Python Statistics Basics
Introduction
Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and presenting data. In data science, statistical methods form the foundation for extracting insights, making predictions, and drawing conclusions from data. Python offers several powerful libraries that make statistical analysis accessible and efficient.
In this tutorial, we'll explore the basics of statistics using Python, focusing on:
- Descriptive statistics (measures of central tendency, dispersion, etc.)
- Working with probability distributions
- Correlation and basic inferential statistics
- Using Python's built-in functions and specialized libraries for statistical analysis
Whether you're analyzing experimental results, processing survey data, or preparing data for machine learning models, these fundamental statistical concepts will serve as essential tools in your data science journey.
Getting Started with Statistics in Python
Required Libraries
Let's start by importing the libraries we'll need:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from statistics import mean, median, mode, stdev
Each of these libraries serves a specific purpose:
numpy
: Efficient numerical operations and array handlingpandas
: Data manipulation and analysisscipy.stats
: Advanced statistical functionsmatplotlib
: Data visualizationstatistics
: Basic statistical functions (part of Python's standard library)
Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset. Let's create a sample dataset and analyze it:
# Sample dataset - student exam scores
scores = [65, 72, 78, 90, 92, 85, 73, 88, 75, 81, 90, 62, 76, 74, 72]
Measures of Central Tendency
The three main measures of central tendency are:
1. Mean (Average)
# Using Python's statistics module
mean_score = mean(scores)
# Or using NumPy
mean_score_np = np.mean(scores)
print(f"Mean score: {mean_score}")
Output:
Mean score: 78.2
The mean represents the average value of the dataset, calculated by summing all values and dividing by the count.
2. Median
median_score = median(scores)
# Or using NumPy
median_score_np = np.median(scores)
print(f"Median score: {median_score}")
Output:
Median score: 76
The median is the middle value when the data is ordered. It's less sensitive to outliers than the mean.
3. Mode
try:
mode_score = mode(scores)
print(f"Mode score: {mode_score}")
except:
print("No unique mode found")
# Using SciPy for mode
mode_score_scipy = stats.mode(scores)
print(f"Mode score (SciPy): {mode_score_scipy.mode[0]} with count {mode_score_scipy.count[0]}")
Output:
No unique mode found
Mode score (SciPy): 72 with count 2
The mode is the most frequently occurring value in the dataset.
Measures of Dispersion
These statistics describe how spread out the data is:
1. Range
data_range = max(scores) - min(scores)
print(f"Range: {data_range}")
Output:
Range: 30
2. Variance
# Using NumPy
variance = np.var(scores, ddof=1) # ddof=1 for sample variance
print(f"Variance: {variance:.2f}")
Output:
Variance: 77.74
Variance measures the average squared deviation from the mean.
3. Standard Deviation
# Using statistics module
std_dev = stdev(scores)
# Or using NumPy
std_dev_np = np.std(scores, ddof=1) # ddof=1 for sample standard deviation
print(f"Standard Deviation: {std_dev:.2f}")
Output:
Standard Deviation: 8.82
Standard deviation is the square root of variance and represents the average distance of data points from the mean.
4. Quartiles and Percentiles
q1 = np.percentile(scores, 25)
q2 = np.percentile(scores, 50) # Same as median
q3 = np.percentile(scores, 75)
print(f"First quartile (Q1): {q1}")
print(f"Second quartile (Q2): {q2}")
print(f"Third quartile (Q3): {q3}")
print(f"Interquartile Range (IQR): {q3 - q1}")
Output:
First quartile (Q1): 72.0
Second quartile (Q2): 76.0
Third quartile (Q3): 88.0
Interquartile Range (IQR): 16.0
Summary Statistics with Pandas
Pandas makes it even easier to calculate multiple statistics at once:
# Convert to pandas Series
scores_series = pd.Series(scores)
# Get summary statistics
summary = scores_series.describe()
print(summary)
Output:
count 15.000000
mean 78.200000
std 8.817839
min 62.000000
25% 72.000000
50% 76.000000
75% 88.000000
max 92.000000
dtype: float64
Working with Distributions
Understanding probability distributions is crucial for statistical analysis and inference.
Normal Distribution
The normal (Gaussian) distribution is one of the most important probability distributions:
# Generate 1000 random numbers from a normal distribution
# with mean=0 and std=1 (standard normal distribution)
normal_data = np.random.normal(0, 1, 1000)
# Plot histogram
plt.figure(figsize=(10, 6))
plt.hist(normal_data, bins=30, density=True, alpha=0.7, color='skyblue')
# Add a line for the theoretical normal distribution
x = np.linspace(-4, 4, 100)
plt.plot(x, stats.norm.pdf(x, 0, 1), 'r-', linewidth=2)
plt.title('Standard Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)
plt.show()
Computing Z-scores
Z-scores tell us how many standard deviations a data point is from the mean:
# Calculate z-scores for our exam scores
z_scores = stats.zscore(scores)
print("Original scores:", scores[:5], "...")
print("Z-scores:", [round(z, 2) for z in z_scores[:5]], "...")
Output:
Original scores: [65, 72, 78, 90, 92] ...
Z-scores: [-1.5, -0.7, -0.02, 1.34, 1.56] ...
Calculating Probabilities
Let's calculate the probability of scoring above 85 on the exam, assuming scores follow a normal distribution:
mean_score = np.mean(scores)
std_dev = np.std(scores, ddof=1)
# Probability of scoring above 85
prob_above_85 = 1 - stats.norm.cdf(85, mean_score, std_dev)
print(f"Probability of scoring above 85: {prob_above_85:.4f} or {prob_above_85*100:.2f}%")
Output:
Probability of scoring above 85: 0.2192 or 21.92%
Correlation and Relationships
Correlation measures the relationship between two variables.
Calculating Correlation
# Create two related variables
study_hours = [1, 2, 4, 3, 6, 5, 7, 8, 10, 9, 11, 2, 5, 4, 3]
exam_scores = scores # Our original scores dataset
# Calculate Pearson correlation coefficient
correlation, p_value = stats.pearsonr(study_hours, exam_scores)
print(f"Correlation coefficient: {correlation:.2f}")
print(f"P-value: {p_value:.4f}")
# Visualize with a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(study_hours, exam_scores, alpha=0.7)
plt.title(f'Study Hours vs. Exam Scores (r = {correlation:.2f})')
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.grid(True, alpha=0.3)
plt.show()
Output:
Correlation coefficient: 0.85
P-value: 0.0001
A correlation coefficient of 0.85 indicates a strong positive correlation between study hours and exam scores.
Understanding Correlation Strengths
- |r| = 0: No correlation
- 0 < |r| < 0.3: Weak correlation
- 0.3 ≤ |r| < 0.7: Moderate correlation
- 0.7 ≤ |r| < 1: Strong correlation
- |r| = 1: Perfect correlation
Practical Example: Data Analysis for a Class
Let's put our skills into practice with a more comprehensive example:
# Create a DataFrame for a class
data = {
'Student': [f'Student_{i}' for i in range(1, 16)],
'Math': [65, 72, 78, 90, 92, 85, 73, 88, 75, 81, 90, 62, 76, 74, 72],
'Science': [70, 75, 82, 88, 94, 90, 76, 85, 78, 84, 93, 65, 74, 77, 70],
'English': [62, 68, 74, 86, 90, 88, 72, 82, 70, 78, 85, 60, 71, 70, 68],
'Study_Hours': [1.5, 3.0, 4.0, 7.5, 8.0, 6.5, 3.5, 7.0, 4.5, 5.5, 8.5, 1.0, 3.5, 3.0, 2.5]
}
df = pd.DataFrame(data)
# Calculate basic statistics for each subject
print("Basic Statistics by Subject:")
print(df[['Math', 'Science', 'English']].describe())
# Calculate average score for each student
df['Average'] = df[['Math', 'Science', 'English']].mean(axis=1)
# Find correlation between study hours and average score
correlation = df['Study_Hours'].corr(df['Average'])
print(f"\nCorrelation between Study Hours and Average Score: {correlation:.2f}")
# Visual analysis - scatter plot with regression line
plt.figure(figsize=(10, 6))
plt.scatter(df['Study_Hours'], df['Average'], alpha=0.7)
# Add regression line
m, b = np.polyfit(df['Study_Hours'], df['Average'], 1)
plt.plot(df['Study_Hours'], m*df['Study_Hours'] + b, 'r-')
plt.title('Relationship Between Study Hours and Average Score')
plt.xlabel('Study Hours')
plt.ylabel('Average Score')
plt.grid(True, alpha=0.3)
plt.show()
# Check which students scored above average in all subjects
overall_avg = df[['Math', 'Science', 'English']].values.mean()
high_performers = df[(df['Math'] > overall_avg) &
(df['Science'] > overall_avg) &
(df['English'] > overall_avg)]
print(f"\nNumber of students scoring above average in all subjects: {len(high_performers)}")
print(high_performers[['Student', 'Math', 'Science', 'English', 'Study_Hours']])
This example demonstrates:
- Creating and analyzing a dataset with multiple variables
- Calculating descriptive statistics for different groups
- Investigating relationships between variables
- Using visualizations to support analysis
- Filtering data based on statistical thresholds
Statistical Testing Basics
Statistical tests help us determine if patterns in our data could have occurred by chance:
One-sample t-test
Tests if a sample's mean differs from a specified value:
# Test if our class's mean math score differs from the school average (75)
school_avg = 75
t_stat, p_value = stats.ttest_1samp(df['Math'], school_avg)
print(f"One-sample t-test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Class mean ({df['Math'].mean():.2f}) is {'significantly different from' if p_value < 0.05 else 'not significantly different from'} school average ({school_avg}).")
Independent t-test
Compares means of two independent groups:
# Create two groups based on study time
high_study = df[df['Study_Hours'] >= 5]['Average']
low_study = df[df['Study_Hours'] < 5]['Average']
# Compare their average scores
t_stat, p_value = stats.ttest_ind(high_study, low_study)
print(f"\nIndependent t-test (high study vs low study):")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"High study group mean: {high_study.mean():.2f}")
print(f"Low study group mean: {low_study.mean():.2f}")
print(f"The difference is {'statistically significant' if p_value < 0.05 else 'not statistically significant'}.")
Summary
In this tutorial, we explored the fundamentals of statistics using Python:
- Descriptive Statistics: Calculating measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation)
- Probability Distributions: Working with normal distributions and calculating probabilities
- Correlation Analysis: Measuring and interpreting relationships between variables
- Practical Data Analysis: Applying statistical concepts to real-world data
- Basic Statistical Testing: Performing t-tests to compare means
These statistical tools and concepts form the foundation of data analysis and are essential for more advanced data science techniques. As you continue your journey in data science, you'll build on these statistical basics to develop more sophisticated analyses and models.
Additional Resources and Exercises
Resources
- SciPy Documentation - Statistics
- Pandas Documentation - Statistical Functions
- Khan Academy - Statistics and Probability
- Book: "Think Stats" by Allen B. Downey (free online)
Practice Exercises
-
Data Collection and Analysis:
- Collect data on daily temperatures for your city for one month
- Calculate the mean, median, and standard deviation
- Test if the temperatures follow a normal distribution
-
Correlation Challenge:
- Find a dataset with multiple numeric variables
- Calculate all pairwise correlations
- Create a correlation matrix heatmap
- Identify the strongest relationships
-
Statistical Test Practice:
- Use a two-sample t-test to compare weekend vs. weekday temperatures
- Report your findings with appropriate statistics
-
Dataset Comparison:
- Find two different datasets representing similar phenomena
- Apply the statistical techniques from this tutorial
- Compare the results and write a brief analysis
Remember, statistical concepts become clearer with practice. Try applying these techniques to datasets that interest you!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)