Python Statistics Basics

Introduction

Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and presenting data. In data science, statistical methods form the foundation for extracting insights, making predictions, and drawing conclusions from data. Python offers several powerful libraries that make statistical analysis accessible and efficient.

In this tutorial, we'll explore the basics of statistics using Python, focusing on:

Descriptive statistics (measures of central tendency, dispersion, etc.)
Working with probability distributions
Correlation and basic inferential statistics
Using Python's built-in functions and specialized libraries for statistical analysis

Whether you're analyzing experimental results, processing survey data, or preparing data for machine learning models, these fundamental statistical concepts will serve as essential tools in your data science journey.

Getting Started with Statistics in Python

Required Libraries

Let's start by importing the libraries we'll need:

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from statistics import mean, median, mode, stdev

Each of these libraries serves a specific purpose:

numpy: Efficient numerical operations and array handling
pandas: Data manipulation and analysis
scipy.stats: Advanced statistical functions
matplotlib: Data visualization
statistics: Basic statistical functions (part of Python's standard library)

Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. Let's create a sample dataset and analyze it:

# Sample dataset - student exam scores
scores = [65, 72, 78, 90, 92, 85, 73, 88, 75, 81, 90, 62, 76, 74, 72]

Measures of Central Tendency

The three main measures of central tendency are:

1. Mean (Average)

# Using Python's statistics module
mean_score = mean(scores)
# Or using NumPy
mean_score_np = np.mean(scores)

print(f"Mean score: {mean_score}")

Output:

Mean score: 78.2

The mean represents the average value of the dataset, calculated by summing all values and dividing by the count.

2. Median

median_score = median(scores)
# Or using NumPy
median_score_np = np.median(scores)

print(f"Median score: {median_score}")

Output:

Median score: 76

The median is the middle value when the data is ordered. It's less sensitive to outliers than the mean.

3. Mode

try:
    mode_score = mode(scores)
    print(f"Mode score: {mode_score}")
except:
    print("No unique mode found")
    
# Using SciPy for mode
mode_score_scipy = stats.mode(scores)
print(f"Mode score (SciPy): {mode_score_scipy.mode[0]} with count {mode_score_scipy.count[0]}")

Output:

No unique mode found
Mode score (SciPy): 72 with count 2

The mode is the most frequently occurring value in the dataset.

Measures of Dispersion

These statistics describe how spread out the data is:

1. Range

data_range = max(scores) - min(scores)
print(f"Range: {data_range}")

Output:

Range: 30

2. Variance

# Using NumPy
variance = np.var(scores, ddof=1)  # ddof=1 for sample variance
print(f"Variance: {variance:.2f}")

Output:

Variance: 77.74

Variance measures the average squared deviation from the mean.

3. Standard Deviation

# Using statistics module
std_dev = stdev(scores)
# Or using NumPy
std_dev_np = np.std(scores, ddof=1)  # ddof=1 for sample standard deviation

print(f"Standard Deviation: {std_dev:.2f}")

Output:

Standard Deviation: 8.82

Standard deviation is the square root of variance and represents the average distance of data points from the mean.

4. Quartiles and Percentiles

q1 = np.percentile(scores, 25)
q2 = np.percentile(scores, 50)  # Same as median
q3 = np.percentile(scores, 75)

print(f"First quartile (Q1): {q1}")
print(f"Second quartile (Q2): {q2}")
print(f"Third quartile (Q3): {q3}")
print(f"Interquartile Range (IQR): {q3 - q1}")

Output:

First quartile (Q1): 72.0
Second quartile (Q2): 76.0
Third quartile (Q3): 88.0
Interquartile Range (IQR): 16.0

Summary Statistics with Pandas

Pandas makes it even easier to calculate multiple statistics at once:

# Convert to pandas Series
scores_series = pd.Series(scores)

# Get summary statistics
summary = scores_series.describe()
print(summary)

Output:

count    15.000000
mean     78.200000
std       8.817839
min      62.000000
25%      72.000000
50%      76.000000
75%      88.000000
max      92.000000
dtype: float64

Working with Distributions

Understanding probability distributions is crucial for statistical analysis and inference.

Normal Distribution

The normal (Gaussian) distribution is one of the most important probability distributions:

# Generate 1000 random numbers from a normal distribution
# with mean=0 and std=1 (standard normal distribution)
normal_data = np.random.normal(0, 1, 1000)

# Plot histogram
plt.figure(figsize=(10, 6))
plt.hist(normal_data, bins=30, density=True, alpha=0.7, color='skyblue')

# Add a line for the theoretical normal distribution
x = np.linspace(-4, 4, 100)
plt.plot(x, stats.norm.pdf(x, 0, 1), 'r-', linewidth=2)
plt.title('Standard Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)
plt.show()

Computing Z-scores

Z-scores tell us how many standard deviations a data point is from the mean:

# Calculate z-scores for our exam scores
z_scores = stats.zscore(scores)
print("Original scores:", scores[:5], "...")
print("Z-scores:", [round(z, 2) for z in z_scores[:5]], "...")

Output:

Original scores: [65, 72, 78, 90, 92] ...
Z-scores: [-1.5, -0.7, -0.02, 1.34, 1.56] ...

Calculating Probabilities

Let's calculate the probability of scoring above 85 on the exam, assuming scores follow a normal distribution:

mean_score = np.mean(scores)
std_dev = np.std(scores, ddof=1)

# Probability of scoring above 85
prob_above_85 = 1 - stats.norm.cdf(85, mean_score, std_dev)
print(f"Probability of scoring above 85: {prob_above_85:.4f} or {prob_above_85*100:.2f}%")

Output:

Probability of scoring above 85: 0.2192 or 21.92%

Correlation and Relationships

Correlation measures the relationship between two variables.

Calculating Correlation

# Create two related variables
study_hours = [1, 2, 4, 3, 6, 5, 7, 8, 10, 9, 11, 2, 5, 4, 3]
exam_scores = scores  # Our original scores dataset

# Calculate Pearson correlation coefficient
correlation, p_value = stats.pearsonr(study_hours, exam_scores)

print(f"Correlation coefficient: {correlation:.2f}")
print(f"P-value: {p_value:.4f}")

# Visualize with a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(study_hours, exam_scores, alpha=0.7)
plt.title(f'Study Hours vs. Exam Scores (r = {correlation:.2f})')
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.grid(True, alpha=0.3)
plt.show()

Output:

Correlation coefficient: 0.85
P-value: 0.0001

A correlation coefficient of 0.85 indicates a strong positive correlation between study hours and exam scores.

Understanding Correlation Strengths

|r| = 0: No correlation
0 < |r| < 0.3: Weak correlation
0.3 ≤ |r| < 0.7: Moderate correlation
0.7 ≤ |r| < 1: Strong correlation
|r| = 1: Perfect correlation

Practical Example: Data Analysis for a Class

Let's put our skills into practice with a more comprehensive example:

# Create a DataFrame for a class
data = {
    'Student': [f'Student_{i}' for i in range(1, 16)],
    'Math': [65, 72, 78, 90, 92, 85, 73, 88, 75, 81, 90, 62, 76, 74, 72],
    'Science': [70, 75, 82, 88, 94, 90, 76, 85, 78, 84, 93, 65, 74, 77, 70],
    'English': [62, 68, 74, 86, 90, 88, 72, 82, 70, 78, 85, 60, 71, 70, 68],
    'Study_Hours': [1.5, 3.0, 4.0, 7.5, 8.0, 6.5, 3.5, 7.0, 4.5, 5.5, 8.5, 1.0, 3.5, 3.0, 2.5]
}

df = pd.DataFrame(data)

# Calculate basic statistics for each subject
print("Basic Statistics by Subject:")
print(df[['Math', 'Science', 'English']].describe())

# Calculate average score for each student
df['Average'] = df[['Math', 'Science', 'English']].mean(axis=1)

# Find correlation between study hours and average score
correlation = df['Study_Hours'].corr(df['Average'])
print(f"\nCorrelation between Study Hours and Average Score: {correlation:.2f}")

# Visual analysis - scatter plot with regression line
plt.figure(figsize=(10, 6))
plt.scatter(df['Study_Hours'], df['Average'], alpha=0.7)

# Add regression line
m, b = np.polyfit(df['Study_Hours'], df['Average'], 1)
plt.plot(df['Study_Hours'], m*df['Study_Hours'] + b, 'r-')

plt.title('Relationship Between Study Hours and Average Score')
plt.xlabel('Study Hours')
plt.ylabel('Average Score')
plt.grid(True, alpha=0.3)
plt.show()

# Check which students scored above average in all subjects
overall_avg = df[['Math', 'Science', 'English']].values.mean()
high_performers = df[(df['Math'] > overall_avg) & 
                     (df['Science'] > overall_avg) & 
                     (df['English'] > overall_avg)]

print(f"\nNumber of students scoring above average in all subjects: {len(high_performers)}")
print(high_performers[['Student', 'Math', 'Science', 'English', 'Study_Hours']])

This example demonstrates:

Creating and analyzing a dataset with multiple variables
Calculating descriptive statistics for different groups
Investigating relationships between variables
Using visualizations to support analysis
Filtering data based on statistical thresholds

Statistical Testing Basics

Statistical tests help us determine if patterns in our data could have occurred by chance:

One-sample t-test

Tests if a sample's mean differs from a specified value:

# Test if our class's mean math score differs from the school average (75)
school_avg = 75
t_stat, p_value = stats.ttest_1samp(df['Math'], school_avg)

print(f"One-sample t-test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Class mean ({df['Math'].mean():.2f}) is {'significantly different from' if p_value < 0.05 else 'not significantly different from'} school average ({school_avg}).")

Independent t-test

Compares means of two independent groups:

# Create two groups based on study time
high_study = df[df['Study_Hours'] >= 5]['Average']
low_study = df[df['Study_Hours'] < 5]['Average']

# Compare their average scores
t_stat, p_value = stats.ttest_ind(high_study, low_study)

print(f"\nIndependent t-test (high study vs low study):")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"High study group mean: {high_study.mean():.2f}")
print(f"Low study group mean: {low_study.mean():.2f}")
print(f"The difference is {'statistically significant' if p_value < 0.05 else 'not statistically significant'}.")

Summary

In this tutorial, we explored the fundamentals of statistics using Python:

Descriptive Statistics: Calculating measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation)
Probability Distributions: Working with normal distributions and calculating probabilities
Correlation Analysis: Measuring and interpreting relationships between variables
Practical Data Analysis: Applying statistical concepts to real-world data
Basic Statistical Testing: Performing t-tests to compare means

These statistical tools and concepts form the foundation of data analysis and are essential for more advanced data science techniques. As you continue your journey in data science, you'll build on these statistical basics to develop more sophisticated analyses and models.

Additional Resources and Exercises

Resources

SciPy Documentation - Statistics
Pandas Documentation - Statistical Functions
Khan Academy - Statistics and Probability
Book: "Think Stats" by Allen B. Downey (free online)

Practice Exercises

Data Collection and Analysis:
- Collect data on daily temperatures for your city for one month
- Calculate the mean, median, and standard deviation
- Test if the temperatures follow a normal distribution
Correlation Challenge:
- Find a dataset with multiple numeric variables
- Calculate all pairwise correlations
- Create a correlation matrix heatmap
- Identify the strongest relationships
Statistical Test Practice:
- Use a two-sample t-test to compare weekend vs. weekday temperatures
- Report your findings with appropriate statistics
Dataset Comparison:
- Find two different datasets representing similar phenomena
- Apply the statistical techniques from this tutorial
- Compare the results and write a brief analysis

Remember, statistical concepts become clearer with practice. Try applying these techniques to datasets that interest you!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Getting Started with Statistics in Python​

Required Libraries​

Descriptive Statistics​

Measures of Central Tendency​

1. Mean (Average)​

2. Median​

3. Mode​

Measures of Dispersion​

1. Range​

2. Variance​

3. Standard Deviation​

4. Quartiles and Percentiles​

Summary Statistics with Pandas​

Working with Distributions​

Normal Distribution​

Computing Z-scores​

Calculating Probabilities​

Correlation and Relationships​

Calculating Correlation​

Understanding Correlation Strengths​

Practical Example: Data Analysis for a Class​

Statistical Testing Basics​

One-sample t-test​

Independent t-test​

Summary​

Additional Resources and Exercises​

Resources​

Practice Exercises​

Introduction

Getting Started with Statistics in Python

Required Libraries

Descriptive Statistics

Measures of Central Tendency

1. Mean (Average)

2. Median

3. Mode

Measures of Dispersion

1. Range

2. Variance

3. Standard Deviation

4. Quartiles and Percentiles

Summary Statistics with Pandas

Working with Distributions

Normal Distribution

Computing Z-scores

Calculating Probabilities

Correlation and Relationships

Calculating Correlation

Understanding Correlation Strengths

Practical Example: Data Analysis for a Class

Statistical Testing Basics

One-sample t-test

Independent t-test

Summary

Additional Resources and Exercises

Resources

Practice Exercises