Python Data Analysis
Introduction
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. Python has become the go-to language for data analysis due to its simplicity and the rich ecosystem of libraries specifically designed for working with data.
In this guide, we will explore how to use Python for data analysis, focusing on three essential libraries:
- NumPy: For numerical computing with powerful array operations
- pandas: For data manipulation and analysis
- matplotlib: For data visualization
Whether you're analyzing sales data, scientific measurements, or social media trends, these tools will help you extract meaningful insights from your data.
Getting Started with Python Data Analysis
Setting Up Your Environment
Before we dive into data analysis, make sure you have the necessary libraries installed:
pip install numpy pandas matplotlib seaborn
Let's start by importing the basic libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set better default aesthetics for plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_theme()
# Display plots inline if using a Jupyter notebook
# %matplotlib inline
Working with NumPy for Numerical Analysis
NumPy is the foundation for data analysis in Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays efficiently.
Creating NumPy Arrays
# Create a simple array
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Output:
[1 2 3 4 5]
# Create a 2D array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix)
Output:
[[1 2 3]
[4 5 6]
[7 8 9]]
Array Operations
NumPy allows for vectorized operations, which are much faster than traditional loops:
# Array arithmetic
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
# Addition
print(arr1 + arr2) # Element-wise addition
# Multiplication
print(arr1 * arr2) # Element-wise multiplication
# With scalars
print(arr1 * 2) # Multiply each element by 2
Output:
[5 7 9]
[4 10 18]
[2 4 6]
Statistical Operations
NumPy has built-in functions for common statistical operations:
data = np.array([15, 23, 48, 10, 28, 36, 52, 19, 25, 33])
print(f"Mean: {np.mean(data)}")
print(f"Median: {np.median(data)}")
print(f"Standard Deviation: {np.std(data)}")
print(f"Min and Max: {np.min(data)} and {np.max(data)}")
Output:
Mean: 28.9
Median: 26.5
Standard Deviation: 12.875954992552266
Min and Max: 10 and 52
Data Manipulation with pandas
While NumPy provides the foundation for numerical computing, pandas is specifically designed for data manipulation and analysis. It introduces two main data structures:
- Series: A one-dimensional labeled array
- DataFrame: A two-dimensional labeled data structure with columns that can be of different types
Creating DataFrames
# Create a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [24, 27, 22, 32, 29],
'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle'],
'Salary': [65000, 72000, 59000, 82000, 75000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City Salary
0 Alice 24 New York 65000
1 Bob 27 Boston 72000
2 Charlie 22 Chicago 59000
3 David 32 Denver 82000
4 Eva 29 Seattle 75000
Loading Data from Files
pandas makes it easy to read data from various file formats:
# Let's create a CSV file for demonstration
df.to_csv('employees.csv', index=False)
# Now read it back
employees = pd.read_csv('employees.csv')
print(employees.head()) # The head() method shows the first five rows
Output:
Name Age City Salary
0 Alice 24 New York 65000
1 Bob 27 Boston 72000
2 Charlie 22 Chicago 59000
3 David 32 Denver 82000
4 Eva 29 Seattle 75000
Data Exploration
pandas provides numerous methods to explore and understand your data:
# Basic information about the DataFrame
print(employees.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 City 5 non-null object
3 Salary 5 non-null int64
dtypes: int64(2), object(2)
memory usage: 288.0+ bytes
None
# Statistical summary
print(employees.describe())
Output:
Age Salary
count 5.000000 5.000000
mean 26.800000 70600.000000
std 3.962323 8792.360580
min 22.000000 59000.000000
25% 24.000000 65000.000000
50% 27.000000 72000.000000
75% 29.000000 75000.000000
max 32.000000 82000.000000
Data Selection and Filtering
pandas offers powerful ways to select and filter data:
# Selecting columns
print(employees['Name'])
Output:
0 Alice
1 Bob
2 Charlie
3 David
4 Eva
Name: Name, dtype: object
# Selecting rows by position
print(employees.iloc[0:2]) # First two rows
Output:
Name Age City Salary
0 Alice 24 New York 65000
1 Bob 27 Boston 72000
# Conditional filtering
high_salary = employees[employees['Salary'] > 70000]
print(high_salary)
Output:
Name Age City Salary
1 Bob 27 Boston 72000
3 David 32 Denver 82000
4 Eva 29 Seattle 75000
Data Grouping and Aggregation
One of pandas' most powerful features is the ability to group data and compute aggregations:
# Let's create a slightly larger dataset
data = {
'Department': ['IT', 'HR', 'Sales', 'IT', 'HR', 'Sales', 'IT', 'Sales'],
'Employee': ['John', 'Alice', 'Bob', 'Mary', 'Jane', 'Michael', 'David', 'Anne'],
'Salary': [65000, 72000, 59000, 82000, 75000, 67000, 78000, 63000],
'Years': [3, 7, 2, 5, 8, 3, 4, 2]
}
df = pd.DataFrame(data)
# Group by department and calculate mean salary and years
dept_stats = df.groupby('Department').agg({
'Salary': ['mean', 'min', 'max'],
'Years': 'mean'
})
print(dept_stats)
Output:
Salary Years
mean min max mean
Department
HR 73500.0 72000 75000 7.500000
IT 75000.0 65000 82000 4.000000
Sales 63000.0 59000 67000 2.333333
Data Visualization with matplotlib and seaborn
After preparing and analyzing your data, visualization helps in understanding patterns and communicating findings. We'll use matplotlib and seaborn, which is built on top of matplotlib but provides a higher-level interface.
Basic Plotting with matplotlib
# Let's create a simple line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2)
plt.title('Sine Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.grid(True)
plt.savefig('sine_wave.png') # Save the figure
plt.show()
This will produce a sine wave plot. Since we can't display it directly here, imagine a blue sine wave on a grid.
Bar Charts with pandas and matplotlib
# Using our employee data to create a bar chart
dept_avg_salary = df.groupby('Department')['Salary'].mean()
plt.figure(figsize=(10, 6))
dept_avg_salary.plot(kind='bar', color='skyblue')
plt.title('Average Salary by Department')
plt.xlabel('Department')
plt.ylabel('Average Salary ($)')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.savefig('dept_salary_bar.png')
plt.show()
Advanced Visualization with seaborn
# Create a larger synthetic dataset for better visualization
np.random.seed(42)
data = {
'Department': np.random.choice(['IT', 'HR', 'Sales', 'Marketing'], 100),
'Experience': np.random.randint(1, 15, 100),
'Salary': np.random.randint(50000, 100000, 100),
'Performance': np.random.uniform(2, 5, 100)
}
df_large = pd.DataFrame(data)
# Create a scatter plot with regression line
plt.figure(figsize=(12, 7))
sns.scatterplot(
data=df_large,
x='Experience',
y='Salary',
hue='Department',
size='Performance',
sizes=(20, 200),
alpha=0.7
)
sns.regplot(
data=df_large,
x='Experience',
y='Salary',
scatter=False,
color='black'
)
plt.title('Salary vs. Experience by Department', fontsize=16)
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.savefig('salary_experience_scatter.png')
plt.show()
Creating Multiple Plots
# Create a dashboard of plots
plt.figure(figsize=(15, 10))
# Plot 1: Distribution of Salaries
plt.subplot(2, 2, 1)
sns.histplot(df_large['Salary'], bins=15, kde=True)
plt.title('Salary Distribution')
# Plot 2: Experience vs. Salary
plt.subplot(2, 2, 2)
sns.boxplot(x='Department', y='Salary', data=df_large)
plt.title('Salary by Department')
# Plot 3: Experience Distribution
plt.subplot(2, 2, 3)
sns.histplot(df_large['Experience'], bins=10, kde=True)
plt.title('Experience Distribution')
# Plot 4: Performance vs. Salary
plt.subplot(2, 2, 4)
sns.scatterplot(x='Performance', y='Salary', data=df_large, alpha=0.6)
plt.title('Performance vs. Salary')
plt.tight_layout()
plt.savefig('dashboard.png')
plt.show()
Real-world Data Analysis Example
Let's put everything together with a realistic example. We'll analyze a dataset of car fuel efficiency:
# Let's simulate loading data from an external source
# In practice, you might use:
# df = pd.read_csv('https://raw.githubusercontent.com/datasets/...')
# Creating sample car data
car_data = {
'make': ['Toyota', 'Honda', 'Ford', 'BMW', 'Toyota', 'Honda', 'Ford', 'BMW', 'Toyota', 'Honda'],
'model': ['Corolla', 'Civic', 'Focus', '3 Series', 'Camry', 'Accord', 'Fusion', '5 Series', 'Prius', 'Fit'],
'year': [2019, 2020, 2018, 2021, 2017, 2020, 2019, 2020, 2020, 2018],
'engine_size': [1.8, 2.0, 1.6, 3.0, 2.5, 1.5, 2.0, 4.0, 1.8, 1.5],
'mpg': [32, 36, 28, 24, 29, 38, 27, 22, 52, 33],
'price': [21000, 22500, 18000, 43000, 25000, 28000, 20000, 52000, 26000, 17000]
}
cars = pd.DataFrame(car_data)
# 1. Data exploration
print(cars.head())
print("\nSummary Statistics:")
print(cars.describe())
# 2. Finding correlations between numerical variables
print("\nCorrelation Matrix:")
correlation = cars[['engine_size', 'mpg', 'price', 'year']].corr()
print(correlation)
# 3. Data aggregation
avg_mpg_by_make = cars.groupby('make')['mpg'].mean().sort_values(ascending=False)
print("\nAverage MPG by Make:")
print(avg_mpg_by_make)
# 4. Visualization
plt.figure(figsize=(15, 10))
# Plot 1: MPG vs Engine Size
plt.subplot(2, 2, 1)
sns.scatterplot(x='engine_size', y='mpg', hue='make', size='price',
sizes=(50, 200), data=cars, alpha=0.7)
plt.title('MPG vs Engine Size')
plt.grid(True, alpha=0.3)
# Plot 2: Average MPG by Make
plt.subplot(2, 2, 2)
avg_mpg_by_make.plot(kind='bar', color='skyblue')
plt.title('Average MPG by Make')
plt.ylabel('Miles Per Gallon')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
# Plot 3: Price vs MPG
plt.subplot(2, 2, 3)
sns.scatterplot(x='price', y='mpg', hue='make', data=cars)
plt.title('Price vs Fuel Efficiency')
plt.xlabel('Price ($)')
plt.ylabel('Miles Per Gallon')
plt.grid(True, alpha=0.3)
# Plot 4: Distribution of MPG
plt.subplot(2, 2, 4)
sns.kdeplot(data=cars, x='mpg', hue='make', fill=True, alpha=0.5)
plt.title('Distribution of Fuel Efficiency by Make')
plt.xlabel('Miles Per Gallon')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('car_analysis.png')
plt.show()
# 5. Finding insights
high_efficiency = cars[cars['mpg'] > 35]
print("\nHigh Efficiency Cars (MPG > 35):")
print(high_efficiency)
print("\nKey Findings:")
print(f"1. The car with the highest fuel efficiency is {cars.loc[cars['mpg'].idxmax()]['make']} {cars.loc[cars['mpg'].idxmax()]['model']} with {cars['mpg'].max()} MPG")
print(f"2. Average price of cars in the dataset: ${cars['price'].mean():.2f}")
print(f"3. Correlation between engine size and MPG: {correlation.loc['engine_size', 'mpg']:.2f}")
print(f"4. Correlation between price and MPG: {correlation.loc['price', 'mpg']:.2f}")
This example shows a complete data analysis workflow:
- Loading and exploring the data
- Computing statistics and correlations
- Grouping and aggregating data
- Creating visualizations
- Drawing insights from the analysis
Summary
In this guide, we've explored Python's powerful ecosystem for data analysis:
- NumPy provides the foundation for numerical computing with arrays
- pandas offers data structures and functions for data manipulation
- matplotlib and seaborn enable creating informative visualizations
Python data analysis is a vast field with many more advanced techniques, but mastering these basics will give you a solid foundation for tackling real-world data problems.
The process typically follows these steps:
- Loading and cleaning data
- Exploring and understanding the data structure
- Manipulating and transforming data
- Analyzing relationships and patterns
- Visualizing results
- Drawing conclusions
Additional Resources and Exercises
Resources for Further Learning
- pandas Documentation
- NumPy User Guide
- matplotlib Tutorials
- seaborn Tutorial
- Python for Data Analysis by Wes McKinney (creator of pandas)
Practice Exercises
-
Basic Data Exploration:
- Download a dataset from Kaggle or use a built-in dataset from seaborn
- Explore the data using pandas functions like
info()
,describe()
, andhead()
- Identify missing values and handle them appropriately
-
Data Transformation:
- Create new columns based on existing data
- Normalize or standardize numerical variables
- Convert categorical variables to numeric using one-hot encoding
-
Visual Analysis Project:
- Choose a dataset of interest (e.g., COVID-19 data, economic indicators, sports statistics)
- Create a dashboard with at least four different types of plots
- Write a summary of the insights you discovered from your visualizations
-
Time Series Analysis:
- Find a dataset with time-based information
- Use pandas date functionality to extract patterns by month, day of week, etc.
- Create line charts showing trends over time
By practicing these exercises, you'll build confidence in your data analysis skills and be ready to tackle more complex projects!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)