Skip to main content

Python Data Analysis

Introduction

Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. Python has become the go-to language for data analysis due to its simplicity and the rich ecosystem of libraries specifically designed for working with data.

In this guide, we will explore how to use Python for data analysis, focusing on three essential libraries:

  • NumPy: For numerical computing with powerful array operations
  • pandas: For data manipulation and analysis
  • matplotlib: For data visualization

Whether you're analyzing sales data, scientific measurements, or social media trends, these tools will help you extract meaningful insights from your data.

Getting Started with Python Data Analysis

Setting Up Your Environment

Before we dive into data analysis, make sure you have the necessary libraries installed:

bash
pip install numpy pandas matplotlib seaborn

Let's start by importing the basic libraries:

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set better default aesthetics for plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_theme()

# Display plots inline if using a Jupyter notebook
# %matplotlib inline

Working with NumPy for Numerical Analysis

NumPy is the foundation for data analysis in Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays efficiently.

Creating NumPy Arrays

python
# Create a simple array
arr = np.array([1, 2, 3, 4, 5])
print(arr)

Output:

[1 2 3 4 5]
python
# Create a 2D array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix)

Output:

[[1 2 3]
[4 5 6]
[7 8 9]]

Array Operations

NumPy allows for vectorized operations, which are much faster than traditional loops:

python
# Array arithmetic
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Addition
print(arr1 + arr2) # Element-wise addition

# Multiplication
print(arr1 * arr2) # Element-wise multiplication

# With scalars
print(arr1 * 2) # Multiply each element by 2

Output:

[5 7 9]
[4 10 18]
[2 4 6]

Statistical Operations

NumPy has built-in functions for common statistical operations:

python
data = np.array([15, 23, 48, 10, 28, 36, 52, 19, 25, 33])

print(f"Mean: {np.mean(data)}")
print(f"Median: {np.median(data)}")
print(f"Standard Deviation: {np.std(data)}")
print(f"Min and Max: {np.min(data)} and {np.max(data)}")

Output:

Mean: 28.9
Median: 26.5
Standard Deviation: 12.875954992552266
Min and Max: 10 and 52

Data Manipulation with pandas

While NumPy provides the foundation for numerical computing, pandas is specifically designed for data manipulation and analysis. It introduces two main data structures:

  1. Series: A one-dimensional labeled array
  2. DataFrame: A two-dimensional labeled data structure with columns that can be of different types

Creating DataFrames

python
# Create a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [24, 27, 22, 32, 29],
'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle'],
'Salary': [65000, 72000, 59000, 82000, 75000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age     City  Salary
0 Alice 24 New York 65000
1 Bob 27 Boston 72000
2 Charlie 22 Chicago 59000
3 David 32 Denver 82000
4 Eva 29 Seattle 75000

Loading Data from Files

pandas makes it easy to read data from various file formats:

python
# Let's create a CSV file for demonstration
df.to_csv('employees.csv', index=False)

# Now read it back
employees = pd.read_csv('employees.csv')
print(employees.head()) # The head() method shows the first five rows

Output:

      Name  Age     City  Salary
0 Alice 24 New York 65000
1 Bob 27 Boston 72000
2 Charlie 22 Chicago 59000
3 David 32 Denver 82000
4 Eva 29 Seattle 75000

Data Exploration

pandas provides numerous methods to explore and understand your data:

python
# Basic information about the DataFrame
print(employees.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 City 5 non-null object
3 Salary 5 non-null int64
dtypes: int64(2), object(2)
memory usage: 288.0+ bytes
None
python
# Statistical summary
print(employees.describe())

Output:

             Age        Salary
count 5.000000 5.000000
mean 26.800000 70600.000000
std 3.962323 8792.360580
min 22.000000 59000.000000
25% 24.000000 65000.000000
50% 27.000000 72000.000000
75% 29.000000 75000.000000
max 32.000000 82000.000000

Data Selection and Filtering

pandas offers powerful ways to select and filter data:

python
# Selecting columns
print(employees['Name'])

Output:

0      Alice
1 Bob
2 Charlie
3 David
4 Eva
Name: Name, dtype: object
python
# Selecting rows by position
print(employees.iloc[0:2]) # First two rows

Output:

    Name  Age     City  Salary
0 Alice 24 New York 65000
1 Bob 27 Boston 72000
python
# Conditional filtering
high_salary = employees[employees['Salary'] > 70000]
print(high_salary)

Output:

    Name  Age     City  Salary
1 Bob 27 Boston 72000
3 David 32 Denver 82000
4 Eva 29 Seattle 75000

Data Grouping and Aggregation

One of pandas' most powerful features is the ability to group data and compute aggregations:

python
# Let's create a slightly larger dataset
data = {
'Department': ['IT', 'HR', 'Sales', 'IT', 'HR', 'Sales', 'IT', 'Sales'],
'Employee': ['John', 'Alice', 'Bob', 'Mary', 'Jane', 'Michael', 'David', 'Anne'],
'Salary': [65000, 72000, 59000, 82000, 75000, 67000, 78000, 63000],
'Years': [3, 7, 2, 5, 8, 3, 4, 2]
}

df = pd.DataFrame(data)

# Group by department and calculate mean salary and years
dept_stats = df.groupby('Department').agg({
'Salary': ['mean', 'min', 'max'],
'Years': 'mean'
})

print(dept_stats)

Output:

           Salary                  Years
mean min max mean
Department
HR 73500.0 72000 75000 7.500000
IT 75000.0 65000 82000 4.000000
Sales 63000.0 59000 67000 2.333333

Data Visualization with matplotlib and seaborn

After preparing and analyzing your data, visualization helps in understanding patterns and communicating findings. We'll use matplotlib and seaborn, which is built on top of matplotlib but provides a higher-level interface.

Basic Plotting with matplotlib

python
# Let's create a simple line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2)
plt.title('Sine Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.grid(True)
plt.savefig('sine_wave.png') # Save the figure
plt.show()

This will produce a sine wave plot. Since we can't display it directly here, imagine a blue sine wave on a grid.

Bar Charts with pandas and matplotlib

python
# Using our employee data to create a bar chart
dept_avg_salary = df.groupby('Department')['Salary'].mean()

plt.figure(figsize=(10, 6))
dept_avg_salary.plot(kind='bar', color='skyblue')
plt.title('Average Salary by Department')
plt.xlabel('Department')
plt.ylabel('Average Salary ($)')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.savefig('dept_salary_bar.png')
plt.show()

Advanced Visualization with seaborn

python
# Create a larger synthetic dataset for better visualization
np.random.seed(42)
data = {
'Department': np.random.choice(['IT', 'HR', 'Sales', 'Marketing'], 100),
'Experience': np.random.randint(1, 15, 100),
'Salary': np.random.randint(50000, 100000, 100),
'Performance': np.random.uniform(2, 5, 100)
}

df_large = pd.DataFrame(data)

# Create a scatter plot with regression line
plt.figure(figsize=(12, 7))
sns.scatterplot(
data=df_large,
x='Experience',
y='Salary',
hue='Department',
size='Performance',
sizes=(20, 200),
alpha=0.7
)
sns.regplot(
data=df_large,
x='Experience',
y='Salary',
scatter=False,
color='black'
)
plt.title('Salary vs. Experience by Department', fontsize=16)
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.savefig('salary_experience_scatter.png')
plt.show()

Creating Multiple Plots

python
# Create a dashboard of plots
plt.figure(figsize=(15, 10))

# Plot 1: Distribution of Salaries
plt.subplot(2, 2, 1)
sns.histplot(df_large['Salary'], bins=15, kde=True)
plt.title('Salary Distribution')

# Plot 2: Experience vs. Salary
plt.subplot(2, 2, 2)
sns.boxplot(x='Department', y='Salary', data=df_large)
plt.title('Salary by Department')

# Plot 3: Experience Distribution
plt.subplot(2, 2, 3)
sns.histplot(df_large['Experience'], bins=10, kde=True)
plt.title('Experience Distribution')

# Plot 4: Performance vs. Salary
plt.subplot(2, 2, 4)
sns.scatterplot(x='Performance', y='Salary', data=df_large, alpha=0.6)
plt.title('Performance vs. Salary')

plt.tight_layout()
plt.savefig('dashboard.png')
plt.show()

Real-world Data Analysis Example

Let's put everything together with a realistic example. We'll analyze a dataset of car fuel efficiency:

python
# Let's simulate loading data from an external source
# In practice, you might use:
# df = pd.read_csv('https://raw.githubusercontent.com/datasets/...')

# Creating sample car data
car_data = {
'make': ['Toyota', 'Honda', 'Ford', 'BMW', 'Toyota', 'Honda', 'Ford', 'BMW', 'Toyota', 'Honda'],
'model': ['Corolla', 'Civic', 'Focus', '3 Series', 'Camry', 'Accord', 'Fusion', '5 Series', 'Prius', 'Fit'],
'year': [2019, 2020, 2018, 2021, 2017, 2020, 2019, 2020, 2020, 2018],
'engine_size': [1.8, 2.0, 1.6, 3.0, 2.5, 1.5, 2.0, 4.0, 1.8, 1.5],
'mpg': [32, 36, 28, 24, 29, 38, 27, 22, 52, 33],
'price': [21000, 22500, 18000, 43000, 25000, 28000, 20000, 52000, 26000, 17000]
}

cars = pd.DataFrame(car_data)

# 1. Data exploration
print(cars.head())
print("\nSummary Statistics:")
print(cars.describe())

# 2. Finding correlations between numerical variables
print("\nCorrelation Matrix:")
correlation = cars[['engine_size', 'mpg', 'price', 'year']].corr()
print(correlation)

# 3. Data aggregation
avg_mpg_by_make = cars.groupby('make')['mpg'].mean().sort_values(ascending=False)
print("\nAverage MPG by Make:")
print(avg_mpg_by_make)

# 4. Visualization
plt.figure(figsize=(15, 10))

# Plot 1: MPG vs Engine Size
plt.subplot(2, 2, 1)
sns.scatterplot(x='engine_size', y='mpg', hue='make', size='price',
sizes=(50, 200), data=cars, alpha=0.7)
plt.title('MPG vs Engine Size')
plt.grid(True, alpha=0.3)

# Plot 2: Average MPG by Make
plt.subplot(2, 2, 2)
avg_mpg_by_make.plot(kind='bar', color='skyblue')
plt.title('Average MPG by Make')
plt.ylabel('Miles Per Gallon')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)

# Plot 3: Price vs MPG
plt.subplot(2, 2, 3)
sns.scatterplot(x='price', y='mpg', hue='make', data=cars)
plt.title('Price vs Fuel Efficiency')
plt.xlabel('Price ($)')
plt.ylabel('Miles Per Gallon')
plt.grid(True, alpha=0.3)

# Plot 4: Distribution of MPG
plt.subplot(2, 2, 4)
sns.kdeplot(data=cars, x='mpg', hue='make', fill=True, alpha=0.5)
plt.title('Distribution of Fuel Efficiency by Make')
plt.xlabel('Miles Per Gallon')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('car_analysis.png')
plt.show()

# 5. Finding insights
high_efficiency = cars[cars['mpg'] > 35]
print("\nHigh Efficiency Cars (MPG > 35):")
print(high_efficiency)

print("\nKey Findings:")
print(f"1. The car with the highest fuel efficiency is {cars.loc[cars['mpg'].idxmax()]['make']} {cars.loc[cars['mpg'].idxmax()]['model']} with {cars['mpg'].max()} MPG")
print(f"2. Average price of cars in the dataset: ${cars['price'].mean():.2f}")
print(f"3. Correlation between engine size and MPG: {correlation.loc['engine_size', 'mpg']:.2f}")
print(f"4. Correlation between price and MPG: {correlation.loc['price', 'mpg']:.2f}")

This example shows a complete data analysis workflow:

  1. Loading and exploring the data
  2. Computing statistics and correlations
  3. Grouping and aggregating data
  4. Creating visualizations
  5. Drawing insights from the analysis

Summary

In this guide, we've explored Python's powerful ecosystem for data analysis:

  • NumPy provides the foundation for numerical computing with arrays
  • pandas offers data structures and functions for data manipulation
  • matplotlib and seaborn enable creating informative visualizations

Python data analysis is a vast field with many more advanced techniques, but mastering these basics will give you a solid foundation for tackling real-world data problems.

The process typically follows these steps:

  1. Loading and cleaning data
  2. Exploring and understanding the data structure
  3. Manipulating and transforming data
  4. Analyzing relationships and patterns
  5. Visualizing results
  6. Drawing conclusions

Additional Resources and Exercises

Resources for Further Learning

  1. pandas Documentation
  2. NumPy User Guide
  3. matplotlib Tutorials
  4. seaborn Tutorial
  5. Python for Data Analysis by Wes McKinney (creator of pandas)

Practice Exercises

  1. Basic Data Exploration:

    • Download a dataset from Kaggle or use a built-in dataset from seaborn
    • Explore the data using pandas functions like info(), describe(), and head()
    • Identify missing values and handle them appropriately
  2. Data Transformation:

    • Create new columns based on existing data
    • Normalize or standardize numerical variables
    • Convert categorical variables to numeric using one-hot encoding
  3. Visual Analysis Project:

    • Choose a dataset of interest (e.g., COVID-19 data, economic indicators, sports statistics)
    • Create a dashboard with at least four different types of plots
    • Write a summary of the insights you discovered from your visualizations
  4. Time Series Analysis:

    • Find a dataset with time-based information
    • Use pandas date functionality to extract patterns by month, day of week, etc.
    • Create line charts showing trends over time

By practicing these exercises, you'll build confidence in your data analysis skills and be ready to tackle more complex projects!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)