Python Data Analysis

Introduction

Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. Python has become the go-to language for data analysis due to its simplicity and the rich ecosystem of libraries specifically designed for working with data.

In this guide, we will explore how to use Python for data analysis, focusing on three essential libraries:

NumPy: For numerical computing with powerful array operations
pandas: For data manipulation and analysis
matplotlib: For data visualization

Whether you're analyzing sales data, scientific measurements, or social media trends, these tools will help you extract meaningful insights from your data.

Getting Started with Python Data Analysis

Setting Up Your Environment

Before we dive into data analysis, make sure you have the necessary libraries installed:

bash
pip install numpy pandas matplotlib seaborn

Let's start by importing the basic libraries:

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set better default aesthetics for plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_theme()

# Display plots inline if using a Jupyter notebook
# %matplotlib inline

Working with NumPy for Numerical Analysis

NumPy is the foundation for data analysis in Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays efficiently.

Creating NumPy Arrays

python
# Create a simple array
arr = np.array([1, 2, 3, 4, 5])
print(arr)

Output:

[1 2 3 4 5]

python
# Create a 2D array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix)

Output:

[[1 2 3]
 [4 5 6]
 [7 8 9]]

Array Operations

NumPy allows for vectorized operations, which are much faster than traditional loops:

python
# Array arithmetic
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Addition
print(arr1 + arr2)  # Element-wise addition

# Multiplication
print(arr1 * arr2)  # Element-wise multiplication

# With scalars
print(arr1 * 2)     # Multiply each element by 2

Output:

[5 7 9]
[4 10 18]
[2 4 6]

Statistical Operations

NumPy has built-in functions for common statistical operations:

python
data = np.array([15, 23, 48, 10, 28, 36, 52, 19, 25, 33])

print(f"Mean: {np.mean(data)}")
print(f"Median: {np.median(data)}")
print(f"Standard Deviation: {np.std(data)}")
print(f"Min and Max: {np.min(data)} and {np.max(data)}")

Output:

Mean: 28.9
Median: 26.5
Standard Deviation: 12.875954992552266
Min and Max: 10 and 52

Data Manipulation with pandas

While NumPy provides the foundation for numerical computing, pandas is specifically designed for data manipulation and analysis. It introduces two main data structures:

Series: A one-dimensional labeled array
DataFrame: A two-dimensional labeled data structure with columns that can be of different types

Creating DataFrames

python
# Create a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle'],
    'Salary': [65000, 72000, 59000, 82000, 75000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age     City  Salary
  Alice   24  New York   65000
    Bob   27    Boston   72000
Charlie   22   Chicago   59000
  David   32    Denver   82000
    Eva   29   Seattle   75000

Loading Data from Files

pandas makes it easy to read data from various file formats:

python
# Let's create a CSV file for demonstration
df.to_csv('employees.csv', index=False)

# Now read it back
employees = pd.read_csv('employees.csv')
print(employees.head())  # The head() method shows the first five rows

Output:

      Name  Age     City  Salary
  Alice   24  New York   65000
    Bob   27    Boston   72000
Charlie   22   Chicago   59000
  David   32    Denver   82000
    Eva   29   Seattle   75000

Data Exploration

pandas provides numerous methods to explore and understand your data:

python
# Basic information about the DataFrame
print(employees.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   City    5 non-null      object
 3   Salary  5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 288.0+ bytes
None

python
# Statistical summary
print(employees.describe())

Output:

             Age        Salary
count   5.000000      5.000000
mean   26.800000  70600.000000
std     3.962323   8792.360580
min    22.000000  59000.000000
25%    24.000000  65000.000000
50%    27.000000  72000.000000
75%    29.000000  75000.000000
max    32.000000  82000.000000

Data Selection and Filtering

pandas offers powerful ways to select and filter data:

python
# Selecting columns
print(employees['Name'])

Output:

    Alice
      Bob
  Charlie
    David
      Eva
Name: Name, dtype: object

python
# Selecting rows by position
print(employees.iloc[0:2])  # First two rows

Output:

    Name  Age     City  Salary
0  Alice   24  New York   65000
1    Bob   27    Boston   72000

python
# Conditional filtering
high_salary = employees[employees['Salary'] > 70000]
print(high_salary)

Output:

    Name  Age     City  Salary
  Bob   27    Boston   72000
David   32    Denver   82000
  Eva   29   Seattle   75000

Data Grouping and Aggregation

One of pandas' most powerful features is the ability to group data and compute aggregations:

python
# Let's create a slightly larger dataset
data = {
    'Department': ['IT', 'HR', 'Sales', 'IT', 'HR', 'Sales', 'IT', 'Sales'],
    'Employee': ['John', 'Alice', 'Bob', 'Mary', 'Jane', 'Michael', 'David', 'Anne'],
    'Salary': [65000, 72000, 59000, 82000, 75000, 67000, 78000, 63000],
    'Years': [3, 7, 2, 5, 8, 3, 4, 2]
}

df = pd.DataFrame(data)

# Group by department and calculate mean salary and years
dept_stats = df.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max'],
    'Years': 'mean'
})

print(dept_stats)

Output:

           Salary                  Years
            mean    min    max     mean
Department                             
HR         73500.0  72000  75000  7.500000
IT         75000.0  65000  82000  4.000000
Sales      63000.0  59000  67000  2.333333

Data Visualization with matplotlib and seaborn

After preparing and analyzing your data, visualization helps in understanding patterns and communicating findings. We'll use matplotlib and seaborn, which is built on top of matplotlib but provides a higher-level interface.

Basic Plotting with matplotlib

python
# Let's create a simple line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2)
plt.title('Sine Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.grid(True)
plt.savefig('sine_wave.png')  # Save the figure
plt.show()

This will produce a sine wave plot. Since we can't display it directly here, imagine a blue sine wave on a grid.

Bar Charts with pandas and matplotlib

python
# Using our employee data to create a bar chart
dept_avg_salary = df.groupby('Department')['Salary'].mean()

plt.figure(figsize=(10, 6))
dept_avg_salary.plot(kind='bar', color='skyblue')
plt.title('Average Salary by Department')
plt.xlabel('Department')
plt.ylabel('Average Salary ($)')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.savefig('dept_salary_bar.png')
plt.show()

Advanced Visualization with seaborn

python
# Create a larger synthetic dataset for better visualization
np.random.seed(42)
data = {
    'Department': np.random.choice(['IT', 'HR', 'Sales', 'Marketing'], 100),
    'Experience': np.random.randint(1, 15, 100),
    'Salary': np.random.randint(50000, 100000, 100),
    'Performance': np.random.uniform(2, 5, 100)
}

df_large = pd.DataFrame(data)

# Create a scatter plot with regression line
plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=df_large, 
    x='Experience', 
    y='Salary', 
    hue='Department', 
    size='Performance',
    sizes=(20, 200), 
    alpha=0.7
)
sns.regplot(
    data=df_large, 
    x='Experience', 
    y='Salary', 
    scatter=False, 
    color='black'
)
plt.title('Salary vs. Experience by Department', fontsize=16)
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.savefig('salary_experience_scatter.png')
plt.show()

Creating Multiple Plots

python
# Create a dashboard of plots
plt.figure(figsize=(15, 10))

# Plot 1: Distribution of Salaries
plt.subplot(2, 2, 1)
sns.histplot(df_large['Salary'], bins=15, kde=True)
plt.title('Salary Distribution')

# Plot 2: Experience vs. Salary
plt.subplot(2, 2, 2)
sns.boxplot(x='Department', y='Salary', data=df_large)
plt.title('Salary by Department')

# Plot 3: Experience Distribution
plt.subplot(2, 2, 3)
sns.histplot(df_large['Experience'], bins=10, kde=True)
plt.title('Experience Distribution')

# Plot 4: Performance vs. Salary
plt.subplot(2, 2, 4)
sns.scatterplot(x='Performance', y='Salary', data=df_large, alpha=0.6)
plt.title('Performance vs. Salary')

plt.tight_layout()
plt.savefig('dashboard.png')
plt.show()

Real-world Data Analysis Example

Let's put everything together with a realistic example. We'll analyze a dataset of car fuel efficiency:

python
# Let's simulate loading data from an external source
# In practice, you might use: 
# df = pd.read_csv('https://raw.githubusercontent.com/datasets/...')

# Creating sample car data
car_data = {
    'make': ['Toyota', 'Honda', 'Ford', 'BMW', 'Toyota', 'Honda', 'Ford', 'BMW', 'Toyota', 'Honda'],
    'model': ['Corolla', 'Civic', 'Focus', '3 Series', 'Camry', 'Accord', 'Fusion', '5 Series', 'Prius', 'Fit'],
    'year': [2019, 2020, 2018, 2021, 2017, 2020, 2019, 2020, 2020, 2018],
    'engine_size': [1.8, 2.0, 1.6, 3.0, 2.5, 1.5, 2.0, 4.0, 1.8, 1.5],
    'mpg': [32, 36, 28, 24, 29, 38, 27, 22, 52, 33],
    'price': [21000, 22500, 18000, 43000, 25000, 28000, 20000, 52000, 26000, 17000]
}

cars = pd.DataFrame(car_data)

# 1. Data exploration
print(cars.head())
print("\nSummary Statistics:")
print(cars.describe())

# 2. Finding correlations between numerical variables
print("\nCorrelation Matrix:")
correlation = cars[['engine_size', 'mpg', 'price', 'year']].corr()
print(correlation)

# 3. Data aggregation
avg_mpg_by_make = cars.groupby('make')['mpg'].mean().sort_values(ascending=False)
print("\nAverage MPG by Make:")
print(avg_mpg_by_make)

# 4. Visualization
plt.figure(figsize=(15, 10))

# Plot 1: MPG vs Engine Size
plt.subplot(2, 2, 1)
sns.scatterplot(x='engine_size', y='mpg', hue='make', size='price', 
                sizes=(50, 200), data=cars, alpha=0.7)
plt.title('MPG vs Engine Size')
plt.grid(True, alpha=0.3)

# Plot 2: Average MPG by Make
plt.subplot(2, 2, 2)
avg_mpg_by_make.plot(kind='bar', color='skyblue')
plt.title('Average MPG by Make')
plt.ylabel('Miles Per Gallon')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)

# Plot 3: Price vs MPG
plt.subplot(2, 2, 3)
sns.scatterplot(x='price', y='mpg', hue='make', data=cars)
plt.title('Price vs Fuel Efficiency')
plt.xlabel('Price ($)')
plt.ylabel('Miles Per Gallon')
plt.grid(True, alpha=0.3)

# Plot 4: Distribution of MPG
plt.subplot(2, 2, 4)
sns.kdeplot(data=cars, x='mpg', hue='make', fill=True, alpha=0.5)
plt.title('Distribution of Fuel Efficiency by Make')
plt.xlabel('Miles Per Gallon')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('car_analysis.png')
plt.show()

# 5. Finding insights
high_efficiency = cars[cars['mpg'] > 35]
print("\nHigh Efficiency Cars (MPG > 35):")
print(high_efficiency)

print("\nKey Findings:")
print(f"1. The car with the highest fuel efficiency is {cars.loc[cars['mpg'].idxmax()]['make']} {cars.loc[cars['mpg'].idxmax()]['model']} with {cars['mpg'].max()} MPG")
print(f"2. Average price of cars in the dataset: ${cars['price'].mean():.2f}")
print(f"3. Correlation between engine size and MPG: {correlation.loc['engine_size', 'mpg']:.2f}")
print(f"4. Correlation between price and MPG: {correlation.loc['price', 'mpg']:.2f}")

This example shows a complete data analysis workflow:

Loading and exploring the data
Computing statistics and correlations
Grouping and aggregating data
Creating visualizations
Drawing insights from the analysis

Summary

In this guide, we've explored Python's powerful ecosystem for data analysis:

NumPy provides the foundation for numerical computing with arrays
pandas offers data structures and functions for data manipulation
matplotlib and seaborn enable creating informative visualizations

Python data analysis is a vast field with many more advanced techniques, but mastering these basics will give you a solid foundation for tackling real-world data problems.

The process typically follows these steps:

Loading and cleaning data
Exploring and understanding the data structure
Manipulating and transforming data
Analyzing relationships and patterns
Visualizing results
Drawing conclusions

Additional Resources and Exercises

Resources for Further Learning

pandas Documentation
NumPy User Guide
matplotlib Tutorials
seaborn Tutorial
Python for Data Analysis by Wes McKinney (creator of pandas)

Practice Exercises

Basic Data Exploration:
- Download a dataset from Kaggle or use a built-in dataset from seaborn
- Explore the data using pandas functions like info(), describe(), and head()
- Identify missing values and handle them appropriately
Data Transformation:
- Create new columns based on existing data
- Normalize or standardize numerical variables
- Convert categorical variables to numeric using one-hot encoding
Visual Analysis Project:
- Choose a dataset of interest (e.g., COVID-19 data, economic indicators, sports statistics)
- Create a dashboard with at least four different types of plots
- Write a summary of the insights you discovered from your visualizations
Time Series Analysis:
- Find a dataset with time-based information
- Use pandas date functionality to extract patterns by month, day of week, etc.
- Create line charts showing trends over time

By practicing these exercises, you'll build confidence in your data analysis skills and be ready to tackle more complex projects!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Getting Started with Python Data Analysis​

Setting Up Your Environment​

Working with NumPy for Numerical Analysis​

Creating NumPy Arrays​

Array Operations​

Statistical Operations​

Data Manipulation with pandas​

Creating DataFrames​

Loading Data from Files​

Data Exploration​

Data Selection and Filtering​

Data Grouping and Aggregation​

Data Visualization with matplotlib and seaborn​

Basic Plotting with matplotlib​

Bar Charts with pandas and matplotlib​

Advanced Visualization with seaborn​

Creating Multiple Plots​

Real-world Data Analysis Example​

Summary​

Additional Resources and Exercises​

Resources for Further Learning​

Practice Exercises​

Introduction

Getting Started with Python Data Analysis

Setting Up Your Environment

Working with NumPy for Numerical Analysis

Creating NumPy Arrays

Array Operations

Statistical Operations

Data Manipulation with pandas

Creating DataFrames

Loading Data from Files

Data Exploration

Data Selection and Filtering

Data Grouping and Aggregation

Data Visualization with matplotlib and seaborn

Basic Plotting with matplotlib

Bar Charts with pandas and matplotlib

Advanced Visualization with seaborn

Creating Multiple Plots

Real-world Data Analysis Example

Summary

Additional Resources and Exercises

Resources for Further Learning

Practice Exercises