Pandas Introduction
What is Pandas?
Pandas is one of the most popular Python libraries for data manipulation and analysis. It provides powerful, flexible, and efficient data structures designed to make working with structured data intuitive and efficient. If you're working with tabular data (like spreadsheets or SQL tables), time series, or any kind of labeled or relational data, Pandas will likely become your go-to tool.
The library was created by Wes McKinney in 2008 while he was working as a quantitative analyst. The name "Pandas" is derived from "panel data," an econometrics term for multidimensional structured data sets.
Why Use Pandas?
Pandas solves many common problems in data analysis:
- Reading various file formats: CSV, Excel, SQL databases, JSON, and more
- Data cleaning: Handling missing values, duplicates, and formatting issues
- Data transformation: Reshaping, pivoting, merging, and aggregating data
- Data analysis: Statistical functions, grouping operations, and time series analysis
- Data visualization: Integration with Matplotlib for creating graphs and charts
Installing Pandas
Before we dive into Pandas, you need to install it. You can install Pandas using pip:
pip install pandas
Or using conda:
conda install pandas
Core Pandas Data Structures
Pandas has two primary data structures:
- Series: A one-dimensional labeled array
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types
Let's explore each of these structures.
Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.
Creating a Series
import pandas as pd
# Create a Series from a list
s = pd.Series([10, 20, 30, 40])
print(s)
Output:
0 10
1 20
2 30
3 40
dtype: int64
You can also specify your own index:
# Create a Series with custom index
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)
Output:
a 10
b 20
c 30
d 40
dtype: int64
A Series can be created from a dictionary as well:
# Create a Series from a dictionary
d = {'a': 10, 'b': 20, 'c': 30, 'd': 40}
s = pd.Series(d)
print(s)
Output:
a 10
b 20
c 30
d 40
dtype: int64
Accessing Series Elements
You can access elements using index labels:
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s['a']) # Output: 10
print(s[['a', 'c']]) # Access multiple elements
Output:
10
a 10
c 30
dtype: int64
DataFrame
A DataFrame is a 2-dimensional labeled data structure with columns that can be of different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
Creating a DataFrame
# Create a DataFrame from a dictionary of lists
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 42],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 42 London
You can also specify the sequence of columns:
df = pd.DataFrame(data, columns=['Name', 'City', 'Age'])
print(df)
Output:
Name City Age
0 John New York 28
1 Anna Paris 34
2 Peter Berlin 29
3 Linda London 42
Creating a DataFrame from CSV
One of the most common ways to create a DataFrame is by importing data from a CSV file:
# Reading a CSV file into a DataFrame
df = pd.read_csv('data.csv')
Basic DataFrame Operations
Let's explore some basic operations on DataFrames.
Viewing Data
# Sample DataFrame for demonstration
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda', 'Bob'],
'Age': [28, 34, 29, 42, 37],
'City': ['New York', 'Paris', 'Berlin', 'London', 'Tokyo'],
'Salary': [50000, 60000, 55000, 75000, 65000]
}
df = pd.DataFrame(data)
# View first 2 rows
print(df.head(2))
# View last 2 rows
print(df.tail(2))
# Get basic information about the DataFrame
print(df.info())
# Get statistical summary of numerical columns
print(df.describe())
Output (head):
Name Age City Salary
0 John 28 New York 50000
1 Anna 34 Paris 60000
Selecting Data
# Select a single column (returns a Series)
print(df['Name'])
# Select multiple columns
print(df[['Name', 'City']])
# Select rows by position using iloc
print(df.iloc[0]) # First row
print(df.iloc[1:3]) # Second and third rows
# Select rows by label using loc
print(df.loc[2]) # Row with index 2
Output (df['Name']):
0 John
1 Anna
2 Peter
3 Linda
4 Bob
Name: Name, dtype: object
Data Manipulation
# Add a new column
df['Experience'] = [3, 8, 5, 15, 10]
# Modify a column
df['Salary'] = df['Salary'] * 1.1 # 10% raise for everyone
# Delete a column
df_copy = df.copy()
del df_copy['Experience']
# Filter data
senior_employees = df[df['Age'] > 30]
print(senior_employees)
# Sort by Age
print(df.sort_values(by='Age'))
# Sort by multiple columns
print(df.sort_values(by=['City', 'Age']))
Output (senior_employees):
Name Age City Salary Experience
1 Anna 34 Paris 66000.0 8
3 Linda 42 London 82500.0 15
4 Bob 37 Tokyo 71500.0 10
Real-World Example: Data Analysis
Let's look at a real-world example of using Pandas for data analysis. We'll analyze a fictional dataset of employee performance reviews.
# Sample employee performance data
performance_data = {
'Employee': ['John', 'Anna', 'Peter', 'Linda', 'Bob', 'Sarah', 'Michael', 'Emma'],
'Department': ['IT', 'HR', 'IT', 'Finance', 'Marketing', 'HR', 'Finance', 'IT'],
'Review_Score': [4.2, 4.5, 3.9, 4.7, 3.2, 4.0, 4.3, 3.8],
'Years_Employed': [3, 5, 2, 7, 3, 4, 6, 2],
'Salary': [55000, 60000, 52000, 75000, 48000, 62000, 69000, 53000]
}
employee_df = pd.DataFrame(performance_data)
print(employee_df)
Output:
Employee Department Review_Score Years_Employed Salary
0 John IT 4.2 3 55000
1 Anna HR 4.5 5 60000
2 Peter IT 3.9 2 52000
3 Linda Finance 4.7 7 75000
4 Bob Marketing 3.2 3 48000
5 Sarah HR 4.0 4 62000
6 Michael Finance 4.3 6 69000
7 Emma IT 3.8 2 53000
Now, let's perform some data analysis:
# Calculate average review score by department
dept_review = employee_df.groupby('Department')['Review_Score'].mean().sort_values(ascending=False)
print("Average Review Score by Department:")
print(dept_review)
# Calculate average salary
print(f"Average Salary: ${employee_df['Salary'].mean():.2f}")
# Find correlation between years employed and salary
correlation = employee_df['Years_Employed'].corr(employee_df['Salary'])
print(f"Correlation between Years Employed and Salary: {correlation:.2f}")
# Create a new column for performance bonus (based on review score)
employee_df['Bonus'] = employee_df['Review_Score'] * 1000
print(employee_df[['Employee', 'Review_Score', 'Bonus']])
Output:
Average Review Score by Department:
Department
Finance 4.50
HR 4.25
IT 3.97
Marketing 3.20
Name: Review_Score, dtype: float64
Average Salary: $59250.00
Correlation between Years Employed and Salary: 0.91
Employee Review_Score Bonus
0 John 4.2 4200.00
1 Anna 4.5 4500.00
2 Peter 3.9 3900.00
3 Linda 4.7 4700.00
4 Bob 3.2 3200.00
5 Sarah 4.0 4000.00
6 Michael 4.3 4300.00
7 Emma 3.8 3800.00
Summary
In this introduction to Pandas, we've covered:
- What Pandas is and why it's useful for data analysis
- How to install Pandas
- The two main data structures: Series and DataFrame
- Creating Series and DataFrames from various data sources
- Basic operations for viewing, selecting, and manipulating data
- A real-world example of data analysis using Pandas
Pandas is an extremely powerful library with many more capabilities beyond what we've covered. As you continue learning, you'll discover functions for advanced data manipulation, time series analysis, working with missing data, and much more.
Additional Resources
To deepen your knowledge of Pandas, here are some excellent resources:
- Official Pandas Documentation
- 10 Minutes to Pandas - A quick introduction to the main concepts
- Pandas Cookbook - Recipes for common tasks
Exercises
Practice what you've learned with these exercises:
- Create a DataFrame with information about 5 different books (title, author, year, genre, rating).
- Load a CSV file of your choice using
pd.read_csv()
. - Filter the employee DataFrame to show only employees from the IT department with a review score greater than 4.0.
- Calculate the average salary by department and sort departments by average salary in descending order.
- Create a new column called "Experience_Level" that categorizes employees as "Junior" (0-3 years), "Mid" (4-6 years), or "Senior" (7+ years).
Happy coding with Pandas!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)