Pandas DataFrame
Introduction
A DataFrame is the primary and most important data structure in Pandas. You can think of a DataFrame as a spreadsheet or SQL table represented in Python code. It's a 2-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
DataFrames are particularly powerful because they allow you to:
- Store data of different types in different columns
- Perform operations on rows, columns, or specific cells
- Handle missing data effectively
- Combine and merge datasets
- Apply statistical functions and aggregations
In this tutorial, we'll explore Pandas DataFrames in depth, from creation to manipulation and analysis.
Creating DataFrames
Let's start by importing the pandas library:
import pandas as pd
import numpy as np
From a Dictionary
One of the most common ways to create a DataFrame is from a dictionary:
# Creating a DataFrame from a dictionary
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 42],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 42 London
From Lists
You can also create a DataFrame from a list of lists:
# Creating a DataFrame from lists
data_list = [
['John', 28, 'New York'],
['Anna', 34, 'Paris'],
['Peter', 29, 'Berlin'],
['Linda', 42, 'London']
]
df = pd.DataFrame(data_list, columns=['Name', 'Age', 'City'])
print(df)
Output:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 42 London
From CSV Files
DataFrames can be created by reading data from files:
# Reading a CSV file into a DataFrame
# df = pd.read_csv('data.csv')
# Example if you had a file:
# df = pd.read_csv('people.csv')
# print(df.head())
Basic DataFrame Properties and Methods
Let's explore some basic properties and methods to understand our DataFrame better:
# Create a sample DataFrame
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 42],
'City': ['New York', 'Paris', 'Berlin', 'London'],
'Salary': [50000, 60000, 55000, 75000]
}
df = pd.DataFrame(data)
# Examining the DataFrame
print("First few rows:")
print(df.head())
print("\nDataFrame shape (rows, columns):")
print(df.shape)
print("\nDataFrame info:")
print(df.info())
print("\nDataFrame columns:")
print(df.columns)
print("\nDataFrame data types:")
print(df.dtypes)
print("\nDataFrame summary statistics:")
print(df.describe())
Output:
First few rows:
Name Age City Salary
0 John 28 New York 50000
1 Anna 34 Paris 60000
2 Peter 29 Berlin 55000
3 Linda 42 London 75000
DataFrame shape (rows, columns):
(4, 4)
DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Age 4 non-null int64
2 City 4 non-null object
3 Salary 4 non-null int64
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
None
DataFrame columns:
Index(['Name', 'Age', 'City', 'Salary'], dtype='object')
DataFrame data types:
Name object
Age int64
City object
Salary int64
dtype: object
DataFrame summary statistics:
Age Salary
count 4.000000 4.000000
mean 33.250000 60000.000000
std 6.238322 11180.339887
min 28.000000 50000.000000
25% 28.750000 51250.000000
50% 31.500000 57500.000000
75% 36.000000 66250.000000
max 42.000000 75000.000000
Accessing DataFrame Elements
Pandas offers multiple ways to access data in a DataFrame:
Column Selection
# Selecting a column
print("Names:")
print(df['Name'])
# Selecting multiple columns
print("\nNames and Ages:")
print(df[['Name', 'Age']])
Output:
Names:
0 John
1 Anna
2 Peter
3 Linda
Name: Name, dtype: object
Names and Ages:
Name Age
0 John 28
1 Anna 34
2 Peter 29
3 Linda 42
Row Selection
# Selecting rows by position using iloc
print("First row:")
print(df.iloc[0])
print("\nMultiple rows:")
print(df.iloc[1:3])
# Selecting rows based on conditions
print("\nPeople older than 30:")
print(df[df['Age'] > 30])
Output:
First row:
Name John
Age 28
City New York
Salary 50000
Name: 0, dtype: object
Multiple rows:
Name Age City Salary
1 Anna 34 Paris 60000
2 Peter 29 Berlin 55000
People older than 30:
Name Age City Salary
1 Anna 34 Paris 60000
3 Linda 42 London 75000
Cell Selection
# Get a specific cell value
print("Anna's city:")
print(df.loc[1, 'City'])
# Get a range of cells
print("\nNames and ages subset:")
print(df.loc[0:2, ['Name', 'Age']])
Output:
Anna's city:
Paris
Names and ages subset:
Name Age
0 John 28
1 Anna 34
2 Peter 29
Modifying DataFrames
DataFrames are mutable, which means we can change their content:
Adding New Columns
# Add a new column
df['Experience'] = [3, 8, 5, 15]
print("DataFrame with Experience column:")
print(df)
# Add a calculated column
df['Experience_Years_Per_Age'] = df['Experience'] / df['Age']
print("\nDataFrame with ratio column:")
print(df)
Output:
DataFrame with Experience column:
Name Age City Salary Experience
0 John 28 New York 50000 3
1 Anna 34 Paris 60000 8
2 Peter 29 Berlin 55000 5
3 Linda 42 London 75000 15
DataFrame with ratio column:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 50000 3 0.107143
1 Anna 34 Paris 60000 8 0.235294
2 Peter 29 Berlin 55000 5 0.172414
3 Linda 42 London 75000 15 0.357143
Modifying Values
# Update a specific value
df.loc[0, 'Salary'] = 55000
print("Updated John's salary:")
print(df)
# Update based on condition
df.loc[df['Age'] > 30, 'Salary'] = df['Salary'] * 1.1
print("\nGave 10% raise to people over 30:")
print(df)
Output:
Updated John's salary:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 55000 3 0.107143
1 Anna 34 Paris 60000 8 0.235294
2 Peter 29 Berlin 55000 5 0.172414
3 Linda 42 London 75000 15 0.357143
Gave 10% raise to people over 30:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 55000 3 0.107143
1 Anna 34 Paris 66000 8 0.235294
2 Peter 29 Berlin 55000 5 0.172414
3 Linda 42 London 82500 15 0.357143
Adding and Removing Rows
# Add a new row
new_person = {'Name': 'Michael', 'Age': 31, 'City': 'Chicago',
'Salary': 58000, 'Experience': 7, 'Experience_Years_Per_Age': 7/31}
# Using loc to add a row at index 4
df.loc[4] = new_person
print("DataFrame with new person:")
print(df)
# Remove a row
df = df.drop(2) # Drops Peter
print("\nDataFrame after removing Peter:")
print(df)
Output:
DataFrame with new person:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 55000 3 0.107143
1 Anna 34 Paris 66000 8 0.235294
2 Peter 29 Berlin 55000 5 0.172414
3 Linda 42 London 82500 15 0.357143
4 Michael 31 Chicago 58000 7 0.225806
DataFrame after removing Peter:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 55000 3 0.107143
1 Anna 34 Paris 66000 8 0.235294
3 Linda 42 London 82500 15 0.357143
4 Michael 31 Chicago 58000 7 0.225806
Handling Missing Data
Real-world datasets often contain missing values. Pandas provides several methods to handle them:
# Create a DataFrame with missing values
data_missing = {
'Name': ['John', 'Anna', None, 'Linda', 'Michael'],
'Age': [28, None, 29, 42, 31],
'City': ['New York', 'Paris', 'Berlin', None, 'Chicago'],
'Salary': [None, 60000, 55000, 75000, 58000]
}
df_missing = pd.DataFrame(data_missing)
print("DataFrame with missing values:")
print(df_missing)
# Check for missing values
print("\nMissing values per column:")
print(df_missing.isna().sum())
# Fill missing values with a specific value
df_filled = df_missing.fillna({
'Name': 'Unknown',
'Age': df_missing['Age'].mean(),
'City': 'Unknown',
'Salary': df_missing['Salary'].median()
})
print("\nDataFrame with filled missing values:")
print(df_filled)
# Drop rows with missing values
df_dropped = df_missing.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)
Output:
DataFrame with missing values:
Name Age City Salary
0 John 28.0 New York NaN
1 Anna NaN Paris 60000.0
2 None 29.0 Berlin 55000.0
3 Linda 42.0 None 75000.0
4 Michael 31.0 Chicago 58000.0
Missing values per column:
Name 1
Age 1
City 1
Salary 1
dtype: int64
DataFrame with filled missing values:
Name Age City Salary
0 John 28.000000 New York 62000.0
1 Anna 32.500000 Paris 60000.0
2 Unknown 29.000000 Berlin 55000.0
3 Linda 42.000000 Unknown 75000.0
4 Michael 31.000000 Chicago 58000.0
DataFrame after dropping rows with missing values:
Name Age City Salary
4 Michael 31.0 Chicago 58000.0
Real-World Application: Data Analysis
Let's put our DataFrame knowledge to work on a simple data analysis task using a sample employee dataset:
# Create a more realistic employee dataset
data = {
'Employee': ['John Smith', 'Anna Johnson', 'Peter Williams', 'Linda Brown',
'Michael Davis', 'Sarah Wilson', 'Robert Taylor', 'Susan Anderson'],
'Department': ['IT', 'HR', 'Finance', 'Marketing', 'IT', 'Finance', 'Marketing', 'HR'],
'Salary': [65000, 60000, 72000, 68000, 78000, 65000, 71000, 62000],
'Years_Employed': [3, 5, 8, 4, 7, 5, 6, 2],
'Performance': ['Good', 'Excellent', 'Good', 'Average', 'Excellent', 'Good', 'Average', 'Good']
}
employees = pd.DataFrame(data)
print("Employee Dataset:")
print(employees)
# Analyze average salary by department
dept_salaries = employees.groupby('Department')['Salary'].mean().sort_values(ascending=False)
print("\nAverage Salary by Department:")
print(dept_salaries)
# Analyze performance distribution
perf_count = employees['Performance'].value_counts()
print("\nPerformance Distribution:")
print(perf_count)
# Find high performers with good salaries
high_performers = employees[(employees['Performance'] == 'Excellent') &
(employees['Salary'] > 65000)]
print("\nExcellent performers with salary > 65000:")
print(high_performers)
# Calculate correlation between years employed and salary
correlation = employees['Years_Employed'].corr(employees['Salary'])
print(f"\nCorrelation between years employed and salary: {correlation:.2f}")
# Summary statistics by department
dept_stats = employees.groupby('Department').agg({
'Salary': ['mean', 'min', 'max'],
'Years_Employed': ['mean', 'min', 'max']
})
print("\nDepartment Statistics:")
print(dept_stats)
Output:
Employee Dataset:
Employee Department Salary Years_Employed Performance
0 John Smith IT 65000 3 Good
1 Anna Johnson HR 60000 5 Excellent
2 Peter Williams Finance 72000 8 Good
3 Linda Brown Marketing 68000 4 Average
4 Michael Davis IT 78000 7 Excellent
5 Sarah Wilson Finance 65000 5 Good
6 Robert Taylor Marketing 71000 6 Average
7 Susan Anderson HR 62000 2 Good
Average Salary by Department:
Department
IT 71500.0
Marketing 69500.0
Finance 68500.0
HR 61000.0
Name: Salary, dtype: float64
Performance Distribution:
Good 4
Average 2
Excellent 2
Name: Performance, dtype: int64
Excellent performers with salary > 65000:
Employee Department Salary Years_Employed Performance
4 Michael Davis IT 78000 7 Excellent
Correlation between years employed and salary: 0.84
Department Statistics:
Salary Years_Employed
mean min max mean min max
Department
Finance 68500.0 65000 72000 6.500 5 8
HR 61000.0 60000 62000 3.500 2 5
IT 71500.0 65000 78000 5.000 3 7
Marketing 69500.0 68000 71000 5.000 4 6
Summary
In this tutorial, we've covered the essential aspects of Pandas DataFrames:
- Creating DataFrames from dictionaries, lists, and external files
- Basic properties and methods to understand DataFrame structure
- Accessing data through columns, rows, and specific cells
- Modifying DataFrames by adding/removing columns and rows
- Handling missing values with various strategies
- Practical data analysis on a real-world dataset
DataFrames are incredibly powerful and form the foundation of data analysis in Python. They provide an intuitive, spreadsheet-like structure with powerful data manipulation capabilities built-in.
Additional Resources
To continue learning about Pandas DataFrames:
Exercises
Test your understanding with these exercises:
- Create a DataFrame of your own with at least 5 columns and 10 rows about a subject of your interest (e.g., movies, cars, athletes).
- Calculate summary statistics for numeric columns.
- Filter the DataFrame to show only rows that meet at least two conditions.
- Add a new calculated column based on values in other columns.
- Group the data by one column and find aggregates (mean, max, min) of another column.
- Handle missing data in your DataFrame using at least two different methods.
- Visualize one aspect of your DataFrame using Pandas' built-in plotting capabilities.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)