Skip to main content

Pandas DataFrame

Introduction

A DataFrame is the primary and most important data structure in Pandas. You can think of a DataFrame as a spreadsheet or SQL table represented in Python code. It's a 2-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

DataFrames are particularly powerful because they allow you to:

  • Store data of different types in different columns
  • Perform operations on rows, columns, or specific cells
  • Handle missing data effectively
  • Combine and merge datasets
  • Apply statistical functions and aggregations

In this tutorial, we'll explore Pandas DataFrames in depth, from creation to manipulation and analysis.

Creating DataFrames

Let's start by importing the pandas library:

python
import pandas as pd
import numpy as np

From a Dictionary

One of the most common ways to create a DataFrame is from a dictionary:

python
# Creating a DataFrame from a dictionary
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 42],
'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 42 London

From Lists

You can also create a DataFrame from a list of lists:

python
# Creating a DataFrame from lists
data_list = [
['John', 28, 'New York'],
['Anna', 34, 'Paris'],
['Peter', 29, 'Berlin'],
['Linda', 42, 'London']
]

df = pd.DataFrame(data_list, columns=['Name', 'Age', 'City'])
print(df)

Output:

    Name  Age      City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 42 London

From CSV Files

DataFrames can be created by reading data from files:

python
# Reading a CSV file into a DataFrame
# df = pd.read_csv('data.csv')

# Example if you had a file:
# df = pd.read_csv('people.csv')
# print(df.head())

Basic DataFrame Properties and Methods

Let's explore some basic properties and methods to understand our DataFrame better:

python
# Create a sample DataFrame
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 42],
'City': ['New York', 'Paris', 'Berlin', 'London'],
'Salary': [50000, 60000, 55000, 75000]
}

df = pd.DataFrame(data)

# Examining the DataFrame
print("First few rows:")
print(df.head())

print("\nDataFrame shape (rows, columns):")
print(df.shape)

print("\nDataFrame info:")
print(df.info())

print("\nDataFrame columns:")
print(df.columns)

print("\nDataFrame data types:")
print(df.dtypes)

print("\nDataFrame summary statistics:")
print(df.describe())

Output:

First few rows:
Name Age City Salary
0 John 28 New York 50000
1 Anna 34 Paris 60000
2 Peter 29 Berlin 55000
3 Linda 42 London 75000

DataFrame shape (rows, columns):
(4, 4)

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Age 4 non-null int64
2 City 4 non-null object
3 Salary 4 non-null int64
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
None

DataFrame columns:
Index(['Name', 'Age', 'City', 'Salary'], dtype='object')

DataFrame data types:
Name object
Age int64
City object
Salary int64
dtype: object

DataFrame summary statistics:
Age Salary
count 4.000000 4.000000
mean 33.250000 60000.000000
std 6.238322 11180.339887
min 28.000000 50000.000000
25% 28.750000 51250.000000
50% 31.500000 57500.000000
75% 36.000000 66250.000000
max 42.000000 75000.000000

Accessing DataFrame Elements

Pandas offers multiple ways to access data in a DataFrame:

Column Selection

python
# Selecting a column
print("Names:")
print(df['Name'])

# Selecting multiple columns
print("\nNames and Ages:")
print(df[['Name', 'Age']])

Output:

Names:
0 John
1 Anna
2 Peter
3 Linda
Name: Name, dtype: object

Names and Ages:
Name Age
0 John 28
1 Anna 34
2 Peter 29
3 Linda 42

Row Selection

python
# Selecting rows by position using iloc
print("First row:")
print(df.iloc[0])

print("\nMultiple rows:")
print(df.iloc[1:3])

# Selecting rows based on conditions
print("\nPeople older than 30:")
print(df[df['Age'] > 30])

Output:

First row:
Name John
Age 28
City New York
Salary 50000
Name: 0, dtype: object

Multiple rows:
Name Age City Salary
1 Anna 34 Paris 60000
2 Peter 29 Berlin 55000

People older than 30:
Name Age City Salary
1 Anna 34 Paris 60000
3 Linda 42 London 75000

Cell Selection

python
# Get a specific cell value
print("Anna's city:")
print(df.loc[1, 'City'])

# Get a range of cells
print("\nNames and ages subset:")
print(df.loc[0:2, ['Name', 'Age']])

Output:

Anna's city:
Paris

Names and ages subset:
Name Age
0 John 28
1 Anna 34
2 Peter 29

Modifying DataFrames

DataFrames are mutable, which means we can change their content:

Adding New Columns

python
# Add a new column
df['Experience'] = [3, 8, 5, 15]
print("DataFrame with Experience column:")
print(df)

# Add a calculated column
df['Experience_Years_Per_Age'] = df['Experience'] / df['Age']
print("\nDataFrame with ratio column:")
print(df)

Output:

DataFrame with Experience column:
Name Age City Salary Experience
0 John 28 New York 50000 3
1 Anna 34 Paris 60000 8
2 Peter 29 Berlin 55000 5
3 Linda 42 London 75000 15

DataFrame with ratio column:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 50000 3 0.107143
1 Anna 34 Paris 60000 8 0.235294
2 Peter 29 Berlin 55000 5 0.172414
3 Linda 42 London 75000 15 0.357143

Modifying Values

python
# Update a specific value
df.loc[0, 'Salary'] = 55000
print("Updated John's salary:")
print(df)

# Update based on condition
df.loc[df['Age'] > 30, 'Salary'] = df['Salary'] * 1.1
print("\nGave 10% raise to people over 30:")
print(df)

Output:

Updated John's salary:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 55000 3 0.107143
1 Anna 34 Paris 60000 8 0.235294
2 Peter 29 Berlin 55000 5 0.172414
3 Linda 42 London 75000 15 0.357143

Gave 10% raise to people over 30:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 55000 3 0.107143
1 Anna 34 Paris 66000 8 0.235294
2 Peter 29 Berlin 55000 5 0.172414
3 Linda 42 London 82500 15 0.357143

Adding and Removing Rows

python
# Add a new row
new_person = {'Name': 'Michael', 'Age': 31, 'City': 'Chicago',
'Salary': 58000, 'Experience': 7, 'Experience_Years_Per_Age': 7/31}

# Using loc to add a row at index 4
df.loc[4] = new_person
print("DataFrame with new person:")
print(df)

# Remove a row
df = df.drop(2) # Drops Peter
print("\nDataFrame after removing Peter:")
print(df)

Output:

DataFrame with new person:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 55000 3 0.107143
1 Anna 34 Paris 66000 8 0.235294
2 Peter 29 Berlin 55000 5 0.172414
3 Linda 42 London 82500 15 0.357143
4 Michael 31 Chicago 58000 7 0.225806

DataFrame after removing Peter:
Name Age City Salary Experience Experience_Years_Per_Age
0 John 28 New York 55000 3 0.107143
1 Anna 34 Paris 66000 8 0.235294
3 Linda 42 London 82500 15 0.357143
4 Michael 31 Chicago 58000 7 0.225806

Handling Missing Data

Real-world datasets often contain missing values. Pandas provides several methods to handle them:

python
# Create a DataFrame with missing values
data_missing = {
'Name': ['John', 'Anna', None, 'Linda', 'Michael'],
'Age': [28, None, 29, 42, 31],
'City': ['New York', 'Paris', 'Berlin', None, 'Chicago'],
'Salary': [None, 60000, 55000, 75000, 58000]
}

df_missing = pd.DataFrame(data_missing)
print("DataFrame with missing values:")
print(df_missing)

# Check for missing values
print("\nMissing values per column:")
print(df_missing.isna().sum())

# Fill missing values with a specific value
df_filled = df_missing.fillna({
'Name': 'Unknown',
'Age': df_missing['Age'].mean(),
'City': 'Unknown',
'Salary': df_missing['Salary'].median()
})
print("\nDataFrame with filled missing values:")
print(df_filled)

# Drop rows with missing values
df_dropped = df_missing.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)

Output:

DataFrame with missing values:
Name Age City Salary
0 John 28.0 New York NaN
1 Anna NaN Paris 60000.0
2 None 29.0 Berlin 55000.0
3 Linda 42.0 None 75000.0
4 Michael 31.0 Chicago 58000.0

Missing values per column:
Name 1
Age 1
City 1
Salary 1
dtype: int64

DataFrame with filled missing values:
Name Age City Salary
0 John 28.000000 New York 62000.0
1 Anna 32.500000 Paris 60000.0
2 Unknown 29.000000 Berlin 55000.0
3 Linda 42.000000 Unknown 75000.0
4 Michael 31.000000 Chicago 58000.0

DataFrame after dropping rows with missing values:
Name Age City Salary
4 Michael 31.0 Chicago 58000.0

Real-World Application: Data Analysis

Let's put our DataFrame knowledge to work on a simple data analysis task using a sample employee dataset:

python
# Create a more realistic employee dataset
data = {
'Employee': ['John Smith', 'Anna Johnson', 'Peter Williams', 'Linda Brown',
'Michael Davis', 'Sarah Wilson', 'Robert Taylor', 'Susan Anderson'],
'Department': ['IT', 'HR', 'Finance', 'Marketing', 'IT', 'Finance', 'Marketing', 'HR'],
'Salary': [65000, 60000, 72000, 68000, 78000, 65000, 71000, 62000],
'Years_Employed': [3, 5, 8, 4, 7, 5, 6, 2],
'Performance': ['Good', 'Excellent', 'Good', 'Average', 'Excellent', 'Good', 'Average', 'Good']
}

employees = pd.DataFrame(data)
print("Employee Dataset:")
print(employees)

# Analyze average salary by department
dept_salaries = employees.groupby('Department')['Salary'].mean().sort_values(ascending=False)
print("\nAverage Salary by Department:")
print(dept_salaries)

# Analyze performance distribution
perf_count = employees['Performance'].value_counts()
print("\nPerformance Distribution:")
print(perf_count)

# Find high performers with good salaries
high_performers = employees[(employees['Performance'] == 'Excellent') &
(employees['Salary'] > 65000)]
print("\nExcellent performers with salary > 65000:")
print(high_performers)

# Calculate correlation between years employed and salary
correlation = employees['Years_Employed'].corr(employees['Salary'])
print(f"\nCorrelation between years employed and salary: {correlation:.2f}")

# Summary statistics by department
dept_stats = employees.groupby('Department').agg({
'Salary': ['mean', 'min', 'max'],
'Years_Employed': ['mean', 'min', 'max']
})
print("\nDepartment Statistics:")
print(dept_stats)

Output:

Employee Dataset:
Employee Department Salary Years_Employed Performance
0 John Smith IT 65000 3 Good
1 Anna Johnson HR 60000 5 Excellent
2 Peter Williams Finance 72000 8 Good
3 Linda Brown Marketing 68000 4 Average
4 Michael Davis IT 78000 7 Excellent
5 Sarah Wilson Finance 65000 5 Good
6 Robert Taylor Marketing 71000 6 Average
7 Susan Anderson HR 62000 2 Good

Average Salary by Department:
Department
IT 71500.0
Marketing 69500.0
Finance 68500.0
HR 61000.0
Name: Salary, dtype: float64

Performance Distribution:
Good 4
Average 2
Excellent 2
Name: Performance, dtype: int64

Excellent performers with salary > 65000:
Employee Department Salary Years_Employed Performance
4 Michael Davis IT 78000 7 Excellent

Correlation between years employed and salary: 0.84

Department Statistics:
Salary Years_Employed
mean min max mean min max
Department
Finance 68500.0 65000 72000 6.500 5 8
HR 61000.0 60000 62000 3.500 2 5
IT 71500.0 65000 78000 5.000 3 7
Marketing 69500.0 68000 71000 5.000 4 6

Summary

In this tutorial, we've covered the essential aspects of Pandas DataFrames:

  • Creating DataFrames from dictionaries, lists, and external files
  • Basic properties and methods to understand DataFrame structure
  • Accessing data through columns, rows, and specific cells
  • Modifying DataFrames by adding/removing columns and rows
  • Handling missing values with various strategies
  • Practical data analysis on a real-world dataset

DataFrames are incredibly powerful and form the foundation of data analysis in Python. They provide an intuitive, spreadsheet-like structure with powerful data manipulation capabilities built-in.

Additional Resources

To continue learning about Pandas DataFrames:

  1. Official Pandas Documentation
  2. Pandas User Guide
  3. 10 Minutes to Pandas Tutorial

Exercises

Test your understanding with these exercises:

  1. Create a DataFrame of your own with at least 5 columns and 10 rows about a subject of your interest (e.g., movies, cars, athletes).
  2. Calculate summary statistics for numeric columns.
  3. Filter the DataFrame to show only rows that meet at least two conditions.
  4. Add a new calculated column based on values in other columns.
  5. Group the data by one column and find aggregates (mean, max, min) of another column.
  6. Handle missing data in your DataFrame using at least two different methods.
  7. Visualize one aspect of your DataFrame using Pandas' built-in plotting capabilities.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)