Pandas DataFrame

Introduction

A DataFrame is the primary and most important data structure in Pandas. You can think of a DataFrame as a spreadsheet or SQL table represented in Python code. It's a 2-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

DataFrames are particularly powerful because they allow you to:

Store data of different types in different columns
Perform operations on rows, columns, or specific cells
Handle missing data effectively
Combine and merge datasets
Apply statistical functions and aggregations

In this tutorial, we'll explore Pandas DataFrames in depth, from creation to manipulation and analysis.

Creating DataFrames

Let's start by importing the pandas library:

python
import pandas as pd
import numpy as np

From a Dictionary

One of the most common ways to create a DataFrame is from a dictionary:

python
# Creating a DataFrame from a dictionary
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 42],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City
 John   28  New York
 Anna   34     Paris
Peter   29    Berlin
Linda   42    London

From Lists

You can also create a DataFrame from a list of lists:

python
# Creating a DataFrame from lists
data_list = [
    ['John', 28, 'New York'],
    ['Anna', 34, 'Paris'],
    ['Peter', 29, 'Berlin'],
    ['Linda', 42, 'London']
]

df = pd.DataFrame(data_list, columns=['Name', 'Age', 'City'])
print(df)

Output:

    Name  Age      City
 John   28  New York
 Anna   34     Paris
Peter   29    Berlin
Linda   42    London

From CSV Files

DataFrames can be created by reading data from files:

python
# Reading a CSV file into a DataFrame
# df = pd.read_csv('data.csv')

# Example if you had a file:
# df = pd.read_csv('people.csv')
# print(df.head())

Basic DataFrame Properties and Methods

Let's explore some basic properties and methods to understand our DataFrame better:

python
# Create a sample DataFrame
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 42],
    'City': ['New York', 'Paris', 'Berlin', 'London'],
    'Salary': [50000, 60000, 55000, 75000]
}

df = pd.DataFrame(data)

# Examining the DataFrame
print("First few rows:")
print(df.head())

print("\nDataFrame shape (rows, columns):")
print(df.shape)

print("\nDataFrame info:")
print(df.info())

print("\nDataFrame columns:")
print(df.columns)

print("\nDataFrame data types:")
print(df.dtypes)

print("\nDataFrame summary statistics:")
print(df.describe())

Output:

First few rows:
    Name  Age      City  Salary
0   John   28  New York   50000
1   Anna   34     Paris   60000
2  Peter   29    Berlin   55000
3  Linda   42    London   75000

DataFrame shape (rows, columns):
(4, 4)

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
 3   Salary  4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
None

DataFrame columns:
Index(['Name', 'Age', 'City', 'Salary'], dtype='object')

DataFrame data types:
Name      object
Age        int64
City      object
Salary     int64
dtype: object

DataFrame summary statistics:
             Age        Salary
count   4.000000      4.000000
mean   33.250000  60000.000000
std     6.238322  11180.339887
min    28.000000  50000.000000
25%    28.750000  51250.000000
50%    31.500000  57500.000000
75%    36.000000  66250.000000
max    42.000000  75000.000000

Accessing DataFrame Elements

Pandas offers multiple ways to access data in a DataFrame:

Column Selection

python
# Selecting a column
print("Names:")
print(df['Name'])

# Selecting multiple columns
print("\nNames and Ages:")
print(df[['Name', 'Age']])

Output:

Names:
0     John
1     Anna
2    Peter
3    Linda
Name: Name, dtype: object

Names and Ages:
    Name  Age
0   John   28
1   Anna   34
2  Peter   29
3  Linda   42

Row Selection

python
# Selecting rows by position using iloc
print("First row:")
print(df.iloc[0])

print("\nMultiple rows:")
print(df.iloc[1:3])

# Selecting rows based on conditions
print("\nPeople older than 30:")
print(df[df['Age'] > 30])

Output:

First row:
Name       John
Age          28
City    New York
Salary     50000
Name: 0, dtype: object

Multiple rows:
   Name  Age   City  Salary
1  Anna   34  Paris   60000
2  Peter  29  Berlin  55000

People older than 30:
    Name  Age    City  Salary
1   Anna   34   Paris   60000
3  Linda   42  London   75000

Cell Selection

python
# Get a specific cell value
print("Anna's city:")
print(df.loc[1, 'City'])

# Get a range of cells
print("\nNames and ages subset:")
print(df.loc[0:2, ['Name', 'Age']])

Output:

Anna's city:
Paris

Names and ages subset:
    Name  Age
0   John   28
1   Anna   34
2  Peter   29

Modifying DataFrames

DataFrames are mutable, which means we can change their content:

Adding New Columns

python
# Add a new column
df['Experience'] = [3, 8, 5, 15]
print("DataFrame with Experience column:")
print(df)

# Add a calculated column
df['Experience_Years_Per_Age'] = df['Experience'] / df['Age']
print("\nDataFrame with ratio column:")
print(df)

Output:

DataFrame with Experience column:
    Name  Age      City  Salary  Experience
0   John   28  New York   50000          3
1   Anna   34     Paris   60000          8
2  Peter   29    Berlin   55000          5
3  Linda   42    London   75000         15

DataFrame with ratio column:
    Name  Age      City  Salary  Experience  Experience_Years_Per_Age
0   John   28  New York   50000          3                  0.107143
1   Anna   34     Paris   60000          8                  0.235294
2  Peter   29    Berlin   55000          5                  0.172414
3  Linda   42    London   75000         15                  0.357143

Modifying Values

python
# Update a specific value
df.loc[0, 'Salary'] = 55000
print("Updated John's salary:")
print(df)

# Update based on condition
df.loc[df['Age'] > 30, 'Salary'] = df['Salary'] * 1.1
print("\nGave 10% raise to people over 30:")
print(df)

Output:

Updated John's salary:
    Name  Age      City  Salary  Experience  Experience_Years_Per_Age
0   John   28  New York   55000          3                  0.107143
1   Anna   34     Paris   60000          8                  0.235294
2  Peter   29    Berlin   55000          5                  0.172414
3  Linda   42    London   75000         15                  0.357143

Gave 10% raise to people over 30:
    Name  Age      City  Salary  Experience  Experience_Years_Per_Age
0   John   28  New York   55000          3                  0.107143
1   Anna   34     Paris   66000          8                  0.235294
2  Peter   29    Berlin   55000          5                  0.172414
3  Linda   42    London   82500         15                  0.357143

Adding and Removing Rows

python
# Add a new row
new_person = {'Name': 'Michael', 'Age': 31, 'City': 'Chicago', 
              'Salary': 58000, 'Experience': 7, 'Experience_Years_Per_Age': 7/31}
              
# Using loc to add a row at index 4
df.loc[4] = new_person
print("DataFrame with new person:")
print(df)

# Remove a row
df = df.drop(2)  # Drops Peter
print("\nDataFrame after removing Peter:")
print(df)

Output:

DataFrame with new person:
      Name  Age      City  Salary  Experience  Experience_Years_Per_Age
0     John   28  New York   55000          3                  0.107143
1     Anna   34     Paris   66000          8                  0.235294
2    Peter   29    Berlin   55000          5                  0.172414
3    Linda   42    London   82500         15                  0.357143
4  Michael   31   Chicago   58000          7                  0.225806

DataFrame after removing Peter:
      Name  Age      City  Salary  Experience  Experience_Years_Per_Age
0     John   28  New York   55000          3                  0.107143
1     Anna   34     Paris   66000          8                  0.235294
3    Linda   42    London   82500         15                  0.357143
4  Michael   31   Chicago   58000          7                  0.225806

Handling Missing Data

Real-world datasets often contain missing values. Pandas provides several methods to handle them:

python
# Create a DataFrame with missing values
data_missing = {
    'Name': ['John', 'Anna', None, 'Linda', 'Michael'],
    'Age': [28, None, 29, 42, 31],
    'City': ['New York', 'Paris', 'Berlin', None, 'Chicago'],
    'Salary': [None, 60000, 55000, 75000, 58000]
}

df_missing = pd.DataFrame(data_missing)
print("DataFrame with missing values:")
print(df_missing)

# Check for missing values
print("\nMissing values per column:")
print(df_missing.isna().sum())

# Fill missing values with a specific value
df_filled = df_missing.fillna({
    'Name': 'Unknown',
    'Age': df_missing['Age'].mean(),
    'City': 'Unknown',
    'Salary': df_missing['Salary'].median()
})
print("\nDataFrame with filled missing values:")
print(df_filled)

# Drop rows with missing values
df_dropped = df_missing.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)

Output:

DataFrame with missing values:
      Name   Age      City  Salary
0     John  28.0  New York     NaN
1     Anna   NaN     Paris  60000.0
2     None  29.0    Berlin  55000.0
3    Linda  42.0      None  75000.0
4  Michael  31.0   Chicago  58000.0

Missing values per column:
Name      1
Age       1
City      1
Salary    1
dtype: int64

DataFrame with filled missing values:
      Name        Age      City  Salary
0     John  28.000000  New York  62000.0
1     Anna  32.500000     Paris  60000.0
2  Unknown  29.000000    Berlin  55000.0
3    Linda  42.000000  Unknown   75000.0
4  Michael  31.000000   Chicago  58000.0

DataFrame after dropping rows with missing values:
      Name   Age     City  Salary
4  Michael  31.0  Chicago  58000.0

Real-World Application: Data Analysis

Let's put our DataFrame knowledge to work on a simple data analysis task using a sample employee dataset:

python
# Create a more realistic employee dataset
data = {
    'Employee': ['John Smith', 'Anna Johnson', 'Peter Williams', 'Linda Brown', 
                'Michael Davis', 'Sarah Wilson', 'Robert Taylor', 'Susan Anderson'],
    'Department': ['IT', 'HR', 'Finance', 'Marketing', 'IT', 'Finance', 'Marketing', 'HR'],
    'Salary': [65000, 60000, 72000, 68000, 78000, 65000, 71000, 62000],
    'Years_Employed': [3, 5, 8, 4, 7, 5, 6, 2],
    'Performance': ['Good', 'Excellent', 'Good', 'Average', 'Excellent', 'Good', 'Average', 'Good']
}

employees = pd.DataFrame(data)
print("Employee Dataset:")
print(employees)

# Analyze average salary by department
dept_salaries = employees.groupby('Department')['Salary'].mean().sort_values(ascending=False)
print("\nAverage Salary by Department:")
print(dept_salaries)

# Analyze performance distribution
perf_count = employees['Performance'].value_counts()
print("\nPerformance Distribution:")
print(perf_count)

# Find high performers with good salaries
high_performers = employees[(employees['Performance'] == 'Excellent') & 
                           (employees['Salary'] > 65000)]
print("\nExcellent performers with salary > 65000:")
print(high_performers)

# Calculate correlation between years employed and salary
correlation = employees['Years_Employed'].corr(employees['Salary'])
print(f"\nCorrelation between years employed and salary: {correlation:.2f}")

# Summary statistics by department
dept_stats = employees.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max'],
    'Years_Employed': ['mean', 'min', 'max']
})
print("\nDepartment Statistics:")
print(dept_stats)

Output:

Employee Dataset:
          Employee Department  Salary  Years_Employed Performance
0      John Smith         IT   65000               3        Good
1    Anna Johnson         HR   60000               5   Excellent
2  Peter Williams    Finance   72000               8        Good
3     Linda Brown  Marketing   68000               4      Average
4   Michael Davis         IT   78000               7   Excellent
5    Sarah Wilson    Finance   65000               5        Good
6   Robert Taylor  Marketing   71000               6      Average
7  Susan Anderson         HR   62000               2        Good

Average Salary by Department:
Department
IT           71500.0
Marketing    69500.0
Finance      68500.0
HR           61000.0
Name: Salary, dtype: float64

Performance Distribution:
Good         4
Average      2
Excellent    2
Name: Performance, dtype: int64

Excellent performers with salary > 65000:
        Employee Department  Salary  Years_Employed Performance
4  Michael Davis         IT   78000               7   Excellent

Correlation between years employed and salary: 0.84

Department Statistics:
             Salary               Years_Employed          
               mean    min    max           mean min max
Department                                               
Finance      68500.0  65000  72000          6.500   5   8
HR           61000.0  60000  62000          3.500   2   5
IT           71500.0  65000  78000          5.000   3   7
Marketing    69500.0  68000  71000          5.000   4   6

Summary

In this tutorial, we've covered the essential aspects of Pandas DataFrames:

Creating DataFrames from dictionaries, lists, and external files
Basic properties and methods to understand DataFrame structure
Accessing data through columns, rows, and specific cells
Modifying DataFrames by adding/removing columns and rows
Handling missing values with various strategies
Practical data analysis on a real-world dataset

DataFrames are incredibly powerful and form the foundation of data analysis in Python. They provide an intuitive, spreadsheet-like structure with powerful data manipulation capabilities built-in.

Additional Resources

To continue learning about Pandas DataFrames:

Exercises

Test your understanding with these exercises:

Create a DataFrame of your own with at least 5 columns and 10 rows about a subject of your interest (e.g., movies, cars, athletes).
Calculate summary statistics for numeric columns.
Filter the DataFrame to show only rows that meet at least two conditions.
Add a new calculated column based on values in other columns.
Group the data by one column and find aggregates (mean, max, min) of another column.
Handle missing data in your DataFrame using at least two different methods.
Visualize one aspect of your DataFrame using Pandas' built-in plotting capabilities.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Creating DataFrames​

From a Dictionary​

From Lists​

From CSV Files​

Basic DataFrame Properties and Methods​

Accessing DataFrame Elements​

Column Selection​

Row Selection​

Cell Selection​

Modifying DataFrames​

Adding New Columns​

Modifying Values​

Adding and Removing Rows​

Handling Missing Data​

Real-World Application: Data Analysis​

Summary​

Additional Resources​

Exercises​

Introduction

Creating DataFrames

From a Dictionary

From Lists

From CSV Files

Basic DataFrame Properties and Methods

Accessing DataFrame Elements

Column Selection

Row Selection

Cell Selection

Modifying DataFrames

Adding New Columns

Modifying Values

Adding and Removing Rows

Handling Missing Data

Real-World Application: Data Analysis

Summary

Additional Resources

Exercises