Skip to main content

Pandas Introduction

What is Pandas?

Pandas logo

Pandas is one of the most popular Python libraries for data manipulation and analysis. It provides powerful, flexible, and efficient data structures designed to make working with structured data intuitive and efficient. If you're working with tabular data (like spreadsheets or SQL tables), time series, or any kind of labeled or relational data, Pandas will likely become your go-to tool.

The library was created by Wes McKinney in 2008 while he was working as a quantitative analyst. The name "Pandas" is derived from "panel data," an econometrics term for multidimensional structured data sets.

Why Use Pandas?

Pandas solves many common problems in data analysis:

  • Reading various file formats: CSV, Excel, SQL databases, JSON, and more
  • Data cleaning: Handling missing values, duplicates, and formatting issues
  • Data transformation: Reshaping, pivoting, merging, and aggregating data
  • Data analysis: Statistical functions, grouping operations, and time series analysis
  • Data visualization: Integration with Matplotlib for creating graphs and charts

Installing Pandas

Before we dive into Pandas, you need to install it. You can install Pandas using pip:

bash
pip install pandas

Or using conda:

bash
conda install pandas

Core Pandas Data Structures

Pandas has two primary data structures:

  1. Series: A one-dimensional labeled array
  2. DataFrame: A two-dimensional labeled data structure with columns of potentially different types

Let's explore each of these structures.

Series

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.

Creating a Series

python
import pandas as pd

# Create a Series from a list
s = pd.Series([10, 20, 30, 40])
print(s)

Output:

0    10
1 20
2 30
3 40
dtype: int64

You can also specify your own index:

python
# Create a Series with custom index
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

Output:

a    10
b 20
c 30
d 40
dtype: int64

A Series can be created from a dictionary as well:

python
# Create a Series from a dictionary
d = {'a': 10, 'b': 20, 'c': 30, 'd': 40}
s = pd.Series(d)
print(s)

Output:

a    10
b 20
c 30
d 40
dtype: int64

Accessing Series Elements

You can access elements using index labels:

python
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s['a']) # Output: 10
print(s[['a', 'c']]) # Access multiple elements

Output:

10
a 10
c 30
dtype: int64

DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns that can be of different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.

Creating a DataFrame

python
# Create a DataFrame from a dictionary of lists
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 42],
'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 42 London

You can also specify the sequence of columns:

python
df = pd.DataFrame(data, columns=['Name', 'City', 'Age'])
print(df)

Output:

    Name      City  Age
0 John New York 28
1 Anna Paris 34
2 Peter Berlin 29
3 Linda London 42

Creating a DataFrame from CSV

One of the most common ways to create a DataFrame is by importing data from a CSV file:

python
# Reading a CSV file into a DataFrame
df = pd.read_csv('data.csv')

Basic DataFrame Operations

Let's explore some basic operations on DataFrames.

Viewing Data

python
# Sample DataFrame for demonstration
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda', 'Bob'],
'Age': [28, 34, 29, 42, 37],
'City': ['New York', 'Paris', 'Berlin', 'London', 'Tokyo'],
'Salary': [50000, 60000, 55000, 75000, 65000]
}

df = pd.DataFrame(data)

# View first 2 rows
print(df.head(2))

# View last 2 rows
print(df.tail(2))

# Get basic information about the DataFrame
print(df.info())

# Get statistical summary of numerical columns
print(df.describe())

Output (head):

   Name  Age      City  Salary
0 John 28 New York 50000
1 Anna 34 Paris 60000

Selecting Data

python
# Select a single column (returns a Series)
print(df['Name'])

# Select multiple columns
print(df[['Name', 'City']])

# Select rows by position using iloc
print(df.iloc[0]) # First row
print(df.iloc[1:3]) # Second and third rows

# Select rows by label using loc
print(df.loc[2]) # Row with index 2

Output (df['Name']):

0     John
1 Anna
2 Peter
3 Linda
4 Bob
Name: Name, dtype: object

Data Manipulation

python
# Add a new column
df['Experience'] = [3, 8, 5, 15, 10]

# Modify a column
df['Salary'] = df['Salary'] * 1.1 # 10% raise for everyone

# Delete a column
df_copy = df.copy()
del df_copy['Experience']

# Filter data
senior_employees = df[df['Age'] > 30]
print(senior_employees)

# Sort by Age
print(df.sort_values(by='Age'))

# Sort by multiple columns
print(df.sort_values(by=['City', 'Age']))

Output (senior_employees):

    Name  Age    City   Salary  Experience
1 Anna 34 Paris 66000.0 8
3 Linda 42 London 82500.0 15
4 Bob 37 Tokyo 71500.0 10

Real-World Example: Data Analysis

Let's look at a real-world example of using Pandas for data analysis. We'll analyze a fictional dataset of employee performance reviews.

python
# Sample employee performance data
performance_data = {
'Employee': ['John', 'Anna', 'Peter', 'Linda', 'Bob', 'Sarah', 'Michael', 'Emma'],
'Department': ['IT', 'HR', 'IT', 'Finance', 'Marketing', 'HR', 'Finance', 'IT'],
'Review_Score': [4.2, 4.5, 3.9, 4.7, 3.2, 4.0, 4.3, 3.8],
'Years_Employed': [3, 5, 2, 7, 3, 4, 6, 2],
'Salary': [55000, 60000, 52000, 75000, 48000, 62000, 69000, 53000]
}

employee_df = pd.DataFrame(performance_data)
print(employee_df)

Output:

  Employee Department  Review_Score  Years_Employed  Salary
0 John IT 4.2 3 55000
1 Anna HR 4.5 5 60000
2 Peter IT 3.9 2 52000
3 Linda Finance 4.7 7 75000
4 Bob Marketing 3.2 3 48000
5 Sarah HR 4.0 4 62000
6 Michael Finance 4.3 6 69000
7 Emma IT 3.8 2 53000

Now, let's perform some data analysis:

python
# Calculate average review score by department
dept_review = employee_df.groupby('Department')['Review_Score'].mean().sort_values(ascending=False)
print("Average Review Score by Department:")
print(dept_review)

# Calculate average salary
print(f"Average Salary: ${employee_df['Salary'].mean():.2f}")

# Find correlation between years employed and salary
correlation = employee_df['Years_Employed'].corr(employee_df['Salary'])
print(f"Correlation between Years Employed and Salary: {correlation:.2f}")

# Create a new column for performance bonus (based on review score)
employee_df['Bonus'] = employee_df['Review_Score'] * 1000
print(employee_df[['Employee', 'Review_Score', 'Bonus']])

Output:

Average Review Score by Department:
Department
Finance 4.50
HR 4.25
IT 3.97
Marketing 3.20
Name: Review_Score, dtype: float64

Average Salary: $59250.00

Correlation between Years Employed and Salary: 0.91

Employee Review_Score Bonus
0 John 4.2 4200.00
1 Anna 4.5 4500.00
2 Peter 3.9 3900.00
3 Linda 4.7 4700.00
4 Bob 3.2 3200.00
5 Sarah 4.0 4000.00
6 Michael 4.3 4300.00
7 Emma 3.8 3800.00

Summary

In this introduction to Pandas, we've covered:

  • What Pandas is and why it's useful for data analysis
  • How to install Pandas
  • The two main data structures: Series and DataFrame
  • Creating Series and DataFrames from various data sources
  • Basic operations for viewing, selecting, and manipulating data
  • A real-world example of data analysis using Pandas

Pandas is an extremely powerful library with many more capabilities beyond what we've covered. As you continue learning, you'll discover functions for advanced data manipulation, time series analysis, working with missing data, and much more.

Additional Resources

To deepen your knowledge of Pandas, here are some excellent resources:

Exercises

Practice what you've learned with these exercises:

  1. Create a DataFrame with information about 5 different books (title, author, year, genre, rating).
  2. Load a CSV file of your choice using pd.read_csv().
  3. Filter the employee DataFrame to show only employees from the IT department with a review score greater than 4.0.
  4. Calculate the average salary by department and sort departments by average salary in descending order.
  5. Create a new column called "Experience_Level" that categorizes employees as "Junior" (0-3 years), "Mid" (4-6 years), or "Senior" (7+ years).

Happy coding with Pandas!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)