Pandas Introduction

What is Pandas?

Pandas logo

Pandas is one of the most popular Python libraries for data manipulation and analysis. It provides powerful, flexible, and efficient data structures designed to make working with structured data intuitive and efficient. If you're working with tabular data (like spreadsheets or SQL tables), time series, or any kind of labeled or relational data, Pandas will likely become your go-to tool.

The library was created by Wes McKinney in 2008 while he was working as a quantitative analyst. The name "Pandas" is derived from "panel data," an econometrics term for multidimensional structured data sets.

Why Use Pandas?

Pandas solves many common problems in data analysis:

Reading various file formats: CSV, Excel, SQL databases, JSON, and more
Data cleaning: Handling missing values, duplicates, and formatting issues
Data transformation: Reshaping, pivoting, merging, and aggregating data
Data analysis: Statistical functions, grouping operations, and time series analysis
Data visualization: Integration with Matplotlib for creating graphs and charts

Installing Pandas

Before we dive into Pandas, you need to install it. You can install Pandas using pip:

bash
pip install pandas

Or using conda:

bash
conda install pandas

Core Pandas Data Structures

Pandas has two primary data structures:

Series: A one-dimensional labeled array
DataFrame: A two-dimensional labeled data structure with columns of potentially different types

Let's explore each of these structures.

Series

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.

Creating a Series

python
import pandas as pd

# Create a Series from a list
s = pd.Series([10, 20, 30, 40])
print(s)

Output:

  10
  20
  30
  40
dtype: int64

You can also specify your own index:

python
# Create a Series with custom index
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

Output:

a    10
b    20
c    30
d    40
dtype: int64

A Series can be created from a dictionary as well:

python
# Create a Series from a dictionary
d = {'a': 10, 'b': 20, 'c': 30, 'd': 40}
s = pd.Series(d)
print(s)

Output:

a    10
b    20
c    30
d    40
dtype: int64

Accessing Series Elements

You can access elements using index labels:

python
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s['a'])  # Output: 10
print(s[['a', 'c']])  # Access multiple elements

Output:

10
a    10
c    30
dtype: int64

DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns that can be of different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.

Creating a DataFrame

python
# Create a DataFrame from a dictionary of lists
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 42],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City
 John   28  New York
 Anna   34     Paris
Peter   29    Berlin
Linda   42    London

You can also specify the sequence of columns:

python
df = pd.DataFrame(data, columns=['Name', 'City', 'Age'])
print(df)

Output:

    Name      City  Age
 John  New York   28
 Anna     Paris   34
Peter    Berlin   29
Linda    London   42

Creating a DataFrame from CSV

One of the most common ways to create a DataFrame is by importing data from a CSV file:

python
# Reading a CSV file into a DataFrame
df = pd.read_csv('data.csv')

Basic DataFrame Operations

Let's explore some basic operations on DataFrames.

Viewing Data

python
# Sample DataFrame for demonstration
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'Bob'],
    'Age': [28, 34, 29, 42, 37],
    'City': ['New York', 'Paris', 'Berlin', 'London', 'Tokyo'],
    'Salary': [50000, 60000, 55000, 75000, 65000]
}

df = pd.DataFrame(data)

# View first 2 rows
print(df.head(2))

# View last 2 rows
print(df.tail(2))

# Get basic information about the DataFrame
print(df.info())

# Get statistical summary of numerical columns
print(df.describe())

Output (head):

   Name  Age      City  Salary
0  John   28  New York   50000
1  Anna   34     Paris   60000

Selecting Data

python
# Select a single column (returns a Series)
print(df['Name'])

# Select multiple columns
print(df[['Name', 'City']])

# Select rows by position using iloc
print(df.iloc[0])  # First row
print(df.iloc[1:3])  # Second and third rows

# Select rows by label using loc
print(df.loc[2])  # Row with index 2

Output (df['Name']):

   John
   Anna
  Peter
  Linda
    Bob
Name: Name, dtype: object

Data Manipulation

python
# Add a new column
df['Experience'] = [3, 8, 5, 15, 10]

# Modify a column
df['Salary'] = df['Salary'] * 1.1  # 10% raise for everyone

# Delete a column
df_copy = df.copy()
del df_copy['Experience']

# Filter data
senior_employees = df[df['Age'] > 30]
print(senior_employees)

# Sort by Age
print(df.sort_values(by='Age'))

# Sort by multiple columns
print(df.sort_values(by=['City', 'Age']))

Output (senior_employees):

    Name  Age    City   Salary  Experience
 Anna   34   Paris  66000.0           8
Linda   42  London  82500.0          15
  Bob   37   Tokyo  71500.0          10

Real-World Example: Data Analysis

Let's look at a real-world example of using Pandas for data analysis. We'll analyze a fictional dataset of employee performance reviews.

python
# Sample employee performance data
performance_data = {
    'Employee': ['John', 'Anna', 'Peter', 'Linda', 'Bob', 'Sarah', 'Michael', 'Emma'],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'Marketing', 'HR', 'Finance', 'IT'],
    'Review_Score': [4.2, 4.5, 3.9, 4.7, 3.2, 4.0, 4.3, 3.8],
    'Years_Employed': [3, 5, 2, 7, 3, 4, 6, 2],
    'Salary': [55000, 60000, 52000, 75000, 48000, 62000, 69000, 53000]
}

employee_df = pd.DataFrame(performance_data)
print(employee_df)

Output:

  Employee Department  Review_Score  Years_Employed  Salary
   John         IT           4.2               3   55000
   Anna         HR           4.5               5   60000
  Peter         IT           3.9               2   52000
  Linda    Finance           4.7               7   75000
    Bob  Marketing           3.2               3   48000
  Sarah         HR           4.0               4   62000
Michael    Finance           4.3               6   69000
   Emma         IT           3.8               2   53000

Now, let's perform some data analysis:

python
# Calculate average review score by department
dept_review = employee_df.groupby('Department')['Review_Score'].mean().sort_values(ascending=False)
print("Average Review Score by Department:")
print(dept_review)

# Calculate average salary
print(f"Average Salary: ${employee_df['Salary'].mean():.2f}")

# Find correlation between years employed and salary
correlation = employee_df['Years_Employed'].corr(employee_df['Salary'])
print(f"Correlation between Years Employed and Salary: {correlation:.2f}")

# Create a new column for performance bonus (based on review score)
employee_df['Bonus'] = employee_df['Review_Score'] * 1000
print(employee_df[['Employee', 'Review_Score', 'Bonus']])

Output:

Average Review Score by Department:
Department
Finance      4.50
HR           4.25
IT           3.97
Marketing    3.20
Name: Review_Score, dtype: float64

Average Salary: $59250.00

Correlation between Years Employed and Salary: 0.91

  Employee  Review_Score    Bonus
0     John           4.2  4200.00
1     Anna           4.5  4500.00
2    Peter           3.9  3900.00
3    Linda           4.7  4700.00
4      Bob           3.2  3200.00
5    Sarah           4.0  4000.00
6  Michael           4.3  4300.00
7     Emma           3.8  3800.00

Summary

In this introduction to Pandas, we've covered:

What Pandas is and why it's useful for data analysis
How to install Pandas
The two main data structures: Series and DataFrame
Creating Series and DataFrames from various data sources
Basic operations for viewing, selecting, and manipulating data
A real-world example of data analysis using Pandas

Pandas is an extremely powerful library with many more capabilities beyond what we've covered. As you continue learning, you'll discover functions for advanced data manipulation, time series analysis, working with missing data, and much more.

Additional Resources

To deepen your knowledge of Pandas, here are some excellent resources:

Official Pandas Documentation
10 Minutes to Pandas - A quick introduction to the main concepts
Pandas Cookbook - Recipes for common tasks

Exercises

Practice what you've learned with these exercises:

Create a DataFrame with information about 5 different books (title, author, year, genre, rating).
Load a CSV file of your choice using pd.read_csv().
Filter the employee DataFrame to show only employees from the IT department with a review score greater than 4.0.
Calculate the average salary by department and sort departments by average salary in descending order.
Create a new column called "Experience_Level" that categorizes employees as "Junior" (0-3 years), "Mid" (4-6 years), or "Senior" (7+ years).

Happy coding with Pandas!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

What is Pandas?​

Why Use Pandas?​

Installing Pandas​

Core Pandas Data Structures​

Series​

Creating a Series​

Accessing Series Elements​

DataFrame​

Creating a DataFrame​

Creating a DataFrame from CSV​

Basic DataFrame Operations​

Viewing Data​

Selecting Data​

Data Manipulation​

Real-World Example: Data Analysis​

Summary​

Additional Resources​

Exercises​

What is Pandas?

Why Use Pandas?

Installing Pandas

Core Pandas Data Structures

Series

Creating a Series

Accessing Series Elements

DataFrame

Creating a DataFrame

Creating a DataFrame from CSV

Basic DataFrame Operations

Viewing Data

Selecting Data

Data Manipulation

Real-World Example: Data Analysis

Summary

Additional Resources

Exercises