Skip to main content

Pandas Objects

Introduction

Pandas is a powerful data manipulation and analysis library for Python, providing data structures designed to work with labeled and relational data efficiently. Understanding the fundamental objects in Pandas is crucial for anyone starting with data analysis in Python.

In this tutorial, we'll explore the three primary Pandas objects:

  1. Series - one-dimensional labeled arrays
  2. DataFrame - two-dimensional labeled data structures
  3. Index - immutable array-like objects for axis labels

Whether you're analyzing financial data, processing scientific results, or cleaning web analytics, these objects form the foundation of your Pandas workflow.

Pandas Series

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floating-point numbers, Python objects, etc.). It's similar to a column in a spreadsheet or a database table.

Creating a Series

You can create a Series from various Python objects:

python
import pandas as pd

# Creating a Series from a list
numbers = pd.Series([1, 2, 3, 4, 5])
print(numbers)

Output:

0    1
1 2
2 3
3 4
4 5
dtype: int64

By default, Pandas assigns integer indices starting from 0. However, you can specify custom indices:

python
# Creating a Series with custom indices
fruits = pd.Series([10, 20, 30, 40], index=['apple', 'banana', 'cherry', 'date'])
print(fruits)

Output:

apple     10
banana 20
cherry 30
date 40
dtype: int64

Accessing Series Elements

You can access Series elements using index labels or integer positions:

python
# Using label-based indexing
print(f"Value for banana: {fruits['banana']}")

# Using integer-based indexing
print(f"Value at position 2: {fruits[2]}")

# Using .iloc for integer-position access
print(f"First value: {fruits.iloc[0]}")

# Using .loc for label-based access
print(f"Value for cherry: {fruits.loc['cherry']}")

Output:

Value for banana: 20
Value at position 2: 30
First value: 10
Value for cherry: 30

Series Operations

Series support vectorized operations, allowing arithmetic operations between Series and scalar values:

python
# Scalar multiplication
print(fruits * 2)

# Adding a value
print(fruits + 5)

Output:

apple     20
banana 40
cherry 60
date 80
dtype: int64

apple 15
banana 25
cherry 35
date 45
dtype: int64

You can also perform operations between Series:

python
prices = pd.Series([0.5, 0.3, 0.7, 0.9], index=['apple', 'banana', 'cherry', 'date'])
print(fruits * prices) # Element-wise multiplication

Output:

apple      5.0
banana 6.0
cherry 21.0
date 36.0
dtype: float64

Pandas DataFrame

A DataFrame is a 2-dimensional labeled data structure resembling a table or spreadsheet. It consists of rows and columns, with each column potentially containing different data types.

Creating a DataFrame

There are several ways to create a DataFrame:

From a Dictionary of Lists

python
# Creating a DataFrame from a dictionary of lists
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 42],
'City': ['New York', 'Paris', 'Berlin', 'London'],
'Salary': [50000, 60000, 55000, 75000]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age     City  Salary
0 John 28 New York 50000
1 Anna 34 Paris 60000
2 Peter 29 Berlin 55000
3 Linda 42 London 75000

From a List of Dictionaries

python
# Creating a DataFrame from a list of dictionaries
employees = [
{'Name': 'John', 'Age': 28, 'Department': 'IT'},
{'Name': 'Anna', 'Age': 34, 'Department': 'HR'},
{'Name': 'Peter', 'Age': 29, 'Department': 'Sales'}
]

df_employees = pd.DataFrame(employees)
print(df_employees)

Output:

    Name  Age Department
0 John 28 IT
1 Anna 34 HR
2 Peter 29 Sales

From a NumPy Array

python
import numpy as np

# Creating a DataFrame from a NumPy array
array_data = np.random.randn(3, 4) # 3 rows, 4 columns of random numbers
df_array = pd.DataFrame(array_data, columns=['A', 'B', 'C', 'D'])
print(df_array)

Output:

          A         B         C         D
0 0.304994 -0.371394 0.283472 -0.720008
1 0.429738 1.140422 -0.356893 0.395309
2 -0.451676 -0.131659 -0.158146 0.261513

Accessing DataFrame Data

There are multiple ways to access data in a DataFrame:

Selecting Columns

python
# Select a single column (returns a Series)
print(df['Name'])

# Select multiple columns (returns a DataFrame)
print(df[['Name', 'Age']])

Output:

0     John
1 Anna
2 Peter
3 Linda
Name: Name, dtype: object

Name Age
0 John 28
1 Anna 34
2 Peter 29
3 Linda 42

Selecting Rows

python
# Select rows by position using iloc
print(df.iloc[1]) # Second row

# Select multiple rows
print(df.iloc[1:3]) # Rows 1 and 2 (not including 3)

Output:

Name      Anna
Age 34
City Paris
Salary 60000
Name: 1, dtype: object

Name Age City Salary
1 Anna 34 Paris 60000
2 Peter 29 Berlin 55000

Selecting Rows and Columns

python
# Select specific cells by row and column positions
print(df.iloc[0, 1]) # First row, second column

# Select a subset of rows and columns
print(df.iloc[0:2, [0, 2]]) # First two rows, first and third columns

Output:

28

Name City
0 John New York
1 Anna Paris

Basic DataFrame Operations

Adding a New Column

python
# Adding a new column
df['Experience'] = [3, 8, 4, 12]
print(df)

Output:

    Name  Age     City  Salary  Experience
0 John 28 New York 50000 3
1 Anna 34 Paris 60000 8
2 Peter 29 Berlin 55000 4
3 Linda 42 London 75000 12

Statistical Operations

python
# Basic statistics
print("Mean age:", df['Age'].mean())
print("Maximum salary:", df['Salary'].max())
print("\nSummary statistics:")
print(df.describe())

Output:

Mean age: 33.25
Maximum salary: 75000

Summary statistics:
Age Salary Experience
count 4.000000 4.000000 4.000000
mean 33.250000 60000.000000 6.750000
std 6.397078 11180.339887 4.112988
min 28.000000 50000.000000 3.000000
25% 28.750000 53750.000000 3.750000
50% 31.500000 57500.000000 6.000000
75% 36.000000 63750.000000 9.000000
max 42.000000 75000.000000 12.000000

Filtering Data

python
# Filtering rows based on conditions
print("Employees younger than 30:")
print(df[df['Age'] < 30])

print("\nEmployees from London or Paris:")
print(df[df['City'].isin(['London', 'Paris'])])

print("\nEmployees with salary > 55000 and experience > 5:")
print(df[(df['Salary'] > 55000) & (df['Experience'] > 5)])

Output:

Employees younger than 30:
Name Age City Salary Experience
0 John 28 New York 50000 3
2 Peter 29 Berlin 55000 4

Employees from London or Paris:
Name Age City Salary Experience
1 Anna 34 Paris 60000 8
3 Linda 42 London 75000 12

Employees with salary > 55000 and experience > 5:
Name Age City Salary Experience
1 Anna 34 Paris 60000 8
3 Linda 42 London 75000 12

Pandas Index Object

The Index is an immutable array responsible for holding axis labels for Series and DataFrame objects. While you don't usually interact with it directly, understanding it can be valuable.

Working with Index Objects

python
# Examining the index of a Series
print("Series index:")
print(fruits.index)

# Examining the index of a DataFrame
print("\nDataFrame index:")
print(df.index)
print("\nDataFrame columns:")
print(df.columns)

Output:

Series index:
Index(['apple', 'banana', 'cherry', 'date'], dtype='object')

DataFrame index:
RangeIndex(start=0, stop=4, step=1)

DataFrame columns:
Index(['Name', 'Age', 'City', 'Salary', 'Experience'], dtype='object')

Setting Custom Indices

You can set a column as the index of your DataFrame:

python
# Setting 'Name' as the index
df_indexed = df.set_index('Name')
print(df_indexed)

# Reset the index back to default
df_reset = df_indexed.reset_index()
print("\nAfter resetting index:")
print(df_reset)

Output:

       Age     City  Salary  Experience
Name
John 28 New York 50000 3
Anna 34 Paris 60000 8
Peter 29 Berlin 55000 4
Linda 42 London 75000 12

After resetting index:
Name Age City Salary Experience
0 John 28 New York 50000 3
1 Anna 34 Paris 60000 8
2 Peter 29 Berlin 55000 4
3 Linda 42 London 75000 12

Multi-level Indexing

Pandas supports hierarchical indexing with multiple levels:

python
# Creating a DataFrame with multi-level indexing
multi_index = pd.MultiIndex.from_tuples([
('USA', 'New York'),
('USA', 'Boston'),
('France', 'Paris'),
('Germany', 'Berlin')
], names=['Country', 'City'])

df_multi = pd.DataFrame({
'Population': [8400000, 685000, 2200000, 3700000],
'Area': [468, 89, 105, 891]
}, index=multi_index)

print(df_multi)

Output:

                Population  Area
Country City
USA New York 8400000 468
Boston 685000 89
France Paris 2200000 105
Germany Berlin 3700000 891

Practical Applications

Data Analysis Example: Sales Data

Let's combine what we've learned to analyze some sales data:

python
# Create sales data
sales_data = {
'Date': pd.date_range(start='2023-01-01', periods=6, freq='D'),
'Product': ['A', 'B', 'A', 'C', 'B', 'A'],
'Quantity': [10, 15, 8, 12, 20, 5],
'Price': [100, 50, 100, 75, 50, 100]
}

sales_df = pd.DataFrame(sales_data)
sales_df['Total'] = sales_df['Quantity'] * sales_df['Price']
print(sales_df)

# Analyzing sales by product
product_sales = sales_df.groupby('Product').agg({
'Quantity': 'sum',
'Total': 'sum'
})
print("\nSales by Product:")
print(product_sales)

# Find the date with the highest sales
max_sales_date = sales_df.loc[sales_df['Total'].idxmax(), 'Date']
max_sales = sales_df['Total'].max()
print(f"\nHighest sales: ${max_sales} on {max_sales_date.date()}")

Output:

        Date Product  Quantity  Price  Total
0 2023-01-01 A 10 100 1000
1 2023-01-02 B 15 50 750
2 2023-01-03 A 8 100 800
3 2023-01-04 C 12 75 900
4 2023-01-05 B 20 50 1000
5 2023-01-06 A 5 100 500

Sales by Product:
Quantity Total
Product
A 23 2300
B 35 1750
C 12 900

Highest sales: $1000 on 2023-01-01

Time Series Analysis Example

Pandas is excellent for time series data:

python
# Creating a time series
dates = pd.date_range('20230101', periods=6)
time_series = pd.Series(np.random.randn(6), index=dates)
print("Time Series Data:")
print(time_series)

# Resampling to monthly frequency (calculating mean)
monthly = time_series.resample('M').mean()
print("\nMonthly Average:")
print(monthly)

Output:

Time Series Data:
2023-01-01 0.335909
2023-01-02 -0.373500
2023-01-03 -0.635623
2023-01-04 0.800277
2023-01-05 -0.226735
2023-01-06 0.576587
Freq: D, dtype: float64

Monthly Average:
2023-01-31 0.079486
Freq: M, dtype: float64

Summary

In this tutorial, we covered the three fundamental objects in Pandas:

  1. Series: One-dimensional labeled arrays that can hold any data type, perfect for representing a single column or time series.

  2. DataFrame: Two-dimensional labeled data structures similar to tables, with columns of potentially different data types.

  3. Index: Immutable array-like objects that hold axis labels for both Series and DataFrame objects.

Understanding these structures is essential for effective data manipulation in Python. Each object has its specific use cases, methods, and properties that make Pandas a versatile tool for data analysis.

Additional Resources and Exercises

Further Reading

Practice Exercises

  1. Series Practice: Create a Series of the monthly expenses for a year and calculate the average monthly expense.

  2. DataFrame Manipulation:

    • Create a DataFrame of students with columns for name, age, grade, and subjects.
    • Add a new column for pass/fail status based on grades.
    • Filter to find all students who passed and are under 15.
  3. Data Analysis Challenge:

    • Import a CSV file of your choice using pd.read_csv()
    • Perform basic exploratory data analysis
    • Create at least two meaningful insights from the data
  4. Multi-index Challenge: Create a DataFrame with a hierarchical index representing sales data by region, city, and product, then perform group operations to find the best-selling product in each region.

Remember that the best way to learn Pandas is through practice with real data!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)