Pandas Series
Introduction
In pandas, a Series is one of the fundamental data structures that serves as a building block for working with data. You can think of a Series as a one-dimensional labeled array capable of holding data of any type (integers, strings, floating-point numbers, Python objects, etc.). It's similar to a column in a spreadsheet or a database table, or a single variable in your dataset.
A Series is characterized by:
- An index that labels each element in the Series
- A collection of data values
- The ability to hold any data type (even mixed types)
- A variety of built-in methods for data manipulation and analysis
In this guide, we'll explore how to create, manipulate, and work with pandas Series objects.
Prerequisites
To follow along with this tutorial, make sure you have pandas installed:
pip install pandas
Let's begin by importing pandas:
import pandas as pd
import numpy as np # We'll use numpy in some examples
Creating a Series
From a List
The simplest way to create a pandas Series is from a list:
# Create a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(s)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Notice that pandas automatically assigned an index (0 to 4) to our Series.
With Custom Index
You can specify your own index labels when creating a Series:
# Create a Series with custom indices
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
s = pd.Series(data, index=index)
print(s)
Output:
a 10
b 20
c 30
d 40
e 50
dtype: int64
From a Dictionary
When creating a Series from a dictionary, the keys become the index:
# Create a Series from a dictionary
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}
s = pd.Series(data)
print(s)
Output:
a 10
b 20
c 30
d 40
e 50
dtype: int64
With Scalar Value
You can create a Series with a single scalar value that gets repeated for each index:
# Create a Series with a scalar value
s = pd.Series(5, index=['a', 'b', 'c', 'd', 'e'])
print(s)
Output:
a 5
b 5
c 5
d 5
e 5
dtype: int64
From NumPy Arrays
Series can also be created from NumPy arrays:
# Create a Series from a NumPy array
data = np.array([10, 20, 30, 40, 50])
s = pd.Series(data)
print(s)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Series Attributes
Let's explore some important attributes of a Series:
# Create a sample Series
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
# 1. Values
print("Series values:")
print(s.values)
# 2. Index
print("\nSeries index:")
print(s.index)
# 3. Data type
print("\nSeries data type:")
print(s.dtype)
# 4. Shape
print("\nSeries shape:")
print(s.shape)
# 5. Size
print("\nSeries size:")
print(s.size)
# 6. Name (if set)
s.name = "My Series"
print("\nSeries name:")
print(s.name)
Output:
Series values:
[10 20 30 40 50]
Series index:
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Series data type:
int64
Series shape:
(5,)
Series size:
5
Series name:
My Series
Accessing Series Elements
There are multiple ways to access elements in a Series:
By Position (Integer Location)
Using iloc
for integer-based indexing:
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
# Access single element by position
print(s.iloc[0]) # First element
print(s.iloc[-1]) # Last element
# Slicing by position
print(s.iloc[1:4]) # Elements at positions 1, 2, and 3
Output:
10
50
b 20
c 30
d 40
dtype: int64
By Label (Index)
Using loc
for label-based indexing:
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
# Access single element by label
print(s.loc['a'])
# Slicing by label (inclusive of end label)
print(s.loc['b':'d'])
Output:
10
b 20
c 30
d 40
dtype: int64
Direct Indexing
You can also use direct indexing, which can be either position or label-based:
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
# By label
print(s['a'])
# By multiple labels
print(s[['a', 'c', 'e']])
# By boolean condition
print(s[s > 30])
Output:
10
a 10
c 30
e 50
dtype: int64
d 40
e 50
dtype: int64
Series Operations
Series support various operations that make data manipulation easy:
Mathematical Operations
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
# Addition
print(s + 5)
# Multiplication
print(s * 2)
# Power
print(s ** 2)
Output:
a 15
b 25
c 35
d 45
e 55
dtype: int64
a 20
b 40
c 60
d 80
e 100
dtype: int64
a 100
b 400
c 900
d 1600
e 2500
dtype: int64
Operations Between Series
s1 = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
s2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
# Addition of two Series
print(s1 + s2)
# With different indices
s3 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'f', 'g'])
print(s1 + s3) # Note the NaN values where indices don't align
Output:
a 11
b 22
c 33
d 44
e 55
dtype: int64
a 11.0
b 22.0
c 33.0
d NaN
e NaN
f NaN
g NaN
dtype: float64
Applying Functions
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
# Apply a function to each element
print(s.apply(lambda x: x * 2))
# Apply NumPy functions
print(np.sqrt(s))
Output:
a 20
b 40
c 60
d 80
e 100
dtype: int64
a 3.162278
b 4.472136
c 5.477226
d 6.324555
e 7.071068
dtype: float64
Common Series Methods
Here are some commonly used methods for Series:
Statistical Methods
s = pd.Series([10, 20, 30, 40, 50])
print(f"Mean: {s.mean()}")
print(f"Median: {s.median()}")
print(f"Standard deviation: {s.std()}")
print(f"Min: {s.min()}")
print(f"Max: {s.max()}")
print(f"Sum: {s.sum()}")
print(f"Description: \n{s.describe()}")
Output:
Mean: 30.0
Median: 30.0
Standard deviation: 15.811388300841896
Min: 10
Max: 50
Sum: 150
Description:
count 5.000000
mean 30.000000
std 15.811388
min 10.000000
25% 20.000000
50% 30.000000
75% 40.000000
max 50.000000
dtype: float64
Data Transformation
s = pd.Series([10, 20, 30, 40, 50])
# Cumulative sum
print("Cumulative sum:")
print(s.cumsum())
# Percentage change
print("\nPercentage change:")
print(s.pct_change())
# Shift values (move values by 1)
print("\nShifted values:")
print(s.shift(1))
# Replace values
print("\nReplaced values (30 -> 300):")
print(s.replace(30, 300))
Output:
Cumulative sum:
0 10
1 30
2 60
3 100
4 150
dtype: int64
Percentage change:
0 NaN
1 1.000000
2 0.500000
3 0.333333
4 0.250000
dtype: float64
Shifted values:
0 NaN
1 10.0
2 20.0
3 30.0
4 40.0
dtype: float64
Replaced values (30 -> 300):
0 10
1 20
2 300
3 40
4 50
dtype: int64
Filtering and Sorting
s = pd.Series([30, 10, 50, 20, 40], index=['a', 'b', 'c', 'd', 'e'])
# Filtering
print("Values greater than 30:")
print(s[s > 30])
# Checking for values
print("\nIs 50 in the Series?")
print(50 in s.values)
# Sorting by value
print("\nSorted by value:")
print(s.sort_values())
# Sorting by index
print("\nSorted by index:")
print(s.sort_index())
Output:
Values greater than 30:
c 50
e 40
dtype: int64
Is 50 in the Series?
True
Sorted by value:
b 10
d 20
a 30
e 40
c 50
dtype: int64
Sorted by index:
a 30
b 10
c 50
d 20
e 40
dtype: int64
Handling Missing Data
s = pd.Series([10, np.nan, 30, np.nan, 50], index=['a', 'b', 'c', 'd', 'e'])
print("Original Series with NaN values:")
print(s)
# Check for null values
print("\nNull value check:")
print(s.isnull())
# Drop null values
print("\nSeries with NaN values dropped:")
print(s.dropna())
# Fill null values
print("\nSeries with NaN values filled with 0:")
print(s.fillna(0))
# Fill null values with forward fill method
print("\nSeries with NaN values forward filled:")
print(s.ffill()) # Also known as s.fillna(method='ffill')
Output:
Original Series with NaN values:
a 10.0
b NaN
c 30.0
d NaN
e 50.0
dtype: float64
Null value check:
a False
b True
c False
d True
e False
dtype: bool
Series with NaN values dropped:
a 10.0
c 30.0
e 50.0
dtype: float64
Series with NaN values filled with 0:
a 10.0
b 0.0
c 30.0
d 0.0
e 50.0
dtype: float64
Series with NaN values forward filled:
a 10.0
b 10.0
c 30.0
d 30.0
e 50.0
dtype: float64
Practical Examples
Let's look at some practical examples of using Series in real-world scenarios:
Example 1: Stock Prices Analysis
# Daily closing prices of a stock for one week
dates = pd.date_range('2023-01-01', periods=5, freq='D')
prices = pd.Series([150.5, 152.3, 151.9, 153.7, 155.2], index=dates)
print("Stock prices:")
print(prices)
# Calculate daily returns
daily_returns = prices.pct_change()
print("\nDaily returns:")
print(daily_returns)
# Calculate statistics
print("\nSummary statistics:")
print(daily_returns.describe())
# Find days with positive returns
print("\nDays with positive returns:")
print(prices[daily_returns > 0])
Output:
Stock prices:
2023-01-01 150.5
2023-01-02 152.3
2023-01-03 151.9
2023-01-04 153.7
2023-01-05 155.2
dtype: float64
Daily returns:
2023-01-01 NaN
2023-01-02 0.011960
2023-01-03 -0.002626
2023-01-04 0.011850
2023-01-05 0.009760
dtype: float64
Summary statistics:
count 4.000000
mean 0.007736
std 0.006995
min -0.002626
25% 0.007005
50% 0.010805
75% 0.011933
max 0.011960
dtype: float64
Days with positive returns:
2023-01-02 152.3
2023-01-04 153.7
2023-01-05 155.2
dtype: float64
Example 2: Sales Data Analysis
# Monthly sales data for a year
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
sales = pd.Series([10500, 12600, 14800, 13900, 15200, 16800,
17500, 18100, 16400, 15300, 14700, 19200], index=months)
print("Monthly sales:")
print(sales)
# Calculate total yearly sales
print(f"\nTotal yearly sales: ${sales.sum():,}")
# Find the month with highest sales
print(f"\nMonth with highest sales: {sales.idxmax()} (${sales.max():,})")
# Find the month with lowest sales
print(f"\nMonth with lowest sales: {sales.idxmin()} (${sales.min():,})")
# Calculate quarterly sales
q1 = sales.iloc[0:3].sum()
q2 = sales.iloc[3:6].sum()
q3 = sales.iloc[6:9].sum()
q4 = sales.iloc[9:12].sum()
quarterly_sales = pd.Series([q1, q2, q3, q4], index=['Q1', 'Q2', 'Q3', 'Q4'])
print("\nQuarterly sales:")
print(quarterly_sales)
Output:
Monthly sales:
Jan 10500
Feb 12600
Mar 14800
Apr 13900
May 15200
Jun 16800
Jul 17500
Aug 18100
Sep 16400
Oct 15300
Nov 14700
Dec 19200
dtype: int64
Total yearly sales: $185,000
Month with highest sales: Dec ($19,200)
Month with lowest sales: Jan ($10,500)
Quarterly sales:
Q1 37900
Q2 45900
Q3 52000
Q4 49200
dtype: int64
Example 3: Customer Survey Ratings
# Customer satisfaction ratings (1-5 scale)
ratings = pd.Series([5, 4, 4, 5, 3, 2, 5, 5, 4, 3, 5, 4, 5, 3, 4])
# Count of each rating
rating_counts = ratings.value_counts().sort_index()
print("Rating distribution:")
print(rating_counts)
# Percentage of each rating
rating_percent = ratings.value_counts(normalize=True).sort_index() * 100
print("\nPercentage distribution:")
print(rating_percent.map('{:.1f}%'.format))
# Average rating
print(f"\nAverage rating: {ratings.mean():.2f} out of 5")
# Percentage of satisfied customers (rating 4 or 5)
satisfied = ((ratings >= 4).sum() / ratings.size) * 100
print(f"\nPercentage of satisfied customers: {satisfied:.1f}%")
Output:
Rating distribution:
2 1
3 3
4 5
5 6
dtype: int64
Percentage distribution:
2 6.7%
3 20.0%
4 33.3%
5 40.0%
dtype: object
Average rating: 4.07 out of 5
Percentage of satisfied customers: 73.3%
Series vs. Other Python Data Structures
It's helpful to understand how pandas Series compares to other Python data structures:
Feature | pandas Series | Python List | NumPy Array | Python Dictionary |
---|---|---|---|---|
Labeled index | ✓ | ✗ | ✗ | ✓ (keys) |
Homogeneous data | Recommended but not required | ✗ | ✓ | ✗ |
Math operations | ✓ (vectorized) | ✗ | ✓ (vectorized) | ✗ |
Built-in data analysis | ✓ | ✗ | Limited | ✗ |
Missing data handling | ✓ | ✗ | Limited | ✗ |
Summary
In this guide, we've covered the pandas Series object in detail:
- A Series is a one-dimensional labeled array capable of holding any data type
- Series objects can be created from various data structures like lists, dictionaries, and NumPy arrays
- Series have a flexible indexing system that allows access by position or label
- Series support vectorized operations and built-in methods for data analysis
- Series provide extensive functionality for handling missing data and data manipulation
The Series is the fundamental building block in pandas that, along with the DataFrame, enables powerful and efficient data analysis in Python. Understanding how to work with Series is essential for any data analysis workflow using pandas.
Exercises
To practice your Series skills, try these exercises:
- Create a Series containing the temperatures (in Celsius) for a week and convert them to Fahrenheit (F = C * 9/5 + 32).
- Given a Series of monthly expenses, calculate the total, average, minimum, and maximum expenses.
- Create a Series of rainfall data with some missing values, then calculate the average rainfall after filling missing values with the mean.
- Create a Series of test scores and calculate what percentage of students scored above the average.
- Create a Series of stock prices and calculate the daily percentage change, then identify the day with the highest price increase.
Additional Resources
Happy coding with pandas Series!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)