Skip to main content

Pandas SciPy Integration

Introduction

Pandas is a powerful data manipulation library that's essential for data analysis in Python, while SciPy (Scientific Python) provides advanced scientific computing capabilities including optimization, linear algebra, statistics, and more. When these two libraries are used together, they create a robust environment for scientific data analysis.

In this tutorial, we'll explore how to effectively integrate Pandas DataFrames with SciPy's scientific computing functions. This integration allows you to perform advanced statistical analyses, optimization, signal processing, and more on your structured data.

Prerequisites

Before diving into the integration details, make sure you have the following libraries installed:

bash
pip install pandas scipy numpy matplotlib

Let's start by importing the necessary libraries:

python
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

Basic SciPy Integration with Pandas

Statistical Functions

One of the most common ways to use SciPy with Pandas is for statistical calculations. Let's see how to apply SciPy's statistical functions to Pandas DataFrames.

python
# Create a sample DataFrame
df = pd.DataFrame({
'A': np.random.normal(0, 1, 100),
'B': np.random.normal(5, 3, 100),
'C': np.random.normal(-2, 2, 100)
})

# Display the first few rows
print(df.head())

Output:

          A         B         C
0 1.764052 4.001144 -2.263112
1 0.400157 8.734991 -1.026945
2 0.978738 2.240893 -3.087586
3 2.240893 8.159203 -3.147797
4 -0.977278 3.357098 -0.894017

Now, let's use SciPy's stats module to perform statistical tests on our data:

python
from scipy import stats

# Perform a one-sample t-test on column A
t_stat, p_value = stats.ttest_1samp(df['A'], 0)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

Output:

T-statistic: 0.3871
P-value: 0.6996

This test checks whether the mean of column A differs significantly from 0. Since our p-value is above 0.05, we cannot reject the null hypothesis.

Performing Correlation Analysis

SciPy offers various correlation methods that can be applied to Pandas DataFrames:

python
# Calculate Pearson correlation between columns A and B
pearson_corr, pearson_p = stats.pearsonr(df['A'], df['B'])
print(f"Pearson correlation: {pearson_corr:.4f}")
print(f"P-value: {pearson_p:.4f}")

# Calculate Spearman rank correlation
spearman_corr, spearman_p = stats.spearmanr(df['A'], df['B'])
print(f"Spearman correlation: {spearman_corr:.4f}")
print(f"P-value: {spearman_p:.4f}")

Output:

Pearson correlation: -0.0327
P-value: 0.7472
Spearman correlation: -0.0189
P-value: 0.8526

Advanced Statistical Analysis

Probability Distributions

SciPy's stats module provides many probability distributions that can be used to analyze Pandas data:

python
# Create a sample data series
data = pd.Series(np.random.normal(size=1000))

# Fit a normal distribution to the data
mu, sigma = stats.norm.fit(data)
print(f"Fitted normal distribution: μ = {mu:.4f}, σ = {sigma:.4f}")

# Plot the histogram and the fitted PDF
x = np.linspace(data.min(), data.max(), 100)
pdf_fitted = stats.norm.pdf(x, mu, sigma)

plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, density=True, alpha=0.7, label='Data')
plt.plot(x, pdf_fitted, 'r-', label=f'Fitted Normal: μ={mu:.2f}, σ={sigma:.2f}')
plt.title('Data Histogram with Fitted Normal Distribution')
plt.legend()
plt.xlabel('Value')
plt.ylabel('Density')

Output:

Fitted normal distribution: μ = -0.0093, σ = 0.9909

The code above will display a histogram of your data with the fitted normal distribution overlaid.

Kernel Density Estimation

SciPy's stats module provides kernel density estimation which can be applied to Pandas Series:

python
# Create bimodal data
bimodal_data = pd.Series(np.concatenate([
np.random.normal(-3, 1, 500),
np.random.normal(3, 1, 500)
]))

# Kernel density estimation
kde = stats.gaussian_kde(bimodal_data)

# Plot the histogram and KDE
x = np.linspace(bimodal_data.min(), bimodal_data.max(), 1000)
plt.figure(figsize=(10, 6))
plt.hist(bimodal_data, bins=50, density=True, alpha=0.6, label='Data')
plt.plot(x, kde(x), 'r-', linewidth=3, label='KDE')
plt.title('Bimodal Data with Kernel Density Estimation')
plt.legend()
plt.xlabel('Value')
plt.ylabel('Density')

This creates a visualization of bimodal data with kernel density estimation overlaid on the histogram.

Signal Processing with Pandas and SciPy

Filtering Time Series Data

SciPy's signal processing capabilities can be used to filter time series data in Pandas:

python
from scipy import signal

# Create a time series with noise
np.random.seed(42)
time = pd.date_range(start='2023-01-01', periods=1000, freq='H')
original_signal = np.sin(np.linspace(0, 10*np.pi, 1000))
noisy_signal = original_signal + np.random.normal(0, 0.5, 1000)

# Create a DataFrame
ts_df = pd.DataFrame({
'time': time,
'original': original_signal,
'noisy': noisy_signal
})

# Apply a Butterworth filter
b, a = signal.butter(3, 0.05) # 3rd order Butterworth low-pass filter
ts_df['filtered'] = signal.filtfilt(b, a, ts_df['noisy'])

# Plot the results
plt.figure(figsize=(12, 8))
plt.subplot(3, 1, 1)
plt.plot(ts_df['time'][:200], ts_df['original'][:200])
plt.title('Original Signal')
plt.subplot(3, 1, 2)
plt.plot(ts_df['time'][:200], ts_df['noisy'][:200])
plt.title('Noisy Signal')
plt.subplot(3, 1, 3)
plt.plot(ts_df['time'][:200], ts_df['filtered'][:200])
plt.title('Filtered Signal')
plt.tight_layout()

This example shows how to apply a Butterworth filter to clean up noisy time series data, which is a common task in signal processing.

Optimization with SciPy and Pandas

SciPy's optimization capabilities can be used with Pandas to solve complex problems:

python
from scipy import optimize

# Example: Find the parameters that minimize the sum of squared errors
# Create sample data
x_data = np.linspace(0, 10, 100)
y_data = 3.5 * np.exp(-0.4 * x_data) + np.random.normal(0, 0.2, 100)
data_df = pd.DataFrame({'x': x_data, 'y': y_data})

# Define the function to fit
def exp_func(x, a, b):
return a * np.exp(b * x)

# Define the error function to minimize
def error_func(params, x, y):
a, b = params
y_fit = exp_func(x, a, b)
return np.sum((y - y_fit) ** 2)

# Initial guess
initial_guess = [1.0, -1.0]

# Minimize the error function
result = optimize.minimize(error_func, initial_guess, args=(data_df['x'], data_df['y']))
optimal_a, optimal_b = result.x

print(f"Optimal parameters: a = {optimal_a:.4f}, b = {optimal_b:.4f}")

# Plot the data and the fitted function
plt.figure(figsize=(10, 6))
plt.scatter(data_df['x'], data_df['y'], alpha=0.6, label='Data')
plt.plot(x_data, exp_func(x_data, optimal_a, optimal_b), 'r-', linewidth=2,
label=f'Fit: {optimal_a:.2f}*exp({optimal_b:.2f}*x)')
plt.title('Exponential Function Fit to Noisy Data')
plt.legend()
plt.xlabel('x')
plt.ylabel('y')

Output:

Optimal parameters: a = 3.5243, b = -0.4023

This example shows how to fit an exponential function to noisy data by minimizing the sum of squared errors using SciPy's optimization tools.

Interpolation with SciPy and Pandas

SciPy provides powerful interpolation functions that work well with Pandas DataFrames:

python
from scipy import interpolate

# Create sparse data
x_sparse = np.linspace(0, 10, 10)
y_sparse = np.sin(x_sparse)
sparse_df = pd.DataFrame({'x': x_sparse, 'y': y_sparse})

# Create a fine grid for interpolation
x_fine = np.linspace(0, 10, 100)

# Linear interpolation
f_linear = interpolate.interp1d(sparse_df['x'], sparse_df['y'])
y_linear = f_linear(x_fine)

# Cubic spline interpolation
f_cubic = interpolate.interp1d(sparse_df['x'], sparse_df['y'], kind='cubic')
y_cubic = f_cubic(x_fine)

# Create a DataFrame with all results
result_df = pd.DataFrame({
'x': x_fine,
'linear_interp': y_linear,
'cubic_interp': y_cubic,
'true_value': np.sin(x_fine)
})

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(sparse_df['x'], sparse_df['y'], s=50, c='black', label='Data Points')
plt.plot(result_df['x'], result_df['true_value'], 'b-', label='True Function')
plt.plot(result_df['x'], result_df['linear_interp'], 'r--', label='Linear Interpolation')
plt.plot(result_df['x'], result_df['cubic_interp'], 'g-', label='Cubic Interpolation')
plt.legend()
plt.title('Interpolation Methods Comparison')
plt.xlabel('x')
plt.ylabel('y')

This example demonstrates how to use different interpolation methods from SciPy with Pandas to estimate values between data points.

Spatial Data Analysis

SciPy's spatial module can be used with Pandas for spatial data analysis:

python
from scipy import spatial

# Create spatial data
np.random.seed(42)
points = np.random.rand(100, 2) * 10 # 100 random points in 2D
points_df = pd.DataFrame(points, columns=['x', 'y'])

# Compute the KDTree for efficient nearest neighbor queries
tree = spatial.KDTree(points_df[['x', 'y']].values)

# Find the nearest neighbor for a query point
query_point = np.array([5, 5])
distance, index = tree.query(query_point)
nearest_point = points_df.iloc[index]

print(f"Query point: ({query_point[0]}, {query_point[1]})")
print(f"Nearest point: ({nearest_point['x']:.4f}, {nearest_point['y']:.4f})")
print(f"Distance: {distance:.4f}")

# Plot the points and the query point
plt.figure(figsize=(8, 8))
plt.scatter(points_df['x'], points_df['y'], alpha=0.6, label='Data Points')
plt.scatter(query_point[0], query_point[1], c='red', s=100, label='Query Point')
plt.scatter(nearest_point['x'], nearest_point['y'], c='green', s=100, label='Nearest Point')
plt.plot([query_point[0], nearest_point['x']], [query_point[1], nearest_point['y']], 'k--')
plt.title('Nearest Neighbor Search with KDTree')
plt.legend()
plt.grid(True)
plt.xlabel('x')
plt.ylabel('y')

Output:

Query point: (5, 5)
Nearest point: (5.4429, 4.7242)
Distance: 0.7968

This example shows how to use SciPy's spatial module to find the nearest neighbor to a query point in a spatial dataset.

Real-World Application: Financial Data Analysis

Let's bring together what we've learned to analyze financial time series data:

python
# Create sample stock price data
dates = pd.date_range('2022-01-01', periods=252) # Trading days in a year
prices = 100 + np.cumsum(np.random.normal(0.001, 0.02, 252))
stock_df = pd.DataFrame({'Date': dates, 'Price': prices})
stock_df.set_index('Date', inplace=True)

# Calculate daily returns
stock_df['Return'] = stock_df['Price'].pct_change().fillna(0)

# Fit a normal distribution to the returns
mu, sigma = stats.norm.fit(stock_df['Return'])
print(f"Return distribution: μ = {mu:.6f}, σ = {sigma:.6f}")

# Test for normality using the Shapiro-Wilk test
shapiro_test = stats.shapiro(stock_df['Return'])
print(f"Shapiro-Wilk test: W = {shapiro_test.statistic:.4f}, p-value = {shapiro_test.pvalue:.4f}")

# Apply a simple moving average filter
window_size = 20
b = np.ones(window_size) / window_size
stock_df['SMA'] = signal.lfilter(b, 1, stock_df['Price'])

# Plot the results
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))

# Plot stock price and SMA
ax1.plot(stock_df.index, stock_df['Price'], label='Stock Price')
ax1.plot(stock_df.index, stock_df['SMA'], 'r-', label=f'{window_size}-day SMA')
ax1.set_title('Stock Price with Simple Moving Average')
ax1.legend()
ax1.grid(True)

# Plot return distribution
x = np.linspace(min(stock_df['Return']), max(stock_df['Return']), 100)
pdf_fitted = stats.norm.pdf(x, mu, sigma)
ax2.hist(stock_df['Return'], bins=30, density=True, alpha=0.7, label='Daily Returns')
ax2.plot(x, pdf_fitted, 'r-', linewidth=2,
label=f'Normal: μ={mu:.4f}, σ={sigma:.4f}')
ax2.set_title('Distribution of Daily Returns')
ax2.legend()
ax2.grid(True)

plt.tight_layout()

Output:

Return distribution: μ = 0.000938, σ = 0.019992
Shapiro-Wilk test: W = 0.9946, p-value = 0.4825

This comprehensive example shows how to:

  1. Analyze financial time series data with Pandas
  2. Fit a normal distribution to returns using SciPy's statistics functions
  3. Test for normality using the Shapiro-Wilk test
  4. Apply a signal processing filter to calculate a moving average
  5. Visualize the results with matplotlib

Summary

In this tutorial, we've explored the powerful integration of Pandas and SciPy for scientific data analysis. We've covered:

  • Basic statistical analysis using SciPy's stats module with Pandas DataFrames
  • Probability distributions and kernel density estimation
  • Signal processing for time series data filtering
  • Optimization techniques for curve fitting
  • Interpolation methods for sparse data
  • Spatial data analysis with KDTree
  • A real-world application in financial data analysis

The combination of Pandas' data manipulation capabilities with SciPy's scientific computing functions creates a versatile toolkit for data scientists and analysts. This integration allows you to efficiently organize, transform, and analyze complex datasets while leveraging advanced scientific computing algorithms.

Additional Resources and Exercises

Further Reading

Exercises

  1. Signal Processing: Download a publicly available dataset of daily temperature readings. Use SciPy's signal processing functions to remove high-frequency noise and identify seasonal patterns.

  2. Statistical Testing: Compare two columns in a dataset using different statistical tests from SciPy (t-test, Mann-Whitney U test, etc.) and interpret the results.

  3. Curve Fitting: Create a dataset with an exponential trend and some noise. Use SciPy's optimization functions to fit various functions (exponential, logarithmic, polynomial) and determine which provides the best fit.

  4. Spatial Analysis: Use a dataset containing geographical coordinates. Implement a nearest-neighbor search to find points of interest within a certain radius of a given location.

  5. Interpolation Challenge: Take a time series with missing values. Compare different interpolation methods from SciPy to fill the gaps and evaluate which method produces the most accurate results.

These exercises will help you apply the concepts learned in this tutorial and gain practical experience with Pandas and SciPy integration for data analysis.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)