Skip to main content

Pandas Scikit-learn Integration

Introduction

When working on machine learning projects, two Python libraries often go hand in hand: pandas and scikit-learn (often abbreviated as sklearn). Pandas provides powerful data structures and data manipulation capabilities, while scikit-learn offers a wide range of machine learning algorithms. Integrating these two libraries creates a seamless workflow for data preprocessing, model training, and evaluation.

In this tutorial, we'll explore how to effectively combine pandas DataFrames with scikit-learn's machine learning tools to build predictive models. This integration is fundamental for any data science or machine learning project, as it bridges the gap between data manipulation and model development.

Why Integrate Pandas with Scikit-learn?

Before diving into the technical details, let's understand why this integration is so valuable:

  1. Efficient workflow: Move from data cleaning to model training without leaving the Python ecosystem
  2. Better data understanding: Leverage pandas' analysis capabilities before applying machine learning
  3. Feature engineering: Use pandas' powerful functions to create and transform features
  4. Simplified preprocessing: Easily prepare your data for machine learning algorithms

Prerequisites

To follow along with this tutorial, you should have:

  • Basic understanding of Python
  • Familiarity with pandas DataFrames
  • Python environment with the following packages installed:
python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

Loading Data with Pandas

Let's start by loading a dataset using pandas. We'll use the famous Iris dataset, which is often used for introductory machine learning projects:

python
# Load the Iris dataset
from sklearn.datasets import load_iris

# Load data
iris = load_iris()

# Create a DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add target column
df['target'] = iris.target

# Display the first few rows
print(df.head())

Output:

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

Data Exploration with Pandas before Machine Learning

Before applying machine learning algorithms, it's crucial to understand your data. Pandas provides excellent tools for this purpose:

python
# Basic statistics
print(df.describe())

# Check for missing values
print(df.isnull().sum())

# Visualize correlations
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlations')

This exploratory analysis helps you understand your data's distributions and relationships, which informs your feature engineering and model selection decisions.

Preparing Data for Scikit-learn

Scikit-learn expects data in a specific format: features (X) and target (y) separated, with numerical values. Pandas makes this preparation straightforward:

python
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Check shapes
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

Output:

X shape: (150, 4)
y shape: (150,)

Handling Categorical Variables

If your dataset contains categorical variables, you'll need to convert them to numerical values before using scikit-learn. Pandas offers seamless integration with scikit-learn's preprocessing tools:

python
# Example with categorical data
data = {'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['small', 'medium', 'large', 'medium', 'small'],
'price': [10.5, 15.0, 20.0, 12.0, 14.5]}

categorical_df = pd.DataFrame(data)

# Using sklearn's preprocessing with pandas
from sklearn.preprocessing import OneHotEncoder

# Select categorical columns
cat_cols = ['color', 'size']
cat_data = categorical_df[cat_cols]

# Apply one-hot encoding
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(cat_data)

# Create DataFrame with encoded values
encoded_df = pd.DataFrame(
encoded_data,
columns=encoder.get_feature_names_out(cat_cols)
)

# Combine with numerical columns
numerical_cols = ['price']
result_df = pd.concat([categorical_df[numerical_cols].reset_index(drop=True),
encoded_df.reset_index(drop=True)], axis=1)

print(result_df.head())

Output:

   price  color_blue  color_green  color_red  size_large  size_medium  size_small
0 10.5 0.0 0.0 1.0 0.0 0.0 1.0
1 15.0 1.0 0.0 0.0 0.0 1.0 0.0
2 20.0 0.0 1.0 0.0 1.0 0.0 0.0
3 12.0 0.0 0.0 1.0 0.0 1.0 0.0
4 14.5 1.0 0.0 0.0 0.0 0.0 1.0

Train-Test Split with Scikit-learn and Pandas

A crucial step in machine learning is splitting your data into training and testing sets. Scikit-learn makes this easy while maintaining the pandas DataFrame structure:

python
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Convert to DataFrames to maintain column names (optional)
X_train_df = pd.DataFrame(X_train, columns=X.columns)
X_test_df = pd.DataFrame(X_test, columns=X.columns)

print(f"Training set shape: {X_train_df.shape}")
print(f"Testing set shape: {X_test_df.shape}")

Output:

Training set shape: (105, 4)
Testing set shape: (45, 4)

Feature Scaling with Scikit-learn

Many machine learning algorithms perform better with scaled features. Let's use scikit-learn's StandardScaler with our pandas DataFrame:

python
# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = pd.DataFrame(
scaler.fit_transform(X_train_df),
columns=X_train_df.columns
)

# Transform the test data (using the same scaler)
X_test_scaled = pd.DataFrame(
scaler.transform(X_test_df),
columns=X_test_df.columns
)

# Display scaled data
print("Scaled training data:")
print(X_train_scaled.head())

Output:

Scaled training data:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 1.257209 -0.142344 0.536587 0.693569
1 -1.561078 -0.142344 -1.328294 -1.253787
2 0.806720 -0.142344 0.699175 0.693569
3 0.205977 0.772163 -0.028354 0.165873
4 0.205977 -0.599097 0.211410 0.165873

Model Training and Evaluation

Now, we can train a machine learning model using scikit-learn and our prepared pandas data:

python
# Initialize the model
model = LogisticRegression(random_state=42, max_iter=200)

# Train the model
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Output:

Accuracy: 1.0000

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 17
2 1.00 1.00 1.00 12

accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45

Feature Importance Analysis with Pandas

After training your model, pandas can help you analyze feature importance:

python
# For logistic regression, we can examine coefficients
coefficients = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_[0] # For binary classification
})

# Sort by absolute value for importance
coefficients['Absolute'] = coefficients['Coefficient'].abs()
coefficients = coefficients.sort_values('Absolute', ascending=False)

print("Feature importance:")
print(coefficients)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(coefficients['Feature'], coefficients['Absolute'])
plt.xlabel('Absolute Coefficient Value')
plt.title('Feature Importance')
plt.tight_layout()

Pipeline Integration

One of the most powerful ways to integrate pandas with scikit-learn is through pipelines. Pipelines allow you to combine preprocessing steps and modeling into a single workflow:

python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Create a sample dataset with mixed types and missing values
data = {
'age': [25, 30, np.nan, 22, 35],
'income': [50000, np.nan, 70000, 45000, 90000],
'gender': ['male', 'female', 'female', 'male', 'male'],
'category': ['A', 'B', 'A', 'C', 'B'],
'outcome': [0, 1, 1, 0, 1]
}

mixed_df = pd.DataFrame(data)
print("Original data:")
print(mixed_df)

# Identify column types
numeric_features = ['age', 'income']
categorical_features = ['gender', 'category']

# Define preprocessing steps for different column types
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessors
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])

# Create the full pipeline
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42))
])

# Prepare data
X_pipeline = mixed_df.drop('outcome', axis=1)
y_pipeline = mixed_df['outcome']

# Apply the pipeline
full_pipeline.fit(X_pipeline, y_pipeline)

# Make predictions
pipeline_preds = full_pipeline.predict(X_pipeline)
print("\nPredictions:", pipeline_preds)

Output:

Original data:
age income gender category outcome
0 25.0 50000.0 male A 0
1 30.0 NaN female B 1
2 NaN 70000.0 female A 1
3 22.0 45000.0 male C 0
4 35.0 90000.0 male B 1

Predictions: [0 1 1 0 1]

Real-world Application: Customer Churn Prediction

Let's apply what we've learned to a real-world example: predicting customer churn. We'll use a simplified dataset for this demonstration:

python
# Create a sample customer dataset
data = {
'age': [35, 42, 28, 39, 45, 32, 41, 30, 35, 38],
'tenure': [6, 24, 3, 12, 36, 5, 18, 8, 7, 15],
'monthly_charges': [65.5, 105.2, 45.3, 78.5, 110.0, 55.0, 98.2, 75.5, 68.0, 85.0],
'total_charges': [393.0, 2524.8, 135.9, 942.0, 3960.0, 275.0, 1767.6, 604.0, 476.0, 1275.0],
'contract_type': ['Month-to-month', 'Two year', 'Month-to-month', 'One year',
'Two year', 'Month-to-month', 'One year', 'Month-to-month',
'Month-to-month', 'One year'],
'online_security': ['No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No'],
'tech_support': ['No', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes'],
'churn': [1, 0, 1, 0, 0, 1, 0, 1, 0, 0]
}

churn_df = pd.DataFrame(data)

# Display the data
print(churn_df.head())

# Basic EDA
print("\nChurn rate:")
print(churn_df['churn'].value_counts(normalize=True))

Now let's build a prediction pipeline:

python
# Split features and target
X_churn = churn_df.drop('churn', axis=1)
y_churn = churn_df['churn']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_churn, y_churn, test_size=0.3, random_state=42)

# Identify column types
numeric_features = ['age', 'tenure', 'monthly_charges', 'total_charges']
categorical_features = ['contract_type', 'online_security', 'tech_support']

# Create preprocessing pipeline
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])

# Full pipeline
churn_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42))
])

# Train model
churn_pipeline.fit(X_train, y_train)

# Make predictions
y_pred = churn_pipeline.predict(X_test)

# Evaluate
print("\nChurn Prediction Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Summary

In this tutorial, we've explored how to effectively integrate pandas DataFrames with scikit-learn's machine learning tools. Here are the key takeaways:

  1. Seamless workflow: Pandas DataFrames work perfectly with scikit-learn's functions and methods
  2. Data preparation: Use pandas for exploratory analysis, cleaning, and feature engineering
  3. Transformations: Easily apply scikit-learn preprocessing tools to pandas data
  4. Pipelines: Create end-to-end machine learning workflows combining pandas and sklearn
  5. Analysis: Use pandas to interpret model results and feature importance

This integration creates a powerful toolkit for data scientists and machine learning practitioners, allowing you to focus on solving problems rather than dealing with data format conversions.

Additional Resources

To deepen your understanding of pandas and scikit-learn integration, check out these resources:

Exercises

  1. Basic Integration: Load the Boston Housing dataset from scikit-learn and convert it to a pandas DataFrame. Calculate basic statistics and correlations.

  2. Missing Data: Create a DataFrame with some missing values, then use scikit-learn's SimpleImputer to handle them before training a model.

  3. Feature Engineering: Starting with the Titanic dataset, use pandas to create new features (like family size from SibSp and Parch), then build a predictive model with scikit-learn.

  4. Model Comparison: Using pandas and scikit-learn, compare the performance of three different classification models on the Iris dataset. Store and visualize the results using pandas.

  5. Advanced Pipeline: Build a complete machine learning pipeline that includes data cleaning, feature selection, hyperparameter tuning, and model evaluation for a dataset of your choice.

Happy coding!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)