Skip to main content

Pandas SQL Import

Introduction

When working with data in the real world, you'll often need to access information stored in SQL databases. Pandas provides powerful tools to help you import this data directly into DataFrames, making it easy to analyze and manipulate database content using familiar Pandas functions.

In this tutorial, we'll learn how to:

  • Connect to different types of SQL databases
  • Import data from SQL queries into Pandas DataFrames
  • Optimize your SQL imports for performance
  • Handle common challenges when working with database data

Prerequisites

Before we begin, make sure you have:

  • Python and Pandas installed
  • Basic knowledge of SQL queries
  • A SQL database to connect to (we'll show examples with SQLite, MySQL, and PostgreSQL)

You'll also need to install the appropriate database connector. For example:

bash
# For SQLite (included in Python standard library)
# No additional installation needed

# For MySQL
pip install pymysql

# For PostgreSQL
pip install psycopg2

Basic SQL Import with Pandas

The read_sql Function

The primary way to import SQL data into Pandas is through the read_sql function. This function comes in two main variants:

  • pd.read_sql_query(): Executes a SQL query and returns the results as a DataFrame
  • pd.read_sql_table(): Imports an entire SQL table as a DataFrame
  • pd.read_sql(): A convenience wrapper that can do either of the above

Let's look at some basic examples:

Using SQLite (Built-in Database)

SQLite is a lightweight database that's built into Python, making it perfect for examples:

python
import pandas as pd
import sqlite3

# Create a connection to a SQLite database
# Using an in-memory database for this example
conn = sqlite3.connect(':memory:')

# Create a simple table for demonstration
conn.execute('CREATE TABLE students (id INTEGER, name TEXT, grade REAL)')
conn.execute('INSERT INTO students VALUES (1, "Alice", 85.5)')
conn.execute('INSERT INTO students VALUES (2, "Bob", 92.0)')
conn.execute('INSERT INTO students VALUES (3, "Charlie", 78.5)')
conn.commit()

# Read the entire table into a DataFrame
df = pd.read_sql('SELECT * FROM students', conn)
print(df)

Output:

   id     name  grade
0 1 Alice 85.5
1 2 Bob 92.0
2 3 Charlie 78.5

Using a SQL Query

You can also use specific SQL queries to filter or transform data as you import it:

python
# Import only students with grades above 80
df_high_grades = pd.read_sql(
'SELECT * FROM students WHERE grade > 80',
conn
)
print(df_high_grades)

Output:

   id   name  grade
0 1 Alice 85.5
1 2 Bob 92.0

Using read_sql_table

If you're using SQLAlchemy (a popular SQL toolkit for Python), you can also use the read_sql_table function to directly import a table:

python
from sqlalchemy import create_engine

# Create a SQLAlchemy engine
engine = create_engine('sqlite:///:memory:')

# Import the entire table
df = pd.read_sql_table('students', engine)
print(df)

Connecting to Different Database Systems

MySQL Example

python
import pandas as pd
import pymysql

# Create connection
conn = pymysql.connect(
host='localhost',
user='username',
password='password',
database='my_database'
)

# Import data
df = pd.read_sql('SELECT * FROM customers', conn)
print(df.head())

# Close connection when done
conn.close()

PostgreSQL Example

python
import pandas as pd
import psycopg2

# Create connection
conn = psycopg2.connect(
host='localhost',
database='my_database',
user='username',
password='password'
)

# Import data
df = pd.read_sql('SELECT customer_id, name, email FROM customers', conn)
print(df.head())

# Close connection
conn.close()

Advanced SQL Import Techniques

Setting the Index Column

You can specify which column should be the DataFrame index:

python
df = pd.read_sql(
'SELECT * FROM students',
conn,
index_col='id'
)
print(df)

Output:

        name  grade
id
1 Alice 85.5
2 Bob 92.0
3 Charlie 78.5

Parsing Dates

If your SQL data includes date columns, you can have Pandas parse them automatically:

python
# Adding a date column to our example
conn.execute('ALTER TABLE students ADD COLUMN enrollment_date TEXT')
conn.execute("UPDATE students SET enrollment_date = '2023-01-15' WHERE id = 1")
conn.execute("UPDATE students SET enrollment_date = '2023-02-20' WHERE id = 2")
conn.execute("UPDATE students SET enrollment_date = '2023-01-10' WHERE id = 3")
conn.commit()

# Import with date parsing
df = pd.read_sql(
'SELECT * FROM students',
conn,
parse_dates=['enrollment_date']
)
print(df)
print("\nData types:")
print(df.dtypes)

Output:

   id     name  grade enrollment_date
0 1 Alice 85.5 2023-01-15
1 2 Bob 92.0 2023-02-20
2 3 Charlie 78.5 2023-01-10

Data types:
id int64
name object
grade float64
enrollment_date datetime64[ns]
dtype: object

Parameterized Queries for Security

To prevent SQL injection attacks, use parameterized queries:

python
min_grade = 80

# Safe way to pass parameters
df = pd.read_sql(
'SELECT * FROM students WHERE grade > ?',
conn,
params=(min_grade,)
)
print(df)

Output:

   id   name  grade enrollment_date
0 1 Alice 85.5 2023-01-15
1 2 Bob 92.0 2023-02-20

Real-World Application: Sales Data Analysis

Let's look at a more practical example of importing and analyzing sales data from a database:

python
import pandas as pd
import sqlite3
from datetime import datetime, timedelta

# Create a sample sales database
conn = sqlite3.connect(':memory:')

# Create tables
conn.execute('''
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
product_name TEXT,
category TEXT,
unit_price REAL
)
''')

conn.execute('''
CREATE TABLE sales (
sale_id INTEGER PRIMARY KEY,
product_id INTEGER,
sale_date TEXT,
quantity INTEGER,
FOREIGN KEY (product_id) REFERENCES products (product_id)
)
''')

# Insert sample data
products = [
(1, 'Laptop', 'Electronics', 899.99),
(2, 'Mouse', 'Electronics', 24.99),
(3, 'Coffee Mug', 'Kitchen', 11.99),
(4, 'Water Bottle', 'Kitchen', 14.99),
(5, 'Notebook', 'Office', 4.99)
]

conn.executemany('INSERT INTO products VALUES (?,?,?,?)', products)

# Generate some sales data
sales = []
sale_id = 1
today = datetime.now()

for i in range(30): # Generate 30 days of sales
date = (today - timedelta(days=i)).strftime('%Y-%m-%d')
for product_id in range(1, 6):
# Random quantity between 1 and 10
quantity = (i % 10) + 1
sales.append((sale_id, product_id, date, quantity))
sale_id += 1

conn.executemany('INSERT INTO sales VALUES (?,?,?,?)', sales)
conn.commit()

# Now let's analyze the sales data

# 1. Join products and sales tables
query = '''
SELECT
s.sale_id,
p.product_name,
p.category,
s.sale_date,
s.quantity,
p.unit_price,
s.quantity * p.unit_price as revenue
FROM sales s
JOIN products p ON s.product_id = p.product_id
'''

sales_df = pd.read_sql(query, conn, parse_dates=['sale_date'])

print("Sample of sales data:")
print(sales_df.head())

# 2. Analyze sales by category
category_sales = sales_df.groupby('category')['revenue'].sum().reset_index()
print("\nSales by category:")
print(category_sales)

# 3. Analyze sales trends over time
daily_sales = sales_df.groupby(sales_df['sale_date'].dt.date)['revenue'].sum().reset_index()
print("\nDaily sales (last 5 days):")
print(daily_sales.tail())

This example demonstrates:

  1. Creating a more complex database structure
  2. Using SQL JOIN operations in your query
  3. Performing data analysis on the imported data

Performance Considerations

When working with large databases, consider these tips:

1. Only Import What You Need

Instead of importing entire tables, select only the columns and rows you need:

python
# Bad (imports everything)
df = pd.read_sql('SELECT * FROM huge_table', conn)

# Better (only imports what you need)
df = pd.read_sql('SELECT id, name FROM huge_table WHERE date > "2023-01-01"', conn)

2. Use Chunking for Large Datasets

For very large tables, import the data in chunks:

python
chunks = []
for chunk in pd.read_sql('SELECT * FROM huge_table', conn, chunksize=10000):
# Process each chunk
chunk['processed'] = chunk['value'] * 2
chunks.append(chunk)

# Combine all chunks into a single DataFrame
df = pd.concat(chunks)

3. Use SQLAlchemy for Better Performance

SQLAlchemy can provide better performance for some operations:

python
from sqlalchemy import create_engine

# Create engine
engine = create_engine('sqlite:///:memory:')

# Use engine for better performance
df = pd.read_sql('SELECT * FROM students', engine)

Common Challenges and Solutions

Handling NULL Values

SQL NULL values are automatically converted to Python's None or NumPy's NaN:

python
# Add a row with NULL values
conn.execute('INSERT INTO students VALUES (4, "Dave", NULL, NULL)')
conn.commit()

df = pd.read_sql('SELECT * FROM students', conn)
print(df)
print("\nNull check:")
print(df.isnull().sum())

Dealing with Different Data Types

Sometimes you need to explicitly cast values:

python
# In SQL query
df = pd.read_sql(
'SELECT id, CAST(grade AS TEXT) AS grade_str FROM students',
conn
)
print(df)

Handling Large Integers

Some databases use 64-bit integers that might need special handling:

python
df = pd.read_sql(
'SELECT * FROM large_number_table',
conn,
dtype={'big_id': 'int64'} # Ensure 64-bit integer
)

Summary

In this tutorial, we've learned:

  1. How to import SQL data into Pandas DataFrames using read_sql_query() and read_sql_table()
  2. Connecting to different types of databases (SQLite, MySQL, PostgreSQL)
  3. Advanced import techniques like setting index columns and parsing dates
  4. Using parameterized queries for security
  5. Real-world applications with sales data analysis
  6. Performance optimization techniques for large datasets
  7. Solutions to common challenges when importing SQL data

SQL databases and Pandas form a powerful combination for data analysis. With the techniques covered in this tutorial, you'll be able to seamlessly integrate data from your databases into your Pandas workflows.

Exercises

  1. Create a small SQLite database with a "books" table including columns for title, author, publication year, and price. Import this data into a Pandas DataFrame.
  2. Connect to a public sample database (like the SQLite Chinook database) and import the "customers" and "invoices" tables. Join them to analyze customer purchasing patterns.
  3. Write a function that takes a date range as input and returns a DataFrame with sales data filtered by that date range.
  4. Experiment with chunking by creating a large table (10,000+ rows) and importing it in chunks of different sizes. Compare the performance.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)