PostgreSQL Bulk Loading
When working with large amounts of data, importing it efficiently into your PostgreSQL database becomes crucial. This guide covers various methods for bulk loading data into PostgreSQL, with a focus on performance and practical applications.
Introduction
Bulk loading refers to the process of inserting large volumes of data into a database at once, rather than row by row. PostgreSQL offers several efficient methods for bulk loading that significantly outperform standard INSERT
statements when dealing with large datasets.
Why is bulk loading important?
- Performance: Load millions of rows in seconds instead of hours
- Reduced overhead: Minimizes transaction overhead and logging
- Resource efficiency: Uses less memory and CPU compared to individual inserts
The COPY Command
The COPY
command is PostgreSQL's primary tool for bulk loading data. It's designed specifically for high-performance data imports and exports.
Basic Syntax
COPY table_name [ (column_name [, ...]) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
Loading Data from a CSV File
Let's say we have a CSV file named employees.csv
with the following content:
id,first_name,last_name,email,hire_date
1,John,Smith,[email protected],2022-01-15
2,Mary,Johnson,[email protected],2022-02-20
3,Robert,Williams,[email protected],2022-03-05
And we want to load it into an employees
table:
-- First create the table
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
hire_date DATE
);
-- Then use COPY to load the data
COPY employees(id, first_name, last_name, email, hire_date)
FROM '/path/to/employees.csv'
WITH (FORMAT CSV, HEADER);
Options for COPY
The COPY command supports numerous options for flexibility:
COPY table_name FROM '/path/to/file.csv' WITH (
FORMAT CSV, -- Format (CSV, TEXT, BINARY)
HEADER, -- Skip the first line (header)
DELIMITER ',', -- Column separator (default is tab for TEXT)
QUOTE '"', -- Quote character
ESCAPE '\', -- Escape character
NULL 'NULL', -- String representing NULL values
ENCODING 'UTF8' -- File encoding
);
Using \copy with psql
While COPY
is powerful, it requires server file system access. For client-side file loading, PostgreSQL's command-line utility psql
provides the \copy
meta-command:
\copy employees FROM '/local/path/to/employees.csv' WITH (FORMAT CSV, HEADER)
The \copy
command runs on the client and sends the data to the server, so it works with files on your local machine.
Key differences between COPY
and \copy
:
COPY
runs server-side and is faster, but requires server file access\copy
runs client-side and works with local files- Syntax is almost identical, but
\copy
doesn't need quotes around the filename
Practical Example: Importing a Large Dataset
Let's walk through a complete example of importing a large dataset of customer transactions:
-- Create the table structure
CREATE TABLE transactions (
transaction_id UUID PRIMARY KEY,
customer_id INTEGER NOT NULL,
amount DECIMAL(10,2) NOT NULL,
transaction_date TIMESTAMP NOT NULL,
product_id VARCHAR(20),
quantity INTEGER,
status VARCHAR(10)
);
-- Import from CSV using COPY
COPY transactions
FROM '/data/transactions.csv'
WITH (
FORMAT CSV,
HEADER,
DELIMITER ',',
NULL 'NA'
);
-- Verify the imported data
SELECT COUNT(*) FROM transactions;
Output:
count
-------
100000
(1 row)
Using COPY with Transformations
Sometimes you need to transform data during import. While the COPY
command doesn't directly support transformations, you can combine it with temporary tables and SQL:
-- Create a temporary table with the exact structure of your CSV
CREATE TEMPORARY TABLE temp_users (
user_id VARCHAR(20),
username VARCHAR(50),
signup_date VARCHAR(20), -- Keep as string initially
status VARCHAR(10)
);
-- Load raw data into the temporary table
COPY temp_users FROM '/path/to/users.csv' WITH (FORMAT CSV, HEADER);
-- Insert into the final table with transformations
INSERT INTO users (user_id, username, signup_date, status, is_active)
SELECT
user_id,
LOWER(username), -- Convert username to lowercase
TO_DATE(signup_date, 'YYYY-MM-DD'), -- Convert string to date
status,
CASE WHEN status = 'active' THEN TRUE ELSE FALSE END -- Derive boolean field
FROM temp_users;
Bulk Loading from Programming Languages
Many programming languages offer efficient ways to perform bulk loading into PostgreSQL.
Python Example with psycopg2
import psycopg2
import csv
# Connect to the database
conn = psycopg2.connect("dbname=mydb user=postgres password=secret")
cur = conn.cursor()
# Open the CSV file
with open('products.csv', 'r') as f:
# Skip the header row
next(f)
# Use copy_from to bulk load
cur.copy_from(
f,
'products',
sep=',',
columns=('product_id', 'name', 'category', 'price', 'stock')
)
# Commit the transaction
conn.commit()
# Close the connection
cur.close()
conn.close()
Node.js Example with pg-copy-streams
const { Pool } = require('pg');
const fs = require('fs');
const { pipeline } = require('stream');
const copyFrom = require('pg-copy-streams').from;
const pool = new Pool({
user: 'postgres',
host: 'localhost',
database: 'mydb',
password: 'secret',
port: 5432,
});
async function bulkLoadData() {
const client = await pool.connect();
try {
// Create a readable stream from your CSV file
const fileStream = fs.createReadStream('orders.csv');
// Create a PostgreSQL COPY stream
const copyStream = client.query(copyFrom(`
COPY orders(order_id, customer_id, order_date, total_amount)
FROM STDIN WITH (FORMAT CSV, HEADER)
`));
// Pipe the file stream to the copy stream
await new Promise((resolve, reject) => {
pipeline(fileStream, copyStream, (err) => {
if (err) reject(err);
else resolve();
});
});
console.log('Bulk loading completed successfully');
} finally {
client.release();
}
}
bulkLoadData().catch(err => console.error('Error during bulk loading:', err));
Performance Considerations
To achieve the best performance when bulk loading:
-
Disable constraints and indexes during load:
sql-- Before loading
ALTER TABLE mytable DROP CONSTRAINT mytable_pkey;
DROP INDEX IF EXISTS mytable_idx;
-- Perform COPY operation
-- After loading
ALTER TABLE mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);
CREATE INDEX mytable_idx ON mytable(column_name); -
Increase maintenance memory:
sqlSET maintenance_work_mem = '1GB'; -- Adjust based on available RAM
-
Use multiple COPY commands in parallel for different chunks of your data
-
Use UNLOGGED tables for temporary data loading:
sqlCREATE UNLOGGED TABLE temp_loading_table (...);
-- Perform COPY, then move data to regular table if needed
Real-world Use Cases
ETL Data Pipeline
Database Migration
When migrating from another database system to PostgreSQL:
- Export data from source database to CSV format
- Create corresponding table structure in PostgreSQL
- Use COPY to efficiently import the data
- Verify data integrity and create necessary indexes
Common Problems and Solutions
Handling Errors During Import
-- Create an error table
CREATE TABLE import_errors (
line_number BIGINT,
error_message TEXT,
raw_line TEXT
);
-- Import with error logging
COPY products FROM '/path/to/products.csv' WITH (
FORMAT CSV,
HEADER,
DELIMITER ',',
FORCE_NULL (description, manufacturer),
FORCE_NOT_NULL (product_id, name, price),
LOG_ERRORS INTO import_errors
);
Dealing with Special Characters
COPY my_table FROM '/path/to/file.csv' WITH (
FORMAT CSV,
DELIMITER ',',
QUOTE '"',
ESCAPE '"'
);
Summary
PostgreSQL's bulk loading capabilities provide efficient methods for importing large volumes of data:
- The
COPY
command offers high-performance server-side data loading - The
\copy
meta-command provides client-side loading functionality - Temporary tables allow for data transformations during the import process
- Programming languages can interface with PostgreSQL for bulk operations
By using these tools and following best practices, you can import millions of rows in seconds rather than hours, making your data loading processes more efficient and reliable.
Exercises
- Create a table to store product information and practice importing data using the
COPY
command. - Compare the performance of loading 10,000 rows using individual
INSERT
statements versus theCOPY
command. - Write a script to generate a large CSV file and bulk load it into PostgreSQL using both the server-side
COPY
and client-side\copy
commands. - Implement a data transformation during bulk loading by using a temporary table and SQL.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)