PostgreSQL Bulk Loading

When working with large amounts of data, importing it efficiently into your PostgreSQL database becomes crucial. This guide covers various methods for bulk loading data into PostgreSQL, with a focus on performance and practical applications.

Introduction

Bulk loading refers to the process of inserting large volumes of data into a database at once, rather than row by row. PostgreSQL offers several efficient methods for bulk loading that significantly outperform standard INSERT statements when dealing with large datasets.

Why is bulk loading important?

Performance: Load millions of rows in seconds instead of hours
Reduced overhead: Minimizes transaction overhead and logging
Resource efficiency: Uses less memory and CPU compared to individual inserts

The COPY Command

The COPY command is PostgreSQL's primary tool for bulk loading data. It's designed specifically for high-performance data imports and exports.

Basic Syntax

sql
COPY table_name [ (column_name [, ...]) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]

Loading Data from a CSV File

Let's say we have a CSV file named employees.csv with the following content:

id,first_name,last_name,email,hire_date
1,John,Smith,[email protected],2022-01-15
2,Mary,Johnson,[email protected],2022-02-20
3,Robert,Williams,[email protected],2022-03-05

And we want to load it into an employees table:

sql
-- First create the table
CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(100),
    hire_date DATE
);

-- Then use COPY to load the data
COPY employees(id, first_name, last_name, email, hire_date)
FROM '/path/to/employees.csv'
WITH (FORMAT CSV, HEADER);

Options for COPY

The COPY command supports numerous options for flexibility:

sql
COPY table_name FROM '/path/to/file.csv' WITH (
    FORMAT CSV,           -- Format (CSV, TEXT, BINARY)
    HEADER,               -- Skip the first line (header)
    DELIMITER ',',        -- Column separator (default is tab for TEXT)
    QUOTE '"',            -- Quote character
    ESCAPE '\',           -- Escape character
    NULL 'NULL',          -- String representing NULL values
    ENCODING 'UTF8'       -- File encoding
);

Using \copy with psql

While COPY is powerful, it requires server file system access. For client-side file loading, PostgreSQL's command-line utility psql provides the \copy meta-command:

\copy employees FROM '/local/path/to/employees.csv' WITH (FORMAT CSV, HEADER)

The \copy command runs on the client and sends the data to the server, so it works with files on your local machine.

Key differences between COPY and \copy:

COPY runs server-side and is faster, but requires server file access
\copy runs client-side and works with local files
Syntax is almost identical, but \copy doesn't need quotes around the filename

Practical Example: Importing a Large Dataset

Let's walk through a complete example of importing a large dataset of customer transactions:

sql
-- Create the table structure
CREATE TABLE transactions (
    transaction_id UUID PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    amount DECIMAL(10,2) NOT NULL,
    transaction_date TIMESTAMP NOT NULL,
    product_id VARCHAR(20),
    quantity INTEGER,
    status VARCHAR(10)
);

-- Import from CSV using COPY
COPY transactions
FROM '/data/transactions.csv'
WITH (
    FORMAT CSV,
    HEADER,
    DELIMITER ',',
    NULL 'NA'
);

-- Verify the imported data
SELECT COUNT(*) FROM transactions;

Output:

 count
-------
 100000
(1 row)

Using COPY with Transformations

Sometimes you need to transform data during import. While the COPY command doesn't directly support transformations, you can combine it with temporary tables and SQL:

sql
-- Create a temporary table with the exact structure of your CSV
CREATE TEMPORARY TABLE temp_users (
    user_id VARCHAR(20),
    username VARCHAR(50),
    signup_date VARCHAR(20),  -- Keep as string initially
    status VARCHAR(10)
);

-- Load raw data into the temporary table
COPY temp_users FROM '/path/to/users.csv' WITH (FORMAT CSV, HEADER);

-- Insert into the final table with transformations
INSERT INTO users (user_id, username, signup_date, status, is_active)
SELECT 
    user_id,
    LOWER(username),  -- Convert username to lowercase
    TO_DATE(signup_date, 'YYYY-MM-DD'),  -- Convert string to date
    status,
    CASE WHEN status = 'active' THEN TRUE ELSE FALSE END  -- Derive boolean field
FROM temp_users;

Bulk Loading from Programming Languages

Many programming languages offer efficient ways to perform bulk loading into PostgreSQL.

Python Example with psycopg2

python
import psycopg2
import csv

# Connect to the database
conn = psycopg2.connect("dbname=mydb user=postgres password=secret")
cur = conn.cursor()

# Open the CSV file
with open('products.csv', 'r') as f:
    # Skip the header row
    next(f)
    
    # Use copy_from to bulk load
    cur.copy_from(
        f,
        'products',
        sep=',',
        columns=('product_id', 'name', 'category', 'price', 'stock')
    )

# Commit the transaction
conn.commit()

# Close the connection
cur.close()
conn.close()

Node.js Example with pg-copy-streams

javascript
const { Pool } = require('pg');
const fs = require('fs');
const { pipeline } = require('stream');
const copyFrom = require('pg-copy-streams').from;

const pool = new Pool({
  user: 'postgres',
  host: 'localhost',
  database: 'mydb',
  password: 'secret',
  port: 5432,
});

async function bulkLoadData() {
  const client = await pool.connect();
  
  try {
    // Create a readable stream from your CSV file
    const fileStream = fs.createReadStream('orders.csv');
    
    // Create a PostgreSQL COPY stream
    const copyStream = client.query(copyFrom(`
      COPY orders(order_id, customer_id, order_date, total_amount)
      FROM STDIN WITH (FORMAT CSV, HEADER)
    `));
    
    // Pipe the file stream to the copy stream
    await new Promise((resolve, reject) => {
      pipeline(fileStream, copyStream, (err) => {
        if (err) reject(err);
        else resolve();
      });
    });
    
    console.log('Bulk loading completed successfully');
  } finally {
    client.release();
  }
}

bulkLoadData().catch(err => console.error('Error during bulk loading:', err));

Performance Considerations

To achieve the best performance when bulk loading:

Disable constraints and indexes during load:

sql
-- Before loading
ALTER TABLE mytable DROP CONSTRAINT mytable_pkey;
DROP INDEX IF EXISTS mytable_idx;

-- Perform COPY operation

-- After loading
ALTER TABLE mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);
CREATE INDEX mytable_idx ON mytable(column_name);

Increase maintenance memory:

sql
SET maintenance_work_mem = '1GB';  -- Adjust based on available RAM

Use multiple COPY commands in parallel for different chunks of your data

Use UNLOGGED tables for temporary data loading:

sql
CREATE UNLOGGED TABLE temp_loading_table (...);
-- Perform COPY, then move data to regular table if needed

Real-world Use Cases

ETL Data Pipeline

Database Migration

When migrating from another database system to PostgreSQL:

Export data from source database to CSV format
Create corresponding table structure in PostgreSQL
Use COPY to efficiently import the data
Verify data integrity and create necessary indexes

Common Problems and Solutions

Handling Errors During Import

sql
-- Create an error table
CREATE TABLE import_errors (
    line_number BIGINT,
    error_message TEXT,
    raw_line TEXT
);

-- Import with error logging
COPY products FROM '/path/to/products.csv' WITH (
    FORMAT CSV,
    HEADER,
    DELIMITER ',',
    FORCE_NULL (description, manufacturer),
    FORCE_NOT_NULL (product_id, name, price),
    LOG_ERRORS INTO import_errors
);

Dealing with Special Characters

sql
COPY my_table FROM '/path/to/file.csv' WITH (
    FORMAT CSV,
    DELIMITER ',',
    QUOTE '"',
    ESCAPE '"'
);

Summary

PostgreSQL's bulk loading capabilities provide efficient methods for importing large volumes of data:

The COPY command offers high-performance server-side data loading
The \copy meta-command provides client-side loading functionality
Temporary tables allow for data transformations during the import process
Programming languages can interface with PostgreSQL for bulk operations

By using these tools and following best practices, you can import millions of rows in seconds rather than hours, making your data loading processes more efficient and reliable.

Exercises

Create a table to store product information and practice importing data using the COPY command.
Compare the performance of loading 10,000 rows using individual INSERT statements versus the COPY command.
Write a script to generate a large CSV file and bulk load it into PostgreSQL using both the server-side COPY and client-side \copy commands.
Implement a data transformation during bulk loading by using a temporary table and SQL.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

The COPY Command​

Basic Syntax​

Loading Data from a CSV File​

Options for COPY​

Using \copy with psql​

Practical Example: Importing a Large Dataset​

Using COPY with Transformations​

Bulk Loading from Programming Languages​

Python Example with psycopg2​

Node.js Example with pg-copy-streams​

Performance Considerations​

Real-world Use Cases​

ETL Data Pipeline​

Database Migration​

Common Problems and Solutions​

Handling Errors During Import​

Dealing with Special Characters​

Summary​

Exercises​

Additional Resources​