PostgreSQL GROUP BY

Introduction

The GROUP BY clause is one of the most powerful features in PostgreSQL for data analysis and reporting. It allows you to organize your data into groups based on one or more columns and then perform aggregate calculations on each group rather than on the entire dataset. This is particularly useful when you need to summarize large datasets or generate statistical reports.

In this tutorial, we'll explore how to use the GROUP BY clause effectively, understand its relationship with aggregate functions, and see real-world applications that demonstrate its power in data analysis.

Basic Syntax

The basic syntax of the GROUP BY clause is:

sql
SELECT column1, column2, ... , aggregate_function(column_name)
FROM table_name
WHERE condition
GROUP BY column1, column2, ...
ORDER BY column1, column2, ... ;

Here, aggregate_function can be functions like:

COUNT(): Returns the number of rows
SUM(): Calculates the sum of values
AVG(): Calculates the average of values
MIN(): Finds the minimum value
MAX(): Finds the maximum value

Understanding GROUP BY with Examples

Let's create a sample database to demonstrate the GROUP BY clause in action:

sql
CREATE TABLE sales (
    sale_id SERIAL PRIMARY KEY,
    product_id INT NOT NULL,
    customer_id INT NOT NULL,
    sale_date DATE NOT NULL,
    quantity INT NOT NULL,
    amount DECIMAL(10, 2) NOT NULL
);

INSERT INTO sales (product_id, customer_id, sale_date, quantity, amount) VALUES
(1, 101, '2023-01-15', 2, 150.00),
(2, 102, '2023-01-16', 1, 75.50),
(1, 103, '2023-01-17', 3, 225.00),
(3, 101, '2023-01-18', 1, 200.00),
(2, 104, '2023-01-19', 2, 151.00),
(3, 102, '2023-01-20', 1, 200.00),
(1, 103, '2023-01-21', 1, 75.00),
(2, 101, '2023-01-22', 3, 226.50);

Example 1: Simple GROUP BY with COUNT

Let's count how many sales we have for each product:

sql
SELECT 
    product_id, 
    COUNT(*) AS total_sales
FROM 
    sales
GROUP BY 
    product_id
ORDER BY 
    product_id;

Output:

 product_id | total_sales
------------+-------------
          1 |           3
          2 |           3
          3 |           2

This query groups the sales records by product_id and counts how many sales exist for each product.

Example 2: Multiple Aggregate Functions

We can use multiple aggregate functions in a single query:

sql
SELECT 
    product_id,
    COUNT(*) AS total_transactions,
    SUM(quantity) AS total_quantity,
    SUM(amount) AS total_revenue,
    AVG(amount) AS average_sale_value,
    MAX(amount) AS highest_sale,
    MIN(amount) AS lowest_sale
FROM 
    sales
GROUP BY 
    product_id
ORDER BY 
    total_revenue DESC;

Output:

 product_id | total_transactions | total_quantity | total_revenue | average_sale_value | highest_sale | lowest_sale
------------+-------------------+----------------+---------------+--------------------+--------------+-------------
          1 |                 3 |              6 |        450.00 |             150.00 |       225.00 |       75.00
          3 |                 2 |              2 |        400.00 |             200.00 |       200.00 |      200.00
          2 |                 3 |              6 |        453.00 |             151.00 |       226.50 |       75.50

This query provides a comprehensive sales analysis for each product, including counts, sums, averages, maximums, and minimums.

Example 3: GROUP BY with Multiple Columns

We can group by multiple columns to get more detailed insights:

sql
SELECT 
    product_id, 
    customer_id, 
    COUNT(*) AS transactions,
    SUM(amount) AS total_spent
FROM 
    sales
GROUP BY 
    product_id, customer_id
ORDER BY 
    product_id, customer_id;

Output:

 product_id | customer_id | transactions | total_spent
------------+-------------+--------------+-------------
          1 |         101 |            0 |        0.00
          1 |         103 |            2 |      300.00
          2 |         101 |            1 |      226.50
          2 |         102 |            0 |        0.00
          2 |         104 |            1 |      151.00
          3 |         101 |            1 |      200.00
          3 |         102 |            1 |      200.00

This query shows how many transactions and how much each customer spent on each product.

The HAVING Clause with GROUP BY

The WHERE clause cannot be used with aggregate functions. Instead, PostgreSQL provides the HAVING clause, which filters the results of a GROUP BY query based on aggregate conditions.

Example 4: Using HAVING to Filter Groups

Let's find products with total sales greater than $300:

sql
SELECT 
    product_id, 
    SUM(amount) AS total_revenue
FROM 
    sales
GROUP BY 
    product_id
HAVING 
    SUM(amount) > 300
ORDER BY 
    total_revenue DESC;

Output:

 product_id | total_revenue
------------+---------------
          2 |        453.00
          1 |        450.00
          3 |        400.00

This query first groups the sales by product, calculates the sum for each group, and then filters to show only products with a total revenue greater than $300.

GROUP BY with Date Functions

PostgreSQL offers powerful date manipulation functions that work well with GROUP BY to analyze time-based data.

Example 5: Monthly Sales Analysis

sql
SELECT 
    EXTRACT(MONTH FROM sale_date) AS month,
    EXTRACT(YEAR FROM sale_date) AS year,
    COUNT(*) AS sales_count,
    SUM(amount) AS monthly_revenue
FROM 
    sales
GROUP BY 
    year, month
ORDER BY 
    year, month;

Output:

 month | year | sales_count | monthly_revenue
-------+------+-------------+-----------------
     1 | 2023 |           8 |         1303.00

This query groups sales by month and year, showing the count and sum of sales for each period.

Using GROUP BY with CASE statements

We can combine GROUP BY with CASE statements for even more sophisticated analysis.

Example 6: Categorizing Sales by Value

sql
SELECT 
    CASE 
        WHEN amount < 100 THEN 'Low Value'
        WHEN amount >= 100 AND amount < 200 THEN 'Medium Value'
        ELSE 'High Value'
    END AS sale_category,
    COUNT(*) AS count,
    SUM(amount) AS total_revenue
FROM 
    sales
GROUP BY 
    sale_category
ORDER BY 
    total_revenue DESC;

Output:

 sale_category | count | total_revenue
---------------+-------+---------------
 High Value    |     3 |        651.50
 Medium Value  |     3 |        501.50
 Low Value     |     2 |        150.50

This query categorizes sales into different value brackets and then counts and sums each category.

ROLLUP and CUBE for Advanced Aggregation

PostgreSQL supports advanced aggregation extensions to GROUP BY like ROLLUP and CUBE for generating subtotals and grand totals.

Example 7: Using ROLLUP for Hierarchical Summaries

sql
SELECT 
    COALESCE(product_id::TEXT, 'All Products') AS product,
    COALESCE(DATE_TRUNC('month', sale_date)::TEXT, 'All Dates') AS month,
    SUM(amount) AS total_sales
FROM 
    sales
GROUP BY 
    ROLLUP(product_id, DATE_TRUNC('month', sale_date))
ORDER BY 
    product_id NULLS LAST, month NULLS LAST;

This query uses ROLLUP to generate subtotals by product and month, as well as a grand total.

Common Mistakes and Best Practices

Mistake 1: Selecting Non-Aggregated Columns

A common mistake is including columns in the SELECT clause that are not in the GROUP BY clause and not wrapped in an aggregate function:

sql
-- This will cause an error
SELECT product_id, customer_id, SUM(amount)
FROM sales
GROUP BY product_id;

PostgreSQL will give an error because it doesn't know which customer_id value to show for each product group.

Mistake 2: Using WHERE instead of HAVING

Remember that WHERE filters rows before grouping, while HAVING filters groups after grouping:

sql
-- Incorrect: This will filter sales before grouping
SELECT product_id, SUM(amount) AS total
FROM sales
WHERE SUM(amount) > 300  -- This will cause an error
GROUP BY product_id;

-- Correct: This will filter groups after grouping
SELECT product_id, SUM(amount) AS total
FROM sales
GROUP BY product_id
HAVING SUM(amount) > 300;

Best Practice: Query Execution Order

Understanding the order of execution can help write better queries:

FROM clause selects and joins tables
WHERE clause filters rows
GROUP BY clause groups rows
HAVING clause filters groups
SELECT clause selects columns
ORDER BY clause sorts the result

Real-World Applications

Application 1: Sales Dashboard

sql
SELECT 
    TO_CHAR(DATE_TRUNC('month', sale_date), 'Month YYYY') AS month,
    SUM(amount) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(amount) / COUNT(DISTINCT customer_id) AS revenue_per_customer
FROM 
    sales
GROUP BY 
    DATE_TRUNC('month', sale_date)
ORDER BY 
    DATE_TRUNC('month', sale_date);

This query could power a monthly sales dashboard, showing revenue trends, customer counts, and per-customer metrics.

Application 2: Product Performance Analysis

sql
WITH product_sales AS (
    SELECT 
        product_id,
        COUNT(*) AS transaction_count,
        SUM(quantity) AS units_sold,
        SUM(amount) AS total_revenue,
        AVG(amount) AS avg_transaction_value
    FROM 
        sales
    GROUP BY 
        product_id
)
SELECT 
    p.*,
    total_revenue / units_sold AS revenue_per_unit,
    total_revenue / (SELECT SUM(total_revenue) FROM product_sales) * 100 AS revenue_percentage
FROM 
    product_sales p
ORDER BY 
    total_revenue DESC;

This complex query provides a comprehensive product performance analysis, showing sales metrics and comparing each product's contribution to total revenue.

Summary

The GROUP BY clause is an essential tool for data analysis in PostgreSQL. By grouping rows with similar values, we can:

Aggregate data for reporting and analysis
Generate statistics across different dimensions
Filter grouped data using the HAVING clause
Combine with other SQL features for powerful data insights

Mastering GROUP BY allows you to transform raw data into meaningful business insights efficiently. It's particularly valuable for building dashboards, generating reports, and performing data analysis.

Exercises

Using the sales table, write a query to find the total number of sales and revenue by day of the week.
Write a query to find customers who have spent more than the average customer.
Create a report showing quarterly sales totals and the percentage change from the previous quarter.
Write a query to find which product has the highest average sale amount and the lowest average sale amount.
Using GROUP BY with HAVING, find products that have been purchased by at least three different customers.

Additional Resources

PostgreSQL Documentation on GROUP BY
Advanced PostgreSQL Aggregation
PostgreSQL Window Functions (which complement GROUP BY for complex analysis)

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Basic Syntax​

Understanding GROUP BY with Examples​

Example 1: Simple GROUP BY with COUNT​

Example 2: Multiple Aggregate Functions​

Example 3: GROUP BY with Multiple Columns​

The HAVING Clause with GROUP BY​

Example 4: Using HAVING to Filter Groups​

GROUP BY with Date Functions​

Example 5: Monthly Sales Analysis​

Using GROUP BY with CASE statements​

Example 6: Categorizing Sales by Value​

ROLLUP and CUBE for Advanced Aggregation​

Example 7: Using ROLLUP for Hierarchical Summaries​

Common Mistakes and Best Practices​

Mistake 1: Selecting Non-Aggregated Columns​

Mistake 2: Using WHERE instead of HAVING​

Best Practice: Query Execution Order​

Real-World Applications​

Application 1: Sales Dashboard​

Application 2: Product Performance Analysis​

Summary​

Exercises​

Additional Resources​

Introduction

Basic Syntax

Understanding GROUP BY with Examples

Example 1: Simple GROUP BY with COUNT

Example 2: Multiple Aggregate Functions

Example 3: GROUP BY with Multiple Columns

The HAVING Clause with GROUP BY

Example 4: Using HAVING to Filter Groups

GROUP BY with Date Functions

Example 5: Monthly Sales Analysis

Using GROUP BY with CASE statements

Example 6: Categorizing Sales by Value

ROLLUP and CUBE for Advanced Aggregation

Example 7: Using ROLLUP for Hierarchical Summaries

Common Mistakes and Best Practices

Mistake 1: Selecting Non-Aggregated Columns

Mistake 2: Using WHERE instead of HAVING

Best Practice: Query Execution Order

Real-World Applications

Application 1: Sales Dashboard

Application 2: Product Performance Analysis

Summary

Exercises

Additional Resources