Redis HyperLogLogs

Introduction

HyperLogLog is one of Redis' most fascinating data structures, yet it's often overlooked by beginners. It solves a very specific problem: efficiently counting unique elements in a dataset (also known as calculating the "cardinality" of a set).

While this might sound simple—after all, couldn't you just use a Redis Set and count its members?—HyperLogLog shines when dealing with massive datasets where memory efficiency becomes critical.

What Makes HyperLogLog Special?

HyperLogLog is a probabilistic data structure that estimates the number of unique elements with remarkable accuracy while using minimal memory:

Uses only 12KB of memory per HyperLogLog
Provides accuracy with an error rate of just 0.81%
Can count billions of unique elements without increasing memory usage

Let's compare this to traditional methods:

Basic HyperLogLog Operations

HyperLogLog supports a handful of commands that are simple to use but powerful in application.

Creating and Adding Elements

To add elements to a HyperLogLog, we use the PFADD command:

PFADD visitors user1 user2 user3

This command:

Creates the HyperLogLog if it doesn't exist
Adds the specified elements
Returns 1 if the estimated cardinality changed, 0 otherwise

Counting Unique Elements

To get the estimated count of unique elements, use PFCOUNT:

PFCOUNT visitors

This returns the approximate number of unique elements added to the HyperLogLog.

Merging Multiple HyperLogLogs

One of the most powerful features is the ability to merge multiple HyperLogLogs:

PFMERGE result visitors1 visitors2 visitors3

This merges the unique counts from multiple sources into a single HyperLogLog.

Practical Example: Tracking Website Visitors

Let's implement a simple visitor tracking system using HyperLogLogs:

js
// Imagine this function runs when a user visits your website
function trackUserVisit(userId, page) {
  // Track unique visitors per page
  redis.pfadd(`page:${page}:visitors`, userId);
  
  // Track unique visitors for the whole site
  redis.pfadd('site:visitors', userId);
  
  // Track unique visitors for today
  const today = new Date().toISOString().split('T')[0];
  redis.pfadd(`site:visitors:${today}`, userId);
}

// To get unique visitor counts
async function getVisitorStats() {
  const [siteTotal, todayTotal] = await Promise.all([
    redis.pfcount('site:visitors'),
    redis.pfcount(`site:visitors:${new Date().toISOString().split('T')[0]}`)
  ]);
  
  return {
    allTime: siteTotal,
    today: todayTotal
  };
}

Memory Efficiency in Action

If you were to track 10 million unique visitors using traditional Redis Sets:

Set approach: ~400MB (40 bytes per user ID)
HyperLogLog approach: 12KB (constant regardless of unique count)

That's a 33,000x memory saving!

Real-World Applications

HyperLogLog is especially useful for:

Analytics systems: Counting unique visitors, page views, etc.
Network monitoring: Tracking unique IP addresses
Database systems: Estimating query result sizes
Ad tech: Counting unique impressions across campaigns
Social media: Approximating reach metrics

Common Mistakes and Limitations

While powerful, HyperLogLogs have some important limitations to keep in mind:

They're probabilistic: You trade perfect accuracy for memory efficiency
They're write-only: You can't retrieve the elements you've added, only count them
They only count uniques: They don't track frequency or other metrics
Commands start with PF: Named after Philippe Flajolet, inventor of the algorithm

Extended Example: User Activity Tracking

Let's build a more complex example tracking daily and weekly active users:

js
// Track user activity
async function trackUserActivity(userId) {
  const today = new Date().toISOString().split('T')[0];
  const weekNumber = getWeekNumber(new Date()); // Helper function to get week number
  
  // Record this user's activity for today and this week
  await redis.pfadd(`users:active:day:${today}`, userId);
  await redis.pfadd(`users:active:week:${weekNumber}`, userId);
}

// Get daily active users for the past week
async function getDailyActiveUsers() {
  const days = [];
  const counts = [];
  
  // Generate the past 7 days
  for (let i = 0; i < 7; i++) {
    const date = new Date();
    date.setDate(date.getDate() - i);
    const day = date.toISOString().split('T')[0];
    days.push(day);
  }
  
  // Get DAU counts for each day
  for (const day of days) {
    const count = await redis.pfcount(`users:active:day:${day}`);
    counts.push({ day, count });
  }
  
  return counts;
}

// Calculate daily to weekly active user ratio
async function getDAU_WAU_Ratio() {
  const today = new Date().toISOString().split('T')[0];
  const currentWeek = getWeekNumber(new Date());
  
  const [dau, wau] = await Promise.all([
    redis.pfcount(`users:active:day:${today}`),
    redis.pfcount(`users:active:week:${currentWeek}`)
  ]);
  
  return {
    dau,
    wau,
    ratio: wau > 0 ? (dau / wau) : 0
  };
}

HyperLogLog Under the Hood

While the full algorithm is complex, here's a simplified explanation of how HyperLogLog works:

It uses a hash function to convert input elements to binary numbers
It focuses on patterns in these numbers (specifically, the position of the leftmost '1' bit)
By tracking these patterns across many inputs, it can estimate cardinality
Multiple estimators are combined to improve accuracy

This approach is based on the observation that the pattern of leading zeros in randomly distributed hash values follows a predictable distribution.

Summary

Redis HyperLogLogs provide an incredibly memory-efficient way to count unique elements in large datasets. While they sacrifice perfect accuracy for efficiency, their ~99% accuracy is more than sufficient for most applications.

Key points to remember:

Use HyperLogLogs when you need to count unique items in very large datasets
They use constant memory regardless of the number of elements counted
Commands are prefixed with PF (PFADD, PFCOUNT, PFMERGE)
They're write-only; you can't retrieve the original elements

Additional Resources

Here are some exercises to practice working with HyperLogLogs:

Build a simple CLI tool that tracks unique words in text files using Redis HyperLogLogs
Implement a basic analytics system that tracks unique visitors per page and across your entire site
Compare the memory and performance differences between using Redis Sets and HyperLogLogs for tracking unique items
Create a system that tracks unique user actions across different time windows (hour, day, week, month)

Practice Exercises

Unique Word Counter: Create a function that counts unique words in a text using HyperLogLog
IP Address Tracker: Build a system that tracks unique IP addresses accessing a web server
Multi-dimensional Analytics: Track unique users across different dimensions (device, country, page)
Approximate Joins: Use HyperLogLogs to estimate the size of database join operations before executing them

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What Makes HyperLogLog Special?​

Basic HyperLogLog Operations​

Creating and Adding Elements​

Counting Unique Elements​

Merging Multiple HyperLogLogs​

Practical Example: Tracking Website Visitors​

Memory Efficiency in Action​

Real-World Applications​

Common Mistakes and Limitations​

Extended Example: User Activity Tracking​

HyperLogLog Under the Hood​

Summary​

Additional Resources​

Practice Exercises​