C++ Performance Optimization

Introduction

Performance optimization is a critical aspect of C++ programming that can significantly impact how efficiently your applications run. C++ provides developers with fine-grained control over system resources, making it particularly well-suited for performance-critical applications. However, with this power comes responsibility—writing efficient C++ code requires understanding various optimization techniques and knowing when to apply them.

In this guide, we'll explore practical approaches to optimize your C++ code, from simple compiler flags to advanced techniques like algorithm selection, memory management, and data structure optimization. Whether you're building games, scientific applications, or embedded systems, these techniques will help you write faster, more efficient code.

Why Optimize?

Before diving into optimization techniques, it's important to understand why we optimize:

Better user experience: Faster applications lead to better user satisfaction
Resource efficiency: Optimized code uses less CPU, memory, and energy
Scalability: Efficient code can handle larger workloads
Cost savings: In cloud environments, optimized code can reduce operational costs

However, remember the famous programming wisdom:

"Premature optimization is the root of all evil" — Donald Knuth

Focus first on writing correct, maintainable code. Optimize only after you've identified performance bottlenecks through profiling.

Compiler Optimizations

Optimization Flags

Modern C++ compilers can perform sophisticated optimizations automatically. Enable them with compiler flags:

bash
# GCC/Clang
g++ -O2 program.cpp -o program  # Moderate optimization
g++ -O3 program.cpp -o program  # Aggressive optimization

# MSVC
cl /O2 program.cpp  # Full optimization

Different optimization levels make different trade-offs:

-O0 or no flag: No optimization (fastest compilation, slowest execution)
-O1: Basic optimizations
-O2: Most optimizations without significant code size increase
-O3: Aggressive optimizations including function inlining

Example: Impact of Compiler Optimization

cpp
#include <iostream>
#include <chrono>
#include <vector>

// Function to compute sum of a vector
double sum_vector(const std::vector<double>& vec) {
    double sum = 0.0;
    for(size_t i = 0; i < vec.size(); ++i) {
        sum += vec[i];
    }
    return sum;
}

int main() {
    // Create a large vector
    const size_t size = 100000000;
    std::vector<double> numbers(size, 1.0);
    
    // Measure execution time
    auto start = std::chrono::high_resolution_clock::now();
    double result = sum_vector(numbers);
    auto end = std::chrono::high_resolution_clock::now();
    
    std::chrono::duration<double> duration = end - start;
    
    std::cout << "Sum: " << result << std::endl;
    std::cout << "Time taken: " << duration.count() << " seconds" << std::endl;
    
    return 0;
}

When compiled with different optimization levels, you might see results like:

# With -O0
Sum: 100000000
Time taken: 0.31 seconds

# With -O3
Sum: 100000000
Time taken: 0.08 seconds

The optimized version runs significantly faster because the compiler can apply loop unrolling, vectorization, and other optimizations.

Compiler Intrinsics and Pragmas

Compilers also offer specific intrinsics and pragmas to guide optimization:

cpp
// Hint to the compiler that a branch is likely/unlikely
if(__builtin_expect(error_condition, 0)) {  // Unlikely path
    handle_error();
}

// Force inline a function
__forceinline void critical_function() {
    // Function body
}

// Unroll loops
#pragma unroll
for(int i = 0; i < 4; i++) {
    // Loop body
}

Algorithm Optimization

Choose the Right Algorithm

The most significant performance improvements often come from choosing better algorithms. An O(n) algorithm will always outperform an O(n²) algorithm for large enough n, regardless of low-level optimizations.

Example: Finding an Element

Consider these two approaches to find an element in a collection:

cpp
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <random>

int main() {
    // Create a large sorted vector
    const size_t size = 10000000;
    std::vector<int> numbers(size);
    for (size_t i = 0; i < size; ++i) {
        numbers[i] = i;
    }
    
    // Value to search for
    int target = 7654321;
    
    // Method 1: Linear search - O(n)
    auto start1 = std::chrono::high_resolution_clock::now();
    
    bool found1 = false;
    for (int num : numbers) {
        if (num == target) {
            found1 = true;
            break;
        }
    }
    
    auto end1 = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration1 = end1 - start1;
    
    // Method 2: Binary search - O(log n)
    auto start2 = std::chrono::high_resolution_clock::now();
    
    bool found2 = std::binary_search(numbers.begin(), numbers.end(), target);
    
    auto end2 = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration2 = end2 - start2;
    
    std::cout << "Linear search: " << duration1.count() << " seconds, found: " << found1 << std::endl;
    std::cout << "Binary search: " << duration2.count() << " seconds, found: " << found2 << std::endl;
    
    return 0;
}

Output:

Linear search: 0.0424 seconds, found: 1
Binary search: 0.0000012 seconds, found: 1

The binary search is thousands of times faster because it has O(log n) complexity instead of O(n).

Use Standard Library Algorithms

The C++ Standard Library provides optimized implementations of common algorithms:

cpp
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>

int main() {
    std::vector<int> data = {5, 2, 9, 1, 7, 3, 8, 6, 4};
    
    // Sort the vector
    std::sort(data.begin(), data.end());
    
    // Find minimum and maximum
    auto [min_it, max_it] = std::minmax_element(data.begin(), data.end());
    
    // Calculate sum
    int sum = std::accumulate(data.begin(), data.end(), 0);
    
    std::cout << "Sorted: ";
    for (int x : data) std::cout << x << " ";
    std::cout << "\nMin: " << *min_it << ", Max: " << *max_it << std::endl;
    std::cout << "Sum: " << sum << std::endl;
    
    return 0;
}

Output:

Sorted: 1 2 3 4 5 6 7 8 9 
Min: 1, Max: 9
Sum: 45

These algorithms are often more efficient than hand-written loops because they're carefully implemented and might use architecture-specific optimizations.

Memory Management Optimization

Avoid Dynamic Allocation When Possible

Dynamic memory allocation (using new and delete) is slower than stack allocation and can lead to memory fragmentation.

cpp
// Slow: Dynamic allocation
void slow_function() {
    for(int i = 0; i < 1000; i++) {
        int* data = new int[100];
        // Process data
        delete[] data;
    }
}

// Fast: Stack allocation
void fast_function() {
    for(int i = 0; i < 1000; i++) {
        int data[100];
        // Process data
    }
}

Use Reserve for Containers

When you know the approximate size of a container beforehand, reserve memory to avoid reallocations:

cpp
#include <iostream>
#include <vector>
#include <chrono>

int main() {
    const size_t elements = 10000000;
    
    // Without reserve
    auto start1 = std::chrono::high_resolution_clock::now();
    std::vector<int> v1;
    for (size_t i = 0; i < elements; ++i) {
        v1.push_back(i);
    }
    auto end1 = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration1 = end1 - start1;
    
    // With reserve
    auto start2 = std::chrono::high_resolution_clock::now();
    std::vector<int> v2;
    v2.reserve(elements);  // Pre-allocate memory
    for (size_t i = 0; i < elements; ++i) {
        v2.push_back(i);
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration2 = end2 - start2;
    
    std::cout << "Without reserve: " << duration1.count() << " seconds" << std::endl;
    std::cout << "With reserve: " << duration2.count() << " seconds" << std::endl;
    
    return 0;
}

Output:

Without reserve: 0.143 seconds
With reserve: 0.062 seconds

Use Custom Allocators

For performance-critical applications, custom allocators can significantly improve memory management:

cpp
#include <iostream>
#include <vector>
#include <memory>

// Simple pool allocator
template<typename T, size_t BlockSize = 4096>
class PoolAllocator {
private:
    struct Block {
        char data[BlockSize];
        Block* next;
    };
    
    Block* currentBlock = nullptr;
    size_t currentOffset = BlockSize;
    
public:
    using value_type = T;
    
    PoolAllocator() = default;
    
    template<typename U>
    PoolAllocator(const PoolAllocator<U>&) {}
    
    T* allocate(size_t n) {
        size_t bytes = n * sizeof(T);
        if (currentOffset + bytes > BlockSize) {
            // Allocate new block
            Block* newBlock = reinterpret_cast<Block*>(new char[sizeof(Block)]);
            newBlock->next = currentBlock;
            currentBlock = newBlock;
            currentOffset = 0;
        }
        
        T* result = reinterpret_cast<T*>(currentBlock->data + currentOffset);
        currentOffset += bytes;
        return result;
    }
    
    void deallocate(T*, size_t) {
        // Memory is freed when the allocator is destroyed
    }
    
    ~PoolAllocator() {
        while (currentBlock) {
            Block* next = currentBlock->next;
            delete[] reinterpret_cast<char*>(currentBlock);
            currentBlock = next;
        }
    }
};

int main() {
    // Using standard allocator
    std::vector<int> standard_vec;
    
    // Using custom pool allocator
    std::vector<int, PoolAllocator<int>> pool_vec;
    
    // Test them...
    
    return 0;
}

Memory Layout Optimization

How you organize data can significantly impact performance due to CPU caching:

cpp
// Inefficient: Array of structures
struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass, radius;
};
std::vector<Particle> particles(1000);

// More efficient: Structure of arrays
struct ParticleSystem {
    std::vector<float> x, y, z;
    std::vector<float> vx, vy, vz;
    std::vector<float> mass, radius;
};
ParticleSystem particles;
particles.x.resize(1000);
particles.y.resize(1000);
// ...

The structure of arrays layout can be more cache-friendly for operations that process one property for all particles, as it improves spatial locality.

Loop Optimization

Loop Unrolling

Manually unrolling loops can reduce branch prediction overhead:

cpp
// Original loop
double sum = 0.0;
for (int i = 0; i < 1000; i++) {
    sum += data[i];
}

// Unrolled loop
double sum = 0.0;
for (int i = 0; i < 1000; i += 4) {
    sum += data[i];
    sum += data[i+1];
    sum += data[i+2];
    sum += data[i+3];
}

Loop Fusion

Combine multiple loops that operate on the same data to improve cache utilization:

cpp
// Before fusion - two separate loops
for (int i = 0; i < n; i++) {
    result1[i] = input[i] * 2;
}
for (int i = 0; i < n; i++) {
    result2[i] = result1[i] + 5;
}

// After fusion - one combined loop
for (int i = 0; i < n; i++) {
    result1[i] = input[i] * 2;
    result2[i] = result1[i] + 5;
}

Remove Invariant Calculations

Move calculations that don't change out of loops:

cpp
// Inefficient
for (int i = 0; i < n; i++) {
    double result = std::pow(base, exponent) * data[i];
    // ...
}

// Efficient
double power = std::pow(base, exponent);
for (int i = 0; i < n; i++) {
    double result = power * data[i];
    // ...
}

Data Structure Optimization

Use the Right Container

Choose containers based on your usage patterns:

Different containers have different performance characteristics:

Container	Access	Insertion	Deletion	Memory Overhead
vector	O(1)	O(1)* or O(n)	O(n)	Low
list	O(n)	O(1)	O(1)	High
map	O(log n)	O(log n)	O(log n)	Medium
unordered_map	O(1) avg	O(1) avg	O(1) avg	Medium-High

*at the end

Custom Hash Functions

For unordered containers, efficient hash functions improve performance:

cpp
#include <iostream>
#include <unordered_map>
#include <string>

// Custom hash function for Point structure
struct Point {
    int x, y;
    
    bool operator==(const Point& other) const {
        return x == other.x && y == other.y;
    }
};

struct PointHash {
    size_t operator()(const Point& p) const {
        // Combine hash of x and y
        // Using a simple but effective hash combine technique
        return std::hash<int>()(p.x) ^ (std::hash<int>()(p.y) << 1);
    }
};

int main() {
    // Using custom hash function
    std::unordered_map<Point, std::string, PointHash> point_map;
    
    point_map[{10, 20}] = "Point A";
    point_map[{30, 40}] = "Point B";
    
    std::cout << "Point {10, 20}: " << point_map[{10, 20}] << std::endl;
    
    return 0;
}

CPU Optimization Techniques

SIMD Instructions

Single Instruction, Multiple Data (SIMD) instructions can process multiple data elements simultaneously:

cpp
#include <iostream>
#include <immintrin.h> // For AVX intrinsics
#include <chrono>
#include <vector>

// Regular function to add two arrays
void add_arrays(const float* a, const float* b, float* result, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        result[i] = a[i] + b[i];
    }
}

// SIMD function to add two arrays using AVX
void add_arrays_simd(const float* a, const float* b, float* result, size_t n) {
    size_t i = 0;
    
    // Process 8 floats at a time using AVX
    for (; i + 7 < n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        __m256 vresult = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(result + i, vresult);
    }
    
    // Handle remaining elements
    for (; i < n; ++i) {
        result[i] = a[i] + b[i];
    }
}

int main() {
    const size_t n = 10000000;
    
    // Align memory to 32-byte boundary for AVX
    float* a = static_cast<float*>(_mm_malloc(n * sizeof(float), 32));
    float* b = static_cast<float*>(_mm_malloc(n * sizeof(float), 32));
    float* result1 = static_cast<float*>(_mm_malloc(n * sizeof(float), 32));
    float* result2 = static_cast<float*>(_mm_malloc(n * sizeof(float), 32));
    
    // Initialize arrays
    for (size_t i = 0; i < n; ++i) {
        a[i] = static_cast<float>(i);
        b[i] = static_cast<float>(i * 2);
    }
    
    // Time regular version
    auto start1 = std::chrono::high_resolution_clock::now();
    add_arrays(a, b, result1, n);
    auto end1 = std::chrono::high_resolution_clock::now();
    
    // Time SIMD version
    auto start2 = std::chrono::high_resolution_clock::now();
    add_arrays_simd(a, b, result2, n);
    auto end2 = std::chrono::high_resolution_clock::now();
    
    std::chrono::duration<double> duration1 = end1 - start1;
    std::chrono::duration<double> duration2 = end2 - start2;
    
    std::cout << "Regular version: " << duration1.count() << " seconds" << std::endl;
    std::cout << "SIMD version: " << duration2.count() << " seconds" << std::endl;
    
    // Verify results
    bool correct = true;
    for (size_t i = 0; i < n; ++i) {
        if (result1[i] != result2[i]) {
            correct = false;
            break;
        }
    }
    std::cout << "Results are " << (correct ? "correct" : "incorrect") << std::endl;
    
    // Free memory
    _mm_free(a);
    _mm_free(b);
    _mm_free(result1);
    _mm_free(result2);
    
    return 0;
}

Output:

Regular version: 0.028 seconds
SIMD version: 0.008 seconds
Results are correct

Cache-Friendly Code

Optimize code to better utilize CPU cache:

cpp
// Cache-unfriendly: Column-major traversal of row-major array
for (int col = 0; col < N; col++) {
    for (int row = 0; row < N; row++) {
        matrix[row][col] = computeValue(row, col);
    }
}

// Cache-friendly: Row-major traversal of row-major array
for (int row = 0; row < N; row++) {
    for (int col = 0; col < N; col++) {
        matrix[row][col] = computeValue(row, col);
    }
}

Avoid Branch Mispredictions

Branch mispredictions can stall the CPU pipeline. Use techniques like branch-free code when appropriate:

cpp
// Branching version
int max(int a, int b) {
    if (a > b) return a;
    else return b;
}

// Branch-free version
int max_branchless(int a, int b) {
    return a * (a >= b) + b * (b > a);
}

// Another branch-free technique
int abs_branchless(int x) {
    // Get the sign bit
    int mask = x >> 31;
    // If x is negative, compute -x, otherwise keep x
    return (x ^ mask) - mask;
}

Multithreading and Parallelism

Use std::async for Simple Parallelism

cpp
#include <iostream>
#include <vector>
#include <future>
#include <numeric>
#include <chrono>

// Function to sum a range of a vector
double sum_range(const std::vector<double>& vec, size_t start, size_t end) {
    return std::accumulate(vec.begin() + start, vec.begin() + end, 0.0);
}

int main() {
    // Create a large vector
    const size_t size = 100000000;
    std::vector<double> numbers(size, 1.0);
    
    // Sequential sum
    auto start1 = std::chrono::high_resolution_clock::now();
    double sum1 = std::accumulate(numbers.begin(), numbers.end(), 0.0);
    auto end1 = std::chrono::high_resolution_clock::now();
    
    // Parallel sum using std::async
    auto start2 = std::chrono::high_resolution_clock::now();
    
    // Split the work into 4 tasks
    size_t quarter = size / 4;
    auto future1 = std::async(std::launch::async, sum_range, std::ref(numbers), 0, quarter);
    auto future2 = std::async(std::launch::async, sum_range, std::ref(numbers), quarter, quarter*2);
    auto future3 = std::async(std::launch::async, sum_range, std::ref(numbers), quarter*2, quarter*3);
    auto future4 = std::async(std::launch::async, sum_range, std::ref(numbers), quarter*3, size);
    
    // Get results and sum them
    double sum2 = future1.get() + future2.get() + future3.get() + future4.get();
    
    auto end2 = std::chrono::high_resolution_clock::now();
    
    std::chrono::duration<double> duration1 = end1 - start1;
    std::chrono::duration<double> duration2 = end2 - start2;
    
    std::cout << "Sequential sum: " << sum1 << " in " << duration1.count() << " seconds" << std::endl;
    std::cout << "Parallel sum: " << sum2 << " in " << duration2.count() << " seconds" << std::endl;
    
    return 0;
}

Output:

Sequential sum: 100000000 in 0.24 seconds
Parallel sum: 100000000 in 0.07 seconds

Use std::execution Policies (C++17)

C++17 introduced execution policies for standard algorithms:

cpp
#include <iostream>
#include <vector>
#include <algorithm>
#include <execution>
#include <chrono>
#include <random>

int main() {
    // Create a large vector with random values
    const size_t size = 10000000;
    std::vector<int> vec(size);
    
    // Fill with random numbers
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> distrib(1, 100);
    
    for (auto& num : vec) {
        num = distrib(gen);
    }
    
    // Sequential sort
    std::vector<int> vec1 = vec;
    auto start1 = std::chrono::high_resolution_clock::now();
    std::sort(vec1.begin(), vec1.end());
    auto end1 = std::chrono::high_resolution_clock::now();
    
    // Parallel sort
    std::vector<int> vec2 = vec;
    auto start2 = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, vec2.begin(), vec2.end());
    auto end2 = std::chrono::high_resolution_clock::now();
    
    std::chrono::duration<double> duration1 = end1 - start1;
    std::chrono::duration<double> duration2 = end2 - start2;
    
    std::cout << "Sequential sort: " << duration1.count() << " seconds" << std::endl;
    std::cout << "Parallel sort: " << duration2.count() << " seconds" << std::endl;
    
    // Verify results
    bool same = std::equal(vec1.begin(), vec1.end(), vec2.begin());
    std::cout << "Results are " << (same ? "identical" : "different") << std::endl;
    
    return 0;
}

Output:

Sequential sort: 1.24 seconds
Parallel sort: 0.35 seconds
Results are identical

Real-World Case Study: Image Processing

Let's look at a real-world example of performance optimization in an image processing context. We'll implement a simple Gaussian blur filter with progressive optimizations.

Version 1: Basic Implementation

cpp
#include <iostream>
#include <vector>
#include <cmath>
#include <chrono>

// Simple image class
class Image {
public:
    Image(int width, int height) 
        : width_(width), height_(height), pixels_(width * height, 0) {}
    
    float getPixel(int x, int y) const {
        if (x < 0) x = 0;
        if (x >= width_) x = width_ - 1;
        if (y < 0) y = 0;
        if (y >= height_) y = height_ - 1;
        return pixels_[y * width_ + x];
    }
    
    void setPixel(int x, int y, float value) {
        pixels_[y * width_ + x] = value;
    }
    
    int getWidth() const { return width_; }
    int getHeight() const { return height_; }
    
private:
    int width_, height_;
    std::vector<float> pixels_;
};

// Version 1: Basic Gaussian blur implementation
Image gaussianBlur_v1(const Image& input, float sigma) {
    // Create output image
    Image output(input.getWidth(), input.getHeight());
    
    // Compute kernel size based on sigma
    int kernelSize = static_cast<int>(2 * std::ceil(3 * sigma)) + 1;
    int kernelRadius = kernelSize / 2;
    
    // Create kernel
    std::vector<float> kernel(kernelSize * kernelSize);
    float kernelSum = 0.0f;
    
    for (int y = -kernelRadius; y <= kernelRadius; y++) {
        for (int x = -kernelRadius; x <= kernelRadius; x++) {
            float value = std::exp(-(x*x + y*y) / (2 * sigma * sigma));
            kernel[(y + kernelRadius) * kernelSize + (x + kernelRadius)] = value;
            kernelSum += value;
        }
    }
    
    // Normalize kernel
    for (auto& k : kernel) {
        k /= kernelSum;
    }
    
    // Apply convolution
    for (int y = 0; y < input.getHeight(); y++) {
        for (int x = 0; x < input.getWidth(); x++) {
            float sum = 0.0f;
            
            for (int ky = -kernelRadius; ky <= kernelRadius; ky++) {
                for (int kx = -kernelRadius; kx <= kernelRadius; kx++) {
                    float kernelValue = kernel[(ky + kernelRadius) * kernelSize + (kx + kernelRadius)];
                    float pixelValue = input.getPixel(x + kx, y + ky);
                    sum += pixelValue * kernelValue;
                }
            }
            
            output.setPixel(x, y, sum);
        }
    }
    
    return output;
}

int main() {
    // Create test image (1000x1000 with a square in the middle)
    const int width = 1000;
    const int height = 1000;
    Image testImage(width, height);
    
    // Fill with a square pattern
    for (int y = height/4; y < 3*height/4; y++) {
        for (int x = width/4; x < 3*width/4; x++) {
            testImage.setPixel(x, y, 1.0f);
        }
    }
    
    // Measure performance of gaussian blur
    auto start = std::chrono::high_resolution_clock::now();
    
    Image blurred = gaussianBlur_v1(testImage, 5.0f);
    
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;
    
    std::cout << "Basic Gaussian blur: " << duration.count() << " seconds" << std::endl;
    
    return 0;
}

Version 2: Optimized Implementation

cpp
// Version 2: Optimized Gaussian blur
// Separate horizontal and vertical passes for better cache performance
Image gaussianBlur_v2(const Image& input, float sigma) {
    int width = input.getWidth();
    int height = input.getHeight();
    
    // Create temporary and output images
    Image temp(width, height);
    Image output(width, height);
    
    // Compute kernel size based on sigma
    int kernelSize = static_cast<int>(2 * std::ceil(3 * sigma)) + 1;
    int kernelRadius = kernelSize / 2;
    
    // Create 1D kernel
    std::vector<float> kernel(kernelSize);
    float kernelSum = 0.0f;
    
    for (int x = -kernelRadius; x <= kernelRadius; x++) {
        float value = std::exp(-(x*x) / (2 * sigma * sigma));
        kernel[x + kernelRadius] = value;
        kernelSum += value;
    }
    
    // Normalize kernel
    for (auto& k : kernel) {
        k /= kernelSum;
    }
    
    // Horizontal pass
    for (int y = 0; y < height; y++) {
        for (int x = 0; x < width; x++) {
            float sum = 0.0f;
            
            for (int k = -kernelRadius; k <= kernelRadius; k++) {
                int xOffset = x + k;
                if (xOffset < 0) xOffset = 0;
                if (xOffset >= width) xOffset = width - 1;
                
                sum += input.getPixel(xOffset, y) * kernel[k + kernelRadius];
            }
            
            temp.setPixel(x, y, sum);
        }
    }
    
    // Vertical pass
    for (int y = 0; y < height; y++) {
        for (int x = 0; x < width; x++) {
            float sum = 0.0f;
            
            for (int k = -kernelRadius; k <= kernelRadius; k++) {
                int yOffset = y + k;
                if (yOffset < 0) yOffset = 0;
                if (yOffset >= height) yOffset = height - 1;
                
                sum += temp.getPixel(x, yOffset) * kernel[k + kernelRadius];
            }
            
            output.setPixel(x, y, sum);
        }
    }
    
    return output;
}

Version 3: Multithreaded Implementation

cpp
// Version 3: Multithreaded Gaussian blur
Image gaussianBlur_v3(const Image& input, float sigma) {
    int width = input.getWidth();
    int height = input.getHeight();
    
    // Create temporary and output images
    Image temp(width, height);
    Image output(width, height);
    
    // Compute kernel size based on sigma
    int kernelSize = static_cast<int>(2 * std::ceil(3 * sigma)) + 1;
    int kernelRadius = kernelSize / 2;
    
    // Create 1D kernel
    std::vector<float> kernel(kernelSize);
    float kernelSum = 0.0f;
    
    for (int x = -kernelRadius; x <= kernelRadius; x++) {
        float value = std::exp(-(x*x) / (2 * sigma * sigma));
        kernel[x + kernelRadius] = value;
        kernelSum += value;
    }
    
    // Normalize kernel
    for (auto& k : kernel) {
        k /= kernelSum;
    }
    
    // Horizontal pass - parallel
    auto horizontalPass = [&](int startY, int endY) {
        for (int y = startY; y < endY; y++) {
            for (int x = 0; x < width; x++) {
                float sum = 0.0f;
                
                for (int k = -kernelRadius; k <= kernelRadius; k++) {
                    int xOffset = x + k;
                    if (xOffset < 0) xOffset = 0;
                    if (xOffset >= width) xOffset = width - 1;
                    
                    sum += input.getPixel(xOffset, y) * kernel[k + kernelRadius];
                }
                
                temp.setPixel(x, y, sum);
            }
        }
    };
    
    // Vertical pass - parallel
    auto verticalPass = [&](int startY, int endY) {
        for (int y = startY; y < endY; y++) {
            for (int x = 0; x < width; x++) {
                float sum = 0.0f;
                
                for (int k = -kernelRadius; k <= kernelRadius; k++) {
                    int yOffset = y + k;
                    if (yOffset < 0) yOffset = 0;
                    if (yOffset >= height) yOffset = height - 1;
                    
                    sum += temp.getPixel(x, yOffset) * kernel[k + kernelRadius];
                }
                
                output.setPixel(x, y, sum);
            }
        }
    };
    
    // Divide work into 4 threads
    int threadCount = 4;
    int rowsPerThread = height / threadCount;
    std::vector<std::future<void>> futures;
    
    // Launch threads for horizontal pass
    for (int i = 0; i < threadCount; i++) {
        int startY = i * rowsPerThread;
        int endY = (i == threadCount - 1) ? height : (i + 1) * rowsPerThread;
        futures.push_back(std::async(std::launch::async, horizontalPass, startY, endY));
    }
    
    // Wait for horizontal pass to complete
    for (auto& future : futures) {
        future.wait();
    }
    
    futures.clear();
    
    // Launch threads for vertical pass
    for (int i = 0; i < threadCount; i++) {
        int startY = i * rowsPerThread;
        int endY = (i == threadCount - 1) ? height : (i + 1) * rowsPerThread;
        futures.push_back(std::async(std::launch::async, verticalPass, startY, endY));
    }
    
    // Wait for vertical pass to complete
    for (auto& future : futures) {
        future.wait();
    }
    
    return output;
}

// Let's compare all versions
int main() {
    // Create test image (1000x1000 with a square in the middle)
    const int width = 1000;
    const int height = 1000;
    Image testImage(width, height);
    
    // Fill with a square pattern
    for (int y = height/4; y < 3*height/4; y++) {
        for (int x = width/4; x < 3*width/4; x++) {
            testImage.setPixel(x, y, 1.0f);
        }
    }
    
    // Basic implementation
    auto start1 = std::chrono::high_resolution_clock::now();
    Image blurred1 = gaussianBlur_v1(testImage, 5.0f);
    auto end1 = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration1 = end1 - start1;
    
    // Optimized implementation
    auto start2 = std::chrono::high_resolution_clock::now();
    Image blurred2 = gaussianBlur_v2(testImage, 5.0f);
    auto end2 = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration2 = end2 - start2;
    
    // Multithreaded implementation
    auto start3 = std::chrono::high_resolution_clock::now();
    Image blurred3 = gaussianBlur_v3(testImage, 5.0f);
    auto end3 = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration3 = end3 - start3;
    
    std::cout << "Basic implementation: " << duration1.count() << " seconds" << std::endl;
    std::cout << "Optimized implementation: " << duration2.count() << " seconds" << std::endl;
    std::cout << "Multithreaded implementation: " << duration3.count() << " seconds" << std::endl;
    
    return 0;
}

Output:

Basic implementation: 1.832 seconds
Optimized implementation: 0.427 seconds
Multithreaded implementation: 0.121 seconds

This example demonstrates several optimization techniques:

Algorithm optimization (separable filter)
Better memory access patterns
Multithreading

Best Practices and Pitfalls

Measure First, Optimize Later

Always profile before optimization to identify bottlenecks:

cpp
#include <iostream>
#include <chrono>
#include <string>

class Timer {
public:
    Timer(const std::string& label) : label_(label), start_(std::chrono::high_resolution_clock::now()) {}
    
    ~Timer() {
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double, std::milli> duration = end - start_;
        std::cout << label_ << ": " << duration.count() << " ms" << std::endl;
    }
    
private:
    std::string label_;
    std::chrono::time_point<std::chrono::high_resolution_clock> start_;
};

void functionToProfile() {
    Timer timer("Function execution");
    // Function code here
    for (int i = 0; i < 1000000; i++) {
        // Some work
    }
}

Common Pitfalls

Excessive abstraction: Abstraction can hurt performance
False sharing: Multiple threads modifying data in the same cache line
Premature micro-optimization: Focusing on small details before addressing algorithmic issues
Ignoring compiler warnings: They often hint at performance issues
Over-optimization: Making code unreadable for negligible gains

Summary

In this guide, we've explored numerous techniques for optimizing C++ code:

Compiler optimizations: Using the right compiler flags and directives
Algorithm optimization: Choosing appropriate algorithms and data structures
Memory management: Reducing allocations and improving locality
Loop optimization: Techniques to make loops more efficient
Multithreading: Distributing work across multiple threads
CPU-specific optimizations: SIMD instructions and cache-friendly patterns

Remember these key points:

Profile first to identify bottlenecks
Start with algorithmic improvements
Consider the readability/performance trade-off
Use standard library containers and algorithms
Enable compiler optimizations

Exercises

Compare the performance of different sorting algorithms (bubble sort, quicksort, std::sort) on arrays of various sizes.
Implement a matrix multiplication algorithm and optimize it using techniques from this guide.
Write a program that processes a large file and optimize it for both memory usage and speed.
Profile a simple video game loop (update entities, check collisions) and optimize the bottlenecks.
Compare the performance of std::map vs. std::unordered_map for different operation patterns.

Additional Resources

Books:
- "Effective C++" by Scott Meyers
- "C++ Concurrency in Action" by Anthony Williams
- "High Performance C++" by Andrei Alexandrescu
Online Resources:
Tools:
- Valgrind - Memory profiling
- Google Benchmark - Microbenchmarking
- Intel VTune - Performance profiling
- perf - Linux profiling tool

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Optimize?​

Compiler Optimizations​

Optimization Flags​

Example: Impact of Compiler Optimization​

Compiler Intrinsics and Pragmas​

Algorithm Optimization​

Choose the Right Algorithm​

Example: Finding an Element​

Use Standard Library Algorithms​

Memory Management Optimization​

Avoid Dynamic Allocation When Possible​

Use Reserve for Containers​

Use Custom Allocators​

Memory Layout Optimization​

Loop Optimization​

Loop Unrolling​

Loop Fusion​

Remove Invariant Calculations​

Data Structure Optimization​

Use the Right Container​

Custom Hash Functions​

CPU Optimization Techniques​

SIMD Instructions​

Cache-Friendly Code​

Avoid Branch Mispredictions​

Multithreading and Parallelism​

Use std::async for Simple Parallelism​

Use std::execution Policies (C++17)​

Real-World Case Study: Image Processing​

Version 1: Basic Implementation​

Version 2: Optimized Implementation​

Version 3: Multithreaded Implementation​

Best Practices and Pitfalls​

Measure First, Optimize Later​

Common Pitfalls​

Summary​

Exercises​

Additional Resources​

Introduction

Why Optimize?

Compiler Optimizations

Optimization Flags

Example: Impact of Compiler Optimization

Compiler Intrinsics and Pragmas

Algorithm Optimization

Choose the Right Algorithm

Example: Finding an Element

Use Standard Library Algorithms

Memory Management Optimization

Avoid Dynamic Allocation When Possible

Use Reserve for Containers

Use Custom Allocators

Memory Layout Optimization

Loop Optimization

Loop Unrolling

Loop Fusion

Remove Invariant Calculations

Data Structure Optimization

Use the Right Container

Custom Hash Functions

CPU Optimization Techniques

SIMD Instructions

Cache-Friendly Code

Avoid Branch Mispredictions

Multithreading and Parallelism

Use std::async for Simple Parallelism

Use std::execution Policies (C++17)

Real-World Case Study: Image Processing

Version 1: Basic Implementation

Version 2: Optimized Implementation

Version 3: Multithreaded Implementation

Best Practices and Pitfalls

Measure First, Optimize Later

Common Pitfalls

Summary

Exercises

Additional Resources