C++ Sobel Edge Detection — 2000× NumPy Speedup

What it is

A systematic benchmarking study of manual optimization techniques applied to the Sobel edge detection kernel, implemented in C++17 with pybind11 Python bindings. The goal: understand the performance ceiling of each optimization layer independently, then measure how compiler flags interact with manual optimizations.

All implementations produce bit-identical output to the NumPy reference. This isn't a benchmark that trades accuracy for speed — it's a controlled study of what the compiler and hardware can do when you give them the right hints.

Architecture

Optimization stack: NumPy baseline at 1×, pybind11 bridge, then C++ implementations from basic at 330× through loop unrolling, SIMD, cache blocking, up to all optimizations combined at 2000× — Each optimization layer is independently benchmarked. The pybind11 bridge adds sub-microsecond overhead — negligible against the ~0.10s NumPy baseline.

Results

Execution time and speedup comparison across all implementations and compiler flag combinations — Execution time and speedup relative to NumPy baseline across all optimization combinations. The combined implementation with aggressive compiler flags reaches ~2000× speedup.

Implementation	Execution time	Speedup vs NumPy
NumPy baseline	~0.10s	1×
C++ basic	~0.0003s	~330×
C++ loop unrolling	~0.00015s	~670×
C++ SIMD (SSE)	~0.00008s	~1250×
C++ all optimizations	~0.00005s	~2000×

Effectiveness of each compiler flag combination across optimization methods — Compiler flag effectiveness analysis. -O3 with -march=native and -ffast-math captures most available performance, but manual SIMD still adds 1.5–1.8× on top.

Speedup contribution of each manual optimization method — Method effectiveness breakdown. SIMD intrinsics contribute the largest single gain; combined optimizations compound all layers.

Visual comparison of Sobel output: NumPy vs all optimizations combined — Output quality check — all C++ implementations produce bit-identical results to the NumPy reference. No accuracy is traded for speed.

Optimization techniques

Manual code optimizations:

Loop unrolling — explicit 3×3 kernel expansion, eliminates inner loop overhead
SIMD intrinsics (SSE) — 128-bit registers processing 4 pixels in parallel
Cache blocking — tiled processing to keep working set in L1/L2 cache
Memory prefetching — explicit __builtin_prefetch hints ahead of the read cursor
Combined — all of the above together

Compiler flags tested:

-O2 (baseline optimization)
-O2 -funroll-loops (auto-unroll only)
-O2 -march=native -mavx2 (auto-vectorization)
-O3 -march=native -ffast-math -funroll-loops (full aggressive)

What I learned

Auto-vectorization is good but not complete. With -march=native, the compiler captures roughly 80% of available SIMD performance. Manual SSE intrinsics add another 1.5–1.8× on top even with -march=native enabled, because the compiler can't always prove that the memory access pattern is safe to vectorize across the boundary logic.

Cache blocking scales with image size. The benefit of tiled processing is negligible on small test images but grows substantially as the working set stops fitting in L2. This is expected from cache theory but satisfying to measure directly.

All optimizations together beat any single one. There's no single technique that dominates — SIMD, unrolling, cache effects, and compiler flags all contribute independently. The 2000× ceiling requires all of them.

pybind11 overhead is negligible. Python ↔ C++ call overhead with pybind11 is sub-microsecond on array-sized inputs, well below the noise floor of the benchmark.

What it is

Architecture

Results

Optimization techniques

What I learned

Links