Project

C++ Sobel Edge Detection — 2000× NumPy Speedup

Systematic exploration of C++ optimization techniques for Sobel edge detection, from a basic C++ port through SIMD intrinsics and cache blocking. Achieves ~2000× speedup over NumPy with bit-identical output.

What it is

A systematic benchmarking study of manual optimization techniques applied to the Sobel edge detection kernel, implemented in C++17 with pybind11 Python bindings. The goal: understand the performance ceiling of each optimization layer independently, then measure how compiler flags interact with manual optimizations.

All implementations produce bit-identical output to the NumPy reference. This isn't a benchmark that trades accuracy for speed — it's a controlled study of what the compiler and hardware can do when you give them the right hints.

Architecture

Optimization stack: NumPy baseline at 1×, pybind11 bridge, then C++ implementations from basic at 330× through loop unrolling, SIMD, cache blocking, up to all optimizations combined at 2000×
Each optimization layer is independently benchmarked. The pybind11 bridge adds sub-microsecond overhead — negligible against the ~0.10s NumPy baseline.

Results

Execution time and speedup comparison across all implementations and compiler flag combinations
Execution time and speedup relative to NumPy baseline across all optimization combinations. The combined implementation with aggressive compiler flags reaches ~2000× speedup.
ImplementationExecution timeSpeedup vs NumPy
NumPy baseline~0.10s
C++ basic~0.0003s~330×
C++ loop unrolling~0.00015s~670×
C++ SIMD (SSE)~0.00008s~1250×
C++ all optimizations~0.00005s~2000×
Effectiveness of each compiler flag combination across optimization methods
Compiler flag effectiveness analysis. -O3 with -march=native and -ffast-math captures most available performance, but manual SIMD still adds 1.5–1.8× on top.
Speedup contribution of each manual optimization method
Method effectiveness breakdown. SIMD intrinsics contribute the largest single gain; combined optimizations compound all layers.
Visual comparison of Sobel output: NumPy vs all optimizations combined
Output quality check — all C++ implementations produce bit-identical results to the NumPy reference. No accuracy is traded for speed.

Optimization techniques

Manual code optimizations:

Compiler flags tested:

What I learned

Auto-vectorization is good but not complete. With -march=native, the compiler captures roughly 80% of available SIMD performance. Manual SSE intrinsics add another 1.5–1.8× on top even with -march=native enabled, because the compiler can't always prove that the memory access pattern is safe to vectorize across the boundary logic.

Cache blocking scales with image size. The benefit of tiled processing is negligible on small test images but grows substantially as the working set stops fitting in L2. This is expected from cache theory but satisfying to measure directly.

All optimizations together beat any single one. There's no single technique that dominates — SIMD, unrolling, cache effects, and compiler flags all contribute independently. The 2000× ceiling requires all of them.

pybind11 overhead is negligible. Python ↔ C++ call overhead with pybind11 is sub-microsecond on array-sized inputs, well below the noise floor of the benchmark.

Links