What it is
A systematic benchmarking study of manual optimization techniques applied to the Sobel edge detection kernel, implemented in C++17 with pybind11 Python bindings. The goal: understand the performance ceiling of each optimization layer independently, then measure how compiler flags interact with manual optimizations.
All implementations produce bit-identical output to the NumPy reference. This isn't a benchmark that trades accuracy for speed — it's a controlled study of what the compiler and hardware can do when you give them the right hints.
Architecture
Results

| Implementation | Execution time | Speedup vs NumPy |
|---|---|---|
| NumPy baseline | ~0.10s | 1× |
| C++ basic | ~0.0003s | ~330× |
| C++ loop unrolling | ~0.00015s | ~670× |
| C++ SIMD (SSE) | ~0.00008s | ~1250× |
| C++ all optimizations | ~0.00005s | ~2000× |



Optimization techniques
Manual code optimizations:
- Loop unrolling — explicit 3×3 kernel expansion, eliminates inner loop overhead
- SIMD intrinsics (SSE) — 128-bit registers processing 4 pixels in parallel
- Cache blocking — tiled processing to keep working set in L1/L2 cache
- Memory prefetching — explicit
__builtin_prefetchhints ahead of the read cursor - Combined — all of the above together
Compiler flags tested:
-O2(baseline optimization)-O2 -funroll-loops(auto-unroll only)-O2 -march=native -mavx2(auto-vectorization)-O3 -march=native -ffast-math -funroll-loops(full aggressive)
What I learned
Auto-vectorization is good but not complete. With -march=native, the compiler captures roughly 80% of available SIMD performance. Manual SSE intrinsics add another 1.5–1.8× on top even with -march=native enabled, because the compiler can't always prove that the memory access pattern is safe to vectorize across the boundary logic.
Cache blocking scales with image size. The benefit of tiled processing is negligible on small test images but grows substantially as the working set stops fitting in L2. This is expected from cache theory but satisfying to measure directly.
All optimizations together beat any single one. There's no single technique that dominates — SIMD, unrolling, cache effects, and compiler flags all contribute independently. The 2000× ceiling requires all of them.
pybind11 overhead is negligible. Python ↔ C++ call overhead with pybind11 is sub-microsecond on array-sized inputs, well below the noise floor of the benchmark.