Context
You’re working on a fintech fraud detection service that scores transactions in real time. The service runs on a fleet of CPU-only instances (no GPUs) and must stay under a strict P99 latency budget to avoid blocking legitimate purchases. A recent model update increased CPU time, and profiling shows the code is memory-bandwidth bound with frequent cache misses.
Core Question
Explain how you would optimize a hot loop / algorithm implementation for specific hardware constraints. Address the following:
- Diagnose the bottleneck: How do you determine whether the code is compute-bound vs memory-bound? What signals do you look for in profiling (e.g., cache-miss rate, IPC, branch-mispredicts)?
- Data layout and locality: How would you change data structures (AoS vs SoA), access patterns (stride, sequential scans), and memory allocation to improve L1/L2 cache locality and reduce TLB pressure?
- Vectorization and branching: How would you make the code more amenable to SIMD (e.g., AVX2/NEON) and reduce branch mispredictions? When do bit tricks or lookup tables help?
- Algorithmic trade-offs under constraints: If the hardware constraint is “only 256MB RAM per worker” or “slow storage / high latency memory,” how do you choose between time vs space (e.g., streaming, blocking/tiling, approximate methods)?
Scope Guidance
Assume the interviewer expects concrete, implementable techniques (not just “use a faster algorithm”). Discuss: cache lines, prefetching, loop fusion/fission, tiling, alignment, avoiding allocations, and how you’d validate improvements with benchmarks. You may use small pseudocode snippets to illustrate transformations.