Less Slow C++

📰 Hacker News · ashvardanian

Improve C++ performance by exploring coroutines, SIMD, and secure enclaves, and learn how to optimize memory access and error handling

advanced Published 18 Apr 2025

Action Steps

Explore coroutines for high-performance work using libraries like cppcoro
Use SIMD intrinsics for clarity and performance, and consider dropping to assembly for easier library distribution
Investigate hardware support for vectorized scatter/gather in AVX-512 and SVE
Compare secure enclaves and pointer tagging on Intel, Arm, and AMD architectures
Measure the throughput gap between CPU and GPU Tensor Cores (TCs) using benchmarks like MLPerf
Optimize memory access by minimizing misaligned memory accesses and split-loads, and using non-temporal loads/stores

Who Needs to Know This

This article is relevant to software engineers, particularly those working on high-performance applications, as it discusses optimization techniques and design choices that can impact performance

Key Insight

💡 Coroutines, SIMD, and secure enclaves can significantly improve C++ performance, but require careful evaluation of trade-offs and optimization techniques

Full Article

Earlier this year, I took a month to reexamine my coding habits and rethink some past design choices. I hope to rewrite and improve my FOSS libraries this year, and I needed answers to a few questions first. Perhaps some of these questions will resonate with others in the community, too. - Are coroutines viable for high-performance work? - Should I use SIMD intrinsics for clarity or drop to assembly for easier library distribution? - Has hardware caught up with vectorized scatter/gather in AVX-512 & SVE? - How do secure enclaves & pointer tagging differ on Intel, Arm, & AMD? - What's the throughput gap between CPU and GPU Tensor Cores (TCs)? - How costly are misaligned memory accesses & split-loads, and what gains do non-temporal loads/stores offer? - Which parts of the standard library hit performance hardest? - How do error-handling strategies compare overhead-wise? - What's the compile-time vs. run-time trade-off for lazily evaluated ranges? - Wha

Read full article → ← Back to Reads

Less Slow C++

Full Article

Related Videos