Parallelizing the physics solver by Dennis Gustafson (https://youtu.be/Kvsvd67XUKw?si=moTxJqvke4g6835s) was a really cool talk. Always cool to see how people tackle an optimization Problem. The only thing I fundamentally disagree with is the use of parallel_for. This might look really innocent but will starve the CPU after each iteration and while not a lot of time seems to be lost per iteration you also have ramp-up/down costs of the threadpool. Used without care this ends up in a death by a thousand cuts scenario.

YouTube
Dennis Gustafsson – Parallelizing the physics solver – BSC 2025It is very convenient to write code like this I agree, but it will prevent you from reaching peak performance. Someone in the audience actually correctly points out that you can overlap the second half of the first iteration of the solver with the first half of the second iteration, and this is not really possible with this setup (at least I have not yet seen an implementation of parallel_for that would allow this kind of overlap)
It's kind of unfortunate how huge the complexity jump from a simple parallel-for to dependency driven task scheduling is. This is an area where I really think the more complex "ECS" style approaches can potentially make big improvements practically. Doing it all by hand works, but it's super fragile and eats a lot of programmer time.
Yeah, that isn't just a limitation of parallel_for_each but of Cilk-style nested task parallelism more generally: it can only represent series-parallel task graphs (https://en.wikipedia.org/wiki/Series%E2%80%93parallel_graph). That said, you can support both types of API with the same task graph backend, at least if you support dynamic tasks.