Split/join style. It is almost like a single threaded code, except that all heavy work is done as fast as possible in parallel, the single threaded part is almost free in practice, less than 1ms in total per frame, often 10 times less.
Split/join is also what I was talking about but storing the values in immutable temporary heap locations rather than in a common lockless queue.
If you have the memory and allocation speed, that can be much faster than a lockless queue (especially if that queue is under high contention). However, for a game and embedded system I'd imagine the impact on heap fragmentation would be a pretty significant reason not to do this.
For a GCed language, those heap allocations are fairly cheap, quiet a bit cheaper than what you pay for in a managed memory language (In GCed languages, the majority of memory allocations are 1 or 2 conditional checks with a pointer bump. It's very nearly the same speed as stack allocation).
When I was talking about mutable state (and my example is perhaps bad), I'm more talking about system or process wide mutable state.
Split/join style. It is almost like a single threaded code, except that all heavy work is done as fast as possible in parallel, the single threaded part is almost free in practice, less than 1ms in total per frame, often 10 times less.