I was also struck that AMD emphasized just how novel OoO memory access is with RDNA4. Which is why I also appreciated that @Chester did this analysis.
And, I had the same question: why didn't AMD use it before, and, conversely, what benefits did in-order have?
I'm no expert in this area, but in-order designs are often used for DSPs, where that makes sense.
I wonder if the RDNA3.5+ architecture, which will be used in the next batch of APUs, will focus solely on this kind of optimization and support for FP8 (FSR4). This is because RDNA4's RT cores take up significant space, making the design overly bulky without providing enough ray tracing performance to justify it(?) Just guessing.