DFlash can be faster than MTP.
But there is no universal winner. It depends on the model, the hardware, the task, the workload, and many other details.
I benchmarked MTP vs DFlash with vLLM and llama.cpp across math, coding, and chat workloads.
For Qwen3.6 27B, DFlash delivered up to a 4x speedup. MTP was not far behind, once tuned, but DFlash reached the highest peak performance.
For Qwen3.6 35B A3B, MTP often performs better.
All the results here: