Babbage (@thechipletter): "Terrific post from @Subbu on our multi-architecture future."

The app for independent voices

The Chip Letter

Terrific post from Subbu on our multi-architecture future.

Once you look at inference in this way, the hardware implications become clear. Drafting and verification have different memory, compute, and latency requirements. The target model (larger parent) benefits from the high-throughput characteristics of large GPU systems. The draft model (smaller child) often benefits from hardware that is optimized for low-latency, small batches, and fast token-by-token generation. As a result, the two models can benefit from very different silicon architectures, each tuned to its part of the pipeline.

Mar 15

at

8:17 AM

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts