Inside the Forward Pass: Can Transformer Internals Predict Correctness?

The app for independent voices

Inside the Forward Pass: Can Transformer Internals Predict Correctness?

TL;DR: Internal transformer signals (entropy, attention, hidden state statistics) predict generation correctness with AUROC 0.60–0.90 under grouped held-out evaluation, without looking at the output text. The first 10 generated tokens carry most of the predictive signal for code tasks. Model confidence scores are nearly uncorrelated with correctness for…

Joe Bachir

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts

The app for independent voices

Inside the Forward Pass: Can Transformer Internals Predict Correctness?

Log in or sign up