Trelis Research (@trelis): "Eric, really nice work and very clearly written. And very cost efficient! A few Qs, if you don’t mind. Did you test any text-only models (like OSS 20B)? I notice all three models you tested were multi-modal and am wondering how important that is. In the “Score-weighed program…"

Eric, really nice work and very clearly written. And very cost efficient!

A few Qs, if you don’t mind.

Did you test any text-only models (like OSS 20B)? I notice all three models you tested were multi-modal and am wondering how important that is.
In the “Score-weighed program selection” column, does “No” mean you sampled uniformly (at random), while “Yes” means you greedily take the highest train accuracy, using pixel match as a tie breaker?
In the library generation phase, you did one round. Was that also 5 programs per task there?
I suppose you have to execute the whole library on every task so that you get the scoring, correct? Should be quick, but does that become a bottleneck?
The score is determined first by train accuracy and then by pixel accuracy. I assume that train accuracy is nearly always zero for the first round on a given task, so that means pixel accuracy must be doing all of the heavy lifting?

Sep 18

9:07 AM