Just compared my Devil’s Advocate prompt across 5 different LLMs/Agents on a company I know well:
Claude 4.7 (regular)
ChatGPT 5.5 (regular)
Claude 4.7 (Research)
ChatGPT 5.5 (Deep Research)
Gemini 3.1 (Deep Research)
Conclusions:
All were at least decent and usable
Gemini was least useful (!), which is surprising because 6 months ago it was at the top
The gap between regular and Research/Deep Research for both Claude and ChatGPT was small to medium - regular versions were pretty good
My rank order for this task was:
Claude 4.7 Research: 9/10
Claude 4.7: 8.5/10
ChatGPT 5.5 Deep Research: 8.5/10
ChatGPT 5.5: 8/10
Gemini 3.1 Deep Research: 7.5/10
Curious what you find if you are comparing workflows across different models/agents? Compound With AI - have you done any comparing recently?