ollama just quietly unlocked something big.
qwen3.5 now runs locally with image + text input.
meaning your agents can see what they’re doing, not just read prompts.
run it: ollama run qwen3.5
this lets a local agent analyze screenshots, diagrams, PDFs, or UI states, then decide what action to take.
simple loop builders are experimenting with:
1 capture screenshot
2 send image + task to qwen3.5
3 model decides next action
4 automation layer executes
stack looks like:
ollama
qwen3.5
playwright
python or node agent loop
old agents read prompts.
multimodal agents read environments.
that’s where local automation starts to feel like real operators.
…
curious what people here are building with vision models locally.
browser agents? screen-aware copilots? something weirder?