Sure, an intuition is as far fetched as it gets.
Regarding the question, let's call the million simpletons "agents".
I would experiment with a global big state (think of million size activation vector ) projecting into each local agent simplified view of the global state.
And agent's response (action) is projected back into changing the global state.
If they-re all under an "animal" virtual skull, then some of the global state generates motions, e.g. camera/wheel movement and sensory inputs project into the global state too.
Sparsity is assumed, both for global state encoding and for the subset of active agents at any given time, mostly for scaling/cost reasons.
Some global reward feedback which also projects into recent past local actions is needed too.