I've recently become interested in AI, specifically generative AI, doing things it really wasn't built to do. I also love a good visualisation, so I combined the two things to teach GPT-V to game and then used this to explore how we can better outcomes from generic models, using a bit of sensible structure and guidance.
Using an open-source game example from pregame, I asked GPT-V to defend this open-source planet from the alien invasion.

I started by showing GPT-V a single frame of the game and asking it to decide whether to fire the gun, wait, move left, or move right.

The result is less than stellar. Although it can do basic movements, it fails to come up with any strategy, and just sits in the corner (this will be a common thread...)

In this example, the AI is only shown one frame. It has no capacity to remember previous states or previous instructions. Looking at the log just shows a lot of 'FIRE!' as it has no reference that it has exceeded the total number of shots. It gets stuck in the corner, and whilst there is a glimmer of hope as it attempts to run from the kill-shot bomb, it is ruined by the decision to retreat to the corner instantaneously to its death.

AI defends a planet, with a sinlge frame of reference. Score 5

The next thing I tried was expanding its memory to 10 frames so that it could see its last positions and moves. This is always a good idea for added context, but it doesn't help in our case, as the 'hide in the corner' strategy that is reached by accident, is fundamentally flawed.

The next step was to introduce a new 'agent' to the scenario. This 'agent' is simply a piece of code that feeds GPT-V some relevant text-based information, like how many shots have been fired, how many bombs are active, and where on the screen the player is. (My hope was that it would recognise that it was getting stuck in the corner.

Now the AI has slightly more information about the scenario - Score 7

This slightly improves our outlook. We get stuck in the corner less, and the gameplay is a bit more dynamic. There is some movement to the shots, but ultimately, the vision AI fails to see the obvious threat above the player, and simply waits for its impending doom.

This is strange behaviour though, because if you ask GPT-V where the bombs are, it can accurately tell you that a bomb is in a threatening place, but the decision isn't made to move out of the way based on visual information alone.
This is where a truly multi-agent approach comes in. This time, I set it up so that not only do we have an 'intelligent' agent playing the game and a basic agent giving some live stats, but now I introduce a separate 'intelligent' strategy agent. This agent is the same GPT-V technology, but rather than ask to give an order to move or fire, it is asked to give a strategy and explain to the player agent where the bombs are and where the danger comes from.

This was interesting, as the strategy agent would add information it gathered from the frame to add additional context.

"Looking at the layout, the truck should move to the left to evade the falling bomb, as moving to the right would bring it closer to the corner, potentially trapping it. Therefore, the truck should move left immediately to stay safe." - Strategy Agent 2024

This additional context, genuinely changes the way the player operates. Keeping it more likely to be central, and move dynamically and consistently away from threats.

High score of 22, with a multi-agent approach

This time the gameplay is revolutionised. The player moves much more dynamically and knows when to move away from bombs. The corner strategy is still strong, but made less prevalent by the strategy agent. The player also seems to be much better at killing the aliens in this scenario than others. We see around score 17/18 the player actively moving out of the corner to avoid bombs, because it actually knows:

  1. It will get stuck in the corner
  2. The bomb is above the player.


All thanks to the strategy agent.

The player eventually gets trapped in a tight spot and loses the game, but I do think this fun demonstration shows that multi-agent approaches, using both smart and dumber agents, can increase the effectiveness just by using the same pieces of technology.