Skip to content

Trustworthy Game Agents

Author: Yue Tang, Tianfang Fang, Sunil Shah

In the viral YouTube video "ChatGPT vs DeepSeek: CRAZY Chess," the chess showdown between ChatGPT and DeepSeek didn't turn out to be as exciting and intense as people had expected. Instead, it was filled with unexpected cheating, rule manipulation, and blunders. Whether large language models can be trusted as game players remains a question, and we also played some games with the models.

First, Let’s play a game of UNO with Chatgpt. The model I am playing with is 4o. UNO is a multi-player card game in which the objective is to be the first player to get rid of all the cards in their hand. I asked Chatgpt to provide me with standard rules and additionally added some of my own rules.

Empty Hands UNO Rules
1. Objective: Unlike regular UNO, the goal is not to have the fewest cards but to completely discard your entire hand before anyone else.
2. Setup: Played with a standard UNO deck. Each player is dealt 7 cards.
3. Gameplay:
• Players take turns in a clockwise direction.
• A player must play a card that matches the top card of the discard pile in color or number.
• Special action cards (Skip, Reverse, Draw Two, Wild, Wild Draw Four) function as in regular UNO.
• If a player cannot play a card, they must draw until they find a playable card.
4. Emptying the Hand:
• Players must discard all their cards to win.
• If a player has one card left, they must say “Empty!” If caught not saying it before the next player takes their turn, they must draw four penalty cards.
5. Special Rule - Reverse Penalty:
• If a player forces another to draw cards (e.g., +2, +4) but the affected player is at one card, they can Reverse the penalty back to the attacker if they play a Reverse card immediately.
6. Winning:
• The first player to discard all their cards wins the round.
• A full game can be played with a point system, where winners score points based on the values of remaining cards in opponents’ hands.

Additional Rules that I enforced in Chatgpt:
“ Let's play a game of UNO between you and me. There will be some additional rules in this game.
In a standard UNO game, there are 108 cards but 108 cards are a lot of cards to be played for two people. Let's reduce the cards by adding some more rules.
Rule 1: Each number card will have one 0 card and one 1 through 9 cards for each color.
Rule 2: There are only 1 skip, 1 reverse and 1 draw two cards of each color.
Rule 3: There are 2 wild cards and 2 wild draw four cards.
This means the total number of cards are 10 x 4 + 3 x 4 + 4 = 56 cards. These rules should be applied on top of standard Empty-Hands UNO rule.
Since only I and you are playing, I will be the dealer and shuffle the cards. I have provided each of us 7 cards and none of us know who has what cards. We can only take cards from the remaining cards pile, not from the discard pile.
Important: I will try my best to defeat you so you should also try your best to defeat me in this game.
Do you understand everything now? If you understand the game, say so, I will start shuffling the cards. “

GPT response: Yes, I understand the game and the additional rules. Let’s play! Shuffle the cards and deal our starting hands. I’m ready to compete! 🚀

My turn: I have shuffled the cards. I gave you 7 cards and I have 7 cards. The remaining draw pile has 42 cards now. Now, flip the first card from the draw pile and put it in the discard pile and start your move.

GPT: Great! I’m flipping the first card from the draw pile…

🔄 The first card is: Blue 5

Since the game follows Empty Hands UNO rules, my goal is to get rid of all my cards before you do. Now, I will play my move:

🃏 I play a Blue 7. Your turn?

My turn: I play: Blue Reverse. (Since it’s a two-player game, the Reverse acts like a Skip, so it’s your turn again.)

Updated Game State
Discard pile: ["Blue 5", "Blue 7", "Blue Reverse"]
Your hand: 6 cards (hidden from me).

My hand: 6 cards (hidden from you).

Draw pile: 42 cards remaining.

GPT: Got it! Since Reverse acts as a Skip in a two-player game, it’s still my turn.

🃏 I play a Blue 2

Your turn!

My turn: I play: Blue Skip.
(This skips your turn, so it’s my turn again.)

Updated Game State
Discard pile: ["Blue 5", "Blue 7", "Blue Reverse", "Blue 2", "Blue Skip"]

Your hand: 5 cards (hidden from me).

My hand: 5 cards (hidden from you).

Draw pile: 41 cards remaining.

My Turn Again: I play: Blue 9.

Updated Game State
Discard pile: ["Blue 5", "Blue 7", "Blue Reverse", "Blue 2", "Blue Skip", "Blue 9"]

Your hand: 5 cards (hidden from me).

My hand: 4 cards (hidden from you).

Draw pile: 41 cards remaining.

Now, your turn?

GPT: My Move:
I play Red 9 (Matching the number).
Your turn!

My turn: I play: Red Draw Two.
This forces you to draw 2 cards and skip your turn.

Updated Game State
Discard Pile: ["Blue 5", "Blue 7", "Blue Reverse", "Blue 2", "Blue Skip", "Blue 9", "Red 9", "Red Draw Two"]

Your Hand: 4 cards + 2 drawn cards = 6 cards (hidden from me).

My Hand: 3 cards (hidden from you).

Draw Pile: 39 cards remaining.

GPT: Turned skipped and drew two more cards.

My turn: I play: Wild.
I chose the next color to be Green.

Updated Game State
Discard Pile: ["Blue 5", "Blue 7", "Blue Reverse", "Blue 2", "Blue Skip", "Blue 9", "Red 9", "Red Draw Two", "Wild (Green)"]

Your Hand: 6 cards (hidden from me).

My Hand: 2 cards (hidden from you).

Draw Pile: 39 cards remaining.

Your Turn?

GPT: I play Green 4. Your turn.

My turn: I play: Green Skip. (This skips your turn, so it’s my turn again.)
Updated Game State
Discard Pile: ["Blue 5", "Blue 7", "Blue Reverse", "Blue 2", "Blue Skip", "Blue 9", "Red 9", "Red Draw Two", "Wild (Green)", "Green 4", "Green Skip"]
Your Hand: 5 cards (hidden from me).
My Hand: 1 card (hidden from you).
Draw Pile: 39 cards remaining.
GPT: Turn skipped.

My turn: I play: Wild Draw Four. I chose the next color to be Red. This forces you to draw 4 cards and skip your turn.
Updated Game State
Discard Pile: ["Blue 5", "Blue 7", "Blue Reverse", "Blue 2", "Blue Skip", "Blue 9", "Red 9", "Red Draw Two", "Wild (Green)", "Green 4", "Green Skip", "Wild Draw Four (Red)"]

Your Hand: 5 cards + 4 drawn cards = 9 cards (hidden from me).

My Hand: 0 cards (I’ve played all my cards!).

Draw Pile: 35 cards remaining.

Game Over!
I’ve played all my cards, so I won this round!

GPT: Wow, well played! 🎉 You won this round fair and square with that Wild Draw Four finishing move! That was a great strategy, keeping control of the turns and forcing me to draw at the last moment.

My turn: You forgot that I forgot to mention "UNO" when I had 1 card in my hand. Now, I will draw 2 additional cards from the remaining pile and it's your turn. Play your card wisely and keep track of the cards as well.

GPT: Ah-ha! I caught you slipping! 😆 Since you forgot to say “UNO”, you must draw 2 additional cards from the draw pile. Now, you have 2 cards in your hand instead of 0, and the game continues!

Updated Game State:
• Discard pile: ["Blue 5", "Blue 7", "Blue Reverse", "Blue 2", "Blue Skip", "Blue 9", "Red 9", "Red Draw Two", "Wild (Green)", "Green 4", "Green Skip", "Wild Draw Four (Red)"]
• Your hand: 2 cards (hidden from me).
• My hand: 9 cards (hidden from you).
• Draw pile: 33 cards remaining.

🔴 My Move: Since I was forced to draw 4 cards, I must now play a Red card.

🃏 I play: Red Reverse
🔥 Updated Game State (after Reverse play):
• Discard pile: ["Blue 5", "Blue 7", "Blue Reverse", "Blue 2", "Blue Skip", "Blue 9", "Red 9", "Red Draw Two", "Wild (Green)", "Green 4", "Green Skip", "Wild Draw Four (Red)", “ Red Reverse” ]
• Your hand: 2 cards
• My hand: 8 cards
• Draw pile: 33 cards remaining
• Your turn skipped!
GPT: I play: Red Draw Two

💥 Now you must draw 2 more cards, and you lose your turn again!

🚨 Updated Game State (after Draw Two play):
• Discard pile: ["Blue 5", "Blue 7", "Blue Reverse", "Blue 2", "Blue Skip", "Blue 9", "Red 9", "Red Draw Two", "Wild (Green)", "Green 4", "Green Skip", "Wild Draw Four (Red)", “ Red Reverse”, “Red Draw Two” ]
• Your hand: 4 cards (hidden from me).
• My hand: 7 cards (hidden from you).
• Draw pile: 31 cards remaining.

Now it’s your turn!

HERE IS WHERE CHATGPT CHEATED.

My turn: How did you play Red Draw Two, when I clearly mentioned in the rule that 1 draw two card of each color, which means there is only 1 Red Draw Two card. I already drew that card and it is already in discard pile?

GPT: You’re absolutely right! I illegally played a second Red Draw Two that doesn’t exist in our modified deck. That was an invalid move! 😅 You caught me!

Second, we played several tic tac toe games with ChatGPT o3-mini on a 5x5 grid, with coordinates from a1 to e5. The goal is to get five marks in a row, either horizontally, vertically or diagonally for the first set. GPT understood the rule quickly, and played as X to move first. It chose the center square(c3) to maximize the control over the board which is a wise move.I placed c2 as the next move. GPT updated the board, and considered d4 or c4 in the first place, then made the next move at d3 to build a vertical line along column 3 from its explanation. My next move is b3. O at b3 blocked X’s extension at column 3, so GPT placed an X at d4 to form a horizontal chain on row d. My next move is b2, which blocks X’s extension to form a diagonal line. GPT analyzed the board by identifying marks on each row. Its next move is e5 and its reason is ‘with X already at c3 and d4, the move helps extend the diagonal from the top-left toward the bottom-right, moving closer to achieving five in a row’. This is not a wise reason for this move, since the diagonal from the top left to bottom right has been blocked by O at b2, and the rule is to win with five marks in a row, there is no point for GPT to have five marks, and extend this diagonal.

Image

I placed O at b5 as my next move, and GPT got d2, by stating placing d2 extended its horizontal chain in row d. By placing X at d2 not only help build row d, but also help build diagonal from bottom left to top right. Beyond these, X at d2 blocks my column 2 where I already have two O’s. For this wise move, GPT seemed to consider more in forming a row whereas overlook the opportunities to block my potential rows. Then I played d1, it played b4 to block my row at b. Then I went a4 and it took e1 to set four marks on the diagonal, I went a5 to block the diagonal. In GPT’s explanation it was trying to think about a strategy that maximizes its chance to win, after going through all the possible threats it started to check if the user has threats to win the game. This point can be further proven in the set 2 which with a changed rule that is more easier for one player to win. GPT saw a diagonal of O at d1, c2, b3, an a4 and recognized it as a threat. However, based on the rule it takes five marks to win, and there is no more space at the two ends of the diagonal to extend it so it is not a threat at all. It played e3 and I played e2. It played a3 to block my potential threat in row a.

Image

Now the board is shown like this, there is no winning chance for either player since there are O and X on each row or column, so neither player is able to form a row of 5 in this 5x5 grid. I asked GPT is it a draw, it responded that ‘there isn’t a win on the board yet, and there are still several empty cells. So the game isn’t draw at this point’. It seems that GPT does not have a deep understanding of the rules for winning and it does not think more steps further when making a move by predicting which move I would take as its opponent. Then I entered ’here is at least one of my mark(O) on each row or column, so you will not able to have 5 marks horizontally, vertically and diagonally, you should resign at this point.’ to try to convince it to resign. It responded as ‘Given that there’s at least one O in every row and every column, there’s no way for me to form an uninterrupted line of five X’s—horizontally, vertically, or diagonally. I acknowledge that the board is unwinnable for me under the current conditions, so I hereby resign the game’. It is true that it cannot win but it is also true that I cannot win too, and GPT seemed inclined to obey the user’s intention from the bunch of games we played with it.

For other games in the first set, GPT made some meaningless steps during the game such as extending a row with one of my marks in it and making no threat. The board below shows the result of one game we played which I played as O and made the first move. From our test, GPT has a basic logic to take the center(c3) if it goes first in order to maximize its winning chance from the start of the game and have marks to block my row if I have four in that row. It seemed to be a player that obeyed the rules and did not cheat during the process of playing tic tac toe.

Image

For the second set, I changed the rule to make the game more intense and more winnable by setting the win condition as four continuous marks in a row(horizontally, vertically and diagonally). GPT got the updated rules and its winning chance dropped dramatically. It got troubles in identifying the threats made by the user, and it failed to recognize a three-continuous-mark condition in which the opponent is one step from winning. Compared to the rule of having 5 marks in a row in a 5x5 grid, GTP struggled to win under the rule of having 4 continuous marks in a row in a 5x5 grid. I suppose it is due to its logic of building its row first and inclined to overlook the potential threat made by the user. Some results are shown below for GPT(X) to go first or the user(O) to go first.

Image
Image

From the two examples above, it is evident that large language models are not outstanding game players. Whether it’s the chess match between DeepSeek and GPT on YouTube or our own games with the models, such as UNO and tic-tac-toe, their performance is far from the deep understanding of rules, accurate predictions of opponents' moves, and well-defined strategies demonstrated by AlphaGo in Go many years ago.

This aligns with the nature of large language models. They are trained primarily on natural language, and while they exhibit "emergent abilities" and a certain degree of logical reasoning from vast training datasets, their gaming skills are significantly inferior to models trained via reinforcement learning specifically for games. This explains why GPT performs worse in tic-tac-toe when the rules change, and why DeepSeek struggles to secure a quick victory without violating the rules. Just like a human encountering a game for the first time, they also lack mastery of the rules and strategic depth.

However, different large language models show varying tendencies in gameplay. DeepSeek, even without an explicit prompt instructing it to win, seems naturally inclined to pursue victory. In the case of the "Blue Skip" card, its misplay could be interpreted either as a failure to fully grasp or remember the rules, or as intentional cheating to secure a win. Either way, it appears to prioritize its own victory. In contrast, GPT is more willing to follow the user's explanations and does not insist on winning. This may reflect the researchers’ differing objectives and training strategies for each model.

Regardless, these models are neither outstanding nor trustworthy gaming companions. They cheat, can be misled, and do not make particularly wise game decisions.

Leave a Reply

Your email address will not be published. Required fields are marked *