Paper Details
Abstract
This study evaluates the ability of Large Language Model (LLM) agents to reduce hallucinations and enhance decision-making in dynamic game environments. It compares three distinct agent architectures within the GSPuzzle game: a tool-using agent built with the Pydantic-Al library and a Zero-Shot Agent based on the Orak architecture and LangGraph Agent. The results indicate that the Pydantic-Al agent performs optimally in well-structured scenarios with detailed prompts, achieving a high completion rate. However, it is prone to hallucination with incomplete data. Conversely, the Orak-based Zero-Shot Agent demonstrates superiority in unknown environments by leveraging multimodal reasoning, though it struggles with long-term context retention. The best results were achieved with Pydantic-Al Agent running on gemini-2.5-flash with Prompt with Hints, the Agent cleared all three maps in just one attempt in the shortest time (specifically 112.88s for map 1, 85.99s for map 2, and 77.56s for map 3) compared to the other configurations. These findings contribute to improving Al agent reliability in the gaming sector and suggest broader applications in automation systems.