Dyna-Q is a foundational architecture in reinforcement learning (RL) created by Richard Sutton that integrates model-free learning with model-based planning. While traditional Q-learning relies exclusively on direct, trial-and-error interactions with the real environment, Dyna-Q uses those same real experiences to simultaneously construct an internal “world model”. The agent then “hallucinates” or simulates extra experiences from its internal model to update its policy while in the background. This dramatically boosts sample efficiency. The Core Architecture Dyna-Q continually manages four parallel processes:
Acting: The agent observes the current state, uses an ε-greedy policy to select an action, and executes it.
Direct RL: The agent takes the real reward and next state, using standard one-step tabular Q-learning to update its Q-table.
Model Learning: The agent logs the real transition (state, action -> reward, next state) into a local lookup table to refine its model of the world.
Planning: The agent takes a break from the real world, randomly selects previously seen states and actions from its internal model, and runs n simulated Q-learning updates. Step-by-Step Algorithm Loop
The entire process operates within a single execution cycle:
1. Initialize Q(s, a) and Model(s, a) for all states and actions 2. Loop forever (or per episode): a) s <- current state b) Choose action ‘a’ from ’s’ using policy derived from Q (e.g., ε-greedy) c) Take action ‘a’; observe reward ‘r’ and next state ’s_prime’ d) [Direct RL] Update Q(s, a) using the real (s, a, r, s_prime) e) [Model Learning] Store Model(s, a) <- (r, s_prime) f) [Planning] Repeat n times: - Select a random, previously visited state ’s_sim’ - Select a random, previously taken action ‘a_sim’ - Fetch predicted (r_sim, s_prime_sim) from Model(s_sim, a_sim) - Update Q(s_sim, a_sim) using the simulated transition g) s <- s_prime Direct Q-Learning vs. Dyna-Q
The distinction between classic Q-learning and Dyna-Q lies entirely in how they utilize real-world interactions. [1811.07550] Switch-based Active Deep Dyna-Q – arXiv
Leave a Reply