Mathematics for ML #13 | Markov Decision Process


13. Markov Decision Process

  • Contents

    • Markov Process

    • Markov Reward Process

    • Markov Decision Process

    • Bellman Equations

  • Summary

    • Let’s find a trajectory (a sequence of states and actions) that maximizes the reward in a given MDP.

    • It’s complex to find the argmax of long-term value functions directly.

    • The Bellman equations allow us to break down complex, long-term value functions into recursive steps.

    • This enables us to find a reward-maximizing trajectory in an iterative manner.



13.1 Markov Process

  • Markov process is a type of stochastic (random) process that satisfies the Markov property.

    • Markov Property : The future state depends only on the current state, not on the sequence of events that preceded it.
  • Key Features:

    • Memoryless: The process has no “memory” of past states.

    • State transitions: It moves from one state to another based on certain probabilities.

    • State space: The set of all possible states.

    • Transition probability: The chance of moving from one state to another.

  • Ex) Let’s say the weather can be Sunny or Rainy. A Markov process might define the following:

    • If it’s Sunny today, there’s an 80% chance it’ll be Sunny tomorrow and 20% chance of Rain.

    • If it’s Rainy today, there’s a 50% chance it’ll be Sunny tomorrow and 50% chance of Rain.

    • This process doesn’t care how long it’s been Sunny or Rainy—just what the current state is.

  • Markov Chain : A Markov process with discrete time steps and a finite (or countable) state space.

Markov Process Example



13.2 Markov Reward Process

  • Markov Reward Process (MRP) is a Markov process with rewards added. It’s a mathematical framework used to model situations where outcomes are partly random and partly under the control of a reward structure.

  • Components

    1. States (S) – A finite set of possible situations.

    2. Transition probabilities (P) – The probability of moving from one state to another.

    3. Reward function (R) – The expected immediate reward received after entering a state.

    4. Discount factor (γ) – A number between 0 and 1 that represents how much future rewards are worth compared to immediate rewards.

  • Return $G_t$ : the total (summation of) discounted reward from time-step $t$

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$

  • Value Function : tells us the expected total future reward from each state.

    • represent value of each state

    • This equation tells us the value of being in a state s as the immediate reward $R(s)$ plus the expected value of future states.

$$ V(s) = R(s) + \gamma \sum_{s{\prime}} P(s{\prime}|s) V(s{\prime}) $$

Markov Reward Process Example



13.3 Markov Decision Process

  • Markov Decision Process : is a MRP with action added. A framework used to describe an agent that makes sequential decisions in a stochastic (random) environment, where the goal is to maximize cumulative rewards over time.

  • Components

    • States (S): What situation the agent is currently in

    • Actions (A): What choices the agent can make

    • Transitions (P): What the world does in response (random but governed by probabilities)

    • Rewards (R): What the agent gets (positive or negative feedback)

    • Policy (π): A strategy that maps states to actions

    • Discount Factor (γ): A way to value the future compared to now

  • State Value Function : the expected return starting from states $s$, and then following the policy

$$ V_\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s\right] $$

  • Action Value Function : the expected return (total future reward) starting from state s, taking action a, and then following policy $\pi$ thereafter.

    • tells you how good it is to take a specific action in a specific state, assuming you act according to some policy afterward.

$$ Q^\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right] $$



Bellman Equation

  • Value Functions (from above)

    • While the state-value function $V^\pi(s)$ gives the value of just being in a state,

    • the action-value function $Q^\pi(s, a)$ evaluates the value of a specific decision.

  • Bellman Equation : is a recursive formula that decomposes the value function into two components

    1. The immediate reword $R_{t+1}$

    2. Discounted value of the future state $γ.v(S_{t+1})$

  • Bellman Equation for the State-Value Function $V^\pi(s)$

    • The value of state $s$ under policy $\pi$ is the expected reward for taking an action a in that state, plus the discounted value of the next state $s{\prime}$, assuming you follow policy $\pi$ thereafter.

$$ V_\pi(s) = E[R_{t+1} + \gamma V_\pi(S_{t+1}) | S_t = s] $$

  • Bellman Equation for the Action-Value Function $Q^\pi(s, a)$

    • The value of taking action a in state $s$ is the immediate reward plus the discounted expected value of the actions in the next state, again following policy $\pi$.

    $$ Q^\pi(s, a) = R(s, a) + \gamma \sum_{s^{\prime}} P(s^{\prime}|s, a) \sum_{a^{\prime}} \pi(a^{\prime}|s^{\prime}) Q^\pi(s^{\prime}, a^{\prime}) $$

  • Why Bellman Equations Matter:

    • Bellman Equations allow us to break down complex, long-term value calculations into recursive steps.

    • Make it possible to use many iterative RL algorithms



References