Mathematics for ML #13 | Markov Decision Process

2024-12-13 3. Mathematics for ML Comments

13. Markov Decision Process

Contents
- Markov Process
- Markov Reward Process
- Markov Decision Process
- Bellman Equations
Summary
- Let’s find a trajectory (a sequence of states and actions) that maximizes the reward in a given MDP.
- It’s complex to find the argmax of long-term value functions directly.
- The Bellman equations allow us to break down complex, long-term value functions into recursive steps.
- This enables us to find a reward-maximizing trajectory in an iterative manner.

Markov process is a type of stochastic (random) process that satisfies the Markov property.
- Markov Property : The future state depends only on the current state, not on the sequence of events that preceded it.
Key Features:
- Memoryless: The process has no “memory” of past states.
- State transitions: It moves from one state to another based on certain probabilities.
- State space: The set of all possible states.
- Transition probability: The chance of moving from one state to another.
Ex) Let’s say the weather can be Sunny or Rainy. A Markov process might define the following:
- If it’s Sunny today, there’s an 80% chance it’ll be Sunny tomorrow and 20% chance of Rain.
- If it’s Rainy today, there’s a 50% chance it’ll be Sunny tomorrow and 50% chance of Rain.
- This process doesn’t care how long it’s been Sunny or Rainy—just what the current state is.
Markov Chain : A Markov process with discrete time steps and a finite (or countable) state space.

Markov Process Example

Markov Reward Process (MRP) is a Markov process with rewards added. It’s a mathematical framework used to model situations where outcomes are partly random and partly under the control of a reward structure.
Components
1. States (S) – A finite set of possible situations.
2. Transition probabilities (P) – The probability of moving from one state to another.
3. Reward function (R) – The expected immediate reward received after entering a state.
4. Discount factor (γ) – A number between 0 and 1 that represents how much future rewards are worth compared to immediate rewards.
Return $G_t$ : the total (summation of) discounted reward from time-step $t$

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$

Value Function : tells us the expected total future reward from each state.
- represent value of each state
- This equation tells us the value of being in a state s as the immediate reward $R(s)$ plus the expected value of future states.

$$ V(s) = R(s) + \gamma \sum_{s{\prime}} P(s{\prime}|s) V(s{\prime}) $$

Markov Reward Process Example

Markov Decision Process : is a MRP with action added. A framework used to describe an agent that makes sequential decisions in a stochastic (random) environment, where the goal is to maximize cumulative rewards over time.
Components
- States (S): What situation the agent is currently in
- Actions (A): What choices the agent can make
- Transitions (P): What the world does in response (random but governed by probabilities)
- Rewards (R): What the agent gets (positive or negative feedback)
- Policy (π): A strategy that maps states to actions
- Discount Factor (γ): A way to value the future compared to now
State Value Function : the expected return starting from states $s$, and then following the policy

$$ V_\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s\right] $$

Action Value Function : the expected return (total future reward) starting from state s, taking action a, and then following policy $\pi$ thereafter.
- tells you how good it is to take a specific action in a specific state, assuming you act according to some policy afterward.

$$ Q^\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right] $$

Value Functions (from above)
- While the state-value function $V^\pi(s)$ gives the value of just being in a state,
- the action-value function $Q^\pi(s, a)$ evaluates the value of a specific decision.
Bellman Equation : is a recursive formula that decomposes the value function into two components
1. The immediate reword $R_{t+1}$
2. Discounted value of the future state $γ.v(S_{t+1})$
Bellman Equation for the State-Value Function $V^\pi(s)$
- The value of state $s$ under policy $\pi$ is the expected reward for taking an action a in that state, plus the discounted value of the next state $s{\prime}$, assuming you follow policy $\pi$ thereafter.

$$ V_\pi(s) = E[R_{t+1} + \gamma V_\pi(S_{t+1}) | S_t = s] $$

Bellman Equation for the Action-Value Function $Q^\pi(s, a)$
- The value of taking action a in state $s$ is the immediate reward plus the discounted expected value of the actions in the next state, again following policy $\pi$.
$$ Q^\pi(s, a) = R(s, a) + \gamma \sum_{s^{\prime}} P(s^{\prime}|s, a) \sum_{a^{\prime}} \pi(a^{\prime}|s^{\prime}) Q^\pi(s^{\prime}, a^{\prime}) $$
Why Bellman Equations Matter:
- Bellman Equations allow us to break down complex, long-term value calculations into recursive steps.
- Make it possible to use many iterative RL algorithms