Ask Question
15 June, 18:13

Show how am MDP with a reward function R (s, a, s') can be transformed into a different MDP with reward function R (s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP

+5
Answers (1)
  1. 15 June, 21:30
    0
    U (s) = maxa[R0

    (s, a) + γ

    1

    2

    P

    pre T

    0

    (s, a, pre) (maxb[R0

    (pre, b) + γ

    1

    2

    P

    s

    0 T

    0

    (pre, b, s0

    ) ∗ U (s

    0

    )) ]]

    U (s) = maxa[

    P

    s

    0 T (s, a, s0

    ) (R (s, a, s0

    ) + γU (s

    0

    ) ]

    U (s) = R0

    (s) + γ

    1

    2 maxa[

    P

    post T

    0

    (s, a, post) (R0

    (post) + γ

    1

    2 maxb[

    P

    s

    0 T

    0

    (post, b, s0

    ) U (s

    0

    )) ]]

    U (s) = maxa[R (s, a) + γ

    P

    s

    0 T (s, a, s0

    ) U (s

    0

    ) ]

    Explanation:

    MDPs

    MDPs can formulated with a reward function R (s), R (s, a) that depends on the action taken or R (s, a, s') that

    depends on the action taken and outcome state.

    To Show how am MDP with a reward function R (s, a, s') can be transformed into a different MDP with reward

    function R (s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

    original MDP.

    One solution is to define a 'pre-state' pre (s, a, s') for every s, a, s' such that executing a in s leads not to s'

    but to pre (s, a, s'). From the pre-state there is only one action b that always leads to s'. Let the new MDP

    have transition T', reward R', and discount γ

    0

    .

    T

    0

    (s, a, pre (s, a, s0

    )) = T (s, a, s0

    )

    T

    0

    (pre (s, a, s0

    ), b, s0

    ) = 1

    R0

    (s, a) = 0

    R0

    (pre (s, a, s0

    ), b) = γ

    - 1

    2 R (s, a, s0

    )

    γ

    0 = γ

    1

    2

    Then, using pre as shorthand for pre (s, a, s'):

    U (s) = maxa[R0

    (s, a) + γ

    1

    2

    P

    pre T

    0

    (s, a, pre) (maxb[R0

    (pre, b) + γ

    1

    2

    P

    s

    0 T

    0

    (pre, b, s0

    ) ∗ U (s

    0

    )) ]]

    U (s) = maxa[

    P

    s

    0 T (s, a, s0

    ) (R (s, a, s0

    ) + γU (s

    0

    ) ]

    Now do the same to convert MDPs with R (s, a) into MDPs with R (s).

    Similar to part (c), create a state post (s, a) for every s, a such that

    T

    0

    (s, a, post (s, a, s0

    )) = 1

    T

    0

    (post (s, a, s0

    ), b, s0

    ) = T (s, a, s0

    )

    R0

    (s) = 0

    R0

    (post (s, a, s0

    )) = γ

    - 1

    2 R (s, a)

    γ

    0 = γ

    1

    2

    Then, using post as shorthand for post (s, a, s'):

    U (s) = R0

    (s) + γ

    1

    2 maxa[

    P

    post T

    0

    (s, a, post) (R0

    (post) + γ

    1

    2 maxb[

    P

    s

    0 T

    0

    (post, b, s0

    ) U (s

    0

    )) ]]

    U (s) = maxa[R (s, a) + γ

    P

    s

    0 T (s, a, s0

    ) U (s

    0

    ) ]

    3
Know the Answer?
Not Sure About the Answer?
Find an answer to your question ✅ “Show how am MDP with a reward function R (s, a, s') can be transformed into a different MDP with reward function R (s, a), such that ...” in 📘 Engineering if you're in doubt about the correctness of the answers or there's no answer, then try to use the smart search and find answers to the similar questions.
Search for Other Answers