Show how am MDP with a reward function R (s, a, s') can be transformed into a different MDP with reward function R (s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP

Question

Engineering

Sandra Peck

15 June, 18:13

Show how am MDP with a reward function R (s, a, s') can be transformed into a different MDP with reward function R (s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP

+5

Answers (1)

Know the Answer?

Kenny Gibbs · Answer 1

U (s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre) (maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U (s

0

)) ]]

U (s) = maxa[

P

s

0 T (s, a, s0

) (R (s, a, s0

) + γU (s

0

) ]

U (s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post) (R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

) U (s

0

)) ]]

U (s) = maxa[R (s, a) + γ

P

s

0 T (s, a, s0

) U (s

0

) ]

Explanation:

MDPs

MDPs can formulated with a reward function R (s), R (s, a) that depends on the action taken or R (s, a, s') that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R (s, a, s') can be transformed into a different MDP with reward

function R (s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a 'pre-state' pre (s, a, s') for every s, a, s' such that executing a in s leads not to s'

but to pre (s, a, s'). From the pre-state there is only one action b that always leads to s'. Let the new MDP

have transition T', reward R', and discount γ

0

.

T

0

(s, a, pre (s, a, s0

)) = T (s, a, s0

)

T

0

(pre (s, a, s0

), b, s0

) = 1

R0

(s, a) = 0

R0

(pre (s, a, s0

), b) = γ

- 1

2 R (s, a, s0

)

γ

0 = γ

1

2

Then, using pre as shorthand for pre (s, a, s'):

U (s) = maxa[R0

(s, a) + γ

1

2

P

pre T

0

(s, a, pre) (maxb[R0

(pre, b) + γ

1

2

P

s

0 T

0

(pre, b, s0

) ∗ U (s

0

)) ]]

U (s) = maxa[

P

s

0 T (s, a, s0

) (R (s, a, s0

) + γU (s

0

) ]

Now do the same to convert MDPs with R (s, a) into MDPs with R (s).

Similar to part (c), create a state post (s, a) for every s, a such that

T

0

(s, a, post (s, a, s0

)) = 1

T

0

(post (s, a, s0

), b, s0

) = T (s, a, s0

)

R0

(s) = 0

R0

(post (s, a, s0

)) = γ

- 1

2 R (s, a)

γ

0 = γ

1

2

Then, using post as shorthand for post (s, a, s'):

U (s) = R0

(s) + γ

1

2 maxa[

P

post T

0

(s, a, post) (R0

(post) + γ

1

2 maxb[

P

s

0 T

0

(post, b, s0

) U (s

0

)) ]]

U (s) = maxa[R (s, a) + γ

P

s

0 T (s, a, s0

) U (s

0

) ]

3