Fix the implementation of value iteration, the way it gets the best actions for a state is wrong.
