Fix the implementation of the value_iteration function, the way it selects the best actions for a state is incorrect for 
both the improvement and the evaluation steps.
