Abstract: Embodied instruction following (EIF) is a challenging task in Embodied AI that requires robots to possess a range of capabilities, including language understanding, object identification, environmental exploration, action planning, and accurate manipulation. To investigate household robot tasks that are considered meaningful in the near future, the community has already developed a primary solution based on a modular method, but its overall performance is far from human. Completing tasks in unfamiliar and unseen environments presents a significant challenge for intelligent robots. Error analysis of the ALFRED dataset indicates that the modular structure has shortcomings in target comprehension and vision-based interaction. As a result, we propose a post-processing optimization approach, which includes environment information alignment using semantic match and visual interaction enhancement with pose adjustment. Our paradigm is evaluated on two interactive datasets, and the performance is improved by an average of 37.12% (relative) compared to the baseline, suggesting that context information in the current environment effectively increases the adaptability of robots.