- Abstract: Many state-of-the-art methods of recognizing human action are based on attention mechanism, which shows the importance of attention mechanism in action recognition. With the rapid development of neural networks, human action recognition has been achieved great improvement by using convolutional neural networks (CNN) or recurrent neural networks (RNN). In this paper, we propose a model based on spatial-temporal attention weighted LSTM. This model pays attention to the key part in each video frame, and also focuses on the important frames in each video sequence, thus the most important theme for our model is how to find out the key point spatially and the key frames temporally. We show a feasible architecture which can solve those two problems effectively and achieve a satisfactory result. Our model is trained and tested on three datasets including UCF-11, UCF-101, and HMDB51. Those results demonstrate a high performance of our model in human action recognition.