Abstract: Large-scale and loosely-coupled applications cannot be implemented directly on high-performance computing platforms. At the same time, the deployment and maintenance of high-performance computing and high-throughput computing will result in the waste of computing resources. In order to solve the problem that existing resource management systems cannot make high-throughput computing applications execute efficiently on high-performance computers, we propose, design and implement a high-throughput computing job execution framework Teno without modifying the existing configuration environment of Slurm on Tianhe-2. It uses Slurm to implement fine-grained resource scheduling through the idea of hierarchical scheduling, and optimizes the traditional Master-Worker model, thereby speeding up the high-throughput operation and increasing the effective utilization of cluster resources. Effective fault-tolerance mechanisms such as fault recovery and error retry are also implemented. Finally we design various experiments in Tianhe-2 to test and evaluate the key factors for high-throughput calculations of Teno, Slurm and HTCondor, and analyze in detail why the performance of Teno is over that of the other two.
Loading