\section{Conclusion}
This paper introduced a novel approach for scaling speech-text pre-training using supervised semantic tokens and synthetic interleaved data. By employing a supervised speech tokenizer and generating 600B tokens of interleaved data, we scaled our speech pre-training to 1 trillion tokens, achieving state-of-the-art performance in speech language modeling and spoken question answering tasks. We also developed an end-to-end spoken chatbot by fine-tuning our pre-trained model, demonstrating competitive performance in both conversational abilities and speech quality. Future work could explore more efficient training techniques, investigate larger model sizes, and expand multilingual capabilities.