Abstract: We introduce a novel deep Mixture of Experts (MoE) architecture, termed MoEAtt, that integrates an attention mechanism as the routing gate, where the individual experts and the router are trained jointly. Furthermore, the training procedure is designed to achieve heterogeneity across the experts, which in turn yield a discriminative representation of the input space. We evaluate MoEAtt architecture on multiple datasets to demonstrate its versatility and applicability in various scenarios. We achieve state-of-the-art performance on certain datasets, showcasing the effectiveness of MoEAtt. Additionally, we discuss and showcase additional benefits and potential provided by the MoE architecture.
Loading