Abstract: Efficient inference in large output space is an essential yet challenging task in large scale machine learning.
Previous approaches reduce this problem to Approximate Maximum Inner Product Search (AMIPS), which is
based on the observation that the prediction of a given model corresponds to the logit with the largest value.
However, models are not perfect in accuracy, and the successful retrievals of the largest logit may not lead to
the correct predictions. We argue that approximate MIPS approaches are sub-optimal because they are tailored
for retrieving largest inner products class instead of retrieving the correct class. Moreover, the logits generated
from neural networks with large output space lead to extra challenges for the AMIPS method to achieve a high
recall rate within the computation budget of efficient inference. In this paper, we propose HALOS, which reduces
inference into sub-linear computation by selectively activating a small set of output layer neurons that are likely to
correspond to the correct classes rather than to yield the largest logit. Our extensive evaluations show that HALOS
matches or even outperforms the accuracy of given models with 21× speed up and 87% energy reduction.
0 Replies
Loading