We present a parameter-efficient and scalable MIL framework that learns context-aware patch representations, substantially reducing reliance on complex aggregation mechanisms. Experimental results show that once rich contextual features are learned, simple pooling performs on par with more elaborate MIL heads, underscoring the robustness and modularity of the proposed approach. A current limitation is the focus on unimodal visual inputs; evaluating scalability and robustness in larger multimodal pipelines remains an interesting direction for future work.