Military Object Detection Using a Fine-Tuned Florence Vision Model

Military Object Detection Using a Fine-Tuned Florence Vision Model

ICLR 2026 Conference Submission19442 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Military Object Detection, Florence-2, LoRA Fine-Tuning, Vision-Language Models, Real-World Battlefield Dataset, Camouflage Detection, Data Augmentation, Class-Aware Sampling, Object Detection Metrics, Precision and Recall, Intersection over Union (IoU), Automated Threat Detection, Defense and Security Applications, YOLOv5 Benchmark Comparison, Computer Vision for Surveillance

TL;DR: We fine-tune Florence-2 Large using LoRA on a curated real-world dataset, achieving over 98% detection accuracy and 85% IoU, demonstrating robust performance under challenging conditions such as camouflage, low-light, and occlusion.

Abstract: Accurately detecting military vehicles and equipment in real-world environments is a challenging but vital task in modern defense. In this work, we fine-tuned Microsoft’s Florence-2 Large model to recognize a wide range of military assets, including tanks, helicopters, artillery, and ground troops under realistic battlefield conditions. Instead of training on clean or staged images, we built a dataset of over 7,000 images collected from military exercises and surveillance footage. These images included difficult cases like camouflage, partial visibility, and low lighting. The objects were annotated using tools such as CVAT and Roboflow, allowing us to maintain consistency across all categories. To improve performance on underrepresented classes, we used data augmentation and class-aware sampling. Our model achieved strong results, with a precision of 98.95%, recall of 99.33%, and an average IoU of 85.29%. These outcomes show that the model performs well even in messy, real-world conditions. This work highlights the potential of Florence-2 Large for practical defense applications like drone surveillance, battlefield monitoring, and automated threat detection.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 19442

Loading