Computer vision in drones: detection, tracking, recognition

The May 2026 Indian Army sovereign swarm RFI placed computer vision in drones at the centre of its evaluation criteria (Indian Army RFI, May 2026). Three Indian programmes mark the maturing of this stack. Akashteer's AI-driven target recognition during Operation Sindoor in May 2025. The Tonbo Imaging Vault counter-UAV export to Armenia in January 2025. The DRDO IRDE airborne EO/IR imaging system. Four pillars define the stack: detection, tracking, recognition, and segmentation.

What computer vision in drones actually means

Computer vision in drones lets a UAV extract meaning from a visual scene. The drone detects objects, tracks them across frames, recognises what they are, and segments the image into actionable regions. This is the perception layer of autonomous flight.

The visual scene comes from one or more cameras. The mix includes RGB, thermal, multispectral, or stereo, supported by LiDAR and radar. Computer vision turns those pixel streams into structured outputs. A bounding box around a vehicle. A track ID across frames. A class label such as "tank" or "truck." A depth estimate. This sits inside the broader autonomy stack inside modern UAVs alongside planning, control, and communications.

Four pillars drive the rest of this analysis. Detection finds objects. Tracking follows them. Recognition classifies them. Segmentation maps the scene. Each pillar maps to a distinct neural-network family, a distinct compute budget, and a distinct use case.

The Indian benchmark is Akashteer. During Operation Sindoor, the Bharat Electronics-developed system demonstrated AI-driven recognition of hostile aerial targets at the corps level (Ministry of Defence press note, May 2025). It integrated data from radars, drones, and ground sensors into a unified air-defence picture (Bharat Electronics IACCS material, 2024). The same discipline is now being pushed into individual unmanned platforms feeding into swarm intelligence and decentralised drone coordination.

Object detection - how a drone identifies what it sees

Drone object detection runs a neural network on the live camera feed. The network draws bounding boxes around every instance of a target class. The dominant real-time architecture is YOLO (You Only Look Once), now in its YOLO26 variant released September 2025. It is fast enough to run on an onboard Jetson compute payload (Ultralytics YOLO26 release notes, September 2025).

YOLO's design choice is the single forward pass. One network, one inference, every bounding box produced at once. The 26th generation is optimised for edge and low-power devices through streamlined regression. Faster R-CNN remains the accuracy-priority option for offline pipelines. RT-DETR and the transformer family deliver higher mAP at the cost of latency. LDDm-YOLO and RTUAV-YOLO target small aerial objects specifically.

RTUAV-YOLO achieves a 3.4 percent improvement on COCO mAP50 against the YOLOv11 baseline. It also reduces parameters by 65.3 percent for onboard deployment (RTUAV-YOLO paper, MDPI Sensors, 25 October 2025). The VisDrone benchmark dataset is the standard against which Indian and global aerial detection models are measured. It covers image detection, video detection, single-object tracking, and multi-object tracking.

Indigenous EO/IR payloads now embed edge AI processing with real-time recognition algorithms. The targets include anti-tank guided munitions, guide bombs, and kamikaze drone applications. This places real-time object detection drone capability inside the gimbal itself across the Indian military drone fleet. For procurement, this means the perception-payload spec sheet should require Jetson-class compute and the YOLO26 or RT-DETR generation as the detection baseline.

Object tracking - keeping identity across frames

Detection alone is not enough. The drone needs to know that the bounding box in frame 1 and the bounding box in frame 30 refer to the same vehicle. Multi-object tracking algorithms solve this problem.

Two families dominate. ByteTrack uses a two-stage data association strategy and the Kalman filter for motion prediction. It retains low-confidence detections instead of discarding them, preserving track continuity through partial occlusion (ByteTrack paper, ECCV 2022). DeepSORT extends the original SORT algorithm with a deep-learning appearance descriptor and a Mahalanobis-distance matching cost. It re-identifies objects after they leave and re-enter the frame (DeepSORT paper, ICIP 2017).

ByteTrack achieves 76.6 MOTA on standard multi-object-tracking benchmarks. SORT scores 74.6 and DeepSORT scores 75.4 (Frontiers vehicle ReID via DeepSORT, 31 July 2024). The choice is operational. ByteTrack wins on small, fast, partially occluded targets. DeepSORT wins where targets pass behind clutter and reappear.

Where this matters operationally splits three ways. Counter-UAV systems need to maintain track on a small fast-moving target through clutter. Loitering munitions need to track a designated target through smoke and dust. This is the OODA loop in autonomous combat drones compressed to machine timescales. Survey drones need to count moving vehicles across a transect without double-counting. For procurement, the tracking module is a separate evaluation line item. A bid that names YOLO without naming the tracker is incomplete.

Object recognition and classification - what is that object

Once an object is detected and tracked, the drone has to classify it. The classes range from a tank, a truck, or a soldier to a power-line pylon, a crop disease pattern, or a structural crack. Recognition is performed by a convolutional neural network or a vision transformer trained on a labelled dataset.

The labelling pipeline is the unglamorous foundation. Bounding-box annotation produces detection training data. Pixel-level annotation produces semantic segmentation data. Instance segmentation labels separate each occurrence of the same class. This lets the model count individual objects rather than just identify presence. The quality of the labelled dataset determines the ceiling on every downstream metric.

The Indian gap is the dataset gap. Commercial computer vision models are trained on Western datasets that do not generalise to Indian terrain, vehicle types, or crop varieties. A model trained on COCO and ImageNet learns to recognise a North American pickup truck reliably. It misclassifies an Indian utility vehicle as a generic "vehicle." Target classification confidence drops by 10 to 15 percent on Indian theatres. The gap closes only when domestic training data does.

The response is structural. The Indian Army's Special Task Force on AI under the Directorate General of Information Systems is building India-specific training datasets (DGIS Special Task Force note, July 2025). The Defence AI Project Agency coordinates indigenous research in AI-based surveillance and threat detection. It covers the categories of unmanned aerial systems Indian forces operate (DAIPA establishment note, Ministry of Defence). Until DAIPA's training-data effort matures, AI target recognition Indian Army accuracy in domestic theatres will lag Western benchmarks by 10 to 15 percent.

Semantic segmentation and depth estimation - the supporting layer

Semantic segmentation assigns a class label to every pixel in the frame. It is useful for mapping, agricultural analysis where the vegetation index is computed per pixel, and free-space estimation for autonomous navigation. Instance segmentation separates individual objects of the same class. The output is a scene that the path planner can drive through.

Depth estimation infers distance from monocular or stereo imagery. It is useful for obstacle avoidance, landing-zone selection, and target geolocation. Monocular depth uses a single camera and learned depth priors. Stereo depth uses two cameras and triangulation. LiDAR adds an active sensing channel that works in low light. Edge AI computer vision drone architectures fuse these three modalities into a single depth field. This drone perception sensor fusion separates research demonstrators from deployable systems.

The two tasks feed autonomous flight directly. Segmentation gives the path planner a drivable corridor. Depth estimation gives the obstacle-avoidance controller a collision envelope. Without both, autonomous landing in unprepared terrain is not certifiable.

The Indian application is autonomous landing without ground radar. DRDO's SWiFT prototype demonstrated autonomous landing on surveyed coordinates in December 2023. It built on the Autonomous Flying Wing Technology Demonstrator's first flight in July 2022 (DRDO ADE press note, July 2022; DRDO SWiFT trial release, December 2023). Both rely on semantic segmentation drone imagery and depth perception to identify the touchdown zone and execute the flare without external infrastructure. For autonomous-landing deliverables under the sovereign swarm RFI, segmentation-and-depth perception is the certifiable capability inside computer vision in drones, not a research add-on.

Onboard inference - running computer vision on the drone

Cloud-dependent computer vision fails in contested environments where data links are jammed or cut. Onboard AI inference drone architectures carry the compute stack on the platform itself. The typical SoC is an NVIDIA Jetson Orin NX or Orin Nano running optimised neural networks for detection, tracking, and recognition (NVIDIA Jetson Orin product brief, 2025).

The engineering tradeoffs are unforgiving. TOPS budget against thermal envelope. Weight against battery draw. Frame rate against model size. A YOLO26 model at 60 frames per second on a Jetson Orin NX draws 15 to 25 watts and adds 50 grams. The same model at 30 frames per second on an Orin Nano draws 7 to 15 watts. Onboard Jetson Orin drone inference is now the procurement assumption rather than a research feature.

The optimisation pipeline is where projects lose time. Model pruning removes redundant weights. TensorRT compilation re-orders operations for the target hardware. FP32-to-INT8 quantisation reduces memory bandwidth roughly fourfold. TensorRT delivers two-to-three-times inference throughput on Jetson hardware against an untuned baseline.

India's sovereign swarm requirements published in May 2026 require offline autonomy and onboard mission-control logic for contested environments. Indigenous gimbal vendors integrate edge AI processing inside the payload itself. NewSpace Research's MOSAIC architecture delivers decentralised onboard perception across a swarm. Operators evaluating UAV bids should treat "cloud-dependent inference" as a procurement disqualifier inside the edge AI inference architectures for drones now required by Indian forces.

EO/IR payloads - the Indian indigenous stack

Vendor content on computer vision in drones treats the camera as a generic input. Indian procurement officers cannot. The EO/IR gimbal is where the perception stack physically lives. Its specifications drive every downstream capability.

The indigenous tactical end anchors on a family of lightweight gimbals for small UAVs. The heavy variants integrate cooled MWIR, day imager, and laser rangefinder under four kilograms total weight. At the high-altitude end sits a multi-sensor gimbal class for maritime patrol aircraft and combat helicopters, accommodating up to seven EO/IR sensors simultaneously. The configuration includes HD day imager, HD thermal imager, short-wave infrared, laser designator, and haze-penetration pointers. An IRST-class system provides 360-degree sensor-fusion vision for ground platforms, identifying armoured vehicles at up to 2,000 metres.

The government anchor is the DRDO IRDE airborne EO/IR imaging system. Ruggedised at approximately 50 kilograms, it carries a thermal imager, HD TV camera, infrared camera, laser rangefinder, and laser target designator. It is designed for retrofit on aircraft, helicopters, and UAVs. The system supports auto-acquisition and auto-tracking of multiple targets in varied climatic conditions (DRDO IRDE airborne EO system release, IMR coverage 2023). The thermal imaging EO/IR drone India procurement path runs through IRDE for heavy platforms and indigenous vendors for tactical platforms. The component indigenisation push under PLI 2.0 is expected to deepen the supply base.

Two newer suppliers extend the stack. EON Space Labs designs indigenous EO/IR camera payloads and multisensor gimbal systems in India (EON Space Labs company profile, 2025). Raphe mPhibr's onboard AI stack has been adopted in Indian Army platforms. The supply base is broadening to a five-to-seven-vendor ecosystem inside two procurement cycles.

Computer vision in counter-drone systems and the Operation Sindoor lesson

Counter-drone systems use the same computer-vision techniques in reverse. Detection of small fast-moving aerial targets. Tracking across cluttered backgrounds. Recognition to distinguish drones from birds and balloons. Computer vision for counter-drone systems is the perception layer of the kill chain.

The Indian counter-UAS stack reads like a vision-centric inventory. The Vault AI-enabled anti-UAV system, exported to Armenia in January 2025, combines electronic-scanning radar with EO/IR imagers and AI-driven classification (IDRW Vault export note, 23 January 2025). It offers both soft-kill RF jamming and hard-kill kinetic options. Bharat Electronics' IACCS contribution to Akashteer enables AI-driven recognition of hostile aerial activity across the corps-level air-defence grid. The D-4 anti-drone platform and the SAKSHAM counter-UAS grid form the layered shield deployed across the western frontier, alongside anti-drone systems in India: detection, tracking, neutralisation.

The Operation Sindoor lesson is the speed lesson. When adversaries deploy hundreds of low-cost drones per day to saturate air defence, detection-and-classification speed is the differentiator (Organiser analysis, 15 May 2026). Computer vision is the speed layer. Without onboard recognition, every drone is a human-in-the-loop classification decision and the queue overflows. With onboard recognition, the operator sees a prioritised target list rather than a raw radar return.

The kinetic end extends the same logic. The Wavestrike Gen-3 high-power microwave directed-energy weapon, unveiled at Aero India 2025, neutralises classified targets through electronics-burning RF energy. It builds on the recognition-driven targeting that loitering munitions and the precision-strike kill chain already use. The May 2026 sovereign swarm RFI placed onboard AI perception at the centre of its evaluation criteria. This is the procurement signal that ties this analysis to the next two years of Indian defence drone capability.

What comes next

Three inflection points define the next 24 months for computer vision in drones. First, the YOLO26 generation will push detection accuracy on small aerial targets above current benchmarks within two procurement cycles. Second, the indigenous EO/IR payload stack across IRDE, EON, and tactical-gimbal vendors is on a path to displace imported subsystems by 2028. This runs under the Drone Shakti and PLI 2.0 framework, an indigenisation curve also visible across BVLOS operations under DGCA regulation. Third, the sovereign swarm RFI's perception criteria will set the baseline for every future Indian Army UAV procurement.