Overview

This report gives the preliminary machine learning (ML) results for the MWS Phase 2 project.
The initial models for this deliverable are trained on images and ground-truth annotation data provided by Richard Lamprey and his team in Uganda. For the purposes of providing a working MVP, this deliverable focuses on producing a fast, accurate, multi-species ML pipeline that can produce image-level positive/negative scores and bounding boxes around objects of interest.
This work is preliminary; this work did not replicate the same level and depth of research from Phase 1. The intuitions and lessons learned from Phase 1 allowed this Phase 2 model training to be more streamlined and for fewer options to be considered when validating the final pipeline. This is a direct benefit from having completed the prior Phase 1 since it allowed the training procedure to be more focused on the features, configurations, and tools that are known to work well to produce a good and wholistic ML model.
As such, the preliminary Phase 2 models do not not perform any negative visual clustering prior to training, does not incorporate the (elephant only) training datasets from the Phase 1, does not predict deduplicated sequence estimates or counts or suggest any amount of bias corrections, does not produce pixel-level foreground/background masks, does not provide bounding box cluster assignments, and does not do any subsequent image alignment or overlap estimation. This new ML pipeline is intended to be used on individual images in isolation, which is the intended use case for the MVP deliverable. More advanced ML outputs will require new image and annotation data from the KAZA survey, survey transect metadata, flight position and attitude telemetry, and research time.
Link to the Phase 1 final report
Data
The initial ML training data provided by Richard Lamprey was collected over the course of several years and by flying over several conservation areas. These areas include the Murchison Falls National Park (MFNP), the Queen Elizabeth National Park (QENP), and the Tsavo East National Park (TENP). The table below details six datasets that have been provided to Wild Me in order to train a preliminary multi-species ML pipeline. The annotated and full image datasets were downloaded down from Amazon S3 storage across 11 separate volumes; the CSV annotation files were created with VGG Image Annotator (VIA) and provided to Wild Me via email.
Location | Date | Images | Annotations | Received | Dataset | Size |
---|---|---|---|---|---|---|
MFNP | Sep. 2015 | 1,204 | 10,177 * | 11/04/21 | 32,949 | 324 GB |
MFNP | Dec. 2015 | 1,067 | 10,177 * | 11/04/21 | 31,696 | 423 GB |
MFNP | Apr. 2016 | 1,475 | 12,334 | 08/03/22 | 20,749 | 243 GB |
TENP | 2017 | 1,130 | 5,496 | 07/22/22 | 115,994 | 700 GB |
QENP | 2018 | 1,208 | 7,784 | 02/14/22 | 48,818 | 295 GB |
MFNP | 2019 | 2,127 | 16,361 | 02/16/22 | 60,535 | 403 GB |
All Locations | 8,211 (deduplicated) |
59,754 (deduplicated) |
08/03/22 | 319,148 (deduplicated) |
2.39 TB |
The data for ML training was provided to Wild Me as a collection of three parts:
- The positive images
- The positive annotations in CSV
- The full dataset
This restriction means that the full 2.4 TB of images (319,148 total) provided to Wild Me must be substantially reduced for the ML training to only the images with annotations. The need to have reliable negative ground-truth provides that only 8,211 images are to be used for training. This filtering may seem extreme, but the incidence rate within an image is still extremely low and the low pixel density of annotations within an image means that a large amount of negative regions may be sampled from each positive example. There are 59,754 annotations for these 8,211 images, meaning an average of only 7.3 boxes per image. While we would prefer to include more completely negative images in the ML dataset for training, the practical tradeoffs between training time and accuracy suggest that the benefit would be marginal at best. This assumption can be reasonably met because the animals are seen uniformly and incidentally when captured from an aerial sensor, meaning that the underlying ML sighting distributions are strongly biased towards areas where animals more commonly aggregate.
Below is a histogram of all 38 annotation labels and their respective number of bounding boxes within the combined and deduplicated dataset:
Species | Total |
---|---|
Buffalo | 8,773 |
Camel | 92 |
Canoe | 332 |
Car | 495 |
Cow | 1,889 |
Crocodile | 61 |
Dead animal/White bones | 81 |
Dead-Bones | 2 |
Eland | 284 |
Ele.Carcass Old | 177 |
Elephant | 2,837 |
Gazelle_Gr | 170 |
Gazelle_Th | 92 |
Species | Total |
---|---|
Gerenuk | 130 |
Giant Forest Hog | 40 |
Giraffe | 785 |
Goat | 2,184 |
Hartebeest | 2,950 |
Hippo | 1,240 |
Impala | 366 |
Kob | 29,282 |
Kudu | 68 |
Motorcycle | 69 |
Oribi | 2,848 |
Oryx | 568 |
Ostrich | 92 |
Species | Total |
---|---|
Roof Grass | 394 |
Roof Mabati | 251 |
Sheep | 317 |
Test | 1,122 |
Topi | 225 |
Vehicle | 15 |
Warthog | 3,724 |
Waterbuck | 3,285 |
White bones | 139 |
White_Bones | 573 |
Wildebeest | 4 |
Zebra | 1,610 |
All Species | 59,754 |

Lastly, when training the bounding box localizer, the following species were re-named and merged
Provided Label | ML Label |
---|---|
Gazelle_Gr | Grants Gazelle |
Gazelle_Th | Thomsons Gazelle |
Dead animal/White bones | White Bones |
Dead-Bones | White Bones |
Ele.Carcass Old | White Bones |
White bones | White Bones |
White_Bones | White Bones |
Results
Tile Grid


For the MVP model, the grid2 extraction was turned off for speed and the resulting grid1 tiles were further subsampled to only keep 10% of the negative tiles in each image. These tiles formed the global negative set that was mined with the iterative boostring strategy.






Model Training
Partition | Total | Positive | Negative | Train | Test |
---|---|---|---|---|---|
Images | 8,211 | 6,662 | 1,549 | 6,560 | 1,651 |
Tiles | 685,757 | 42,054 * | 643,703 | 547,420 | 138,337 |
WIC Boost 0 Tiles | 66,540 | 33,270 | 33,270 | 53,232 (train) |
13,308 (val) |
WIC Boost 1 Tiles | 92,207 | 33,270 | 58,937 | 73,765 (train) |
18,442 (val) |
WIC Boost 2 Tiles | 117,734 | 33,270 | 84,194 | 94,187 (train) |
23,547 (val) |
LOC Tiles | 53,072 | 33,270 | 19,802 | 39,960 (train) |
13,112 (val) |
Below is a list of model training improvements made for MVP:
- Changed from DenseNet-201 to ResNet-50 (see plot below) to speed up training and inference time
- Changed from the PyTorch SGD with Momentum optimizer to Adam (default LR)
- Added more data augmentation types to reduce over-fitting, and changed the batch sampling ratio to 1.0 between positives and negatives
- Improved training infrastructure to use all available CPU cores (40 for MVP) and added multi-GPU training and inference for both models
- Changed the WIC from an ensemble of multiple models to a single model. The performance benefit seen during Phase 1 was relatively marginal (<5%) and significantly slowed training and inference time (by a factor of 3 to 5).
- Changed the number of WIC boosting rounds from ten (Phase 1 research) to three. The hard negative boosting proceeedure saw diminishing returns in Phase 1 after round three, and a more substantial improvement to general ML performance was obtained by cleaning up missing ground-truth labels in the underying training dataset.

Whole Image Classifier (WIC)



Localizer (LOC)
The localizer is run as a secondard pipeline component after the WIC. As such, we can focus the performance plots of the LOC on its recall ability. Because this is the first models we have trained to support multiple species, we make a simplifying assumption that the localization is considered correct if it gets the bounding box correct (IoU=20%) regardless if it labeled the species correctly. Preliminary results suggest that the model's localization performance drops by ~8% if we require matched bounding box predictions to also have the correct species as the ground-truth. The full breakdown of the LOC model by species is still pending.
Furthermore, because the LOC is run on tiles that overlap with two grid strategies, we can ommit any failed detections on the margins of each tile. The assumption is that a neighboring tile has that margin in its respective center pixels, so our evaluation focuses on the middle center of each tile (margin=32, tile=256x256) and suggests operating points based on optimizing for this use case. Preliminary results indicate that the LOC's performance improves by 14.2% if we ignore the annotations that are missed along the 1/4 margin of each tile. The aggregation code (evaulation also pending) is responsible for performing non-maximim suppression (NMS) across and between tiles, and aggregating the final detections at the image-level. Below are the tile-specific results and suggested performance numbers for MVP.



Open Source on WBIA

ScoutBot

Docker

Known Issues
- Small animals are a challenge for the positive/negative threshold of 2.5% pixel area coverage
- Not all training annotations images were used due to a lack of a grid2 extraction during training
- Large objects like construction vehicles and buildings are larger than the tile size
- The detected species labels sometimes get confused for other, visually-similar species
- The image-level aggregation coniguration has not been fully validated again held-out test images
- At least one ground-truth image has embedded bounding boxes on the pixel information, as seen below:
