Overview

An image from Murchison Sep. 2015 with 4 Hartebeest bounding boxes: "TEP SURVEY SEP 2015 AIRPHOTOS 30 Sep Left and Right, 1 Oct Left Only/2015-10-01-2-106-0904.JPG"

This report gives the preliminary machine learning (ML) results for the MWS Phase 2 project.

The initial models for this deliverable are trained on images and ground-truth annotation data provided by Richard Lamprey and his team in Uganda. For the purposes of providing a working MVP, this deliverable focuses on producing a fast, accurate, multi-species ML pipeline that can produce image-level positive/negative scores and bounding boxes around objects of interest.

This work is preliminary; this work did not replicate the same level and depth of research from Phase 1. The intuitions and lessons learned from Phase 1 allowed this Phase 2 model training to be more streamlined and for fewer options to be considered when validating the final pipeline. This is a direct benefit from having completed the prior Phase 1 since it allowed the training procedure to be more focused on the features, configurations, and tools that are known to work well to produce a good and wholistic ML model.

As such, the preliminary Phase 2 models do not not perform any negative visual clustering prior to training, does not incorporate the (elephant only) training datasets from the Phase 1, does not predict deduplicated sequence estimates or counts or suggest any amount of bias corrections, does not produce pixel-level foreground/background masks, does not provide bounding box cluster assignments, and does not do any subsequent image alignment or overlap estimation. This new ML pipeline is intended to be used on individual images in isolation, which is the intended use case for the MVP deliverable. More advanced ML outputs will require new image and annotation data from the KAZA survey, survey transect metadata, flight position and attitude telemetry, and research time.

Link to the Phase 1 final report

Data

The initial ML training data provided by Richard Lamprey was collected over the course of several years and by flying over several conservation areas. These areas include the Murchison Falls National Park (MFNP), the Queen Elizabeth National Park (QENP), and the Tsavo East National Park (TENP). The table below details six datasets that have been provided to Wild Me in order to train a preliminary multi-species ML pipeline. The annotated and full image datasets were downloaded down from Amazon S3 storage across 11 separate volumes; the CSV annotation files were created with VGG Image Annotator (VIA) and provided to Wild Me via email.

Location Date Images Annotations Received Dataset Size
MFNP Sep. 2015 1,204 10,177 * 11/04/21 32,949 324 GB
MFNP Dec. 2015 1,067 10,177 * 11/04/21 31,696 423 GB
MFNP Apr. 2016 1,475 12,334 08/03/22 20,749 243 GB
TENP 2017 1,130 5,496 07/22/22 115,994 700 GB
QENP 2018 1,208 7,784 02/14/22 48,818 295 GB
MFNP 2019 2,127 16,361 02/16/22 60,535 403 GB
All Locations 8,211
(deduplicated)
59,754
(deduplicated)
08/03/22 319,148
(deduplicated)
2.39 TB
* The MFNP September and December 2016 datasets were annotated together and delivered as single CSV download, containing a total of 20,355 annotations for both.

The data for ML training was provided to Wild Me as a collection of three parts:

  1. The positive images
  2. The positive annotations in CSV
  3. The full dataset
The full datasets include all of the images that were taken for each survey, including images "on the survey line" (OSL), the end of transect turns, and images taken in transit to the survey area. In each of the full image datasets all animals have been counted by Richard's team, but only about 50% of the counted animals have been annotated with bounding boxes. The images that were designated for bounding box annotation were extracted out of their respective full datasets and randomly given to a team of 5 data annotators in Uganda. After a round of data annotation and review using the VGG Image Annotator tool, the CSV files were extracted for all positive images and provided to Wild Me. The raw counts of animals for each image in the full dataset was not provided, meaning that the ML cannot directly assume that a random image in the full dataset is a true negative throughout the full image. To compensate for this, only the images that contained annotations were used for training the initial Phase 2 ML pipeline.

This restriction means that the full 2.4 TB of images (319,148 total) provided to Wild Me must be substantially reduced for the ML training to only the images with annotations. The need to have reliable negative ground-truth provides that only 8,211 images are to be used for training. This filtering may seem extreme, but the incidence rate within an image is still extremely low and the low pixel density of annotations within an image means that a large amount of negative regions may be sampled from each positive example. There are 59,754 annotations for these 8,211 images, meaning an average of only 7.3 boxes per image. While we would prefer to include more completely negative images in the ML dataset for training, the practical tradeoffs between training time and accuracy suggest that the benefit would be marginal at best. This assumption can be reasonably met because the animals are seen uniformly and incidentally when captured from an aerial sensor, meaning that the underlying ML sighting distributions are strongly biased towards areas where animals more commonly aggregate.

Below is a histogram of all 38 annotation labels and their respective number of bounding boxes within the combined and deduplicated dataset:

Species Total
Buffalo 8,773
Camel 92
Canoe 332
Car 495
Cow 1,889
Crocodile 61
Dead animal/White bones 81
Dead-Bones 2
Eland 284
Ele.Carcass Old 177
Elephant 2,837
Gazelle_Gr 170
Gazelle_Th 92
Species Total
Gerenuk 130
Giant Forest Hog 40
Giraffe 785
Goat 2,184
Hartebeest 2,950
Hippo 1,240
Impala 366
Kob 29,282
Kudu 68
Motorcycle 69
Oribi 2,848
Oryx 568
Ostrich 92
Species Total
Roof Grass 394
Roof Mabati 251
Sheep 317
Test 1,122
Topi 225
Vehicle 15
Warthog 3,724
Waterbuck 3,285
White bones 139
White_Bones 573
Wildebeest 4
Zebra 1,610
All Species 59,754
The annotations highlighted in red are the labels with less than 100 annotations in the entire dataset. The class "Test" is also highlighted.
Example annotations loaded into the WBIA software suite from the above dataset.

Lastly, when training the bounding box localizer, the following species were re-named and merged

Provided Label ML Label
Gazelle_Gr Grants Gazelle
Gazelle_Th Thomsons Gazelle
Dead animal/White bones White Bones
Dead-Bones White Bones
Ele.Carcass Old White Bones
White bones White Bones
White_Bones White Bones

Results

Tile Grid
An example grid extraction visualization from the Phase 1 report. The tiles from grid 1 are colored in orange, the tiles from grid 2 are blue, and the border tiles are colored black.
The extracted tiles from the above example image, with each tile's weighted positive area percentage in the top left.

For the MVP model, the grid2 extraction was turned off for speed and the resulting grid1 tiles were further subsampled to only keep 10% of the negative tiles in each image. These tiles formed the global negative set that was mined with the iterative boostring strategy.

The final WBIA database, including the number of images, annotations, and (subsampled) tiles.
Model Training
Partition Total Positive Negative Train Test
Images 8,211 6,662 1,549 6,560 1,651
Tiles 685,757 42,054 * 643,703 547,420 138,337
WIC Boost 0 Tiles 66,540 33,270 33,270 53,232
(train)
13,308
(val)
WIC Boost 1 Tiles 92,207 33,270 58,937 73,765
(train)
18,442
(val)
WIC Boost 2 Tiles 117,734 33,270 84,194 94,187
(train)
23,547
(val)
LOC Tiles 53,072 33,270 19,802 39,960
(train)
13,112
(val)
A breakdown of the number of images and tiles available in the dataset and used for training the WIC and LOC models. When a model's trainval dataset is sampled, there is an 80/20% split between training and validation that is applied automatically.
* The number of positive tiles is calculated by any tile that has an portion of an annotation within it. For training, we restrict these tiles further to require at least a weighted percentage of the pixel area (minimum 2.5%) is covered by an annotation. This reduces the effective number of positive tiles from 42,054 to 33,270

Below is a list of model training improvements made for MVP:

A plot from "Benchmark Analysis of Representative Deep Neural Network Architectures" by Bianco et al. on the real-time inference performance of various CNN backbones. We can see that DenseNet-201 (Phase 1 backbone) is slightly more accurate than ResNet-50 (MVP backbone), but it is substantially slower. To maximize speed while obtaining a similar level of accuracy, the MVP model uses ResNet-50 as the backbone of the WIC.
Whole Image Classifier (WIC)
WIC performance plots prior to any ground-truth cleaning.
WIC performance plots after ground-truth cleaning. Ground-truth labels were converted from positive to negative if the final WIC model predicted the confidence score to be below the afterage negative tile score (~2.3%) and if the covered area of any ground-truth bounding boxes was less than 5%. Negative ground-truth labels were converted to positive labels if the covered area was greater than 0% or the final WIC model predicted the confidence score to be above the average poitive tile score (~86%).
The confusion matrix for the final WIC boosting round model, which obtains a classification accuracy of 97.57% and an Operating Point (OP) of 7%
Localizer (LOC)

The localizer is run as a secondard pipeline component after the WIC. As such, we can focus the performance plots of the LOC on its recall ability. Because this is the first models we have trained to support multiple species, we make a simplifying assumption that the localization is considered correct if it gets the bounding box correct (IoU=20%) regardless if it labeled the species correctly. Preliminary results suggest that the model's localization performance drops by ~8% if we require matched bounding box predictions to also have the correct species as the ground-truth. The full breakdown of the LOC model by species is still pending.

Furthermore, because the LOC is run on tiles that overlap with two grid strategies, we can ommit any failed detections on the margins of each tile. The assumption is that a neighboring tile has that margin in its respective center pixels, so our evaluation focuses on the middle center of each tile (margin=32, tile=256x256) and suggests operating points based on optimizing for this use case. Preliminary results indicate that the LOC's performance improves by 14.2% if we ignore the annotations that are missed along the 1/4 margin of each tile. The aggregation code (evaulation also pending) is responsible for performing non-maximim suppression (NMS) across and between tiles, and aggregating the final detections at the image-level. Below are the tile-specific results and suggested performance numbers for MVP.

The Precision-Recall curves for the LOC model. These plots are generated for different non-maximum suppression (NMS) thresholds and is aggregating the generalized performance for all species.
The confusion matrix for the best performing NMS threshold. The MVP LOC model is 75.53% accurate and obtains a Average Precision (AP) of 92% for all species. The recommended NMS of 60% and Operating Point (OP) of 38%.
An example prediction using the MVP models.
Open Source on WBIA
The MWS Phase 1 code was merged into the main branch of WBIA on GitHub and is now available as part of that open source project. The models that were trained during Phase 1 have been uploaded to Wild Me's public CDN and the full list of APIs are available by adding a single command line flag to the standard WBIA Docker image.
ScoutBot
The MWS Phase 1 and MVP models have been converted to ONNX (Open Neural Network eXchange) model files and have been deployed as a stand-alone Python package called "ScoutBot". The package is open source, has CI/CD on GitHub, has automated Docker and PyPI deployments, has documentation on ReadTheDocs, offers a Python API to interface with the Scout ML, offers a CLI executable to run the Scout ML from the command line, and will download models from an external CDN as needed for inference.
Docker
Scoutbot has been uploaded to Docker Hub that launches a pre-build demo. The Dockerfile shows how to install and setup the GPU dependencies for ScoutBot's accelerated inference. Furthermore, the Docker image has the ML models pre-downloaded and baked into the image, so inference can happen offline.

Known Issues

A ground-truth image with embedded bounding boxes for each object, incorrectly altering and modifying the original pixel data on the image.