Wild Me - MWS Phase 2 MVP Report

Overview

An image from Murchison Sep. 2015 with 4 Hartebeest bounding boxes: "TEP SURVEY SEP 2015 AIRPHOTOS 30 Sep Left and Right, 1 Oct Left Only/2015-10-01-2-106-0904.JPG"

This report gives the preliminary machine learning (ML) results for the MWS Phase 2 project.

The initial models for this deliverable are trained on images and ground-truth annotation data provided by Richard Lamprey and his team in Uganda. For the purposes of providing a working MVP, this deliverable focuses on producing a fast, accurate, multi-species ML pipeline that can produce image-level positive/negative scores and bounding boxes around objects of interest.

This work is preliminary; this work did not replicate the same level and depth of research from Phase 1. The intuitions and lessons learned from Phase 1 allowed this Phase 2 model training to be more streamlined and for fewer options to be considered when validating the final pipeline. This is a direct benefit from having completed the prior Phase 1 since it allowed the training procedure to be more focused on the features, configurations, and tools that are known to work well to produce a good and wholistic ML model.

As such, the preliminary Phase 2 models do not not perform any negative visual clustering prior to training, does not incorporate the (elephant only) training datasets from the Phase 1, does not predict deduplicated sequence estimates or counts or suggest any amount of bias corrections, does not produce pixel-level foreground/background masks, does not provide bounding box cluster assignments, and does not do any subsequent image alignment or overlap estimation. This new ML pipeline is intended to be used on individual images in isolation, which is the intended use case for the MVP deliverable. More advanced ML outputs will require new image and annotation data from the KAZA survey, survey transect metadata, flight position and attitude telemetry, and research time.

Link to the Phase 1 final report

Data

The initial ML training data provided by Richard Lamprey was collected over the course of several years and by flying over several conservation areas. These areas include the Murchison Falls National Park (MFNP), the Queen Elizabeth National Park (QENP), and the Tsavo East National Park (TENP). The table below details six datasets that have been provided to Wild Me in order to train a preliminary multi-species ML pipeline. The annotated and full image datasets were downloaded down from Amazon S3 storage across 11 separate volumes; the CSV annotation files were created with VGG Image Annotator (VIA) and provided to Wild Me via email.

Location	Date	Images	Annotations	Received	Dataset	Size
MFNP	Sep. 2015	1,204	10,177 *	11/04/21	32,949	324 GB
MFNP	Dec. 2015	1,067	10,177 *	11/04/21	31,696	423 GB
MFNP	Apr. 2016	1,475	12,334	08/03/22	20,749	243 GB
TENP	2017	1,130	5,496	07/22/22	115,994	700 GB
QENP	2018	1,208	7,784	02/14/22	48,818	295 GB
MFNP	2019	2,127	16,361	02/16/22	60,535	403 GB
All Locations		8,211 (deduplicated)	59,754 (deduplicated)	08/03/22	319,148 (deduplicated)	2.39 TB

* The MFNP September and December 2016 datasets were annotated together and delivered as single CSV download, containing a total of 20,355 annotations for both.

The data for ML training was provided to Wild Me as a collection of three parts:

The positive images
The positive annotations in CSV
The full dataset

The full datasets include all of the images that were taken for each survey, including images "on the survey line" (OSL), the end of transect turns, and images taken in transit to the survey area. In each of the full image datasets all animals have been counted by Richard's team, but only about 50% of the counted animals have been annotated with bounding boxes. The images that were designated for bounding box annotation were extracted out of their respective full datasets and randomly given to a team of 5 data annotators in Uganda. After a round of data annotation and review using the VGG Image Annotator tool, the CSV files were extracted for all positive images and provided to Wild Me. The raw counts of animals for each image in the full dataset was not provided, meaning that the ML cannot directly assume that a random image in the full dataset is a true negative throughout the full image. To compensate for this, only the images that contained annotations were used for training the initial Phase 2 ML pipeline.

This restriction means that the full 2.4 TB of images (319,148 total) provided to Wild Me must be substantially reduced for the ML training to only the images with annotations. The need to have reliable negative ground-truth provides that only 8,211 images are to be used for training. This filtering may seem extreme, but the incidence rate within an image is still extremely low and the low pixel density of annotations within an image means that a large amount of negative regions may be sampled from each positive example. There are 59,754 annotations for these 8,211 images, meaning an average of only 7.3 boxes per image. While we would prefer to include more completely negative images in the ML dataset for training, the practical tradeoffs between training time and accuracy suggest that the benefit would be marginal at best. This assumption can be reasonably met because the animals are seen uniformly and incidentally when captured from an aerial sensor, meaning that the underlying ML sighting distributions are strongly biased towards areas where animals more commonly aggregate.

Below is a histogram of all 38 annotation labels and their respective number of bounding boxes within the combined and deduplicated dataset:

Species	Total
Buffalo	8,773
Camel	92
Canoe	332
Car	495
Cow	1,889
Crocodile	61
Dead animal/White bones	81
Dead-Bones	2
Eland	284
Ele.Carcass Old	177
Elephant	2,837
Gazelle_Gr	170
Gazelle_Th	92

Species	Total
Gerenuk	130
Giant Forest Hog	40
Giraffe	785
Goat	2,184
Hartebeest	2,950
Hippo	1,240
Impala	366
Kob	29,282
Kudu	68
Motorcycle	69
Oribi	2,848
Oryx	568
Ostrich	92

Species	Total
Roof Grass	394
Roof Mabati	251
Sheep	317
Test	1,122
Topi	225
Vehicle	15
Warthog	3,724
Waterbuck	3,285
White bones	139
White_Bones	573
Wildebeest	4
Zebra	1,610
All Species	59,754

The annotations highlighted in red are the labels with less than 100 annotations in the entire dataset. The class "Test" is also highlighted.

Example annotations loaded into the WBIA software suite from the above dataset.

Lastly, when training the bounding box localizer, the following species were re-named and merged

Provided Label	ML Label
Gazelle_Gr	Grants Gazelle
Gazelle_Th	Thomsons Gazelle
Dead animal/White bones	White Bones
Dead-Bones	White Bones
Ele.Carcass Old	White Bones
White bones	White Bones
White_Bones	White Bones

Results

Tile Grid

An example grid extraction visualization from the Phase 1 report. The tiles from grid 1 are colored in orange, the tiles from grid 2 are blue, and the border tiles are colored black.

The extracted tiles from the above example image, with each tile's weighted positive area percentage in the top left.

For the MVP model, the grid2 extraction was turned off for speed and the resulting grid1 tiles were further subsampled to only keep 10% of the negative tiles in each image. These tiles formed the global negative set that was mined with the iterative boostring strategy.

The final WBIA database, including the number of images, annotations, and (subsampled) tiles.

Model Training

Partition	Total	Positive	Negative	Train	Test
Images	8,211	6,662	1,549	6,560	1,651
Tiles	685,757	42,054 *	643,703	547,420	138,337
WIC Boost 0 Tiles	66,540	33,270	33,270	53,232 (train)	13,308 (val)
WIC Boost 1 Tiles	92,207	33,270	58,937	73,765 (train)	18,442 (val)
WIC Boost 2 Tiles	117,734	33,270	84,194	94,187 (train)	23,547 (val)
LOC Tiles	53,072	33,270	19,802	39,960 (train)	13,112 (val)

A breakdown of the number of images and tiles available in the dataset and used for training the WIC and LOC models. When a model's trainval dataset is sampled, there is an 80/20% split between training and validation that is applied automatically.

* The number of positive tiles is calculated by any tile that has an portion of an annotation within it. For training, we restrict these tiles further to require at least a weighted percentage of the pixel area (minimum 2.5%) is covered by an annotation. This reduces the effective number of positive tiles from 42,054 to 33,270

Below is a list of model training improvements made for MVP:

Changed from DenseNet-201 to ResNet-50 (see plot below) to speed up training and inference time
Changed from the PyTorch SGD with Momentum optimizer to Adam (default LR)
Added more data augmentation types to reduce over-fitting, and changed the batch sampling ratio to 1.0 between positives and negatives
Improved training infrastructure to use all available CPU cores (40 for MVP) and added multi-GPU training and inference for both models
Changed the WIC from an ensemble of multiple models to a single model. The performance benefit seen during Phase 1 was relatively marginal (<5%) and significantly slowed training and inference time (by a factor of 3 to 5).
Changed the number of WIC boosting rounds from ten (Phase 1 research) to three. The hard negative boosting proceeedure saw diminishing returns in Phase 1 after round three, and a more substantial improvement to general ML performance was obtained by cleaning up missing ground-truth labels in the underying training dataset.

A plot from "Benchmark Analysis of Representative Deep Neural Network Architectures" by Bianco et al. on the real-time inference performance of various CNN backbones. We can see that DenseNet-201 (Phase 1 backbone) is slightly more accurate than ResNet-50 (MVP backbone), but it is substantially slower. To maximize speed while obtaining a similar level of accuracy, the MVP model uses ResNet-50 as the backbone of the WIC.

Whole Image Classifier (WIC)

WIC performance plots prior to any ground-truth cleaning.

WIC performance plots after ground-truth cleaning. Ground-truth labels were converted from positive to negative if the final WIC model predicted the confidence score to be below the afterage negative tile score (~2.3%) and if the covered area of any ground-truth bounding boxes was less than 5%. Negative ground-truth labels were converted to positive labels if the covered area was greater than 0% or the final WIC model predicted the confidence score to be above the average poitive tile score (~86%).

The confusion matrix for the final WIC boosting round model, which obtains a classification accuracy of 97.57% and an Operating Point (OP) of 7%

Localizer (LOC)

The localizer is run as a secondard pipeline component after the WIC. As such, we can focus the performance plots of the LOC on its recall ability. Because this is the first models we have trained to support multiple species, we make a simplifying assumption that the localization is considered correct if it gets the bounding box correct (IoU=20%) regardless if it labeled the species correctly. Preliminary results suggest that the model's localization performance drops by ~8% if we require matched bounding box predictions to also have the correct species as the ground-truth. The full breakdown of the LOC model by species is still pending.

Furthermore, because the LOC is run on tiles that overlap with two grid strategies, we can ommit any failed detections on the margins of each tile. The assumption is that a neighboring tile has that margin in its respective center pixels, so our evaluation focuses on the middle center of each tile (margin=32, tile=256x256) and suggests operating points based on optimizing for this use case. Preliminary results indicate that the LOC's performance improves by 14.2% if we ignore the annotations that are missed along the 1/4 margin of each tile. The aggregation code (evaulation also pending) is responsible for performing non-maximim suppression (NMS) across and between tiles, and aggregating the final detections at the image-level. Below are the tile-specific results and suggested performance numbers for MVP.

The Precision-Recall curves for the LOC model. These plots are generated for different non-maximum suppression (NMS) thresholds and is aggregating the generalized performance for all species.

The confusion matrix for the best performing NMS threshold. The MVP LOC model is 75.53% accurate and obtains a Average Precision (AP) of 92% for all species. The recommended NMS of 60% and Operating Point (OP) of 38%.

An example prediction using the MVP models.

Open Source on WBIA

The MWS Phase 1 code was merged into the main branch of WBIA on GitHub and is now available as part of that open source project. The models that were trained during Phase 1 have been uploaded to Wild Me's public CDN and the full list of APIs are available by adding a single command line flag to the standard WBIA Docker image.

ScoutBot

The MWS Phase 1 and MVP models have been converted to ONNX (Open Neural Network eXchange) model files and have been deployed as a stand-alone Python package called "ScoutBot". The package is open source, has CI/CD on GitHub, has automated Docker and PyPI deployments, has documentation on ReadTheDocs, offers a Python API to interface with the Scout ML, offers a CLI executable to run the Scout ML from the command line, and will download models from an external CDN as needed for inference.

Docker

Scoutbot has been uploaded to Docker Hub that launches a pre-build demo. The Dockerfile shows how to install and setup the GPU dependencies for ScoutBot's accelerated inference. Furthermore, the Docker image has the ML models pre-downloaded and baked into the image, so inference can happen offline.

Known Issues

Small animals are a challenge for the positive/negative threshold of 2.5% pixel area coverage
Not all training annotations images were used due to a lack of a grid2 extraction during training
Large objects like construction vehicles and buildings are larger than the tile size
The detected species labels sometimes get confused for other, visually-similar species
The image-level aggregation coniguration has not been fully validated again held-out test images
At least one ground-truth image has embedded bounding boxes on the pixel information, as seen below:

A ground-truth image with embedded bounding boxes for each object, incorrectly altering and modifying the original pixel data on the image.