Labelling - preparing data for model development

The majority of modern Artificial intelligence (AI) algorithms are developed by supplying labelled data.

These labels are created by hand (everytime you click on some of the “Are you a robot? Identify the cars” – images, you are supplying labels to AI models). The model receives as input the raw images, and as output, the labels, and learns what the best combination of its internal parameters are, to go from the input to the labelled output.

While it is relatively straightforward to provide ground-truth labels for everyday objects, such as houses, cars or kitchenware, it is much more difficult to do so for invasive plants that look very much alike, and even more so from an aerial perspective, which is unnatural for most humans.

If advanced imagery analysis tasks, such as object detection or segmentation should be used, the objective of identifying a plant is exacerbated by the added task of delineating it, e.g., supplying a start and end location for each of the identified objects, either through bounding boxes or polygons. Operators who are highly familiar with the target species are usually the best people  to label weed imagery, provided they are instructed in the use of the chosen labelling software. If possible, it is best to have a team of labellers so that large numbers of images can be labelled for each species. Labelling is highly labour-intensive and thus takes significant time in most cases -  this time needs to be accounted for in any weed detection project. Therefore, collating a team of labellers for a particular weed detection project will be the best way to ensure efficiency.

RGB image showing labelling polygons of target plant species (Bitou Bush)

Fig. 1 RGB image showing labelling polygons of target plant species (Bitou Bush)

Furthermore, as data acquisition and labelling are the most cumbersome and expensive components of the process, even more so when expert labellers need to be consulted, the definition and facilitation of the labelling processes requires clear definition and are driven by end-user needs.

The quality of data is the key bottleneck in machine learning model development (Geiger et al., 2021, Roberts et al., 2021) and care should be taken in defining data accurately and reproducibly.

For example, if the treatment of plants is only possible at certain stages of their life cycle, such as when they are already matured, it may be beneficial to identify them just before. Also, if during certain phenological stages no identification by human labellers is possible, and so it may be beneficial to exclude these stages from labelling.

Furthermore, the parts that require identification need to be clearly defined, whether flowers, stems and leaves are to be included or are categorically excluded, or under which circumstances these should be included. These definitions also inform the choice of tools that can be considered suitable for the task.

As a third careful consideration, it should be considered what can be identified from the target viewpoint. While handheld images may show stems and size differences in plants, these may not be contained in top-down viewpoints. The two figures below illustrate this example.

Fig 2. Side-view of a plant, showing differences in height

Fig 2. Side-view of a plant, showing differences in height

Fig. 3. Top down view of a plant, with indistinguishable height differences.

Fig. 3. Top down view of a plant, with indistinguishable height differences.

The following tables show examples how these processes could be defined, with colour codes highlighting potential negative cases and issues.

Table 1. Features of plants, growth stages and potential false negatives

 Species False Positives
StagesBitouHibertiaMyoporumScaveola
Seed no distinct features    
Sprout no distinct features    
Small Plant yellow flowers yellow flowers yellow flowers yellow flowers
Adult Plant yellow flowers yellow flowers yellow flowers yellow flowers
Dying Plant irrelevant    

Table 1. above, shows the distinct plant features and at what stage of the plant life cycle they should be identified and potential false positives by phenological stage.

Table 2. Distinct plant features of targets and potential false negative species and their suitable identification period

 Species
TimeBitouHibertiaMyoporuamScaveola
Winter     
Spring yellow flowers yellow flowers yellow flowers yellow flowers
Summer yellow flowers    
Autumn     

Table 2 above, shows the distinct plant features and when they should be identified, sorted by time of year. The first two tables could be used interchangeably, depending on the application scenario. Importantly, the time of detection requires definition.

Table 3. Key plant features of targets and potential false negatives

 Species
ComponentsBitouHibertiaMyoporumScaveola
roots     
stems     
leaves     
flowers yellow yellow yellow yellow
seeds     

Table 3 above, shows the distinct plant features by component of plant. This is important to delineate labelling sections, as well as highlight potential false-positives.

Clear guidelines and definition of labelling enable consistent labels, especially for cases that are difficult to delineate and change between phenological stages, such as weeds.

Since supervised deep learning is a statistical process during which a model approximates a desired outcome provided by human labellers in the form of data, these labels are key to robust model performance.

There is a variety of software that can be used for labelling, most of which will depend on the type of computer vision task to be employed to develop the model. Table 4 below lists a variety of these and their pros and cons for use in weed detection. Access each by clicking on the heading.

Table 4. Examples of labelling software, pros and cons.

SoftwareProsCons

Roboflow

  • commercial tool, well maintained
  • uses other models to generate suggestions
  • shareable between multiple users
  • cost
  • uses data to train own models

Computer Vision Annotation Tool (CVAT)

  • commercial tool
  • shareable between multiple users
  • cost, limited usage in free space
  • difficult to set up project for many users

Labelstud.io

  • open source
  • shareable between users / can be set up on server
  • difficult to set up for multiple users
  • interface needs getting used to
  • commercial version for larger scale usage

Pixel Annotation Tool

  • open source
  • uses watershed algorithm in the back end
  • difficult to install
  • cannot be run on server for multiple users

Makesense.ai 

  • easy to use in browser
  • quick to set up
  • doesn't store images
  • only "detection " and "image annotation "
  • not persistent and shareable between users

VGG Image Annotator (VIA)

  • open source
  • easy to install - runs in browser
  • proprietary software - sometimes buggy and cumbersome to use
  • not able to on server for multiple users
  • not easy interface

Cloud computing providers also offer a range of image annotation tools as part of their AI frameworks, such as Amazon Web Services (AWS) Rekognition, Google Cloud Vision API, and Azure Machine Learning Annotations.