Labelling - preparing data for model development

The majority of modern Artificial intelligence (AI) algorithms are developed by supplying labelled data.

These labels are created by hand (every time you click on some of the “Are you a robot? Identify the cars” – images, you are supplying labels to AI models). The model receives as input the raw images, and as output, the labels, and learns what the best combination of its internal parameters are, to go from the input to the labelled output.

While it is relatively straightforward to provide ground-truth labels for everyday objects, such as houses, cars or kitchenware, it is much more difficult to do so for invasive plants that look very much alike, and even more so from an aerial perspective, which is unnatural for most humans.

If advanced imagery analysis tasks, such as object detection or segmentation should be used, the objective of identifying a plant is exacerbated by the added task of delineating it, e.g., supplying a start and end location for each of the identified objects, either through bounding boxes or polygons. Operators who are highly familiar with the target species are usually the best people to label weed imagery, provided they are instructed in the use of the chosen labelling software. If possible, it is best to have a team of labellers so that large numbers of images can be labelled for each species. Labelling is highly labour-intensive and thus takes significant time in most cases - this time needs to be accounted for in any weed detection project. Therefore, collating a team of labellers for a particular weed detection project will be the best way to ensure efficiency.

RGB image showing labelling polygons of target plant species (Bitou Bush)

Fig. 1 RGB image showing labelling polygons of target plant species (Bitou Bush)

Furthermore, as data acquisition and labelling are the most cumbersome and expensive components of the process, even more so when expert labellers need to be consulted, the definition and facilitation of the labelling processes requires clear definition and are driven by end-user needs.

The quality of data is the key bottleneck in machine learning model development (Geiger et al., 2021, Roberts et al., 2021) and care should be taken in defining data accurately and reproducibly.

For example, if the treatment of plants is only possible at certain stages of their life cycle, such as when they are already matured, it may be beneficial to identify them just before. Also, if during certain phenological stages no identification by human labellers is possible, and so it may be beneficial to exclude these stages from labelling.

Furthermore, the parts that require identification need to be clearly defined, whether flowers, stems and leaves are to be included or are categorically excluded, or under which circumstances these should be included. These definitions also inform the choice of tools that can be considered suitable for the task.

As a third careful consideration, it should be considered what can be identified from the target viewpoint. While handheld images may show stems and size differences in plants, these may not be contained in top-down viewpoints. The two figures below illustrate this example.

Fig 2. Side-view of a plant, showing differences in height

Fig 2. Side-view of a plant, showing differences in height

Fig. 3. Top down view of a plant, with indistinguishable height differences.

Fig. 3. Top down view of a plant, with indistinguishable height differences.

The following tables show examples how these processes could be defined, with colour codes highlighting potential negative cases and issues.

Table 1. Features of plants, growth stages and potential false negatives

	Species	False Positives
Stages	Bitou	Hibertia	Myoporum	Scaveola
Seed	no distinct features
Sprout	no distinct features
Small Plant	yellow flowers	yellow flowers	yellow flowers	yellow flowers
Adult Plant	yellow flowers	yellow flowers	yellow flowers	yellow flowers
Dying Plant	irrelevant

Table 1. above, shows the distinct plant features and at what stage of the plant life cycle they should be identified and potential false positives by phenological stage.

Table 2. Distinct plant features of targets and potential false negative species and their suitable identification period

	Species
Time	Bitou	Hibertia	Myoporuam	Scaveola
Winter
Spring	yellow flowers	yellow flowers	yellow flowers	yellow flowers
Summer	yellow flowers
Autumn

Table 2 above, shows the distinct plant features and when they should be identified, sorted by time of year. The first two tables could be used interchangeably, depending on the application scenario. Importantly, the time of detection requires definition.

Table 3. Key plant features of targets and potential false negatives

	Species
Components	Bitou	Hibertia	Myoporum	Scaveola
roots
stems
leaves
flowers	yellow	yellow	yellow	yellow
seeds

Table 3 above, shows the distinct plant features by component of plant. This is important to delineate labelling sections, as well as highlight potential false-positives.

Clear guidelines and definition of labelling enable consistent labels, especially for cases that are difficult to delineate and change between phenological stages, such as weeds.

Since supervised deep learning is a statistical process during which a model approximates a desired outcome provided by human labellers in the form of data, these labels are key to robust model performance.

There is a variety of software that can be used for labelling, most of which will depend on the type of computer vision task to be employed to develop the model. Table 4 below lists a variety of these and their pros and cons for use in weed detection. Access each by clicking on the heading.

Table 4. Examples of labelling software, pros and cons.

Software	Pros	Cons
Roboflow	commercial tool, well maintained uses other models to generate suggestions shareable between multiple users	cost uses data to train own models
Computer Vision Annotation Tool (CVAT)	commercial tool shareable between multiple users	cost, limited usage in free space difficult to set up project for many users
Labelstud.io	open source shareable between users / can be set up on server	difficult to set up for multiple users interface needs getting used to commercial version for larger scale usage
Pixel Annotation Tool	open source uses watershed algorithm in the back end	difficult to install cannot be run on server for multiple users
Makesense.ai	easy to use in browser quick to set up doesn't store images	only "detection " and "image annotation " not persistent and shareable between users
VGG Image Annotator (VIA)	open source easy to install - runs in browser	proprietary software - sometimes buggy and cumbersome to use not able to on server for multiple users not easy interface

Cloud computing providers also offer a range of image annotation tools as part of their AI frameworks, such as Amazon Web Services (AWS) Rekognition, Google Cloud Vision API, and Azure Machine Learning Annotations.