Training datasets for AI

A key part of any object detection project is developing a robust training dataset. A training dataset is a set of labelled data that is used to train a machine learning model.

In the context of AI-based weed detection, a training dataset would consist of images where the target weed species is manually labelled in each image. As described in the section on “labelling”, these labels typically consist of a boundary box created by the human annotator (normally a weed expert, able to identify the target species from other vegetation)  that is used to represent the extent of an object in an image.

RGB ground quadrat image with known plant species present in the landscape

Fig 1. An example of an RGB ground quadrat image with known plant species present in the landscape and the target plant identified by the model in blue bounding boxes using object detection (HWF).

The size and quality of the training dataset is important for the accuracy of the object detection model. A larger dataset will generally lead to a more accurate model, but it may also require more time and resources to train the model.

A high-quality dataset will contain images that are well-labelled and representative of the objects that the model will be used to detect. Collaboration between weed experts and data scientists is essential for generating high-quality training datasets.

There are some disadvantages to using CNNs for object detection.

  • They require a large amount of training data which is time-consuming and expensive to generate.
  • They can be computationally expensive to train. CNNs are complex models that require a lot of computing power to train, typically graphics processing units (GPU) with large amounts of RAM. This can be a challenge for applications that need to be deployed on resource-constrained devices.
  • CNNs can be sensitive to changes in lighting and viewpoint, which is a challenge in weed detection where environmental conditions can vary widely.

If the images that are being tested are taken under different lighting conditions or from different viewpoints, the CNN may not be able to detect the objects accurately.

Training datasets that contain images collected under a wide range of conditions can help to mitigate this issue. Despite these limitations, CNNs are a powerful tool for object detection and they are being used in a wide variety of applications. As the technology continues to improve, these limitations are likely to be addressed

The following list summarises some important considerations when developing a training dataset:

  • The size of the dataset. A larger dataset will generally lead to a more accurate model, but it may also require more time and resources to collect and label the data.
  • The diversity of the dataset. The dataset should be diverse enough to represent the different types of objects that the model will be used to detect. This includes different lighting conditions, viewpoints, and object sizes.
  • The quality of the data. The data should be of high quality and free of noise. This will help to ensure that the model learns accurate representations of the objects.
  • The labelling of the data. The data should be labelled accurately and consistently, with a very  high number of labelled images provided (minimum 500 images is recommended). This is important for the model to learn the correct relationships between the features of the objects and their labels. For this reason the input of domain specialists who can accurately annotate the target weed is critical in any AI based detection project. This stage is the most labour-intensive and can take many weeks to complete, depending on the site and images required.
  • The cost and time of development. The cost and time of developing the training dataset should be considered, especially if the dataset is large or complex. AI tools for automated labelling are increasingly being investigated to reduce these resource requirements.