AI model development

The development of a model for weed detection can adopt the general model development pipeline illustrated in the figures below.

A common AI project has the following participants, who are responsible for certain components of the development:

  • data engineer (builds the data ingestion pipelines)
  • machine learning engineer (train and iterate models to perform the task)
  • software engineer (aids with integrating machine learning model with the rest of the software and hardware architecture)
  • project manager (main point of contact with the end user)

Since the key ingredient to successful AI models is data, the focus is on producing quality data (Geiger et al., 2021; Maxwell et al., 2021; Jordan, 2018, 2019).

Figure 1 illustrates the complexity of a machine learning model and its key components, while figure 2 shows the development lifecycle and its iterative returns to earlier stages.

Fig.1. The components of a machine learning model (https://www.neuraxio.com/blogs/news/the-business-process-of-machine-learning-with-automl)

The components of a machine learning model

Fig.2. The Machine Learning Development Lifecycle (https://www.jeremyjordan.me/ml-projects-guide/)

The Machine Learning Development Lifecycle

The key question that can be taken from the business process of developing AI applications also applies to the research case: what is the end-user requirement?

This will drive many factors of the development lifecycle and also inherently the treatment of data. End-users will most likely treat the developed models as black-boxes, without wanting to be required to know too much about intricate workings such as labelling additional images. The end user requirements drive the definition of data, desired behaviour and associated software.

Dataset splits during AI model development

Supervised Deep Learning requires datasets to train and evaluate the performance of a model. Supervised learning requires labelled data, where a human provides the “True” output that the model should learn (see section on “labelling”).

In general, datasets are split into three parts:

  1. The training dataset
  2. The validation dataset
  3. The test dataset

Process for using training, validation and test datasets

Fig. 1. Process for using training, validation and test datasets in AI model development (Train Test Validation Split: How To & Best Practices [2023] (v7labs.com))

The test dataset is generally withheld from the model at training and it is essential that this dataset resembles the real-world scenario the closest - it is the “real-world” dataset. It  is also essential that the test dataset is never mixed into the training dataset. Both datasets should always be separate and the test dataset should also be kept constant over time, to be able to compare models when improving them.

Training and validation dataset usually come from the same bulk and are split at model training time. The parameters of the model are updated using samples from the training dataset, while the accuracy is evaluated on the validation dataset. When the performance no longer improves on the validation dataset, the model parameters are stored and the performance is tested on the test dataset (or the real world dataset) for the final reporting value.

The data formats developed by the end-user drive the supplementary software developments. If georeferenced files such as geotiffs are the desired input, the pipeline will have to look different than for standard images that are generated by smartphones. Image stitching artifacts and resolution differences, as well as data loading pipelines are inherently common in these cases and differ from standard CV data formats (Huang et al., 2019).

Additionally, the question if it is more important to detect every specimen, at the expense of False Positives, or detect only reliable specimen, at the expense of False Negatives, can drive model hyperparameters such as the choice of a loss function (Maxwell et al., 2021).

The importance of early discussions with the project team

At the outset of any model development, a list of questions to discuss with the project team will be helpful to ensure that the species is fully understood by the team so as to best enable accurate identification of the target species in the images for labelling purposes. Early information regarding the data and AI techniques that may be most appropriate to employ for model development is also necessary at the beginning of the process.

The project team should document discussions around the following questions:

  1. When should the target species be detected? At what stage of its phenology? At what time of year? How long before application of control treatments?
  2. Where does the target species appear? What do the environments look like? What are other species that appear in these environments? What are the properties of those environments?
  3. What does the target species look like in its entirety? How do flowers, stem, root, leaf look like? How do these change over time? How do they look like from aerial imagery? What are distinct features?
  4. What may be potential False Positives? What do they look like? Do the timelines (phenological or seasonal) coincide? How are they distinguishable from one another? Do they appear in the same environments?
  5. How many images are available? What is the dataset size? What is the dataset variance?

    The project team should also determine the following for later analysis:
  6. Which Computer Vision task is most likely to yield adequate results? Classification, Object Detection or Semantic Segmentation?
  7. How can labels be generated reliably? Which tool is used? Who generates labels? What are the guidelines for labelling? Which part of the plant is excluded and included, explicitly? At which phenological stage should plants be detected?
  8. How are False Positives or False Negatives treated? Is it more important to detect all instances of the target species, at the cost of FPs or is it more important to only detect reliable instances of the target species.
  9. Which datasets are available? How many datasets are available? How can they be used to split into training / validation and test? What data formats do they have? Are georeferenced images used, or only raw images? How many sensors were used? Which sensors were used? Are there fixed flying heights?
  10. Which computational resources are available? Where is data hosted, exchanged or made available to computational resources? Are there constraints on the computational capabilities of available systems?

These issues require clarification before the model development can begin, as these drive the development of models and are early indicators of success.

Infrastructure and Resources

Another question that requires definition is the deployment strategy and available hardware resources. Models have become ever more resource intensive and require large GPUs, even more so during training. Development of models requires access to GPUs and how these are hosted defines the libraries and tools available for the development of models, e.g. data handling, libraries, training settings, etc. Commercial tools such as azure, AWS or roboflow allow easy use for off-the-shelf tasks, however, they are less flexible in the definition of model parameters and settings, peculiar to non-standardised data. Commercial tools for deep learning such as azure, AWS or roboflow facilitate use by removing some technical difficulties, however, they are less flexible in the definition, operation and technical literacy to use, features which are not necessarily given for end -users as much as developers.

Python is by far the most popular language for AI model development. It is a general-purpose language that is easy to learn and use, and it has a large library of libraries and tools that are specifically designed for AI. Some of the most common Python AI libraries, such as TensorFlow, PyTorch and Scikit-learn are also compatible with other popular programming languages, such as R, Java, and C++. However, as the use of these libraries requires programming skills, the development of AI models using them typically requires the involvement of a specialised data scientist.

Given the widespread adoption of AI models, cloud computing providers such as AWS, Azure or Roboflow offer managed AI services. These services provide a turnkey solution for developing and deploying AI models. They typically include pre-trained models, pre-built pipelines, and managed infrastructure. This can make it much easier for scientists to get started with AI without having to invest in the infrastructure or expertise required to build and manage their own models. These tools provide a graphical user interface (GUI) for building and training AI models. This can make it easier for businesses to develop AI models without having to learn how to code. One limitation of these solutions is that they can make it difficult to understand how AI models work. This is because they abstract away from the underlying complexity of the process, making it difficult to see how the models make decisions. This can make it difficult to identify and address potential biases or errors in the models. As with any model it is critical to test the models thoroughly before deploying them into the field so that you can be sure they work as expected and that they do not exhibit any bias or errors

Examples of specific pipelines for RGB, MS and HS imagery are shown below (taken from the recent research project :”Weed Managers Guide to Remote Detection”:

Fig.1. imagery processing and analysis pipeline for RGB imageryWeed detection pipeline

Source: Dr. M. Rahaman, Charles Sturt University

Fig. 2. imagery processing and analysis pipeline for HS/MS imagery

  1. Preliminary study on weed detection (one iteration)
    1. Retrieve related works and methods on detection and classification of the weed.
    2. Collaborate within-site and weed management experts and translate on-ground data samples to geo-referenced weed locations in UAV surveyed areas (ground truth).
    3. Gather information about UAV band wavelengths for MS or HS.
  2. Processing the raw images (per site, per camera, per flight altitude)
    1. Download the high-resolution RGB, MS or HS raw images from a cloud  service or external storage drive.
    2. Generate high-resolution RGB orthomosaic and export the mosaic (*.jpg with georeferencing metadata using the same global coordinates) using Agisoft Metashape.
    3. Generate MS georeferenced orthomosaic images and export the mosaic (*.tiff) using Agisoft Metashape.
    4. Georeference of the MS and HS data using the georeferenced RGB as a baseline.
  3. Labelling the MS/HS image (per site, camera, and flight altitude)
    1. Load the RGB mosaic (or set of tiles) to ENVI to serve as a reference and visual aid for photo interpretation and labelling using ArcGIS Pro or QGIS.
    2. Load the multispectral or hyperspectral tiles you want to label in ENVI
    3. Import the ground truth GPS coordinates (*.SHP files) over RGB and MS images using QGIS or ENVI.
    4. Create ROIs and export tiles from the MS orthomosaic image based on ground truth GPS location
    5. Label Regions of Interest (ROI) with weeds with cross-validation from ground experts in ENVI processing software.
    6. Export the labelled ROIs as ENVI (*.hdr and binary) or *.SHP files.
  4. Python custom code development and training of a machine learning model (one or multiple models per site, sensor and altitude configurations)
    1. Read multispectral (ENVI, TIF), hyperspectral (ENVI) and satellite (TIF) rasters.
    2. Convert hyperspectral transects from radiance to reflectance using a white reference spectral intensity (batch processing available).
    3. Apply pre-processing operations to hyperspectral transects such as spatial filtering, spectral filtering, and morphological operations (batch processing available).
    4. Calculate spectral indices from a self-contained spectral index library.
    5. Train a machine learning classifier for pixel-wise classification (configured hyper-parameters, training the model, and selection of important features).
    6. Export trained ML models for subsequent inference using the entire tile dataset.
    7. Export a predicted map (a georeferenced image) into ENVI or TIF file formats.
    8. Visualise the predicted map using ENVI, ArcGIS, or QGIS.
  5. ML model validation
    1. Test the models for different sites and different resolutions
    2. Upload the results to the central storage location.
    3. Feedback from team members.
    4. Refine the model to obtain higher accuracy.
  6. Investigate other models (SVM / RF / DT / KNN) (per model)
    1. Conduct the same procedure as 4, and 5.
    2. Compare the different models/ML algorithms using classification reports and model training time.
    3. Select the best ML classifier for detection and mapping of the weed.

Source: N. Amarasingam, Queensland University of Technology