ESP32-CAM object detection with Edge Impulse

This article is a complete, exhaustive, step-by-step tutorial to do object detection in realtime on your ESP32-CAM board. Since this aims to be your single source of truth for all your needs, it is pretty long. But don't be scared: it is break down into manageable steps that you can complete in a few minutes each.

The entire process should take 30-60 minutes the first time: once you become familiar with the process, you should be able to complete a project in 15-20 minutes!

Here's how the tutorial is structured:

Collect training images
Label images (a.k.a. define your objects of interest)
Train a neural network with Edge Impulse
Run the network on your ESP32-CAM

Prerequisites

You need to create a free account on Edge Impulse. For those who don't know, Edge Impulse is a low-code platform to train AI models targeted at many development boards and operating systems (from microcontrollers to Raspberry Pi to desktop PCs). It is free to use (with limitations) and beginner-friendly.

Collect training images

To create a machine learning model, you will need training data. This is data (images in our case) that tell the model what he should learn to recognize. In the context of object detection, we will go even further: training data will be objects in our images. We'll see later how to label them.

For now, we need to collect a few images to form our dataset. We'll load a sketch that allows us to save frames from the ESP32-CAM using a web browser.

ESPx library

This project makes use of the espx library. This library for Arduino defines a set of abstractions that make the ESP32 features easily accessible with few lines of code. Some of the features are:

WiFi connection (wifix)
HTTP client (httpx)
Multithreading (threadx)
Camera manipulation (camx)

Installation

Install the latest version of espx from the Arduino Library Manager.

The code examples on this page have been tested with version 1.0.6: if you get weird errors about missing variables or method, double check that you have the correct version.

If the error persists, please open an issue on GitHub.

You will also need to install the excellent JPEGDEC library from Larry Bank.

/**
 * Create a MJPEG HTTP server
 */
#include <espx.h>
#include <espx/wifix.h>
#include <espx/mdnsx.h>
#include <espx/camx.h>
#include <espx/camx/mjpegx.h>


void setup() {
    delay(1000);
    Serial.begin(115200);
    Serial.println("Camx example: MJPEG server");

    // configure camx through Serial Monitor
    camx.model.prompt();
    camx.pixformat.jpeg();
    camx.quality.high();
    camx.resolution.qvga();

    // initialize camx,
    // enter endless loop on error
    camx.begin().raise();

    // connect to WiFi and set hostname
    // to avoid using the IP address of the board:
    // server will be available at http://esp32cam.local
    wifix("SSID", "PASSWORD").raise();
    mdnsx("esp32cam");

    // start MJPEG server
    // (turn on INFO logging to see messages from server)
    mjpegx.listenOn(80);
    mjpegx.begin().raise();
}


void loop() {
  // server runs in background
}

This sketch is the same you will find in the tutorial ESP32-CAM streaming: the webpage has a button to download images with a click and another to download images automatically every second.

Leave the resolution at qvga (320x240), you don't need larger images. Our model will actually run at even lower resolution (likely 96x96), so you'd waste bandwidth streaming at higher resolutions.

Don't forget to replace the SSID and PASSWORD of your WiFi network!

Put your object of interest in front of the camera and start taking pictures. 20-30 pictures should be enough to get started: don't take too many at this point, because labelling will become tedious with tens of images. Only capture more images if your model exhibits low accuracy later!

In my case, this is an example of the object I wanted to recognize: a penguin toy.

Next, capture 20-30 more images without the object. Try to create some variability in the pictures: move your camera, rotate, include objects different that the one of interest. This will help make our model more robust.

Label images

Now that we have some data to work with, we need to label it. Labelling, in this case, means drawing a box around the objects we want to be recognized by our model. There are a couple tools available online, but since we'll be creating and training our model with Edge Impulse, let's use their Studio.

Follow these exact steps to not waste time:

Create a new project, name it something like esp32-cam-object-detection, then choose Images > Classify multiple objects > Import existing data.

Now follow these exact steps to speed up the labelling:

Create a new project. Call it object-detection or similar
Select Images > Classify mutiple objects > Import existing data
Click Select files and select all the images without the object!
Check Automatically split between training and testing; then hit Begin upload
Close the modal and click Labelling queue in the top navigation bar
Always click Save labels without doing anything! (since there's no object in these images, we'll use them for background modeling)
Go back to Upload data in the top navigation bar
Click Select files and select all the images with the object
Check Automatically split between training and testing; then hit Begin upload
Go to Labelling queue and draw the box around the object you want to recognize. On the right, make sure Label suggestions: Track objects between frames is selected
Label all the images. Make sure to fix the bounding box to fit the object while leaving a few pixels of padding
If you have more objects, repeat 7-11 for each object, one at a time

If you upload all the images at once, the labelling queue will mix the different objects and you will lose a lot more time to draw the bounding box.

image.png 36.97 KB
Train AI model

Now it is time to define and train our machine learning vision model. As I said earlier, Edge Impulse is a low-code/no-code platform, so you won't be writing a single line of code to complete this step.

Navigate to Impulse design on the left menu
Enter 80 as both image width and image height
Select Fit shortest axis as resize mode
Add the Image processing block
Add the Object detection learning block
Save the impulse
Navigate to Impulse design > Image on the left
Select Grayscale as color depth
Click on Save parameters
Click on Generate features (it will take ~1 minute to complete)
Navigate to Impulse design > Object detection
Set the number of training cycles to 35
Set learning rate to 0.005
Click on Choose a different model right below the FOMO block
Select FOMO (Faster Objects, More Objects) MobileNetV2 0.1
Hit Start training and wait until it completes. It can take 4-5 minutes depending on the number of images.

On the right, you will get a confusion matrix: this table gives a quick overview of how well (or bad) the model performs on your data. Of course you aim for a 100% score, but that's not always the case. As a first try, anything above 90% should be a good starting point.

To get a more trustable accuracy score, navigate to Model testing on the left menu and hit Classify all: this score is calculated on images the model has never seen during training, so it's a totally unbiased evaluation.

My results are bad

If the score of your model is low, there are a few things you can try. Here's a list of what you can try, from least effort to most:

double check you correctly draw the bounding boxes around your objects
increase the number of training cycles (try 50 or 80) while lowering the learning rate (try 0.001)
increase the image size (from 80 to 96)
enable RGB mode instead of grayscale
collect more images of your object (up to 50)

Hopefully the score of your model should increase to an acceptable level.

The final step in the Edge Impulse Studio is to download the AI model as an Arduino library.

Navigate to Deployment on the left menu
Select Arduino Library from the search bar
Leave all as is and hit Build

A zip file containing the model library will download: extract its contents (a folder named <your-project-name>_inferencing) and copy the folder in your Arduino libraries folder.

Model deployment

Now that we have our AI model as a library, we can integrate it into our sketch. If you didn't read the tutorial to capture frames from the ESP32-CAM, this is the right time to do so, otherwise you may be intimidated by the syntax of the following sketch.

/**
 * Perform object detection on camera frames
 * using Edge Impulse's FOMO model
 */
#include <JPEGDEC.h>
#include <tinyml4all-object-detection_inferencing.h>
#include <espx.h>
#include <espx/camx.h>
#include <espx/camx/fomox.h>


void setup() {
    delay(1000);
    Serial.begin(115200);
    Serial.println("Fomox example: object detection");

    // configure camx through Serial Monitor
    camx.model.prompt();
    camx.pixformat.jpeg();
    camx.quality.high();
    camx.resolution.qvga();

    // initialize camx,
    // enter endless loop on error
    camx.begin().raise();

    // fomox doesn't need initialization
    // but you can set the minimum
    // confidence threshold for objects to be included
    // in the results (from 0 to 1)
    fomox.moreConfidentThan(0.6);
}


void loop() {
    auto frame = camx.grab();

    if (!fomox.process(frame)) {
        Serial.print("Failure: ");
        Serial.println(fomox.failure());
        return;
    }

    Serial.printf("Found %d objects in %dms\n", fomox.count, fomox.stopwatch.millis());

    // loop over objects
    for (auto object : fomox.objects) {
        Serial.printf(
            "Found object of class %s at coordinates (%d, %d) with confidence %.2f\n",
            object.label,
            object.cx,
            object.cy,
            object.score
        );
    }
}

Here's what the sketch does:

configure camera using the camx object (lines 17-25)
continuously captures frame (line 36)
feeds the frame to the AI model with error handling (lines 38-42)
loops over the detected objects to print their label and position (lines 47-55)

You can replace the for loop body with your custom handling logic (steer servo motors, toggle a relay, play a sound...).