Squashing MobileNetV1 for microcontrollers

Squashing MobileNetV1 for microcontrollers

MobileNets are a family of neural networks architectures aimed at image classification. Their known to be small and fast yet still achieving a good classification accuracy with respect to more sophisticated architectures. In recent years, TensorFlow Lite for Microcontrollers made it possible to run such networks on embedded hardware, which is a huge achievement.

Nevertheless, the TF runtime adds a lot of overhead both in terms of resources consumption and latency to its models. Out of this frustration, I had the idea to manually implement MobileNetV1 in plain C++. Or more accurately: I had the idea of writing a Python package that does it for me.

Key issues

Writing a neural network from scratch by hand can be tedious and error prone. Also, it is not customizable (for example add or remove layers). 

Solution

Since MobileNetV1 is made by a handful of different layer types (Convolution2D, Depthwise Convolution2D, MaxPooling2D, Dense), it is only required to write a generic implementation for them once and pass the correct set of parameters to make it adapt to the different layers. 

Since we have "at compile time" full visibility over the input/output shapes of each layer, we can unroll each function call by taking into account the correct offsets for memory access. Also, we can tailor the memory allocation to be as large as required, not a single byte more (since RAM is the most limiting factor in microcontrollers).

Results

The end result of my speculative work was a Python library that indeed is able to produce an optimized version of slight variations of the original MobileNetV1 architecture. In order to be used on extemely constrained settings, I implemented up to pico architectures, which reduce the number of layers and parameters to the bare minimum.

Here's the Python code to train a model.

from micromobilenet import PicoMobileNet

# replace num_classes with the actual number of classes
net = PicoMobileNet(num_classes=10)
net.config.learning_rate = 0.01
net.config.batch_size = 32
net.config.verbosity = 1
net.config.loss = "categorical_crossentropy"
net.config.metrics = ["categorical_accuracy"]
net.config.checkpoint_path = "./checkpoints/pico"

net.build()
net.compile()
net.fit(train_x, train_y, val_x, val_y, epochs=30)
predictions = net.predict(test_x)

And here's an (extract of) example of the generated C++ code.

uint16_t predict(float *input) {
    float *ping = arena;
    float *pong = arena + 3468;

    // conv2d (0)
    for (int16_t d = 0; d < 3; d++)
        this->conv2d_3x3x1(input, ping + 32 * 32 * d, conv2d_0_weights[d], 96, 3);

    // padding (1)
    for (int16_t d = 0; d < 3; d++)
        this->pad(ping + 32 * 32 * d, pong + 34 * 34 * d, 32);

    memcpy(ping, pong, sizeof(float) * 34 * 34 * 3);

    // depthwise (1)
    for (int16_t d = 0; d < 3; d++)
        this->depthwise_conv(ping + 34 * 34 * d, pong + 16 * 16 * d, depthwise_1_weights[d], 34, 2);

    // pointwise (1)
    for (int16_t d = 0; d < 6; d++)
        this->pointwise_conv(pong, ping + 16 * 16 * d, pointwise_1_weights[d], 16, 3);

    // padding (2)
    for (int16_t d = 0; d < 6; d++)
        this->pad(ping + 16 * 16 * d, pong + 18 * 18 * d, 16);

    memcpy(ping, pong, sizeof(float) * 18 * 18 * 6);

    // depthwise (2)
    for (int16_t d = 0; d < 6; d++)
        this->depthwise_conv(ping + 18 * 18 * d, pong + 8 * 8 * d, depthwise_2_weights[d], 18, 2);

    // pointwise (2)
    for (int16_t d = 0; d < 12; d++)
        this->pointwise_conv(pong, ping + 8 * 8 * d, pointwise_2_weights[d], 8, 6);


    // padding (3)
    for (int16_t d = 0; d < 12; d++)
        this->pad(ping + 8 * 8 * d, pong + 10 * 10 * d, 8);

    memcpy(ping, pong, sizeof(float) * 10 * 10 * 12);


    // depthwise (3)
    for (int16_t d = 0; d < 12; d++)
        this->depthwise_conv(ping + 10 * 10 * d, pong + 4 * 4 * d, depthwise_3_weights[d], 10, 2);

    // pointwise (3)
    for (int16_t d = 0; d < 24; d++)
        this->pointwise_conv(pong, ping + 4 * 4 * d, pointwise_3_weights[d], 4, 12);

    this->maxpool(ping, pong, 4, 24);

    for (uint16_t d = 0; d < numOutputs; d++)
        this->dot(pong, ping + d, conv2d_last_weights[d], conv2d_last_bias[d], 24);

    this->softmax(ping, outputs, numOutputs);

    return this->argmax();
}

Final considerations and benchmarks

The implementation I wrote is written in ANSI C and does not leverages any hardware-specific acceleration (e.g. CMSIS framework), but that is not hard to integrate, since the code already declares primitives for computation (e.g. 3x3 matrix multiplication).

How fast do this perform?

Benchmarks for ESP32S3
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Architecture       ┃ Flash (kb) ┃  RAM (kb)  ┃      Execution time (us)        ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Pico               │      4,52  │    30,55   │     2832    (844 params)        |
├────────────────────┼────────────┼────────────┤─────────────────────────────────┤
│ Nano               │      8,54  │    64,23   │     6543  (1,636 params)        |
├────────────────────┼────────────┼────────────┤─────────────────────────────────┤
│ Micro              │     19,75  │   132,36   │    31987  (4,264 params)        |
├────────────────────┼────────────┼────────────┤─────────────────────────────────┤
│ Milli              │     49,70  │   162,12   │    37641 (11,704 params)        |
├────────────────────┼────────────┼────────────┤─────────────────────────────────┤
│ Base               │    123,60  │   235,47   │    53944 (30,040 params)        |
└────────────────────┴────────────┴────────────┴─────────────────────────────────┘