Arduino ChatGPT Voice Assistant

Arduino ChatGPT Voice Assistant

Do you have ever wanted to build your own Arduino-powered Alexa-like smart device to control your environment with the voice? Thanks to the advancements in accuracy and accessibility of Large Language Models and Speech-to-Text, it is now pretty easy to do!

This page will show you a totally free and easy way to turn any Arduino board equipped with a microphone and Wi-Fi into an intelligent, configurable voice-controlled assistant. 

How does it work?

The operational framework is pretty simple:

  1. you define a set of commands you want to recognize
  2. you record a chunk of audio with your microphone
  3. send audio & commands to an external server
  4. the server uses ChatGPT-like models to transcribe the audio and match the message to the commands
  5. the server sends the response to the board, where you write code to react to commands

Here's a short demo of the end result.

What are you waiting for?
Creating your Chat-GPT powered, voice controlled assistant is as easy as clicking a button!

Download the project

Implementation

All the audio recording and board-server communication has already been implemented for you by me. Your job is to just define the commands you want to recognize and the code to execute when those commands are received. A few configurations are also mandatory to properly initialize the microphone.

Here's the full sketch with a lot of comments that explain what each line does. Don't be scared by the length: the lines that matter are only a few and I will detail them later.

The sketch is tailored for the Arduino Nano Rp2040 Connect board and understands the following commands:

  1. raise / lower the volume of your PC
  2. turn on / off any digital pin
  3. control the built-in RGB led
  4. type what you say as if it was a keyboard
/**
 * Turn your Arduino board into a 
 * voice-controlled personal assistant.
 * 
 * Tested on Nano Rp2040 Connect.
 */
#include <WiFiNINA.h>
#include <PluggableUSBHID.h>
#include <USBKeyboard.h>
#include "VoiceAssistant.h"


VoiceAssistant assistant;
USBKeyboard keyboard;


void setup() {
  Serial.begin(115200);
  Serial.println("Init...");

  // configure assistant

  // how long can a command last
  // try to keep this to a minimum to
  // not exceed the available RAM
  assistant.recordingDuration("3 seconds");
  
  // Rp2040 Connect has a 21 kHz frequency
  // Nicla Vision 16 khz
  // other boards may have other frequencies!
  assistant.microphoneFreq("21 khz");
  
  // if you want want to extend the duration
  // but you have limited RAM, you can
  // "decimate" the frequency. That is, recorded
  // audio will be sub-sampled by a given factor.
  // e.g. the following line will capture audio
  // at 21 kHz, but then only store half of them.
  // the format is "<mic frequency> / <decimation factor>"
  // (decimation factor can be a float e.g. 1.5)
  assistant.microphoneFreq("21 khz / 2");
  
  // which language are the commands in?
  // (the first 2 characters are used)
  // so "en" is what really matters
  assistant.language("english");
  
  // configure the credentials for your Wi-Fi network
  assistant.wifiCredentials("SSID", "PASSWORD");

  // register custom commands
  // the general form is "<command> <argument1> <argument2>..."
  // if the value can only be from a limited set, specify the
  // values separated by |
  // edit volume will respond to "raise the volume", "lower the volume",
  // "decrease the volume"...
  assistant.addCommand("edit volume {direction:up|down}", [](Args& args) {
    // this code is executed when your voice input matches this command
    String direction = args["direction"];

    if (direction == "up")
      keyboard.media_control(KEY_VOLUME_UP);
    else if (direction == "down")
      keyboard.media_control(KEY_VOLUME_DOWN);
    else {
      Serial.print("Unknown volume control: ");
      Serial.println(direction);
    }
  });

  // if the value can assume any value, specify its type
  // most common types will be string or number
  // "set pin..." will also respond to "enable pin <pin>"
  // or "turn on pin <pin>"
  assistant.addCommand("set pin {pin:number} to {state:on|off}", [](Args& args) {
    auto state = (args["state"] == "on") ? HIGH : LOW;

    digitalWrite(args["pin"].toInt(), state);
  });

  // control RP2040 Nano Connect built-in led
  assistant.addCommand("set led to {color:off|red|green|blue}", [](Args& args) {
    auto red = (args["color"] == "red") ? HIGH : LOW;
    auto green = (args["color"] == "green") ? HIGH : LOW;
    auto blue = (args["color"] == "blue") ? HIGH : LOW;
    
    pinMode(LEDR, OUTPUT);
    pinMode(LEDG, OUTPUT);
    pinMode(LEDB, OUTPUT);

    digitalWrite(LEDR, red);
    digitalWrite(LEDG, green);
    digitalWrite(LEDB, blue);
  });

  // type what you said as your Arduino was a keyboard
  assistant.addCommand("type {text:string}", [](Args& args) {
    keyboard.printf(args["text"].c_str());
  });

  // handle missing match (no command matched)
  assistant.onUnknown([](String transcription) {
    Serial.print("Unknown prompt: ");
    Serial.println(transcription);
  });

  // init assistant
  while (!assistant.begin()) {
    Serial.print("Init error: ");
    Serial.println(assistant.error);
  }

  Serial.println("Init done");
}


void loop() {
  // let the assistant do its job
  assistant.loop();
  
  if (assistant.failed()) {
    Serial.print("Loop error: ");
    Serial.println(assistant.error);
    assistant.clear();
  }

  // trigger the start of recording.
  // here we await for an input in the Serial Monitor,
  // but you may use a button, for example
  if (assistant.isIdle() && Serial.available()) {
    Serial.readStringUntil('\n');
    assistant.startRecording();  
  }
}

Configuration

To make the system work, you have to configure the following parameters:

  • duration: how long (at most) can a voice command last?
  • frequency: what's the sampling frequency of the microphone? (can be found in the datasheet of the board)
  • language: in which language are you going to talk?
  • Wi-Fi credentials: the board needs an internet connection to work
assistant.recordingDuration("3 seconds");
assistant.microphoneFreq("21 khz / 2");
assistant.language("english");
assistant.wifiCredentials("SSID", "PASSWORD");

You may be limited on the duration because all the audio has to fit in memory before being sent. Do the math:

3 seconds x 21 kHz x 2 bytes per sample = 126 kb RAM

If your board doesn't have that much memory available, the sketch will crash. By using the frequency decimation option, you can increase the duration that fits into your board's RAM (at the cost of a slight decrease in the audio - and thus transcription - quality).

Commands

This is the core part of the entire system. Here you can define the commands that the ChatGPT-like model will try to recognize from your spoken words.

I tried to design a format that is as simple as possible, while still allowing a good degree of flexibility. Your commands should read like plain language, with variable parts enclosed in square brackets {}, eventually followed by a type hint. A type hint can either be a generic one (e.g. string or number) or a list of possible values (e.g. on|off).

Here's a list of a few examples you can use as reference.

// control a digital output
// you will need the pin number / name (let's name it "target")
// and its "state" (ON or OFF).
// By defining the "target" to be a string, it will match to
// commands like "set pin 5 to on", but also to "turn on the lamp"
assistant.addCommand("set {target:string} to {state:on|off}", ...);
// read an input
// will match to e.g. "read the temperature"
assistant.addCommand("read input {target:string}", ...);
// media controls: volume up/down
assistant.addCommand("edit volume {direction:up|down}", ...);

// media controls: play/pause
// will match to e.g. "pause the music"
assistant.addCommand("control music {action:play|pause}", ...);

Actions

Each command requires an action to be defined. This action is a function that will be called when you, as a user, say the related command. This function will receive the list of variables that were also matched. The syntax may be a bit obscure to you if you're not familiar with C++, but you can just copy-paste from the example and edit as needed.

// define an action
assistant.addCommand("edit volume {direction:up|down}", [](Args& args) {
  // use [] operator to get an argument by name
  String direction = args["direction"];

  // returns an empty string when the key is missing
  if (args["volume"] == "")
    Serial.println("No volume set");
});

What are you waiting for?
Creating your Chat-GPT powered, voice controlled assistant is as easy as clicking a button!

Download the project