During this tutorial, the complete flow for deploying a trained neural network inside a sensor embedding the ISPU using ST Edge AI Core will be demonstrated.
The use case consists of human activity recognition (HAR) using accelerometer data to train a small neural network (NN) to recognize four activities: stationary, walking, running, and cycling.
For reproducing all the steps in this tutorial project, the following hardware and software components are required.
-
NUCLEO-F401RE: STM32 Nucleo board supported by ST Edge AI Core and the X-CUBE-ISPU package
-
X-NUCLEO-IKS4A1: motion MEMS and environmental sensors expansion board, which must be connected to the NUCLEO-F401RE through the Arduino UNO V3 connector
- ISPU-Toolchain: toolchain to build applications for sensors embedding the ISPU, which must be added to the system PATH (for detailed steps refer to the dedicated README).
- ST Edge AI Core: tool to easily convert pretrained AI models for integration into ST products, which must be added to the system PATH (for detailed steps refer to the dedicated article inside the ST Edge AI Core installation folder:
<stedgeai_folder>/1.0/Documentation/setting_env.html
). - MEMS Studio: desktop software solution to develop and test solutions for MEMS sensors.
- X-CUBE-ISPU: expansion package for ISPU development containing documentation, examples, and STM32 firmware for enabling communication with MEMS Studio.
- Python: Python programming language interpreter, necessary to create and train the neural network model (required Python packages are listed in requirements.txt). A version of Python ≤ 3.11 is required.
- Jupyter Notebook: interactive computing platform accessible from the web browser to run Python code.
For the first step of this tutorial, the Nucleo board (with the expansion board and adapter board mounted on top) and MEMS Studio can be used to collect multiple data logs for each of the activities.
-
Mount the X-NUCLEO-IKS4A1 expansion board on top of the NUCLEO-F401RE board and connect it to a USB port of your computer. Do not plug any sensor adapter board in the DIL24 socket if it hosts a sensor with the same I2C address as the LSM6DSO16IS sensor already available on the expansion board, as a clash of addresses does not allow the firmware to work correctly.
-
Flash the LSM6DSO16IS_DataLogExtended.bin firmware from the X-CUBE-ISPU package to enable communication between MEMS Studio and the sensor (this can be achieved simply by copying the .bin file to the board mass storage).
-
Open MEMS Studio, under Connect, select Serial as Communication type and the serial port the board is connected to as Communication port, and then press Connect. Then, if not already showing, select LSM6DSO16IS under the Accelerometer sensor dropdown menu and press Select.
-
Go to the
Sensor Evaluation
section and select theQuick Setup
page to configure the sensor (Accelerometer Full Scale: 16 g; Accelerometer Output Data Rate: 26 Hz), then press the▶
button in the top left corner to start streaming data from the sensor. -
Go to the
Save to File
page, select a save path for the CSV log in thedata/<id>_<activity>
folder (where <activity> corresponds to the desired activity name and <id> corresponds to the numeric identifier of the activity used for its classification), select only Accelerometer in both Data and Datalog period source sections. -
Use the
Start
/Stop
buttons to start and stop data collection making sure to acquire data in the most realistic setting with only one activity type per log.
Several CSV logs are already available in the data
folder and are sufficient to obtain good training results, however, the user is free to add new logs or customize the activities.
These logs have been originally obtained from a public HAR dataset (Reiss, Attila. (2012). PAMAP2 Physical Activity Monitoring. UCI Machine Learning Repository. https://doi.org/10.24432/C5NW2H) and later cleaned/pre-processed for the purpose of this tutorial. The following steps have been applied to the original dataset:
- Only keep logs acquired from the wrist-mounted sensor and only from { lying, sitting, standing, watching_TV, walking, nordic_walking, upstairs, downstairs, running, cycling } activities
- Merge { lying, sitting, standing, watching_TV } into stationary class, and { walking, nordic_walking, upstairs, downstairs } into walking class
- Visually inspect data and remove start/stop segments where no tasks are being performed
- Only keep accelerometer X, Y, and Z axes data and convert it to mg scale
- Resample data from 100 Hz to 26 Hz using Fourier method
- Segment CSV logs into multiple logs of 1 minute each
- Balance dataset by keeping the same number of logs for each class
- Export logs in MEMS Studio CSV format
Once a decent-sized dataset has been acquired, the training of the model on the collected data is the next step. To make this process as easy as possible, a ready-to-use notebook (har_tutorial.ipynb
) is provided.
By running all cells in this notebook, the following steps are performed:
-
Load data from the CSV logs contained in the
data/<id>_<activity>
folders -
Segment data into 2-second windows without overlap
-
Label windows using the corresponding
<id>
value -
Divide dataset into training (63%), validation (7%), and testing (30%) sets
-
Train a small 1D-CNN model (772 parameters)
-
Quantize model weights from
float
toint8
representationNote: Performing model quantization may entail trading some accuracy for significantly reducing the model size; both the original and quantized models will be used in this tutorial to show these benefits.
At the end of the procedure, both the original and quantized models and the test set are saved in the output_ipynb
folder.
The notebook uses Keras and TensorFlow Lite to create, train, and quantize the model, but other frameworks could be used instead (any framework where the model can be exported or converted to ONNX, such as PyTorch, QKeras and Larq for quantization, etc.).
To customize the activities recognized by the model, the user can just add/remove activity folders in data
. For example, for adding a new driving activity, the user needs to create a data/4_driving
folder where all CSV logs corresponding to that activity will be placed.
To be able to run the provided Jupyter Notebook, please follow these steps:
-
(Optional) Create a Python virtual environment using venv, Anaconda / Miniconda or any other similar tool to avoid conflicts with previously installed Python packages.
-
Install the required modules by using the following shell command from the folder of this tutorial:
pip install -r requirements.txt
-
Start the Jupyter server by running:
jupyter notebook
and open in a web browser the URL displayed in your shell.
Alternatively, a VScode extension is also available for Jupyter integration in VSCode, which will take care of starting the Jupyter server in the background.
-
Run all the Jupyter Notebook cells
Having successfully trained an NN model, ST Edge AI Core can now be used to integrate it inside the ISPU and inspect its behavior for developing an accurate and reliable application.
To assist in deciding which model is more suited for ISPU integration, the analyze
command can be used to obtain useful information regarding its memory footprint and the number of operations.
By running the following command the Keras (.h5) model can be analyzed:
stedgeai analyze --target ispu --model output_ipynb/cnn_8x8x8.h5 --no-workspace
Here is the summary from the st_ai_output/network_analyze_report.txt
report:
Exec/report summary (analyze)
----------------------------------------------------------------------------------------------------------------
model file : st-mems-ispu\tutorials\st_edge_ai_core\output_ipynb\cnn_8x8x8.h5
type : keras
c_name : network
options : use-lite-runtime, use-st-ai
optimization : balanced
target/series : ispu
workspace dir : st-mems-ispu\tutorials\st_edge_ai_core\st_ai_ws
output dir : st-mems-ispu\tutorials\st_edge_ai_core\st_ai_output
model_fmt : float
model_name : cnn_8x8x8
model_hash : 0x22073547509dd4abcf6eaa3b928e3e2b
params # : 724 items (2.83 KiB)
----------------------------------------------------------------------------------------------------------------
input 1/1 : 'input', f32(1x52x3), 624 Bytes, user
output 1/1 : 'output', f32(1x4), 16 Bytes, user
macc : 12,960
weights (ro) : 2,704 B (2.64 KiB) (1 segment) / -192(-6.6%) vs float model
activations (rw) : 1,664 B (1.62 KiB) (1 segment)
ram (total) : 2,304 B (2.25 KiB) = 1,664 + 624 + 16
----------------------------------------------------------------------------------------------------------------
If the ISPU-Toolchain is detected on the system PATH, ST Edge AI Core will use it to estimate the memory footprint of the converted model taking into accout also its code size:
Summary - "ispu" target
-----------------------------------------------------------
Code RAM (ro) %* Data RAM (rw) %
-----------------------------------------------------------
RT total 10,860 80.1% 872 27.5%
-----------------------------------------------------------
TOTAL 13,564 3,176
-----------------------------------------------------------
* rt/total
Alternatively, for the quantized TFLite (.tflite) model, run the following command:
stedgeai analyze --target ispu --model output_ipynb/qcnn_8x8x8.tflite --no-workspace
Here is the summary from the st_ai_output/network_analyze_report.txt
report:
Exec/report summary (analyze)
---------------------------------------------------------------------------------------------------------------------
model file : st-mems-ispu\tutorials\st_edge_ai_core\output_ipynb\qcnn_8x8x8.tflite
type : tflite
c_name : network
options : use-lite-runtime, use-st-ai
optimization : balanced
target/series : ispu
workspace dir : st-mems-ispu\tutorials\st_edge_ai_core\st_ai_ws
output dir : st-mems-ispu\tutorials\st_edge_ai_core\st_ai_output
model_fmt : ss/sa per channel
model_name : qcnn_8x8x8
model_hash : 0xe01fcc727cab1f8e766f97e15187cf95
params # : 676 items (760 B)
---------------------------------------------------------------------------------------------------------------------
input 1/1 : 'serving_default_input0', f32(1x52x3), 624 Bytes, user
output 1/1 : 'conversion_21', f32(1x4), 16 Bytes, user
macc : 12,552
weights (ro) : 760 B (760 B) (1 segment) / -1,944(-71.9%) vs float model
activations (rw) : 1,008 B (1008 B) (1 segment)
ram (total) : 1,648 B (1.61 KiB) = 1,008 + 624 + 16
---------------------------------------------------------------------------------------------------------------------
Summary - "ispu" target
----------------------------------------------------------
Code RAM (ro) %* Data RAM (rw) %
----------------------------------------------------------
RT total 14,546 95.0% 4 0.2%
----------------------------------------------------------
TOTAL 15,306 1,652
----------------------------------------------------------
* rt/total
Comparing the reports before and after the quantization:
macc
: 12960 → 12552weights
: 2704 B → 760 Bcode ram (ro)
: 13564 B → 15306 Bdata ram (rw)
: 3176 B → 1652 B
After quantization, the network has about the same number of macc operations and weights now occupy ~72% less memory. However, code ram (ro), where weights are stored, is even larger than before; this is due to code overhead (needed to handle inference computations in the quantized network) being too much for such a small network. A significant reduction, instead, can be observed for data ram (rw), which 52% less occupied.
Note: The larger the model the less impact the code overhead will have on the final code size; for this reason, it is always recommended to run the analyze
command on the model before proceding further in the development process with ST Edge AI Core.
Given the results obtained from the comparison of the model before and after the quantization, the next steps in the tutorial will focus only on the un-quantized float model. In this case, the quantized model would be a good choice only if the application has constraints on the size in data RAM.
The actual conversion step is performed using the generate
command, which enables the user to easily generate a C library optimized for the ISPU architecture from the trained model.
By running the following command, the Keras (.h5) model can be converted:
stedgeai generate --target ispu --model output_ipynb/cnn_8x8x8.h5 --output generated --no-workspace --no-report
The result of this operation is the creation of C files (.c/.h), containing model-specific code and data, and a C runtime library (.h/.a), that make the model inference possible on the ISPU, inside the specified output folder generated
.
To evaluate model performance and the correctness of the conversion performed by ST Edge AI Core, the validate
command can be used.
The command offers various functionalities but, in its basic form, by inputting the model with no extra arguments, the tool will generate random data to be used as input to both the original and converted model to check that predictions coincide; alternatively, the user can provide the input / output data directly to ensure more control over the frequencies of the predicted classes.
Another useful option is the ability to perform the validation on target to check the correctness of the converted model on the final target (LSM6DSO16IS) and get its execution time. Before doing so, the generated C-model must be copied inside the template_stedgeai_validate project that must be compiled to create a sensor configuration (.ucf) file:
-
First, copy the
template_stedgeai_validate
project available in the examples folder of this repository:cp -r ../../examples/ism330is_lsm6dso16is/template_stedgeai_validate/ispu ispu_validation
-
Then copy the content of
generated
inside the template project:cp -r generated/* ispu_validation
-
Lastly, compile the project using
make
to generate the sensor configuration (.ucf) file and copy it to theoutput
folder:make -C ispu_validation/make cp ispu_validation/make/bin/ispu.ucf output/har_validate.ucf
Note: Make sure you have the ISPU-Toolchain correctly set up on your system to be able to compile the code for the ISPU architecture.
Finally, having completed all the previous steps, it is possible to perform the validation on target by connecting the Nucleo board to a USB port of the PC, and running the validate
command specifying the model (.h5), the sensor configuration (.ucf), and the validation data (.npz) as arguments:
stedgeai validate --target ispu --mode target --model output_ipynb/cnn_8x8x8.h5 --valinput output_ipynb/har_testset.npz --ucf output/har_validate.ucf --no-workspace
The st_ai_output/network_validate_report.txt
report includes information about the execution time and model accuracy:
ST.AI Profiling results v2.0 - "network"
---------------------------------------------------------------
nb sample(s) : 647
duration : 70.225 ms by sample (70.055/70.483/0.093)
macc : 12960
cycles/MACC : 27.09
---------------------------------------------------------------
The most important piece of information here is given by the duration
field, which reports that the model inference is taking ~70 ms for its execution; this is an important parameter to know for running the model in real time.
Evaluation report (summary)
---------------------------------------------------------------------------------------------------------------------------------------------
Output acc rmse mae l2r mean std nse cos tensor
---------------------------------------------------------------------------------------------------------------------------------------------
TARGET c-model #1 98.61% 0.0688044 0.0119832 0.1396483 0.0000000 0.0688177 0.9747615 0.9904999 output, (4,), m_id=[16]
original model #1 98.61% 0.0688044 0.0119832 0.1396483 -0.0000000 0.0688177 0.9747615 0.9904999 output, (4,), m_id=[16]
X-cross #1 100.00% 0.0000000 0.0000000 0.0000001 0.0000000 0.0000000 1.0000000 1.0000000 output, (4,), m_id=[16]
---------------------------------------------------------------------------------------------------------------------------------------------
In the evaluation report, three rows of results are reported, their meaning is the following:
- TARGET c-model: performance of the converted model using given outputs as ground-truth
- original model: performance of the original model using given outputs as ground-truth
- X-cross: performance of the converted model using the original model outputs as ground-truth
In this case, the converted model performance is practically the same as the original model (for more details on validation metrics refer to the dedicated article inside the ST Edge AI Core installation folder: <stedgeai_folder>/1.0/Documentation/evaluation_metrics.html
).
To accomodate users that prefer to use a graphical interface over the command line, ST Edge AI Core has been integrated in MEMS Studio.
After opening MEMS Studio, go to the Advanced Features
section and select the ISPU Model Converter
page. In the lower portion of the page are located four buttons that open up subpages corresponding to ST Edge AI Core's main functionalities:
-
Load NN model / Generate: generate a C library optimized for the ISPU architecture from the trained model
-
Analyze: obtain useful information regarding the model's memory footprint and the number of operations
-
Validate: evaluate model performance and the correctness of the conversion performed by ST Edge AI Core
-
On host: run the validation on the user's computer
-
On target: run the validation directly on the ISPU (note: requires the Nucleo board to be flashed with the nucleo_f401re_ispu_stedgeai_validate.bin firmware)
-
-
Benchmark: assess all benchmark results (original model, c-model, and X-cross) from both host and target validations
After the validation phase, if the results are satisfactory, the next step is the integration inside the ISPU firmware and the implementation of the actual logic of the application.
The first step is to copy the template_stedgeai project and add the C code generated from the model:
cp -r ../../examples/ism330is_lsm6dso16is/template_stedgeai/ispu ispu_integration
cp -r generated/* ispu_integration
Then, for the actual integration, the main.c
template must be modified to do the following steps:
- Read accelerometer data and store it inside a buffer of length
52 samples x 3 axes
- Once the buffer is full, run the model inference to obtain the prediction
- Write the model prediction in the ISPU output registers
The integration template is agnostic of the specific application, therefore, as a first step, all relevant information, such as sensor settings, must be added:
#define ACC_ODR 26 // [Hz]
#define ACC_FS 8 // [g]
#define ACC_SENS 0.244 // [mg/LSB]
Next, variables to implement the application logic are needed. In this case: an array of label strings, the number of samples currently stored inside the input buffer, and the model prediction value:
static const char *labels[] = { "stationary", "walking", "running", "cycling" };
static uint8_t win_cnt;
static int8_t prediction;
To ensure correct functioning, the initialization logic must be placed inside the algo_00_init
function:
void __attribute__ ((signal)) algo_00_init(void)
{
(void)stai_runtime_init(); // initialize the runtime library
(void)stai_network_init(net); // initialize the network context
init_network_buffers(net, input_buffers, output_buffers);
// initialize state variables
win_cnt = 0;
prediction = 0;
}
Most notably, here stai_runtime_init
, stai_network_init
, and init_network_buffers
functions initialize the runtime library, the context of our network, and internal buffer pointers respectively.
After initialization, all the logic of the application, including running the model, should be placed in the algo_00
function that is called every time new data is ready to be read:
void __attribute__ ((signal)) algo_00(void)
{
// ispu output registers base address
uint32_t addr = ISPU_DOUT_00;
// reinterpret input buffer as a multi-dimensional array of shape {1,52,3}
float (*input)[STAI_NETWORK_IN_1_HEIGHT][STAI_NETWORK_IN_1_CHANNEL] =
(float (*)[STAI_NETWORK_IN_1_HEIGHT][STAI_NETWORK_IN_1_CHANNEL])input_buffers[0];
// reinterpret output buffer as a multi-dimensional array of shape {1,4}
float (*output)[STAI_NETWORK_OUT_1_CHANNEL] =
(float (*)[STAI_NETWORK_OUT_1_CHANNEL])output_buffers[0];
Here we declare three important variables:
addr
: address of the first unused ISPU output register used to share results with the host microcontroller or microprocessorinput
: pointer to the model input buffer where new accelerometer samples will be storedoutput
: pointer to the model output buffer where new model predictions will be stored
// read accelerometer data and place it inside input buffer
input[0][win_cnt][0] = cast_sint16_t(ISPU_ARAW_X) * ACC_SENS;
input[0][win_cnt][1] = cast_sint16_t(ISPU_ARAW_Y) * ACC_SENS;
input[0][win_cnt][2] = cast_sint16_t(ISPU_ARAW_Z) * ACC_SENS;
Here new accelerometer data is read, converted from LSB to mg units (this is important since the model has been trained on data in mg units) and placed inside the model input buffer.
// write accelerometer data to output registers
for (uint8_t i = 0; i < STAI_NETWORK_IN_1_CHANNEL; i++, addr += sizeof(float))
cast_float(addr) = input[0][win_cnt][i];
Additionally, we can write accelerometer data to the ISPU output registers for debugging purposes.
After incrementing the window counter, if the number of samples in the input buffer equals the window length, the model inference can be run:
// increment count and check if input buffer is ready
if (++win_cnt == STAI_NETWORK_IN_1_HEIGHT) {
win_cnt = 0;
// run model inference
stai_network_run(net, STAI_MODE_SYNC);
// prediction corresponds to the output with the highest probability
float max_prob = -1.0f;
for (uint8_t i = 0; i < STAI_NETWORK_OUT_1_CHANNEL; i++) {
if (output[0][i] > max_prob) {
max_prob = output[0][i];
prediction = i;
}
}
}
The function stai_network_run
will do all the work by forwarding the input buffer through all layers of the network and updating the output buffer.
To get the actual prediction from the softmax output layer, some logic must be implemented to obtain the index of the output value associated with the highest probability value.
// write prediction results to output registers
for (uint8_t i = 0; i < STAI_NETWORK_OUT_1_CHANNEL; i++, addr += sizeof(float))
cast_float(addr) = output[0][i];
strcpy((char *)addr, labels[prediction]);
// interrupt generation
int_status = int_status | 0x1u;
Finally, the predictions probabilities and predicted class label can be written to the ISPU output registers before triggering an interrupt; the host microcontroller or microprocessor receiving the interrupt will then retrieve this data and act accordingly.
Once the changes have been made to main.c
, the project can be compiled to generate the sensor configuration (.ucf/.h):
make -C ispu_integration/make
cp ispu_integration/make/bin/ispu.ucf output/har_tutorial.ucf
cp ispu_integration/make/bin/ispu.h output/har_tutorial.h
For convenience, a reference ISPU project integrating the HAR model is already provided in the ispu
folder and the prebuilt sensor configuration files can be found in output
.
For debugging purposes, the content of the ISPU output registers can be easily read and interpreted by MEMS Studio in real time.
In order to be able to correctly visualize the outputs, a JSON file with the same name of the sensor configuration (.ucf) and located in the same directory is needed. This file must contain a list of description objects, each specifying:
name
: name of the outputtype
: type of the outputsize
(optional): size of the output in terms of number of values, defaults to 1
In this case, to correctly format the outputs:
{
"output": [
{
"name": "Acc x [mg]",
"type": "float"
},
{
"name": "Acc y [mg]",
"type": "float"
},
{
"name": "Acc z [mg]",
"type": "float"
},
{
"name": "Output stationary",
"type": "float"
},
{
"name": "Output walking",
"type": "float"
},
{
"name": "Output running",
"type": "float"
},
{
"name": "Output cycling",
"type": "float"
},
{
"name": "Prediction",
"type": "char",
"size": 10
}
]
}
A pre-made output format file is already available in the output
folder.
Finally, MEMS Studio can be used to upload and test the sensor configuration (.ucf) containing the ISPU program:
-
Make sure the Nucleo board has been flashed using the LSM6DSO16IS_DataLogExtended.bin firmware (note: this is the same firmware used before for data logging)
-
Connect the board, go to the
Sensor Evaluation
section, and then select theQuick Setup
page -
Click on the
Load configuration file
button to open a file dialog window and select thehar_tutorial.ucf
file to load the configuration inside the ISPU: -
Once the upload is completed, go to the
Data Monitor
page and press the▶
button in the top left corner to start reading theISPU Output
results from the ISPU program:
More information: http://www.st.com
Copyright © 2024 STMicroelectronics