Machine Learning reached its peak nowadays and finally used everywhere. You can see face recognition systems in the airport and personalized advertisements in Facebook.

Although, when we talk about combining ML with embedded devices there is still a considerable gap.

1. We do not understand what the embedded world is

Working with ML, we used to have enormous computational power.

AlexNet requires 727 MegaFlops and 235Mb of memory to process a small 227x227px image. For example, ARM Cortex-A8 on Google Nexus S can produce 66 MegaFlops per second. So, you have to wait ~11 seconds for inference. That’s a lot!
*FLOP — floating-point operation

Check out more info: Estimates of memory consumption and FLOP counts for various CNNs and Floating-point performance of ARM cores and their efficiency.

I had several ML-related courses in the university. We did much cool stuff in homework. But even there my 4GB GTX1050 was not enough to train all the models.

General ML engineer rarely thinks about computational resources. Moreover, he/she seldom care about memory usage. Why? Cause it’s cheap and even your phone has pretty good CPU and a lot of memory.

Do you still feel a lack of memory in your phone when it comes to photos from the last party? Imagine that you work with the TrueTouch sensing controller which has 256KB of flash memory. Yeah, 256KB. And you can not use all of that because of existing firmware. So, around 100KB. Check out your last model size. Probably it is much much larger.

Is it getting interesting? Let’s continue :)

What do you think when you hear “embedded device”? Imagine the picture and keep it in mind.

1. It’s any electrical machinery. Even my microwave and washing machine

Yeah, you are right!

Almost any electrical device today is an embedded device. It may have one or several controllers inside responsible for each exact function: touch sensing, engine condition monitoring etc.

2. Arduino or/and Raspberry PI

Congrats, you are right again!

They are ones of the most popular and widely used kits for DIY projects. And yes, they are embedded devices.

3. Jetson Nano and similar

It’s right again.

It’s a special developer kit “aka mini-computer”, developed precisely for running ML models on it. It’s extremely powerful and, to be honest, super sexy.

But there is something missed here

I have many friends who are doing fabulous hardware pet-projects. They commonly use Arduino or STM32(aka hardcore Arduino) there.

I know several AI-engineers who are thrilled about Jetson Nano and similar devices. They think that it’s future for Embedded AI.

And now. Please, think about “How many of such devices are used in production level?”

The answer — tiny

Think about how many electrical devices you have in the house. Then add tons of controllers in your car. The security system in your job. I can continue a very long list.

And each of that device has some controller. Usually, it’s miniature and super cheap. It can not be compared with the resources and capabilities of Jetson or Raspberry.

Imagine that you have a microcontroller. Its main task is to process your finger touches into the screen. It has ARM Cortex-M0 processor, 256KB of memory (where only 80–120 is available for you). It’s a real-time system, so you have a small amount of time to make inference for your model, let’s say 100 microseconds. And your goal is to improve or replace some algorithm there.
Good luck, and welcome to the world of “Embedded AI.”

* The embedded world” consist of 1–2$ chips with limited resources. And this is what is used on the production level *

2. Pure Infrastructure

I have been working on the project I described above. Everything was great. I developed a small network that potentially has to fit into that microcontroller.

Time to move the model from my computer to device has begun!

Quantization.

That processor was not able to do operations with floating-point numbers. Even if it could, we would not use it because it’s a quite complex operation and requires much time.

So, I did model weights quantization — converting a continuous range of values into a finite range of discrete values.

And guess what? Neither PyTorch and TensorflowLite do not fully support it. They do not support all activation function (while I used pretty simple HardTanh). PyTorch even could not save the quantized model into the file.

So, I had to do it by hand.

2. Inference code generation

You want to run your model on that controller. Right? So, you need C-code for model inference.

It’s sad, but you have to do it by hand. Why? PyTorch does not have a functionality of inference code generation. In contrast, TFLite has, but it’s pretty limited and again does not support common activation functions.

So, I again did it by hand.

I faced lots of such situations at my work. It’s not a punch into PyTorch or TF side, but rather “cry of help”.

*** There is another side of ML community, that is looking for a professional tool for “Embedded AI”, but can’t find it. ***

Anything good?

I see a huge interest in AI from big semiconductor companies, and they are doing remarkably valuable and important things. They worth another article, so I list a couple of them not to make this article too long.

CMSIS-NN — efficient neural network kernels for Arm Cortex-M CPUs
Compilers which produce highly effective inference code that is optimized based on the hardware you have
and a lot of other cool stuff