A web-based solution to control a swarm of Raspberry Pis, featuring a real-time dashboard, a deep learning inference engine, 1-click Cloud deployment, and dataset labeling tools.

This is the first article of the three-part SorterBot series.

Part 1 — General project description and the Web Application
Part 2 — Controlling the Robotic Arm
Part 3 — Transfer Learning and Cloud Deployment

Source code on GitHub:

Control Panel: Django backend and React frontend, running on EC2
Inference Engine: Object Recognition with PyTorch, running on ECS
Raspberry: Python script to control the Robotic Arm
Installer: AWS CDK, GitHub Actions and a bash script to deploy the solution
LabelTools: Dataset labeling tools with Python and OpenCV

I recently completed an AI mentorship program at SharpestMinds, of which the central element was to build a project, or even better, a complete product.

I choose the latter, and in this article, I write about what I built, how I built it, and what I learned along the way. Before we get started, I would like to send a special thanks to my mentor, Tomas Babej (CTO@ProteinQure) for his invaluable help during this journey.

When thinking about what to build, I came up with an idea of a web-based solution to control a swarm of Raspberry Pis, featuring a real-time dashboard, a deep learning inference engine, 1-click Cloud deployment, and dataset labeling tools. The Raspberry Pis can have any sensors and actuators attached to them.

They collect data, send it to the inference engine, which processes it and turns it into commands that the actuators can execute. A control panel is also included to manage and monitor the system, while the subsystems communicate with each other using either WebSockets or REST API calls.

As an implementation of the above general idea, I built SorterBot, where the sensor is a camera, and the actuators are a robotic arm and an electromagnet. This solution is able to automatically sort metal objects based on how they look.

When the user starts a session, the arm scans the area in front of it, locates the objects and containers within its reach, then automatically divides the objects into as many groups as many containers were found. Finally, it moves the objects to their corresponding containers.

SorterBot automatically picks up objects

To process the images taken by the arm’s camera, I built an inference engine based on Facebook AI’s Detectron2 framework. When a picture arrives for processing, it localizes the items and containers on that image, then saves the bounding boxes to the database.

After the last picture in a given session is processed, the items are clustered into as many groups as many containers were found. Finally, the inference engine generates commands, which are instructing the arm to move the similar-looking items into the same container.

To make it easier to control and monitor the system, I built a control panel, using React for the front-end and Django for the back-end. The front end shows a list of registered arms, allows the user to start a session, and also shows existing sessions with their statuses.

Under each session, the user can access the logically grouped logs, as well as before and after overview images of the working area. To avoid paying for AWS resources unnecessarily, the user also has the option to start and stop the ECS cluster where the inference engine runs, using a button in the header.

To make it easier for the user to see what the arm is doing, I used OpenCV to stitch together the pictures that the camera took during the session.

Additionally, another set of pictures are taken after the arm moved the objects to the containers, so the user can see a before/after overview of the area and verify that the arm actually moved the objects to the containers.

Overview image made of the session images stitched together

The backend communicates with the Raspberry Pis via WebSockets and REST calls, handles the database and controls the inference engine. To enable real-time updates from the backend as they happen, the front-end also communicates with the back-end via WebSockets.

Since the solution consists of many different AWS resources and it is very tedious to manually provision them, I automated the deployment process utilizing AWS CDK and a lengthy bash script.

To deploy the solution, 6 environment variables have to be set, and a single bash script has to be run. After the process finishes (which takes around 30 minutes), the user can log in to the control panel from any web browser and start using the solution.

The Web Application

Conceptually the communication protocol has two parts. The first part is a repeated heartbeat sequence that the arm runs at regular intervals to check if everything is ready for a session to be started. The second part is the session sequence, responsible for coordinating the execution of the whole session across subsystems.

Diagram illustrating how the different parts of the solution communicate with each other

Heartbeat Sequence

The point where the execution of the first part starts is marked with a green rectangle. As the first step, the Raspberry Pi pings the WebSocket connection to the inference engine.

If the connection is healthy, it skips over to the next part. If the inference engine appears to be offline, it requests its IP address from the control panel.

After the control panel returns the IP (or ‘false’ if the inference engine is actually offline), it tries to establish a connection with the new address. This behavior enables the inference engine to be turned off when it’s not in use, which lowers costs significantly. It also simplifies setting up the arms, which is especially important when multiple arms are used.

Regardless if the connection with the new IP succeeds or not, the result gets reported to the control panel alongside the arm’s ID. When the control panel receives the connection status, it first checks if the arm ID is already registered in the database, and registers it if needed.

After that, the connection status is pushed to the UI, where a status LED lights up in green or orange, representing whether the connection succeeded or not, respectively.

An arm as it appears on the UI, with the start button and status light

On the UI, next to the status LED, there is a ‘play’ button. When the user clicks this button, the arm’s ID is added to a list in the database that contains the IDs of the arms that should start a session.

When an arm checks in with the connection status, and that status is green, it checks if its ID is in that list. If it is, the ID gets removed and a response is sent back to the arm to start a session. If it isn’t, a response is sent back to restart the heartbeat sequence without starting a session.

Session Sequence

The first task of the arm is to take pictures for inference. To do that, the arm moves to inference position then starts to rotate at its base. It stops at certain intervals, then the camera takes a picture, which is directly sent to the inference engine as bytes, using the WebSocket connection.

High-level diagram of the Inference Engine

When the image data is received from the Raspberry Pi, the image processing begins. First, the image is decoded from bytes, then the resulting NumPy array is used as the input of the Detectron2 object recognizer.

The model outputs bounding box coordinates of the recognized objects alongside their classes. The coordinates are relative distances from the top-left corner of the image measured in pixels.

Only binary classification is done here, meaning an object can be either an item or a container. Further clustering of items is done in a later step. At the end of the processing, the results are saved to the PostgreSQL database, then the images are written to disk to be used later by the vectorizer, and archived to S3 for later reference.

Saving and uploading the image is not in the critical path, so they are executed in a separate thread. This lowers execution time as the sequence can continue before the upload finishes.

When evaluating models in Detectron2’s model zoo, I choose Faster R-CNN R-50 FPN, as it provides the lowest inference time (43 ms), lowest training time (0.261 s/iteration), and lowest training memory consumption (3.4 GB), without giving up too much accuracy (41.0 box AP, which is 92.5% of the best network’s box AP), compared to other available architectures.

After all of the session images have been processed and the signal to generate session commands arrived, stitching together these pictures starts on a separate process, providing a ‘before’ overview for the user.

Parallel to this, all the image processing results belonging to the current session are loaded from the database. First, the coordinates are converted to absolute polar coordinates using an arm-specific constant sent with the request.

The constant, r represents the distance between the center of the image and the arm’s base axis. The relative coordinates (x and y on the drawing below) are pixel distances from the top-left corner of the image.

The angle where the image was taken is denoted with γ. Δγ represents the difference between the angle of the given item and the image’s center and can be calculated using equation 1) on the drawing below.

The first absolute polar coordinate of the item (angle, γ’), can be simply calculated using this equation: γ’ = γ + Δγ. The second coordinate (radius, r’), can be calculated using equation 2) on the drawing.

Drawing and equations used to convert relative coordinates to absolute polar coordinates

After the conversion of the coordinates, the bounding boxes belonging to the same physical objects are replaced by their averaged absolute coordinates.

In the preprocessing step for the vectorizer, the images saved to disk during the previous step are loaded, then cropped around the bounding boxes of each object, resulting in a small picture of every item.

Example of an object cropped around its bounding box

These pictures are converted to tensors, then added to a PyTorch dataloader. Once all the images are cropped, the created batch is processed by the vectorizer network.

The chosen architecture is a ResNet18 model, which is appropriate for these small-sized images. A PyTorch hook is inserted after the last fully connected layer, so in each inference step the output of that layer, a 512-dimensional feature vector is copied to a tensor outside of the network.

After the vectorizer processed all of the images, the resulting tensor is directly used as input of the K-Means clustering algorithm. For the other required input, the number of clusters to be computed, a simple count of the recognized containers is inserted from the database.

This step outputs a set of pairings, representing which item goes to which container. Lastly, these pairings are replaced with absolute coordinates that are sent to the robotic arm.

The commands are pairs of coordinates representing items and containers. The arm executes these one by one, moving the objects to the containers using the electromagnet.

After the objects were moved, the arm takes another set of pictures to be stitched, as an overview of the landscape after the operation. Finally, the arm resets to its initial position and the session is complete.

To be continued...