In my last blog, I discussed some basics concepts of computer vision and how to create a facial recognition filter using OpenCV. But what if you want to detect in an image something other than faces? There are two possible ways forward:

Train your own model: If you have a lot of images ready to go of the object you are trying to identify, using OpenCV’s built-in functions (like k-Nearest-Neighbors) or more powerful machine learning libraries like PyTorch and TensorFlow you could build your own model. However, this would mean that you need hundreds if not thousands of images at your disposal that all display your object of interest. Creating a model for these could also take time, hard drive space, and computer power. You’d also need to already have knowledge of various computer vision training algorithms such as Regions with Convolutional Neural Network (R-CNN), Faster R-CNN, and You Only Look Once (YOLO).
Use a pre-existing openly available model: There are many resources available online that can help you build object detection into your program, app, or analysis on the fly. In my last blog, I took advantage of OpenCV’s pre-trained haar cascade model for faces and eyes to quickly build a fun face filter. Python libraries like Facebook AI’s Detectron2, among others, have powerful pre-trained models ready to go (often called a model zoo). Many were trained on the Common Objects in Context (COCO) image dataset. Alternatively, you can find pre-trained models online to import into deep learning frameworks like PyTorch or TensorFlow on websites like ModelZoo.co.

Photo by Ga on Unsplash

The latter of the above options is obviously much less flexible and customizable. So, if you are trying to detect something very unique in an image or video, it might be best to go your own route.

But, depending on the goals and needs of your project, it might be to your advantage to use a pre-existing model from a model zoo because it can save you time, energy, and space on your hard drive. For example, in my last blog, it wasn’t really necessary for the goal of the project to train my own facial recognition model. I wanted to see if I could build a cool filter to put on people’s faces.

If I tried to train my own model, I would need to find or create a dataset of images of faces (which takes time itself) and then train the model. But, all I needed was the haar cascade, and then I ready to go. Because object detection is such a common task, it was much more efficient to use existing tools already available to me.

Going further, many models are easy enough to implement without actually knowing how they work or having much knowledge about computer vision at all. This is particularly good for the software engineer that just needs the functionality of object detection and needs to focus on other tasks.

Choosing the model that’s right for you

There are so many models to choose from, how do you know which to use in your project? Or should you train your own model entirely? Here are a few things to consider:

Does it have the functionality you need? Some models just help you draw a box around the detected object while others might help you detect more unique features of an object like the limbs of a person. But chances are there may not be one model that is perfect for you, especially if you have a specific project in mind. For example, perhaps you want to detect images of poodles as opposed to other breeds of dogs. A baseline dog detection algorithm from Detectron2 using COCO images may help you identify dogs in an image or video and discard images of cats, but it’s not going to quite get you to your goal. One way forward is to springboard your own model from the pre-existing ones.
Would training a new model be redundant? Because so many object detection tasks are similar am I re-inventing the wheel to create a whole new model? This is particularly true for common tasks like facial recognition or text detection. Sure, it is a good coding/modeling exercise to practice training with these tasks but it's redundant when a project deadline is looming and there are so many out there that already exist.
Is it fast enough? Many models are built by using algorithms such as R-CNN), Faster R-CNN, and YOLO. Some algorithms are faster while others are more efficient with memory. If you are trying to process live images like in a video be sure to use Faster R-CNN or YOLO. Many Model Zoos list speed and memory metrics for you consider (here is Detectron2's)
Is it detail-oriented enough? A model built using the YOLO algorithm is quite fast but sometimes at the cost of specificity. Therefore YOLO is great for live video but might struggle to find a gathering of people in the distance. Depending on what you're trying to detect, make sure your model has the capacity to do it.
Does it match your skill level? Some models and python packages are easy to pick up. My last blog easily implemented the haar cascade using OpenCV. However, be sure that you have the skills and ability to use your model. This may seem obvious, but it's easy to bite off more than you can chew. Some models might require python packages you might be unfamiliar with or require you to understand concepts you haven’t encountered. It has definitely happened to me where I tried to use a model from somewhere and I was in over my head.

Overall, choosing a model, whether a model you created or one you found, is largely about balancing efficiency (both the computer’s and yours) and the level of detail the task requires of you.