Let's explore if we can help accelerate your perception development and deployment.
Table of Contents
If you've been in or around computer vision for a while, you might have seen one of these things:
These are known as fiducial markers, and are used as a way to establish a visual reference in the scene. They're easy to make and easy to use. When used in the right context, extracting these markers from a scene can aid in camera calibration, localization, tracking, mapping, object detection... almost anything that uses camera geometry, really.
What you might not know is that there are a lot of different fiducial markers:
Some of these are older, some are newer, some are "better" in a given scenario. However, all of them share the same very clever mathematical foundation. A few smart yet straightforward ideas allow these fiducial markers to perform well under different lighting conditions, upside-down, and in even the harshest environment: the real world.
In my opinion, the best way to understand this math magic is to try it ourselves. What goes into making a robust fiducial marker? Can we pull it off? We'll use the ArUco tag design as our reference. The ArUco tag is one of the most common fiducial markers in computer vision, and it checks a lot of boxes as far as robustness and usability goes. Let's reverse-engineer ArUco markers and discover why these work so well.
🇪🇸 ArUco (also stylized ARUCO or Aruco) got its moniker from Augmented Reality, University of Cordoba, which is the home of the lab that created the marker. At least, that's what it seems to be; I've never seen it documented!
As a rule of thumb, our marker shouldn't look like anything found in natural images:
It doesn't take much brainstorming to realize that "black-and-white matte square" fits a lot of these qualifications. In addition, square corners and edges are each themselves unique features, giving us more bang for our buck. Let's use this concept as our base.
It's easy enough to track one black-and-white square frame to frame in a video (and some algorithms do!), but what if we want to use more than one marker? Having two identical squares makes it impossible to identify which is which. We clearly need some way to disambiguate.
If we want to maintain our not-from-nature design choices, keeping black-and-white squares around somehow makes sense. Why don't we use... combinations of squares?
We can now differentiate, to some extent. But clever minds will notice that a rotation in our camera will confuse our identification:
Obviously, we'll have to do better than this.
...so let's use more squares! We can switch from black to white across squares to up the entropy in our pattern.
This looks better, but even with more squares, we still have the risk of mixing it up with another marker design. It's clear that we want to avoid mis-classifying our markers, and the more differentiable our markers are, the easier it will be to detect each of them correctly. However, there's a balance to strike here: if our patterns are too different from one another, or don't have enough features, it won't be easy for our camera to tell that something is a marker at all!
This means that our squares should have a common structure, but different and varied features. The easiest way to add structure is to define a constant shape (in square rows and columns) of every marker that the camera might see. This specificity makes it clearer what our algorithm is trying to find, and sets conditions for identification once it finds it.
Now, with our structure well-defined, we can focus on the original problem: making every individual marker that fits this structure as different as possible from all of its kin.
Let's put that another way: we want to maximize the difference between all of our marker patterns that share that same structure. Whenever the word "maximize" comes up in algorithmic design, one should immediately think of optimization. If we can somehow translate our marker patterns into a formula or algorithm, we can turn our predicament into an optimization problem which can be better understood and solved.
Markers like ArUco do this by treating every square in the pattern like a bit of information. Every pattern can therefore be represented by a bit string, and the difference between two patterns is the minimum number of changes it takes to turn the first into the second.
This is called the Hamming distance between markers, named after mathematician Richard Hamming. By representing the difference in Hamming distance, we can formulate our goal in terms of formulas, not just words.
So, now that we can represent every marker as its corresponding bit pattern, we can make some math! We want to make sure that every marker has as many black-to-white transitions as possible...
...while making its pattern as different as possible from every other marker...
...while also making sure that this marker's rotations are unique from itself, so that we don't mix it up with another marker when our camera is tilted.
Now use all of these formulas together to optimize every new marker you can generate. Collect those markers that hit a certain threshold, and do it again and again until you have as many as you need! We now have a complete dictionary of fiducial markers.
🖥️ Richard Hamming's innovations around bit-wise information storage and transfer were a huge influence on the Information Age. Though we won't cover them here, 3Blue1Brown's wonderful explanation on the power and simplicity of Hamming Codes is worth a watch.
Now that we have our dictionary, we can start detecting. Our optimization process gave us some pretty cool abilities in this regard. For one, we have a much lower chance of mixing up our markers (which is what we optimized for). More surprisingly, though, we've also made ourselves more robust to detection mistakes.
For instance, what if we detect a white square in a marker, when in reality that same square was black? Since we've maximized the Hamming distance between all of our markers, this mistake shouldn't cost us. Instead, we'll select the matching marker in our dictionary that has the closest Hamming distance to our detected features. The misdetection was unfortunate, but we took it in stride.
What if we misdetect several squares? At some point, the misdetections will be too much for our poor dictionary, and we'll start getting confused. But when do we hit this threshold?
An easy way to set that breaking point ourselves is by adding a minimum Hamming distance constraint between all markers in our dictionary. If our minimum Hamming distance is 9, and we generate a marker that has a distance of 8 from the rest of the dictionary, we throw it out. This process limits the number of markers that we can add, but in exchange we generate a dictionary that's more robust to detection mistakes. This ability becomes more crucial as your marker size grows and the feature details become finer.
🤔 Small disclaimer: Fiducial marker libraries like AprilTag employ this strict Hamming distance threshold when generating their libraries. However, ArUco derives a maximum value to its cost functions instead (those formulas we derived above). The effect is the same: more robust dictionaries.
There you have it: the basics of ArUco! All we needed to do was
Of course, there are small tweaks and changes one can make to this basic formula to up the robustness, or variety, or usefulness in different environments. This is where all those other markers come from! Regardless, these other markers all considered the same points we did today; they just arrived at different solutions.
Tangram Vision's calibration module can use several fiducial types to help bootstrap the camera calibration process. While we're no fan of the checkerboard (which we'll let you know), we all recognize the value of fiducials when used at the right place and time. Even your humble author has been known to wave around a few markers now and then:
If waving markers is as far as you want to go in the calibration process, Tangram Vision has you covered! Come by and sign up for early access to the Tangram Vision Platform, where we're developing easy-to-use calibration software alongside other great perception modules like our plug-and-play sensor runtime. Feel free to send us a tweet or just get in touch!
Tangram Vision helps perception teams develop and scale autonomy faster.