Let's explore if we can help accelerate your perception development and deployment.
Table of Contents
As a perception infrastructure company, we talk to many roboticists. When we hear repeating themes in our conversations, we take note. One repeating theme we’ve heard with increasing regularity is: “we plan to create our own stereo depth sensing system.” While there are myriad reasons why a robotics company may choose to do this, there is typically one common factor: for many robotics applications, there are simply no off-the-shelf depth sensors available that meet the specific (and demanding) set of requirements for an application.
So is creating a stereo depth sensing system as simple as mounting two cameras to a robot? Hardly. In the following post, we’ll go over the shortcomings that roboticists have found with many of the available depth sensors, and what factors to consider if you choose to roll your own depth (or, as we like to call it, “RYOD”).
There are several reasons why no single depth sensor is ideal for all robotics applications. Let’s start with baseline. The accuracy of depth sensors vary across their range, and invariably, a depth sensor will be tuned to be most accurate within a specific area of its range. This is heavily influenced by the baseline, which is the distance between the two cameras in a stereo system.
Sensors with small baselines, such as the golf ball sized Intel RealSense D405, provide impressive near-range accuracy but struggle to perform well at longer distances. Sensors with large baselines, such as the Stereolabs ZED, are best suited for long range use cases where accuracy at more distant portions of a long range are required, but suffer at closer ranges. Sensors with medium baselines, such as the Luxonis OAK-D or Structure Core, are more general-purpose and try to split the difference between very short and very wide baseline sensors.
For some robotic applications, such as a pick and place robot with a work surface that always stays in the same place, an off-the-shelf depth sensor with a medium range baseline can be a great solution. Yet for many mobile robots, there is a requirement to sense depth at near, medium, and far ranges. For devices that need high depth accuracy across a wide range, it is typical to see an array of different sensors (even different depth sensors) employed to achieve this requirement.
So how do you determine what baseline to choose if you RYOD? Consider the following charts with the Y-axis showing the amount of error caused by being off by one pixel in disparity matching. The disparity refers to the distance between corresponding points in the two images of your stereo setup in pixel space. If you're off by a pixel, this discrepancy can translate into a considerable error in the depth estimate — an important metric when gauging how the precision of your disparity matching translates to accurate depth.
IImagine a scenario with a one centimeter baseline — akin to the RealSense D405. The difference in estimated depth remains negligible until we reach a distance of about half a meter. Beyond this point, errors in disparity start to project into the estimated depth dramatically. This leads to the inference that a narrow baseline can result in significant disparity errors and potentially compromise depth estimation, especially as the distance to the object increases.
On the other hand, if the baseline is extended to 20 centimeters, you'll find the scale on the x-axis shifting, and we can extend the depth range to two meters without substantial disparity errors. However, as we increase the depth further, error again starts to creep in. This illustrates why a wider baseline becomes increasingly important as we aim to extend our viewing range.
Of course, there is a limit to this. As your baseline gets larger and larger, the minimum distance you can “perceive” an object grows larger as well. If you can’t see an object in both cameras of a stereo pair, then you can’t produce any disparity at all. Imagine a sensor with a 1 meter baseline — objects within a few centimeters of one of the cameras couldn’t possibly be seen by the other camera, and there would be no way to generate disparity matches. Speaking of which…
If you look at the spec sheets for most popular depth sensors, you’ll see a fairly consistent resolution specification for the stereo pair: one megapixel. Why is that? Wouldn’t it be possible to specify a higher resolution set of cameras?
It turns out there are a few reasons why many depth sensors limit the stereo pair to one megapixel cameras. This is firstly due to the low cost of these cameras, and given that there is a relatively tight spread of pricing within the depth sensor market, it’s important to manage the bill of materials (BOM) to stay within this spread. But there are reasons beyond price.
Depth sensors with dedicated onboard depth processing have limitations in the amount of processing capacity or compute available. The capacity required to process cameras with resolutions higher than a megapixel often simply isn’t available, or may only be available at lower framerates / resolutions.
Therefore, the depth map quality from many off-the-shelf depth sensors can be considered good — but not great. For robotics applications that require greater resolution, the RYOD approach is to push the processing back to a more powerful host which won't have the same thermal and power restrictions that embedded GPUs and CPUs might present.
But beware, it’s not so simple as just using a more powerful host. While a higher number of pixels might seem beneficial, more pixels don't always equate to better performance in terms of precision and accuracy. Increased resolution can lead to greater noise in the sensor (e.g. appears as a grainy aberration, or a reduced signal-to-noise ratio in the final image), or more vignetting at the peripheries of the image extent due to signal loss. Pixel-pitch, that is the size of a pixel in metric space, actually matters! Imaging sensors with smaller pixels may produce higher resolution images, but they are more expensive and often come with a lot of caveats. In fact, for some IR cameras it may not even be possible to always increase pixel count because the pixel-pitch is limited by the physical constraints of the technology.
💡 A small digression: Some older time-of-flight cameras (e.g. SwissRanger) had extremely large pixel-pitch, around 40μm, in order to attain a sufficient signal-to-noise ratio. For reference, most RGB imaging sensors have a pixel-pitch around 2-3μm, which makes those pixels quite large!
As a result, some of these older time-of-flight cameras were extremely limited in resolution. The SwissRanger 4000 time-of-flight cameras had a resolution of 176×144! By today’s standards, this is almost nothing at all!
One popular alternative is high dynamic range (HDR) cameras. These cameras use embedded software to expand resolution by merging images with different exposures and synthesizing a wide-range image. However, this operation effectively halves the frame rate and can be computationally expensive, adding another layer of complexity to your camera and host decision-making process.
Lastly, if you do choose to process on the host, it's essential to factor in computational resources. Your depth sensing system may compete for resources with other critical processes in your device. For instance, HDR cameras may require one or more cores to process, which could potentially strain your system's resources. Therefore, understanding your computational capabilities and limitations is a crucial part of the decision if you RYOD.
If you haven’t already, head over to our Depth Sensor Visualizer to get a sense for the range and field-of-view (FOV) of most of today’s most popular off-the-shelf depth sensors. While there is quite a bit of variation available, you’ll probably note that there remains a convergence towards a standard set of values for most of the more popular sensors.
For many robotics applications, this is perfectly fine. However, in some cases, there is a desire to cover as much FOV as possible with as few sensors as possible. In this case, a wide vision or ultra wide vision stereo pair can be desirable.
But there are challenges when using wider FOV cameras. It's crucial to determine what kind of distortion is tolerable, as performing tasks such as intrinsic calibration are inherently more difficult as FOV increases. For dense depth generation specifically, it likewise becomes significantly more difficult to do disparity matching as distortion warps the image. Fisheye lenses for example are antithetical to how most traditional disparity matching algorithms operate. Even if one manages to calibrate and rectify frames, the regions near the very periphery of a fisheye image are so warped as to be useless, even if there’s overlap across multiple cameras.
💡 The reason that fisheye cameras cannot be used for depth generation at the extrema of the FOV is because almost all dense depth is generated by matching disparity across epipolar lines. It isn’t really easy nor geometrically sound to draw an epipolar line between two spheroids, rather than two planes.
The derivation of such is complicated and not intuitive at all, but the take-away is that you’re usually sticking to wide or ultra-wide FOV lenses when constructing a sensor for dense-depth. This same limitation does not necessarily apply if you’re applying sparse depth / photogrammetric methodologies for your sensing use-case.
Lastly, depending on the environment where your robot will be deployed, the potential impact of thermal expansion/contraction of the lens may be important as well. Many lens manufacturers will provide some measure of the coefficient of thermal expansion / contraction on a lens, but especially on the cheaper side (read: plastic lenses) they may not. In general plastic lenses will almost always be worse than glass lenses for this reason, but the difference can often be a matter of how much this affects your final BOM. Take care to understand how thermal effects will affect your lens choice, both in terms of BOM and in terms of what application you’re targeting!
When it comes to the camera's shutter, you need to decide between a rolling shutter and a global shutter.
Rolling shutter cameras are less expensive and more widely available in many resolutions and spectrums, but are subject to motion blur and read-out artifacts when motion is present, which if not corrected for explicitly by your software can lead to significant and confusing imaging errors. On the other hand, global shutter cameras eliminate most motion artifacts by synchronizing the read-out of every “column” on the CMOS / CCD, but are more expensive and come with a restricted selection from camera vendors.
In general, if you’re building a perception system you probably want to aim for a global shutter camera. While it will limit your choices and increase your BOM, correcting rolling shutter artifacts can be a huge software task to undertake, especially if your application involves a lot of motion! Rolling shutter cameras do have their place (and in particular have some advantages in reducing noise in the resulting image), but the reason that many of the common depth sensors today are moving to global shutter is obvious: it solves a significant software risk at an early stage of the project for (usually) a fixed and reasonable cost.
Another key aspect to consider is the spectrum, with the most common choices being either visible light (aka RGB or grayscale), or infrared.
RGB cameras often seem like an obvious choice. After all, as humans, we see in color, and we often expect our machines to see as we do. One of the pros of using RGB cameras is that you can get photometric information from it. That makes them very useful for tasks beyond depth, such as helping to generate rich data for machine learning pipelines, or finding photometric texture in a scene that might be inherent to the environment (e.g. uniformly colored flooring or walls).
This is partially because you can leverage color data to differentiate between objects, and also see more texture within the scene. This is something that is somewhat understated, even though it's very basic information. However, a major con of RGB is that it relies on the visible light spectrum. The visible spectrum of light can be problematic, especially for depth, because if suddenly there is no light (or too much), you have a real problem!
In addition, RGB cameras specifically need to be debayered. RAW images are what you're likely to get from any system that doesn’t come with some kind of vendor-provided software. Certain color calibration aspects like false coloration can also be an issue. This can be mitigated through a color calibration process, but can be difficult to correctly perform for a variety of color temperatures and lighting conditions.
Similar to false coloration is chromatic aberration and moire effects. These can be in part due to the lens, but it's also due to the debayering process. But you won’t encounter these problems in a grayscale or an IR camera.
Grayscale is a great alternative to RGB that we believe to be underexplored. There's no debayering because you're simply measuring intensity across the scene and subsequently you get sharper imagery, with less bandwidth required.
This is because you’re not sending three channels of red, green and blue data; rather, you're sending a single channel of grayscale or intensity and therefore you get higher data throughput. However, you don't have any photometric information, so if you have an application that requires color, then grayscale is not a fit. Like RGB cameras, grayscale cameras rely on visible light, so again, if the lights go off, you’ll have problems.
So what if the lights do go off, and often? Or is your lighting just inconsistent? Then you may want to consider infrared. Infrared tends to be more robust to different lighting conditions that might otherwise be out of one’s control. If you want to operate in the dark and you have your own infrared projector or emitter (more on these below), you'll be able to do that much more easily than with an RGB or grayscale camera. However, most infrared cameras have lower resolution, as a result of larger CMOS pixel sizes.
And, in particular, infrared cameras won't work without an infrared light source. Given these constraints, there are many situations in which infrared cameras are still not the right choice. One of the more common failure modes in warehouse robotics is that you want your robots working 24/7; however, at the end of the day, somebody turns the lights off, and that's not good for the visible spectrum. Conversely, if you have an application that works outdoors and during the day the sunlight will likely just wash out any IR emitter you pair with your cameras.
💡 On pattern projectors: Many robots operate in environments where there are self-similar surfaces. What is a self-similar surface, you ask? It’s a surface with no variation. Think of a relatively regular concrete sidewalk, or a blank white wall. Since most depth systems are looking for some variation in features, these surfaces will flummox them. Adding an infrared pattern projector will coat these surfaces with a layer of patterned infrared light that gives your stereo pair something to track against. The downside? Adding an emitter brings up an additional supply chain question. Additionally, as these use a laser to emit the pattern, a Class 1 laser certification might be necessary.
This deserves a section of its own, because it is possibly the most important factor to consider of them all. Unfortunately, we find ourselves in a world that is still constrained by supply chain issues. While the situation has improved greatly over the past couple of years, it is still not uncommon to find that a certain component may have a lead time of months. To mitigate this, we recommend seeking out readily available cameras are already being extensively used in other areas like smartphones, ADAS (Advanced Driver-Assistance Systems), and other consumer devices.
Opting for a component with an exotic specification might introduce an unnecessary risk into your supply chain. A practical strategy when searching for a camera on platforms like AVnet, Mouser, or Digikey is to validate its long-term availability by aligning it with the mass-produced product it's primarily used for. For example, the Sony IMX390, a popular camera choice for ADAS applications and rear view cameras in cars.
We end our RYOD exploration with the final piece of the puzzle: software. There are a number of functional requirements to consider: platform compatibility, disparity calculation, sensor fusion, and sensor calibration.
Depending on what platform you’ve developed your robot on, you’ll need to ensure that the cameras you choose can deliver data in a format that your platform can consume. If you use ROS or ROS 2, you’ll want to package data in ROS bags (deprecated) or in the MCAP format. New platforms like Viam may have their own data transfer protocols to consider.
Signal timing is important both between each pair of cameras that comprise a given stereo pair, as well as between that stereo pair and any other sensors that they will be fused with. Let’s not forget that your host clock may need synchronization as well. While we consider synchronization a part of the calibration problem here at Tangram, other software is not always so transparent about the different aspects that a given calibration or streaming software may cover.
Achieving high-quality intrinsic and extrinsic calibration for your stereo pairs and other sensors is crucial. And speaking of calibration, you might want to consider Tangram Vision for this task, offering a seamless solution for your stereo setup.
💡 Tangram Vision sees calibration as solving the following three problems:
1. Modeling Intrinsics
2. Registering Extrinsics
3. Synchronization
Our software aims for a complete solution to all of these problems, rather than solving each in a piecemeal fashion.
Disparity calculation is another crucial aspect, and you have options like census matching, sum-of-absolute-differences (SAD) matching, or semi-global (block) matching (SGM/SGBM).
If you find that there is no suitable off-the-shelf depth sensor that fits the needs of your robot application, you may as well join the ranks of robotics companies that have chosen to roll their own depth sensors. As you’ve likely ascertained from our article, this is not an insignificant undertaking. However, there is a well understood path. The components and software required to make such a system a reality is widely available and well documented. Of course, we hope that you’ll choose to work with us when it comes time to synchronize and calibrate your system. After all, depth sensors are a key supported modality in our platform, whether off-the-shelf, or self developed.
Tangram Vision helps perception teams develop and scale autonomy faster.