Let's explore if we can help accelerate your perception development and deployment.
Table of Contents
In the world of automated vision, there's only so much one can do with a single sensor. Optimize for one thing, and lose the other; and even with one-size-fits-all attempts, it's hard to paint a full sensory picture of the world at 30fps.
We use sensor combinations to overcome this restriction. This intuitively seems like the right play; more sensors mean more information. The jump from one camera to two, for instance, unlocks binocular vision, or the ability to see behind as well as in front. Better yet, use three cameras to do both at once. Add in a LiDAR unit, and see farther. Add in active depth, and see with more fidelity. Tying together multiple data streams is so valuable this act of Sensor Fusion is a whole discipline in itself.
Yet this boon in information often makes vision-enabled systems harder to build, not easier. Binocular vision relies on stable intrinsic and extrinsic camera properties, which cameras don't have. Depth sensors lose accuracy with distance. A sensor can fail entirely, like LiDAR on a foggy day.
This means that effective sensor fusion involves constructing vision architecture in a way that minimizes uncertainty in uncertain conditions. Sensors aren't perfect, and data can be noisy. It's the job of the engineer to sort this out and derive assurances about what is actually true. This challenge is what makes sensor fusion so difficult: it takes competency in information theory, geometry, optimization, fault tolerance, and a whole mess of other things to get right.
So how do we start?
...just kidding. Though you would be surprised how many times an educated guess gets thrown in! No, we're talking
So let's review our predicament:
Nonetheless, this is all that we have to work with. This seems troubling; we can't be certain about anything!
Instead, what we can do is minimize our uncertainty. Through the beauty of mathematics, we can combine all of this knowledge and actually come out with a more certain idea of our state through time than if we used any one sensor or model.
This 👆 is the magic of Kalman filters.
💡 Warning: Math.
Let's pretend that we're driving an RC car in a completely flat, very physics-friendly line.
There are two things that we can easily track about our car's state: its position \(p_{t}\) and velocity \(v_{t}\).
We can speed up our robot by punching the throttle, something we do frequently. We do this by exerting a force \(f\) on the RC car's mass \(m\), resulting in an acceleration \(a\) (see Newton's Second Law of Motion).
With just this information, we can derive a model for how our car will act over a time period \(\Delta t\) using some classical physics:
$$\tag{1.1}p_{t+1} = p_{t} + (v_{t} \Delta t)+ \frac{f \Delta t^2} {2m}$$
$$\tag{1.2}v_{t+1} = v_{t} + \frac{ f \Delta t} {m}$$
We can simplify this for ourselves using some convenient matrix notation. Let's put the values we can track, position \(p_{t}\) and velocity \(v_{t}\), into a state vector:
$$\tag{1.3}\textbf{x}{t} = \begin{bmatrix} p{t} \\ v_{t} \end{bmatrix}$$
...and let's put out applied forces into a control vector that represents all the outside influences affecting our state:
$$\tag{1.4}\textbf{u}_{t} = \begin{bmatrix} \frac{f}{m} \end{bmatrix} = \frac{f}{m}$$
Now, with a little rearranging, we can organize our motion model for position and velocity into something a bit more compact:
$$\tag{1.5} \begin{bmatrix}p_{t+1} \\ v_{t+1}\end{bmatrix} = \underbrace{\begin{bmatrix}1 & \Delta t \\ 0 & 1\end{bmatrix}}_{F{t}} \begin{bmatrix}p_{t} \\ v_{t}\end{bmatrix}+\underbrace{\begin{bmatrix}\frac{\Delta t^2}{2} \\ \Delta t\end{bmatrix}}_{B{t}}\frac{f}{m}$$
$$\tag{1.6}\Rightarrow \textbf{x}_{t+1} F_{t}\textbf{x}{t} +B{t}\textbf{u}_{t}$$
By rolling up these terms, we get some handy notation that we can use later:
However, we're not exactly sure whether or not our state values are true to life; there's uncertainty! Let's make some assumptions about what this uncertainty might look like in our system:
These two assumptions mean that our uncertainty follows the Central Limit Theorem! We can therefore assume that our error follows a Normal Distribution, aka a Gaussian curve.
💡 We will use our understanding of Gaussian curves later to great effect, so take note!
We're going to give this uncertainty model a special name: a probability density function (PDF). This represents how probable it is that certain states are the true state. Peaks in our function correspond to the states that have the highest probability of occurrence.
Our state vector \(\textbf{x}{t}\) represents the mean \(\mu\) of this PDF. To derive the rest of the function, we can model our state uncertainty using a covariance matrix \(\textbf{P}{t}\):
$$\tag{2.1}\textbf{P}{t} =\begin{bmatrix}\Sigma{pp} & \Sigma_{pv} \\\Sigma_{vp} & \Sigma_{vv}\end{bmatrix}$$
There are some interesting properties here in \(\textbf{P}_{t}\) . The diagonal elements (\(\Sigma _{pp}, \Sigma _{vv}\)) represent how much these variables deviate from their own mean. We call this variance.
The off-diagonal elements of \(\textbf{P}{t}\) express covariance between state elements. If \(\Sigma{pv}\) is zero, for instance, then we know that an error in velocity won't influence an error in position. If it's any other value, we can safely say that one affects the other in some way. PDFs without covariance terms look like Figure 1 above, with major and minor axes aligned with our world axes. PDFs with covariance are skewed off-axis depending on how extreme the covariance is:
Variance, covariance, and the related correlation of variables are valuable, as they make our PDF more information-dense.
We know how to predict \(\textbf{x}{t+1}\), but we also need the predicted covariance \(\textbf{P}{t+1}\) if we're going to describe our state fully. We can derive it from \(\textbf{x}_{t+1}\) using some (drastically simplified) linear algebra:
$$\tag{2.2}\textbf{P}{t+1} = cov(F{t}\textbf{x}{t}) + cov(B{t}\textbf{u}{t}) =F{t} \textbf{P}{t}F{t}^T + \xcancel{cov(B_{t}\textbf{u}_{t})}$$
Notice that \(B_{t}\textbf{u}{t}\) got tossed out! Control has no uncertainty that we can directly observe, so we can't use the same math that we did on \(F{t}\textbf{x}_{t}\).
However, we can factor in the effects of noisy control inputs another way: by adding a process noise covariance matrix \(Q_{t}\):
$$\tag{2.3}\textbf{P}{t+1} = F{t} \textbf{P}{t}F{t}^T + Q_{t}$$
Yes, we are literally adding noise.
We have now derived the full prediction step:
$$\tag{2.4} \textbf{x}_{t+1} =\underbrace{F_{t} \textbf{x}{t}}_{state}+\underbrace{B_{t} \textbf{u}}_{control}$$
$$\tag{2.5} \textbf{P}{t+1} =\underbrace{F{t} \textbf{P}{t} F{t}^T}_{prediction}+\underbrace{Q{t}}_{process}$$
Our results are... ok. We got a good guess at our new state out of this process, sure, but we're a lot more uncertain than we used to be!
There's a good reason for that: everything up to this point has been a sort of "best guess". We have our state, and we have a model of how the world works; all that we're doing is using both to predict what might happen over time. We still need something to support these predictions outside of our model.
Something like sensor measurements, for instance.
We’re getting there! So far, this post has covered
We’ll keep it going in Part II by bringing in our sensor measurements (finally). We will use these measurements, along with our PDFs, to uncover the true magic of Kalman filters!
Spoiler: it’s not magic. It’s just more math.
Tangram Vision helps perception teams develop and scale autonomy faster.