Get a personalized demo and find out how to accelerate your time-to-market
In the introduction to this series, we discussed the Sensoria Communia, aka the Common Sensors of autonomy: Cameras, Depth, and LiDAR. These three modalities were common for a variety of practical reasons, both technical and commercial. Where one sensor fails, another excels, making them naturally complimentary. However, even when all of these modalities are used together, they can still face scenarios that lead to failures. This has inspired researchers and engineers to push the state of the art in autonomy with new forms of sensing that challenge conventional approaches to perception.
Near the top of this list lies event cameras. Also known as neuromorphic cameras, dynamic vision cameras, or even silicon retinas, these biologically-inspired sensors have incredible properties that make them viable modalities in even the most challenging autonomous environments. Recent work in academia has proven that these sensors have a place in autonomy, excelling at perception problems with which the Common Sensors would struggle.
We’re going to dive into what makes these sensors so special; how the technology is being used today; and what commercial viability they’ve shown. There’s a reason that these sensors are still part of the Sensoria Obscura… but there’s no doubt their popularity is growing.
Before we get into the cameras themselves, let’s define what’s meant by neuromorphic. This refers to any technology that’s inspired by biological neural computation, i.e. by the brain, its neurons, and any ancillary systems that connect to it. It’s thought that mimicking the layout of the human brain will allow computers to mimic its learning styles as well, most commonly in the form of Spiking Neural Networks (SNNs).
Without getting too far into the nuances of neural network architecture, it’s enough to know that neural networks are made up of many layers of “neurons”, each of which takes its input from the layer before, and passes its output to the layer after it. These networks can be “fully-connected”, which means that every neuron is connected to every other neuron in the layer before and after its own layer. In a conventional fully-connected neural network, data flows from one end to the other without hindrance, and presents us with an output based on that network’s training.
Interestingly, this is not how our own neurons work. We don’t just move all stimuli or signal around our body end-to-end. Instead, when a neuron receives stimuli, there’s an electric barrier to entry that has to be overcome to move that signal around; this is referred to as its action potential. This acts as another filtering mechanism on data inputs. If the signal is not “meaningful” enough, it won’t have an effect on the next layer.
Spiking neural networks add this action potential to vanilla neural networks. Now the network acts more similarly to our own nervous system: instead of automatically passing data along the neural chain, the network holds that data back unless the data reaches a certain threshold. This introduces an element of time into the neural network; it takes actual training cycles for a neuron to hit its activation threshold, which in itself adds information to the classification abilities of the neural network. Note this new time factor, as it will become an important element in event cameras as well.
💾 I fully admit that I’m not a neural network/biology expert, and that this explanation is lacking. For those who want a fuller overview, Frontiers has a good summary on SNNs and their applications: https://www.frontiersin.org/articles/10.3389/fncom.2021.646125/full
Whenever you see the term “neuromorphic”, know that whatever is happening is probably similar to this process: signals processed and transmitted via a potential function. And, indeed, we’ll find that neuromorphic cameras, a.k.a. event cameras, fit this mold.
Instead of thinking of event cameras as a traditional camera, they can be better thought of as an array of neuromorphic sensors. I’ll let ETH Zurich explain:
"In contrast to standard cameras, which acquire full images at a rate specified by an external clock (e.g., 30 fps), event cameras… respond to brightness changes in the scene asynchronously and independently for every pixel. Thus, the output of an event camera is a variable datarate sequence of digital “events” or “spikes”, with each event representing a change of brightness (log intensity) of predefined magnitude at a pixel at a particular time." [1]
In other words, Instead of being time-dependent, data output is now signal-dependent. With a standard camera, you’ll get data at a fixed rate of time, whether it’s needed or not; this is time-dependent data output. With event cameras, we don’t get data unless there’s enough signal to warrant a change in the state of the sensor; this is signal-dependent data output. The result is an array of on-off signals that are independent in space and time. As we learned above, this very signal potential is what makes these cameras neuromorphic.
When the neuromorphic camera debuted in the May 1991 edition of Scientific American [2], the authors (Misha Mahowald and Carver Mead) explained how the human eye compressed signals with large dynamic range into relatively straightforward on-off signals, done through the use of the aptly-named bipolar layer of retinal neurons. By mimicking this retinal structure in the neuromorphic camera, they were able to construct signal-dependent images that only took in the data that was actively changing in the scene.
Of course, they imaged a cat to show it off.
Early forms of neuromorphic cameras were shown to suffer from the same optical illusions that human eyes do, including filling in brightness information between high-contrast squares and afterimages of an object in motion. These effects are due to the nature of the bipolar layer: by being reliant on only the change in the scene, not the scene values themselves, neuromorphic cameras take time to compensate for changes. This effect is only temporary and is highly localized: it only takes a moment for any one pixel to reach a stable voltage after an abrupt change.
Given that event cameras record data differently than a conventional camera, factors that were once constant are now completely at the whim of the scene:
If there is no change in the scene, there is no change in the data. This means that no data is sent, or power drawn beyond the minimal amount needed to keep the signal value in memory for a specific pixel. We’ll find that this asynchronous nature of the signal will make the intake of event camera data much more complex than conventional camera data.
With event cameras, there’s really no such thing as a framerate, as there are no frames. Instead, most event cameras are given a maximum bandwidth of events in Megaevents per second, or Mev/s. Events come in the form of a tuple:
[x position, y position, timestamp, change value]
The more pixels there are in the event camera, the more events will be generated. This is intuitive enough: any changes in the scene are now picked up in higher resolution, and so will trigger more events.
This tradeoff between resolution and throughput historically hasn’t been a problem because, well, there was never a high-resolution event camera. Most event cameras on the market were under QVGA resolution (320x240 pixels). The iniVation DAVIS240, for example, is 240x180 pixels and operates at a maximum 12 Mev/s.
This is no longer the case. The newest event cameras are now reaching VGA (640x480) and even HD (1280x720) resolutions. These cameras are getting up into the hundreds and even thousands of Mev/s. The Samsung DVS-Gen4 is 1280x960 and can hit a whopping 1200 Mev/s at its peak event rate, 100x that of the DAVIS240. That’s 1,200,000,000 events per second, for those playing at home.
If we translate this into a byte rate, we can expect a lot of data at peak event throughput. In a scenario where all pixels register an event at once (which would be rare), we can derive some scary figures after doing some back-of-the-napkin calculation based on message size:
That is so much data.
The DAVIS240 uses a micro-USB 3.0 cable which just peaks out at 4.8Gb/s; that’s enough for it to stream, but nowhere near enough for the DVS-Gen4. Even switching to a CAT6 ethernet cable, we’re limited to a peak bandwidth of 10 Gb/s. The DVS-Gen4 streams over that amount by over 12x.
At this event rate, you will absolutely saturate the bandwidth of the connection even with partial data capture. This means that you’re now introducing latency into the communications of your sensor system (which strikes this author as ironic, given the event camera is praised for its nearly-real-time readings).
According to an article on the development of Poker-DVS, a benchmarking event camera dataset using playing cards, the researchers couldn’t do much with the data until they were able to bring the peak event rate down to 8-10 Mev/s [3]. As we noted above, this is under the peak event rate of a QVGA event camera; we’re either sacrificing spatial resolution or temporal resolution to reach this number, but the sacrifice is necessary just to handle what’s being produced.
This is the price one pays for nearly-instant sensor readings: a lot of readings. It’s up to the user to know how to control this data throughput for the optimal performance in their system. Luckily, there are a few straightforward ways to do just that.
We’ve already touched on one way to reduce event rate: changing the resolution of the camera. This makes it physically impossible to go over a certain peak event threshold, but one sacrifices data resolution that, traditionally, would make a difference in an autonomous vision pipeline.
But event cameras aren’t ones for tradition. A 2022 article from University of Zurich argues that high-resolution cameras don’t always outperform their low-resolution counterparts in certain computer vision tasks, and in fact can perform much worse under adverse conditions [4]. For instance, optical flow tasks were found to perform better with low-resolution data in nighttime lighting conditions.
The large caveat to this (in this author’s opinion) is that there are very few systems optimized for event camera data. The above paper used SNN models trained on data derived from certain conditions, e.g. high exposure or rapid movement. Once software and hardware adapt to an event camera’s high throughput and data formats, we could be seeing different trends entirely.
Another lever that we have to control the event rate is via contrast. Lower contrast would mean that it takes a larger brightness change to trigger an event; higher contrast would take less brightness change to do the same. By playing with the contrast, one can develop heuristics for controlling the event rate in a given scene.
However, this method should be used with caution. Lowering the contrast in an event camera has the same effect as lowering the contrast in a conventional camera: borders and features become washed out and less defined. When this technique was employed with Poker-DVS, for instance, the authors found that the playing cards with red pips were significantly less defined than the ones with black pips (which had naturally high contrast on a white playing card). In order to create the dataset, in fact, the authors manufactured a playing card deck with black pips for every card to get around this limitation [3].
We’ve shown that we can get data under control using variables like contrast and resolution. Now how do we process that data? What does one do with hundreds of thousands of individual pixel readings a second?
Well, it depends on what you want; going beyond an atomic ‘event’ unit requires some lateral thinking. There are in fact many different (and common!) ways to conglomerate and represent event data [1]:
Some of these representations trade accuracy in time or space for easier processing by other programs. For instance, if your pipeline already uses grayscale images, you might just want to create synthetic grayscale images every so often to mimic a conventional camera. This is absolutely possible with an event camera (if you’re comfortable pre-processing your data on the front-end).
On the other hand, if highly accurate timestamp information is important to you, you might instead opt to just take in packets of events or even individual events as they are generated. This could be desirable when e.g. training a spiking neural net to detect certain motion patterns. The density and speed of data generated gives the user plenty to work with, if they’re willing to tinker with the output.
---
So: How do event cameras measure up side-by-side to our Common Sensors? Using my handy and very official Technical and Commercial Metrics from the intro Sensoria Obscura post, we can compare.
...which we will do next post! Stay tuned as we dive into the technical and commercial progress of event cameras in autonomy today.
EDIT: That post is here! https://www.tangramvision.com/blog/sensoria-obscura-event-cameras-part-ii