- Kinect Z Buffer Noise and Audio Beam Steering Precision
- Eighteen Reasons Why You Should Not Follow Bill Sardi's Children Flu Season Vaccination Advice
- C++ Morsels: Why does C++ distinguish between member and pointer-to-member?
- Mammoth by John Varley
- C++ Morsels: std::for_each functors member variables
- C++ Morsels: Initializer List Execution Order
- Junkyard Wars Snowplow
- The Witling by Vernor Vinge
- 285 Semi Accident at Parmalee Gulch
- Fevre Dream by George R. R. Martin
Kinect Z Buffer Noise and Audio Beam Steering Precision
So, I've been working on hacking the Microsoft Kinect hardware with OSG recently and have come up with some novel information I'd like to share.
First, I've been measuring the depth noise inherent in Kinect's structured-light Z-camera. the Z values tend to jitter a lot, even on a static scene, so I set about to measure and quantify this noise so I could plan for and filter it.
Here is the test scene I set up, viewed in freenect's glview:
This shows a variety of objects at different distances, plus a solid backdrop. Nothing moved in the scene during sampling, so the Z values should represent a stable scene.
I sampled 100 frames of this, recording max/min/mean and standard deviation for each pixel independently.
I dumped this to a CSV file, keeping only every fifth non-null/non-zero sample (since OpenOffice Calc choked on more than 65000 lines of data). In the CSV is the mean Z value, followed by the standard deviation. I plotted these as an XY scatter plot, with mean Z on the X axis (range of around 815 to just over 1000 Kinect Z units) and Standard Deviation on the Y axis. Since many pixels had nearly the same mean Z, but may have had different SD values, there are more than one dot per X location on the graph.
As you can see, Kinect is kinda noisy. In the worst samples the standard deviation exceeds 75, but there seems to be a consistent floor where the standard deviation is 5 or under.
There is also a peculiar bell-curve that I can't explain. I would have expected error to always proportionally increase with distance, but it doesn't seem to. It could be unrelated, having something to do with the different optical properties of the materials of the test objects found at different distances, but this explanation seems inadequate to me in light of how clear the bell curve is.
The gap at around 930 Z units probably has to do with me not having an object covering that Z distance.
Attached is the CSV file I used if you wish to do your own analysis. If anyone wants other data produced/analyzed, I could probably assist.
Moving on, my second topic is Kinect's built-in four element phased array beam forming microphone array. Though it is not yet supported by freenect, the question that has arisen as to how accurate the beam steering is, and could it identify the speaking person in a room with two or more people. Fortunately this task doesn't require arbitrary beam forming for all possible locations and extraction of the strongest signal (which is computationally expensive). Since the Kinect's Z camera already knows where all the bodies' heads are, the problem is simply to beam-form a virtual microphone aimed at the head of each person and reconstruct the audio heard at that spot. Whichever person is speaking will have the loudest audio from their beamformed virtual microphone. However -- is the mic array sufficiently accurate to be able to distinguish between two humans standing near each other or sitting on a couch?
Luckily a friend of mine is an audio hardware engineer with an interest in mic arrays and beam forming, and he put together a spreadsheet to calculate this. The spreadsheet is attached below. Any cell with a yellow background is supposed to be changed by the user. The fields are:
Frequency: This spreadsheet only works on a single frequency at a time. This is that frequency. Human speech is normally limited to 200 Hz to 3000 KHz. If you ignore tall males, then 300Hz-3Khz is fine. Most of the important stuff is in the 1-1.5 KHz range.
Distance From Sensor: This is the distance from the Kinect to the couch.
Speed of sound: Change this if you want to simulate how the Kinect will operate on Mars or Jupiter.
Mic Dist from center: This is the side to side distance of the mic from the center of the Kinect. I kind of guessed at these values based on the pictures from the Kinect tear-down and my measurements of my Kinect. Basically, this sets the spacing of the mics in the array.
Couch Dist from Centerline: This is the distance that the person is from the centerline of the couch/Kinect. There must always be a "zero distance", but everything else can be changed.
Mic Dist, ft: This is the distance from the person to the mic, given the couch dist. from centerline.
Amplitude after mixing: This is the "volts" of the audio. Assumes that the beam steering is set for straight forward.
Normalized to 0ft from centerline: This is the amplitude normalized.
How to use this:
With sound we think in terms of a -3db threshold. For example, the cutoff frequency of a filter is normally the frequency that the audio is -3db down from "normal". It's an arbitrary threshold, but the one we'll use here. When converting dB to our normalized output, -3db = 0.5. 0db = 1.0.
So using the "default values" in the spreadsheet, we see that the normalized amplitude hits 0.5 (a.k.a. -3db) at 2.5 feet from the center of the couch. So... At 1 KHz the "sweet spot" for mic pickup is 5 feet wide, 2.5 feet to either side of the center of the couch.
The -6dB "window" is about 6.5 feet wide.
While this all assumes a fixed "beam steering" direction, that's not what's important. What is shows is the width of the beam. We see 5 feet wide at 8 feet, but this is not going to change much if the beam is directed straight ahead or 20 deg to one side.
In my opinion, what this says is that in the center of the normal human speech frequency range the Kinect mic array can isolate the couch, but not individual people on the couch.