Kinect Z Buffer Noise and Audio Beam Steering Precision


So, I've been working on hacking the Microsoft Kinect hardware with OSG recently and have come up with some novel information I'd like to share.

Depth Noise

First, I've been measuring the depth noise inherent in Kinect's structured-light Z-camera. the Z values tend to jitter a lot, even on a static scene, so I set about to measure and quantify this noise so I could plan for and filter it.

Here is the test scene I set up, viewed in freenect's glview:

This shows a variety of objects at different distances, plus a solid backdrop. Nothing moved in the scene during sampling, so the Z values should represent a stable scene.

I sampled 100 frames of this, recording max/min/mean and standard deviation for each pixel independently.

I dumped this to a CSV file, keeping only every fifth non-null/non-zero sample (since OpenOffice Calc choked on more than 65000 lines of data). In the CSV is the mean Z value, followed by the standard deviation. I plotted these as an XY scatter plot, with mean Z on the X axis (range of around 815 to just over 1000 Kinect Z units) and Standard Deviation on the Y axis. Since many pixels had nearly the same mean Z, but may have had different SD values, there are more than one dot per X location on the graph.

As you can see, Kinect is kinda noisy. In the worst samples the standard deviation exceeds 75, but there seems to be a consistent floor where the standard deviation is 5 or under.

There is also a peculiar bell-curve that I can't explain. I would have expected error to always proportionally increase with distance, but it doesn't seem to. It could be unrelated, having something to do with the different optical properties of the materials of the test objects found at different distances, but this explanation seems inadequate to me in light of how clear the bell curve is.

The gap at around 930 Z units probably has to do with me not having an object covering that Z distance.

Attached is the CSV file I used if you wish to do your own analysis. If anyone wants other data produced/analyzed, I could probably assist.

Audio Steering

Moving on, my second topic is Kinect's built-in four element phased array beam forming microphone array. Though it is not yet supported by freenect, the question that has arisen as to how accurate the beam steering is, and could it identify the speaking person in a room with two or more people. Fortunately this task doesn't require arbitrary beam forming for all possible locations and extraction of the strongest signal (which is computationally expensive). Since the Kinect's Z camera already knows where all the bodies' heads are, the problem is simply to beam-form a virtual microphone aimed at the head of each person and reconstruct the audio heard at that spot. Whichever person is speaking will have the loudest audio from their beamformed virtual microphone. However -- is the mic array sufficiently accurate to be able to distinguish between two humans standing near each other or sitting on a couch?

Luckily a friend of mine is an audio hardware engineer with an interest in mic arrays and beam forming, and he put together a spreadsheet to calculate this. The spreadsheet is attached below. Any cell with a yellow background is supposed to be changed by the user. The fields are:

Frequency: This spreadsheet only works on a single frequency at a time. This is that frequency. Human speech is normally limited to 200 Hz to 3000 KHz. If you ignore tall males, then 300Hz-3Khz is fine. Most of the important stuff is in the 1-1.5 KHz range.

Distance From Sensor: This is the distance from the Kinect to the couch.

Speed of sound: Change this if you want to simulate how the Kinect will operate on Mars or Jupiter.

Mic Dist from center: This is the side to side distance of the mic from the center of the Kinect. I kind of guessed at these values based on the pictures from the Kinect tear-down and my measurements of my Kinect. Basically, this sets the spacing of the mics in the array.

Couch Dist from Centerline: This is the distance that the person is from the centerline of the couch/Kinect. There must always be a "zero distance", but everything else can be changed.

Mic Dist, ft: This is the distance from the person to the mic, given the couch dist. from centerline.

Amplitude after mixing: This is the "volts" of the audio. Assumes that the beam steering is set for straight forward.

Normalized to 0ft from centerline: This is the amplitude normalized.

How to use this:

With sound we think in terms of a -3db threshold. For example, the cutoff frequency of a filter is normally the frequency that the audio is -3db down from "normal". It's an arbitrary threshold, but the one we'll use here. When converting dB to our normalized output, -3db = 0.5. 0db = 1.0.

So using the "default values" in the spreadsheet, we see that the normalized amplitude hits 0.5 (a.k.a. -3db) at 2.5 feet from the center of the couch. So... At 1 KHz the "sweet spot" for mic pickup is 5 feet wide, 2.5 feet to either side of the center of the couch.

The -6dB "window" is about 6.5 feet wide.

While this all assumes a fixed "beam steering" direction, that's not what's important. What is shows is the width of the beam. We see 5 feet wide at 8 feet, but this is not going to change much if the beam is directed straight ahead or 20 deg to one side.

In my opinion, what this says is that in the center of the normal human speech frequency range the Kinect mic array can isolate the couch, but not individual people on the couch.

StatisticsImage.jpg57.38 KB
StatsGraph.png128.43 KB
statistics.7z234.27 KB
kinect_mic_array.zip3.1 KB

The reason why

Dear Xenon,

Your problem seems indeed what is called "partial surface", which is the presence in the pixel of several depth dots (lets recall that the dots projected by Kinect are not organized in a grid, but the depth non-carthesian "image" is resampled into a cartesian one). Within each pixel, the several obtained depths are averaged. A micro fluctuation in the projected point position introduces thus a large depth standard deviation. You could verify this easily by removing the border pixels from your set of points.


Kinect audio

Hi, Xenon
I am a student currently working on a project using Kinect to locate the sound source and do some classification. The problem I have now is how to hack the Kinect audio, directly, how to get the synchronized 4-mic array data. It seems very few information or feasible methods about audio hacking on the internet. I saw your analysis of beamforming using Kinect, and am wondering how you get the data from Kinect or do you just simulate the results? I am new for Kinect but still trying to solve the problem. Thanks very much! I appreciate your help on that.



Hey Chris,

I saw this link on the osg forum and thought I'd check it out. Something you could do to get more of a sense of where the noise may be coming from is to set a standard deviation threshold to obtain a set of pixels which are particularly noisy, and highlight/color (red perhaps) on the image for easy visualization. As you suggest it could reveal particular materials which are troublesome, but I agree that seems unlikely. More likely perhaps is that edges are difficult to accurately place in the z dimension as the sensor may switch between detecting the nearer/further surface just due to noise. The physical placement of objects in the scene could lead to the properties of the noise that you see, because objects roughly half way through the depth field can have large z-distance between them and the surface they're occluding, where the closer and further objects have smaller distances to their occluded surfaces. If this possibility plays out, you could think about not just filtering or minimizing noise, but using noise as a contribution to an edge detection technique.


Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h1> <h2> <h3> <img>
  • Lines and paragraphs break automatically.

More information about formatting options