Writing software to tell where sound comes from (directional listening)

If you look into research papers on multi phase microphone arrays, specifically those used for underwater direction finding (ie, a big area of submarine research during the cold war - where is the motor sound coming from so we can aim the torpedoes?) then you'll find the technology and math required to find the location of a sound given two or more microphone inputs.

It's non-trivial, and not something that could be discussed so broadly here, though, so you aren't going to find an easy code snippet and/or library to do what you need.

The main issue is eliminating echos and shadows. A simplistic method would be to start with a single tone, filtering out everything but that tone, then measuring the phase difference between the two microphones of that tone. The phase difference will give you a lot of information about the location of the tone.

You can then choose whether you want to deal with echoes and multipath issues (many of which can be eliminated by removing all but the strongest tone) or move onto correlating sounds that consist of something other than a single tone - a person talking, or a glass break, for instance. Start small and easy, and expand from there.


I was looking up something similar and wrote a dumb answer here that got deleted. I had some ideas but didn't really write them properly. The deletion gave me that internet bruised ego pride so I decided to try out the problem and I think it worked!

Actually trying to do a real locate a la Adam Davis' answer is very difficult but doing a human style location (looking at the first source, ignoring echos, or treating them as sources) is not too bad, I think, though I'm not a signal processing expert by any means.

I read this and this. Which made me realise that the problem is really one of finding the time shift (cross-correlation) between two signals. From there you would calculate the angle using the speed of sound. Note you'll get two solutions (front and back).

The key information I read was in this answer and others on the same page which talk about how to do fast fourier transforms in scipy, to find the cross-correlation curve.

Basically, you need to import the wave file into python. See this.

If your wave file (input) is a tuple with two numpy arrays (left, right), zero-padded at least as long as itself (to stop it circularly aligning apparently) the code follows from Gustavo's answer. I think you need to recognise that that ffts make the assumption of time-invariance, which means if you want to get any kind of time-based tracking of signals you need to 'bite off' small samples of data.

I brought the following code together from the mentioned sources. It will produce a graph showing estimated time delay, in frames, from left to right (negative/positive). To convert to actual time, divide by the sample rate. If you want to know what the angle is you need to:

  • assume everything is on a plane (no height factor)
  • forget the difference between sound in front and those behind (you can't differentiate)

You would also want to use the distance between the two microphones to make sure you aren't getting echos (time delays greater than that for the 90 degree delay).

I realise that I've taken a lot of borrowed here, so thanks to all of those that inadvertently contributed!

import wave
import struct
from numpy import array, concatenate, argmax
from numpy import abs as nabs
from scipy.signal import fftconvolve
from matplotlib.pyplot import plot, show
from math import log

def crossco(wav):
    """Returns cross correlation function of the left and right audio. It
    uses a convolution of left with the right reversed which is the
    equivalent of a cross-correlation.
    """
    cor = nabs(fftconvolve(wav[0],wav[1][::-1]))
    return cor

def trackTD(fname, width, chunksize=5000):
    track = []
    #opens the wave file using pythons built-in wave library
    wav = wave.open(fname, 'r')
    #get the info from the file, this is kind of ugly and non-PEPish
    (nchannels, sampwidth, framerate, nframes, comptype, compname) = wav.getparams ()

    #only loop while you have enough whole chunks left in the wave
    while wav.tell() < int(nframes/nchannels)-chunksize:

        #read the audio frames as asequence of bytes
        frames = wav.readframes(int(chunksize)*nchannels)

        #construct a list out of that sequence
        out = struct.unpack_from("%dh" % (chunksize * nchannels), frames)

        # Convert 2 channels to numpy arrays
        if nchannels == 2:
            #the left channel is the 0th and even numbered elements
            left = array (list (out[0::2]))
            #the right is all the odd elements
            right = array (list  (out[1::2]))
        else:
            left = array (out)
            right = left

        #zero pad each channel with zeroes as long as the source
        left = concatenate((left,[0]*chunksize))
        right = concatenate((right,[0]*chunksize))

        chunk = (left, right)

        #if the volume is very low (800 or less), assume 0 degrees
        if abs(max(left)) < 800 :
            a = 0.0
        else:
            #otherwise computing how many frames delay there are in this chunk
            cor = argmax(crossco(chunk)) - chunksize*2
            #calculate the time
            t = cor/framerate
            #get the distance assuming v = 340m/s sina=(t*v)/width
            sina = t*340/width
            a = asin(sina) * 180/(3.14159)



        #add the last angle delay value to a list
        track.append(a)


    #plot the list
    plot(track)
    show()

I tried this out using some stereo audio I found at equilogy. I used the car example (stereo file). It produced this.

To do this on-the-fly, I guess you'd need to have an incoming stereo source that you could 'listen to' for a short time (I used 1000 frames = 0.0208s) and then calculate and repeat.

[edit: found you can easily use the fft convolve function, using the inverted time series of one of the two to make a correlation]


This is an interesting problem. I don't know of any reference material for this, but I do have some experience in audio software and signal processing that may help point you in the right direction.

Determining sound source direction (where the sound is coming from around you) is fairly simple. Get 6 directional microphones and point them up, down, front, back, left, and right. By looking at the relative amplitudes of the mic signals in response to a sound, you could pretty easily determine which direction a particular sound is coming from. Increase the number of microphones for increased resolution.

2 microphones would only tell you whether a sound is coming from the right or left. The reason your 2 ears can figure out whether a sound is coming from in front of, or behind you, is because the outer structure of your ear modifies the sound depending on the direction, which your brain interprets and then corrects for.