Speech Overlap Detection

girkenroth

Hi,

I’m trying to figure out how to go about using Pd to design a procedure for differentiating between a single-voice speech and a multiple-voice conversation in which voices overlap. In other words – the moment of overlap is what I’m interested at. I want to be able to listen to a vocal conversation between two people, and detect when voices collide / overlap.

I guess that spectral-wise there’s a major difference between a speech-based conversation with no overlaps compared with a conversation with speech overlaps, so perhaps one direction to look at is detecting sudden spectral changes?

Any thoughts / ideas / pointers for achieving this are very welcome!

Thanks,
K

mod

bonk~ will detect spectral changes, but even a single person speaking will have a huge spectral variation in their speech patterns.

i doubt that this can be done without some pretty intense AI pattern recognition

mod

what you MIGHT find, is that when voices overlap, you suddenly get a lot more dissonant harmonics,

but i have absolutely no idea how yoU'd measure that.

emacpher

You might be able to detect non-harmonic mixtures of harmonics using the cepstrum ... this is the Fourier transform of the log magnitude of the Fourier transform of the signal. Or maybe just getting the autocorrelation of the spectrum and seeing how periodic that is?

Here are some papers I googled up that look at two-talker detection:

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1369314
http://www.lrde.epita.fr/~reda/cours/speech/speakerDiarization/4518619.pdf

girkenroth

how do you get the autocorrelation of the spectrum? is there a Pd object for doing this?

seb-harmonik.ar

try http://puredata.hurleur.com/sujet-6846-autocorrelation-wiener-khinchin-theorem-why-doesn-work

this may be helpful too: http://puredata.hurleur.com/sujet-6776-helmholtz-guess

girkenroth

thanks for all replies.

before delving into detecting speech overlaps, i think i should try building a simple patch for speech and non-speech activity detection. can anyone share a patch that does that? or is there a Pd external for voice activity detection?

sunji

Speech is comprised of two sonic elements: periodic pitched phonemes, and transitive noise phonemes. Generally they are represented by vowels and consonants respectively.

if there is a single pitch present, with a pseudomelodic contour, articulated with dynamic noise envelopes, you might have a voice talking. But it might also be a clarinet solo.

To frame the desired tool correctly, we should figure what is not speech, but might be received as input. Will these be taped conversations with minimal environment noise? Or will we try to train it between a human speech and a dog bark?

girkenroth

As a start, the tool should be able to detect human speech activity within an indoor environment with minimal background noise - say a gallery, an office, etc.
The indoor environment may occasionally include some short mechanical sounds which may be louder than normal/standard human speech.

girkenroth

any ideas?