This is a hard problem with a bunch of other tough computer science and DSP problems as part of it. I think it's ambitious. However, nothing like a mountain to challenge the spirit of man... so here's what I know about it.....
To change the voice of a speaker in real time you need to do a 3 stage process, analysis, resynthesis, and a magical intermediate stage of transformation in the "parametric domain".
The analysis / sysnthesis part is fairly easy. Mobile phone technology already uses applications of LPC (linear predictive coding) and phase vocoders that split up the voice into a set of filter coefficients and an excitation signals. These are recombined in the recieving handset by a resynthesis stage. So, something that few people realise, when you are listening to your friend talk on a mobile phone you are not hearing their real voice, you are hearing a resynthesised voice. The signal is split up this way because it is good to reduce bandwidth and compress the data sent, but it has another possibility.....
If you alter the filter coefficients it's possible to change the voice, even to another age or gender.
It will sound artificial unless you get the mapping exactly right. Getting this part to work is at the front of research into speaker independent speech recognition, to deal with the words as matrices in a "parameter space" rather than as simply time or frequency signals. Perry Cook and Eduardo Miranda have done some of this, but going only one way, from the physical parameters to the signal, however to make a voice changer as you describe you need to do it both ways, to be able to derive the physical parameters from the signal, alter the physical parameters, and then resynthesise the voice.
This would make a good post doctoral reseach project for team of 2-5 programmers....Just to let you know what you're getting into ! And it has no practical commercial uses other than deception, so outside an artistic context I would remain mindful of that if I were you.
A good place to start would be with the phase vocoder and experiment goofing with the analysis data to shift the formants. A better system is probably (edit: *wavelet analysis and Fourier resynthesis*) Linear Predictive Coding because that makes it easier to transform formants independently of anything else. See the Tapestrea software for sound design, which could have interesting applications on this. A dirty solution would be a form of cross synthesis with a limited dictionary of recognised transformations.
As a quick and practical solution you might find that certain VST plugins similar to Antares auto tune can be subverted to alter speech in a way that renders the speaker unrecognisable. This is used in TV documentaries for interviews where the person wants to be anonymous. Not to be alarmist, but it is actually possible to reverse this process and obtain the original voice if you know what you are doing, and speaker identification software works by analysing the mannerisms of speech not the exact signals...so to truly disguise a speaker it's best to get an actor to read their words.
In sumary: changing a voice so that it sounds like another (generic) person - quite easy,
changing speaker A into speaker B so that a human would be fooled....very difficult.