What is the best way to capture multiple audio streams and mix them together into a single audio stream for passing in to the ASF mux/writer?
For example, reading from mic or line in and from wav mix (separately), running a DMO filter on the mic/line, then mixing with the wav mix (and performing any SRC to match the wav mix), then sending to the ASF mux/writer as a single stereo stream.
I tried to set this up with GraphEdit; no luck (the AVI muxer allowed this to work, but I need to output ASF). I also tried the Solvieg ASF multiplexer: it appears to store the channels unmixed in the file; playback on a stereo system loses the extra channel.
Well, you can configure the ASF writer using LoadProfileByData(). Profiles can be created using GenProfile (from the WMF sdk).
From http://msdn.microsoft.com/library/default.asp?url=/library/en-us/WMFORM11/htm/inputsstreamsandoutputs.asp we can add extra streams to the ASF file. What I need to do is mix both streams into a single stereo stream, as opposed to creating a file with multiple audio streams (as is needed for surround formats (5.1 etc.)). Any pointers to documentation and/or samples are appreciated.
Even though this is not possible to test with GraphEdit, perhaps all that is needed is to add another audio stream to the ASF writer, and based on the stereo format from the profile, the ASF writer will automatically convert and mix the files into a single stereo stream?
I am also looking to build a filter that can mix audio streams from multiple input pins.
I'm assuming I'll have to base it on CBaseFilter, since CTransformFilter only supports a single input pin.
I'm planning to use multiple source graphs, using multiple GMFBridge filters to link to my render graph (which will contain this filter) so the input samples will have different timestamps and, quite possibly, different sample rates (I'm thinking of accepting uncompressed PCM streams from MP3 decoders, AAC decoders and MPEG movie clips, as well as a live stream from the sound card. I also need to be able to disconnect and reconnect different streams (which may have different sample rates) while the renderer is still playing. The output stream will obviously have one sample rate and one set of timestamps, so I'm thinking I'll have to drop or duplicate some of the input samples to keep all the streams synchronised, and also make up timestamps for the output stream.
What I'm wondering is, should I treat the sample rate of each input stream as gospel and ignore the timestamps on the input samples (assuming the samples will arrive in the correct order), or should I build the output stream based on the timestamps (compensated for different start times) of the input streams?
That's a tricky question. Most timestamps are generated from the samples themselves, however the timestamps could start with an offset if the audio stream was syncronized with a video stream. My suggestion would be to support both and make it a configurable option via a custom interface.
It's probably wise to start with CBaseFilter due to the syncronization requirements, i.e. the output pin is not syncronized with the input pins. Resampling the different source rates to a common output rate will most likely be the most challenging part.
I've looked a bit further into this. Timestamps are probably not too important on something like an audio stream, unless I'm synchronising to a video stream. I can probably rely on the nSamplesPerSec member of WAVEFORMATEX and the dRate parameter of IPin::NewSegment. If I have to resample a stream with timestamps then it would be preferable to synchronise the video to the resampled audio timestamps rather than the other way round (so that any playback rate error created by resampling the audio did not lose synch with the video - I wouldn't want to show multiple video streams at the same time, apart from during dissolves and wipes, so losing synch on some of them wouldn't be important). The windows kernel mixer can handle multiple audio streams from disparate applications and send them to the same end point, so it can't be rocket science.
No, it's not rocket science. The hardest part is getting hold of a good audio resampling algorithm. You can use the PCM ACM filter but it's garbage (bad quality and inaccurate). Secret Rabbit Code is a good one available for purchase for commerical use or free for GPL use.
Hi, sorry for the late reply. My solution was to use DirectShow only to capture the video stream (real-time app). All audio is processed with DirectSound (low-level, custom mixing thread). The custom mixer also generates clock data (based on sample count) which drives a DirectShow clock which is used by the custom video capture filter. The sound samples are combined with the video samples using WMWriter directly (not through DirectShow).
In order to support seeking, real-time index code has to be written and implemented through a custom Writer Sink (not a lot of code, but requires a bit of work as the spec does not tell the whole story (http://www.microsoft.com/windows/windowsmedia/forpros/format/asfspec.aspx).
Even with the above tightly coupled timer/clock system, audio-video sync can still be a problem as some webcams deliver invalid time stamps. Thus, I implemented various work-arounds (for example, ignore video timestamps and use actual time of delivery. This will not work in the case of some webcams which buffer frames and deliver late (as is the case with webcams with special effects enabled).
For complex systems, you may have the best luck writing as much of the code yourself, even though initially it is more work. I tried many different ways to solve this problem, using existing systems/code, but ultimately achieved the best results with custom code.
As for resampling, you can probably get away with dropping/adding a single sample with no major issue (small deltas). The bigger challenge is computing actual rates from the “hardware” cursors (which tend to be noisy). I can run about 30ms latency on most systems (using DirectSound, custom mixer thread and sync code, relative to hardware cursor). Actual latency is higher due to kernel mixer, overhead, etc. For small resampling amounts, you can use simple linear interpolation. For larger ratios, the best ways involve FIR filtering (Band Limited Interpolation: http://ccrma.stanford.edu/~jos/pasp/Bandlimited_Interpolation.html) . Probably not worth it for cubic interpolation (vs. linear), etc.
Thanks for the help, guys. I'm looking into Secret Rabbit Code. The commercial licence looks pretty reasonable, in case I go down that route.
For straightforward rendering of multiple disparate audio streams, would there be anything wrong in using multiple instances of the waveOut renderer (or even multiple graphs) in the same application and letting the kernel mixer sort it out?
Yes, you can use multiple audio renderers at different rates without issue. It's probably best to keep them in the same graph so that share the same reference clock automatically, although in theory it should be possible to share a clock between multiple graphs provided you take care in how it is used.
Secret Rabbit Code (by Erik de Castro Lopo) is an optimized (float) version of Julius O. Smith's band-limited interpolation resampler (fixed point). Dominic Mazzoni released a floating point version as well (not as optimized as SRC, but LGPL instead of GPL/Commercial): http://www-ccrma.stanford.edu/~jos/resample/Available_Software.html
For real-time apps, simple methods can work surprisingly well (for example FMOD using linear interpolation for games is good enough for sounds effects). If the application starts out with highly compressed audio (such as MP3,WMA,OGG), as long as the SRC does not increase perceived noise/distortion, it's good enough.
If timing/synchronization is not critical, rendering the audio pins for multiple streams in the same graph should be a fast and easy way to go: you can quickly test your scenarios in GraphEdit.
Regarding complex filter graphs (using multiple renderers, GMFBridges, etc.): if you run into problems (especially if timing related), you may find it beneficial to remove complexity. I originally tried to do everything in DirectShow (looked like it would be the simple/fast development solution), but after stripping away complexity when debugging timing issues, all that remained using DirectShow was video stream capture (for the real-time, timing critical capture components. Code for decompressing samples, video, etc., still uses DirectShow).
It's certainly not Rocket Science in terms of math, but once you get your application working reliably, you may find that when explaining what you had to do to achieve reliability to your colleagues, it starts to sound like Rocket Science.
"If timing/synchronization is not critical, rendering the audio pins for multiple streams in the same graph should be a fast and easy way to go: you can quickly test your scenarios in GraphEdit."
I appreciate I am novice but I went to GraphEdit and I could not see which filters to use to mix multiple audio sources and output to the speaker. I picked 2 file source (Async) under directshow filters but could not find any way of mux (all filters only had one input pin) them together and outputting them to the speaker.
I would appreciate which filter I must use to achieve this or point me to any source code.
Tx in advance
As suggested above, the idea is to use one instance of the DirectSound renderer for each audio stream. The infrastucture will take care of the mixing.
"For straightforward rendering of multiple disparate audio streams, would there be anything wrong in using multiple instances of the waveOut renderer (or even multiple graphs) in the same application and letting the kernel mixer sort it out?"
Michel Roujansky, http://www.roujansky.com