none
WASAPI latency - possible to achieve lower than 70ms 'glass to output' on Surface RT?

    Вопрос

  • Hi

    I'm interested porting a musical instrument application to Windows RT, and I'd like to achieve the lowest possibly latency between tapping on the screen and hearing the results at the audio output. I've hacked the WASAPI C++ sample to make it render a continuous sine wave which is briefly increased in volume in response to a tap on the screen.

    To measure the latency I'm putting a microphone near the device, recording it and loading it up in a wav editor. This way I can measure the time between the physical tap on the screen and the onset of the audio coming out the speaker. The best I can manage so far is a 'glass to speaker' latency of about 60-70ms, which is enough to make the platform just about viable but it's not up to current iOS standards. 

    Here are the results I've got so far:

    1. If I use shared mode with no hardware offloading I get a buffer size of 960 frames and a latency of 85-100ms (using the measurement method above).

    2. If I use shared mode with hardware offloading, I get a buffer size of 480 frames and a latency of about 60-70ms.

    3. If I use exclusive mode (hardware offloading appears not to be an option here) with the minimum device period (which the docs say one can do) then I get a buffer size of 144 frames and a latency of 45-50ms. This starts to look really promising, but if I examine the waveform Isee that only about half the rendered audio appears at the output! It seems to put out about 80-100 samples of silence, followed by about 130 of correct audio, I assume due to buffer underruns.

    4. If I do the same as 3, but ask for the minimum (rather than the default) device period, the output waveform is OK and I get a buffer size of 480 frames with a latency of about 65ms. This is about the same as 2.

    So to conclude:

    - The best I've managed is either 2 (shared mode with HW offloading) or 4 (exclusive mode, HW offloading not allowed). 

    - In all cases, passing the minimum period (as reported by GetDevicePeriod()) to IAudioClient2::Initialize() results in underruns.

    - In all cases, passing the default period (as reported by GetDevicePeriod()) to IAudioClient2::Initialize() works without underruns, but gives no better than 60-70ms latency.  I was hoping for a bit better than this.

    My questions are:

    - Are there any settings I can try to improve this?  In particular, the docs suggest that using the minimum period returned from GetDevicePeriod() is acceptable in exclusive mode, but this does not appear to work in practice

    - If the audio latency cannot be improved, there must also be some latency coming from the touch handling (a CoreWindow::PointerPressed event).  Can this be improved in some way?

    Any discussion welcome.  This subject always seems to be an experimental one on any platform I've ever worked on!

    6 ноября 2012 г. 12:38

Ответы

  • Hello,

    Our design goals for the audio and video stack in W8 was to reduce the latency to around 100 ms one way. With a lot of work we were able to get the latency down to around 65 ms one way in shared mode. With the current audio architecture its really difficult to get latencies below about 45 ms one way in exclusive mode and using the polling method mentioned above.

    The original design goal for WASAPI was preventing "glitching" during normal playback scenarios. This naturally leads to larger buffers and higher latency. The work we did for W8 has certainly decreased the latency enough to allow for VOIP scenarios but is not nearly good enough for most professional music apps. 

    As a musician myself I have been pushing to see the platform support round trip latencies in the 15 - 20 ms range. I think that this is totally possible with the current architecture. Unfortunately at this time this really isn't on the roadmap. I will keep pushing and we will have to see how the platform evolves.

    I wish I had better news but I do hope that this gives you some insight.

    Thanks much,

    James 


    Windows SDK Technologies - Microsoft Developer Services - http://blogs.msdn.com/mediasdkstuff/

    27 ноября 2012 г. 1:54
    Модератор

Все ответы

  • Have you looked into using XAudio2 and other low latency DX APIs?

    6 ноября 2012 г. 16:21
  • To my knowledge, XAudio2 sits on top of WASAPI and WASAPI is the lowest you can go.  Therefore I'm really looking for ways to set up WASAPI for the lowest possible latency.
    6 ноября 2012 г. 16:22
  • I reproduced the scenario on my own Surface RT and can confirm your findings. Additionally, I tried handling the buffer callbacks within the main UI thread and also within a high priority threadpool thread in the simplest fashion:

    for(;;){
    DWORD retval = WaitForSingleObjectEx( m_SampleReadyEvent, 2000, 0 );
    m_AudioRenderClient->GetBuffer( m_BufferFrames, &Data );
    FillBuffer();
    m_AudioRenderClient->ReleaseBuffer( m_BufferFrames, 0 );
    }

    Nonetheless there was no device period in exclusive mode without offloading below the default one that remained stable for a reasonable amount of time. My wild guess is that the Windows RT thread priorities are still not high enough compared to MMCSS Pro Audio for desktop applications since I don't have any problems running my store app on my desktop pc with the minimum device period.

    Any ideas how this can be solved?

    21 ноября 2012 г. 21:36
  • Since making my original post I have discovered one further thing - but it only makes matters worse.  Given that exclusive mode vs. shared mode with hardware offloading give similar latency, I was planning on using shared mode on the basis that it benefits the user by not taking exclusive control of the audio.  However, when hardware offloading is enabled it seems that you can do very little processing in the audio callback before experiencing under-runs.  Therefore, it's necessary to move the app's audio processing to a separate task and then relegate the audio callback to the role of copying the last buffer and telling the app's audio task to start rendering the next - thereby incurring another buffer's worth of latency (another 10ms given a buffer of 480 frames @ 48kHz).

    Exclusive mode on the other hand does seem to let you do a lot of CPU work in the callback without problems.  The side effect of using exclusive mode seems to be that the app can carry on happily rendering audio even when moved into the background, which I can't decide is an advantage or disadvantage - so it should probably be a user choice.  I've yet to see if it's possible for the app to destroy the WASAPI objects when it enters the background and then re-create them when it enters the foreground again.  If so, this might be the best solution.

    I realised that there's probably some extra touch latency which can be reduced.  I don't know how your main app loop is structured, but mine followed the traditional model of processing events, rendering the display and then presenting the display buffer.  Once Present() is called, nothing much is going to happen since I assume it just blocks the main thread waiting for the next update.  Instead of calling Present() just after the display has been renderered, it's possible to sit in an inner loop and continue processing events, only exiting that loop once the time to Present draws near.  ie:

    	// Semi-pseudo code!
    	while (!m_windowClosed)
    	{
    		// Get the current time
    		double StartTime = CurrentPerformanceCounter();
    
    		// Process events
    		CoreWindow::GetForCurrentThread()->Dispatcher->ProcessEvents(CoreProcessEventsOption::ProcessAllIfPresent);
    
    		// Update and render scene
    		spApplication->RenderScene();
    
    		// Keep processing events every millisecond until it's almost time to present
    		do
    		{
    			CoreWindow::GetForCurrentThread()->Dispatcher->ProcessEvents(CoreProcessEventsOption::ProcessAllIfPresent);
    			Sleep(1);
    		}
    		while (CurrentPerformanceCounter() - StartTime < (1.0f / 60.0f) - 1.0f / 1000.0f); // Assuming 60Hz - 1ms to be on the safe side
    
    		Present();
    	}

    In practice, this seems to have made some difference but not as much as I'd hoped.  The best it can really do is even out the response time so that you don't get cases where a touch event 'just missed' the last poll and has to wait until the next display refresh.

    Finally (and this is a bit of an aside), I've found that simply adjusting the volume using the hardware +/- buttons causes serious CPU spikes and buffer underruns.  I'm not sure, but this seems to occur all the time the volume slider pop-up is visible on screen.  Maybe it's the CPU work of graphically compositing this view over a DirectX window?

    EDIT : Actually, exclusive mode seems to 'degrade' over time no matter what buffer size I use.  Sometimes it'll run for a few minutes, other times it'll only manage 10 seconds.  Once this happens, it sounds 'grainy' like it's missing a part of each buffer and never recovers.  I suspect once again this is from trying to do too much work in the callback.

    EDIT 2 : I think that the 'poll for input events and sleep in between' idea doesn't work because it's not possible to sleep in Windows for much below 15ms.  In fact, it could sleep for more than that and affect the display frame rate so just pretend you never even heard that idea.  It typically works on *nix system though if you ever find yourself in need of it!

    • Изменено MattBlip 26 ноября 2012 г. 23:36
    22 ноября 2012 г. 0:42
  • More discoveries:

    - Managed to always get a stable start in minimum latency (144 samples per buffer @ 48kHz) exclusive mode by calling m_AudioClient->Start() after a small waiting period:

    HANDLE wait;
    wait = CreateEventEx( NULL, NULL, CREATE_EVENT_MANUAL_RESET, EVENT_ALL_ACCESS );
    WaitForSingleObjectEx( wait, 2000, false );
    m_AudioClient->Start();

    - Whenever the output "breaks" one extra call to the callback procedure stabilizes it again. Unfortunately, this must be done every few seconds and is still no proper solution. Note that I'm not even doing any computation except minimalist copy-filling the buffers and handling input events on a ms basis.

    - When timing the callbacks by using WaitForSingleObjectEx( m_SampleReadyEvent, ... ) i always get timeouts when specifying the waiting time slightly above the buffer duration (15ms vs 10 ms duration). I wonder why that would be.

    I also raised the priority of MFPutWaitingWorkItem to 2 although that didn't seem to do a lot. If anybody could confirm that's indeed the value for "critical" that would be nice. I couldn't find anything about it and went with the AVRT_PRIORITY values as for work-queue threads.


    22 ноября 2012 г. 22:38
  • Waiting before starting the client seems to make no difference for me.  Are you inserting the wait in WASAPIDevice::OnStartPlayback(), after pre-rolling the buffer with silence and before calling mAudioClient->Start()?

    - When you say this, "Whenever the output "breaks" one extra call to the callback procedure stabilizes it again.", how are you detecting if the output 'breaks', and what callback are you calling from where?  Can you detect this before it's happened, or is it already too late because an audible glitch has occurred on the output?

    23 ноября 2012 г. 13:57
  • This is the order after initializing m_AudioClient:

    m_AudioClient->GetBufferSize( &m_BufferFrames );
    m_AudioClient->GetService( __uuidof(IAudioRenderClient), (void**) &m_AudioRenderClient );
    m_AudioClient->SetEventHandle( m_SampleReadyEvent );
    MFCreateAsyncResult( nullptr, &m_xSampleReady, nullptr, &m_SampleReadyAsyncResult );
    BYTE *Data;
    m_AudioRenderClient->GetBuffer( m_BufferFrames, &Data );
    for ( unsigned int y = 0; y < 4 * m_BufferFrames; y++ )
    	Data[y] = 0;
    m_AudioRenderClient->ReleaseBuffer( m_BufferFrames, 0 );
    MFPutWaitingWorkItem( m_SampleReadyEvent, 2, m_SampleReadyAsyncResult, &m_SampleReadyKey );
    HANDLE wait;
    wait = CreateEventEx( NULL, NULL, CREATE_EVENT_MANUAL_RESET, EVENT_ALL_ACCESS );
    WaitForSingleObjectEx( wait, 2000, false );
    m_AudioClient->Start();
    - So far only by hearing. I fill the buffers in HRESULT WASAPIRenderer::OnSampleReady( IMFAsyncResult* pResult ) which I made public and call from the UI thread.


    EDIT: I used a performance counter to look at the time between callback events and about every 5th event seems to be signalled a few ms too late as compared to my desktop pc where I observed no fluctuations. I have no idea how to fix this, I suppose it's a driver/hardware issue.
    • Изменено David Kain 23 ноября 2012 г. 18:50
    23 ноября 2012 г. 15:04
  • That's really strange, adding the wait (and/or fiddling with various combinations of start up order) seems to be making no difference for me.

    I have managed to make some progress though.  Before, I mentioned that doing any significant CPU work in the render callback seemed to cause underruns pretty easily.  I've found out that this statement isn't quite true - more specifically, it's down to the amount of CPU work done between the GetBuffer() and ReleaseBuffer() calls.  Also, when WASAPI signals that it needs a buffer, it really, really needs it like yesterday.  So instead of doing this (like the WASAPI sample):

    mpAudioRenderClient->GetBuffer(RenderFramesAvailable, &Data);
    
    	// Do app specific rendering and fill the Data buffer returned by GetBuffer()
    
    mpAudioRenderClient->ReleaseBuffer(RenderFramesAvailable, 0);

    Much better results can be obtained like this:

    mpAudioRenderClient->GetBuffer(RenderFramesAvailable, &Data);
    
    	// Copy in the buffer we generated at the end of the last call
    	memcpy(Data, mpRenderBuffer, BufferSizeInBytes);
    
    mpAudioRenderClient->ReleaseBuffer(RenderFramesAvailable, 0);
    
    // Do app specific rendering and put the results in mpRenderBuffer (to be used on the next call)
    
    

    Effectively this means that WASAPI gets its new buffer as as soon as possible, although since that buffer was generated during the previous call it's actually one buffer behind.  By doing this I can run it most of the time using the minimum device period.  The code above is assuming that the requested buffer size is the same each time.  I'm not sure what I'd do if I (eg.) render 100 frames ahead on one call and then it decides to ask for 200 on the next call - have to render the extra frames immediately I suppose.

    There's still a problem though - 80% of the time this works fine, but sometimes the output becomes 'broken' (grainy like it's constantly under-running on every buffer) and then the only way to 'unbreak' it is to stop the client and restart it again.  Even this doesn't guarantee that it'll work straight away - sometimes I need to stop/restart 2 or 3 times until it's working again.  If I could find a way to detect this and take corrective action then it might be in with a chance, but none of the functions seem to be returning any sort of HRESULT which might indicate an underrun.  As it stands, a single glitch then becomes a permanent problem until the client is restarted.

    24 ноября 2012 г. 0:50
  • Never mind all that fiddling, rock stable now! Just initialize without AUDCLNT_STREAMFLAGS_EVENTCALLBACK and handle buffers like so:

    UINT32 paddingFrames;
    UINT32 availableFrames;
    BYTE *Data;
    
    for (;;) {
    	do {
    		m_AudioClient->GetCurrentPadding( &paddingFrames );
    		availableFrames = m_BufferFrames - paddingFrames;
    	}
    	while ( availableFrames < m_BufferFrames / 2 );
    	m_AudioRenderClient->GetBuffer( availableFrames, &Data );
    	for ( unsigned int i = 0; i < availableFrames; i++ ) {
    		// Fill
    	}
    	m_AudioRenderClient->ReleaseBuffer( availableFrames, 0 );
    }



    • Изменено David Kain 24 ноября 2012 г. 12:11
    24 ноября 2012 г. 12:08
  • Nice, I'll try that as soon as I get time :)

    I gather it's probably going to burn loads of CPU as it stands.  Have you tried adding any sleep/wait logic to it or are you running it in a separate task?

    24 ноября 2012 г. 18:13
  • I've yet to see what works best, but at least a separate sleepless task handling pointer input is no problem at all.
    24 ноября 2012 г. 23:31
  • I still seem to be getting nowhere with this.  I've tried the non-event driven polling method you described above, but as soon as I do any significant processing work to render the buffer it all goes wrong again, despite the fact that CPU work takes <50% of the buffer length.  Also, I've found that if I don't specify the AUDCLNT_STREAMFLAGS_EVENTCALLBACK flag the number of buffer frames reported by IAudioClient2::GetBufferSize() jumps from 144 to 1024. 

    Even if I can get the non-event driven version working, I'm worried about how much CPU it's going to use and its effect on battery life.  I was hoping that it would be possible to sleep the buffer polling task until the next buffer is approximately due, but from some tests I've done using WaitForSingleObjectEx with a timeout of 1ms, the minimum quantum I get is about 15ms and that's far from guaranteed (although probably about right for Windows given other information I've been reading).  Given that we're talking about servicing buffer sizes of 144/48000 = 3ms, any waiting using this method pretty well guarantees it's going to miss the next buffer.

    In fact, the only method I can get to consistently and reliably render glitch free is shared mode (with no hardware offloading) using a default buffer size, which gives an overally latency of 100ms.  Most frustrating!

    26 ноября 2012 г. 0:26
  • Dammit, you're right. I should have checked all that before posting, sorry. Alright, I put my project on hold. 100 ms is unacceptable and far from low-latency.

    Dear Surface/Win RT developing team, please provide a solution as soon as possible.

    26 ноября 2012 г. 14:46
  • No need to apologize, I'm grateful for at least one person to compare notes with and you've encouraged me to be more tenacious than I'd otherwise be!

    A couple more observations:

    1 - I got my app into a state where even fairly large buffer sizes weren't working with any mode other than shared, and found that a device reboot fixed it.  I then went back to my pared-down version of the WASAPI sample to get the simplest possible test case.  Exclusive mode (with default buffer size) was working OK again after the reboot, so I added a spin loop using the performance counter to simulate some CPU work in the audio event callback.  Once the CPU load got to about 50% of the buffer duration, it started under-running.  Fair enough, so I recompiled the code to take out the spin loop - and it was still under-running.  The only way to recover from this was to reboot the device again. 

    2 - Sometimes, even without loading the CPU and after a clean reboot, exclusive mode will start up already broken - and again, the only way to fix it is a reboot.

    Shared mode without HW offloading never seems to exhibit these problems, but its latency's too high.  I've pretty well come to the conclusion that there are some definite performance problems/bugs in there that simple cannot be worked around at the app level.

    I'm close to shelving my project too - these findings have been an all too familiar story on both Android and Playbook and still leave iOS as the only candidate for real time music apps. 

    I second the request to see some progress on this in a future Windows update.

    26 ноября 2012 г. 17:05
  • Hello,

    Our design goals for the audio and video stack in W8 was to reduce the latency to around 100 ms one way. With a lot of work we were able to get the latency down to around 65 ms one way in shared mode. With the current audio architecture its really difficult to get latencies below about 45 ms one way in exclusive mode and using the polling method mentioned above.

    The original design goal for WASAPI was preventing "glitching" during normal playback scenarios. This naturally leads to larger buffers and higher latency. The work we did for W8 has certainly decreased the latency enough to allow for VOIP scenarios but is not nearly good enough for most professional music apps. 

    As a musician myself I have been pushing to see the platform support round trip latencies in the 15 - 20 ms range. I think that this is totally possible with the current architecture. Unfortunately at this time this really isn't on the roadmap. I will keep pushing and we will have to see how the platform evolves.

    I wish I had better news but I do hope that this gives you some insight.

    Thanks much,

    James 


    Windows SDK Technologies - Microsoft Developer Services - http://blogs.msdn.com/mediasdkstuff/

    27 ноября 2012 г. 1:54
    Модератор
  • I'd be happy with a latency in shared mode of 65ms, but I can only get a reliable 100ms.

    One further problem I've noticed.  In my app, the main loop (where CoreWindow::Dispatcher->ProcessEvents() is called) seems to have the same priority as the audio rendering task (invoked as a result of MFPutWaitingWorkItem(..., 2, ..., ...)). 

    Even if I comment out the registration functions for all CoreWindow touch events, if I put a few fingers on the screen and start dragging them around, CoreWindow::Dispatcher->ProcessEvents() starts taking an average of around 4ms to execute, my audio callback takes longer, and the audio stutters. 

    Aside from the fact that 4ms is a lot of time to deal with touches, how can I ensure that the event-driven audio rendering callback always takes priority over the UI thread?  My audio rendering always has a deterministic execution time by design and it's completely time critical so everything should be dropped to service this task as the top priority.

    27 ноября 2012 г. 11:47
  • Hi James

    It's been three months now.  Are there likely to be any further developments in this area?  iOS has a thriving music making scene due to its low latency audio implementation but right now WinRT is going nowhere in this regard.

    11 февраля 2013 г. 12:18
  • wouldn't the interface you're using to record the signal to pc have it's own latency issue?
    9 сентября 2013 г. 10:33
  • Yes but that's a fixed offset applied to all measurements.  As long as you measure relative times within the recording, this measurement latency is nulled.
    10 сентября 2013 г. 9:36
  • Have there been any updates in this regard? Struggling with some latency issues in my VoIP application (trying to cancel echo on 10ms packets using Speex), and I'd love to see some progress with these APIs.
    6 ноября 2013 г. 0:49
  • Hello,

    Likely we won't see latencies as low as 10ms round trip on the Windows platform any time soon. This is due to many factors, but as an example, multiple large buffers are necessary in a shared audio scenario. While this problem isn't insurmountable it is a difficult problem to solve particularly on low end ARM hardware.

    That said have you tried measuring the WASAPI latency in your Windows 8.1 app and compared it to the same code running on 8.0? I would be interested to see if you are seeing lower latencies in 8.1 as compared to 8.0. My expectation is that latencies have decreased between 8.0 and 8.1 but only fractionally.

    I hope this helps,

    James


    Windows SDK Technologies - Microsoft Developer Services - http://blogs.msdn.com/mediasdkstuff/

    6 ноября 2013 г. 1:15
    Модератор
  • It is remarkable though, that Apple has solved this issue years ago with Core Audio. Even with several applications running, even on mobile iOS devices.

    There are also ASIO drivers for Windows which feature low latency streaming by several applications (if they support ASIO). Why not recreate ASIO functionalities with WASAPI?

    28 ноября 2013 г. 23:19