none
System hangs on resume with Hive Registry on eMMC (SD) - WinCE 6.0 R3 RRS feed

  • Question

  • Hello everybody,

    I'm facing a nasty issue on a device whose hive-based registry is persisted on an eMMC. Things work pretty fine during "normal" operation - that is, the device suspends and resumes corretcly almost always. But there is a specific use case in which the resume process hangs in a deadlock.

    From what I've seen so far, the deadlock is caused by a page fault accessing the registry during resume, after returning to normal operations but before the file system has completed the mount of the relevant volume. In this condition, even JIT debugging over KITL doesn't work anymore and the only way to get some information about what's going on is through an hardware debugger (Lauterbach).

    Now, I'm probably missing something important - but I can't see the pole. The eMMC is mounted as permanent and the driver has been modified not to issue fake card removal on resume. The block driver is also not registered as power managed - just block device class. Still the volume gets unmounted and remounted during suspend-resume cycle - and I still don't understand if that's relevant or not: I expected a page fault to be put on hold until the volume is remounted.

    Thnk you in advance for any help you may provide.

    BR - Stefano 

    Thursday, June 23, 2011 1:56 PM

All replies

  • Are you specifically compressing any of the driver DLLs?  This can cause a page fault during resume on CE 6.0 and later.


    Bruce Eitman (eMVP)
    Senior Engineer
    Bruce.Eitman AT Eurotech DOT com
    My BLOG http://geekswithblogs.net/bruceeitman

    Eurotech Inc.
    www.Eurotech.com
    Thursday, June 23, 2011 7:32 PM
    Moderator
  • No, there's no compressed driver nor DLL in the image.

    The memory mapped file the registry relies upon seems to be paged out before suspend, but the system is unable to re-page in at resume.

    I say so since I've observed that the lock is due to a WaitForObject looped with a timeout of 0 - in WINCE600\private\winceos\COREOS\storage\fsdmgr\mountedvolume.hpp, function EnterWithWait, lines 63+.

    In this situation, the volume being accessed is the ROM FS - but m_PnPWaitIODelay in the call is 0, which probably is correct for ROM FS (I'm not so deep into the FSDMGR - not yet). The inner function, in the EnterWithWait loop, is returning ERROR_DEVICE_NOT_AVAILABLE because of powerdown status reported by MountedVolume_t::Enter.

    So, it looks like the ROM FS has been powered down at suspend, then at resume someone is trying to access the registry while the ROM FS is not available yet - but the wait blocks the system (since it loops with zero timeout).

    Please note that this problem occurs only if the USB OHCI2 driver is installed: at resume, it is unloaded and reloaded by a thread (CHW::CeResumeThread). Here, during DeviceDeinitialize(), the library is unloaded and ShimEngine key is accessed - but ROM FS is not ready.

    Of course there are [dirty] workarounds, but since I'm surely missing something at higher level, I would like to "fix by understanding"... don't like to assume that no other thread is going to do the same one day!

    BR - Stefano


    Stefano Voulaz Embedded Design @ projecKt studio

    Saturday, June 25, 2011 1:42 PM
  • Some additional thoughts...

    I investigated a bit deeper and yes, the ROM FS is initialized first inside the Store Manager - here the registry is unavailable, registry keys are hardcoded and PnPWaitIODelay is globally set to 0 - hence the wait timeout.

    Now, the "virtual" ROM FS volume is powered off and on just like any other volume, but so far I've been unable to find a way to instruct the system to never page out memory used by the registry (as per my understanding, a memory mapped file is used), at least, those holding the boot hive - but it sounds logical that at runtime the keys are not split between boot and non-boot registry.

    Probably mine is a peculiar use-case, but since the ROM FS has a PnPWaitIODelay of zero, any thread (read: application thread) that would randomly access a paged out registry might potentially lock the system on resume - probably depending on priorities and scheduling order.

    All the above is true, of course, if my understanding about registry management is correct... Reading Sue Loh's article about Paging Pool in WinCE 6.0 gave me the impression that the memory used for the registry is handled (almost) like any other allocation. But if the registry is never paged out... then I should scrub my head with the other hand!

    BR - Stefano


    Stefano Voulaz Embedded Design @ projecKt studio
    Saturday, June 25, 2011 3:35 PM
  • Hello!

    After further investigation and discussion with some colleagues, I confirm the thoughts of my last post - and  I would humbly rise a flag for a potential bug in the volume manager of windows CE 6.0 R3.

    A PnPWaitIODelay of zero for a volume (any volume) is basically inconsistent with the volume access policy, especially within a suspend/resume cycle - where all the volumes (including ROMFS virtual volume) receive power notification changes. If during resume, for any reason, a volume is accessed before it receives the PowerOn notification, the PnPWaitIODelay normally allows the thread accessing the volume to yield - so that the volume will possibly become available after the wait timeout. If PnPWaitIODelay is zero, then the thread doesn't yield and in case it's a high priority thread the system may lock waiting for the volume to become available.

    In my specific case, the resume thread of the OHCD driver has a priority of 101 (same of the IST) and on resume it re-activates the driver after deactivating it. During deactivation the registry is accessed, but following a page fault (caused by a page-out on the related  memory mapped file) the ROMFS virtual volume is accessed *before* it has received power on notification. Then the system locks in a WaitForObject - which is blocking because PnPWaitIODelay is zero for ROMFS volume. The [still to be confirmed] workaround to this situation is to reimplement the OHCD resume policy, possibly using a thread with a lower priority and probably a Sleep() before deactivating the device.

    Can anyone confirm all the above?

    BR - Stefano

     


    Stefano Voulaz Embedded Design @ projecKt studio
    Thursday, June 30, 2011 7:25 AM
  • That sounds like a reasonable analysis.  Would it work to temporarily raise the priority of the PM thread at suspend/resume?  There is already code in place to do this; see giSuspendPriority and related comments in the PM (platform.cpp).


    Dean Ramsier eMVP BSQUARE Corporation
    Thursday, June 30, 2011 12:55 PM
  • Hello Dean,  thank you for pointing me to this possible (and interesting) solution.

    At the moment I'm proceeding the other way - lowering the priority of the OHCD resume thread and forcing the thread to yield with a Sleep(0) when the thread starts. This required just reimplementing CHW::ResumeThread in a local file (plus some adjustments to "source" file, of course). I'm now running the tests and it seems to work fine - but I have make some more before marking the problem as (surely) fixed.

    BTW, thank you for the suggestion - I'll look at it anyway, since it seems a reasonable alternate approach. I'll also check if there is any other thread involved, other than the PM (i.e., form FSDMGR). I'll keep digging to explore some more system details.

    BR - Stefano


    Stefano Voulaz Embedded Design @ projecKt studio
    Thursday, June 30, 2011 2:39 PM
  • Hi all - a quick update on the topic.

    I checked the approach suggested by Dean, setting SuspendPriority256 (pretty undocumented, I must say) to 98 - above USB OHCD resume thread. Unfortunately, this approach doesn't fit my use case, since the PM thread priority would rise to 98 only during suspend process and it is restored when leaving the suspend procedure. This happens way before the resume is initiated and modifying SuspendPriority256 actually has no effect on the resume process itself (when the system hangs) . I'm still investigating into the PM, looking for how ResumePriority256 is used instead - so far no clue, but I will keep digging.

    In the meanwhile, the tests on the CHW::ResumeThread went on and it looks like the system is quite stable now - fingers crossed, though!

    Thanks again and BR - Stefano


    Stefano Voulaz Embedded Design @ projecKt studio
    Wednesday, July 6, 2011 6:31 PM
  • Hi Stefano,

     Thank you for your valuable information in this thread. in fact your thread was the source of inspiration to my digging in this Windows ce BUG. After one week of contiuous work, i have collected useful information about powerhandler and issues of suspend/resume. The problem i faced was that my S3C6410 based device is hanging after resume when i use client usb driver (some data acquisition device),  specifically SiLabs USBXpress device. Later i have found that:

    1. When the processor speed is shifted up, hanging after resume posibility increases

    2. When using more debug messages, hanging possibility decreases

    3. When some application cause exception during operation, ce will hang whether there is a usb device attached or not.

    4. When a usb device is attached and opened for transfer, ce will hang after suspend/resume even when the device is closed before suspention

    5. If we make suspend/resume cycle before opening the usb device, ce will work fine and will resume normally even if we open the usb device later

    After investigating PlatformSetSystemPowerState function in platform.cpp (C:\WINCE600\PUBLIC\COMMON\OAK\DRIVERS\PM\PDD\DEFAULT\platform.cpp), i noticed that:

    1. this function is invoked twice during suspention/resume, the first call during suspention puts the suspention thread in higher priority  depending on DEF_SUSPEND_THREAD_PRIORITY definition ( i use priority 100), it will update all devices other than block devices about suspention in case any of them need to access the registry, set suspention flag and call PowerOffSystem() which takes care of calling devices and GWES power handlers and also calls OEMPowerOff(). In fact, PowerOffSystem() invokes kernel function NKPowerOffSystem() which is run as single thread till it accomplish its job and then return back to suspention thread priority. OEMPowerOff wil suspend the processor and will wait for resume event. once a resume event is fired, the NKPowerOffSystem will call device and GWES power up handlers and return. Till now everything is fine

    2. at the end of the first call of PlatformSetSystemPowerState, the suspention thread is retained to normal priority (may be the priority of calling procedure), and hence, all devices IST will fire and activated. At this point the FileSystem is not notified yet about resume and they are not ready. In the second call of PlatformSetsystemPowerstate, the resume section will notify all block device and filesys about power up, but this is actually too late!! ...we will see why

    3. something will happen before the second call to PlatofrmSetSystemPowerState, some devices (usb client device and SD for example) need to do Deactivation/Reactivation cycle in order to work properly as descibed by Stefano. during deactivation, the handler will try to deregister the device by calling I_DeregisterDevice(hDevice) in module devload.c, this function will try to lookup the registry for the hDevice information and will try to delete the active driver from HKLM\Drivers\Active for the corresponding device. Ooooppppss, but the filesys is not yet ready and hence ce will trap into hang or sometimes throw exception in udevice.exe module filesys.dll.

    4. the delete active key will fail and this is why we see increasing number for the active devices which is not actual.

    I have tried to shift up the priority of suspend thread but without any success because PlatformSetSystemPowerState shift up the priority during suspend (first call) only. Also, i hve tried to lower priority of USB IST and add some Sleep() but without any useful results for my case.

    Finaly, i decided to play with PlatformSetsystemPowerState function, I raised the priority of the second call (the if statements subdivide the routine into two parts, resume and suspend) but this didn't solve the problem completely even though it has better results. The next solution is to put the filesys notification after the call of PowerOffSystem like this

    RETAILMSG(TRUE, (_T("PM: PlatformSetSystemPowerState: will enter suspend!\r\n")));

    PowerOffSystem(); // sets a flag in the kernel for the scheduler

    Sleep(0); // so we force the scheduler to run and wait till all threads served 

    RETAILMSG(TRUE, (_T("PM: PlatformSetSystemPowerState: back from poweroff!\r\n" )));

    FileSystemPowerFunction(FSNOTIFY_POWER_ON);

    gfFileSystemsAvailable = TRUE; // clear the suspend flag

    gfSystemSuspended = FALSE;

    gfPasswordOn = 0 ;

    this solution completely eliminate hang problem and the active device number in the registry is now correct and not increasing. but there is an exception in the explorer.exe thrown in the first suspend/resume cycle only. i will keep investigating the exception, I hope one can help to figure out the problem, espacilly microsoft. I think this is obvious bug in ce and hope microsoft will correct this.

    Thanks

    Jaafar Kh

     

    • Edited by Jaafar Kh Sunday, November 6, 2011 10:06 PM
    Sunday, November 6, 2011 9:56 PM
  • Important update:

    The reason behind exception in explorer.exe was that the block device drivers must be notified before powering up Filesys. In fact both UpdateClassDeviceState for block devices and FileSystemPowerFunction(FSNOTIFY_ON) are moved from  " else if(fResumeSystem) " block to the block of " if (fSuspendSystem) " after calling PowerOffSystem()

                    PMLOGMSG(ZONE_PLATFORM || ZONE_RESUME, (_T("%s: calling PowerOffSystem()\r\n"), pszFname));
                    PowerOffSystem();       // sets a flag in the kernel for the scheduler
                    Sleep(0);               // so we force the scheduler to run
                    PMLOGMSG(ZONE_PLATFORM || ZONE_RESUME, (_T("%s: back from PowerOffSystem()\r\n"), pszFname));
              
    
    				 // we're waking up from a resume -- update block device power states
                    // so we can access the registry and/or files.
                    PMLOGMSG(ZONE_PLATFORM || ZONE_RESUME, 
                        (_T("%s: resuming - notifying block drivers\r\n"), pszFname));
    
                    pdl = GetDeviceListFromClass(&idBlockDevices);
                    if(pdl != NULL) {
                        UpdateClassDeviceStates(pdl);
                    }
                    
                    // Notify file systems that their block drivers are back.
                    FileSystemPowerFunction(FSNOTIFY_POWER_ON);
                    gfFileSystemsAvailable = TRUE;
    
                    // clear the suspend flag
                    gfSystemSuspended = FALSE;
                    gfPasswordOn = 0 ;
    


    also, DEF_SUSPEND_THREAD_PRIORITY must be set to 100 (decimal) in order to complete notification for power up before device ISRs start.

    No, there is no hang, no increasing active device numbers and no exception. CE seems stable up to now.

    Enjoy

    • Proposed as answer by Jaafar Kh Monday, November 7, 2011 6:48 PM
    Monday, November 7, 2011 6:47 PM