none
Unit will no longer boot after a power cycle WindowsCE 5.0 RRS feed

  • Question

  • Hello:

    My general question is on trying to figure out how can I find out exactly why units no longer boot up after running a customer application and then power cycling the unit.

    A customer has been using and running their application in WindowsCE 4.2 for many years.  They upgraded to WindowsCE 5.0...we just can't support CE 4.2 anymore.

    Their application(VB.NET) compiles in VS2005, no errors, and runs fine on unit. No errors that they can tell of.

    But when they power cycle the unit, the unit no longer boots...at all. Even if I reload the CE 5.0 NK.BIN on the unit it still does not boot.

    This is not on just a single unit , but every unit. They start their application, it runs fine for a day or so, then suddenly it stops or if they power cycle the unit, it no longer boots.

    If I re-install the NK.BIN for CE 4.2 on the unit and then load the CE 5.0 NK.BIN file, the unit again is able to boot. My guess is that loading CE 4.2 wipes out the format of CE 5.0 stuff and then when I reloads CE 5.0, it starts from a clean memory/flash state.

    Past experience shows that reloading of a CE 5.0 NK.BIN really does nor really seem erase existing data in the flash, and my guess is that perhaps its just overwrites existing files and leaves any new files generated by users alone. Again my guess is that whatever is wrong with the unit that prevents it from booting, obviously does not get cleared out when I try to reinstall the CE 5.0 NK.BIN file.

    This is a networked based application that receives data, writes that data to files on the unit...just text data...like employee names. We have a PC based client test program that emulates the data transmission from a PC I got from the customer. Most of the time it runs just fine, but when I power cycle the unit, the unit will not boot anymore. Sometimes the test program hangs, and appears to be waiting for an acknowledgement from the unit.  But when that happens, he unit pings ok, and reboots without problem.

    But if I run the test program a dozen times or maybe just a a few times, the unit consistantly no longer boots when I perform a power cycle. Before I power-cycle, I get no error messages from the application. The unit appears to behave perfectly fine. I can ping it, remote connect to it, nothing to indicate anything it wrong at all, until I power cycle the unit and it does not boot.

    Where do I start in finding out what has changed in the unit that would indicate that something has been corrupted and that will no longer allow the unit to boot anymore.

    I want to blame the application, but its does not appear to be doing anything obviously wrong. I'm bothered that there does not seems to be much error checking, but all VB.NET programs I have found in other examples are written the same way.

    I suspect the unit is getting corrupted from the following the lines of code:

        Private fileCfg As StreamWriter

            fileCfg = File.CreateText(strFileTmp)
        .
            fileCfg.WriteLine(strTemp)
            .
            fileCfg.Close()

    If I comment out those lines, seems I don't see the problem occuring. I also tried adding a fileCfg.Flush() command after writing but the unit still fails.

    We suspect a hardware type error, but this occurs on units that have different date-coded flash chips.

    It seems that whereever the code that the unit goes to when it boots is corrupted, so it just hangs when it tries to boot. But that's just a guess.

    I have upgraded the CE 5.0 OS to the latest QFE's.

    How do I set things up to even be able to start looking?

    Thanks.

    Tuesday, August 12, 2014 4:21 PM

Answers

  • Chulk,

    What I think is more likely to happen is this:

    1. Bootloader starts and checks for CF. If no CF inserted jump to 4
    2. Check if NK.BIN is on CF, if not, jump to 4
    3. Load NK.BIN and write to internal NAND flash (to a partition you can't see in CE)
    4. Load NK.BIN from internal NAND flash to RAM
    5. Jump to start address in RAM

    Now, when CE starts it will mount the part of internal NAND (that is not used to store NK.BIN) as storage device. The application is writing it's log/data files to this mounted storage device.

    Thus, your application is still writing to the same NAND flash device as NK.BIN is stored.

    Does the CE kernel use Hive based registry? If yes; replace with RAM based registry and see if the problem goes away.

    If you don't need registry persistence between cold boots then this is your solution (most probably).

    If you do need registry persistence and the hive based registry is the problem (due to file corruption), please try to enable TFAT and disable any disk caching in the kernel.

    This all has nothing to do with the VB.net application. It only has to do with writing files when turning the device off. This can cause file system corruption, and this can cause the persisted registry hives to not being able to be read from disk, causing CE to wait indefinitely for the hive store to mount, or just to hang because of corrupted data.

    So, here are your steps:

    1. Disable any disk caching in the kernel registry and rebuild your kernel. Flash to the device and try to reproduce the hang. If it still occurs;
    2. Change to RAM based (non persisted) registry and rebuild your kernel.
      Flash to the device and try to reproduce the hang. If it still occurs;
    3. Add TFAT to your OS Design, rebuild your kernel and MAKE SURE to reformat the NAND partition and VERIFY TFAT is indeed used! If the problem still occurs;
    4. Write a small program in your bootloader that can dump the entire contents of the NAND device (or hook up a JTAG and read the entire NAND that way). Then determine the positions in NAND of the storage partition and the NK.BIN and verify the contents are as you expect.

    The options are listed in order of complexity. Hopefully you'll find your fix before 4.

    If it does turn out to be data corruption then the only 100% solution is to add a supercap or small battery to the device so you can keep the device on for a couple of seconds after cutting power. You can then notify the OS (and through the OS the applications) that the device is about to go down which gives applications and the OS the time to flush, close and finish writing all data to the storage medium.

    Failing that, you may want to check out Datalight's Reliance Nitro as a solution.


    Good luck,

    Michel Verhagen, eMVP
    Check out my blog: http://guruce.com/blog

    GuruCE
    Microsoft Embedded Partner
    http://guruce.com
    Consultancy, training and development services.



    Thursday, August 14, 2014 11:21 PM
    Moderator
  • Correct.

    Good luck,

    Michel Verhagen, eMVP
    Check out my blog: http://guruce.com/blog

    GuruCE
    Microsoft Embedded Partner
    http://guruce.com
    Consultancy, training and development services.

    • Marked as answer by Chulk Ches Thursday, August 28, 2014 8:34 PM
    Wednesday, August 20, 2014 1:03 AM
    Moderator
  • Just wanted to let you know that I've suspended my investigate. My customer/developer indicated that they had modified their program that got rid of some race condition that that their program appears to be working ok now.

    I did disable disk cache in the registry and that did seem to help a bit but the same results would happen.

    But for now I'll just go on hold until it raises it ugly head again...very bizzare.

    Thanks for all your help and advise to everyone.

    • Marked as answer by Chulk Ches Thursday, August 28, 2014 8:33 PM
    Wednesday, August 27, 2014 3:45 PM

All replies

  • You say "doesn't boot", but what does it do?  My read is that the bootloader doesn't run, but I suspect that isn't what you mean.

    Bruce Eitman (eMVP) Senior Engineer Bruce.Eitman AT Eurotech DOT com My BLOG http://geekswithblogs.net/bruceeitman Eurotech Inc. www.Eurotech.com

    Wednesday, August 13, 2014 11:26 AM
    Moderator
  • Hi Bruce:

    The symptom is that the unit really does nothing after I apply power to it. There is a momentary click sound, which it always does when it first powers up, but after that nothing.  It really does not do much on a successful boot anyway.  Its attached to a 2X40 LCD display that momentarily show a checkerboard screen, then runs a program that displays its version and IP address on the LCD display, plus a little jungle.

    This is a mature product and does not have a GUI or console port to see anything during its boot up process.

    ( See picture of it at http://www.lanpoint.com/lanpoint-plus-fixed-images.html )

    The bad unit does nothing that indicates its doing anything at all when it is powered on.

    I presume the bootloader runs if when I pop in a CF card containing a NK.BIN file containing CE 4.2 , and is assumed to have loaded, it does then boot up. If I then pop in a NK.BIN file for CE 5.0 and let it load, then remove the CF card, the unit then boots up and behaves as if nothing has happened.

    I can still successfully compile the CE 5.0 OS and tried it with all the latest QFE's(thanks to your web-site that pointed me to that location for for the latest QFE's), but that made no difference.

    My guess is that whatever the bootloader is doing as it loads in the OS, that something has been previously corrupted and the bootloader encounters some kind of error and stops. I just have no means to see the error.

    Seems I need to know exactly the processes that occur when you power up a unit and then be able to monitor those steps and see where it stops. At power up, the CPU must start reading and processing commands at a specific memory address(at least they did back in the 8080 days), and start doing what ever it does to get to the point where it is a functioning device.

    Again, the problem does not occur until I power cycle the unit.  Otherwise, you would never know anything is wrong.

    Would performing some kind of periodic memory dump, or maybe some kind of periodic CRC check on every file in the system to detect any kind of a change before I perform a power cycle help?

    Where in the OS source code for the bootloader source code? I am sure that there is some kind of debug mode I can get this into.  I am not the original software developer of this custimized OS.

    Thanks.

    Wednesday, August 13, 2014 3:40 PM
  • This sounds like a classic case of file corruption to me.

    The bootloader is loading NK from a flash (CF) medium, right? The same medium is used as storage in CE, and the application is writing to this.

    Flushing the file is fine, but that doesn't do anything to the disk cache (only to the file cache).

    Disable any disk caching and try again.

    Also, check the FATs of a working device and a failing device. I'm sure you will find the FAT is trashed.


    Good luck,

    Michel Verhagen, eMVP
    Check out my blog: http://guruce.com/blog

    GuruCE
    Microsoft Embedded Partner
    http://guruce.com
    Consultancy, training and development services.


    Thursday, August 14, 2014 3:21 AM
    Moderator
  • You asked/stated: The bootloader is loading NK from a flash (CF) medium, right? The same medium is used as storage in CE, and the application is writing to this.

    I think the answer is no. I believe what I have a ROM-only file system. The unit is not booting from a Compact Flash card that contains a NK.BIN file. The unit boots from its flash memory.

    I believe when I first power on the unit, the bootloader checks to see if there is a Compact card present in the unit Compact Flash card slot and contains the file called NK.BIN.  If the bootloader detects the file called NK.BIN in the Compact Flash card, the bootloader unpacks the contents of the NK.BIN file and writes the contents of the NK.BIN file into a file structure in Flash memory.

    Once that is done I remove the Compact flash card from the unit.  The NK.BIN file itself is not in the unit anymore.  Its all been unpacked and sorted into a bunch of separate files and folders as shown below.

    When power is completely removed then re-applied, the unit boots up. After that, any folders or files I create or copy into any of the folders under 'My Device' will remain intact if I do a power cycle(Completely remove all power to the unit). We create a 'Flash' folder to store our programs, but I can write files anywhere and they remain intact upon a power cycle.

    The application causing the problem creates a folder called /Flash/solaria2 and creates files as shown below:

    There are just text files. The file called TMP is initially created and data written into it with the following commands:

            fileCfg = File.CreateText(strFileTmp)
        .
            fileCfg.WriteLine(strTemp)
            .
            fileCfg.Close()

    Once the application closes the TMP file the application copies the TMP file to a different name using the following command:

           File.Copy(strFileTmp, strFileAct, True)

    where strFileTmp = "TMP" and (for example) strFileAct="PER".

    If I comment out these 4 commands, the problem seems to go away.

    I have replaced the CreateText(), and WriteFile() commands with File.Open and File.Write() commands which might ne helping

    This is a written in VB.NET. I see nothing about flushing, or know about disabling disk cashing, or checking the works of a FAT. How are any of these things done in VB.NET.  Might adding a File.Flush() command after each write or before the File.Close()n help?

    As things sit right now seems its the File.Copy(strFileTmp, strFileAct, True), as being the problem. 

    Once the unit no longer boots, I can't check flash memory because the OS never gets to the point of doing anything. Is there a way to perform some kind of total memory dump of the contents of flash memory that I might compare before I power cycle a unit? What is odd is the unit behaves perfectly fine, until I pull power and try to reboot it.

    As you stated, it does look like something is getting corrupted someplace. I just don't know how or where.

    Thanks

    Thursday, August 14, 2014 9:23 PM
  • Chulk,

    What I think is more likely to happen is this:

    1. Bootloader starts and checks for CF. If no CF inserted jump to 4
    2. Check if NK.BIN is on CF, if not, jump to 4
    3. Load NK.BIN and write to internal NAND flash (to a partition you can't see in CE)
    4. Load NK.BIN from internal NAND flash to RAM
    5. Jump to start address in RAM

    Now, when CE starts it will mount the part of internal NAND (that is not used to store NK.BIN) as storage device. The application is writing it's log/data files to this mounted storage device.

    Thus, your application is still writing to the same NAND flash device as NK.BIN is stored.

    Does the CE kernel use Hive based registry? If yes; replace with RAM based registry and see if the problem goes away.

    If you don't need registry persistence between cold boots then this is your solution (most probably).

    If you do need registry persistence and the hive based registry is the problem (due to file corruption), please try to enable TFAT and disable any disk caching in the kernel.

    This all has nothing to do with the VB.net application. It only has to do with writing files when turning the device off. This can cause file system corruption, and this can cause the persisted registry hives to not being able to be read from disk, causing CE to wait indefinitely for the hive store to mount, or just to hang because of corrupted data.

    So, here are your steps:

    1. Disable any disk caching in the kernel registry and rebuild your kernel. Flash to the device and try to reproduce the hang. If it still occurs;
    2. Change to RAM based (non persisted) registry and rebuild your kernel.
      Flash to the device and try to reproduce the hang. If it still occurs;
    3. Add TFAT to your OS Design, rebuild your kernel and MAKE SURE to reformat the NAND partition and VERIFY TFAT is indeed used! If the problem still occurs;
    4. Write a small program in your bootloader that can dump the entire contents of the NAND device (or hook up a JTAG and read the entire NAND that way). Then determine the positions in NAND of the storage partition and the NK.BIN and verify the contents are as you expect.

    The options are listed in order of complexity. Hopefully you'll find your fix before 4.

    If it does turn out to be data corruption then the only 100% solution is to add a supercap or small battery to the device so you can keep the device on for a couple of seconds after cutting power. You can then notify the OS (and through the OS the applications) that the device is about to go down which gives applications and the OS the time to flush, close and finish writing all data to the storage medium.

    Failing that, you may want to check out Datalight's Reliance Nitro as a solution.


    Good luck,

    Michel Verhagen, eMVP
    Check out my blog: http://guruce.com/blog

    GuruCE
    Microsoft Embedded Partner
    http://guruce.com
    Consultancy, training and development services.



    Thursday, August 14, 2014 11:21 PM
    Moderator
  • Thanks. I'll start trying your suggestions.

    ============================

    In a Common.reg I found the following:

    [HKEY_LOCAL_MACHINE\System\StorageManager\FATFS]
        "FriendlyName"="FAT FileSystem"
        "Dll"="fatfsd.dll"
        "Flags"=dword:00000064
        "Paging"=dword:1
        "EnableCache"=dword:1  ;...Change this to 0(zero?) and generate a new NK.BIN...correct?
        "CacheSize"=dword:0
        "Util"="fatutil.dll"
        "CacheDll"="diskcache.dll"
    ; END HIVE BOOT SECTION
    ; @CESYSGEN ENDIF CE_MODULES_FATFSD || CE_MODULES_TFAT

    ===================================

    Thanks;
    Monday, August 18, 2014 11:48 PM
  • Correct.

    Good luck,

    Michel Verhagen, eMVP
    Check out my blog: http://guruce.com/blog

    GuruCE
    Microsoft Embedded Partner
    http://guruce.com
    Consultancy, training and development services.

    • Marked as answer by Chulk Ches Thursday, August 28, 2014 8:34 PM
    Wednesday, August 20, 2014 1:03 AM
    Moderator
  • Just wanted to let you know that I've suspended my investigate. My customer/developer indicated that they had modified their program that got rid of some race condition that that their program appears to be working ok now.

    I did disable disk cache in the registry and that did seem to help a bit but the same results would happen.

    But for now I'll just go on hold until it raises it ugly head again...very bizzare.

    Thanks for all your help and advise to everyone.

    • Marked as answer by Chulk Ches Thursday, August 28, 2014 8:33 PM
    Wednesday, August 27, 2014 3:45 PM