Unanswered Corrupt files on disk after power failure with EWF on WES7

  • Wednesday, June 20, 2012 1:17 PM
     
     

    We're building a WES7 embedded system.   I experienced a corrupt disk recently, which is worrysome.

    Our setup is a WES7 Atom based computer with a 16 GB SSD.  EWF (RAMREG) is enabled on the disk. The disk has two NTFS partitions.  One of them, the OS partition, is protected by EWF.  I deleted a directory on the protected partition and pulled the power to the board.  When the system came back up, the directory still showed in Explorer, but it was corrupted and I couldn't delete it, or change into it.  A few reboots did not change this.

    I then disabled EWF with the -commitanddisable and -live switches and rebooted.  When the system started, WES7 performed a checkdisk during boot and after checkdisk ran, the system appeared fine again.  The directory that I deleted initially while EWF was enabled was restored and the system seemed to operate normally.

    I enabled EWF again and retried the steps a couple of times (deleting directory and powered off / restarted).  The system has been behaving as expected.

    Our main reason for using EWF is to protect against unexpected power failures.  Any suggestions on deploying with EWF to protect against power failures?

    Thanks!


    • Edited by kokketiel Wednesday, June 20, 2012 1:26 PM
    •  

All Replies

  • Wednesday, June 20, 2012 4:05 PM
     
     

    There are multiple things going on here.

    1. EWF doesn't protect the NTFS meta data section so this space is getting writen too. The benfit is that if power does get cycled the meta data should help recover the system.

    2. For flash device like CF and USB, there is were leveling technology to reduce flash wear. I assume that same thing for SSD. If the flash is doing a wear-level block move in the middle of a power cycle, his could have been the cause of the coruption.

    EWF or FBWF with NTFS is the best possible solutions to protect from sudden power cycle.

    -Sean 


    www.sjjmicro.com / www.seanliming.com / www.annabooks.com, Book Author - Pro Guide to WES 7, XP Embedded Advanced, Pro Guide to POS for .NET

  • Monday, July 16, 2012 4:44 PM
     
     

    Egads!! We too are experiencing corrupt drive images. We have a 16BG SLC SSD drive partitioned into two drives. C: (for the O/S) and D: (for data). We protect drive C: with EWF. The D: drive we do not protect, but our code can tolerate bad files here.

    Our device is always powered-off unannounced. It is factory machinery. For this reason, we use EWF. Yet still, we're experiencing corruption were the O/S cannot boot because of bad files. We've been thinking we have a power spike or something causing the corruption. 

    Sean: Your comment is frightening as it adds a whole new level of concern - that even never writing to the drive, the device itself is moving things around.

    Does anyone else have a comment to add to this?  Should we abandon SSD drives and go back to standard spinning drives?  We thought SSD would be better since this device is used in a dirty/dusty environment.

  • Monday, July 16, 2012 5:39 PM
     
     

    What SSD brand are you using?

    -Sean


    www.sjjmicro.com / www.seanliming.com / www.annabooks.com, Book Author - Pro Guide to WES 7, XP Embedded Advanced, Pro Guide to POS for .NET

  • Tuesday, July 17, 2012 11:11 AM
     
     

    For SSDs, I have decided on the Intel 320 series (the ones with power failure protection) and I am not looking forward to testing with EWF at all.

    I believe that is the magic feature that will save your bacon since it allows the SSD to complete anything in flight.

    I can safely say however, that you should avoid kingston SV100 drives like the plague. (out of 64, 8 have failed so far, not one has been out there more than 3 months!) In fact, pretty much anything consumer is scary.


    =^x^=

  • Friday, July 20, 2012 5:32 AM
     
     

    I am possibly also facing this issue, with a design for a medical device that contains a small PC running Windows 7 or WES7 (because of constraints such a graphics, drivers, etc.) The PC will use either a conventional HDD or an enterprise-class SSD (with internal capacitors) as boot drive. The entire device must function like an appliance, i.e. the user should be able to turn it on/off instantly, with a switch, or pull the plug, without damage. From studying WES7 documentation, it seems that EWF+HORM could in theory support that. I am however worried that EWF was not really designed for routine hard power cuts, in addition to the SSD issues discussed above. I realize that using a UPS (or a laptop) could avoid file system damage and allow regular Windows shutdown, but that is a solution I want to avoid because of regulatory complexity.

    There is also a software product called IntervalZero ReadyOn, which claims to support instant on/off, apparently through a mechanism similar to EWF+HORM. One of their product managers even claimed they had invented Windows Embedded and licensed it to Microsoft (http://www.mp3car.com/software-and-software-development/30244-anyone-looked-into-readyon-for-xp-boot-9.html). But ReadyOn is not available for Windows 7.

    Back to WES7: Is it realistic to assume I can get ROUTINE instant on/off through EWF+HORM, and if so, what do I need to pay attention to other than use HDDs or SSDs with capacitors? Any recommendation for a specific SSD brand/model? Is it indeed better to stick to HDDs? Or, am I crazy for even expecting this to work day-in, day-out?

  • Friday, July 20, 2012 6:20 AM
     
     

    I think you have a fabulous challenge there. I can't use HORM, but I have to wonder what sort of "instant on" time you are expecting.

    Perhaps you could fake it so the power switch is actually a "wake" button and that it is charging its condensers and actually on the moment the power cable is plugged in.

    In other news, I managed to break the ntfs file system on an intel 320. So they aren't the panacea for all ills, but they are a pretty good bandage.


    =^x^=

  • Friday, July 20, 2012 6:28 AM
     
     
    I am trying to be realistic--the usual time it takes to wake from hibernation is acceptable. I am most concerned about the feasibility of hard power off.
  • Sunday, July 22, 2012 11:00 PM
     
     

    So does anyone have any experience or suggestions on this? My current plan is to install WES7 with EWF (but not HORM for starters) on a conventional, enterprise-class HDD, and then keep booting and cutting the power, say hundreds of times, or until it doesn' t boot anymore (whichever happens sooner). If it works 100s of times, I will have some confidence that instant-off is OK, and then I can add HORM to it, and also switch to an enterprise-class SSD.

  • Monday, July 23, 2012 12:21 AM
     
     

    Not actually close enough to you, but to add a data point.

    Apacer commercial SSD (power fail protection), WES2009, EWF.

    Survived 100 power offs (pull the plug, flip the switch, etc.) no problem.

    Supplier put Kingston SV100's in instead. They have been waking up dead in distressing quantities. (8/64 failed within 3 months, file system corruption, actual bad blocks in the SSD, etc.) 

    I'll try the 100 on these intel 320 drives (consumer, but 5 year warranty + power fail protection) when I get a working image ready this week.

    I'm thinking maybe FBWF of \WINDOWS should do the trick ^^;


    =^x^=

  • Monday, July 23, 2012 12:40 AM
     
     

    Aha, interesting, thanks. So are you expecting FBWF to be superior to EWF in any way? I was planning to use EWF because it works together with HORM, so it's on the optimistic development path to fast (if not instant) power-on.

  • Wednesday, October 03, 2012 8:50 PM
     
     

    Any progress from either Oberst (EWF/HORM/SSD) or PE (Intel 320 Pow Off Test)?  

    For our part, we have experienced repeated BSOD after many 10s to hundreds of power off/on cycles with Intel 520 with EWF C: partition and non EWF D: partition.    We have 4x Intel SSD 320s with same behavior.   At this time we can not determine the cause of BSOD on powerup.  Once systems are running, they are very stable.

    Details:

    10 systems.  Intel S1200KPR, XEON E3-1225v2, Dataram ECC 2x4G, MS WES7, Intel 520 SSD, EWF on C: Part,  no EWF D: part, System Logs relocated to D:, Pagefile to D:

    Test Script does this:
    1. Hard Reboot via ethernet power strip
    2. Wait 1 minute.
    3. Reboot ... UNLESS no ping reply
    4. Go to next system in list and repeat 1-3

    We are looking for expert advice to assist with solving this problem.  


    OnePV1d

  • Tuesday, October 09, 2012 8:45 AM
     
     

    I forgot about my Intel 320 and left it rebooting every 3 minutes over a long weekend. No problems at all. Something over 500 power cycles before I remembered.

    I found a few interesting tools.  SSD Life Pro and ssdready.

    It watches how much you write and guesses at how long your drive will live.  ssdready its entertaining but the estimates of daily write traffic are interesting.

    One thing I can think of, if you're using an SSD, was your image _built_ for SSD. eg superfetch/prefetcher off, defrag off, no page file.

    The 520 doesn't have power failure protection, so anything "in flight" will be lost when the power goes.


    =^x^=

  • Tuesday, October 30, 2012 3:42 PM
     
     

    Thank you for the comments and advice, PE.

    Some more test results.

    After a lot of work on the WES7 image to move c: partition (EWF) files writes to d: partition, no longer getting BSOD and corrupt image.

    1000+ cycles disruptive power off/on ... and so far no BSOD.

    At this time, the recommendation to others is to use Process Monitor filtered to show EWF partition writes.  Then work down the list of files moving files from the EWF drive to a non-EWF drive.



    OnePV1d

  • Thursday, November 01, 2012 9:09 AM
     
     

    Did you end up moving anything strange?

    The only things I've moved are the event logs, and only the big three. There doesn't seem to be an obvious way to move the line-noise ones like "microsoft-diagnostics-cheesecake".

    Comparing identical images, one with EWF on, one off, over 800 power cycles (yes, I forgot for a weekend again) there are more events logged on EWF than not, to the tune of a couple of thousand...

    Interestingly, the WES2009 image with EWF and an intel drive suvived 100 cycles (1 off, 5 on) but chkdsk blew a gasket and complained about the EWF partition. (the wide temp range toshiba and the POS kingston v100+ also survived, the latter with lots of lovely ntfs errors _ON_ C, the ewf partition!)


    =^x^=

  • Tuesday, November 06, 2012 8:02 PM
     
     

    Same kind of deal here. We're using an Intel SSD and ran into corruptions on a non-protected partition after abrupt power failure. We use EWF RAM-REG + HORM for our C: partition and haven't had any problems there.

    To mitigate this, our device has batteries and the software does all it's logging and shuts down Windows when running on batteries and the batteries get down to a critical level.  We tried calling the Win32 flush calls from our C# app but that didn't guarantee that writes would be committed. Shutting down gracefully before a power failure (if possible) was our only solution.

    I think Intel sells battery-backed up SSD's to mitigate this, but who knows how long those batteries are good for. 



    • Edited by Ben Schoepke Tuesday, November 06, 2012 8:02 PM
    • Edited by Ben Schoepke Tuesday, November 06, 2012 8:03 PM
    •  
  • Wednesday, November 07, 2012 12:46 AM
     
     

    The important thing is which Intel SSD, the 330 doesn't have the capacitors, the 320 does. Performance versus rock solid. The capacitors have sufficient capacity to complete ANY inflight write operation.

    Some testing I did, both with and without EWF, WD notebook drives, kingston SSD Now V100 (I'd recommend those to my enemies) and Intel 320.

    WES7 + New Application, 100, 200, 500 power cycles (1 min off, 5 min on) 

    kingston hard errors, WD SMART errors. Intel didn't complain at all.

    WES7 + Old Application, ditto

    WD replaced with toshiba industrial drives, kingston didn't get worse, not a squeak from the intel.

    WES2009 + Old Application, didn't get beyond 200.

    OS unable to chkdsk successfully on any hardware -.-

    Interesting, the number of entries in the event logs is dramatically different depending on EWF ON or OFF.

    I did not have sufficient hardware to stuff Intel 330's in, but it is entirely possible that their write speed is sufficient that you would have to have deity like precision killing the power to corrupt stuff.


    =^x^=

  • Thursday, November 15, 2012 9:00 AM
     
     

    I have nearly the same problem with an SSD drive. The test are done with WES7 and 2 partitions (C: System / D: Data) C: is protected with EWF + HORM.

    As a file (test file a txt file) is changed and saved the data sometimes goes corrupt on a unexpected power loss. The test file is edit with notepad and notepad++. If the file is corrupt notepad displays black fields and notepad++ displays for each corrupt letter a "NULL"

    Also there are three different, time depended, results: 

    1 - 3 sec between power loss and saving = no change is saved in the file

    3 - 8 sec between power loss and saving = file goes corrupt 

    8+ sec between power loss and saving = changes are saved correctly. 

    As I run Sync v2.0 (http://technet.microsoft.com/en-us/sysinternals/bb897438) between power loss and saving no file corruption appears. But that cant run if the power supply is removed suddenly (not usual but possible)

    Is there any possibility to prevent corrupt files??

    Thanks!

     
  • Thursday, November 22, 2012 5:20 AM
     
     

    Which SSD are you using, is it one with power failure protection?

    So far, I've found that the Intel 320 SSD's are close to bulletproof. 1600 power cycles and not a single complaint on WES7. (500 caused WES2009 to have lots of "please check the disk. chkdsk found errors, chkdsk thinks it fixed it but it didn't...")

    If you have a kingston SSD, you're SOL. Sorry. ^^:

    reading about sync suggests you shouldn't need it. Unless you have done something weird like turned off write buffer flushing.


    =^x^=


    • Edited by P E Thursday, November 22, 2012 5:21 AM
    •  
  • Thursday, February 28, 2013 10:57 PM
     
     

    Hello,

    I, too, am using EWF to mitigate hard drive corruptions in a medical diagnostic instrument that can and will be turned off via the power switch.  It really came about because we wanted to harden the OS image against malware, viruses, etc... But as someone who cut his teeth in the true embedded world where we paid a lot of attention to power failures and ensuring the systems still shut down in an orderly fashion (typically in less than 50ms), I made the same leap as so many others here and figured that EWF was also a great tool to help with power failures in the Windows world. 

    In my current project, we have gone to platter drives because we were experiencing a number of issues with SSDs in a one of our products (running VxWorks) and were frightened by the data we saw during the investigation of the problem. 

    To test my theory about EWF and power failure, a couple of years ago, I used a J-Works USB controlled relay module and wrote a program to repeatedly power cycle our SBC running WES7 with EWF.  We ran this configuration for a few weeks without interruption and NEVER encountered any issues.  I was able to vary both on and off cycle times with this program and tried some different settings with 100% success.  Perhaps it is the SSD drives? 

    Sean, if no writes are occuring to the EWF protected voulme and you are using an SSD drive, would you still expect wear leveling algorithms to move blocks? It has always been my understanding that wear leveling kicks in when writing to the device itself.  I would love some clarification on this (even though we aren't using SSDs now).

    Thanks!

    Rick

  • Monday, March 04, 2013 11:34 AM
     
     

    What sort of cycle were you running?

    I use Intel 320's, the power-fail protection was the deciding feature. Someone else used Kingston in something, they were out in the field for 3 months before they started dying... We won't be buying them in a hurry, they are all being replaced with Toshiba HDD's slow, wide temperature range things that appear as bombproof as the intel ssds! As long as performance isn't even remotely important.

    For my test (this box has now done this over 2000 times with no problems) I had 1 minute off, 5 on. Even the toshibas have survived 1000.

    I do see some curious junk from the date EWF was turned on in the event logs on the unprotected partition, but nothing worrying.

    I also have a letter from intel about the replacement for the 320, still waiting for firm data...


    =^x^=

  • Monday, March 04, 2013 9:39 PM
     
     

    Actually there are writes made to an EWF protected volume. 1. NTFS has the meta-data section that is not protected by EWF. Any writes made to the protect disk is sent to RAM, but some critical data that would typically be used for recover in power lose is sent to the meta-data section. Writes are being made and wear leveling kicks in. I noted this issue in my book as I was burned on an XPe project that used EWF. You have to use forensic tools to see that data has been write to the drive. There have been a few others who have spotted this over the past 10 years.

    2. It was in an old forum post that some from MS said the disk address lines are still toggled under EWF, thus also triggering a wear leveling event. I have no way to verify this one.

    WES 8 introduces UWF, which will replace all the older filters moving forward. I asked if UWF will have the same issues, but the people I asked were new and not as closely familiar with EWF. I will ask again to see if anyone is looking into this.

    -Sean


    www.annabooks.com / www.seanliming.com / Book Author - Pro Guide to WES 7, XP Embedded Advanced, Pro Guide to POS for .NET