none
Heap Corruption puzzler.

    Question

  • One of my customers is reporting frequent failures in the C++ service I'm responsible for maintaining. 

    I've analyzed several crash dumps collected via adplus with page heap enabled, but so far I've been unable to find the root cause.  I'm running out of ideas and looking for new ones. 

    Here's the stack trace of the latest dump

    ===========================================================
    VERIFIER STOP 00000008: pid 0xAC8: corrupted heap block 
    
    	00151000 : Heap handle
    	04144990 : Heap block
    	00000020 : Block size
    	00000000 : 
    ===========================================================
    
    FirstChance_bpe_CONTRL_C_OR_Debug_Break
    
    [snip]
    
     # 33 Id: ac8.edc Suspend: 1 Teb: 7ff97000 Unfrozen
     # ChildEBP RetAddr Args to Child       
    00 06416b58 7c863309 00151000 06416c08 06416be4 ntdll!DbgBreakPoint (FPO: [0,0,0])
    01 06416b68 7c877a7f 00000008 7c877e04 00151000 ntdll!RtlpPageHeapStop+0x72 (FPO: [Non-Fpo])
    02 06416be4 7c877fec 00151000 00000009 04144990 ntdll!RtlpDphReportCorruptedBlock+0x2e5 (FPO: [Non-Fpo])
    03 06416c48 7c878b74 0b0333a0 00000000 00151000 ntdll!RtlpDphAddToDelayedFreeQueue+0x120 (FPO: [Non-Fpo])
    04 06416c6c 7c878d94 00151000 00250000 01000002 ntdll!RtlpDphNormalHeapFree+0x73 (FPO: [Non-Fpo])
    05 06416cc4 7c87bc6b 00150000 01000002 0b0333c0 ntdll!RtlpDebugPageHeapFree+0x146 (FPO: [Non-Fpo])
    06 06416d2c 7c85574a 00150000 01000002 0b0333c0 ntdll!RtlDebugFreeHeap+0x2c (FPO: [Non-Fpo])
    07 06416e04 7c83e600 00150000 01000002 0b0333c0 ntdll!RtlFreeHeapSlowly+0x37 (FPO: [Non-Fpo])
    08 06416ee8 00562935 00150000 01000002 0b0333c0 ntdll!RtlFreeHeap+0x11a (FPO: [Non-Fpo])
    09 06416efc 00401848 0b0333c0 00591a66 00000000 NetFYISvc!ATL::CWin32Heap::Free+0x19 (FPO: [Non-Fpo]) (CONV: thiscall)
    0a 06416f14 0043c635 06416f3c 25a6057e 0641ca28 NetFYISvc!ATL::CSimpleStringT<wchar_t,0>::operator=+0x48 (FPO: [1,0,0]) (CONV: thiscall)
    0b 06416f48 00466fd5 06416fdc 25a605aa 0641fcfc NetFYISvc!CFoldDocList::BuildLink_SearchPage+0x195 (FPO: [1,6,4]) (CONV: thiscall)

    The failure in this case seems to be on freeing the 'old memory' of a CString prior to attaching a new value.  What puzzles me is that the CString in question is a local variable of BuildLink_SearchPage.  The code looks vaguely like:

    CString link = m_urlStart + RQ_SEARCH;
    link = m_Request.BuildLink(link, etc, etc, etc);  // exception here!
    link.Append("&foo=bar");
    return link;

    The first parameter to m_Request.BuildLink is of type "const CString&" and the return value is type "CString" and built from a working copy.

    So the failure isn't happening here, but somewere else... both in code and time. 

     

    What else can I try to find it?





    This signature unintentionally left blank.
    Tuesday, May 03, 2011 1:46 PM

Answers

  • The dump posted above is a mini...  The result of the first command is

    07f70670: Unable to read ProcessHeaps array

    Yesterday I maybe got a lucky break.  My customer uploaded an AV dump that not only captured the write to invalid memory, but also managed to catch the thread that freed said memory before it finished processing.  One bug squashed.  Hopefully the one I've been fighting for the past week, but I may never really know.

    When I give them the fix, I'll also update the adplus configuration to grab full dumps on Page Heap errors....just in case.

     


    This signature unintentionally left blank.
    • Marked as answer by Nick F. _ Thursday, June 09, 2011 12:11 PM
    Friday, May 06, 2011 12:27 PM

All replies

  • May be try to use UMDH
    I'm preparing for the exam 70-660 TS: Windows Internals
    Wednesday, May 04, 2011 12:07 PM
  • I've used UMDH to locate memory leaks before, even in drivers I didn't write but I'm not sure it would help for what I'm seeing.

    The problem isn't allocation or free... It's one of corruption.  My hunch is that one thread is trying to use memory another thread has re-allocated, but by the time the exception is raised, the damage is done... the crooks have made off with the loot and Elvis has left the building.

    I need to trap the cause, or have a way to back track from effect to cause.

     


    This signature unintentionally left blank.
    Wednesday, May 04, 2011 12:59 PM
  • Application verifier?
    I'm preparing for the exam 70-660 TS: Windows Internals
    Wednesday, May 04, 2011 6:20 PM
  • Locally I've tried application verifiier and boundschecker... neither one gives any indication that anything is wrong. 

    either I'm not doing the same actions my customer's users are, or we're just not getting enough load on the system.

    I suppose I could see about loading application verifier on the production box, but I'm not sure they'd go for it.  Currently they're testing a build with stack checking enabled to see if that reveals anything.

     

     


    This signature unintentionally left blank.
    Wednesday, May 04, 2011 8:57 PM
  • From the stack , it seems like you have page heap enabled, and that the exception (actually a debug break) is caused by errors in the metadata of the heap block.

     

    On the dump, You can try to do a

    !heap –s –v                             << Verify the whole heap take time, maybe other errors

    !heap –p –a 0b0333c0    << Check the address which is passed to free

    DD  0b0333c0    L20   << see the content of the CStriing

     

    Post the result, maybe I or someone can give you some hints.

    Regards

    Kjell GUnnar

    Thursday, May 05, 2011 2:02 PM
  • The dump posted above is a mini...  The result of the first command is

    07f70670: Unable to read ProcessHeaps array

    Yesterday I maybe got a lucky break.  My customer uploaded an AV dump that not only captured the write to invalid memory, but also managed to catch the thread that freed said memory before it finished processing.  One bug squashed.  Hopefully the one I've been fighting for the past week, but I may never really know.

    When I give them the fix, I'll also update the adplus configuration to grab full dumps on Page Heap errors....just in case.

     


    This signature unintentionally left blank.
    • Marked as answer by Nick F. _ Thursday, June 09, 2011 12:11 PM
    Friday, May 06, 2011 12:27 PM