win32 window in WPF
I'm not sure that the stack part (or at least the UXTheme stuff) is trustworthy. The bottom of the stack seems normal. And we see what appears to be an exception handler trying to do cleanup. Then lots of nested calls to various layers of heap management code.
But this part where the stack transitions from RtlFreeHeap
to ConvertToUnicode
doesn't make any sense. I suspect that everything above that is leftover from previous use of the stack.
0048f40c 6b88f208 mscorwks!_EH_epilog3_GS+0xa, calling mscorwks!__security_check_cookie
0048f410 6b8a756e mscorwks!SString::ConvertToUnicode+0x81, calling mscorwks!_EH_epilog3_GS
0048f424 77b4371e ntdll_77b10000!RtlpFreeHeap+0xbb1, calling ntdll_77b10000!RtlLeaveCriticalSection
0048f42c 77b436fa ntdll_77b10000!RtlpFreeHeap+0xb7a, calling ntdll_77b10000!_SEH_epilog4
A Crash in RtlFreeHeap points to heap corruption, which suggests that the problem is in unmanaged code, but the memory for manged objects must ultimately be allocated from unmanaged memory, so it could be either.
I suggest you look for places whre your unmanaged window can corrupt heap; multiple free's of the same allocation, or overwriting an allocation's boundaries.
Here's a useful article on memory leaks in WPF. You might also consider something like ANTS Performance and/or Memory Profiler from RedGate to help diagnose problems like this.
Your problem is not caused by a managed memory leak. Clearly you are tickling a bug somewhere in unmanaged code.
The SyncFlush() method is called after several MILCore calls, and it appears to cause the changes that have been sent to be processed immediately instead of being left in queue for later processing. Since the call processes everything previously sent, nothing in your visual tree can be ruled out from the call stack you sent.
A call stack that includes unmanaged calls may turn up more useful information. Run the application under VS.NET with native debugging, or with windbg or another native code debugger. Set the debugger to break on the exception, and get the call stack at the relative breakpoint.
The call stack will of course descend into MILCore, and from there it may go into the DirectX layer and the DirectX driver. A clue as to which part of your code caused the problem may be found somewhere in this native call stack.
Chances are that MILCore is passing a huge value of some parameter into DirectX based on what you are telling it. Check your application for anything that could cause a bug that would make DirectX to allocate a lot of memory. Examples of things to look for would be:
- BitmapSources that are set to load at very high resolution.
- Large WritableBitmaps
- Extremely large (or negative) transform or size values
Another way to attack this problem is to progressively simplify your application until the problem disappears, then look very closedly at what you removed last. When convenient, it can be good to do this as a binary search: Initially cut out half of the visual complexity. If it works, put back half of what was removed, otherwise remove another half. Repeat until done.
Also note that it is usually unnecssary to actually remove UI components to keep MILCore from seeing then. Any Visual with Visibility.Hidden may be skipped over entirely.
There is no generalized way to avoid this problem, but the search technique will help you pinpoint what specifically needs to be changed to fix it in the particular case.
It is safe to say from the call stack, that you have found a bug in either NET Framework or the DirectX drivers for a particular video card.
Regarding the second stack trace you posted
John Knoeller is correct that the transition from RtlFreeHeap to ConvertToUnicode is nonsense, but draws the wrong conclusion from it. What we are seeing is that your debugger got lost when tracing back the stack. It started correctly from the exception but got lost below the Assembly.ExecuteMainMethod
frame because that part of the stack had been overwritten as the exception was handled and the debugger was invoked.
Unfortunately any analysis of this stack trace is useless for your purposes because it was captured too late. What we are seeing is an exception occuring during processing of a WM_LBUTTONDOWN which is converted to a WM_SYSCOMMAND, which then catches an exception. In other words, you clicked on something that caused a system command (such as a resize), which caused an exception. At the point this stack trace was captured, the exception was already being handled. The reason you are seeing User32 and UxTheme calls is because these are involved in processing the button click. They have nothing to do with the real problem.
You are on the right track, but you will need to capture a stack trace at the moment the allocation fails (or you can use one of the other approaches I suggested above).
You will know you have the correct stack trace when the all the managed frames in your first stack trace appear in it and the top of the stack is a failing memory allocation. Note that we are really interested only in the unmanaged frames that appear above the DUCE+Channel.SyncFlush
call -- everything below that will be NET Framework and your application code.
How to get a native stack trace at the right time
You want to get a stack trace at the time the first memory allocation failure within the DUCE+Channel.SyncFlush
call shown. This may be tricky. There are three approaches I use: (note that in each case you start with a breakpoint inside the SyncFlush call - see note below for more details)
Set the debugger to break on all exceptions (managed and unmanaged), then keep hitting go (F5, or "g") until it breaks on the memory allocation exception you are interested in. This is the first thing to try because it is quick, but it often fails when working with native code because the native code often returns an error code to the calling native code instead of throwing an exception.
Set the debugger to break on all exceptions and also set breakpoints on common memory allocation routines, then hit F5 (go) repeatedly until the exception occurs, counting how many F5s you hit. Next time you run, use one fewer F5 and you may be on the allocation call that generated the exception. Capture the call stack to Notepad, then F10 (step over) repeatedly from there to see if it really was the allocation that failed.
Set a breakpoint on the first native frame called by SyncFlush (this is wpfgfx_v0300!MilComposition_SyncFlush) to skip over the managed to native transition, then F5 to run to it. F10 (step over) through the function it until EAX contains one of the error codes E_OUTOFMEMORY (0x8007000E), ERROR_OUTOFMEMORY (0x0000000E), or ERROR_NOT_ENOUGH_MEMORY (0x0000008). Note the most recent "Call" instruction. The next time you run the program, run to there and step into it. Repeat this until you are down to the memory allocation call that caused the problem and dump the stack trace. Note that in many cases you will find yourself looping through a largish data structure, so some intelligence is required to set an appropriate breakpoint to skip over the loop so you can get where you need to be quickly. This technique is very reliable but very labor-intensive.
Note: In each case you don't want to set breakpoints or start single-stepping until your application is inside the failing DUCE+Channel.SyncFlush
call. To ensure this, start the application with all breakpoints disabled. When it is running, enable a breakpoint on System.Windows.Media.Composition.DUCE+Channel.SyncFlush
and resize the window. The first time around just hit F5 to make sure the exception fails on the first SyncFlush call (if not, count how many times you have to hit F5 before the exception occurs). Then disable the breakpoint and restart the program. Repeat the procedure but this time after you hit the SyncFlush call the right time, set your breakpoints or do you single-stepping as described above.
Recommendations
The debugging techniques I describe above are labor-intensive: Plan to spend several hours at least. Because of this, I generally try repeatedly simplifying my application to find out exactly what tickles the bug before jumping into the debugger for something like this. This has two advantages: It will give you a good repro to send the graphics card vendor, and it will make your debugging faster because there will be less displayed and therefore less code to single-step through, fewer allocations, etc.
Because the problem happens only with a specific graphics card, there is no doubt that the problem is either a bug in the graphics card driver or in the MilCore code that calls it. Most likely it is in the graphics card driver, but it is possible that MilCore is passing invalid values that are handled correctly by most graphics cards but not this one. The debugging techniques I describe above will tell you this is the case: For example, if MilCore is telling the graphics card to allocate a 1000000x1000000 pixel area and the graphics card is giving correct resolution information, the bug is in the MilCore. But if MilCore's requests are reasonable then the bug is in the graphics card driver.