locked
Performance of Write file by StorageFile with StorageStreamTransaction and CRT _write() API

    Question

  • After doing some performance profiling on StorageFile API and CRT file API,

    (Please refer to another thread I posted: http://social.msdn.microsoft.com/Forums/en-US/winappswithnativecode/thread/46ef04d5-2201-4a58-9a49-8952641a5cdf)

    I found StorageFile with StorageStreamTransaction to write a file can be also much slower then CRT _write API.

    My test sample as following:

    1. use CRT _write() API 

    void RunTestCaseCRT()
    {
        // 1. create file
        DWORD desiredAccess = GENERIC_READ|GENERIC_WRITE;
        DWORD shareMode = FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE;
        DWORD createDisposition = CREATE_ALWAYS;;
        HANDLE h32 = NULL;
        int hCrt = 0;
        int flags = 0;
        int nbytes = 0;
        String^ filePath = ApplicationData::Current->LocalFolder->Path + "\\TestCaseCRT.txt";
    
        CREATEFILE2_EXTENDED_PARAMETERS cf2ex = {0};
        cf2ex.dwSize = sizeof(CREATEFILE2_EXTENDED_PARAMETERS);
        cf2ex.dwFileAttributes = FILE_ATTRIBUTE_NORMAL;
        cf2ex.hTemplateFile = NULL;
        h32 = CreateFile2(filePath->Data(), desiredAccess, shareMode, createDisposition, &cf2ex);
        if (h32 == INVALID_HANDLE_VALUE) {
            return;
        }
    
        uint64_t startTime = GetTickCount64();
    
        // 2. open crt file handle
        flags &= _O_APPEND | _O_RDONLY | _O_TEXT;  // only attributes described in http://msdn.microsoft.com/en-us/library/bdts1c9x.aspx
        hCrt = _open_osfhandle((intptr_t)h32, flags);
    
        // 3. write file by _write() 10000 times
        for(int i=0 ; i<WRITE_FILE_COUNT ; i++) {
            nbytes = _write(hCrt, s_1kb_data, strlen(s_1kb_data));
            if (nbytes == -1) {
                _close(hCrt);
                return;
            }
        }
    
        uint64_t endTime = GetTickCount64();
    
        // 4. close file
        if( _close(hCrt) == -1 ) {
            return;
        }
    
        // (endTime - startTime) is the result
    }

    2. use StorageFile API

    static uint64_t s_case4_start_time = 0;
    static uint64_t s_case4_end_time = 0;
    static int s_case4_loop_count = 0;
    
    char s_1kb_data[1025] = "1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye 1000 btye end.";
    
    void RunTestCase4()
    {
        s_case4_loop_count = 0;
    
        // 1. create file
        create_task(ApplicationData::Current->LocalFolder->CreateFileAsync("TestCase4.txt",CreationCollisionOption::ReplaceExisting))
         .then(
            [this](StorageFile^ file)
            {
                s_case4_start_time = GetTickCount64();
                // 2. start write file loop
                TransactedWriteFileLoop(file);
            }
        ,task_continuation_context::use_arbitrary()).then(
            [this](task<void> preTask)
            {
                try
                {
                    preTask.get();
                }
                catch(Exception^ ex)
                {
                    ;
                }
            }
        ,task_continuation_context::use_arbitrary());
    }
    
    void TransactedWriteFileLoop(Windows::Storage::StorageFile^ file)
    {
        // 3. open StorageStreamTransaction
        create_task(file->OpenTransactedWriteAsync())
         .then(
         [this, file](StorageStreamTransaction^ iostream)
            {
                //4. open DataWriter
                Streams::IOutputStream^ outputStream= iostream->Stream->GetOutputStreamAt(iostream->Stream->Size);
                Streams::DataWriter^ dataWriter  = ref new Streams::DataWriter(outputStream);
    
                //5. write to DataWriter
                auto platformBuffer = ref new Platform::Array<BYTE>((BYTE*)s_1kb_data, strlen(s_1kb_data));
                dataWriter->WriteBytes(platformBuffer);
    
                //6. Store to DataWriter
                create_task(dataWriter->StoreAsync())
                 .then(
                    [this,file,dataWriter,iostream](unsigned int byteWritten) 
                    {
                        //7. Commit StorageStreamTransaction
                        create_task(iostream->CommitAsync())
                        .then(
                            [this,file,dataWriter](task<void> preTask)
                            {
                                try
                                {
                                    preTask.get();
                                    dataWriter->DetachStream();
                                    s_case4_loop_count++;
                                    //8. write until 10000 times
                                    if(s_case4_loop_count < WRITE_FILE_COUNT) {
                                        TransactedWriteFileLoop(file);
                                    }
                                    else {
                                        //9. wrote 10000 times
                                        s_case4_end_time = GetTickCount64();
                                        //s_case4_end_time-s_case4_start_time) is the result
                                    }
                                }
                                catch(Exception^ ex)
                                {
                                    ;
                                }
                            }
                        ,task_continuation_context::use_arbitrary());
                    }
                ,task_continuation_context::use_arbitrary()).then(
                    [this](task<void> preTask)
                    {
                        try
                        {
                            preTask.get();
                        }
                        catch(Exception^ ex)
                        {
                            ;
                        }
                    }
                ,task_continuation_context::use_arbitrary());
            }
        ,task_continuation_context::use_arbitrary()).then(
            [this](task<void> preTask)
            {
                try
                {
                    preTask.get();
                }
                catch(Exception^ ex)
                {
                    ;
                }
            }
        ,task_continuation_context::use_arbitrary());
    }

    Both 2 test case try to write file with 1kb size data for 10000 times.

    CRT API takes 40 milliseconds.

    StorageFile API takes 71187 milliseconds.

    Is it an expect performance of StorageFile API?

     

     
    Friday, April 12, 2013 3:37 AM

All replies

  • You are comparing the performance of a synchronous for loop to the performance of tens of thousands of task continuations invoked via a recursive function. It's really not going to matter what code you happen to stick in as an example. You are comparing apples to rocket launchers and the fact that you aren't even letting the concurrency runtime be concurrent (since you reuse the same file again and again thus necessitating that each set of 6 tasks and continuations runs entirely to its end before the next set can begin) just makes it worse.

    I have no idea what you are trying to do, but your presumed claim that this in any way represents the performance of the StorageFile API is bogus. This represents the performance of 60000 PPL tasks forcibly run in a synchronous fashion. The performance of the StorageFile API, if it can be gleaned at all from this data (which is doubtful) is 60000/71187 = 0.84285 milliseconds.

    Try reworking your test so that you create 10000 uniquely named files (such that you can then run the PPL tasks concurrently) and see how that performs. Also, with the CRT case, add in an equivalent number of PPL tasks so that you limit the influence of asynchrony on the results. Then you will be comparing apples to apples.

    To the extent that you ever need to run 10000 iterations of something, if you know that the total time for all iterations combined will be under 50 ms, then you can just run them synchronously. If they'll take longer then you must use asynchrony to avoid blocking the UI thread (otherwise your app will fail certification). There are other rules and guidelines as well; see, e.g.: http://msdn.microsoft.com/en-us/library/windows/apps/xaml/Hh780631(v=win.10).aspx .


    Visual C++ MVP | Website | Blog | @mikebmcl | Windows Store DirectX Game Template

    Friday, April 12, 2013 10:21 AM
  • While truckwu's example is somewhat contrived and shows the Async stuff at its worst, it does show that the Async philosophy scales poorly. There are other real world examples of poor scaling I've seen on this forum, such as retrieving large numbers of thumbnails from image repositories. This doesn't even mention that the pseudo-recursive function call to chain multiple Async calls is worse than spaghetti code. I want to vomit every time I see it. For a real world example try to download an unknown number of bytes via WinRT sockets.

    Also, MSFT's own apps (e.g. Solitare on WinRT) definitely violate the 50ms rule. It's a horribly written app that revs up the fan on my Samsung Series 7 Slate while it's running ... and is still visibly chunky, i.e. violates the 50ms rule.

    Friday, April 12, 2013 2:01 PM
  • After re-reading MikeBMcL's criticism of Truckwu's test case I decided to do a more realistic test of Win32 versus WinRT for directory handling. Here's a link to the zipped source of my test app (created in VS2012 using the "Blank App XAML" template):

    https://skydrive.live.com/redir?resid=A490AEB98CF07F17!251&authkey=!AKxnac5mH74f_iQ

    The app creates 1000 files in the Local storage area at startup. Let's say these represent Photos my daughter took at the Taylor Swift concert (she filled up her camera's memory card!) and I want to get the count of files and their total size. Build the app and run it. You'll see truckwu's fstat equivalent on the top row, allowing you to compare Win32 vs. WinRT pounding on a single file.

    On the bottom row is my more realistic example, where the files are enumerated using Win32 and WinRT and the file sizes are added together. Note that I did *not* implement a thread-safe accumulation of the filesizes in the WinRT case. In this more realistic scenario, WinRT performs WORSE compared to Win32 than in the truckwu's contrived fstat case in the top row. Here are the numbers I get on my Samsung Series 7 Slate:

    Win32 Enumeration: 0 to 15 ms (faster than the NT time tick!)

    WinRT Enumeration: around 3760 ms

    In this more realistic case, Win32 is at least 250 times faster than WinRT. WinRT's Async performance is so bad that it actually is a self-reinforcing technique => WinRT Async needs to be executed asynchronously precisely because its asynchronous nature makes it so slow!

    Are there any bugs in my WinRT implementation of the file enumeration? I hope so.

    Friday, May 3, 2013 1:56 PM
  • Yes there are bugs. For one, your purported example is a bunch of pictures in the app's local storage. How did they get there? And why would the user want them there rather than in their Pictures library?

    Of course, you can't use the Pictures library as an example since you can't access it with the Win32 API (see: http://social.msdn.microsoft.com/Forums/en-US/winappswithnativecode/thread/06c71593-ca93-42e4-8e8b-d6cae31c64e2 ). Sandboxing prevents you from accessing anything you don't have direct, full permissions to (i.e. anything other than the app's app data directory and its install directory).

    Further, your test is on a device with an SSD. How does it perform with a mechanical hdd? How about on a network folder?

    Yes, async is going to be slower than direct access. Why is that a surprise? The design goal for the WinRT Storage API was to create cancellable operations that do not block the UI thread and which enable inter-app sharing. While I'm sure they made great efforts to make async storage as fast as possible, speed would always take second place to one of the primary design goals.

    Even if you took your Win32 code and threw it on a background thread, it still wouldn't be cancellable (so if it started requesting something from a network drive on a heavily congested network, it could block for a long time) and it still wouldn't be able to take in files shared from another app without copying them locally using the WinRT storage API.

    If you're going to be doing file operations that are solely on items in your app's app data directory (local and roaming storage) or its install directory, then go ahead and use the Win32 APIs since they should always be fast enough to tackle whatever you are throwing at them without any appreciable risk of noticeable UI blocking. Beyond that, though, you don't get a choice. Considering that a user's pictures library might contain a folder on an NAS or some other network drive, that's a good thing since your app is blocked from doing the sorts of operations that can freeze the UI and make users crazy.


    Visual C++ MVP | Website | Blog | @mikebmcl | Windows Store DirectX Game Template

    Saturday, May 4, 2013 2:20 PM
  • You are defending the indefensible here.

    1. Why can't the Win32 file and directory functions access anything outside those two locations? If sandboxing is so important it should be in the kernel, not a user-mode "file broker" sitting on top of Win32.

    2. How about a HDD or other media? I don't know. I do know that Win32 will always beat Async.

    3. About cancellation. Why go through all the WinRT Async business if cancellation is important. Simply add a CancelHandle() or some other Win32 function to abort a file op.

    4. Of course I would put a potentially long-running op on a separate thread. That doesn't require polluting an entire API with Async functionality. Babysitting that thread via the UI requires the same amount of care as firing off Async ops. So you gain nothing with the Async philosophy. All pain and no gain.

    Via private communication from MSFT, the current suspect for the dismal performance we're seeing is the FileBroker attached to the Async APIs. Hopefully, they'll investigate what's going on an we'll get a definitive answer. I don't think the FB is entirely responsible for the horrendous performance. I think it is mainly due to the threading overhead, which places blame on the Async philosophy. Why? Because truckwu's file writing test showed similar poor performance (which *should* take the FB out of the equation after the initial file creation).

    Saturday, May 4, 2013 3:04 PM