none
Cannot decide on an implementation - File I/O Async or not? RRS feed

  • Question

  • Hi all,

    I'm currently writing a manager class for a queue of objects, I have written two implementations of what I wanted to achieve, a synchronous version and an asynchronous version. I understand that almost any type of I/O should be asynchronous if possible.

    The problem I am having is which path do I take, the problem is simple; open a file and read it line by line parsing the line into data and constructing an object from that data.

    I'm working with a list of proxies, so my idea was that there could be a lot of proxies in the list so I wrote this implementation:

    public static async Task<ProxyManager> LoadFromFileAsync(string filePath)
    {
        var proxyManager = new ProxyManager();
    
        using (var fileStream = File.OpenText(filePath))
        {
            var fileLine = "";
            while ((fileLine = await fileStream.ReadLineAsync()) != null)
            {
                proxyManager.Enqueue(new Proxy(fileLine));
            }
        }
    
        return proxyManager;
    }
    

    Then I had another thought, usually the time taken to read a file and do such a thing is miniscule so I re-wrote it in a synchronous fashion:

    public static ProxyManager LoadFromFile(string filePath)
    {
        var proxyManager = new ProxyManager();
    
        File.ReadAllLines(filePath)
            .ToList()
            .ForEach(fileLine => proxyManager.Enqueue(new Proxy(fileLine)));
    
        return proxyManager;
    }
    

    Both are pretty simple implementations, but now I cannot decide which choice to go for. I was also wondering if anyone could come up with a better solution, I always like to see how things can be done differently!

    Tuesday, November 25, 2014 8:07 PM

Answers

  • If it takes sufficient time such that async processing makes sense then use the async version. You can also call it synchronously if desired.  The problem that I have with your first code is that it really isn't doing reasonable work asynchronously. Reading a single line is not going to be that slow compared to all the work you're doing. I would probably recommend instead that you take your first set of code that is opening the file and enumerating the lines and move that into a method (basically make it synchronous). There really isn't much use in calling ReadLineAsync in this case.  The method can return the ProxyManager object when it is done. Then create an async method that calls your real method and returns a task with the ProxyManager as a result. Clients will then call the async method and can wait for you to read and process the file (on a secondary thread). If the process of reading the line, creating the proxy and adding it to the manager is expensive then you could consider moving that to an async process as well but that is probably overkill for now.

    Michael Taylor
    http://blogs.msmvps.com/p3net

    • Marked as answer by Impulser Tuesday, November 25, 2014 11:21 PM
    Tuesday, November 25, 2014 8:45 PM
    Moderator
  • Hi Impulser,

    What the solution is that you should go for strongly depens on your needs. I guess speed could be an issue. In your position, I would just try both solutions with a realistic set of files and file sizes ans measure the time each one takes.

    The approach you should take depends on the type of the input data. How many files do you have? 1, 5, 10 , 1000? And how many lines in each file? Is it always the same number of lines?

    Another approache you could use is to take the Parallel class:

    public static ProxyManager LoadFromFile(string filePath)
    {
        var proxyManager = new ProxyManager();
    
        string[] lines = File.ReadAllLines(filePath);
        Parallel.ForEach(lines, (line) => proxyManager.Enqueue(new Proxy(line)));
        return proxyManager;
    }

    You might as well use another Parallel.ForEach call to iterate through the files first.

    Rgds MM

    Please remember to "mark as answer" if this post was helpful. Thanks.

    • Marked as answer by Impulser Tuesday, November 25, 2014 11:21 PM
    Tuesday, November 25, 2014 9:01 PM

All replies

  • If it takes sufficient time such that async processing makes sense then use the async version. You can also call it synchronously if desired.  The problem that I have with your first code is that it really isn't doing reasonable work asynchronously. Reading a single line is not going to be that slow compared to all the work you're doing. I would probably recommend instead that you take your first set of code that is opening the file and enumerating the lines and move that into a method (basically make it synchronous). There really isn't much use in calling ReadLineAsync in this case.  The method can return the ProxyManager object when it is done. Then create an async method that calls your real method and returns a task with the ProxyManager as a result. Clients will then call the async method and can wait for you to read and process the file (on a secondary thread). If the process of reading the line, creating the proxy and adding it to the manager is expensive then you could consider moving that to an async process as well but that is probably overkill for now.

    Michael Taylor
    http://blogs.msmvps.com/p3net

    • Marked as answer by Impulser Tuesday, November 25, 2014 11:21 PM
    Tuesday, November 25, 2014 8:45 PM
    Moderator
  • If it takes sufficient time such that async processing makes sense then use the async version. You can also call it synchronously if desired.  The problem that I have with your first code is that it really isn't doing reasonable work asynchronously. Reading a single line is not going to be that slow compared to all the work you're doing. I would probably recommend instead that you take your first set of code that is opening the file and enumerating the lines and move that into a method (basically make it synchronous). There really isn't much use in calling ReadLineAsync in this case.  The method can return the ProxyManager object when it is done. Then create an async method that calls your real method and returns a task with the ProxyManager as a result. Clients will then call the async method and can wait for you to read and process the file (on a secondary thread). If the process of reading the line, creating the proxy and adding it to the manager is expensive then you could consider moving that to an async process as well but that is probably overkill for now.

    Michael Taylor
    http://blogs.msmvps.com/p3net

    So if I understand you correctly, simply wrap the synchronous method like this?

    public static async Task<ProxyManager> LoadFromFileAsync(string filePath)
    {
        return await Task.Run(() => LoadFromFile(filePath));
    }
    

    Tuesday, November 25, 2014 8:58 PM
  • Hi Impulser,

    What the solution is that you should go for strongly depens on your needs. I guess speed could be an issue. In your position, I would just try both solutions with a realistic set of files and file sizes ans measure the time each one takes.

    The approach you should take depends on the type of the input data. How many files do you have? 1, 5, 10 , 1000? And how many lines in each file? Is it always the same number of lines?

    Another approache you could use is to take the Parallel class:

    public static ProxyManager LoadFromFile(string filePath)
    {
        var proxyManager = new ProxyManager();
    
        string[] lines = File.ReadAllLines(filePath);
        Parallel.ForEach(lines, (line) => proxyManager.Enqueue(new Proxy(line)));
        return proxyManager;
    }

    You might as well use another Parallel.ForEach call to iterate through the files first.

    Rgds MM

    Please remember to "mark as answer" if this post was helpful. Thanks.

    • Marked as answer by Impulser Tuesday, November 25, 2014 11:21 PM
    Tuesday, November 25, 2014 9:01 PM
  • My initial approach to this problem was thinking this: "I'm loading a list of proxies from a text file, sure 10,000 proxies wouldn't be that bad in synchronous, but what if I wanted to load a file of 2 million proxies for instance".

    I had already considered the Parallel.ForEach approach but I thought that maybe letting the underlying Task Scheduler do the work of off loading the data across my machine's cores might be better, I imagine the team that implemented the TPL to have run millions of benchmarks and had the Task Scheduler tuned to find the best result.

    Tuesday, November 25, 2014 9:08 PM
  • I wouldn't use your sync version of LoadFromFile as it is inefficient compared to your async version. You have to be careful with Task.Run though as it is considered to be an "incorrect" approach to async.  Refer to this blog post about it. 
    Tuesday, November 25, 2014 9:10 PM
    Moderator
  • IIRC, I have read that blog post which is what made me not choose the "wrapper" implementation in the first place.

    So just a thought about what Manuel posted, it now seems that it'd probably be better to keep a synchronous version of the function but use the Parallel.ForEach approach in the implementation of the function.

    Edit: I think both of your suggestions have just clicked in my brain, this is what I just produced.

    public static Task<ProxyManager> LoadFromFileAsync(string filePath)
    {
        return Task.Run(() => LoadFromFile(filePath));
    }
    
    public static ProxyManager LoadFromFile(string filePath)
    {
        var proxyManager = new ProxyManager();
    
        Parallel.ForEach(File.ReadAllLines(filePath),
                         fileLine => proxyManager.Enqueue(new Proxy(fileLine)));
    
        return proxyManager;
    }
    

    • Edited by Impulser Tuesday, November 25, 2014 9:35 PM
    Tuesday, November 25, 2014 9:16 PM