locked
Reading a big Text File RRS feed

  • Question

  • User265788195 posted

    This text file is so large (Right now 99906 lines). I need to reach each line and do something with the information. I am using this code

    MemoryStream ms = DownloadFile(sftp, "MyTextFile.txt");
                if (ms != null)
                {
                    ms.Position = 0; 
                    List<string> rows = new List<string>();               
                    using (var reader = new StreamReader(ms, Encoding.ASCII))
                    {
                        string line;
                        while ((line = reader.ReadLine()) != null)
                        {
                            if (MyList().Contains(line.Substring(0, 2)))
                                rows.Add(line);
                        }
                    }
                    foreach (string row in rows)
                    {
    
                       // Logic that needs to happen on each row
    }
                }
                else
                 //Console.ReadLine();          
            }

    Its taking forever to do this. Is there any other better way?

    Thanks,

    Friday, May 20, 2016 9:48 PM

All replies

  • User1231829591 posted

    Hi, try this link http://cc.davelozinski.com/c-sharp/fastest-way-to-read-text-files . It has an article which talks about the fastest way to read text files.  

    Friday, May 20, 2016 10:14 PM
  • User303363814 posted

    Its taking forever to do this

    Could you add a tiny bit of detail to this?

    Even reducing to 1% of forever is still forever.

    Saturday, May 21, 2016 12:02 AM
  • User1231829591 posted

    The reason your code is sluggish is because your RAM is taking a hit. You need to create a custom class to read your file. Follow this link https://siddheshshelke.wordpress.com/2011/10/23/read-large-text-files-using-c/ , it deals with a situation that is exactly like yours.

    Saturday, May 21, 2016 12:53 AM
  • User303363814 posted

    because your RAM is taking a hit
    How do you know this?

    100,000 lines with, say, 200 bytes in each line = 20 megabytes.  Do you think the code is running on a computer with 30 megabytes of ram?  My little laptop has 8,000 megabytes.  That file would consume 0.25% of the ram (that is, approximately nothing).  Why are you sure that this would be a problem?

    The key factors to me seem to be

    a) No measurements are being made.  We have no idea which line and/or lines are slow.

    b) We have no idea of what the expectations are or how realistic they are

    c) The downloadfile() is a total black box.  We do not know if it is 'slow' (or even what slow means), highly inefficient or lightning fast.  What if downloadfile is reading information over a dial-up modem and performs a handshake for each line?  What if downloadfile grabs data off an SSD drive?  What if downloadfile grabs data from a very busy website?  What if downloadfile is trying to use an internet connection that is being throttled due to excess usage?  We know absolutely nothing about downloadfile.  It may be a problem, it may not be - without measurements we cannot know.  Does downloadfile actually download a file from somewhere or is it just opening it?  We do not know.

    Saturday, May 21, 2016 1:52 AM
  • User1231829591 posted

    Point taken, but from my own and many others' experiences the problem described by the poster has all the symptoms of the processor/RAM getting hammered. 

    Saturday, May 21, 2016 10:47 AM
  • User265788195 posted
     string path = @"C:\MyFolder\MyNewFile.txt";
                using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read))
                using (BufferedStream bs = new BufferedStream(fs))
                using (StreamReader sr = new StreamReader(bs))
                 {
                    string line;
                    while ((line = sr.ReadLine()) != null)
                    {
                        if (MyList().Contains(line.Substring(0, 2)))
                            rows.Add(line);
                    }
                }
    
                List<string> newList = new List<string>();
                newList.AddRange(rows.Take(100));
              
                Console.WriteLine(DateTime.Now);
                    foreach (string row in newList)
                    {
    
                       // DoSomething
                }

    I changed the code to use BufferedStream and still the same result.

    As I said it was taking too long, I waited 10 min and then tried the logic with only 100 rows. For 100 rows its taking 43 secs. For 1000 it took 7 minutes.

     List<string> newList = new List<string>();
                newList.AddRange(rows.Take(100));

    Should I convert my text file into XML? Will that make any difference? My whole txt file is around 60 mb.

    Monday, May 23, 2016 7:43 PM
  • User303363814 posted

    I ran your code and it took 0.029 seconds with a 5,000 line file.

    Why does the method MyList() do?  Does it call a database or some other slow operation?

    In my code I made it a simple list variable instead of a method call?

    A simplified version of your code could be.

    var prefixes = new List<string>{"01", "03", "05"};
    var rows = File.ReadLines(path)
    .Where(l => prefixes.Contains(l.Substring(0,2)));

    runs in 0.011 seconds with a 5000 line input file

    Monday, May 23, 2016 8:48 PM
  • User265788195 posted

    I uncommented each line and tried. The time consuming thing is the for each loop. I have some web services calls. Looks like that is a different issue now. The web service calls are taking longer. Basically I call the web service to give me the details about each line then depending on some conditions I update the information again using another web service call.

    So for 90,000 rows - for each row, I am making 3 different web service calls. Any better ways?

    Monday, May 23, 2016 9:44 PM
  • User303363814 posted

    Do you really need to make the web service call for each line?  Can you read what you need to know first, once, store it in some variables and then just use the local information?

    Monday, May 23, 2016 10:57 PM
  • User265788195 posted

    Hmm, how can I do that for so many rows? I can't think of how I can achieve that.

    Monday, May 23, 2016 11:09 PM
  • User303363814 posted

    Look at the numbers.

    Reading a line from a file takes about 2 microseconds.  See my code.

    The web service calls that you make for each line take 430,000 microseconds. (You said that a 100 line file takes 43 seconds)

    So, what does you program do?  It does web service calls.  That's all.  The file reading part is 0.0004% of the work, that's zero to humans. forget it.

    What is a web service call? It is shouting a question at someone on the other side of the planet and waiting for them to notice your question, then get around to answering you and then shout the answer back at you.  If you really need to yell hundreds of thousands of questions to some remote server then it is never, ever going to be fast.  You design is screwed.

    You need to go back to the drawing board and re-architecture the whole thing.

    Maybe you need to upload your file to a slightly different web service and ask it to process the whole file for you in one transaction which can happen locally (no remote calls) and then send the results back to you.

    Maybe you should not wait for a file to be created but instead get each line of information as it is generated and do the web service calls immediately.  So, while the other system is working out how to create the next line in the file you can process the previous line.  If your web service processing is less than the time it takes for the generating system to create the next line then you processing will be complete half a second after you get the last line.

    Maybe you should not process every line in the input file.  Is someone really going to look at all 100,000 lines of output?  Can you filter the input file?

    Is the file similar to the previous file?  Maybe you just need to work out the 'delta' and process that.

    Set expectations. Maybe you should start processing after everyone goes home and give them the results in the morning.  Does it matter if it takes 10 seconds or 10 hours if no one is waiting?

    Look carefully at the business need.  Do you really, really, really need to do all those calls for every single line?  Is each line really that different?  Maybe the whole thing needs to be interactive.  Let user pick the line they are interested in and only do the web service call for that line when they ask for it.  (They will not do that 100,000 times).

    Does the information that the web service provide change very often?  Can you download all the remote information once a day/week/month/hour in the background and use the local copy to do your processing.

    So many possibilities.

    Tuesday, May 24, 2016 12:43 AM
  • User265788195 posted

    Thank you so much for your reply. yes, I felt that the design need to re-architected. I was handed the design and as I started developing and understanding it I am finding these things.

    BTW, this can be left alone after hours to be run for how ever long it takes it seems. Still, is that a good way? I am not sure. Need to think of other solutions.

    Tuesday, May 24, 2016 4:25 PM
  • User753101303 posted

    Hi,

    Do you retrieve some unique information for each row or do you sometimes just get the same info again? If yes an easy change could be to keep the info handy rather than fetching it again. For example, if you process a log file and read the country for an IP address, you need to read that once for each distinct IP rather than once for each line found in the log file. Not sure which kind of information you are processing.

    Tuesday, May 24, 2016 5:33 PM
  • User265788195 posted

    From each row, I get a case number & case status, I call the webservice to get the case details(which is an XML file) and if the case status is different then I update it.

    Tuesday, May 24, 2016 6:52 PM