locked
How to pull values from items in JSON without indexing? RRS feed

  • Question

  • User-1306520653 posted

    I have the following code that extracts json elements values and outputs to csv:

    public static void Json_to_Csv(string jsonInputFile, string csvFile)
    {
        using (var p = new ChoJSONReader(jsonInputFile).WithJSONPath("$..readResults")) // "readResults": [
        {
            using (var w = new ChoCSVWriter(csvFile).WithFirstLineHeader())
            {
                w.Write(p
                    .Select(r1 =>
                    {
                        var lines = (dynamic[])r1.lines;
                        return new
                        {
                            FileName = jsonInputFile,
                            Page = r1.page,
                            PracticeName = lines[2].text,
                            OwnerFullName = lines[4].text,
                            OwnerEmail = lines[6].text,
                        };
                    }
            }
        }
    }
    

    csv output:

    File Name,Page,Practice Name,Owner Full Name,Owner Email
    file1.json,1,Some Practice Name,Bob Lee,Bob@gmail.com
    

    Currently there is no other contextual information on each item to reference them so the only way is by indexing, e.g. lines[2]

    This works for now but I may have other JSON files that have an extra field, therefore the values pulled will be wrong.

    In order to address this scenario, how can i pull the values contextually instead of indexing the lines?

    Ive tried PracticeName = lines["Practice Name"].text,, but i get Cannot implicitly convert type string to int error

    file1.json sample:

    {
      "status": "succeeded",
      "createdDateTime": "2020-10-22T19:35:35Z",
      "lastUpdatedDateTime": "2020-10-22T19:35:36Z",
      "analyzeResult": {
        "version": "3.0.0",
        "readResults": [
          {
            "page": 1,
            "angle": 0,
            "width": 8.5,
            "height": 11,
            "unit": "inch",
            "lines": [          
              {
                "boundingBox": [
                  0.5016,
                  1.9141,
                  2.5726,
                  1.9141,
                  2.5726,
                  2.0741,
                  0.5016,
                  2.0741
                ],           
               "text": "Account Information",
                "words": [
                  {
                    "boundingBox": [
                      0.5016,
                      1.9345,
                      1.3399,
                      1.9345,
                      1.3399,
                      2.0741,
                      0.5016,
                      2.0741
                    ],
                    "text": "Account",
                    "confidence": 1
                  },
                  {
                    "boundingBox": [
                      1.3974,
                      1.9141,
                      2.5726,
                      1.9141,
                      2.5726,
                      2.0741,
                      1.3974,
                      2.0741
                    ],
                    "text": "Information",
                    "confidence": 1
                  }
                ]
              },
              {
                "boundingBox": [
                  1.7716,
                  2.4855,
                  2.8793,
                  2.4855,
                  2.8793,
                  2.6051,
                  1.7716,
                  2.6051
                ],
                "text": "Practice Name",
                "words": [
                  {
                    "boundingBox": [
                      1.7716,
                      2.4855,
                      2.3803,
                      2.4855,
                      2.3803,
                      2.6051,
                      1.7716,
                      2.6051
                    ],
                    "text": "Practice",
                    "confidence": 1
                  },
                  {
                    "boundingBox": [
                      2.4362,
                      2.4948,
                      2.8793,
                      2.4948,
                      2.8793,
                      2.6051,
                      2.4362,
                      2.6051
                    ],
                    "text": "Name",
                    "confidence": 1
                  }
                ]
              },
              {
                "boundingBox": [
                  2.9993,
                  2.5257,
                  4.7148,
                  2.5257,
                  4.7148,
                  2.714,
                  2.9993,
                  2.714
                ],
                "text": "Some Practice Name",
                "words": [
                  {
                    "boundingBox": [
                      3.0072,
                      2.5385,
                      3.6546,
                      2.5284,
                      3.6516,
                      2.7131,
                      3.0105,
                      2.712
                    ],
                    "text": "Some",
                    "confidence": 0.984
                  },
                  {
                    "boundingBox": [
                      3.6887,
                      2.5281,
                      4.2112,
                      2.5262,
                      4.2028,
                      2.7159,
                      3.6854,
                      2.7132
                    ],
                    "text": "Parctice",
                    "confidence": 0.986
                  },
                  {
                    "boundingBox": [
                      4.2453,
                      2.5263,
                      4.7223,
                      2.5297,
                      4.7091,
                      2.72,
                      4.2366,
                      2.7161
                    ],
                    "text": "Name",
                    "confidence": 0.986
                  }
                ]
              },
              {
                "boundingBox": [
                  1.6116,
                  2.9999,
                  2.8816,
                  2.9999,
                  2.8816,
                  3.1158,
                  1.6116,
                  3.1158
                ],
                "text": "Owner Full Name",
                "words": [
                  {
                    "boundingBox": [
                      1.6116,
                      3.0039,
                      2.1026,
                      3.0039,
                      2.1026,
                      3.1157,
                      1.6116,
                      3.1157
                    ],
                    "text": "Owner",
                    "confidence": 1
                  },
                  {
                    "boundingBox": [
                      2.1541,
                      2.9999,
                      2.3784,
                      2.9999,
                      2.3784,
                      3.1158,
                      2.1541,
                      3.1158
                    ],
                    "text": "Full",
                    "confidence": 1
                  },
                  {
                    "boundingBox": [
                      2.4384,
                      3.0052,
                      2.8816,
                      3.0052,
                      2.8816,
                      3.1155,
                      2.4384,
                      3.1155
                    ],
                    "text": "Name",
                    "confidence": 1
                  }
                ]
              },
              {
                "boundingBox": [
                  2.9993,
                  3.0242,
                  3.6966,
                  3.0242,
                  3.6966,
                  3.2125,
                  2.9993,
                  3.2014
                ],
                "text": "Bob Lee",
                "words": [
                  {
                    "boundingBox": [
                      3.0063,
                      3.0303,
                      3.3439,
                      3.0349,
                      3.3461,
                      3.2125,
                      3.007,
                      3.2081
                    ],
                    "text": "Bob",
                    "confidence": 0.987
                  },
                  {
                    "boundingBox": [
                      3.3788,
                      3.0349,
                      3.6931,
                      3.0326,
                      3.697,
                      3.2121,
                      3.3813,
                      3.2125
                    ],
                    "text": "Lee",
                    "confidence": 0.983
                  }
                ]
              },
              {
                "boundingBox": [
                  1.945,
                  3.5063,
                  2.8748,
                  3.5063,
                  2.8748,
                  3.6261,
                  1.945,
                  3.6261
                ],
                "text": "Owner Email",
                "words": [
                  {
                    "boundingBox": [
                      1.945,
                      3.5143,
                      2.4359,
                      3.5143,
                      2.4359,
                      3.6261,
                      1.945,
                      3.6261
                    ],
                    "text": "Owner",
                    "confidence": 1
                  },
                  {
                    "boundingBox": [
                      2.4874,
                      3.5063,
                      2.8748,
                      3.5063,
                      2.8748,
                      3.6259,
                      2.4874,
                      3.6259
                    ],
                    "text": "Email",
                    "confidence": 1
                  }
                ]
              },
              {
                "boundingBox": [
                  3.0104,
                  3.5005,
                  4.6042,
                  3.5005,
                  4.6042,
                  3.6888,
                  3.0104,
                  3.6777
                ],
                "text": "bob@gmail.com",
                "words": [
                  {
                    "boundingBox": [
                      3.0212,
                      3.5047,
                      4.5837,
                      3.5039,
                      4.5769,
                      3.6886,
                      3.0129,
                      3.6787
                    ],
                    "text": "bob@gmail.com",
                    "confidence": 0.951
                  }
                ]
              }
            ]
          }
        ]
      }
    }
    
    Friday, October 23, 2020 11:50 PM

Answers

  • User303363814 posted

    My eyeball scan of your JSON seems to show that the .text of odd numbered lines is the name of a field and the .text of even numbered lines is the value of that field.  For example

    lines[3].text is "Owner Full Name" and lines[3+1] is "Bob Lee"

    I don't know the JSON package you are using but it would seem that you could create a small utility method to retrieve the value corresponding to a required field with something like

    string fieldValue(IEnumerable<???> lines, string nameOfField) {
    var skipped = lines.SkipWhile(l => l.text != nameOfField;
    return skipped.Skip(1).First().text;
    }

    The skipped variable would be the 'lines' input with everything prior to the field of interest removed.  You then just skip over the field name line and return the .text property of the next line.  You obviously need some serious error handling code but this little method would allow you to do things like

    OwnerEmail = fieldValue(lines, "Owner Email");

    Of course, the literal "Owner Email" could be taken from the csv file rather than being a literal.

    Because I don't know your JSON reading package I can't tell you what type ??? is but I would think that you know that - its just the type of a single element of the lines array.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Tuesday, October 27, 2020 8:26 AM

All replies

  • User475983607 posted

    I think you have to realize that a file with 3 columns is a different type than a file with 4 columns.  Designs based on dynamically generated columns generally points to a design flaw.  If the columns can be added over time then the programming problem is handling versions. 

    Saturday, October 24, 2020 12:02 AM
  • User-1306520653 posted

    I think you have to realize that a file with 3 columns is a different type than a file with 4 columns.  Designs based on dynamically generated columns generally points to a design flaw.  If the columns can be added over time then the programming problem is handling versions. 

    You're right, but honestly, this is the best/only way I have currently that achieves something for the short-term. I do know that this is not feasible for the long-term, as files could change, which is why I'm posting this thread so I can perhaps receive advice on a better design that what I have currently. 

    Saturday, October 24, 2020 12:09 AM
  • User475983607 posted

    You're right, but honestly, this is the best/only way I have currently that achieves something for the short-term. I do know that this is not feasible for the long-term, as files could change, which is why I'm posting this thread so I can perhaps receive advice on a better design that what I have currently. 

    It is up to you to design the code so you can keep track of CSV versions.  Perhaps put the version in the file name. 

    Saturday, October 24, 2020 10:45 AM
  • User303363814 posted

    My eyeball scan of your JSON seems to show that the .text of odd numbered lines is the name of a field and the .text of even numbered lines is the value of that field.  For example

    lines[3].text is "Owner Full Name" and lines[3+1] is "Bob Lee"

    I don't know the JSON package you are using but it would seem that you could create a small utility method to retrieve the value corresponding to a required field with something like

    string fieldValue(IEnumerable<???> lines, string nameOfField) {
    var skipped = lines.SkipWhile(l => l.text != nameOfField;
    return skipped.Skip(1).First().text;
    }

    The skipped variable would be the 'lines' input with everything prior to the field of interest removed.  You then just skip over the field name line and return the .text property of the next line.  You obviously need some serious error handling code but this little method would allow you to do things like

    OwnerEmail = fieldValue(lines, "Owner Email");

    Of course, the literal "Owner Email" could be taken from the csv file rather than being a literal.

    Because I don't know your JSON reading package I can't tell you what type ??? is but I would think that you know that - its just the type of a single element of the lines array.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Tuesday, October 27, 2020 8:26 AM
  • User-1306520653 posted

    My eyeball scan of your JSON seems to show that the .text of odd numbered lines is the name of a field and the .text of even numbered lines is the value of that field.  For example

    lines[3].text is "Owner Full Name" and lines[3+1] is "Bob Lee"

    I don't know the JSON package you are using but it would seem that you could create a small utility method to retrieve the value corresponding to a required field with something like

    string fieldValue(IEnumerable<???> lines, string nameOfField) {
    var skipped = lines.SkipWhile(l => l.text != nameOfField;
    return skipped.Skip(1).First().text;
    }

    The skipped variable would be the 'lines' input with everything prior to the field of interest removed.  You then just skip over the field name line and return the .text property of the next line.  You obviously need some serious error handling code but this little method would allow you to do things like

    OwnerEmail = fieldValue(lines, "Owner Email");

    Of course, the literal "Owner Email" could be taken from the csv file rather than being a literal.

    Because I don't know your JSON reading package I can't tell you what type ??? is but I would think that you know that - its just the type of a single element of the lines array.

    Hi Paul

    Thanks for your response! I am actually not sure what the type is, Ive thought at first easy enough, its string since thats what im converting the read results to afterall but nope, text is not recognized. So i thought it must be JToken type and that is not the case either. Ive tried byte as well as byte[] but that is not correct either. 

    This json is generated from the Azure computer vision REST API, so what would the type be?

    Here is the code used to generate the JSON:

            /// <summary>
            /// Gets the text from the specified image file by using
            /// the Computer Vision REST API.
            /// </summary>
            /// <param name="imageFilePath">The image file with text.</param>
            static async Task ReadText(string imageFilePath)
            {
                try
                {
                    HttpClient client = new HttpClient();
    
                    // Request headers.
                    client.DefaultRequestHeaders.Add(
                        "Ocp-Apim-Subscription-Key", subscriptionKey);
    
                    string url = uriBase;
    
                    HttpResponseMessage response;
    
                    // Two REST API methods are required to extract text.
                    // One method to submit the image for processing, the other method
                    // to retrieve the text found in the image.
    
                    // operationLocation stores the URI of the second REST API method,
                    // returned by the first REST API method.
                    string operationLocation;
    
                    // Reads the contents of the specified local image
                    // into a byte array.
                    byte[] byteData = GetImageAsByteArray(imageFilePath);
    
                    // Adds the byte array as an octet stream to the request body.
                    using (ByteArrayContent content = new ByteArrayContent(byteData))
                    {
                        // This example uses the "application/octet-stream" content type.
                        // The other content types you can use are "application/json"
                        // and "multipart/form-data".
                        content.Headers.ContentType =
                            new MediaTypeHeaderValue("application/octet-stream");
    
                        // The first REST API method, Batch Read, starts
                        // the async process to analyze the written text in the image.
                        response = await client.PostAsync(url, content);
                    }
    
                    // The response header for the Batch Read method contains the URI
                    // of the second method, Read Operation Result, which
                    // returns the results of the process in the response body.
                    // The Batch Read operation does not return anything in the response body.
                    if (response.IsSuccessStatusCode)
                        operationLocation =
                            response.Headers.GetValues("Operation-Location").FirstOrDefault();
                    else
                    {
                        // Display the JSON error data.
                        string errorString = await response.Content.ReadAsStringAsync();
                        Console.WriteLine("\n\nResponse:\n{0}\n",
                            JToken.Parse(errorString).ToString());
                        return;
                    }
    
                    // If the first REST API method completes successfully, the second 
                    // REST API method retrieves the text written in the image.
                    //
                    // Note: The response may not be immediately available. Text
                    // recognition is an asynchronous operation that can take a variable
                    // amount of time depending on the length of the text.
                    // You may need to wait or retry this operation.
                    //
                    // This example checks once per second for ten seconds.
                    string contentString;
                    int i = 0;
                    do
                    {
                        System.Threading.Thread.Sleep(1000);
                        response = await client.GetAsync(operationLocation);
                        contentString = await response.Content.ReadAsStringAsync();
                        ++i;
                    }
                    while (i < 60 && contentString.IndexOf("\"status\":\"succeeded\"") == -1);
    
                    if (i == 60 && contentString.IndexOf("\"status\":\"succeeded\"") == -1)
                    {
                        Console.WriteLine("\nTimeout error.\n");
                        return;
                    }
    
                    // Display the JSON response.
                    Console.WriteLine("\nResponse:\n\n{0}\n",
                        JToken.Parse(contentString).ToString());
                }
                catch (Exception e)
                {
                    Console.WriteLine("\n" + e.Message);
                }
            }
    
            /// <summary>
            /// Returns the contents of the specified file as a byte array.
            /// </summary>
            /// <param name="imageFilePath">The image file to read.</param>
            /// <returns>The byte array of the image data.</returns>
            static byte[] GetImageAsByteArray(string imageFilePath)
            {
                // Open a read-only file stream for the specified file.
                using (FileStream fileStream =
                    new FileStream(imageFilePath, FileMode.Open, FileAccess.Read))
                {
                    // Read the file's contents into a byte array.
                    BinaryReader binaryReader = new BinaryReader(fileStream);
                    return binaryReader.ReadBytes((int)fileStream.Length);
                }
            }
        }

    Wednesday, October 28, 2020 3:29 AM
  • User303363814 posted

    The debugger or the docs will give you the types.

    Wednesday, October 28, 2020 7:45 AM
  • User-1306520653 posted

    The debugger or the docs will give you the types.

    According to the Docs, Its Newtonsoft.Json package, and i tried JToken but the "JToken does not contain a definition for "text" and no accessible extension method 'text' accepting a first argument of type 'JToken' could be found"

    According to debugger when i input the file, its type string, but i am getting same error for string: "string does not contain a definition for "text" and no accessible extension method 'text' accepting a first argument of type 'string' could be found"

    Wednesday, October 28, 2020 6:57 PM
  • User303363814 posted

    The code you showed in the very first post was referencing the .text property of one element of the lines array.  Therefore, you have the type at your fingertips. The debugger will tell you the type of the element of 'lines', or you can just hover with your mouse.  

    Thursday, October 29, 2020 12:36 AM
  • User-1306520653 posted

    The code you showed in the very first post was referencing the .text property of one element of the lines array.  Therefore, you have the type at your fingertips. The debugger will tell you the type of the element of 'lines', or you can just hover with your mouse.  

    it was dynamic!!! OMG thats the type LOL

    but dude, i just tested it inside the choJSON code i had and it worked!! it FREAKING worked!!

    I cant thank you enough!!! a lot of folks said this is impossible, and someone said I had to redesign it from scratch, in whcih he recommended i start with clustering the JSON into buckets and then applying some AI transformations on it then use the strategy pattern etc...

    someone else whose MS certified said after he tried coding it he realized its impossible to do because the JSON is not in the format of a regular JSON property/attribute relationship that you'd find. 

    I almost gave up! How does your simple function work so well!! AMAZING!

    Friday, October 30, 2020 2:13 AM