none
How to find an address within a string in C# RRS feed

  • Question

  • Hey Everyone,

    I am working on a project of which I need to be able to remove a postal address from a response field. This is because this project is not allowed to give out PII data to anyone who exports the data from the system. Could anyone give me a good way to do such? I did some brief looking around and I couldn't find anything online. I know I could use regex possibly but I don't know how to create or format a pattern string. If anyone could help me out with it that would be great.

    An example string would be: 5-6 FT WEEDS ALL OVER AT 123 SCHOMER RD (Address changed for PII reasons)

    Another Example Being: 

    CALLER STATES NEIGHBOR HAS A BUNCH OF JUNK, TRASH - CANS, BARRELS, BOARDS, CHAIRS, LOGS ALL TYPE OF DEBRIS IN BACKYARD, IT'S HORRIBLE AND NEEDS REMOVING.  CALLER STATES YOU CAN COME ONTO THEIR PROPERTY AT 1234 SUPERIOR ST TO VIEW NEIGHBOR'S BACKYARD.

    This is dealing with public government 311 data so that is why I would have to remove such.

    Thanks in advance!

    Nate



    • Edited by NHastings25 Monday, August 5, 2019 2:04 PM Updating Examples
    Monday, August 5, 2019 1:53 PM

Answers

  • There is no easy solution. One big complication is spelling. If a road is misspelled then it might not be caught. Another thing is abbreviations such as rd and st. If anyone uses an inconsistent abbreviation then that might be missed.

    Someone needs to spend time analyzing the data. The amount of time spent will likely affect the quality.

    One thing that is likely to help is to have a list (database) of street names for Aurora. The conversion could search for street names. However that probably cannot be done efficiently using regular expressions.

    I am not a fan of regular expressions, so I might be too quick to say they won't help here. I think you need to have a relatively simple syntactic analyzer to find words. In this context words are generalized to include numbers such as 1234. Words are separated by whitespace and punctuation. Each non-numeric word could be looked up for relevant words such as abbreviations and street names. If the list of street names have words such as street and road spelled out without abbreviations then the program needs to expand abbreviations. When a street name is found then the preceding word could be checked for a street number. And so forth and so on. Data written by people for people can be difficult to parse. Street numbers might not be all numbers, such as 1234A.

    I once wrote a program (using COBOL) to parse aircraft manufacturing instructions written by people for people to pick out tools and materials for the manufacture of aircraft. The aircraft were Black World (classified) but I believe one of them was a stealth fighter. Another one or at least one of them was a drone. A Manufacturing Engineer spent more than a year to anaylyze the data and in parallel it took about a year for me to write the program.

    You probably can use existing software to help you but I do not know how much experience you have. One thing I am not familiar with is indexing (searching) software. You might be able to use something intended to parse (index) data for the purpose of making searches faster. Something else that could help is software that generates programs to do syntactic analyses based on specified rules. Since this is a relatively common requirement, perhaps I am wrong that it is not easy. Perhaps other communities have developed something that can be used.

    You can tell management that it is not easy. 



    Sam Hobbs
    SimpleSamples.Info

    • Marked as answer by NHastings25 Monday, August 5, 2019 6:34 PM
    Monday, August 5, 2019 5:52 PM

All replies

  • First of all you need o determine the characteristics that are common to all your addresses. For instance, the two addresses above could be identified by the fact that they start wit a numeral and end with the designation of the street, such as ST or RD. You can locate such strings using a regular expression like this:

    @"\d+\s(\w|\s)+?(RD|ST)"

    But of course if your set of addresses contains other variations then you will need to modify the expression to support the variations.

    Monday, August 5, 2019 2:15 PM
    Moderator
  • There is just no way to do this automatically.  You can make a stab, but a human has to be involved.  Consider a report that says "50 ROSE THORNS WAY TOO TALL AT 50 ROSE THORNS WAY".  Further, as someone who has done registration duty for many conventions and festivals, I know that people are stupidly unreliable when typing in addresses.  They don't follow the rules, even when there are rules.

    The only fully reliable solution is to change the data entry so the address is in a separate field from the problem report.


    Tim Roberts | Driver MVP Emeritus | Providenza & Boekelheide, Inc.

    Monday, August 5, 2019 5:50 PM
  • There is no easy solution. One big complication is spelling. If a road is misspelled then it might not be caught. Another thing is abbreviations such as rd and st. If anyone uses an inconsistent abbreviation then that might be missed.

    Someone needs to spend time analyzing the data. The amount of time spent will likely affect the quality.

    One thing that is likely to help is to have a list (database) of street names for Aurora. The conversion could search for street names. However that probably cannot be done efficiently using regular expressions.

    I am not a fan of regular expressions, so I might be too quick to say they won't help here. I think you need to have a relatively simple syntactic analyzer to find words. In this context words are generalized to include numbers such as 1234. Words are separated by whitespace and punctuation. Each non-numeric word could be looked up for relevant words such as abbreviations and street names. If the list of street names have words such as street and road spelled out without abbreviations then the program needs to expand abbreviations. When a street name is found then the preceding word could be checked for a street number. And so forth and so on. Data written by people for people can be difficult to parse. Street numbers might not be all numbers, such as 1234A.

    I once wrote a program (using COBOL) to parse aircraft manufacturing instructions written by people for people to pick out tools and materials for the manufacture of aircraft. The aircraft were Black World (classified) but I believe one of them was a stealth fighter. Another one or at least one of them was a drone. A Manufacturing Engineer spent more than a year to anaylyze the data and in parallel it took about a year for me to write the program.

    You probably can use existing software to help you but I do not know how much experience you have. One thing I am not familiar with is indexing (searching) software. You might be able to use something intended to parse (index) data for the purpose of making searches faster. Something else that could help is software that generates programs to do syntactic analyses based on specified rules. Since this is a relatively common requirement, perhaps I am wrong that it is not easy. Perhaps other communities have developed something that can be used.

    You can tell management that it is not easy. 



    Sam Hobbs
    SimpleSamples.Info

    • Marked as answer by NHastings25 Monday, August 5, 2019 6:34 PM
    Monday, August 5, 2019 5:52 PM
  • Hi Alberto,

    Thanks for the response and as other people have said, it isn't as easy as I was hoping it might be so thanks for your response and I'll have to develop a different solution.

    Nate

    Monday, August 5, 2019 6:31 PM
  • Hi Tim,

    Thanks for the input, I was expecting as much but I thought I would as the community for any ideas. The problem is that the 311 system for the city allows for address input which is an easy redaction but since they might type a neighbors address to report the problem then I cant remove it that way. Ill have to look at a better solution that allows for running the reports but also allow for general redaction. Thanks for the response!

    Nate

    Monday, August 5, 2019 6:34 PM
  • Hi Sam,

    While I am experienced in data analytics, among other things, I was not sure what system might be the best way to detect and redact that data. The data that I gave as an example was the description field of the 311 data for the City of Aurora. I will take a look at other ways but I was looking to remove as much data as possible so that the final redaction process by the legal department can be quicker so it doesn't take too much of their time. I will look at other options and I appreciate your response. I marked your response as the answer since it gave the best answer and other alternatives.

    Thanks,

    Nate

    Monday, August 5, 2019 6:40 PM
  • final redaction process by the legal department

    I did not know details such as that of course but you might develop a tool to help with their analyses instead of doing all the conversion. I do not know what to suggest, it is something you can work out with them.



    Sam Hobbs
    SimpleSamples.Info

    Monday, August 5, 2019 6:55 PM