You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Considering the current CSV schema from a 3rd party source:
name, street, city, state
Additionally, we'll add the following after the pull is complete:
date, source
We know from existing code runs that some addresses will fail to parse by usaddress due to abnormalities such as missing 'postType' (st, rd, ave, and so on), or having an intersection defined such as main st & 100th. Though it does not exist at the time of this issue creation, there will be a report for failed addresses - the expectation is for some one to be able to manually edit the offending lines so that parsing can do its job.
How ever, as future runs progress its likely that we fail to recognize the modified row as data we already have and thus append the offending data once again. My current working theory on this is that we can take part or all of the name/address/city/state and convert it into a hash or checksum. A hash could be favorable if we could decode into the original value if needed - though I'm not sure decoding will be really be something we'll need to rely on. If that is overkill, a checksum could work just as well. My fear is that we'll manually modify something and need to see what the original address was that we pulled down for t-shooting/debugging. Else, if we change how the logic parses the line, we can decode a hash and programmatically update other fields as needed.
This approach would allow us to add a hash field after source in the CSV schema, we can reference it for duplicates from the source from the 'raw csv' files, make logic decisions with out parsing more addresses while also not having to worry about failed parsings.
This issue needs to be ironed out before the end of phase one, but does not impact any other improvements or cleanup since we're in a beginning stage of data collection.
The text was updated successfully, but these errors were encountered:
So far, investigations reveal that trying to use something like an MD5 or SHA hash for this would result in a 1-way hash, which would only be good for determining IF the original string had been modified and does not offer any mechanisms for decrypting the original string.
Either we'll need 2 way encryption (if that is feasible) or we'll need to consider another approach.
Considering the current CSV schema from a 3rd party source:
name, street, city, state
Additionally, we'll add the following after the pull is complete:
date, source
We know from existing code runs that some addresses will fail to parse by
usaddress
due to abnormalities such as missing'postType'
(st, rd, ave, and so on), or having an intersection defined such asmain st & 100th
. Though it does not exist at the time of this issue creation, there will be a report for failed addresses - the expectation is for some one to be able to manually edit the offending lines so that parsing can do its job.How ever, as future runs progress its likely that we fail to recognize the modified row as data we already have and thus append the offending data once again. My current working theory on this is that we can take part or all of the name/address/city/state and convert it into a hash or checksum. A hash could be favorable if we could decode into the original value if needed - though I'm not sure decoding will be really be something we'll need to rely on. If that is overkill, a checksum could work just as well. My fear is that we'll manually modify something and need to see what the original address was that we pulled down for t-shooting/debugging. Else, if we change how the logic parses the line, we can decode a hash and programmatically update other fields as needed.
This approach would allow us to add a hash field after source in the CSV schema, we can reference it for duplicates from the source from the 'raw csv' files, make logic decisions with out parsing more addresses while also not having to worry about failed parsings.
This issue needs to be ironed out before the end of phase one, but does not impact any other improvements or cleanup since we're in a beginning stage of data collection.
The text was updated successfully, but these errors were encountered: