Considerations on how to store original row data that can still allow for row updates. #7

kami-nashi · 2024-11-02T01:51:00Z

Considering the current CSV schema from a 3rd party source:

name, street, city, state

Additionally, we'll add the following after the pull is complete:

date, source
We know from existing code runs that some addresses will fail to parse by usaddress due to abnormalities such as missing 'postType' (st, rd, ave, and so on), or having an intersection defined such as main st & 100th. Though it does not exist at the time of this issue creation, there will be a report for failed addresses - the expectation is for some one to be able to manually edit the offending lines so that parsing can do its job.

How ever, as future runs progress its likely that we fail to recognize the modified row as data we already have and thus append the offending data once again. My current working theory on this is that we can take part or all of the name/address/city/state and convert it into a hash or checksum. A hash could be favorable if we could decode into the original value if needed - though I'm not sure decoding will be really be something we'll need to rely on. If that is overkill, a checksum could work just as well. My fear is that we'll manually modify something and need to see what the original address was that we pulled down for t-shooting/debugging. Else, if we change how the logic parses the line, we can decode a hash and programmatically update other fields as needed.

This approach would allow us to add a hash field after source in the CSV schema, we can reference it for duplicates from the source from the 'raw csv' files, make logic decisions with out parsing more addresses while also not having to worry about failed parsings.

This issue needs to be ironed out before the end of phase one, but does not impact any other improvements or cleanup since we're in a beginning stage of data collection.

The text was updated successfully, but these errors were encountered:

kami-nashi · 2024-11-20T17:24:58Z

So far, investigations reveal that trying to use something like an MD5 or SHA hash for this would result in a 1-way hash, which would only be good for determining IF the original string had been modified and does not offer any mechanisms for decrypting the original string.

Either we'll need 2 way encryption (if that is feasible) or we'll need to consider another approach.

kami-nashi added help wanted Extra attention is needed question Further information is requested labels Nov 2, 2024

kami-nashi added this to the Phase 1 milestone Nov 2, 2024

kami-nashi mentioned this issue Nov 2, 2024

Custom Corrections/Linting/Formatting/Shenannigans #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Considerations on how to store original row data that can still allow for row updates. #7

Considerations on how to store original row data that can still allow for row updates. #7

kami-nashi commented Nov 2, 2024

kami-nashi commented Nov 20, 2024

Considerations on how to store original row data that can still allow for row updates. #7

Considerations on how to store original row data that can still allow for row updates. #7

Comments

kami-nashi commented Nov 2, 2024

kami-nashi commented Nov 20, 2024