Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSV utility functions #956

Merged
merged 4 commits into from
Mar 19, 2024
Merged

Add CSV utility functions #956

merged 4 commits into from
Mar 19, 2024

Conversation

matthew-white
Copy link
Member

@matthew-white matthew-white commented Mar 12, 2024

This PR makes progress on getodk/central#589, adding utility functions related to CSV files. parseCSVHeader() parses the header row of a CSV file, parseCSV() parses the data, and formatRow() displays a row in CSV format.

What has been done to verify that this works as intended?

This PR just adds utility functions; it doesn't make user-facing changes. I've added unit tests of the utility functions.

Why is this the best possible solution? Were any other approaches considered?

npm package

One important decision was around which npm package to use for CSV parsing. I first tried csv-parse, since that's what we use on Backend. However, I ran into a couple of issues:

  • I didn't see an easy way to stream the CSV file in the browser. In the examples of csv-parse use in the browser, a CSV string is passed to the parse() function. Ideally, we'd be able to pass a File or a FileReader or a ReadableStream from Blob.prototype.stream(). It looks like there may be a way to convert a File to a Node-style stream that could be used with csv-parse, but I was hoping for something simpler. Related: Add support for WhatWG streams adaltas/node-csv#242
  • I had trouble parsing just the header row. When I tried using option to, I saw an error in the browser console: "Uncaught TypeError: "listener" argument must be a function". I think that error stems from option.to breaks the stream and exits the program adaltas/node-csv#333. That said, it looked like parsing was able to complete despite the error logging. But to avoid the error, I tried doing some preliminary parsing, grabbing the substring of the CSV string before the first newline character and passing that to csv-parse. That approach allowed me to avoid option to, but it's again at odds with streaming. It also won't return the correct result if a column header contains a newline.

I gave a different package Papa Parse a try instead, and I've been liking it. It's easy to stream the file, and I didn't have trouble parsing the header row. It also seems markedly faster than what I was doing with csv-parse (where I was calling text() on the File, then passing the resulting string to the parse() function). I tried parsing a file with 4 columns (3 properties + label):

# rows size csv-parse time Papa Parse time
100,000 8 MB 1.5s 0.4s
1,000,000 80 MB 14s 2s

Another little benefit of Papa Parse is the ability to easily abort parsing. I'm planning to set it up so that if the user leaves the upload modal, any ongoing CSV parsing is aborted.

Parsing the header and data separately

As mentioned above, there are separate functions for parsing the header vs. parsing the data. parseCSV() parses the data and expects to receive an array of column headers. That array comes from parsing the header row using parseCSVHeader().

I've set it up this way so that if there is an error in the header, the rest of the CSV file is not parsed. The upload modal will show errors about the header, and I want those to be shown as soon as the header has been parsed and validated.

That could still have been done in a single function, say, by passing a validateHeader option to the function. However, parseCSV() is already fairly complex, and I didn't want to make it even more so. parseCSVHeader() and parseCSV() also handle errors pretty differently. If Papa Parse finds errors in the header (of code MissingQuotes or InvalidQuotes), parseCSVHeader() will not return a rejected promise and will instead return the errors along with Papa Parse's best attempt to parse the header. That's because the upload modal will display both the errors and the header: the two are needed together. On the other hand, parseCSV() returns a rejected promise if there is an error with the data. In that case, the upload modal simply shows a danger alert. parseCSV() only returns data if it was successful. At that point, it may also return warnings about the data. parseCSVHeader() has no concept of warnings.

I don't think we lose much by separating parseCSVHeader() and parseCSV(). There end up being some conceptual differences between the two, so I think it's useful to do so.

Returning promises

Another part of the approach I took was to wrap Papa Parse in a promise. Papa Parse follows more of a callback pattern, but it is more convenient to work with promises. parseCSVHeader() and parseCSV() return promises, using Papa Parse callbacks under the hood. Also, Papa Parse provides its own API for aborting parsing, but in other places in Frontend (around requests), we use AbortSignals. parseCSVHeader() and parseCSV() will accept an optional AbortSignal, then use the AbortSignal to trigger the abort mechanism in Papa Parse. That way, most code can just use promises and AbortSignals, and for the most part, only src/util/csv.js needs to be concerned with the mechanics of Papa Parse.

Before submitting this PR, please make sure you have:

  • run npm run test and npm run lint and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code or assets from external sources are properly credited in comments or that everything is internally sourced

@matthew-white matthew-white requested a review from ktuite March 12, 2024 22:05
Copy link
Member Author

@matthew-white matthew-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes from interactive code review

@matthew-white
Copy link
Member Author

For some reason, I thought Papa Parse only returned quote errors for the header row. I'd tried passing Papa Parse a CSV string with an invalid quote after the header and didn't see an error. However, I must have missed something, because I'm now seeing that Papa Parse does return quote errors about the data after the header. I think that's a good development, as that's the better behavior. I've pushed a commit that specifically handles those errors, rejecting parsing with a better error message compared to before.

@matthew-white matthew-white merged commit b1f625f into master Mar 19, 2024
1 check passed
@matthew-white matthew-white deleted the csv-util branch March 19, 2024 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants