Glyphosate SAP Clustering

Daniel J. Hicks, PhD

This project demonstrates the use of text mining methods to analyze public comments on government regulatory actions. Specifically, I analyze comments on a meeting of EPA's Scientific Advisory Panel, held December 13-16, 2016, to discuss the herbicide glyphosate. Public comments were posted in the Regulations.gov docket EPA-HQ-OPP-2016-0385.

The analysis is broken into a series of steps. Most of these steps are completely automated, using the scripts found in this repository. Each script has a corresponding literate HTML version. Running these scripts, in numerical order, will completely reproduce the analysis. In the steps below, filename links will load HTML versions of scripts (generated with rmarkdown::render), which include both code and output. The full repository can be found at https://github.com/dhicks/glyphosate/.

Some brief findings are presented in the final script file, 7_findings.R. However, the focus of this demonstration project is on data-gathering and -analysis methods, rather than the development of communicable findings.

The script 1_scrape.R retrieves all public comments and selected attachments (PDF, Microsoft Word, and plaintext files) found in the docket. Attachments are converted to plaintext files using two command-line tools, pdftotext and pandoc.
a. The downloaded attachments are manually reviewed for completeness (that is, to confirm that all files were downloaded correctly), accuracy of plaintext conversation, and relevance. In this particular docket, many of the attachments were petition signature pages, submissions from letter-writing campaigns (where individuals merely signed their name to prepared letters), previous publications (such as EPA documents or peer-reviewed journal articles), bibliographies, or commentators' CVs, resumes, or biographies. All of these attachment types were excluded from the rest of the analysis. This step is conducted using the file 2_attachments.xlsx.

b. All comments are manually classified under two variables:
- commenter type: one of academic, advocacy (consumer or environmental), government, industry (including organic farmers), or individual (anonymous or lacking a determinable affiliation)
- valence: either neg (arguing against glyphosate, its registration, use, or that it poses health risks) or pro (arguing for glyphosate)
For this step, the file 1_comment_metadata.csv (generated in step 1) is opened in Excel, edited, and saved as 2_comment_metadata.xlsx.
Comments and attachments are combined into a single dataframe by the script 3_combine_comments_attachments.R.
The script 4_prep_fit_word2vec.R standardizes the combined comments, identifies significant multigrams (phrases comprising multiple words), and fits a word embedding model. Word embeddings represent words in relatively low-dimensional vector spaces; the cosine similarity of a pair of vectors is proportionate to the probability of the corresponding words occurring close to each other. The fitted word2vec model is saved in the file glyphosate_w2v.bin.
The script 5_clustering.R uses affinity propagation to construct clusters of terms. "Focal terms" are identified as the terms with the highest information gain for distinguishing industry and advocacy comments. This initial termlist is expanded by taking the 500 most similar terms from the fitted word embedding model. These 500 terms are then clustered using affinity propagation and word embedding similarity. After clusters are developed, they are "mapped" to individual comments, and dot plots are used to identify clusters that are more strongly associated with industry or advocacy comments.
The script 6_ml.R constructs a machine learning classifier to distinguish pro- and anti-glyphosate comments, extracts the most important terms for this classifier, and constructs partial dependence plots.
The file 7_findings.R draws some implications from the cluster analysis in step 5. It discusses linguistic differences between advocacy and industry comments; suggests how these differences might have influenced EPA's decisionmaking; and examines the relative prominence of workers, consumers, children, and farmers in the comments.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
1_scrape.R		1_scrape.R
1_scrape.html		1_scrape.html
2_attachments.xlsx		2_attachments.xlsx
2_comment_metadata.xlsx		2_comment_metadata.xlsx
3_combine_comments_attachments.R		3_combine_comments_attachments.R
3_combine_comments_attachments.html		3_combine_comments_attachments.html
4_prep_fit_word2vec.R		4_prep_fit_word2vec.R
4_prep_fit_word2vec.html		4_prep_fit_word2vec.html
5_clustering.R		5_clustering.R
5_clustering.html		5_clustering.html
6_ml.R		6_ml.R
6_ml.html		6_ml.html
7_findings.R		7_findings.R
7_findings.html		7_findings.html
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Glyphosate SAP Clustering

Daniel J. Hicks, PhD

About

Releases

Packages

Languages

dhicks/glyphosate

Folders and files

Latest commit

History

Repository files navigation

Glyphosate SAP Clustering

Daniel J. Hicks, PhD

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages