Skip to content

Reproduce the results presented in the paper "Sparse matrix linear models for structured high-throughput data".

Notifications You must be signed in to change notification settings

senresearch/mlm_l1_supplement

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparse matrix linear models for structured high-throughput data

This repository contains code to reproduce the results presented in the paper "Sparse matrix linear models for structured high-throughput data".

Analysis was primarily performed in Julia1, and visualizations were created using R2. The Julia package associated with this paper is MatrixLMnet, which extends the MatrixLM package.

Simulations examining dependence of runtimes on data size

  • scaling_times.jl: Examine how runtime increases with dimension size in simulated data for two algorithms, FISTA with backtracking and ADMM.
  • scaling_times.R: Visually compare runtimes for FISTA with backtracking and ADMM.

Simulations inspired by environmental screening data (Woodruff et al., 2011)

Woodruff, T. J., Zota, A. R., & Schwartz, J. M. (2011). Environmental chemicals in pregnant women in the United States: NHANES 2003–2004. Environmental health perspectives, 119(6), 878-885.

  • woodruff_sim.jl: Simulate data and run L1-penalized matrix linear model.
  • woodruff_sim.R: Run univariate linear models on simulated data, and reproduce ROC plot comparing approaches.

E. coli genetic screening data (Nichols et al., 2011)

Nichols, R. J., Sen, S., Choo, Y. J., Beltrao, P., Zietek, M., Chaba, R., Lee, S., Kazmierczak, K. M., Lee, K. J., Wong, A., et al. (2011). Phenotypic landscape of a bacterial cell. Cell, 144(1):143–156.

Download the data here3. Once downloaded, the files should be saved in the data/raw_KEIO_data/ directory.

  • nichols_preprocess.R: Preprocess data.
  • nichols.jl: Run L1-penalized matrix linear model, with cross-validation.
  • nichols.R: Reproduce dot plot and ROC plot for analyzing auxotrophs.
  • nichols_sim.jl: Simulate data and run matrix linear models (least squares and L1-penalized).
  • nichols_sim.R: Reproduce ROC plots for comparing matrix linear models with and without the L1 penalty.

Arabidopsis fitness adaptation QTL data (Ågren et al., 2013)

Ågren, J., Oakley, C. G., McKay, J. K., Lovell, J. T., & Schemske, D. W. (2013). Genetic mapping of adaptation reveals fitness tradeoffs in Arabidopsis thaliana. Proceedings of the National Academy of Sciences, 110(52), 21077-21082.

Download RIL_DataForSelectionAnalyses3yrs.xls and geno.csv here4, 5. Once downloaded, the files should be saved in the data/ directory.

  • agren_preprocess.R: Preprocess data.
  • agren.jl: Run L1-penalized matrix linear model, with cross-validation.
  • agren.R: Reproduce plot of main effect and interaction QTL peaks.
  • agren_times.jl: Compare the runtimes for different L1-penalized algorithms.

Arabidopsis eQTL experiment data (Lowry et al., 2013)

Lowry, D. B., Logan, T. L., Santuari, L., Hardtke, C. S., Richards, J. H., DeRose-Wilson, L. J., ... & Juenger, T. E. (2013). Expression quantitative trait locus mapping across water availability environments reveals contrasting associations with genomic features in Arabidopsis. The Plant Cell, 25(9), 3266-3279.

Download the series matrix file from the paper's GEO site here and Supplemental Dataset 1b here6. Once downloaded, the files should be saved in the data/ directory.

Download the annotation file here. Once downloaded, the file should be saved in the data/ directory.

The files for the marker positions, TKrils_Marker_PhysPos.csv, and the key for the TxK RIL IDs, dd2014_cytocovar.csv, are provided in the data subdirectory.

  • lowry_preprocess.R: Preprocess data.
  • lowry.jl: Run L1-penalized matrix linear model.
  • lowry.R: Reproduce scatterplot of main effect and interaction QTL.
  • lowry_multipleqtl.R: Run multiple QTL analysis on individual phenotypes using the R/qtl package7.
  • lowry_times.jl: Compare the runtimes for different L1-penalized algorithms on random subsets of the data.
  • lowry_times.R: Plot runtimes of each L1-penalized algorithm against number of genes in subset.

1. Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM review, 59(1):65–98.

2. R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

3. Nichols, R. J., Sen, S., Choo, Y. J., Beltrao, P., Zietek, M., Chaba, R., Lee, S., Kazmierczak, K. M., Lee, K. J., Wong, A., et al. (2011). Phenotypic landscape of a bacterial cell. Cell, 144(1):143–156.

4. Ågren, J., Oakley, C. G., Lundemo, S., & Schemske, D. W. (2017). Adaptive divergence in flowering time among natural populations of Arabidopsis thaliana: estimates of selection and QTL mapping. Evolution, 71(3), 550-564.

5. Ågren, J., Oakley, C. G., Lundemo, S., & Schemske, D. W. (2016), Adaptive divergence in flowering time among natural populations of Arabidopsis thaliana: estimates of selection and QTL mapping. Data from: Dryad Digital Repository. https://doi.org/10.5061/dryad.77971.

6. Lovell, J. T., Mullen, J. L., Lowry, D. B., Awole, K., Richards, J. H., Sen, S., ... & McKay, J. K. (2015). Exploiting differential gene expression and epistasis to discover candidate genes for drought-associated QTLs in Arabidopsis thaliana. The Plant Cell, 27(4), 969-983.

7. Broman, K. W., Wu, H., Sen, Ś., & Churchill, G. A. (2003). R/qtl: QTL mapping in experimental crosses. Bioinformatics, 19(7), 889-890.

About

Reproduce the results presented in the paper "Sparse matrix linear models for structured high-throughput data".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published