Skip to content

Commit 689c2f7

Browse files
authored
Update README.md
1 parent aa7606e commit 689c2f7

File tree

1 file changed

+41
-42
lines changed

1 file changed

+41
-42
lines changed

README.md

+41-42
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,45 @@ A toolkit for variable number tandem repeats (VNTRs) analysis, which enables:
44
2. genotyping each VNTR as a set of (*k*-mer, count) given SRS data (`danbing-tk align`), and
55
3. predicting VNTR length from the genotype (`danbing-tk predict`) given locus-specific sampling bias (LSB).
66

7-
[Manuscript](https://www.biorxiv.org/content/10.1101/2020.08.13.249839v3) available on biorxiv.
7+
See [online manuscript](https://www.nature.com/articles/s41467-021-24378-0) for details.
88

99

10-
## Installation
11-
This will take up to an hour on a typical work station, depending on your internet connection.
10+
## Download Releases
11+
Each release comes with the lastest version of genotypable VNTRs, RPGG (and/or) precomputed LSB.
1212

13-
### Download Releases
14-
Each release comes with the lastest version of genotypable VNTRs, RPGG, and precomputed LSB.
15-
16-
| | File | Input of | Output of |
17-
|-------------------|-----------------------------------|--------------------|------------------|
18-
| Genotypable VNTRs | tr.good.bed | danbing-tk build | |
19-
| Control regions | ctrl.bed | danbing-tk build | |
20-
| RPGG | pan.(tr\|ntr\|graph).kmers | danbing-tk align | danbing-tk build |
21-
| Precomputed LSB | LSB.tsv | danbing-tk predict | |
13+
| | File | Input of | Output of |
14+
|-------------------|-------------------------------------------------------|--------------------|------------------|
15+
| Genotypable VNTRs | tr.good.bed | danbing-tk build | |
16+
| Control regions | ctrl.bed | danbing-tk build | |
17+
| RPGG | pan.(tr.kmers\|kmerDBi.umap\|kmerDBi.vv\|graph.umap) | danbing-tk align | danbing-tk build |
18+
| Precomputed LSB | LSB.tsv | danbing-tk predict | |
2219

20+
- Release v1.3: Updated RPGG built from 35 HGSVC genomes.
2321
- Release v1.0: VNTR summary statistics and eGene discoveries are also included. Example analyses such as differential length/motif analysis, eQTL mapping, VNTR locus QC, sample QC are also included.
2422

23+
24+
## Building on Linux
25+
```shell
26+
git clone https://github.com/ChaissonLab/danbing-tk
27+
cd danbing-tk && make
28+
```
29+
30+
## danbing-tk align
31+
Decompress the RPGG `RPGG.tar.gz` in your working directory.
32+
33+
An example usage to genotype SRS sample using the RPGG:
34+
35+
```shell
36+
samtools fasta -@2 -n $SRS.bam |
37+
/$PREFIX/danbing-tk/bin/danbing-tk -gc 80 -ae -kf 4 0 -cth 45 -o $OUT_PREF -k 21 -qs pan -fai /dev/stdin -p $THREADS | gzip >$OUT_PREF.aln.gz
38+
```
39+
40+
`danbing-tk align` takes ~12 cpu hours to genotype a 30x SRS sample. This will generate `$OUT_PREF.tr.kmers` and `$OUT_PREF.aln.gz` output with format specified in [File Format](#file-format).
41+
42+
**Important note:** If outputs of `danbing-tk align` are intended to be used directly for downstream analyses e.g. association tests, please check the [distribution of LSB](#distribution-of-lsb) section below before running.
43+
44+
45+
## danbing-tk build
2546
### Install Dependencies
2647
For users intended to use `danbing-tk align` only, this step is not required.
2748

@@ -40,39 +61,15 @@ conda create -n $MY_ENVIRONMENT -c conda-forge -c bioconda -c intel \
4061
scikit-learn=0.23.1 statsmodels=0.12.1 pysam=0.15.3
4162
```
4263

43-
### Building on Linux
44-
```shell
45-
git clone https://github.com/ChaissonLab/danbing-tk
46-
cd danbing-tk && make
47-
```
48-
49-
### Test Environment
5064
To check if everything is configured properly:
5165
1. Go to `/$PREFIX/danbing-tk/test/`
52-
5366
2. Replace `$PREFIX` in `goodPanGenomeGraph.json` and `input/genome.bam.tsv` with the path to danbing-tk
5467
3. Run `snakemake -p -s ../pipeline/GoodPanGenomeGraph.snakefile -j 4 --rerun-incomplete --output-wait 3`
5568

5669
Tested on v1.0.
57-
## Usage
58-
59-
### danbing-tk align
60-
Decompress the RPGG `RPGG.tar.gz` and link `*.kmers` (required by the `-qs` option) to your working directory with `ln -s`.
6170

62-
An example usage to genotype SRS sample using the RPGG:
63-
64-
```shell
65-
samtools fasta -@2 -n $SRS.bam |
66-
/$PREFIX/danbing-tk/bin/bam2pe -fai /dev/stdin |
67-
/$PREFIX/danbing-tk/bin/danbing-tk -gc 50 -k 21 -qs pan -fai /dev/stdin -o $OUT_PREFIX \
68-
-p 24 -cth 45 -rth 0.5
69-
```
70-
71-
`danbing-tk align` takes ~42 cpu hours to genotype a 30x SRS sample. This will generate a `*.tr.kmers` output with format specified in [File Format](#file-format).
72-
73-
**Important note:** If outputs of `danbing-tk align` are intended to be used directly for downstream analyses e.g. association tests, please check the [distribution of LSB](#distribution-of-lsb) section below before running.
7471

75-
### danbing-tk build
72+
### Running danbing-tk build
7673
- Required inputs:
7774
- haplotype-resolved assemblies (FASTA)
7875
- matched SRS data (BAM; optional)
@@ -96,7 +93,7 @@ snakemake -p -s /$PREFIX/danbing-tk/pipeline/GoodPanGenomeGraph.snakefile -j 40\
9693

9794
Submitting jobs to cluster is preferred as `danbing-tk build` is compute-intensive, ~1200 cpu hours for the original dataset. Otherwise, remove `--cluster` and its parameters to run jobs locally.
9895

99-
### danbing-tk predict
96+
## danbing-tk predict
10097
Locus-specific sampling biases (LSB) at VNTR regions are critical for normalizing the sum of *k*-mer counts to VNTR length. We provided precomputed LSB at the VNTR regions for fast comparison, however this assumes the LSB of the dataset of interest is close enough to the dataset in the original paper. Please ensure this assumption is valid by running a joint PCA on the LSB of non-repetitive regions with the original dataset, provided in `LSB.tsv`. If this assumption failed, leave-one-out analysis (next section) on the dataset of interest is necessary to make accurate predictions. The following usage is for when the assumption holds.
10198

10299
Run `getCovByLocus.397.sh` on your SRS dataset.
@@ -148,6 +145,7 @@ kmer1 kmer_count1
148145
>locus i+1
149146
...
150147
```
148+
The second field is optional.
151149

152150
- Alignment output (`-a` option)
153151
- Synopsis
@@ -159,7 +157,8 @@ kmer1 kmer_count1
159157
- `ops`: operations to align the read to the graph
160158
- `=`: a match in the repeat
161159
- `.`: a match in the flank
162-
- `[A|C|G|T]`: a mismatch in the repeat; letter in the graph is shown
163-
- `[a|c|g|t]`: a mismatch in the flank; letter in the graph is shown
164-
- `[H|h]`: a homopolymer run in the repeat or flank
165-
- `S`: a gap (skipped)
160+
- `[A|C|G|T]`: a mismatch; letter in the graph is shown
161+
- `[0|1|2|3]`: a deletion; letter in the graph is shown as 0123 for ACGT, respectively.
162+
- `I`: an insertion in the read
163+
- `H`: a nucleotide in the homopolymer run
164+
- `*`: unalinged nucleotide

0 commit comments

Comments
 (0)