Skip to content

Commit 3732c8e

Browse files
authoredFeb 12, 2022
Update docs for uniCOIL and other learned sparse models (#997)
pyserini.search -> pyserini.search.lucene pyserini.index -> pyserini.index.lucene
1 parent 7e21271 commit 3732c8e

5 files changed

+177
-176
lines changed
 

‎docs/experiments-deepimpact.md

+23-23
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Pyserini: DeepImpact for MS MARCO V1 Passage Ranking
1+
# Pyserini: DeepImpact on MS MARCO V1 Passage Ranking
22

33
This page describes how to reproduce the DeepImpact experiments in the following paper:
44

@@ -7,8 +7,6 @@ This page describes how to reproduce the DeepImpact experiments in the following
77
Here, we start with a version of the MS MARCO passage corpus that has already been processed with DeepImpact, i.e., gone through document expansion and term reweighting.
88
Thus, no neural inference is involved.
99

10-
Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-deepimpact.md) based on Java.
11-
1210
## Data Prep
1311

1412
> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
@@ -17,28 +15,28 @@ We're going to use the repository's root directory as the working directory.
1715
First, we need to download and extract the MS MARCO passage dataset with DeepImpact processing:
1816

1917
```bash
20-
# Alternate mirrors of the same data, pick one:
21-
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact-b8.tar -P collections/
22-
wget https://vault.cs.uwaterloo.ca/s/57AE5aAjzw2ox2n/download -O collections/msmarco-passage-deepimpact-b8.tar
18+
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact.tar -P collections/
2319

24-
tar xvf collections/msmarco-passage-deepimpact-b8.tar -C collections/
20+
tar xvf collections/msmarco-passage-deepimpact.tar -C collections/
2521
```
2622

27-
To confirm, `msmarco-passage-deepimpact-b8.tar` is ~3.6 GB and has MD5 checksum `3c317cb4f9f9bcd3bbec60f05047561a`.
23+
To confirm, `msmarco-passage-deepimpact.tar` is 3.6 GB and has MD5 checksum `fe827eb13ca3270bebe26b3f6b99f550`.
2824

2925
## Indexing
3026

3127
We can now index these docs:
3228

3329
```bash
34-
python -m pyserini.index -collection JsonVectorCollection \
35-
-input collections/msmarco-passage-deepimpact-b8/ \
36-
-index indexes/lucene-index.msmarco-passage.deepimpact-b8 \
37-
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
38-
-threads 12
30+
python -m pyserini.index.lucene \
31+
--collection JsonVectorCollection \
32+
--input collections/msmarco-passage-deepimpact/ \
33+
--index indexes/lucene-index.msmarco-passage-deepimpact/ \
34+
--generator DefaultLuceneDocumentGenerator \
35+
--threads 12 \
36+
--impact --pretokenized
3937
```
4038

41-
The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the DeepImpact tokens.
39+
The important indexing options to note here are `--impact --pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the DeepImpact tokens.
4240

4341
Upon completion, we should have an index with 8,841,823 documents.
4442
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 15 minutes.
@@ -48,7 +46,7 @@ The indexing speed may vary; on a modern desktop with an SSD (using 12 threads,
4846
To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries.
4947
First, fetch the MS MARCO passage ranking dev set queries:
5048

51-
```
49+
```bash
5250
# Alternate mirrors of the same data, pick one:
5351
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz -P collections/
5452
wget https://vault.cs.uwaterloo.ca/s/NYibRJ9bXs5PspH/download -O collections/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz
@@ -60,21 +58,23 @@ The MD5 checksum of the topics file is `88a2987d6a25b1be11c82e87677a262e`.
6058
We can now run retrieval:
6159

6260
```bash
63-
python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \
64-
--index indexes/lucene-index.msmarco-passage.deepimpact-b8 \
65-
--output runs/run.msmarco-passage.deepimpact-b8.tsv \
66-
--impact \
67-
--hits 1000 --batch 36 --threads 12 \
68-
--output-format msmarco
61+
python -m pyserini.search.lucene \
62+
--index indexes/lucene-index.msmarco-passage-deepimpact/ \
63+
--topics collections/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \
64+
--output runs/run.msmarco-passage-deepimpact.tsv \
65+
--output-format msmarco \
66+
--batch 36 --threads 12 \
67+
--hits 1000 \
68+
--impact
6969
```
7070

71-
Note that the important option here is `-impact`, where we specify impact scoring.
71+
Note that the important option here is `--impact`, where we specify impact scoring.
7272
A complete run should take around five minutes.
7373

7474
The output is in MS MARCO output format, so we can directly evaluate:
7575

7676
```bash
77-
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.deepimpact-b8.tsv
77+
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-deepimpact.tsv
7878
```
7979

8080
The results should be as follows:

‎docs/experiments-msmarco-v2-unicoil.md

+23-34
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Pyserini: uniCOIL w/ doc2query-T5 for MS MARCO V2
1+
# Pyserini: uniCOIL w/ doc2query-T5 on MS MARCO V2
22

33
This page describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections.
44
Details about our model can be found in the following paper:
@@ -30,30 +30,28 @@ To confirm, `msmarco_v2_passage_unicoil_noexp_0shot.tar` is 24 GB and has an MD5
3030
Index the sparse vectors:
3131

3232
```bash
33-
python -m pyserini.index \
33+
python -m pyserini.index.lucene \
3434
--collection JsonVectorCollection \
3535
--input collections/msmarco_v2_passage_unicoil_noexp_0shot \
3636
--index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
3737
--generator DefaultLuceneDocumentGenerator \
3838
--threads 32 \
39-
--impact \
40-
--pretokenized
39+
--impact --pretokenized
4140
```
4241

4342
> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-noexp-0shot` in the command below.
4443
4544
Sparse retrieval with uniCOIL:
4645

4746
```bash
48-
python -m pyserini.search \
47+
python -m pyserini.search.lucene \
4948
--topics msmarco-v2-passage-dev \
5049
--encoder castorini/unicoil-noexp-msmarco-passage \
5150
--index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
5251
--output runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt \
53-
--impact \
52+
--batch 144 --threads 36 \
5453
--hits 1000 \
55-
--batch 144 \
56-
--threads 36
54+
--impact
5755
```
5856

5957
To evaluate, using `trec_eval`:
@@ -91,28 +89,26 @@ To confirm, `msmarco_v2_passage_unicoil_0shot.tar` is 41 GB and has an MD5 check
9189
Index the sparse vectors:
9290

9391
```bash
94-
python -m pyserini.index \
92+
python -m pyserini.index.lucene \
9593
--collection JsonVectorCollection \
9694
--input collections/msmarco_v2_passage_unicoil_0shot \
9795
--index indexes/lucene-index.msmarco-v2-passage.unicoil-0shot \
9896
--generator DefaultLuceneDocumentGenerator \
9997
--threads 32 \
100-
--impact \
101-
--pretokenized
98+
--impact --pretokenized
10299
```
103100

104101
Sparse retrieval with uniCOIL:
105102

106103
```bash
107-
python -m pyserini.search \
104+
python -m pyserini.search.lucene \
108105
--topics msmarco-v2-passage-dev \
109106
--encoder castorini/unicoil-msmarco-passage \
110107
--index indexes/lucene-index.msmarco-v2-passage.unicoil-0shot \
111108
--output runs/run.msmarco-v2-passage.unicoil.0shot.txt \
112-
--impact \
109+
--batch 144 --threads 36 \
113110
--hits 1000 \
114-
--batch 144 \
115-
--threads 36
111+
--impact
116112
```
117113

118114
To evaluate, using `trec_eval`:
@@ -152,32 +148,29 @@ To confirm, `msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar` is 54 GB and has
152148
Index the sparse vectors:
153149

154150
```bash
155-
python -m pyserini.index \
151+
python -m pyserini.index.lucene \
156152
--collection JsonVectorCollection \
157153
--input collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot \
158154
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
159155
--generator DefaultLuceneDocumentGenerator \
160156
--threads 32 \
161-
--impact \
162-
--pretokenized
157+
--impact --pretokenized
163158
```
164159

165160
> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-per-passage-unicoil-noexp-0shot` in the command below.
166161
167162
Sparse retrieval with uniCOIL:
168163

169164
```bash
170-
python -m pyserini.search \
165+
python -m pyserini.search.lucene \
171166
--topics msmarco-v2-doc-dev \
172167
--encoder castorini/unicoil-noexp-msmarco-passage \
173168
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
174169
--output runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt \
175-
--impact \
170+
--batch 144 --threads 36 \
176171
--hits 10000 \
177-
--batch 144 \
178-
--threads 36 \
179-
--max-passage-hits 1000 \
180-
--max-passage
172+
--max-passage --max-passage-hits 1000 \
173+
--impact
181174
```
182175

183176
For the document corpus, since we are searching the segmented version, we retrieve the top 10k _segments_ and perform MaxP to obtain the top 1000 _documents_.
@@ -217,30 +210,27 @@ To confirm, `msmarco_v2_doc_segmented_unicoil_0shot.tar` is 62 GB and has an MD5
217210
Index the sparse vectors:
218211

219212
```bash
220-
python -m pyserini.index \
213+
python -m pyserini.index.lucene \
221214
--collection JsonVectorCollection \
222215
--input collections/msmarco_v2_doc_segmented_unicoil_0shot \
223216
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil.0shot \
224217
--generator DefaultLuceneDocumentGenerator \
225218
--threads 32 \
226-
--impact \
227-
--pretokenized
219+
--impact --pretokenized
228220
```
229221

230222
Sparse retrieval with uniCOIL:
231223

232224
```bash
233-
python -m pyserini.search \
225+
python -m pyserini.search.lucene \
234226
--topics msmarco-v2-doc-dev \
235227
--encoder castorini/unicoil-msmarco-passage \
236228
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil.0shot \
237229
--output runs/run.msmarco-doc-v2-segmented.unicoil.0shot.txt \
238-
--impact \
230+
--batch 144 --threads 36 \
239231
--hits 10000 \
240-
--batch 144 \
241-
--threads 36 \
242-
--max-passage-hits 1000 \
243-
--max-passage
232+
--max-passage --max-passage-hits 1000 \
233+
--impact
244234
```
245235

246236
For the document corpus, since we are searching the segmented version, we retrieve the top 10k _segments_ and perform MaxP to obtain the top 1000 _documents_.
@@ -259,7 +249,6 @@ recall_100 all 0.7556
259249
recall_1000 all 0.9056
260250
```
261251

262-
263252
## Reproduction Log[*](reproducibility.md)
264253

265254
+ Results reproduced by [@lintool](https://github.com/lintool) on 2021-08-13 (commit [`2b96b9`](https://github.com/castorini/pyserini/commit/2b96b99773302315e4d7dbe4a373b36b3eadeaa6))

‎docs/experiments-spladev2.md

+30-26
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Pyserini: SPLADEv2 for MS MARCO V1 Passage Ranking
1+
# Pyserini: SPLADEv2 on MS MARCO V1 Passage Ranking
22

33
This page describes how to reproduce with Pyserini the DistilSPLADE-max experiments in the following paper:
44

@@ -7,8 +7,6 @@ This page describes how to reproduce with Pyserini the DistilSPLADE-max experime
77
Here, we start with a version of the MS MARCO passage corpus that has already been processed with SPLADE, i.e., gone through document expansion and term reweighting.
88
Thus, no neural inference is involved. As SPLADE weights are given in fp16, they have been converted to integer by taking the round of weight*100.
99

10-
Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-splade-v2.md) based on Java.
11-
1210
## Data Prep
1311

1412
> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
@@ -24,21 +22,23 @@ wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msm
2422
tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/
2523
```
2624

27-
To confirm, `msmarco-passage-distill-splade-max.tar` is ~9.8 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`.
25+
To confirm, `msmarco-passage-distill-splade-max.tar` is 9.9 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`.
2826

2927
## Indexing
3028

3129
We can now index these documents:
3230

3331
```bash
34-
python -m pyserini.index -collection JsonVectorCollection \
35-
-input collections/msmarco-passage-distill-splade-max \
36-
-index indexes/lucene-index.msmarco-passage.distill-splade-max \
37-
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
38-
-threads 12
32+
python -m pyserini.index.lucene \
33+
--collection JsonVectorCollection \
34+
--input collections/msmarco-passage-distill-splade-max \
35+
--index indexes/lucene-index.msmarco-passage-distill-splade-max \
36+
--generator DefaultLuceneDocumentGenerator \
37+
--threads 12 \
38+
--impact --pretokenized
3939
```
4040

41-
The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens.
41+
The important indexing options to note here are `--impact --pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens.
4242

4343
Upon completion, we should have an index with 8,841,823 documents.
4444
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 30 minutes.
@@ -61,23 +61,25 @@ The MD5 checksum of the topics file is `621a58df9adfbba8d1a23e96d8b21cb7`.
6161
We can now run retrieval:
6262

6363
```bash
64-
python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \
65-
--index indexes/lucene-index.msmarco-passage.distill-splade-max \
66-
--output runs/run.msmarco-passage.distill-splade-max.tsv \
67-
--impact \
68-
--hits 1000 --batch 36 --threads 12 \
69-
--output-format msmarco
64+
python -m pyserini.search.lucene \
65+
--index indexes/lucene-index.msmarco-passage-distill-splade-max \
66+
--topics collections/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \
67+
--output runs/run.msmarco-passage-distill-splade-max.tsv \
68+
--output-format msmarco \
69+
--batch 36 --threads 12 \
70+
--hits 1000 \
71+
--impact
7072
```
7173

72-
Note that the important option here is `-impact`, where we specify impact scoring.
74+
Note that the important option here is `--impact`, where we specify impact scoring.
7375
A complete run can take around half an hour.
7476

7577
*Note from authors*: We are still investigating why it takes so long using Pyserini, while the same model (including distilbert query encoder forward pass in CPU) takes only **10 minutes** on similar hardware using a numba implementation for the inverted index and using sequential processing (only one query at a time).
7678

7779
The output is in MS MARCO output format, so we can directly evaluate:
7880

7981
```bash
80-
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.distill-splade-max.tsv
82+
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-distill-splade-max.tsv
8183
```
8284

8385
The results should be as follows:
@@ -104,19 +106,21 @@ mv distilsplade_max distill-splade-max
104106
Then run retrieval with `--encoder distill-splade-max`:
105107

106108
```bash
107-
python -m pyserini.search --topics msmarco-passage-dev-subset \
108-
--index indexes/lucene-index.msmarco-passage.distill-splade-max \
109-
--encoder distill-splade-max \
110-
--output runs/run.msmarco-passage.distill-splade-max.tsv \
111-
--impact \
112-
--hits 1000 --batch 36 --threads 12 \
113-
--output-format msmarco
109+
python -m pyserini.search.lucene \
110+
--index indexes/lucene-index.msmarco-passage-distill-splade-max \
111+
--topics msmarco-passage-dev-subset \
112+
--encoder distill-splade-max \
113+
--output runs/run.msmarco-passage-distill-splade-max.tsv \
114+
--output-format msmarco \
115+
--batch 36 --threads 12 \
116+
--hits 1000 \
117+
--impact
114118
```
115119

116120
And then evaluate:
117121

118122
```bash
119-
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.distill-splade-max.tsv
123+
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-distill-splade-max.tsv
120124
```
121125

122126
The results should be something along these lines:

0 commit comments

Comments
 (0)