You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/experiments-deepimpact.md
+23-23
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Pyserini: DeepImpact for MS MARCO V1 Passage Ranking
1
+
# Pyserini: DeepImpact on MS MARCO V1 Passage Ranking
2
2
3
3
This page describes how to reproduce the DeepImpact experiments in the following paper:
4
4
@@ -7,8 +7,6 @@ This page describes how to reproduce the DeepImpact experiments in the following
7
7
Here, we start with a version of the MS MARCO passage corpus that has already been processed with DeepImpact, i.e., gone through document expansion and term reweighting.
8
8
Thus, no neural inference is involved.
9
9
10
-
Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-deepimpact.md) based on Java.
11
-
12
10
## Data Prep
13
11
14
12
> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
@@ -17,28 +15,28 @@ We're going to use the repository's root directory as the working directory.
17
15
First, we need to download and extract the MS MARCO passage dataset with DeepImpact processing:
The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the DeepImpact tokens.
39
+
The important indexing options to note here are `--impact --pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the DeepImpact tokens.
42
40
43
41
Upon completion, we should have an index with 8,841,823 documents.
44
42
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 15 minutes.
@@ -48,7 +46,7 @@ The indexing speed may vary; on a modern desktop with an SSD (using 12 threads,
48
46
To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries.
49
47
First, fetch the MS MARCO passage ranking dev set queries:
> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-noexp-0shot` in the command below.
> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-per-passage-unicoil-noexp-0shot` in the command below.
For the document corpus, since we are searching the segmented version, we retrieve the top 10k _segments_ and perform MaxP to obtain the top 1000 _documents_.
@@ -217,30 +210,27 @@ To confirm, `msmarco_v2_doc_segmented_unicoil_0shot.tar` is 62 GB and has an MD5
For the document corpus, since we are searching the segmented version, we retrieve the top 10k _segments_ and perform MaxP to obtain the top 1000 _documents_.
@@ -259,7 +249,6 @@ recall_100 all 0.7556
259
249
recall_1000 all 0.9056
260
250
```
261
251
262
-
263
252
## Reproduction Log[*](reproducibility.md)
264
253
265
254
+ Results reproduced by [@lintool](https://github.com/lintool) on 2021-08-13 (commit [`2b96b9`](https://github.com/castorini/pyserini/commit/2b96b99773302315e4d7dbe4a373b36b3eadeaa6))
Copy file name to clipboardExpand all lines: docs/experiments-spladev2.md
+30-26
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Pyserini: SPLADEv2 for MS MARCO V1 Passage Ranking
1
+
# Pyserini: SPLADEv2 on MS MARCO V1 Passage Ranking
2
2
3
3
This page describes how to reproduce with Pyserini the DistilSPLADE-max experiments in the following paper:
4
4
@@ -7,8 +7,6 @@ This page describes how to reproduce with Pyserini the DistilSPLADE-max experime
7
7
Here, we start with a version of the MS MARCO passage corpus that has already been processed with SPLADE, i.e., gone through document expansion and term reweighting.
8
8
Thus, no neural inference is involved. As SPLADE weights are given in fp16, they have been converted to integer by taking the round of weight*100.
9
9
10
-
Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-splade-v2.md) based on Java.
11
-
12
10
## Data Prep
13
11
14
12
> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens.
41
+
The important indexing options to note here are `--impact --pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens.
42
42
43
43
Upon completion, we should have an index with 8,841,823 documents.
44
44
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 30 minutes.
@@ -61,23 +61,25 @@ The MD5 checksum of the topics file is `621a58df9adfbba8d1a23e96d8b21cb7`.
Note that the important option here is `-impact`, where we specify impact scoring.
74
+
Note that the important option here is `--impact`, where we specify impact scoring.
73
75
A complete run can take around half an hour.
74
76
75
77
*Note from authors*: We are still investigating why it takes so long using Pyserini, while the same model (including distilbert query encoder forward pass in CPU) takes only **10 minutes** on similar hardware using a numba implementation for the inverted index and using sequential processing (only one query at a time).
76
78
77
79
The output is in MS MARCO output format, so we can directly evaluate:
0 commit comments