1 Introduction

The assumption that a query unambiguously defines the user’s information need does not always hold in a Web search scenario (Spärck-Jones et al. 2007; Sanderson 2008). Typical user queries bear some degree of ambiguity (Song et al. 2009). While truly ambiguous queries (e.g., ‘office’) are open to different interpretations (e.g., ‘business room’, ‘software suite’, ‘tv show’), queries with a clearly defined interpretation (e.g., ‘the office tv show’) might still be open to different aspects of this interpretation (e.g., ‘schedule’, ‘episode guide’, ‘cast’) (Clarke et al. 2008). An effective approach for tackling ambiguous queries is to diversify the search results, so as to maximise the chance that different users will find at least one relevant result to their particular need (Agrawal et al. 2009).

A classical diversification strategy consists in comparing the retrieved results to one another, in order to promote novelty in the ranking (Carbonell and Goldstein 1998; Zhai et al. 2003; Wang and Zhu 2009; Rafiei et al. 2010). In particular, novelty-based diversification approaches implicitly assume that different results will cover different aspects of the query, and hence should be promoted in the ranking. While classical approaches deploy novelty as their sole ranking strategy, the state-of-the-art approaches deploy a hybrid strategy. In particular, the latter approaches seek to promote not only novel search results, but also results with a high coverage Footnote 1 of the aspects underlying the initial query (Agrawal et al. 2009; Carterette and Chandar 2009; Santos et al. 2010a, c). This strategy is enabled by an explicit representation of the query aspects, in contrast to the implicit aspect representation adopted by the existing novelty-based approaches.

Unfortunately, the prevalence of different aspect representations has precluded a direct comparison between coverage and novelty as diversification strategies. As a result, it remains unclear whether the striking difference in performance commonly observed between coverage and novelty-based approaches is due to their underlying aspect representation (explicit vs. implicit) or to their diversification strategy (coverage vs. novelty). It is also unclear how much novelty actually contributes to the effectiveness of the current state-of-the-art approaches, while penalising their efficiency—differently from novelty, the coverage of a search result is estimated independently of other results. Although intuitive, novelty has yet to be shown effective for diversifying Web search results. In particular, existing evidence of the effectiveness of novelty as a diversification strategy is based on either qualitative studies (Carbonell and Goldstein 1998) or on curated corpora, such as Wikipedia (Rafiei et al. 2010) or newswire (Wang and Zhu 2009).

In this paper, we challenge the common view of novelty as an intuitive diversification strategy. To this end, we thoroughly investigate the role of this strategy in light of both classical as well as state-of-the-art diversification approaches in the literature. To enable our investigation, we adapt two existing novelty-based diversification approaches to leverage explicit query aspect representations. Likewise, we produce coverage-only versions of two state-of-the-art approaches that deploy a hybrid of coverage and novelty strategies. By doing so, we bridge the gap between the diversification approaches in the literature and enable their evaluation in terms of the aspect representation and the diversification strategy dimensions. As a result, we provide the first comprehensive account of the role of novelty as a ranking strategy for diversifying Web search results.

Using the evaluation framework provided by the diversity task of the TREC 2009 and 2010 Web tracks (Clarke et al. 2009, 2010), we empirically show that novelty cannot consistently improve over a standard, non-diversified baseline ranking. When leveraging explicit aspect representations (including a ‘ground-truth’ aspect representation), we show that novelty-based approaches can be improved, but are still not significantly more effective than a non-diversified ranking. On the diversification strategy dimension, we find that novelty does not contribute significantly to the coverage-based strategy deployed by the current state-of-the-art, suggesting that the efficiency overhead added by promoting novelty does not pay off. Finally, through a scrutinous analysis based on simulated rankings of various quality, we demonstrate that, under special conditions, novelty can still play a role at breaking the tie between results with similar coverage.

In summary, the major contributions of this paper are:

  1. 1.

    A unifying framework to enable the direct comparison of existing diversification approaches across the aspect representation and diversification strategy dimensions;

  2. 2.

    A thorough investigation of the impact of different aspect representations and diversification strategies for search result diversification;

  3. 3.

    A comprehensive analysis of the role of novelty as a diversification strategy, under a range of empirical and simulated relevance scenarios.

The remainder of this paper is organised as follows. In Sect. 2, we provide background on search result diversification and on representative approaches from the literature. In Sect. 3, we describe the methodology that supports our investigations. Our experimental setup is detailed in Sect. 4, while Sects. 5 and 6 discuss the results of our investigation, based on empirical and simulated experiments, respectively. Finally, in Sect. 7, we present our concluding remarks and directions for future research.

2 Background and related work

In 1964, Goffman (1964) pointed out that `the relationship between a document and a query is necessary but not sufficient to determine relevance’. Later, in 1991, Gordon and Lenk (1991) discussed two assumptions underlying the probability ranking principle (Cooper 1971; Robertson 1977), namely, that relevance is determined with certainty, and that documents are judged relevant or not independently of one another. Since then, several approaches have been proposed to overcome these limiting assumptions. Among these, search result diversification tackles the uncertainty of relevance estimates, primarily resulting from query ambiguity, by promoting documents with maximum coverage of the possible aspects underlying a query. Additionally, it accounts for the dependent relevance of documents by promoting those documents with maximum novelty with respect to the already selected ones.

The diversification approaches in the literature can be classified according to two complementary dimensions: aspect representation and diversification strategy (Santos et al. 2010c). The aspect representation determines how a document is described in light of the several aspects underlying a query. In particular, an implicit representation describes a document regardless of the query aspects, based on features intrinsic to the document (e.g., the terms it contains). In turn, an explicit representation describes how well a document covers the query aspects, where each aspect can be itself represented in a variety of ways. For instance, different aspects can represent different query classes according to a predefined taxonomy (Agrawal et al. 2009) or different topics covered by the retrieved documents (Carterette and Chandar 2009). More generally, different aspects can represent multiple information needs underlying the query, e.g., as different query reformulations (Radlinski and Dumais 2006; Santos et al. 2010a).

Complementarily to the aspect representation, the diversification strategy determines how a diversification approach achieves the goal of satisfying different aspects of a query. Coverage-based approaches achieve this goal by directly estimating how well each document covers each aspect of the query, regardless of the other retrieved documents. Alternative estimates of coverage depend on the adopted aspect representation and include classification confidence (Agrawal et al. 2009), topicality (Carterette and Chandar 2009), and relevance (Santos et al. 2010a, c). A different diversification strategy exploits the relationships among the retrieved documents. In particular, novelty-based approaches directly compare the retrieved documents to one another, in order to promote those that convey novel information (i.e., information not conveyed by the other retrieved documents). Existing approaches differ mostly in how they identify novel information. For instance, novelty can be estimated based on content dissimilarity (Carbonell and Goldstein 1998), divergence (Zhai et al. 2003), conditioned relevance (Chen and Karger 2006), or relevance score correlation (Rafiei et al. 2010; Wang and Zhu 2009).

Table 1 organises the most representative diversification approaches in the literature according to the aspect representation and diversification strategy dimensions. In particular, coverage (Carterette and Chandar 2009) and hybrid (i.e., coverage + novelty) approaches (Santos et al. 2010a, c) have been shown to substantially outperform pure novelty-based ones. On the other hand, to the best of our knowledge, novelty has only been tested on qualitative studies (Carbonell and Goldstein 1998) or on curated corpora such as Wikipedia (Rafiei et al. 2010) or newswire (Wang and Zhu 2009), with its effectiveness in a Web search result diversification scenario yet to be proven. Moreover, while hybrid approaches constitute the current state-of-the-art (Clarke et al. 2009; Clarke et al. 2010), it is unclear how much of their effectiveness comes from also promoting novelty. To address these questions, Sect. 3 describes our research methodology. The results of our thorough experimentation are discussed in Sects. 5 and 6 and unveil the role of novelty as a diversification strategy.

Table 1 An overview of representative search result diversification approaches in the literature, organised in terms of two dimensions: diversification strategy and query aspect representation

3 Bridging the gap

The objectives of search result diversification are two-fold: (1) to maximise the number of query aspects covered in the ranking, and (2) to avoid excessive redundancy among the covered aspects. Finding a subset of the retrieved documents with maximum coverage (or, similarly, minimum redundancy) with respect to the query aspects is an instance of the Maximum Coverage Problem Footnote 2 (Hochbaum 1997), and is therefore NP-hard (Agrawal et al. 2009). Most of the diversification approaches in the literature deploy a greedy approximation algorithm for this problem. From an initial ranking \(\mathcal{R}\), this algorithm builds a ranking \(\mathcal{S}\), by iteratively selecting a document d* such that:

$$ d^{\ast}=\mathop{\hbox{arg max}}\limits_{d \in {\mathcal{R}} \setminus {\mathcal{S}}} \mathit{score}(d,q,{\mathcal{A}},{\mathcal{S}}), $$
(1)

where \(\mathit{score}(d,q,\mathcal{A},\mathcal{S})\) is typically computed as a trade-off between the estimated relevance of d given the query q, and the diversity of d given some representation of the aspects \(\mathcal{A}\) underlying q and the documents in \(\mathcal{S}\), which were selected in the previous iterations of the algorithm (Santos et al. 2010b).

Although having the same goal of producing a diverse ranking, coverage and novelty-based approaches implement the above objective function in different ways. While purely coverage-based approaches typically ignore the set of already selected documents \(\mathcal{S}\), existing novelty-based approaches ignore the set of query aspects \(\mathcal{A}\). In practice, this renders coverage and novelty, as implemented by existing approaches, not directly comparable. In this section, we describe our methodology to bridge the gap between these approaches and enable their direct comparison. Besides evaluating novelty in contrast to and in combination with coverage, our goal is to isolate these strategies from their underlying aspect representation, so as to provide a controlled setting for our investigations. To this end, in Sect. 3.1, we propose adaptations of two implicit novelty-based diversification approaches to leverage explicit aspect representations. Additionally, in Sect. 3.2, we deconstruct two explicit hybrid approaches to deploy a coverage-based strategy only.

3.1 Explicit novelty-based diversification

Existing novelty-based diversification approaches rely on an implicit aspect representation to estimate the diversity of a document with respect to the other retrieved documents (Carbonell and Goldstein 1998; Zhai et al. 2003; Wang and Zhu 2009). As a result, these approaches compare documents purely on the basis of their content, rather than based on how they satisfy different query aspects. Moreover, the resulting document representation (e.g., in the term-frequency space of a given corpus) is usually high-dimensional, which negatively impacts both the effectiveness and the efficiency of these approaches (Manning et al. 2008). To counter these limitations and—more importantly for this work—to enable a direct comparison of existing diversification approaches across both the aspect representation and the diversification strategy dimensions, we propose to leverage explicit aspect representations for estimating novelty. Besides providing a more expressive account of the relationship between documents and the aspects they cover, this representation also has a considerable impact on efficiency, since the feature space is reduced from the size of the corpus vocabulary (millions) to the number of aspects underlying a query (around a dozen).

Given a query q with a set of aspects \(\mathcal{A}\), with \(|\mathcal{A}| = k\), we explicitly represent each retrieved document \(d \in \mathcal{R}\) as a k-dimensional vector d over \(\mathcal{A}\). In particular, the mth dimension of the vector d is defined as:

$$ \user2{d}_m = f(d,a_m), $$
(2)

where the function f estimates how well the document d satisfies the aspect \(a_m \in \mathcal{A}\). As we will show in Sect. 4.3, different measures of the document-aspect association can be used, depending on how the aspects are identified, e.g., based on reformulations mined from a query log or on categories derived from a classification taxonomy.

Regardless of the particular mechanism used to identify the aspects of a query, an explicit representation of documents with respect to these aspects can be seamlessly integrated into existing novelty-based diversification approaches. In particular, to enable our analysis in Sects. 5 and 6, we derive explicit versions of two well-known novelty-based approaches in the literature, namely, Maximal Marginal Relevance (MMR, Carbonell and Goldstein 1998) and Mean-Variance Analysis (MVA, Wang and Zhu 2009).

MMR (Carbonell and Goldstein 1998) instantiates the scoring function in Eq. 1 by estimating the similarity between \(d \in \mathcal{R} \setminus \mathcal{S}\) and its most dissimilar document \(d_j \in \mathcal{S}\). Likewise, we devise xMMR (Explicit Maximal Marginal Relevance) to estimate novelty over explicit representations of the retrieved documents:

$$ \mathit{score}_{\rm xMMR} (d,q,{\mathcal{A}},{\mathcal{S}})= \lambda \hbox{sim}_1(d,q) - (1-\lambda) \max_{\user2{d}_j \in {\mathcal{S}}} \hbox{sim}_2(\user2{d},\user2{d}_j), $$
(3)

where sim1(d,q) and \(\hbox{sim}_2(\user2{d},\user2{d}_j)\) estimate the relevance of d to the query q and its similarity to the documents already in \(\mathcal{S}\), respectively. A balance between relevance (i.e., sim1) and redundancy (i.e., maxsim2, the opposite of novelty) is achieved through an appropriate setting of λ, as will be described in Sect. 4.5. In our experiments, sim1(d,q) is estimated by a standard retrieval model. Following Carbonell and Goldstein (1998), we compute \(\hbox{sim}_2(\user2{d},\user2{d}_j)\) as the cosine between explicit representations of d and \(\user2{d}_j\) over the aspects \(\mathcal{A}\).

Analogously to MMR, MVA (Wang and Zhu 2009) instantiates Eq. 1 by trading off relevance and redundancy. However, instead of computing the similarity between documents, MVA estimates the redundancy of a document based on how its relevance scores correlate to those of the other documents. Accordingly, we devise xMVA (Explicit Mean-Variance Analysis) to estimate these correlations based on how well the documents satisfy the explicitly represented query aspects. The objective function of xMVA is defined according to the following equation:

$$ \mathit{score}_{\rm xMVA}(d,q,{\mathcal{A}},{\mathcal{S}})=\mu_{(d)} - b w_i \sigma^2_{(d)}-2b \sum_{d_j \in {\mathcal{S}}} w_j \sigma_{(d_j)} \sigma_{(d)} \rho_{(\user2{d},\user2{d}_j)}, $$
(4)

where μ(d) and σ 2(d) represent the mean and variance of the relevance estimates associated to document d, respectively, while the summation component estimates the redundancy of document d in light of the documents in \(\mathcal{S}\). In particular, documents are compared in terms of their correlation \(\rho_{(\user2{d},\user2{d}_j)}\). A balance between relevance, variance, and redundancy is achieved through the parameter b. Following Wang and Zhu (2009), μ(d) is estimated by a standard retrieval model, with relevance scores normalised to yield a probability distribution. Additionally, σ(d) is set as a constant for all documents. In our experiments, both σ and b are set through training, as will be described in Sect. 4.5. Finally, \(\rho_{(\user2{d},\user2{d}_j)}\) is estimated as the Pearson’s correlation between explicit representations of d and \(\user2{d}_j\) over the aspects \(\mathcal{A}\).

3.2 Explicit coverage-based diversification

Besides making coverage and novelty directly comparable by introducing explicit novelty-based diversification approaches (i.e., xMMR and xMVA), we want to be able to assess the effectiveness of novelty when combined with coverage. To this end, we deconstruct two state-of-the-art diversification approaches, IA-Select (Agrawal et al. 2009) and xQuAD (Santos et al. 2010a), which deploy a hybrid of coverage and novelty. Our goal is to produce directly comparable versions of these approaches, which should deploy coverage as their only strategy.

IA-Select (Agrawal et al. 2009) was originally proposed to diversify the search results according to a predefined taxonomy, such as the one provided by the Open Directory Project (ODP). Its objective function is defined as:

$$ \mathit{score}_{\rm {IA - Select}}(d,q,{\mathcal{A}},{\mathcal{S}}) = \sum_{a_m \in {\mathcal{A}}} u(a_m|q,{\mathcal{S}}) v(d|q,a_m), $$
(5)

where the function u estimates the marginal utility of the query aspect a m given the query q and the documents already selected in \(\mathcal{S}\), and the function v estimates the coverage of d with respect to q and a m . The marginal utility u incorporates both the relative importance of the aspect a m in light of all aspects \(\mathcal{A}\) of the query q, as well as the current utility of a m , in light of the aspects already covered by the documents in \(\mathcal{S}\). In practice, the function u emulates a novelty component, by estimating how much the already selected documents satisfy each aspect of the query. To produce a coverage-only version of IA-Select, we assume that the query aspects do not lose their utility even if they are already covered by the documents in \(\mathcal{S}\). In practice, this is achieved simply by dropping the term \(\mathcal{S}\) in Eq. 5:

$$ \mathit{score}_{{\rm {IA - Select}}^{\ast}}(d,q,{\mathcal{A}},{\mathcal{S}})=\sum_{a_m \in {\mathcal{A}}} u(a_m|q) v(d|q,a_m). $$
(6)

To emphasise its difference from the standard IA-Select in Eq. 5, we call this coverage-only version IA-Select*.

Different from IA-Select, xQuAD (Santos et al. 2010a) implements the objective function in Eq. 1 as a mixture of probabilities:

$$ \mathit{score}_{\rm xQuAD}(d,q,{\mathcal{A}},{\mathcal{S}})= (1-\lambda) \hbox{P}_R(d|q) + \lambda \hbox{P}_D(d,\bar{{\mathcal{S}}}|q), $$
(7)

where P R (d|q) denotes the probability of d being relevant given the query q and \({\hbox{P}_D(d,\bar{\mathcal{S}}|q)}\) denotes the probability of d but none of the documents already selected in \(\mathcal{S}\) being diverse given q. These two probabilities are mixed using the parameter λ, which implements a trade-off between promoting relevant and diverse documents (Santos et al. 2010b). By marginalising over the possible aspects of q, the probability \({\hbox{P}_D(d,\bar{\mathcal{S}}|q)}\) can be further broken down as:

$$ \hbox{P}_D(d,\bar{{\mathcal{S}}}|q) = \sum_{a_m \in {\mathcal{A}}} \hbox{P}_D(a_m|q) \hbox{P}_D(d|q,a_m) \hbox{P}_D(\bar{{\mathcal{S}}}|q,a_m), $$
(8)

where P D (a m |q) denotes the importance of the aspect a m given the query q, P D (d|qa m ) denotes the coverage of d given q and a m , and \({\hbox{P}_D(\bar{\mathcal{S}}|q,a_m)}\) denotes the novelty of any document satisfying a m , based on the probability that none of the documents in \(\mathcal{S}\) satisfy this aspect. Analogously to our adaptation of IA-Select, we introduce a coverage-only version of xQuAD by assuming that all query aspects retain their utility, regardless of the documents previously selected in \(\mathcal{S}\). In practice, this is achieved simply by dropping the probability of novelty \({\hbox{P}_D(\bar{\mathcal{S}}|q,a_m)}\), which produces xQuAD*:

$$ \begin{aligned} \mathit{score}_{{\rm xQuAD}^{\ast}}(d,q,{\mathcal{A}},{\mathcal{S}}) &=(1-\lambda) \hbox{P}_R(d|q)\\ &\quad +\lambda \sum_{a_m \in {\mathcal{A}}} \hbox{P}_D(a_m|q) \hbox{P}_D(d|q,a_m). \end{aligned} $$
(9)

Note that, without a novelty component, the coverage-only objective functions of both IA-Select* (Eq. 6) and xQuAD* (Eq. 9) no longer require an iterative, greedy diversification strategy. In practice, for an initial ranking of n documents, we reduce the cost of estimating Eq. 1 from O(n) to O(1). In Sects. 5 and 6, we evaluate all these approaches, in order to investigate the role of novelty when deployed in isolation, as well as when combined with coverage in a hybrid strategy.

4 Experimental setup

In this section, we describe the setup that supports our investigations in Sects. 5 and 6. These investigations aim to answer the following questions:

  1. 1.

    Is novelty an effective diversification strategy, and can it be improved with an explicit aspect representation?

  2. 2.

    How does an explicit novelty strategy perform in contrast to and in combination with a coverage strategy?

  3. 3.

    What is the role of novelty as a diversification strategy?

We address the first two research questions in Sect. 5. To answer the first question, we fix the diversification strategy dimension to novelty, in order to evaluate the impact of different aspect representations. Conversely, to tackle the second question, we fix the aspect representation dimension to different explicit representations and measure the effectiveness of novelty in contrast to and in combination with coverage. Finally, to provide further insights into the role of novelty as a search result diversification strategy, in Sect. 6, we answer the third question, by thoroughly evaluating this strategy with simulated rankings of various quality. The remainder of this section describes the experimental setup that supports all these investigations.

4.1 Collection and topics

Our investigations are conducted within the standard experimentation framework of the diversity task of the TREC 2009 and 2010 Web tracks (Clarke et al. 2009, 2010), henceforth referred to as WT09 and WT10 tasks, respectively. These tasks provide a total of 98 queries (50 for WT09, 48 for WT10), sampled from the query log of a commercial search engine. For each query, TREC assessors identified multiple sub-topics, representing different aspects of the initial query, with relevance assessments conducted at the sub-topic level. As the document corpus, we use the category-B subset of the TREC ClueWeb09 corpus (henceforth ClueWeb09 B), as used in the WT09 and WT10 tasks. In our experiments, this 50-million Web document corpus is indexed using Terrier,Footnote 3 with Porter’s stemmer and standard stopword removal.

4.2 Retrieval approaches

To verify the consistency of our results, we experiment with several retrieval approaches under a uniform setting. As an ad hoc retrieval approach, which does not perform diversification, we use the Divergence From Randomness DPH model (Amati et al. 2007). Besides being effective, DPH is a parameter-free probabilistic model, and hence requires no training. On top of DPH, we experiment with diversification approaches representative of the novelty and coverage strategies. In particular, these approaches directly leverage the scores produced by DPH as their underlying ‘relevance’ estimation, as discussed in Sect. 3. As novelty-based approaches, we use MMR (Carbonell and Goldstein 1998) and MVA (Wang and Zhu 2009), as well as their explicit variants, xMMR and xMVA, introduced in Sect. 3.1. As coverage-based approaches, we consider our variants IA-Select* and xQuAD*, from Sect. 3.2. Their standard versions, namely, IA-Select (Agrawal et al. 2009) and xQuAD (Santos et al. 2010a), are used as hybrid approaches, and are representative of the state-of-the-art. Indeed, an instance of xQuAD attained the top performance in the diversity task of the TREC 2009 and 2010 Web tracks (cat. B) (Clarke et al. 2009; Clarke et al. 2010). Following Zhai et al. (2003), to cope with the quadratic pairwise comparisons performed by novelty-based approaches, both novelty, coverage, and hybrid approaches are applied to diversify the top 100 documents retrieved by DPH.

4.3 Aspect representations

To analyse the impact of different aspect representations, we compare a traditional implicit representation of documents in the space of the terms in the ClueWeb09 B corpus to four explicit aspect representations, described in the remainder of this section. Additionally, Table 2 summarises these explicit representations in terms of the average number of aspects identified for the WT09 and WT10 queries. For keyword-based aspect representations (i.e., WS, WR, and WT in Table 2), we also show the average length (in tokens) of each query aspect, and the average overlap between each aspect and the initial query, measured as the fraction of unique query terms covered by the aspect.

Table 2 Statistics of the explicit query aspect representations used in this paper

Our first explicit aspect representation (DZ in Table 2) was proposed by Agrawal et al. (2009), and corresponds to the 15 top-level categories from the Open Directory Project (ODP): adult, arts, business, computers, games, health, home, news, recreation, reference, regional, science, shopping, society, and sports. In particular, each document is represented as a k-dimensional vector, with each dimension corresponding to the probability that the document belongs to a category. Following Agrawal et al. (2009), this probability is estimated by the cosine between the document and the centroid representing the category, according to a Rocchio classifier (Manning et al. 2008). To obtain a centroid for each category, we randomly select 3,000 documents from the ClueWeb09 B corpus that belong exclusively to this category in ODP.

Our second and third aspect representations were proposed by Santos et al. (2010a). In particular, for each of the WT09 and WT10 queries, we obtain two sets of query reformulations from a commercial search engine: suggested queries (WS, displayed in the search engine’s search box) and related queries (WR, displayed alongside the search engine’s results). For each set with k aspects, we represent a document as a k-dimensional vector, with each dimension (i.e., the function f in Eq. 2) corresponding to the estimated relevance of the document to a different reformulation. To ensure this estimation is consistent with the one produced for the initial query, both are given by DPH.

Finally, as a ‘ground-truth’ aspect representation (WT), we represent the retrieved documents in the space of the sub-topics identified by TREC assessors for each of the WT09 and WT10 queries (Clarke et al. 2009, 2010). In particular, these sub-topics provide a reference performance for the other explicit aspect representations used in our investigation. Analogously to using query reformulations from a commercial search engine, the retrieved documents are represented as k-dimensional vectors, with each dimension denoting the estimated relevance of a document to a TREC sub-topic, once again according to DPH. Additionally, the availability of relevance assessments for these ‘ground-truth’ aspects enables the evaluation of coverage and novelty using diversity estimates of various simulated quality, as we will show in Sect. 6.

4.4 Evaluation metrics

To evaluate the various approaches investigated in this paper, we use the two primary metrics in the diversity task of the TREC 2010 Web track (Clarke et al. 2010): ERR-IA and α-nDCG. The Intent-Aware Expected Reciprocal Rank (ERR-IA) metric (Chapelle et al. 2009) implements a cascade user model (Craswell et al. 2008), which penalises redundancy across multiple query aspects, by assuming that users will stop examining the result list once they find relevant information. The α-normalised Discounted Cumulative Gain (α-nDCG) metric (Clarke et al. 2008) extends the traditional nDCG (Järvelin and Kekäläinen 2002), with a parameter α that controls how much redundancy should be penalised. This tunable parameter is particularly suited for our investigation, as it allows the evaluation of novelty in an extreme scenario (α = 1), which models a user with no tolerance to redundancy (Clarke et al. 2008). Both ERR-IA and α-nDCG have been shown to reward rankings that achieve a balance of coverage and novelty (Clarke et al. 2011). Moreover, α-nDCG has been shown to possess a discriminative power at least as high as that of the traditional nDCG (Sakai and Song 2011). Following the standard TREC setting, unless otherwise noted, both metrics are reported at rank cutoff 20 (Clarke et al. 2010). It is worth noting, however, that the observed trends are consistent across different rank cutoffs up to 100.

4.5 Training procedure

Most approaches in our evaluation require some parameter tuning. The exceptions are DPH, IA-Select (Agrawal et al. 2009), and IA-Select*, which are parameter-free. In order to train the parameters of the other approaches (i.e., MMR (Carbonell and Goldstein 1998) and xMMR’s λ, MVA (Wang and Zhu 2009) and xMVA’s b and σ, and xQuAD* and xQuAD’s λ (Santos et al. 2010a), we use the WT09 and WT10 topics as training and test sets, in a cross-year fashion—i.e., we train on WT09 and test on WT10, and vice versa. All parameters are optimised through simulated annealing (Kirkpatrick et al. 1983), to maximise ERR-IA@100 on the training topics. To ensure our conclusions are not limited by the available training data, besides reporting our results on the test topics, we also report the training performance of all approaches.

5 Empirical evaluation

In this section, we address our first two research questions through an empirical evaluation within the framework provided by the TREC 2009 and 2010 Web tracks (Clarke et al. 2009, 2010). In particular, Sect. 5.1 covers our first question, to assess the effectiveness of novelty-based approaches across implicit and explicit aspect representations. Sections 5.2 and 5.3 address our second research question, by further investigating how novelty performs in contrast to and in combination with coverage across multiple aspect representations.

5.1 Implicit versus explicit novelty

To answer our first question, we contrast novelty-based diversification approaches based on implicit and explicit aspect representations. In particular, we aim to investigate not only whether existing approaches can be improved with a more refined aspect representation, but also whether any of these representations can improve over a standard, non-diversified baseline. Table 3 shows the training and test diversification performances of MMR and MVA (as implicit novelty-based approaches), as well as their explicit counterparts (xMMR and xMVA, respectively) in terms of ERR-IA and α-nDCG. The latter approaches are deployed with the four explicit representations described in Sect. 4.3: ODP categories (DZ) (Agrawal et al. 2009), suggested (WS) and related (WR) Web search queries (Santos et al. 2010a), and the official TREC Web track sub-topics (WT) (Clarke et al. 2009, 2010). The performance of DPH is provided as a non-diversified baseline. The best performance per approach is highlighted in bold. Significance is verified using the Wilcoxon signed-rank test. The symbols ▴ (▾) and \(\vartriangle\) (▿) denote a significant increase (decrease) at the p < 0.01 and p < 0.05 levels, respectively, while = denotes no significant difference. A first instance of these symbols denotes the significance of each approach compared to DPH. A second instance, for all variants of xMMR and xMVA, denotes significance with respect to MMR or MVA, respectively.

Table 3 Diversification performance (@20) of novelty-based approaches for implicit (MMR and MVA) and explicit (xMMR and xMVA) aspect representations

From Table 3, we first observe that both MMR and MVA show at best marginal yet not significant improvements over the non-diversified ranking produced by DPH, even under training. Indeed, the largest observed improvement is only +3% (MVA’s α-nDCG on the WT09 topics). These results corroborate our initial observations in this paper, regarding the lack of empirical validation of novelty-based approaches for diversifying Web search results. Answering our first research question, these results show that novelty is generally an innefective diversification strategy for Web search.

With respect to the different aspect representations, we observe that both xMMR and xMVA can improve over their implicit counterparts in most settings. Under the test scenario, these improvements can be significant, particularly for xMMR using related Web queries (WR) on the WT09 topics (ERR-IA only), and the ‘ground-truth’ (WT) sub-topics on the WT10 topics, and for xMVA using ODP categories (DZ) on WT09 and WT10 (the latter for α-nDCG only). This completes the investigation of our first research question, by showing that an explicit aspect representation can help improve novelty-based diversification. Nevertheless, only xMMR using the ‘ground-truth’ aspect representation is able to significantly improve over the non-diversified DPH ranking, which suggests that an explicit representation per se cannot guarantee an effective performance for novelty-based approaches.

5.2 Explicit coverage versus explicit novelty

The observations in Sect. 5.1 suggest an inherent limitation of novelty as a diversification strategy, regardless of any particular aspect representation. To address our second research question, we first contrast the effectiveness of novelty and coverage-based approaches using the same representations. To this end, in Table 4, we compare the diversification performance of xMMR and xMVA (novelty-based) to that of IA-Select* and xQuAD* (coverage-based) across the four explicit aspect representations considered in this work. Two instances of the previously introduced significance symbols denote whether IA-Select* and xQuAD* differ significantly from xMMR and xMVA, respectively.

Table 4 Diversification performance (@20) of coverage (IA-Select* and xQuAD*) and novelty-based (xMMR and xMVA) approaches for different explicit aspect representations

From Table 4, we observe that both coverage-based approaches substantially outperform the novelty-based ones in almost all settings, often significantly. The only exception is IA-Select* using the DZ aspect representation, which slightly underperforms on the WT10 topics, yet not significantly. This might be due to the overall lower performance of the DZ aspect representation compared to the other considered representations. Nevertheless, xQuAD* still outperforms both xMMR and xMVA in this scenario. Considering the other aspect representations, both xMMR and xMVA are significantly outperformed when using the WR representation on both WT09 and WT10 topics, and the WT representation on the WT10 topics. Additionally, on the WT09 topics, significant improvements over xMVA are observed when using the WS representation, and over xMMR when using the WT representation. This answers our second research question, by showing that, whenever the underlying aspect representation is held fixed, coverage provides an often significantly superior diversification strategy compared to novelty.

5.3 Explicit coverage versus explicit coverage + novelty

The results in Sect. 5.2 show that novelty cannot improve against a pure coverage-based strategy. To complete the investigation of our second research question, we investigate whether novelty can be effective in combination with coverage. To address this, Table 5 shows the diversification performance of IA-Select and xQuAD, which deploy hybrid diversification strategies, compared to their coverage-only versions, IA-Select* and xQuAD*, respectively. The previously described symbols are used to denote significant improvements between hybrid and coverage-only versions.

Table 5 Diversification performance (@20) of coverage-based (IA-Select* and xQuAD*) and hybrid approaches (IA-Select and xQuAD) for different explicit aspect representations

From Table 5, we note that neither IA-Select nor xQuAD can consistently improve upon their coverage-only versions. Indeed, no significant improvement is observed across the entire table. Recalling our second question, this surprising result shows that novelty does not significantly contribute to the effectiveness of the state-of-the-art diversification approaches in the literature. Along with the other results in this section, it raises further questions regarding the role of novelty as a diversification strategy, and the conditions (if any) under which this strategy could be effective. We investigate these questions in the next section. A full breakdown analysis of the results in Tables 3, 4, and 5 is provided in Appendix A.

6 Simulated evaluation

The results in Sect. 5 show that novelty performs ineffectively in comparison to and in combination with coverage, and even when compared to a standard, non-diversified ad hoc retrieval baseline. What remains unknown is why this is the case. Hence, in this section, we address our third and last research question, by further investigating the role of novelty as a search result diversification strategy. In particular, our ultimate goal is to identify the conditions (if any) under which novelty could be deployed effectively.

To this end, we perform two complementary simulations. Section 6.1 analyses the impact of simulated relevance and diversity estimates on the effectiveness of novelty-based diversification. Section 6.2 investigates how novelty is affected by non-relevant documents. As the results of both simulations lead to identical conclusions on both WT09 and WT10 settings, for brevity, we only present and discuss the latter.

6.1 Relevance versus diversity

Building upon the view of search result diversification as a trade-off between promoting relevance or diversity (Santos et al. 2010b), we analyse the diversification performance of novelty-based, coverage-based, and hybrid approaches over a range of simulated relevance and diversity estimation performances. The first scenario (simulated relevance) simulates the application of these approaches over baseline rankings of various quality. The second scenario (simulated diversity) has different interpretations for different approaches. For coverage-based approaches, it represents a refined estimation of how well a document covers different query aspects (e.g., the probability P D (d|qa m ) in Eqs. 8 and 9). For explicit novelty-based approaches, it equates to a refined document representation in the space of the considered aspects (see Eq. 2), which allows for an improved identification of novel documents.

Following Turpin and Scholer (2006), we produce a range of relevance estimation performances by simulating re-rankings of the top 1000 results retrieved by DPH for each of the WT10 queries. In particular, each re-ranking seeks a different target query average precision (AP), by iteratively swapping randomly chosen pairs of relevant and irrelevant documents. For this simulation, we use the relevance assessments for the ad hoc task of the TREC 2010 Web track (Clarke et al. 2010). Footnote 4 A similar procedure is used to simulate diversity estimates. For this simulation, we use the TREC Web track sub-topics as an aspect representation. As described in Sect. 4.3, this is the only available aspect representation with relevance assessments (i.e., those from the diversity task of the TREC 2010 Web track). Based on these ‘ground-truth’ aspects and their corresponding relevance assessments, our simulation iteratively re-ranks the top 1000 results retrieved by DPH for a given query with respect to each sub-topic of this query, until a target aspect AP performance is achieved.

As target relevance (for queries) and diversity (for query aspects) estimation performances, we split the range of possible AP values (i.e., [0,1]) into 20 equally sized bins (i.e., each bin has size 0.05). Within the range of each bin, we randomly select 20 target AP values, making up a total of 400 simulated relevance and diversity estimation performances per query. To enable a comprehensive yet controlled analysis, we focus on xMMR, xQuAD*, and xQuAD as representative explicit novelty-based, coverage-based, and hybrid diversification approaches, respectively. These approaches are particularly suited for this analysis, as they directly implement the aforementioned trade-off between relevance and diversity, hence allowing a controlled experimentation, by varying these two components independently. To avoid any bias towards one of these components, all approaches are applied with the standard setting of λ = 0.5.

The diversification performance of xMMR, xQuAD*, and xQuAD is shown in Fig. 1a for a range of relevance estimation performances. Relevance performance (the x axis) is measured by mean average precision (MAP). Diversification performance (the y axis) is measured by α-nDCG@100 with α = 1.0, so as to penalise redundancy the most heavily. Additionally, since all approaches are applied to diversify the top 100 documents, evaluation at rank cutoff 100 ensures that any observed improvements are due to removing redundancy with respect to the aspects already covered, rather than to covering additional query aspects in the ranking. The diversification performance of a standard DPH ranking is also included as a baseline. From the figure, we first observe that the diversification performance of all approaches is highly correlated to their underlying relevance estimation performance. This is somewhat expected, since by improving relevance, the chance of satisfying at least one of the aspects of the query increases, as confirmed by the high correlation observed for the DPH baseline itself (Pearson’s r = 0.8978). As for the diversification approaches, xMMR is almost indistinguishable from DPH across the query MAP range. Likewise, xQuAD cannot be distinguished from xQuAD*. This further shows that novelty is a generally weak strategy for promoting diversity, both on its own, and when combined with coverage.

Fig. 1
figure 1

Diversification performance of explicit novelty-based (xMMR), coverage-based (xQuAD*), and hybrid (xQuAD) approaches for a range of a relevance and b diversity performances

Figure 1b provides a complementary view of the results in Fig. 1a. In this second scenario, instead of varying the relevance estimations for the query, we simulate a range of diversity estimations. Once again, besides the diversification performance of xMMR, xQuAD*, and xQuAD over the range of simulated diversity estimations, we include the performance of DPH as an ad hoc retrieval baseline. From Fig. 1b, we observe that the performance of xMMR remains limited by the performance of the baseline ranking, even with increasingly improved aspect relevance estimations. This result further confirms the limitations of novelty as a diversification strategy. In contrast, the performance of xQuAD* substantially increases as the underlying aspect relevance estimations improve. This shows that, besides being more robust as a diversification strategy, coverage can also benefit more from improved evidence of the association of documents to query aspects. More surprisingly, coverage proves to be a more effective strategy for promoting novelty (i.e., for reducing redundancy) than novelty itself, as shown by the striking superiority of xQuAD* compared to xMMR. On the other hand, the performance of xQuAD cannot be distinguished from that of xQuAD*, further confirming the limitations of novelty when combined with coverage.

6.2 Relevance versus non-relevance

The results in Sect. 6.1 emphasise the limitations of novelty as a diversification strategy, based on a range of simulated relevance and diversity performance scenarios. Focusing on the relevance simulation scenario, for a fixed baseline ranking (i.e., a fixed relevance performance), a novelty-based diversification approach re-ranks documents on the basis of their differences from other documents, with no bearing on their likelihood of being relevant to a query aspect. In particular, Zhai et al. (2003) suggested that the performance gains attained by promoting novelty are offset by the corresponding losses due to also promoting non-relevant documents.

To fully investigate this intuition in a Web search setting, we perform a complementary simulation to the one shown in Fig. 1a. In particular, while the previous simulation produced baseline rankings with various performances, these rankings still contained both relevant and non-relevant documents. Instead, we simulate a different scenario, where the baseline ranking is gradually improved by randomly removing non-relevant documents. This allows us to assess the impact of non-relevant documents on the performance of novelty-based diversification. In particular, Fig. 2a shows the diversification performance of MMR, xMMR, xQuAD*, and xQuAD, as we increase the fraction of non-relevant documents removed from a baseline ranking produced by DPH. MMR (Carbonell and Goldstein 1998) is included so as to allow the analysis of the impact of non-relevant documents under an implicit novelty-based approach. The performance of DPH itself is also shown as a baseline. We test removal fractions from 0 to 1, in steps of 0.05. For instance, a removal fraction of 0 represents the original DPH ranking, while a fraction of 1 means that all non-relevant results have been removed from this ranking. For a given fraction, each random removal of non-relevant documents is repeated 20 times, and we report diversification performances averaged across these 20 repetitions, with error bars denoting standard deviations.

Fig. 2
figure 2

Diversification performance of explicit novelty-based (xMMR), coverage-based (xQuAD*), and hybrid (xQuAD) approaches as non-relevant documents are removed

From Fig. 2a, we first note, as expected, that the performance of DPH improves as non-relevant documents are removed from its ranking. What we are interested to know, however, is whether a novelty strategy can take advantage of these gradually improving baseline performances. Looking at MMR, we observe that the performance of this implicit novelty-based approach is lower than that of DPH. Moreover, the gap between MMR and DPH remains almost unaltered as non-relevant documents are removed. A similar observation can be made for xMMR. Although it performs above DPH, the gap between the two approaches does not increase with the removal of non-relevant documents. Another important observation is that the hybrid combination of coverage and novelty implemented by xQuAD does not benefit from an improved baseline ranking when compared to xQuAD*—indeed, the performance of these two approaches is indistinguishable from one another in the figure. These results are surprising, as they show that, contrarily to the established intuition, a baseline ranking with only relevant documents is not sufficient to improve novelty-based diversification.

To investigate what could help improve novelty as a diversification strategy, we perform a similar simulation to the one presented in Fig. 2a, however under an extreme scenario. In particular, while the diversification approaches in Fig. 2a leverage ‘real’ aspect-document relevance estimates (i.e., those provided by DPH), we propose a scenario where these approaches are deployed under ideal conditions, so as to stress their maximum potential. In this idealised scenario, all approaches are deployed with ‘perfect’ aspect-document relevance estimates, based on the relevance assessments of the diversity task of the TREC 2010 Web track (Clarke et al. 2010). Moreover, all approaches are deployed to make full use of these perfect estimates. To achieve this, xMMR is deployed with λ = 0 (see Eq. 3), while xQuAD and xQuAD* are deployed with λ = 1.0 (see Eqs. 7 and 9). Footnote 5

Figure 2b shows the results of this ‘perfect’ simulation scenario. From the figure, we first observe that xMMR can consistently outperform DPH. However, as in Fig. 2a, the gap between xMMR and DPH remains roughly constant as non-relevant documents are removed. This surprising result shows that removing non-relevant documents from the baseline ranking does not necessarily improve novelty, even when novelty is deployed under idealised conditions.

In terms of absolute performance, although xMMR performs slightly better in contrast to its performance in the ‘real’ scenario in Fig. 2a, the benefits of deploying novelty as a standalone strategy seem quite low. Indeed, while xMMR struggles to improve over DPH, xQuAD* largely outperforms both DPH and xMMR. To understand why this is the case, we can look at the right end of Fig. 2b. In particular, when there are only relevant documents to be diversified (i.e., when the fraction of non-relevants removed is 1), xQuAD* still outperforms xMMR. This is because, different from coverage, novelty does not take into account how well each individual document covers multiple query aspects. In contrast, coverage provides a much stronger diversification performance, by placing more emphasis on ‘highly diverse’ documents (i.e., documents relevant to multiple aspects). Lastly, compared to xQuAD*—a purely coverage-based approach—the hybrid strategy deployed by xQuAD is finally shown to bring significant improvements. This shows that, although rather limited as a standalone strategy, novelty can still play a role in combination with coverage, as a tie-breaking criterion—i.e., whenever two documents have similar coverage, the one that covers the least seen aspects (i.e., the most novel) should be ranked higher.

7 Conclusions

We have thoroughly investigated the role of novelty as a diversification strategy. In particular, we placed existing diversification approaches in a common framework based on two complementary dimensions: diversification strategy and aspect representation. Moreover, we have introduced four new diversification approaches to enable the assessment of novelty as a diversification strategy, independently of the query aspect representation dimension. Based on a thorough investigation, we have provided empirical evidence of the limitations of novelty-based diversification in a standard Web search scenario. Finally, through a comprehensive analysis based on simulations, we have shed light on the limitations of novelty, and its role as a diversification strategy.

In particular, we found that novelty is generally an ineffective diversification strategy when deployed on its own. As it ignores how diverse individual documents are, its performance is inherently limited by the relevance of the underlying baseline ranking. However, when deployed in combination with a coverage-based strategy, it can still provide improvements, provided that an effective aspect-document relevance estimation mechanism is available. To this end, future research should focus on constructing aspect representations that better reflect the multiple possible information needs underlying an ambiguous query (Santos and Ounis 2011), e.g., based on the needs of previous users who issued similar queries, as identified from a query log. Another promising direction for investigation is on developing improved retrieval approaches for estimating how different documents cover the identified query aspects, e.g., by leveraging machine learned models (Santos et al. 2011).