Keywords

1 Introduction

A typical representative of building cluster center dictionaries is K-means [12] and its variants [2, 15]. The basic K-means algorithm is known to strongly depend on initialization, and the results can vary arbitrarily [13]. Some more advanced versions, such as K-means++ [2] and Biset K-means [20], provide better initializations but are still sensitive to noise and usually do not yield the desired results for complex data. Other center-based algorithms, such as EM-clustering [17] and Fuzzy c-means [3], suffer from similar drawbacks. In terms of complexity, K-means is linear with respect to the data size, yet the algorithm iterates many times to converge, which requires multiple passes over the entire dataset. Also, like many center-based methods, it is an in-memory algorithm, which raises the issue of appropriate data handling/pre-processing when the dataset cannot fit into the main memory. Another issue of center-based greedily iterative algorithms is that they require the user to specify the number of centers. In most practical situations this corresponds to domain knowledge that may not be available. In the problem of finding feature similarities in large image collections, such information is a priori unknown. A rough approximation of the number of clusters can be obtained by first executing the algorithm many times for parameter tuning, which increases the overall running time significantly. Some algorithms, such as Affinity Propagation [10] and Density Peaks [24], are able to detect the number of clusters automatically, but usually either perform well only when the data are simple (with limited noise and clutter between clusters), or are inefficient for large datasets.

Improving Efficiency. Some hierarchy-based clustering algorithms, like BIRCH [29], are reported to provide increased efficiency. BIRCH builds a CF (clustering feature) tree and uses an agglomerative clustering algorithm to merge leaves towards a specific number of clusters. Since agglomerative clustering is very expensive with \(O(N^2lgN)\) time complexity, the efficiency is guaranteed only if the user chooses the correct parameters to generate a reasonable amount of leaves. Although it can be fast with careful parameter selection, it does not provide natural clusters and its performance is usually sensitive to the permutation of the data [29]. On the other hand, reducing the complexity of iterative methods has been a major motivation for developing single-pass algorithms, i.e., methods that parse the data once [25]. Despite its aforementioned limitations, BIRCH [29] is a popular single-pass framework. [9] describes a single-pass K-means that yields results similar to the iterative version. StreamSL [19] performs better than BIRCH, yet with higher complexity. StreamKM++ [1] approximates the performance of K-means++, is faster than StreamSL, yet is still slower than BIRCH. Overall, fast, stream clustering algorithms have reported accuracy lower than or equal to the accuracy of K-means++.

Visual Vocabularies. The greedily iterative paradigm is still very popular in Computer Vision, despite its drawbacks, due to its simplicity and its acceptable efficiency for some applications. Specifically, despite the advances of deep neural networks in image classification, building visual vocabularies [7, 14, 22, 26] can still provide significant benefits for various tasks, including unsupervised object detection in image collections [5] or in streaming data where new categories may emerge. Coates et al. [6] use simple K-means clustering and a triangle metric to learn small image blocks, and use these learned features to encode an image. In [14, 21, 22], K-means and EM clustering are used to encode SIFT [16] features detected from an image, known as VLAD (Vector of Locally Aggregated Descriptors) and Fisher Vector encoding, respectively. For object retrieval, [23] uses randomized k-d forest when matching between centers and points to boost the speed of simple K-means, and reports better results than the vocabulary tree method in [18].

In this paper we present a center-based approach that improves the trade-off between accuracy and efficiency. Specifically, our method: (i) is able to detect accurately the number of natural clusters in a non-parametric fashion, (ii) requires only a single-pass through the dataset, while its efficiency can be further improved using hierarchy, and (iii) can be used for building visual vocabularies and/or object proposals from streaming images, where new clusters may emerge.

2 Method Overview

Consider a dataset \( \varvec{D}=\left\{ \varvec{x}_1; \varvec{x}_2; \ldots ;\varvec{x}_N \right\} \), where \( \varvec{x}_i\) is \(1\times d\) feature vector. We build assignments to an a priori unknown number, K, of clusters \(\left\{ \varvec{\pi }_1; \varvec{\pi }_2; \ldots ;\varvec{\pi }_K \right\} \), \(\varvec{\pi }_k \bigcap \varvec{\pi }_l = \emptyset \), \(\forall k \ne l \), such that \(\bigcup _{k=1}^K \varvec{ \pi }_k \subseteq \left\{ \varvec{x}_1; \varvec{x}_2; \ldots ;\varvec{x}_N \right\} \). Data not assigned to clusters are considered as outliers/noise. For \(\varvec{x}_i\) and \(\varvec{x}_j\), \(\forall i \ne j \le N\), a threshold \(\theta \), and a similarity measure \(s(\varvec{x}_i,\varvec{x}_j)\), \(\varvec{x}_j\) is matched to \(\varvec{x}_i\) if \(s(\varvec{x}_i,\varvec{x}_j) > \theta \); here we consider s as the negative Euclidean distance. If \(\varvec{x}_i\) is a cluster center, then \(\varvec{x}_j\) is assigned to \(\varvec{\pi }_i\).

Fig. 1.
figure 1

Feature clustering with fading affect bias [28]: clustering while ‘forgetting’. We cluster features from large image collections or streaming videos, with a priori unknown number of clusters. Dictionary is a dynamically populated list of formed cluster centers, while Memory is a temporary list of unmatched or rarely matched features. When a feature cannot be matched with any existing center in Dictionary, it moves to Memory; similar to that features also move to Memory, where they form a cluster, which is then transferred to Dictionary as a new cluster. The ‘activity’ counter is increased when a Memory entry is populated, and is reduced for every feature matched in Dictionary or moved to Memory but not assigned to the corresponding temporary center. This way, a center in Memory is ‘activated’ (transferred to Dictionary), or ‘dies’ (diminishes).

In our method, illustrated in Fig. 1, we build a list of centers dynamically, what we call ‘Dictionary’, which is initially empty and then enriched by frequently matched features while parsing the given dataset. At any given instance of Dictionary, features near the cluster centers are in higher density, which means that these are the most representative samples of the formed patterns in the parsed subset of the data, though not representative of the entire dataset. Therefore, instead of parsing the data sequentially, we perform random sampling without replacement, which improves accuracy as we show below. A practical way to do so is shuffling: consecutive features in the shuffled order are actually random samples from the original dataset. We match each feature \(\varvec{x}_i\) in shuffled order with the closest existing center in the Dictionary. If the match succeeds with respect to a similarity threshold \(\theta \), then we assign the feature to the corresponding cluster \(\varvec{\pi }_k\) and update its center \(\varvec{c}_k\) in constant time as, \(\varvec{c_k}^{(new)}= \frac{\parallel \varvec{\pi }_k \parallel \varvec{c}_k +\varvec{x}_i}{\parallel \varvec{\pi }_k \parallel +1}\). If no matching center is found, we temporarily store the feature in what we call (short) memory with ‘fading affect bias’ [28], or for simplicity ‘Memory’, which is a list of centers initially empty, then dynamically enlarged, and progressively diminished. Temporarily stored features are either ‘forgotten’ as outliers or move to Dictionary as members of a new cluster.

Memory uses a variable ‘activity’, indicating how frequently an entry (temporary center) is matched. The activity is set to an initial value, \(a_0\), when a new entry is created and then varies during feature assignments. When a Memory entry is matched, its activity value is increased; when this value exceeds a threshold, \(\phi \), the corresponding entry is transferred to the Dictionary as a new cluster. The activity value of an entry is reduced when a feature is matched with either a center in the Dictionary or a different entry in Memory. When activity becomes negative, the corresponding entry ‘dies’, i.e., diminishes and is removed from Memory. This way we reject outliers, assuming that they are randomly assigned to Memory entries, or form new entries of small sizes that will diminish. This is why we consider this as a short memory: it keeps ‘forgetting’ data while receiving data, and the less persistent the assignments to a given entry (indicative to noise), the more likely it is for this entry to diminish. We borrow the term ‘fading affect bias’ from cognitive psychology [28] to describe the fact that noise (negative memories) is discarded (fades) fast.

3 Parameter Estimation

In what is described above, the initial activity value \(a_0\) and threshold \(\phi \) essentially dictate how soon noise is discarded from Memory during parsing the dataset, while also determining whether informative features would be considered as noise.

In the shuffled list, each feature of the dataset has the same probability 1 / N to appear in any cluster. Therefore the probability that a feature \(\varvec{x}_i\) in the shuffled list is from cluster \(\varvec{\pi }_k\) is \(P \left( \varvec{x}_i \in \varvec{\pi }_k \right) = p_k = \frac{\parallel \varvec{\pi }_k \parallel }{N}\). If we assume all features from the same cluster can be matched to each other with respect to some similarity threshold \(\theta \), we can define noise as small clusters with population smaller than a certain number \(N_f\), and then the probability of a feature appexaring in such a noise cluster is smaller than \(N_f/N\). Therefore, in what follows, \(N_f/N\) and \(\theta \) are related to each other: for smaller values of \(\theta \), higher values for \(N_f/N\) should be considered.

Consider a newly created entry in Memory with its activity value initialized to \(a_0\). The activity value will decrease to 0 if no feature is matched with this entry during \(a_0\) steps, i.e., during processing \(a_0\) new features. These \(a_0\) steps include sampling the dataset, matching and adding in Dictionary or Memory, and removing from Memory. Thus, \(\phi \) indicates how frequently a Memory entry should be matched within \(a_0\) steps to be considered ‘informative’ and not noise.

Our primary hypothesis \(H_0 =\)informative’ for a feature \(\varvec{x}_i\) in the data is \(\big \{ H_0: P \left( \varvec{x}_i \in \varvec{\pi }_k, \parallel \varvec{\pi }_k \parallel \ge N_f \right) \big \}\), and the alternative hypothesis, namely \(H_1=\)noise’, is \(\big \{H_1: P \left( \varvec{x}_i \in \varvec{\pi }_k, \parallel \varvec{\pi }_k \parallel < N_f \right) \big \}\). This translates into calculating (a) a lower bound for \(a_0\) that guarantees we have sufficient samples before rejecting a cluster, and (b) a lower bound for \(\phi \) that indicates sufficient evidence that a temporary cluster is not noise. In other words, we must guarantee low probability of discarding real clusters as noise and low probability of permanently accepting noise, or respectively,

$$\begin{aligned} {\left\{ \begin{array}{ll} P \{ H_1=1 \mid H_0=1, H_1=0 \}\le \alpha \,\ \mathbf{Condition A } \\ P \{ H_1=0 \mid H_0=0, H_1=1 \}\le \beta \,\ \mathbf{Condition B } \end{array}\right. } \end{aligned}$$
(1)

where \(\alpha \) and \(\beta \) indicate probabilistic significance and are typically very small numbers (\(\le 0.05 \)). However, these two probabilities cannot be small simultaneously without increasing \(a_0\), which is the sample size. A solution is to satisfy \(\alpha \) first and then enlarge the sample size to satisfy \(\beta \) [11].

Assume an entry k in Memory has been matched X times during \(a_0\) processing steps. Then, X follows binomial distribution \(X \sim B(a_0,p_k)\). According to the Central Limit Theorem [11], when \(a_0\) is large and \(\ a_0p_k\ge 5\), we can approximate binomial distribution \(B(a_0,p_k)\) with normal distribution \(N(a_0p_k,\sqrt{a_0p_k(1-p_k)}\). Then, according to Condition A in Eq. (1), we expect that \(P\{ X \le X^*\} \le \alpha \), where \(X^*\) is the lowest bound of acceptance. If we consider the statistical variable, \(Y=\frac{X-a_0p_k}{\sqrt{a_0p_k(1-p_k)}} \sim N(0,1)\),

$$\begin{aligned} P\{ X \le X^*\} =P \left\{ Y \le \frac{X^*-a_0p_k}{\sqrt{a_0p_k(1-p_k)}} \right\} \le \alpha \end{aligned}$$
(2)

If \(\varPhi \) is the cumulative distribution function of the standard normal distribution, i.e., \(P \left\{ Y \le \frac{X^*-a_0p_k}{\sqrt{a_0p_k(1-p_k)}} \right\} = \varPhi \left( \frac{X^*-a_0p_k}{\sqrt{a_0p_k(1-p_k)}} \right) \), then the solution for \(X^*\) is,

$$\begin{aligned} X^* \ge a_0p_k+\varPhi ^{-1}(\alpha ) \sqrt{a_0p_k(1-p_k)} \ge X_f, \text {with} \end{aligned}$$
(3)
$$\begin{aligned} X_f = a_0 \frac{N_f}{N} +\varPhi ^{-1}(\alpha ) \sqrt{a_0 \frac{N_f}{N} \Big (1-\frac{N_f}{N} \Big )}, \end{aligned}$$
(4)

considering \(p_k \ge \frac{N_f}{N}\) and \(\frac{N_f}{N}\), \(p_k \le 0.5\). Therefore, if an entry in Memory is matched more than \(X_f\) times within \(a_0\) steps, we have confidence of \(1-\alpha \) that the features matched to this entry are from a real cluster and can be added to Dictionary, which satisfies Condition A in Eq. (1).

Once a Memory entry is matched with an input feature, its activity value, initialized with \(a_0\), increases by \(a_0\). If after m steps the entry is matched X times, \(X \le m < a_0\), the activity value will be \(a_0 + Xa_0 - (m-X) = X(a_0 + 1) + a_0 - m\), and for \(\{ X=X_f, m = a_0 \}\),

$$\begin{aligned} \phi =X_f (a_0 + 1), \end{aligned}$$
(5)

which determines how long an entry is preserved in Memory, before being transferred to Dictionary as a permanent cluster.

Next, we examine how we can choose \(a_0\) so that we do not discard real clusters from Memory, based on Condition A of Eq. (1). Let us assume an existing entry in Memory is from a real cluster, with population \(N_{\pi } \ge N_f\). The probability of an informative feature, among \(a_0\) samples, not being matched (e.g., if it appears only once) is \(P=\left( \frac{N-N_{\pi }}{N} \right) ^{a_0}\). If we need \(1-\alpha \) confidence that this will not happen, i.e.,

$$\begin{aligned} 1 - \left( \frac{N-N_{\pi }}{N} \right) ^{a_0} = 1-P \{ H_1=1 \mid H_0=1, H_1=0 \} \ge 1- \alpha , \text {then} \end{aligned}$$
(6)
$$\begin{aligned} a_0 \ge \frac{\ln \alpha }{\ln (1-\frac{N_{\pi }}{N})} \ge \frac{\ln \alpha }{\ln (1-\frac{N_f}{N})} \end{aligned}$$
(7)

Therefore, the probability of discarding a real cluster as noise is significantly low if we choose \(a_0\) according to the condition above.

In Condition B of Eq. (1) we also require low probability of accepting noise as ‘informative’ features. Assume a noise cluster has been formed, \(\varvec{\pi }_z\), where \(\parallel \varvec{\pi }_z \parallel < N_f\), and it has been matched Z times during \(a_0\) steps. Again, Z follows a binomial distribution \(Z \sim B(a_0,p_z)\), with \(p_z = \frac{\parallel \varvec{\pi }_z \parallel }{N}\), however, we cannot approximate it using a normal distribution since \(p_z\) is practically very small; instead, we use Poisson \(P\{Z \ge X_f \} = \sum _{q=X^*}^{a_0} \frac{(a_0p_z)^q}{q!} e^{-a_0p_z}\). We cannot derive a closed form solution for this probability, however we can see with an example that it is insignificant: Recall \(\frac{N_f}{N}\) is the minimum portion of the dataset that a real cluster can contain. For \(\frac{N_f}{N}=0.01\) and \(\alpha =0.01\), from Eq. (7) it is \(a_0 \ge 458\). If we consider \(a_0 = 458\) and \(p_z = \frac{1}{2} \frac{Nf}{N}\), then \(P\{Z \ge X_f \}=1.3955 \cdot 10^{-4}\). In the worst-case where \(p_z = \frac{Nf}{N}\), it is \(P\{Z \ge X_f \}=0.0191\), which determines the \(\beta \)-value in Eq. (1).

Sufficient Subset Size. Next, we show the portion of the dataset that needs to be processed for cluster centers to be calculated accurately. If we consider a cluster \(\varvec{\pi }_k\) and we sample \(N^*\) features from the dataset of size N, then the total number of features \(N_k\) expected to be from \(\varvec{\pi }_k\) follows the binomial distribution, \(N_k \sim B(N^*, p_k)\). Let us consider the Chebyshev inequality,

$$\begin{aligned} P\left\{ \mid N_k - E[N_k] \mid <\varepsilon \right\} \ge 1-\frac{D[N_k]}{\varepsilon ^2}, \end{aligned}$$
(8)

with \(E[N_k] = N^*p_k\) being the expectation and \(D[N_k] = N^*p_k(1-p_k)\) the variance of the random variable \(N_k\), while \(\varepsilon \) is a positive constant. Then,

$$\begin{aligned} P\left\{ \mid N_k-N^*p_k \mid < \varepsilon \right\} \ge 1- \frac{N^*p_k(1-p_k)}{\varepsilon ^2} \end{aligned}$$
(9)

If we consider that from the \(N^*\) samples it is expected that \(N^*\frac{N_f}{N}\) outliers will emerge, we can assign \(\varepsilon = \omega N^*\frac{N_f}{N}\), \(0< \omega < 1\). Then, Eq. (9) becomes

$$\begin{aligned} P\left\{ \mid N_k-N^*p_k \mid < \omega N^*\frac{N_f}{N} \right\} \ge 1- \frac{1 - \frac{N_f}{N}}{ \omega ^2 N^*\frac{N_f}{N} } \end{aligned}$$
(10)

For a probability significance \(1-\gamma \), we expect,

$$\begin{aligned} P\left\{ \mid N_k-N^*p_k \mid < \omega N^*\frac{N_f}{N} \right\} \ge 1- \gamma \end{aligned}$$
(11)

From Eqs. (10) and (11),

$$\begin{aligned} 1-\frac{1 - \frac{N_f}{N}}{ \omega ^2 N^*\frac{N_f}{N} } \ge 1-\gamma \Rightarrow N^* \ge \frac{1-\frac{N_f}{N}}{\omega ^2 \gamma \frac{N_f}{N}}, \end{aligned}$$
(12)

which is the condition for the size of the subset of the data that needs to be processed to generate accurate clusters.

Correctness. The analysis above involves a feature subset of size \(a_0\). Since we have \(T=\lfloor \frac{N}{a_0}\rfloor \) such subsets (sampling without replacement), the probability of failure to detect a cluster is (Condition A in Eq. (1)),

$$\begin{aligned} P_{fail} =\left( P\{H_1=1 \mid H_0=1, H_1=0 \} \right) ^T =\alpha ^T \end{aligned}$$
(13)

and for all K clusters, the probability of success is

$$\begin{aligned} P_{success}= \left[ 1 - P_{fail} \right] ^K =\left( 1-\alpha ^T \right) ^K \end{aligned}$$
(14)

To illustrate the importance of this probability, let us consider a set of \(N=10^5\) features and K = 100 clusters (ground-truth). If we choose \(\frac{N_f}{N}=0.01\) \(\alpha =0.01\), then from Eq. (7) it is \(a_0 \ge 458\). Considering the minimum number of features in each sample set, \(a_0 = 458\), it is \(P_{success}= \big (1 - 0.01^{\lfloor \frac{10^5}{458}\rfloor } \big )^{100} \approx 1\). Thus, under a distinct grouping pattern among the data, the algorithm is theoretically able to find a perfect clustering result. Note that \(\frac{N_f}{N}\) determines the maximum portion of the data allowed to form a noise cluster, or equivalently, the minimum portion of the data that can be in a real cluster.

Clustering Streaming Features. Our method inherently carries the idea of sequential processing: it parses the dataset once, and each feature requires constant computation. However, we made the assumption that it shuffles the order at the beginning, in order to distribute noise evenly among the considered subsets of size \(a_0\) and avoid noise accumulation. If we consider the problem of feature clustering in image collections, the average case is that noise is equally likely to appear in any image, and therefore shuffling the order of the features does not have significant effect. For the sequence paradigm, the worst case would be when successive images include more noise, or when noise is distributed spatially in an image in a non-uniform fashion. In such scenarios, if we do not shuffle the data, we can rely on the size of the formed clusters, and remove small ones transferred from Memory to Dictionary as statistically insignificant with respect to the content of the images.

Here we show that for a cluster \(\varvec{\pi }_k\), \(\parallel \varvec{\pi }_k\parallel \ge N_f\), no matter what the order of the input features is, there exists at least one consecutive subsequence of features \(\mathbf{x}\), \(\parallel \mathbf{x} \parallel = a_0\), such that,

$$\begin{aligned} \frac{\parallel \left\{ \varvec{x}_i \mid \varvec{x}_i \in \varvec{\pi }_k, \varvec{x}_i \in \mathbf{x} \right\} \parallel }{a_0} \ge \frac{N_f}{N} \end{aligned}$$
(15)

In what follows, we consider binary variables \(\nu \) to describe the feature membership to a specific cluster.

Lemma: For any permutation of a binary set \(\left\{ \nu _i\mid \nu _i=0 \ or \ \nu _i=1 \right\} _N\) and any positive integer \(a_0 \le N\), there exists at least one consecutive subsequence \(\varvec{\nu }\) of length \(a_0\), such that,

$$\begin{aligned} \frac{\parallel \left\{ \nu _i \mid \nu _i=1, \nu _i \in \varvec{\nu } \right\} \parallel }{a_0} \ge \frac{\sum _{n=1}^N \nu _n}{N}, \end{aligned}$$
(16)

where \(\varvec{\nu }\) can include tail-head permutations, i.e., \([\nu _i,\nu _{i+1},\ldots ,\nu _N,\nu _1,\nu _2,\ldots ]\).

Note: This condition means that there is at least one consecutive subsequence of length \(a_0\) where the density of the cluster members is greater than or equal to the average density of the cluster members in the entire dataset.

Proof. We prove this lemma by contradiction. Assume for any \(a_0\)-length consecutive subsequence of an arbitrary permutation \([\nu _1,\ldots , \nu _N]\),

$$\begin{aligned} \sum _{j=i}^{i+a_0-1} \nu _{mod(j,N)+1} < a_0 \frac{\sum _{n=1}^N \nu _n}{N}, \,\ \forall i=0,1,\ldots ,N-1 \end{aligned}$$
(17)

We consider the modulo index \(mod(j,N)+1\) to account for tail-head permutations (see above). In total, there are N distinct consecutive subsequences. Then,

$$\begin{aligned} \sum _{i=0}^{N-1} \left( \sum _{j=i}^{i+a_0-1} \nu _{mod(j,N)+1} \right) =a_0 \sum _{n=1}^{N} \nu _n, \end{aligned}$$
(18)

i.e., each element in the dataset is added \(a_0\) times (consider an \(a_0\)-length window ‘sliding’ N times along the dataset/sequence). However, according to the assumption in Eq. (17), we have,

$$\begin{aligned} \sum _{i=0}^{N-1} \left( \sum _{j=i}^{i+a_0-1} \nu _{mod(j,N)+1} \right) < \sum _{i=0}^{N-1} \left( a_0 \frac{\sum _{n=1}^N \nu _n }{N} \right) = a_0 \sum _{n=1}^{N} \nu _n, \end{aligned}$$
(19)

which contradicts Eq. (18). \(\Box \)

4 Experimental Results

Image categorization involves, in general, three steps: (a) building visual vocabularies from image features, (b) image encoding using the vocabularies, and (c) classification. We used our method to cluster SIFT features [16] and create visual vocabularies. To show-case the benefit of using our method in such problems, namely improving the trade-off between accuracy and efficiency, we adopted three image classification methods: Vector of Locally Aggregated Descriptors (VLAD) [14], Bag-of-Visual-Words (BoVW) [7], and Fisher Vector (FV) [22]. We make comparisons between our approach, ANN K-means [15], and Naive EM [17], when used in these three methods. Note that variations of K-means and Naive EM are among the most popular clustering approaches utilized in such Computer Vision tasks.

We used three publicly available image collections: (a) Object Discovery 100 (Obj. Disc); (b) Caltech 101 (Caltech101); and (c) PASCAL VOC 2007: we first used a subset of 6 randomly chosen categories (PASC(6)) and then the entire collection (PASC(all)).

In the experiments with K-means and EM, we follow the approach in [27] and randomly chose K\(\cdot 1000\) features for VLAD and FV, and K\(\cdot 200\) for BoVW. Since our algorithm is much more efficient than K-means and Naive EM, it allows us to mine a larger dataset and still be much faster than those two. For the experiments with our method, we used: for FV, 0.8 million features (N) from each dataset; for VLAD, 1 million features from Obj. Disc. dataset, 5 million from PASC(6), and 5 million from PASC(all); for BoVW, 2 million features from Obj. Disc. and 5 million from each of PASC(6) and PASC(all).

Table 1. Comparisons in building visual vocabularies for three popular image categorization methods. K = number of clusters; mAP = mean Average Precision [23]; DB = Davies-Bouldin index [8]; CH = Calinski-Harabasz index [4]; LL = Log-likelihood

Table 1 summarizes the results of our method and the competition when used in each image classification method (VLAD, BoVW, FV) and for each dataset. Each row corresponds to a different vocabulary size K, as set by the competing method. For each experiment (row), we ran our method and the competition 25 times, and we report average results. To evaluate clustering itself, we used the Davies-Bouldin (DB) index [8], Calinski-Harabasz (CH) index [4], and Log-likelihood (LL). To evaluate the overall accuracy of VLAD, BoVW, and FV, we used the mean Average Precision (mAP) [23]. The reported times are for clustering only. These results illustrate that our method and the competition yield, on average, comparable accuracy, with our method being significantly faster: in Table 1, the boldface numbers correspond to indicative comparison instances and examples where our method is 4 to 42 times faster than the competition. Note that SIFT features from natural images are usually very cluttered, therefore there are no natural groups in the clustered data. However, our method still generates competitive results while boosting the clustering efficiency.

To test the performance of our approach in sequential data (clustering on the fly), we used videos captured by an onboard camera of a quadrotor during flight. Figure 2 illustrates an indicative example of clustering detected SIFT features in eight non-consecutive \(752 \times 480\) frames of a video (frame numbers are shown on top left of each image). The detected features are marked in different shapes and colors, indicating different cluster assignments. The magenta-yellow arrows in frames \(\#250\) and \(\#375\) show indicative examples of newly emerged clusters (orange and green square categories), while the long double arrow indicates correspondence between features (features in the same cluster) across frames. In this experiment we used \(N_f / N=0.005\) and \(\theta = -275\).

Fig. 2.
figure 2

Building visual vocabularies on the fly. Colors and shapes indicate cluster assignments of the detected features, while the arrows in frames \(\#250\) and \(\#375\) indicate emerged clusters (orange/green squares). (Color figure online)

Finally, we also validated the efficiency of our method using synthetic datasets generated by different Mix Gaussian models with random means and covariances, large numbers of clusters, and uniformly distributed noise. Compared to K-means++ [2], BIRCH [29], and EM clustering [17], our approach was on average 3–7 times faster than the first two, while EM was the slowest among the competition. Note that BIRCH builds a CF-tree, where grouping the leaves into the desired number of clusters is computationally expensive with \(O(N^2logN)\).

5 Conclusions

We described a fast and accurate center-based clustering method suitable for large datasets with a high number of natural clusters. It produces clustering centers with a single pass through the data, by using a Dictionary and a (short) Memory list for building and enriching a global (sparse) histogram of the data: dense entries in Dictionary and Memory correspond to frequently matched features, thus indicating formed clusters. Input features that are not matched in Dictionary move to Memory, where either they are assigned to an existing entry, or create a new one. Memory entries that are not populated sufficiently are discarded as noise, while dense entries are moved to Dictionary permanently. In our results we showed that the trade-off between accuracy and efficiency is improved, compared clustering approaches commonly used in Computer Vision.