Abstract
Humans are still indispensable on industrial assembly lines, but in the event of an error, they need support from intelligent systems. In addition to the objects to be observed, it is equally important to understand the fine-grained hand movements of a human to be able to track the entire process. However, these deep-learning-based hand action recognition methods are very label intensive, which cannot be offered by all industrial companies due to the associated costs. This work therefore presents a self-supervised learning approach for industrial assembly processes that allows a spatio-temporal transformer architecture to be pre-trained on a variety of information from real-world video footage of daily life. Subsequently, this deep learning model is adapted to the industrial assembly task at hand using only a few labels. Well-known real-world datasets best suited for representation learning of such hand actions in a regression tasks are outlined and to what extent they optimize the subsequent supervised trained classification task. This subsequent fine-tuning is supplemented by concept drift detection, which makes the resulting productively employed models more robust against concept drift and future changing assembly movements.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
With the rising costs of maintaining sophisticated systems and the increasing technological demands on operators, many manufacturing companies are realizing the importance of focusing on people in their production processes in order to remain economical and flexible enough to produce multiple product variants, especially on their assembly lines. For example, despite advances in automation technology, it has been found that 70% of the world’s assembly lines are still operated solely by humans [14]. This underscores the essential role that human operators play in the manufacturing process, as they are able to cost-effectively handle a wide range of products and easily adapt to new or improved processes. However, human operators also have limitations. They can make mistakes, especially at the beginning of a learning process or when performing repetitive tasks. In addition, the psychological consequences of performing the same task every day can have a negative impact on their well-being. To overcome these challenges, this work focuses on assistive systems for assembly applications based on deep learning vision applications that can help assemblers by describing their movements in the assembly process.
However, traditional supervised deep learning approaches require significant amounts of data and labels [40, 41], which can be challenging for industrial companies to provide. These companies may not have the necessary resources or expertise to record, clean, and label their processes for deep learning applications. Additionally, the employees with the necessary domain knowledge may be scarce and only available for a short period of time. Moreover, the data may rarely leave the factory premises, making it difficult to share with others.
Self-supervised representation learning approaches reduce the resource-intensive task of data labeling and allow deep-learning models to be pre-trained on a similar task from, for example, the non-industrial world, which can then be used to adapt them to an industrial use case using only a small amount of labeled data [12, 38]. This allows the respective industrial domain expert to incorporate the knowledge into the deep learning model in the shortest possible time, and the industrial company to support its employees with this knowledge. In this work, a self-supervised learning approach for a two-towered spatio-temporal transformer encoder architecture is investigated. This model learns the representation of hand movements from various daily videos based on hand skeleton data. The learned representation results from sequences that are partially masked. The model reconstructs these masked regions, similar to the approach of the BERT model [4]. Subsequently, the suitability for a transfer of the weights to an industrial use case is checked and evaluated on the basis of values such as the amount of labeled data, training time and performance on a test data set.
In addition to this extended version of [31], the occurrence of concept drift during fine-tuning is considered and possible responses to prevent performance degradation are outlined. This phenomenon refers to changing circumstances potentially impacting the data and the performance of machine learning models consuming it [8]. For fine-tuning, concept drift might occur if the new data is very different from the base data and no correlations can be recognized [48]. This can overwhelm models, which lack a learned representation of the new information. In turn, this can lead to poor validation and test results. As such training processes can be highly time and computation-resource-consuming, means for optimizing their progress in a targeted fashion are very valuable.
In the real industrial world, concept drift can also result in serious problems that cannot be solved by responses as easy as exchanging data. In that regard, One of the primary threats is the volatility of input information and conditions over time. Even minor adjustments in assembly processes can lead to concept drift in the associated data, despite the processes having unaltered outcomes. Possible examples are changing tools, new or differently shaped components or different hand movements involved during assembly. Classifying certain steps from such processes can become increasingly difficult for a machine learning model.
Since, the reconstruction performance during fine-tuning is prone to fluctuations, and the process may suffer from internalizing training steps that produce local error maxima or fluctuating errors, we propose to counteract this effect, by using explicit concept drift adaptation methods. In this way, abnormal errors can be detected and appropriate interventions in the process can be triggered to improve the fine-tuning performance.
The main contributions are as follows:
-
The presentation of an efficient approach of masked auto encoding in spatio-temporal transformer encoder architectures to work with fewer labels. Therefore, the goal is to achieve self-supervised industrial context understanding to train afterwards productive models for industrial applications faster with less labeled data.
-
A demonstration of how non-industrial datasets can be used to pre-train models for industrial applications with limited domain data availability.
-
Extensive experiments on challenging video benchmarks achieving comparable or better results with improved state-of-the-art methods.
-
A concept and application of explicit progress control for dealing with concept drift in model training architectures to detect and analyze its occurrence during fine-tuning, as well as feasible responses for stabilizing the training process in the event of further changes over time in the assembly context.
For this purpose, different self-supervised learning approaches based on Masked Autoencoder (MAE) are presented in Sect. 2. Subsequently, the model architecture and the masking method used is described in Sect. 3. This model architecture is first trained in a supervised manner on an industrial dataset to create a ground truth. Subsequently, the same model architecture is pre-trained self-supervised with a masking approach on different presented real world datasets before it is tested for its usability for fine-tuning on an industrial dataset in Sect. 4. Thereby, it is also analyzed to what extent this method optimizes the learning process in Sect. 5. In addition, each Section is enhanced with state-of-the-art information on conceptualization and application of concept drift handling techniques. Finally, the work is concluded with an outlook in Sect. 6.
2 Related work
Generative self-supervised learning is a method to provide knowledge to a model that should recognize patterns in unlabeled data based on observation. Through this observation, it is additionally possible to use previously unrecognized patterns in addition to the obvious patterns recognized by a labeler who has domain knowledge in a supervised learning approach. This observation and learning is usually done by an autoencoder which converts an input into a latent representation which is then converted back into the structure of the input with the help of a decoder. The resulting deviation is then taken as a measurement value to check the performance of the model [26]. In order to train the autoencoder for better generalizability, a masking method for the input data can be applied to create a denoising autoencoder, which is a type of neural network designed to remove noise from data by learning to reconstruct the original input from a corrupted version. It achieves this by encoding the noisy input into a lower-dimensional representation and then decoding it back to the original denoised output. This process helps the model learn robust features that capture the underlying structure of the data [10, 39].
In the following sections, an overview of different methods of the most promising masked self-supervised learning methods investigated for this industrial use case are introduced which are, related to skeleton based data where the focus relies on images, video data, and multivariate time series data. Additionally, relevant aspects from the state of the art on concept drift adaptation are outlined.
2.1 Masked autoencoder in image and video data
The idea of MAEs is relatively simple and has been successfully applied in the field of Natural Language Processing (NLP) for quite some time. The most popular NLP approach that is trained with this method is the BERT Model [4]. In that case, the random masking method involves randomly replacing 15% of the tokens in the input text with a special token [MASK]. The model is then trained to predict the original tokens that were replaced, enabling it to learn contextual representations of words by understanding the surrounding context. Afterwards the model can be fine-tuned to different tasks from the same domain with fewer labels compared to a supervised approach.
However, this masking procedure has also been used in image processing for quite some time under the term denoising autoencoders [37]. One of the more popular recent work from [10] used as autoencoder a Vision Transformer (ViT) [5] architecture and found that significantly more masking than the BERT approach’s 15% can be applied to images. They masked about 75% depending on the depth of the model and the masking strategies, proving that MAEs can also be used as scalable vision learners. They also showed that random masking, which is performed on images by replacing random patches, can produce significantly better results on image data than, for example, block or grid masking. The block masking hides contiguous regions or blocks in data, effectively removing certain areas of the input to encourage the model to infer this missing information. Grid masking, on the other hand, masks data in a grid-like pattern, hiding non-contiguous but regularly spaced portions of the input to promote the model’s ability to recognize patterns and dependencies across different sections. Afterwards the model is trained to predict the masked content, forcing it to learn contextual information from the remaining unmasked data. In this way, they have revolutionized the method of self-supervised learning by not only reaching the state of the art for image pre-training, but also bridges the gap between visual and linguistic MAE pre-training [3]. Similar findings were reported at the same time by [45] during the investigation of multiple masking strategies like square, block-wise, and random masking, and achieved the best performance with random masking. Their presented SimMIM model also confirms that direct prediction of pixels as in MAE does not perform worse than other complex design methods, such as tokenization, clustering, or discretization. [6] extended these pixel-wise masking approaches for images by applying random masking to videos and proved that randomly masked spatio-temporal information not only improves the fine-tuning results, but also increases the masking rate by up to 90%. For downstream tasks, the final results improve already from masking rates of over 70%. The VideoMAE model also demonstrated that masking is very well applicable for representation learning of videos [33]. They used a tube masking approach which is a technique masking spatio-temporal regions, which are referred to as tubes, across consecutive frames to focus the model on learning dynamic visual features over time. In this masking method they showed that a relatively small amount of data is sufficient for training and that a very large data set is not required as previously assumed.
2.2 Masked autoencoder in multivariate time series and skeleton based data
Similar approaches like the previous ones for masking have also been used for multivariate time series data, showing that this method of representation learning can also be used for very dense types of data [46]. [46] used a transformer encoder architecture similar to BERT with the same random masking approach [4]. What was interesting here is that the same samples could be used multiple times for pre-training purposes, as long as they were masked at different locations. Multivariate Time Series Masked Autoencoder (MTSMAE) introduced by [32] added to the traditional ViT embedding approaches a patch embedding layer [34] in the direction of time after the embedding. They also show that a high masking level in such dense data can cause greater information loss, greatly reduces the redundancy of the data, and reduces the overall understanding of the model about the low-level information. Consequently, a similarly high masking ratio, such as 95%, also means that a large amount of data is lost. The data that the model can learn is limited, which can affect the model’s understanding of the data. Therefore, they chose a masking ratio of 85% and achieved the best results for the representation learning and subsequent downstream classification task. [44] presented SkeletonMAE, a MAE approach that considers 3D skeletal data. They investigated random and fixed-masked images as well as joint masking methods and a new spatio-temporal masking method for skeleton data introduced on both the joint plane and the image plane, with the appropriate combination of masking ratios in both spatial and temporal dimensions. In masking joint skeletal data, specific joints in skeletal data sequences are masked to train the model to predict their positions. By masking these joints, the model learns to capture spatial and temporal dependencies between joints and improves its understanding of human motion. In this way, the model improves its ability to generalize and recognize different actions by focusing on the relationships and dynamics between different joints in the skeletal data. The approach achieved the best results of fixed masking accuracy of 85.4% and random masking accuracy of 86.6% in the downstream task. This was accomplished with a fixed masking ratio in the temporal domain of 40% and in the spatial domain of 50%. Additionally, a random masking ratio of 50% was used in the spatial domain and 40% and 50% in the temporal domain. These ratio parameters matched the findings from [32] and can be used as general state-of-the-art values for such a masking method.
2.3 Concept drift adaptation
Data streams can be abstractly represented as potentially infinite series of data points randomly drawn at various timestamps. The data X in streams follows patterns that are primarily determined by the characteristics of the processes generating it, but might also contain noise components. In machine learning, such patterns are referred to as concepts. From a statistical perspective, a concept can be formalized as a probability distribution P(X) [43].
Real-world concepts are often volatile and associated P(X) need to be assumed non-stationary, i.e. might be subject to frequent changes. This phenomenon is commonly referred to as concept drift. It can arise with different types of various dynamic profiles (e.g. abrupt or extended), magnitudes (e.g. warning or change-level), and feature-spatial expansions (e.g. local or global). Generally, multiple types can occur in parallel [17, 43]. Concept drift has destructive potential with respect to machine learning models. It can negatively affect their performance as it leads to them being confronted by previously unseen input patterns [7, 16, 42].
Paradoxically, in practice, this problem is often ignored and lacks established best practices [36]. To be robust, i.e. to sustain performance, machine learning models therefore require appropriate responses treating concept drift occurrences. The field of concept drift adaptation internalizes this motivation and delves into the design of associated measures, which typically involve model retraining [7] or ensemble-based mechanics [18]. It can be done blindly, which is also referred to as implicit adaptation, or employs informed measures, which are also referred to as explicit adaptation. The former attempts to treat concept drifts without any information about their timing. Associated measures often involve retraining in regular intervals or once labels become available. The latter involves the identification of concept drifts by employing specific detectors and explicitly timed and controlled responses to treat such [7]. In contrast to implicit adaptation, explicit concept drift adaptation has the potential to economize compute resources [29] and allows more deliberate treatment of concept drifts [25].
3 Proposed method
In the following Section, the feature extraction of the hand skeleton information is explained in more detail before the architecture of the coding model to be trained is presented. Then, the masking method for learning the representation is discussed. Figure 1 shows the prototypical implementation of the combined model architecture in an end-to-end training scheme. It is based on the pre-extraction of the hand skeleton data and the subsequent random masking block before the masked data is processed by the transformer architecture to reconstruct the skeleton sequence.
Finally, a method for monitoring the fine-tuning training process for concept drift and applying appropriate responses is proposed.
3.1 Model architecture
3.1.1 Preextraction of hand skeleton data
A keypoint extraction method inspired by Google’s MediaPipe Hands solution architecture is used, see [47]. The model consists of a palm recognition module followed by a hand feature detector that provides 21 2.5D coordinates for various landmarks. These vertices serve as key features for the subsequent transformer model. The process begins with a palm detector network optimized for real-time mobile use [47]. Instead of detecting the entire hand with fingers, the model first identifies the palms through bounding boxes to ensure stability. An encoder-decoder feature extractor, similar to a feature pyramid network [21], recognizes the palm at different scales. To balance the imbalance between background and palm detection during training, the focal loss [22] is used [47]. When a palm is detected, an image section is generated that includes the entire hand for further analysis. This image section is then fed into a convolutional neural network that identifies the 21 landmarks of the hand with 2.5D coordinates. These coordinates include x and y values, and a z value with a relative depth to the wrist, and a probability of hand presence and handedness information. To optimize subsequent frames, the detected landmarks are used to calculate a new crop area, keeping the hand within this region. Subsequent frames are cropped accordingly, avoiding redundant detection by the single-shot detector, except when the probability of hand presence falls below a certain threshold [47]. By reducing the number of compute cycles required during inference, the convolutional neural network operates on smaller cropped images, and the single-shot detector only scans the entire image when necessary due to lost hand detection, see Fig. 2.
3.1.2 ConvGTN
The hand keypoint extractor skeleton output serves as the foundation for the subsequent part of the architecture. This section focuses on classifying the temporal and sequential correlations within the extracted time series of keypoints. To achieve this, a specialized gated transformer network [23] is employed, specifically adapted for this use case, see Fig. 3. Previously, the network from [23] demonstrated remarkable performance on 13 multivariate time series classification tasks, including human action recognition tasks similar to the one in this work. The adapted architecture revolves around a two-tower transformer with batch normalization [15] instead of layer normalization like in the traditional approach from [35], where each tower’s encoder captures attention in both temporal and spatial-channel dimensions at each time step. As activation function, in the feed forward block of the encoders the Gaussian Error Linear Units (GELU) is used [11]. Additional parameters for the encoders are QUERY=32, VALUE=32, KEY=4, with N=4 layers and D_HIDDEN=512 hidden layers. To merge the encoded features from the two towers, an adaptive weighted concatenation acts as a gate before the last fully connected layers. This gate dynamically determines which tower contributes more crucial features to the classification process during backpropagation. The output of both encoders is additionally supplemented beforehand by a linear layer with 128 input features and the respective output features of each tower 126 keypoints or 100 frames to come to the original input dimension and to equalize it with the input signal of [126,100]. To enhance the model’s predictive capabilities, an additional Conv1D layer with a kernel size of 3 is introduced. This extra convolution facilitates better correlation detection between the hand keypoints represented as [21*3*2] [keypoints per hand, xyz coordinates, hands], embedded temporally within the model. It leads to improved gradients in the temporal tower for each time step of the input data represented as [128,126,100] [batch size, features, sequence length]. The Conv1D layer is succeeded by a linear layer with 128 input and 128 output features and a learnable positional encoding layer [19]. The total number of trainable parameters in this model is 1,743,684 and with these settings has much fewer parameters than the initial setup of [23] with more than 18 M parameters.
3.2 Masking method
As masking method, random masking is used. The reason for this are the results of [44] and [6]. Due to the similar structure of skeleton-based video data, this is a promising choice. Before splitting the information for both towers, the sequential input is masked temporally and spatially with a masking ratio of 50%. This prevents the model from detecting spatial masking too easily due to tubes and temporal patterns caused by masked images [10]. The training procedure consists of first randomly masking the skeleton sequences in space and time (cf. block masking in Fig. 3). The respective dimension is then reconstructed as subject of a regression task and the loss is calculated with a (MSE) based only on the masked regions. As a result, the model is prevented from memorizing the entire structure of the input sequence and skipping the masked regions during regression. It ensures that the loss calculation only focuses on the masked regions without being influenced by rote learning.
3.3 Explicit progress control for treating concept drifts during fine-tuning processes
Training processes can be demanding in terms of duration and computation resources. Optimizing their continuous progress can have an economizing impact. Concept drift occurring during such processes, i.e. caused by changing circumstances, can impact the data and, consequently, compromise the performance of machine learning models consuming it. The suggested explicit progress control method is based on aspects from explicit concept drift adaptation and targets the mitigation of negative concept drift impact. To the best of the authors’ knowledge, this approach is novel.
Algorithm 1 presents details of this method, which is applicable to fine-tuning processes, as pseudo code.
The loss value of each training iteration, computed via the cross-entropy loss function [13], can be used to identify occurrences of concept drift in the data, i.e. significantly abnormal error behavior. For this task, a concept drift detector is employed. It receives the cross-entropy loss as input and emits an alert in case of a detection (cf. lines 7 and 8). This input can be passed as a raw or processed value, depending on the detector. Section 4.4 outlines the intricacies of different detectors as well as input processing steps appropriate for the considered application context. Depending on the loss trend surrounding a concept drift occurrence, different types can be distinguished: A concept drift is considered positive if the loss exhibits a decreasing trend, whereas it is considered negative if the loss exhibits an increasing trend.
The concept drift detector’s alerts then trigger reactions that intervene in the training process. The motivation behind this is to prevent performance deterioration and to improve training stability. The reactions can be triggered directly or on the basis of other conditions. The range of feasible reactions is vast and can e.g. include a reorganization of the training data, model versioning or an update of the training parameters. In the context of this work, however, a curriculum learning approach [1] is suggested for each batch. This means that as soon as a negative drift is detected for a certain range of data samples, they are considered hard ones, i.e. their internalization by the model is difficult. Therefore, these samples are not backpropagated but stored temporarily (cf. lines 9 and 10). If no concept drift is detected, backpropagation is directly done for such samples that are thus considered easy ones, which is in line with the fundamentals of curriculum learning.
As soon as a batch has been processed, fine-tuning on the gathered hard samples follows to make the model more robust (cf. lines 17 through 25). As it is expected that not all hard samples can be incorporated correctly during training, the set of samples again causing concept drift detections is tracked to monitor the improvement of the process (cf. line 21).
Figure 4 serves as an additional schematic depicting the suggested method. To highlight that the reaction to a concept drift detection is not restricted to the curriculum learning approach suggested in this paper, it depicts a block abstractly referring to a variable Response.
4 Experiments
The following section presents the datasets selected for pre-training based on the findings from [30] and the Industrial Hand Assembly Dataset V1 for the downstream task, as well as the model training environment and training time, before the experiments are discussed in more detail.
4.1 Datasets
4.1.1 20bn-something-something V2
The 20bn-something-something V2 dataset [9, 24] is a large-scale video dataset which is widely used as a benchmark for action recognition tasks. It contains 220,847 video clips with a duration of 2 to 6 s, each showing short human actions or activities from various sources. It covers 174 different action classes, representing a diverse set of everyday human activities, interactions with objects, and gestures.
4.1.2 EGTEA gaze
The EGTEA Gaze dataset [20] focuses on gaze estimation in the context of object interactions and human-object interactions in an egocentric vision setup. The dataset contains data from 61 participants interacting with 28 different objects. It consists of 14,000 frames in 86 videos and covers 32 activities, like pouring liquids, cutting vegetables, writing on a whiteboard, and using tools.
4.1.3 Industrial hand assembly dataset V1
The Industrial Hand Assembly Dataset V1 (IHADV1) [30] consists of industrial hand assembly recordings which are available as videos or already pre extracted hand skeleton data in 12 fine-grained hand action classes. The dataset includes 459,180 frames in its basic version and in another available version, 2,295,900 frames after applying spatial augmentation. One of its essential characteristics is the presence of occlusions, hand-object interactions, and a diverse set of fine-grained human hand actions specifically tailored for industrial assembly tasks.
4.2 Model training environment
The model architecture was implemented using PyTorch and integrated with Google’s MediaPipe Hands framework. Training and tuning of hyperparameters were performed on the STANDARD_NC6 instance of Microsoft Azure, which is equipped with 6 vCPUs and 56 GiB of memory. For the final training of the model, a GPU corresponding to half a K80 card with 12 GiB of memory was used, along with a maximum of 24 data disks and 1 NCiS. Training time ranged from 20 h and 09 min to 50 min for the pre-trained model and up to 1 h and 45 min for the fine-tuned model, depending on the amount of data used, see the last column in Table 2 and the last column in Table 3.
4.3 Conception of self-supervised pre-training and supervised downstream task
To verify that the model architecture and random masking are working as desired, the model is first trained separately on all datasets in a supervised fashion to create a baseline. Each experiment is then repeated as described using the self-supervised random masking training method. The reconstruction performance of the action sequences of the model is checked using the respective measurement values, (RMSE), MSE and R2 score as well as the training time. Furthermore, the respective performance of the model is visualized in a test plot in which the representability of hand movements is demonstrated. These trained models are afterward fine-tuned on the IHADV1 [30] by applying the complete self-supervised trained model to a classifier consisting of a flatten layer, a linear layer with 126\(\times\)100 input features and the set of classes as output features, followed by a final SoftMax layer. With this additional classifier, the model has 1,882,284 trainable parameters. Since the datasets from Sect. 4.1 have different sizes, different BATCH_SIZEs had to be used in the self-supervised training phase, see first entries in each column in Table 1. The small amount of data was compensated during masking, as in [46], by masking a sample several times per iteration and using it for training, so that the model saw a sample several times in different executions, see ITERMASK cell in Table 1. Thus, as expected, a consistent improvement was observed even if significantly less data was available as, for example, in the 20bn-something-something V2 dataset which was trained with a BATCH_SIZE of 128.
Also for the downstream task, due to the different amount of data, different hyperparameters have to be applied for training to stabilize the model and to detect the corresponding patterns in the data, see second entries in column EPOCHS, BATCH, and LR in Table1. In addition, to stabilize the model training during fine tuning, a ReduceLRonPleatau scheduler from PyTorch is applied to the training loop. This learning rate scheduler is used to dynamically adjust the learning rate in response to validation performance, with the goal of improving training stability and convergence. This scheduler monitors the validation loss and reduces the learning rate by a factor if the loss does not improve in successive epochs. This provides additional fine-grained control over the learning process and avoids potential plateaus. The LR has been increased to 1e-3, and the SCHEDULER_PATIENCE and SCHEDULER_FACTOR parameters have been set to 7 epochs and 0.5, respectively.
4.4 Application of explicit progress control for treating concept drifts
This section provides details for applying the explicit progress control method. As introduced in Sect. 3.3, this method fosters the robustness of fine-tuning processes against concept drift. For identifying concept drift occurrences, two different detectors are separately employed in experiments: Adaptive Windowing (ADWIN) and the Page-Hinkley (PH) method for concept drift detection. To leverage the strengths of both detectors, they receive differently pre-processed loss value inputs. Their detection evidence is integrated in the explicit progress control method as indicated in Algorithm 1.
ADWIN This concept drift detector provides rigorous performance guarantees for detecting prolonged concept drifts. It maintains two sliding windows and provides concept drift alerts if they exhibit statistical differences. The parameter delta enables the user to control the detector’s sensitivity [2].
In this work, ADWIN receives the raw loss values per iteration of the training process as input. In case it detects a concept drift, the local loss trend is inferred before triggering a response. While various methods can be applied for this task, this work suggests one inspired by signal analysis. A parametrized number of loss values before the identified concept drift candidate is considered. Another parametrized number of values at the end of these is discarded to account for ADWIN’s unknown detection delay. The remaining values are used for linear approximation using the ordinary least squares method. If the linear approximation exhibits positive or near-zero slope, an increasing loss trend is inferred. If it exhibits a non-zero negative slope, a decreasing loss trend is inferred. Only the former case results in a response being triggered. If the loss trend is decreasing, the training process is not interfered with and the concept drift candidate is ignored.
Page-Hinkley This method for concept drift detection operates by monitoring a sequence of observations and calculating a cumulative sum based on these. When the cumulative sum exceeds a user-parametrized threshold, it indicates that a significant change has occurred in the data stream [27, 28].
With PH being employed in this work’s experiments, the detector input is pre-processed as follows: The incrementally growing loss value history is continuously standardized according to its mean and standard deviation. This is followed by two consecutive computations of the discrete differences between ensuing values, which provides an approximation of the second derivative of the standardized loss signal as input for PH. Each concept drift detection directly triggers a response.
5 Results and analysis
In the following section, the results of the self-supervised pre-training and the fine-tuning of the downstream task based on it are presented before an analysis of the training process.
5.1 Results self-supervised learning
As can be seen in Table 2, the best result was obtained with all data sets combined. The longest training time was more than 20 h, but with an R2 score of 50.7% and much lower MSE and RMSE of 0.0001 and 0.01. Only the standalone 20bn-sth-sth dataset could match these good results to some extent. It can also be seen in Table 2 that the more data are available and the more variance the data have, the better the representation becomes. An example of the very good representation results can be seen in Fig. 5, where the red and blue dots are the true sequence and the green and yellow dots, depending on the hand’s left or right, are the predicted ones while picking up a larger part. This shows that after self-supervised representation learning, the model is able to recreate very accurate sequences from scratch.
5.2 Results downstream task
The downstream task shows a similar picture regarding the amount of data as the previous self-supervised comparison, with a maximum training time of 1:45:18 h. The 20bn-sth-sth dataset, as well as all data sets together, showed significantly better results with many labels. Both data sets achieved an accuracy of over 95% with 80% labeled data and with fewer labels. As expected, accuracy in classification decreases with fewer labels, but the results are still within a very good range with an achieved accuracy up to 93.15% with only 40% of the labels, see Table3.
To compare how self-supervised pre-training improves the results, the first row presents the results of fully supervised training of the model. As in the other experiments, the number of labels in the IHADV1 dataset was reduced piece by piece for this purpose. Although the results show that very good comparable measurements are produced even without pre-training in the high label range, Fig. 6 clearly shows that pre-training leads to better generalizability of the model.
This can be seen in the smaller gap between the training curve, which is the solid line, and the validation loss, which is the dashed line, when the golden lines (w pre-training) and the blue lines (w/o pre-training) are compared. In addition, the results in the low-label range, which should be the target for these experiments in Table3, are clearly inferior to the pre-trained data. However, there is a disadvantage in the context of the random masking used here, which is visible in the training with ALL_DATA when comparing 60% and 40% of the labels. Due to the uncontrolled masking of the skeleton sequence, it is possible that parts of the sequences are masked that contain good features for reconstruction and context understanding, and the model is therefore no longer able to reconstruct the masked sequence parts completely. For further work on avoiding this problem, this will be discussed again in the outlook. In addition, it can be seen that this classical random masking self-supervised learning method also requires a certain amount of data to obtain enough contextual information in advance. In row 4 of Table3 where the EGTEA dataset was used, it is clearly visible that too little contextual understanding was obtained in advance for classification. How to solve these cases with a satisfactory performance will be discussed again in the outlook.
5.3 Results of explicit progress control for treating concept drifts
Due to the large number of follow-up experiments caused by the combination of the different datasets from the previous experiments and the novel integration of the explicit progress control method, the following section focuses exclusively on the self-supervised pre-trained models and the training on ALL_DATA of the combined datasets. Table 4 provides an overview of associated experiments. The annotations “w DD” and "w/o DD" mark the inclusion and omission of the explicit progress control method, respectively. As before, the amount of labeled data is gradually reduced to prove that this approach can reduce the amount of labeled data for each use case without significant performance loss. However, the more impressive results now fall on the results from the experiments including the novel explicit progress control method reported in rows 2 and 3 of Table 4. There, it becomes clear how employing PH (cf. row 2) as well as the slightly better ADWIN concept drift detector (cf. row 3) leads to a higher accuracy in the low label range. Although having only 10% labels available, the accuracy is close to the 95% achieved in the first experiments without the new method, as presented in the last row of Table 3.
The effect of the two concept drift detectors also becomes apparent in Fig. 7. Figure 8 displays the same observations while focusing on epoch 30 onwards to further highlight the differences. Already in epoch 45, i.e. far earlier, the blue train loss curve, which results from employing ADWIN, reaches a stable minimum when compared to the green loss curve corresponding to not employing the proposed novel method. In the latter case, this state is not reached before epoch 65. Employing PH also leads to the stable minimum being reached earlier, more precisely around epoch 60. These results show that explicit concept drift adaptation by the re-initialisation of the hard samples has an improving effect on model training.
5.4 Analysis
The results clearly demonstrate that more data with more variance in the self-supervised masking produces significantly better results, generalizability, and performance in the self-supervised method as well as in the subsequent downstream task. It has also been shown that a break-even point is reached with an amount of labels as small as 40% in this case, producing very high test accuracy results.he improvement with more and high variation data, as in the case of all datasets combined, but also in the case of the 20bn-sth-sth dataset, was additionally recognized during training when looking at the weights through the gate of the ConvGTN in Fig. 3, where the spatial and temporal features are combined. It was recognized that the more data available to the gate, the better the tower could be balanced, leading to a better final result. When less data was available, only one tower was optimized, since the gate is designed for only one tower in the end. This shows that using a large set of unlabeled data under these conditions is necessary for the robustness of similar models, but this needs to be further investigated in later experiments.
Observing the use of the explicit progress control method in experiments from Sect. 5.3, it is obvious that concept drift detection does not only reveal deteriorating effects in model training. It can also help to improve the training process for especially fine-tuning tasks when an explicit adaptation action takes place. Even without an extensive hyperparameter tuning of both concept drift detectors, it is obvious that both options have a great effect on the training, especially with fewer labels if one looks at the numerical results in Table 4. In addition to the accuracy results, it also helps to achieve a stable result with fewer epochs, although these results can be optimized and examined in more detail in future works.
6 Conclusion and outlook
In this work, it was shown that it is possible to pre-train a spatio-temporal deep learning model without labels using data from daily life and then fine-tune it efficiently with a significantly smaller number of labels than in the fully supervised state. This subsequent fine-tuning can then be further optimized and stabilized using various concept drift adaptation methods to ensure stable performance even in a later changing environment, which is very often the case in industrial scenarios. The reason for this approach is the need to avoid the high cost of labels for a company. In addition, this approach makes it possible to train scalable models in a self-supervised way and then use them over a long period of time in an industrial application for quality control, as in the presented case of human assembly recognition. Future experiments will test how stable the model reacts to further assembly steps, how extensive hyperparameter tuning of the applied concept drift detection methods can improve the training process and how the self-supervised pre-training can be optimized followed by a larger evaluation and cross-validation of all data sets investigated so far to find a better combination of features for Human Hand Action Recognition.
References
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning. pp. 41–48 (2009)
Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM international conference on data mining. pp. 443–448. SIAM (2007)
Cao, S., Xu, P., Clifton, D.A.: How to understand masked autoencoders. arXiv preprint arXiv:2202.03670 (2022)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale (2021)
Feichtenhofer, C., Li, Y., He, K., et al.: Masked autoencoders as spatiotemporal learners. Adv. Neural. Inf. Process. Syst. 35, 35946–35958 (2022)
Gama, J., Zliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 1–37 (2014). https://doi.org/10.1145/2523813
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 1–37 (2014)
Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The “something something” video database for learning and evaluating visual common sense (2017). arXiv:1706.04261
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Ho, Y., Wookey, S.: The real-world-weight cross-entropy loss function: Modeling the costs of mislabeling. IEEE Access 8, 4806–4813 (2019)
Hu, M., Kapoor, B., Akella, P., Prager, D.: The state of human factory analytics (2018), https://info.kearney.com/30/2769/uploads/the-state-of-human-factory-analytics.pdf?intIaContactId=eAsAAnVQ4FJww4J%2fWxZkpg%3d%3d&intExternalSystemId=1, accessed: 07/25/2024
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 448–456. JMLR.org (2015), http://proceedings.mlr.press/v37/ioffe15.html
Iwashita, A.S., Papa, J.P.: An Overview on Concept Drift Learning. IEEE Access 7, 1532–1547 (2019). https://doi.org/10.1109/ACCESS.2018.2886026
Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., Ghédira, K.: Discussion and review on evolving data streams and concept drift adapting. Evol. Syst. 9(1), 1–23 (2018). https://doi.org/10.1007/s12530-016-9168-2
Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: a survey. Inf. Fusion 37, 132–156 (2017). https://doi.org/10.1016/j.inffus.2017.02.004
Li, Y., Si, S., Li, G., Hsieh, C.J., Bengio, S.: Learnable fourier features for multi-dimensional spatial positional encoding (2021)
Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: Gaze and actions in first person video (2020). arxiv:2006.00626
Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. CoRR abs/1612.03144 (2016), arxiv:1612.03144
Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. CoRR abs/1708.02002 (2017), arxiv:1708.02002
Liu, M., Ren, S., Ma, S., Jiao, J., Chen, Y., Wang, Z., Song, W.: Gated transformer networks for multivariate time series classification. CoRR abs/2103.14438 (2021), arxiv:2103.14438
Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D.J., Memisevic, R.: Fine-grained video classification and captioning. CoRR abs/1804.09235 (2018), arxiv:1804.09235
Minku, L.L., Yao, X.: DDD: a new ensemble approach for dealing with concept drift. IEEE Trans. Knowl. Data Eng. 24(4), 619–633 (2012). https://doi.org/10.1109/TKDE.2011.58
Ng, A.: Sparse autoencoder (NA), http://www.stanford.edu/class/cs294a/sparseAutoencoder.pdf
Page, E.S.: Continuous inspection schemes. Biometrika 41(1/2), 100–115 (1954)
Sebastião, R., Fernandes, J.M.: Supporting the page-hinkley test with empirical mode decomposition for change detection. In: International Symposium on Methodologies for Intelligent Systems. pp. 492–498. Springer (2017)
Sethi, T.S., Kantardzic, M.: Don’t pay for validation: detecting drifts from unlabeled data using margin density. Procedia Comput. Sci. 53, 103–112 (2015). https://doi.org/10.1016/j.procs.2015.07.284
Sturm, F., Hergenroether, E., Reinhardt, J., Vojnovikj, P.S., Siegel, M.: Challenges of the creation of a dataset for vision based human hand action recognition in industrial assembly. In: Arai, K. (ed.) Intelligent Computing, pp. 1079–1098. Springer Nature Switzerland, Cham (2023)
Sturm, F., Sathiyababu, R., Allipilli, H., Hergenroether, E., Siegel, M.: Self-supervised representation learning for fine grained human hand action recognition in industrial assembly lines. In: International Symposium on Visual Computing. pp. 172–184. Springer (2023)
Tang, P., Zhang, X.: Mtsmae: Masked autoencoders for multivariate time-series forecasting. In: 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI). pp. 982–989. IEEE (2022)
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078–10093 (2022)
Trockman, A., Kolter, J.Z.: Patches are all you need? Trans. Mach. Learn. Res. 2023 (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Vela, D., Sharp, A., Zhang, R., Nguyen, T., Hoang, A., Pianykh, O.S.: Temporal quality degradation in AI models. Sci. Rep. (2022). https://doi.org/10.1038/s41598-022-15245-z
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning. pp. 1096–1103 (01 2008). https://doi.org/10.1145/1390156.1390294
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, ll (Dec) pp. 3371–3408 (2010)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., Bottou, L.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(12), 3371–3408 (2010)
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 98–106 (2016)
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proceedings of the European conference on computer vision (ECCV). pp. 391–408 (2018)
Wares, S., Isaacs, J., Elyan, E.: Data stream mining: methods and challenges for handling concept drift. SN Appl. Sci. 1(11), 1–19 (2019). https://doi.org/10.1007/s42452-019-1433-0
Webb, G.I., Hyde, R., Cao, H., Nguyen, H.L., Petitjean, F.: Characterizing concept drift. Data Min. Knowl. Disc. 30(4), 964–994 (2016). https://doi.org/10.1007/s10618-015-0448-4
Wu, W., Hua, Y., Wu, S., Chen, C., Lu, A., et al.: Skeletonmae: Spatial-temporal masked autoencoders for self-supervised skeleton action recognition. arXiv preprint arXiv:2209.02399 (2022)
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: Simmim: a simple framework for masked image modeling. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 9643–9653 (2021)
Zerveas, G., Jayaraman, S., Patel, D., Bhamidipaty, A., Eickhoff, C.: A transformer-based framework for multivariate time series representation learning. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. pp. 2114–2124 (2021)
Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A., Sung, G., Chang, C., Grundmann, M.: Mediapipe hands: On-device real-time hand tracking. CoRR abs/2006.10214 (2020), arxiv:2006.10214
Žliobaitė, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. Big data analysis: new algorithms for a new society pp. 91–114 (2016)
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sturm, F., Trat, M., Sathiyababu, R. et al. Self-supervised representation learning for robust fine-grained human hand action recognition in industrial assembly lines. Machine Vision and Applications 36, 19 (2025). https://doi.org/10.1007/s00138-024-01638-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-024-01638-9