Table 1 Summary of various fish detection methods proposed, the datasets used and their evaluation results in stereo vision
Challenge | Strategy | References | Algorithm | Data set | Evaluation results | |||
---|---|---|---|---|---|---|---|---|
Description | Annotation | Size | Preprocessing | |||||
High inter-class similarity | Integrated the backbone with Convolutional Block Attention Module (CBAM), focusing on key area. | Liu et al. (2022) | YOLOv5 | Self-built dataset, including RGB images and infrared images captured by an underwater depth camera | Annotation of fewer mask data sets compared to labeled box data sets | N/A | N/A | Measurement accuracy = 96.9% |
Deng et al. (2022) | Keypoint RCNN | Self-built dataset, including images of five different types of fish captured by a binocular camera in a culture pool | Annotated according to the human key-point annotation format of the COCO dataset | Training/validation/test = 7200/1800/900 | Enhancement: random two attributes from saturation, brightness, contrast, and sharpness were adjusted stochastically | The mAP for fish detection increased by 2.4% after integrating with CBAM | ||
Deng et al. (2023) | RetineNet CenterNet | Self-built dataset, captured by a binocular camera above the water surface, reserved only one side of the image pairs. The collection scenes included both outdoor and indoor lighting environments | Only the straight and uncovered fish was annotated with the key-points of head and tail | Training/validation = 6400/1600 for fish detection, 19,200/4800 for key points detection | Enhancement: saturation, brightness, contrast and sharpness were adjusted stochastically | The pixel distance errors decreased about 0.215 pixels after integrating with CBAM | ||
Integrated the backbone of Deep Layer Aggregation with Transformer | Yu et al. (2023) | CenterNet | Including RGB images collected from Internet and field, and most were captured in clear water bodies | Each fish was labeled with 9 key-points | Training/validation/test = 1260/158/158 | Enhancement: turbid underwater image enhancement based on parameter-tuned stochastic resonance | The AP for key-points detection increased by 1.8 after integrated with transformer only | |
Multi-scale variations | Increased the skip connection through deformable convolution for a better feature aggregation | Yu et al. (2023) | Already mentioned above | The AP for key-points detection increased by 3.9 after integrated with Aggregation only | ||||
Replacing Feature Pyramid Networks with the Improved-Path Aggregation Network | Deng et al. (2022) | The mAP for fish detection increased by 1.4% after replacing with I-PANet | ||||||
Multi-scale variations | Replacing FPN with ASFF structure to improve the scale invariance of the features in an adaptive way | Deng et al. (2023) | Already mentioned above | Already mentioned above | ||||
Misdetection of key-points due to the high density of fish | An intermediate supervision scheme of Stacked Hourglass-based network can avoid the false detection | Suo et al. (2020) | Faster R-CNN Stacked Hourglass | Self-built stereo dataset, captured by a binocular camera in a culture pool. | Annotation of bounding box and 7 key points | Training/validation = 1117/124 for fish detection, and 551/61 for key points detection | N/A | The mAP for fish detection is 0.905 and the Averaged Object Keypoint Similarity (OKS) is 0.667 |
Using the bottom-up method designed for multi people key-points detection and tracking, using PAFs for detecting at key part of the same body | Hsieh and Lee (2023) | OpenPose ArtTrack | Self-built stereo dataset, including image pairs of Oplegnathus punctatus captured by a binocular camera in a culture pool | Each fish was labeled with 9 key-points and 9 bones | 1000 images for training | Enhancement: histogram equalization (white balance) and edge enhancement | Measurement relative error = 4.49% | |
Real-time requirements | Pre-trained the model with open datasets | Tonachella et al. (2022) | YOLOv4 RESNET-101 | Open Image datasets for fish detection Self-built stereo dataset for key-points detection, captured by a stereo camera in a sea cage | Cropped fish images with snout tip and the base of the middle caudal rays labelled | Training/test = 1120/280 for fish detection, 8960/3840 for key points detection | Augmentation: scale, noise, rotation, translation, and brightness | The mAP for fish detection is 87%, and the MSE for landmark detection is 0.23 |
Using a lightweight model YOLO v5 small with transfer learning strategy | Marrable et al. (2023) | YOLO v5 small | Open dataset from the OzFish stereo-BRUVS imagery | Cropped fish images with head and tail labelled | Training/validation/test = 5348/2292/4154 | N/A | Deep learning precision for key-points detection is 77.40% | |
Using a lightweight and pre-trained model Deep Layer Aggregation network DLA-X-60-C | Deng et al. (2023) | Already mentioned above | The number of parameters decreased 72.207 M after using pre-trained lightweight model | |||||
Real-time requirements | Using the bottleneck and group convolution can effectively improve the training efficiency. | Deng et al. (2022) | Already mentioned above | The model size only increased 0.811 M, and the mAP of fish detection increased by 4.55% |