Skip to main content
Account

Table 1 Summary of various fish detection methods proposed, the datasets used and their evaluation results in stereo vision

From: A review of deep learning-based stereo vision techniques for phenotype feature and behavioral analysis of fish in aquaculture

Challenge

Strategy

References

Algorithm

Data set

Evaluation results

Description

Annotation

Size

Preprocessing

High inter-class similarity

Integrated the backbone with Convolutional Block Attention Module (CBAM), focusing on key area.

Liu et al. (2022)

YOLOv5

Self-built dataset, including RGB images and infrared images captured by an underwater depth camera

Annotation of fewer mask data sets compared to labeled box data sets

N/A

N/A

Measurement accuracy = 96.9%

  

Deng et al. (2022)

Keypoint RCNN

Self-built dataset, including images of five different types of fish captured by a binocular camera in a culture pool

Annotated according to the human key-point annotation format of the COCO dataset

Training/validation/test = 7200/1800/900

Enhancement: random two attributes from saturation, brightness, contrast, and sharpness were adjusted stochastically

The mAP for fish detection increased by 2.4% after integrating with CBAM

  

Deng et al. (2023)

RetineNet

CenterNet

Self-built dataset, captured by a binocular camera above the water surface, reserved only one side of the image pairs. The collection scenes included both outdoor and indoor lighting environments

Only the straight and uncovered fish was annotated with the key-points of head and tail

Training/validation = 6400/1600 for fish detection, 19,200/4800 for key points detection

Enhancement: saturation, brightness, contrast and sharpness were adjusted stochastically

The pixel distance errors decreased about 0.215 pixels after integrating with CBAM

 

Integrated the backbone of Deep Layer Aggregation with Transformer

Yu et al. (2023)

CenterNet

Including RGB images collected from Internet and field, and most were captured in clear water bodies

Each fish was labeled with 9 key-points

Training/validation/test = 1260/158/158

Enhancement: turbid underwater image enhancement based on parameter-tuned stochastic resonance

The AP for key-points detection increased by 1.8 after integrated with transformer only

Multi-scale variations

Increased the skip connection through deformable convolution for a better feature aggregation

Yu et al. (2023)

Already mentioned above

    

The AP for key-points detection increased by 3.9 after integrated with Aggregation only

Replacing Feature Pyramid Networks with the Improved-Path Aggregation Network

Deng et al. (2022)

     

The mAP for fish detection increased by 1.4% after replacing with I-PANet

Multi-scale variations

Replacing FPN with ASFF structure to improve the scale invariance of the features in an adaptive way

Deng et al. (2023)

Already mentioned above

    

Already mentioned above

Misdetection of key-points due to the high density of fish

An intermediate supervision scheme of Stacked Hourglass-based network can avoid the false detection

Suo et al. (2020)

Faster R-CNN

Stacked Hourglass

Self-built stereo dataset, captured by a binocular camera in a culture pool.

Annotation of bounding box and 7 key points

Training/validation = 1117/124 for fish detection, and 551/61 for key points detection

N/A

The mAP for fish detection is 0.905 and the Averaged Object Keypoint Similarity (OKS) is 0.667

Using the bottom-up method designed for multi people key-points detection and tracking, using PAFs for detecting at key part of the same body

Hsieh and Lee (2023)

OpenPose ArtTrack

Self-built stereo dataset, including image pairs of Oplegnathus punctatus captured by a binocular camera in a culture pool

Each fish was labeled with 9 key-points and 9 bones

1000 images for training

Enhancement: histogram equalization (white balance) and edge enhancement

Measurement relative error = 4.49%

Real-time requirements

Pre-trained the model with open datasets

Tonachella et al. (2022)

YOLOv4

RESNET-101

Open Image datasets for fish detection

Self-built stereo dataset for key-points detection, captured by a stereo camera in a sea cage

Cropped fish images with snout tip and the base of the middle caudal rays labelled

Training/test = 1120/280 for fish detection, 8960/3840 for key points detection

Augmentation: scale, noise, rotation, translation, and brightness

The mAP for fish detection is 87%, and the MSE for landmark detection is 0.23

Using a lightweight model YOLO v5 small with transfer learning strategy

Marrable et al. (2023)

YOLO v5 small

Open dataset from the OzFish stereo-BRUVS imagery

Cropped fish images with head and tail labelled

Training/validation/test = 5348/2292/4154

N/A

Deep learning precision for key-points detection is 77.40%

 

Using a lightweight and pre-trained model Deep Layer Aggregation network DLA-X-60-C

Deng et al. (2023)

Already mentioned above

    

The number of parameters decreased 72.207 M after using pre-trained lightweight model

Real-time requirements

Using the bottleneck and group convolution can effectively improve the training efficiency.

Deng et al. (2022)

Already mentioned above

    

The model size only increased 0.811 M, and the mAP of fish detection increased by 4.55%

  1. AP, mAP and MSE correspond to average precision, mean average precision and mean square error
  2. N/A not available