A review of deep learning-based stereo vision techniques for phenotype feature and behavioral analysis of fish in aquaculture

Zhao, Yaxuan; Qin, Hanxiang; Xu, Ling; Yu, Huihui; Chen, Yingyi

doi:10.1007/s10462-024-10960-7

Table 1 Summary of various fish detection methods proposed, the datasets used and their evaluation results in stereo vision

From: A review of deep learning-based stereo vision techniques for phenotype feature and behavioral analysis of fish in aquaculture

Challenge	Strategy	References	Algorithm	Data set				Evaluation results
Challenge	Strategy	References	Algorithm	Description	Annotation	Size	Preprocessing	Evaluation results
High inter-class similarity	Integrated the backbone with Convolutional Block Attention Module (CBAM), focusing on key area.	Liu et al. (2022)	YOLOv5	Self-built dataset, including RGB images and infrared images captured by an underwater depth camera	Annotation of fewer mask data sets compared to labeled box data sets	N/A	N/A	Measurement accuracy = 96.9%
		Deng et al. (2022)	Keypoint RCNN	Self-built dataset, including images of five different types of fish captured by a binocular camera in a culture pool	Annotated according to the human key-point annotation format of the COCO dataset	Training/validation/test = 7200/1800/900	Enhancement: random two attributes from saturation, brightness, contrast, and sharpness were adjusted stochastically	The mAP for fish detection increased by 2.4% after integrating with CBAM
		Deng et al. (2023)	RetineNet CenterNet	Self-built dataset, captured by a binocular camera above the water surface, reserved only one side of the image pairs. The collection scenes included both outdoor and indoor lighting environments	Only the straight and uncovered fish was annotated with the key-points of head and tail	Training/validation = 6400/1600 for fish detection, 19,200/4800 for key points detection	Enhancement: saturation, brightness, contrast and sharpness were adjusted stochastically	The pixel distance errors decreased about 0.215 pixels after integrating with CBAM
	Integrated the backbone of Deep Layer Aggregation with Transformer	Yu et al. (2023)	CenterNet	Including RGB images collected from Internet and field, and most were captured in clear water bodies	Each fish was labeled with 9 key-points	Training/validation/test = 1260/158/158	Enhancement: turbid underwater image enhancement based on parameter-tuned stochastic resonance	The AP for key-points detection increased by 1.8 after integrated with transformer only
Multi-scale variations	Increased the skip connection through deformable convolution for a better feature aggregation	Yu et al. (2023)	Already mentioned above					The AP for key-points detection increased by 3.9 after integrated with Aggregation only
Multi-scale variations	Replacing Feature Pyramid Networks with the Improved-Path Aggregation Network	Deng et al. (2022)						The mAP for fish detection increased by 1.4% after replacing with I-PANet
Multi-scale variations	Replacing FPN with ASFF structure to improve the scale invariance of the features in an adaptive way	Deng et al. (2023)	Already mentioned above					Already mentioned above
Misdetection of key-points due to the high density of fish	An intermediate supervision scheme of Stacked Hourglass-based network can avoid the false detection	Suo et al. (2020)	Faster R-CNN Stacked Hourglass	Self-built stereo dataset, captured by a binocular camera in a culture pool.	Annotation of bounding box and 7 key points	Training/validation = 1117/124 for fish detection, and 551/61 for key points detection	N/A	The mAP for fish detection is 0.905 and the Averaged Object Keypoint Similarity (OKS) is 0.667
Misdetection of key-points due to the high density of fish	Using the bottom-up method designed for multi people key-points detection and tracking, using PAFs for detecting at key part of the same body	Hsieh and Lee (2023)	OpenPose ArtTrack	Self-built stereo dataset, including image pairs of Oplegnathus punctatus captured by a binocular camera in a culture pool	Each fish was labeled with 9 key-points and 9 bones	1000 images for training	Enhancement: histogram equalization (white balance) and edge enhancement	Measurement relative error = 4.49%
Real-time requirements	Pre-trained the model with open datasets	Tonachella et al. (2022)	YOLOv4 RESNET-101	Open Image datasets for fish detection Self-built stereo dataset for key-points detection, captured by a stereo camera in a sea cage	Cropped fish images with snout tip and the base of the middle caudal rays labelled	Training/test = 1120/280 for fish detection, 8960/3840 for key points detection	Augmentation: scale, noise, rotation, translation, and brightness	The mAP for fish detection is 87%, and the MSE for landmark detection is 0.23
Real-time requirements	Using a lightweight model YOLO v5 small with transfer learning strategy	Marrable et al. (2023)	YOLO v5 small	Open dataset from the OzFish stereo-BRUVS imagery	Cropped fish images with head and tail labelled	Training/validation/test = 5348/2292/4154	N/A	Deep learning precision for key-points detection is 77.40%
	Using a lightweight and pre-trained model Deep Layer Aggregation network DLA-X-60-C	Deng et al. (2023)	Already mentioned above					The number of parameters decreased 72.207 M after using pre-trained lightweight model
Real-time requirements	Using the bottleneck and group convolution can effectively improve the training efficiency.	Deng et al. (2022)	Already mentioned above					The model size only increased 0.811 M, and the mAP of fish detection increased by 4.55%

AP, mAP and MSE correspond to average precision, mean average precision and mean square error
N/A not available

Back to article page