Abstract
Computer vision technology has been widely used for blind assistance, such as navigation and way finding. However, few camera-based systems are developed for helping blind or visually impaired people to find daily necessities. In this paper, we propose a prototype system of blind-assistant object finding by camera-based network and matching-based recognition. We collect a dataset of daily necessities and apply Speeded-Up Robust Features and Scale Invariant Feature Transform feature descriptors to perform object recognition. Experimental results demonstrate the effectiveness of our prototype system.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The World Health Organization (WHO) estimated that in 2002, 2.6 % of the world’s total population was visually impaired. Also, the American Foundation for the Blind (AFB) approximates that there are more than 25 million people in the United States living with vision loss. Visually impaired people face much inconvenience when interacting with their surrounding environments, and the most common challenge is to find dropped or misplaced personal items (i.e., keys, wallets, etc). While many literatures and systems have been focusing on navigation (Kao et al. 1996), way finding (Guide et al. 2008), “scene object recognition (Nikolakis et al. 2005) text reading (http://www.seeingwithsound.com/ocr.htm), bar code reading, banknote recognition (Sudol et al. 2010), etc., there are rare camera-based systems available in the market to find daily necessities for the blind. According to the AFB (2012), visually impaired individuals can live comfortably in a house or apartment by following these principles: lightning increment, hazard elimination, color contrasts creation, item organization and labeling, and glare reduction. It means that blind or visually impaired people need effective assistance, including natural daylight simulation, fluorescent tape, colored objects against dark backgrounds, and the tagging of daily necessities for recognition. However, it is inconvenient and expensive to make these complex arrangements in most cases. On contrast, camera setup is easier and more economical. It costs less than $500 to buy a 1080 p full High-Definition camera, which supports video in 1920 × 1080 resolution. Thus computer vision and pattern recognition technology could be an effective alternative to assist blind or visually impaired people.
Based on our knowledge, few blind-assistive systems are able to help blind or visually impaired people find their daily necessities. Previous blind assistant system focuses more on navigation, path planning and blind person tracking. In Orwell et al. (2005), a method was designed to track foot player under a network of multiple cameras. In Marinakis and Dudek (2005), a method is designed to infer the relative locations of cameras and construct indoor layout by analyzing their respective observations. In Hoover and Olsen (2000), cooperated with first-view camera, a sensor network was built to assist robot to perceive obstacles and plan effective paths. In Xie et al. (2008), a camera network was built for object retrieval, where Scale Invariant Feature Transform (SIFT) descriptor was employed for object detection and recognition. Locomotion analysis (Tang and Su 2012) could be used to design specific room layout to help blind people live a more convenient life. These systems assume that blind user is located at unfamiliar environment, and provide general instruction to assist blind people reach their destination. However, the probability of blind or visually impaired people in unfamiliar environment is low (even for normal-vision people), so blind-assistant system should care more about the daily life of blind people. They need better perception and control of their personal objects, so we employ the computer vision technology to help them manage their daily necessities.
In our proposed system, a multiple-camera network is built by placing a camera at the important locations of the indoor environment of blind user’s daily life. The important locations are usually located around tables, cabinets, and wash sink. The cameras provide scene monitoring around these fixed locations, and inform blind users of the locations of his/her demanded objects. In this process, matching-based object recognition is performed to find out the objects. Fast and efficient object recognition is a very popular area in the computer vision community. It is to localize and certify given objects in images or video sequences. Humans are able to recognize a wide variety of object within images or videos with little effort, no matter the difference in viewpoint, scenarios, image scale, translation, rotation, range of distortion, and illumination (Kreiman 2008). It is proved that human visual system can discriminate among tens of thousands of different object categories (Biederman 1987) in an efficient way, about 100–200 ms (Potter and Levy 1969; Thorpe et al. 1996; Hung et al. 2005). Also human vision system can effectively handle background clutter and object occlusions. However, it is extremely difficult to build robust and selective computer vision algorithms of object recognition to handle objects with large similarity or variations. Fortunately, present and future technological innovations in object detection and recognition systems will contribute extensively to helping blind individuals. LookTel was presented in Sudol et al. (2010) to be an extensive platform that integrates state of the art computer vision techniques with mobile communication devices which return real-time banknote recognition results announced by a text-to-speech engine. OCR engines (http://www.seeingwithsound.com/ocr.htm) were developed to transform the text information on camera-based scene image into readable text codes. Signage and text detection (Wang et al. 2012) were designed to extract abstract symbol information directly from natural scene images. Besides, some other recognition systems used FM sonar sensors (Kao et al. 1996) and camera-integrated walking canes (Guide et al. 2008) for blind navigation. Sensor modules could be used for searching tasks in the surrounding environments (Hub et al. 2004). Some other prototypes for blind-assistant localization and recognition were introduced in Hub et al. (2006), Gehring (2008) and Bobo et al. (2008). These devices could offer an equivalent of raw visual input to blind people, via complex sounds capes (head-mounted camera and stereo headphones), thus leaving the recognition tasks to the human brain. Furthermore, many other techniques have been developed for efficient object detection and recognition. Jauregi et al. (2009) proposed a two-step algorithm based on region detection and feature extraction. This approach aims to improve the extracted features by reducing unnecessary keypoints, and increasing efficiency through accuracy and computational time. Ta et al. (2009) presented an efficient algorithm for continuous image recognition and feature descriptor tracking in video.
2 System design
Our system consists of a wearable camera with multiple fixed cameras. Figure 1 illustrates the layout design of the system. Blind user is equipped with a wearable camera which is connected (wire or wireless) to a computer (PDA, or laptop), as shown in Fig. 2. The user can send requests of finding an item by speech command and then wear the camera system to look for the item. When the system finds the requested item, an audio signal will be produced. A dataset of the personal items for the user is created as reference samples. In the dataset, multiple images for each item are captured for different camera views, scales, lighting changes, and occlusions.
Multiple cameras are fixed at the important locations where blind user probably leaves his/her items, and compose a blind-assistant network. When a blind user sends a request of object finding into the system, all the fixed cameras will start object recognition by comparing their captured objects with the reference objects in the dataset. Then the system will report the most similar object though matching distance and instruct blind user to get close to its location. Next, the camera attached to blind user will perform further recognition to certify the existence of the demanded object.
The flowchart of our proposed object recognition algorithm is shown in Fig. 3. First, based on the blind user’s request, features from camera captured images are extracted by Speeded-Up Robust Features (SURF) or SIFT. Then, these features are compared to pre-calculated features from reference images of the request object in the dataset. If matches are found, the algorithm will output if the object has been found or not according to the pre-established thresholds for each object.
3 Object feature extraction
A large variety of features can be extracted from camera-based scene image, so the combinations and selection of the features play an important role in data analysis (Xiang et al. 2012; Husle et al. 2012). Interest point detectors and descriptors (Bay et al. 2006; Lowe 2004) are able to extract representative and discriminative features from reference images. Based on the state-of-the-art local feature descriptors SIFT (Lowe 2004) and SURF (Bay et al. 2006; SURF source code 2008), precise matching is allowed between images containing identical objects. Both SIFT and SURF are able to extract representative and discriminative keypoints from an image, which contain significant information of the object appearance and structure in the image.
SURF is a robust image detector and descriptor that can be used in computer vision tasks like object recognition or 3D reconstruction. The standard version of SURF is several times faster than SIFT. The SURF detector is based on the Hessian matrix which provides excellent performance in computational time and accuracy. Given point x(x,y) in an image I, the Hessian matrix H(x,σ) in x at scale σ is defined as Eq. (1).
where L xx (x,σ), L xy (x,σ) and L yy (x,σ) are the convolution of the Gaussian second order derivative with image I in point x(x,y). Gaussians are best for scale-space analysis, but in practice, they must be discretized and cropped. The detection of interest points in the image is determined by non-maximum suppression in a 3 × 3 × 3 neighborhood.
On the other hand, the SURF descriptor is based on sums of approximated Haar wavelets responses within the circular interest point region, with radius 6S, with S being the scale at which the interest point is detected. Since the wavelets are large at high scales, integral images are efficiently used for fast filtering. Once the wavelet responses are calculated, dominant orientation can be estimated within a sliding orientation window covering an angle of π/3. The horizontal and vertical responses originated from this window yield a vector, which lends it orientation to the interest point according to the longest one obtained. Square descriptors are then constructed around the interest points along the orientation determined by the vector. Each square then subdivides regularly into a smaller 4 × 4 sub-region keeping important spatial information in for the descriptor construction.
SIFT extracts local features by mapping gradient orientations within predefined blocks and cells into a histogram. It calculates the Difference of Gaussian (DOG) of image maps in scale space as Eqs. (2) and (3).
where \( x \), \( y \), and \( \sigma \) denote spatial coordinate and scale respectively, \( G(x,y,\sigma ) \) is a Gaussian filter with variable scale \( \sigma \), and \( D(x,y,\sigma ) \) represent the DOG map.
The DOG function approximates the scale-normalized Laplacian of Gaussian, that is, there is \( D(x,y,\sigma ) \approx \sigma^{2} \nabla^{2} G \). Previous work shows that local maxima and minima of \( \sigma^{2} \nabla^{2} G \) are the most stable image features in comparison with gradient, Hessian, Harris and so on. Thus, the local maxima and minima of \( \sigma^{2} \nabla^{2} G \) are extracted as SIFT feature points.
The SIFT keypoints are in the form of oriented disks attached to representative structure of the objects in the image. The detected keypoints could keep invariant to transitions, rotations, scale changes, and other deformations. They are then used as representative local features of the objects from image. Around each SIFT keypoint, a 4 × 4 block is defined and a histogram of gradient orientations is generated. Since gradient orientation is quantized into eight values, the histogram has eight corresponding bins. Then all histograms of the blocks are cascaded into a \( 16 \times 8 = 128 \) dimensional vector, as shown in Fig. 4. This feature vector will be used as SIFT feature descriptor at the keypoint.
An illustration of SIFT descriptor. The left presents original image and a keypoint in the red circle. The middle presents the 16 blocks around the keypoint and their respectively gradient orientations. The right denotes a 128-dimensional feature vector generated by the votes of quantized gradient orientations (color figure online)
SIFT descriptor plays an important role in many applications involved in content-based information retrieval, object matching, and object recognition etc. It is able to find out distinctive keypoints in images that are invariant to location, scale, rotation, affine transformations and illumination changes. Given two images, ones is a complete object with clean background and standard viewpoint, and the other contains the same object under a different viewpoint, complex background or partial occlusion. We apply SIFT detector and descriptor to find out matches between the two images. Each match will be assigned a score based on the Euclidean distance between the SIFT descriptors of the two matched points. The number and score of keypoint matching will be used to design our algorithm of object recognition.
4 System implementation
In order to effectively recognize objects, we collect a dataset of daily necessities as reference objects. This dataset contain personal items that are essential to visually impaired individuals, such as keys and sunglasses. In addition, this dataset covers a variety of conditions such as image scale, translation, rotation, change in viewpoint, and partial occlusion. All these images are captured in the presence of cluttered backgrounds. Some examples are displayed in Fig. 5.
4.1 Camera network
Our proposed system is based on a network of cameras. A blind user sends his/her request of finding a specific object, and the system will start object recognition from the cameras to search their respective regions. Each camera will output a recognition score by comparing its captured objects with image samples of the reference object in the dataset. The score is calculated from the average matching distances of the SURF/SIFT key points. Thus several possible locations of the request object could be obtained from small matching distances.
In our system, all the cameras are connected to a local network for information share. Each camera reports the results of matching reference objects. A host takes charge of information collection and analysis. It will notify the blind user of the most probable locations of his/her expected objects.
4.2 Matching-based recognition
To identify an object in every query image and without counting any false matches, thresholds are set up for every reference image based on experiments. In our system, a threshold level of ten key matching points is employed for the cell phone, sunglasses and keys, and a threshold level of 25 key matching points is used for the Watch Wrist, and Juice Cup.
Figure 6 illustrates the detected interest points from different reference objects, marked in red circles. The same method is also applied to the query object. After the construction of descriptors around the interest points of reference and query images, SURF features from the reference images are extracted and matched against the features from the query images in our dataset. If the SURF features in the query image match the ones in the selected object reference image, then we employ the predefined thresholds to determine if the object is the expected item which the blind user looks for. Figure 8 illustrates that our algorithm can successfully predict the category of query image in the presence of background clutter, partial occlusion, and viewpoint changes.
Next, SIFT descriptor is applied to the dataset for performance evaluation of object recognition through the keypoint matching. At first, we collect an object dataset of reference images, which consists of ten categories of common-use objects, such as key, glasses, coffee cup, etc. To predict object category of the queried image \( I_{\text{Q}} \), we compare it with each reference image \( I_{\text{R}} \) in the dataset. Then SIFT detector is applied to obtain two set of keypoints \( K_{\text{Q}} \) and \( K_{\text{R}} \) respectively from the two images. The corresponding 128-dimensional SIFT descriptor is calculated from each keypoint. Based on Euclidean distance between SIFT descriptor, each key point will be assigned a match from the other image. Firstly, a key point \( P \) \( (P \in K_{\text{Q}}) \) (is measured to obtain its distance to each keypoint in \( K_{\text{R}} \). If the keypoint \( P^{\prime} \) (\( P^{\prime} \in K_{\text{R}} \)) has the minimum distance to \( P \), then they are regarded as a match of each other. Secondly, we only preserve the keypoints where the ratio between nearest neighboring distance and the second nearest neighboring distance is greater than 0.6. It is able to reduce false positives and increase robustness. The other keypoints and their corresponding matches will be removed. Thirdly, we calculate the mean distance \( {\text{d}}(I_{\text{Q}} ,I_{\text{R}} ) \) from the rest keypoint matches. As above mention, \( I_{\text{R}} \) is one of the reference images in the dataset. Then we calculate all the mean distances from the query image \( I_{\text{Q}} \) to the sibling reference images of \( I_{\text{R}} \), which belong to the same category, as Eq. (4),
where \( {\text{L}}_{\text{C}} ( {\text{I}}_{\text{Q}} ) \) represents the mean distances from query image to a category \( {\text{C}} \) of reference images, and \( ||C|| \) represents the number of reference images in the category \( {\text{C}} \). The minimum distance models similarity between the object in query image and those in the dataset, and it can be used to predict which category the object in query image belongs to. In the same way, we calculate the mean distance to each category in the dataset. Then the query image will be assigned to the category with minimum mean distance as Eq. (5),
Figure 7 depicts the keypoint matching between the objects from identical category, but in different viewpoint, distance and background. This figure demonstrates that SIFT feature detector and descriptor can handle arbitrary transitions, rotations, and scale changes. It ensures that our system of object finding will not be influenced by relative positions of the blind user and the object, or the surrounding environments of the object.
5 Experimental results and discussion
To evaluate the performance of SURF and SIFT-based object recognition, we collect a testing dataset for the ten classes of daily necessities. Each class contains ten image samples. It covers a variety of conditions such as image scale, translation, rotation, change in viewpoint, and partial occlusion.
The proposed algorithm can effectively distinguish different classes of objects from test images of the same objects in the presence of a cluttered background in different scenarios under a variety of conditions. When further testing is performed with multiple query images, the algorithm successfully identified objects as shown in Fig. 8. The white lines demonstrate the traces from the reference image to the test image which represent the position where the matching features have been detected. If the number of the matching points is greater than the thresholds, the object is detected. Furthermore, there are no falsely identified items for all the classes of objects even when they are partially occluded by other objects.
We observe that under some situations, the algorithm fail to find enough matching points and identify an expected object when it is captured by the camera. Further, when the algorithm initially identifies an object of interest, based on the thresholds of each class of object, its matching features are not enough to correctly identify it. This happens due to the challenging conditions of some test images which are taken on a variety of conditions.
SURF features are only robust to some degree of rotation and view point changes. These errors are caused by images with very large variations in illumination, background, scaling and/or rotation. Changes in illumination and cluttered background generate few matching points that would affect object recognition. The large scale change has been approved as the strong relief effect in Yu and Morel (2011) ASIFT and would also affect recognition. Rotation plays a big role as well, since the SURF descriptors are constructed around interest points whose largest orientation vector is within a sliding orientation window of π/3. Any orientation vector outside this range will result in mismatching. Therefore, if we were not dealing with images that contain extreme scaling changes and rotations, we would have a very robust and efficient object recognition algorithm.
Because we deal with images that contain a variety of conditions, the test accuracy for each class of objects is between 50 and 95 %. Our algorithm achieves an average accuracy of 69 % as shown in Table 1.
Table 2 presents the results of object recognition. We can see the objects with discriminative surface texture, such as Book and Juice Cup, obtain higher accuracy of recognition. It is because SIFT descriptor is designed for texture matching. However, the recognition results of keys and sunglasses are lower, because their appearance depends much on the viewpoints of capture. The above two tables demonstrate that SIFT-based descriptor obtain better performance than SURF-based descriptor.
5.1 Comparisons between SURF-based recognition and SIFT-based recognition
The experimental results demonstrate that SURF-based recognition has higher efficiency, while SIFT-based recognition obtains higher accuracy over all testing object categories. SIFT detector extracts DOG maxima and minima as keypoints, which are more stable than SURF keypoints based on Hessian matrix. Besides, SIFT descriptor in our experiments has 128 dimensions, while SURF descriptor has only 64 dimensions. SIFT descriptor preserves more information of local features. From another aspect, SURF improves the computational efficiency of feature extraction, because (1) it lowers feature dimension compared to SIFT; (2) it simply sums first-order Haar wavelet responses to extract feature descriptor, instead of the statistics of gradients.
It is challenging tasks to combine SIFT and SURF, because the two detectors extract different groups of keypoints from an identical object, which correspond to different aspects of object appearance and structure. The simple cascade of the two descriptors cannot improve the recognition accuracy, but further lower the efficiency. In future work, we will integrate both SIFT and SURF into a histogram of visual words under the Bag-of-Words framework, and it will generate most robust and efficient object recognizer.
6 Conclusion and future work
This paper presents a prototype blind-assistant system. It helps blind user to find their personal items in daily life through camera-based network and matching-based object recognition algorithm. We employ two types of local feature descriptors for detecting keypoint matches. The SURF and SIFT interest point detector and descriptor are scale-invariant and rotation-invariant, and provided our object recognition algorithm with ways to handle image scaling, translation, rotation, change in viewpoint, and partial occlusion between objects in the presence of cluttered backgrounds.
In order to evaluate the performance of our algorithm, a reference dataset is built by collecting daily necessities that are essential to visually impaired or blind people. Experimental results demonstrate that the proposed algorithm can effectively identify objects under conditions of cluttered background and occlusions without falsely identifying any reference object. However, based on pre-learned thresholds, the algorithm also failed to find enough matches to identify an object of interest when present in the image such as “cell phone” due to a lack of distinguishable features.
Our future work will focus on enhancing the object recognition system so that it can better detect and identify objects under extreme and challenging conditions. We will enhance cooperation of different cameras within the system, and address human interface issues for image capture and auditory display of the object recognition on computers and cell phones.
References
American Foundation for the Blind (2012) http://www.afb.org/. Accessed 2012
Bay H, Tuytelaars T, Van Gool L (2006) SURF: speeded up robust features. European Conference on Computer Vision
Biederman I (1987) Recognition-by-components: a theory of human image understanding. Psychol Rev 94:115–147
Bobo B, Chellapa R, Tang C (2008) Developing a real-time identify-and-locate system for the blind. In: Workshop on computer vision applications for the visually impaired
Gehring S (2008) Adaptive indoor navigation for the blind. Proc GI Jahrestagung 1:293–294
Guide R, Østerby M, Soltveit S (2008) Blind navigation and object recognition. Laboratory for Computational Stochastics, University of Aarhus, Denmark. http://www.daimi.au.dk/~mdz/BlindNavigation_and_ObjectRecognition.pdf. Accessed 2008
Hoover A, Olsen B (2000) Sensor network perception for mobile robotics. IEEE Int Conf Robotics Autom 1:342–347
Hub A, Diepstraten J, Ertl T (2004) Design and development of an indoor navigation and object identification system for the Blind. In: Proceedings of ASSETS, pp 147–152
Hub A, Hartter T, Ertl T (2006) Interactive tracking of movable objects for the blind on the basis of environmental models and perception oriented object recognition methods. In: Proceedings of ASSETS, pp 111–118
Hung C, Kreiman G, Poggio T, DiCarlo J (2005) Fast read-out of object identity from Macaque inferior temporal cortex. Science 310:863–866
Husle J, Khoshgoftaar M, Napolitano A, Wald R (2012) Threshold-based feature selection techniques for high-dimensional bioinformatics data. Netw Model Anal Health Inform Bioinform 1(1–2):47–61
Jauregi E, Lazkano E, Sierra B (2009) Object recognition using region detection and feature extraction. In: Proceedings of towards autonomous robotic systems (TAROS) (ISSN: 2041-6407)
Kao G, Probert P, Lee D (1996) Object recognition with FM sonar: an assistive device for blind and visually-impaired people. AAAI fall symposium on developing assistive technology for people with disabilities. MIT, Cambridge
Kreiman G (2008) Biological object recognition. Scholarpedia 3(6): 2667 http://www.scholarpedia.org/article/Biological_object_recognition
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision
Marinakis D, Dudek G (2005) Topology inference for a vision-based sensor network. In: Proceedings of Canadian conference on computer and robot vision, pp 121–128
Mobile OCR, face and object recognition for the blind. http://www.seeingwithsound.com/ocr.htm. Accessed 1996–2013
Nikolakis G, Tzovaras D, Strintzis MG (2005) Object recognition for the blind. In: Proceedings of 13th European signal processing conference (EUSIPCO 2005). Antalya, Turkey
Orwell J, Lowey L, Thirde D (2005) Architecture and algorithms for tacking footable players with multiple cameras. IEEE Proc Vision Image Signal Process 152(2):232–241
Potter M, Levy E (1969) Recognition memory for a rapid sequence of pictures. J Exp Psychol 81:10–15
SURF source code (2008) http://www.vision.ee.ethz.ch/~surf/. Accessed 2008
Sudol J, Dialameh O, Blanchard C, Dorcey T (2010) LookTel—a comprehensive platform for computer-aided visual assistance. IEEE Conference on Computer Vision and Pattern Recognition
Ta DN, Chen WC, Gelfand N, Pulli K (2009) SURFTrac: efficient tracking and continuos object recognition using local feature descriptors. IEEE Conference on Computer Vision and Pattern Recognition, pp 2937–29
Tang W, Su D (2012) Locomotion analysis and its applications in neurological disorders detection: state-of-art review. Network Model Anal Health Inform Bioinform
Thorpe S, Fize D, Marlot C (1996) Speed of processing in the human visual system. Nature 381:520–522
Wang S, Yi C, Tian Y (2012) Signage detection and recognition for blind persons to access unfamiliar environment. J Comput Vision Image Process 2(2)
Xiang Y, Fuhry D, Kaya K, Jin R, Catalyurek U, Huang K (2012) Merging network patterns: a general framework to summarize biomedical network data. 1(3): 103–116
Xie D, Yan T, Ganesan D, Hanson A (2008) Design and implementation of a dual-camera wireless sensor network for object retrieval. In: Proceedings of the 7th international conference on information processing in sensor networks, pp 469–480
Yu G, Morel J.-M (2011) ASIFT: an algorithm for fully affine invariant comparison. Image Process On Line, 2011
Acknowledgments
This work was supported by NSF grant IIS-0957016, EFRI-1137172, NIH 1R21EY020990, ARO grant W911NF-09-1-0565, DTFH61-12-H-00002, Microsoft Research, and CITY SEEDs grant.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yi, C., Flores, R.W., Chincha, R. et al. Finding objects for assisting blind people. Netw Model Anal Health Inform Bioinforma 2, 71–79 (2013). https://doi.org/10.1007/s13721-013-0026-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13721-013-0026-x