Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 09 October 2023

Sign language recognition using the fusion of image and hand landmarks through multi-headed convolutional neural network

  • Refat Khan Pathan 1 ,
  • Munmun Biswas 2 ,
  • Suraiya Yasmin 3 ,
  • Mayeen Uddin Khandaker   ORCID: orcid.org/0000-0003-3772-294X 4 , 5 ,
  • Mohammad Salman 6 &
  • Ahmed A. F. Youssef 6  

Scientific Reports volume  13 , Article number:  16975 ( 2023 ) Cite this article

18k Accesses

16 Citations

Metrics details

  • Computational science
  • Image processing

Sign Language Recognition is a breakthrough for communication among deaf-mute society and has been a critical research topic for years. Although some of the previous studies have successfully recognized sign language, it requires many costly instruments including sensors, devices, and high-end processing power. However, such drawbacks can be easily overcome by employing artificial intelligence-based techniques. Since, in this modern era of advanced mobile technology, using a camera to take video or images is much easier, this study demonstrates a cost-effective technique to detect American Sign Language (ASL) using an image dataset. Here, “Finger Spelling, A” dataset has been used, with 24 letters (except j and z as they contain motion). The main reason for using this dataset is that these images have a complex background with different environments and scene colors. Two layers of image processing have been used: in the first layer, images are processed as a whole for training, and in the second layer, the hand landmarks are extracted. A multi-headed convolutional neural network (CNN) model has been proposed and tested with 30% of the dataset to train these two layers. To avoid the overfitting problem, data augmentation and dynamic learning rate reduction have been used. With the proposed model, 98.981% test accuracy has been achieved. It is expected that this study may help to develop an efficient human–machine communication system for a deaf-mute society.

Similar content being viewed by others

research paper for sign language recognition

AI enabled sign language recognition and VR space bidirectional communication using triboelectric smart glove

research paper for sign language recognition

Sign language recognition based on dual-path background erasure convolutional neural network

research paper for sign language recognition

Improved 3D-ResNet sign language recognition algorithm with enhanced hand features

Introduction.

Spoken language is the medium of communication between a majority of the population. With spoken language, it would be workable for a massive extent of the population to impart. Nonetheless, despite spoken language, a section of the population cannot speak with most of the other population. Mute people cannot convey a proper meaning using spoken language. Hard of hearing is a handicap that weakens their hearing and makes them unfit to hear, while quiet is an incapacity that impedes their talking and makes them incapable of talking. Both are just handicapped in their hearing or potentially, therefore, cannot still do many other things. Communication is the only thing that isolates them from ordinary people 1 . As there are so many languages in the world, a unique language is needed to express their thoughts and opinions, which will be understandable to ordinary people, and such a language is named sign language. Understanding sign language is an arduous task, an ability that must be educated with training.

Many methods are available that use different things/tools like images (2D, 3D), sensor data (hand globe 2 , Kinect sensor 3 , neuromorphic sensor 4 ), videos, etc. All things are considered due to the fact that the captured images are excessively noisy. Therefore an elevated level of pre-processing is required. The available online datasets are already processed or taken in a lab environment where it becomes easy for recent advanced AI models to train and evaluate, causing prone to errors in real-life applications with different kinds of noises. Accordingly, it is a basic need to make a model that can deal with noisy images and also be able to deliver positive results. Different sorts of methods can be utilized to execute the classification and recognition of images using machine learning. Apart from recognizing static images, work has been done in depth-camera detecting and video processing 5 , 6 , 7 . Various cycles inserted in the system were created utilizing other programming languages to execute the procedural strategies for the final system's maximum adequacy. The issue can be addressed and deliberately coordinated into three comparable methodologies: initially using static image recognition techniques and pre-processing procedures, secondly by using deep learning models, and thirdly by using Hidden Markov Models.

Sign language guides this part of the community and empowers smooth communication in the community of people with trouble talking and hearing (deaf and dumb). They use hand signals along with facial expressions and body activities to cooperate. Yet, as a global language, not many people become familiar with communication via sign language gestures 8 . Hand motions comprise a significant part of communication through signing vocabulary. At the same time, facial expressions and body activities assume the jobs of underlining the words and phrases communicated by hand motions. Hand motions can be static or dynamic 9 , 10 . There are methodologies for motion discovery utilizing the dynamic vision sensor (DVS), a similar technique used in the framework introduced in this composition. For example, Arnon et al. 11 have presented an event-based gesture recognition system, which measures the event stream utilizing a natively event-based processor from International Business Machines called TrueNorth. They use a temporal filter cascade to create Spatio-temporal frames that CNN executes in the event-based processor, and they reported an accuracy of 96.46%. But in a real-life scenario, corresponding background situations are not static. Therefore the stated power saving process might not work properly. Jun Haeng Lee et al. 12 proposed a motion classification method with two DVSs to get a stereo-vision system. They used spike neurons to handle the approaching occasions with the same real-life issue. Static hand signals are also called hand acts and are framed in different shapes and directions of hands without speaking to any movement data. Dynamic hand motions comprise a sequence of hand stances with related movement information 13 . Using facial expressions, static hand images, and hand signals, communication through signing gives instruments to convey similarly as if communicated in dialects; there are different kinds of communication via gestures as well 14 .

In this work, we have applied a fusion of traditional image processing with extracted hand landmarks and trained on a multi-headed CNN so that it could complement each other’s weights on the concatenation layer. The main objective is to achieve a better detection rate without relying on a traditional single-channel CNN. This method has been proven to work well with less computational power and fewer epochs on medical image datasets 15 . The rest of the paper is divided into multiple sections as literature review in " Literature review " section, materials and methods in " Materials and methods " section with three subsections: dataset description in Dataset description , image pre-processing in " Pre-processing of image dataset " and working procedure in " Working procedure ", result analysis in " Result analysis " section, and conclusion in " Conclusion " section.

Literature review

State-of-the-art techniques centered after utilizing deep learning models to improve good accuracy and less execution time. CNNs have indicated huge improvements in visual object recognition 16 , natural language processing 17 , scene labeling 18 , medical image processing 15 , and so on. Despite these accomplishments, there is little work on applying CNNs to video classification. This is halfway because of the trouble in adjusting the CNNs to join both spatial and fleeting data. Model using exceptional hardware components such as a depth camera has been used to get the data on the depth variation in the image to locate an extra component for correlation, and then built up a CNN for getting the results 19 , still has low accuracy. An innovative technique that does not need a pre-trained model for executing the system was created using a capsule network and versatile pooling 11 .

Furthermore, it was revealed that lowering the layers of CNN, which employs a greedy way to do so, and developing a deep belief network produced superior outcomes compared to other fundamental methodologies 20 . Feature extraction using scale-invariant feature transform (SIFT) and classification using Neural Networks were developed to obtain the ideal results 21 . In one of the methods, the images were changed into an RGB conspire, the data was developed utilizing the movement depth channel lastly using 3D recurrent convolutional neural networks (3DRCNN) to build up a working system 5 , 22 where Canny edge detection oriented FAST and Rotated BRIEF (ORB) has been used. ORB feature detection technique and K-means clustering algorithm used to create the bag of feature model for all descriptors is described, but the plain background, easy to detect edges are totally dependent on edges; if the edges give wrong info, the model may fall accuracy and become the main problem to solve.

In recent years, utilizing deep learning approaches has become standard for improving the recognition accuracy of sign language models. Using Faster Region-based Convolutional Neural Network (Faster-RCNN) 23 , a CNN model is applied for hand recognition in the data image. Rastgoo et al. 24 proposed a method where they cropped an image properly, used fusion between RGB and depth image (RBM), added two noise types (Gaussian noise + salt n paper noise), and prepared the data for training. As a naturally propelled deep learning model, CNNs achieve every one of the three phases with a single framework that is prepared from crude pixel esteems to classifier yields, but extreme computation power was needed. Authors in ref. 25 proposed 3D CNNs where the third dimension joins both spatial and fleeting stamps. It accepts a few neighboring edges as input and performs 3D convolution in the convolutional layers. Along with them, the study reported in 26 followed similar thoughts and proposed regularizing the yields with high-level features, joining the expectations of a wide range of models. They applied the developed models to perceive human activities and accomplished better execution in examination than benchmark methods. But it is not sure it works with hand gestures as they detected face first and thenody movement 27 .

On the other hand, the Microsoft and Leap Motion companies have developed unmistakable approaches to identify and track a user’s hand and body movement by presenting Kinect and the leap motion controller (LMC) separately. Kinect recognizes the body skeleton and tracks the hands, whereas the LMC distinguishes and tracks hands with its underlying cameras and infrared sensors 3 , 28 . Using the provided framework, Sykora et al. 7 utilized the Kinect system to catch the depth data of 10 hand motions to classify them using a speeded-up robust features (SURF) technique that came up to an 82.8% accuracy, but it cannot test on more extensive database and modified feature extraction methods (SIFT, SURF) so it can be caused non-invariant to the orientation of gestures. Likewise, Huang et al. 29 proposed a 10-word-based ASL recognition system utilizing Kinect by tenfold cross-validation with an SVM that accomplished a precision pace of 97% using a set of frame-independent features, but the most significant problem in this method is segmentation.

The literature summarizes that most of the models used in this application either depend on a single variable or require high computational power. Also, their dataset choice for training and validating the model is in plain background, which is easier to detect. Our main aim is to show how to reduce the computational power for training and the dependency of model training on one layer.

Materials and methods

Dataset description.

Using a generalized single-color background to classify sign language is very common. We intended to avoid that single color background and use a complex background with many users’ hand images to increase the detection complexity. That’s why we have used the “ASL Finger Spelling” dataset 30 , which has images of different sizes, orientations, and complex backgrounds of over 500 images per sign (24 sign total) of 4 users (non-native to sign language). This dataset contains separate RGB and depth images; we have worked with the RGB images in this research. The photos were taken in 5 sessions with the same background and lighting. The dataset details are shown in Table 1 , and some sample images are shown in Fig.  1 .

figure 1

Sample images from a dataset containing 24 signs from the same user.

Pre-processing of image dataset

Images were pre-processed for two operations: preparing the original image training set and extracting the hand landmarks. Traditional CNN has one input data channel and one output channel. We are using two input data channels and one output channel, so data needs to be prepared for both inputs individually.

Raw image processing

In raw image processing, we have converted the images from RGB to grayscale to reduce color complexity. Then we used a 2D kernel matrix for sharpening the images, as shown in Fig.  2 . After that, we resized the images into 50 × 50 pixels for evaluation through CNN. Finally, we have normalized the grayscale values (0–255) by dividing the pixel values by 255, so now the new pixel array contains value ranges (0–1). The primary advantage of this normalization is that CNN works faster in the (0–1) range rather than other limits.

figure 2

Raw image pre-processing with ( a ) sharpening kernel.

Hand landmark detection

Google’s hand landmark model has an input channel of RGB and an image size of (224 × 224 × 3). So, we have taken the RGB images, converted pixel values into float32, and resized all the images into (256 × 256 × 3). After applying the model, it gives 21 coordinated 3-dimensional points. The landmark detection process is shown in Fig.  3 .

figure 3

Hand landmarks detection and extraction of 21 coordinates.

Working procedure

The whole work is divided into two main parts, one is the raw image processing, and another one is the hand landmarks extraction. After both individual processing had been completed, a custom lightweight simple multi-headed CNN model was built to train both data. Before processing through a fully connected layer for classification, we merged both channel’s features so that the model could choose between the best weights. This working procedure is illustrated in Fig.  4 .

figure 4

Flow diagram of working procedure.

Model building

In this research, we have used multi-headed CNN, meaning our model has two input data channels. Before this, we trained processed images and hand landmarks with two separate models to compare. Google’s model is not best for “in the wild” situations, so we needed original images to complement the low faults in Google’s model. In the first head of the model, we have used the processed images as input and hand landmarks data as the second head’s input. Two-dimensional Convolutional layers with filter size 50, 25, kernel (3, 3) with Relu, strides 1; MaxPooling 2D with pool size (2, 2), batch normalization, and Dropout layer has been used in the hand landmarks training side. Besides, the 2D Convolutional layer with filter size 32, 64, 128, 512, kernel (3, 3) with Relu; MaxPooling 2D with pool size (2, 2); batch normalization and dropout layer has been used in the image training side. After both flatten layers, two heads are concatenated and go through a dense, dropout layer. Finally, the output dense layer has 24 units with Softmax activation. This model has been compiled with Adam optimizer and MSE loss for 50 epochs. Figure  5 illustrates the proposed CNN architecture, and Table 2 shows the model details.

figure 5

Proposed multi-headed CNN architecture. Bottom values are the number of filters and top values are output shapes.

Training and testing

The input images were augmented to generate more difficulty in training so that the model could not overfit. Image Data Generator did image augmentation with 10° rotation, 0.1 zoom range, 0.1 widths and height shift range, and horizontal flip. Being more conscious about the overfitting issues, we have used dynamic learning rates, monitoring the validation accuracy with patience 5, factor 0.5, and a minimum learning rate of 0.00001. For training, we have used 46,023 images, and for testing, 19,725 images. For 50 epochs, the training vs testing accuracy and loss has been shown in Fig.  6 .

figure 6

Training versus testing accuracy and loss for 50 epochs.

For further evaluation, we have calculated the precision, recall, and F1 score of the proposed multi-headed CNN model, which shows excellent performance. To compute these values, we first calculated the confusion matrix (shown in Fig.  7 ). When a class is positive and also classified as so, it is called true positive (TP). Again, when a class is negative and classified as so, it is called true negative (TN). If a class is negative and classified as positive, it is called false positive (FP). Also, when a class is positive and classified as not negative, it is called false negative (FN). From these, we can conclude precision, recall, and F1 score like the below:

figure 7

Confusion matrix of the testing dataset. Numerical values in X and Y axis means the sequential letters from A = 0 to Y = 24, number 9 and 25 is missing because dataset does not have letter J and Z.

Precision: Precision is the ratio of TP and total predicted positive observation.

Recall: It is the ratio of TP and total positive observations in the actual class.

F1 score: F1 score is the weighted average of precision and recall.

The Precision, Recall, and F1 score for 24 classes are shown in Table 3 .

Result analysis

In human action recognition tasks, sign language has an extra advantage as it can be used to communicate efficiently. Many techniques have been developed using image processing, sensor data processing, and motion detection by applying different dynamic algorithms and methods like machine learning and deep learning. Depending on methodologies, researchers have proposed their way of classifying sign languages. As technologies develop, we can explore the limitations of previous works and improve accuracy. In ref. 13 , this paper proposes a technique for acknowledging hand motions, which is an excellent part of gesture-based communication jargon, because of a proficient profound deep convolutional neural network (CNN) architecture. The proposed CNN design disposes of the requirement for recognition and division of hands from the captured images, decreasing the computational weight looked at during hand pose recognition with classical approaches. In our method, we used two input channels for the images and hand landmarks to get more robust data, making the process more efficient with a dynamic learning rate adjustment. Besides in ref 14 , the presented results were acquired by retraining and testing the sign language gestures dataset on a convolutional neural organization model utilizing Inception v3. The model comprises various convolution channel inputs that are prepared on a piece of similar information. A capsule-based deep neural network sign posture translator for an American Sign Language (ASL) fingerspelling (posture) 20 has been introduced where the idea concept of capsules and pooling are used simultaneously in the network. This exploration affirms that utilizing pooling and capsule routing on a similar network can improve the network's accuracy and convergence speed. In our method, we have used the pre-trained model of Google to extract the hand landmarks, almost like transfer learning. We have shown that utilizing two input channels could also improve accuracy.

Moreover, ref 5 proposed a 3DRCNN model integrating a 3D convolutional neural network (3DCNN) and upgraded completely associated recurrent neural network (FC-RNN), where 3DCNN learns multi-methodology features from RGB, motion, and depth channels, and FCRNN catch the fleeting data among short video clips divided from the original video. Consecutive clips with a similar semantic significance are singled out by applying the sliding window way to deal with a section of the clips on the whole video sequence. Combining a CNN and traditional feature extractors, capable of accurate and real-time hand posture recognition 26 where the architecture is assessed on three particular benchmark datasets and contrasted and the cutting edge convolutional neural networks. Extensive experimentation is directed utilizing binary, grayscale, and depth data and two different validation techniques. The proposed feature fusion-based CNN 31 is displayed to perform better across blends of approval procedures and image representation. Similarly, fusion-based CNN is demonstrated to improve the recognition rate in our study.

After worldwide motion analysis, the hand gesture image sequence was dissected for keyframe choice. The video sequences of a given gesture were divided in the RGB shading space before feature extraction. This progression enjoyed the benefit of shaded gloves worn by the endorsers. Samples of pixel vectors representative of the glove’s color were used to estimate the mean and covariance matrix of the shading, which was sectioned. So, the division interaction was computerized with no user intervention. The video frames were converted into color HSV (Hue-SaturationValue) space in the color object tracking method. Then the pixels with the following shading were distinguished and marked, and the resultant images were converted to a binary (Gray Scale image). The system identifies image districts compared to human skin by binarizing the input image with a proper threshold value. Then, at that point, small regions from the binarized image were eliminated by applying a morphological operator and selecting the districts to get an image as an applicant of hand.

In the proposed method we have used two-headed CNN to train the processed input images. Though the single image input stream is widely used, two input streams have an advantage among them. In the classification layer of CNN, if one layer is giving a false result, it could be complemented by the other layer’s weight, and it is possible that combining both results could provide a positive outcome. We used this theory and successfully improved the final validation and test results. Before combining image and hand landmark inputs, we tested both individually and acquired a test accuracy of 96.29% for the image and 98.42% for hand landmarks. We did not use binarization as it would affect the background of an image with skin color matched with hand color. This method is also suitable for wild situations as it is not entirely dependent on hand position in an image frame. A comparison of the literature and our work has been shown in Table 4 , which shows that our method overcomes most of the current position in accuracy gain.

Table 5 illustrates that the Combined Model, while having a larger number of parameters and consuming more memory, achieves the highest accuracy of 98.98%. This suggests that the combined approach, which incorporates both image and hand landmark information, is effective for the task when accuracy is priority. On the other hand, the Hand Landmarks Model, despite having fewer parameters and lower memory consumption, also performs impressively with an accuracy of 98.42%. But it has its own error and memory consumption rate in model training by Google. The Image Model, while consuming less memory, has a slightly lower accuracy of 96.29%. The choice between these models would depend on the specific application requirements, trade-offs between accuracy and resource utilization, and the importance of execution time.

This work proposes a methodology for perceiving the classification of sign language recognition. Sign language is the core medium of communication between deaf-mute and everyday people. It is highly implacable in real-world scenarios like communication, human–computer interaction, security, advanced AI, and much more. For a long time, researchers have been working in this field to make a reliable, low cost and publicly available SRL system using different sensors, images, videos, and many more techniques. Many datasets have been used, including numeric sensory, motion, and image datasets. Most datasets are prepared in a good lab condition to do experiments, but in the real world, it may not be a practical case. That’s why, looking into the real-world situation, the Fingerspelling dataset has been used, which contains real-world scenarios like complex backgrounds, uneven image shapes, and conditions. First, the raw images are processed and resized into a 50 × 50 size. Then, the hand landmark points are detected and extracted from these hand images. Making images goes through two processing techniques; now, there are two data channels. A multi-headed CNN architecture has been proposed for these two data channels. Total data has been augmented to avoid overfitting, and dynamic learning rate adjustment has been done. From the prepared data, 70–30% of the train test spilled has been done. With the 30% dataset, a validation accuracy of 98.98% has been achieved. In this kind of large dataset, this accuracy is much more reliable.

There are some limitations found in the proposed method compared with the literature. Some methods might work with low image dataset numbers, but as we use the simple CNN model, this method requires a good number of images for training. Also, the proposed method depends on the hand landmark extraction model. Other hand landmark model can cause different results. In raw image processing, it is possible to detect hand portions to reduce the image size, which may increase the recognition chance and reduce the model training time. Hence, we may try this method in future work. Currently, raw image processing takes a good amount of training time as we considered the whole image for training.

Data availability

The dataset used in this paper (ASL Fingerspelling Images (RGB & Depth)) is publicly available at Kaggle on this URL: https://www.kaggle.com/datasets/mrgeislinger/asl-rgb-depth-fingerspelling-spelling-it-out .

Anderson, R., Wiryana, F., Ariesta, M. C. & Kusuma, G. P. Sign language recognition application systems for deaf-mute people: A review based on input-process-output. Proced. Comput. Sci. 116 , 441–448. https://doi.org/10.1016/j.procs.2017.10.028 (2017).

Article   Google Scholar  

Mummadi, C. et al. Real-time and embedded detection of hand gestures with an IMU-based glove. Informatics 5 (2), 28. https://doi.org/10.3390/informatics5020028 (2018).

Hickeys Kinect for Windows - Windows apps. (2022). Accessed 01 January 2023. https://learn.microsoft.com/en-us/windows/apps/design/devices/kinect-for-windows

Rivera-Acosta, M., Ortega-Cisneros, S., Rivera, J. & Sandoval-Ibarra, F. American sign language alphabet recognition using a neuromorphic sensor and an artificial neural network. Sensors 17 (10), 2176. https://doi.org/10.3390/s17102176 (2017).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Ye, Y., Tian, Y., Huenerfauth, M., & Liu, J. Recognizing American Sign Language Gestures from Within Continuous Videos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , 2145–214509 (IEEE, 2018). https://doi.org/10.1109/CVPRW.2018.00280 .

Ameen, S. & Vadera, S. A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images. Expert Syst. 34 (3), e12197. https://doi.org/10.1111/exsy.12197 (2017).

Sykora, P., Kamencay, P. & Hudec, R. Comparison of SIFT and SURF methods for use on hand gesture recognition based on depth map. AASRI Proc. 9 , 19–24. https://doi.org/10.1016/j.aasri.2014.09.005 (2014).

Sahoo, A. K., Mishra, G. S. & Ravulakollu, K. K. Sign language recognition: State of the art. ARPN J. Eng. Appl. Sci. 9 (2), 116–134 (2014).

Google Scholar  

Mitra, S. & Acharya, T. “Gesture recognition: A survey. IEEE Trans. Syst. Man Cybern. Part C 37 (3), 311–324. https://doi.org/10.1109/TSMCC.2007.893280 (2007).

Rautaray, S. S. & Agrawal, A. Vision based hand gesture recognition for human computer interaction: A survey. Artif. Intell. Rev. 43 (1), 1–54. https://doi.org/10.1007/s10462-012-9356-9 (2015).

Amir A. et al A low power, fully event-based gesture recognition system. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 7388–7397 (IEEE, 2017). https://doi.org/10.1109/CVPR.2017.781 .

Lee, J. H. et al. Real-time gesture interface based on event-driven processing from stereo silicon retinas. IEEE Trans. Neural Netw. Learn Syst. 25 (12), 2250–2263. https://doi.org/10.1109/TNNLS.2014.2308551 (2014).

Article   PubMed   Google Scholar  

Adithya, V. & Rajesh, R. A deep convolutional neural network approach for static hand gesture recognition. Proc. Comput. Sci. 171 , 2353–2361. https://doi.org/10.1016/j.procs.2020.04.255 (2020).

Das, A., Gawde, S., Suratwala, K., & Kalbande, D. Sign language recognition using deep learning on custom processed static gesture images. In 2018 International Conference on Smart City and Emerging Technology (ICSCET) , 1–6 (IEEE, 2018). https://doi.org/10.1109/ICSCET.2018.8537248 .

Pathan, R. K. et al. Breast cancer classification by using multi-headed convolutional neural network modeling. Healthcare 10 (12), 2367. https://doi.org/10.3390/healthcare10122367 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), 2278–2324. https://doi.org/10.1109/5.726791 (1998).

Collobert, R., & Weston, J. A unified architecture for natural language processing. In Proceedings of the 25th international conference on Machine learning—ICML ’08 , 160–167 (ACM Press, 2008). https://doi.org/10.1145/1390156.1390177 .

Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), 1915–1929. https://doi.org/10.1109/TPAMI.2012.231 (2013).

Xie, B., He, X. & Li, Y. RGB-D static gesture recognition based on convolutional neural network. J. Eng. 2018 (16), 1515–1520. https://doi.org/10.1049/joe.2018.8327 (2018).

Jalal, M. A., Chen, R., Moore, R. K., & Mihaylova, L. American sign language posture understanding with deep neural networks. In 2018 21st International Conference on Information Fusion (FUSION) , 573–579 (IEEE, 2018).

Shanta, S. S., Anwar, S. T., & Kabir, M. R. Bangla Sign Language Detection Using SIFT and CNN. In 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT) , 1–6 (IEEE, 2018). https://doi.org/10.1109/ICCCNT.2018.8493915 .

Sharma, A., Mittal, A., Singh, S. & Awatramani, V. Hand gesture recognition using image processing and feature extraction techniques. Proc. Comput. Sci. 173 , 181–190. https://doi.org/10.1016/j.procs.2020.06.022 (2020).

Ren, S., He, K., Girshick, R., & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process Syst. , 28 (2015).

Rastgoo, R., Kiani, K. & Escalera, S. Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy 20 (11), 809. https://doi.org/10.3390/e20110809 (2018).

Jhuang, H., Serre, T., Wolf, L., & Poggio, T. A biologically inspired system for action recognition. In 2007 IEEE 11th International Conference on Computer Vision , 1–8. (IEEE, 2007) https://doi.org/10.1109/ICCV.2007.4408988 .

Ji, S., Xu, W., Yang, M. & Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), 221–231. https://doi.org/10.1109/TPAMI.2012.59 (2013).

Huang, J., Zhou, W., Li, H., & Li, W. sign language recognition using 3D convolutional neural networks. In 2015 IEEE International Conference on Multimedia and Expo (ICME) , 1–6 (IEEE, 2015). https://doi.org/10.1109/ICME.2015.7177428 .

Digital worlds that feel human Ultraleap. Accessed 01 January 2023. Available: https://www.leapmotion.com/

Huang, F., & Huang, S. Interpreting american sign language with Kinect. Journal of Deaf Studies and Deaf Education, [Oxford University Press] , (2011).

Pugeault, N., & Bowden, R. Spelling it out: Real-time ASL fingerspelling recognition. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) , 1114–1119 (IEEE, 2011). https://doi.org/10.1109/ICCVW.2011.6130290 .

Rahim, M. A., Islam, M. R. & Shin, J. Non-touch sign word recognition based on dynamic hand gesture using hybrid segmentation and CNN feature fusion. Appl. Sci. 9 (18), 3790. https://doi.org/10.3390/app9183790 (2019).

“ASL Alphabet.” Accessed 01 Jan, 2023. https://www.kaggle.com/grassknoted/asl-alphabet

Download references

Funding was provided by the American University of the Middle East, Egaila, Kuwait.

Author information

Authors and affiliations.

Department of Computing and Information Systems, School of Engineering and Technology, Sunway University, 47500, Bandar Sunway, Selangor, Malaysia

Refat Khan Pathan

Department of Computer Science and Engineering, BGC Trust University Bangladesh, Chittagong, 4381, Bangladesh

Munmun Biswas

Department of Computer and Information Science, Graduate School of Engineering, Tokyo University of Agriculture and Technology, Koganei, Tokyo, 184-0012, Japan

Suraiya Yasmin

Centre for Applied Physics and Radiation Technologies, School of Engineering and Technology, Sunway University, 47500, Bandar Sunway, Selangor, Malaysia

Mayeen Uddin Khandaker

Faculty of Graduate Studies, Daffodil International University, Daffodil Smart City, Birulia, Savar, Dhaka, 1216, Bangladesh

College of Engineering and Technology, American University of the Middle East, Egaila, Kuwait

Mohammad Salman & Ahmed A. F. Youssef

You can also search for this author in PubMed   Google Scholar

Contributions

R.K.P and M.B, Conceptualization; R.K.P. methodology; R.K.P. software and coding; M.B. and R.K.P. validation; R.K.P. and M.B. formal analysis; R.K.P., S.Y., and M.B. investigation; S.Y. and R.K.P. resources; R.K.P. and M.B. data curation; S.Y., R.K.P., and M.B. writing—original draft preparation; S.Y., R.K.P., M.B., M.U.K., M.S., A.A.F.Y. and M.S. writing—review and editing; R.K.P. and M.U.K. visualization; M.U.K. and M.B. supervision; M.B., M.S. and A.A.F.Y. project administration; M.S. and A.A.F.Y, funding acquisition.

Corresponding author

Correspondence to Mayeen Uddin Khandaker .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Pathan, R.K., Biswas, M., Yasmin, S. et al. Sign language recognition using the fusion of image and hand landmarks through multi-headed convolutional neural network. Sci Rep 13 , 16975 (2023). https://doi.org/10.1038/s41598-023-43852-x

Download citation

Received : 04 March 2023

Accepted : 29 September 2023

Published : 09 October 2023

DOI : https://doi.org/10.1038/s41598-023-43852-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

  • Junming Zhang
  • Xiaolong Bu

Scientific Reports (2024)

Boxing behavior recognition based on artificial intelligence convolutional neural network with sports psychology assistant

  • Yuanhui Kong
  • Zhiyuan Duan

Using LSTM to translate Thai sign language to text in real time

  • Werapat Jintanachaiwat
  • Kritsana Jongsathitphaibul
  • Thitirat Siriborvornratanakul

Discover Artificial Intelligence (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper for sign language recognition

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, sign language recognition.

74 papers with code • 15 benchmarks • 23 datasets

Sign Language Recognition is a computer vision and natural language processing task that involves automatically recognizing and translating sign language gestures into written or spoken language. The goal of sign language recognition is to develop algorithms that can understand and interpret sign language, enabling people who use sign language as their primary mode of communication to communicate more easily with non-signers.

( Image credit: Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison )

research paper for sign language recognition

Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> --> --> -->
Trend Dataset Best ModelPaper Code Compare
SlowFastSign
SlowFastSign
SlowFastSign
STF+LSTM
NLA-SLR
SignBERT
SPOTER
StepNet
SignBERT+
mVITv2-S
3D-DCNN + ST-MGCN
Skeleton Image Representation
Skeleton Image Representation
StepNet
HWGAT

research paper for sign language recognition

Most implemented papers

Learning to estimate 3d hand pose from single rgb images.

Low-cost consumer depth cameras and deep learning have enabled reasonable 3D hand pose estimation from single depth images.

BlazePose: On-device Real-time Body Pose tracking

We present BlazePose, a lightweight convolutional neural network architecture for human pose estimation that is tailored for real-time inference on mobile devices.

A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation

research paper for sign language recognition

Concretely, we pretrain the sign-to-gloss visual network on the general domain of human actions and the within-domain of a sign-to-gloss dataset, and pretrain the gloss-to-text translation network on the general domain of a multilingual corpus and the within-domain of a gloss-to-text corpus.

Skeleton Aware Multi-modal Sign Language Recognition

Sign language is commonly used by deaf or speech impaired people to communicate but requires significant effort to master.

Continuous Sign Language Recognition with Correlation Network

Visualizations demonstrate the effects of CorrNet on emphasizing human body trajectories across adjacent frames.

SubUNets: End-To-End Hand Shape and Continuous Sign Language Recognition

We propose a novel deep learning approach to solve simultaneous alignment and recognition problems (referred to as "Sequence-to-sequence" learning).

Fingerspelling recognition in the wild with iterative visual attention

In this paper we focus on recognition of fingerspelling sequences in American Sign Language (ASL) videos collected in the wild, mainly from YouTube and Deaf social media.

Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison

Based on this new large-scale dataset, we are able to experiment with several deep learning methods for word-level sign recognition and evaluate their performances in large scale scenarios.

TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences.

KArSL: Arabic Sign Language Database

Hamzah-Luqman/KArSL • ACM Transactions on Asian and Low-Resource Language Information Processing 2021

The availability of a comprehensive benchmarking database for ArSL is one of the challenges of the automatic recognition of Arabic Sign language.

Sign Language Recognition

9 Pages Posted: 2 Apr 2024

Shraddha Srivastava

Inderprastha Engineering College

Ritik Jaiswal

Raghib ahmad, vishal maddheshiya.

Date Written: March 30, 2024

This comprehensive review explores the evolving landscape of gesture and emotion recognition technologies, with a focus on applications for the deaf and hard of hearing communities. The study introduces an efficient deep convolutional neural network approach for hand gesture recognition, leveraging transfer learning to overcome dataset limitations. Evaluation on three diverse datasets demonstrates high recognition rates, emphasizing the system's potential in sign language analysis. Emotion recognition systems, crucial for human-computer interaction, are investigated, comparing contact-less methods like facial analysis with physiological parameter monitoring through smart wearables. The incorporation of multimodal emotional computing is investigated, exhibiting different modalities' accuracy. Additionally, the paper delves into technological advancements in sign language recognition, visualization, and synthesis, identifying trends and gaps. The review concludes with a proposed framework for sign language recognition research, acknowledging the importance of diverse input modalities and anticipating future developments in this dynamic field.

Keywords: comprehensive, evolving landscape, leveraging, visualization

Suggested Citation: Suggested Citation

Shraddha Srivastava (Contact Author)

Inderprastha engineering college ( email ).

63 Site IV, Sahibabad Industrial Area Surya Nagar Flyover Road Ghaziabad, Sahibabad 201010 India

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, international conference on innovative computing & communication (icicc) 2024.

Subscribe to this free journal for more curated articles on this topic

Communication & Technology eJournal

Subscribe to this fee journal for more curated articles on this topic

Sign Languages eJournal

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

electronics-logo

Article Menu

research paper for sign language recognition

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Deepsign: sign language detection and recognition using deep learning.

research paper for sign language recognition

1. Introduction

2. related work, 3. methodology, proposed lstm-gru-based model.

  • The feature vectors are extracted using InceptionResNetV2 and passed to the model. Here, the video frames are classified into objects with InceptionResNet-2; then, the task is to create key points stacked for video frames;
  • The first layer of the neural network is composed of a combination of LSTM and GRU. This composition can be used to capture the semantic dependencies in a more effective way;
  • The dropout is used to reduce overfitting and improve the model’s generalization ability;
  • The final output is obtained through the ‘softmax’ function.
  • The LSTM layer of 1536 units, 0.3 dropouts, and a kernel regularizer of ′l2′ receive data from the input layer;
  • Then, the data are passed from the GRU layer using the same parameters;
  • Results are passed to a fully connected dense layer;
  • The output is fed to the dropout layer, with an effective value of 0.3.

4. Experiments and Results

4.1. dataset, 4.2. results, 5. discussion and limitations, 6. conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.

  • Ministry of Statistics & Programme Implementation. Available online: https://pib.gov.in/PressReleasePage.aspx?PRID=1593253 (accessed on 5 January 2022).
  • Manware, A.; Raj, R.; Kumar, A.; Pawar, T. Smart Gloves as a Communication Tool for the Speech Impaired and Hearing Impaired. Int. J. Emerg. Technol. Innov. Res. 2017 , 4 , 78–82. [ Google Scholar ]
  • Wadhawan, A.; Kumar, P. Sign language recognition systems: A decade systematic literature review. Arch. Comput. Methods Eng. 2021 , 28 , 785–813. [ Google Scholar ] [ CrossRef ]
  • Papastratis, I.; Chatzikonstantinou, C.; Konstantinidis, D.; Dimitropoulos, K.; Daras, P. Artificial Intelligence Technologies for Sign Language. Sensors 2021 , 21 , 5843. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Nandy, A.; Prasad, J.; Mondal, S.; Chakraborty, P.; Nandi, G. Recognition of Isolated Indian Sign Language Gesture in Real Time. Commun. Comput. Inf. Sci. 2010 , 70 , 102–107. [ Google Scholar ]
  • Mekala, P.; Gao, Y.; Fan, J.; Davari, A. Real-time sign language recognition based on neural network architecture. In Proceedings of the IEEE 43rd Southeastern Symposium on System Theory, Auburn, AL, USA, 14–16 March 2011. [ Google Scholar ]
  • Chen, J.K. Sign Language Recognition with Unsupervised Feature Learning ; CS229 Project Final Report; Stanford University: Stanford, CA, USA, 2011. [ Google Scholar ]
  • Sharma, M.; Pal, R.; Sahoo, A. Indian sign language recognition using neural networks and KNN classifiers. J. Eng. Appl. Sci. 2014 , 9 , 1255–1259. [ Google Scholar ]
  • Agarwal, S.R.; Agrawal, S.B.; Latif, A.M. Article: Sentence Formation in NLP Engine on the Basis of Indian Sign Language using Hand Gestures. Int. J. Comput. Appl. 2015 , 116 , 18–22. [ Google Scholar ]
  • Wazalwar, S.S.; Shrawankar, U. Interpretation of sign language into English using NLP techniques. J. Inf. Optim. Sci. 2017 , 38 , 895–910. [ Google Scholar ] [ CrossRef ]
  • Shivashankara, S.; Srinath, S. American Sign Language Recognition System: An Optimal Approach. Int. J. Image Graph. Signal Process. 2018 , 10 , 18–30. [ Google Scholar ]
  • Camgoz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural Sign Language Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018. [ Google Scholar ]
  • Muthu Mariappan, H.; Gomathi, V. Real-Time Recognition of Indian Sign Language. In Proceedings of the International Conference on Computational Intelligence in Data Science, Haryana, India, 6–7 September 2019. [ Google Scholar ]
  • Mittal, A.; Kumar, P.; Roy, P.P.; Balasubramanian, R.; Chaudhuri, B.B. A Modified LSTM Model for Continuous Sign Language Recognition Using Leap Motion. IEEE Sens. J. 2019 , 19 , 7056–7063. [ Google Scholar ] [ CrossRef ]
  • De Coster, M.; Herreweghe, M.V.; Dambre, J. Sign Language Recognition with Transformer Networks. In Proceedings of the Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, 13–15 May 2020; pp. 6018–6024. [ Google Scholar ]
  • Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Skeleton aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–24 June 2021; pp. 3413–3423. [ Google Scholar ]
  • Liao, Y.; Xiong, P.; Min, W.; Min, W.; Lu, J. Dynamic Sign Language Recognition Based on Video Sequence with BLSTM-3D Residual Networks. IEEE Access 2019 , 7 , 38044–38054. [ Google Scholar ] [ CrossRef ]
  • Adaloglou, N.; Chatzis, T. A Comprehensive Study on Deep Learning-based Methods for Sign Language Recognition. IEEE Trans. Multimed. 2022 , 24 , 1750–1762. [ Google Scholar ] [ CrossRef ]
  • Aparna, C.; Geetha, M. CNN and Stacked LSTM Model for Indian Sign Language Recognition. Commun. Comput. Inf. Sci. 2020 , 1203 , 126–134. [ Google Scholar ] [ CrossRef ]
  • Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv 2016 , arXiv:1602.07261. [ Google Scholar ]
  • Yang, D.; Martinez, C.; Visuña, L.; Khandhar, H.; Bhatt, C.; Carretero, J. Detection and Analysis of COVID-19 in medical images using deep learning techniques. Sci. Rep. 2021 , 11 , 19638. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Likhar, P.; Bhagat, N.K.; Rathna, G.N. Deep Learning Methods for Indian Sign Language Recognition. In Proceedings of the 2020 IEEE 10th International Conference on Consumer Electronics (ICCE-Berlin), Berlin, Germany, 9–11 November 2020; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput. 1997 , 9 , 1735–1780. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Le, X.-H.; Hung, V.; Ho, G.L.; Sungho, J. Application of Long Short-Term Memory (LSTM) Neural Network for Flood Forecasting. Water 2019 , 11 , 1387. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Yan, S. Understanding LSTM and Its Diagrams. Available online: https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714 (accessed on 19 January 2022).
  • Chen, J. CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning. 2012. Available online: http://vision.stanford.edu/teaching/cs231a_autumn1213_internal/project/final/writeup/distributable/Chen_Paper.pdf (accessed on 15 March 2022).

Click here to enlarge figure

AuthorMethodologyDatasetAccuracy
Mittal et al. (2019) [ ]2D-CNN and Modified LSTM, with Leap motion sensorASL89.50%
Aparna and Geetha (2019) [ ]CNN and 2layer LSTMCustom Dataset (6 signs)94%
Jiang et al. (2021) [ ]3DCNN with SL-GCN using RGB-D modalitiesAUTSL98%
Liao et al. (2019) [ ] 3D- ConvNet with BLSTMDEVISIGN_D89.8%
Adaloglou et al. (2021) [ ]Inflated 3D ConvNet with BLSTMRGB + D89.74%
IISL2020 (Our Dataset)AUTSLGSL
ModelPrecisionRecallF1-ScorePrecisionRecallF1-ScorePrecisionRecallF1-Score
GRU-GRU0.920.900.900.930.900.900.930.920.93
LSTM-LSTM0.960.960.950.890.890.890.900.890.89
GRU-LSTM0.910.890.890.900.890.890.910.900.90
LSTM-GRU0.970.970.970.950.940.950.950.940.94
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

Kothadiya, D.; Bhatt, C.; Sapariya, K.; Patel, K.; Gil-González, A.-B.; Corchado, J.M. Deepsign: Sign Language Detection and Recognition Using Deep Learning. Electronics 2022 , 11 , 1780. https://doi.org/10.3390/electronics11111780

Kothadiya D, Bhatt C, Sapariya K, Patel K, Gil-González A-B, Corchado JM. Deepsign: Sign Language Detection and Recognition Using Deep Learning. Electronics . 2022; 11(11):1780. https://doi.org/10.3390/electronics11111780

Kothadiya, Deep, Chintan Bhatt, Krenil Sapariya, Kevin Patel, Ana-Belén Gil-González, and Juan M. Corchado. 2022. "Deepsign: Sign Language Detection and Recognition Using Deep Learning" Electronics 11, no. 11: 1780. https://doi.org/10.3390/electronics11111780

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

Artificial Intelligence Technologies for Sign Language

AI technologies can play an important role in breaking down the communication barriers of deaf or hearing-impaired people with other communities, contributing significantly to their social inclusion. Recent advances in both sensing technologies and AI algorithms have paved the way for the development of various applications aiming at fulfilling the needs of deaf and hearing-impaired communities. To this end, this survey aims to provide a comprehensive review of state-of-the-art methods in sign language capturing, recognition, translation and representation, pinpointing their advantages and limitations. In addition, the survey presents a number of applications, while it discusses the main challenges in the field of sign language technologies. Future research direction are also proposed in order to assist prospective researchers towards further advancing the field.

1. Introduction

Sign language (SL) is the main means of communication between hearing-impaired people and other communities and it is expressed through manual (i.e., body and hand motions) and non-manual (i.e., facial expressions) features. These features are combined together to form utterances that convey the meaning of words or sentences [ 1 ]. Being able to capture and understand the relation between utterances and words is crucial for the Deaf community in order to guide us to an era where the translation between utterances and words can be achieved automatically [ 2 ]. The research community has long identified the need for developing sign language technologies to facilitate the communication and social inclusion of hearing-impaired people. Although the development of such technologies can be really challenging due to the existence of numerous sign languages and the lack of large annotated datasets, the recent advances in AI and machine learning have played a significant role towards automating and enhancing such technologies.

Sign language technologies cover a wide spectrum, ranging from the capturing of signs to their realistic representation in order to facilitate the communication between hearing-impaired people, as well as the communication between hearing-impaired and speaking people. More specifically, sign language capturing involves the accurate extraction of body, hand and mouth expressions using appropriate sensing devices in marker-less or marker-based setups. The accuracy of sign language capturing technologies is currently limited by the resolution and discrimination ability of sensors and the fact that occlusions and fast hand movements pose significant challenges to the accurate capturing of signs. Sign language recognition (SLR) involves the development of powerful machine learning algorithms to robustly classify human articulations to isolated signs or continuous sentences. Current limitations in SLR lie in the lack of large annotated datasets that greatly affect the accuracy and generalization ability of SLR methods, as well as the difficulty in identifying sign boundaries in continuous SLR scenarios.

On the other hand, sign language translation (SLT) involves the translation between different sign languages, as well as the translation between sign and speaking languages. SLT methods employ sequence-based machine learning algorithms and aim to bridge the communication gap between people signing or speaking different languages. The difficulties in SLT lie in the lack of multilingual sign language datasets, as well as the inaccuracies of SLR methods, considering that the gloss recognition (performed by SLR methods) is the initial step of the SLT methods . Finally, sign language representation involves the accurate representation and reproduction of signs using realistic avatars or signed video approaches. Currently, avatar movements are deemed unnatural and hard to understand by the Deaf community due to inaccuracies in skeletal pose capturing and the lack of life-like features in the appearance of avatars.

Sign language technologies are connected in a way that affect each other as seen in Figure 1 . The accurate extraction of hand and body motions as well as facial expressions plays a crucial role to the success of the machine learning algorithms that are responsible for the robust recognition of signs. Moreover, the accurate sign language recognition significantly affects the performance of sign language translation and representation methods. The breakthroughs in sensorial devices and AI have paved the way for the development of sign language applications that can immensely facilitate hearing-impaired people in their everyday life.

An external file that holds a picture, illustration, etc.
Object name is sensors-21-05843-g001.jpg

Sign language technologies.

Previous literature reviews mainly concentrate on specific sign language technologies, such as video-based and sensor-based sign language recognition [ 3 , 4 , 5 , 6 , 7 ] and sign language translation [ 8 , 9 ]. Lately, with the development of sign language applications, there are also reviews that presented sign language systems to facilitate hearing-impaired people in teaching and learning, as well as in voice and text interpretation systems [ 10 , 11 ]. However, there is no systematic review that presents all sign language technologies and their relations with each other. This review aims to fill this gap by presenting the advances of AI in all sign language technologies, ranging from capturing and recognition to translation and representation and concludes by describing recent sign language applications that can considerably facilitate the communication among hearing-impaired and speaking people. The main purpose of this review is to demonstrate the importance of using AI technologies in sign language to facilitate deaf and hearing-impaired people in their communication with other communities. In addition, this review aims at familiarizing researchers with the state-of-the-art in all sign language technologies and propose future research directions that can facilitate the development of even more accurate approaches that can lead to mainstream products for the Deaf community. More specifically, the objectives of this review can be summarized as follows:

  • A comprehensive overview of the use of AI technologies in various sign language tasks (i.e., capturing, recognition, translation and representation), along with their importance to their field, is provided.
  • The advantages and limitations of modern sign language technologies and the relations between them are discussed and explored.
  • Possible future directions in the development of AI technologies for sign language are suggested to facilitate prospective researchers in the field.

The rest of this survey is organized as follows. In Section 2 , the literature search guideline is presented. Sign language capturing sensors are described in Section 3 . In Section 4 , sign language recognition methods are categorized and discussed. Sign language representation approaches and applications are presented in Section 5 and Section 6 , respectively. Finally, conclusions and potential future research directions are highlighted in Section 7 .

2. Literature Search

A systematic literature search was performed by adopting the PRISMA guidelines [ 12 ]. The articles were extracted in June 2021 from three academic databases, namely Scopus ( https://www.scopus.com/home.uri ), (link, accessed on 28 May 2021), ProQuest ( https://www.proquest.com/ ), (link, accessed on 28 May 2021) and IeeeXplore ( https://ieeexplore.ieee.org/Xplore/home.jsp ), (link, accessed on 28 May 2021). The articles that were not peer-reviewed or written in English were discarded. Since this review deals with AI technologies for sign language, the search was based on the following condition:

TITLE-ABSTRACT-KEYWORDS ( sign AND language AND ( recognition OR application(*) OR avatar(*) OR representation(*) OR translation OR captur(*) OR generation OR production ) ) AND PUBLISH YEAR > 2018 AND ( LIMIT-TO ( DOCTYPE , "ar" ) OR LIMIT-TO ( DOCTYPE , "cp" ) OR LIMIT-TO ( DOCTYPE , "ch" ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) ) AND ( LIMIT-TO ( PUBSTAGE , "final" ) ) AND ( LIMIT-TO ( SUBJAREA , "COMP" ) OR LIMIT-TO ( SUBJAREA , "ENGI" ) )

The aforementioned search condition describes the existence of the above words (i.e., recognition, translation, etc.) in the title, abstract or keywords of the literature works. In this context, (*) allows for variations in the search terms (i.e., captur(*) allows the existence of words, such as capture, capturing, etc.). In addition, the search is performed for papers published after 2018 since the field is evolving with fast pace and older methods are rendered quickly obsolete. To this end, this review aims to present only the latest and best works related to sign language technologies. Finally, the papers included in this review have been published as journal articles, conference proceedings and book chapters (i.e., DOCTYPE) in the fields of computing and engineering (i.e., SUBJAREA).

The number of the records retrieved from the three databases is 2368. From this number, 331 duplicate records are removed, leading to 2037 unique records. After screening title, abstract and finally the full text with various criteria to discard irrelevant records, 106 records remain and are included in this review. The selection procedure is depicted in Figure 2 .

An external file that holds a picture, illustration, etc.
Object name is sensors-21-05843-g002.jpg

Flowchart of the systematic literature search process.

3. Sign Language Capturing

Sign language capturing involves the recording of sign gestures using appropriate sensor setups. The purpose is to capture discriminative information from the signs that will allow the study, recognition and 3D representation of signs at later stages. Moreover, sign language capturing enables the construction of large datasets that can be used to accurately train and evaluate machine learning sign language recognition and translation algorithms.

3.1. Capturing Sensors

The most common means of recording sign gestures is through visual sensors that are able to capture fine-grained information, such as facial expressions and body postures, that is crucial for understanding sign language. Cerna et al. in [ 13 ] employed a Kinect sensor [ 14 ] to simultaneously capture red-green-blue (RGB) image, depth and skeletal information towards the recording of a multimodal dataset with Brazilian sign language. Similarly, Kosmopoulos et al. in [ 15 ] captured realistic real-life scenarios with sign language using the Kinect sensor. The dataset contains isolated and continuous sign language recordings with RGB, depth and skeletal information, along with annotated hand and facial features. Contrary to the previous methods that use a single Kinect sensor, this work additionally employs a machine vision camera, along with a television screen, for sign demonstration. Sincan et al. in [ 16 ], captured isolated Turkish sign language glosses using Kinect sensors with a large variety of indoor and outdoor backgrounds, revealing the importance of capturing videos with various backgrounds. Adaloglou et al. in [ 17 ], created a large sign language dataset with RealSense D435 sensor that records both RGB and depth information. The dataset contain continuous and isolated sign videos and is appropriate for both isolated and continuous sign language recognition tasks.

Another sensor that has been employed for sign language capturing is Leap Motion, which has the ability to capture 3D positions of hand and fingers at the expense of having to operate close to the subject. Mittal et al. in [ 18 ], employed this type of sensor to record sign language gestures. Other setups with antennas and readers of radio-frequency identification (RFID) signals have also been adopted for sign language recognition. Meng et al. in [ 19 ], extracted phase characteristics of RFID signals to detect and recognize sign gestures. The training setup consists of an RFID reader, an RFID tag and a directional antenna. The recorded human should stand between the reader and the tag for a proper capturing. Moreover, the recognition system is signer-dependent.

On the other hand, wearable sensors have been adopted for capturing sign language gestures. Galea et al. in [ 20 ], used electromyography (EMG) to capture electrical activity that was produced during arm movement. The Thalmic MYO armband device was used for the recording of Irish sign language alphabet. Similarly, Zhang et al. [ 21 ] used a wearable device to capture EMG and inertial measurement unit (IMU) signals, while they used a convolutional neural network (CNN) [ 22 ] followed by a long short-term memory (LSTM) [ 23 ] architecture to recognize American sign language at both word and sentence levels. One disadvantage of the method is that its performance has not been evaluated under walking condition. Hou et al. in [ 24 ], proposed Sign-Speaker, which was deployed on a smartwatch to collect sign signals. Then, these signals were sent to a smartphone and were translated into spoken language in real-time. In this method, a very simple capturing setup is required, consisting of a smartwatch and a smartphone. However, their system recognizes a limited number of signs and it cannot generalize well to new users. Wang et al. in [ 25 ], employed a system with two armbands using both IMU and EMG sensors in order to capture fine-grained finger and hand positions and movements. How et al. in [ 26 ], used a low-cost dataglove with IMU sensors to capture sign gestures that were transmitted through Bluetooth to a smartphone device. Nevertheless, the employment of a single right-hand dataglove limited the number of signs that could be performed by this setup.

Each of the aforementioned sensor setups for sign language capturing has different characteristics, which makes it suitable for different applications. Kinect sensors provide high resolution RGB and depth information but their accuracy is restricted by the distance from the sensors. Leap Motion also requires a small distance between the sensor and the subject, but their low computational requirements enable its usage in real-time applications. Multi-camera setups are capable of providing highly accurate results at the expense of increased complexity and computational requirements. A myo armband that can detect EMG and inertial signals is also used in few works but the inertial signals may be distorted by body motions when people are walking. Smartwatches are really popular nowadays and they can also be used for sign language capturing but their output can be quite noisy due to unexpected body movements. Finally, datagloves can provide highly accurate sign language capturing results in real-time. However, the tuning of its components (i.e., flex sensor, accelerometer, gyroscope) may require a trial and error process that is impractical and time-consuming. In addition, signers tend to not prefer datagloves for sign language capturing as they are considered invasive.

3.2. Datasets

Datasets are crucial for the performance of methodologies regarding sign language recognition, translation and synthesis and as a result a lot of attention has been drawn towards the accurate capturing of signs and their meticulous annotation. The majority of the existing publicly available datasets are captured with visual sensors and are presented below.

3.2.1. Continuous Sign Language Recognition Datasets

Continuous sign language recognition (CSLR) datasets contain videos of sequences of signs instead of individual signs and are more suitable for developing real-life applications. Phoenix-2014 [ 27 ] is one of the most popular CSLR dataset with recordings of weather forecasts in German sign language. All videos were recorded with 9 signers at a frame rate of 25 frames per second. The dictionary has 1081 unique glosses and the dataset contains 5672 videos for training, 540 videos for validation and 629 videos for testing. The same authors created an updated version of Phoenix-2014, called Phoenix-2014-T [ 28 ], with spoken language translations, which makes it appropriate for both CSLR and sign language translation experiments. It contains 8257 videos from 9 different signers performing 1088 unique signs and 2887 unique words. Although all recordings are performed in a controlled environment, Phoenix-2014 and Phoenix-2014-T are both challenging datasets with large vocabularies and varying number of samples per sign with a few signs having a single sample. Similarly, BSL-1K [ 29 ] contains video recordings from British news broadcasts, along with automatically extracted annotations from provided subtitles. It is a large database with 273,000 samples from 40 signers that is also used for sign language segmentation. Another notable dataset is CSL [ 30 , 31 ] that contains Chinese words widely used in daily communication. The dataset has 100 sentences with signs that were performed from 50 signers. The recordings are performed in a lab with predefined conditions (i.e., background, lighting). The vocabulary size is 178 words that are performed multiple times, resulting in high recognition results achieved by SLR methods. GRSL [ 15 ] is another CSLR dataset of Greek sign language that is used in home care services, which contains multiple modalities, such as RGB, depth and skeletal joints. On the other hand, GSL [ 17 ] is a large Greek sign language dataset created to assist communication of Deaf people with public service employees. The dataset was created with a RealSense D435 sensor that records both RGB and depth information. Furthermore, it contains both continuous and isolated sign videos from 15 predefined scenarios. It is recorded on a laboratory environment, where each scenario is repeated five consecutive times.

3.2.2. Isolated Sign Language Recognition Datasets

Isolated sign language recognition (ISLR) datasets are important for identifying and learning discriminative features for sign language recognition. CSL-500 [ 31 , 32 ] is the isolated version of CSL but it contains 500 unique glosses performed from the same 50 signers. CSLR methods usually adopt this dataset for feature learning prior to finetuning on the CSL dataset. MS-ASL [ 33 ] is another widely employed ISLR dataset with 1000 unique American sign language glosses. It contains recordings collected from YouTube platform from 222 signers with a large variance in background settings, which makes this dataset suitable for training complex methods with strong representation capabilities. Similarly, WASL [ 34 ] is an ISLR dataset with 2000 unique American sign glosses performed by 119 signers. The videos have different background and illumination conditions, which makes it a challenging ISLR benchmark dataset. On the other hand, AUTSL is a Turkish sign language dataset captured under various indoor and outdoor backgrounds, while LSA64 [ 35 ] is an Argentinian sign language dataset that includes 3200 videos, in which 10 non-expert subjects execute 5 repetitions of 64 different types of signs. LSA64 is a small and relatively easy dataset, where SLR methods achieve outstanding recognition performance. Finally, IsoGD [ 36 ] is a gesture recognition dataset that consists of 47,933 RGB-D videos performed by 21 different individuals and contains 249 gesture labels. Although IsoGD is a gesture recognition dataset, its large size and challenging illumination and background conditions allows the training of highly accurate ISLR methods.

3.2.3. Discussion

A discussion about the aforementioned datasets can be made at this stage, while a detailed overview of the dataset characteristics is provided on Table 1 . It can be seen that over time datasets become larger in size (i.e., number of samples) with more signers involved in them, as well as contain high resolution videos captured under various and challenging illumination and background conditions. Moreover, new datasets usually include different modalities (i.e., RGB, depth and skeleton). Recording sign language videos using many signers is very important, since each person performs signs with different speed, body posture and face expression. Moreover, high resolution videos capture more clearly small but important details, such as finger movements and face expressions, which are crucial cues for sign language understanding. Datasets with videos captured under different conditions enable deep networks to extract highly discriminative features for sign language classification. As a result, methodologies trained in such datasets can obtain greatly enhanced representation and generalization capabilities and achieve high recognition performances. Furthermore, although RGB information is the predominant modality used for sign language recognition, additional modalities, such as skeleton and depth information, can provide complementary information to the RGB modality and significantly improve the performance of SLR methods.

Large-scale publicly available SLR datasets.

DatasetsCharacteristics
LanguageSignersClassesVideo InstancesResolutionTypeModalitiesYear
Phoenix-2014 [ ]German912316841210 × 260CSLRRGB2014
CSL [ , ]Chinese5017825,0001920 × 1080CSLRRGB, depth2016
Phoenix-2014-T [ ]German912318257210 × 260CSLRRGB2018
GRSL [ ]Greek1515004000varyingCSLRRGB, depth, skeleton2020
BSL-1K [ ]British401064273,000varyingCSLRRGB2020
GSL [ ]Greek731010,295848 × 480CSLRRGB, depth2021
CSL-500 [ , ]Chinese50500125,0001920 × 1080ISLRRGB, depth2016
MS-ASL [ ]American222100025,513varyingISLRRGB2019
WASL [ ]American119200021,013varyingISLRRGB2020
AUTSL [ ]Turkish4322638,336512 × 512ISLRRGB, depth2020
KArSL [ ]Arabic350275,300varyingISLRRGB, depth, skeleton2021

4. Sign Language Recognition

Sign language recognition (SLR) is the task of recognizing sign language glosses from video streams. It is a very important research area since it can bridge the communication gap between hearing and Deaf people, facilitating the social inclusion of hearing-impaired people. Moreover, sign language recognition can be classified into isolated and continuous based on whether the video streams contain an isolated gloss or a gloss sequence that corresponds to a sentence.

4.1. Continuous Sign Language Recognition

Continuous Sign Language Recognition aims at classifying signed videos to entire sentences (i.e., ordered sequence of glosses). CSLR is a very challenging task as it requires the recognition of glosses from video sequences without any knowledge of the sign boundaries (i.e., lack of ground truth annotations regarding the start and end of glosses). Most works adopt 2D or 3D-CNNs for feature extraction followed by temporal convolutional networks or recurrent neural networks (RNNs) for sequential information modelling. To measure CSLR performance, word error rate (WER) [ 38 ] is commonly adopted. WER measures the number of operations (i.e., substitutions, deletions and insertions) required to transform the predicted sequence into the target sequence.

Cui et al. [ 39 ] adopted a 2D-CNN followed by temporal 1D convolutional layers for feature extraction. The extracted spatio-temporal features were fed to a bidirectional long short-term memory (BLSTM) network for modelling the context of the entire sequence. The feature extractor was extended with a classifier and trained in a fully-supervised setting on isolated glosses for video to gloss alignment, while the BLSTM was used for CSLR. This two-step optimization process was conducted iteratively with Connectionist Temporal Classification (CTC) [ 40 ] and Cross-Entropy losses, until the network converged. Besides, the recognition model fused RGB with optical flow modalities and achieved a WER of 22.8% on the Phoenix-2014 dataset. Similarly, Koishybay et al. in [ 41 ], adopted a residual 2D-CNN with cascaded 1D convolutional layers for feature extraction, while for CSLR experiments, BLSTM was utilized. Their method generated gloss-level alignments using the Levenshtein distance in order to fine-tune the feature extractor. However, the authors stated that during the early iterations the model predicted poor alignment proposals, which hinders the training process and requires several iterations to converge. Cheng et al. in [ 42 ], proposed a 2D fully convolutional network with a feature enhancement module that did not require iterative training. Instead, it provided extra supervision and assisted the CSLR network to learn better gloss alignments. Niu et al. in [ 43 ], proposed a 2D-CNN followed by a Transformer network for CSLR. They used three stochastic methods to drop frames of the input video, to randomly stop gradients of back-propagation and to model glosses using hidden states, respectively, which led to better CSLR performance. Nevertheless, the randomness ratio of these stochastic processes must be tuned carefully to achieve good recognition rates. Generally, CSLR methods based on 2D-CNNs achieve great recognition performance. More specifically, 2D-CNNs extract descriptive features from the frame sequences, while the sequence modelling mechanisms align efficiently the input video and the output predictions. However, they usually require complex training strategies, such as iterative optimization techniques, to achieve strong feature extraction capabilities.

On the other hand, some works chose to incorporate attention mechanisms for CSLR. Pan et al. in [ 44 ], used a key-frame sampling technique to extract the most descriptive frames of the video. Then, a vector representation was constructed from the skeletal data of the key-frames, which was fed to an attention-based BLSTM to model the temporal information. Huang et al. [ 45 ] proposed an adaptive encoder-decoder architecture to learn the temporal boundaries of the video. Furthermore, a hierarchical BLSTM with attention over sliding windows was used on the decoder to weigh the importance of the input frames. Li et al. in [ 46 ], used a pyramid structure of BLSTMs in order to find key actions of the video representations, which were produced from the 2D-CNN. Moreover, an attention-based LSTM was used to align the input and output sequences and the whole network was trained jointly with Cross-Entropy and CTC losses.

Recently, the self-attention mechanism has been introduced in a variety of models, such as the Transformer, and has also been adopted by CSLR methods. Slimane et al. in [ 47 ], proposed two data streams with cropped hand images and full images. The two modalities were passed through two 2D-CNNs to extract the spatial features. Then, the modalities were synchronized by a self-attention module to obtain better contextual information and generate efficient video representations for CSLR. Zhou et al. [ 48 ], adopted a fully-inception architecture with 2D and 1D convolutional layers along with a self-attention to further improve the feature extraction capabilities of the inception layers.

Reinforcement techniques have also been applied for CSLR, along with Transformer networks. Zhang et al. in [ 49 ], adopted a 3D-CNN followed by a Transformer network that was responsible for recognizing gloss sequences from input videos. Instead of training the model with cross-entropy loss, they used the REINFORCE algorithm [ 50 ] to directly optimize the model by using WER as the reward function of the agent (i.e., the feature extractor). Wei et al. in [ 51 ], used a semantic boundary detection algorithm with reinforcement learning to improve CSLR performance. A spatio-temporal feature extractor learned the video representations. Then, the detection algorithm used reinforcement learning to detect gloss timestamps from video sequences and refine the final video representations. The evaluation metric was used again as the reward function. The major limitation of this method is the need for a careful selection of the pooling size, which defines the action search space for the reinforcement learning agent.

Papastratis et al. [ 52 ] constructed a cross-modal approach in order to effectively model intra-gloss dependencies by leveraging information from text. This method extracted video features using a video encoder that consisted of a 2D-CNN followed by temporal convolutions and a BLSTM, while text representations were obtained from an LSTM. Finally, these embeddings were aligned in a joint latent space. The improved representations led to great CSLR performance, achieving WERs of 24.0% and 3.52% on Phoenix-2014 and GSL SI, respectively. Papastratis et al. in their latest work [ 53 ], employed a generative adversarial network to evaluate the predictions of the video encoder. In addition, contextual information was incorporated to improve recognition performance on sign language conversations.

Due to their efficient feature extraction capabilities, 3D-CNNs have also been adopted by many researchers for CSLR. Wei et al. in [ 54 ], used a 3D residual CNN along with a BLSTM, while they applied grammatical rules sign language. The text was split into isolated words and n -grams, which are modelled using two classifiers. The two classifiers aimed to recognize each word independently and based on the context in contrast to CTC, which models the whole sequence. Pu et al. in [ 55 ], employed a 3D-CNN with an LSTM decoder and a CTC decoder that were jointly aligned with a soft dynamic time warping (soft-DTW) [ 56 ] alignment constraint. The network was trained recursively with the proposed alignments from soft-DTW. The method achieved WERs of 6.1% and 32.7% on CSL Split 1 and CSL Split 2, respectively. Guo et al. in [ 57 ], developed a fully convolutional approach with a 3D-CNN followed by 1D temporal convolutional layers. The 1D CNN block had a hierarchical structure with small and large receptive fields to capture short- and long-term correlations in the video, while the entire architecture was trained with CTC loss. 3D-CNNs are computationally expensive methods that require pre-training on large-scale datasets and cannot be tuned directly for CSLR. To this end, sliding window techniques are adopted to create informative features. To tackle this problem, some works incorporated pseudo-labelling, which is an optimization process that adds predicted labels on the training set. Pei et al. in [ 58 ], trained a deep 3D-CNN with CTC and generate clip-level pseudo-labels from the alignment of CTC to obtain better feature representations. To improve the quality of pseudo-labels, Zhou et al. in [ 59 ], proposed a dynamic decoding method instead of greedy decoding to find better alignment paths and filter out the wrong pseudo-labels. Their method applied the I3D [ 60 ] network from the action recognition field along with temporal convolutions and bidirectional gated recurrent units (BGRU) [ 61 ]. Moreover, the proposed method achieved a WER of 34.5% on the Phoenix-2014 dataset. However, pseudo-labelling required many iterations, while initial labels affected the convergence of the optimization process.

In Table 2 , several methods are compared on the test set of the most commonly adopted datasets for continuous sign language recognition. From the experimental results it is shown that multi-modal methods achieve the lowest WERs. More specifically, STMC [ 62 ] has the best recognition rates on Phoenix-2014, CSL Split 1 and CSL Split 2 datasets using RGB, hands and skeleton modalities, while SLRGAN [ 53 ], employing the RGB and text modality, achieves superior performance on the GSL SI and GSL SD datasets.

Performance comparison of CSLR approaches categorized by dataset measured in WER (%). The best performance for each dataset appears in bold.

MethodInput ModalityDatasetTest Set (WER)
PL [ ]RGBPhoenix-201440.6
RL [ ]RGB 38.3
Align-iOpt [ ]RGB 36.7
DenseTCN [ ]RGB 36.5
DPD [ ]RGB 34.5
CNN-1D-RNN [ ]RGB 34.4
Fully-Inception Networks [ ]RGB 31.3
SAN [ ]RGB 29.7
SFD [ ]RGB 25.3
CrossModal [ ]RGB 24.0
Fully-Conv-Net [ ]RGB 23.9
SLRGAN [ ]RGB 23.4
CNN-TEMP-RNN [ ]RGB+Optical flow 22.8
STMC [ ]RGB+Hands+Skeleton
DenseTCN [ ]RGBCSL Split 114.3
Key-action [ ]RGB 9.1
Align-iOpt [ ]RGB 6.1
WIC-NGC [ ]RGB 5.1
DPD [ ]RGB 4.7
Fully-Conv-Net [ ]RGB 3.0
CrossModal [ ]RGB 2.4
SLRGAN [ ]RGB
STMC [ ]RGB+Hands+Skeleton
Key-action [ ]RGBCSL Split 249.1
DenseTCN [ ]RGB 44.7
Align-iOpt [ ]RGB 32.7
STMC [ ]RGB+Hands+Skeleton
CrossModal [ ]RGBGSL SI3.52
SLRGAN [ ]RGB
CrossModal [ ]RGBGSL SD41.98
SLRGAN [ ]RGB

4.2. Isolated Sign Language Recognition

Isolated sign language recognition refers to the task of accurately detecting single sign gestures from videos and thus it is usually tackled similar to action and gesture recognition, as well as other types of video processing and classification tasks with the extraction and learning of highly discriminative features [ 63 , 64 , 65 ]. In the literature, a common approach to the task of isolated sign language recognition is the extraction of hand and mouth regions from the video sequences in an attempt to remove noisy backgrounds that can inhibit classification performance. Liao et al. in [ 66 ], proposed a video-based SLR method that was based on hand region extraction and classification using 3D ResNet networks and BLSTM layers. Similarly, Aly et al. in [ 67 ], developed an ISLR method that segmented hand regions from images using DeepLabv3+ algorithm [ 68 ], extracted features from these regions using a Convolutional Self-Organizing Map and classified the features using a deep recurrent neural network consisting of 3 BLSTM layers. Gökçe et al. in [ 69 ], proposed 3D-CNN networks for the processing of hand, upper body and face image regions and the fusion of these streams in the score level to accurately classify isolated signs. The authors stated that their method performs comparatively worse on mono-morphemic signs performed with a single hand, rather than on temporally more complex signs with two-handed gestures. On the other hand, Zhang et al. in [ 70 ], proposed the Multiple extraction and Multiple prediction (MEMP) network that consists of alternating 3D-CNN networks and Convolutional LSTM layers that extracted spatio-temporal features from video sequences multiple times, enabling the network to achieve 99.06% and 78.85% accuracy in the LSA64 and IsoGD datasets, respectively. Li et al. in [ 71 ], proposed a SLR method that was based on the transferring of cross-domain knowledge of news signs to a base model and improve its performance using domain-invariant features.

To further improve the accuracy and robustness of SLR methods, several researchers proposed the extraction of other types of features, such as optical flow and skeletal joints from visual cues. These multi-stream networks are more computationally expensive than their single stream counterparts, but they have the advantage of overcoming confusing cases regularly met when a single type of features is employed. Sarhan et al. in [ 72 ], proposed a two-stream network architecture that received as input RGB and optical flow data, extracted features using I3D networks and performed late fusion at the score level for accurate sign language recognition. Rastgoo et al. in [ 73 ], proposed a multi-stream SLR method that utilized as input hand image regions, hand heatmaps and 2D projections of hand skeletal joints to images. These input data were processed using 3D-CNN networks, concatenated and fed to LSTM layers for sign recognition. Konstantinidis et al. in [ 74 ], proposed a SLR methodology that was based on the processing and late fusion of body and hand skeletal features using LSTM layers. Apart from the raw joint coordinates, the authors also utilized joint-line distances, which led to a significant improvement in the performance of the method, reaching 98.09% accuracy in the LSA64 dataset. In a later work [ 75 ], the same authors introduced additional streams that processed RGB video sequences and optical flow data, enhancing even more the performance of their method, ultimately achieving 99.84% accuracy in the LSA64 dataset. Similarly, Papadimitriou et al. in [ 76 ], proposed a multi-stream SLR method that processes hand and mouth regions, as well as optical flow and skeletal features for the accurate classification of signs. These features were concatenated and fed to a temporal deformable convolutional attention-based encoder-decoder that predicts the sign class. Gündüz et al. in [ 77 ], employed a multi-stream SLR approach that received as input RGB video sequences, optical flow sequences and body and hand skeletal features and performed a late fusion to accurately classify Turkish signs. Bilge et al. in [ 78 ], proposed a SLR method that can generalize well on unseen signs. To achieve this, the authors employed two 3D-CNN networks followed by BLSTM layers for the extraction of short-term and long-term feature representations from body and hand video sequences. In addition, the authors employed a BERT model [ 79 ] for the extraction of textual sign representations from text descriptions of how the signs were performed. Finally, they used a bi-linear compatibility function to associate video and text representations.

In an effort to derive more discriminative features, Rastgoo et al. in [ 63 ], proposed a multi-stream SLR method that gets as input hand regions, 3D hand pose features and Extra Spatial Hand Relation features (i.e., orientation and slope of hands). These features were concatenated and fed to an LSTM layer to derive the sign class. In this way, the authors managed to achieve a really high accuracy of 86.32% in the challenging IsoGD dataset. Kumar et al. in [ 64 ], proposed Spatial 3D Relational Features for sign language recognition. These features were computed from the area and perimeter of polygons formed by quadruples of skeletal joints. Then, the class of a test sign was predicted by comparing the sign with the training set using global alignment kernels. In another work [ 80 ], Kumar et al. introduced two novel features for accurate sign language recognition that were named colour-coded topographical descriptors. These descriptors were formed as images from the computation of joint distances and angles. Finally, these descriptors were processed by 2D CNNs and merged to derive the class of the sign.

Recently, the advances in deep learning led several isolated SLR methods to leverage attention mechanisms, transformer networks and graph convolutional networks. Attention mechanisms in particular enable a deep network to pay more attention on features that are important for a classification task and are widely employed by most state-of-the-art SLR methods. Parelli et al. in [ 81 ], proposed a multi-stream SLR method that processes hand and mouth image regions as well as 3D hand skeletal data. All streams were concatenated and fed to an attention CNN network that accurately predicts the class of the sign. Attention LSTM, attention GRU and Transformer networks were also tested but they led to inferior performance. De Amorim et al. in [ 82 ], proposed an American SLR method that extracts skeletal data from video sequences and then processes them using a Spatio-Temporal Graph Convolutional Network (GCN) [ 83 ]. Tunga et al. in [ 84 ], proposed a SLR method that extracts skeletal features from video sequences and then employs a GCN network to model spatial dependencies among the skeletal data, as well as a BERT model to model temporal dependencies among the skeletal data. The two representations were finally merged to derive the class of the sign. A limitation of this approach is that the model cannot differentiate in-plane and out-of-plane movements due to the use of only 2D spatial information. In a similar fashion, Meng et al. in [ 85 ], proposed a GCN with multi-scale attention modules to process the extracted skeletal data and model their long-term spatial and temporal dependencies. In this way, the authors achieved a really high accuracy of 97.36% in the CSL-500 dataset. GCNs are computationally lighter than the image processing networks, but they often cannot extract highly enriched features, thus leading to inferior performance, as noted in [ 82 ].

Finally, the wide adoption of RGB-D sensors for action and gesture recognition has led several researchers to adopt them for multi-modal sign language recognition as well. However, the performance of such multi-modal methodologies is currently limited by the small number of large publicly available RGB-D datasets and the mediocre accuracy of depth information. Tur et al. in [ 86 ], proposed a Siamese deep network for the concurrent processing of RGB and depth sequences. The extracted features were then concatenated and passed to an LSTM layer for isolated sign language recognition. Ravi et al. in [ 87 ], proposed a multi-modal SLR method that was based on the processing of RGB, depth and optical flow sequences. Each stream employed CNN layers to process the sequences and then, all features were fused together and fed to a CNN model for classification. Rastgoo et al. in [ 88 ], proposed a multi-modal SLR method that leverages RGB and depth video sequences to achieve an accuracy of 86.1% in the IsoGD dataset. More specifically, the authors extracted pixel-level, optical flow, deep hand and hand pose features for each modality, concatenated these features across both modalities and classified them to sign classes using an LSTM layer. The authors stated that there were signs with similar appearance and motion features that led to misclassification errors and thus they proposed the use of augmentation strategies, high capacity networks and more data samples.

Huang et al. in [ 89 ], proposed the use of RGB, depth and skeletal data as input to attention-based 3D-CNNs and attention-based BLSTMs in order for the proposed SLR method to pay attention to spatio-temporal dependencies in the input data and fuse the input streams in an optimal way. Huang et al. in [ 90 ], proposed a sequence-to-sequence approach that detects key frames to remove noisy information from video sequences. Then, they extracted CNN features from these key frames, histogram-of-gradients (HOG) features from depth motion maps and trajectory features from skeletal data. These features were finally concatenated and fed to an encoder-decoder LSTM network that predicted sub-words that form the signed word. Zhang et al. in [ 91 ], proposed a highly accurate SLR method that initially selected pairs of aligned RGB-D images to reduce redundancy. Then, the proposed method computed discriminative features from hand regions using a spatial stream and extracted depth motion features using a temporal stream. Both streams were finally fused by a convolutional fusion layer and the output feature vector was used for classification. The authors reported that occlusions and the surface materials can significantly affect the quality of depth images, degrading the performance of their model. Common failure cases among most ISLR methodologies are the difficulty in differentiating signs when performed differently by users and the inability to accurately classify signs with similar hand shapes and positions. An overview of the performance of ISLR methods on well-known datasets are presented in Table 3 .

Performance of ISLR methods on well-known datasets. The best performance for each dataset appears in bold.

MethodDatasetAccuracy (%)
Konstantinidis et al.  [ ]LSA64 [ ]98.09
Zhang et al.  [ ]99.06
Konstantinidis et al.  [ ]99.84
Gündüz et al.  [ ]99.9
Huang et al.  [ ]CSL-500 [ , ]91.18
Zhang et al.  [ ]96.7
Meng et al.  [ ]97.36
Sarhan et al.  [ ]IsoGD [ ]62.09
Zhang et al.  [ ]63.78
Zhang et al.  [ ]78.85
Rastgoo et al.  [ ]86.1
Rastgoo et al.  [ ]86.32

4.3. Sign Language Translation

Sign Language Translation is the task of translating videos with sign language into spoken language by modeling not only the glosses but also the language structure and grammar. It is an important research area that facilitates the communication between the Deaf and other communities. Moreover, the SLT task is more challenging compared to CSLR due to the additional linguistic rules and the representation of spoken languages. SLT methods are usually evaluated using the bilingual evaluation understudy (BLEU) metric [ 92 ]. BLEU is a translation quality score that evaluates the correspondence between the predicted translation and the ground truth text. More specifically, BLEU- n measures the n -gram overlap between the output and the reference sentences. BLEU-1,2,3,4 are reported to provide a clear view of the actual translation performance of a method. Camgoz et al. in [ 28 ], adopted an attention-based neural machine translation architecture for SLT. The encoder consisted of a 2D-CNN and an LSTM network, while the decoder consists of word embeddings with an attention LSTM. The authors stated that the method is prone to errors when spoken words are not explicitly signed in the video but inferred from the context. Their method set the baseline performance on Phoenix-2014-T with a BLEU-4 score of 18.4. Orbay et al. in [ 93 ], compared different gloss tokenization methods using either 2D-CNN, 3D-CNN, LSTM or Transformer networks. In addition, they investigated the importance of using full frames compared to hand images as the first provide useful information regarding the face and arms of the signer for SLT. On the other hand, Ko et al. in [ 94 ], utilized human keypoints extracted from the video, which were then fed to a recurrent encoder-decoder network for sign language translation. Furthermore, the skeletal features were extracted with OpenPose and then normalized to improve the overall performance. Then, they were fed to the encoder, while the translation was generated from the attention decoder. Differently, Zheng et al. in [ 95 ], used a preprocessing algorithm to remove similar and redundant frames of the input video and increase the processing speed of the neural network without losing information. Then, they employed an SLT architecture that consisted of a 2D-CNN, temporal convolutional layers and bidirectional GRUs. Their method was able to deal with long videos that have long-term dependencies, improving the translation quality. Zhou et al. in [ 62 ], proposed a multi-modal framework for CSLR and SLT tasks. The proposed method used 2D-CNN, 1D convolutional layers and several BLSTMs and learned both spatial and temporal dependencies between different modalities. The proposed method achieved a BLEU-4 score of 23.65 on the test set of Phoenix-2014-T. However, due to the multi-modal cues, this method is very computationally heavy and requires several hours of training.

Recently, Transformer networks have also been employed for sign language translation due to their success in natural language processing tasks. Camgoz et al. in [ 96 ], introduced a joint architecture for CSLR and SLT with a Transformer encoder-decoder network. The network was trained with CTC and Cross-Entropy losses, while the gloss-level supervision improved the SLT performance. The authors evaluated various configurations of their method and stated that directly translating from video representations can improve the translation quality. A limitation of this approach was in translating numbers as there was no such context available during training. In their latest work, Camgoz et al. in [ 97 ], adopted additional modalities and a cross-modal attention to synchronize the different streams and model both inter- and intra-contextual information. Kim et al. in [ 98 ], used a deep neural network for human keypoint extraction that were fed to a transformer encoder-decoder network, while the keypoints were normalized based on the neck location. A comparison of existing methods for SLT that are evaluated on the Phoenix-2014-T dataset, is shown in Table 4 . Overall, Transformer-based SLT methods achieve slightly better performance than RNN-based methods, which indicates the importance of attention mechanism for SLT. In addition, using multiple modalities can also improve the translation quality.

Reported results on sign language translation on Phoenix-2014-T. The best performance appears in bold.

MethodValidation SetTest Set
BLEU-1BLEU-2BLEU-3BLEU-4BLEU-1BLEU-2BLEU-3BLEU-4
Sign2Gloss2Text [ ]42.8830.3023.0218.4043.2930.3922.8218.13
MCT [ ]---19.51--18.51
S2(G+T)-Transformer [ ]47.2634.4027.0522.3846.6133.7326.1921.32
STMC-T [ ]

5. Sign Language Representation

The automatic and realistic sign language representation is vital for each sign language system. The representation of a sentence in sign language instead of a plain text can make the system friendlier and more accessible to the members of the deaf community. Signs are commonly represented using avatars or synthesized videos of a real human. The challenges of this task include the difficulty in creating realistic representations due to complex hand shapes and rapid arm movements.

5.1. Realistic Avatars

A common approach to sign language representation is the use of 3D avatars that with a high degree of accuracy and realism can reproduce facial expressions and body/hand movements in a way that represent signs understandable by deaf or hearing-impaired people. Balayn et al. in [ 99 ], developed a virtual communication agent for sign language to recognize Japanese sign language sentences from video recordings and synthesize sign language animations. Their system adopted a deep LSTM encoder-decoder network to translate sign language videos to spoken text, while a separate encoder-decoder network used as input the sign language glosses and extracted specific encodings, which were then used to synthesize the avatar motion. However, the network employed for the generation task does not have enough parameters to learn complete sentence expressions, lacking an attention module that could assist in learning longer-term dependencies. Shaikh et al. in [ 100 ], employed a system to generate sign animations from audio announcements in railway stations. At first, language rules and grammar was applied in the input text to transform it into a specific format. Then, inverse kinematics were applied to calculate the avatar target positions for each word and render the final video representation. Melchor et al. in [ 101 ], used a speech recognition system that translates Mexican spoken text into sign language. Then, the signs were represented through an avatar that was digitally animated on a mobile device. Uchida et al. in [ 102 ], developed an application to automatically produce sign language animations for sports games and was able to operate on live game broadcasts. A disadvantage of the application is that the delay time between the video occurrence and the video display is large.

Das et al. in [ 103 ], developed a 3D avatar to convert Indian text or speech into sign language. The input was translated to English and then to the corresponding Indian sign language using Natural Language Processing (NLP) rules and techniques. The final avatar movements were generated using a predefined sign vocabulary and Blender. A limitation of the system is that it was developed for a limited corpus and that the avatar had no facial expressions. Mehta et al. in [ 104 ], introduced a system in order to translate online videos into Indian Sign Language (ISL) and produce sign animations with a 3D cartoon-like avatar. The audio from the videos was captioned using NLP algorithms and mapped to signs that were finally rendered with the avatar. Nevertheless, due to the limited resources available for ISL, the performance of the system may degrade when dealing with complex grammatical structures and interactions. Patel et al. in [ 105 ], developed an application for animation generation. The input speech was recognised and translated with Google Cloud Speech Recognizer. Then, the translated text was converted to Hamburg notation system (HamNoSys) [ 106 ] and sign gesture markup language (SigML) [ 107 ] notations to effectively generate animations. Kumar et al. in [ 108 , 109 ] developed a mobile application to translate English text into ISL. HamNoSys was used for sign representation, SigML for its conversion to an XML file, and an avatar was employed to generate signs. A weakness of the developed system is that it struggles to represent complex animation and facial expressions of ISL signs. Moreover, the proposed system does not index the signs based on its context and this can cause confusion on directional signs that require different handling based on the context. Brock et al. in [ 110 ], adopted deep recurrent neural networks to generate 3D skeleton data from sign language videos. Subsequently, inverse kinematics were applied to calculate joints angles and positions that were mapped to a sign language avatar for animation synthesis.

5.2. Sign Language Production

Sign language production (SLP) has gained a lot of attention lately due to the huge advances in deep learning that allows the production of realistic signed videos. Sign language production techniques aim to replace the rigid body and facial features of an avatar with the natural features of a real human. To this end, these techniques usually receive as input sign language glosses and a reference image of a human and synthesize a signed video with the human performing signs in a more realistic way than the one that could have been achieved by an avatar.

Stoll et al. in [ 111 ], proposed an SLP method using a machine translation encoder-decoder network to translate spoken language into gloss sequences. Then, each gloss was assigned to a unique 2D skeleton pose, which were extracted from sign videos, normalized and aligned. Finally, a pose-guided generative adversarial network handled the skeleton pose sequence and a reference image to generate the gloss video. However, this methods fails to generate precise videos when the hand keypoints are not detected by the pose estimation method or the timing of the glosses is not predicted correctly. In their latest work, Stoll et al. in [ 112 ], used an improved architecture with additional components. The NMT network directly transforms spoken text to pose sequences, while a motion graph was adopted to generate 2D smooth skeletal poses. An improved generative adversarial network (GAN) was used in order to produce videos with higher resolution. The motion graph and the GAN modules improved significantly the quality of the generated videos. Stoll et al. in [ 113 ], adopted an auto-regressive gloss-to-pose network that can generate skeleton poses and velocities for each sign language gloss. In addition, a pose-to-video network generated the output video using a 2D-CNN along with a GAN. This approach resulted in smooth transitions between glosses and refined details on hand and finger shapes. Saunders et al. in [ 114 ], employed Transformers to automatically generate 3D human poses from spoken text using a multiple-level configuration. A text-to-gloss-to-pose (T2G2P) network with Transformer layers translated text sentences to sign language glosses and finally to 3D poses, while a text-to-pose (T2P) network directly transformed text into human poses. Furthermore, a progressive Transformer decoder was used to generate continuous and smooth human poses one frame at a time. Furthermore, the method achieved superior performance compared to NMT-based and GAN-based methods. Xiao et al. in [ 115 ] developed a bidirectional system for SLR and SLP. A deep RNN was used to jointly recognize sign language from input skeleton poses and generated skeleton sequences that were responsible to move an avatar or generate a signed video. The generated sequences were also used for SLR and improved the robustness of the system.

Cui et al. in [ 116 ], used a pose predictor network, which contains an LSTM and an autoencoder to generate the future human poses given a reference pose and the gloss label. Moreover, an image synthesis module accepted as input the current frame and the next pose to predict the next frame of the video using a U-Net based architecture with a CNN and an LSTM. Furthermore, it extracted regions of interest to improve details, such as the hands, which were crucial for generating high-quality sign language videos. This approach was able to synthesize realistic signs with naturally evolving hand shapes.

6. Applications

The advances in sign language capturing, recognition and representation have led to the development of several related applications. Each application can be compatible either with desktop computers or with android and iOS smartphones, as it is illustrated in Table 5 . The majority of the methods use one or two CNN models integrated to their applications. The use of lightweight CNN models ensures the real-time performance of the applications.

Characteristics of sign language applications.

MethodOperating SystemSign LanguageScenario
Liang et al. [ ]Windows desktopBritishDementia screening
Zhou et al. [ ]iOSHong KongTranslation
Ozarkar et al. [ ]AndroidIndianTranslation
Joy et al. [ ]AndroidIndianLearning
Paudyal et al. [ ]AndroidAmericanLearning
Luccio et al. [ ]AndroidMultipleLearning
Chaikaew et al. [ ]Android, iOSThaiLearning
Ku et al. [ ]-AmericanTranslation
Potamianos et al. [ ]-GreekLearning
Lee et al. [ ]-KoreanTranslation
Schioppo et al. [ ]-AmericanLearning
Bansal et al. [ ]-AmericanLearning
Quandt et al. [ ]-AmericanLearning

Liang et al. in [ 117 ], introduced an automatic toolkit to recognize early stages of dementia among British Sign Language (BSL) users. Hand trajectory data, facial data and elbow distribution data were employed for feature extraction. The data were extracted using OpenPose and the dlib libraries. The final decision, whether the user was healthy or not, was taken by a CNN model. Zhou et al. in [ 118 ], created a Hong Kong sign language recognition platform, consisting of a mobile application and a Jetson Nano [ 130 ]. The mobile application was the front-end of the platform that preprocesses the sign language video. After the preprocessing, the video was transferred to the Jetson Nano that translates the video into spoken language, using a pre-trained deep learning model. Moreover, the authors created a Hong Kong sign language dataset for the purposes of the study. However, the method provides only word-level translation and predicts a relatively small vocabulary size. Furthermore, Ku et al. in [ 124 ], employed the 2d camera of the smartphone to record the signer. Hand skeleton information was extracted by OpenPose and a CNN model identified the meaning of the sign. The user could also choose to translate a pre-recorded video. However, very few gestures are recognised (three) and only finger positions are employed for feature extraction and not the entire hand. Moreover, the application does not run in real-time. On the other hand, Ozarkar et al. in [ 119 ], implemented a smartphone application consisting of three modules. The sound classification module detected and classified input sounds and alerted the user through vibrations. The gesture recognition module recognized the input Indian sign language video and converted it to natural language. In addition, the Multilingual Translation Module could either convert text to speech in different Indian regional languages or convert speech to text. Some limitations of the method are the performance degradation when more than one people appear in front of the camera, as well as the sensitivity of the sound classification module in noisy environments. Finally, Lee et al. in [ 126 ], described multiple technologies that could be integrated to a smartphone and ease the communication between speaking and hearing-impaired people. These technologies were: Text-To-Speech (TTS), Speech-To-Text (STT), Augmentative and Alternative Communication (AAC) and motion recognition.

Numerous educational oriented applications employing SLR have been also developed. These applications aim to help someone to learn or practice SL. Potamianos et al. in [ 125 ], presented a summary of the SL-ReDu project. The goal of the project was to teach the Greek sign language as a second language through recognition. The educational process was supported by self-monitoring and objective learning of the learners. Furthermore, a deep learning-based approach for isolated sign recognition of GSL was introduced. On the other hand, Joy et al. in [ 120 ], proposed a mobile application that could be used as a visual dictionary for children. It consisted of two modules: an object detection module and a word recognition module. The former enabled the user to select an object and the application displayed the corresponding sign. The latter took as input a picture of a text and it demonstrated the corresponding sign. However, the word recognition module is limited to translate a maximum number of 950 characters from a text. In addition, there are delays in loading sign animation videos due to the limited number of videos that can be stored on the mobile device. Moreover, Paudyal et al. in [ 121 ], designed a smartphone application that provides feedback to a sign language learner based on location, movement, orientation and hand-shape of his signs. A dataset was also created from 100 learners, for 25 American Sign Language (ASL) signs. However, the system does not perform continuous SLR. Schioppo et al. in [ 127 ], created a virtual environment for learning sign language, employing a virtual reality headset. A Leap Motion sensor was attached to the headset. The system was evaluated on the 26 letters of the alphabet in ASL. Luccio et al. in [ 122 ], employed an Elf Sandbot robot [ 131 ] to help people with hearing impairments to learn sign language. Two smartphone and tablet applications were also developed, with the first one controlling the movement of the robot and the second one taking a verbal or textual input of a word or sentence, translating it to sign language and demonstrating the corresponding video. Furthermore, Chaikaew et al. in [ 123 ], introduced an application that could help the communication of hearing-impaired people who want to learn the Thai sign language. The learners were able to choose the preferred vocabulary and practice with animation. Bansal et al. in [ 128 ], designed a game aiming to help Deaf children that lack continuous access to sign language, using only a high resolution camera and pose estimation software. The learner was asked to describe a scene and if the description was correct, he/she advanced to the next scene. Moreover, a dataset with RGB and depth features was created from adults with little experience with ASL. Nevertheless, the dataset consists of very few data to effectively train a deep learning model. Finally, Quandt et al. in [ 129 ], designed an avatar who served as the teacher of a virtual environment in order to teach introductory ASL to a novice signer. The users could also see a digital representation of their hands due to the usage of LEAP Motion. However, the system could not capture signs that involved touching a specific part of the body or signs that involved body part occlusion.

7. Conclusions and Future Directions

In this paper, the broad spectrum of AI technologies in the field of sign language is covered. Starting from sign language capturing methods for the collection of sign language data and moving on to sign language recognition and representation techniques for the identification and translation of sign language, this review highlights all important technologies for the construction of a complete AI-based sign language system. Additionally, it explores the in-between relations among the AI technologies and presents their advantages and challenges. Finally, it presents groundbreaking sign language applications that facilitate the communication between hearing-impaired and speaking people, as well as enable the social inclusion of hearing-impaired people in their everyday life. The aim of this review is to familiarize researchers with sign language technologies and assist them towards developing better approaches.

In the field of sign language capturing, it is essential to select an optimal sensor for capturing signs for a task that highly depends on various constraints (e.g., cost, speed, accuracy, etc.). For instance, wearable sensors (i.e., gloves) are expensive and capture only hand joints and arm movements, while in recognition applications, the user is required to use gloves. On the other hand, camera sensors, such as web or smartphone cameras, are inexpensive and capture the most substantial information, like the face and the body posture, which are crucial for sign language.

Concerning CSLR approaches, most of the existing works adopt 2D CNNs with temporal convolutional networks or recurrent neural networks that use video as input. In general, 2D methods have lower training complexity compared to 3D architectures and produce better CSLR performance. Moreover, it is experimentally shown that multi-modal architectures that utilize optical flow or human pose information, achieve slightly higher recognition rates than unimodal methods. In addition, CSLR performance on datasets with large vocabularies of more than 1000 words, such as Phoenix-2014, or datasets with unseen words on the test sets, such as CSL Split 2 and GSL SD, is far from perfect. Furthermore, ISLR methods have been extensively explored and have achieved high recognition rates on large-scale datasets. However, they are not suitable for real-life applications since they are trained to detect and classify isolated signs on pre-segmented videos.

Sign language translation methods have shown promising results although they are not exhaustively explored. The majority of the SLT methods adopt architectures from the field of neural machine translation and video captioning. These approaches are of great importance, since they translate sign language into spoken counterparts and can be used to facilitate the communication between the Deaf community and other groups. To this end, this research field requires additional attention from the research community.

Sign language representation approaches adopt either 3D avatars or video generation architectures. 3D animations require manual design of the movement and the position of each joint of the avatar, which is very time-consuming. In addition, it is extremely difficult to generate smooth and realistic animations of the fine grained movements that compose a sign, without the use of sophisticated motion capturing systems/technologies that employ multiple cameras and specialised wearable sensors. On the other hand, recent deep learning methods for sign language production have shown promising results at synthesizing sign language videos automatically. Besides, they can generate realistic videos using a reference image or video from a human, which are also preferable from the Deaf community instead of avatars.

Regarding the sign language applications, they are mostly developed to be integrated in a smartphone operating system and perform SL translation or recognition. A discrete category is the educational oriented applications, which are very useful for anyone with little or no knowledge of sign language. In order to create better and more easily accessible applications, the research should focus on the development of more robust and less computational expensive AI models, along with the further improvement of the existing software for integration of the AI models into smart devices.

Figure 3 is designed to provide objective and subjective comparisons of AI technologies and DNN architectures for sign language as seen from the perspective and the experience of the authors in the field. More specifically, Figure 3 a presents and compares the characteristics of the different AI technologies for sign language. Volume of works is used to measure the number of published papers for each sign language technology and it is calculated based on the results of the query search in the databases. Challenges is used to subjectively measure the difficulty in accurately dealing with each sign language technology and it is based on the performance of the methods on the specific area. Finally, future potential is used to express the view of the authors on which sign language technology has the most potential to deliver future research works.

An external file that holds a picture, illustration, etc.
Object name is sensors-21-05843-g003.jpg

Radar charts showcasing the findings of this survey regarding ( a ) the literature methods for CSLR, ISLR and SLP and ( b ) the characteristics of each AI sign language technology.

From the chart in Figure 3 a, it can be seen that most existing works deal with sign language recognition, while sign language capturing and translation methods are still not thoroughly explored. It is strongly believed that these research areas should be explored more in future works. Furthermore, it is assumed that there is still great room for improvement for applications, especially mobile ones, that can assist the Deaf community. Regarding future directions, improvements can still be achieved in the accuracy of sign language recognition and production systems. In addition, advances should be made in the extraction of robust skeletal features, especially in the presence of occlusions, as well as in the realism of avatars. Finally, it is crucial to develop fast and robust sign language applications that can be integrated in the everyday life of hearing-impaired people and facilitate their communication with other people and services.

On the other hand, Figure 3 b draws a comparison between various DNN architectures in terms of the performance of the proposed networks (Accuracy), hardware requirements for inference and training of the proposed networks (Hardware requirements), scope for improvement based on the performance gains and the volume of works (Future potential), computational complexity during training (Training complexity) and the number of recorded datasets that are currently available (Existing datasets). Except for the existing datasets, whose values are based on a search for publicly available datasets, all other metrics presented in the chart of Figure 3 b are calculated based on the study of the review papers and the opinions and experience of the authors. As it can be observed, ISLR methods have high accuracy with small hardware requirements but such methods have been extensively explored resulting in limited future potential. On the other hand, CSLR and SLP methods have high hardware and training requirements, as well as demonstrate significant future potential as there is still great room for improvements in future research works.

Author Contributions

Conceptualization, I.P., C.C., D.K., K.D. and P.D.; Formal analysis, I.P., C.C., D.K., K.D. and P.D.; Funding acquisition, P.D.; Project administration, P.D.; Supervision, K.D.; Writing—original draft, I.P., C.C., D.K.; Writing—review and editing, K.D. and P.D. All authors have read and agreed to the published version of the manuscript.

This research was funded by the Greek General Secretariat of Research and Technology under contract T1E Δ K-02469 EPIKOINONO.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Conflicts of interest.

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Instant Sign Language Recognition by WAR Strategy Algorithm Based Tuned Machine Learning

  • Research Article
  • Open access
  • Published: 08 September 2024

Cite this article

You have full access to this open access article

research paper for sign language recognition

  • Shahad Thamear Abd Al-Latief   ORCID: orcid.org/0009-0003-9141-7951 1 ,
  • Salman Yussof   ORCID: orcid.org/0000-0002-2040-4454 2 ,
  • Azhana Ahmad   ORCID: orcid.org/0000-0003-1149-4053 3 ,
  • Saif Mohanad Khadim   ORCID: orcid.org/0009-0009-5090-8942 1 &
  • Raed Abdulkareem Abdulhasan   ORCID: orcid.org/0000-0001-6478-8990 4  

Sign language serves as the primary means of communication utilized by individuals with hearing and speech disabilities. However, the comprehension of sign language by those without disabilities poses a significant challenge, resulting in a notable disparity in communication across society. Despite the utilization of numerous effective Machine learning techniques, there remains a minor compromise between accuracy rate and computing time when it comes to sign language recognition. A novel sign language recognition system is presented in this paper with an exceptionally accurate and expeditious, which is developed upon the recently devised metaheuristic WAR Strategy optimization algorithm. Following the preprocessing, both of spatial and temporal features has been extracted using the Linear Discriminant Analysis (LDA) and Gray-level cooccurrence matrix (GLCM) methods. Afterward, the WAR Strategy optimization algorithm has been adopted in two procedures, first in optimizing the extracted set of features, and second to fine-tune the hyperparameters of six standard machine learning models in order to achieve precise and efficient sign language recognition. The proposed system was assessed on sign language datasets of different languages (American, Arabic, and Malaysian) containing numerous variations. The proposed system attained a recognition accuracy ranging from 93.11% to 100% by employing multiple optimized machine learning classifiers and training time of 0.038–10.48 s. As demonstrated by the experimental outcomes, the proposed system is exceptionally efficient regarding time, complexity, generalization, and accuracy.

Similar content being viewed by others

research paper for sign language recognition

Recent progress in sign language recognition: a review

research paper for sign language recognition

A Systematic Study of Sign Language Recognition Systems Employing Machine Learning Algorithms

research paper for sign language recognition

American Sign Language Recognition: Algorithm and Experimentation System

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Communication has a pivotal role in establishing and maintaining interpersonal relationships, that exerts a substantial influence on individuals’ lives through its facilitation of knowledge acquisition and sharing, promotion of social interaction, cultivation of relationship growth, and provision of a means for individuals to express their feelings and needs [ 1 ]. In contrast to most individuals who use verbal communication, individuals with hearing and speaking limitations employ a distinct form of nonverbal communication referred to as Sign Language (SL). This form of language plays a crucial role in facilitating communication for people who encounter challenges in verbal or auditory expression. In contrast to spoken language, the comprehension of sign language does not depend on sound perception or vocalization. In contrast, those who are deaf or mute employ a coordinated array of hands shapes, orientations, and movements that rely on many body parts such as fingers, hands, arms, head, torso, and facial expressions. This multifaceted approach enables them to effectively communicate messages, thereby establishing sign language as a visual form of communication [ 2 ]. The linguistic studies of sign language originated in the 1970s [ 3 ], which has demonstrated that sign language shares similarities with spoken languages but is distinct from them in terms of its vocabulary and grammar. Signs are created using a finite collection of gestural features, similar to how a small number of sounds can produce millions of words in spoken languages. Sign language undergoes natural evolution and growth over time and across different geographical locations. Numerous countries possess their own distinct national sign language, exhibiting regional and domestic variations [ 4 ]. Whereas, depending on the report of the World Federation of the Deaf there are more than 70 million individuals in the world own a speaking and hearing disability which uses over than 300 type of sign language [ 5 ]. Nevertheless, this mode of communication is not widely embraced by those who do not have hearing or speaking difficulties, and a minority of them has the ability to comprehend and acquire proficiency in it. This observation highlights an authentic disparity in communication with the individuals who experience hearing and speaking impairments and the broader society [ 6 ]. Hence, there is a crucial need for an automated system capable of precisely identifying and translating sign language in order to overcome these obstacles and create a smooth platform for communication between individuals who are deaf/mute and those who are able to hear and speak. Furthermore, in order to guarantee equitable access to information for those who are deaf or mute, akin to their counterparts [ 7 ]. Sign language recognition encompasses several crucial sub-fields, including detection, tracking, pose estimation, gesture recognition, and pose recovery. Many Machine learning models have been widely employed in human–computer interaction applications, where these sub-fields are substantially applied. However, the task of recognizing sign language using machine learning is complicated and presents various problems that might have a substantial impact on the outcomes. Whereas the signs in this language lack a direct correspondence to a particular word. Hence, the identification of sign language is a multifaceted procedure that goes beyond the mere replacement of individual signs with their respective spoken language equivalents. This phenomenon can be linked to the presence of unique vocabulary and grammatical structures in sign languages, which are not limited to any specific spoken language [ 8 ]. Moreover, other challenges encompass several aspects, namely the scarcity and imbalance of datasets, the unpredictability in sign language due to environmental conditions, the presence of dynamic indicators that might lead to occlusion, the extraction of features, the time necessary, the complexity involved, and the processing cost [ 9 ]. The feature extraction is considered to be as an initial stage prior to employing machine learning techniques for classification purposes in most image recognition systems. In contrast, deep learning, a distinct type of machine learning, involves the direct extraction of features and representations from data [ 10 ]. This is achieved by encoding the entirety of images as points inside a high-dimensional space and utilizing them as a cohesive unit for the purpose of training [ 11 ]. Nevertheless, the successful training of these models frequently necessitates a substantial amount of annotated training data and significant computational resources [ 12 ]. An extra procedure so called feature optimization or feature selection has been adopted in several recognition systems, which is used to optimize or select the most effective collection of extracted features in order to reduces the dimensions so minimize the time [ 13 ]. Therefore, the primary aim of this framework is to augment the caliber of the features employed in machine learning models, thereby enhancing the accuracy, generalization, and overall performance of the models [ 14 ]. Moreover, another performance enhancement of the recognition systems has employed the concept of hyperparameter optimization of machine learning models. Its main intent is to identify the most favorable parameter values that exhibit high level of success for the model. Metaheuristic optimization algorithms are the most widely recognized and commonly employed for the purposes of feature optimization and machine learning hyperparameter tuning [ 15 ]. This is due to its ability to find the best set of solutions in order to provide a solution for complicated unsolved problems by traditional methods in many domains. Moreover, many adversarial attacks can face any classification system using machine learning such as inference attacks [ 16 ], Image Perturbation Attacks, Pose Manipulation, Temporal attacks such as motion blur, Data Poisoning, Backdoor Attacks that lead to misclassification, and Privacy and Inference Attacks [ 17 , 18 , 19 , 20 , 21 , 22 ]. One of the most protection against these attacks is the efficient extraction of features that represent the data in an effective manner.

This work introduces a novel technique for recognizing static signs in sign language, namely those representing alphabetics and numerals. The extracted features have been optimized using the recently created metaheuristic WAR Strategy optimization algorithm. Furthermore, the WAR Strategy algorithm has been utilized to optimize the hyperparameters of six machine learning algorithms. The system provided successfully addresses the challenges of generalization, computing cost, and accurately recognizes sign language in different poses, orientations, and illuminations with minimal time requirements.

The subsequent sections of this research have been organized as follows: In Sect.  3 , the metaheuristic WAR Strategy optimization algorithm is described. Section  4 describes the proposed methodology for the proposed static sign language recognition system. The outcomes obtained by implementing the introduced system on three distinct sign language datasets are detailed in Sect.  5 . Future research and a succinct summary of the conclusions are included in Sect.  6 .

2 Literature Review

Sign language recognition is an extremely competitive subfield of gesture recognition within the academic community. The detection of the gesture may involve a variety of techniques, like the application of sensory hardware which incurs substantial costs and presents fewer practical considerations in real-life circumstances. As a result, scholars have implemented computer vision algorithms to attain the highest level of accuracy in sign language recognition. As a result, machine learning techniques have gained significant attention, leading to the development of numerous machine learning models, particularly deep ones, for the purpose of recognizing sign language in its various forms (static and dynamic), across different languages and countries. However, developing a high-performing deep learning model for sign language recognition remains difficult, owing to the sign language’s complexity, its reliance on spatial and temporal features, and the computational time and cost associated with deep learning models. Due to these constraints, the researchers were compelled to utilize metaheuristic algorithms in conjunction with machine learning models in order to accurately identify the sign language.

In [ 23 ], presented a gesture recognition system that employs the Cuckoo Search algorithm along with to minimize the complex trajectory and the SURF transform in the Hidden Markov model (HMM). The main goal was to decrease the features dimensions in different, illumination, orientations, and shapes of gestures. An 80.16% recognition accuracy achieved from this system, that cannot be accepted in real time application, and it has not been evaluated in recognizing the sign language.

In [ 24 ], the static Indian sign language recognition system has been introduced depending on the use of an Improved Genetic Algorithm to perform features selection after extraction depending on the contour model and canny detector. A feed-forward neural network has been adopted for classification and came up with a 74% recognition accuracy.

In [ 25 ] the metaheuristic Crow Search Algorithm (CSA) has been adopted to identify and select the best parameters for the convolution neural networks for gesture recognition in Human Computer Interaction (HCI). The system has shown a high recognition accuracy equal to 100% on the public from Kaggle dataset. Nevertheless, the system suffers from hardware and computational cost since it has been executed in the Google Collab’s GPU-based cloud framework. Additionally, the system has not undergone evaluation on several sign languages of different sorts, resulting in limited generality.

In [ 26 ], the gesture has been recognized using a proposed Lightboost based Gradient boosting machine (LightGBM) with a parameter enhancement method depending on a developed Memetic Firefly algorithm. Their main aim in this work was to reduce the computational cost while providing a high performance of recognition, in which it shows a 99.36%, 99%, 99%, and 99% in term of Accuracy, Precision, Recall, and F1-measure sequentially. The introduced machine model in this work owns a high count of parameters that may result in fitting the training data and lead to overfitting. Moreover, the system has poor generalization and has not been evaluated on different sign language datasets having high variation, and the training time has not been measured.

The problem of poor generalization was the main interest in [ 27 ], and [ 28 ], in which both of the static and dynamic Indian sign language has been recognized in [ 27 ] by selecting the optimal features set and upgrade the related wights in the Multilayer Perceptron (MLP) model depending on the use of a hybrid metaheuristic algorithm so called Deer Hunting-based Grey Wolf Optimization (DH-GWO). The statistical results for static Indian sign language were 97%, 94.6%, 81%, in term of Accuracy, Precision, and F1 score while the results for the dynamic dataset were equal to 89%, 47%, 30%. Although the findings were impressive for static sign language, this effort requires a significant amount of time and has a high level of computing complexity. Furthermore, the dataset being used is not accessible to the public. While in [ 28 ], two types of static sign languages have been recognized including Mexican of 99.37% accuracy and American that show 99.98%. The parameters of the Deep Convolutional Neural Networks (DCNN) have been tuned using the Particle Swarm Optimization (PSO) in this work.

Despite the high recognition results, significant computer resources may be necessary to handle huge datasets or complex designs due to the PSO require a computationally intensive task of training several candidate networks to assess the fitness of each particle.

Selecting the optimal features for sign language recognition was the goal in [ 29 ], in which the features are extracted and a metaheuristic PSO algorithm is used before the classification. The selected features afterward are classified using the multi-class Support Vector Machine to avoid the high computational complexity in deep learning. It has been evaluated on seven public datasets for three different sign languages having uniform and complicated backgrounds including Indian dataset, Jochen Triesch (JTD), MNIST dataset, NUS Dataset II, Static Hand Posture with different background, Arabic dataset IEEE ASL dataset of accuracies equal to 99.18%, 93.07%, 90.9%, 96.7%, 91.1%, 93%, 80.1% in sequence. Notwithstanding its commendable outcomes and endeavors to surmount challenges related to generalization and computational expense, this system has been noted to exhibit difficulties with dynamic signs and a notable propensity for producing errors in certain letters, including J and Z. Reducing the feature set will, nevertheless, impact precision and must invariably involve a trade-off. Furthermore, neither the necessary time nor dynamic sign languages were accounted for in the evaluation.

Recognizing the complex Arabic sign language is the main interest in [ 30 ]. The optimized Deep Convolutional Autoencoder by the Atom Search Optimization algorithm has been utilized for classifying the extracted features using the capsule network (CapsNet). The statistical results were measured using Accuracy, Precision, Recall, and F-measure equal to 99.17%, 95.52%, 95.45%, 95.31%. The utilized dataset in this work is not big enough to prove that the system will have the ability to recognize all the Arabic words, and the needed time is not mentioned.

In [ 31 ], the gesture has been recognized using hyperparameters tuned CNN by the newly developed Harris Hawks Optimization (HHO) algorithm. The recognition accuracy was equal to 100% and the required training time is 15 min. While the presented system demonstrates a commendable rate of recognition, it fails to account for generalization and fails to specifically tackle the demands of real-time applications that prioritize rapidity and computational effectiveness.

To overcome the high dimensionality in the features of sign language, some related works employ the deep learning to extract the features in [ 32 , 33 , 34 , 35 ]. The MobileNet model has been employed for features extraction in [ 32 , 33 ]. Moreover, the Artificial Rabbits Optimizer is used for tuning the hyperparameters of Siamese Neural Network in [ 32 ] for recognizing sign language and shows a 99.14% accuracy.

While in [ 33 ], the parameters tuning has been done on both MobileNet, using the Manta Ray Foraging Optimization (MRFO), and Hybrid Deep Learning model using the Reptile Search Algorithm. The alphabetics of the American sign language has been recognized with a 99.51% accuracy.

The connected densely network (DenseNet169) for extraction features in [ 34 ] has been used, the parameters of the Multilayer Perceptron (MLP) models are optimized using the Deer Hunting Optimization algorithm of 92.88%. recognition accuracy for the Arabic sign language.

On the other hand, in [ 35 ], the model of Inception v3 is used to generate the features map and the Deep Wavelet Autoencoder is fine-tuned using the metaheuristic algorithm “Sand Cat Swarm Optimizer” to recognize the American sign language alphabets and number and show an accuracy rate for recognition equal to 99.01%. These four studies have complexity, time, and poor generalization since they were not tested on numerous datasets with several changes and cannot be applied in the real-world fields.

In [ 36 ], the parameters of a developed CNN are tuned using three metaheuristic algorithms including Whale Optimization Algorithm (WOA), Particle Swarm Optimization, and an Improved Competitive Gray Wolf Optimizer (ICGWO) and show the recognition accuracy rate on English alphabets in Indian Sign Language (ISL) equal to 99.93%. This developed (CNN) exclusively processes the input image of the hand, as opposed to the entire body. This necessitates precise hand detection or segmentation. Furthermore, the architecture is computationally demanding, requiring iterations based on the number of solutions. Consequently, it can be laborious and time-consuming, particularly when dealing with extensive datasets or intricate CNN architectures.

While researchers have utilized various metaheuristic algorithms to optimize features and tune parameters in models of machine learning for sign language or gesture recognition, little emphasis has been placed on addressing issues related to time, complexities, computing power, and generalization.

Given the issues mentioned earlier a developed metaheuristic optimization algorithm called the WAR Strategy has been utilized to optimize features and fine tune hyperparameters of machine learning models, for sign language recognition, in this paper. The key aspects of the system outlined in this paper are as follows:

Optimize the extracted features from the sign language images to overcome the problems related to hand gestures variations using the metaheuristic WAR Strategy algorithm.

Fine-tune the hyperparameters of six machine learning models using the metaheuristic algorithm (WAR Strategy) to accurately identify sign language all while reducing expenses and time.

Assess the efficacy of the proposed system and evaluate its high generalization on three publicly accessible sign language datasets of significant variances, namely American, Arabic, and Malaysian.

3 WAR Strategy Optimization Algorithm

The emergence of a collection of optimization algorithms so called Metaheuristic is a result of the raise difficult and complicated problems in numerous fields and applications in the current period, that cannot be handled within an acceptable timescale and at a reasonable computing expenditure. In situations when more conventional optimization techniques have failed, these algorithms, which draw their inspiration from social or natural phenomena, may frequently provide effective and adaptable solutions. In order to get the most out of systems, metaheuristic optimization algorithms seek for optimum or nearly optimal solutions within a given problem area. Numerous fields have found their way into its implementation, such as scheduling, routing, hyperparameter tweaking in machine learning, feature optimization and selection, and many more [ 37 ]. To solve optimization issues, a newly developed metaheuristic algorithm called the WAR Strategy Optimization Algorithm emerged. Its primary source of inspiration is the strategic planning of the military forces that go into battle. Each soldier autonomously converges towards the optimal value. This optimization algorithm incorporates two well-known approaches to combat, namely offensive and defensive tactics. The positions of the troops on the battlefield are modified in line with the plan that is put into action. A new method for updating weights and a plan to move weak troops are used to make the algorithm more resilient and improve its convergence. The proposed war strategy algorithm shows a quick convergence speed across different search areas and does a good job of balancing the exploration and exploitation phases [ 38 ]. The main steps of the WAR Strategy optimization algorithm are illustrated in Algorithm 1 [ 38 ].

Algorithm 1: The Metaheuristic WAR Strategy Optimizer.

figure a

4 Sign Language Proposed System Methodology

The sign language recognition system described in this work comprises a series of stages and procedures, each of which plays a crucial role in accomplishing the desired goals. As shown in Fig.  1 , before beginning the system procedures, the data is divided into two separate groups, with 70% given for training the system and the remaining 30% for testing and evaluation. The first stage of the proposed system consists of a series of pre-processing procedures. Afterwards, the identification and division of the area of interest is performed in order to prepare the data for the subsequent stage. The subsequent stage involves feature extraction, which depends on the employment of two techniques: Linear Discriminant Analysis (LDA) and Gray-level cooccurrence matrix (GLCM), to provide a comprehensive collection of features that accurately represent the data. The feature optimization phase follows, with the goal of obtaining the best collection of features from the ones collected using LDA by using the WAR Strategy method is then implemented. Subsequently, the WAR Strategy algorithm has been adopted to fine-tune the hyperparameters of six (ML) algorithms, resulting in developing a new classifier to be used in the final phase.

figure 1

Architectural model of proposed static sign language recognition system

4.1 Sign Images Preprocessing

The sign language dataset may exhibit several changes, which can be attributed to factors such as the capturing instrument, environmental conditions, orientation, occlusion, position, and so on. The primary objective of preprocessing prior to classification using ML is to enhance the quality of the data in order to produce efficient and accurate classification results. Preprocessing techniques are used to address these kinds of variance, such as in the elimination of noise, improvement of contrast, normalizing, and other related methods.

The proposed system includes a series of preprocessing steps as follows.

Convert the colored RGB sign language images into greyscale which exhibits a high effect on forwarded phases mainly on the extraction of features. Whereas the greyscale images include just one-color channel, ensuring that the retrieved features remain unaffected by color. Therefore, it is a crucial process in several applications as it decreases computational complexity, lowers processing times, and simplifies the execution of afterwards steps [ 39 ].

Contrast adjustments using the Histogram Equalization technique, due to incorrect distribution of lights or insufficient illumination may lead to misclassification [ 40 ].

Apply the Gaussian Blur filtering to decrease the high-frequency components of an image in order to eliminate noise and enhance features, particularly those associated with edges and lines. Equation ( 1 ) describes the mathematical equation of the used Gaussian filter, that applied using a 3 × 3 kernel size in this system, which will provide a smoothed image with the needed features. [ 41 ].

Convert the image to binary colour in order to simplify the process and make it easier to identify the most crucial areas that contain vital details. Binarization enhances the capability to differentiate the foreground and background of an image [ 42 ]. Thresholding based binarization method has been implemented in this system with a threshold value of 115.

4.2 Cropping and Resizing

It is essential to use segmentation techniques to isolate the precise area of the hand involved in sign language from the given image. It is important to apply a precise identification procedure since it greatly influences the outcome. This research utilizes the well-recognized contour-based segmentation approach, specifically known as edge-based segmentation, to collect and segment the Region of Interest (ROI) in the image [ 43 ]. The first part of this technique involves identifying the edges, followed by a connecting step that utilizes the continuity of curved lines and edges to depict the hand area and form in the image [ 44 ]. The Contour-based segmentation approach has been used, depending on the presence of the axis in the binary image of lined edges, to extract the hand portion from the smoothed grayscale image.

Afterwards, the segmented image of the hand area will be resized to enhance its finer characteristics and reduce processing time. The suggested system utilizes the widely used Bilinear Interpolation Scaling technique for resizing. This technique uses the quartet of adjacent pixels to calculate the desired pixel and functions in both the vertical and horizontal orientations [ 45 ]. In this work, the sign language images have been resized to dimensions of 50 × 50.

4.3 Feature Extraction

Irrelevant characteristics may lead to incorrect categorization and erroneous recognition when it comes to sign language recognition, thus extracting the relevant features is crucial for getting accurate results [ 46 ]. With the goal of improving the efficiency of data analysis and storage while decreasing processing time and shedding light on complexity, feature extraction has a strong impact on the problem of finding the most succinct and meaningful set of characteristics. This kind of data representation for classification and regression is therefore often regarded as the most popular and practical option.

Two feature extraction methods have been employed in this paper: Linear Discriminant Analysis (LDA), and Gray-Level Co-occurrence Matrix (GLCM), to acquire spatial and temporal features of the sign language image dataset.

The Linear Discriminant Analysis (LDA), commonly known as Fisher’s Discriminant Analysis, is a widely used method for reducing dimensionality and extracting desired features. The majority of its employment is in supervised classification situations [ 47 ]. The approach works by determining the most effective linear combinations of characteristics that accurately differentiate between different groups or categories. The number of characteristics it gives is equal to the number of classes minus one, which is beneficial for dealing with the multiple-class issue [ 48 ]. On the other hand, the GLCM is a square matrix is used to express the odds of pairs of pixel intensities occurring together at a certain distance and angle inside a picture [ 49 ]. It has been used to extract the relevant texture characteristics in the picture, using diverse statistical measures of the pixels. Six features are acquired using the GLCM: Inverse Difference Moment, Energy, Entropy, Homogeneity, Mean, and Contrast [ 50 ].

4.4 Features Optimization by Metaheuristic Algorithm (WAR Strategy)

It is vital to get the optimal or nearly optimal collection of features that accurately reflect the most important information, including all relevant aspects, of sign language in various poses and environments. This work utilizes the metaheuristic algorithm (WAR Strategy) to optimize the features set obtained from employing the LDA and GLCM feature extraction methods. The key parameter values used in the WAR Strategy method for optimization of the features are shown in Table  1 .

The Easom function is a single-modal test function known for its global minimum, which is located in a very small region compared to the total search space. It has just two variables and is often used as a benchmark for evaluating the effectiveness of optimization strategies. The equation for the Easom function is as follow [ 51 ]:

Test area is usually restricted to square \(- 100 \le d_1 \le 100, - 100 \le d_2 \le 100\) . Its global minimum equal f ( d ) = – 1 is attainable for \(\left( {d_1 ,d_2 } \right) = \left( {\pi ,\pi } \right)\) .

4.5 Tuned Machine Learning Models-Based Sign Language Classification

Bio-inspired algorithms are renowned for their effective use in hybrid approaches to fine-tuning the parameters of ML models. Hyperparameter fine-tuning is the process of systematically optimizing the hyperparameters of a ML model to improve its overall performance and enhance its efficiency by determining the optimal values of these parameters that makes these models reaches the desired stability range and achieve its main goal [ 52 ]. The hyperparameters of six well-known ML models are fine-tuned using the WAR Strategy optimization algorithm. These machine learning models are Support Vector Machine (SVM) [ 53 ], Random Forest [ 54 ], Logistic Regression [ 55 ], Decision Tree [ 56 ], K-Nearest-Neighbors (KNN) [ 57 ], and Naïve Bayes [ 58 ]. The WAR Strategy optimization algorithm will enhance the performance of these ML models by optimizing their most effective hyperparameters. The improved versions of these models will then be used to classify the extracted and optimized features of the sign language for the purpose of recognition. The specific parameters that will be adjusted for each ML model are listed in Table  2 , while Table  3 , illustrate the parameters of the WAR Strategy optimization utilized in this hyperparameters optimization phase.

Ackley’s is a test function of type multimodal and can be defined as follow [ 59 ]:

The area of testing is mainly bounded to the hypercube \(- {32}.{768 } \le x_i \le { 32}:{768},{\text{ i }} = { 1,} \cdots , {\text{n}}\) , while the global minimum f (x) = 0, that is achieved for \(x_i = \, 0,{\text{ i }} = { 1}, \cdots ,{\text{ n}}.\)

The tuning for these hyperparameters will be done in this paper using the WAR Strategy algorithm. The initial values of these hyperparameters in of the employed ML models and the values obtained after fine-tuning them using the WAR Strategy algorithm are shown in Table  4 . The obtained hyperparameter values will be used in the optimized version of the ML models in the proposed system.

It is noted from Table  4 that the optimal values of the hyperparameter for all ML models are obtained by employing the WAR Strategy algorithm. Figure  2 exhibits the WAR Strategy optimization algorithm during the fine-tuning of these parameters.

figure 2

The diffusion behavior of the WAR Strategy Algorithm during fin-tuning the ML hyperparameters

5 Experimental Evaluation and Results

The proposed sign language recognition system was assessed for its ability to recognize static signs in three benchmark datasets representing three different languages: American, Arabic, and Malaysian. The efficacy of the presented system, which utilizes the optimized version of the ML models has been demonstrated in handling the numerous variations present in the three datasets used, including differences in lighting conditions, distance, backgrounds, dimensions, positions, and orientations. Moreover, it displays exceptional accuracy of recognition while decreasing the training time, especially when dealing with several images of different quality. The effect of using the WAR Strategy optimization algorithm in both feature optimization phase and hyperparameters tuning has been assessed. in which, the sign language has been recognized first without using the WAR Strategy algorithm, second using it only for optimizing the features, and third employ it both phases (optimizing the features and ML parameters) which is the proposed system.

The utilized performance measurements used to assess the efficiency of the presented system are accuracy, precision, F-measure, recall, and training time of the model.

The implementation environment of the presented system includes a laptop of type ASUS having AMD Ryzen 9 5900HS with Radeon Graphics 3.30 GHz, RAM 16 GB size, and an NVIDIA GeForce RTX, of operating system type Windows 10 of 64-bit.

5.1 American Sign Language Dataset Results

It is a static sign language dataset describing the letters of American language, acquired from Kaggle [ 60 ]. It contains an 87,000 colored sign language image stored in a JPG file format of a 200 × 200-resolution. all these images are stored in 29 folder or class, in which 26 assigned for the American letters from A-Z, and the rest are assigned for the DELETE, NOTHING, and SPACE. All the images in this dataset suffer from a high level of variations and gathered in variant lighting and locations in many backgrounds. Figure  3 shows some samples of this dataset, while the WAR Strategy algorithm diffusion for optimizing the features of this dataset is shown in Fig.  4 . The statistical results of this dataset acquired from the two case studies and the proposed system are presented in Tables 5 , 6 , and 7 .

figure 3

American sign language dataset samples

figure 4

The diffusion of the WAR Strategy algorithm on the American sign language dataset features

The statistical findings of the three case studies indicate that the suggested system has attained the maximum recognition accuracy via the use of the WAR Strategy optimization algorithm. Four fine-tuned ML models have reached a perfect accuracy rate of 100% including decision tree, random forest, SVM, and KNN. This refers to the effectiveness of the system in dealing with the many variances in sign language, with a training time of just 0.1 s obtained from the decision tree. Figure  5 displays a comparative flowchart describing both accuracies and the required time for training acquired from the proposed system and the previously examined cases of study.

figure 5

The proposed system results compared with the two cases of study for recognizing the American sign language

5.2 Arabic Sign Language Dataset Results

The Arabic sign language dataset utilized in this work is the (ArSL2018), which is large and labelled [ 61 ]. There is a 54,049 static Arabic sign language image in this dataset for the 32 Arabic alphabets stored in greyscale format of 64 × 64 resolution. It has been gathered using 40 participants of many ages using a smart camera in a single position, 1 m far away. Some of these images of Arabic sign language of this dataset is presented in Fig.  6 , which it is clear that it suffers from many variations, while the WAR Strategy algorithm diffusion for optimizing the features of this dataset is shown in  Fig. 7 . Tables 8 , 9 , and 10 , exhibit the statistical results of this dataset.

figure 6

Samples of Arabic Alphabetic Sign Language Dataset

figure 7

The diffusion of the WAR Strategy optimization algorithm on the Arabic sign language dataset features

Despite the complex nature of Arabic sign language and the poor lighting and low resolution of the dataset used, the proposed system has demonstrated its effectiveness in addressing such limitations. Many tuned classifiers achieved 100% recognition accuracy with minimal training time. Figure  8 clearly illustrates a direct comparison between the proposed method and the two cases of study including without using the WAR strategy and use it only for feature optimization.

figure 8

The proposed system results compared with the two cases of study for recognizing the Arabic sign language

5.3 Malaysian Sign Language Dataset Results

The Malaysian sign language utilized in this paper contains a clear and focused images of a hand gesture corresponding to a Malaysian letter, and numbers attained from Kaggle [ 62 ]. This dataset that alphabets, numbers, and words. The evaluation of the presented system was made on 12,400 images of the Malaysian alphabets and numbers only. All the images are jpg colored of 640 × 480 resolution. The images representing the numbers are distributed scattered in 11 folders, and each one has 300 images. While the images representing the alphabet are scattered in 26 folders, in each one are 350 images for each letter. Samples of this dataset is presented in Fig.  9 , and the obtained results from implementing the system without using the WAR strategy optimization algorithm, and when use it for feature optimization and the proposed system are illustrated in Tables  11 , 12 , and 13 , sequentially. Moreover Fig. 10 , illustrate the diffusion of the WAR Strategy algorithm during optimization of the features extracted from this dataset.

figure 9

Malaysian alphabet sign language dataset samples

figure 10

The diffusion of the WAR Strategy optimization algorithm on the Malaysian sign language dataset features

Although the Malaysian sign language dataset is extensive and has many dimensions, the approach used in this study achieves good accuracy in recognizing signs by employing many optimized classifiers. Figure  11 is a flowchart that demonstrates the accuracy and duration of training achieved by the proposed system in comparison to the two cases of study.

figure 11

The proposed system results compared with the two cases of study for recognizing the Malaysian sign language

5.4 Discussion

The metaheuristic algorithm (WAR Strategy) efficiency has been proved from the obtained result analysis and can be considered as a powerful optimization algorithm due to it having a rapid convergence and avoids getting trapped in the local optima. whereas, in the beginning it become very clear when employing it for optimizing the features, which raised up the superiority of the proposed system in overcoming the problems of sign language recognition, such as variations in illumination, poor resolution, orientation, occlusion, and poses. In addition, when using the WAR Strategy in optimizing the hyperparameters of the ML models, has raised up the accuracy of recognition of sign language in American, Arabic, and Malaysian, has achieved a maximum of 100% from applying the tuned version of ML models. Moreover, in nearly all of the optimized classifiers, the time necessary to train the ML models has been decreased to below one second. By implementing the suggested system on three unique sign language datasets characterized by different attributes, the issue of generalization in sign language recognition systems had been addressed. The outcomes were both time-efficient and accurate. A substantial disparity will become apparent when contrasting the outcomes of utilizing conventional machine learning models in the context of sign language recognition with the tuned version of these models. By analyzing the results obtained from the second case study, the efficacy of the WAR Strategy algorithm was initially observed when it was applied to feature optimization. Furthermore, the results obtained from the proposed sign language recognition system unequivocally demonstrated and documented the efficacy of employing the WAR Strategy algorithm for feature optimization and machine learning hyperparameter tuning. Earlier-conducted case studies and the evaluation of six machine learning classifiers against three distinct categories of sign language datasets collectively indicate that the WAR Strategy algorithm is quite effective. Moreover, due to its remarkable performance, the proposed system obviates the need for time-consuming and intricate deep learning methodologies, as it instead utilizes fine-tuned classical machine learning models to recognize sign language accurately and rapidly.

To establish the clear advantage of the proposed sign language recognition system over previous research, a comparative analysis was undertaken. The findings of this examination are detailed in Table  14 . When compared to previous systems, the proposed method demonstrates the lowest execution time (0.1 s) and the highest identification rate (100%). The accuracy rate and the required training time outcomes for each tuned classifier across the three datasets are illustrated in Fig.  12 .

figure 12

The results of the proposed system on the three examined datasets

6 Conclusions and Future Work

This paper presents an efficient and accurate recognition system for static sign language in American, Arabic, and Malaysian that rely on employing the recently developed metaheuristic WAR Strategy algorithm. After processing the sign language image by correcting its contrast and cropping the desired hand part from it and extracting the spatial and temporal features using two powerful techniques (i.e. LDA, GLCM), the WAR Strategy algorithm have been applied on the extracted features set. Whereas, by applying this metaheuristic optimization algorithm will result in producing an optimized features set that holds the most significant information of the signs in the sign language. The features optimization phase makes the proposed system able to overcome many obstacles raised when recognizing the sign language such as variance in lighting, difference in orientations and poses, and partial or complete occlusion. Moreover, the WAR Strategy optimization algorithm has been applied a second time to optimize the hyperparameters of six widely utilized Machine Learning models. Tuning the machine learning models' parameters has extremely improved the performance of the proposed system, as observed in the previously presented recognition results which exhibit a 100% accuracy rate. As well as this parameters optimization has improved the ML processes and reduces the required time for training the model that even reached to 0.5 s in some of the tuned versions of ML models. As a result, the proposed system proves its efficiency in dealing with large data sets that have many variations and show a high performance. in addition, the proposed system has proven its high generalization by recognizing different sign languages (i.e. American, Arabic, Malaysian) and give a superior result on all of them. In the future, the intention is to enhance the system to be applied on dynamic sign language or test the proposed system on a different kind of data such as sound.

Data Availability Statement

All the datasets that have been adopted in our experiments are publicly available and can be accessed as follow: ASL Alphabet Dataset available for public on Kaggle and can be accessed via the following link: https://www.kaggle.com/datasets/grassknoted/asl-alphabet?resource=download . The DOI for the dataset is https://doi.org/10.34740/kaggle/dsv/29550 . The dataset was accessed on October 15, 2023. ArASL: Arabic Alphabets Sign Language Dataset available for public on Mendeley Data and can be accessed via the following link: https://data.mendeley.com/datasets/y7pckrw6z2/1 . The DOI for the dataset is https://doi.org/10.17632/y7pckrw6z2.1 . The dataset was accessed on October 30, 2023. Malaysian Sign Language (MSL) Image Datase available for public on Kaggle and can be accessed via the following link: https://www.kaggle.com/datasets/pradeepisawasan/malaysian-sign-language-msl-image-dataset . The DOI for the dataset is https://doi.org/10.34740/kaggle/dsv/7135047 . The dataset was accessed on November 4, 2023.

Hall JA, Davis DC (2017) Proposing the communicate bond belong theory: evolutionary intersections with episodic interpersonal communication. Commun Theory 27(1):21–47. https://doi.org/10.1111/comt.12106

Article   Google Scholar  

Stokoe WC Jr (2005) Sign language structure: an outline of the visual communication systems of the American deaf. J Deaf Stud Deaf Edu 10(1):3–37. https://doi.org/10.1093/deafed/eni001

Mcburney SL (2001) William Stokoe and the discipline of sign language linguistics. Historiographia Linguistica 28(1–2):143–186. https://doi.org/10.1075/hl.28.1.10mcb

Goldin-Meadow S, Brentari D (2017) Gesture, sign, and language: the coming of age of sign language and gesture studies. Behav Brain Sci 40:e46. https://doi.org/10.1017/S0140525X15001247

World Federation of the deaf. Rome, Italy. Retrieved from https://wfdeaf.org/our-work/ . Accessed 12 Dec 2024

Rastgoo R, Kiani K, Escalera S (2021) Sign language recognition: a deep survey. Expert Syst Appl 164:113794. https://doi.org/10.1016/j.eswa.2020.113794

Wang Z, Zhao T, Ma J, Chen H, Liu K, Shao H, Wang Q, Ren Ju (2020) Hear sign language: a real-time end-to-end sign language recognition system. IEEE Trans Mob Comput 21(7):2398–2410. https://doi.org/10.1109/TMC.2020.3038303

Farooq U, Rahim MSM, Sabir N, Hussain A, Abid A (2021) Advances in machine translation for sign language: approaches, limitations, and challenges. Neural Comput Appl 33(21):14357–14399. https://doi.org/10.1007/s00521-021-06079-3

Koller O, Forster J, Ney H (2015) Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput Vis Image Underst 141:108–125. https://doi.org/10.1016/j.cviu.2015.09.013

Hassan MH (2003) Applications of machine learning in mobile networking. J Smart Internet Things (JSIoT) 2023:23–35. https://doi.org/10.2478/jsiot-2023-0003

Jogin M, Madhulika MS, Divya GD, Meghana RK, Apoorva S (2018) Feature extraction using convolution neural networks (CNN) and deep learning. In: 2018 3rd IEEE international conference on recent trends in electronics, information and communication technology (RTEICT), pp. 2319–2323. IEEE. https://doi.org/10.1109/RTEICT42901.2018.9012507

Chen X-W, Lin X (2014) Big data deep learning: challenges and perspectives. IEEE Access 2:514–525. https://doi.org/10.1109/ACCESS.2014.2325029

Boubezoul A, Paris S (2012) Application of global optimization methods to model and feature selection. Pattern Recogn 45(10):3676–3686. https://doi.org/10.1016/j.patcog.2012.04.015

Alelyani S, Tang J, Liu H (2018) Feature selection for clustering: A review. Data Clustering 29–60. ISBN: 9781315373515

Yang Li, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316. https://doi.org/10.1016/j.neucom.2020.07.061

Sheikh BUH, Zafar A (2024) Unlocking adversarial transferability: a security threat towards deep learning-based surveillance systems via black box inference attack-a case study on face mask surveillance. Multimed Tools Appl 83(8):24749–24775. https://doi.org/10.1007/s11042-023-15405-x

Sheikh BUH, Zafar A (2024) Beyond accuracy and precision: a robust deep learning framework to enhance the resilience of face mask detection models against adversarial attacks. Evolv Syst 15(1):1–24. https://doi.org/10.1007/s12530-023-09522-z

Sheikh BUH, Zafar A (2023) RRFMDS: rapid real-time face mask detection system for effective COVID-19 monitoring. SN Comput Sci 4(3):288. https://doi.org/10.1007/s42979-023-01738-9

Roshan K, Zafar A, Haque SBU (2024) Untargeted white-box adversarial attack with heuristic defence methods in real-time deep learning-based network intrusion detection system. Comput Commun 218:97–113. https://doi.org/10.1016/j.comcom.2023.09.030

Sheikh BUH, Zafar A (2024) White-box inference attack: compromising the security of deep learning-based COVID-19 diagnosis systems. Int J Inf Technol 16(3):1475–1483. https://doi.org/10.1007/s41870-023-01538-7

Sheikh BUH, Zafar A (2024) Unlocking adversarial transferability: a security threat towards deep learning-based surveillance systems via black box inference attack-a case study on face mask surveillance. Multimed Tools Appl 83(8):24749–24775. https://doi.org/10.1007/s11042-023-16439-x

Sheikh BUH, Zafar A (2024) Robust medical diagnosis: a novel two-phase deep learning framework for adversarial proof disease detection in radiology images. J Imag Inf Med 37(1):308–338. https://doi.org/10.1007/s10278-023-00916-8

Sagayam KM, Hemanth DJ, Vasanth XA, Henesy LE, Ho CC (2018) Optimization of a HMM-based hand gesture recognition system using a hybrid cuckoo search algorithm. Hybrid Metaheur Image Anal 2018:87–114

Kaluri R, Ch PR (2018) Optimized feature extraction for precise sign gesture recognition using self-improved genetic algorithm. Int J Eng Technol Innov 8(1):25–37

Google Scholar  

Gadekallu TR, Alazab M, Kaluri R, Maddikunta PKR, Bhattacharya S, Lakshmanna K (2021) Hand gesture classification using a novel CNN-crow search algorithm. Compl Intell Syst 7:1855–1868. https://doi.org/10.1007/s40747-021-00324-x

Nayak J, Naik B, Dash PB, Souri A, Shanmuganathan V (2021) Hyper-parameter tuned light gradient boosting machine using memetic firefly algorithm for hand gesture recognition. Appl Soft Comput 107:107478. https://doi.org/10.1016/j.asoc.2021.107478

Kowdiki M, Khaparde A (2021) Automatic hand gesture recognition using hybrid meta-heuristic-based feature selection and classification with dynamic time warping. Comput Sci Rev 39:100320. https://doi.org/10.1016/j.cosrev.2020.100320

Fregoso J, Gonzalez CI, Martinez GE (2021) Optimization of convolutional neural networks architectures using PSO for sign language recognition. Axioms 10(3):139. https://doi.org/10.3390/axioms10030139

Bansal SR, Wadhawan S, Goel R (2022) mrmr-pso: a hybrid feature selection technique with a multiobjective approach for sign language recognition. Arab J Sci Eng 47(8):10365–10380. https://doi.org/10.1007/s13369-021-06456-z

Marzouk R, Alrowais F, Al-Wesabi FN, Hilal AM (2022) Atom search optimization with deep learning enabled arabic sign language recognition for speaking and hearing disability persons. Healthcare 10(9):1606. https://doi.org/10.3390/healthcare10091606

Gadekallu TR, Srivastava G, Liyanage M, Iyapparaja M, Chowdhary CL, Koppu S, Maddikunta PKR (2022) Hand gesture recognition based on a Harris hawks optimized convolution neural network. Comput Electr Eng 100:107836. https://doi.org/10.1016/j.compeleceng.2022.107836

Marzouk R, Alrowais F, Al-Wesabi FN, Hilal AM (2023) Sign language recognition using artificial rabbits optimizer with siamese neural network for persons with disabilities. J Disab Res 2(4):31–39

Alsolai H, Alsolai L, Al-Wesabi FN, Othman M, Rizwanullah M, Abdelmageed AA (2023) Automated sign language detection and classification using reptile search algorithm with hybrid deep learning. Heliyon 10:1

Al-onazi BB, Nour MK, Alshahran H, Elfaki MA, Alnfiai MM, Marzouk R, Othman M, Sharif MM, Motwakel A (2023) Arabic sign language gesture classification using deer hunting optimization with machine learning model. Comput Mater Contin. https://doi.org/10.32604/cmc.2023.035303

Asiri MM, Motwakel A, Drar S (2023) Sand cat swarm optimizer with deep wavelet autoencoder-based sign language recognition for hearing-and speech-impaired persons. J Disab Res 2(3):94–104. https://doi.org/10.57197/JDR-2023-0040

Paharia N, Jadon RS, Gupta SK (2023) Optimization of convolutional neural network hyperparameters using improved competitive gray wolf optimizer for recognition of static signs of Indian sign language. J Electr Imag 32(2):023042–023042. https://doi.org/10.1117/1.JEI.32.2.023042

Chopard B, Tomassini M (2018) An introduction to metaheuristics for optimization. Springer, Cham. https://doi.org/10.1007/978-3-319-93073-2

Book   Google Scholar  

Ayyarao TSLV, Ramakrishna NSS, Elavarasan RM, Polumahanthi N, Rambabu M, Saini G, Khan B, Alatas B (2022) War strategy optimization algorithm: a new effective metaheuristic algorithm for global optimization. IEEE Access 10:25073–25105. https://doi.org/10.1109/ACCESS.2022.3153493

Saravanan G, Yamuna G, Nandhini S (2016) Real time implementation of RGB to HSV/HSI/HSL and its reverse color space models. In: 2016 International conference on communication and signal processing (ICCSP), pp. 0462–0466. IEEE. https://doi.org/10.1109/ICCSP.2016.7754179

Dhal KG, Das A, Ray S, Gálvez J, Das S (2021) Histogram equalization variants as optimization problems: a review. Arch Comput Methods Eng 28:1471–1496. https://doi.org/10.1007/s11831-020-09425-1

Abdulhasan RA, Al-latief STA, Kadhim SM (2023) Instant learning based on deep neural network with linear discriminant analysis features extraction for accurate iris recognition system. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-16751-6

Bovik, AC, ed. The essential guide to image processing. Academic Press, 2009. https://doi.org/10.1016/B978-0-12-374457-9.X0001-7

Hsiao Y-T, Chuang C-L, Jiang J-A, Chien C-C (2005) A contour based image segmentation algorithm using morphological edge detection. In: 2005 IEEE International conference on systems, man and cybernetics, vol. 3, pp. 2962–2967. IEEE. https://doi.org/10.1109/ICSMC.2005.1571600

Abubakar FM (2012) A study of region-based and contour-based image segmentation. Signal Image Proc 3(6):15. https://doi.org/10.5121/sipij.2012.3602

Article   MathSciNet   Google Scholar  

Yan F, Zhao S, Venegas-Andraca SE, Hirota K (2021) Implementing bilinear interpolation with quantum images. Digital Signal Proc 117:103149. https://doi.org/10.1016/j.dsp.2021.103149

Madhiarasan DM, Roy P, Pratim P (2022) A comprehensive review of sign language recognition: different types, modalities, and datasets. Preprint arXiv:2204.03328 . https://doi.org/10.48550/arXiv.2204.03328

Xanthopoulos P, Pardalos PM, Trafalis TB, Xanthopoulos P, Pardalos PM, Trafalis TB (2013) Linear discriminant analysis. Robust Data Mining. https://doi.org/10.1007/978-1-4419-9878-1_4

Sharma A, Paliwal KK (2015) Linear discriminant analysis for the small sample size problem: an overview. Int J Mach Learn Cybern 6:443–454. https://doi.org/10.1007/s13042-013-0226-9

Öztürk Ş, Akdemir B (2018) Application of feature extraction and classification methods for histopathological image using GLCM, LBP, LBGLCM, GLRLM and SFTA. Proc Comput Sci 132:40–46. https://doi.org/10.1016/j.procs.2018.05.057

Garg M, Dhiman G (2021) A novel content-based image retrieval approach for classification using GLCM features and texture fused LBP variants. Neural Comput Appl 33:1311–1328. https://doi.org/10.1007/s00521-020-05017-z

Solteiro Pires EJ, Tenreiro Machado JA, de Moura Oliveira PB, Boaventura Cunha J, Mendes L (2010) Particle swarm optimization with fractional-order velocity. Nonlinear Dyn 61:295–301. https://doi.org/10.1007/s11071-009-9649-y

Andonie R (2019) Hyperparameter optimization in learning systems. J Membr Comput 1(4):279–291. https://doi.org/10.1007/s41965-019-00023-0

Pisner DA, Schnyer DM (2020) Support vector machine. In: Machine learning, pp. 101–121. Academic Press. https://doi.org/10.1016/B978-0-12-815739-8.00006-7

Pal M (2005) Random forest classifier for remote sensing classification. Int J Remote Sens 26(1):217–222. https://doi.org/10.1080/01431160412331269698

Dreiseitl S, Ohno-Machado L (2002) Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform 35(5–6):352–359. https://doi.org/10.1016/S1532-0464(03)00034-0

Navada A, Ansari AN, Patil S, Sonkamble BA (2011) Overview of use of decision tree algorithms in machine learning. In: 2011 IEEE control and system graduate research colloquium, pp. 37–42. IEEE. https://doi.org/10.1109/ICSGRC.2011.5991826

Cunningham P, Delany SJ (2021) k-Nearest neighbour classifiers—a tutorial. ACM Comput Surv (CSUR) 54(6):1–25. https://doi.org/10.1145/3459665

Ontivero-Ortega M, Lage-Castellanos A, Valente G, Goebel R, Valdes-Sosa M (2017) Fast Gaussian Naïve Bayes for searchlight classification analysis. Neuroimage 163:471–479. https://doi.org/10.1016/j.neuroimage.2017.09.001

Liang JJ, Qin AK, Suganthan PN, Baskar S (2006) Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE Trans Evol Comput 10(3):281–295. https://doi.org/10.1109/TEVC.2005.857610

Akash N. ASL Alphabet. Kaggle. https://doi.org/10.34740/kaggle/dsv/29550 . Accessed 15 Oct 2023

Latif G, Mohammad N, Alghazo J, AlKhalaf R, AlKhalaf R (2019) ArASL: arabic alphabets sign language dataset. Data Brief 23:103777. https://doi.org/10.1016/j.dib.2019.103777

Isawasan P, Zolkefly A. Malaysian Sign Language (MSL) Image Dataset. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7135047 . Accessed 4 Nov 2023

Download references

Acknowledgements

Not Applicable

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

College of Graduate Studies (COGS), Universiti Tenaga Nasional (National Energy University), Selangor, Malaysia

Shahad Thamear Abd Al-Latief & Saif Mohanad Khadim

Institute of Informatics and Computing in Energy, Universiti Tenaga Nasional (National Energy University), Selangor, Malaysia

Salman Yussof

College of Computing and Informatics, Universiti Tenaga Nasional (National Energy University), Selangor, Malaysia

Azhana Ahmad

Faculty of Electrical and Electronics Engineering, Universiti Tun Hussein Onn, Johor, Malaysia

Raed Abdulkareem Abdulhasan

You can also search for this author in PubMed   Google Scholar

Contributions

Shahad Th. wrote the manuscript and suggest the method and perform the analysis, Salman y. and Azhana A, take the supervision role of this manuscript, Saif M and Raed A. review the work and provide some advising to make an adjustment.

Corresponding author

Correspondence to Shahad Thamear Abd Al-Latief .

Ethics declarations

Conflict of interest.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Consent to Participate

The authors provide the appropriate consent to participate.

Consent for Publication

The authors provide the consent to publish the images in the manuscript. The data used in the publication is publicly available. We provide respective citations for each of the data sources.

Ethical Approval

Not Applicable .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Abd Al-Latief, S.T., Yussof, S., Ahmad, A. et al. Instant Sign Language Recognition by WAR Strategy Algorithm Based Tuned Machine Learning. Int J Netw Distrib Comput (2024). https://doi.org/10.1007/s44227-024-00039-8

Download citation

Received : 22 April 2024

Accepted : 23 August 2024

Published : 08 September 2024

DOI : https://doi.org/10.1007/s44227-024-00039-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Hyperparameter tuning
  • WAR strategy optimization algorithm
  • Feature optimization
  • Machine learning
  • Sign language recognition
  • Find a journal
  • Publish with us
  • Track your research

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Sign Language Semantics

Sign languages (in the plural: there are many) arise naturally as soon as groups of deaf people have to communicate with each other. Sign languages became institutionally established starting in the late eighteenth century, when schools using sign languages were founded in France, and spread across different countries, gradually leading to a golden age of Deaf culture (we capitalize Deaf when talking about members of a cultural group, and use deaf for the audiological status). This came to a partial halt in 1880, when the Milan Congress declared that oral education was superior to sign language education (Lane 1984)—a view that is amply refuted by research (Napoli et al. 2015). While sign languages continued to be used in Deaf education in some countries (e.g., the United States), it was only in the 1970s that a Deaf Awakening gave renewed prominence to sign languages in the western world (see Lane 1984 for a broader history).

Besides their essential role in Deaf culture and education, sign languages have a key role to play for linguistics in general and for semantics in particular. Despite earlier misconceptions that denied them the status of full-fledged languages, their grammar, their expressive possibilities, and their brain implementation are overall strikingly similar to those of spoken languages (Sandler & Lillo-Martin 2006; MacSweeney et al. 2008). In other words, human language exists in two modalities, signed and spoken, and any general theory must account for both, a view that is accepted in all areas of linguistics.

Cross-linguistic semantics is thus naturally concerned with sign languages. In addition, sign languages (or ‘sign’ for short) raise several foundational questions. These include cases of ‘Logical Visibility’, cases of iconicity, and the potential universal accessibility of certain properties of the sign modality.

Historically, a number of notable early works in sign language semantics have taken a similar argumentative form, proposing that certain key components of Logical Forms that are covert in speech sometimes have an overt reflex in sign (‘Logical Visibility’). Such arguments have been formulated for diverse phenomena such as variables, context shift, and telicity. Their semantic import is clear: if a logical element is indeed overtly realized, it has ramifications for the inventory of the logical vocabulary of human language, and indirectly for the types of entities that must be postulated in semantic models (‘natural language ontology’, Moltmann 2017). Moreover, when a given element has an overt reflex, one can directly manipulate it in order to investigate its interaction with other parts of the grammar.

Arguments based on Logical Visibility are certainly not unique to sign language (e.g., see for instance Matthewson 2001 and Cable 2011 (see Other Internet Resources) for the importance of semantic fieldwork for spoken languages), nor do they entail that sign languages as a class will make visible the same set of logical elements. Nevertheless, a notable finding from cross-linguistic work on sign languages is that a number of the logical elements implicated in these discussions do indeed appear with a similar morphological expression across a large number of historically unrelated sign languages. Such observations invite deeper speculation about what it is about the signed modality that makes certain logical elements likely to appear in a given form.

A second thread of semantically-relevant research relates to the observation that sign languages make rich use of iconicity (Liddell 2003; Taub 2001; Cuxac & Sallandre 2007), the property by which a symbol resembles its denotation by preserving some its structural properties. Sign language iconicity raises three foundational questions. First, some of the same semantic elements that are implicated in arguments for Logical Visibility turn out to be employed and manipulated in the expression of concrete or abstract iconic relations (e.g., pictorial uses of individual-denoting expressions; scalar structure; mereological structure), thus suggesting that logical and iconic notions are intertwined at the core of sign language. Second, sign languages have designated conventional words (‘classifier predicates’) whose position or movement must be interpreted iconically; this calls for an integration of techniques from pictorial semantics into natural language semantics (Schlenker 2018a; Schlenker & Lamberton forthcoming). Finally, this high degree of iconicity raises questions about the comparison between speech and sign, with the possibility that, along iconic dimensions, the latter is expressively richer (Goldin-Meadow & Brentari 2017; Schlenker 2018a).

Possibly due in part to the above factors, sign languages—even when historically unrelated—behave as a coherent language family, with several semantic properties in common that are not generally shared with spoken languages (Sandler & Lillo-Martin 2006). Furthermore, some of these properties occasionally seem to be ‘known’ by individuals that do not have access to sign language; these include hearing non-signers and also deaf homesigners (i.e., deaf individuals that are not in contact with a Deaf community and thus have to invent signs to communicate with their hearing environment). Explaining this convergence is a key theoretical challenge.

Besides semantics proper, sign raises important questions for the analysis of information structure, notably topic and focus. These are often realized by way of facial articulators, including raised eyebrows, which have sometimes been taken to play the same role as some intonational properties of speech (e.g., Dachkovsky & Sandler 2009). For reasons of space, we leave these issues aside in what follows.

1.1 Loci as visible variables?

1.2 time and world-denoting loci, 1.3 degree-denoting loci, 2.1 telicity, 2.2 context shift, 3.1 iconic modulations, 3.2 event structure, 3.3 plurals and pluractionals, 3.4 iconic loci, 4.1 the demonstrative analysis of classifier predicates, 4.2 the pictorial analysis of classifier predicates, 4.3 comparison and refinements, 5.1 reintegrating gestures in the sign/speech comparison, 5.2 typology of iconic contributions in speech and in sign, 5.3 classifier predicates and pro-speech gestures, 6.1 sign language typology and sign language emergence, 6.2 sign language grammar and gestural grammar, 6.3 why convergence, 7. future issues, other internet resources, related entries, 1. logical visibility i: loci.

In several cases of foundational interest, sign languages don’t just have the same types of Logical Forms as spoken languages; they may make overt key parts of the logical vocabulary that are usually covert in spoken language. These have been called instances of ‘Logical Visibility’ (Schlenker 2018a; following the literature, we use the term ‘logical’ loosely, to refer to primitive distinctions that play a key role in a semantic analysis).

Claims of Logical Visibility have been made for logical variables associated with syntactic indices (Lillo-Martin & Klima 1990; Schlenker 2018a), for context shift operators (Quer 2005, 2013; Schlenker 2018a), and for verbal morphemes relevant to telicity (Wilbur 2003, 2008; Wilbur & Malaia 2008). In each case, the claim of Logical Visibility has been debated, and many questions remain open.

In this section, we discuss cases in which logical variables of different types have been argued to sometimes be overt in sign—a claim that has consequences of foundational interest for semantics; we will discuss further potential cases of Logical Visibility in the next section.

In English and other languages, sentences such as (2a) and (3a) can be read in three ways (see (2b) – (3b) ), depending on whether the embedded pronoun is understood to depend on the subject, on the object, or to refer to some third person.

A claim of Logical Visibility relative to variables has been made in sign because one can introduce a separate position in signing space, or ‘locus’ (plural ‘loci’), for each of the antecedents (e.g., Sarkozy on the left and Obama on the right for (2) ), and one can then point towards these loci (towards the left or towards the right) to realize the pronoun: loci thus mirror the role of variables in these examples.

Sign languages routinely use loci to represent objects or individuals one is talking about. Pronouns can be realized by pointing towards these positions. The signer and addressee are represented in a fixed position that corresponds to their real one, and similarly for third persons that are present in the discourse situation: one points at them to refer to them. But in addition, arbitrary positions can be created for third persons that are not present in the discourse. The maximum number of loci that can be simultaneously used seems to be determined by considerations of performance (e.g., memory) rather than by rigid grammatical conditions (there are constructed examples with up to 7 loci in the literature).

Loci corresponding to the signer (1), the addressee (2), and different third persons (3a and 3b) (from Pfau, Salzmann, & Steinbach 2018: Figure 1)

We focus on the description of loci in American Sign Language (ASL) and French Sign Language (LSF, for ‘ Langue des Signes Française ’), but these properties appear in a similar form across the large majority of sign languages. Singular pronouns are signed by directing the index finger towards a point in space; plural pronouns can be signed with a variety of strategies, including using the index finger to trace a semi-circle around an area of space, and are typically used to refer to groups of at least three entities. Other pronouns specify a precise number of participants with an incorporated numeral (e.g., dual, trial), and move between two or more points in space. In addition, some verbs, called ‘agreement verbs’, behave as if they contain a pronominal form in their realization, pertaining to the subject and/or to the object. For instance, TELL in ASL targets different positions depending on whether the object is second person (5a) or third person (5b) .

start of 'I tell you'. In a human picture, dominant hand's index finger extended and near mouth.

(a) I tell you.

start of 'I tell him/her'. In a human picture, dominant hand's index finger extended and near mouth.

(b) I tell him/her

Loci often make it possible to disambiguate pronominal reference. For instance, the ambiguity of the example in (2) can be removed in LSF, where the sentence comes in two versions. In both, Sarkozy is assigned a locus on the signer’s left (by way of the index of the left hand held upright, ‘👆 left ’ below), and Obama a locus on the right (using the index of the right hand held upright, ‘👆 right ’ below). The verb tell in he (Sarkozy) tells him (Obama) is realized as a single sign linking the Sarkozy locus on the left to the Obama locus on the right (unlike the ASL version, which just displays object agreement, the LSF version displays both subject and object agreement: ‘ left TELL right ’ indicates that the sign moves from the Sarkozy locus on the left to the Obama locus on the right). If he refers to Sarkozy, the signer points towards the Sarkozy locus on the left (‘👉 left ’); if he refers to Obama, the signer points towards the Obama locus on the right (‘👉 right ’).

SARKOZY 👆 left OBAMA 👆 right left TELL right ‘Sarkozy told Obama…

Short video clip of the sentence, no audio

In sign language linguistics, signs are transcribed in capital letters, as was the case above, and loci are encoded by way of letters ( a , b , c , …), starting from the signer’s dominant side (right for a right-handed signer, left for a left-handed signer). The upward fingers used to establish positions for Sarkozy and Obama are called ‘classifiers’ and are glossed here as CL (with the conventions of Section 4.2 . the gloss would be PERSON-cl ; classifiers are just one way to establish the position of antecedents, and they are not essential here). Pronouns involving pointing with the index finger are glossed as IX . With these conventions, the sentence in (6) can be represented as in (7) . (Examples are followed by the name of the language, as well as the reference of the relevant video when present in the original source; thus ‘LSF 4, 235’ indicates that the sentence is from LSF, and can be found in the video referenced as 4, 235.)

The ambiguity of quantified sentences such as (3) can also be removed in sign, as illustrated in an LSF sentence in (8) .

In light of these data, the claim of Logical Visibility is that sign language loci (when used—for they need not be) are an overt realization of logical variables.

One potential objection is that pointing in sign might be very different from pronouns in speech: after all, one points when speaking, but pointing gestures are not pronouns. However this objection has little plausibility in view of formal constraints that are shared between pointing signs and spoken language pronouns. For example, pronouns in speech are known to follow grammatical constraints that determine which terms can be used in which environments (e.g., the non-reflexive pronoun her when the antecedent is ‘far enough’ vs. the reflexive pronoun herself when the antecedent is ‘close enough’). Pointing signs obey similar rules, and enter into an established typology of cross-linguistic variation. For instance, the ASL reflexive displays the same kinds of constraints as the Mandarin reflexive pronoun in terms of what counts as ‘close enough’ (see Wilbur 1996; Koulidobrova 2009; Kuhn 2021).

It is thus generally agreed that pronouns in sign are part of the same abstract system as pronouns in speech. It is also apparent that loci play a similar function to logical variables, disambiguating antecedents and tracking reference. This being said, the claim that loci are a direct morphological spell-out of logical variables requires a more systematic evaluation of the extent to which sign language loci have the formal properties of logical variables. As a concrete counterpoint, one observes that gender features in English also play a similar function, disambiguating antecedents and tracking reference. For instance, Joe Biden told Kamala Harris that he would be elected has a rather unambiguous reading ( he = Biden), while Joe Biden told Kamala Harris that she would be elected has a different one ( she = Harris ). Such parallels have led some linguists to propose that loci should best be viewed as grammatical features akin to gender features (Neidle et al. 2000; Kuhn 2016).

As it turns out, sign language loci seem to share some properties with logical variables, and some properties with grammatical features. On the one hand, the flexibility with which loci can be used seems closer to the nature of logical variables than to grammatical features. First, gender features are normally drawn from a finite inventory, whereas there seems to be no upper bound to the number of loci used except for reasons of performance. Second, gender features have a fixed form, whereas loci can be created ‘on the fly’ in various parts of signing space. On the other hand, loci may sometimes be disregarded in ways that resemble gender features. A large part of the debate has focused on sign language versions of sentences such as: Only Ann did her homework . This has a salient (‘bound variable’) reading that entails that Bill didn’t do his homework . In order to derive this reading, linguists have proposed that the gender features of the pronoun her must be disregarded, possibly because they are the result of grammatical agreement. Loci can be disregarded in the very same kind of context, suggesting that they are features, not logical variables (Kuhn 2016). In light of this theoretical tension, a possible synthesis is that loci are a visible realization of logical variables, but mediated by a featural level (Schlenker 2018a). The debate continues to be relatively open.

While there has been much theoretical interest in cases in which reference is disambiguated by loci, this is usually an option, not an obligation. In ASL, for instance, it is often possible to realize pronouns by pointing towards a neutral locus that need not be introduced explicitly by nominal antecedents, and in fact several antecedents can be associated with this default locus. This gives rise to instances of referential ambiguity that are similar to those found in English in (2)–(3) above (see Frederiksen & Mayberry 2022; for an account that treats loci as corresponding to entire regions of signing space, and also allows for sign language pronouns without locus specification, see Steinbach & Onea 2016).

Regardless of implementation, the flexible nature of sign language loci allows one to revisit foundational questions about anaphora and reference.

In the analysis of temporal and modal constructions in speech, there are two broad directions. One goes back to quantified tense logic and modal logic, and takes temporal and modal expressions of natural language to be fundamentally different from individual-denoting expressions: the latter involve the full power of variables and quantifiers, whereas no variables exist in the temporal and modal domain, although operators manipulate implicit parameters. The opposite view is that natural language has in essence the same logical vocabulary across the individual, the temporal and the modal domains, with variables (which may take different forms across different domains) and quantifiers that may bind them (see for instance von Stechow 2004). This second tradition was forcefully articulated by Partee (1973) for tense and Stone (1997) for mood. Partee’s and Stone’s argument was in essence that tense and mood have virtually all the uses that pronouns do. This suggests, theory-neutrally, that pronouns, tenses and moods have a common semantic core. With the additional assumption that pronouns should be associated with variables, this suggests that tenses and moods should be associated with variables as well, perhaps with time- and world-denoting variables, or with a more general category of situation-denoting variables.

As an example, pronouns can have a deictic reading on which they refer to salient entities in the context; if a person sitting alone with their head in their hands utters: She left me , one will understand that she refers to the person’s former partner. Partee argued that tense has deictic uses too. For instance, if an elderly author looks at a picture selected by their publisher for a forthcoming book, and says: I wasn’t young , one will understand that the author wasn’t young when the picture was taken . Stone similarly argued that mood can have deictic readings, as in the case of someone who, while looking at a high-end stereo in a store, says: My neighbors would kill me . The interpretation is that the speaker’s neighbors would (metaphorically) kill them “if the speaker bought the stereo and played it a ‘satisfying’ volume”, in Stone’s words. A wide variety of other uses of pronouns can similarly be replicated with tense and mood, such as cross-sentential binding with indefinite antecedents. (In the individual domain: A woman will go to Mars. She [=the woman who goes to Mars] will be famous . In the temporal domain: I sometimes go to China. I eat Peking duck [=in the situations in which I visit China] .)

While strong, these arguments are indirect because the form of tense and mood looks nothing like pronouns. In several sign languages, including at least ASL (Schlenker 2018a) and Chinese Sign Language (Lin et al. 2021), loci provide a more direct argument because in carefully constructed examples, pointing to loci can be used not just to refer to individuals, but also to temporal and modal situations, with a meaning akin to that of the word then in English. It follows that the logical system underlying the ASL pronominal system (e.g., as variables) extends to temporal and modal situations.

A temporal example appears in (9) . In the first sentence, SOMETIMES WIN is signed in a locus a . In the second sentence, the pointing sign IX-a refers back to the situations in which I win. The resulting meaning is that I am happy in those situations in which I win , not in general; this corresponds to the reading obtained with the word then in English: ‘then I am happy’. (Here and below, ‘re’ glosses raised eyebrows, with a line above the words over which eyebrow raising occurs.)

Context: Every week I play in a lottery.

Formally, SOMETIMES can be seen as an existential quantifier over temporal situations, so the first sentence is semantically existential: there are situations in which I win . The pointing sign thus displays cross-sentential anaphora, depending on a temporal existential quantifier that appears in a preceding sentence. A further point made by Chinese Sign Language (but not by the ASL example above) is that loci may be ordered on a temporal line, with the result that not just the loci but also their ordering can be made visible (Lin et al. 2021).

A related argument can be made about anaphoric reference to modal situations: in (10) , the second sentence just asserts that there are possible situations in which I am infected, associating the locus a with situations of infection. The second sentence makes reference to them: in those situations (not in general), I have a problem. Here too, the reading obtained corresponds to a use of the word then in English.

In sum, temporal and modal loci make two points. First, theory-neutrally, the pointing sign can have both the use of English pronouns and of temporal and modal readings of the word then , suggesting that a single system of reference underlies individual, temporal and modal reference. Second, on the assumption that loci are the overt realization of some logical variables, sign languages provide a morphological argument for the existence of temporal and modal variables alongside individual variables.

In spoken language semantics, there is a related debate about the existence of degree-denoting variables. The English sentences Ann is tall and Ann is taller than Bill (as well as other gradable constructions) can be analyzed in terms of reference to degrees, for instance as in (11) .

To say that one can analyze the meaning in terms of reference to degrees doesn’t entail that one must (for discussion, see Klein 1980). And even if one posits quantification over degrees, a further question is whether natural language has counterparts of pronouns that refer to degrees—if so, one would have an argument that natural language is committed to degrees. Importantly, this debate is logically independent from that about the existence of time- and world-denoting pronominals, as one may consistently believe that there are pronominals in one domain but not in the other. The question must be asked anew, and here too sign language brings important insights.

Degree-denoting pronouns exist in some constructions of Italian Sign Language (LIS; Aristodemo 2017). In (12) , the movement of the sign TALL ends at a particular location on the vertical axis (which we call α, to avoid confusion with the Latin characters used for individual loci in the horizontal plane), which intuitively represents Gianni’s degree of height. In the second sentence, the pronoun points towards this degree-denoting locus, and the rest of the sentence characterizes this degree of height.

More complicated examples can be constructed in which Ann is taller than Bill makes available two degree-denoting loci, one corresponding to Ann’s height and the other to Bill’s.

In sum, some constructions of LIS provide evidence for the existence of degree-denoting pronouns in sign language, which in turn suggests that natural language is sometimes committed to the existence of degrees. And if one grants that loci are the realization of variables, one obtains the further conclusion that natural language has at least some degree-denoting variables. (It is a separate question whether all languages avail themselves of degree variables; see Beck et al. 2009 for the hypothesis that this is a parameter of semantic variation.)

Finally, we observe that, unlike the examples of individual-denoting pronouns seen so far, the placement of degree pronouns along a particular axis is semantically interpreted, reflecting the total ordering of their denotations: not only are degrees visibly realized, but so is their ordering (see also Section 1.3 regarding the ordering on temporal loci on timelines in Chinese Sign Language, and Section 3.4 for a discussion of a structural iconicity).

In sum, loci have been argued to be the overt realization of individual, time, world, and degree variables. If one grants this point, it follows that sign language is ontologically committed to these object types. But the debate is still ongoing, with alternatives that take loci to be similar to features rather than to variables. Let us add that the loci-as-variable analysis has offered a new argument in favor of dynamic semantics, where a variable can depend on a quantifier without being in its syntactic scope; see the supplement on Dynamic Loci .

2. Logical Visibility II: Beyond Loci

There are further cases in which sign language arguably makes visible some components of Logical Forms that are not always overt in spoken language.

Semanticists traditionally classify event descriptions as telic if they apply to events that have a natural endpoint determined by that description, and they call them atelic otherwise. Ann arrived and Mary understood have such a natural endpoint, e.g., the point at which Ann reached her destination, and that at which Mary saw the light, so to speak: arrive and understand are telic. By contrast, Ann waited and Mary thought lack such natural endpoints: wait and think are atelic. As a standard test (e.g., Rothstein 2004), a temporal modifier of the form in X time modifies telic VPs while for X time modifies atelic VPs (e.g., Ann arrived in five minutes vs. Ann waited for five minutes , Mary understood in a second vs. Mary thought for a second ).

Telicity is a property of predicates (i.e., verbs complete with arguments and modifiers), not of verbs themselves. Whether a predicate is telic or atelic may thus result from a variety of different factors; these include adverbial modifiers that explicitly identify an endpoint— run 10 kilometers (in an hour) is telic, but run back and forth (for an hour) is atelic—and properties of the nominal arguments— eat an apple (in two minutes) is telic whereas eat lentil soup (for two minutes) is atelic. But telicity also depends on properties of the lexical semantics of the verb itself, as illustrated by the intransitive verbs above, as well as transitive examples like found a solution (in an hour) versus look for a solution (for an hour) . In work on spoken language, some theorists have posited that these lexical factors can be explained by a morphological decomposition of the verb, and that inherently telic verbs like arrive or find include a morpheme that specifies the endstate resulting from a process (Pustejovsky 1991; Ramchand 2008). This morpheme has been called various things in the literature, including ‘EndState’ (Wilbur 2003) and ‘Res’ (Ramchand 2008).

In influential work, Wilbur (2003, 2008; Wilbur & Malaia 2008) has argued that the lexical factors related to telicity are often realized overtly in the phonology of several sign languages: inherently telic verbs tend to be realized with sharp sign boundaries; inherently atelic verbs are realized without them (Wilbur 2003, 2008; Wilbur & Malaia 2008). For instance, the ASL sign ARRIVE involves a sharp deceleration, as one hand makes contact with the other, as shown in (13) .

ARRIVE in ASL (telic)

Drawing of a human with the open left hand is at the front lower chest with palm upwards. The open right hand is nearer the right should, palm facing towards the signer. An arrow indicates that the second hand moves down to the front lower chest where it stops when meeting the second hand.

Short video clip of ARRIVE (ASL), no audio

Picture credits: Valli, Clayton: 2005, The Gallaudet Dictionary of American Sign Language , Gallaudet University Press.

In contrast, WAIT is realized with a trilled movement of the fingers and optional circular movement of the hands, without a sharp boundary:

WAIT in ASL (atelic)

both hands are open and in front of the upper chest, not touching, palms facing towards the signer and a symbol indicating the hand are moving

Short video clip of WAIT (ASL), no audio

Similarly, in LSF UNDERSTAND , which is telic, is realized with three fingers forming a tripod that ends up closing on the forehead; the closure is realized quickly, and thus displays a sharp boundary. By contrast, REFLECT, which is atelic, is realized by the repeated movement of the curved index finger towards the temple, without sharp boundaries.

COMPRENDRE (UNDERSTAND): thumb and first two fingers apart then touch tips as they touch the forehead

Short video clip of UNDERSTAND (LSF), no audio

REFLÉCHIR (REFLECT): curved index finger touches the forehead

Short video clip of REFLECT (LSF), no audio

Credits: La langue des signes - dictionnaire bilingue LSF-français . IVT 1986

Wilbur (2008) posits that in ASL and other sign languages, this phonological cue, the “rapid deceleration of the movement to a complete stop”, is an overt manifestation of the morpheme EndState, yielding inherently telic lexical predicates. If Wilbur’s analysis is correct, this is another possible instance of visibility of an abstract component of Logical Forms that is not usually overt in spoken languages. An alternative is that an abstract version of iconicity is responsible for this observation (Kuhn 2015), as we will see in Section 3.2 .

It is also possible that the analysis of this phonological cue varies across languages. Of note, both ASL and LSF have exceptions to the generalization (Davidson et al. 2019), for example ASL SLEEP is atelic but ends with deceleration and contact between the fingers; LSF RESIDE is similarly atelic but ends with deceleration and contact. In contrast, in Croatian Sign Language (HZJ), endmarking appears to be a regular morphological process, allowing a verb stem to alternate between an end-marked and non-endmarked form (Milković 2011).

In the classic analysis of indexicals developed by Kaplan (1989), the value of an indexical (words like I , here , and now ) is determined by a context parameter that crucially doesn’t interact with time and world operators (in other words, the context parameter is not shiftable). The empirical force of this idea can be illustrated by the distinction between I , an indexical, and the person speaking , which is indexical-free. The speaker is always late may, on one reading, refer to different speakers in different situations because speaker can be evaluated relative to a time quantified by always . Similarly, The speaker must be late can be uttered even if one has no idea who the speaker is supposed to be; this is because speaker can be evaluated relative to a world quantified by must . By contrast, I am always late and I must be late disallow such a dependency because I is dependent on the context parameter alone, not on time and world quantification. This analysis raises a question: are there any operators that can manipulate the context of evaluation of indexicals? While such operators can be defined for a formal language, Kaplan famously argued that they do not exist in natural language and called them, for this reason, ‘monsters’.

Against this claim, an operator of ‘context shift’ (a Kaplanian monster) has been argued to exist in several spoken languages (including Amharic and Zazaki). The key observation was that some indexicals can be shifted in the scope of some attitude verbs, and in the absence of quotation (e.g., Anand & Nevins 2004; Anand 2006; Deal 2020). Schematically, in such languages, Ann says that I am a hero can mean that Ann says that she herself is a hero, with I interpreted from Ann’s perspective. Several researchers have argued that context shift can be overt in sign language, and realized by an operation called ‘Role Shift’, whereby the signer rotates her body to adopt the perspective of another character (Quer 2005, 2013; Schlenker 2018a).

A simple example involves the sentence WIFE SAY IX-2 FINE , where the boldfaced words are signed from the rotated position, illustrated below. As a result, the rest of the sentence is interpreted from the wife’s perspective, with the consequence that the second person pronoun IX-2 refers to whoever the wife is talking to, and not the addressee of the signer.

An example of Role Shift in ASL ( Credits: Lillo-Martin 2012)

four pictures of the signer. The first two for 'WIFE SAY' looking one way then rotating to look another way to sign the last two 'IX-2 FINE'

WIFE SAY IX-2 FINE

Role Shift exists in several languages, and in some cases (notably in Catalan and German sign languages [Quer 2005; Herrmann & Steinbach 2012]), it has been argued not to involve mere quotation. On the context-shifting analysis of Role Shift (e.g., Quer 2005, 2013), the point at which the signer rotates her body corresponds to the insertion of a context-shifting operator C, yielding the representation: WIFE SAY C [ IX-2 FINE ]. The boldfaced words are signed in rotated position and are taken to be interpreted in the scope of C.

Interestingly, Role Shift differs from context-shifting operations described in speech in that it can be used outside of attitude reports (to distinguish the two cases, researchers use the term ‘Attitude Role Shift’ for the case we discussed before, and ‘Action Role Shift’ for the non-attitudinal case). For example, if one is talking about an angry man who has been established at locus a , one can use an English-strategy and say IX-a WALK-AWAY to mean that he walked away. But an alternative possibility is to apply Role Shift after the initial pointing sign, and say the following (with the operator C realized by the signer’s rotation):

Here 1-WALK-AWAY is a first person version of ‘walk away’, but the overall meaning is just that the angry person associated with locus a walked away. By performing a body shift and adopting that person’s position to sign 1-WALK-AWAY , the signer arguably makes the description more vivid.

Importantly, in ASL and LSF, Role Shift interacts with iconicity. Attitude Role Shift has, at a minimum, a strong quotational component. For instance, angry facial expressions under Role Shift must be attributed to the attitude holder, not to the signer (Schlenker 2018a). This observation extends to ASL and LSF Action Role Shift: disgusted facial expressions under Action Role Shift are attributed to the agent rather than to the signer.

As in other cases of purported Logical Visibility, the claim that Role Shift is the visible reflex of an operation that is covert in speech has been challenged. In the analysis of Davidson (2015, following Supalla 1982), Role Shift falls under the category of classifier predicates, specific constructions of sign language that are interpreted in a highly iconic fashion (we discuss them in Section 4 ). What is special about Role Shift is that the classifier is not signed with a hand (as other classifiers are), but with the signer’s own body. The iconic nature of this classifier means that properties of role-shifted expressions that can be iconically assigned to the described situation must be. For Attitude Role Shift, the analysis essentially derives a quotational reading via a demonstration—for our example above, something like: My wife said this, “You are fine” . Cases of Attitude Role Shift that have been argued not to involve standard quotation ( in Catalan and German sign language) require refinements of the analysis. For Action Role Shift, the operation has the effect of demonstrating those parts of signs that are not conventional, yielding in essence: He walked away like this, where this refers to all iconically interpretable components of the role-shifted construction (including the signer’s angry expression, if applicable).

Debates about Role Shift have two possible implications. On one view, Role Shift provides overt evidence for context shift, and extends the typology of Kaplanian monsters beyond spoken languages and beyond attitude operators (due to the existence of Action Role Shift alongside Attitude Role Shift). On the alternative view developed by Davidson, Role Shift suggests that some instances of quotation should be tightly connected to a broader analysis of iconicity owing to the similarity between Attitude Role Shift an Action Shift.

In sum, it has been argued that telicity and context shift can be overtly marked in sign language, hence instances of Logical Visibility beyond loci, but alternative accounts exist as well. Let us add that there are cases of Logical non -visibility, in which logical elements that are often overt in speech can be covert in sign. See the supplement on Coordination for the case of an ASL construction ambiguous between conjunction and disjunction.

3. Iconicity I: Optional Iconic Modulations

On a standard (Saussurean) view, language is made of discrete conventional elements, with iconic effects at the margins. Sign languages cast doubt on this view because iconicity interacts in complex and diverse ways with grammar.

By iconicity, we mean a rule-governed way in which an expression denotes something by virtue of its resemblance to it, as is for instance the case of a photograph of a cat, or of a vocal imitation of a cat call. By contrast, the conventional word cat does not refer by resembling cats. There are also mixed cases in which an expression has both a conventional and an iconic component.

In this section and the next, we survey constructions that display optional or obligatory iconicity in sign, and call for the development of a formal semantics with iconicity. As we will see, some purported cases of Logical Visibility might be better analyzed as more or less abstract versions of iconicity, with the result that several phenomena discussed above can be analyzed from at least two theoretical perspectives.

As in spoken language, it is possible to modulate some conventional words in an iconic fashion. In English, the talk was looong suggests that the talk wasn’t just long but very long. Similarly, the conventional verb GROW in ASL can be realized more quickly to evoke a faster growth, and with broader endpoints to suggest a larger growth, as is illustrated below (Schlenker 2018b). There are multiple potential levels of speed and endpoint breadth, which suggests that a rule is genuinely at work in this case.

(18) Different iconic modulations of the sign in ASL M. Bonnet
small amount, slowly medium amount, slowly large amount, slowly
small amount, quickly medium amount, quickly large amount, quickly

Different iconic modulations of the sign GROW in ASL ( Picture credits: M. Bonnet)

small amount,
slowly
small amount,
quickly
medium amount,
slowly
medium amount,
quickly
large amount,
slowly
large amount,
quickly

In English, iconic modulations can arguably be at-issue and thus interpreted in the scope of grammatical operators. An example is the following sentence: If the talk is loooong , I’ll leave before the end . This means that if the talk is very long, I’ll leave before the end (but if it’s only moderately long, maybe not); here, the iconic contribution is interpreted in the scope of the if- clause, just like normal at-issue contributions. The iconic modulation of GROW has similarly been argued to be at-issue (Schlenker 2018b). (See Section 5.2 for further discussion on the at-issue vs. non-at-issue semantic contributions.)

While conceptually similar to iconic modulations in English, the sign language versions are arguably richer and more pervasive than their spoken language counterparts.

Iconic modulation interacts with the marking of telicity noticed by Wilbur ( Section 2.1 ). GROW , discussed in the preceding sub-section, is an (atelic) degree achievement; the iconic modifications above indicate the final degree reached and the time it took to reach that degree. Similarly, for telic verbs, the speed and manner in which the phonological movement reaches its endpoint can indicate the speed and manner in which the result state is reached. For example, if LSF UNDERSTAND is realized slowly and then quickly, the resulting meaning is that there was a difficult beginning, and then an easier conclusion. Atelic verbs that don’t involve degrees can also be iconically modulated; for instance, if LSF REFLECT is signed slowly and then quickly, the resulting meaning is that the person’s reflection intensified. Here too, the iconic contribution has been argued to be at-issue (Schlenker 2018b).

There are also cases in which the event structure is not just specified but radically altered by a modulation, as in the case of incompletive forms (also called unrealized inceptive, Liddell 1984; Wilbur 2008). ASL DIE , a telic verb, is expressed by turning the dominant hand palm-down to palm-up as shown below (the non-dominant hand turns palm-up to palm-down). If the hands only turn partially, the sign is roughly interpreted as ‘almost die’.

Normal vs. incompletive form of DIE in ASL ( Credits: J. Kuhn)

an image of the non-dominant hand in before, palm up, and after, palm down, positions.

a. DIE in ASL

an image of the non-dominant hand in before, palm vertical, and after, palm down, positions.

b. ALMOST-DIE in ASL

Similarly to the fact that multiple levels of speed and size can be indicated by the verb GROW in (18) , the incompletive form of verbs can be modulated to indicate arbitrarily many degrees of completion, depending on how far the hand travels; these examples thus seem to necessitate an iconic rule (Kuhn 2015). On the other hand, while the examples with GROW can be analyzed by simple predicate modification (‘The group grew and it happened like this: slowly’), examples of incompletive modification require a deeper integration in the semantics, similar to the semantic analysis of the adverb almost or the progressive aspect in English. (Notably, it’s nonsense to say: ‘My grandmother died and it happened like this: incompletely.’)

The key theoretical question lies in the integration between iconic and conventional elements in such cases. If one posits a decompositional analysis involving a morpheme representing the endstate ( EndState or Res , see Section 2.1 ), one must certainly add to it an iconic component (with a non-trivial challenge for incompletive forms, where the iconic component does not just specify but radically alters the lexical meaning). Alternatively, one may posit that a structural form of iconicity is all one needs, without morphemic decomposition. An iconic analysis along these lines has been proposed (Kuhn 2015: Section 6.5), although a full account has yet to be developed.

The logical notion of plurality is expressed overtly in some way in many of the world’s languages: pluralizing operations may apply to nouns or verbs to indicate a plurality of objects or events (for nouns: ‘plurals’; for verbs: ‘pluractionals’). Historically, arguments of Logical Visibility have not been made for plurals in sign languages, since—while overt plural marking certainly exists in sign language—plural morphemes also appear overtly in spoken languages (e.g., English singular horse vs. plural horses ).

Nevertheless, mirroring areas of language in which arguments of Logical Visibility do apply, plural formation in sign language shows a number of unique and revealing properties. First, the morphological expression of this logical concept is similar for both nouns and verbs across a large number of unrelated sign languages: for both plural nouns (Pfau & Steinbach 2006) and pluractional verbs (Kuhn & Aristodemo 2017), plurality is expressed by repetition. We note that repetition-based plurals and pluractionals also exist in speech (Sapir 1921: 79).

Second, in sign language, these repeated plural forms have been shown to feed iconic processes. Modifications of the way in which the sign is repeated may indicate the number of objects or events, or may indicate the arrangement of these pluralities in space or time. Relatedly, so-called ‘punctuated’ repetitions (with clear breaks between the iterations) refer to precise plural quantities (e.g., three objects or events for three iterations), while ‘unpunctuated’ repetitions (without clear breaks between the iterations) refer to plural quantities with vague thresholds, and often ‘at least’ readings (Pfau & Steinbach 2006; Schlenker & Lamberton 2022).

In the nominal domain, the number of repetitions may provide an indication of the number of objects, and the arrangement of the repetitions in signing space can provide a pictorial representation of the arrangement of the denotations in real space (Schlenker & Lamberton 2022). For instance, the word TROPHY can be iterated three times on a straight line to refer to a group of trophies that are horizontally arranged; or the three iterations can be arranged as a triangle to refer to trophies arranged in a triangular fashion. A larger number of iterations serves to refer to larger groups. Here too, the iconic contribution can be at-issue and thus be interpreted in the scope of logical operators such as if -clauses.

TROPHY in ASL, repetition on a line:

The TROPHY sign repeated to the signer's upper left, upper front, and upper right.

TROPHY in ASL, repetition as a triangle:

The TROPHY sign repeated to the signer's middle left, upper front, and middle right

Credits: M. Bonnet

Punctuated (= easy to count) repetitions yield meanings with precise thresholds (often with an ‘exactly’ reading, e.g., ‘exactly three trophies’ for three punctuated iterations); unpunctuated repetitions yield vague thresholds and often ‘at least’ readings (e.g., ‘several trophies’ for three unpunctuated iterations). While one may take the distinction to be conventional, it might have an iconic source. In essence, unpunctuated iterations result in a kind of pictorial vagueness on which the threshold is hard to discern; deriving the full range of ‘exactly’ and ‘at least’ readings is non-trivial, however (Schlenker & Lamberton 2022).

In the verbal domain, pluractionals (referring to pluralities of events) can be created by repeating a verb, for instance in LSF and ASL. A complete analysis seems to require both conventionalized grammatical components and iconic components. The form of reduplication—as identical reduplication or as alternating two-handed reduplication—appears to conventionally communicate the distribution of events with respect to either time or to participants. But a productive iconic rule also appears to be involved, as the number and speed of the repetitions gives an idea of the number and speed of the denoted events (Kuhn & Aristodemo 2017); again, the iconic contribution can be at-issue.

Iconic plurals and pluractionals alike are now treated by way of mixed lexical entries that include a grammatical/logical component and an iconic component. For instance, if N is a (singular) noun denoting a set of entities S , then the iconic plural N -rep denotes the set of entities x such that:

  • x is the sum of atomic elements in S (i.e., \(x \in *S\)), and
  • the form of N -rep iconically represents x .

Condition (i) is the standard definition of a plural; condition (ii) is the iconic part, which is itself in need of an elucidation using general tools of pictorial semantics (see Section 4 ).

Loci, which have been hypothesized to be (sometimes) the overt realization of variables, can lead a dual life as iconic representations. Singular loci may (but need not) be simplified pictures of their denotations: if so, a person-denoting locus is a structured area I , and pronouns are directed towards a point i that corresponds to the upper part of the body. In ASL, when the person is tall, one can thus point upwards (there are also metaphorical cases in which one points upwards because the person is powerful or important). When a person is understood to be in a rotated position, the direction of the pronoun correspondingly changes, as seen in (21) for a person standing upright or hanging upside down (Schlenker 2018a; see also Liddell 2003).

research paper for sign language recognition

Iconic mappings involving loci may also preserve abstract structural relations that have been posited to exist for various kinds of ontological objects, including mereological relations, total orderings, and domains of quantification.

First, two plural loci—indexed over areas of space—may (but need not) express mereological relations diagrammatically, with a locus a embedded in a locus b if the denotation of a is a mereological part of the denotation of b (Schlenker 2018a). For example, in (22) , the ASL expression POSS-1 STUDENT (‘my students’) introduces a large locus (glossed as ab to make it clear that it contains subloci a and b —but initially just a large locus). MOST introduces a sublocus a within this large locus because the plurality denoted by a is a proper part of that denoted by ab . And critically, diagrammatic reasoning also makes available a third discourse referent: when a plural pronoun points towards b —the complement of the sublocus a within the large locus ab —the sentence is acceptable, and b is understood to refer to the students who did not come to class.

In English, the plural pronoun they clearly lacks such a reading when one says, Most of my students came to class. They stayed home , which sounds contradictory. (One can communicate the target interpretation by saying, The others stayed home , but the others is not a pronoun.) Likewise, in ASL, if the same discourse is uttered using default, non-localized plural pronouns, the pattern of inferences is exactly identical to the English translation.

A second case of preservation of abstract orders pertains to degree-denoting and sometimes time-denoting loci. In LIS, degree-denoting loci are represented iconically, with the total ordering mapped to an axis in space, as described in Section 1.3 . Time-denoting loci may but need not give rise to preservation of ordering on an axis, depending on whether normal signing space is used (as in the ASL examples (9) above), or a specific timeline, as mentioned in Section 1.2 in relation to Chinese Sign Language. As in the case of diagrammatic plural pronouns, the spatial ordering of degree- and time-denoting loci generates an iconic inference—beyond the meaning of the words themselves—about the relative degree of a property or temporal order of events.

A third case involves the partial ordering of domain restrictions of nominal quantifiers: greater height in signing space may be mapped to a larger domain of quantification, as is the case in ASL (Davidson 2015) and at least indefinite pronouns in Catalan Sign Language (Barberà 2015).

4. Iconicity II: Classifier Predicates

A special construction type, classifier predicates (‘classifiers’ for short), has raised difficult conceptual questions because they involve a combination of conventional and iconic meaning. Classifier predicates are lexical expressions that refer to classes of animate or inanimate entities that share some physical characteristics—e.g., objects with a flat surface, cylindrical objects, upright individuals, sitting individuals, etc. Their form is conventional; for instance, the ASL ‘three’ handshape, depicted below, represents a vehicle. But their position, orientation and movement in signing space is interpreted iconically and gradiently (Emmorey & Herzig 2003), as illustrated in the translation of the example below.

right hand with thumb, index, and middle fingers extended moves right to left across the chest

‘A car drove by [with a movement resembling that of the hand]

These constructions have on several occasions been compared to gestures in spoken language, especially to gestures that fully replace some words, as in: This airplane is about to FLY-take-off , with the verb replaced with a a hand gesture representing an airplane taking off. But there is an essential difference: classifier predicates are stable parts of the lexicon, gestures are not.

Early semantic analyses, notably by Zucchi 2011 and Davidson 2015, took classifier predicates to have a self-referential demonstrative component, with the result that the moving vehicle classifier in (24) means in essence ‘move like this’, where ‘this’ makes reference to the very form of the classifier movement. As mentioned in Section 2.2 , this analysis has been extended to Role Shift by Davidson (Davidson 2015), who took the classifier to be in this case the signer’s rotated body.

The demonstrative analysis of classifier predicates as stated has two general drawbacks. First, it establishes a natural class containing classifiers and demonstratives (like English this ), but the two phenomena possibly display different behaviors. Notably, while demonstratives behave roughly like free variables that can pick up their referent from any of a number of contextual sources, the iconic component of classifiers can only refer to the position/movement and configuration of the hand (any demonstrative variable is thus immediately saturated). Second, the demonstrative analysis currently relegates the iconic component to a black box. Without any interpretive principles on what it means for an event to be ‘like’ the demonstrated representation, one cannot provide any truth conditions for the sentence as a whole.

Any complete analysis must thus develop an explicit semantics for the iconic component. This is more generally necessary to derive explicit truth conditions from other iconic constructions in sign language, such as the repetition-based plurals discussed in Section 3.3 . above: in the metalanguage, the condition [a certain expression] iconically represents [a certain object] was in need of explication.

A recent model has been offered by formal pictorial semantics, developed by Greenberg and Abusch (e.g., Greenberg 2013, 2021; Abusch 2020). The basic idea is that a picture obtained by a given projection rule (for instance, perspective projection) is true of precisely those situations that can project onto the picture. Greenberg has further extended this analysis with the notion of an object projecting onto a picture part (in addition to a situation projecting onto a whole picture). This notion proves useful for sign language applications because they usually involve partial iconic representations, with one iconic element representing a single object or event in a larger situation. To illustrate, below, the picture in (25a) is true of the situation in (25b) , and the left-most shape in the picture in (25a) denotes the top cube in the situation in (25b) .

Illustration of a projection rule relating (parts of) a picture to (objects in) a world. ( Credits: Gabriel Greenberg)

(a) Picture

(b) Situation

The full account makes reference to a notion of viewpoint relative to which perspective projection is assessed, and a picture plane, both represented in (25b) . This makes it possible to say that the top cube (top-cube) projects onto the left-hand shape (left-shape) relative to the viewpoint (call it π), at the time t and in the world w in which the projection is assessed. In brief:

Classifier predicates (as well as other iconic constructions, such as repetition-based plurals) may be analyzed with a version of pictorial semantics to explicate the truth-conditional contribution of iconic elements.

To illustrate, consider a pair of minimally different words in ASL that can be translated as ‘airplane’: one is a normal noun, glossed as PLANE , and the other is a classifier predicate, glossed below as PLANE-cl . Both forms involve the handshape in (26) , but the normal noun includes a tiny repetition (different from that of plurals) which is characteristic of some nominals in ASL. As we will see, the position of the classifier version is interpreted iconically (‘an airplane in position such and such’), whereas the nominal version need not be.

Handshape for both (i) ASL PLANE (= nominal version) and (ii) ASL PLANE-cl (= classifier predicate version). ( Credits: J. Kuhn)

hand with thumb, index, and little fingers extended and other two not.

Semantically, the difference between the two forms is that only the classifier generates obligatory iconic inferences about the plane’s configuration and movement. This has clear semantic consequences when several classifier predicates occur in the same sentence. In (27b) , two tokens of PLANE-cl appear in positions a and b , and as the video makes clear, the two classifiers are signed close to each other and in parallel. As a result, the sentence only makes an assertion about cases in which two airplanes take off next to each other or side by side. In contrast, with a normal noun in (27a) , the assertion is that there is danger whenever two airplanes take off at the same time, irrespective of how close the two airplanes are, or how they are oriented relative to each other.

(ASL, 35, 1916, 4 judgments; short video clip of the sentences, no audio )

To capture these differences, one can posit the following lexical entries for the normal noun and for its classifier predicate version. Importantly, the interpretation of PLANE-cl in (28b) is defined for a particular token of the sign (not a type), produced with a phonetic realization Φ.

Evaluation is relative to a context c that provides the viewpoint, \(\pi_{c}\). In the lexical entry for the normal noun in (28a) , plane' t,w is a (metalanguage) predicate of individuals that applies to anything that is an airplane at t in w . The classifier predicate has the lexical entry in (28b) . It has the same conventional component as the normal noun, but adds to it an (iconic) projective condition: for a token of the predicate PLANE-cl to be true of an object x , x should project onto this very token.

With this pictorial semantics in hand, we can make a more explicit comparison to the demonstrative analysis of classifiers. As described above, a demonstrative analysis takes classifiers to include a component of meaning akin to ‘move like this’. For Zucchi, this is spelled out via a lexical entry very close to the one in (28) , but in which the second clause (in terms of projection above) is instead a similarity function, asserting that the position of the denoted object x is ‘similar’ to that of the airplane classifier; the proposal, however, leaves it entirely open what it means to be ‘similar’. Of course, one may supplement the analysis with a separate explication in which similarity is defined in terms of projection, but this move presupposes rather than replaces an explicit pictorial account. In other words, the demonstrative analysis relegates the iconic component to a black box, whose content can be specified by the pictorial analysis. But once a pictorial analysis is posited, it become unclear why one should make a detour through the demonstrative component, rather than posit pictorial lexical entries in the first place.

A number of further refinements need to be made to any analysis of classifiers. First, to have a fully explicit iconic semantics, one must contend with several differences between classifiers and pictures.

  • Classifier predicates have a conventional shape; only their position, orientation and movement is interpreted iconically (sometimes modifications of the conventional handshape can be interpreted iconically as well). This requires projection rules with a partly conventional component (of the type: a certain symbol appears in a certain position of the picture if an object of the right types projects onto that position).
  • Many classifier predicates are dynamic (in the sense of involving movement) rather than static; this requires the development of a semantics for visual animations.
  • Sign language classifiers are not two-dimensional pictures, but rather 3D representations. One can think of them as puppets whose shape needn’t be interpreted literally, but whose position, orientation and movement can be iconically precise. This requires formal means that go beyond pictorial semantics.

The interaction between iconic representations and the sentences they appear in also requires further refinements. A first refinement pertains to the semantics. For simplicity, we assumed above that the viewpoint relative to which the iconic component of classifier predicates is evaluated is fixed by the context. Notably, though, in some cases, viewpoint choice can be dependent on a quantifier. In the example below, the meaning obtained is that in all classes, during the break, for some salient viewpoint π associated with the class , there is a student who leaves with the movement depicted relative to π; a recent proposal (Schlenker and Lamberton forthcoming) has viewpoint variables in the object language, and they may be left free or bound by default existential quantifiers, as illustrated in (30) . (While there is a strong intuition that Role Shift manipulates viewpoints as well, a formal account has yet to be developed.)

A second refinement pertains to the syntax. Across sign languages, classifier constructions have been shown to sometimes override the basic word order of the language; for instance, ASL normally has the default word order SVO (Subject Verb Object), but classifier predicates usually prefer preverbal objects instead. One possible explanation is that the non-standard syntax of classifiers arises at least in part from their iconic semantics; we revisit this point in Section 5.3 .

5. Sign with Iconicity versus Speech with Gestures

The iconic contributions discussed above are to some extent different from those found in speech. Iconic modulations exist in speech (e.g., looong means ‘very long’) but are probably less diverse than those found in sign. Repetition-based plurals and pluractionals exist in speech (Rubino 2013), and it has been argued for pluractional ideophones in some languages, that the number of repetitions can reflect the number of denoted events (Henderson 2016). But sign language repetitions can iconically convey a particularly rich amount of information, including through their punctuated or unpunctuated nature, and sometimes their arrangement in space (Schlenker & Lamberton 2022). As for iconic pronouns and classifier predicates, they simply have no clear counterparts in speech. From this perspective, speech appears to be ‘iconically deficient’ relative to sign.

But Goldin-Meadow and Brentari (2017) have argued that a typological comparison between sign language and spoken language makes little sense if it does not take gestures into account: sign with iconicity should be compared to speech with gestures rather than to speech alone, since gestures are the main exponent of iconic enrichments in spoken language. This raises a question: From a semantic perspective, does speech with gesture have the same expressive effect and the same grammatical integration as sign with iconicity?

This question has motivated a systematic study of iconic enrichments across sign and speech, and has led to the discovery of fine-grained differences (Schlenker 2018b). The key issue pertains to the place of different iconic contributions in the typology of inferences, which includes at-issue contributions and non-at-issue ones, notably presuppositions and supplements (the latter are the semantic contributions of appositive relative clauses).

While detailed work is still limited, several iconic constructions in sign language have been argued to make at-issue contributions (sometimes alongside non-at-issue ones). This is the case of iconic modulations of verbs, as for GROW in (18) , of repetition-based plurals and pluractionals, and of classifier predicates.

By contrast, gestures that accompany spoken words have been argued in several studies (starting with the pioneering one by Ebert & Ebert 2014 – see Other Internet Resources) to make primarily non-at-issue contributions. Recent typologies (e.g., Schlenker 2018b; Barnes & Ebert 2023) distinguish between co-speech gestures, which co-occur with the spoken words they modify (a slapping gesture co-occurs with punish in (31a) ; post-speech gestures, which follow the words they modify (the gesture follows punish in (31b) ; and pro-speech gestures, which fully replace some words (the slapping gestures has the function of a verb in (31c) .

slapping gesture

When different tests are applied, such as embedding under negation, these three types display different semantic behaviors. Co-speech gestures have been argued to trigger conditionalized presuppositions, as in (32a) . Post-speech gestures have been argued to display the behavior of appositive relative clauses, and in particular to be deviant in some negative environments, as illustrated in (32b)–(32b′); in addition, post-speech gestures, just like appositive relative clauses, usually make non-at-issue contributions.

( Picture credits : M. Bonnet)

Only pro-speech gestures, as in (32c) , make at-issue contributions by default (possibly in addition to other contributions). In this respect, they ‘match’ the behavior of iconic modulations, iconic plurals and pluractionals, and classifier predicates. But unlike these, pro-speech gestures are not words and are correspondingly expressively limited. For instance, abstract psychological verbs UNDERSTAND (= (15a) ) and especially REFLECT (= (15b) ) can be modulated in rich iconic ways in LSF—e.g., if the hand movement of REFLECT starts slow and ends fast, this conveys that the reflection intensified (Schlenker 2018a). But there are no clear pro-speech gestures with the same abstract meanings, and thus one cannot hope to emulate with pro-speech gestures the contributions of UNDERSTAND and REFLECT , including when they are enriched by iconic modulations.

In sum, while the reintegration of gestures into the study of speech opens new avenues of comparison between sign with iconicity and speech with gestures, one shouldn’t jump to the conclusion that these enriched objects display precisely the same semantic behavior.

Unlike gestures in general and pro-speech gestures in particular, classifier predicates have a conventional form (only the position, orientation, and movement are iconically interpreted, accompanied in limited cases by aspects of the handshape). But there are still striking similarities between pro-speech gestures and classifier predicates.

First, on a semantic level, the iconic semantics sketched for classifier predicates in Section 4.2 seems useful for pro-speech gestures as well, sometimes down to the details—for instance, it has been argued that the dependency between viewpoints and quantifiers illustrated in (30) has a counterpart with pro-speech gestures (Schlenker & Lamberton forthcoming).

Second, on a syntactic level, classifier predicates often display a different word order from other constructions, something that has been found across languages (Pavlič 2016). In ASL, the basic word order is SVO, but preverbal objects are usually preferred if the verb is a classifier predicate, for instance one that represents a crocodile moving and eating up a ball (as is standard in syntax, the ‘basic’ or underlying word order may be modified on independent grounds by further operations, for instance ones that involve topics and focus; we are not talking about such modifications of the word order here).

It has been proposed that the non-standard word order is directly related to the iconic properties of classifier predicates. The idea is that these create a visual animation of an action, and preferably take their argument in the order in which their denotations are visible (Schlenker, Bonnet et al. 2024; see also Napoli, Spence, and Müller 2017). One would typically see a ball and a crocodile before seeing the eating, hence the preference for preverbal objects (note that the subject is preverbal anyway in ASL). A key argument for this idea is that when one considers a minimally different sentence involving a crocodile spitting out a ball it had previously ingested, SVO order is regained, in accordance with the fact that an observer would see the object after the action in this case.

Strikingly, these findings carry over to pro-speech gestures. Goldin-Meadow et al. (2008) famously noted that when speakers of languages with diverse word orders are asked to use pantomime to describe an event with an agent and a patient, they tend to go with SOV order, including if this goes against the basic word order of their language (as is the case in English). Similarly, pre-verbal objects are preferred in sequences of pro-speech gestures in French (despite the fact that the basic word order of the language is SVO); this is for instance the case for a sequence of pro-speech gestures that means that a crocodile ate up a ball. Remarkably, with spit-out-type gestural predicates, an SVO order is regained, just as is the case with ASL classifier predicates (Schlenker, Bonnet, et al. 2024, following in part by Christensen, Fusaroli, & Tylén 2016; Napoli, Mellon, et al. 2017; Schouwstra & de Swart 2014). This suggests that iconicity, an obvious commonality between the two constructions, might indeed be responsible for the non-standard word order.

6. Universal Properties of the Signed Modality

Properties discussed above include: (i) the use of loci to realize anaphora, (ii) the overt marking of telicity and (possibly) context shift, (iii) the presence of rich iconic modulations interacting with event structure, plurals and pluractionals, and anaphora, (iv) the existence of classifier predicates, which have both a conventional and an iconic dimension. Although the examples above involve a relatively small number of languages, it turns out that these properties exist in several and probably many sign languages. Historically unrelated sign languages are thus routinely treated as a ‘language family’ because they share numerous properties that are not shared by spoken languages (Sandler & Lillo-Martin 2006). Of course, this still allows for considerable variation across sign languages, for instance with respect to word order (e.g., ASL is SVO, LIS is SOV).

Cases of convergence also exist in language emergence. Homesigners are deaf individuals who are not in contact with an established sign language and thus develop their own gesture systems to communicate with their families. While homesigners do not invent a sign language, they sometimes discover on their own certain properties of mature sign languages. Loci and repetition-based plurals are cases in point (Coppola & So 2006; Coppola et al. 2013). Strikingly, Coppola and colleagues (2013) showed in a production experiment that a group of homesigners from Nicaragua used both punctuated and unpunctuated repetitions, with the kinds of semantic distinctions found in mature sign language. Coppola et al. further

examined a child homesigner and his hearing mother, and found that the child’s number gestures displayed all of the properties found in the adult homesigners’ gestures, but his mother’s gestures did not. (Coppola, Spaepen, & Goldin-Meadow 2013: abstract)

This provided clear evidence that this homesigner had invented this strategy of plural-marking.

In sum, there is striking typological convergence among historically unrelated sign languages, and homesigners can in some cases discover grammatical devices found in mature sign languages.

It is arguably possible to have non-signers discover on the fly certain non-trivial properties of sign languages (Strickland et al. 2015; Schlenker 2020). One procedure involves hybrids of words and gestures. We saw a version of this in Section 5.3 , when we discussed similarities between pro-speech gestures and classifier predicates. The result was that along several dimensions, notably word order preferences and associated meanings, pro-speech gestures resemble ASL classifier predicates (they also differ from them in not having lexical forms).

More generally, hybrid sequences of words and gestures suggest that non-signers sometimes have access to a gestural grammar somewhat reminiscent of sign languages. (It goes without saying that there is no claim whatsoever that non-signers know the sophisticated grammars of sign languages, any more than a naive monolingual English speaker knows the grammar of Mandarin or Hebrew.) In one experimental study (summarized in Schlenker 2020), gestures with a verbal meaning, such as ‘send kisses’, targeted different positions, corresponding to the addressee or some third person, as illustrated below.

face with mouth in a kiss and a hand gesturing forward with face facing front

a. send kisses to you

face with mouth in a kiss and a hand gesturing forward but the whole oriented to the side

b. send kisses to him/her

Credits : J. Kuhn

The conditions in which these two forms can be used turn out to be reminiscent of the behavior of the agreement verb TELL in ASL: in (5) , the verb could target the addressee position to mean I tell you , or some position to the side to mean I tell him/her . The study showed that non-signers immediately perceived a distinction between the second person object form and the third person object form of the gestural verb, despite the fact that English, unlike ASL, has no object agreement markers. In other words, non-signers seemed to treat the directionality of the gestural verb as a kind of agreement marker. More fine-grained properties of the ASL object agreement construction were tested with gestures, again with positive results.

More broadly, it has been argued that aspects of gestural grammar resemble the grammar of ASL in designated cases involving loci, repetition-based plurals and pluractionals, Role Shift, and telicity marking (e.g., Schlenker 2020 and references therein). These findings have yet to be confirmed with experimental means, but if they are correct, the question is why.

We have seen three cases of convergence in the visual modality: typological convergence among unrelated sign languages, homesigners’ ability to discover designated aspects of sign language grammar, and possibly the existence of a gestural grammar somewhat reminiscent of sign language in designated cases. None of these cases should be exaggerated. While typologically they belong to a language family, sign languages are very diverse, varying on all levels of linguistic structure. As for homesigners, the gestural systems they develop compensate for the lack of access to a sign language; indeed, homesigners bear consequences of lack of access to a native language (see for instance Morford & Hänel-Faulhaber 2011; Gagne & Coppola 2017). Finally, non-signers cannot guess anything about sign languages apart from a few designated properties.

Still, these cases of convergence should be explained. There are at least three conceivable directions, which might have different areas of applicability. Chomsky famously argued that there exists an innate Universal Grammar (UG) that underlies all human languages (see for instance Chomsky 1965, Pinker 1994). One possibility is that UG doesn’t just specify abstracts features and rules (as is usually assumed), but also certain form-to-meaning mappings in the visual modality, for instance the fact that pronouns are realized by way of pointing. A second possibility is that the iconic component of sign language—possibly in more abstract forms than is usually assumed—is responsible for some of the convergence. An example was discussed in Section 5.3 in relation to the word order differences between classifier predicates and normal signs, and between gesture sequences and normal words. A third possibility is that, for reasons that have yet to be determined, the visual modality sometimes makes it possible to realize in a more uniform fashion some deeper cognitive properties of linguistic expressions.

On a practical level, future research will have to find the optimal balance between fine-grained studies and robust methods of data collection (e.g., what are the best methods to collect fine-grained data from a small number of consultants? how can large-scale experiments be set up for sign language semantics?). A second issue pertains to the involvement of native signers and Deaf researchers, who should obviously play a central role in this entire research.

On a theoretical level, the traditional view of human language as a discrete system with iconicity at the margins is hard to maintain in view of the analysis of sign with iconicity (and possibly also of speech with gestures). Rather, human language is a hybrid system with a discrete/logical component and an iconic component. But there are multiple open issues. First, cases of Logical Visibility will no doubt give rise to further debates. Second, a formal iconic semantics appropriate for sign language has yet to be fully developed. Third, the interaction between the discrete/logical component and the iconic component must be investigated in greater detail. Fourth, the formal semantics of sign language should be extended with an equally formal pragmatics to investigate, among others, information structure, and the rich typology of inferences that has been unearthed for spoken languages (including implicatures, presuppositions, supplements, expressives, etc.). Importantly, this formal pragmatics will have to explore both the discrete/logical and the iconic component of sign language. Fifth, consequences of the iconic component for the syntax will have to be further explored, especially in view of the hypothesis that classifier predicates display a non-standard syntax because they have an iconic semantics. Last, but not least, the philosophy of language should take sign languages into account. For the moment, it almost never does so.

  • Abusch, Dorit, 2020, “Possible‐Worlds Semantics for Pictures”, in The Wiley Blackwell Companion to Semantics , Daniel Gutzmann, Lisa Matthewson, Cécile Meier, Hotze Rullmann, and Thomas Zimmermann (eds.), Hoboken, NJ: Wiley. doi:10.1002/9781118788516.sem003
  • Anand, Pranav and Andrew Nevins, 2004, “Shifty Operators in Changing Contexts”, in Proceedings of Semantics and Linguistic Theory (SALT 14) , Robert B. Young (ed.), Semantics and Linguistic Theory (Linguistics Society of America), pp. 20–37. doi:10.3765/salt.v14i0.2913
  • Anand, Pranav, 2006, “De de Se”, Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA. [ Anand 2006 available online ]
  • Aristodemo, Valentina, 2017, Gradable Constructions in Italian Sign Language , Ph.D. Thesis, Ecole des Hautes Etudes en Sciences Sociales, Paris.
  • Barberà Altimira, Gemma, 2015, The Meaning of Space in Sign Language: Reference, Specificity and Structure in Catalan Sign Language Discourse (Sign Languages and Deaf Communities 4), Boston: De Gruyter Mouton/Ishara Press. doi:10.1515/9781614518815
  • Barnes, Kathryn and Cornelia Ebert, 2023, “The Information Status of Iconic Enrichments: Modelling Gradient at-Issueness”, Theoretical Linguistics , 49(3–4): 167–223. doi:10.1515/tl-2023-2009
  • Beck, Sigrid, Sveta Krasikova, Daniel Fleischer, Remus Gergel, Stefan Hofstetter, Christiane Savelsberg, John Vanderelst, and Elisabeth Villalta, 2009, “Crosslinguistic Variation in Comparison Constructions”, Linguistic Variation Yearbook , 9: 1–66. doi:10.1075/livy.9.01bec
  • Bellugi, Ursula and Susan Fischer, 1972, “A Comparison of Sign Language and Spoken Language”, Cognition , 1(2–3): 173–200. doi:10.1016/0010-0277(72)90018-2
  • Chomsky, Noam, 1965, Aspects of the Theory of Syntax , Cambridge, MA: MIT Press
  • Christensen, Peer, Riccardo Fusaroli, and Kristian Tylén, 2016, “Environmental Constraints Shaping Constituent Order in Emerging Communication Systems: Structural Iconicity, Interactive Alignment and Conventionalization”, Cognition , 146: 67–80. doi:10.1016/j.cognition.2015.09.004
  • Cogill-Koez, Dorothea, 2000, “Signed Language Classifier Predicates: Linguistic Structures or Schematic Visual Representation?”, Sign Language & Linguistics , 3(2): 153–207. doi:10.1075/sll.3.2.03cog
  • Coppola, Marie and Wing Chee So, 2006, “The Seeds of Spatial Grammar: Spatial Modulation and Coreference in Homesigning and Hearing Adults”, in Proceedings of the 30th Boston University Conference on Language Development , David Bamman, Tatiana Magnitskaia, and Colleen Zaller (eds.), Boston: Cascadilla Press, 1: 119–130.
  • Coppola, Marie, Elizabet Spaepen, and Susan Goldin-Meadow, 2013, “Communicating about Quantity without a Language Model: Number Devices in Homesign Grammar”, Cognitive Psychology , 67(1–2): 1–25. doi:10.1016/j.cogpsych.2013.05.003
  • Corina, David and Nicole Spotswood, 2012, “Neurolinguistics”, in Pfau, Steinbach, and Woll 2012 : 739–762 (ch. 31). doi:10.1515/9783110261325.739
  • Cresswell, M. J., 1990, Entities and Indices (Studies in Linguistics and Philosophy 41), Dordrecht/Boston: Kluwer Academic Publishers.
  • Cuxac, Christian and Marie-Anne Sallandre, 2007, “Iconicity and Arbitrariness in French Sign Language: Highly Iconic Structures, Degenerated Iconicity and Diagrammatic Iconicity”, in Verbal and Signed Languages: Comparing Structures, Constructs and Methodologies (Empirical Approaches to Language Typology 36), Elena Pizzuto, Paola Pietrandrea, and Raffaele Simone (eds.), Berlin/New York: Mouton de Gruyter, 13–33.
  • Dachkovsky, Svetlana and Wendy Sandler, 2009, “Visual Intonation in the Prosody of a Sign Language”, Language and Speech , 52(2–3): 287–314. doi:10.1177/0023830909103175
  • Davidson, Kathryn, 2013, “‘And’ or ‘or’: General Use Coordination in ASL”, Semantics and Pragmatics , 6: article 4 (44 pages). doi:10.3765/sp.6.4
  • –––, 2015, “Quotation, Demonstration, and Iconicity”, Linguistics and Philosophy , 38(6): 477–520. doi:10.1007/s10988-015-9180-1
  • Davidson, Kathryn and Deanna Gagne, 2022, “‘More Is up’ for Domain Restriction in ASL”, Semantics and Pragmatics , 15: article 1 (52 pages). doi:10.3765/sp.15.1
  • Davidson, Kathryn, Annemarie Kocab, Andrea D. Sims, and Laura Wagner, 2019, “The Relationship between Verbal Form and Event Structure in Sign Languages”, Glossa: A Journal of General Linguistics , 4(1): 123. doi:10.5334/gjgl.924
  • Deal, Amy Rose, 2020, A Theory of Indexical Shift: Meaning, Grammar, and Crosslinguistic Variation (Linguistic Inquiry Monographs 82), Cambridge, MA: The MIT Press. doi:10.7551/mitpress/12374.001.0001
  • Elbourne, Paul D., 2005, Situations and Individuals (Current Studies in Linguistics 41), Cambridge, MA: MIT Press.
  • Emmorey, Karen, 2002, Language, Cognition, and the Brain: Insights from Sign Language Research , Mahwah, NJ: Lawrence Erlbaum Associates.
  • –––, 2014, “Iconicity as Structure Mapping”, Philosophical Transactions of the Royal Society B: Biological Sciences , 369(1651): 20130301. doi:10.1098/rstb.2013.0301
  • Emmorey, Karen and Melissa Herzig, 2003, “Categorical versus Gradient Properties of Classifier Constructions in ASL”, in Perspectives on Classifer Constructions in Sign Language , Karen Emmorey (ed.), Mahwah, NJ: Lawrence Erlbaum Associates, 222–246.
  • Evans, Gareth, 1980, “Pronouns”, Linguistic Inquiry , 11(2): 337–362.
  • Frederiksen, Anne Therese and Rachel I. Mayberry, 2022, “Pronoun Production and Comprehension in American Sign Language: The Interaction of Space, Grammar, and Semantics”, Language, Cognition and Neuroscience , 37(1): 80–102. doi:10.1080/23273798.2021.1968013
  • Gagne, Deanna L. and Marie Coppola, 2017, “Visible Social Interactions Do Not Support the Development of False Belief Understanding in the Absence of Linguistic Input: Evidence from Deaf Adult Homesigners”, Frontiers in Psychology , 8: article 837. doi:10.3389/fpsyg.2017.00837
  • Geach, Peter T., 1962, Reference and Generality: An Examination of Some Medieval and Modern Theories (Contemporary Philosophy), Ithaca, NY: Cornell University Press.
  • Goldin-Meadow, Susan and Diane Brentari, 2017, “Gesture, Sign, and Language: The Coming of Age of Sign Language and Gesture Studies”, Behavioral and Brain Sciences , 40: e46. doi:10.1017/S0140525X15001247
  • Goldin-Meadow, Susan, Wing Chee So, Aslı Özyürek, and Carolyn Mylander, 2008, “The Natural Order of Events: How Speakers of Different Languages Represent Events Nonverbally”, Proceedings of the National Academy of Sciences , 105(27): 9163–9168. doi:10.1073/pnas.0710060105
  • Greenberg, Gabriel, 2013, “Beyond Resemblance”, The Philosophical Review , 122(2): 215–287. doi:10.1215/00318108-1963716
  • –––, 2021, “Semantics of Pictorial Space”, Review of Philosophy and Psychology , 12(4): 847–887. doi:10.1007/s13164-020-00513-6
  • Heim, Irene, 1982, The Semantics of Definite and Indefinite Noun Phrases , Ph.D. Thesis, University of Massachusetts, Amherst, Amherst, MA.
  • –––, 1990, “E-Type Pronouns and Donkey Anaphora”, Linguistics and Philosophy , 13(2): 137–177. doi:10.1007/BF00630732
  • Henderson, Robert, 2016, “A Demonstration-Based Account of (Pluractional) Ideophones”, in Proceedings of Semantics and Linguistic Theory (SALT 26) , Mary Morony, Carol-Rose Little, Jacob Collard, and Dan Burgdorf (eds.), 664–683. doi:10.3765/salt.v26i0.3786
  • Herrmann, Annika and Markus Steinbach, 2012, “Quotation in Sign Languages: A Visible Context Shift”, in Converging Evidence in Language and Communication Research (Converging Evidence in Language and Communication Research 15), Isabelle Buchstaller and Ingrid Van Alphen (eds.), Amsterdam: John Benjamins Publishing Company, 203–228. doi:10.1075/celcr.15.12her
  • Jacobson, Pauline, 1999, “Towards a Variable Free Semantics”, Linguistics and Philosophy , 22(2): 117–185. doi:10.1023/A:1005464228727
  • –––, 2012, “Direct Compositionality and ‘Uninterpretability’: The Case of (Sometimes) ‘Uninterpretable’ Features on Pronouns”, Journal of Semantics , 29(3): 305–343. doi:10.1093/jos/ffs005
  • Kamp, Hans, 1981 [1984], “A Theory of Truth and Semantic Representation”, in Formal Methods in the Study of Language (Mathematical Centre Tracts 135), Jeroen A. G. Groenendijk, Theo M. V. Janssen, and Martin J. B. Stokhof (eds.), Amsterdam: Mathematisch Centrum. Reprinted in Truth, Interpretation and Information: Selected Papers from the Third Amsterdam Colloquium , Jeroen Groenendijk, Theo M. V. Janssen, and Martin Stokhof (eds.), Berlin/Boston: De Gruyter, 1–42. doi:10.1515/9783110867602.1
  • Kaplan, David, 1989, “Demonstratives. An Essay on the Semantics, Logic, Metaphysics, and Epistemology of Demonstratives and Other Indexicals”, in Themes from Kaplan , Joseph Almog, John Perry, and Howard Wettstein (eds.), New York: Oxford University Press, 481–563.
  • Klein, Ewan, 1980, “A Semantics for Positive and Comparative Adjectives”, Linguistics and Philosophy , 4(1): 1–45. doi:10.1007/BF00351812
  • Koulidobrova, Elena, 2009, “SELF: Intensifier and ‘long distance’ effects in ASL”, 21st European Summer School in Logic, Language, and Information . Bordeaux, Association for Logic Language and Information (FoLLI). [ Koulidobrova 2009 available online ]
  • Kuhn, Jeremy, 2015, Cross-Categorial Singular and Plural Reference in Sign Language , Ph.D. Thesis, New York University.
  • –––, 2016, “ASL Loci: Variables or Features?”, Journal of Semantics , 33(3): 449–491. doi:10.1093/jos/ffv005
  • –––, 2021, “Discourse Anaphora: Theoretical Perspectives”, in The Routledge Handbook of Theoretical and Experimental Sign Language Research , Josep Quer, Roland Pfau, and Annika Herrmann (eds.), Abingdon/New York: Routledge, 458–479.
  • Kuhn, Jeremy and Valentina Aristodemo, 2017, “Pluractionality, Iconicity, and Scope in French Sign Language”, Semantics and Pragmatics , 10: article 6 (49 pages): doi:10.3765/sp.10.6
  • Lane, Harlan, 1984, When the Mind Hears: A History of the Deaf , New York: Random House.
  • Liddell, Scott K., 1984, “Unrealized-Inceptive Aspect in American Sign Language: Feature Insertion in Syllabic Frames”, in Papers from the Twentieth Regional Meeting of the Chicago Linguistic Society , Chicago: Chicago Linguistic Society, 257–270.
  • –––, 2003, Grammar, Gesture, and Meaning in American Sign Language , Cambridge/New York: Cambridge University Press. doi:10.1017/CBO9780511615054
  • Lillo-Martin, Diane, 2012, “Utterance Reports and Constructed Action”, in Pfau, Steinbach, and Woll 2012 : 365–387 (ch. 17). doi:10.1515/9783110261325.365
  • Lillo-Martin, Diane and Edward S. Klima, 1990, “Pointing out Differences: ASL Pronouns in Syntactic Theory”, in Theoretical Issues in Sign Language Research: Volume 1, Linguistics , Susan D. Fischer and Patricia Siple (eds.), Chicago, IL: The University of Chicago Press, 191–210.
  • Lin, Hao, Jeremy Kuhn, Huan Sheng, and Philippe Schlenker, 2021, “Timelines and Temporal Pointing in Chinese Sign Language”, Glossa: A Journal of General Linguistics , 6(1): article 133. doi:10.16995/glossa.5836
  • MacSweeney, Mairéad, Cheryl M. Capek, Ruth Campbell, and Bencie Woll, 2008, “The Signing Brain: The Neurobiology of Sign Language”, Trends in Cognitive Sciences , 12(11): 432–440. doi:10.1016/j.tics.2008.07.010
  • Malaia, Evie, Ronnie B. Wilbur, and Marina Milković, 2013, “Kinematic Parameters of Signed Verbs”, Journal of Speech, Language, and Hearing Research , 56(5): 1677–1688. doi:10.1044/1092-4388(2013/12-0257)
  • Matthewson, Lisa, 2001, “Quantification and the Nature of Crosslinguistic Variation”, Natural Language Semantics , 9(2): 145–189. doi:10.1023/A:1012492911285
  • Meir, Irit, Wendy Sandler, Carol Padden, and Mark Aronoff, 2010, “Emerging Sign Languages”, in The Oxford Handbook of Deaf Studies, Language, and Education , Marc Marschark and Patricia Elizabeth Spencer (eds.), Oxford/New York: Oxford University Press, 2: 267–280. doi:10.1093/oxfordhb/9780195390032.013.0018
  • Milković, Marina, 2011, Verb classes in Croatian Sign Language (HZJ): Syntactic and semantic properties , PhD thesis, University of Zagreb, Croatia.
  • Moltmann, Friederike, 2017, “Natural Language Ontology”, in Oxford Research Encyclopedia of Linguistics , Mark Aronoff (ed.), Oxford: Oxford University Press. doi:10.1093/acrefore/9780199384655.013.330
  • Morford, Jill P. and Barbara Hänel‐Faulhaber, 2011, “Homesigners as Late Learners: Connecting the Dots from Delayed Acquisition in Childhood to Sign Language Processing in Adulthood”, Language and Linguistics Compass , 5(8): 525–537. doi:10.1111/j.1749-818X.2011.00296.x
  • Napoli, Donna Jo, Nancy K. Mellon, John K. Niparko, Christian Rathmann, Gaurav Mathur, Tom Humphries, Theresa Handley, Sasha Scambler, and John D. Lantos, 2015, “Should All Deaf Children Learn Sign Language?”, Pediatrics , 136(1): 170–176. doi:10.1542/peds.2014-1632
  • Napoli, Donna Jo, Rachel Sutton Spence, and Ronice Müller de Quadros, 2017, “Influence of Predicate Sense on Word Order in Sign Languages: Intensional and Extensional Verbs”, Language , 93(3): 641–670. doi:10.1353/lan.2017.0039
  • Neidle, Carol Jan, Judy Kegl, Dawn MacLaughlin, Benjamin Bahan, and Robert G. Lee (eds.), 2000, The Syntax of American Sign Language: Functional Categories and Hierarchical Structure (Language, Speech, and Communication), Cambridge, MA: MIT Press.
  • Ohori, Toshio, 2004, “Coordination in Mentalese”, in Coordinating Constructions (Typological Studies in Language 58), Martin Haspelmath (ed.), Amsterdam: John Benjamins Publishing Company, 41–66. doi:10.1075/tsl.58.04oho
  • Padden, Carol A., 1986, “Verbs and Role-Shifting in American Sign Language”, Proceedings of the Fourth National Symposium on Sign Language Research and Teaching , Silver Spring, MD: National Association of the Deaf.
  • Partee, Barbara Hall, 1973, “Some Structural Analogies between Tenses and Pronouns in English”, The Journal of Philosophy , 70(18): 601–609. doi:10.2307/2025024
  • Pavlič, Matic, 2016, “The Word Order Parameter in Slovenian Sign Language : Transitive, Ditransitive, Classifier and Locative Constructions”, Doctoral Thesis, Università Ca’ Foscari Venezia.
  • Pfau, Roland and Markus Steinbach, 2006, “Pluralization in Sign and in Speech: A Cross-Modal Typological Study”, Linguistic Typology , 10(2): 135–182. doi:10.1515/LINGTY.2006.006
  • Pfau, Roland, Martin Salzmann, and Markus Steinbach, 2018, “The Syntax of Sign Language Agreement: Common Ingredients, but Unusual Recipe”, Glossa: A Journal of General Linguistics , 3(1): article 107 (46 pages). doi:10.5334/gjgl.511
  • Pfau, Roland, Markus Steinbach, and Bencie Woll (eds.), 2012, Sign Language: An International Handbook (Handbücher zur Sprach- und Kommunikationswissenschaft/Handbooks of Linguistics and Communication Science 37), Berlin/Boston: De Gruyter Mouton. doi:10.1515/9783110261325
  • Pinker, Steven, 1994, The Language Instinct: How the Mind Creates Language , New York: William Morrow.
  • Postal, Paul, 1966, “On so-called ‘pronouns’ in English”, in Report on the Seventeenth Annual Round Table Meeting on Linguistics and Language Studies , Washington, DC: Georgetown University Press, pp. 177–206
  • Pustejovsky, James, 1991, “The Syntax of Event Structure”, Cognition , 41(1–3): 47–81. doi:10.1016/0010-0277(91)90032-Y
  • Quer, Josep, 2005, “Context Shift and Indexical Variables in Sign Languages”, in Proceedings of Semantics and Linguistic Theory (SALT 15) , 152–168. doi:10.3765/salt.v15i0.2923
  • –––, 2013, “Attitude Ascriptions in Sign Languages and Role Shift”, in Proceedings of the 13 th Meeting of the Texas Linguistics Society , Leah C. Geer (ed.), Austin: Texas Linguistics Forum, pp. 12–28. [ Quer 2013 available online ]
  • Quine, Willard V., 1960, “Variables Explained Away”, Proceedings of the American Philosophical Society , 104(3): 343–347.
  • Ramchand, Gillian, 2008, Verb Meaning and the Lexicon: A First-Phase Syntax (Cambridge Studies in Linguistics 116), Cambridge/New York: Cambridge University Press. doi:10.1017/CBO9780511486319
  • Rothstein, Susan, 2004, Structuring Events: A Study in the Semantics of Lexical Aspect (Explorations in Semantics), Oxford/Malden, MA: Blackwell. doi:10.1002/9780470759127
  • Rubino, Carl, 2013, “Reduplication”, in The Wolrd Atlas of Language Structures Online (v2020.3) [Data set], Matthew S. Dryer and Martin Haspelmath (eds.), Leipzip: Max Planck Institute for Evolutionary Anthropology. [ Rubino data set available online ] doi:10.5281/zenodo.7385533
  • Sandler, Wendy and Diane C. Lillo-Martin, 2006, Sign Language and Linguistic Universals , Cambridge/New York: Cambridge University Press. doi:10.1017/CBO9781139163910
  • Sapir, Edward, 1921, Language: An Introduction to the Study of Speech , New York: Harcourt, Brace and Company.
  • Schlenker, Philippe, 2018a, “Visible Meaning: Sign Language and the Foundations of Semantics”, Theoretical Linguistics , 44(3–4): 123–208. doi:10.1515/tl-2018-0012
  • –––, 2018b, “Iconic Pragmatics”, Natural Language & Linguistic Theory , 36(3): 877–936. doi:10.1007/s11049-017-9392-x
  • –––, 2020, “Gestural Grammar”, Natural Language & Linguistic Theory , 38(3): 887–936. doi:10.1007/s11049-019-09460-z
  • Schlenker, Philippe, Marion Bonnet, Jonathan Lamberton, Jason Lamberton, Emmanuel Chemla, Mirko Santoro, and Carlo Geraci, 2024, “Iconic Syntax: Sign Language Classifier Predicates and Gesture Sequences”, Linguistics and Philosophy , 47(1): 77–147. doi:10.1007/s10988-023-09388-z
  • Schlenker, Philippe and Jonathan Lamberton, 2022, “Meaningful Blurs: The Sources of Repetition-Based Plurals in ASL”, Linguistics and Philosophy , 45(2): 201–264. doi:10.1007/s10988-020-09312-9
  • –––, forthcoming, “Iconological Semantics”, Linguistics & Philosophy , accepted with minor revisions. [ Schlenker & Lamberton available online (lingbuzz/007048) ]
  • Schouwstra, Marieke and Henriëtte De Swart, 2014, “The Semantic Origins of Word Order”, Cognition , 131(3): 431–436. doi:10.1016/j.cognition.2014.03.004
  • von Stechow, Arnim, 2004, “Binding by Verbs: Tense, Person, and Mood under Attitudes”, in The Syntax and Semantics of the Left Periphery (Interface Explorations [IE] 9), Horst Lohnstein and Susanne Trissler (eds.), Berlin/New York: De Gruyter, 431–488. doi:10.1515/9783110912111.431
  • Steinbach, Markus and Edgar Onea, 2016, “A DRT Analysis of Discourse Referents and Anaphora Resolution in Sign Language”, Journal of Semantics , 33(3): 409–448. doi:10.1093/jos/ffv002
  • Stone, Matthew, 1997, “The Anaphoric Parallel Between Modality and Tense”. Technical Report MS-CIS-97-09 , University of Pennsylvania, Department of Computer and Information Science. [ Stone 1997 available online ]
  • Strickland, Brent, Carlo Geraci, Emmanuel Chemla, Philippe Schlenker, Meltem Kelepir, and Roland Pfau, 2015, “Event Representations Constrain the Structure of Language: Sign Language as a Window into Universally Accessible Linguistic Biases”, Proceedings of the National Academy of Sciences , 112(19): 5968–5973. doi:10.1073/pnas.1423080112
  • Supalla, Ted, 1982, Structure and acquisition of verbs of motion and location in American Sign Language . Ph.D. Thesis, University of California, San Diego.
  • Taub, Sarah F., 2001, Language from the Body: Iconicity and Metaphor in American Sign Language , Cambridge/New York: Cambridge University Press. doi:10.1017/CBO9780511509629
  • Wilbur, Ronnie B., 1996, “Focus and Specificity in ASL Structures Containing SELF”, Linguistic Society of America Annual Meeting, San Diego, January 1996.
  • –––, 2003, “Representations of Telicity in ASL”, in Proceedings from the Annual Meeting of the Chicago Linguistic Society , 39(1): 354–368.
  • –––, 2008, “Complex Predicates Involving Events, Time and Aspect: Is This Why Sign Languages Look So Similar?”, in Signs of the time: Selected papers from TISLR 8 (2004) , Josep Quer (ed.), Hamburg: Signum, pp. 217–250.
  • Wilbur, Ronnie B. and Evie Malaia, 2008, “Event Visibility Hypothesis: Motion Capture Evidence for Overt Marking of Telicity in ASL”, Linguistics Society of America Annual Meeting, Chicago, January 2008.
  • Zucchi, Sandro, 2011, “Event Descriptions and Classifier Predicates in Sign Languages”, Presentation given at FEAST in Venice, 21 June 2011.
  • –––, 2012, “Formal Semantics of Sign Languages”, Language and Linguistics Compass , 6(11): 719–734. doi:10.1002/lnc3.348
  • –––, 2017, “Event Categorization in Sign Languages”, in Handbook of Categorization in Cognitive Science , second edition, Henri Cohen and Claire Lefebvre (eds.), Amsterdam: Elsevier, 377–396. doi:10.1016/B978-0-08-101107-2.00016-6
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • Cable, Seth, 2011, “ Understudied and Endangered Languages at the Semantics/Syntax Interface ” (slides), presented at 50 Years of Linguistics at MIT: A Scientific Reunion, 10 December 2011.
  • Ebert, Cornelia and Christian Ebert, 2014, “ Gestures, Demonstratives, and the Attributive/Referential Distinction ”, slides of talk at Semantics and Philosophy in Europe 7, ZAS, Berlin.

anaphora | innateness: and language | logical form | ontology, natural language | plural quantification | presupposition | quotation | semantics: dynamic | tense and aspect

Acknowledgments

Author contributions: Schlenker and Kuhn wrote the article. Lamberton commented and provided some of the ASL data.

Acknowledgments: We are very grateful to Editor Ed Zalta and to two anonymous reviewers for very constructive comments and suggestions. Many thanks to Lucie Ravaux for help with the formatting and with the bibliography.

Funding: Schlenker, Lamberton, Kuhn: This research received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 788077, Orisem, PI: Schlenker).

Schlenker, Kuhn: Research was conducted at the DEC, Ecole Normale Supérieure - PSL Research University. The DEC is supported by grant FrontCog ANR-17-EURE-0017.

Copyright © 2024 by Philippe Schlenker < philippe . schlenker @ gmail . com > Jeremy Kuhn < jeremy . d . kuhn @ gmail . com > Jonathan Lamberton < jonlamberton @ gmail . com >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2024 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

IMAGES

  1. (PDF) Deep Learning for Sign Language Recognition: Current Techniques

    research paper for sign language recognition

  2. A Survey Paper on Sign Language Recognition

    research paper for sign language recognition

  3. (PDF) Sign Language Recognition Analysis using Multimodal Data

    research paper for sign language recognition

  4. (PDF) Machine learning methods for sign language recognition: A

    research paper for sign language recognition

  5. A New Framework for Sign Language Recognition based on 3D Handshape

    research paper for sign language recognition

  6. (PDF) Finger Detection for Sign Language Recognition

    research paper for sign language recognition

VIDEO

  1. Alphabet Sign Recognition Using Tensorflow

  2. Sign Language Recognition using Machine Learning

  3. Arabic Sign Language Recognition

  4. Sign Language: Tracking and Feature Extraction

  5. Sign Language Recognition Mobile Application

  6. Sign Language Generation@Thapar University

COMMENTS

  1. (PDF) Sign Language Recognition

    This paper presents an innovative approach for sign language recognition and conversion to text using a custom dataset containing 15 different classes, each class containing 70-75 different images.

  2. Machine learning methods for sign language recognition: A critical

    This paper aims to analyse the research published on intelligent systems in sign language recognition over the past two decades. A total of 649 publications related to decision support and intelligent systems on sign language recognition (SLR) are extracted from the Scopus database and analysed.

  3. Sign language recognition using the fusion of image and hand landmarks

    Sign Language Recognition is a breakthrough for communication among deaf-mute society and has been a critical research topic for years. ... this paper proposes a technique for acknowledging hand ...

  4. (PDF) Real Time Sign Language Detection

    Current Indian Sign Language Recognition systems, while employing machine learning algorithms, often lack real-time capabilities. In this paper, we introduce a method to construct an Indian Sign ...

  5. Sign Language Recognition: A Deep Survey

    The remainder of this paper is organized as follows. Section 2 includes a brief review of Deep Learning algorithms. Section 3 presents a taxonomy of the sign language recognition area. Hand sign language, face sign language, and human sign language literature are reviewed in Sections 4, 5, and 6, respectively.Section 7 presents the recent models in continuous sign language recognition.

  6. Sign Language Recognition

    2. Paper. Code. **Sign Language Recognition** is a computer vision and natural language processing task that involves automatically recognizing and translating sign language gestures into written or spoken language. The goal of sign language recognition is to develop algorithms that can understand and interpret sign language, enabling people ...

  7. Deep Learning for Sign Language Recognition: Current Techniques

    People with hearing impairments are found worldwide; therefore, the development of effective local level sign language recognition (SLR) tools is essential. We conducted a comprehensive review of automated sign language recognition based on machine/deep learning methods and techniques published between 2014 and 2021 and concluded that the current methods require conceptual classification to ...

  8. [2204.03328] A Comprehensive Review of Sign Language Recognition

    A machine can understand human activities, and the meaning of signs can help overcome the communication barriers between the inaudible and ordinary people. Sign Language Recognition (SLR) is a fascinating research area and a crucial task concerning computer vision and pattern recognition. Recently, SLR usage has increased in many applications, but the environment, background image resolution ...

  9. Sign Language Recognition

    Additionally, the paper delves into technological advancements in sign language recognition, visualization, and synthesis, identifying trends and gaps. The review concludes with a proposed framework for sign language recognition research, acknowledging the importance of diverse input modalities and anticipating future developments in this ...

  10. Deep Learning for Sign Language Recognition: Current Techniques

    Techniques used to collect sign language recognition data can be hardware-based, vision-based, or hybrid. 1) HARDWARE-BASED. Hardware-based approaches are designed to circumvent computer vision problems during sign language recogni-tion. These challenges may develop when recognizing signs from a video, for example.

  11. Recent progress in sign language recognition: a review

    Sign language is a predominant form of communication among a large group of society. The nature of sign languages is visual, making them distinct from spoken languages. Unfortunately, very few able people can understand sign language making communication with the hearing-impaired infeasible. Research in the field of sign language recognition (SLR) can help reduce the barrier between deaf and ...

  12. Deep Learning for Sign Language Recognition: Current Techniques

    We conducted a comprehensiv e. review of automated sign language recognition based on machine/deep learning methods and techniques. published between 2014 and 2021 and concluded that the current ...

  13. Deepsign: Sign Language Detection and Recognition Using Deep ...

    The rest of the paper is organized as follows: Section 2 provides an overview of the current literature on state-of-the-art isolated and continuous SLR (sign language recognition). Section 3 describes the methodology for implementing an isolated SLR system for real-time sign language detection and recognition, which involves pre-processing ...

  14. Sign Language Recognition Systems: A Decade Systematic Literature

    Despite the importance of sign language recognition systems, there is a lack of a Systematic Literature Review and a classification scheme for it. This is the first identifiable academic literature review of sign language recognition systems. It provides an academic database of literature between the duration of 2007-2017 and proposes a classification scheme to classify the research articles ...

  15. A comprehensive survey and taxonomy of sign language research

    This survey is directed to junior researchers and industry developers in related fields to gain key insights of sign language recognition and related human-machine interaction systems. The remainder of the paper is organized as follows: Section 2 provides an overview and reviews related works.

  16. JOURNAL OF LA A Comprehensive Review of Sign Language Recognition

    the keywords sign language recognition to identify significant related works that exist in the past two decades have included for this review work. We excluded papers other than out-of-scope sign language recognition and not written in English. The contributions to this comprehensive SLR review paper are as follows: Carried out a review of the ...

  17. Artificial Intelligence Technologies for Sign Language

    4. Sign Language Recognition. Sign language recognition (SLR) is the task of recognizing sign language glosses from video streams. It is a very important research area since it can bridge the communication gap between hearing and Deaf people, facilitating the social inclusion of hearing-impaired people.

  18. Sign Language Recognition, Generation, and Translation: An

    Developing successful sign language recognition, generation, and translation systems requires expertise in a wide range of fields, including computer vision, computer graphics, natural language processing, human-computer interaction, linguistics, and Deaf culture. Despite the need for deep interdisciplinary knowledge, existing research occurs in separate disciplinary silos, and tackles ...

  19. (PDF) Sign Language Recognition Systems: A Decade ...

    ArSL there are 70% of sign language recognition systems. who achieved a verage accuracy of greater than 90%, while. 23% of the systems have accuracy be tween 80 and 89%. There are only 7% systems ...

  20. PDF Real-time Conversion of Sign Language to Text and Speech, and ...

    the prestigious IEEE International Conference on Innovative Research in Computer Applications. The paper is anticipated to ... Deep learning-based sign language recognition system for static signs by Ankita Wadhawan1 • Parteek Kumar. 2020. [3] Bhagat, N. K., Vishnusai, Y., & Rathna, G. N. (2019). Indian Sign Language by Gesture Recognition ...

  21. SCOPE: Sign Language Contextual Processing with Embedding from LLMs

    Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information. Current methods in vision-based sign language recognition (SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information. To address these challenges, we introduce SCOPE (Sign ...

  22. Sign language recognition

    Sign language recognition. Abstract: This paper presents a novel system to aid in communicating with those having vocal and hearing disabilities. It discusses an improved method for sign language recognition and conversion of speech to signs. The algorithm devised is capable of extracting signs from video sequences under minimally cluttered and ...

  23. Instant Sign Language Recognition by WAR Strategy Algorithm ...

    Sign language serves as the primary means of communication utilized by individuals with hearing and speech disabilities. However, the comprehension of sign language by those without disabilities poses a significant challenge, resulting in a notable disparity in communication across society. Despite the utilization of numerous effective Machine learning techniques, there remains a minor ...

  24. Sign language recognition: State of the art

    Sign language (SL) [1] is a visual-gest ural. language used by deaf and hard-hearing people fo r. communication purposes. Three dimensional s paces and. the hand movements are used (and other ...

  25. Sign Language Recognition System using TensorFlow Object Detection API

    double-handed gestures but they are not real-time. In this paper, we propose a method to create an Indian Sign Language dataset using a webcam and then using transfer learning, train a TensorFlow model to create a real-time Sign Language Recognition system. The system achieves a good level of accuracy even with a limited size dataset. Keywords:

  26. Sign Language Semantics

    We focus on the description of loci in American Sign Language (ASL) and French Sign Language (LSF, for 'Langue des Signes Française'), but these properties appear in a similar form across the large majority of sign languages.Singular pronouns are signed by directing the index finger towards a point in space; plural pronouns can be signed with a variety of strategies, including using the ...

  27. (PDF) Instant Sign Language Recognition by WAR Strategy ...

    A novel sign language recognition system is presented in this paper with an exceptionally accurate and expeditious, which is developed upon the recently devised metaheuristic WAR Strategy ...