literature review activity recognition

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Review Article
Open access
Published: 18 October 2021

A systematic review of smartphone-based human activity recognition methods for health research

Marcin Straczkiewicz ORCID: orcid.org/0000-0002-8703-4451 1 ,
Peter James 2 , 3 &
Jukka-Pekka Onnela 1

npj Digital Medicine volume 4 , Article number: 148 ( 2021 ) Cite this article

21k Accesses

89 Citations

24 Altmetric

Metrics details

Predictive markers
Public health
Quality of life

Smartphones are now nearly ubiquitous; their numerous built-in sensors enable continuous measurement of activities of daily living, making them especially well-suited for health research. Researchers have proposed various human activity recognition (HAR) systems aimed at translating measurements from smartphones into various types of physical activity. In this review, we summarized the existing approaches to smartphone-based HAR. For this purpose, we systematically searched Scopus, PubMed, and Web of Science for peer-reviewed articles published up to December 2020 on the use of smartphones for HAR. We extracted information on smartphone body location, sensors, and physical activity types studied and the data transformation techniques and classification schemes used for activity recognition. Consequently, we identified 108 articles and described the various approaches used for data acquisition, data preprocessing, feature extraction, and activity classification, identifying the most common practices, and their alternatives. We conclude that smartphones are well-suited for HAR research in the health sciences. For population-level impact, future studies should focus on improving the quality of collected data, address missing data, incorporate more diverse participants and activities, relax requirements about phone placement, provide more complete documentation on study participants, and share the source code of the implemented methods and algorithms.

A “one-size-fits-most” walking recognition method for smartphones, smartwatches, and wearable accelerometers

Quantification of acceleration as activity counts in ActiGraph wearable

Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization

Introduction.

Progress in science has always been driven by data. More than 5 billion mobile devices were in use in 2020 1 , with multiple sensors (e.g., accelerometer and GPS) that can capture detailed, continuous, and objective measurements on various aspects of our lives, including physical activity. Such proliferation in worldwide smartphone adoption presents unprecedented opportunities for the collection of data to study human behavior and health. Along with sufficient storage, powerful processors, and wireless transmission, smartphones can collect a tremendous amount of data on large cohorts of individuals over extended time periods without additional hardware or instrumentation.

Smartphones are promising data collection instruments for objective and reproducible quantification of traditional and emerging risk factors for human populations. Behavioral risk factors, including but not limited to sedentary behavior, sleep, and physical activity, can all be monitored by smartphones in free-living environments, leveraging the personal or lived experiences of individuals. Importantly, unlike some wearable activity trackers 2 , smartphones are not a niche product but instead have become globally available, increasingly adopted by users of all ages both in advanced and emerging economies 3 , 4 . Their adoption in health research is further supported by encouraging findings made with other portable devices, primarily wearable accelerometers, which have demonstrated robust associations between physical activity and health outcomes, including obesity, diabetes, various cardiovascular diseases, mental health, and mortality 5 , 6 , 7 , 8 , 9 . However, there are some important limitations to using wearables for studying population health: (1) their ownership is much lower than that of smartphones 10 ; (2) most people stop using their wearables after 6 months of use 11 ; and (3) raw data are usually not available from wearable devices. The last point often forces investigators to rely on proprietary device metrics, which lowers the already low rate of reproducibility of biomedical research in general 12 and makes uncertainty quantification in the measurements nearly impossible.

Human activity recognition (HAR) is a process aimed at the classification of human actions in a given period of time based on discrete measurements (acceleration, rotation speed, geographical coordinates, etc.) made by personal digital devices. In recent years, this topic has been proliferating within the machine learning research community; at the time of writing, over 400 articles had been published on HAR methods using smartphones. This is a substantial increase from just a handful of articles published a few years earlier (Fig. 1 ). As data collection using smartphones becomes easier, analysis of the collected data is increasingly identified as the main bottleneck in health research 13 , 14 , 15 . To tackle the analytical challenges of HAR, researchers have proposed various algorithms that differ substantially in terms of the type of data they use, how they manipulate the collected data, and the statistical approaches used for inference and/or classification. Published studies use existing methods and propose new methods for the collection, processing, and classification of activities of daily living. Authors commonly discuss data filtering and feature selection techniques and compare the accuracy of various machine learning classifiers either on previously existing datasets or on datasets they have collected de novo for the purposes of the specific study. The results are typically summarized using classification accuracy within different groups of activities, such as ambulation, locomotion, and exercise.

Articles were published between January 2008 and December 2020, based on a search of PubMed, Scopus, and Web of Science databases (for details, see “Methods”).

To successfully incorporate developments in HAR into research in public health and medicine, there is a need to understand the approaches that have been developed and identify their potential limitations. Methods need to accommodate physiological (e.g., weight, height, age) and habitual (e.g., posture, gait, walking speed) differences of smartphone users, as well as differences in the built environment (e.g., buildings and green spaces) that provide the physical and social setting for human activities. Moreover, the data collection and statistical approaches typically used in HAR may be affected by location (where the user wears the phone on their body) and orientation of the device 16 , which complicates the transformation of collected data into meaningful and interpretable outputs.

In this paper, we systematically review the emerging literature on the use of smartphones for HAR for health research in free-living settings. Given that the main challenge in this field is shifting from data collection to data analysis, we focus our analysis on the approaches used for data acquisition, data preprocessing, feature extraction, and activity classification. We provide insight into the complexity and multidimensionality of HAR utilizing smartphones, the types of data collected, and the methods used to translate digital measurements into human activities. We discuss the generalizability and reproducibility of approaches, i.e., the features that are essential and applicable to large and diverse cohorts of study participants. Lastly, we identify challenges that need to be tackled to accelerate the wider utilization of smartphone-based HAR in public health studies.

Our systematic review was conducted by searching for articles published up to December 31, 2020, on PubMed, Scopus, and Web of Science databases. The databases were screened for titles, abstracts, and keywords containing phrases “activity” AND (“recognition” OR “estimation” OR “classification”) AND (“smartphone” OR “cell phone” OR “mobile phone”). The search was limited to full-length journal articles written in English. After removing duplicates, we read the titles and abstracts of the remaining publications. Studies that did not investigate HAR approaches were excluded from further screening. We then filtered out studies that employed auxiliary equipment, like wearable or ambient devices, and studies that required carrying multiple smartphones. Only studies that made use of commercially available consumer-grade smartphones (either personal or loaner) were read in full. We excluded studies that used the smartphone microphone or video camera for activity classification as they might record information about an individual’s surroundings, including information about unconsented individuals, and thus hinder the large-scale application of the approach due to privacy concerns. To focus on studies that mimicked free-living settings, we excluded studies that utilized devices strapped or glued to the body in a fixed position.

Our search resulted in 1901 hits for the specified search criteria (Fig. 2 ). After removal of articles that did not discuss HAR algorithms ( n = 793), employed additional hardware ( n = 150), or utilized microphones, cameras, or body-affixed smartphones ( n = 149), there were 108 references included in this review.

The search was conducted in PubMed, Scopus, and Web of Science databases and included full-length peer-reviewed articles written in English. The search was carried out on January 2, 2021.

Most HAR approaches consist of four stages: data acquisition, data preprocessing, feature extraction, and activity classification (Fig. 3 ). Here, we provide an overview of these steps and briefly point to significant methodological differences among the reviewed studies for each step. Figure 4 summarizes specific aspects of each study. Of note, we decomposed data acquisition processes into sensor type, experimental environment, investigated activities, and smartphone location; we indicated which studies preprocessed collected measurements using signal correction methods, noise filtering techniques, and sensor orientation-invariant transformations; we marked investigations based on the types of signal features they extracted, as well as the feature selection approaches used; we indicated the adopted activity classification principles, utilized classifiers, and practices for accuracy reporting; and finally, we highlighted efforts supporting reproducibility and generalizability of the research. Before diving into these technical considerations, we first provide a brief description of study populations.

The map displays common aspects of HAR systems together with their operational definitions. The methodological differences between the reviewed studies are highlighted in Figure 4 .

The columns correspond to the 108 reviewed studies and the rows correspond to different technical aspects of each study. Cells marked with a cross (x) indicate that the given study used the given method, algorithm, or approach. Rows have been grouped to correspond to different stages of HAR, such as data processing, and color shading of rows indicates how frequently a particular aspect is present among the studies (darker shade corresponds to higher frequency).

Study populations

We use the term study population to refer to the group of individuals investigated in any given study. In the reviewed studies, data were usually collected from fewer than 30 individuals, although one larger study analyzed data from 440 healthy individuals 17 . Studies often included healthy adults in their 20s and 30s, with only a handful of studies involving older individuals. Most studies did not report the full distribution of ages, only the mean age or the age range of participants (Fig. 5 ). To get a sense of the distribution of participant ages, we attempted to reconstruct an overall approximate age distribution by assuming that the participants in each study are evenly distributed in age between the minimum and maximum ages, which may not be the case. A comparison of the reconstructed age distribution of study participants with nationwide age distributions clearly demonstrates that future HAR research in health settings needs to broaden the age spectrum of the participants. Less effort was devoted in the studies to investigating populations with different demographic and disease characteristics, such as elders 18 , 19 , 20 and individuals with Parkinson’s disease 21 .

Panel a displays age of the population corresponding to individual studies, typically described by its range (lines) or mean (dots). Panel b displays the reconstructed age distribution in the reviewed studies (see the text). Nationwide age distributions displayed in panel c of three countries offer a stark contrast with the reconstructed distribution of study participant ages.

Data acquisition

We use the term data acquisition to refer to a process of collecting and storing raw sub-second-level smartphone measurements for the purpose of HAR. The data are typically collected from individuals by an application that runs on the device and samples data from built-in smartphone sensors according to a predefined schedule. We carefully examined the selected literature for details on the investigated population, measurement environment, performed activities, and smartphone settings.

In the reviewed studies, data acquisition typically took place in a research facility and/or nearby outdoor surroundings. In such environments, study participants were asked to perform a series of activities along predefined routes and to interact with predefined objects. The duration and order of performed activities were usually determined by the study protocol and the participant was supervised by a research team member. A less common approach involved observation conducted in free-living environments, where individuals performed activities without specific instructions. Such studies were likely to provide more insight into diverse activity patterns due to individual habits and unpredictable real-life conditions. Compared to a single laboratory visit, studies conducted in free-living environments also allowed investigators to monitor behavioral patterns over many weeks 22 or months 23 .

Activity selection is one of the key aspects of HAR. The studies in our review tended to focus on a small set of activities, including sitting, standing, walking, running, and stair climbing. Less common activities involved various types of mobility, locomotion, fitness, and household routines, e.g., slow, normal, and brisk walking 24 , multiple transportation modes, such as by car, bus, tram, train, metro, and ferry 25 , sharp body-turns 26 , and household activities, like sweeping a floor or walking with a shopping bag 27 . More recent studies concentrated solely on walking recognition 28 , 29 . As shown in Fig. 4 , the various measured activities in the reviewed studies can be grouped into classes: “posture” refers to lying, sitting, standing, or any pair of these activities; “mobility” refers to walking, stair climbing, body-turns, riding an elevator or escalator, running, cycling, or any pair of these activities; “locomotion” refers to motorized activities; and “other” refers to various household and fitness activities or singular actions beyond the described groups.

The spectrum of investigated activities determines the choice of sensors used for data acquisition. At the time of writing, a standard smartphone is equipped with a number of built-in hardware sensors and protocols that can be used for activity monitoring, including an accelerometer, gyroscope, magnetometer, GPS, proximity sensor, and light sensor, as well as to collect information on ambient pressure, humidity, and temperature (Fig. 6 ). Accurate estimation of commonly available sensors over time is challenging given a large number of smartphone manufacturers and models, as well as the variation in their adoption in different countries. Based on global statistics on smartphone market shares 30 and specifications of flagship models 31 , it appears that accelerometer, gyroscope, magnetometer, GPS, and proximity and light sensors were fairly commonly available by 2010. Other smartphone sensors were introduced a couple of years later; for example, the barometer was included in Samsung Galaxy S III released in 2012, and thermometer and hygrometer were included in Samsung Galaxy S4 released in 2013.

Inertial sensors (accelerometer, gyroscope, and magnetometer) provide measurements with respect to the three orthogonal axes ( x , y , z ) of the body of the phone; the remaining sensors are orientation-invariant.

Our literature review revealed that the most commonly used sensors for HAR are the accelerometer, gyroscope, and magnetometer, which capture data about acceleration, angular velocity, and phone orientation, respectively, and provide temporally dense, high-resolution measurements for distinguishing among activity classes (Fig. 7 ). Inertial sensors were often used synchronously to provide more insight into the dynamic state of the device. Some studies showed that the use of a single sensor can yield similar accuracy of activity recognition as using multiple sensors in combination 32 . To alleviate the impact of sensor position, some researchers collected data using the built-in barometer and GPS sensors to monitor changes in altitude and geographic location 33 , 34 , 35 . Certain studies benefited from using the broader set of capabilities of smartphones; for example, some researchers additionally exploited the proximity sensor and light sensor to allow recognition of a measurement’s context, e.g., the distance between a smartphone and the individual’s body, and changes between in-pocket and out-of-pocket locations based on changes in illumination 36 , 37 . The selection of sensors was also affected by secondary research goals, such as simplicity of classification and minimization of battery drain. In these studies, data acquisition was carried out using a single sensor (e.g., accelerometer 22 ), a small group of sensors (e.g., accelerometer and GPS 38 ), or a purposely modified sampling frequency or sampling scheme (e.g., alternating between data collection and non-collection cycles) to reduce the volume of data collected and processed 39 . Supplementing GPS data with other sensor data was motivated by the limited indoor reception of GPS; satellite signals may be absorbed or attenuated by walls and ceilings 17 up to 60% of the time inside buildings and up to 70% of the time in underground trains 23 .

a A person is sitting by the desk with the smartphone placed in the front pants pocket; b a person is walking normally (~1.9 steps per second) with the smartphone placed in a jacket pocket; c a person is ascending stairs with the smartphone placed in the backpack; d a person is walking slowly (~1.4 steps per second) holding the smartphone in hand; e a person is jogging (~2.8 steps per second) with the smartphone placed in back short’s pocket.

Sampling frequency specifies how many observations are collected by a sensor within a 1-s time interval. The selection of sampling frequency is usually performed as a trade-off between measurement accuracy and battery drain. Sampling frequency in the reviewed studies typically ranged between 20 and 30 Hz for inertial sensors and 1 and 10 Hz for the barometer and GPS. The most significant variations were seen in studies where limited energy consumption was a priority (e.g., accelerometer sampled at 1 Hz 40 ) or if investigators used advanced signal processing methods, such as time-frequency decomposition methods, or activity templates that required higher sampling frequency (e.g., accelerometer sampled at 100 Hz 41 ). Some studies stated that inertial sensors sampled at 20 Hz provided enough information to distinguish between various types of transportation 42 , while 10 Hz sampling rate was sufficient to distinguish between various types of mobility 43 . One study demonstrated that reducing the sampling rate from 100 Hz to 12.5 Hz increased the duration of data collection by a factor of three on a single battery charge 44 .

A crucial parameter in the data acquisition process is the smartphone’s location on the body. This is important mainly because of the nonstationary nature of real-life conditions and the strong effect it has on the smartphone’s inertial sensors. The main challenge in HAR in free-living conditions is that data recorded by the accelerometer, gyroscope, and magnetometer sensors differ between the upper and lower body as the device is not affixed to any specific location or orientation 45 . Therefore, it is essential that studies collect data from as many body locations as possible to ensure the generalizability of results. In the reviewed literature, study participants were often instructed to carry the device in a pants pocket (either front or back), although a number of studies also considered other placements, such as jacket pocket 46 , bag or backpack 47 , 48 , and holding the smartphone in the hand 49 or in a cupholder 50 .

To establish the ground truth for physical activity in HAR studies, data were usually annotated manually by trained research personnel or by the study participants themselves 51 , 52 . However, we also noted several approaches that automated this process both in controlled and free-living conditions, e.g., through a designated smartphone application 22 or built-in step counter combined paired with GPS data 53 ., used a built-in step counter and GPS data to produce “weak” labels. The annotation was also done using the built-in microphone 54 , video camera 18 , 20 , or an additional body-worn sensor 29 .

Finally, the data acquisition process in the reviewed studies was carried out on purposely designed applications that captured data. In studies with online activity classification, the collected data did not leave the device, but instead, the entire HAR pipeline was implemented on the smartphone; in contrast, studies using offline classification transmitted data to an external (remote) server for processing using a cellular, Wi-Fi, Bluetooth, or wired connection.

Data preprocessing

We use the term data preprocessing to refer to a collection of procedures aimed at repairing, cleaning, and transforming measurements recorded for HAR. The need for such step is threefold: (1) measurement systems embedded in smartphones are often less stable than research-grade data acquisition units, and the data might therefore be sampled unevenly or there might be missingness or sudden spikes that are unrelated to an individual’s actual behavior; (2) the spatial orientation (how the phone is situated in a person’s pocket, say) of the device influences tri-axial measurements of inertial sensors, thus potentially degrading the performance of the HAR system; and (3) despite careful planning and execution of the data acquisition stage, data quality may be compromised due to other unpredictable factors, e.g., lack of compliance by the study participants, unequal duration of activities in the measurement (i.e., dataset imbalance), or technological issues.

In our literature review, the first group of obstacles was typically addressed using signal processing techniques (in Fig. 4 , see “standardization”). For instance, to alleviate the mismatch between requested and effective sampling frequency, researchers proposed the use of linear interpolation 55 or spline interpolation 56 (Fig. 8 ). Such procedures were imposed on a range of affected sensors, typically the accelerometer, gyroscope, magnetometer, and barometer. Further time-domain preprocessing considered data trimming, carried out to remove unwanted data components. For this purpose, the beginning and end of each activity bout, a short period of activity of a specified kind, were clipped as nonrepresentative for the given activity 46 . During this stage, the researchers also dealt with dataset imbalance, which occurs when there are different numbers of observations for different activity classes in the training dataset. Such a situation makes the classifier susceptible to overfitting in favor of the larger class; in the reviewed studies, this issue was resolved using up-sampling or down-sampling of data 17 , 57 , 58 , 59 . In addition, the measurements were processed for high-frequency noise cancellation (i.e., “denoising”). The literature review identified several methods suitable for this task, including the use of low-pass finite impulse response filters (with a cutoff frequency typically equal to 10 Hz for inertial sensors and 0.1 Hz for barometers) 60 , 61 , which remove the portion of the signal that is unlikely to result from the activities of interest; weighted moving average 55 ; moving median 45 , 62 ; and singular-value decomposition 63 . GPS data were sometimes de-noised based on the maximum allowed positional accuracy 64 .

Standardization includes relabeling ( a ), when labels are reassigned to better match transitions between activities; trimming ( b ), when part of the signal is removed to balance the dataset for system training; interpolation ( c ), when missing data are filled in based on adjacent observations; and denoising ( d ), when the signal is filtered from redundant components. The transformation includes normalization ( e ), when the signal is normalized to unidimensional vector magnitude; rotation ( f ), when the signal is rotated to a different coordinate system; and separation ( g ), when the signal is separated into linear and gravitational components. Raw accelerometer data are shown in gray, and preprocessed data are shown using different colors.

Another element of data preprocessing considers device orientation (in Fig. 4 , see “transformation”). Smartphone measurements are sensitive to device orientation, which may be due to clothing, body shape, and movement during dynamic activities 57 . One of the popular solutions reported in the literature was to transform the three-dimensional signal into a univariate vector magnitude that is invariant to rotations and more robust to translations. This procedure was often applied to accelerometer, gyroscope, and magnetometer data. Accelerometer data were also subjected to digital filtering by separating the signal into linear (related to body motions) and gravitational (related to device spatial orientation) acceleration 65 . This separation was typically performed using a high-pass Butterworth filter of low order (e.g., order 3) with a cutoff frequency below 1 Hz. Other approaches transformed tri-axial into bi-axial measurement with horizontal and vertical axes 49 , or projected the data from the device coordinate system into a fixed coordinate system (e.g., the coordinate system of a smartphone that lies flat on the ground) using a rotation matrix (Euler angle-based 66 or quaternion 47 , 67 ).

Feature extraction

We use the term feature extraction to refer to a process of selecting and computing meaningful summaries of smartphone data for the goal of activity classification. A typical extraction scheme includes data visualization, data segmentation, feature selection, and feature calculation. A careful feature extraction step allows investigators not only to understand the physical nature of activities and their manifestation in digital measurements, but also, and more importantly, to help uncover hidden structures and patterns in the data. The identified differences are later quantified through various statistical measures to distinguish between activities. In an alternative approach, the process of feature extraction is automated using deep learning, which handles feature selection using simple signal processing units, called neurons, that have been arranged in a network structure that is multiple layers deep 59 , 68 , 69 , 70 . As with many applications of deep learning, the results may not be easily interpretable.

The conventional approach to feature extraction begins with data exploration. For this purpose, researchers in our reviewed studies employed various graphical data exploration techniques like scatter plots, lag plots, autocorrelation plots, histograms, and power spectra 71 . The choice of tools was often dictated by the study objectives and methods. For example, research on inertial sensors typically presented raw three-dimensional data from accelerometers, gyroscopes, and magnetometers plotted for the corresponding activities of standing, walking, and stair climbing 50 , 72 , 73 . Acceleration data were often inspected in the frequency domain, particularly to observe periodic motions of walking, running, and cycling 45 , and the impact of the external environment, like natural vibration frequencies of a bus or a subway 74 . Locomotion and mobility were investigated using estimates of speed derived from GPS. In such settings, investigators calculated the average speed of the device and associated it with either the group of motorized (car, bus, train, etc.) or non-motorized (walking, cycling, etc.) modes of transportation.

In the next step, measurements are divided into smaller fragments (also, segments or epochs) and signal features are calculated for each fragment (Fig. 9 ). In the reviewed studies, this segmentation was typically conducted using a windowing technique that allows consecutive windows to overlap. The window size usually had a fixed length that varied from 1 to 5 s, while the overlap of consecutive windows was often set to 50%. Several studies that investigated the optimal window size supported this common finding: short windows (1–2 s) were sufficient for recognizing posture and mobility, whereas somewhat longer windows (4–5 s) had better classification performance 75 , 76 , 77 . Even longer windows (10 s or more) were recommended for recognizing locomotion modes or for HAR systems employing frequency-domain features calculated with the Fourier transform (resolution of the resulting frequency spectrum is inversely proportional to window length) 42 . In principle, this calibration aims to closely match the window size with the duration of a single instance of the activity (e.g., one step). Similar motivation led researchers to seek more adaptive segmentation methods. One idea was to segment data based on specific time-domain events, like zero-cross points (when the signal changes value from positive to negative or vice versa), peak points (local maxima), or valley points (local minima), which represent the start and endpoints of a particular activity bout 55 , 57 . This allowed for segments to have different lengths corresponding to a single fundamental period of the activity in question. Such an approach was typically used to recognize quasiperiodic activities like walking, running, and stair climbing 63 .

An analyzed measurement ( a ) is segmented into smaller fragments using a sliding window ( b ). Depending on the approach, each segment may then be used to compute time-domain ( c ) or frequency-domain features ( d ), but also it may serve as the activity template ( e ), or as input for deep learning networks that compute hidden (“deep”) features ( f ). The selected feature extraction approach determines the activity classifier: time- and frequency-domain features are paired with machine learning classifiers ( g ) and activity templates are investigated using distance metrics ( h ), while deep features are computed within embedded layers of convolutional neural networks ( i ).

The literature described a large variety of signal features used for HAR, which can be divided into several categories based on the initial signal processing procedure. This enables one to distinguish between activity templates (i.e., raw signal), deep features (i.e., hidden features calculated within layers of deep neural networks), time-domain features (i.e., statistical measures of time-series data), and frequency-domain features (i.e., statistical measures of frequency representation of time-series data). The most popular features in the reviewed papers were calculated from time-domain signals as descriptive statistics, such as local mean, variance, minimum and maximum, interquartile range, signal energy (defined as the area under the squared magnitude of the considered continuous signal), and higher-order statistics. Other time-domain features included mean absolute deviation, mean (or zero) crossing rate, regression coefficients, and autocorrelation. Some studies described novel and customized time-domain features, like histograms of gradients 78 , and the number of local maxima and minima, their amplitude, and the temporal distance between them 39 . Time-domain features were typically calculated over each axis of the three-dimensional measurement or orientation-invariant vector magnitude. Studies that used GPS also calculated average speed 64 , 79 , 80 , while studies that used the barometer analyzed the pressure derivative 81 .

Signals transformed to the frequency domain were less exploited in the literature. A commonly performed signal decomposition used the fast Fourier transform (FFT) 82 , 83 , an algorithm that converts a temporal sequence of samples to a sequence of frequencies present in that sample. The essential advantage of frequency-domain features over time-domain features is their ability to identify and isolate certain periodic components of performed activities. This enabled researchers to estimate (kinetic) energy within particular frequency bands associated with human activities, like gait and running 51 , as well as with different modes of locomotion 74 . Other frequency-domain features included spectral entropy and parameters of the dominant peak, e.g., its frequency and amplitude.

Activity templates function essentially as blueprints for different types of physical activity. In the HAR systems, we reviewed, these templates were compared to patterns of observed raw measurements using various distance metrics 38 , 84 , such as the Euclidean or Manhattan distance. Given the heterogeneous nature of human activities, activity templates were often enhanced using techniques similar to dynamic time warping 29 , 57 , which measures the similarity of two temporal sequences that may vary in speed. As an alternative to raw measurements, some studies used signal symbolic approximation, which translates a segmented time-series signal into sequences of symbols based on a predefined mapping rule (e.g., amplitude between −1 and −0.5 g represents symbol “a”, amplitude between −0.5 and 0 g represents symbol “b”, and so on) 85 , 86 , 87 .

More recent studies utilized deep features. In these approaches, smartphone data were either fed to deep neural networks as raw univariate or multivariate time series 35 , 48 , 60 or preprocessed into handcrafted time- and frequency-domain feature vectors 82 , 83 . Within the network layers, the input data were then transformed (e.g., using convolution) to produce two-dimensional activation maps that revealed hidden spatial relations between axes and sensors specific to a given activity. To improve the resolution of input data, one study proposed to split the integer and decimal values of accelerometer measurements 41 .

In the reviewed articles, the number of extracted features typically varied from a few to a dozen. However, some studies purposely calculated too many features (sometimes hundreds) and let the analytical method perform variable selection, i.e., identify those features that were most informative for HAR 88 . Support vector machines 81 , 89 , gain ratio 43 , recursive feature elimination 38 , correlation-based feature selection 51 , and principal component analysis 90 were among the popular feature selection/dimension reduction methods used.

Activity classification

We use the term activity classification to refer to a process of associating extracted features with particular activity classes based on the adopted classification principle. The classification is typically performed by a supervised learning algorithm that has been trained to recognize patterns between features and labeled physical activities in the training dataset. The fitted model is then validated on separate observations, using a validation dataset, usually data obtained from the same group of study participants. The comparison between predictions made by the model and the known true labels allows one to assess the accuracy of the approach. This section summarizes the methods used in classification and validation, and also provides some insights into reporting on HAR performance.

The choice of classifier aims to identify a method that has the highest classification accuracy for the collected datasets and for the given data processing environment (e.g., online vs. offline). The reviewed literature included a broad range of classifiers, from simple decision trees 18 , k-nearest neighbors 65 , support vector machines 91 , 92 , 93 , logistic regression 21 , naïve Bayes 94 , and fuzzy logic 64 to ensemble classifiers such as random forest 76 , XGBoost 95 , AdaBoost 45 , 96 , bagging 24 , and deep neural networks 48 , 60 , 82 , 97 , 98 , 99 . Simple classifiers were frequently compared to find the best solution in the given measurement scenario 43 , 53 , 100 , 101 , 102 . A similar type of analysis was implemented for ensemble classifiers 79 . Incremental learning techniques were proposed to adapt the classification model to new data streams and unseen activities 103 , 104 , 105 . Other semi-supervised approaches were proposed to utilize unlabeled data to improve the personalization of HAR systems 106 and data annotation 53 , 70 . To increase the effectiveness of HAR, some studies used a hierarchical approach, where the classification was performed in separate stages and each stage could use a different classifier. The multi-stage technique was used for gradual decomposition of activities (coarse-grained first, then fine-grained) 22 , 37 , 52 , 60 and to handle the predicament of changing sensor location (body location first, then activity) 91 . Multi-instance multi-label approaches were adapted for the classification of complex activities (i.e., activities that consist of several basic activities) 62 , 107 as well as for recognition of basic activities paired with different sensor locations 108 .

Classification accuracy could also be improved by using post-processing, which relies on modifying the initially assigned label using the rules of logic and probability. The correction was typically performed based on activity duration 74 , activity sequence 25 , and activity transition probability and classification confidence 80 , 109 .

The selected method is typically cross-validated, which splits the collected dataset into two or more parts—training and testing—and only uses the part of the data for testing that was not used for training. The literature mentions a few cross-validation procedures, with k -fold and leave-one-out cross-validation being the most common 110 . Popular train-test proportions were 90–10, 70–30, and 60–40. A validation is especially valuable if it is performed using studies with different demographics and smartphone use habits. Such an approach allows one to understand the generalizability of the HAR system to real-life conditions and populations. We found a few studies that followed this validation approach 18 , 21 , 71 .

Activity classification is the last stage of HAR. In our review, we found that analysis results were typically reported in terms of classification accuracy using various standard metrics like precision, recall, and F-score. Overall, the investigated studies reported very high classification accuracies, typically above 95%. Several comparisons revealed that ensemble classifiers tended to outperform individual or single classifiers 27 , 77 , and deep-learning classifiers tended to outperform both individual and ensemble classifiers 48 . More nuanced summaries used the confusion matrix, which allows one to examine which activities are more likely to be classified incorrectly. This approach was particularly useful for visualizing classification differences between similar activities, such as normal and fast walking or bus and train riding. Additional statistics were usually provided in the context of HAR systems designed to operate on the device. In this case, activity classification needed to be balanced among acceptable classifier performance, processing time, and battery drain 44 . The desired performance optimum was obtained by making use of dataset remodeling (e.g., by replacing the oldest observations with the newest ones), low-cost classification algorithms, limited preprocessing, and conscientious feature selection 45 , 86 . Computation time was sometimes reported for complex methods, such as deep neural networks 20 , 82 , 111 and extreme learning machine 112 , as well as for symbolic representation 85 , 86 and in comparative analyses 46 . A comprehensive comparison of results was difficult or impossible, as discussed below.

Over the past decade, many studies have investigated HAR using smartphones. The reviewed literature provides detailed descriptions of essential aspects of data acquisition, data preprocessing, feature extraction, and activity classification. Studies were conducted with one or more objectives, e.g., to limit technological imperfections (e.g., no GPS signal reception indoors), to minimize computational requirements (e.g., for online processing of data directly on the device), and to maximize classification accuracy (all studies). Our review summarizes the most frequently used methods and offers available alternatives.

As expected, no single activity recognition procedure was found to work in all settings, which underlines the importance of designing methods and algorithms that address specific research questions in health while keeping the specifics of the study cohort in mind (e.g., age distribution, the extent of device use, and nature of disability). While datasets were usually collected in laboratory settings, there was little evidence that algorithms trained using data collected in these controlled settings could be generalized to free-living conditions 113 , 114 . In free-living settings, duration, frequency, and specific ways of performing any activity are subject to context and individual ability, and these degrees of freedom need to be considered in the development of HAR systems. Validation of these data in free-living settings is essential, as the true value of HAR systems for public health will come through transportable and scalable applications in large, long-term observational studies or real-world interventions.

Some studies were conducted with a small number of able-bodied volunteers. This makes the process of data handling and classification easier but also limits the generalizability of the approach to more diverse populations. The latter point was well demonstrated in two of the investigated studies. In the first study, the authors observed that the performance of a classifier trained on a young cohort significantly decreases if validated on an older cohort 18 . Similar conclusions can be drawn from the second study, where the observations on healthy individuals did not replicate in individuals with Parkinson’s disease 21 . These facts highlight the role of algorithmic fairness (or fairness of machine learning), the notion that the performance of an algorithm should not depend on variables considered sensitive, such as race, ethnicity, sexual orientation, age, and disability. A highly visible example of this was the decision of some large companies, including IBM, to stop providing facial recognition technology to police departments for mass surveillance 115 , and the European Commission has considered a ban on the use of facial recognition in public spaces 116 . These decisions followed findings demonstrating the poor performance of facial recognition algorithms when applied to individuals with dark-skin tones.

The majority of the studies we reviewed utilized stationary smartphones at a single-body position (i.e., a specific pants pocket), sometimes even with a fixed orientation. However, such scenarios are rarely observed in real-life settings, and these types of studies should be considered more as proofs of concept. Indeed, as demonstrated in several studies, inertial sensor data might not share similar features across body locations 49 , 117 , and smartphone orientation introduces additional artifacts to each axis of measurement which make any distribution-based features (e.g., mean, range, skewness) difficult to use without appropriate data preprocessing. Many studies provided only incomplete descriptions of the experimental setup and study protocol and provided few details on demographics, environmental context, and the details of the performed activities. Such information should be reported as fully and accurately as possible.

Only a few studies considered classification in a context that involves activities outside the set of activities the system was trained on; for example, if the system was trained to recognize walking and running, these were the only two activities that the system was later tested on. However, real-life activities are not limited to a prescribed set of behaviors, i.e., we do not just sit still, stand still, walk, and climb stairs. These classifiers, when applied to free-living conditions, will naturally miss the activities they were not trained on but will also likely overestimate those activities they were trained on. An improved scheme could assume that the observed activities are a sample from a broader spectrum of possible behaviors, including periods when the smartphone is not on a person, or assess the uncertainty associated with the classification of each type of activity 84 . This could also provide for an adaptive approach that would enable observation/interventions suited to a broad range of activities relevant for health, including decreasing sedentary behavior, increasing active transport (i.e., walking, bicycling, or public transit), and improving circadian patterns/sleep.

The use of personal digital devices, in particular smartphones, makes it possible to follow large numbers of individuals over long periods of time, but invariably investigators need to consider approaches to missing sensor data, which is a common problem. The importance of this problem is illustrated in a recent paper that introduced a resampling approach to imputing missing smartphone GPS data; the authors found that relative to linear interpolation—the naïve approach to missing spatial data—imputation resulted in a tenfold reduction in the error averaged across all daily mobility features 118 . On the flip side of missing data is the need to propagate uncertainty, in a statistically principled way, from the gaps in the raw data to the inferences that investigators wish to draw from the data. It is a common observation that different people use their phones differently, and some may barely use their phones at all; the net result is not that the data collected from these individuals are not useful, but instead the data are less informative about the behavior of this individual than they ideally might be. Dealing with missing data and accounting for the resulting uncertainty is important because it means that one does not have to exclude participants from a study because their data fail meet some arbitrary threshold of completeness; instead, everyone counts, and every bit of data from each individual counts.

The collection of behavioral data using smartphones understandably raises concerns about privacy; however, investigators in health research are well-positioned to understand and address these concerns given that health data are generally considered personal and private in nature. Consequently, there are established practices and common regulations on human subjects’ research, where informed consent of the individual to participate is one of the key foundations of any ethically conducted study. Federated learning is a machine learning technique that can be used to train an algorithm across decentralized devices, here smartphones, using only local data (data from the individual) and without the need to exchange data with other devices. This approach appears at first to provide a powerful solution to the privacy problem: the personal data never leave the person’s phone and only the outputs of the learning process, generally parameter estimates, are shared with others. This is where the tension between privacy and the need for reproducible research arises, however. The reason for data collection is to produce generalizable knowledge, but according to an often-cited study, 65% of medical studies were inconsistent when retested and only 6% were completely reproducible 12 . In the studies reviewed here, only 4 out of 108 made the source code or the methods used in the study publicly available. For a given scientific question, studies that are not replicable require the collection of more private and personal data; this highlights the importance of reproducibility of studies, especially in health, where there are both financial and ethical considerations when conducting research. If federated learning provides no possibility to confirm data analyses, to re-analyze data using different methods, or to pool data across studies, it by itself cannot be the solution to the privacy problem. Nevertheless, the technique may act as inspiration for developing privacy-preserving methods that also enable future replication of studies. One possibility is to use publicly available datasets (Table 1 ). If sharing of source code were more common, HAR methods could be tested on these publicly available datasets, perhaps in a similar way as datasets of handwritten digits are used to test classification methods in machine learning research. Although some efforts have been made in this area 42 , 119 , 120 , 121 , the recommended course of action assumes collecting and analyzing data from a large spectrum of sensors on diverse and understudied populations and validating classifiers against widely accepted gold standards.

When accurate, reproducible, and transportable methods coalesce to recognize a range of relevant activity patterns, smartphone-based HAR approaches will provide a fundamental tool for public health researchers and practitioners alike. We hope that this paper has provided to the reader some insights into how smartphones may be used to quantify human behavior in health research and the complexities that are involved in the collection and analysis of such data in this challenging but important field.

Data availability

Aggregated data analyzed in this study are available from the corresponding author upon request.

Code availability

Scripts used to process the aggregated data are available from the corresponding author upon request.

Association, G. The mobile economy 2020. https://www.gsma.com/mobileeconomy/wp-content/uploads/2020/03/GSMA_MobileEconomy2020_Global.pdf (2020).

Mercer, K. et al. Acceptance of commercially available wearable activity trackers among adults aged over 50 and with chronic illness: a mixed-methods evaluation. JMIR mHealth uHealth 4 , e7 (2016).

Article PubMed PubMed Central Google Scholar

Anderson, M. & Perrin, A. Tech adoption climbs among older adults. http://www.pewinternet.org/wp-content/uploads/sites/9/2017/05/PI_2017.05.17_Older-Americans-Tech_FINAL.pdf (2017).

Taylor, K. & Silver, L. Smartphone ownership is growing rapidly around the world, but not always equally. http://www.pewresearch.org/global/wp-content/uploads/sites/2/2019/02/Pew-Research-Center_Global-Technology-Use-2018_2019-02-05.pdf (2019).

Cooper, A. R., Page, A., Fox, K. R. & Misson, J. Physical activity patterns in normal, overweight and obese individuals using minute-by-minute accelerometry. Eur. J. Clin. Nutr. 54 , 887–894 (2000).

Article CAS PubMed Google Scholar

Ekelund, U., Brage, S., Griffin, S. J. & Wareham, N. J. Objectively measured moderate- and vigorous-intensity physical activity but not sedentary time predicts insulin resistance in high-risk individuals. Diabetes Care 32 , 1081–1086 (2009).

Legge, A., Blanchard, C. & Hanly, J. G. Physical activity, sedentary behaviour and their associations with cardiovascular risk in systemic lupus erythematosus. Rheumatology https://doi.org/10.1093/rheumatology/kez429 (2019).

Loprinzi, P. D., Franz, C. & Hager, K. K. Accelerometer-assessed physical activity and depression among U.S. adults with diabetes. Ment. Health Phys. Act. 6 , 79–82 (2013).

Article Google Scholar

Smirnova, E. et al. The predictive performance of objective measures of physical activity derived from accelerometry data for 5-year all-cause mortality in older adults: National Health and Nutritional Examination Survey 2003–2006. J. Gerontol. Ser. A https://doi.org/10.1093/gerona/glz193 (2019).

Wigginton, C. Global Mobile Consumer Trends, 2nd edition. Deloitte, https://www2.deloitte.com/content/dam/Deloitte/us/Documents/technology-media-telecommunications/us-global-mobile-consumer-survey-second-edition.pdf (2017).

Coorevits, L. & Coenen, T. The rise and fall of wearable fitness trackers. Acad. Manag . 2016 , https://doi.org/10.5465/ambpp.2016.17305abstract (2016).

Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10 , 712 (2011).

Kubota, K. J., Chen, J. A. & Little, M. A. Machine learning for large-scale wearable sensor data in Parkinson’s disease: concepts, promises, pitfalls, and futures. Mov. Disord. 31 , 1314–1326 (2016).

Article PubMed Google Scholar

Iniesta, R., Stahl, D. & McGuffin, P. Machine learning, statistical learning and the future of biological research in psychiatry. Psychol. Med. 46 , 2455–2465 (2016).

Article CAS PubMed PubMed Central Google Scholar

Kuehn, B. M. FDA’s foray into big data still maturing. J. Am. Med. Assoc. 315 , 1934–1936 (2016).

Straczkiewicz, M., Glynn, N. W. & Harezlak, J. On placement, location and orientation of wrist-worn tri-axial accelerometers during free-living measurements. Sensors 19 , 2095 (2019).

Esmaeili Kelishomi, A., Garmabaki, A. H. S., Bahaghighat, M. & Dong, J. Mobile user indoor-outdoor detection through physical daily activities. Sensors 19 , 511 (2019).

Article PubMed Central Google Scholar

Del Rosario, M. B. et al. A comparison of activity classification in younger and older cohorts using a smartphone. Physiol. Meas. 35 , 2269–2286 (2014).

Del Rosario, M. B., Lovell, N. H. & Redmond, S. J. Learning the orientation of a loosely-fixed wearable IMU relative to the body improves the recognition rate of human postures and activities. Sensors 19 , 2845 (2019).

Nan, Y. et al. Deep learning for activity recognition in older people using a pocket-worn smartphone. Sensors 20 , 7195 (2020).

Albert, M. V., Toledo, S., Shapiro, M. & Kording, K. Using mobile phones for activity recognition in Parkinson’s patients. Front. Neurol. 3 , 158 (2012).

Liang, Y., Zhou, X., Yu, Z. & Guo, B. Energy-efficient motion related activity recognition on mobile devices for pervasive healthcare. Mob. Netw. Appl. 19 , 303–317 (2014).

Gjoreski, H. et al. The university of Sussex-Huawei locomotion and transportation dataset for multimodal analytics with mobile devices. IEEE Access 6 , 42592–42604 (2018).

Wu, W., Dasgupta, S., Ramirez, E. E., Peterson, C. & Norman, G. J. Classification accuracies of physical activities using smartphone motion sensors. J. Med. Internet Res. 14 , e130 (2012).

Guvensan, M. A., Dusun, B., Can, B. & Turkmen, H. I. A novel segment-based approach for improving classification performance of transport mode detection. Sensors 18 , 87 (2018).

Pei, L. et al. Human behavior cognition using smartphone sensors. Sensors 13 , 1402–1424 (2013).

Della Mea, V., Quattrin, O. & Parpinel, M. A feasibility study on smartphone accelerometer-based recognition of household activities and influence of smartphone position. Inform. Heal. Soc. Care 42 , 321–334 (2017).

Klein, I. Smartphone location recognition: a deep learning-based approach. Sensors 20 , 214 (2020).

Casado, F. E. et al. Walking recognition in mobile devices. Sensors 20 , 1189 (2020).

O’Dea, S. Global smartphone market share worldwide by vendor 2009–2020. https://www.statista.com/statistics/271496/global-market-share-held-by-smartphone-vendors-since-4th-quarter-2009/ (2021).

GSMArena. https://www.gsmarena.com/ (2021). Accessed 24 March 2021.

Shoaib, M., Bosch, S., Durmaz Incel, O., Scholten, H. & Havinga, P. J. M. Fusion of smartphone motion sensors for physical activity recognition. Sensors 14 , 10146–10176 (2014).

Vanini, S., Faraci, F., Ferrari, A. & Giordano, S. Using barometric pressure data to recognize vertical displacement activities on smartphones. Comput. Commun. 87 , 37–48 (2016).

Wan, N. & Lin, G. Classifying human activity patterns from smartphone collected GPS data: a fuzzy classification and aggregation approach. Trans. GIS 20 , 869–886 (2016).

Gu, Y., Li, D., Kamiya, Y. & Kamijo, S. Integration of positioning and activity context information for lifelog in urban city area. Navigation 67 , 163–179 (2020).

Miao, F., He, Y., Liu, J., Li, Y. & Ayoola, I. Identifying typical physical activity on smartphone with varying positions and orientations. Biomed. Eng. Online 14 , 32 (2015).

Lee, Y.-S. & Cho, S.-B. Layered hidden Markov models to recognize activity with built-in sensors on Android smartphone. Pattern Anal. Appl. 19 , 1181–1193 (2016).

Martin, B. D., Addona, V., Wolfson, J., Adomavicius, G. & Fan, Y. Methods for real-time prediction of the mode of travel using smartphone-based GPS and accelerometer data. Sensors 17 , 2058 (2017).

Oshin, T. O., Poslad, S. & Zhang, Z. Energy-efficient real-time human mobility state classification using smartphones. IEEE Trans. Comput. 64 , 1680–1693 (2015).

Google Scholar

Shin, D. et al. Urban sensing: Using smartphones for transportation mode classification. Comput. Environ. Urban Syst. 53 , 76–86 (2015).

Hur, T. et al. Iss2Image: a novel signal-encoding technique for CNN-based human activity recognition. Sensors 18 , 3910 (2018).

Gjoreski, M. et al. Classical and deep learning methods for recognizing human activities and modes of transportation with smartphone sensors. Inf. Fusion 62 , 47–62 (2020).

Wannenburg, J. & Malekian, R. Physical activity recognition from smartphone accelerometer data for user context awareness sensing. IEEE Trans. Syst. Man, Cybern. Syst. 47 , 3143–3149 (2017).

Yurur, O., Labrador, M. & Moreno, W. Adaptive and energy efficient context representation framework in mobile sensing. IEEE Trans. Mob. Comput. 13 , 1681–1693 (2014).

Li, P., Wang, Y., Tian, Y., Zhou, T.-S. & Li, J.-S. An automatic user-adapted physical activity classification method using smartphones. IEEE Trans. Biomed. Eng. 64 , 706–714 (2017).

PubMed Google Scholar

Awan, M. A., Guangbin, Z., Kim, C.-G. & Kim, S.-D. Human activity recognition in WSN: a comparative study. Int. J. Networked Distrib. Comput. 2 , 221–230 (2014).

Chen, Z., Zhu, Q., Soh, Y. C. & Zhang, L. Robust human activity recognition using smartphone sensors via CT-PCA and online SVM. IEEE Trans. Ind. Inform. 13 , 3070–3080 (2017).

Zhu, R. et al. Efficient human activity recognition solving the confusing activities via deep ensemble learning. IEEE Access 7 , 75490–75499 (2019).

Yang, R. & Wang, B. PACP: a position-independent activity recognition method using smartphone sensors. Inf 7 , 72 (2016).

Gani, M. O. et al. A light weight smartphone based human activity recognition system with high accuracy. J. Netw. Comput. Appl. 141 , 59–72 (2019).

Reddy, S. et al. Using mobile phones to determine transportation modes. ACM Trans. Sens. Networks 6 , 1–27 (2010).

Guidoux, R. et al. A smartphone-driven methodology for estimating physical activities and energy expenditure in free living conditions. J. Biomed. Inform. 52 , 271–278 (2014).

Cruciani, F. et al. Automatic annotation for human activity recognition in free living using a smartphone. Sensors 18 , 2203 (2018).

Micucci, D., Mobilio, M. & Napoletano, P. UniMiB SHAR: A dataset for human activity recognition using acceleration data from smartphones. Appl. Sci. 7 , 1101 (2017).

Derawi, M. & Bours, P. Gait and activity recognition using commercial phones. Comput. Secur. 39 , 137–144 (2013).

Gu, F., Khoshelham, K., Valaee, S., Shang, J. & Zhang, R. Locomotion activity recognition using stacked denoising autoencoders. IEEE Internet Things J. 5 , 2085–2093 (2018).

Chen, Y. & Shen, C. Performance analysis of smartphone-sensor behavior for human activity recognition. IEEE Access 5 , 3095–3110 (2017).

Javed, A. R. et al. Analyzing the effectiveness and contribution of each axis of tri-axial accelerometer sensor for accurate activity recognition. Sensors 20 , 2216 (2020).

Mukherjee, D., Mondal, R., Singh, P. K., Sarkar, R. & Bhattacharjee, D. EnsemConvNet: a deep learning approach for human activity recognition using smartphone sensors for healthcare applications. Multimed. Tools Appl. 79 , 31663–31690 (2020).

Avilés-Cruz, C., Ferreyra-Ramírez, A., Zúñiga-López, A. & Villegas-Cortéz, J. Coarse-fine convolutional deep-learning strategy for human activity recognition. Sensors 19 , 1556 (2019).

Guiry, J. J., van de Ven, P. & Nelson, J. Multi-sensor fusion for enhanced contextual awareness of everyday activities with ubiquitous devices. Sensors 14 , 5687–5701 (2014).

Saha, J., Chowdhury, C., Ghosh, D. & Bandyopadhyay, S. A detailed human activity transition recognition framework for grossly labeled data from smartphone accelerometer. Multimed. Tools Appl . https://doi.org/10.1007/s11042-020-10046-w (2020).

Ignatov, A. D. & Strijov, V. V. Human activity recognition using quasiperiodic time series collected from a single tri-axial accelerometer. Multimed. Tools Appl. 75 , 7257–7270 (2016).

Das, R. D. & Winter, S. Detecting urban transport modes using a hybrid knowledge driven framework from GPS trajectory. ISPRS Int. J. Geo-Information 5 , 207 (2016).

Arif, M., Bilal, M., Kattan, A. & Ahamed, S. I. Better physical activity classification using smartphone acceleration sensor. J. Med. Syst. 38 , 95 (2014).

Heng, X., Wang, Z. & Wang, J. Human activity recognition based on transformed accelerometer data from a mobile phone. Int. J. Commun. Syst. 29 , 1981–1991 (2016).

Gao, Z., Liu, D., Huang, K. & Huang, Y. Context-aware human activity and smartphone position-mining with motion sensors. Remote Sensing 11 , 2531 (2019).

Kang, J., Kim, J., Lee, S. & Sohn, M. Transition activity recognition using fuzzy logic and overlapped sliding window-based convolutional neural networks. J. Supercomput. 76 , 8003–8020 (2020).

Shojaedini, S. V. & Beirami, M. J. Mobile sensor based human activity recognition: distinguishing of challenging activities by applying long short-term memory deep learning modified by residual network concept. Biomed. Eng. Lett. 10 , 419–430 (2020).

Mairittha, N., Mairittha, T. & Inoue, S. On-device deep personalization for robust activity data collection. Sensors 21 , 41 (2021).

Khan, A. M., Siddiqi, M. H. & Lee, S.-W. Exploratory data analysis of acceleration signals to select light-weight and accurate features for real-time activity recognition on smartphones. Sensors 13 , 13099–13122 (2013).

Ebner, M., Fetzer, T., Bullmann, M., Deinzer, F. & Grzegorzek, M. Recognition of typical locomotion activities based on the sensor data of a smartphone in pocket or hand. Sensors 20 , 6559 (2020).

Voicu, R.-A., Dobre, C., Bajenaru, L. & Ciobanu, R.-I. Human physical activity recognition using smartphone sensors. Sensors 19 , 458 (2019).

Hur, T., Bang, J., Kim, D., Banos, O. & Lee, S. Smartphone location-independent physical activity recognition based on transportation natural vibration analysis. Sensors 17 , 931 (2017).

Bashir, S. A., Doolan, D. C. & Petrovski, A. The effect of window length on accuracy of smartphone-based activity recognition. IAENG Int. J. Comput. Sci. 43 , 126–136 (2016).

Lu, D.-N., Nguyen, D.-N., Nguyen, T.-H. & Nguyen, H.-N. Vehicle mode and driving activity detection based on analyzing sensor data of smartphones. Sensors 18 , 1036 (2018).

Wang, G. et al. Impact of sliding window length in indoor human motion modes and pose pattern recognition based on smartphone sensors. Sensors 18 , 1965 (2018).

Jain, A. & Kanhangad, V. Human activity classification in smartphones using accelerometer and gyroscope sensors. IEEE Sens. J. 18 , 1169–1177 (2018).

Bedogni, L., Di Felice, M. & Bononi, L. Context-aware Android applications through transportation mode detection techniques. Wirel. Commun. Mob. Comput. 16 , 2523–2541 (2016).

Ferreira, P., Zavgorodnii, C. & Veiga, L. edgeTrans—edge transport mode detection. Pervasive Mob. Comput. 69 , 101268 (2020).

Gu, F., Kealy, A., Khoshelham, K. & Shang, J. User-independent motion state recognition using smartphone sensors. Sensors 15 , 30636–30652 (2015).

Li, X., Wang, Y., Zhang, B. & Ma, J. PSDRNN: an efficient and effective HAR scheme based on feature extraction and deep learning. IEEE Trans. Ind. Inform. 16 , 6703–6713 (2020).

Zhao, B., Li, S., Gao, Y., Li, C. & Li, W. A framework of combining short-term spatial/frequency feature extraction and long-term IndRNN for activity recognition. Sensors 20 , 6984 (2020).

Huang, E. J. & Onnela, J.-P. Augmented movelet method for activity classification using smartphone gyroscope and accelerometer data. Sensors 20 , 3706 (2020).

Montero Quispe, K. G., Sousa Lima, W., Macêdo Batista, D. & Souto, E. MBOSS: a symbolic representation of human activity recognition using mobile sensors. Sensors 18 , 4354 (2018).

Sousa Lima, W., de Souza Bragança, H. L., Montero Quispe, K. G. & Pereira Souto, E. J. Human activity recognition based on symbolic representation algorithms for inertial sensors. Sensors 18 , 4045 (2018).

Bragança, H., Colonna, J. G., Lima, W. S. & Souto, E. A smartphone lightweight method for human activity recognition based on information theory. Sensors 20 , 1856 (2020).

Saeedi, S. & El-Sheimy, N. Activity recognition using fusion of low-cost sensors on a smartphone for mobile navigation application. Micromachines 6 , 1100–1134 (2015).

Bilal, M., Shaikh, F. K., Arif, M. & Wyne, M. F. A revised framework of machine learning application for optimal activity recognition. Clust. Comput. 22 , 7257–7273 (2019).

Shi, D., Wang, R., Wu, Y., Mo, X. & Wei, J. A novel orientation- and location-independent activity recognition method. Pers. Ubiquitous Comput. 21 , 427–441 (2017).

Antos, S. A., Albert, M. V. & Kording, K. P. Hand, belt, pocket or bag: practical activity tracking with mobile phones. J. Neurosci. Methods 231 , 22–30 (2014).

Shi, J., Zuo, D. & Zhang, Z. Transition activity recognition system based on standard deviation trend analysis. Sensors 20 , 3117 (2020).

Garcia-Gonzalez, D., Rivero, D., Fernandez-Blanco, E. & Luaces, M. R. A public domain dataset for real-life human activity recognition using smartphone sensors. Sensors 20 , 2200 (2020).

Saeedi, S., Moussa, A. & El-Sheimy, N. Context-aware personal navigation using embedded sensor fusion in smartphones. Sensors 14 , 5742–5767 (2014).

Zhang, W., Zhao, X. & Li, Z. A comprehensive study of smartphone-based indoor activity recognition via Xgboost. IEEE Access 7 , 80027–80042 (2019).

Ferrari, A., Micucci, D., Mobilio, M. & Napoletano, P. On the personalization of classification models for human activity recognition. IEEE Access 8 , 32066–32079 (2020).

Zhou, B., Yang, J. & Li, Q. Smartphone-based activity recognition for indoor localization using a convolutional neural network. Sensors 19 , 621 (2019).

Pires, I. M. et al. Pattern recognition techniques for the identification of activities of daily living using a mobile device accelerometer. Electronics 9 , 509 https://www.mdpi.com/2079-9292/9/3/509#cite (2020).

Alo, U. R., Nweke, H. F., Teh, Y. W. & Murtaza, G. Smartphone motion sensor-based complex human activity identification using deep stacked autoencoder algorithm for enhanced smart healthcare system. Sensors 20 , 6300 https://www.mdpi.com/2079-9292/9/3/509#cite (2020).

Otebolaku, A. M. & Andrade, M. T. User context recognition using smartphone sensors and classification models. J. Netw. Comput. Appl. 66 , 33–51 (2016).

Zhuo, S. et al. Real-time smartphone activity classification using inertial sensors—recognition of scrolling, typing, and watching videos while sitting or walking. Sensors 20 , 655 (2020).

Asim, Y., Azam, M. A., Ehatisham-ul-Haq, M., Naeem, U. & Khalid, A. Context-aware human activity recognition (CAHAR) in-the-wild using smartphone accelerometer. IEEE Sens. J. 20 , 4361–4371 (2020).

Zhao, Z., Chen, Z., Chen, Y., Wang, S. & Wang, H. A class incremental extreme learning machine for activity recognition. Cogn. Comput. 6 , 423–431 (2014).

Abdallah, Z. S., Gaber, M. M., Srinivasan, B. & Krishnaswamy, S. Adaptive mobile activity recognition system with evolving data streams. Neurocomputing 150 , 304–317 (2015).

Guo, H., Chen, L., Chen, G. & Lv, M. Smartphone-based activity recognition independent of device orientation and placement. Int. J. Commun. Syst. 29 , 2403–2415 (2016).

Cruciani, F. et al. Personalizing activity recognition with a clustering based semi-population approach. IEEE ACCESS 8 , 207794–207804 (2020).

Saha, J., Ghosh, D., Chowdhury, C. & Bandyopadhyay, S. Smart handheld based human activity recognition using multiple instance multiple label learning. Wirel. Pers. Commun . https://doi.org/10.1007/s11277-020-07903-0 (2020).

Mohamed, R., Zainudin, M. N. S., Sulaiman, M. N., Perumal, T. & Mustapha, N. Multi-label classification for physical activity recognition from various accelerometer sensor positions. J. Inf. Commun. Technol. 17 , 209–231 (2018).

Wang, C., Xu, Y., Liang, H., Huang, W. & Zhang, L. WOODY: a post-process method for smartphone-based activity recognition. IEEE Access 6 , 49611–49625 (2018).

Garcia-Ceja, E. & Brena, R. F. An improved three-stage classifier for activity recognition. Int. J. Pattern Recognit. Artif. Intell . 32 , 1860003 (2018).

Ravi, D., Wong, C., Lo, B. & Yang, G.-Z. A deep learning approach to on-node sensor data analytics for mobile or wearable devices. IEEE J. Biomed. Heal. Inform. 21 , 56–64 (2017).

Chen, Z., Jiang, C. & Xie, L. A novel ensemble ELM for human activity recognition using smartphone sensors. IEEE Trans. Ind. Inform. 15 , 2691–2699 (2019).

van Hees, V. T., Golubic, R., Ekelund, U. & Brage, S. Impact of study design on development and evaluation of an activity-type classifier. J. Appl. Physiol. 114 , 1042–1051 (2013).

Sasaki, J. et al. Performance of activity classification algorithms in free-living older adults. Med. Sci. Sports Exerc. 48 , 941–950 (2016).

Allyn, B. IBM abandons facial recognition products, condemns racially biased surveillance. https://www.npr.org/2020/06/09/873298837/ibm-abandons-facial-recognition-products-condemns-racially-biased-surveillance (2020).

Chee, F. Y. EU mulls five-year ban on facial recognition tech in public areas. https://www.reuters.com/article/uk-eu-ai/eu-mulls-five-year-ban-on-facial-recognition-tech-in-public-areas-idINKBN1ZF2QN (2020).

Saha, J., Chowdhury, C., Chowdhury, I. R., Biswas, S. & Aslam, N. An ensemble of condition based classifiers for device independent detailed human activity recognition using smartphones. Information 9 , 94 (2018).

Barnett, I. & Onnela, J.-P. Inferring mobility measures from GPS traces with missing data. Biostatistics 21 , e98–e112 (2020).

Wang, L. et al. Enabling reproducible research in sensor-based transportation mode recognition with the Sussex-Huawei dataset. IEEE ACCESS 7 , 10870–10891 (2019).

Lee, M. H., Kim, J., Jee, S. H. & Yoo, S. K. Integrated solution for physical activity monitoring based on mobile phone and PC. Healthc. Inform. Res. 17 , 76–86 (2011).

Fahim, M., Fatima, I., Lee, S. & Park, Y.-T. EFM: evolutionary fuzzy model for dynamic activities recognition using a smartphone accelerometer. Appl. Intell. 39 , 475–488 (2013).

Yurur, O., Liu, C. H. & Moreno, W. Light-weight online unsupervised posture detection by smartphone accelerometer. IEEE Internet Things J. 2 , 329–339 (2015).

Awan, M. A., Guangbin, Z., Kim, H.-C. & Kim, S.-D. Subject-independent human activity recognition using Smartphone accelerometer with cloud support. Int. J. Ad Hoc Ubiquitous Comput. 20 , 172–185 (2015).

Chen, Z., Wu, J., Castiglione, A. & Wu, W. Human continuous activity recognition based on energy-efficient schemes considering cloud security technology. Secur. Commun. Netw. 9 , 3585–3601 (2016).

Guo, J. et al. Smartphone-based patients’ activity recognition by using a self-learning scheme for medical monitoring. J. Med. Syst. 40 , 140 (2016).

Walse, K. H., Dharaskar, R. V. & Thakare, V. M. A study of human activity recognition using AdaBoost classifiers on WISDM dataset. IIOAB J. 7 , 68–76 (2016).

Lee, K. & Kwan, M.-P. Physical activity classification in free-living conditions using smartphone accelerometer data and exploration of predicted results. Comput. Environ. Urban Syst. 67 , 124–131 (2018).

Ahmad, N. et al. SARM: salah activities recognition model based on smartphone. Electronics 8 , 881 (2019).

Usman Sarwar, M. et al. Recognizing physical activities having complex interclass variations using semantic data of smartphone. Softw. Pract. Exp . 51 , 532–549 (2020).

Kwapisz, J. R., Weiss, G. M. & Moore, S. A. Activity recognition using cell phone accelerometers. in Proceedings of the Fourth International Workshop on Knowledge Discovery from Sensor Data 10–18 https://doi.org/10.1145/1964897.1964918 (Association for Computing Machinery, 2010).

Sharma, A., Singh, S. K., Udmale, S. S., Singh, A. K. & Singh, R. Early transportation mode detection using smartphone sensing data. IEEE Sens. J . 1 , https://doi.org/10.1109/JSEN.2020.3009312 (2020).

Chen, Z. et al. Smartphone sensor-based human activity recognition using feature fusion and maximum full a posteriori. IEEE Trans. Instrum. Meas. 69 , 3992–4001 (2020).

Vavoulas, G. Chatzaki, C. Malliotakis, T. Pediaditis, M. & Tsiknakis, M. The MobiAct Dataset: recognition of activities of daily living using smartphones. In Proceedings of the International Conference on Information and Communication Technologies for Ageing Well and e-Health (eds. Röcker, C., Ziefle, M., O’Donoghue, J, Maciaszek, L. & Molloy W.) Vol. 1: ICT4AWE, (ICT4AGEINGWELL 2016) 143–151, https://www.scitepress.org/ProceedingsDetails.aspx?ID=VhZYzluZTNE=&t=1 (SciTePress, 2016).

Shoaib, M., Bosch, S., Incel, O. D., Scholten, H. & Havinga, P. J. M. Complex human activity recognition using smartphone and wrist-worn motion sensors. Sensors 16 , 426 (2016).

Lockhart, J. W. et al. Design considerations for the WISDM smart phone-based sensor mining architecture. in Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data. 25–33 https://doi.org/10.1145/2003653.2003656 (Association for Computing Machinery, 2011).

Vaizman, Y., Ellis, K. & Lanckriet, G. Recognizing detailed human context in the wild from smartphones and smartwatches. IEEE Pervasive Comput. 16 , 62–74 (2017).

Sztyler, T. & Stuckenschmidt, H. On-body localization of wearable devices: an investigation of position-aware activity recognition. in 2016 IEEE International Conference on Pervasive Computing and Communications (PerCom) 1–9 https://ieeexplore.ieee.org/document/7456521 (IEEE, 2016).

Malekzadeh, M., Clegg, R. G., Cavallaro, A. & Haddadi, H. Mobile sensor data anonymization. in Proceedings of the International Conference on Internet of Things Design and Implementation. 49–58 https://doi.org/10.1145/3302505.3310068 (ACM, 2019).

Carpineti, C., Lomonaco, V., Bedogni, L., Felice, M. D. & Bononi, L. Custom dual transportation mode detection by smartphone devices exploiting sensor diversity. in 2018 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops) 367–372 https://ieeexplore.ieee.org/document/8480119 (IEEE, 2018).

Ichino, H., Kaji, K., Sakurada, K., Hiroi, K. & Kawaguchi, N. HASC-PAC2016: large scale human pedestrian activity corpus and its baseline recognition. in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct. 705–714 https://doi.org/10.1145/2968219.2968277 (Association for Computing Machinery, 2016).

Download references

Acknowledgements

Drs. Straczkiewicz and Onnela are supported by NHLBI award U01HL145386 and NIMH award R37MH119194. Dr. Onnela is also supported by the NIMH award U01MH116928. Dr. James is supported by NCI award R00CA201542 and NHLBI award R01HL150119.

Author information

Authors and affiliations.

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA

Marcin Straczkiewicz & Jukka-Pekka Onnela

Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA, 02215, USA

Peter James

Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA

You can also search for this author in PubMed Google Scholar

Contributions

M.S. conducted the review, prepared figures, and wrote the initial draft. P.J. and J.P.O. revised the manuscript. J.P.O. supervised the project. All authors reviewed the manuscript.

Corresponding author

Correspondence to Marcin Straczkiewicz .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Straczkiewicz, M., James, P. & Onnela, JP. A systematic review of smartphone-based human activity recognition methods for health research. npj Digit. Med. 4 , 148 (2021). https://doi.org/10.1038/s41746-021-00514-4

Download citation

Received : 24 March 2021

Accepted : 13 September 2021

Published : 18 October 2021

DOI : https://doi.org/10.1038/s41746-021-00514-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Diabsense: early diagnosis of non-insulin-dependent diabetes mellitus using smartphone-based human activity recognition and diabetic retinopathy analysis with graph neural network.

Md Nuho Ul Alam
Ibrahim Hasnine
Md. Abdus Samad

Journal of Big Data (2024)

Unlocking the potential of smartphone and ambient sensors for ADL detection

Marija Stojchevska
Mathias De Brouwer
Femke Ongenae

Scientific Reports (2024)

Self-supervised learning for human activity recognition using 700,000 person-days of wearable data

Aiden Doherty

npj Digital Medicine (2024)

Activity recognition based on smartphone sensor data using shallow and deep learning techniques: A Comparative Study

Asif Iqbal Middya
Sarvajit Kumar
Sarbani Roy

Multimedia Tools and Applications (2024)

Marcin Straczkiewicz
Emily J. Huang
Jukka-Pekka Onnela

npj Digital Medicine (2023)

Quick links

Explore articles by subject
Guide to authors
Editorial policies

A Survey on Deep Learning for Human Activity Recognition

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations.

Bidet-Ildei C BenAhmed O Bouidaine D Francisco V Decatoire A Blandin Y Pylouster J Fernandez-Maloigne C (2024) SmartDetector: Automatic and vision-based approach to point-light display generation for human action perception Behavior Research Methods 10.3758/s13428-024-02478-1 Online publication date: 13-Aug-2024 https://doi.org/10.3758/s13428-024-02478-1
Aljbori M Meddeb-Makhlouf A Fakhfakh A (2024) A Review and Comparative Study of Works that Care is Monitoring Detection and Therapy of Children with Autism Spectrum Disorder WSEAS TRANSACTIONS ON COMPUTER RESEARCH 10.37394/232018.2024.12.24 12 (244-263) Online publication date: 7-Mar-2024 https://doi.org/10.37394/232018.2024.12.24
Khaliluzzaman M Furquan M Zaman Khan M Hoque M (2024) STA-HAR Applied Computational Intelligence and Soft Computing 10.1155/2024/1832298 2024 Online publication date: 17-Jun-2024 https://dl.acm.org/doi/10.1155/2024/1832298
Show More Cited By

Index Terms

Human-centered computing

Ubiquitous and mobile computing

Ubiquitous and mobile computing systems and tools

Recommendations

Human activity recognition using wearable sensors by deep convolutional neural networks.

Human physical activity recognition based on wearable sensors has applications relevant to our daily life such as healthcare. How to achieve high recognition accuracy with low computational cost is an important issue in the ubiquitous computing. Rather ...

Deep learning for sensor-based activity recognition: A survey

We survey deep learning based HAR in sensor modality, deep model, and application.

Sensor-based activity recognition seeks the profound high-level knowledge about human activities from multitudes of low-level sensor readings. Conventional pattern recognition approaches have made tremendous progress in the past years. ...

Toward human activity recognition: a survey

Human activity recognition (HAR) is a complex and multifaceted problem. The research community has reported numerous approaches to perform HAR. Along with HAR approaches, various surveys have revealed HAR trends in various environments and ...

Information

Published in.

University of Sydney, Australia

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

Machine learning
deep learning
activity recognition
mobile sensing
deep models

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Program of China
Guangdong Basic and Applied Basic Research Foundation
Shenzhen Scientific Research and Development Funding Program

Contributors

Other metrics, bibliometrics, article metrics.

92 Total Citations View Citations
3,489 Total Downloads
Downloads (Last 12 months) 1,112
Downloads (Last 6 weeks) 99
Mohammadi Foumani N Miller L Tan C Webb G Forestier G Salehi M (2024) Deep Learning for Time Series Classification and Extrinsic Regression: A Current Survey ACM Computing Surveys 10.1145/3649448 56 :9 (1-45) Online publication date: 25-Apr-2024 https://dl.acm.org/doi/10.1145/3649448
Kausar A Chang C Raja M Shoaib M (2024) A novel design of layered recurrent neural networks for fractional order Caputo–Fabrizio stiff electric circuit models Modern Physics Letters B 10.1142/S0217984924503937 Online publication date: 18-May-2024 https://doi.org/10.1142/S0217984924503937
Kausar A Chang C Raja M Zameer A Shoaib M (2024) Novel design of recurrent neural network for the dynamical of nonlinear piezoelectric cantilever mass–beam model The European Physical Journal Plus 10.1140/epjp/s13360-023-04708-5 139 :1 Online publication date: 3-Jan-2024 https://doi.org/10.1140/epjp/s13360-023-04708-5
Arrotta L Bettini C Civitarese G Fiori M (2024) ContextGPT: Infusing LLMs Knowledge into Neuro-Symbolic Activity Recognition Models 2024 IEEE International Conference on Smart Computing (SMARTCOMP) 10.1109/SMARTCOMP61445.2024.00029 (55-62) Online publication date: 29-Jun-2024 https://doi.org/10.1109/SMARTCOMP61445.2024.00029
Sandhu M Silvera-Tawil D Lu W Borges P Kusy B (2024) Exploring Activity Recognition in Multi-device Environments using Hierarchical Federated Learning 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops) 10.1109/PerComWorkshops59983.2024.10503023 (720-726) Online publication date: 11-Mar-2024 https://doi.org/10.1109/PerComWorkshops59983.2024.10503023
Ahmad I Ullah A Choi W (2024) WiFi-Based Human Sensing With Deep Learning: Recent Advances, Challenges, and Opportunities IEEE Open Journal of the Communications Society 10.1109/OJCOMS.2024.3411529 5 (3595-3623) Online publication date: 2024 https://doi.org/10.1109/OJCOMS.2024.3411529
Geng X Chen X Ma Z Song L (2024) Multi-Framework Evidential Association Rule Fusion for Wearable Human Activity Recognition IEEE Sensors Journal 10.1109/JSEN.2024.3363908 24 :7 (11805-11816) Online publication date: 1-Apr-2024 https://doi.org/10.1109/JSEN.2024.3363908

View Options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
Download citation
Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

REVIEW article

A review of human activity recognition methods.

$\r\nMichalis Vrigkas$

1 Department of Computer Science and Engineering, University of Ioannina, Ioannina, Greece
2 Computational Biomedicine Laboratory, Department of Computer Science, University of Houston, Houston, TX, USA

Recognizing human activities from video sequences or still images is a challenging task due to problems, such as background clutter, partial occlusion, changes in scale, viewpoint, lighting, and appearance. Many applications, including video surveillance systems, human-computer interaction, and robotics for human behavior characterization, require a multiple activity recognition system. In this work, we provide a detailed review of recent and state-of-the-art research advances in the field of human activity classification. We propose a categorization of human activity methodologies and discuss their advantages and limitations. In particular, we divide human activity classification methods into two large categories according to whether they use data from different modalities or not. Then, each of these categories is further analyzed into sub-categories, which reflect how they model human activities and what type of activities they are interested in. Moreover, we provide a comprehensive analysis of the existing, publicly available human activity classification datasets and examine the requirements for an ideal human activity recognition dataset. Finally, we report the characteristics of future research directions and present some open issues on human activity recognition.

1. Introduction

Human activity recognition plays a significant role in human-to-human interaction and interpersonal relations. Because it provides information about the identity of a person, their personality, and psychological state, it is difficult to extract. The human ability to recognize another person’s activities is one of the main subjects of study of the scientific areas of computer vision and machine learning. As a result of this research, many applications, including video surveillance systems, human-computer interaction, and robotics for human behavior characterization, require a multiple activity recognition system.

Among various classification techniques two main questions arise: “What action?” (i.e., the recognition problem) and “Where in the video?” (i.e., the localization problem). When attempting to recognize human activities, one must determine the kinetic states of a person, so that the computer can efficiently recognize this activity. Human activities, such as “walking” and “running,” arise very naturally in daily life and are relatively easy to recognize. On the other hand, more complex activities, such as “peeling an apple,” are more difficult to identify. Complex activities may be decomposed into other simpler activities, which are generally easier to recognize. Usually, the detection of objects in a scene may help to better understand human activities as it may provide useful information about the ongoing event ( Gupta and Davis, 2007 ).

Most of the work in human activity recognition assumes a figure-centric scene of uncluttered background, where the actor is free to perform an activity. The development of a fully automated human activity recognition system, capable of classifying a person’s activities with low error, is a challenging task due to problems, such as background clutter, partial occlusion, changes in scale, viewpoint, lighting and appearance, and frame resolution. In addition, annotating behavioral roles is time consuming and requires knowledge of the specific event. Moreover, intra- and interclass similarities make the problem amply challenging. That is, actions within the same class may be expressed by different people with different body movements, and actions between different classes may be difficult to distinguish as they may be represented by similar information. The way that humans perform an activity depends on their habits, and this makes the problem of identifying the underlying activity quite difficult to determine. Also, the construction of a visual model for learning and analyzing human movements in real time with inadequate benchmark datasets for evaluation is challenging tasks.

To overcome these problems, a task is required that consists of three components, namely: (i) background subtraction ( Elgammal et al., 2002 ; Mumtaz et al., 2014 ), in which the system attempts to separate the parts of the image that are invariant over time (background) from the objects that are moving or changing (foreground); (ii) human tracking, in which the system locates human motion over time ( Liu et al., 2010 ; Wang et al., 2013 ; Yan et al., 2014 ); and (iii) human action and object detection ( Pirsiavash and Ramanan, 2012 ; Gan et al., 2015 ; Jainy et al., 2015 ), in which the system is able to localize a human activity in an image.

The goal of human activity recognition is to examine activities from video sequences or still images. Motivated by this fact, human activity recognition systems aim to correctly classify input data into its underlying activity category. Depending on their complexity, human activities are categorized into: (i) gestures; (ii) atomic actions; (iii) human-to-object or human-to-human interactions; (iv) group actions; (v) behaviors; and (vi) events. Figure 1 visualizes the decomposition of human activities according to their complexity.

Figure 1. Decomposition of human activities .

Gestures are considered as primitive movements of the body parts of a person that may correspond to a particular action of this person ( Yang et al., 2013 ). Atomic actions are movements of a person describing a certain motion that may be part of more complex activities ( Ni et al., 2015 ). Human-to-object or human-to-human interactions are human activities that involve two or more persons or objects ( Patron-Perez et al., 2012 ). Group actions are activities performed by a group or persons ( Tran et al., 2014b ). Human behaviors refer to physical actions that are associated with the emotions, personality, and psychological state of the individual ( Martinez et al., 2014 ). Finally, events are high-level activities that describe social actions between individuals and indicate the intention or the social role of a person ( Lan et al., 2012a ).

The rest of the paper is organized as follows: in Section 2, a brief review of previous surveys is presented. Section 3 presents the proposed categorization of human activities. In Sections 4 and 5, we review various human activity recognition methods and analyze the strengths and weaknesses of each category separately. In Section 6, we provide a categorization of human activity classification datasets and discuss some future research directions. Finally, conclusions are drawn in Section 7.

2. Previous Surveys and Taxonomies

There are several surveys in the human activity recognition literature. Gavrila (1999) separated the research in 2D (with and without explicit shape models) and 3D approaches. In Aggarwal and Cai (1999) , a new taxonomy was presented focusing on human motion analysis, tracking from single view and multiview cameras, and recognition of human activities. Similar in spirit to the previous taxonomy, Wang et al. (2003) proposed a hierarchical action categorization hierarchy. The survey of Moeslund et al. (2006) mainly focused on pose-based action recognition methods and proposed a fourfold taxonomy, including initialization of human motion, tracking, pose estimation, and recognition methods.

A fine separation between the meanings of “action” and “activity” was proposed by Turaga et al. (2008) , where the activity recognition methods were categorized according to their degree of activity complexity. Poppe (2010) characterized human activity recognition methods into two main categories, describing them as “top-down” and “bottom-up.” On the other hand, Aggarwal and Ryoo (2011) presented a tree-structured taxonomy, where the human activity recognition methods were categorized into two big sub-categories, the “single layer” approaches and the “hierarchical” approaches, each of which have several layers of categorization.

Modeling 3D data is also a new trend, and it was extensively studied by Chen et al. (2013b) and Ye et al. (2013) . As the human body consists of limbs connected with joints, one can model these parts using stronger features, which are obtained from depth cameras, and create a 3D representation of the human body, which is more informative than the analysis of 2D activities carried out in the image plane. Aggarwal and Xia (2014) recently presented a categorization of human activity recognition methods from 3D stereo and motion capture systems with the main focus on methods that exploit 3D depth data. To this end, Microsoft Kinect has played a significant role in motion capture of articulated body skeletons using depth sensors.

Although much research has been focused on human activity recognition systems from video sequences, human activity recognition from static images remains an open and very challenging task. Most of the studies of human activity recognition are associated with facial expression recognition and/or pose estimation techniques. Guo and Lai (2014) summarized all the methods for human activity recognition from still images and categorized them into two big categories according to the level of abstraction and the type of features each method uses.

Jaimes and Sebe (2007) proposed a survey for multimodal human computer interaction focusing on affective interaction methods from poses, facial expressions, and speech. Pantic and Rothkrantz (2003) performed a complete study in human affective state recognition methods that incorporate non-verbal multimodal cues, such as facial and vocal expressions. Pantic et al. (2006) studied several state-of-the-art methods of human behavior recognition including affective and social cues and covered many open computational problems and how they can be efficiently incorporated into a human-computer interaction system. Zeng et al. (2009) presented a review of state-of-the-art affective recognition methods that use visual and audio cues for recognizing spontaneous affective states and provided a list of related datasets for human affective expression recognition. Bousmalis et al. (2013a) proposed an analysis of non-verbal multimodal (i.e., visual and auditory cues) behavior recognition methods and datasets for spontaneous agreements and disagreements. Such social attributes may play an important role in analyzing social behaviors, which are the key to social engagement. Finally, a thorough analysis of the ontologies for human behavior recognition from the viewpoint of data and knowledge representation was presented by Rodríguez et al. (2014) .

Table 1 summarizes the previous surveys on human activity and behavior recognition methods sorted by chronological order. Most of these reviews summarize human activity recognition methods, without providing the strengths and the weaknesses of each category in a concise and informative way. Our goal is not only to present a new classification for the human activity recognition methods but also to compare different state-of-the-art studies and understand the advantages and disadvantages of each method.

Table 1. Summary of previous surveys .

3. Human Activity Categorization

The human activity categorization problem has remained a challenging task in computer vision for more than two decades. Previous works on characterizing human behavior have shown great potential in this area. First, we categorize the human activity recognition methods into two main categories: (i) unimodal and (ii) multimodal activity recognition methods according to the nature of sensor data they employ. Then, each of these two categories is further analyzed into sub-categories depending on how they model human activities. Thus, we propose a hierarchical classification of the human activity recognition methods, which is depicted in Figure 2 .

Figure 2. Proposed hierarchical categorization of human activity recognition methods .

Unimodal methods represent human activities from data of a single modality, such as images, and they are further categorized as: (i) space-time , (ii) stochastic , (iii) rule-based , and (iv) shape-based methods .

Space-time methods involve activity recognition methods, which represent human activities as a set of spatiotemporal features ( Shabani et al., 2011 ; Li and Zickler, 2012 ) or trajectories ( Li et al., 2012 ; Vrigkas et al., 2013 ). Stochastic methods recognize activities by applying statistical models to represent human actions (e.g., hidden Markov models) ( Lan et al., 2011 ; Iosifidis et al., 2012a ). Rule-based methods use a set of rules to describe human activities ( Morariu and Davis, 2011 ; Chen and Grauman, 2012 ). Shape-based methods efficiently represent activities with high-level reasoning by modeling the motion of human body parts ( Sigal et al., 2012b ; Tran et al., 2012 ).

Multimodal methods combine features collected from different sources ( Wu et al., 2013 ) and are classified into three categories: (i) affective , (ii) behavioral , and (iii) social networking methods .

Affective methods represent human activities according to emotional communications and the affective state of a person ( Liu et al., 2011b ; Martinez et al., 2014 ). Behavioral methods aim to recognize behavioral attributes, non-verbal multimodal cues, such as gestures, facial expressions, and auditory cues ( Song et al., 2012a ; Vrigkas et al., 2014b ). Finally, social networking methods model the characteristics and the behavior of humans in several layers of human-to-human interactions in social events from gestures, body motion, and speech ( Patron-Perez et al., 2012 ; Marín-Jiménez et al., 2014 ).

Usually, the terms “activity” and “behavior” are used interchangeably in the literature ( Castellano et al., 2007 ; Song et al., 2012a ). In this survey, we differentiate between these two terms in the sense that the term “activity” is used to describe a sequence of actions that correspond to specific body motion. On the other hand, the term “behavior” is used to characterize both activities and events that are associated with gestures, emotional states, facial expressions, and auditory cues of a single person. Some representative frames that summarize the main human action classes are depicted in Figure 3 .

Figure 3. Representative frames of the main human action classes for various datasets .

4. Unimodal Methods

Unimodal human activity recognition methods identify human activities from data of one modality. Most of the existing approaches represent human activities as a set of visual features extracted from video sequences or still images and recognize the underlying activity label using several classification models ( Kong et al., 2014a ; Wang et al., 2014 ). Unimodal approaches are appropriate for recognizing human activities based on motion features. However, the ability to recognize the underlying class only from motion is on its own a challenging task. The main problem is how we can ensure the continuity of the motion along time as an action occurs uniformly or non-uniformly within a video sequence. Some approaches use snippets of motion trajectories ( Matikainen et al., 2009 ; Raptis et al., 2012 ), while others use the full length of motion curves by tracking the optical flow features ( Vrigkas et al., 2014a ).

We classify unimodal methods into four broad categories: (i) space-time , (ii) stochastic , (iii) rule-based , and (iv) shape-based approaches . Each of these sub-categories describes specific attributes of human activity recognition methods according to the type of representation each method uses.

4.1. Space-Time Methods

Space-time approaches focus on recognizing activities based on space-time features or on trajectory matching. They consider an activity in the 3D space-time volume, consisting of concatenation of 2D spaces in time. An activity is represented by a set of space-time features or trajectories extracted from a video sequence. Figure 4 depicts an example of a space-time approach based on dense trajectories and motion descriptors ( Wang et al., 2013 ).

Figure 4. Visualization of human actions with dense trajectories (top row). Example of a typical human space-time method based on dense trajectories (bottom row) . First, dense feature sampling is performed for capturing local motion. Then, features are tracked using dense optical flow, and feature descriptors are computed ( Wang et al., 2013 ).

A plethora of human activity recognition methods based on space-time representation have been proposed in the literature ( Efros et al., 2003 ; Schuldt et al., 2004 ; Jhuang et al., 2007 ; Fathi and Mori, 2008 ; Niebles et al., 2008 ). A major family of methods relies on optical flow, which has proven to be an important cue. Efros et al. (2003) recognized human actions from low-resolution sports’ video sequences using the nearest neighbor classifier, where humans are represented by windows of height of 30 pixels. The approach of Fathi and Mori (2008) was based on mid-level motion features, which are also constructed directly from optical flow features. Moreover, Wang and Mori (2011) employed motion features as input to hidden conditional random fields (HCRFs) ( Quattoni et al., 2007 ) and support vector machine (SVM) classifiers ( Bishop, 2006 ). Real time classification and prediction of future actions was proposed by Morris and Trivedi (2011) , where an activity vocabulary is learned through a three-step procedure. Other optical flow-based methods which gained popularity were presented by Dalal et al. (2006) , Chaudhry et al. (2009) , and Lin et al. (2009) . An invariant in translation and scaling descriptor was introduced by Oikonomopoulos et al. (2009) . Spatiotemporal features based on B-splines are extracted in the optical flow field. To model this descriptor, a Bag-of-Words (BoW) technique is employed, whereas, classification of activities is performed using relevant vector machines (RVM) ( Tipping, 2001 ).

The classification of a video sequence using local features in a spatiotemporal environment has also been given much focus. Schuldt et al. (2004) represented local events in a video using space-time features, while an SVM classifier was used to recognize an action. Gorelick et al. (2007) considered actions as 3D space-time silhouettes of moving humans. They took advantage of the Poisson equation solution to efficiently describe an action by using spectral clustering between sequences of features and applying nearest neighbor classification to characterize an action. Niebles et al. (2008) addressed the problem of action recognition by creating a codebook of space-time interest points. A hierarchical approach was followed by Jhuang et al. (2007) , where an input video was analyzed into several feature descriptors depending on their complexity. The final classification was performed by a multiclass SVM classifier. Dollár et al. (2005) proposed spatiotemporal features based on cuboid descriptors. Instead of encoding human motion for action classification, Jainy et al. (2015) proposed to incorporate information from human-to-objects interactions and combined several datasets to transfer information from one dataset to another.

An action descriptor of histograms of interest points, relying on the work of Schuldt et al. (2004) , was presented by Yan and Luo (2012) . Random forests for action representation have also attracted widespread interest for action recognition Mikolajczyk and Uemura (2008) and Yao et al. (2010) . Furthermore, the key issue of how many frames are required to recognize an action was addressed by Schindler and Gool (2008) . Shabani et al. (2011) proposed a temporally asymmetric filtering for feature detection and activity recognition. The extracted features were more robust under geometric transformations than the features described by a Gabor filter ( Fogel and Sagi, 1989 ). Sapienza et al. (2014) used a bag of local spatiotemporal volume features approach to recognize and localize human actions from weakly labeled video sequences using multiple instance learning.

The problem of identifying multiple persons simultaneously and performing action recognition was presented by Khamis et al. (2012) . The authors considered that a person could first be detected by performing background subtraction techniques. Based on the histograms of oriented Gaussians, Dalal and Triggs (2005) were able to detect humans, whereas classification of actions was made by training an SVM classifier. Wang et al. (2011b) performed human activity recognition by associating the context between interest points based on the density of all features observed. A multiview activity recognition method was presented by Li and Zickler (2012) , where descriptors from different views were connected together to construct a new augmented feature that contains the transition between the different views. Multiview action recognition has also been studied by Rahmani and Mian (2015) . A non-linear knowledge transfer model based on deep learning was proposed for mapping action information from multiple camera views into one single view. However, their method is computationally expensive as it requires a two-step sequential learning phase prior to the recognition step for analyzing and fusing the information of multiviews.

Tian et al. (2013) employed spatiotemporal volumes using a deformable part model to train an SVM classifier for recognizing sport activities. Similar in spirit, the work of Jain et al. (2014) used a 3D space-time volume representation of human actions obtained from super-voxels to understand sport activities. They used an agglomerative approach to merge super-voxels that share common attributes and localize human activities. Kulkarni et al. (2015) used a dynamic programing approach to recognize sequences of actions in untrimmed video sequences. A per-frame time-series representation of each video and a template representation of each action were proposed, whereas dynamic time warping was used to sequence alignment.

Samanta and Chanda (2014) proposed a novel representation of human activities using a combination of spatiotemporal features and a facet model ( Haralick and Watson, 1981 ), while they used a 3D Haar wavelet transform and higher order time derivatives to describe each interest point. A vocabulary was learned from these features and SVM was used for classification. Jiang et al. (2013) used a mid-level feature representation of video sequences using optical flow features. These features were clustered using K-means to build a hierarchical template tree representation of each action. A tree search algorithm was used to identify and localize the corresponding activity in test videos. Roshtkhari and Levine (2013) also proposed a hierarchical representation of video sequences for recognizing atomic actions by building a codebook of spatiotemporal volumes. A probe video sequence was classified into its underlying activity according to its similarity with each representation in the codebook.

Earlier approaches were based on describing actions by using dense trajectories. The work of Le et al. (2011) discovered the action label in an unsupervised manner by learning features directly from video data. A high-level representation of video sequences, called “action bank,” was presented by Sadanand and Corso (2012) . Each video was represented by a set of action descriptors, which were put in correspondence. The final classification was performed by an SVM classifier. Yan and Luo (2012) also proposed a novel action descriptor based on spatial temporal interest points (STIP) ( Laptev, 2005 ). To avoid overfitting, they proposed a novel classification technique combining Adaboost and sparse representation algorithms. Wu et al. (2011) used visual features and Gaussian mixture models (GMM) ( Bishop, 2006 ) to efficiently represent the spatiotemporal context distributions between the interest points at several space and time scales. The underlying activity was represented by a set of features extracted by the interest points over the video sequence. A new type of feature called the “hankelet” was presented by Li et al. (2012) . This type of feature, which was formed by short tracklets, along with a BoW approach, was able to recognize actions under different viewpoints without requiring any camera calibration.

The work of Vrigkas et al. (2014a) focused on recognizing human activities by representing a human action with a set of clustered motion trajectories. A Gaussian mixture model was used to cluster the motion trajectories, and the action labeling was performed using a nearest neighbor classification scheme. Yu et al. (2012) proposed a propagative point-matching approach using random projection trees, which can handle unlabeled data in an unsupervised manner. Jain et al. (2013) used motion compensation techniques to recognize atomic actions. They also proposed a new motion descriptor called “divergence-curl-shear descriptor,” which is able to capture the hidden properties of flow patterns in video sequences. Wang et al. (2013) used dense optical flow trajectories to describe the kinematics of motion patterns in video sequences. However, several intraclass variations caused by missing data, partial occlusion, and the sort duration of actions in time may harm the recognition accuracy. Ni et al. (2015) discovered the most discriminative groups of similar dense trajectories for analyzing human actions. Each group was assigned a learned weight according to its importance in motion representation.

An unsupervised method for learning human activities from short tracklets was proposed by Gaidon et al. (2014) . They used a hierarchical clustering algorithm to represent videos with an unordered tree structure and compared all tree-clusters to identity the underlying activity. Raptis et al. (2012) proposed a mid-level approach extracting spatiotemporal features and constructing clusters of trajectories, which could be considered as candidates of an action. Yu and Yuan (2015) extracted bounding box candidates from video sequences, where each candidate may contain human motion. The most significant action paths were estimated by defining an action score. Due to the large spatiotemporal redundancy in videos, many candidates may overlap. Thus, estimation of the maximum set coverage was applied to address this problem. However, the maximum set coverage problem is NP-hard, and thus the estimation requires approximate solutions.

An approach that exploits the temporal information encoded in video sequences was introduced by Li et al. (2011) . The temporal data were encoded into a trajectory system, which measures the similarity between activities and computes the angle between the associated subspaces. A method that tracks features and produces a number of trajectory snippets was proposed by Matikainen et al. (2009) . The trajectories were clustered by an SVM classifier. Motion features were extracted from a video sequence by Messing et al. (2009) . These features were tracked with respect to their velocities, and a generative mixture model was employed to learn the velocity history of these trajectories and classify each video clip. Tran et al. (2014a) proposed a scale and shape invariant method for localizing complex spatiotemporal events in video sequences. Their method was able to relax the tight constraints of bounding box tracking, while they used a sliding window technique to track spatiotemporal paths maximizing the summation score.

An algorithm that may recognize human actions in 3D space by a multicamera system was introduced by Holte et al. (2012a) . It was based on the synergy of 3D space and time to construct a 4D descriptor of spatial temporal interest points and a local description of 3D motion features. The BoW technique was used to form a vocabulary of human actions, whereas agglomerative information bottleneck and SVM were used for action classification. Zhou and Wang (2012) proposed a new representation of local spatiotemporal cuboids for action recognition. Low-level features were encoded and classified via a kernelized SVM classifier, whereas a classification score denoted the confidence that a cuboid belongs to an atomic action. The new feature could act as complementary material to the low-level feature. The work of Sanchez-Riera et al. (2012) recognized human actions using stereo cameras. Based on the technique of BoW, each action was presented by a histogram of visual words, whereas their approach was robust to background clutter.

The problem of temporal segmentation and event recognition was examined by Hoai et al. (2011) . Action recognition was performed by a supervised learning algorithm. Satkin and Hebert (2010) explored the effectiveness of video segmentation by discovering the most significant portions of videos. In the sense of video labeling, the study of Wang et al. (2012b) leveraged the shared structural analysis for activity recognition. The correct annotation was given in each video under a semisupervised scheme. Bag-of-video words have become very popular. Chakraborty et al. (2012) proposed a novel method applying surround suppression. Human activities were represented by bag-of-video words constructed from spatial temporal interest points by suppressing the background features and building a vocabulary of visual words. Guha and Ward (2012) employed a technique of sparse representations for human activity recognition. An overcomplete dictionary was constructed using a set of spatiotemporal descriptors. Classification over three different dictionaries was performed.

Seo and Milanfar (2011) proposed a method based on space-time locally adaptive regression kernels and the matrix cosine measure. They extracted features from space-time descriptors and compared them against features of the target video. A vocabulary based approach has been proposed by Kovashka and Grauman (2010) . The main idea is to find the neighboring features around the detected interest points, quantize them, and form a vocabulary. Ma et al. (2015) extracted spatiotemporal segments from video sequences that correspond to whole or part human motion and constructed a tree-structured vocabulary of similar actions. Fernando et al. (2015) learned to arrange human actions in chronological order in an unsupervised manner by exploiting temporal ordering in video sequences. Relevant information was summarized together through a ranking learning framework.

The main disadvantage of using a global representation, such as optical flow, is the sensitivity to noise and partial occlusions. Space-time approaches can hardly recognize actions when more than one person is present in a scene. Nevertheless, space-time features focus mainly on local spatiotemporal information. Moreover, the computation of these features produces sparse and varying numbers of detected interest points, which may lead to low repeatability. However, background subtraction can help overcome this limitation.

Low-level features usually used with a fixed length feature vector (e.g., Bag-of-Words) failed to be associated with high-level events. Trajectory-based methods face the problem of human body detection and tracking, as these are still open issues. Complex activities are more difficult to recognize when space-time feature based approaches are employed. Furthermore, viewpoint invariance is another issue that these approaches have difficulty in handling.

4.2. Stochastic Methods

In recent years, there has been a tremendous growth in the amount of computer vision research aimed at understanding human activity. There has been an emphasis on activities, where the entity to be recognized may be considered as a stochastically predictable sequence of states. Researchers have conceived and used many stochastic techniques, such as hidden Markov model (HMMs) ( Bishop, 2006 ) and hidden conditional random fields (HCRFs) ( Quattoni et al., 2007 ), to infer useful results for human activity recognition.

Robertson and Reid (2006) modeled human behavior as a stochastic sequence of actions. Each action was described by a feature vector, which combines information about position, velocity, and local descriptors. An HMM was employed to encode human actions, whereas recognition was performed by searching for image features that represent an action. Pioneering this task, Wang and Mori (2008) were among the first to propose HCRFs for the problem of activity recognition. A human action was modeled as a configuration of parts of image observations. Motion features were extracted forming a BoW model. Activity recognition and localization via a figure-centric model was presented by Lan et al. (2011) . Human location was treated as a latent variable, which was extracted from a discriminative latent variable model by simultaneous recognition of an action. A real-time algorithm that models human interactions was proposed by Oliver et al. (2000) . The algorithm was able to detect and track a human movement, forming a feature vector that describes the motion. This vector was given as input to an HMM, which was used for action classification. Song et al. (2013) considered that human action sequences of various temporal resolutions. At each level of abstraction, they learned a hierarchical model with latent variables to group similar semantic attributes of each layer. Representative stochastic models are presented in Figure 5 .

Figure 5. Representative stochastic approaches for action recognition. (A) Factorized HCRF model used by Wang and Mori (2008) . Circle nodes correspond to variables, and square nodes correspond to factors. (B) Hierarchical latent discriminative model proposed by Song et al. (2013) .

A multiview person identification was presented by Iosifidis et al. (2012a) . Fuzzy vector quantization and linear discriminant analysis were employed to recognize a human activity. Huang et al. (2011) presented a boosting algorithm called LatentBoost. The authors trained several models with latent variables to recognize human actions. A stochastic modeling of human activities on a shape manifold was introduced by Yi et al. (2012) . A human activity was extracted as a sequence of shapes, which is considered as one realization of a random process on a manifold. The piecewise Brownian motion was used to model human activity on the respective manifold. Wang et al. (2014) proposed a semisupervised framework for recognizing human actions combining different visual features. All features were projected onto a common subspace, and a boosting technique was employed to recognize human actions from labeled and unlabeled data. Yang et al. (2013) proposed an unsupervised method for recognizing motion primitives for human action classification from a set of very few examples.

Sun and Nevatia (2013) treated video sequences as sets of short clips rather than a whole representation of actions. Each clip corresponded to a latent variable in an HMM model, while a Fisher kernel technique ( Perronnin and Dance, 2007 ) was employed to represent each clip with a fixed length feature vector. Ni et al. (2014) decomposed the problem of complex activity recognition into two sequential sub-tasks with increasing granularity levels. First, the authors applied human-to-object interaction techniques to identify the area of interest, then used this context-based information to train a conditional random field (CRF) model ( Lafferty et al., 2001 ) and identify the underlying action. Lan et al. (2014) proposed a hierarchical method for predicting future human actions, which may be considered as a reaction to a previous performed action. They introduced a new representation of human kinematic states, called “hierarchical movements,” computed at different levels of coarse to fine-grained level granularity. Predicting future events from partially unseen video clips with incomplete action execution has also been studied by Kong et al. (2014b) . A sequence of previously observed features was used as a global representation of actions and a CRF model was employed to capture the evolution of actions across time in each action class.

An approach for group activity classification was introduced by Choi et al. (2011) . The authors were able to recognize activities such as a group of people talking or standing in a queue. The proposed scheme was based on random forests, which could select samples of spatiotemporal volumes in a video that characterize an action. A probabilistic Markov random field (MRF) ( Prince, 2012 ) framework was used to classify and localize the activities in a scene. Lu et al. (2015) also employed a hierarchical MRF model to represent segments of human actions by extracting super-voxels from different scales and automatically estimated the foreground motion using saliency features of neighboring super-voxels.

The work of Wang et al. (2011a) focused on tracking dense sample points from video sequences using optical flow based on HCRFs for object recognition. Wang et al. (2012c) proposed a probabilistic model of two components. The first component modeled the temporal transition between action primitives to handle large variation in an action class, while the second component located the transition boundaries between actions. A hierarchical structure, which is called the sum-product network, was used by Amer and Todorovic (2012) . The BoW technique encoded the terminal nodes, the sum nodes corresponded to mixtures of different subsets of terminals, and the product nodes represented mixtures of components.

Zhou and Zhang (2014) proposed a robust to background clutter, camera motion, and occlusions’ method for recognizing complex human activities. They used multiple-instance formulation in conjunction with an MRF model and were able to represent human activities with a bag of Markov chains obtained from STIP and salient region feature selection. Chen et al. (2014) addressed the problem of identifying and localizing human actions using CRFs. The authors were able to distinguish between intentional actions and unknown motions that may happen in the surroundings by ordering video regions and detecting the actor of each action. Kong and Fu (2014) addressed the problem of human interaction classification from subjects that lie close to each other. Such a representation may be erroneous to partial occlusions and feature-to-object mismatching. To overcome this problem the authors proposed a patch-aware model, which learned regions of interacting subjects at different patch levels.

Shu et al. (2015) recognized complex video events and group activities from aerial shoots captured from unmanned aerial vehicles (UAVs). A preprocessing step prior to the recognition process was adopted to address several limitations of frame capturing, such as low resolution, camera motion, and occlusions. Complex events were decomposed into simpler actions and modeled using a spatiotemporal CRF graph. A video segmentation approach for video activities and a decomposition into smaller clips task that contained sub-actions was presented by Wu et al. (2015) . The authors modeled the relation of consecutive actions by building a graphical model for unsupervised learning of the activity label from depth sensor data.

Often, human actions are highly correlated to the actor, who performs a specific action. Understanding both the actor and the action may be vital for real life applications, such as robot navigation and patient monitoring. Most of the existing works do not take into account the fact that a specific action may be performed in different manner by a different actor. Thus, a simultaneous inference of actors and actions is required. Xu et al. (2015) addressed these limitations and proposed a general probabilistic framework for joint actor-action understanding while they presented a new dataset for actor-action recognition.

There is an increasing interest in exploring human-object interaction for recognition. Moreover, recognizing human actions from still images by taking advantage of contextual information, such as surrounding objects, is a very active topic ( Yao and Fei-Fei, 2010 ). These methods assume that not only the human body itself, but the objects surrounding it, may provide evidence of the underlying activity. For example, a soccer player interacts with a ball when playing soccer. Motivated by this fact, Gupta and Davis (2007) proposed a Bayesian approach that encodes object detection and localization for understanding human actions. Extending the previous method, Gupta et al. (2009) introduced spatial and functional constraints on static shape and appearance features and they were also able to identify human-to-object interactions without incorporating any motion information. Ikizler-Cinbis and Sclaroff (2010) extracted dense features and performed tracking over consecutive frames for describing both motion and shape information. Instead of explicitly using separate object detectors, they divided the frames into regions and treated each region as an object candidate.

Most of the existing probabilistic methods for human activity recognition may perform well and apply exact and/or approximate learning and inference. However, they are usually more complicated than non-parametric methods, since they use dynamic programing or computationally expensive HMMs for estimating a varying number of parameters. Due to their Markovian nature, they must enumerate all possible observation sequences while capturing the dependencies between each state and its corresponding observation only. HMMs treat features as conditionally independent, but this assumption may not hold for the majority of applications. Often, the observation sequence may be ignored due to normalization leading to the label bias problem ( Lafferty et al., 2001 ). Thus, HMMs are not suitable for recognizing more complex events, but rather an event is decomposed into simpler activities, which are easier to recognize.

Conditional random fields, on the other hand, overcome the label bias problem. Most of the aforementioned methods do not require large training datasets, since they are able to model the hidden dynamics of the training data and incorporate prior knowledge over the representation of data. Although CRFs outperform HMMS in many applications, including bioinformatics, activity, and speech recognition, the construction of more complex models for human activity recognition may have good generalization ability but is rather impractical for real time applications due to the large number of parameter estimations and the approximate inference.

4.3. Rule-Based Methods

Rule-based approaches determine ongoing events by modeling an activity using rules or sets of attributes that describe an event. Each activity is considered as a set of primitive rules/attributes, which enables the construction of a descriptive model for human activity recognition.

Action recognition of complex scenes with multiple subjects was proposed by Morariu and Davis (2011) . Each subject must follow a set of certain rules while performing an action. The recognition process was performed over basketball game videos, where the players were first detected and tracked, generating a set of trajectories that are used to create a set of spatiotemporal events. Based on the first-order logic and probabilistic approaches, such as Markov networks, the authors were able to infer which event has occurred. Figure 6 summarizes their method using primitive rules for recognizing human actions. Liu et al. (2011a) addressed the problem of recognizing actions by a set of descriptive and discriminative attributes. Each attribute was associated with the characteristics describing the spatiotemporal nature of the activities. These attributes were treated as latent variables, which capture the degree of importance of each attribute for each action in a latent SVM approach.

Figure 6. Relation between primitive rules and human actions ( Morariu and Davis, 2011 ) .

A combination of activity recognition and localization was presented by Chen and Grauman (2012) . The whole approach was based on the construction of a space-time graph using a high-level descriptor, where the algorithm seeks to find the optimal subgraph that maximizes the activity classification score (i.e., find the maximum weight subgraph, which in the general case is an NP-complete problem). Kuehne et al. (2014) proposed a structured temporal approach for daily living human activity recognition. The author used HMMs to model human actions as action units and then used grammatical rules to form a sequence of complex actions by combining different action units. When temporal grammars are used for action classification, the main problem consists in treating long video sequences due to the complexity of the models. One way to cope with this limitation is to segment video sequences into smaller clips that contain sub-actions, using a hierarchical approach ( Pirsiavash and Ramanan, 2014 ). The generation of short description from video sequences ( Vinyals et al., 2015 ) based on convolutional neural networks (CNN) ( Ciresan et al., 2011 ) was also used for activity recognition ( Donahue et al., 2015 ).

Intermediate semantic features representation for recognizing unseen actions during training were proposed ( Wang and Mori, 2010 ). These intermediate features were learned during training, while parameter sharing between classes was enabled by capturing the correlations between frequently occurring low-level features ( Akata et al., 2013 ). Learning how to recognize new classes that were not seen during training, by associating intermediate features and class labels, is a necessary aspect for transferring knowledge between training and test samples. This problem is generally known as zero-shot learning ( Palatucci et al., 2009 ). Thus, instead of learning one classifier per attribute, a two-step classification method has been proposed by Lampert et al. (2009) . Specific attributes are predicted from already learned classifiers and are mapped into a class-level score.

Action classification from still images by learning semantic attributes was proposed by Yao et al. (2011) . Attributes describe specific properties of human actions, while parts of actions, which were obtained from objects and human poses, were used as bases for learning complex activities. The problem of attribute-action association was reported by Zhang et al. (2013) . The authors proposed a multitask learning approach Evgeniou and Pontil (2004) for simultaneously coping with low-level features and action-attribute relationships and introduced attribute regularization as a penalty term for handling irrelevant predictions. A robust to noise representation of attribute-based human action classification was proposed by Zhang et al. (2015) . Sigmoid and Gaussian envelopes were incorporated into the loss function of an SVM classifier, where the outliers are eliminated during the optimization process. A GMM was used for modeling human actions, and a transfer ranking technique was employed for recognizing unseen classes. Ramanathan et al. (2015) were able to transfer semantic knowledge between classes to learn human actions from still images. The interaction between different classes was performed using linguistic rules. However, for high-level activities, the use of language priors is often not adequate, thus simpler and more explicit rules should be constructed.

Complex human activities cannot be recognized directly from rule-based approaches. Thus, decomposition into simpler atomic actions is applied, and then combination of individual actions is employed for the recognition of complex or simultaneously occurring activities. This limitation leads to constant feedback by the user of rule/attribute annotations of the training examples, which is time consuming and sensitive to errors due to subjective point of view of the user defined annotations. To overcome this drawback, several approaches employing transfer learning ( Lampert et al., 2009 ; Kulkarni et al., 2014 ), multitask learning ( Evgeniou and Pontil, 2004 ; Salakhutdinov et al., 2011 ), and semantic/discriminative attribute learning ( Farhadi et al., 2009 ; Jayaraman and Grauman, 2014 ) were proposed to automatically generate and handle the most informative attributes for human activity classification.

4.4. Shape-Based Methods

Modeling of human pose and appearance has received a great response from researchers during the last decades. Parts of the human body are described in 2D space as rectangular patches and as volumetric shapes in 3D space (see Figure 7 ). It is well known that activity recognition algorithms based on the human silhouette play an important role in recognizing human actions. As a human silhouette consists of limbs jointly connected to each other, it is important to obtain exact human body parts from videos. This problem is considered as a part of the action recognition process. Many algorithms convey a wealth of information about solving this problem.

Figure 7. Human body representations. (A) 2D skeleton model ( Theodorakopoulos et al., 2014 ) and (B) 3D pictorial structure representation ( Belagiannis et al., 2014 ).

A major focus in action recognition from still images or videos has been made in the context of scene appearance ( Thurau and Hlavac, 2008 ; Yang et al., 2010 ; Maji et al., 2011 ). More specifically, Thurau and Hlavac (2008) represented actions by histograms of pose primitives, and n-gram expressions were used for action classification. Also, Yang et al. (2010) combined actions and human poses together, treating poses as latent variables, to infer the action label in still images. Maji et al. (2011) introduced a representation of human poses, called the “poselet activation vector,” which is defined by the 3D orientation of the head and torso and provided a robust representation of human pose and appearance. Moreover, action categorization based on modeling the motion of parts of the human body was presented by Tran et al. (2012) , where a sparse representation was used to model and recognize complex actions. In the sense of template-matching techniques, Rodriguez et al. (2008) introduced the maximum average correlation height (MACH) filter, which was a method for capturing intraclass variabilities by synthesizing a single action MACH filter for a given action class. Sedai et al. (2013a) proposed a combination of shape and appearance descriptors to represent local features for human pose estimation. The different types of descriptors were fused at the decision level using a discriminative learning model. Nevertheless, identifying which body parts are most significant for recognizing complex human activities still remains a challenging task ( Lillo et al., 2014 ). The classification model and some representative examples of the estimation of human pose are depicted in Figure 8 .

Figure 8. Classification of actions from human poses ( Lillo et al., 2014 ). (A) The discriminative hierarchical model for the recognition of human action from body poses. (B) Examples of correct human pose estimation of complex activities.

Ikizler and Duygulu (2007) modeled the human body as a sequence of oriented rectangular patches. The authors described a variation of BoW method called bag-of-rectangles. Spatially oriented histograms were formed to describe a human action, while the classification of an action was performed using four different methods, such as frame voting, global histogramming, SVM classification, and dynamic time warping (DTW) ( Theodoridis and Koutroumbas, 2008 ). The study of Yao and Fei-Fei (2012) modeled human poses for human-object interactions by introducing a mutual context model. The types of human poses, as well as the spatial relationship between the different human parts, were modeled. Self organizing maps (SOM) ( Kohonen et al., 2001 ) were introduced by Iosifidis et al. (2012b) for learning human body posture, in conjunction with fuzzy distances, to achieve time invariant action representation. The proposed algorithm was based on multilayer perceptrons, where each layer was fed by an associated camera, for view-invariant action classification. Human interactions were addressed by Andriluka and Sigal (2012) . First, 2D human poses were estimated from pictorial structures from groups of humans and then each estimated structure was fitted into 3D space. To this end, several 2D human pose benchmarks have been proposed for the evaluation of articulated human pose estimation methods ( Andriluka et al., 2014 ).

Action recognition using depth cameras was introduced by Wang et al. (2012a) , where a new feature type called “local occupancy pattern” was also proposed. This feature was invariant to translation and was able to capture the relation between human body parts. The authors also proposed a new model for human actions called “actionlet ensemble model,” which captured the intraclass variations and was robust to errors incurred by depth cameras. 3D human poses have been taken into consideration in recent years and several algorithms for human activity recognition have been developed. A recent review on 3D pose estimation and activity recognition was proposed by Holte et al. (2012b) . The authors categorized 3D pose estimation approaches aimed at presenting multiview human activity recognition methods. The work of Shotton et al. (2011) modeled 3D human poses and performed human activity recognition from depth images by mapping the pose estimation problem into a simpler pixel-wise classification problem. Graphical models have been widely used in modeling 3D human poses. The problem of articulated 3D human pose estimation was studied by Fergie and Galata (2013) , where the limitation of the mapping from the image feature space to the pose space was addressed using mixtures of Gaussian processes, particle filtering, and annealing ( Sedai et al., 2013b ). A combination of discriminative and generative models improved the estimation of human pose.

Multiview pose estimation was examined by Amin et al. (2013) . The 2D poses for different sources were projected onto 3D space using a mixture of multiview pictorial structures models. Belagiannis et al. (2014) have also addressed the problem of multiview pose estimation. They constructed 3D body part hypotheses by triangulation of 2D pose detections. To solve the problem of body part correspondence between different views, the authors proposed a 3D pictorial structure representation based on a CRF model. However, building successful models for human pose estimation is not straightforward ( Pishchulin et al., 2013 ). Combining both pose-specific appearance and the joint appearance of body parts helps to construct a more powerful representation of the human body. Deep learning has gained much attention for multisource human pose estimation ( Ouyang et al., 2014 ) where the tasks of detection and estimation of human pose were jointly learned. Toshev and Szegedy (2014) have also used deep learning for human pose estimation. Their approach relies on using deep neural networks (DNN) ( Ciresan et al., 2012 ) for representing cascade body joint regressors in a holistic manner.

Despite the vast development of pose estimation algorithms, the problem still remains challenging for real time applications. Jung et al. (2015) presented a method for fast estimation of human pose with 1,000 frames per second. To achieve such a high computational speed, the authors used random walk sub-sampling methods. Human body parts were handled as directional tree-structured representations and a regression tree was trained for each joint in the human skeleton. However, this method depends on the initialization of the random walk process.

Sigal et al. (2012b) addressed the multiview human-tracking problem where the modeling of 3D human pose consisted of a collection of human body parts. The motion estimation was performed by non-parametric belief propagation ( Bishop, 2006 ). On the other hand, the work of Livne et al. (2012) explored the problem of inferring human attributes, such as gender, weight, and mood, by the scope of 3D pose tracking. Representing activities using trajectories of human poses is computationally expensive due to many degrees of freedom. To this end, efficient dimensionality reduction methods should be applied. Moutzouris et al. (2015) proposed a novel method for reducing dimensionality of human poses called “hierarchical temporal Laplacian eigenmaps” (HTLE). Moreover, the authors were able to estimate unseen poses using a hierarchical manifold search method.

Du et al. (2015) divided the human skeleton into five segments and used each of these parts to train a hierarchical neural network. The output of each layer, which corresponds to neighboring parts, is fused and fed as input to the next layer. However, this approach suffers from the problem of data association as parts of the human skeleton may vanish through the sequential layer propagation and back projection. Nie et al. (2015) also divided human pose into smaller mid-level spatiotemporal parts. Human actions were represented using a hierarchical AND/OR graph and dynamic programing was used to infer the class label. One disadvantage of this method is that it cannot deal with self-occlusions (i.e., overlapping parts of human skeleton).

A shared representation of human poses and visual information has also been explored ( Ferrari et al., 2009 ; Singh and Nevatia, 2011 ; Yun et al., 2012 ). However, the effectiveness of such methods is limited by tracking inaccuracies in human poses and complex backgrounds. To this end, several kinematic and part-occlusion constraints for decomposing human poses into separate limbs have been explored to localize the human body ( Cherian et al., 2014 ). Xu et al. (2012) proposed a mid-level representation of human actions by computing local motion volumes in skeletal points extracted from video sequences and constructed a codebook of poses for identifying the action. Eweiwi et al. (2014) reduced the required amount of pose data using a fixed length vector of more informative motion features (e.g., location and velocity) for each skeletal point. A partial least squares approach was used for learning the representation of action features, which is then fed into an SVM classifier.

Kviatkovsky et al. (2014) mixed shape and motion features for online action classification. The recognition processes could be applied in real time using the incremental covariance update and the on-demand nearest neighbor classification schemes. Rahmani et al. (2014) trained a random decision forest (RDF) ( Ho, 1995 ) and applied a joint representation of depth information and 3D skeletal positions for identifying human actions in real time. A novel part-based skeletal representation for action recognition was introduced by Vemulapalli et al. (2014) . The geometry between different body parts was taken into account, and a 3D representation of human skeleton was proposed. Human actions are treated as curves in the Lie group ( Murray et al., 1994 ), and the classification was performed using SVM and temporal modeling approaches. Following a similar approach, Anirudh et al. (2015) represented skeletal joints as points on the product space. Shape features were represented as high-dimensional non-linear trajectories on a manifold to learn the latent variable space of actions. Fouhey et al. (2014) exploited the interaction between human actions and scene geometry to recognize human activities from still images using 3D skeletal representation and adopting geometric representation constraints of the scenes.

The problem of appearance-to-pose mapping for human activity understanding was studied by Urtasun and Darrell (2008) . Gaussian processes were used as an online probabilistic regressor for this task using sparse representation of data for reducing computational complexity. Theodorakopoulos et al. (2014) have also employed sparse representation of skeletal data in the dissimilarity space for human activity recognition. In particular, human actions are represented by vectors of dissimilarities and a set of prototype actions is built. The recognition is performed into the dissimilarity space using sparse representation-based classification. A publicly available dataset (UPCV Action dataset) consisting of skeletal data of human actions was also proposed.

A common problem in estimating human pose is the high-dimensional space (i.e., each limb may have a large number of degrees of freedom that need to be estimated simultaneously). Action recognition relies heavily on the obtained pose estimations. The articulated human body is usually represented as a tree-like structure, thus locating the global position and tracking each limb separately is intrinsically difficult, since it requires exploration of a large state space of all possible translations and rotations of the human body parts in 3D space. Many approaches, which employ background subtraction ( Sigal et al., 2012a ) or assume fixed limb lengths and uniformly distributed rotations of body parts ( Burenius et al., 2013 ), have been proposed to reduce the complexity of the 3D space.

Moreover, the association of human pose orientation with the poses extracted from different camera views is also a difficult problem due to similar body parts of different humans in each view. Mixing body parts of different views may lead to ambiguities because of the multiple candidates of each camera view and false positive detections. The estimation of human pose is also very sensitive to several factors, such as illumination changes, variations in view-point, occlusions, background clutter, and human clothing. Low-cost devices, such as Microsoft Kinect and other RGB-D sensors, which provide 3D depth data of a scene, can efficiently leverage these limitations and produce a relatively good estimation of human pose, since they are robust to illumination changes and texture variations ( Gao et al., 2015 ).

5. Multimodal Methods

Recently, much attention has been focused on multimodal activity recognition methods. An event can be described by different types of features that provide more and useful information. In this context, several multimodal methods are based on feature fusion, which can be expressed by two different strategies: early fusion and late fusion. The easiest way to gain the benefits of multiple features is to directly concatenate features in a larger feature vector and then learn the underlying action ( Sun et al., 2009 ). This feature fusion technique may improve recognition performance, but the new feature vector is of much larger dimension.

Multimodal cues are usually correlated in time, thus a temporal association of the underlying event and the different modalities is an important issue for understanding the data. In that context, audio-visual analysis is used in many applications not only for audio-visual synchronization ( Lichtenauer et al., 2011 ) but also for tracking ( Perez et al., 2004 ) and activity recognition ( Wu et al., 2013 ). Multimodal methods are classified into three categories: (i) affective methods , (ii) behavioral methods , and (iii) methods based on social networking . Multimodal methods describe atomic actions or interactions that may correspond to affective states of a person with whom he/she communicates and depend on emotions and/or body movements.

5.1. Affective Methods

The core of emotional intelligence is understanding the mapping between a person’s affective states and the corresponding activities, which are strongly related to the emotional state and communication of a person with other people ( Picard, 1997 ). Affective computing studies model the ability of a person to express, recognize, and control his/her affective states in terms of hand gestures, facial expressions, physiological changes, speech, and activity recognition ( Pantic and Rothkrantz, 2003 ). This research area is generally considered to be a combination of computer vision, pattern recognition, artificial intelligence, psychology, and cognitive science.

A key issue in affective computing is accurately annotated data. Ratings are one of the most popular affect annotation tools. However, this is challenging to obtain for real world situations, since affective events are expressed in a different manner by different persons or occur simultaneously with other activities and feelings. Preprocessing affective annotations may be detrimental for generating accurate and ambiguous affective models due to biased representations of affect annotation. To this end, a study on how to produce highly informative affective labels has been proposed by Healey (2011) . Soleymani et al. (2012) investigated the properties of developing a user-independent emotion recognition system that is able to detect the most informative affective tags from electroencephalogram (EEG) signals, pupillary reflex, and bodily responses that correspond to video stimulus. Nicolaou et al. (2014) proposed a novel method based on probabilistic canonical correlation analysis (PCCA) ( Klami and Kaski, 2008 ) and DTW for fusing multimodal emotional annotations and performing temporal aligning of sequences.

Liu et al. (2011b) associated multimodal features (i.e., textual and visual) for classifying affective states in still images. The authors argued that visual information is not adequate for understanding human emotions, and thus additional information that describes the image is needed. Dempster-Shafer theory ( Shafer, 1976 ) was employed for fusing the different modalities, while SVM was used for classification. Hussain et al. (2011) proposed a framework for fusing multimodal psychological features, such as heart and facial muscle activity, skin response, and respiration, for detecting and recognizing affective states. AlZoubi et al. (2013) explored the effect of the affective feature variations over time on the classification of affective states.

Siddiquie et al. (2013) analyzed four different affective dimensions, such as activation, expectancy, power, and valence ( Schuller et al., 2011 ). To this end, they proposed joint hidden conditional random Fields (JHCRF) as a new classification scheme to take advantage of the multimodal data. Furthermore, their method uses late fusion to combine audio and visual information together. This may lead to significant loss of the intermodality dependence, while it suffers from propagating the classification error across different levels of classifiers. Although their method could efficiently recognize the affective state of a person, the computational burden was high as JHCRFs require twice as many hidden variables as the traditional HCRFs when features represent two different modalities.

Nicolaou et al. (2011) proposed a regression model based on SVMs for regression (SVR) ( Smola and Schölkopf, 2004 ) for continuous prediction of multimodal emotional states, using facial expression, shoulder gesture, and audio cues in terms of arousal and valence (Figure 9 ). Castellano et al. (2007) explored the dynamics of body movements to identify affective behaviors using time series of multimodal data. Martinez et al. (2014) presented a detailed review of learning methods for the classification of affective and cognitive states of computer game players. They analyzed the properties of directly using affect annotations in classification models, and proposed a method for transforming such annotations to build more accurate models.

Figure 9. Flow chart of multimodal emotion recognition . Emotions, facial expressions, shoulder gestures, and audio cues are combined for continuous prediction emotional states ( Nicolaou et al., 2011 ).

Multimodal affect recognition methods in the context of neural networks and deep learning have generated considerable recent research interest ( Ngiam et al., 2011 ). In a more recent study, Martinez et al. (2013) could efficiently extract and select the most informative multimodal features using deep learning to model emotional expressions and recognize the affective states of a person. They incorporated psychological signals into emotional states, such as relaxation, anxiety, excitement, and fun, and demonstrated that deep learning was able to extract more informative features than feature extraction on psychological signals.

Although the understanding of human activities may benefit from affective state recognition, the classification process is extremely challenging due to the semantic gap between the low-level features extracted from video frames and high-level concepts, such as emotions, that need to be identified. Thus, building strong models that can cope with multimodal data, such as gestures, facial expressions and psychological data, depends on the ability of the model to discover relations between different modalities and generate informative representation on affect annotations. Generating such information is not an easy task. Users cannot always express their emotion with words, and producing satisfactory and reliable ground truth that corresponds to a given training instance is quite difficult as it can lead to ambiguous and subjective labels. This problem becomes more prominent as human emotions are continuous acts in time, and variations in human actions may be confusing or lead to subjective annotations. Therefore, automatic affective recognition systems should reduce the effort for selecting the proper affective label to better assess human emotions.

5.2. Behavioral Methods

Recognizing human behaviors from video sequences is a challenging task for the computer vision community ( Candamo et al., 2010 ). A behavior recognition system may provide information about the personality and psychological state of a person, and its applications vary from video surveillance to human-computer interaction. Behavioral approaches aim at recognizing behavioral attributes, non-verbal multimodal cues, such as gestures, facial expressions, and auditory cues. Factors that can affect human behavior may be decomposed into several components, including emotions, moods, actions, and interactions, with other people. Hence, the recognition of complex actions may be crucial for understanding human behavior. One important aspect of human behavior recognition is the choice of proper features, which can be used to recognize behavior in applications, such as gaming and physiology. A key challenge in recognizing human behaviors is to define specific emotional attributes for multimodal dyadic interactions ( Metallinou and Narayanan, 2013 ). Such attributes may be descriptions of emotional states or cognitive states, such as activation, valence, and engagement. A typical example of a behavior recognition system is depicted in Figure 10 .

Figure 10. Example of interacting persons . Audio-visual features and emotional annotations are fed into a GMM for estimating the emotional curves ( Metallinou et al., 2013 ).

Audio-visual representation of human actions has gained an important role in human behavior recognition methods. Sargin et al. (2007) suggested a method for speaker identification integrating a hybrid scheme of early and late fusion of audio-visual features and used CCA ( Hardoon et al., 2004 ) to synchronize the multimodal features. However, their method can cope with video sequences of frontal view only. Metallinou et al. (2008) proposed a probabilistic approach based on GMMs for recognizing human emotions in dyadic interactions. The authors took advantage of facial expressions as they can be expressed by the facial action coding system (FACS) ( Ekman et al., 2002 ), which describes all possible facial expressions as a combination of action units (AU), and combines them with audio information, extracted from each participant, to identify their emotional state. Similarly, Chen et al. (2015) proposed a real-time emotion recognition system that modeled 3D facial expressions using random forests. The proposed method was robust to subjects’ poses and changes in the environment.

Wu et al. (2010) proposed a human activity recognition system by taking advantage of the auditory information of the video sequences of the HOHA dataset ( Laptev et al., 2008 ) and used late fusion techniques for combining audio and visual cues. The main disadvantage of this method is that it used different classifiers to separately learn the audio and visual context. Also, the audio information of the HOHA dataset contains dynamic backgrounds and the audio signal is highly diverse (i.e., audio shifts roughly from one event to another), which generates the need for developing audio feature selection techniques. Similar in spirit is the work of Wu et al. (2013) , who used the generalized multiple kernel learning algorithm for estimating the most informative audio features. They applied fuzzy integral techniques to combine the outputs of two different SVM classifiers, increasing the computational burden of the method.

Song et al. (2012a) proposed a novel method for human behavior recognition based on multiview hidden conditional random fields (MV-HCRF) ( Song et al., 2012b ) and estimated the interaction of the different modalities by using kernel canonical correlation analysis (KCCA) ( Hardoon et al., 2004 ). However, their method cannot deal with data that contain complex backgrounds, and due to the down-sampling of the original data the audio-visual synchronization may be lost. Also, their method used different sets of hidden states for audio and visual information. This property considers that the audio and visual features were a priori synchronized, while it increases the complexity of the model. Metallinou et al. (2012) employed several hierarchical classification models from neural networks to HMMs and their combinations to recognize audio-visual emotional levels of valence and arousal rather than emotional labels, such as anger and kindness.

Vrigkas et al. (2014b) employed a fully connected CRF model to identify human behaviors, such as friendly, aggressive, and neutral. To evaluate their method, they introduced a novel behavior dataset, called the Parliament dataset, which consists of political speeches in the Greek parliament. Bousmalis et al. (2013b) proposed a method based on hierarchical Dirichlet processes to automatically estimate the optimal number of hidden states in an HCRF model for identifying human behaviors. The proposed model, also known as infinite hidden conditional random field model (iHCRF), was employed to recognize emotional states, such as pain and agreement, and disagreement from non-verbal multimodal cues.

Baxter et al. (2015) proposed a human classification model that does not learn the temporal structure of human actions but rather decomposes human actions and uses them as features for learning complex human activities. The intuition behind this approach is a psycholinguistics phenomenon, where randomizing letters in the middle of words has almost no effect on understanding the underlying word if and only if the first and the last letters of this word remain unchanged ( Rawlinson, 2007 ). The problem of behavioral mimicry in social interactions was studied by Bilakhia et al. (2013) . It can be seen as an interpretation of human speech, facial expressions, gestures, and movements. Metallinou et al. (2013) applied mixture models to capture the mapping between audio and visual cues to understand the emotional states of dyadic interactions.

Selecting the proper features for human behavior recognition has always been a trial-and-error approach for many researchers in this area of study. In general, effective feature extraction is highly application dependent. Several feature descriptors, such as HOG3D ( Kläser et al., 2008 ) and STIP ( Laptev, 2005 ), are not able to sufficiently characterize human behaviors. The combination of visual features with other more informative features, which reflect human emotions and psychology, is necessary for this task. Nonetheless, the description of human activities with high-level contents usually leads to recognition methods with high computational complexity. Another obstacle that researchers must overcome is the lack of adequate benchmark datasets to test and validate the reliability, effectiveness, and efficiency of a human behavior recognition system.

5.3. Methods Based on Social Networking

Social interactions are an important part of daily life. A fundamental component of human behavior is the ability to interact with other people via their actions. Social interaction can be considered as a special type of activity where someone adapts his/her behavior according to the group of people surrounding him/her. Most of the social networking systems that affect people’s behavior, such as Facebook, Twitter, and YouTube, measure social interactions and infer how such sites may be involved in issues of identity, privacy, social capital, youth culture, and education. Moreover, the field of psychology has attracted great interest in studying social interactions, as scientists may infer useful information about human behavior. A recent survey on human behavior recognition provides a complete summarization of up-to-date techniques for automatic human behavior analysis for single person, multiperson, and object-person interactions ( Candamo et al., 2010 ).

Fathi et al. (2012) modeled social interactions by estimating the location and orientation of the faces of persons taking part in a social event, computing a line of sight for each face. This information was used to infer the location where an individual may be found. The type of interaction was recognized by assigning social roles to each person. The authors were able to recognize three types of social interactions: dialog, discussion, and monolog. To capture these social interactions, eight subjects wearing head-mounted cameras participated in groups of interacting persons analyzing their activities from the first-person point of view. Figure 11 shows the resulting social network built from this method. In the sense of first-person scene understanding, Park and Shi (2015) were able to predict joint social interactions by modeling geometric relationships between groups of interacting persons. Although the proposed method could cope with missing information and variations in scene context, scale, and orientation of human poses, it is sensitive to localization of interacting members, which leads to erroneous predictions of the true class.

Figure 11. Social network of interacting persons . The connections between the group of persons P 1 … P 25 and the subjects wearing the cameras S 1 … S 8 are weighted based on how often a person’s face is captured by a subject’s camera ( Fathi et al., 2012 ).

Human behavior on sport datasets was investigated by Lan et al. (2012a) . The authors modeled the behavior of humans in a scene using social roles in conjunction with modeling low-level actions and high-level events. Burgos-Artizzu et al. (2012) discussed the social behavior of mice. Each video sequence was segmented into periods of activities by constructing a temporal context that combines spatiotemporal features. Kong et al. (2014a) proposed a new high-level descriptor called “interactive phrases” to recognize human interactions. This descriptor was a binary motion relationship descriptor for recognizing complex human interactions. Interactive phrases were treated as latent variables, while the recognition was performed using a CRF model.

Cui et al. (2011) recognized abnormal behaviors in human group activities. The authors represented human activities by modeling the relationships between the current behavior of a person and his/her actions. An attribute-based social activity recognition method was introduced by Fu et al. (2014) . The authors were interested in classifying social activities of daily life, such as birthdays and weddings. A new social activity dataset has also been proposed. By treating attributes as latent variables, the authors were able to annotate and classify video sequences of social activities. Yan et al. (2014) leveraged the problem of human tracking for modeling the repulsion, attraction, and non-interaction effects in social interactions. The tracking problem was decomposed into smaller tasks by tracking all possible configurations of interactions effects, while the number of trackers was dynamically estimated. Tran et al. (2014b) modeled crowded scenes as a graph of interacting persons. Each node represents one person and each edge on the graph is associated with a weight according to the level of the interaction between the participants. The interacting groups were found by graph clustering, where each maximal clique corresponds to an interacting group.

The work of Lu et al. (2011) focused on automatically tracking and recognizing players’ positions (i.e., attacker and defender) in sports’ videos. The main problem of this work was the low resolution of the players to be tracked (a player was roughly 15 pixels tall). Lan et al. (2012b) recognized group activities, which were considered as latent variables, encoding the contextual information in a video sequence. Two types of contextual information were explored: group-to-person interactions and person-to-person interactions. To model person-to-person interactions, one approach is to model the associated structure. The second approach is based on spatiotemporal features, which encode the information about an action and the behavior of people in the neighborhood. Finally, the third approach is a combination of the above two.

Much focus has also been given to recognizing human activities from real life videos, such as movies and TV shows, by exploiting scene contexts to localize activities and understand human interactions ( Marszałek et al., 2009 ; Patron-Perez et al., 2012 ; Bojanowski et al., 2013 ; Hoai and Zisserman, 2014 ). The recognition accuracy of such complex videos can also be improved by relating textual descriptions and visual context to a unified framework ( Ramanathan et al., 2013 ). An alternative approach is a system that takes a video clip as its input and generates short textual descriptions, which may correspond to an activity label, which was unseen during training ( Guadarrama et al., 2013 ). However, natural video sequences may contain irrelevant scenes or scenes with multiple actions. As a result, Bandla and Grauman (2013) proposed a method for recognizing human activities from unsegmented videos using a voting-based classification scheme to find the most frequently used action label.

Marín-Jiménez et al. (2014) used a bag of visual-audio words scheme along with late fusion for recognizing human interactions in TV shows. Even though their method performs well in recognizing human interaction, the lack of an intrinsic audio-visual relationship estimation limits the recognition problem. Bousmalis et al. (2011) considered a system based on HCRFs for spontaneous agreement and disagreement recognition using audio and visual features. Although both methods yielded promising results, they did not consider any kind of explicit correlation and/or association between the different modalities. Hoai and Zisserman (2014) proposed a learning based method based on the context and the properties of a scene for detecting upper body positions and understanding the interaction of the participants in TV shows. An audio-visual analysis for recognizing dyadic interactions was presented by Yang et al. (2014) . The author combined a GMM with a Fisher kernel to model multimodal dyadic interactions and predict the body language of each subject according to the behavioral state of his/her interlocutor. Escalera et al. (2012) represented the concept of social interactions as an oriented graph using an influence model to identify human interactions. Audio and visual detection and segmentation were performed to extract the exact segments of interest in a video sequence, and then the influence model was employed. Each link measured the influence of a person over another.

Many works on human activity recognition based on deep learning techniques have been proposed in the literature. In fact, deep learning methods have had a large impact on a plethora of research areas including image/video understanding, speech recognition, and biomedical image analysis. Kim et al. (2013) used deep belief networks (DBN) ( Hinton et al., 2006 ) in both supervised and unsupervised manners to learn the most informative audio-visual features and classify human emotions in dyadic interactions. Their system was able to preserve non-linear relationships between multimodal features and showed that unsupervised learning can be used efficiently for feature selection. Shao et al. (2015) mixed appearance and motion features for recognizing group activities in crowded scenes collected from the web. For the combination of the different modalities, the authors applied multitask deep learning. By these means, they were able to capture the intraclass correlations between the learned attributes while they proposed a novel dataset of crowed scene understanding, called WWW crowd dataset.

Deep learning has also been used by Gan et al. (2015) for detecting and recognizing complex events in video sequences. The proposed approach followed a sequential framework. First, saliency maps were used for detecting and localizing events, and then deep learning was applied to the pretrained features for identifying the most important frames that correspond to the underlying event. Although much of the existing work on event understanding relies on video representation, significant work has been done on recognizing complex events from static images. Xiong et al. (2015) utilized CNNs to hierarchically combine information from different visual channels. The new representation of fused features was used to recognize complex social events. To assess their method, the authors introduced a large dataset with >60,000 static images obtained from the web, called web image dataset for event recognition (WIDER).

Karpathy et al. (2014) performed an experimental evaluation of CNNs to classify events from large-scale video datasets, using one million videos with 487 categories (Sports-1M dataset) obtained from YouTube videos. Chen et al. (2013a) exploited different types of features, such as static and motion features, for recognizing unlabeled events from heterogenous web data (e.g., YouTube and Google/Bing image search engines). A separate classifier for each source is learned and a multidomain adaptation approach was followed to infer the labels for each data source. Tang et al. (2013) studied the problem of heterogenous feature combination for recognizing complex events. They considered the problem as two different tasks. At first, they estimated which were the most informative features for recognizing social events, and then combined the different features using an AND/OR graph structure.

Modeling crowded scenes has been a difficult task due to partial occlusions, interacting motion patterns, and sparsely distributed cameras in outdoor environments ( Alahi et al., 2014 ). Most of the existing approaches for modeling group activities and social interactions between different persons usually exploit contextual information from the scenes. However, such information is not sufficient to fully understand the underlying activity as it does not capture the variations in human poses when interacting with other persons. When attempting to recognize social interactions with a fixed number of participants, the problem may become more or less trivial. When the number of interacting people dynamically changes over time, the complexity of the problem increases and becomes more challenging. Moreover, social interactions are usually decomposed into smaller subsets that contain individual person activities or interaction between individuals. The individual motion patterns are analyzed separately and are then combined to estimate the event. A person adapts his/her behavior according to the person with whom s/he interacts. Thus, such an approach is limited by the fact that only specific interaction patterns can be successfully modeled and is sensitive in modeling complex social events.

5.4. Multimodal Feature Fusion

Consider the scenario where several people have a specific activity/behavior and some of them may emit sounds. In the simple case, a human activity recognition system may recognize the underlying activity by taking into account only the visual information. However, the recognition accuracy may be enhanced from audio-visual analysis, as different people may exhibit different activities with similar body movements, but with different sound intensity values. The audio information may help to understand who is the person of interest in a test video sequence and distinguish between different behavioral states.

A great difficulty in multimodal feature analysis is the dimensionality of the data from different modalities. For example, video features are much more complex with higher dimensions than audio, and thus techniques for dimensionality reduction are useful. In the literature, there are two main fusion strategies that can be used to tackle this problem ( Atrey et al., 2010 ; Shivappa et al., 2010 ).

Early fusion , or fusion at the feature level, combines features of different modalities, usually by reducing the dimensionality in each modality and creating a new feature vector that represents an individual. Canonical correlation analysis (CCA) ( Hardoon et al., 2004 ) was widely studied in the literature as an effective way for fusing data at the feature level ( Sun et al., 2005 ; Wang et al., 2011c ; Rudovic et al., 2013 ). The advantage of early fusion is that it yields good recognition results when the different modalities are highly correlated, since only one learning phase is required. On the other hand, the difficulty of combining the different modalities may lead to the domination of one modality over the others. A novel method for fusing verbal (i.e., textual information) and non-verbal (i.e., visual signals) cues was proposed by Evangelopoulos et al. (2013) . Each modality is separately analyzed and saliency scores are used for linear and non-linear fusing schemes.

The second category of methods, which is known as late fusion or fusion at the decision level, combines several probabilistic models to learn the parameters of each modality separately. Then all scores are combined together in a supervised framework yielding a final decision score ( Westerveld et al., 2003 ; Jiang et al., 2014 ). The individual strength of each modality may lead to better recognition results. However, this strategy is time-consuming and requires more complex supervised learning schemes, which may cause a potential loss of inter-modality correlation. A comparison of early versus late fusion methods for video analysis was reported by Snoek et al. (2005) .

Recently, a third approach for fusing multimodal data has come to the foreground ( Karpathy et al., 2014 ). This approach, called slow fusion , is a combination of the previous approaches and can be seen as a hierarchical fusion technique that slowly fuses data by successively passing information through early and late fusion levels. Although this approach seems to have the advantages of both early and late fusion techniques, it also has a large computational burden due to the different levels of information processing. Figure 12 illustrates the graphical models of different fusion approaches.

Figure 12. Graphical representation of different fusion approaches ( Karpathy et al., 2014 ) .

6. Discussion

Human activity understanding has become one of the most active research topics in computer vision. The type and amount of data that each approach uses depends on the ability of the underlying algorithm to deal with heterogeneous and/or large scale data. The development of a fully automated human activity recognition system is a non-trivial task due to cluttered backgrounds, complex camera motion, large intraclass variations, and data acquisition issues. Tables 2 and 3 provide a comprehensive comparison of unimodal and multimodal methods, respectively, and list the benefits and limitations of each family of methods.

Table 2. Comparison of unimodal methods .

Table 3. Comparison of multimodal methods .

The first step in developing a human activity recognition system is to acquire an adequate human activity database. This database may be used for training and testing purposes. A complete survey, which covers important aspects of human activity recognition datasets, was introduced by Chaquet et al. (2013) . An appropriate human activity dataset is required for the development of a human activity recognition system. This dataset should be sufficiently rich in a variety of human actions. Moreover, the creation of such a dataset should correspond to real world scenarios. The quality of the input media that forms the dataset is one of the most important things one should take into account. These input media can be static images or video sequences, colored or gray-scaled. An ideal human activity dataset should address the following issues: (i) the input media should include either still images and/or video sequences, (ii) the amount of data should be sufficient, (iii) input media quality (resolution, grayscale or color), (iv) large number of subjects performing an action, (v) large number of action classes, (vi) changes in illuminations, (vii) large intraclass variations (i.e., variations in subjects’ poses), (viii) photo shooting under partial occlusion of human structure, and (ix) complex backgrounds.

Although there exists a plethora of benchmark activity recognition datasets in the literature, we have focused on the most widely used ones with respect to the database size, resolution, and usability. Table 4 summarizes human activity recognition datasets, categorizing them into seven different categories. All datasets are grouped by their associated category and by chronological order for each group. We also present the number of classes, actors, and video clips along with their frame resolution.

Table 4. Human activity recognition datasets .

Many of the existing datasets for human activity recognition were recorded in controlled environments, with participant actors performing specific actions. Furthermore, several datasets are not generic, but rather cover a specific set of activities, such as sports and simple actions, which are usually performed by one actor. However, these limitations constitute an unrealistic scenario that does not cover real-world situations and does not address the specifications for an ideal human activity dataset as presented earlier. Nevertheless, several activity recognition datasets that take into account these requirements have been proposed.

Several existing datasets have reached their expected life cycle (i.e., methods on Weizmann and KTH datasets achieved 100% recognition rate). These datasets were captured in controlled environments and the performed actions were obtained from a frontal view camera. The non-complex backgrounds and the non-intraclass variations in human movements make these datasets non-applicable for real world applications. However, these datasets still remain popular for human activity classification, as they provide a good evaluation criterion for many new methods. A significant problem in constructing a proper human activity recognition dataset is the annotation of each action, which is generally performed manually by the user, making the task biased.

Understanding human activities is a part of interpersonal relationships. Humans have the ability to understand another human’s actions by interpreting stimuli from the surroundings. On the other hand, machines need a learning phase to be able to perform this operation. Thus, some basic questions arise about a human activity classification system:

1. How to determine whether a human activity classification system provides the best performance?

2. In which cases is the system prone to errors when classifying a human activity?

3. In what level can the system reach the human ability of recognizing a human activity?

4. Are the success rates of the system adequate for inferring safe conclusions?

It is necessary for the system to be fully automated. To achieve this, all stages of human activity modeling and analysis are to be performed automatically, namely: (i) human activity detection and localization, where the challenge is to detect and localize a human activity in the scene. Background subtraction ( Elgammal et al., 2002 ) and human tracking ( Liu et al., 2010 ) are usually used as a part of this process; (ii) human activity modeling (e.g., feature extraction; Laptev, 2005 ) is the step of extracting the necessary information that will help in the recognition step; and (iii) human activity classification is the step where a probe video sequence is classified in one of the classes of the activities that have been defined before building the system.

In addition, the system should work regardless of any external factors. This means that the system should perform robustly despite changes in lighting, pose variations or partially occluded human bodies, and background clutter. Also, the number as well as the type of human activity classes to be recognized is an important factor that plays a crucial role in the robustness of the system. The requirements of an ideal human activity classification system should cover several topics, including automatic human activity classification and localization, lighting and pose variations (e.g., multiview recognition), partially occluded human bodies, and background clutter. Also, all possible activities should be detected during the recognition process, the recognition accuracy should be independent from the number of activity classes, and the activity identification process should be performed in real time and provide a high success rate and low false positive rate.

Besides the vast amount of research in this field, a generalization of the learning framework is crucial toward modeling and understanding real world human activities. Several challenges that correspond to the ability of a classification system to generalize under external factors, such as variations in human poses and different data acquisition, are still open issues. The ability of a human activity classification system to imitate humans’ skill in recognizing human actions in real time is a future challenge to be tackled. Machine-learning techniques that incorporate knowledge-driven approaches may be vital for human activity modeling and recognition in unconstrained environments, where data may not be adequate or may suffer from occlusions and changes in illuminations and view point.

Training and validation methods still suffer from limitations, such as slow learning rate, which gets even worse for large scale training data, and low recognition rate. Although much research focuses on leveraging human activity recognition from big data, this problem is still in its infancy. The exact opposite problem (i.e., learning human activities from very little training data or missing data) is also very challenging. Several issues concerning the minimum number of learning examples for modeling the dynamics of each class or safely inferring the performed activity label are still open and need further investigation. More attention should also be put in developing robust methods under the uncertainty of missing data either on training steps or testing steps.

The role of appropriate feature extraction for human activity recognition is a problem that needs to be tackled in future research. The extraction of low-level features that are focused on representing human motion is a very challenging task. To this end, a fundamental question arises are there features that are invariant to scale and viewpoint changes, which can model human motion in a unique manner, for all possible configurations of human pose?

Furthermore, it is evident that there exists a great need for efficiently manipulating training data that may come from heterogeneous sources. The number and type of different modalities that can be used for analyzing human activities is an important question. The combination of multimodal features, such as body motion features, facial expressions, and the intensity level of voice, may produce superior results, when compared to unimodal approaches, On the other hand, such a combination may constitute over-complete examples that can be confusing and misleading. The proposed multimodal feature fusion techniques do not incorporate the special characteristics of each modality and the level of abstraction for fusing. Therefore, a comprehensive evaluation of feature fusion methods that retain the feature coupling is an issue that needs to be assessed.

It is evident that the lack of large and realistic human activity recognition datasets is a significant challenge that needs to be addressed. An ideal action dataset should cover several topics, including diversity in human poses for the same action, a wide range of ground truth labels, and variations in image capturing and quality. Although a list of action datasets that correspond to most of these specifications has been introduced in the literature, the question of how many actions we can actually learn is a task for further exploration. Most of the existing datasets contain very few classes (15 on average). However, there exist datasets with more activities that reach 203 or 487 classes. In such large datasets, the ability to distinguish between easy and difficult examples for representing the different classes and recognizing the underlying activity is difficult. This fact opens a promising research area that should be further studied.

Another challenge worthy of further exploration is the exploitation of unsegmented sequences, where one activity may succeed another. Frequent changes in human motion and actions performed by groups of interacting persons make the problem amply challenging. More sophisticated high-level activity recognition methods need to be developed, which should be able to localize and recognize simultaneously occurring actions by different persons.

7. Conclusion

In this survey, we carried out a comprehensive study of state-of-the-art methods of human activity recognition and proposed a hierarchical taxonomy for classifying these methods. We surveyed different approaches, which were classified into two broad categories (unimodal and multimodal) according to the source channel each of these approaches employ to recognize human activities. We discussed unimodal approaches and provided an internal categorization of these methods, which were developed for analyzing gesture, atomic actions, and more complex activities, either directly or employing activity decomposition into simpler actions. We also presented multimodal approaches for the analysis of human social behaviors and interactions. We discussed the different levels of representation of feature modalities and reported the limitations and advantages for each representation. A comprehensive review of existing human activity classification benchmarks was also presented and we examined the challenges of data acquisition to the problem of understanding human activity. Finally, we provided the characteristics of building an ideal human activity recognition system.

Most of the existing studies in this field failed to efficiently describe human activities in a concise and informative way as they introduce limitations concerning computational issues. The gap of a complete representation of human activities and the corresponding data collection and annotation is still a challenging and unbridged problem. In particular, we may conclude that despite the tremendous increase of human understanding methods, many problems still remain open, including modeling of human poses, handling occlusions, and annotating data.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

This research was funded in part by the UH Hugh Roy and Lillie Cranz Cullen Endowment Fund. All statements of fact, opinion, or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of the sponsors.

Aggarwal, J. K., and Cai, Q. (1999). Human motion analysis: a review. Comput. Vis. Image Understand. 73, 428–440. doi:10.1006/cviu.1998.0744

CrossRef Full Text | Google Scholar

Aggarwal, J. K., and Ryoo, M. S. (2011). Human activity analysis: a review. ACM Comput. Surv. 43, 1–43. doi:10.1145/1922649.1922653

Aggarwal, J. K., and Xia, L. (2014). Human activity recognition from 3D data: a review. Pattern Recognit. Lett. 48, 70–80. doi:10.1016/j.patrec.2014.04.011

Akata, Z., Perronnin, F., Harchaoui, Z., and Schmid, C. (2013). “Label-embedding for attribute-based classification,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 819–826.

Google Scholar

Alahi, A., Ramanathan, V., and Fei-Fei, L. (2014). “Socially-aware large-scale crowd forecasting,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 2211–2218.

AlZoubi, O., Fossati, D., D’Mello, S. K., and Calvo, R. A. (2013). “Affect detection and classification from the non-stationary physiological data,” in Proc. International Conference on Machine Learning and Applications (Portland, OR), 240–245.

Amer, M. R., and Todorovic, S. (2012). “Sum-product networks for modeling activities with stochastic structure,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1314–1321.

Amin, S., Andriluka, M., Rohrbach, M., and Schiele, B. (2013). “Multi-view pictorial structures for 3D human pose estimation,” in Proc. British Machine Vision Conference (Bristol), 1–12.

Andriluka, M., Pishchulin, L., Gehler, P. V., and Schiele, B. (2014). “2D human pose estimation: new benchmark and state of the art analysis,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 3686–3693.

Andriluka, M., and Sigal, L. (2012). “Human context: modeling human-human interactions for monocular 3D pose estimation,” in Proc. International Conference on Articulated Motion and Deformable Objects (Mallorca: Springer-Verlag), 260–272.

Anirudh, R., Turaga, P., Su, J., and Srivastava, A. (2015). “Elastic functional coding of human actions: from vector-fields to latent variables,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3147–3155.

Atrey, P. K., Hossain, M. A., El-Saddik, A., and Kankanhalli, M. S. (2010). Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16, 345–379. doi:10.1007/s00530-010-0182-0

Bandla, S., and Grauman, K. (2013). “Active learning of an action detector from untrimmed videos,” in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 1833–1840.

Baxter, R. H., Robertson, N. M., and Lane, D. M. (2015). Human behaviour recognition in data-scarce domains. Pattern Recognit. 48, 2377–2393. doi:10.1016/j.patcog.2015.02.019

Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., and Ilic, S. (2014). “3D pictorial structures for multiple human pose estimation,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 1669–1676.

Bilakhia, S., Petridis, S., and Pantic, M. (2013). “Audiovisual detection of behavioural mimicry,” in Proc. 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (Geneva), 123–128.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning . Secaucus, NJ: Springer.

Blank, M., Gorelick, L., Shechtman, E., Irani, M., and Basri, R. (2005). “Actions as space-time shapes,” in Proc. IEEE International Conference on Computer Vision (Beijing), 1395–1402.

Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., and Sivic, J. (2013). “Finding actors and actions in movies,” in Proc. IEEE International Conference on Computer Vision (Sydney), 2280–2287.

Bousmalis, K., Mehu, M., and Pantic, M. (2013a). Towards the automatic detection of spontaneous agreement and disagreement based on nonverbal behaviour: a survey of related cues, databases, and tools. Image Vis. Comput. 31, 203–221. doi:10.1016/j.imavis.2012.07.003

Bousmalis, K., Zafeiriou, S., Morency, L. P., and Pantic, M. (2013b). Infinite hidden conditional random fields for human behavior analysis. IEEE Trans. Neural Networks Learn. Syst. 24, 170–177. doi:10.1109/TNNLS.2012.2224882

Bousmalis, K., Morency, L., and Pantic, M. (2011). “Modeling hidden dynamics of multimodal cues for spontaneous agreement and disagreement recognition,” in Proc. IEEE International Conference on Automatic Face and Gesture Recognition (Santa Barbara, CA), 746–752.

Burenius, M., Sullivan, J., and Carlsson, S. (2013). “3D pictorial structures for multiple view articulated pose estimation,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 3618–3625.

Burgos-Artizzu, X. P., Dollár, P., Lin, D., Anderson, D. J., and Perona, P. (2012). “Social behavior recognition in continuous video,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1322–1329.

Candamo, J., Shreve, M., Goldgof, D. B., Sapper, D. B., and Kasturi, R. (2010). Understanding transit scenes: a survey on human behavior-recognition algorithms. IEEE Trans. Intell. Transp. Syst. 11, 206–224. doi:10.1109/TITS.2009.2030963

Castellano, G., Villalba, S. D., and Camurri, A. (2007). “Recognising human emotions from body movement and gesture dynamics,” in Proc. Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science , Vol. 4738 (Lisbon), 71–82.

Chakraborty, B., Holte, M. B., Moeslund, T. B., and Gonzàlez, J. (2012). Selective spatio-temporal interest points. Comput. Vis. Image Understand. 116, 396–410. doi:10.1016/j.cviu.2011.09.010

Chaquet, J. M., Carmona, E. J., and Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Comput. Vis. Image Understand. 117, 633–659. doi:10.1016/j.cviu.2013.01.013

Chaudhry, R., Ravichandran, A., Hager, G. D., and Vidal, R. (2009). “Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1932–1939.

Chen, C. Y., and Grauman, K. (2012). “Efficient activity detection with max-subgraph search,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1274–1281.

Chen, H., Li, J., Zhang, F., Li, Y., and Wang, H. (2015). “3D model-based continuous emotion recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1836–1845.

Chen, L., Duan, L., and Xu, D. (2013a). “Event recognition in videos by learning from heterogeneous web sources,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 2666–2673.

Chen, L., Wei, H., and Ferryman, J. (2013b). A survey of human motion analysis using depth imagery. Pattern Recognit. Lett. 34, 1995–2006. doi:10.1016/j.patrec.2013.02.006

Chen, W., Xiong, C., Xu, R., and Corso, J. J. (2014). “Actionness ranking with lattice conditional ordinal random fields,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 748–755.

Cherian, A., Mairal, J., Alahari, K., and Schmid, C. (2014). “Mixing body-part sequences for human pose estimation,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 2361–2368.

Choi, W., Shahid, K., and Savarese, S. (2011). “Learning context for collective activity recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3273–3280.

Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. (2011). “Flexible, high performance convolutional neural networks for image classification,” in Proc. International Joint Conference on Artificial Intelligence (Barcelona), 1237–1242.

Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012). “Multi-column deep neural networks for image classification,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 3642–3649.

Cui, X., Liu, Q., Gao, M., and Metaxas, D. N. (2011). “Abnormal detection using interaction energy potentials,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3161–3167.

Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for human detection,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 886–893.

Dalal, N., Triggs, B., and Schmid, C. (2006). “Human detection using oriented histograms of flow and appearance,” in Proc. European Conference on Computer Vision (Graz), 428–441.

Dollár, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005). “Behavior recognition via sparse spatio-temporal features,” in Proc. International Conference on Computer Communications and Networks (Beijing), 65–72.

Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 2625–2634.

Du, Y., Wang, W., and Wang, L. (2015). “Hierarchical recurrent neural network for skeleton based action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1110–1118.

Efros, A. A., Berg, A. C., Mori, G., and Malik, J. (2003). “Recognizing action at a distance,” in Proc. IEEE International Conference on Computer Vision , Vol. 2 (Nice), 726–733.

Ekman, P., Friesen, W. V., and Hager, J. C. (2002). Facial Action Coding System (FACS): Manual . Salt Lake City: A Human Face.

Elgammal, A., Duraiswami, R., Harwood, D., and Davis, L. S. (2002). Background and foreground modeling using nonparametric kernel density for visual surveillance. Proc. IEEE 90, 1151–1163. doi:10.1109/JPROC.2002.801448

Escalera, S., Baró, X., Vitrià, J., Radeva, P., and Raducanu, B. (2012). Social network extraction and analysis based on multimodal dyadic interaction. Sensors 12, 1702–1719. doi:10.3390/s120201702

PubMed Abstract | CrossRef Full Text | Google Scholar

Evangelopoulos, G., Zlatintsi, A., Potamianos, A., Maragos, P., Rapantzikos, K., Skoumas, G., et al. (2013). Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans. Multimedia 15, 1553–1568. doi:10.1109/TMM.2013.2267205

Evgeniou, T., and Pontil, M. (2004). “Regularized multi-task learning,” in Proc. ACM International Conference on Knowledge Discovery and Data Mining (Seattle, WA), 109–117.

Eweiwi, A., Cheema, M. S., Bauckhage, C., and Gall, J. (2014). “Efficient pose-based action recognition,” in Proc. Asian Conference on Computer Vision (Singapore), 428–443.

Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. A. (2009). “Describing objects by their attributes,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1778–1785.

Fathi, A., Hodgins, J. K., and Rehg, J. M. (2012). “Social interactions: a first-person perspective,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1226–1233.

Fathi, A., and Mori, G. (2008). “Action recognition by learning mid-level motion features,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1–8.

Fergie, M., and Galata, A. (2013). Mixtures of Gaussian process models for human pose estimation. Image Vis. Comput. 31, 949–957. doi:10.1016/j.imavis.2013.09.007

Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., and Tuytelaars, T. (2015). “Modeling video evolution for action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 5378–5387.

Ferrari, V., Marin-Jimenez, M., and Zisserman, A. (2009). “Pose search: retrieving people using their pose,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1–8.

Fisher, R. B. (2004). PETS04 Surveillance Ground Truth Dataset . Available at: http://www-prima.inrialpes.fr/PETS04/

Fisher, R. B. (2007a). Behave: Computer-Assisted Prescreening of Video Streams for Unusual Activities . Available at: http://homepages.inf.ed.ac.uk/rbf/BEHAVE/

Fisher, R. B. (2007b). PETS07 Benchmark Dataset . Available at: http://www.cvg.reading.ac.uk/PETS2007/data.html

Fogel, I., and Sagi, D. (1989). Gabor filters as texture discriminator. Biol. Cybern. 61, 103–113. doi:10.1007/BF00204594

Fothergill, S., Mentis, H. M., Kohli, P., and Nowozin, S. (2012). “Instructing people for training gestural interactive systems,” in Proc. Conference on Human Factors in Computing Systems (Austin, TX), 1737–1746.

Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., and Sivic, J. (2014). People watching: human actions as a cue for single view geometry. Int. J. Comput. Vis. 110, 259–274. doi:10.1007/s11263-014-0710-z

Fu, Y., Hospedales, T. M., Xiang, T., and Gong, S. (2012). “Attribute learning for understanding unstructured social activity,” in Proc. European Conference on Computer Vision, Lecture Notes in Computer Science , Vol. 7575 (Florence), 530–543.

Fu, Y., Hospedales, T. M., Xiang, T., and Gong, S. (2014). Learning multimodal latent attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 303–316. doi:10.1109/TPAMI.2013.128

Gaidon, A., Harchaoui, Z., and Schmid, C. (2014). Activity representation with motion hierarchies. Int. J. Comput. Vis. 107, 219–238. doi:10.1007/s11263-013-0677-1

Gan, C., Wang, N., Yang, Y., Yeung, D. Y., and Hauptmann, A. G. (2015). “DevNet: a deep event network for multimedia event detection and evidence recounting,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 2568–2577.

Gao, Z., Zhang, H., Xu, G. P., and Xue, Y. B. (2015). Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151, 554–564. doi:10.1016/j.neucom.2014.06.085

Gavrila, D. M. (1999). The visual analysis of human movement: a survey. Comput. Vis. Image Understand. 73, 82–98. doi:10.1006/cviu.1998.0716

Gorelick, L., Blank, M., Shechtman, E., Irani, M., and Basri, R. (2007). Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 2247–2253. doi:10.1109/TPAMI.2007.70711

Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R. J., Darrell, T., et al. (2013). “Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition,” in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 2712–2719.

Guha, T., and Ward, R. K. (2012). Learning sparse representations for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1576–1588. doi:10.1109/TPAMI.2011.253

Guo, G., and Lai, A. (2014). A survey on still image based human action recognition. Pattern Recognit. 47, 3343–3361. doi:10.1016/j.patcog.2014.04.018

Gupta, A., and Davis, L. S. (2007). “Objects in action: an approach for combining action understanding and object perception,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Minneapolis, MN), 1–8.

Gupta, A., Kembhavi, A., and Davis, L. S. (2009). Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775–1789. doi:10.1109/TPAMI.2009.83

Haralick, R. M., and Watson, L. (1981). A facet model for image data. Comput. Graph. Image Process. 15, 113–129. doi:10.1016/0146-664X(81)90073-3

Hardoon, D. R., Szedmak, S. R., and Shawe-Taylor, J. R. (2004). Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16, 2639–2664. doi:10.1162/0899766042321814

Healey, J. (2011). “Recording affect in the field: towards methods and metrics for improving ground truth labels,” in Proc. International Conference on Affective Computing and Intelligent Interaction (Memphis, TN), 107–116.

Heilbron, F. C., Escorcia, V., Ghanem, B., and Niebles, J. C. (2015). “ActivityNet: a large-scale video benchmark for human activity understanding,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 961–970.

Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554. doi:10.1162/neco.2006.18.7.1527

Ho, T. K. (1995). “Random decision forests,” in Proc. International Conference on Document Analysis and Recognition , Vol. 1 (Washington, DC: IEEE Computer Society), 278–282.

Hoai, M., Lan, Z. Z., and Torre, F. (2011). “Joint segmentation and classification of human actions in video,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3265–3272.

Hoai, M., and Zisserman, A. (2014). “Talking heads: detecting humans and recognizing their interactions,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 875–882.

Holte, M. B., Chakraborty, B., Gonzàlez, J., and Moeslund, T. B. (2012a). A local 3-D motion descriptor for multi-view human action recognition from 4-D spatio-temporal interest points. IEEE J. Sel. Top. Signal Process. 6, 553–565. doi:10.1109/JSTSP.2012.2193556

Holte, M. B., Tran, C., Trivedi, M. M., and Moeslund, T. B. (2012b). Human pose estimation and activity recognition from multi-view videos: comparative explorations of recent developments. IEEE J. Sel. Top. Signal Process. 6, 538–552. doi:10.1109/JSTSP.2012.2196975

Huang, Z. F., Yang, W., Wang, Y., and Mori, G. (2011). “Latent boosting for action recognition,” in Proc. British Machine Vision Conference (Dundee), 1–11.

Hussain, M. S., Calvo, R. A., and Pour, P. A. (2011). “Hybrid fusion approach for detecting affects from multichannel physiology,” in Proc. International Conference on Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science , Vol. 6974 (Memphis, TN), 568–577.

Ikizler, N., and Duygulu, P. (2007). “Human action recognition using distribution of oriented rectangular patches,” in Proc. Conference on Human Motion: Understanding, Modeling, Capture and Animation (Rio de Janeiro), 271–284.

Ikizler-Cinbis, N., and Sclaroff, S. (2010). “Object, scene and actions: combining multiple features for human action recognition,” in Proc. European Conference on Computer Vision, Lecture Notes in Computer Science , Vol. 6311 (Hersonissos, Heraclion, Crete, greece: Springer), 494–507.

Iosifidis, A., Tefas, A., and Pitas, I. (2012a). Activity-based person identification using fuzzy representation and discriminant learning. IEEE Trans. Inform. Forensics Secur. 7, 530–542. doi:10.1109/TIFS.2011.2175921

Iosifidis, A., Tefas, A., and Pitas, I. (2012b). View-invariant action recognition based on artificial neural networks. IEEE Trans. Neural Networks Learn. Syst. 23, 412–424. doi:10.1109/TNNLS.2011.2181865

Jaimes, A., and Sebe, N. (2007). “Multimodal human-computer interaction: a survey,” in Computer Vision and Image Understanding , Vol. 108 (Special Issue on Vision for Human-Computer Interaction), 116–134.

Jain, M., Gemert, J., Jégou, H., Bouthemy, P., and Snoek, C. G. M. (2014). “Action localization with tubelets from motion,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 740–747.

Jain, M., Jegou, H., and Bouthemy, P. (2013). “Better exploiting motion for better action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 2555–2562.

Jainy, M., Gemerty, J. C., and Snoek, C. G. M. (2015). “What do 15,000 object categories tell us about classifying and localizing actions?,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 46–55.

Jayaraman, D., and Grauman, K. (2014). “Zero-shot recognition with unreliable attributes,” in Proc. Annual Conference on Neural Information Processing Systems (Montreal, QC), 3464–3472.

Jhuang, H., Gall, J., Zuffi, S., Schmid, C., and Black, M. J. (2013). “Towards understanding action recognition,” in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 3192–3199.

Jhuang, H., Serre, T., Wolf, L., and Poggio, T. (2007). “A biologically inspired system for action recognition,” in Proc. IEEE International Conference on Computer Vision (Rio de Janeiro), 1–8.

Jiang, B., Martínez, B., Valstar, M. F., and Pantic, M. (2014). “Decision level fusion of domain specific regions for facial action recognition,” in Proc. International Conference on Pattern Recognition (Stockholm), 1776–1781.

Jiang, Y. G., Ye, G., Chang, S. F., Ellis, D. P. W., and Loui, A. C. (2011). “Consumer video understanding: a benchmark database and an evaluation of human and machine performance,” in Proc. International Conference on Multimedia Retrieval (Trento), 29–36.

Jiang, Z., Lin, Z., and Davis, L. S. (2013). A unified tree-based framework for joint action localization, recognition and segmentation. Comput. Vis. Image Understand. 117, 1345–1355. doi:10.1016/j.cviu.2012.09.008

Jung, H. Y., Lee, S., Heo, Y. S., and Yun, I. D. (2015). “Random treewalk toward instantaneous 3D human pose estimation,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 2467–2474.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). “Large-scale video classification with convolutional neural networks,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 1725–1732.

Khamis, S., Morariu, V. I., and Davis, L. S. (2012). “A flow model for joint action recognition and identity maintenance,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1218–1225.

Kim, Y., Lee, H., and Provost, E. M. (2013). “Deep learning for robust feature generation in audiovisual emotion recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (Vancouver, BC), 3687–3691.

Klami, A., and Kaski, S. (2008). Probabilistic approach to detecting dependencies between data sets. Neurocomputing 72, 39–46. doi:10.1016/j.neucom.2007.12.044

Kläser, A., Marszałek, M., and Schmid, C. (2008). “A spatio-temporal descriptor based on 3D-gradients,” in Proc. British Machine Vision Conference (Leeds: University of Leeds), 995–1004.

Kohonen, T., Schroeder, M. R., and Huang, T. S. (eds) (2001). Self-Organizing Maps , Third Edn. New York, NY.: Springer-Verlag Inc.

Kong, Y., and Fu, Y. (2014). “Modeling supporting regions for close human interaction recognition,” in Proc. European Conference on Computer Vision (Zurich), 29–44.

Kong, Y., Jia, Y., and Fu, Y. (2014a). Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1775–1788. doi:10.1109/TPAMI.2014.2303090

Kong, Y., Kit, D., and Fu, Y. (2014b). “A discriminative model with multiple temporal scales for action prediction,” in Proc. European Conference on Computer Vision (Zurich), 596–611.

Kovashka, A., and Grauman, K. (2010). “Learning a hierarchy of discriminative space-time neighborhood features for human action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 2046–2053.

Kuehne, H., Arslan, A., and Serre, T. (2014). “The language of actions: recovering the syntax and semantics of goal-directed human activities,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 780–787.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011). “HMDB: a large video database for human motion recognition,” in Proc. IEEE International Conference on Computer Vision (Barcelona), 2556–2563.

Kulkarni, K., Evangelidis, G., Cech, J., and Horaud, R. (2015). Continuous action recognition based on sequence alignment. Int. J. Comput. Vis. 112, 90–114. doi:10.1007/s11263-014-0758-9

Kulkarni, P., Sharma, G., Zepeda, J., and Chevallier, L. (2014). “Transfer learning via attributes for improved on-the-fly classification,” in Proc. IEEE Winter Conference on Applications of Computer Vision (Steamboat Springs, CO), 220–226.

Kviatkovsky, I., Rivlin, E., and Shimshoni, I. (2014). Online action recognition using covariance of shape and motion. Comput. Vis. Image Understand. 129, 15–26. doi:10.1016/j.cviu.2014.08.001

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). “Conditional random fields: probabilistic models for segmenting and labeling sequence data,” in Proc. International Conference on Machine Learning (Williamstown, MA: Williams College), 282–289.

Lampert, C. H., Nickisch, H., and Harmeling, S. (2009). “Learning to detect unseen object classes by between-class attribute transfer,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 951–958.

Lan, T., Chen, T. C., and Savarese, S. (2014). “A hierarchical representation for future action prediction,” in Proc. European Conference on Computer Vision (Zurich), 689–704.

Lan, T., Sigal, L., and Mori, G. (2012a). “Social roles in hierarchical models for human activity recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1354–1361.

Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., and Mori, G. (2012b). Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1549–1562. doi:10.1109/TPAMI.2011.228

Lan, T., Wang, Y., and Mori, G. (2011). “Discriminative figure-centric models for joint action localization and recognition,” in Proc. IEEE International Conference on Computer Vision (Barcelona), 2003–2010.

Laptev, I. (2005). On space-time interest points. Int. J. Comput. Vis. 64, 107–123. doi:10.1007/s11263-005-1838-7

Laptev, I., Marszałek, M., Schmid, C., and Rozenfeld, B. (2008). “Learning realistic human actions from movies,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1–8.

Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y. (2011). “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3361–3368.

Li, B., Ayazoglu, M., Mao, T., Camps, O. I., and Sznaier, M. (2011). “Activity recognition using dynamic subspace angles,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3193–3200.

Li, B., Camps, O. I., and Sznaier, M. (2012). “Cross-view activity recognition using hankelets,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1362–1369.

Li, R., and Zickler, T. (2012). “Discriminative virtual views for cross-view action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 2855–2862.

Lichtenauer, J., Valstar, J. S. M., and Pantic, M. (2011). Cost-effective solution to synchronised audio-visual data capture using multiple sensors. Image Vis. Comput. 29, 666–680. doi:10.1016/j.imavis.2011.07.004

Lillo, I., Soto, A., and Niebles, J. C. (2014). “Discriminative hierarchical modeling of spatio-temporally composable human activities,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 812–819.

Lin, Z., Jiang, Z., and Davis, L. S. (2009). “Recognizing actions by shape-motion prototype trees,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 444–451.

Liu, J., Kuipers, B., and Savarese, S. (2011a). “Recognizing human actions by attributes,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3337–3344.

Liu, N., Dellandréa, E., Tellez, B., and Chen, L. (2011b). “Associating textual features with visual ones to improve affective image classification,” in Proc. International Conference on Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science , Vol. 6974 (Memphis, TN), 195–204.

Liu, J., Luo, J., and Shah, M. (2009). “Recognizing realistic actions from videos in the wild,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1–8.

Liu, J., Yan, J., Tong, M., and Liu, Y. (2010). “A Bayesian framework for 3D human motion tracking from monocular image,” in IEEE International Conference on Acoustics, Speech and Signal Processing (Dallas, TX: IEEE), 1398–1401.

Livne, M., Sigal, L., Troje, N. F., and Fleet, D. J. (2012). Human attributes from 3D pose tracking. Comput. Vis. Image Understanding 116, 648–660. doi:10.1016/j.cviu.2012.01.003

Lu, J., Xu, R., and Corso, J. J. (2015). “Human action segmentation with hierarchical supervoxel consistency,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3762–3771.

Lu, W. L., Ting, J. A., Murphy, K. P., and Little, J. J. (2011). “Identifying players in broadcast sports videos using conditional random fields,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3249–3256.

Ma, S., Sigal, L., and Sclaroff, S. (2015). “Space-time tree ensemble for action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 5024–5032.

Maji, S., Bourdev, L. D., and Malik, J. (2011). “Action recognition from a distributed representation of pose and appearance,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3177–3184.

Marín-Jiménez, M. J., Noz Salinas, R. M., Yeguas-Bolivar, E., and de la Blanca, N. P. (2014). Human interaction categorization by using audio-visual cues. Mach. Vis. Appl. 25, 71–84. doi:10.1007/s00138-013-0521-1

Marszałek, M., Laptev, I., and Schmid, C. (2009). “Actions in context,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 2929–2936.

Martinez, H. P., Bengio, Y., and Yannakakis, G. N. (2013). Learning deep physiological models of affect. IEEE Comput. Intell. Mag. 8, 20–33. doi:10.1109/MCI.2013.2247823

Martinez, H. P., Yannakakis, G. N., and Hallam, J. (2014). Don’t classify ratings of affect; rank them! IEEE Trans. Affective Comput. 5, 314–326. doi:10.1109/TAFFC.2014.2352268

Matikainen, P., Hebert, M., and Sukthankar, R. (2009). “Trajectons: action recognition through the motion analysis of tracked features,” in Workshop on Video-Oriented Object and Event Classification, in Conjunction with ICCV (Kyoto: IEEE), 514–521.

Messing, R., Pal, C. J., and Kautz, H. A. (2009). “Activity recognition using the velocity histories of tracked keypoints,” in Proc. IEEE International Conference on Computer Vision (Kyoto), 104–111.

Metallinou, A., Katsamanis, A., and Narayanan, S. (2013). Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information. Image Vis. Comput. 31, 137–152. doi:10.1016/j.imavis.2012.08.018

Metallinou, A., Lee, C. C., Busso, C., Carnicke, S. M., and Narayanan, S. (2010). “The USC creative IT database: a multimodal database of theatrical improvisation,” in Proc. Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (Malta: Springer), 1–4.

Metallinou, A., Lee, S., and Narayanan, S. (2008). “Audio-visual emotion recognition using Gaussian mixture models for face and voice,” in Proc. IEEE International Symposium on Multimedia (Berkeley, CA), 250–257.

Metallinou, A., and Narayanan, S. (2013). “Annotation and processing of continuous emotional attributes: challenges and opportunities,” in Proc. IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (Shanghai), 1–8.

Metallinou, A., Wollmer, M., Katsamani, A., Eyben, F., Schuller, B., and Narayanan, S. (2012). Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affective Comput. 3, 184–198. doi:10.1109/T-AFFC.2011.40

Mikolajczyk, K., and Uemura, H. (2008). “Action recognition with motion-appearance vocabulary forest,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1–8.

Moeslund, T. B., Hilton, A., and Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Understand. 104, 90–126. doi:10.1016/j.cviu.2006.08.002

Morariu, V. I., and Davis, L. S. (2011). “Multi-agent event recognition in structured scenarios,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3289–3296.

Morris, B. T., and Trivedi, M. M. (2011). Trajectory learning for activity understanding: unsupervised, multilevel, and long-term adaptive approach. IEEE Trans. Pattern Anal. Mach. Intell. 33, 2287–2301. doi:10.1109/TPAMI.2011.64

Moutzouris, A., del Rincon, J. M., Nebel, J. C., and Makris, D. (2015). Efficient tracking of human poses using a manifold hierarchy. Comput. Vis. Image Understand. 132, 75–86. doi:10.1016/j.cviu.2014.10.005

Mumtaz, A., Zhang, W., and Chan, A. B. (2014). “Joint motion segmentation and background estimation in dynamic scenes,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 368–375.

Murray, R. M., Sastry, S. S., and Zexiang, L. (1994). A Mathematical Introduction to Robotic Manipulation , first Edn. Boca Raton, FL: CRC Press, Inc.

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. (2011). “Multimodal deep learning,” in Proc. International Conference on Machine Learning (Bellevue, WA), 689–696.

Ni, B., Moulin, P., Yang, X., and Yan, S. (2015). “Motion part regularization: improving action recognition via trajectory group selection,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3698–3706.

Ni, B., Paramathayalan, V. R., and Moulin, P. (2014). “Multiple granularity analysis for fine-grained action detection,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 756–763.

Nicolaou, M. A., Gunes, H., and Pantic, M. (2011). Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affective Comput. 2, 92–105. doi:10.1109/T-AFFC.2011.9

Nicolaou, M. A., Pavlovic, V., and Pantic, M. (2014). Dynamic probabilistic CCA for analysis of affective behavior and fusion of continuous annotations. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1299–1311. doi:10.1109/TPAMI.2014.16

Nie, B. X., Xiong, C., and Zhu, S. C. (2015). “Joint action recognition and pose estimation from video,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1293–1301.

Niebles, J. C., Wang, H., and Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299–318. doi:10.1007/s11263-007-0122-4

Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C., Lee, J. T., et al. (2011). “A large-scale benchmark dataset for event recognition in surveillance video,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3153–3160.

Oikonomopoulos, A., Pantic, M., and Patras, I. (2009). Sparse B-spline polynomial descriptors for human activity recognition. Image Vis. Comput. 27, 1814–1825. doi:10.1016/j.imavis.2009.05.010

Oliver, N. M., Rosario, B., and Pentland, A. P. (2000). A Bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22, 831–843. doi:10.1109/34.868684

Ouyang, W., Chu, X., and Wang, X. (2014). “Multi-source deep learning for human pose estimation,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 2337–2344.

Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). “Zero-shot learning with semantic output codes,” in Proc. Annual Conference on Neural Information Processing Systems (Vancouver, BC), 1410–1418.

Pantic, M., Pentland, A., Nijholt, A., and Huang, T. (2006). “Human computing and machine understanding of human behavior: a survey,” in Proc. International Conference on Multimodal Interfaces (New York, NY), 239–248.

Pantic, M., and Rothkrantz, L. (2003). “Towards an affect-sensitive multimodal human-computer interaction,” in Proc. IEEE, Special Issue on Multimodal Human-Computer Interaction, Invited Paper , Vol. 91 (IEEE), 1370–1390.

Park, H. S., and Shi, J. (2015). “Social saliency prediction,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4777–4785.

Patron-Perez, A., Marszalek, M., Reid, I., and Zisserman, A. (2012). Structured learning of human interactions in TV shows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2441–2453. doi:10.1109/TPAMI.2012.24

Perez, P., Vermaak, J., and Blake, A. (2004). Data fusion for visual tracking with particles. Proc. IEEE 92, 495–513. doi:10.1109/JPROC.2003.823147

Perronnin, F., and Dance, C. R. (2007). “Fisher kernels on visual vocabularies for image categorization,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Minneapolis, MN), 1–8.

Picard, R. W. (1997). Affective Computing . Cambridge, MA: MIT Press.

Pirsiavash, H., and Ramanan, D. (2012). “Detecting activities of daily living in first-person camera views,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 2847–2854.

Pirsiavash, H., and Ramanan, D. (2014). “Parsing videos of actions with segmental grammars,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 612–619.

Pishchulin, L., Andriluka, M., Gehler, P. V., and Schiele, B. (2013). “Strong appearance and expressive spatial models for human pose estimation,” in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 3487–3494.

Poppe, R. (2010). A survey on vision-based human action recognition. Image Vis. Comput. 28, 976–990. doi:10.1016/j.imavis.2009.11.014

Prince, S. J. D. (2012). Computer Vision: Models Learning and Inference . New York, NY: Cambridge University Press.

Quattoni, A., Wang, S., Morency, L. P., Collins, M., and Darrell, T. (2007). Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1848–1852. doi:10.1109/TPAMI.2007.1124

Rahmani, H., Mahmood, A., Huynh, D. Q., and Mian, A. S. (2014). “Real time action recognition using histograms of depth gradients and random decision forests,” in Proc. IEEE Winter Conference on Applications of Computer Vision (Steamboat Springs, CO), 626–633.

Rahmani, H., and Mian, A. (2015). “Learning a non-linear knowledge transfer model for cross-view action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 2458–2466.

Ramanathan, V., Li, C., Deng, J., Han, W., Li, Z., Gu, K., et al. (2015). “Learning semantic relationships for better action retrieval in images,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1100–1109.

Ramanathan, V., Liang, P., and Fei-Fei, L. (2013). “Video event understanding using natural language descriptions,” in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 905–912.

Raptis, M., Kokkinos, I., and Soatto, S. (2012). “Discovering discriminative action parts from mid-level video representations,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1242–1249.

Rawlinson, G. (2007). The significance of letter position in word recognition. IEEE Aerosp. Electron. Syst. Mag. 22, 26–27. doi:10.1109/MAES.2007.327521

Reddy, K. K., and Shah, M. (2013). Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24, 971–981. doi:10.1007/s00138-012-0450-4

Robertson, N., and Reid, I. (2006). A general method for human activity recognition in video. Comput. Vis. Image Understand. 104, 232–248. doi:10.1016/j.cviu.2006.07.006

Rodriguez, M. D., Ahmed, J., and Shah, M. (2008). “Action MACH: a spatio-temporal maximum average correlation height filter for action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1–8.

Rodríguez, N. D., Cuéllar, M. P., Lilius, J., and Calvo-Flores, M. D. (2014). A survey on ontologies for human behavior recognition. ACM Comput. Surv. 46, 1–33. doi:10.1145/2523819

Rohrbach, M., Amin, S., Mykhaylo, A., and Schiele, B. (2012). “A database for fine grained activity detection of cooking activities,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1194–1201.

Roshtkhari, M. J., and Levine, M. D. (2013). Human activity recognition in videos using a single example. Image Vis. Comput. 31, 864–876. doi:10.1016/j.imavis.2013.08.005

Rudovic, O., Petridis, S., and Pantic, M. (2013). “Bimodal log-linear regression for fusion of audio and visual features,” in Proc. ACM Multimedia Conference (Barcelona), 789–792.

Sadanand, S., and Corso, J. J. (2012). “Action bank: a high-level representation of activity in video,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1234–1241.

Salakhutdinov, R., Torralba, A., and Tenenbaum, J. B. (2011). “Learning to share visual appearance for multiclass object detection,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 1481–1488.

Samanta, S., and Chanda, B. (2014). Space-time facet model for human activity classification. IEEE Trans. Multimedia 16, 1525–1535. doi:10.1109/TMM.2014.2326734

Sanchez-Riera, J., Cech, J., and Horaud, R. (2012). “Action recognition robust to background clutter by using stereo vision,” in Proc. European Conference on Computer Vision (Firenze), 332–341.

Sapienza, M., Cuzzolin, F., and Torr, P. H. S. (2014). Learning discriminative space-time action parts from weakly labelled videos. Int. J. Comput. Vis. 110, 30–47. doi:10.1007/s11263-013-0662-8

Sargin, M. E., Yemez, Y., Erzin, E., and Tekalp, A. M. (2007). Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 1396–1403. doi:10.1109/TMM.2007.906583

Satkin, S., and Hebert, M. (2010). “Modeling the temporal extent of actions,” in Proc. European Conference on Computer Vision (Heraklion), 536–548.

Schindler, K., and Gool, L. V. (2008). “Action snippets: how many frames does human action recognition require?,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1–8.

Schuldt, C., Laptev, I., and Caputo, B. (2004). “Recognizing human actions: a local SVM approach,” in Proc. International Conference on Pattern Recognition (Cambridge), 32–36.

Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., and Pantic, M. (2011). “Avec 2011 -the first international audio visual emotion challenge,” in Proc. International Audio/Visual Emotion Challenge and Workshop, Lecture Notes in Computer Science , Vol. 6975 (Memphis, TN), 415–424.

Sedai, S., Bennamoun, M., and Huynh, D. Q. (2013a). Discriminative fusion of shape and appearance features for human pose estimation. Pattern Recognit. 46, 3223–3237. doi:10.1016/j.patcog.2013.05.019

Sedai, S., Bennamoun, M., and Huynh, D. Q. (2013b). A Gaussian process guided particle filter for tracking 3D human pose in video. IEEE Trans. Image Process. 22, 4286–4300. doi:10.1109/TIP.2013.2271850

Seo, H. J., and Milanfar, P. (2011). Action recognition from one example. IEEE Trans. Pattern Anal. Mach. Intell. 33, 867–882. doi:10.1109/TPAMI.2010.156

Shabani, A. H., Clausi, D., and Zelek, J. S. (2011). “Improved spatio-temporal salient feature detection for action recognition,” in Proc. British Machine Vision Conference (Dundee), 1–12.

Shafer, G. (1976). A Mathematical Theory of Evidence . Princeton NJ: Princeton University Press.

Shao, J., Kang, K., Loy, C. C., and Wang, X. (2015). “Deeply learned attributes for crowded scene understanding,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4657–4666.

Shivappa, S., Trivedi, M. M., and Rao, B. D. (2010). Audiovisual information fusion in human-computer interfaces and intelligent environments: a survey. Proc. IEEE 98, 1692–1715. doi:10.1109/JPROC.2010.2057231

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). “Real-time human pose recognition in parts from single depth images,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 1297–1304.

Shu, T., Xie, D., Rothrock, B., Todorovic, S., and Zhu, S. C. (2015). “Joint inference of groups, events and human roles in aerial videos,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4576–4584.

Siddiquie, B., Khan, S. M., Divakaran, A., and Sawhney, H. S. (2013). “Affect analysis in natural human interaction using joint hidden conditional random fields,” in Proc. IEEE International Conference on Multimedia and Expo (San Jose, CA), 1–6.

Sigal, L., Isard, M., Haussecker, H. W., and Black, M. J. (2012a). Loose-limbed people: estimating 3D human pose and motion using non-parametric belief propagation. Int. J. Comput. Vis. 98, 15–48. doi:10.1007/s11263-011-0493-4

Sigal, L., Isard, M., Haussecker, H., and Black, M. J. (2012b). Loose-limbed people: estimating 3D human pose and motion using non-parametric belief propagation. Int. J. Comput. Vis. 98, 15–48. doi:10.1007/s11263-011-0493-4

Singh, S., Velastin, S. A., and Ragheb, H. (2010). “Muhavi: a multicamera human action video dataset for the evaluation of action recognition methods,” in Proc. IEEE International Conference on Advanced Video and Signal Based Surveillance (Boston, MA), 48–55.

Singh, V. K., and Nevatia, R. (2011). “Action recognition in cluttered dynamic scenes using pose-specific part models,” in Proc. IEEE International Conference on Computer Vision (Barcelona), 113–120.

Smola, A. J., and Schölkopf, B. (2004). A tutorial on support vector regression. Stat. Comput. 14, 199–222. doi:10.1023/B:STCO.0000035301.49549.88

Snoek, C. G. M., Worring, M., and Smeulders, A. W. M. (2005). “Early versus late fusion in semantic video analysis,” in Proc. Annual ACM International Conference on Multimedia (Singapore), 399–402.

Soleymani, M., Pantic, M., and Pun, T. (2012). Multimodal emotion recognition in response to videos. IEEE Trans. Affective Comput. 3, 211–223. doi:10.1109/T-AFFC.2011.37

Song, Y., Morency, L. P., and Davis, R. (2012a). “Multimodal human behavior analysis: learning correlation and interaction across modalities,” in Proc. ACM International Conference on Multimodal Interaction (Santa Monica, CA), 27–30.

Song, Y., Morency, L. P., and Davis, R. (2012b). “Multi-view latent variable discriminative models for action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 2120–2127.

Song, Y., Morency, L. P., and Davis, R. (2013). “Action recognition by hierarchical sequence summarization,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 3562–3569.

Soomro, K., Zamir, A. R., and Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild . Cornell University Library. CoRR, abs/1212.0402.

Sun, C., and Nevatia, R. (2013). “ACTIVE: activity concept transitions in video event classification,” in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 913–920.

Sun, Q. S., Zeng, S. G., Liu, Y., Heng, P. A., and Xia, D. S. (2005). A new method of feature fusion and its application in image recognition. Pattern Recognit. 38, 2437–2448. doi:10.1016/j.patcog.2004.12.013

Sun, X., Chen, M., and Hauptmann, A. (2009). “Action recognition via local descriptors and holistic features,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (Los Alamitos, CA), 58–65.

Tang, K. D., Yao, B., Fei-Fei, L., and Koller, D. (2013). “Combining the right features for complex event recognition,” in Proc. IEEE International Conference on Computer Vision, pages (Sydney, NSW), 2696–2703.

Tenorth, M., Bandouch, J., and Beetz, M. (2009). “The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition,” in Proc. IEEE International Workshop on Tracking Humans for the Evaluation of Their Motion in Image Sequences (THEMIS) (Kyoto), 1089–1096.

Theodorakopoulos, I., Kastaniotis, D., Economou, G., and Fotopoulos, S. (2014). Pose-based human action recognition via sparse representation in dissimilarity space. J. Vis. Commun. Image Represent. 25, 12–23. doi:10.1016/j.jvcir.2013.03.008

Theodoridis, S., and Koutroumbas, K. (2008). Pattern Recognition , Fourth Edn. Boston: Academic Press.

Thurau, C., and Hlavac, V. (2008). “Pose primitive based human action recognition in videos or still images,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1–8.

Tian, Y., Sukthankar, R., and Shah, M. (2013). “Spatiotemporal deformable part models for action detection,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 2642–2649.

Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244. doi:10.1162/15324430152748236

Toshev, A., and Szegedy, C. (2014). “Deeppose: human pose estimation via deep neural networks,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 1653–1660.

Tran, D., Yuan, J., and Forsyth, D. (2014a). Video event detection: from subvolume localization to spatiotemporal path search. IEEE Trans. Pattern Anal. Mach. Intell. 36, 404–416. doi:10.1109/TPAMI.2013.137

Tran, K. N., Gala, A., Kakadiaris, I. A., and Shah, S. K. (2014b). Activity analysis in crowded environments using social cues for group discovery and human interaction modeling. Pattern Recognit. Lett. 44, 49–57. doi:10.1016/j.patrec.2013.09.015

Tran, K. N., Kakadiaris, I. A., and Shah, S. K. (2012). Part-based motion descriptor image for human action recognition. Pattern Recognit. 45, 2562–2572. doi:10.1016/j.patcog.2011.12.028

Turaga, P. K., Chellappa, R., Subrahmanian, V. S., and Udrea, O. (2008). Machine recognition of human activities: a survey. Proc. IEEE Trans. Circuits Syst. Video Technol. 18, 1473–1488. doi:10.1109/TCSVT.2008.2005594

Urtasun, R., and Darrell, T. (2008). “Sparse probabilistic regression for activity-independent human pose inference,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1–8.

Vemulapalli, R., Arrate, F., and Chellappa, R. (2014). “Human action recognition by representing 3D skeletons as points in a lie group,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 588–595.

Vinciarelli, A., Dielmann, A., Favre, S., and Salamin, H. (2009). “Canal9: a database of political debates for analysis of social interactions,” in Proc. International Conference on Affective Computing and Intelligent Interaction and Workshops (Amsterdam: De Rode Hoed), 1–4.

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). “Show and tell: a neural image caption generator,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3156–3164.

Vrigkas, M., Karavasilis, V., Nikou, C., and Kakadiaris, I. A. (2013). “Action recognition by matching clustered trajectories of motion vectors,” in Proc. International Conference on Computer Vision Theory and Applications (Barcelona), 112–117.

Vrigkas, M., Karavasilis, V., Nikou, C., and Kakadiaris, I. A. (2014a). Matching mixtures of curves for human action recognition. Comput. Vis. Image Understand. 119, 27–40. doi:10.1016/j.cviu.2013.11.007

Vrigkas, M., Nikou, C., and Kakadiaris, I. A. (2014b). “Classifying behavioral attributes using conditional random fields,” in Proc. 8th Hellenic Conference on Artificial Intelligence, Lecture Notes in Computer Science , Vol. 8445 (Ioannina), 95–104.

Wang, H., Kläser, A., Schmid, C., and Liu, C. L. (2011a). “Action recognition by dense trajectories,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3169–3176.

Wang, J., Chen, Z., and Wu, Y. (2011b). “Action recognition with multiscale spatio-temporal contexts,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3185–3192.

Wang, Y., Guan, L., and Venetsanopoulos, A. N. (2011c). “Kernel cross-modal factor analysis for multimodal information fusion,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (Prague), 2384–2387.

Wang, H., Kläser, A., Schmid, C., and Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 60–79. doi:10.1007/s11263-012-0594-8

Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012a). “Mining actionlet ensemble for action recognition with depth cameras,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1290–1297.

Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., and Hauptmann, A. G. (2012b). “Action recognition by exploring data distribution and feature correlation,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1370–1377.

Wang, Z., Wang, J., Xiao, J., Lin, K. H., and Huang, T. S. (2012c). “Substructure and boundary modeling for continuous action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1330–1337.

Wang, L., Hu, W., and Tan, T. (2003). Recent developments in human motion analysis. Pattern Recognit. 36, 585–601. doi:10.1016/S0031-3203(02)00100-0

Wang, S., Ma, Z., Yang, Y., Li, X., Pang, C., and Hauptmann, A. G. (2014). Semi-supervised multiple feature analysis for action recognition. IEEE Trans. Multimedia 16, 289–298. doi:10.1109/TMM.2013.2293060

Wang, Y., and Mori, G. (2008). “Learning a discriminative hidden part model for human action recognition,” in Proc. Annual Conference on Neural Information Processing Systems (Vancouver, BC), 1721–1728.

Wang, Y., and Mori, G. (2010). “A discriminative latent model of object classes and attributes,” in Proc. European Conference on Computer Vision (Heraklion), 155–168.

Wang, Y., and Mori, G. (2011). Hidden part models for human action recognition: probabilistic versus max margin. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1310–1323. doi:10.1109/TPAMI.2010.214

Westerveld, T., de Vries, A. P., van Ballegooij, A., de Jong, F., and Hiemstra, D. (2003). A probabilistic multimedia retrieval model and its evaluation. EURASIP J. Appl. Signal Process. 2003, 186–198. doi:10.1155/S111086570321101X

Wu, C., Zhang, J., Savarese, S., and Saxena, A. (2015). “Watch-n-patch: unsupervised understanding of actions and relations,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4362–4370.

Wu, Q., Wang, Z., Deng, F., Chi, Z., and Feng, D. D. (2013). Realistic human action recognition with multimodal feature selection and fusion. IEEE Trans. Syst. Man Cybern. Syst. 43, 875–885. doi:10.1109/TSMCA.2012.2226575

Wu, Q., Wang, Z., Deng, F., and Feng, D. D. (2010). “Realistic human action recognition with audio context,” in Proc. International Conference on Digital Image Computing: Techniques and Applications (Sydney, NSW), 288–293.

Wu, X., Xu, D., Duan, L., and Luo, J. (2011). “Action recognition using context and appearance distribution features,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 489–496.

Xiong, Y., Zhu, K., Lin, D., and Tang, X. (2015). “Recognize complex events from static images by fusing deep channels,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1600–1609.

Xu, C., Hsieh, S. H., Xiong, C., and Corso, J. J. (2015). “Can humans fly? Action understanding with multiple classes of actors,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 2264–2273.

Xu, R., Agarwal, P., Kumar, S., Krovi, V. N., and Corso, J. J. (2012). “Combining skeletal pose with local motion for human activity recognition,” in Proc. International Conference on Articulated Motion and Deformable Objects (Mallorca), 114–123.

Yan, X., Kakadiaris, I. A., and Shah, S. K. (2014). Modeling local behavior for predicting social interactions towards human tracking. Pattern Recognit. 47, 1626–1641. doi:10.1016/j.patcog.2013.10.019

Yan, X., and Luo, Y. (2012). Recognizing human actions using a new descriptor based on spatial-temporal interest points and weighted-output classifier. Neurocomputing 87, 51–61. doi:10.1016/j.neucom.2012.02.002

Yang, W., Wang, Y., and Mori, G. (2010). “Recognizing human actions from still images with latent poses,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 2030–2037.

Yang, Y., Saleemi, I., and Shah, M. (2013). Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1635–1648. doi:10.1109/TPAMI.2012.253

Yang, Z., Metallinou, A., and Narayanan, S. (2014). Analysis and predictive modeling of body language behavior in dyadic interactions from multimodal interlocutor cues. IEEE Trans. Multimedia 16, 1766–1778. doi:10.1109/TMM.2014.2328311

Yao, A., Gall, J., and Gool, L. V. (2010). “A Hough transform-based voting framework for action recognition,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 2061–2068.

Yao, B., and Fei-Fei, L. (2010). “Modeling mutual context of object and human pose in human-object interaction activities,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 17–24.

Yao, B., and Fei-Fei, L. (2012). Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1691–1703. doi:10.1109/TPAMI.2012.67

Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L. J., and Fei-Fei, L. (2011). “Human action recognition by learning bases of action attributes and parts,” in Proc. IEEE International Conference on Computer Vision (Barcelona), 1331–1338.

Ye, M., Zhang, Q., Wang, L., Zhu, J., Yangg, R., and Gall, J. (2013). “A survey on human motion analysis from depth data,” in Time-of-Flight and Depth Imaging, Lecture Notes in Computer Science , Vol. 8200. eds M. Grzegorzek, C. Theobalt, R. Koch, and A. Kolb (Berlin Heidelberg: Springer), 149–187.

Yi, S., Krim, H., and Norris, L. K. (2012). Human activity as a manifold-valued random process. IEEE Trans. Image Process. 21, 3416–3428. doi:10.1109/TIP.2012.2197008

Yu, G., and Yuan, J. (2015). “Fast action proposals for human action detection and search,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1302–1311.

Yu, G., Yuan, J., and Liu, Z. (2012). “Propagative Hough voting for human activity recognition,” in Proc. European Conference on Computer Vision (Florence), 693–706.

Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., and Samaras, D. (2012). “Two-person interaction detection using body-pose features and multiple instance learning,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (Providence, RI), 28–35.

Zeng, Z., Pantic, M., Roisman, G. I., and Huang, T. S. (2009). A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39–58. doi:10.1109/TPAMI.2008.52

Zhang, Z., Wang, C., Xiao, B., Zhou, W., and Liu, S. (2013). Attribute regularization based human action recognition. IEEE Trans. Inform. Forensics Secur. 8, 1600–1609. doi:10.1109/TIFS.2013.2258152

Zhang, Z., Wang, C., Xiao, B., Zhou, W., and Liu, S. (2015). Robust relative attributes for human action recognition. Pattern Anal. Appl. 18, 157–171. doi:10.1007/s10044-013-0349-3

Zhou, Q., and Wang, G. (2012). “Atomic action features: a new feature for action recognition,” in Proc. European Conference on Computer Vision (Firenze), 291–300.

Zhou, W., and Zhang, Z. (2014). Human action recognition with multiple-instance Markov model. IEEE Trans. Inform. Forensics Secur 9, 1581–1591. doi:10.1109/TIFS.2014.2344448

Keywords: human activity recognition, activity categorization, activity datasets, action representation, review, survey

Citation: Vrigkas M, Nikou C and Kakadiaris IA (2015) A Review of Human Activity Recognition Methods. Front. Robot. AI 2:28. doi: 10.3389/frobt.2015.00028

Received: 09 July 2015; Accepted: 29 October 2015; Published: 16 November 2015

Reviewed by:

Copyright: © 2015 Vrigkas, Nikou and Kakadiaris. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Christophoros Nikou, cnikou@cs.uoi.gr

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
Sensors (Basel)

Human Activity Recognition Data Analysis: History, Evolutions, and New Trends

Paola patricia ariza-colpas.

1 Department of Computer Science and Electronics, Universidad de la Costa CUC, Barranquilla 080002, Colombia

2 Faculty of Engineering in Information and Communication Technologies, Universidad Pontificia Bolivariana, Medellín 050031, Colombia; [email protected]

Enrico Vicario

3 Department of Information Engineering, University of Florence, 50139 Firenze, Italy; [email protected] (E.V.); [email protected] (F.P.)

Ana Isabel Oviedo-Carrascal

Shariq butt aziz.

4 Department of Computer Science and IT, University of Lahore, Lahore 44000, Pakistan; moc.liamg@5132qirahs

Marlon Alberto Piñeres-Melo

5 Department of Systems Engineering, Universidad del Norte, Barranquilla 081001, Colombia; oc.ude.etroninu@mserenip

Alejandra Quintero-Linero

6 Microbiology Program, Universidad Popular del Cesar, Valledupar 200002, Colombia; oc.ude.rasecinu@loretniuqardnajela

Fulvio Patara

The Assisted Living Environments Research Area–AAL (Ambient Assisted Living), focuses on generating innovative technology, products, and services to assist, medical care and rehabilitation to older adults, to increase the time in which these people can live. independently, whether they suffer from neurodegenerative diseases or some disability. This important area is responsible for the development of activity recognition systems—ARS (Activity Recognition Systems), which is a valuable tool when it comes to identifying the type of activity carried out by older adults, to provide them with assistance. that allows you to carry out your daily activities with complete normality. This article aims to show the review of the literature and the evolution of the different techniques for processing this type of data from supervised, unsupervised, ensembled learning, deep learning, reinforcement learning, transfer learning, and metaheuristics approach applied to this sector of science. health, showing the metrics of recent experiments for researchers in this area of knowledge. As a result of this article, it can be identified that models based on reinforcement or transfer learning constitute a good line of work for the processing and analysis of human recognition activities.

1. Introduction

Currently, the number of older adults who require a caregiver due to various conditions related to neurodegenerative diseases has greatly increased. This situation constitutes a great problem both for society and for integrated health systems worldwide because there is not enough infrastructure to be able to massively attend to the increasing number of people with this type of condition. Due to the above, a line of research arises that relates the sensory and the processing of HAR (Human Activity Recognition) data, which allows supporting the management of these individuals.

In general, this type of experimentation uses a model as a representation of reality developed to study it. In most analyzes it is not necessary to consider all the details of reality, the model is not only a substitute for reality but also a simplification of it. According to the same author, the models are classified as iconic, analogous, symbolic, deterministic, stochastic, static, continuous, and discrete depending on the tools used [ 1 ]. On the other hand, Cramér [ 2 ] from the foundation of probability theory, specifies from the object of probability theory that a mathematical model makes a description of a certain class of observed phenomena.

Artificial intelligence (AI) is defined as a field of science and engineering concerned with the computational understanding of what is commonly called intelligent behavior with the creation of artifacts that exhibit such behavior [ 3 ]. These processes require the training of models from data sources, in some cases heterogeneous. Training with a learning view involves the acquisition and pre-processing of information and the application or construction of rules for the treatment and use of this, to generate reasoning. That is, it uses the rules to reach approximate or definitive conclusions and self-correction [ 4 ].

Artificial Intelligence has become more popular today thanks to Big Data, advanced algorithms, and computers with improved power and storage, systems based on artificial intelligence are becoming an integrated element of digital systems and more specifically they are generating a profound impact on human decision-making. As a result, there is a growing demand for information systems researchers to investigate and understand the implications of this, for decision making and contribute to the theoretical advancement and practical success of the applications of this area of knowledge [ 5 ].

In 1959, Arthur Samuel coined the term Machine Learning and defined it as “the field of study that gives computers the ability to learn without being explicitly programmed”. Machine learning is part of the field of Artificial Intelligence, and its objective is usually to recognize and fit statistics to models [ 6 ]. Along with Artificial Intelligence, Machine Learning has emerged as the method of choice for the development of practical software for image and speech recognition, natural language processing, robot control, and other applications like Human Activity Recognition (HAR). Many developers of Artificial Intelligence systems recognize that, for many applications, it may be easier to train a system by feeding it examples of the desired input and output behavior, than to manually program in advance the desired response for all possible inputs [ 7 ].

Machine Learning has been playing, in recent decades, an important role in the construction of models based on experience from processed data [ 8 ], enabling computers to build models from data For example, according to the automation of decision-making processes, based on the input data [ 6 ], for their part [ 8 ] affirms that the study and construction of algorithms are explored who can learn and make predictions from data. This systematic review of the literature locates the advances made in Human Activity Recognition in each of the automatic learning methods, their evolution, and results.

The recognition of human activities has become one of the most used areas of knowledge that has allowed many advances in the care of patients at home and the improvement of the quality of life of the elderly. That is why different ways have been used by which the data from the different datasets can be processed, among which are machine learning and metaheuristics. The HAR approach is based on the complexity associated with the different data inputs that can come from wearable sensors, object sensors, images, audio, among others. Many models have been developed to try to improve performance and quality metrics based on different experimentations. Motivated by this research eld, the main contribution of this paper is (a) Show researchers the datasets most used in the literature for experimentation processes, as well as detailing the most used algorithms in their analysis. (b) Identify for each one of the data processing approaches the results of the experimentations of the algorithms, as well as discriminate the quality metrics associated with said applications an (c) Suggest, based on the analysis of the literature, the different techniques to be used to obtain good results in the experiments and show future researchers what the results of current applications are for the improvement of their experiments.

For the development of this purpose, a compilation of 570 articles has been analyzed, extracted from specialized databases such as IEEE Xplorer, Scopus, Science Direct, WOS. These manuscripts were analyzed through a meta-analytic matrix that allowed extracting relevant information such as year of publication, the database used, techniques implemented, results of the quality metrics implemented.

This article is a review of the literature on the use of the technique of the different machine learning methods supervised, unsupervised, ensembled, deep learning, reinforcement learning. First, a concept map of HAR Approach Concept Maps is shown ( Section 2 ). Second, conceptual information is displayed ( Section 3 ). Third, the methodology for analyzing the information sources is detailed ( Section 4 ). Fifth, is the Scientometric Analysis ( Section 5 ). Sixth, Technical Analysis ( Section 6 ). Seventh, Conclusions ( Section 7 ). Finally, the Future Works are shown ( Section 8 ).

2. HAR Approach Concept Maps

In the last decades, Machine Learning has evolved in different methods and techniques to address different challenges in different areas of knowledge. In Figure 1 you can see the discrimination of each of the data mining methods can be seen. From classic supervised or unsupervised-based machine learning. Among the most outstanding algorithms of Supervised Analysis ( Section 3.1 ), the following can be highlighted: Decision Tree [ 9 ], Support Vectorial Machine [ 10 ], Naive Bayesian Classifier [ 11 ], Artificial Neural Networks [ 12 ], Decision Tables [ 13 ] and Logistic Models [ 14 ], among others. Regarding Unsupervised Learning ( Section 3.2 ), several methods can be found, among which Clustering [ 15 ] and Association Rules [ 16 ], and Dimensionality Reduction [ 17 ] can be highlighted. As for Ensembled Learning ( Section 3.3 ), techniques such as Stacking [ 18 ], Bagging [ 19 ], Boosting [ 20 ] can be highlighted. Later, it was emphasized in methods or techniques based on Deep Neural Networks ( Section 3.4 ) [ 21 ] that have several levels of analysis for knowledge discovery. Nowadays, machine learning has evolved to analysis based on Reinforcement Learning ( Section 3.5 ) [ 22 ], which allows the algorithm that is strengthened in a system of rewards and punishments to permeate the learning process. In ( Section 3.6 ) Metaheuristic Techniques [ 23 ], are strategies for designing heuristic procedures. Therefore, the types of metaheuristics are established, in the first place, based on the type of procedures to which they refer. The following can be identified among other types of algorithms: Threshold Accepting [ 24 ], Memetic algorithms [ 25 ], MultiBoot Algorithms [ 26 ], CRO (Coral reef-based algorithms) [ 27 ], Swarm algorithms [ 28 ], Genetic algorithms [ 29 ], Scatter Search [ 30 ], Variable Neighborhood Search [ 31 ] and Ant Colony [ 32 ]. Finally, in ( Section 3.7 ), we show the approach of Transfer Learning [ 33 ] to Human Activity Recognition using a different type of combination of Neural networks for the analysis.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-03401-g001.jpg

HAR Approach Concepts Maps.

3. Conceptual Information

3.1. supervised learning.

Supervised learning is a technique for deducing a function from training data. Training data consists of pairs of objects (usually vectors): one component of the pair is the input data and the other is the desired results. The output of the function can be a numeric value (as in regression problems) or a class label (as in classification problems). The goal of supervised learning is to create a function capable of predicting the value corresponding to any valid input object after viewing a series of examples, the training data. To do this, you must generalize from the data presented to previously unseen situations. Among the techniques most used in machine learning, the following can be highlighted.

3.1.1. Decision Tree

According to Timaran [ 34 ], the quality of the decision tree depends on the size and precision of the classification. A subset of the dataset set (training) is chosen and a decision tree is created [ 9 ]. If it does not return the answer for the objects in the test set, a selection of exceptions is added to the dataset set, continuing the process until the correct decision set is found. The most used classification algorithms, from the decision trees category, are ID-3, C4.5, CART, Sprint, and j48 [ 34 ].

3.1.2. Support Vectorial Machine (SVM)

In SVMs [ 10 ] the input quantities are mapped non-linearly to a very high-dimensional feature space. In this feature space, a line decision surface is constructed [ 35 ]. According to Hang [ 36 ], SVM uses a non-linear mapping to transform the original training data into a higher dimension. Within this new dimension, it looks for the linear optimal separation hyperplane (that is, a “decision limit” that separates the tuples of one class from another). SVMs can be used for numerical prediction, as well as for classification. They have been applied to several areas, including handwritten digit recognition, object recognition, and speaker identification, as well as benchmark time series prediction tests.

3.1.3. Naïve Bayesian Classifier

Is a special typology of machine learning algorithms that address the task of classification [ 11 ]. The foundation of this is based on the “Bayes theorem”. In this algorithm, it is assumed that the variables that are used for the prediction are independent of each other. In other words, the presence of a series of characteristics in a data set is not related to the absence of another character in another data set.

3.1.4. Artificial Neural Networks-ANN

The fundamental processing elements of an ANN are artificial neurons (or nodes) that are interconnected by weighted links that form layers [ 12 ]. Normally in an ANN, there is an input layer and an output layer, and several hidden layers that vary depending on the complexity of the problem in question. Neurons transform weighted input into output, using an activation function that can take different linear and non-linear forms. The process by which the weights are adjusted is called learning. Several non-linear ANNs are known to perform function approximators. Several parameters define the architecture of a neural network: the type of connection, learning rule, and activation functions. Due to these conformation parameters, there are different types of ANN, for example, Multilayer Perceptron—MLP [ 37 ], Echo State Networks-ESN, radial basis function-RBFN, Boltzmann machine.

3.1.5. Decision Tables

The decision tables or also called decision rules achieve a synthetic representation of knowledge [ 13 ]. There are at least four sources of inconsistencies in decision tables, listed below: (1) hesitation in evaluating decision attribute values, (2) errors in recording, measurement, and observation, (3) condition attributes missing related to the evaluation of the decision attribute values, (4) the unstable nature of the system represented by the decision table and the like. These inconsistencies cannot be considered simple errors or noise. To acquire rules, from inconsistent decision tables, relative attribute reductions are needed. Skowron and Rauszer introduced the discernibility matrix method which became a popular approach for listing all reductions in the Rough set [ 26 ].

3.1.6. Tree-Based on the Logistic Model-LMT

This classification process mixes decision trees with logistic regression [ 14 ]. The classification process can be improved if characteristics selection techniques are used, these allow assigning the prioritization or relevance of the attributes using the class criterion, thus obtaining a structure of attributes that directly affect the model and that in turn. are increasingly relevant concerning classification.

3.2. Unsupervised Learning

Unsupervised learning is a Machine Learning method where a model is fit for observations. It is distinguished from supervised learning by the fact that there is no a priori knowledge. In unsupervised learning, a data set of input objects is processed. Thus, unsupervised learning typically treats input objects as a set of random variables, with a density model being constructed for the data set. There are different unsupervised learning methods, among which we can highlight: Clustering, Association Rules, and Dimensionality Reduction.

3.2.1. Clustering Methods

In the last decades, many are algorithms of grouping have been proposed and developed [ 38 , 39 ], from the hierarchical approach (Single Link, Complete Link, etc.) and partition (K-means, Gaussian Mixture, Density Estimation and Mode Seeking, etc.) among other methods. As data sets get larger and more varied, many of the dimensions are often irrelevant. These irrelevant dimensions can confuse traditional clustering algorithms.

Clustering is a technology used for many purposes because it simplifies massive data by extracting essential information, based on the relationship of subsequent analyzes or processes that make the process feasible or more efficient. For example, in information systems, grouping is applied to text documents or images to speed up indexing and retrieval [ 40 , 41 ]. Clustering can also be a stand-alone process and has been used as a technique for prototype-based supervised learning algorithms and different applications have also been made in non-vector data. The application of Clustering algorithms for the analysis of unsupervised data has become a useful tool to explore and solve the different application problems of data mining. Clustering methods [ 39 , 42 ] have been used to solve problems emanating from different contexts and disciplines, see Table 1 .

Clustering’s methods and applications.

Method	Strategy	Applications
Hierarchical	Agglomerative	Nearest Neighbor [ ] Farthest Neighbor [ ] Average Linkage Pool [ ] Minimum Variance [ ] Median Method [ ]
Hierarchical	Divisive
Non-Hierarchical	Reassignment	Centroids	K-means [ ], QuickCluster [ ], Forgy Methods [ ]
		Medioid	k-medioids [ ], Clara [ ]
	Density	Dynamic clouds [ ]
		Typological approximation	Modal Analysis [ ], Taxmap Method [ ], Fortin Method [ ]
		Probabilistic approximation	Wolf Methods [ ]
	Direct	Block Clustering [ ]
	Reductive	Type Q Factor Analysis [ ]

3.2.2. Association Rules Methods

The association rules base their analysis on the “if-then” algorithmic sentences, which allow supporting the different probabilities existing in the multiple elements of the data, found in large databases of different formats and types. The data mining techniques that are based on association rules throughout their evolution have had multiple applications, among which the sales and analysis of medical data sets can be highlighted.

Based on the algorithmic “if-then” sentences and based on established criteria such as support and trust, the association rules can identify the most important patterns. The support criterion gives the association rules the ability to know the frequency of the appearance of the elements in the data set is. As for the confidence criterion, it can determine the number of times the Boolean value of the “if-then” statement is true. There is also another common metric which is called Fit, which is fundamentally based on making a comparison between the expected confidence and the confidence that can be evidenced in the data. In the literature review, the progress of the association rules can be identified, as detailed below, see Table 2 .

Association Rules Evolutions.

Based in	Algorithms
Frequent Itemsets Mining	Apriori [ ]
	Apriori-TID [ ]
	ECLAT TID-list [ , ]
	FP-Growth [ ]
Big Data Algorithms	R-Apriori [ ]
	YAFIM [ ]
	ParEclat [ ]
	Par-FP (Parallel FP-Growth with Sampling) [ ]
	HPA (Hash Partitioned Apriori) [ ]
Distributed algorithms	PEAR (Parallel Efficient Association Rules) [ ]
Distributed algorithms for fuzzy association rule mining	Count Distribution algorithm [ , ]

3.2.3. Dimensionality Reduction Methods

Dimensionality reduction methods are statistical techniques that map the data set to subspaces derived from the original space, of less dimension, which allow a description of the data at a lower cost. These techniques become important as many algorithms from various fields such as numerical analysis, machine learning or data mining tend to degrade their performance when used with high dimensional data. In external cases, the algorithm is no longer useful for the purpose for which it was designed. The curse of dimension refers to the various phenomena that arise when analyzing and organizing data from multi-dimensional spaces. Among the most important algorithms we can highlight.

Missing Values Ratio [ 73 ]: By examining the data, if we find that it contains many missing values, if there are few missing values, we can fill in the missing values or remove this variable directly; when the proportion of missing values in the dataset is too high, I usually choose directly Remove this variable because it contains too little information. Specific removal is not removed, how to remove depends on the situation, we can set a threshold, if the proportion of missing values is greater than the threshold, remove the column where it is. The higher the threshold. The more aggressive the dimensionality reduction method.

Low Variance Filter [ 74 ]: If the value of a column is the same in a dataset, that is, its variance is very low, we generally think that low-variance variables contain very little information, so you can eliminate it directly and put it into practice, which is to calculate all Variation Size variables and then eliminate the smallest of them.

High Correlation Filter [ 75 ]: If the two variables are highly correlated, this means that they have similar trends and can carry similar information. Similarly, the presence of such variables can reduce the performance of certain models (such as linear and logistic regression models). To solve such problems, we can calculate the correlation between independent variables. If the correlation coefficient exceeds a certain threshold, one of the variables is eliminated.

Random Forests/Ensemble Trees [ 76 ]: Random Forest is a widely used feature selection algorithm, it automatically calculates the importance of each feature, so no separate programming is required. This helps us choose a smaller subset of features. The advantages of the random forest: high precision, the introduction of randomness makes random forests not easy to overfit, the introduction of randomness makes the random forests have good anti-noise ability (can better handle outliers), can handle very high-dimensional data without feature selection, it can handle both discrete data and continuous data, and the data set does not need to be normalized, fast training speed, you can get the importance of variable classification and easy to parallelize. Disadvantages of the random forest: when there are many decision trees in the random forest, the space and time required for training will be large, and the interpretability of the random forest is poor.

Principal Component Analysis (PCA) [ 77 ]: PCA is a very common dimensionality reduction method. You can reduce the number of predictors by reducing the dimensionality of high-dimensional data while eliminating noise through dimensionality reduction. The most direct application is to compress data, mainly used in signal processing Noise reduction, and visualization after data dimensionality reduction.

3.3. Ensemble Learning

An ensemble is a set of machine learning models. Each model produces a different prediction. The predictions from the different models are combined to obtain a single prediction. The advantage we get from combining different models is that because each model works differently, its errors tend to be compensated for. This results in a better generalization error.

3.3.1. Voting by the Majority

Training multiple machine learning models with the same data [ 78 ]. When we have new data, we will get a prediction from each model. Each model will have a vote associated with it. In this way, we will propose as a final prediction what most of the models vote for. There is another way to combine voting. When machine learning models give a probability, we can use “soft-voting”. In soft voting, more importance is given to results in which some model is very confident. That is, when the prediction is very close to probability 0 or 1, more weight is given to the prediction of that model.

3.3.2. Bagging

Unlike majority voting, the way to get errors to compensate for each other is that each model is trained with subsets of the training set [ 79 ]. These subsets are formed by randomly choosing samples (with repetition) from the training set. The results are combined, for classification problems, as we have seen in majority voting, with the soft vote for the models that give probabilities. For regression problems, the arithmetic mean is normally used.

3.3.3. Boosting

In boosting, each model tries to fix the errors of the previous models [ 80 ]. For example, in the case of classification, the first model will try to learn the relationship between the input attributes and the result. You will surely make some mistakes. So, the second model will try to reduce these errors. This is achieved by giving more weight to poorly classified samples and less weight to well-classified samples. For regression problems, predictions with a higher mean square error will have more weight for the next model.

3.3.4. Stacking

When we talk about a stacking ensemble, we mean that we are stacking models [ 81 ]. When we stack models, what we are doing is using the output of multiple models as the input of multiple models.

3.4. Deep Learning

Deep Learning is a type of machine learning that is structured and inspired by the human brain and its neural networks [ 82 ]. Deep learning processes data to detect objects, recognize conversations, translate languages, and make decisions. Being a type of machine learning, this technology helps artificial intelligence learn continuously. Deep learning is based on the use of artificial neural networks. Within neural networks 3 types are the most used.

3.4.1. Convolutional Neural Networks (CNN)

Convolutional neural networks are artificial neural networks that have been designed to process structured matrices, such as images [ 83 ]. That is, they are responsible for classifying images based on the patterns and objects that appear in them, for example, lines, circles, or even eyes and faces.

3.4.2. Recurrent Neural Networks (RNN)

Recurrent neural networks are neural networks that use sequential data or time-series data [ 84 ]. These types of networks solve ordinal or temporal problems, such as language translation, speech recognition, Natural Language Processing (NLP, Natural Language Processing), and image capture. Therefore, these networks are in technologies such as Siri or Google translate. In this case, natural language processing recognizes a person’s speech. For example, it is distinguished if the person who is speaking is a man or a woman, an adult or a minor, if they have an Andalusian or Catalan accent, etc. In this way, the person’s way of speaking is analyzed, and their idiolect is reached.

3.4.3. Generative Adversarial Networks (GAN)

The antagonistic generative networks consist of using 2 artificial neural networks and opposing them to each other (that is why they are known as antagonistic) to generate new content or synthetic data that can be passed as real [ 85 ]. One of the networks generates and the other works as a “discriminator”. The discriminatory network (also known as an antagonistic network) has been trained to recognize real content and acts as a sensor for the network that generates content to make content that appears real.

3.5. Reinforcement Learning

The field of machine learning is the branch of Artificial Intelligence that encompasses techniques that allow machines to learn through their environment. This environment can be considered as the set of data that the algorithm has or obtains in the training stage. Reinforcement learning is the most common in nature. An individual has a connection with the environment with which he obtains information from the cause-effect relationships, the results of the actions carried out, and the strategy to follow to complete an objective [ 86 ].

The time difference method was introduced by Sutton [ 87 ] as a model-free method based on a bootstrapping update rule and consists of estimating the values of immediate and future rewards in a way like programming. Are dynamic and are denoted as TD (λ). Methods of time difference attempt to estimate the value function of a given state of a policy, and contrary to Monte Carlo methods, you do not need to wait at the end of an episode to make such an estimate. Some prominent algorithms are.

3.5.1. SARSA

One of the algorithms that derive from the method based on time difference is the SARSA algorithm [ 88 ] which is an on-policy method, that is, it has an initial policy and updates it at the end. of each episode.

3.5.2. Q-Learning

Q-learning is a value-based learning algorithm that focuses on optimizing the value function according to the environment or problem [ 89 , 90 ]. The Q in Q-learning represents the quality with which the model finds its next quality-improving action. The process can be automatic and simple. This technique is great to start your reinforcement learning journey. The model stores all the values in a table, which is Table Q. In simple words, the learning method is used for the best solution.

3.5.3. Deep Reinforcement Learning

Deep Reinforcement Learning [ 91 ], where reinforcement learning is integrated with neural networks. The DeepMind company began to use this type of learning to create agents that would learn to play Atari games from scratch without having any information about them, not even the rules of the video game.

3.6. Metaheuristic Learning

Metaheuristics are clever strategies to design or improve very general heuristic procedures with high performance. The term metaheuristic first appeared in Fred Glover’s seminal article on tabu search in 1986 [ 92 ]. Since then, many proposals for guidelines have emerged to design good procedures to solve certain problems that, by expanding their field of application, have adopted the denomination of metaheuristics.

Some of the main types are: Relaxation metaheuristics [ 93 ] refer to problem-solving procedures that use relaxations of the original model (that is, modifications of the model that make the problem easier to solve), the solution of which facilitates the solution of the original problem. The constructive metaheuristics [ 94 ] are oriented to the procedures that try to obtain a solution from the analysis and gradual selection of the components that form it. Search metaheuristics [ 95 ] guide procedures that use transformations or moves to traverse the space of alternative solutions and exploit the associated environment structures. Evolutionary metaheuristics [ 96 ] are focused on procedures based on solution sets that evolve over the solution space.

3.7. Transfer Learning

Deep Learning primarily emphasizes features, Reinforcement Learning primarily emphasizes feedback, and Transfer Learning primarily emphasizes adaptation. Traditional machine learning is about reaping the benefits of planting fruits and reaping the benefits of planting beans, while transfer learning can draw inferences from each other.

Artificial intelligence competition, from algorithm model development to data quality and data competition, these successful models and algorithms are mainly driven by supervised learning, and supervised learning consumes a lot of data and requires big data support (big data) to meet the precise requirements of the application. The development of artificial intelligence tends to satisfy the precise requirements of the applications without requiring massive data. Therefore, “small data learning” is becoming a new point of interest. Small data learning techniques represented by migration learning and reinforcement learning can better reflect artificial intelligence.

Since the transfer learning (TL) concept was proposed by Stevo Bozinovski and Ante Fulgosi in 1976 [ 97 ], it has received a great deal of attention from the academic community. The definition of transfer learning is too broad and a variety of specialized terms have appeared in related research, such as learning to learn, lifelong learning, multitasking learning, meta-learning, inductive transfer, knowledge transfer, context-sensitive learning, etc. Among them, transfer learning has the closest relationship with multitasking learning. Multitask learning learns multiple different tasks at the same time and discovers implicit common features to aid single-task learning.

3.8. Human Activity Recognition

Recognizing human activities consists of interpreting human gestures or movements through sensors to determine human action or activity [ 98 ]. For example, a HAR system can report activities performed by patients outside of hospital facilities, which makes it a useful tool for evaluating health interventions and therapy progress, and for clinical decision-making [ 99 ]. HAR can be supervised or unsupervised. The supervised HAR system requires prior training with a tagged data set, on the contrary, the unsupervised system does not require training but has a set of rules configured during development. In this particular work, we focused on a HAR system of the supervised type to recognize the following six human activities: walking (WK), climbing stairs (WU), descending stairs (WD), standing (ST), lying down (LD), and being sitting (SD). We name, in particular, the WK, WU, and WD activities as dynamic activities since they involve a voluntary movement that causes displacement and is reflected in the inertial sensors, and we call ST, LD, and SD activities. Given that they do not involve voluntary movements of the subject and there is no displacement of the person.

In HAR systems it is common to use signals and images that come from sensors that can be located in a specific physical space, such as in a room, or that can be placed or carried by people, like those found in smart cell phones or smartwatches. Smartphones are mobile phones that can perform tasks like those of a computer, such as the ability to store and process data and be able to navigate the Internet [ 100 ]. In addition, compared to personal computers, smartphones are widely accepted due to their small size, low weight, more personal device, and, especially, great connectivity that allows you to access at any time and place to information sites and social networks [ 101 ]. Other applications that are usually present are integrated cameras, contact management, multimedia software capable of playing music and being able to view photos and videos, and the use of navigation programs, and, in addition, having the ability to view business documents in different formats such as PDF and Microsoft Office [ 101 ].

Currently, different sensors are installed, such as positioning sensors, proximity sensors, temperature sensors, accelerometer, gyroscope, magnetometer, microphone, etc., as shown in Figure 2 . This is currently a challenge carried out by different scientific communities, particularly in the fields of computer vision, signal processing, and machine learning. The sensors are usually operated by a microcontroller or microprocessor, which performs the function of a computer.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-03401-g002.jpg

Sensors of Human Activity Recognition.

Inertial sensors are sensors based on the principle of inertia, the tendency of a body to conserve its speed (in the absence of an external influence, a body remains in a uniform rectilinear motion). There are different types of sensors to measure signals that can be used by HAR systems. Two of the most used are the accelerometer and the gyroscope. The accelerometer measures the acceleration (in meters per second squared, m/s 2 ) based on the different variations that a capacitance makes inside the sensor. This capacitance is a microelectromechanical system (MEMS for its acronym in English microelectromechanical systems) that consists of the suspension of silicon particles that are located at a fixed point and are moved freely in the axis where they are measured. When acceleration occurs, the particles move and break with equilibrium in capacitance; this is measured to provide the information that occurs in a certain axis.

According to the type of sensors and the occupation of the indoor environments, a series of datasets have been built that have served to carry out different experiments based on machine learning techniques. The most prominent datasets are: UCI HAR [ 102 ], KU-HAR [ 103 ], Precis HAR [ 104 ], Fall-Up Dataset [ 105 ], VanKasteren [ 106 ], CASAS Multiresident [ 107 ], WISDM [ 108 ], DOMUS [ 109 ], Opportunity [ 110 ], CASAS Aruba [ 111 ], USC-HAD [ 112 ], MIT PlaceLab [ 113 ], Smart Environment—Ulster University [ 114 ], CASAS–Daily Life Kyoto [ 115 ], PAMAP2 [ 116 ], mHealth [ 117 ], DSADS [ 118 ], UJAmI SmatLab [ 119 ].

4. Methodology for Analyzing the Information

The methodology for the analysis of the publications is supported and defined by Kitchenham [ 120 ]. This methodology consists of identifying the main research problem, and then disaggregating each of its components by analyzing the different inclusions and exclusions, to determine a suitable search string to be used in scientific databases. Specifically for our case study, in addition to the Scientometric type variables, those related to the type of dataset used, techniques or algorithms implemented, as well as the quality of the results measured by the quality metrics, were identified. Kitchenham [ 120 ] defines different stages of the literature review process, among which the following can be highlighted: (a) Identification of the search parameters (search objectives, hypotheses identified) (b) Definition of search engines (selection of specialized databases where the study is to be developed) (c) Response to the hypotheses that were raised for the literature inquiry process.

By these previously defined phases, the first thing to do is to identify the central question of the inquiry process. For this literature review, it would be “What are the different techniques based on Machine Learning that support the analysis of dataset recognition of human activities?”. To carry out the literature review, the IEEE, Scopus, Science Direct, and WOS databases were used. To delimit the documentary findings, the following search string was used: (HAR OR ADL OR AAL) AND dataset AND (“indoor environment’’ OR “smart homes” OR “intelligent buildings” OR “ambient intelligence” OR “assisted living’’). In Figure 3 you can see the basic concept scheme for the review document filter. Then the references were analyzed by the machine learning technique that is implemented, which is described in Section 6 .

An external file that holds a picture, illustration, etc.
Object name is sensors-22-03401-g003.jpg

Relationship between concepts for the literature review.

It is important to specify that the order of the different terms that are observed in Figure 3 determine all those that are part of the domain of knowledge, which was previously tested in the different search engines of the scientific databases to eliminate the different noises from them. that can be generated at the time of the search and the exclusion of papers not related to the study area. Taking into account the previously explained methodology, different factors of the analytical order and high importance for those interested in this area of knowledge were described in the meta-analytic matrix, such as year of publication of the work (which is not greater than a window of 5 years), journal, conference or book where the publication was made, quartile in the case of publications in journals, country of origin of the first author as well as the university or research center. Other technical variables are taken into account in the same way for the development of this research, such as Name of the dataset, type of data collection, type of activities carried out, several individuals who define the occupation, data mining techniques used, hybridization of techniques, results of quality metrics.

5. Scientometric Analysis

In the results obtained from the 570 articles processed, different relevant variables were taken into account, among which are detailed: (1) the year of publication of the article see Figure 4 , (2) the database where the publication can be found, (3) Type of publication if it is a journal, conference or book, (4) Quartile of the journal in the case of publications in this medium, (5) country of origin of the journal (6) country of origin of the first author of the article (7) University of the first author, (8) Dataset used for the experiments, (9) Techniques used for the discovery of information and (10) results of the metrics of each technique.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-03401-g004.jpg

Years of publication of the articles.

It can be identified that 2018 was where the most publications were generated in HAR’s line of work. In the same way, when discriminating the databases in which the publications are made, it is highlighted that most of the works have been published in the Science Direct database, then Scopus. Some publications are visible in different databases, as shown in Figure 5 .

An external file that holds a picture, illustration, etc.
Object name is sensors-22-03401-g005.jpg

Of the total articles, analyzed 64% of them refer to conference publications, 4 & are books and 36% refer to journals, see Figure 6 a,b.

An external file that holds a picture, illustration, etc.
Object name is sensors-22-03401-g006.jpg

( a ) Publications division according to typology ( b ). Distribution by quartiles of publications.

6. Technical Analysis

6.1. supervised learning applied to human activity recognition dataset.

Regarding the application of Machine Learning techniques, to Human Activity Recognition Dataset, various experiments have been developed, but the most relevant ones found in the literature are highlighted below (see Table 3 ). Tasmin [ 121 ], carried out implementations in the UCI-HAR Dataset, through the implementation of supervised algorithms Nearest Neighbor, Decision Tree, Random Forest, and Naive Bayes, of the techniques used, the one with the best results in the detection of activities was the Bayesian with an accuracy of 76.9%. Igwe [ 122 ], concentrated his experimentations on the ARAS Data-set which was implemented in 2 different locations (House A and House B), CA-SAS Tulum created by WSU University, the author applied supervised techniques such as SVM, ANN, and MSA (Margin Setting Algorithm), demonstrating the effectiveness of the latter in identifying activities with an accuracy of 68.85%, 96.24% and 68% in the respective Datasets.

Supervised Techniques results.

Dataset	Technique	Metrics				References
Dataset	Technique	Accuracy	Precision	Recall	F-Measure	References
UCI Machine Learning	Nearest Neighbor	75.7	-	-	-	[ ]
	Decision Tree	76.3	-	-	-
	Random Forest	75.9	-	-	-
	Naive Bayes	76.9	-	-	-
Aras (House A)	MSA (Margin Setting Algorithm)	68.85	-	-	-	[ ]
	SVM	66.90	-	-	-
	ANN	67.32	-	-	-
Aras (House B)	MSA (Margin Setting Algorithm)	96.24	-	-	-
	SVM	94.81	-	-	-
	ANN	95.42	-	-	-
CASAS Tulum	MSA (Margin Setting Algorithm)	68.00	-	-	-
	SVM	66.6	-	-	-
	ANN	67.37	-	-	-
Mhealth	K-NN	99.64	-	-	99.7	[ ]
	ANN	99.55	-	-	99.6
	SVM	99.89	-	-	100
	C4.5	99.32	-	-	99.3
	CART	99.13	-	-	99.7
	Random Forest	99.89	-	-	99.89
	Rotation Forest	99.79	-	-	99.79
WISDM, SCUT_NA-A	Sliding window with variable size, S transform, and regularization based robust subspace (SRRS) for selection and SVM for Classification	96.1	-	-	-	[ ]
SCUT NA-A	Sliding window with fixed samples, SVM like a classifier, cross-validation	91.21	-	-	-
PAMPA2, Mhealth	Sliding windows with fixed 2s, SVM, and Cross-validation	84.10	-	-	-
SBHAR	Sliding windows with fixed 4s, SVM, and Cross-validation	93.4	-	-	-
WISDM	MLP based on voting techniques with nb-Tree are used	96.35	-	-	-
UTD-MHAD	Feature level fusion approach& collaborative representation classifier	79.1	-	-	-
Groupware	Mark Hall’s feature selection and Decision Tree	99.4	-	-	-
Free-living	k-NN and Decision Tree	95	-	-	-
WISDM, Skoda	Hybrid Localizing learning (k-NN-LSS-VM)	81	-	-	-
UniMiB SHAR	LSTM and Deep Q-Learning	95	-	-	-
Groupware	Sliding windows Gaussian Linear Filter and NB classifier	89.5	-	-	-
Groupware	Sliding windows Gaussian Linear Filter and Decision Tree classifier	99.99	-	-	-
CSI-data	SVM	96	-	-	-	[ ]
CSI-data	LSTM	89	-	-	-	[ ]
Built by the authors	IBK	95	-	-	-	[ ]
	Classifier based ensemble	98	-	-	-
	Bayesian network	63	-	-	-
Built by the authors	Decision Tree	91.08	-	-	89.75	[ ]
	Random Forest	91.25	-	-	90.02
	Gradient Boosting	97.59	-	-	97.4
	KNN	93.76	-	-	93.21
	Naive Bayes	88.57	-	-	88.07
	SVM	92.7	-	-	91.53
	XGBoost	96.93	-	-	96.63
UK-DALE	FFNN	95.28	-	-	-	[ ]
	SVM	93.84	-	-	-
	LSTM	83.07	-	-	-
UCI Machine Learning	KNN	90.74	91.15	90.28	90.45	[ ]
	SVM	96.27	96.43	96.14	96.23
	HMM+SVM	96.57	96.74	96.49	96.56
	SVM+KNN	96.71	96.75	96.69	96.71
	Naive Bayes	77.03	79.25	76.91	76.72
	Logistic Reg	95.93	96.13	95.84	95.92
	Decision Tree	87.34	87.39	86.95	86.99
	Random Forest	92.3	92.4	92.03	92.14
	MLP	95.25	95.49	95.13	95.25
	DNN	96.81	96.95	96.77	96.83
	LSTM	91.08	91.38	91.24	91.13
	CNN+LSTM	93.08	93.17	93.10	93.07
	CNN+BiLSTM	95.42	96.58	95.26	95.36
	Inception+ResNet	95.76	96.06	95.63	95.75
UCI Machine Learning	NB-NB	73.68	-	-	46.9	[ ]
	NB-KNN	85.58	-	-	61.08
	NB-DT	89.93	-	-	69.75
	NB-SVM	79.97	-	-	53.69
	KNN-NB	74.93	-	-	45
	KNN-KNN	79.3	-	-	49.82
	KNN-DT	87.01	-	-	60.98
	KNN-SVM	82.24	-	-	53.1
	DT-NB	84.72	-	-	60.05
	DT-KNN	91.55	-	-	73.11
	DT-DT	92.73	-	-	75.97
	DT-SVM	93.23	-	-	77.35
	SVM-NB	30.40	-	-	-
	SVM-KNN	25.23	-	-	-
	SVM-DT	92.43	-	-	75.31
	SVM-SVM	43.32	-	-	-
CASAS Tulum	Back-Propagation	88.75	-	-	-	[ ]
	SVM	87.42	-	-	-
	DBM	90.23	-	-	-
CASAS Twor	Back-Propagation	76.9	-	-	-
	SVM	73.52	-	-	-
	DBM	78.49	-	-	-
WISDM	KNN	69	78	-	78	[ ]
	LDA	40	34	-	34
	QDA	65	58	-	58
	RF	90	91	-	91
	DT	77	77	-	77
	CNN	66	62	-	60
DAPHNET	KNN	90	87	-	88
	LDA	91	83	-	83
	QDA	91	82	-	82
	RF	91	91	-	91
	DT	91	83	-	83
	CNN	90	87	-	87
PAPAM	KNN	65	66	-	66
	LDA	45	45	-	45
	QDA	15	19	-	19
	RF	80	83	-	83
	DT	60	60	-	60
	CNN	73	76	-	73
HHAR(Phone)	KNN	83	85	-	85
	LDA	43	45	-	45
	QDA	40	50	-	50
	RF	88	89	-	89
	DT	67	66	-	66
	CNN	84	84	-	84
HHAR(watch)	KNN	78	82	-	82
	LDA	54	52	-	52
	QDA	26	27	-	27
	RF	85	85	-	85
	DT	69	69	-	69
	CNN	83	83	-	83
Mhealth	KNN	76	81	-	81
	LDA	38	59	-	59
	QDA	91	82	-	82
	RF	85	85	-	85
	DT	77	77	-	77
	CNN	80	80	-	80
RSSI	KNN	91	91	-	91
	LDA	91	91	-	91
	QDA	91	91	-	91
	RF	91	91	-	91
	DT	91	91	-	91
	CNN	91	90	-	91
CSI	KNN	93	93	-	93
	LDA	93	93	-	93
	QDA	92	92	-	92
	RF	93	93	-	93
	DT	93	93	-	93
	CNN	92	92	-	92
Casas Aruba	DT	96.3	93.8	92.3	93	[ ]
	SVM	88.2	88.3	87.8	88.1
	KNN	89.2	87.8	85.9	86.8
	AdaBoost	98	96	95.9	95.9
	DCNN	95.6	93.9	95.3	94.6
SisFall	SVM	97.77	76.17		75.6	[ ]
	Random Forest	96.82	79.99		79.95
	KNN	96.71	93.99		68.36
CASAS Milan	Naive Bayes	76.65				[ ]
	HMM+SVM	77.44
	CRF	61.01
	LSTM	93.42
CASAS Cairo	Naive Bayes	82.79
	HMM+SVM	82.41
	CRF	68.07
	LSTM	83.75
CASAS Kyoto 2	Naive Bayes	63.98
	HMM+SVM	65.79
	CRF	66.20
	LSTM	69.76
CASAS Kyoto 3	Naive Bayes	77.5
	HMM+SVM	81.67
	CRF	87.33
	LSTM	88.71
CASAS Kyoto 4	Naive Bayes	63.27
	HMM+SVM	60.9
	CRF	58.41
	LSTM	85.57

Subasi [ 123 ], performed analysis on the Meath Dataset, applying techniques such as K-NN, ANN, SVM, C4.5, CART. Random Forest and Rotation Forest obtained better results with SVM and Random Forest with 99.89%. Maswadi [ 124 ], firstly I prepare the Dataset using Sliding window segmentation techniques with a variable size in different Datasets such as WISDM with SCUT_NA-A, SCUT NA-.An only, PAMPA2 with Mhealth, SBHAR, WISDM, UTD-MHAD, Groupware, Free-living WISDM with Skoda, UniMiB SHAR, and Groupware, showing the superiority of this technique obtaining results greater than 80% accuracy. Other authors such as Damodaran [ 125 ], applied SVM, LSTM to the CSI-Data Dataset, where better results are shown in the use of SVM with 96%.

Other authors such as Saha [ 126 ] and Das [ 127 ], define the characteristics and process for the construction of their Dataset, to which a set of techniques are applied and it should be noted that both authors show that vector support machines show efficiency in the results of classification of human activities. Franco [ 128 ], uses techniques such as FFNN, SVM, and LSTM in the UK-Dale Dataset, showing the effectiveness of FFNN with 95.28% accuracy in quality metrics.

Bozkurt [ 129 ] and Wang [ 130 ], carry out supervised learning implementations in the UCI HAR Dataset, with various combined supervised techniques, and in the case of Bozkurt, they describe that using SVM + KNN obtains good results in the classification with an accuracy of 96.71% and Wang explains that using a combination of Decision Tree it is possible to count on the accuracy of 92.73%. Outreach [ 131 ], performs analysis on two Datasets of the set CASAS Tulum and Two, highlighting the use of BackPropagation with results 88.75% and 76.9%, respectively in accuracy.

Demrozi [ 132 ], performs multiple experiments of many supervised techniques in many widely known Datasets such as WISDM, DAPHNET, PAPAM, HHAR (Phone), HHAR (watch), Mhealth, RSSI, CSI. For this, different algorithms are implemented such as KNN, LDA, QDA, RF, DT, CNN. In the case of the algorithm WISDM, DAPHNET, PAPAM, HHAR (Phone), HHAR (watch) the RF algorithm obtains the best results with accuracy of 90%, 91%, 80%, 88%, 85% precision, and recall of 91%, 91%, 83%, 89%, 85%. For Mhealth, RSSI, the performance of the QDA algorithm is denoted with 91% and 92% in accuracy and 85% and 92% in precision and recall respectively.

Xu [ 133 ], applies compares techniques such as DT, SVM, KNN, AdaBoost, DCNN in Dataset CASAS Aruba, showing the superiority of ensembled techniques such as Adaboost with the accuracy of 98%, precision 96%, recall 95.9%, and f -measure 95.9%. Other authors such as Hussain [ 134 ] apply algorithms such as SVM, Random Forest, KNN to datasets such as SisFall, the SVM results being better with 97.77% accuracy. Finally, Liciotti [ 135 ], performs experimentation on a set of well-known CASAS project Datasets such as Milan, Cairo, Kyoto 2, Kyoto 3, Kyoto 4, of algorithms such as Naive Bayes, HMM + SVM, CRF, LSTM, showing the superiority of LSTM in the results.

6.2. Unsupervised Learning Applied to Human Activity Recognition Dataset

In the unsupervised learning applications in the literature, different applications of the algorithms can be observed that are measured with quality metrics associated with the groupings such as ARI, Jaccard Index, Silhouette Index, Euclidean, F1 Fisher’s discriminant (see Table 4 ). The following works developed by authors such as Wang [ 130 ] stand out, who uses various versions of the UCI-HAR Dataset, implementing algorithms such as K-means, HAC, FCM, both showing better results for the case of FCM. Mohmed [ 136 ] applies unsupervised algorithms like FCM to the Nottingham Trent University Dataset. Brena [ 137 ], applies his form developed by the author called PM Mo-del to perform unsupervised analysis to the Chest Sensor Dataset, Wrist Sensor Dataset, WISDM Dataset, and Smartphone Dataset, which he measures using the silhouette index. He [ 138 ], applies another method developed by the authors called wavelet tensor fuzzy clustering scheme (WTFCS) to the DSAD Dataset, obtaining an ARI index of 89.66%.

Unsupervised Techniques results.

Dataset	Technique	Metrics					References
Dataset	Technique	ARI	Jaccard Index	Silhouette Index	Euclidean	F1 Fisher’s Discriminant Ratio	References
UCI HAR SmartPhone	K-means	0.7727	0.3246	0.4416			[ ]
	HAC	0.4213	0.2224	0.5675
	FCM	0.8343	0.4052	0.4281
UCI HAR Single Chest-Mounted Accelerometer	K-means	0.8850	0.6544	0.6935
	HAC	0.5996	0.2563	0.6851
	FCM	0.9189	0.7230	0.7751
Nottingham Trent University	FCM	-	-	-	-		[ ]
Chest Sensor Dataset	PM Model			25.8%	-		[ ]
Wrist Sensor Dataset				64.3%	-
WISDM Dataset				54%	-
Smartphone Dataset				85%	-
DSAD	wavelet tensor fuzzy clustering scheme (WTFCS)	0.8966	-	-	-		[ ]
UCI HAR	Spectral Clustering		0.543		0.583		[ ]
	Single Linkage		0.807		0.851
	Ward Linkage		0.770		0.810
	Average Linkage		0.790		0.871
	K-medioids		0.653		0.654
UCI HAR	K-means					52.1	[ ]
	K-Means 5					50.7
	Spectral Clustering					57.8
	Gaussian Mixture					49.8
	DBSCAN					16.4
CADL	K-means					50.9
	K-Means 5					50.5
	Spectral Clustering					61.9
	Gaussian Mixture					58.9
	DBSCAN					13.9

Wang [ 139 ], implements clustering-based algorithms such as Spectral Clustering, Single Linkage, Ward Linkage, Average Linkage, K-medioids to the UCI-HAR dataset, analyzing their Jaccard and Euclidean indices as shown in Table 4 . In the same way, Bota [ 140 ] also makes experiments in the UCI-HAR and CADL Dataset with the K-means, K-Means 5, Spectral Clustering, Gaussian Mixture, DBSCAN algorithms analyzing its F1 Fisher’s discriminant rat.

6.3. Ensemble Learning Applied to Human Activity Recognition Dataset

In the lessons based on ensemble learning, the application of multiple techniques is usually carried out, which together offer better results (see Table 5 ). Below is a detailed description of the works found in the literature review that shows the application of these techniques in the recognition of human activities. Yacchirema [ 141 ], uses a combination of techniques such as Decision Tree, Ensemble, Logistic Regression, Deepnet to analyze the SisFall Dataset, explaining the results of the DeepNet algorithm with an accuracy of 99.06%. For his part, Manzi [ 142 ], uses a mixture of X-means and SV; to analyze the Cornell Activity Dataset and TST Dataset obtaining 98.4% and 92.7% respectively.

Ensembled Learning Techniques results.

Dataset	Technique	Metrics				References
Dataset	Technique	Accuracy	Precision	Recall	F-Measure	References
SisFall	Decision Tree	97.48	-	-	-	[ ]
	Ensemble	99.51	-	-	-
	Logistic Regression	84.87	-	-	-
	Deepnet	99.06	-	-	-
Cornell Activity Dataset	X-means-SVM	98.4	95.0	95.8	-	[ ]
TST Dataset	X-means-SVM	92.7	95.6	91.1	-	[ ]
HHAR	Multi-task deep clustering		67.2	65.3	65.9	[ ]
MobiAct			68.3	69.1	66.8
MobiSense			72.5	71.2	70.7
NTU-RGB + D	K-Means	85.72	-	-	-	[ ]
NTU-RGB + D	GMM	87.26	-	-	-	[ ]
UCI HAR	CELearning	96.88%	-	-	-	[ ]
UCI HAR	RF	96.96	97.0	97.0	98	[ ]
	XGB	96.2	96	96	96
	AdaB	50.5	61	51	51
	GB	94.53	95	95	95
	ANN	92.51	92	93	92
	V. RNN	90.53	90	91	90
	LSTM	91.23	90	91	90
	DT	94.23	95	95	95
	KNN	96.59	97	97	97
	NB	80.67	84	81	81
Proposed Dataset	GB	84.1	84.1	84.2	84.1	[ ]
	RFs	83.9	83.9	84.1	83.9
	Bagging	83	83	83.1	83
	XGB	80.4	80.5	80.4	80.4
	AdaBoost	77.2	77.3	77.3	77.3
	DT	76.9	77	77	77
	MLP	67.6	68.7	67.8	67.8
	LSVM	65	65.7	65.1	64.9
	NLSVM	63	63.3	63.2	62.8
	LR	59.6	60.2	59.8	59.4
	KNNs	58.9	60.1	59.2	58.9
	GNB	56.1	59.4	55.4	45.2
House A	Bernoulli NB	78.7	64	-	-	[ ]
	Decision Tree	88	79.4	-	-
	Logistic Regression	81.4	69.2	-	-
	KNN	75.8	64.9	-
House B	Bernoulli NB	95.9	79.4	-
	Decision Tree	97.2	86.4	-
	Logistic Regression	96.5	82.7	-
	KNN	93.1	79.8	-
UCI HAR	SVM-AdaBoost	99.9			99.9	[ ]
	k-NN-AdaBoost	99.43			99.4
	ANN-AdaBoost	99.33			99.33
	NB-AdaBoost	97.24			97.2
	RF-AdaBoost	99.98			100
	CART-AdaBoost	99.97			100
	C4.5-AdaBoost	99.95			100
	REPTree-AdaBoost	99.95			100
	LADTree-AdaBoost	98.84			98.8
HAR Dataset	KNN	90.3				[ ]
	CART	84.9
	BAYES	77
	RF	92.7
HAPT Dataset	KNN	89.2
	CART	80.2
	BAYES	74.7
	RF	91
	ET	91.7
	Proposed Method	92.6

Ma [ 143 ], uses the model based on Multi-task deep clustering in the HHAR, MobiAct, MobiSense datasets, where the latter algorithm obtains an accuracy of 72.5%, a precision of 71.2%, and a recall of 70.7%. Budisteanu [ 144 ], describes the NTU-RGB + D Dataset, and implements the K-Means, GMM algorithms, obtaining 85.72% and 87.26% respectively. Xu [ 145 ], uses the well-known UCI-HAR Dataset, implementing the CELearning own technique, obtaining an accuracy of 96.88%.

Choudhury [ 146 ], also analyzes the UCI-HAR Dataset, with the algorithms RF, XGB, AdaB, GB, ANN, V. RNN, LSTM, DT, KNN, and NB, where the RF algorithm performs the best result in the ensemble models with 96.96%. Wang [ 147 ] for his part defines his Dataset to which he implements the algorithms GB, RFs, Bagging, XGB, AdaBoost, DT, MLP, LSVM, NLSVM, LR, KNNs, GNB, in which the RF algorithm obtains the best results with an accuracy of 83.9%. Jethanandani [ 148 ], works with the popular Dataset House A and House B, applying algorithms such as Bernoulli NB, Decision Tree, Logistic Regression, KNN. This experimentation shows the good results of the algorithms based on decision trees with 88% and 97.2% respectively.

Subasi [ 149 ], also uses the UCI-HAR Dataset, applying the algorithms SVM-AdaBoost, k-NN-AdaBoost, ANN-AdaBoost, NB-AdaBoost, RF-AdaBoost, CART-AdaBoost, C4.5-AdaBoost, REPTree-AdaBoost, LADTree-AdaBoost obtaining better results with the REPTree-AdaBoost combination with 99.95% accuracy. Padmaja [ 150 ], uses HAR Dataset, HAPT Dataset implementing the KN, CART, BAYES, RF, ET alalgorithms, a method proposed by the authors, demonstrating the superiority of the results of the method proposed by the authors.

6.4. Deep Learning Applied to Human Activity Recognition Dataset

Implementations based on Deep Learning have become very useful for the identification of activities of daily life, especially those that include image processing [ 151 , 152 ] (see Table 6 ). Some relevant results of the literature review are detailed below. Wan [ 153 ], makes use of the UCI-HAR and PAMAP2 Dataset, implementing algorithms such as CNN, LSTM, BLSTM, MLP, SVM, in which the good results of CNN network implementation are shown with 92.71% and 91% respectively. Akula [ 154 ], configures its Dataset to which it applies the algorithms LBP-Naive Bayes, HOG-Naive Bayes, LBP-KNN, HOG-KNN, LBP-SVM, HOF-SVM obtaining better results with the implementation of HOF -SVM with 85.92% accuracy.

Deep Learning Techniques results.

Dataset	Technique	Metrics				References
Dataset	Technique	Accuracy	Precision	Recall	F-Measure	References
Uci Har	CNN	92.71	93.21	92.82	92.93	[ ]
	LSTM	89.01	89.14	88.99	88.99
	BLSTM	89.4	89.41	89.36	89.35
	MLP	86.83	86.83	86.58	86.61
	SVM	89.85	90.5	89.86	89.85
PAMAP2	CNN	91.00	91.66	90.86	91.16
	LSTM	85.86	86.51	84.67	85.34
	BLSTM	89.52	90.19	89.02	89.4
	MLP	82.07	83.35	82.17	82.46
	SVM	84.07	84.71	84.23	83.76
Propio Infrared Images	LBP-Naive Bayes	42.1	-	-	-	[ ]
	HOG-Naive Bayes	77.01	-	-	-
	LBP-KNN	53.261	-	-	-
	HOG-KNN	83.541	-	-	-
	LBP-SVM	62.34	-	-	-
	HOF-SVM	85.92	-	-	-
Uci Har	DeepConvLSTM	94.77	-	-	-	[ ]
Uci Har	CNN	92.76	-	-	-
Weakly Dataset	DeepConvLSTM	92.31	-	-	-
Weakly Dataset	CNN	85.17	-	-	-
Opportunity	HC	85.69	-	-	-	[ ]
	CBH	84.66	-	-	-
	CBS	85.39	-	-	-
	AE	83.39	-	-	-
	MLP	86.65	-	-	-
	CNN	87.62	-	-	-
	LSTM	86.21	-	-	-
	Hybrid	87.67	-	-	-
	ResNet	87.67	-	-	-
	ARN	90.29	-	-	-
UniMiB-SAHR	HC	21.96	-	-	-
	CBH	64.36	-	-	-
	CBS	67.36	-	-	-
	AE	68.39	-	-	-
	MLP	74.82	-	-	-
	CNN	73.36	-	-	-
	LSTM	68.81	-	-	-
	Hybrid	72.26	-	-	-
	ResNet	75.26	-	-	-
	ARN	76.39	-	-	-
Uci Har	KNN	90.74	91.15	90.28	90.48	[ ]
	SVM	96.27	96.43	96.14	96.23
	HMM+SVM	96.57	96.74	06.49	96.56
	SVM+KNN	96.71	96.75	96.69	96.71
	Naive Bayes	77.03	79.25	76.91	76.72
	Logistic Regression	95.93	96.13	95.84	95.92
	Decision Tree	87.34	87.39	86.95	86.99
	Random Forest	92.30	92.4	92.03	92.14
	MLP	95.25	95.49	95.13	95.25
	DNN	96.81	96.95	96.77	96.83
	LSTM	91.08	91.38	91.24	91.13
	CNN+LSTM	93.08	93.17	93.10	93.07
	CNN+BiLSTM	95.42	95.58	95.26	95.36
	Inception+ResNet	95.76	96.06	95.63	95.75
Utwente Dataset	Naive Bayes	-	-	-	94.7	[ ]
	SVM	-	-	-	91.6
	Deep Stacked Autoencoder	-	-	-	97.6
	CNN-BiGRu	-	-	-	97.8
PAMAP2	DeepCOnvTCN	-	-	-	81.8
	InceptionTime	-	-	-	81.1
	CNN-BiGRu	-	-	-	85.5
FrailSafe dataset	CNN	91.84	-	-	-	[ ]
CASAS Milan	LSTM	76.65	-	-	-	[ ]
	Bi-LSTM	77.44	-	-	-
	Casc-LSTM	61.01	-	-	-
	ENs2-LSTM	93.42	-	-	-
CASAS Cairo	LSTM	82.79	-	-	-
	Bi-LSTM	82.41	-	-	-
	Casc-LSTM	68.07	-	-	-
	ENs2-LSTM	83.75	-	-	-
CASAS Kyoto 2	LSTM	63.98	-	-	-
	Bi-LSTM	65.79	-	-	-
	Casc-LSTM	66.20	-	-	-
	ENs2-LSTM	69.76	-	-	-
CASAS Kyoto 3	LSTM	77.5	-	-	-
	Bi-LSTM	81.67	-	-	-
	Casc-LSTM	87.33	-	-	-
	ENs2-LSTM	88.71	-	-	-
Proposal	ANN	89.06	-	-	-	[ ]
	SVM	94.12	-	-	-
	DBN	95.85	-	-	-

He [ 155 ], implements DeepConvLSTM, CNN in the UCI-HAR, and Wealky Datasets, showing good results of the implementation of Deep learning with 94.77% and 92.31% respectively. Long [ 156 ] in turn uses the Opportunity and Uni-MiB-SAHR Dataset with the algorithms HC, CBH, CBS, AE, MLP, CNN, LSTM, Hybrid, ResNet, and ARN were the results of RNA of 90.29% and 76.39%. Bozkurt [ 157 ], for his part, only analyzes the UCI-HAR Dataset, KNN, SVM, HMM + SVM, SVM + KNN, Naive Bayes, Logistic Regression, Decision Tree, Random Forest, MLP, DNN, LSTM, CNN + LSTM, CNN + BiLSTM, Inception + ResNet, the result of the DNN algorithm is shown with an accuracy of 96.81%.

Mekruksavanich [ 158 ], uses the Utwente Dataset and PAMAP2, applying the Naive Bayes, SVM, Deep Stacked Autoencoder, CNN-BiGRu techniques, showing better results with this last technique described. Papagiannaki [ 159 ] used the FrailSafe dataset with the implementation of CNN networks with an accuracy of 91.84%. Liciotti [ 139 ] uses techniques such as LSTM, Bi-LSTM, Casc-LSTM, ENs2-LSTM in the CASAS group dataset to show the dynamics of processes based on deep learning. Hassan [ 160 ], applied ANN, SVM and DBN in a proposal dataset for the development of a robust human activity recognition system based on the smartphone sensors’ data, obtaining the following accuracy results ANN 89.06%, SVM 94.12% and DBN 95.85%.

6.5. Reinforcement Learning Applied to Human Activity Recognition Dataset

Currently, there is a new trend in reinforcement-based learning processes where it is possible to have systems capable of learning by themselves from punishment and reward schemes, defined by behavioral psychology. It has been introduced in this new line of work for HAR. Which this review shows three highly relevant works (see Table 7 ). Ber-lin [ 161 ], made implementations in the Weizmann and KTH Datasets through the implementation of Spiking Neural Network showing promising results 94.44% and 92.50%. Lu [ 162 ] uses the DoMSEV Dataset using the Deep-shallow algorithm with an accuracy of 72.9% and Hossain [ 163 ], Pop used a new Dataset to which they implemented the Deep Q-Network algorithm with an accuracy of 83.26%.

Reinforcement Learning Techniques results.

Dataset	Technique	Metrics	References
Dataset	Technique	Accuracy	References
Weizmann datasets	Spiking Neural Network	94.44	[ ]
KTH datasets	Spiking Neural Network	92.50	[ ]
DoMSEV	Deep-Shallow	72.9	[ ]
Proposal	Deep Q-Network (DQN)	83.26	[ ]
S.Yousefi-2017	Reinforcement Learning Agent Recurrent Neural Network with Long Short-Term Memory	80	[ ]
FallDeFi		83	[ ]
UCI HAR	Reinforcement Learning + DeepConvLSTM	98.36	[ ]
Proposal		79	[ ]
UCF-Sports	Q-learning	95	[ ]
UCF-101		85
sub-JHMDB		80
MHEALTH	Cluster-Q learning	94.5	[ ]
PAMAP2		83.42
UCI HAR		81.32
MARS		85.92
DataEgo	LRCN	88	[ ]
Proposal	Mask Algorithm	96.02	[ ]
Proposal	LSTM-Reinforcement Learning	90.50	[ ]
Proposal	Convolutional Autoencoder	87.7	[ ]

6.6. Metaheuristic Algorithms Applied to Human Activity Recognition Dataset

In the review of the state of the art, it was possible to identify different metaheuristic techniques that contribute to the identification of different algorithms. Among the most evident results are applications of Genetic Algorithms with the following results 96.43% [ 171 ], 87.5 [ 172 ], 95,71 [ 173 ], 99.75 [ 174 ], 98.00 [ 175 ] and 98.96 [ 175 ]. In many solutions, hybrid systems or new algorithms proposed by the authors are used, see Table 8 .

Metaheuristic Learning Techniques results.

Dataset	Technique	Metrics	References
Dataset	Technique	Accuracy	References
Cifar-100	L4-Banched-ActionNet + EntACS + Cub-CVM	98.00	[ ]
Sbharpt	Ant-Colony, NB	98.96	[ ]
Ucihar	Bee swarm optimization with a deep Q-network	98.41	[ ]
Motionsense	Binary Grey Wolf Optimization	93.95	[ ]
Mhealth	Binary Grey Wolf Optimization	96.83	[ ]
Uci Har	Genetic Algorithms-SVM	96.43	[ ]
Ucf50	Genetic Algorithms-CNN	87.5	[ ]
Sbhar	GA-PCA	95,71	[ ]
Mnist	GA-CNN	99.75	[ ]
Cifar-100	Genetic Algorithms-SVM	98.00	[ ]
Sbharpt	Genetic Algorithms-CNN	98.96	[ ]

6.7. Transfer Algorithms Applied to Human Activity Recognition Dataset

Transfer Learning TL transfers the parameters of the learned and trained model to a new model to help the training of the new model. Considering that most of the data or tasks are related, through transfer learning, the learned model parameters can be shared with the new model in a certain way to speed up and optimize the model learning efficiency. The basic motivation of TL is to try to apply the knowledge gained from one problem to a different but related problem, see Table 9 .

Transfer Learning Techniques results.

Dataset	Technique	Metrics				References
Dataset	Technique	Accuracy	Precision	Recall	F-Measure	References
CSI	KNN	98.3	-	-	-	[ ]
	SVM	98.3	-	-	-
	CNN	99.2	-	-	-
Opportunity	KNN+PCA	60	-	-	-	[ ]
	GFK	59	-	-	-
	STL	65	-	-	-
	SA-GAN	73	-	-	-
USC-HAD	MMD	80	-	-	-	[ ]
	DANN	77	-	-	-
	WD	72	-	-	-
Proposal	KNN-OS	79.84	85.84	91.88	88.61	[ ]
	KNN-SS	89.64	94.41	94.76	94.52
	SVM-OS	77.14	97.04	79.23	87.09
	SVM-SS	87.5	94.39	92.61	93.27
	DT-OS	87.5	94.61	92.16	93.14
	DT-SS	91.79	95.19	96.26	95.71
	JDA	86.79	92.71	93.07	92.89
	BDA	91.43	95.9	95.18	95.51
	IPL-JPDA	93.21	97.04	95.97	96.48
	KNN-OS	79.84	85.84	91.88	88.61
Wiezmann Dataset	VGG-16 MODEL	96.95	97.00	97.00	97.00	[ ]
	VGG-19 MODEL	96.54	97.00	97.00	96.00
	Inception-v3 Model	95.63	96.00	96.00	96.00
PAMAP2	DeepConvLSTM	-	-	-	93.2	[ ]
Skoda Mini Checkpoint	DeepConvLSTM	-	-	-	93	[ ]
Opportunity	PCA	66.78	-	-	-	[ ]
	TCA	68.43	-	-	-
	GFK	70.87	-	-	-
	TKL	70.21	-	-	-
	STL	73.22	-	-	-
	TNNAR	78.4	-	-	-
PAMAP2	PCA	42.87	-	-	-
	TCA	47.21	-	-	-
	GFK	48.09	-	-	-
	TKL	43.32	-	-	-
	STL	51.22	-	-	-
	TNNAR	55.48	-	-	-
UCI DSADS	PCA	71.24	-	-	-
	TCA	73.47	-	-	-
	GFK	81.23	-	-	-
	TKL	74.26	-	-	-
	STL	83.76	-	-	-
	TNNAR	87.41	-	-	-
UCI HAR	CNN-LSTM	90.8	-	-	-	[ ]
	DT	76.73	.	-	-	[ ]
	RF	71.96	-	-	-
	TB	75.65	-	-	-
	TransAct	86.49	-	-	-
Mhealth	DT	48.02	-	-	-
	RF	62.25	-	-	-
	TB	66.48	-	-	-
	TransAct	77.43	-	-	-
Daily Sport	DT	66.67	.	.	.
	RF	70.38	.	.	.
	TB	72.86	.	-	-
	TransAct	80.83	-	-	-
Proposal	Without SVD (Singular Value Decomposition)	63.13%	-	-	-	[ ]
	With SVD (Singular Value Decomposition)	43.13%	-	-	-
	Transfer Accuracy	97.5%	-	-	-
PAMAP2	CNN	84.89	-	-	-	[ ]
UCI HAR	CNN	83.16	-	-	-	[ ]
UCI HAR	kNN	77.28	-	-	-	[ ]
	DT	72.16	-	-	-
	DA	77.46	-	-	-
	NB	69.93	-	-	-
	Transfer Accuracy	83.7	-	-	-
UCF Sports Action dataset	VGGNet-19	97.13	-	-	-	[ ]
AMASS	DeepConvLSTM	87.46	-	-	-	[ ]
DIP	DeepConvLSTM	89.08	-	-	-	[ ]
DAR Dataset	Base CNN	85.38	-	-	-	[ ]
	AugToAc	91.38	-	-	-
	HDCNN	86.85	-	-	-
	DDC	86.67	-	-	-
UCI HAR	CNN_LSTM	92.13	-	-	-	[ ]
	CNN_LSTM_SENSE	91.55	-	-	-
	LSTM	91.28	-	-	-
	LSTM_DENSE	91.40	-	-	-
ISPL	CNN_LSTM	99.06	-	-	-
	CNN_LSTM_SENSE	98.43	-	-	-
	LSTM	96.23	-	-	-
	LSTM_DENSE	98.11	-	-	-

7. Conclusions

The objective of this systematic literature review article is to provide HAR researchers with a set of recommendations, among which the different data sets that can be used depending on the type of research are highlighted. For the development of this analysis, different data sources were considered in an observation window between the years 2017 and 2021. Among the most representative databases, IEEE Xplorer can be highlighted with 256 articles, far surpassing other specialized databases such as Scopus, Science Direct, Web of Science, and ACM.

It is important to specify that 47% of the publications are due to proceedings of congresses or conferences and 36% to the specialized journal. Discriminating the quartiles where the articles are published, it is important to highlight that although the vast majority of publications are indeed focused on conference proceedings that do not have a specific category, 36% of the publications that were made in journals were are mostly in the first two quartiles Q1 and Q2.

In this article, technical analysis of different types of datasets that are used for experimentation processes with HAR was carried out. It should be noted that the creation of new data sets has increased. Some traditional approaches related to the use of indoor datasets based on the WSU Casas project remain. Also, public repositories such as UCI Machine learning have provided sets widely used in the literature such as Opportunity and UCI HAR. It should be noted that the processing of images and videos to the dataset has also been increased, allowing the application of different cutting-edge techniques, such as Weakly Dataset and UniMiB-SAHR.

In this review, different data processing approaches that have been used in this area of knowledge were used. For the specific case of supervised learning, the usability of algorithms based on decision trees such as RandomForest, Naive Bayes, and Support Vector Machine stands out. Regarding unsupervised learning, in most of the analyzed works, the use of techniques such as Spectral Clustering, Single Linkage, Ward Linkage, Average Linkage and K-medioids. Using ensembled learning, it was possible to demonstrate the use of different sets of techniques that allowed improving the results of the experiments, among which those based on classification and grouping can be highlighted. Another modern and widely used approach is the use of DeepLearning focused on datasets with massive image processing requirements, where the use of the following LSTM algorithms stands out, Bi-LSTM, Casc-LSTM, ENs2-LSTM. Other approaches based on Reinforcement learning use resources such as Q-learning and Cluster-Q with learning, in the experimentation processes. The metaheuristic-based approach shows the usability of different algorithms, among which the following stand out: L4-Banched-ActionNet+EntACS+Cub-CVM, Ant-Colony, N.B Bee swarm optimization with a deep Q-network and Genetic Algorithms.

It is important to point out that due to the high demand for data and information processing, it becomes increasingly necessary to implement techniques capable of improving performance and results, such as those based on Reinforcement Learning and Transfer Learning. Another challenge found in the literature is the processing of multi-occupancy datasets that make the use of computational resources and the identification of activities more expensive.

8. Future Works

Among the future works that can be implemented after this systematic review of the literature, the real-time analysis of the dataset not only with data from sensors but also images and sound, among which algorithms based on Reinforcement Learning and Transfer Learning can be highlighted. provide a wide range of competitive solutions, adding multi-occupancy in data sets.

Acknowledgments

This research has received funding under the REMIND project Marie Sklodowska-Curie EU Framework for Research and Innovation Horizon 2020, under Grant Agreement No. 734355. Furthermore, this research has been supported by the Spanish government by means of the projects RTI2018-098979-A-I00, PI-0387-2018 and CAS17/00292.

Funding Statement

European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 734355.

Author Contributions

Definition of taxonomy, P.P.A.-C., F.P. and E.V.; Conceptualization, P.P.A.-C., A.I.O.-C. and M.A.P.-M.; Human Activity Recognition conceptual Information P.P.A.-C., F.P. and E.V.; Methodology P.P.A.-C. and M.A.P.-M.; Technical and Scientometric Analysis P.P.A.-C., M.A.P.-M. and F.P. and A.Q.-L.; Formal Conclusions P.P.A.-C. and A.I.O.-C.; Supervision F.P. and E.V.; Writing-Review & Editing, P.P.A.-C., S.B.A. and F.P. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

IEEE Account

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Information

Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Active Journals
Find a Journal
Proceedings Series
For Authors
For Reviewers
For Editors
For Librarians
For Publishers
For Societies
For Conference Organizers
Open Access Policy
Institutional Open Access Program
Special Issues Guidelines
Editorial Process
Research and Publication Ethics
Article Processing Charges
Testimonials
Preprints.org
SciProfiles
Encyclopedia

Article Menu

Subscribe SciFeed
Recommended Articles
Google Scholar
on Google Scholar
Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Human activity recognition for production and logistics—a systematic literature review.

1. Introduction

What is the current status of research regarding HAR for P+L and related domains from a practitioner’s perspective?
What are the specifications of current applications regarding the sensor technology, recording environment and utilised datasets?
What methods of HAR are deployed?
What is the research gap to enhance HAR in P+L? What does the future road map look like?

2. Demarcation from Related Surveys

3. method of literature review, 3.1. inclusion criteria, 3.2. selection process, 3.3. literature analysis.

the initial situation and scope;
the methodological and empirical results; and
the further research demand.

4.1. Contributions Per Stage and Reasons for Exclusion during Selection Process

4.2. systematic review of relevant contributions, 4.2.1. application, 4.2.2. har methods, data representation, pre-processing, segmentation, shallow methods, deep learning, 5. discussion and conclusions.

What is the current status of research regarding HAR for P+L and related domains from a practitioner’s perspective? For the past 10 years, eight publications dealing with HAR in P+L have been identified. They address a variety of use cases but none covers the entire domain. Apart from two applications [ 83 , 85 ], the approaches assume a predefined set of activities, which is a downside amid the versatility of human work in P+L. Furthermore, the necessary effort for dataset creation is unknown, making the expenditure for deploying HAR in industry difficult to predict. In applications for related domains, locomotion activities as well as exercises and ADLs that resemble manual work in P+L are covered, allowing for their transfer to this domain.
What are the specifications of current applications regarding the sensor technology, recording environment and utilised datasets? The vast majority of research is done using IMUs placed on a person or using the accelerometers of smartphones. The sensor attachment could not be derived from the activities to be recognised. There was no link apparent to the reviewers. Seven out of the eight P+L contributions use data recorded in a laboratory. In total, 39 contributions use real-life data versus 19 that use laboratory data. Only four papers use both real-life and laboratory data. The reviewers did not find work regarding the training of a classifier using data from a laboratory for deployment in a real-life P+L facility or transfer learning between datasets and scenarios in an industrial context. Most of the publications proposed their own datasets or used individual excerpts from data available in repositories; thus, replicating their methods and results is hardly possible.
What methods of HAR are deployed? Current publications solve HAR either using a standard pattern recognition algorithm or using deep networks. Publications follow only the sliding window approach for segmenting signals. The window size differs strongly according to the recording scenarios. However, the overlapping is usually 50 % . For the standard methods, there is a large number of statistical features in time and frequency, being the variance, mean, correlation, energy and entropy the most common. Deep applications have been applied successfully for solving HAR. In comparison with applications in the vision domain, the networks are relatively shallow. Temporal CNNs or combinations between tCNNS and RNNs show the best results. Accuracy is the most used metric for evaluating the HAR methods. However, methods using datasets with unbalanced annotation should be evaluated with precision, recall and F1-metrics; otherwise, the performance of the method is not evaluated correctly.
What is the research gap to enhance HAR in P+L? What does the future road map look like? From the reviewer’s perspective, further research on HAR for P+L should focus on five issues. First, a high-quality benchmark dataset for HAR methods to deploy in P+L is missing. This dataset should contain motion pattern that are as close to reality as possible and it should allow for comparison among different methods and thus being relevant for application in industry. Second, it must become possible to quantify the data creation effort, including both recording and annotation following a predefined protocol. This allows for a holistic effort estimation when deploying HAR in P+L. Third, most of the observed activities in the literature corpus are simplistic and they do not cover the entirety of manually performed work in P+L. Furthermore, the definition of activities cannot be considered fixed at design time and expected to remain the same during run time in such a rapidly evolving industry. Methods of HAR for P+L must address this issue. Fourth, method-wise, the segmentation approach should be revised in detail as a window-based approach is currently the only method for generating activity hypothesis. This method does not handle activities that differ on their duration. A new method for computing activities with strongly different duration is needed. Fifth, the methods using deep networks do not include confidence measure. Even though these network methods show the state-of-the-art performance on benchmark datasets, they are still overconfident with their predictions. For this reason, integrating deep architectures with probabilistic reasoning for solving HAR using context information can be difficult.

Author Contributions

Conflicts of interest.

Dregger, J.; Niehaus, J.; Ittermann, P.; Hirsch-Kreinsen, H.; ten Hompel, M. Challenges for the future of industrial labor in manufacturing and logistics using the example of order picking systems. Procedia CIRP 2018 , 67 , 140–143. [ Google Scholar ] [ CrossRef ]
Hofmann, E.; Rüsch, M. Industry 4.0 and the current status as well as future prospects on logistics. Comput. Ind. 2017 , 89 , 23–34. [ Google Scholar ] [ CrossRef ]
Michel, R. 2016 Warehouse/DC Operations Survey: Ready to Confront Complexity ; Northwestern University Transportation Library: Evanston, IL, USA, 2016. [ Google Scholar ]
Schlögl, D.; Zsifkovits, H. Manuelle Kommissioniersysteme und die Rolle des Menschen. BHM Berg-und Hüttenmänn. Monatshefte 2016 , 161 , 225–228. [ Google Scholar ] [ CrossRef ]
Liang, C.; Chee, K.J.; Zou, Y.; Zhu, H.; Causo, A.; Vidas, S.; Teng, T.; Chen, I.M.; Low, K.H.; Cheah, C.C. Automated Robot Picking System for E-Commerce Fulfillment Warehouse Application. In Proceedings of the 14th IFToMM World Congress, Taipei, Taiwan, 25–30 October 2015; pp. 398–403. [ Google Scholar ] [ CrossRef ]
Oleari, F.; Magnani, M.; Ronzoni, D.; Sabattini, L. Industrial AGVs: Toward a pervasive diffusion in modern factory warehouses. In Proceedings of the 2014 IEEE 10th International Conference on Intelligent Computer Communication and Processing (ICCP), Piscataway, NJ, USA, 4–6 September 2014; pp. 233–238. [ Google Scholar ] [ CrossRef ]
Grosse, E.H.; Glock, C.H.; Neumann, W.P. Human Factors in Order Picking System Design: A Content Analysis. IFAC-PapersOnLine 2015 , 48 , 320–325. [ Google Scholar ] [ CrossRef ]
Calzavara, M.; Glock, C.H.; Grosse, E.H.; Persona, A.; Sgarbossa, F. Analysis of economic and ergonomic performance measures of different rack layouts in an order picking warehouse. Comput. Ind. Eng. 2017 , 111 , 527–536. [ Google Scholar ] [ CrossRef ]
Grosse, E.H.; Calzavara, M.; Glock, C.H.; Sgarbossa, F. Incorporating human factors into decision support models for production and logistics: Current state of research. IFAC-PapersOnLine 2017 , 50 , 6900–6905. [ Google Scholar ] [ CrossRef ]
Chen, C.; Jafari, R.; Kehtarnavaz, N. A survey of depth and inertial sensor fusion for human action recognition. Multimed. Tools Appl. 2017 , 76 , 4405–4425. [ Google Scholar ] [ CrossRef ]
Ordóñez, F.; Roggen, D. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. Sensors 2016 , 16 , 115. [ Google Scholar ] [ CrossRef ]
Haescher, M.; Matthies, D.J.; Srinivasan, K.; Bieber, G. Mobile Assisted Living: Smartwatch-based Fall Risk Assessment for Elderly People. In Proceedings of the 5th International Workshop on Sensor-Based Activity Recognition and Interaction iWOAR’18, Berlin, Germany, 20–21 September 2018; pp. 6:1–6:10. [ Google Scholar ] [ CrossRef ]
Hölzemann, A.; Van Laerhoven, K. Using Wrist-Worn Activity Recognition for Basketball Game Analysis. In Proceedings of the 5th International Workshop on Sensor-Based Activity Recognition and Interaction iWOAR’18, Berlin, Germany, 20–21 September 2018; pp. 13:1–13:6. [ Google Scholar ] [ CrossRef ]
Zeng, M.; Nguyen, L.T.; Yu, B.; Mengshoel, O.J.; Zhu, J.; Wu, P.; Zhang, J. Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors. In Proceedings of the 6th International Conference on Mobile Computing, Applications and Services, ICST, Austin, TX, USA, 6–7 November 2014. [ Google Scholar ] [ CrossRef ]
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional Two-Stream Network Fusion for Video Action Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1933–1941. [ Google Scholar ] [ CrossRef ]
Ronao, C.A.; Cho, S.B. Deep Convolutional Neural Networks for Human Activity Recognition with Smartphone Sensors. In Neural Information Processing ; Lecture Notes in Computer Science; Arik, S., Huang, T., Lai, W.K., Liu, Q., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 46–53. [ Google Scholar ] [ CrossRef ]
Yang, J.B.; Nguyen, M.N.; San, P.P.; Li, X.L.; Krishnaswamy, S. Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition. In Proceedings of the 24th International Conference on Artificial Intelligence IJCAI’15, Buenos Aires, Argentina, 25–31 July 2015; pp. 3995–4001. [ Google Scholar ]
Bishop, C.M. Pattern Recognition and Machine Learning ; Information Science and Statistics; Springer: Cham, Switzerland, 2006. [ Google Scholar ]
Fink, G.A. Markov Models for Pattern Recognition: From Theory to Applications , 2nd ed.; Advances in Computer Vision and Pattern Recognition; Springer: Cham, Switzerland, 2014. [ Google Scholar ]
Twomey, N.; Diethe, T.; Fafoutis, X.; Elsts, A.; McConville, R.; Flach, P.; Craddock, I. A Comprehensive Study of Activity Recognition Using Accelerometers. Informatics 2018 , 5 , 27. [ Google Scholar ] [ CrossRef ]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning ; MIT Press: Cambridge, MA, USA, 2015. [ Google Scholar ]
Yao, R.; Lin, G.; Shi, Q.; Ranasinghe, D.C. Efficient dense labelling of human activity sequences from wearables using fully convolutional networks. y 2018 , 78 , 252–266. [ Google Scholar ] [ CrossRef ]
Feldhorst, S.; Aniol, S.; ten Hompel, M. Human Activity Recognition in der Kommissionierung– Charakterisierung des Kommissionierprozesses als Ausgangsbasis für die Methodenentwicklung. Logist. J. Proc. 2016 , 2016 . [ Google Scholar ] [ CrossRef ]
Alam, M.A.U.; Roy, N. Unseen Activity Recognitions: A Hierarchical Active Transfer Learning Approach. In Proceedings of the 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5–8 June 2017; pp. 436–446. [ Google Scholar ] [ CrossRef ]
Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2009 , 22 , 1345–1359. [ Google Scholar ] [ CrossRef ]
Luan, P.G.; Tan, N.T.; Thinh, N.T. Estimation and Recognition of Motion Segmentation and Pose IMU-Based Human Motion Capture. In Robot Intelligence Technology and Applications 5 ; Advances in Intelligent Systems and Computing; Kim, J.H., Myung, H., Kim, J., Xu, W., Matson, E.T., Jung, J.W., Choi, H.L., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 383–391. [ Google Scholar ] [ CrossRef ]
Pfister, A.; West, A.M.; Bronner, S.; Noah, J.A. Comparative abilities of Microsoft Kinect and Vicon 3D motion capture for gait analysis. J. Med. Eng. Technol. 2014 , 38 , 274–280. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Schlagenhauf, F.; Sahoo, P.P.; Singhose, W. A Comparison of Dual-Kinect and Vicon Tracking of Human Motion for Use in Robotic Motion Programming. Robot Autom. Eng. J. 2017 , 1 , 555558. [ Google Scholar ] [ CrossRef ]
Bulling, A.; Blanke, U.; Schiele, B. A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. 2014 , 46 , 1–33. [ Google Scholar ] [ CrossRef ]
Roggen, D.; Förster, K.; Calatroni, A.; Tröster, G. The adARC pattern analysis architecture for adaptive human activity recognition systems. J. Ambient. Intell. Humaniz. Comput. 2013 , 4 , 169–186. [ Google Scholar ] [ CrossRef ]
Dalmazzo, D.; Tassani, S.; Ramírez, R. A Machine Learning Approach to Violin Bow Technique Classification: A Comparison Between IMU and MOCAP systems Dalmazzo, David and Tassani, Simone and Ramírez, Rafael. In Proceedings of the 5th International Workshop on Sensor-Based Activity Recognition and InteractioniWOAR’18, Berlin, Germany, 20–21 September 2018; pp. 12:1–12:8. [ Google Scholar ] [ CrossRef ]
Vinciarelli, A.; Esposito, A.; André, E.; Bonin, F.; Chetouani, M.; Cohn, J.F.; Cristani, M.; Fuhrmann, F.; Gilmartin, E.; Hammal, Z.; et al. Open Challenges in Modelling, Analysis and Synthesis of Human Behaviour in Human–Human and Human–Machine Interactions. Cogn. Comput. 2015 , 7 , 397–413. [ Google Scholar ] [ CrossRef ]
Lara, O.D.; Labrador, M.A. A Survey on Human Activity Recognition using Wearable Sensors. IEEE Commun. Surv. Tutor. 2013 , 15 , 1192–1209. [ Google Scholar ] [ CrossRef ]
Xing, S.; Hanghang, T.; Ping, J. Activity recognition with smartphone sensors. Tinshhua Sci. Technol. 2014 , 19 , 235–249. [ Google Scholar ] [ CrossRef ]
Attal, F.; Mohammed, S.; Dedabrishvili, M.; Chamroukhi, F.; Oukhellou, L.; Amirat, Y. Physical Human Activity Recognition Using Wearable Sensors. Sensors 2015 , 15 , 31314–31338. [ Google Scholar ] [ CrossRef ] [ Green Version ]
Edwards, M.; Deng, J.; Xie, X. From pose to activity: Surveying datasets and introducing CONVERSE. Comput. Vis. Image Underst. 2016 , 144 , 73–105. [ Google Scholar ] [ CrossRef ] [ Green Version ]
O’Reilly, M.; Caulfield, B.; Ward, T.; Johnston, W.; Doherty, C. Wearable Inertial Sensor Systems for Lower Limb Exercise Detection and Evaluation: A Systematic Review. Sport. Med. 2018 , 48 , 1221–1246. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
Kitchenham, B.; Brereton, P. A systematic review of systematic review process research in software engineering. Inf. Softw. Technol. 2013 , 55 , 2049–2075. [ Google Scholar ] [ CrossRef ]
Kitchenham, B.; Pearl Brereton, O.; Budgen, D.; Turner, M.; Bailey, J.; Linkman, S. Systematic literature reviews in software engineering—A systematic literature review. Inf. Softw. Technol. 2009 , 51 , 7–15. [ Google Scholar ] [ CrossRef ]
Kitchenham, B. Procedures for Performing Systematic Reviews ; Keele University: Keele, UK, 2004; p. 33. [ Google Scholar ]
Chen, L.; Zhao, X.; Tang, O.; Price, L.; Zhang, S.; Zhu, W. Supply chain collaboration for sustainability: A literature review and future research agenda. Int. J. Prod. Econ. 2017 , 194 , 73–87. [ Google Scholar ] [ CrossRef ]
Caspersen, C.J.; Powell, K.E.; Christenson, G.M. Physical activity, exercise, and physical fitness: definitions and distinctions for health-related research. Public Health Rep. 1985 , 100 , 126–131. [ Google Scholar ] [ PubMed ]
Purkayastha, A.; Palmaro, E.; Falk-Krzesinski, H.J.; Baas, J. Comparison of two article-level, field-independent citation metrics: Field-Weighted Citation Impact (FWCI) and Relative Citation Ratio (RCR). J. Inf. 2019 , 13 , 635–642. [ Google Scholar ] [ CrossRef ]
Xi, L.; Bin, Y.; Aarts, R. Single-accelerometer-based daily physical activity classification. In Proceedings of the 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Minneapolis, MN, USA, 3–6 September 2009; pp. 6107–6110. [ Google Scholar ] [ CrossRef ]
Altun, K.; Barshan, B. Human Activity Recognition Using Inertial/Magnetic Sensor Units. In Human Behavior Understanding ; Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6219, pp. 38–51. [ Google Scholar ] [ CrossRef ]
Altun, K.; Barshan, B.; Tunçel, O. Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognit. 2010 , 43 , 3605–3620. [ Google Scholar ] [ CrossRef ]
Khan, A.M.; Lee, Y.K.; Lee, S.Y.; Kim, T.S. Human Activity Recognition via an Accelerometer- Enabled-Smartphone Using Kernel Discriminant Analysis. In Proceedings of the 2010 5th International Conference on Future Information Technology, Busan, Korea, 21–23 May 2010; pp. 1–6. [ Google Scholar ] [ CrossRef ]
Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. Activity recognition using cell phone accelerometers. ACM SigKDD Explor. Newsl. 2011 , 12 , 74. [ Google Scholar ] [ CrossRef ]
Wang, L.; Gu, T.; Tao, X.; Chen, H.; Lu, J. Recognizing multi-user activities using wearable sensors in a smart home. Pervasive Mob. Comput. 2011 , 7 , 287–298. [ Google Scholar ] [ CrossRef ]
Casale, P.; Pujol, O.; Radeva, P. Human Activity Recognition from Accelerometer Data Using a Wearable Device. In Pattern Recognition and Image Analysis ; Vitrià, J., Sanches, J.M., Hernández, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6669, pp. 289–296. [ Google Scholar ] [ CrossRef ]
Gu, T.; Wang, L.; Wu, Z.; Tao, X.; Lu, J. A Pattern Mining Approach to Sensor-Based Human Activity Recognition. IEEE Trans. Knowl. Data Eng. 2010 , 23 , 1359–1372. [ Google Scholar ] [ CrossRef ]
Lee, Y.S.; Cho, S.B. Activity Recognition Using Hierarchical Hidden Markov Models on a Smartphone with 3D Accelerometer. In Hybrid Artificial Intelligent Systems ; Corchado, E., Kurzyński, M., Woźniak, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6678, pp. 460–467. [ Google Scholar ] [ CrossRef ]
Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. Human Activity Recognition on Smartphones Using a Multiclass Hardware-Friendly Support Vector Machine. In Ambient Assisted Living and Home Care ; Bravo, J., Hervás, R., Rodríguez, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7657, pp. 216–223. [ Google Scholar ] [ CrossRef ]
Deng, L.; Leung, H.; Gu, N.; Yang, Y. Generalized Model-Based Human Motion Recognition with Body Partition Index Maps ; Blackwell Publishing Ltd.: Oxford, UK, 2012; Volume 31, pp. 202–215. [ Google Scholar ] [ CrossRef ]
Lara, S.D.; Labrador, M.A. A mobile platform for real-time human activity recognition. In Proceedings of the 2012 IEEE Consumer Communications and Networking Conference (CCNC), Las Vegas, NV, USA, 14–17 January 2012; pp. 667–671. [ Google Scholar ] [ CrossRef ]
Lara, O.D.; Pérez, A.J.; Labrador, M.A.; Posada, J.D. Centinela: A human activity recognition system based on acceleration and vital sign data. Pervasive Mob. Comput. 2012 , 8 , 717–729. [ Google Scholar ] [ CrossRef ]
Siirtola, P.; Röning, J. Recognizing Human Activities User-independently on Smartphones Based on Accelerometer Data. IJIMAI 2012 , 1 , 38. [ Google Scholar ] [ CrossRef ]
Koskimäki, H.; Huikari, V.; Siirtola, P.; Röning, J. Behavior modeling in industrial assembly lines using a wrist-worn inertial measurement unit. J. Ambient. Intell. Humaniz. Comput. 2013 , 4 , 187–194. [ Google Scholar ] [ CrossRef ]
Shoaib, M.; Scholten, H.; Havinga, P. Towards Physical Activity Recognition Using Smartphone Sensors. In Proceedings of the 2013 IEEE 10th International Conference on Ubiquitous Intelligence and Computing and 2013 IEEE 10th International Conference on Autonomic and Trusted Computing, Vietri sul Mere, Italy, 18–21 December 2013; pp. 80–87. [ Google Scholar ] [ CrossRef ]
Zhang, M.; Sawchuk, A.A. Human Daily Activity Recognition With Sparse Representation Using Wearable Sensors. IEEE J. Biomed. Health Inform. 2013 , 17 , 553–560. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Bayat, A.; Pomplun, M.; Tran, D.A. A Study on Human Activity Recognition Using Accelerometer Data from Smartphones. Procedia Comput. Sci. 2014 , 34 , 450–457. [ Google Scholar ] [ CrossRef ] [ Green Version ]
Garcia-Ceja, E.; Brena, R.; Carrasco-Jimenez, J.; Garrido, L. Long-Term Activity Recognition from Wristwatch Accelerometer Data. Sensors 2014 , 14 , 22500–22524. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
Gupta, P.; Dallas, T. Feature Selection and Activity Recognition System Using a Single Triaxial Accelerometer. IEEE Trans. Biomed. Eng. 2014 , 61 , 1780–1786. [ Google Scholar ] [ CrossRef ]
Kwon, Y.; Kang, K.; Bae, C. Unsupervised learning for human activity recognition using smartphone sensors. Expert Syst. Appl. 2014 , 41 , 6067–6074. [ Google Scholar ] [ CrossRef ]
Aly, H.; Ismail, M.A. ubiMonitor: intelligent fusion of body-worn sensors for real-time human activity recognition. In Proceedings of the 30th Annual ACM Symposium on Applied Computing-SAC’15, Salamanca, Spain, 13–17 April 2015; pp. 563–568. [ Google Scholar ] [ CrossRef ]
Bleser, G.; Steffen, D.; Reiss, A.; Weber, M.; Hendeby, G.; Fradet, L. Personalized Physical Activity Monitoring Using Wearable Sensors. In Smart Health ; Holzinger, A., Röcker, C., Ziefle, M., Eds.; Springer International Publishing: Cham, Switzerland, 2015; Volume 8700, pp. 99–124. [ Google Scholar ] [ CrossRef ]
Chen, Y.; Xue, Y. A Deep Learning Approach to Human Activity Recognition Based on Single Accelerometer. In Proceedings of the 2015 IEEE International Conference on Systems, Man, and Cybernetics, Kowloon, China, 9–12 October 2015; pp. 1488–1492. [ Google Scholar ] [ CrossRef ]
Guo, M.; Wang, Z. A feature extraction method for human action recognition using body-worn inertial sensors. In Proceedings of the 2015 IEEE 19th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Calabria, Italy, 6–8 May 2015; pp. 576–581. [ Google Scholar ] [ CrossRef ]
Zainudin, M.; Sulaiman, M.N.; Mustapha, N.; Perumal, T. Activity recognition based on accelerometer sensor using combinational classifiers. In Proceedings of the 2015 IEEE Conference on Open Systems (ICOS), Bandar Melaka, Malaysia, 24–26 August 2015; pp. 68–73. [ Google Scholar ] [ CrossRef ]
Ayachi, F.S.; Nguyen, H.P.; Lavigne-Pelletier, C.; Goubault, E.; Boissy, P.; Duval, C. Wavelet-based algorithm for auto-detection of daily living activities of older adults captured by multiple inertial measurement units (IMUs). Physiol. Meas. 2016 , 37 , 442–461. [ Google Scholar ] [ CrossRef ] [ Green Version ]
Fallmann, S.; Kropf, J. Human Activity Pattern Recognition based on Continuous Data from a Body Worn Sensor placed on the Hand Wrist using Hidden Markov Models. Simul. Notes Eur. 2016 , 26 , 9–16. [ Google Scholar ] [ CrossRef ]
Feldhorst, S.; Masoudenijad, M.; ten Hompel, M.; Fink, G.A. Motion Classification for Analyzing the Order Picking Process using Mobile Sensors-General Concepts, Case Studies and Empirical Evaluation. In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods, Rome, Italy, 24–26 February 2016; SCITEPRESS-Science and and Technology Publications: Setubal, Portugal, 2016; pp. 706–713. [ Google Scholar ] [ CrossRef ]
Hammerla, N.Y.; Halloran, S.; Ploetz, T. Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables. arXiv 2016 , arXiv:1604.08880. [ Google Scholar ]
Liu, Y.; Nie, L.; Liu, L.; Rosenblum, D.S. From action to activity: Sensor-based activity recognition. Neurocomputing 2016 , 181 , 108–115. [ Google Scholar ] [ CrossRef ]
Margarito, J.; Helaoui, R.; Bianchi, A.; Sartor, F.; Bonomi, A. User-Independent Recognition of Sports Activities from a Single Wrist-worn Accelerometer: A Template Matching Based Approach. IEEE Trans. Biomed. Eng. 2015 , 63 , 788–796. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Reyes-Ortiz, J.L.; Oneto, L.; Samà, A.; Parra, X.; Anguita, D. Transition-Aware Human Activity Recognition Using Smartphones. Neurocomputing 2016 , 171 , 754–767. [ Google Scholar ] [ CrossRef ] [ Green Version ]
Ronao, C.A.; Cho, S.B. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 2016 , 59 , 235–244. [ Google Scholar ] [ CrossRef ]
Ronao, C.A.; Cho, S.B. Recognizing human activities from smartphone sensors using hierarchical continuous hidden Markov models. Int. J. Distrib. Sens. Netw. 2017 , 13 , 155014771668368. [ Google Scholar ] [ CrossRef ]
Song-Mi, L.; Sangm, M.Y.; Heeryon, C. Human activity recognition from accelerometer data using Convolutional Neural Network. In Proceedings of the 2017 IEEE International Conference on Big Data and Smart Computing (BigComp), Jeju, Korea, 13–16 February 2017; pp. 131–134. [ Google Scholar ] [ CrossRef ]
Scheurer, S.; Tedesco, S.; Brown, K.N.; O’Flynn, B. Human activity recognition for emergency first responders via body-worn inertial sensors. In Proceedings of the 2017 IEEE 14th International Conference on Wearable and Implantable Body Sensor Networks (BSN), Eindhoven, The Netherlands, 9–12 May 2017; pp. 5–8. [ Google Scholar ] [ CrossRef ]
Vital, J.P.M.; Faria, D.R.; Dias, G.; Couceiro, M.S.; Coutinho, F.; Ferreira, N.M.F. Combining discriminative spatiotemporal features for daily life activity recognition using wearable motion sensing suit. Pattern Anal. Appl. 2017 , 20 , 1179–1194. [ Google Scholar ] [ CrossRef ]
Chen, Z.; Le, Z.; Cao, Z.; Guo, J. Distilling the Knowledge From Handcrafted Features for Human Activity Recognition. IEEE Trans. Ind. Inform. 2018 , 14 , 4334–4342. [ Google Scholar ] [ CrossRef ]
Moya Rueda, F.; Grzeszick, R.; Fink, G.; Feldhorst, S.; ten Hompel, M. Convolutional Neural Networks for Human Activity Recognition Using Body-Worn Sensors. Informatics 2018 , 5 , 26. [ Google Scholar ] [ CrossRef ]
Nair, N.; Thomas, C.; Jayagopi, D.B. Human Activity Recognition Using Temporal Convolutional Network. In Proceedings of the 5th international Workshop on Sensor-Based Activity Recognition and Interaction-iWOAR’18, Berlin, Germany, 20–21 September 2018; pp. 1–8. [ Google Scholar ] [ CrossRef ]
Reining, C.; Schlangen, M.; Hissmann, L.; ten Hompel, M.; Moya, F.; Fink, G.A. Attribute Representation for Human Activity Recognition of Manual Order Picking Activities. In Proceedings of the 5th international Workshop on Sensor-based Activity Recognition and Interaction-iWOAR’18, Berlin, Germany, 20–21 September 2018; pp. 1–10. [ Google Scholar ] [ CrossRef ]
Tao, W.; Lai, Z.H.; Leu, M.C.; Yin, Z. Worker Activity Recognition in Smart Manufacturing Using IMU and sEMG Signals with Convolutional Neural Networks. Procedia Manuf. 2018 , 26 , 1159–1166. [ Google Scholar ] [ CrossRef ]
Wolff, J.P.; Grützmacher, F.; Wellnitz, A.; Haubelt, C. Activity Recognition using Head Worn Inertial Sensors. In Proceedings of the 5th international Workshop on Sensor-based Activity Recognition and Interaction-iWOAR’18, Berlin, Germany, 20–21 September 2018; pp. 1–7. [ Google Scholar ] [ CrossRef ]
Xi, R.; Li, M.; Hou, M.; Fu, M.; Qu, H.; Liu, D.; Haruna, C.R. Deep Dilation on Multimodality Time Series for Human Activity Recognition. IEEE Access 2018 , 6 , 53381–53396. [ Google Scholar ] [ CrossRef ]
Xie, L.; Tian, J.; Ding, G.; Zhao, Q. Human activity recognition method based on inertial sensor and barometer. In Proceedings of the 2018 IEEE International Symposium on Inertial Sensors and Systems (INERTIAL), Moltrasio, Italy, 26–29 March 2018; pp. 1–4. [ Google Scholar ] [ CrossRef ]
Zhao, J.; Obonyo, E. Towards a Data-Driven Approach to Injury Prevention in Construction. In Advanced Computing Strategies for Engineering ; Smith, I.F.C., Domer, B., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 10863, pp. 385–411. [ Google Scholar ] [ CrossRef ]
Zhu, Q.; Chen, Z.; Yeng, C.S. A Novel Semi-supervised Deep Learning Method for Human Activity Recognition. IEEE Trans. Ind. Inform. 2018 , 3821–3830. [ Google Scholar ] [ CrossRef ]
Rueda, F.M.; Fink, G.A. Learning Attribute Representation for Human Activity Recognition. arXiv 2018 , arXiv:1802.00761. [ Google Scholar ]
Lampert, C.H.; Nickisch, H.; Harmeling, S. Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2013 , 36 , 453–465. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Lockhart, J.W.; Weiss, G.M.; Xue, J.C.; Gallagher, S.T.; Grosner, A.B.; Pulickal, T.T. WISDM Lab: Dataset ; Department of Computer & Information Science, Fordham University: Bronx, NY, USA, 2013. [ Google Scholar ]
Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. WISDM Lab: Dataset ; Department of Computer & Information Science, Fordham University: Bronx, NY, USA, 2012. [ Google Scholar ]
Roggen, D.; Plotnik, M.; Hausdorff, J. UCI Machine Learning Repository: Daphnet Freezing of Gait Data Set ; School of Information and Computer Science, University of California: Irvine, CA, USA, 2013; Available online: https://archive.ics.uci.edu/ml/datasets/Daphnet+Freezing+of+Gait (accessed on 20 July 2019).
Müller, M.; Röder, T.; Eberhardt, B.; Weber, A. Motion Database HDM05 ; Technical Report; Universität Bonn: Bonn, Germany, 2007. [ Google Scholar ]
Banos, O.; Toth, M.A.; Amft, O. UCI Machine Learning Repository: REALDISP Activity Recognition Dataset Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/REALDISP+Activity+Recognition+Dataset (accessed on 20 July 2019).
Reyes-Ortiz, J.L.; Anguita, D.; Oneto, L.; Parra, X. UCI Machine Learning Repository: Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions (accessed on 20 July 2019).
Zhang, M.; Sawchuk, A.A. Human Activities Dataset. 2012. Available online: http://sipi.usc.edu/had/ (accessed on 20 July 2019).
Yang, A.Y.; Giani, A.; Giannatonio, R.; Gilani, K.; Iyengar, S.; Kuryloski, P.; Seto, E.; Seppa, V.P.; Wang, C.; Shia, V.; et al. d-WAR: Distributed Wearable Action Recognition. Available online: https://people.eecs.berkeley.edu/~yang/software/WAR/ (accessed on 20 July 2019).
Roggen, D.; Calatroni, A.; Long-Van, N.D.; Chavarriaga, R.; Hesam, S.; Tejaswi Digumarti, S. UCI Machine Learning Repository: OPPORTUNITY Activity Recognition Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/opportunity+activity+recognition (accessed on 20 July 2019).
Reyes-Ortiz, J.L.; Anguita, D.; Ghio, A.; Oneto, L.; Parra, X. UCI Machine Learning Repository: Human Activity Recognition Using Smartphones Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones (accessed on 20 July 2019).
Reiss, A. UCI Machine Learning Repository: PAMAP2 Physical Activity Monitoring Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring (accessed on 20 July 2019).
Bulling, A.; Blanke, U.; Schiele, B. MATLAB Human Activity Recognition Toolbox. Available online: https://github.com/andreas-bulling/ActRecTut (accessed on 20 July 2019).
Zappi, P.; Lombriser, C.; Stiefmeier, T.; Farella, E.; Roggen, D.; Benini, L.; Tröster, G. Activity Recognition from On-body Sensors: Accuracy-power Trade-off by Dynamic Sensor Selection. In Proceedings of the 5th European Conference on Wireless Sensor Networks EWSN’08, Bologna, Italy, 30 January–1 February 2008; Springer: Berlin, Heidelberg, 2008; pp. 17–33. [ Google Scholar ]
Fukushima, K.; Miyake, S. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognit. 1982 , 15 , 455–469. [ Google Scholar ] [ CrossRef ]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 ; Curran Associates, Inc.: Red Hook, NY, USA; pp. 1097–1105.

Click here to enlarge figure

Ref.	Year	Author & Description
[ ]	2013	Lara and Labrador reviewed the state of the art in HAR based on wearable sensors. They addressed the general structure of HAR systems and design issues. Twenty-eight systems are evaluated in terms of recognition performance, energy consumption and other criteria.
[ ]	2014	Xing Su et al. surveyed recent advances in HAR with smartphone sensors and address experiment settings. They divided activities into five types: living, working, health, simple and complex.
[ ]	2015	Attal et al. reviewed classification techniques for HAR using accelerometers. They provided an overview of sensor placement, detected activities and performance metrics of current state-of-the-art approaches.
[ ]	2016	Edwards et al. presented a review on publicly available datasets for HAR. The examined sensor technology includes MoCap and IMUs. The observed application domains are ADL, surveillance, sports and generic activities, meaning that a wide variety of actions is covered.
[ ]	2018	Twomey et al. surveyed the state-of-the-art in activity recognition using accelerometers. They focused on ADL and examined, among other issues, the sensor placement and its influence on the recognition performance.
[ ]	2018	O’Reilly et al. synthesised and evaluated studies which investigate the capacity for IMUs to assess movement quality in lower limb exercises. The studies are categorised into three groups: exercise detection, movement classification or measurement validation.

Inclusion Criteria	Description
Database	IEEE Xplore, Science Direct, Google Scholar, Scopus, European Union Digital Library (EUDL), ACM Digital Library, LearnTechLib, Springer Link, Wiley Online Library, dblp computer science bibliography, IOP Science, World Scientific, Multidisciplinary Digital Publishing Institute (MDPI), SciTePress Digital Library (Science and Technology Publications)
Keywords	Motion Capturing, Motion Capture, MoCap, OMC, OMMC Inertial Measurement Unit, IMU, Accelerometer, body-worn/on-body/wearable/wireless Sensor (Human) Activity/Action, Recognition, HAR Production, Manufacturing, Logistics, Warehousing, Order Picking
Year of publication	2009–2018
Language	English
Source Types	Conference Proceedings & Peer-reviewed Journals
Identifier	Persistent Identifier mandatory (DOI, ISBN, ISSN, arxiv)

Content Criteria	Description
( ) IMU or OMMC	Method is based on data from IMUs or OMMC-Systems. The sensors and markers are either attached to the subject’s body or body-worn.
( ) Human	Contribution addresses the recognition of activities performed by humans.
( ) Physical World	Data are recorded in the physical world without the use of simulated or immersive environments.
( ) Quantification	The application aims to quantitatively determine the occurrence of activities, not to capture and analyse them for developing new methods in related fields.
( ) Application-oriented	Perspectives for deploying the proposed method in P+L is conceivable. Definition of HAR-related terms is not the contribution’s focus.
( ) Physical activity	According to Caspersen et al., [ ] “physical activity is defined as any bodily movement produced by skeletal muscles that results in energy expenditure”. In this literature review, bodily movement is limited to torso and limb movement.
( ) No focus on hardware	Comparison of sensor technologies or a showcase of new hardware when using it for HAR is not the contribution’s focus.
( ) Clear Method	Publications are computer science oriented, stating clear pattern recognition methods and performance metrics.

Stage	Description
(I) Keywords	Keywords of the publication match with the Inclusion Criteria. Contributions have not yet been examined by the reviewers at this point.
(II) Title	The title does not conflict with any Content Criteria. This is because the title either complies with the criteria or it is ambiguous.
(III) Abstract	The abstract’s content does not conflict with any Content Criteria. This is because the content either complies with the criteria or necessary specifications are missing.
(IV) Full Text	Reading the full text confirms compliance with all Content Criteria. Properties of the publication are recorded in the literature overview.

Root Category
	Subcategory	Description

	P+L	Deployment in industrial settings, e.g., production facilities or warehouses
	Other	Related application domain, e.g., health or ADL

	Work	Working activities such as assembly or order picking
	Exercises	Sport Activities, e.g., riding a stationary bicycle or gymnastic exercises
	Locomotion	Walking, running as well as the recognition of the lack of locomotion when standing
	ADL	Activities of daily living including cooking, doing the laundry, driving a car and so forth

	Arm	Upper and lower arm
	Hand	including wrists
	Leg	including knee and shank
	Foot	including ankle
	Torso	including chest, back, belt and waist
	Head	including sensors attached to a helmet or protective gear
	Smartphone	Worn in a pocket or a bag. If attached to a limb, the subcaterogy is checked as well

	Repository	Utilised dataset is available in a repository
	Individual	Dataset is created specifically for the contribution and not available in a repository
	Laboratory	Recording takes place in a constraint laboratory environment
	Real-life	Recording takes place in a real-life environment, e.g., a real warehouse or in public places
	Name of dataset	Name, origin, repository and description of dataset

	Passive Markers	Markers reflect light for the camera to capture
	Active Markers	Markers emit light for the camera to capture
	IMU	Devices that measure specific forces such as acceleration or gyroscopes

	Pre.-Pr.	Pre-Processing: Normalisation, noise filtering, low-pass and high-pass filtering, and re-sampling
	Segm.	Segmentation: Sliding window-approach

	FE - Stat. Feat.	Statistical feature extraction: Time- and Frequency-Domain Features
	FE- App.-based	Application-based features, e.g., Kinematics, Body model, Event-Based
	FR	Feature reduction, e.g., Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA), Kernel Discriminant Analysis (KDA), Random Projection (RP)
	CL-NB	Classification method: Naïve Bayes
	CL-HMMs	Classification method: Hidden Markov Models
	CL-SVM	Classification method: Support Vector Machines
	CL-MLP	Classification method: Multilayer Perceptron
	CL-Other	Classification method: Random Forest (RF), Decision Trees (DT), Dynamic Time Warping (DTW), K-Nearest Neighbor (KNN), Fuzzy-Logic (FL), Logistic Regression (LR), Bayesian Network (BN), Least-Squares (LS), Conditional Random Field (CRF), Factorial Conditional Random Field (FCR), Conditional Clauses (CC), Gaussian Mixture Models (GMM), Template Matching (TM), Dynamic Bayesian Mixture Model (DBMM), Emerging Patterns (EP), Gradient-Boosted Trees (GBT), Sparsity Concentration Index (SCI)

	CNN	Convolutional Neural Networks
	tCCN	Temporal CNNs and Dilated tCNNs (DTCNN)
	rCNN	Recurrent Neural Networks, e.g., GRU, LSTM, Bidirectional LSTM

Stage	No. of Publications
(I) Keywords	1243
(II) Title	524
(III) Abstract	263
(IV) Full Text	52

General Information				Domain		Activity				Attachment							Dataset				DP		Shallow Method								DL
Ref.	Year	Author	FWCI	P+L	Other	Work	Exercises	Locomotion	ADL	Arm	Hand	Leg	Foot	Torso	Head	Smartphone	Repository	Individual	Laboratory	Real-Life	Pre.-Pr.	Segm.	FE-Stat.Feat.	FE-Others	FR	CL-NB	CL-HMMs	CL-SVM	CL-MLP	CL-Others	CNN	tCNN	rCNN
[ ]	2009	Xi Long et al.	12.48		x		x	x						x				x		x	x	x	x	x	x	x
[ ]	2010	Altun and Barshan	7.95		x		x	x			x	x		x				x		x			x		x	x				LS, KNN
[ ]	2010	Altun et al.	4.60		x		x	x			x	x		x				x		x		x	x		x				x	BDM,LSM,KNN,DTW
[ ]	2010	Khan et al.	7.38		x			x								x		x		x	x			x					x	LDA, KDA
[ ]	2010	Kwapisz et al.	-		x			x								x		x		x		x	x						x	DT, LR
[ ]	2010	Wang et al.	4.62		x				x		x							x		x		x	x				x			FCR
[ ]	2011	Casale et al.	8.02		x			x	x					x				x		x			x							RF
[ ]	2011	Gu et al.	5.16		x				x		x			x				x	x				x							EP
[ ]	2011	Lee and Cho	13.37		x			x								x		x		x		x					x
[ ]	2012	Anguita et al.	35.00		x			x						x		x		x		x	x	x	x				x
[ ]	2012	Deng et al.	0.58		x		x	x		x	x	x	x	x	x		x	x	x						x					GM, DTW
[ ]	2012	Lara and Labrador	7.53		x			x								x		x		x		x	x	x				x		DT
[ ]	2012	Lara et al.	18.74		x			x						x				x		x		x	x			x			x	BN, DT, LR
[ ]	2012	Siirtola and Röning	-		x			x	x							x		x		x	x	x	x							QDA,KNN,DT
[ ]	2013	Koskimäki et al.	1.20	x		x					x							x	x			x	x							KNN
[ ]	2013	Shoaib et al.	9.30		x			x		x	x			x		x		x		x	x	x	x			x		x		LR,KNN,DT
[ ]	2013	Zhang and Sawchuk	6.37		x			x						x				x		x				x				x	x	SCI
[ ]	2014	Bayat et al.	11.96		x			x	x							x		x		x	x	x	x					x	x	RF, LR
[ ]	2014	Bulling et al.	64.63		x				x	x	x						x			x	x	x	x	x		x	x	x		KNN boosting
[ ]	2014	Garcia-Ceja et al.	2.52		x				x	x								x		x	x	x	x				x			CRF
[ ]	2014	Gupta and Dallas	9.06		x			x						x				x		x		x	x			x				KNN
[ ]	2014	Kwon et al.	5.89		x			x	x							x		x		x		x	x							GMM
[ ]	2014	Zeng et al.	39.20	x	x	x		x	x	x	x	x	x	x		x	x		x	x		x									x
[ ]	2015	Aly and Ismail	0.00		x		x	x	x		x		x	x				x		x	x		x	x				x		CC
[ ]	2015	Bleser et al.	2.95		x		x	x	x	x			x	x			x			x
[ ]	2015	Chen and Xue	20.10		x			x				x		x		x		x		x		x						x		DBM	x
[ ]	2015	Guo andWang	3.56		x			x	x		x		x	x			x			x	x	x	x		x	x		x		KNN, DT
[ ]	2015	Zainudin et al.	6.95		x			x								x	x		x			x	x						x	DT, LR
[ ]	2016	Ayachi et al.	1.59		x			x	x	x	x	x	x	x	x			x	x
[ ]	2016	Fallmann and Kropf	-		x			x	x	x	x			x		x	x	x		x	x	x	x				x
[ ]	2016	Feldhorst et al.	1.31	x		x		x			x			x				x		x		x	x			x		x		RF
[ ]	2016	Hammerla et al.	23.99		x		x	x	x	x	x	x	x	x			x		x	x	x	x							x		x	x	x
[ ]	2016	Liu et al.	30.36		x			x	x	x	x	x	x	x			x		x			x				x		x		KNN
[ ]	2016	Margarito et al.	5.24		x		x	x			x							x		x		x	x							DTW, TM
[ ]	2016	Ordóñez and Roggen	42.72	x	x	x		x	x	x	x	x	x	x		x	x		x	x	x	x										x	x
[ ]	2016	Reyes-Ortiz et al.	12.46		x		x	x	x	x	x	x	x	x	x	x	x			x	x	x	x					x
[ ]	2016	Ronao and Cho	24.82		x			x								x		x		x		x										x
[ ]	2016	Ronao and Cho	4.89		x			x						x		x	x			x		x	x		x		x
[ ]	2017	Song-Mi Lee et al.	10.91		x			x								x		x		x		x									x
[ ]	2017	Scheurer et al.	7.06		x			x						x				x	x		x		x					x		GBT, KNN
[ ]	2017	Vital et al.	0.93		x		x	x	x	x	x	x	x	x				x	x					x	x	x		x	x	DBMM
[ ]	2018	Chen et al.	1.77		x			x						x		x	x		x			x	x						x				x
[ ]	2018	Moya Rueda et al.	3.69	x	x	x	x	x	x	x	x	x	x	x			x	x	x	x	x	x										x
[ ]	2018	Nair et al.	0.00		x			x						x		x	x			x	x	x										x
[ ]	2018	Reining et al.	0.00	x		x		x		x	x	x	x	x	x			x	x		x	x									x
[ ]	2018	Tao et al.	0.00	x		x				x								x	x			x	x								x
[ ]	2018	Wolff et al.	0.00		x		x	x	x						x			x		x	x	x									x
[ ]	2018	Xi et al.	1.61		x		x	x	x	x	x	x	x	x			x		x	x		x										x
[ ]	2018	Xie et al.	6.43		x			x						x				x		x		x	x					x		RF
[ ]	2018	Yao et al.	3.53		x			x	x	x	x	x	x	x			x	x	x	x	x											x
[ ]	2018	Zhao and Obonyo	0.00	x		x		x		x		x		x	x			x	x			x	x		x	x		x		KNN
[ ]	2018	Zhu et al.	0.00		x			x						x		x	x		x			x	x					x		RF,KNN,LR			x

Ref.	Name	Utl. in
[ ]	Actitracker from Wireless Sensor Data Mining (WISDM)	[ ]
[ ]	Activity Prediction from Wireless Sensor Data Mining (WISDM)	[ ]
[ ]	Daphnet Gait dataset (DG)	[ ]
[ ]	Mocap Database HDM05	[ ]
[ ]	Realistic Sensor Displacement Benchmark Dataset (REALDISP)	[ ]
[ ]	Smartphone-Based Recognition of Human Activities and Postural Transitions Dataset	[ ]
[ ]	USC-SIPI Human Activity Dataset (USC-HAD)	[ ]
[ ]	Wearable Action Recognition Database (WARD)	[ ]

Ref.	Name	Description	Utl. in
[ ]	Opportunity	Published in 2012, this dataset contains recordings from wearable, object, and ambient sensors in a room simulating a studio flat. Four subjects were asked to perform early morning cleanup and breakfast activities.	[ , , , , , , ]
[ ]	Human Activity Recognition Using Smartphones Data Set	The dataset from 2012 contains smartphone-recordings. 30 subjects at the age of 19 to 48 performed six different locomotion activities wearing a smartphone on the waist.	[ , , , , ]
[ ]	2	Published in 2012, this dataset provides recordings from three IMUs and a heart rate monitor. Nine subjects performed twelve different household, sports and daily living activities. Some subjects performed further optional activities.	[ , , , , ]
[ ]	PAMAP		[ , , , , ]
[ ]	Hand Gesture	The dataset from 2013 contains 70 minutes of arm movements per subject from eight ADLs as well as from playing tennis. Two recorded subjects were equipped with three IMUs on the right hand and arm.	[ , , ]
[ ]	Skoda	This dataset from the year 2008 contains ten manipulative gestures performed by a single worker in a car maintenance scenario. 20 accelerometers were used for recording.	[ , ]

1	20	20	20	20	25	25	30	40	30	50	50	50	50	50	98	100	100	100	126	300
10–20	3	5	5	25	5	10	0.72	1.9–7.5	1.67	1.28	1.2–1.3	2	2.56	5,12,20	0.5	2.56	1.28	4	6	0.67
-	33	50	-	50	-	-	50	-	50	50–75	50	50	50	50	50	50	50	50	50	5

Domain	Features	Definitions	Publications
Time	Variance	Arithmetic Variance	[ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]
	Mean	Arithmetic Mean	[ , , , , , , , , , , , , , , , , , , , , , , , , , , ]
	Pairwise Correlation	Correlation between every pair of axes	[ , , , , , , , , , , , , , ]
	Minimum	Smallest value in the window	[ , , , , , , , , ]
	Maximun	Largest value in the window	[ , , , , , , , , ]
	Energy	Average sum of squares	[ , , , , , , , ]
	Signal Magnitude Area		[ , , , , , , ]
	IQR	Interquartile Range	[ , , , , , ]
	Root Mean Square	Square root of the arithmetic mean	[ , , , , , ]
	Kurtosis		[ , , , , ]
	Skewness		[ , , , , ]
	MinMax	Difference between the Maximum and the Minimum in the window	[ , , , ]
	Zero Crossing Rate	Rate of the changes of the sign	[ , , ]
	Average Absolute Deviation	Mean absolute deviations from a central point	[ , ]
	MAD	Median Absolute Deviation	[ , ]
	Mean Crossing Rate		[ , ]
	Slope	Sen’s slope for a series of data	[ ]
	Log-Covariance		[ ]
	Norm	Euclidean Norm	[ ]
	APF	Average Number of occurrences of Peaks	[ ]
	Variance Peak Frequency	Variance of APF	[ ]
	Correlation Person Coefficient		[ ]
	Angle	Angle between mean signal and vector	[ ]
	Time Between Peaks	Time [ms] between peaks	[ ]
	Binned Distribution	Quantisation of the difference between the Maximum and the Minimum	[ ]
	Median	Middle value in the window	[ ]
	Five different Percentiles	Observations in five different percentiles	[ ]
	Sum and Square Sum in Percentiles	Sum and Square sum of observations above/below certain percentile	[ ]
	ADM	Average Derivate of the Magnitude	[ ]
Frequency	Entropy	Normalised information entropy of the discrete FFT component magnitudes of the signal	[ , , , , , , , , , , ]
	Signal Energy	Sum squared signal amplitude	[ , , , , , , , , ]
	Skewness	Symmetric of distribution	[ , , , ]
	Kurtosis	Heavy tail of the distribution	[ , , , ]
	DC Component of FFT and DCT		[ , , ]
	Peaks of the DFT	First 5 Peaks of the FFT	[ , ]
	Spectral		[ ]
	Spectral centroid	Centroid of a given spectrum	[ ]
	Frequency Range Power	Sum of absolute amplitude of the signal	[ ]
	Cepstral coefficients	Mel-Frequency Cepstral Coefficients	[ ]
	Correlation		[ ]
	maxFreqInd	Largest Frequency Component	[ ]
	MeanFreq	Frequency Signal Weighted Average	[ ]
	Energy Band	Spectral Energy of a Frequency Band	[ ]
	PPF	Peak Power Frequency	[ ]

Domain	Features	Definitions	Publications
Spatial	Gravity variation	Gravity acceleration computed using the harmonic mean of the acceleration along the three axes (x,y,z)	[ ]
Spatial	Eigenvalues of Dominant Directions		[ ]
Structural	Trend		[ , ]
Structural	Magnitude of change		[ , ]
Time	Autoregressive Coefficients		[ ]
Kinematics	User steps frequency	Number of detected steps per unit time	[ ]
	Walking Elevation	Correlation between the acceleration along the y-axis vs. the gravity acceleration or acceleration along the z-axis	[ ]
	Correlation Hand and foot	Acceleration correlation between wrist and ankle	[ ]
	Heel Strike Force	Mean and variance of the Heel Strike Force, which is computed using dynamics	[ ]
	Average Velocity	Integral of the acceleration	[ ]

Metric	# of Publications
Accuracy	38
Precision	12
Recall	11
weightedF_1	5
meanF_1	6

Share and Cite

Reining, C.; Niemann, F.; Moya Rueda, F.; Fink, G.A.; ten Hompel, M. Human Activity Recognition for Production and Logistics—A Systematic Literature Review. Information 2019 , 10 , 245. https://doi.org/10.3390/info10080245

Reining C, Niemann F, Moya Rueda F, Fink GA, ten Hompel M. Human Activity Recognition for Production and Logistics—A Systematic Literature Review. Information . 2019; 10(8):245. https://doi.org/10.3390/info10080245

Reining, Christopher, Friedrich Niemann, Fernando Moya Rueda, Gernot A. Fink, and Michael ten Hompel. 2019. "Human Activity Recognition for Production and Logistics—A Systematic Literature Review" Information 10, no. 8: 245. https://doi.org/10.3390/info10080245

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

Subscribe to receive issue release notifications and newsletters from MDPI journals

Human activity recognition in artificial intelligence framework: a narrative review

Published: 18 January 2022
Volume 55 , pages 4755–4808, ( 2022 )

Cite this article

Neha Gupta 1 , 3 ,
Suneet K. Gupta 1 ,
Rajesh K. Pathak 2 ,
Vanita Jain 3 ,
Parisa Rashidi 4 &
Jasjit S. Suri ORCID: orcid.org/0000-0001-6499-396X 5 , 6

35k Accesses

130 Citations

6 Altmetric

Explore all metrics

Human activity recognition (HAR) has multifaceted applications due to its worldly usage of acquisition devices such as smartphones, video cameras, and its ability to capture human activity data. While electronic devices and their applications are steadily growing, the advances in Artificial intelligence (AI) have revolutionized the ability to extract deep hidden information for accurate detection and its interpretation. This yields a better understanding of rapidly growing acquisition devices, AI, and applications, the three pillars of HAR under one roof. There are many review articles published on the general characteristics of HAR, a few have compared all the HAR devices at the same time, and few have explored the impact of evolving AI architecture. In our proposed review, a detailed narration on the three pillars of HAR is presented covering the period from 2011 to 2021. Further, the review presents the recommendations for an improved HAR design, its reliability, and stability. Five major findings were: (1) HAR constitutes three major pillars such as devices, AI and applications; (2) HAR has dominated the healthcare industry; (3) Hybrid AI models are in their infancy stage and needs considerable work for providing the stable and reliable design. Further, these trained models need solid prediction, high accuracy, generalization, and finally, meeting the objectives of the applications without bias; (4) little work was observed in abnormality detection during actions; and (5) almost no work has been done in forecasting actions. We conclude that: (a) HAR industry will evolve in terms of the three pillars of electronic devices, applications and the type of AI. (b) AI will provide a powerful impetus to the HAR industry in future.

An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence

Human Activity Recognition Using Wearable Sensors: Review, Challenges, Evaluation Benchmark

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Explore related subjects.

Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Human activity recognition (HAR) can be referred to as the art of identifying and naming activities using Artificial Intelligence (AI) from the gathered activity raw data by utilizing various sources (so-called devices). Examples of such devices include wearable sensors (Pham et al. 2020 ), electronic device sensors like smartphone inertial sensor (Qi et al. 2018 ; Zhu et al. 2019 ), camera devices like Kinect (Wang et al. 2019a ; Phyo et al. 2019 ), closed-circuit television (CCTV) (Du et al. 2019 ), and some commercial off-the-shelf (COTS) equipment’s (Ding et al. 2015 ; Li et al. 2016 ). The use of diverse sources makes HAR important for multifaceted applications domains, such as healthcare (Pham et al. 2020 ; Zhu et al. 2019 ; Wang et al. 2018 ), surveillance (Thida et al. 2013 ; Deep and Zheng 2019 ; Vaniya and Bharathi 2016 ; Shuaibu et al. 2017 ; Beddiar et al. 2020 ) remote care to elderly people living alone (Phyo et al. 2019 ; Deep and Zheng 2019 ; Yao et al. 2018 ), smart home/office/city (Zhu et al. 2019 ; Deep and Zheng 2019 ; Fan et al. 2017 ), and various monitoring application like sports, and exercise (Ding et al. 2015 ). The widespread use of HAR is beneficial for the safety and quality of life for humans (Ding et al. 2015 ; Chen et al. 2020 ).

The existence of devices like sensors, video cameras, radio frequency identification (RFID), and Wi-Fi are not new, but the usage of these devices in HAR is in its infancy. The reason for HAR’s evolution is the fast growth of techniques such as AI, which enables the use of these devices in various application domains (Suthar and Gadhia 2021 ). Therefore, we can say that there is a mutual relationship between the AI techniques or AI models and HAR devices. Earlier these models were based on a single image or a small sequence of images, but the advancements in AI have provided more opportunities. According to our observations (Chen et al. 2020 ; Suthar and Gadhia 2021 ; Ding et al. 2019 ), the growth of HAR is directly proportional to the advancement of AI which thrives the scope of HAR in various application domains.

The introduction of deep learning (DL) in the HAR domain has made the task of meaningful feature extraction from the raw sensor data. The evolution of DL models such as (1) convolutional neural networks (CNN) (Tandel et al. 2020 ), (2) extending the role of transfer weighting schemes (it allows the knowledge reusability where the recognition model is trained on a set of data and the same trained knowledge can then be used by a different testing dataset) such as Inception (Szegedy et al. 2015 , 2016 , 2017 ), VGG-16 (Simonyan and Zisserman 2015 ), and Residual Neural Network (Resents)-50 (Nash et al. 2018 ), (3) series of hybrid DL models such as fusion of CNN with long short-term memory (LSTM), Inception with ResNets (Yao et al. 2017 , 2019 , 2018 ; Buffelli and Vandin 2020 ), (4) loss function designs such entropy, Kaulback Liberal divergence, and Tversky (Janocha and Czarnecki 2016 ; Wang et al. 2020a ), (5) optimization paradigms such as cross-entropy, stochastic gradient descent (SGD) (Soydaner 2020 ; Sun et al. 2020 ) has made the task of HAR-based design plug-and-play based. Even though it is getting black-box oriented, it requires better understanding to actually ensure that the 3-legged stool is stable and effective.

Typically, HAR consists of four stages (Fig. 1 ) including (1) capturing of signal activity, (2) data pre-processing, (3) AI-based activity recognition, and (4) the user interface for the management of HAR. Each stage can be implemented using several techniques bringing the HAR system to have multiple choices. Thus, the choice of the application domain, the type of data acquisition device, and the processing of artificial intelligence (AI) algorithms for activity detection makes the choices even more challenging.

Four stages of HAR process (Hx et al. 2017 )

Numerous reviews in HAR have been published, but our observations show that most of the studies are associated with either vision-based (Beddiar et al. 2020 ; Dhiman Chhavi 2019 ; Ke et al. 2013 ) or sensor-based (Carvalho and Sofia 2020 ; Lima et al. 2019 ), while very few have considered RFID-based and device-free HAR. Further, there is no AI review article that covers the detailed analysis of all the four device types that includes all four types of devices such as sensor-based (Yao et al. 2017 , 2019 ; Hx et al. 2017 ; Hsu et al. 2018 ; Xia et al. 2020 ; Murad and Pyun 2017 ), vision-based (Feichtenhofer et al. 2018 ; Simonyan and Zisserman 2014 ; Newell Alejandro 2016 ; Crasto et al. 2019 ), RFID-based (Han et al. 2014 ), and device-free (Zhang et al. 2011 ).

An important observation to note here is that technology has advanced in the field of AI, i.e., deep learning (Agarwal et al. 2021 ; Skandha et al. 2020 ; Saba et al. 2021 ) and machine learning methods (Hsu et al. 2018 ; Jamthikar et al. 2020 ) and is revolutionizing the ability to extract deep hidden information for accurate detection and interpretation. Thus, there is a need to understand the role of these new paradigms that are rapidly changing HAR devices. This puts the requirement to consider a review inclined to address simultaneously changing AI and HAR devices. Therefore, the main objective of this study is to better understand the HAR framework while integrating devices and application domains in the specialized AI framework. What types of devices can fit in which type of application, and what attributes of the AI can be considered during the design of such (Agarwal et al. 2021 ) a framework are some of the issues that need to be explored. Thus, this review is going to illustrate how one can select such a combination by first understanding the types of HAR devices, and then, the knowledge-based infrastructure in the fast-moving world of AI, knowing that some of such combinations can be transformed into different applications (domains).

The proposed review is structured as follows: Sect. 2 covers the search strategy, and literature review with statistical distributions of HAR attributes. Section 3 illustrates the description of the HAR stages, HAR devices, and HAR application domains in the AI framework. Section 4 illustrates the role of emerging AI as the core of HAR. Section 5 presents performance evaluation criteria in the HAR and integration of AI in HAR devices. Section 6 consists of a critical discussion on factors influencing HAR, benchmarking of the study against the previous studies, and finally, the recommendations. Section 7 finally concludes the study.

2 Search strategy and literature review

“Google Scholar” is used for searching articles published between the periods of 2011-present. The search included the keywords “human activity recognition” or “HAR” in combination with terms “machine learning”, “deep learning”, “sensor-based”, “vision-based”, “RFID-based” and, “device-free”. Figure 2 shows the PRISMA diagram showing the criteria for the selection of HAR articles. We identified around 1548 articles in the last 10 years period, which were then short-listed to 175 articles based on three major assessment criteria: AI models used, target application domain, and data acquisition devices which are the three main pillars of the proposed review. In the proposed review we have formed two clusters of attributes based on three major assessment criteria. Cluster 1 includes 7 HAR devices and applications-based attributes, and cluster 2 includes 7 AI attributes. HAR devices and application-based attributes are: data source, #activities, datasets, subjects, scenarios, total #actions and performance evaluation, while the AI attributes includes: #features, feature extraction, ML/DL model, architecture, metrics, validation and hyperparameters/optimizer/loss function. The description of HAR devices and applications-based attributes is given in Sect. 3.2 . Further, the Tables A.1 , A.2 , A.3 and A.4 of "Appendix 1 " illustrate these attributes for various studies considered in the proposed review. The cluster 2’s AI attributes are discussed in Sect. 4.2 and Table 3 , 4 , 5 and 6 illustrate the insight about AI models adapted by researchers in their HAR model. Apart from three major criteria, three exclusion, and four inclusion criteria were also followed in research articles selection. Excluded (1) articles with traditional and older AI techniques, (2) non-relevant articles, and (3) articles with insufficient data. These exclusion criteria consisted of 991, 125, and 54 articles (marked as E1, E2, and E3 in PRISMA flowchart) that lead to the finalization of the 175 articles. Included (1) non-redundant articles, (2) articles with the detailed screening of abstract and conclusion, (3) articles based on eligibility criteria assessment which includes advanced AI techniques, target domain, and device-type, and (4) article’s qualitative synthesisation including impact factor of journal, and author’s contribution in HAR domain; (marked as I1, I2, I3, and I4 in PRISMA flowchart).

PRISMA model for the study selection

In the proposed review, we performed a rigorous analysis of the HAR framework in terms of AI techniques, device types, and application domain. One of the major observations of the proposed study is the existence of a mutual relationship among HAR device types and AI techniques. First, the analysis on HAR devices is presented in Fig. 3 a which is based on the articles considered between the periods of 2011 to 2021. It shows the changing pattern of HAR devices over time. Secondly, the growth of ML and DL techniques is presented in Fig. 3 b which shows that the HAR is trending towards the use of DL-based techniques. The HAR devices distribution is elaborated more in Fig. 4 a, in Fig. 4 b we have shown the further categorization of sensor-based HAR into the wearable sensor (WS) and smartphone sensor (SPS). Figure 4 c shows the division of vision-based HAR into video and skeleton-based models. Further, Fig. 4 d shows the types of HAR application domains.

a Changing pattern of HAR devices over time, b distribution of machine learning (ML) and deep learning (DL) articles in last decade

a Types of HAR devices, b sensor-based devices, c vision-based devices, d HAR applications. WS: wearable sensors, SPS: smartphone sensor, sHome: smart home, mHealthcare: health care monitoring, cSurv: crowd surveillance, fDetect: fall detection, eMonitor: exercise monitoring, gAnalysis: gait analysis

Observation 1

In Fig. 3 a, according to the device-wise analysis vision-based HAR was popular between the period 2011–2016. But from the year 2017 sensor-based models’ growth is more prominent and this is the same time period when DL techniques entered the HAR domain (Fig. 3 b). In the period 2017–2021, Wi-Fi devices evolved as one of the data sources for gathering activity.

Observation 2

Figure 3 b shows the year-wise distribution of articles published using ML and DL techniques. The key observation is the transition of AI techniques from ML to DL. From the year 2011–2016, the HAR models with ML framework were popular, while the HAR models using DL techniques started to evolve from the year 2014. In the last 3 years, this growth has increased significantly. Therefore, after analysing graphs of Fig. 3 a, b thoroughly, we can say that the HAR devices are evolving, as the trend is shifting towards the DL framework. This combined analysis verifies our claim of the existence of the mutual relationship between AI and device types.

Devices used in the HAR paradigm are the premier component of HAR by which HAR can be classified. We observed a total of 9 review articles arranged in chronological order (see Table 1 ). These reviews focused mainly on three sets of devices such as sensor-based (marked in light shade color) (Carvalho and Sofia 2020 ; Lima et al. 2019 ; Wang et al. 2016a , 2019b ; Lara and Labrador 2013 ; Hx et al. 2017 ; Demrozi et al. 2020 ; Crasto et al. 2019 ; De-La-Hoz-Franco et al. 2018 ) or vision-based (marked with dark shade color) (Beddiar et al. 2020 ; Dhiman Chhavi 2019 ; Ke et al. 2013 ; Obaida and Saraee 2017 ; Popoola and Wang 2012 ), device-free HAR (Hussain et al. 2020 ). Table 1 summarizes the nine articles based on the focus area, keywords, number of keywords, research period, and #citations. Note that sensor-based HAR captures activity signals using ambient and embedded sensors, vision-based HAR involves 3-dimensional (3D) activity data gathering using a 3D camera or depth camera. In device-free HAR, activity data is captured using Wi-fi transmitter–receiver units.

3 HAR process, HAR devices, and HAR applications in AI framework

The objective of developing HAR models is to provide information about human actions which helps in analyzing the behavior of a person in a real environment. It allows computer-based applications to help users in performing tasks and to improve their lifestyle such as remote care to the elderly living alone, and posture monitoring during exercise. This section presents about HAR framework that includes HAR stages, HAR devices, and target application domains.

3.1 HAR process

There are four main stages in the HAR process: data acquisition, pre-processing, model training, and performance evaluation (Figure S.1(a) in supporting document) . In stage 1 , depending on the target application, a HAR device is selected. For example, in surveillance application involving multiple persons, the HAR device for data collection is the camera. Similarly, for applications where a person's daily activity monitoring is involved, the data acquisition source is sensor preferably. One can use a camera also, but it breaches the user's privacy and needs high computational cost. Table 2 illustrates the variation in HAR devices according to the application domains. It elaborates the description of diverse HAR applications in terms of various data sources and AI techniques. Note that sometimes the acquired data suffer from noise or other unwanted signals, and therefore offers challenges in post-processing AI-based systems. Thus, it is very important to have a robust feature extraction system with a robust network for better prediction. In stage 2 , data cleaning is performed, which involves low-pass or high-pass filters for noise suppression or image enhancement (Suri 2013 ; Sudeep et al. 2016 ). This data undergoes regional and boundary segmentation (Multi Modality State-of-the-Art Medical Image Segmentation and 2011 ; Suri et al. 2002 ; Suri 2001 ). Our group has published several dedicated monograms on segmentation paradigms and are available as ready reference (Suri 2004 , 2005 ; El-Baz and Jiang 2016 ; El-Baz and Suri JS 2019 ). This segmented data can now be used for model training. Stage 3 involves the training of HAR model using ML or DL techniques. When using hand-crafted features, one can use ML-based techniques (Maniruzzaman et al. 2017 ). For automated feature extraction, one can use the DL framework. Apart from automatic feature learning, DL offers knowledge reusability by providing transfer learning models, exploration of huge datasets (Biswas et al. 2018 ), and hybrid DL models usage which allows spatial as well as temporal features identification and learning. After stage 3, the HAR model is ready to be used for an application or prediction. Stage 4 is the most challenging part since the model is applied to the real data, whose behavior varies depending on physical factors like age, physique, and an approach for performing a task. An HAR model is efficient if its performance is independent of physical factors.

3.2 HAR devices

The HAR device type depends on the target application. Figure S.1(b) (Supporting document) presents the different sources for activity data: sensors, video cameras, RFID systems, and Wi-Fi devices.

The sensors-based approaches can be categorized into wearable sensors and device sensors. In wearable sensor-based approach, a body-worn sensor module is designed which includes inertial sensors, environmental sensors units (Pham et al. 2017 , 2020 ; Hsu et al. 2018 ; Xia et al. 2020 ; Murad and Pyun 2017 ; Saha et al. 2020 ; Tao et al. 2016a , b ; Cook et al. 2013 ; Zhou et al. 2020 ; Wang et al. 2016b ; Attal et al. 2015 ; Chen et al. 2021 ; Fullerton et al. 2017 ; Khalifa et al. 2018 ; Tian et al. 2019 ). Sometimes the wearable sensor devices can be stressful for the user, therefore the solution is the use of smart-device sensors. In device sensor approach data is captured using smartphone inertial sensors (Zhu et al. 2019 ; Yao et al. 2018 ; Wang et al. 2016a , 2019b ; Zhou et al. 2020 ; Li et al. 2019 ; Civitarese et al. 2019 ; Chen and Shen 2017 ; Garcia-Gonzalez et al. 2020 ; Sundaramoorthy and Gudur 2018 ; Gouineua et al. 2018 ; Lawal and Bano 2019 ; Bashar et al. 2020 ). The most commonly used sensor for HAR is accelerometer and gyroscope. Table A.1 of “Appendix 1 ”, shows the types of data acquisition devices, activity classes, and scenarios in earlier sensor-based HAR models.

Video camera

It can be further classified into two types: 3D camera and depth camera. 3D camera-based HAR models uses closed-circuit television (CCTV) cameras in the user's environment for monitoring the actions performed by the user. Usually, the monitoring task is performed by humans or some innovative recognition model. Numerous HAR models were proposed by researchers, which can process and evaluate the activity video or image data and recognize the performed activities (Wang et al. 2018 ; Feichtenhofer et al. 2018 , 2017 , 2016 ; Diba et al. 2016 , 2020 ; Yan et al. 2018 ; Chong and Tay 2017 ). The accuracy of activity recognition of 3D camera data depends on physical factors such as lighting and background color. The solution to this issue can be provided by using a depth camera (like Kinect). The Kinect camera consists of different data streams such as depth, RGB, and audio. Depth stream captures body joint coordinates, and based on joint coordinates, a skeleton-based HAR model can be developed. The skeleton-based HAR models have applications in domains that involve posture recognition (Liu et al. 2020 ; Abobakr et al. 2018 ; Akagündüz et al. 2016 ). Table A.2 of “Appendix 1 ” provides an overview of earlier vision-based HAR models. Apart from 3D and depth cameras, one can use thermal cameras but it can be expensive.

RFID tags and readers

By installing RFID passive tags in close proximity of the user, the activity data can be collected using RFID readers. As compared to active RFID tags, passive tags have more operational life as they do not need a separate battery. Rather it uses the reader's energy and converts it into an electrical signal for operating its circuitry. But the range of active tags is more than passive tags. They both can be used for HAR models (Du et al. 2019 ; Ding et al. 2015 ; Li et al. 2016 ; Yao et al. 2018 ; Zhang et al. 2011 ; Xia et al. 2012 ; Fan et al. 2019 ). The further description of existing RFID-based HAR models is provided in Table A.3 of “Appendix 1 ”.

Wi-Fi device: In the last 5 years, the device-free HAR has gained popularity. Researchers have explored the possibility of capturing activity signals using Wi-Fi devices. Channel state information (CSI) from the wireless signal is used to acquire activity data. Many models were developed for fall detection and gait recognition using CSI (Yao et al. 2018 ; Wang et al. 2019c , d , 2020b ; Zou et al. 2019 ; Yan et al. 2020 ; Fei et al. 2020 ). The description of some popular existing Wi-Fi device-based HAR is provided in Table A.4 of “Appendix 1 ”.

Summary of challenges in HAR devices

There are almost four types of HAR devices, and researchers have proposed various HAR models with advanced AI techniques. Gradually, the usage of electronic devices for gathering activity data in HAR domain is increasing, but with this growth, the challenges are also evolving: (1) Video camera-based application involves data gathering using a video camera, which results in the invasion of user’s privacy. It also requires high power systems to process large data produced by video cameras, (2) In sensors-based HAR models, the use of wearable devices is stressful and inconvenient for the user, therefore smartphone sensors are more preferable. But the use of smartphone and smartwatch is limited to simple activities recognition such as walking, sitting, and going upstairs, (3) In RFID tags and reader-based HAR models, the usage of RFID in activity capturing is limited to indoor only. (4) Wi-Fi-based HAR models are new in the HAR industry, but there are few issues with it. Moreover, it can capture activities performed within the Wi-Fi range but cannot identify the movement in blind spot areas.

3.3 HAR applications using AI

In the last decade, researchers have developed various HAR models for different domains. “What type of HAR device is suitable for which application domain and what is the suitable AI methodology” is the biggest question that pops into the mind, once developing the HAR framework. The description of diverse HAR applications with data sources and AI techniques is illustrated in Table 2 . It shows the variation in HAR devices and AI techniques depending on the application domain. The pie chart in Fig. 4 d shows the distribution of applications based on existing articles. HAR is used in fields like:

Crowd surveillance (cSurv): Crowd pattern monitoring and detecting panic situations in the crowd.

Health care monitoring (mHealthcare): Assistive care to ICU patients, Trauma resuscitation.

Smart home (sHome): Care to elderly or dementia patients and child activity monitoring.

Fall detection (fDetect): Detection of abnormality in action which results in a person's fall.

Exercise monitoring (eMonitor): Pose estimation while doing exercise.

Gait analysis (gAnalysis): Analyze gait patterns to monitor health problems.

3.4 HAR applications with different activity-types

There is no predefined set of activities, rather the human activity type varies according to the application domain. Figure S.2 (Supporting document) shows the activity type involved in human activity recognition.

Single person activity

Here the action is performed by a person. Figure S.3 (Supporting document) shows examples of single-person activities (jumping jack, baby crawling, punching the boxing bag, and handstand walking). Single person action can be divided into the following categories:

Behavior : The goal of behavior recognition is to recognize a person’s behavior from activity data, and it is useful in monitoring applications: dementia patient & children behavior (Han et al. 2014 ; Nam and Park 2013 ; Arifoglu and Bouchachia 2017 ).

Gestures: It has application in sign language recognition for differently-abled persons. Wearable sensor-based HAR models are more suitable (Sreekanth and Narayanan 2017 ; Ohn-Bar and Trivedi 2014 ; Xie et al. 2018 ; Kasnesis et al. 2017 ; Zhu and Sheng 2012 ).

Activity of daily living (ADL) and Ambient assistive living (AAL): ADL activities are performed in an indoor environment cooking, sleeping, and sitting. In smart home, ADL monitoring for dementia patients can be performed using wireless sensor-based HAR models (Nguyen et al. 2017 ; Sung et al. 2012 ) or RFID tags based HAR models (Ke et al. 2013 ; Oguntala et al. 2019 ; Raad et al. 2018 ; Ronao and Cho 2016 ). AAL-based models help elderly and disabled people by providing remote care, medication reminder, and management (Rashidi and Mihailidis 2013 ). CCTV cameras are an ideal choice but they have privacy issues (Shivendra shivani and Agarwal 2018 ). Therefore, sensor or RFID-based HAR models (Parada et al. 2016 ; Adame et al. 2018 ) or wearable sensor-based models are more suitable (Azkune and Almeida 2018 ; Ehatisham-Ul-Haq et al. 2020 ; Magherini et al. 2013 ).

Multiple person activity

The action is performed by a group of persons. Multiple person movement is illustrated in Figure S.4 (Supporting document), depicts the normal human movement on a pedestrian pathway and anomalous activity of cyclist and truck in a pedestrian pathway. It can belong to the following categories.

Interaction: There are human–object (cooking, reading a book) (Kim et al. 2019 ; Koppula et al. 2013 ; Ni et al. 2013 ; Xu et al. 2017 ) and human–human (handshake) activities (Weng et al. 2021 ). A human–object interaction-based free weight exercise monitoring (FEMO) model using RFID devices that monitors exercise by installing a tag on dumbbells (Ding et al. 2015 ).

Group: It involves monitoring people's count in an indoor environment like a museum or crowd pattern monitoring (Chong and Tay 2017 ; Xu et al. 2013 ). To check the number of people in an area, we can use Wi-Fi units. Received signal strength can be used for counting people as it is user-sensitive.

Observation 3: Vision-based HAR has broad application domains, but they have limitations like privacy and the need for more resources (such as GPUs). These issues can be overcome with sensor-based HAR but their applications domain is currently limited to single-person activity monitoring.

4 Core of the HAR system design: emerging AI

The foremost goal of HAR is to predict the movement or action of a person based on the action data collected from a data acquisition device. These movements include activities like walking, exercising, and cooking. It is challenging to predict movements, as it involves huge amounts of unlabelled sensor data, and video data which suffer from conditions like lights, background noise, and scale variation. To overcome these challenges AI framework offers numerous ML, and DL techniques.

4.1 Artificial intelligence models in HAR

ML architectures: ML is a subset of AI, which aims at developing an intelligent model which involves the extraction of unique features, that helps in recognizing patterns in the input data (Maniruzzaman et al. 2018 ). There are two types of ML approaches: supervised and unsupervised. In supervised approach, a mathematical model is created based on the relationship between raw input data and output data. The idea behind the unsupervised approach is to detect patterns in raw input data without prior knowledge of output. Figure S.5 (Supporting document) illustrates the popular ML techniques used in recognizing human actions (Qi et al. 2018 ; Yao et al. 2019 ; Multi Modality State-of-the-Art Medical Image Segmentation and 2011 ). Several applications of ML models in handling different diseases have been developed by our group such as diabetes man(Maniruzzaman et al. 2018 ) liver cancer (Biswas et al. 2018 ), thyroid cancer (Rajendra Acharya et al. 2014 ), ovarian cancer (Acharya et al. 2013a , 2015 ), prostate (Pareek et al. 2013 ) breast (Huang et al. 2008 ), skin (Shrivastava et al. 2016 ), arrhythmia classification (Martis et al. 2013 ), and recently in cardiovascular (Acharya et al. 2012 ; Acharya et al. 2013b ). In the last 5 years, the researchers' focus has been shifted to semi-supervised learning where the HAR model is trained on labelled as well as unlabelled data. The semi-supervised approach aims to label unlabelled data using the knowledge gained from the set of labelled data. In a semi-supervised approach, the HAR model is trained on popular labelled datasets and the new users' unlabelled test data and classified into activity classes according to the knowledge gained from training data (Mabrouk et al. 2015 ; Cardoso and Mendes Moreira 2016 ).

DL/TL Architectures: In recent years, DL has become quite popular due to its capability of learning high-level features and its superior performance (Saba et al. 2019 ; Biswas et al. 2019 ). The basic idea behind DL is data representation, which enables it to produce optimal features. It learns unknown patterns from raw data without human intervention. The DL techniques used in HAR can be divided into three parts such as deep neural networks (DNN), hybrid deep learning (HDL) models, and transfer learning (TL) based models (Agarwal et al. 2021 ). (Shown in Figure S.5 of Supporting document) The DNN includes the models like convolutional neural networks (CNN) (Deep and Zheng 2019 ; Liu et al. 2020 ; Zeng et al. 2014 ), recurrent neural networks (RNN) (Murad and Pyun 2017 ) and RNN variants which include long short-term memory (LSTM) and gated recurrent unit (GRU) (Zhu et al. 2019 ; Du et al. 2019 ; Fazli et al. 2021 ). In hybrid HAR models, the combination of CNN and RNN models is trained on spatio-temporal data. Researchers have proposed various hybrid models in the last 5 years, such as DeepSense (Yao et al. 2017 ) and DeepConvLSTM (Wang et al. 2019a ). Apart from hybrid AI models, there are various transfer learning-based HAR models which involves pre-trained DL architectures like ResNet-50, Inceptionv3, VGG-16 (Feichtenhofer et al. 2018 ; Newell Alejandro 2016 ; Crasto et al. 2019 ; Tran et al. 2019 ; Feichtenhofer and Ai 2019 ). However, the role of TL in sensor-based HAR is still evolving (Deep and Zheng 2019 ).

Figure 5 a depicts a representative CNN architecture for HAR, which shows the two convolution layers followed by a pooling layer for feature extraction for the activity image, leading to dimensionality reduction. This is then followed by a fully connected (FC) layer for iterative weight computations and a softmax layer for binary or granular decision making. After that, the input image is classified into an activity class. Figure 5 b presents the representative TL-based HAR model, which includes pretrained models such as VGG-16, inception V3, and ResNet. The pre-trained model is trained on a large dataset of natural images such as man, cat, dog, and food. These pre-trained weights are applied to the training data of the sequence of images using an intermediate layer. It forms the customized fully connected layer. Further, the training weights are fine-tuned using the optimizer function. Next the retrained model is applied to testing data for the classification of the activity into an activity class.

Miniaturized mobile devices are handy to use and offer a set of physiological sensors that can be used for capturing activity signals. But the problem is the complex structure and strong inner correlation in captured data. The deep learning models which are the combination of both CNN and RNN offer benefits to explore this complex data and identify detailed features for activity recognition. One such model offered by Ordonez et al. was DeepConvLSTM (Ordóñez and Roggen 2016 ), where CNN works as feature extractor and represent the sensor input data as feature maps, and LSTM layer explores the temporal dynamics of feature maps. Yao et al. have proposed similar model named as DeepSense in which two convolution layers (individual and merge conv layers) and stacked GRU layers were used as main building blocks (Yao et al. 2017 ). Figure 5 c shows the representative hybrid HAR model with CNN-LSTM frameworks.

a CNN model for HAR $({\text{where}}\;\omega _{{\left( * \right)}} {\text{:}}\;{\text{ weights}}\;{\text{of}}\;{\text{hidden}}\;{\text{layers}},\;\sigma \left( * \right){\text{:}}\;{\text{activation}}\;{\text{function}},\;\lambda {\text{:}}\;{\text{learning}}\;{\text{rate}}$ . ${*}{:} {\text{ convolutional operation}},{ }{\mathcal{L}}\left( {\upomega } \right){\text{ is the loss function}}){ }$ , b TL-based model for HAR, and c hybrid HAR model (CNN-LSTM)

Loss function

DL model learns by means of loss function. It evaluates how well an algorithm models the applied data. If it deviates largely from actual output, the value of the loss function will be very large. The loss function with the help of optimization function learns gradually to reduce the prediction error. Mostly used loss functions in HAR models are mean squared loss and cross-entropy (Janocha and Czarnecki 2016 ; Wang et al. 2020a ).

Mean absolute error (δ): is calculated as the average sum of absolute differences between predicted ${(\text{y}}_{{\text{i}}} )$ and actual ${(\hat{\text{y}}}_{{\text{i}}} ){\text{ output}}$ . N is the number of training samples

Mean squared error ( ${\upvarepsilon }$ ): is calculated as the average of the squared difference between the predicted ${(\hat{\text{y}}}_{{\text{i}}} )$ and actual output ${(\text{y}}_{{\text{i}}} )$ . N is the number of training samples

Cross-entropy loss ( ${\upeta }$ ): evaluates the performance of a model whose output probability ranges between 0 and 1. The loss increases if predicted probability $({\text{y}}_{{\text{i}}} ){ }$ diverges from actual output $\widehat{{({\text{y}}}}_{{\text{i}}} ){ }$ .

Binary cross-entropy loss: predict the probability between two activity classes.

Multiclass cross-entropy loss : Multi-class CEL is the generalization of binary CEL where each class is assigned a unique integer value range between 0 to n −1 (n is a number of classes).

Kullback Lieblar-divergence (KL-divergence): is a measure of how a probability distribution diverges from another distribution. For the probability distribution of P(x) and Q(x), KL-divergence is defined as the logarithmic difference between P(x) and Q(x) with respect to P(x).

Hyper-parameters and optimization

Drop-out rate : regularization technique where few neurons are dropped to avoid overfitting.

Learning rate : it defines how fast parameters are updated in a network.

Momentum : it helps in the next step direction based on knowledge gained in previous steps.

Number of hidden layers : number of hidden layers between input and output layers.

Optimization

It is a method used for changing the parameters of neural networks. DL provides a wide range of optimizers: gradient descent (GD), stochastic gradient descent (SGD), RMSprop, and Adam optimizers. GD is a first-order optimization that relies on the first-order derivative of the loss function. SGD is the variant of GD, which involves frequent variation in a model’s parameter. It computes the loss for each training sample and alters the model’s parameters. Further, the RMSprop optimizer lies in the domain of adaptive learning. RMSprop deals with the vanishing/exploding gradient issue by using a moving average of squared gradients to normalize the gradient. The most powerful optimizer is the Adam optimizer which has the strength of momentum of GD to hold the gained knowledge of updates, adaptive learning of RMSprop optimizer, offers two new hyper-parameters beta and beta 2 (Soydaner 2020 ; Sun et al. ( 2020 ).

The most common validation strategies are K-fold cross validation and leave one subject out (LOSO). In k-fold, the k-onefold is used for training and the remaining is used for validation. A similar pattern is followed in k-fold variants such as twofold, threefold, and tenfold cross-validation. In LOSO, out of whole dataset, the data of one subject is kept for validation and the rest is used for training.

4.2 AI models adapting HAR devices

There are various HAR devices for capturing human activity signals. The goal of HAR devices is to capture activity signals with minimal distortion. For providing deeper insight into the existing HAR models, we have identified seven AI attributes and used tabular representation for better understanding. It consists of attributes such as #features, feature extraction, AI model architecture, metrics, validation, hyper-parameters/optimizer/loss function. For in-depth description of recent HAR models between 2019 and 2021, we have made four tables for each HAR device: Table 3 (sensor), Table 4 (vision), Table 5 (RFID), and Table 6 (device-free).

In Table 3 , we have provided insight into AI techniques adopted in sensor-based HAR models in the last two years. Apart from recent sensor-based HAR models, knowledge about previous sensor-based HAR models published between 2011–2018 is provided in Table S.1 (Supporting document) (Zhu et al. 2019 ; Ding et al. 2019 ; Yao et al. 2017 , 2019 ; Hsu et al. 2018 ; Murad and Pyun 2017 ; Sundaramoorthy and Gudur 2018 ; Lawal and Bano 2019 ). The sensor-based HAR is more dominated by DL techniques especially CNN or the CNN's combination with RNN or its variants. In sensor-based HAR the most used hyper-parameters are learning rate, batch size, #layers, and drop out. Adam optimizer, cross-entropy loss, and k-fold validation are dominant in sensor-based HAR. For example, Table 3 ’s (R2, C4) presents the 3D CNN-based HAR model which includes 3 convolutional layers of size (32, 64,128) followed by a pooling layer, then an FC layer of size (128) and a softmax layer. Entry (R2, C6) illustrate the validation strategy (10% data was used for validation) and entry (R2, C7) illustrates the hyperparameters (i.e., LR = 0.001, batch size = 50) and selected optimiser (Adam) for performance fine-tuning. Table 4 illustrates the AI framework in vision-based HAR models published in recent 2 years. Further, description of earlier vision-based HAR models published between 2011–2018 are provided in Table S.2 (Supporting document) (Qi et al. 2018 ; Wang et al. 2018 ; Thida et al. 2013 ; Feichtenhofer et al. 2018 , 2017 ; Simonyan and Zisserman 2014 ; Newell Alejandro 2016 ; Diba et al. 2016 ; Xia et al. 2012 ; Vishwakarma and Singh 2017 ; Chaaraoui 2015 ). Initial vision-based HAR models were dominated by ML algorithms such as support vector machine (SVM), k-means clustering with principal component analysis (PCA)-based feature extraction. In the last few years, researchers have shifted to DL paradigm and the most dominant DL techniques such as multi-dimensional CNN, LSTM, and a combination of both. In video camera-based HAR models, the incoming data is video stream which needs more resources and processing time. This issue gives rise to the usage of TL in vision-based HAR approaches. The hyper-parameters used in vision-based HAR are drop-out rate, learning rate, weight decay, and batch normalization. The mean square loss and cross-entropy loss are the most used loss functions, while RMSProp and SGD are the most dominant optimizers in vision-based HAR. For example, Table 4 ’s (R1, C3) illustrates the description of 3DCNN based HAR model which includes input layer with skeletal joints information split into coloured skeleton motion history images (Color-skl-MHI), and relative joint images (RJI) followed by 3DCNN, then a fusion layer to combine the o/p of both 3DCNN layers and last is the output layer. Table 5 shows the recognition models using RFID devices published in the last 2 years, while details of the earlier RFID-based HAR models are provided in Table S.3 (Supporting document) (Ding et al. 2015 ; Li et al. 2016 ; Fan et al. 2017 ). RFID-based HAR is mostly dominated by ML algorithms like SVM, sparse coding, and dictionary learning. Very few researchers have used DL techniques. Some RFID-based HAR models used traditional approach in which received signal strength indicator (RSSI) is used for data gathering and recognition task is performed by calculating the similarity in dynamic time warping (DTW). Table 6 provides the overview of device-free HAR models where Wi-Fi devices are used for collecting activity data. The recognition approach is similar to RFID-based HAR. Further, ML approaches are more dominant than DL.

Impact of DL on miniaturized mobile and wireless sensing HAR devices

A visible growth of DL in vision-based HAR devices is observed in terms of existing HAR models mentioned in Table 4 where most of the recent work is done using advanced DL techniques like TL using VGG-16, VGG-19, and ResNet-50. Apart from these TL-based models, there are hybrid models using autoencoders as shown in row R8 of Table 4 which includes CNN, LSTM, and autoencoder-based HAR model for extracting deep features from enormous volumed video datasets. But the impact of advanced DL techniques in sensors-based HAR and device-free HAR is not very powerful. Due to the compact size and versatility of miniaturized and wireless sensing devices, they are progressing to become the next revolution in the HAR framework, and the key to their progress is the emerging DL framework. The data gathered from these devices is unlabelled, complex, and has strong inter-correlation. DL offers (1) advanced algorithms like TL, and unsupervised learning techniques such as generative adversarial networks (GAN) and variational autoencoders (VAE), (2) fast optimization techniques such as SGD, Adam, and (3) dedicated DL libraries like TensorFlow, (Py) Torch, and Theano to handle complex data.

Observation 4: DL techniques are still in an evolving stage. Minimal work has been done using TL in sensor-based HAR models. Most of the approaches are discriminative where supervised learning is used for training HAR models. Generative models like VAE and GAN have evolved in the computer vision domain but they are still new in the HAR domain.

5 Performance evaluation in HAR and integration of AI in HAR devices

5.1 performance evaluation.

Researchers have adopted different metrics for evaluating the performance of HAR models, and the most popular evaluation metric is accuracy. The most used metrics in sensor-based HAR include accuracy, sensitivity, specificity, and F1-score. The evaluation metrics used in existing vision-based HAR models were accuracy i.e., top-1, top-5, and mean average precision (mAPS). Metrics used in RFID-based HAR include accuracy, F1-score, recall, and precision. The metrics used in Device-free HAR include F1-score, precision, recall, and accuracy. “Appendix 2 ” shows the mathematical representations of the performance evaluation metrics used in the HAR framework.

5.2 Integration of AI in HAR devices

In the last few years, a significant growth can be seen in the usage of DL in the HAR framework, but there are challenges associated with DL models such as (1) Overfitting/Underfitting: When the amount of activity data is limited, the HAR model learns too well during training that it learns the irregularities and random noise as part of data. As a result, it negatively impacts the model’s generalization ability. Underfitting is another negative condition where the HAR model neither models the new data nor generalizes to new unseen data. Both overfitting and underfitting result in lower performance. By selecting the appropriate optimizer, we can overcome the overfitting condition by tuning the right hyperparameters or by increasing the size of training data, or using k-fold cross validation. The challenge is to select the correct range of hyperparameters that can work well during training and testing protocols and works well when the HAR model is used in real-life applications. (2) Hardware integration in HAR devices: In the last 10 years various HAR models with high performance came into the picture, but the question is “how well they can be used in real-environment without integrating specialized hardware like graphics processing units (GPUs), and extra memory”. Therefore, the objective for designing a HAR model is to design a robust and lightweight model which can run in real-environment without the need for specialized hardware. For applications with huge data such as videos, we need GPUs for training the model. Python offers libraries (such as Keras, TensorFlow) for implementing AI framework on a general-purpose CPU processor. For working on GPUs, one needs to explore special libraries for implementing AI models. Sometimes, it may result in specialized hardware integration need in the target application which makes it expensive. Processing power and costs are interrelated i.e., one needs to pay more for extra power.

6 Critical discussion

In the proposed review, we made four observations based on the tri-stool of HAR which includes HAR devices, AI techniques, and applications. Based on these observations and challenges highlighted in Sects. 3 and 4 , we have made three claims and four recommendations.

6.1 Claims based on in-depth analysis of HAR devices, AI, and HAR applications

(i) Mutual relationship among HAR devices and AI framework: Our first claim is based on the observation 1 and 2 where we illustrate that the advancement in AI directly affects the growth of HAR devices. In Sect. 2 , Fig. 3 a presents the growth of HAR devices in the last 10 years. Further, Fig. 3 b illustrates the advancement in AI, which shows how researchers have shifted to the DL paradigm from ML in the last 5 years. Therefore, from observations 1 and 2, we can rationalize that the advancement in AI is resulting in the growth of HAR devices. Most of the earlier HAR models were dependent on cameras or customized wearable sensors data but in the last 5 years more devices like embedded sensors, Wi-Fi devices came into the picture as prominent HAR sources.

(ii) Growth in HAR devices increases the scope of HAR in various application domains: Claim 2 is based on observation 3, where we have shown that for the best results how a target application is depending on a HAR device. For applications like crowd monitoring, if we use sensor devices for gathering the activity data it will not be able to give prominent results because sensors are best for single person applications. Similarly, if we use a camera in a smart home environment, it will not be a good choice because cameras invade user’s privacy and require a high computational cost.

Therefore, we can conclude that multi-person applications like surveillance video cameras are proven best. However, for single-person monitoring applications smart device sensors are more suitable.

(iii) HAR devices, AI, and target application domains are three pillars in HAR framework: From all four observations and claims (1 and 2), we have proved that HAR devices, AI, and application domains are three pillars in the success of a HAR model.

6.2 Benchmarking: comparison between different HAR reviews

The objective of the proposed review is to provide a complete and comprehensive review of HAR based on the three pillars i.e., device-type, AI techniques, and application domains. Table 7 provides the benchmarking of the proposed review with existing studies.

6.3 A short note on HAR datasets

The narrative review surely needs a special note on types of HAR datasets. (1) Sensor-based : Researchers have proposed many popular sensor-based datasets. In Table A.5 (“Appendix 1 ”), the description of sensor-datasets is illustrated with attributes such as data source, #factors, sensor location, and activity type. It includes wearable sensor-based datasets (Alsheikh et al. 2016 ; Asteriadis and Daras 2017 ; Zhang et al. 2012 ; Chavarriaga et al. 2013 ; Munoz-Organero 2019 ; Roggen et al. 2010 ; Qin et al. 2019 ), as well as smart-device sensor-based datasets (Ravi et al. 2016 ; Cui and Xu 2013 ; Weiss et al. 2019 ; Miu et al. 2015 ; Reiss and Stricker 2012a , b ; Lv et al. 2020 ; Gani et al. 2019 ; Stisen et al. 2015 ; Röcker et al. 2017 ; Micucci et al. 2017 ) Apart from datasets mentioned in Table A.5 , there are few more datasets worth mentioning such as Kasteren dataset (Kasteren et al. 2011 ; Chen et al. 2017 ), which is also very popular. (2) Vision-based HAR: Devices for collecting 3D data are CCTV cameras (Koppula and Saxena 2016 ; Devanne et al. 2015 ; Zhang and Parker 2016 ; Li et al. 2010 ; Duan et al. 2020 ; Kalfaoglu et al. 2020 ; Gorelick et al. 2007 ; Mahadevan et al. 2010 ), depth cameras (Cippitelli et al. 2016 ; Gaglio et al. 2015 ; Neili Boualia and Essoukri Ben Amara 2021 ; Ding et al. 2016 ; Cornell Activity Datasets: CAD-60 & CAD-120 2021 ), and videos from public domains like YouTube and Hollywood movie scenes (Gu et al. 2018 ; Soomro et al. 2012 ; Kuehne et al. 2011 ; Sigurdsson et al. 2016 ; Kay et al. 2017 ; Carreira et al. 2018 ; Goyal et al. 2017 ). The reason behind using public domain videos is that they have no privacy issue, unlike with cameras. Table A.6 (“Appendix 1 ” illustrates the description of vision-based datasets which includes data source, #factors, sensor location, and activity type. Apart from datasets mentioned in Table A.6 , there are few more publicly available datasets such as MCDS (Magnetic wall chess board video) datasets (Tanberk et al. 2020 ), NTU-RGBD datasets (Yan et al. 2018 ; Liu et al. 2016 ), VIRAT 1.0 (3 hour person vehicle interaction), and VIRAT 2.0 (8 hour surveillance scene of school parking) (Wang and Ji 2014 ). (3) RFID-based: RFID-based HAR is mostly used for smart home applications, where actions performed by the user are monitored by RFID tags. To the best of our knowledge, there is hardly a public dataset available for RFID-based HAR. Researchers have developed their own datasets for their respective applications. One such dataset was developed by Ding et al . (Ding et al. 2015 ) in 2015 which includes data of 10 exercises performed by 15 volunteers for 2 weeks with a total duration of 1543 min. Similarly, Li et al. developed the dataset for trauma resuscitation including the 10 activities and 5 resuscitation phases (Li et al. 2016 ). A similar strategy was followed by Du et al. ( 2019 ), Fan et al. ( 2017 ), Yao et al. ( 2017 , 2019 ), Wang et al. ( 2019d ). (4) Device-free: There are not many popular datasets that are publicly available. However, researchers followed the same strategy which is adopted in RFID-based HAR. Yan et al. included the data of 6 volunteers with 440 actions in their dataset with a total of 4400 samples (Wang et al. 2019c ). Similarly, Yan et al. ( 2020 ), Fei et al. ( 2020 ), Wang et al. ( 2019d ) have proposed their own datasets.

6.4 Strengths and limitations

This is the first review of its kind where we demonstrated the HAR system consisting of three components such as HAR devices, AI models, and HAR applications. This is the only review that considered all four kinds of HAR devices such as sensor-based, vision-based, RFID-based, and device-free in the AI framework. The engineering perspective was discussed on AI in terms of architecture, loss function design, and optimization strategies. A comprehensive and comparative study was conducted in the benchmarking section. We also provided sources of datasets for the readers. Limitations: A significant amount of work has been done in the HAR domain, but some limitations need to be addressed. (1) Synchronised activities: According to earlier HAR models, researchers have made this presumption that a person performs a single activity at a time. But it is not true, in the real-world humans perform synchronized activities such as talking on smartphone and walking or reading a book. As per our knowledge, there is hardly a HAR model that considered synchronized activities in their recognition model. (2) Complex and composite activities: Various state-of-the-art results have been achieved by researchers with simple and atomic activities such as: running, walking, stairs-up, and down. But very limited work has been done with complex activities where an activity includes two or more simple actions. For example, exercise monitoring where an exercise like burpees includes jump, bending down, and extending legs. Such kind of complex and risky activity requires attention for proper posture monitoring, but to the best of our knowledge, there is no HAR model which can monitor exercise involving complex activity. (3) Future action forecast: A significant amount of work has been done in HAR but most of the work is based on the identification of action performed by a user like fall detection. There is no HAR model which predicts the future action. For example, in a smart home environment if an elderly person is doing exercise and there are chances of fall then it will be very helpful if there is a smart system that can identify fall in advance and inform the person timely for necessary precaution(s). (4) Lack of real-time validation of HAR models: In earlier HAR models, for validation researchers have used k-fold cross validation and LOSO, where a part on the dataset is used for validation. However, most of the data in datasets are gathered in the experimental setup, which lacks real-time flavour. Therefore, there is a need for a model which can provide good results on experimental as well as real-time data without AI bias (Suri et al. 2016 ).

Recommendations

Trending AI technique : the use of transfer learning had shown significant results with vision-based HAR models. But there is very less work done on the sensor-based HAR model. Sensor-based HAR with TL can be the next revolution in the HAR domain.

Trending device type : a decade before the most popular data capturing source for activity signals were video cameras. But there are some major issues associated with vision-based HAR such as user privacy and GPU requirements. The solution to these problems is sensor-based HAR where a simple smartphone or smartwatch is used for capturing activity signals. In the last 3 years, the sensor-based HAR is one of the most trending HAR approach.

Dominant target domain in HAR : Although, HAR has multifaceted application domains such as surveillance, healthcare, and fall detection. Healthcare is the most crucial domain where HAR plays an important role which includes remote health monitoring to patients, exercise monitoring, assistive care to the elderly living alone. In the current COVID-19 pandemic scenario, the sensor-based HAR model with DL technique can be used to provide assistive care to home-quarantined COVID-19 patients by monitoring their health remotely.

Abnormal action identification and future action prediction: A significant amount of work has been done in HAR, but most of the work revolves around the recognition of simple activities. A very little amount of work has been done in finding the abnormalities in actions. Abnormality conditions are categorized into two categories: physical and non-physical. Under physical conditions, examples include (a) Fall detection in normal conditions under activities of daily living (ADL), (b) Fall detection in elderly health monitoring conditions, and (c) Fall detection in sports conditions. Only physical abnormality can be detected under this paradigm. Under Non-physical abnormality, examples include dizziness, headaches, vomiting feeling. These are not truly physical parameters that can be detected via the camera. Note that these non-physical parameters can however be monitored via special sensor-based devices, such as hypertension monitor, oximeter, etc. Further, to our knowledge, there are not many applications that combine camera and sensor devices in non-physical frame. Apart from abnormality identification, there is hardly any work done on the prediction of future action based on current actions. For example, A person is running or walking and he is not focusing or concentrating on the road on which he is travelling. Suddenly, there is an obstacle on the road in his path. He trips and falls down. Such detections are forecasting actions and happen suddenly. There is no application that can detect the obstacle and raise an alarm in advance. Forecasting is more towards the projections at distant times, unlike nearly current time spatial and temporal information. Similarly, forecasting is challenging in the motion estimation for subsequent frames where data is not available and unseen.

7 Conclusion

Unlike earlier review articles where researchers focus was on a single HAR device, we have proposed the study that revolves around the three pillars of HAR i.e., HAR devices, AI, and application domains. In the proposed review, we have hypothesized that the growth in HAR devices is synchronized with the evolving AI framework, and the study rationalizes this by providing evidence in terms of graphical representation of existing HAR models. Our second hypothesis says the growth in AI is the core of HAR which makes it suitable for multifaceted domains. We rationalized this by presenting representative CNN and TL architectures of HAR models, and also discussed the importance of hyperparameters, optimizers, and loss functions in the design of HAR models. A unique contribution is in the area of the role of the AI framework in existing HAR models for each of the HAR devices. This study further surfaced out (1) sensor-based HAR with miniaturizing devices will show the ground for opportunities in healthcare application, especially remote care, and monitoring, and (2) device-free HAR with the use of Wi-Fi device can make the usage of HAR as an essential part of human’s healthy life. Finally, the study presented four recommendations that will expand the vision of new researchers and help them in expanding the scope of HAR in diverse domains with evolving AI framework for providing a quality of healthy life to human.

Abbreviations

Ambient assistive living

Activity of daily living

Artificial intelligence

Average pooling

Area under curve

Batch normalization

Cross entropy loss

Convolution neural network

Convolution

Cross validation

Deep learning

Dynamic time warping

Equal error rate

Fully connected

Generative adversarial n/w

Giga floating point operations/sec

Graphics processing unit

Gated recurrent unit

Human activity recognition

Kullback Lieblar

Leave one subject out

Learning rate

Long short-term memory

Mean absolute error

Mean average precision

Machine learning

Max pooling

Mean square error

Principal component analysis

Radio frequency identification

Recurrent neural network

Received signal strength indicator

Sensitivity

Stochastic gradient descent

Specificity

Support vector machine

Transfer learning

Variational autoencoders

True positive

True negative

False positive

False negative

Likelihood ratio

Abobakr A, Hossny M, Nahavandi S (2018) A skeleton-free fall detection system from depth images using random decision forest. IEEE Syst J 12(3):2994–3005. https://doi.org/10.1109/JSYST.2017.2780260

Article Google Scholar

Acharya UR et al (2012) An accurate and generalized approach to plaque characterization in 346 carotid ultrasound scans. IEEE Trans Instrum Meas 61(4):1045–1053. https://doi.org/10.1109/TIM.2011.2174897

Acharya UR et al (2013b) Automated classification of patients with coronary artery disease using grayscale features from left ventricle echocardiographic images. Comput Methods Programs Biomed 112(3):624–632. https://doi.org/10.1016/j.cmpb.2013.07.012

Acharya UR et al (2015) Ovarian tissue characterization in ultrasound: a review. Technol Cancer Res Treat 14(3):251–261. https://doi.org/10.1177/1533034614547445

Acharya UR, Sree SV, Saba L, Molinari F, Guerriero S, Suri JS (2013a) Ovarian tumor characterization and classification using ultrasound—a new online paradigm. J Digit Imaging 26(3):544–553. https://doi.org/10.1007/s10278-012-9553-8

Adame T, Bel A, Carreras A, Melià-Seguí J, Oliver M, Pous R (2018) CUIDATS: An RFID–WSN hybrid monitoring system for smart health care environments. Future Gen Comput Syst 78:602–615. https://doi.org/10.1016/j.future.2016.12.023

Agarwal M et al (2021) A novel block imaging technique using nine artificial intelligence models for COVID-19 disease classification, characterization and severity measurement in lung computed tomography scans on an Italian cohort. J Med Syst. https://doi.org/10.1007/s10916-021-01707-w

Agarwal M et al (2021) Wilson disease tissue classification and characterization using seven artificial intelligence models embedded with 3D optimization paradigm on a weak training brain magnetic resonance imaging datasets: a supercomputer application. Med Biol Eng Comput 59(3):511–533. https://doi.org/10.1007/s11517-021-02322-0

Akagündüz E, Aslan M, Şengür A (2016) Silhouette orientation volumes for efficient fall detection in depth videos. 2194(c):1–8. https://doi.org/10.1109/JBHI.2016.2570300 .

Alsheikh MA, Selim A, Niyato D, Doyle L, Lin S, Tan HP (2016) Deep activity recognition models with triaxial accelerometers. In: AAAI workshop technical reports, vol. WS-16-01, pp 8–13, 2016.

Arifoglu D, Bouchachia A (2017) Activity recognition and abnormal behaviour detection with recurrent neural networks. Procedia Comput Sci 110:86–93. https://doi.org/10.1016/j.procs.2017.06.121

Asteriadis S, Daras P (2017) Landmark-based multimodal human action recognition. Multimed Tools Appl 76(3):4505–4521. https://doi.org/10.1007/s11042-016-3945-6

Attal F, Mohammed S, Dedabrishvili M, Chamroukhi F, Oukhellou L, Amirat Y (2015) Physical human activity recognition using wearable sensors. Sensors (Switzerland) 15(12):31314–31338. https://doi.org/10.3390/s151229858

Azkune G, Almeida A (2018) A scalable hybrid activity recognition approach for intelligent environments. IEEE Access 6(8):41745–41759. https://doi.org/10.1109/ACCESS.2018.2861004

Bashar SK, Al Fahim A, Chon KH (2020) Smartphone based human activity recognition with feature selection and dense neural network. In: Proceedings of annual international conference of the ieee engineering in medicine and biology society EMBS, vol. 2020-July, pp 5888–5891, 2020. https://doi.org/10.1109/EMBC44109.2020.9176239

Beddiar DR, Nini B, Sabokrou M, Hadid A (2020) Vision-based human activity recognition: a survey. Multimed Tools Appl 79(41–42):30509–30555. https://doi.org/10.1007/s11042-020-09004-3

Biswas M et al (2018) Symtosis: a liver ultrasound tissue characterization and risk stratification in optimized deep learning paradigm. Comput Methods Programs Biomed 155:165–177. https://doi.org/10.1016/j.cmpb.2017.12.016

Biswas M et al (2019) State-of-the-art review on deep learning in medical imaging. Front Biosci Landmark 24(3):392–426. https://doi.org/10.2741/4725

Buffelli D, Vandin F (2020) Attention-based deep learning framework for human activity recognition with user adaptation. arXiv, 2020.

Cardoso HL, Mendes Moreira J (2016) Human activity recognition by means of online semi-supervised learning, pp. 75–77. https://doi.org/10.1109/mdm.2016.93

Carreira J, Noland E, Banki-Horvath A, Hillier C, Zisserman A (2018) A short note about kinetics-600, 2018. [Online]. http://arxiv.org/abs/1808.01340 .

Carvalho LI, Sofia RC (2020) A review on scaling mobile sensing platforms for human activity recognition: challenges and recommendations for future research. IoT 1(2):451–473. https://doi.org/10.3390/iot1020025

Chaaraoui AA (215) Abnormal gait detection with RGB-D devices using joint motion history features, 2015

Chavarriaga R et al (2013) The opportunity challenge: a benchmark database for on-body sensor-based activity recognition. Pattern Recognit Lett 34(15):2033–2042. https://doi.org/10.1016/j.patrec.2012.12.014

Chen WH, Cho PC, Jiang YL (2017) Activity recognition using transfer learning. Sensors Mater 29(7):897–904. https://doi.org/10.18494/SAM.2017.1546

Chen Y, Shen C (2017) Performance analysis of smartphone-sensor behavior for human activity recognition. IEEE Access 5(c):3095–3110. https://doi.org/10.1109/ACCESS.2017.2676168

Chen J, Sun Y, Sun S (2021) Improving human activity recognition performance by data fusion and feature engineering. Sensors (Switzerland) 21(3):1–23. https://doi.org/10.3390/s21030692

Chen K, Zhang D, Yao L, Guo B, Yu Z, Liu Y (2020) Deep learning for sensor-based human activity recognition: overview, challenges and opportunities. arXiv, vol. 37, no. 4, 2020

Chong YS, Tay YH (2017) Abnormal event detection in videos using spatiotemporal autoencoder. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 10262 LNCS, pp 189–196, 2017. https://doi.org/10.1007/978-3-319-59081-3_23

Cippitelli E, Gasparrini S, Gambi E, Spinsante S (2016) A human activity recognition system using skeleton data from RGBD sensors. Comput Intell Neurosci. https://doi.org/10.1155/2016/4351435

Civitarese G, Presotto R, Bettini C (2019) Context-driven active and incremental activity recognition, 2019. [Online]. http://arxiv.org/abs/1906.03033 .

Cook DJ, Krishnan NC, Rashidi P (2013) Activity discovery and activity recognition: a new partnership. IEEE Trans Cybern 43(3):820–828. https://doi.org/10.1109/TSMCB.2012.2216873

Cornell Activity Datasets: CAD-60 & CAD-120 (2021) [Online]. Available: re3data.org: Cornell Activity Datasets: CAD-60 & CAD-120; editing status 2019-01-22; re3data.org—Registry of Research Data Repositories. https://doi.org/10.17616/R3DD2D . Accessed 17 Apr 2021

Crasto N et al (2019) MARS: motion-augmented RGB stream for action recognition to cite this version : HAL Id : hal-02140558 MARS: motion-augmented RGB stream for action recognition, 2019. [Online]. http://www.europe.naverlabs.com/Research/

Cui J, Xu B (2013) Cost-effective activity recognition on mobile devices. In: BODYNETS 2013—8th international conference on body area networks, pp 90–96, 2013. https://doi.org/10.4108/icst.bodynets.2013.253656

De-La-Hoz-Franco E, Ariza-Colpas P, Quero JM, Espinilla M (2018) Sensor-based datasets for human activity recognition—a systematic review of literature. IEEE Access 6(c):59192–59210. https://doi.org/10.1109/ACCESS.2018.2873502

Deep S, Zheng X (2019) Leveraging CNN and transfer learning for vision-based human activity recognition. In: 2019 29th international telecommunication networks and application conference ITNAC 2019, pp 35–38, 2019. https://doi.org/10.1109/ITNAC46935.2019.9078016

Demrozi F, Pravadelli G, Bihorac A, Rashidi P (2020) Human activity recognition using inertial, physiological and environmental sensors: a comprehensive survey. IEEE Access 8:210816–210836. https://doi.org/10.1109/ACCESS.2020.3037715

Devanne M, Wannous H, Berretti S, Pala P, Daoudi M, Del Bimbo A (2015) 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Trans Cybern 45(7):1340–1352. https://doi.org/10.1109/TCYB.2014.2350774

Dhiman Chhavi VDK (2019) state of art tech for HAR.pdf., pp 21–45

Diba A, Pazandeh AM, Van Gool L (2016) Efficient two-stream motion and appearance 3D CNNs for video classification, 2016, [Online]. http://arxiv.org/abs/1608.08851

Diba A et al. (2020) Large scale holistic video understanding. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 12350 LNCS, pp 593–610, 2020. https://doi.org/10.1007/978-3-030-58558-7_35

Ding R et al (2019) Empirical study and improvement on deep transfer learning for human activity recognition. Sensors (Switzerland). https://doi.org/10.3390/s19010057

Ding W, Liu K, Fu X, Cheng F (2016) Profile HMMs for skeleton-based human action recognition. Signal Process Image Commun 42:109–119. https://doi.org/10.1016/j.image.2016.01.010

Ding H et al. (2015) FEMO: a platform for free-weight exercise monitoring with RFIDs. In: SenSys 2015—proceedings of 13th ACM conference on embedded networked sensor systems, pp 141–154. https://doi.org/10.1145/2809695.2809708 .

Du Y, Lim Y, Tan Y (2019) A novel human activity recognition and prediction in smart home based on interaction. Sensors (Switzerland). https://doi.org/10.3390/s19204474

Duan H, Zhao Y, Xiong Y, Liu W, Lin D (2020) Omni-sourced webly-supervised learning for video recognition. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 12360 LNCS, pp 670–688, 2020. https://doi.org/10.1007/978-3-030-58555-6_40

Ehatisham-Ul-Haq M, Azam MA, Amin Y, Naeem U (2020) C2FHAR: coarse-to-fine human activity recognition with behavioral context modeling using smart inertial sensors. IEEE Access 8:7731–7747. https://doi.org/10.1109/ACCESS.2020.2964237

El-Baz JSSA, Jiang X (2016) Biomedical Image Segmentation: Advances and Trends. CRC Press, Taylor & Francis Group

Book Google Scholar

El-Baz A, Suri JS (2019) Level set method in medical imaging segmentation. CRC Press, Taylor & Francis Group, London

Fan X, Gong W, Liu J (2017) I2tag: RFID mobility and activity identification through intelligent profiling. ACM Trans Intell Syst Technol 9(1):1–21. https://doi.org/10.1145/3035968

Fan X, Wang F, Wang F, Gong W, Liu J (2019) When RFID meets deep learning: exploring cognitive intelligence for activity identification. IEEE Wirel Commun 26(3):19–25. https://doi.org/10.1109/MWC.2019.1800405

Fazli M, Kowsari K, Gharavi E, Barnes L, Doryab A (2020) HHAR-net: hierarchical human activity recognition using neural networks, pp 48–58, 2021. https://doi.org/10.1007/978-3-030-68449-5_6

Fei H, Xiao F, Han J, Huang H, Sun L (2020) Multi-variations activity based gaits recognition using commodity WiFi. IEEE Trans Veh Technol 69(2):2263–2273. https://doi.org/10.1109/TVT.2019.2962803

Feichtenhofer C, Ai F (2019) SlowFast networks for video recognition technical report AVA action detection in ActivityNet challenge 2019, pp. 2–5

Feichtenhofer C, Fan H, Malik J, He K (2018) SlowFast networks for video recognition 2018. [Online]. http://arxiv.org/abs/1812.03982

Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-Janua, no. Nips, pp 7445–7454, 2017. https://doi.org/10.1109/CVPR.2017.787

Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition, vol 2016-Decem, no. i, pp. 1933–1941, 2016. https://doi.org/10.1109/CVPR.2016.213

Ferrari A, Micucci D, Mobilio M, Napoletano P (2020) On the personalization of classification models for human activity recognition. IEEE Access 8:32066–32079. https://doi.org/10.1109/ACCESS.2020.2973425

Fullerton E, Heller B, Munoz-Organero M (2017) Recognizing human activity in free-living using multiple body-worn accelerometers. IEEE Sens J 17(16):5290–5297. https://doi.org/10.1109/JSEN.2017.2722105

Gaglio S, Lo Re G, Morana M (2015) Human activity recognition process using 3-D posture data. IEEE Trans Hum Mach Syst 45(5):586–597. https://doi.org/10.1109/THMS.2014.2377111

Gani MO et al (2019) A light weight smartphone based human activity recognition system with high accuracy. J Netw Comput Appl 141(May):59–72. https://doi.org/10.1016/j.jnca.2019.05.001

Garcia-Gonzalez D, Rivero D, Fernandez-Blanco E, Luaces MR (2020) A public domain dataset for real-life human activity recognition using smartphone sensors. Sensors (Switzerland). https://doi.org/10.3390/s20082200

Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253. https://doi.org/10.1109/TPAMI.2007.70711

Gouineua F, Sortin M Chikhaoui B (2018) Chikhaoui-DL-springer (2018).pdf. Springer, pp 302–315

Goyal R et al. (2017) The ‘Something Something’ video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision, pp 5843–5851. https://doi.org/10.1109/ICCV.2017.622 .

Gu C et al. (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 6047–6056, 2018. https://doi.org/10.1109/CVPR.2018.00633

Han J et al. (2014) CBID: a customer behavior identification system using passive tags. In: Proceedings of international conference on network protocols, ICNP, pp 47–58, 2014. https://doi.org/10.1109/ICNP.2014.26 .

Hsu YL, Yang SC, Chang HC, Lai HC (2018) Human daily and sport activity recognition using a wearable inertial sensor network. IEEE Access 6(c):31715–31728. https://doi.org/10.1109/ACCESS.2018.2839766

Huang SF, Chang RF, Moon WK, Lee YH, Chen DR, Suri JS (2008) Analysis of tumor vascularity using ultrasound images. IEEE Trans Med Imaging 27(3):320–330

Hussain Z, Sheng QZ, Zhang WE (2020) A review and categorization of techniques on device-free human activity recognition. J Netw Comput Appl 167:102738. https://doi.org/10.1016/j.jnca.2020.102738

Hx P, Wang J, Hu L, Chen Y, Hao S (2017) Deep learning for sensor based activity recognition: a survey. Pattern Recognit Lett 1–9

Jalal A, Uddin M, Kim TS (2012) Depth video-based human activity recognition system using translation and scaling invariant features for life logging at smart home. IEEE Trans Consum Electron 58(3):863–871. https://doi.org/10.1109/TCE.2012.6311329

Jamthikar AD et al (2020) Multiclass machine learning vs. conventional calculators for stroke/CVD risk assessment using carotid plaque predictors with coronary angiography scores as gold standard: a 500 participants study. Int J Cardiovasc Imaging. https://doi.org/10.1007/s10554-020-02099-7

Janocha K, Czarnecki WM (2016) On loss functions for deep neural networks in classification. Schedae Informaticae 25:49–59. https://doi.org/10.4467/20838476SI.16.004.6185

Jiang B, Wang M, Gan W, Wu W, Yan J (2019) STM: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE international conference on computer vision, vol. 2019-Octob, pp 2000–2009, 2019. https://doi.org/10.1109/ICCV.2019.00209 .

Kalfaoglu ME, Kalkan S, Alatan AA (2020) Late temporal modeling in 3D CNN architectures with bert for action recognition. arXiv, pp 1–19. https://doi.org/10.1007/978-3-030-68238-5_48

Kasnesis P, Patrikakis CZ, Venieris IS (2017) Changing mobile data analysis through deep learning, pp 17–23

Kay W et al. (2017) The kinetics human action video dataset, 2017 [Online]. http://arxiv.org/abs/1705.06950

Ke SR, Thuc HLU, Lee YJ, Hwang JN, Yoo JH, Choi KH (2013) A review on video-based human activity recognition, vol 2, no 2

Khalifa S, Lan G, Hassan M, Seneviratne A, Das SK (2018) HARKE: human activity recognition from kinetic energy harvesting data in wearable devices. IEEE Trans Mob Comput 17(6):1353–1368. https://doi.org/10.1109/TMC.2017.2761744

Kim S, Yun K, Park J, Choi JY (2019) Skeleton-based action recognition of people handling objects. In: Proceedings of 2019 IEEE winter conference on applications of computer vision, WACV 2019, pp 61–70, 2019. https://doi.org/10.1109/WACV.2019.00014 .

Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from RGB-D videos. Int J Rob Res 32(8):951–970. https://doi.org/10.1177/0278364913478446

Koppula HS, Saxena A (2016) Anticipating human activities using object affordances for reactive robotic response. IEEE Trans Pattern Anal Mach Intell 38(1):14–29. https://doi.org/10.1109/TPAMI.2015.2430335

Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 2556–2563. https://doi.org/10.1109/ICCV.2011.6126543 .

Lara ÓD, Labrador MA (2013) A survey on human activity recognition using wearable sensors. IEEE Commun Surv Tutorials 15(3):1192–1209. https://doi.org/10.1109/SURV.2012.110112.00192

Lawal IA, Bano S (2020) Deep human activity recognition with localisation of wearable sensors. IEEE Access 8:155060–155070. https://doi.org/10.1109/ACCESS.2020.3017681

Lawal IA, Bano S (2019) Deep human activity recognition using wearable sensors. In: ACM international conference proceedings series, pp 45–48, 2019. https://doi.org/10.1145/3316782.3321538

Li JH, Tian L, Wang H, An Y, Wang K, Yu L (2019) Segmentation and recognition of basic and transitional activities for continuous physical human activity. IEEE Access 7:42565–42576. https://doi.org/10.1109/ACCESS.2019.2905575

Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: 2010 IEEE computer society conference on computer vision and pattern recognition—work. CVPRW 2010, vol 2010, pp 9–14, 2010. https://doi.org/10.1109/CVPRW.2010.5543273 .

Li X, Zhang Y, Marsic I, Sarcevic A, Burd RS (2016) Deep learning for RFID-based activity recognition. In: Proceedings of 14th ACM conference on embedded networked sensor systems SenSys 2016, pp 164–175. https://doi.org/10.1145/2994551.2994569 .

Lima WS, Souto E, El-Khatib K, Jalali R, Gama J (2019) Human activity recognition using inertial sensors in a smartphone: an overview. Sensors (switzerland) 19(14):14–16. https://doi.org/10.3390/s19143213

Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 9907 LNCS, pp 816–833, 2016. https://doi.org/10.1007/978-3-319-46487-9_50 .

Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 140–149, 2020. https://doi.org/10.1109/CVPR42600.2020.00022

Lv T, Wang X, Jin L, Xiao Y, Song M (2020) A hybrid network based on dense connection and weighted feature aggregation for human activity recognition. IEEE Access 8:68320–68332. https://doi.org/10.1109/ACCESS.2020.2986246

Mabrouk MF, Ghanem NM, Ismail MA (2016) Semi supervised learning for human activity recognition using depth cameras. In: Proceedings of 2015 IEEE 14th international conference on machine learning and applications ICMLA 2015, pp 681–686, 2016. https://doi.org/10.1109/ICMLA.2015.170

Magherini T, Fantechi A, Nugent CD, Vicario E (2013) Using temporal logic and model checking in automated recognition of human activities for ambient-assisted living. IEEE Trans Hum Mach Syst 43(6):509–521. https://doi.org/10.1109/TSMC.2013.2283661

Mahadevan V, Li W, Bhalodia V, Vasconcelos N (2010) Anomaly detection in crowded scenes. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1975–1981, 2010. https://doi.org/10.1109/CVPR.2010.5539872

Maniruzzaman M et al (2017) Comparative approaches for classification of diabetes mellitus data: machine learning paradigm. Comput Methods Programs Biomed 152:23–34. https://doi.org/10.1016/j.cmpb.2017.09.004

Maniruzzaman M et al (2018) Accurate diabetes risk stratification using machine learning: role of missing value and outliers. J Med Syst 42(5):1–17. https://doi.org/10.1007/s10916-018-0940-7

Martis JSRJ, Acharya UR, Prasad H, Chua CK, Lim CM (2013) Application of higher order statistics for atrial arrhythmia classification. Biomed Signal Process Control 8(6)

Micucci D, Mobilio M, Napoletano P (2017) UniMiB SHAR: a dataset for human activity recognition using acceleration data from smartphones. Appl Sci. https://doi.org/10.3390/app7101101

Miu T, Missier P, Plötz T (2015) Bootstrapping personalised human activity recognition models using online active learning. In: Proceedings of 15th international conference on computer science and information technology CIT 2015, 14th IEEE international conference on ubiquitous computing and communications IUCC 2015, 13th international conference on dependable, autonomic and secure, pp 1138–1147, 2015. https://doi.org/10.1109/CIT/IUCC/DASC/PICOM.2015.170

Multi Modality State-of-the-Art Medical Image Segmentation and Registration Methodologies (2011)

Munoz-Organero M (2019) Outlier detection in wearable sensor data for human activity recognition (HAR) based on DRNNs. IEEE Access 7:74422–74436. https://doi.org/10.1109/ACCESS.2019.2921096

Murad A, Pyun JY (2017) Deep recurrent neural networks for human activity recognition. Sensors (Switzerland). https://doi.org/10.3390/s17112556

Nam Y, Park JW (2013) Child activity recognition based on cooperative fusion model of a triaxial accelerometer and a barometric pressure sensor. IEEE J Biomed Heal Inform 17(2):420–426. https://doi.org/10.1109/JBHI.2012.2235075

Nash W, Drummond T, Birbilis N (2018) A review of deep learning in the study of materials degradation. NPJ Mater Degrad 2(1):1–12. https://doi.org/10.1038/s41529-018-0058-x

Neili Boualia S, Essoukri Ben Amara N (2021) Deep full-body HPE for activity recognition from RGB frames only. Informatics 8(1):2. https://doi.org/10.3390/informatics8010002

Newell Alejandro DJ, Yang K (2016) Stacked hour glass.pdf., pp 1–15

Nguyen DT, Kim KW, Hong HG, Koo JH, Kim MC, Park KR (2017) Gender recognition from human-body images using visible-light and thermal camera videos based on a convolutional neural network for image feature extraction, pp 1–22, 2017. https://doi.org/10.3390/s17030637

Ni B, Pei Y, Moulin P, Yan S (2013) Multilevel depth and image fusion for human activity detection. IEEE Trans Cybern 43(5):1382–1394. https://doi.org/10.1109/TCYB.2013.2276433

Obaida MA, Saraee MAM (2017) A novel framework for intelligent surveillance system based on abnormal human activity detection in academic environments. Neural Comput Appl 28(s1):565–572. https://doi.org/10.1007/s00521-016-2363-z

Oguntala GA et al (2019) SmartWall: novel RFID-enabled ambient human activity recognition using machine learning for unobtrusive health monitoring. IEEE Access 7:68022–68033. https://doi.org/10.1109/ACCESS.2019.2917125

Ohn-Bar E, Trivedi MM (2014) Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations. IEEE Trans Intell Transp Syst 15(6):2368–2377. https://doi.org/10.1109/TITS.2014.2337331

Ordóñez FJ, Roggen D (2016) Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors (Switzerland). https://doi.org/10.3390/s16010115

Parada R, Nur K, Melia-Segui J, Pous R (2016) Smart surface: RFID-based gesture recognition using k-means algorithm. In: Proceedings of 12th international conference on intelligent environments IE 2016, pp 111–118, 2016. https://doi.org/10.1109/IE.2016.25 .

Pareek G et al (2013) Prostate tissue characterization/classification in 144 patient population using wavelet and higher order spectra features from transrectal ultrasound images. Technol Cancer Res Treat 12(6):545–557. https://doi.org/10.7785/tcrt.2012.500346

Pham C et al (2020) SensCapsNet: deep neural network for non-obtrusive sensing based human activity recognition. IEEE Access 8:86934–86946. https://doi.org/10.1109/ACCESS.2020.2991731

Pham C, Diep NN, Phuong TM (2017) E-shoes: smart shoes for unobtrusive human activity recognition. In: Proceedings of 2017 9th international conference on knowledge and systems engineering KSE 2017, vol 2017-Janua, pp 269–274, 2017. https://doi.org/10.1109/KSE.2017.8119470 .

Phyo CN, Zin TT, Tin P (2019) Deep learning for recognizing human activities using motions of skeletal joints. IEEE Trans Consum Electron 65(2):243–252. https://doi.org/10.1109/TCE.2019.2908986

Popoola OP, Wang K (2012) Video-based abnormal human behavior recognitiona review. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):865–878. https://doi.org/10.1109/TSMCC.2011.2178594

Qi J, Wang Z, Lin X, Li C (2018) Learning complex spatio-temporal configurations of body joints for online activity recognition. IEEE Trans Hum Mach Syst 48(6):637–647. https://doi.org/10.1109/THMS.2018.2850301

Qin Z, Zhang Y, Meng S, Qin Z, Choo KKR (2020) Imaging and fusing time series for wearable sensor-based human activity recognition. Inf Fusion 53:80–87. https://doi.org/10.1016/j.inffus.2019.06.014

Raad MW, Sheltami T, Soliman MA, Alrashed M (2018) An RFID based activity of daily living for elderly with Alzheimer’s. In: Lecture notes of the institute for computer sciences, social-informatics and telecommunications engineering LNICST, vol 225, pp 54–61, 2018. https://doi.org/10.1007/978-3-319-76213-5_8

Rajendra Acharya U et al (2014) A review on ultrasound-based thyroid cancer tissue characterization and automated classification. Technol Cancer Res Treat 13(4):289–301. https://doi.org/10.7785/tcrt.2012.500381

Rashidi P, Mihailidis A (2013) A survey on ambient-assisted living tools for older adults. IEEE J Biomed Heal Inform 17(3):579–590. https://doi.org/10.1109/JBHI.2012.2234129

Ravi D, Wong C, Lo B, Yang GZ (2016) Deep learning for human activity recognition: a resource efficient implementation on low-power devices. In: BSN 2016—13th annual body sensor networks conference, pp 71–76, 2016. https://doi.org/10.1109/BSN.2016.7516235

Reiss A, Stricker D (2012) Introducing a new benchmarked dataset for activity monitoring. In: Proceedings of international symposium on wearable computers ISWC, pp 108–109, 2012. https://doi.org/10.1109/ISWC.2012.13 .

Reiss A. Stricker D (2012) Creating and benchmarking a new dataset for physical activity monitoring. In: ACM international conference proceeding series, no. February, 2012. https://doi.org/10.1145/2413097.2413148 .

Roggen D et al (2010) “Collecting complex activity datasets in highly rich networked sensor environments”, INSS 2010–7th Int. Conf Networked Sens Syst 00:233–240. https://doi.org/10.1109/INSS.2010.5573462

Ronao CA, Cho SB (2016) Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst Appl 59:235–244. https://doi.org/10.1016/j.eswa.2016.04.032

Röcker C, O’Donoghue J, Ziefle M, Maciaszek L, Molloy W (2017) Preface. Commun Comput. Inf Sci 736:5. https://doi.org/10.1007/978-3-319-62704-5

Saba L et al (2019) The present and future of deep learning in radiology. Eur J Radiol 114:14–24. https://doi.org/10.1016/j.ejrad.2019.02.038

Saba L et al (2021) Ultrasound-based internal carotid artery plaque characterization using deep learning paradigm on a supercomputer: a cardiovascular disease/stroke risk assessment system. Int J Cardiovasc Imaging. https://doi.org/10.1007/s10554-020-02124-9

Saha J, Ghosh D, Chowdhury C, Bandyopadhyay S (2020) Smart handheld based human activity recognition using multiple instance multiple label learning. Wirel Pers Commun. https://doi.org/10.1007/s11277-020-07903-0

Shivendra shivani JSS, Agarwal S (2018) Hand book of image-based security techniques. Chapman and Hall/CRC, London, p 442

Shrivastava VK, Londhe ND, Sonawane RS, Suri JS (2016) Computer-aided diagnosis of psoriasis skin images with HOS, texture and color features: a first comparative study of its kind. Comput Methods Programs Biomed 126(2016):98–109. https://doi.org/10.1016/j.cmpb.2015.11.013

Shuaibu AN, Malik AS, Faye I, Ali YS (2017) Pedestrian group attributes detection in crowded scenes. In: Proceedings of 3rd international conference on advanced technologies for signal and image processing ATSIP 2017, pp 1–5, 2017. https://doi.org/10.1109/ATSIP.2017.8075584

Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 9905 LNCS, pp 510–526, 2016. https://doi.org/10.1007/978-3-319-46448-0_31 .

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1:568–576

Google Scholar

Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd international conference on learning representations ICLR 2015—conference track proceedings, pp 1–14

Skandha SS et al (2020) 3-D optimized classification and characterization artificial intelligence paradigm for cardiovascular/stroke risk stratification using carotid ultrasound-based delineated plaque: Atheromatic TM 2.0. Comput Biol Med 125:103958. https://doi.org/10.1016/j.compbiomed.2020.103958

Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild, no. November, 2012, [Online]. http://arxiv.org/abs/1212.0402 .

Soydaner D (2020) A comparison of optimization algorithms for deep learning. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S0218001420520138

Sreekanth NS, Narayanan NK (2017) Proceedings of the international conference on signal, networks, computing, and systems, vol 395, pp 105–115, 2017. https://doi.org/10.1007/978-81-322-3592-7

Stisen A et al. (2015) Smart devices are different: assessing and mitigating mobile sensing heterogeneities for activity recognition. In: SenSys 2015—proceedings of 13th ACM conference on embedded networked sensor systems, no. November, pp 127–140, 2015. https://doi.org/10.1145/2809695.2809718

Sudeep PV et al (2016) Speckle reduction in medical ultrasound images using an unbiased non-local means method. Biomed Signal Process Control 28:1–8. https://doi.org/10.1016/j.bspc.2016.03.001

Sun S, Cao Z, Zhu H, Zhao J (2020) A survey of optimization methods from a machine learning perspective. IEEE Trans Cybern 50(8):3668–3681. https://doi.org/10.1109/TCYB.2019.2950779

Sundaramoorthy P, Gudur GK (2018) HARNet : towards on-device incremental learning using deep, pp 31–36

Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from RGBD images. In: Proceedings of IEEE international conference on robotics and automation, pp 842–849, 2012. https://doi.org/10.1109/ICRA.2012.6224591

Suri JS (2001) Two-dimensional fast magnetic resonance brain segmentation. IEEE Eng Med Biol Mag 20(4):84–95. https://doi.org/10.1109/51.940054

Suri JS (2005) Handbook of biomedical image analysis: segmentation models. Springer, New York

Suri JS et al (2021) Systematic review of artificial intelligence in acute respiratory distress syndrome for COVID-19 lung patients: a biomedical imaging perspective. IEEE J Biomed Heal Inform 2194(1):1–12. https://doi.org/10.1109/JBHI.2021.3103839

Suri JS, Liu K, Singh S, Laxminarayan SN, Zeng X, Reden L (2002) Shape recovery algorithms using level sets in 2-D/3-D medical imagery: a state-of-the-art review. IEEE Trans Inf Technol Biomed 6(1):8–28. https://doi.org/10.1109/4233.992158

Suri JS (2013) DK Med_Image_Press_Eng.Pdf.” [Online]. https://www.freepatentsonline.com/20080051648.pdf .

Suri JS (2004) Segmentation method and apparatus for medical images using diffusion propagation, pixel classification, and mathematical morphology

Suthar B, Gadhia B (2021) Human activity recognition using deep learning: a survey. Lect Notes Data Eng Commun Technol 52:217–223. https://doi.org/10.1007/978-981-15-4474-3_25

Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition, pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition, pp 2818–2826, 2016. https://doi.org/10.1109/CVPR.2016.308 .

Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-ResNet and the impact of residual connections on learning. In: 31st AAAI conference on artificial intelligence AAAI 2017, pp. 4278–4284

Tanberk S, Kilimci ZH, Tukel DB, Uysal M, Akyokus S (2020) A hybrid deep model using deep learning and dense optical flow approaches for human activity recognition. IEEE Access 8:19799–19809. https://doi.org/10.1109/ACCESS.2020.2968529

Tandel GS, Balestrieri A, Jujaray T, Khanna NN, Saba L, Suri JS (2020) Multiclass magnetic resonance imaging brain tumor classification using artificial intelligence paradigm. Comput Biol Med 122:103804. https://doi.org/10.1016/j.compbiomed.2020.103804

Tao D, Jin L, Yuan Y, Xue Y (2016a) Ensemble manifold rank preserving for acceleration-based human activity recognition. IEEE Trans Neural Networks Learn Syst 27(6):1392–1404. https://doi.org/10.1109/TNNLS.2014.2357794

Article MathSciNet Google Scholar

Tao D, Wen Y, Hong R (2016b) Multicolumn bidirectional long short-term memory for mobile devices-based human activity recognition. IEEE Internet Things J 3(6):1124–1134. https://doi.org/10.1109/JIOT.2016.2561962

Thida M, Eng HL, Remagnino P (2013) Laplacian eigenmap with temporal constraints for local abnormality detection in crowded scenes. IEEE Trans Cybern 43(6):2147–2156. https://doi.org/10.1109/TCYB.2013.2242059

Tian Y, Zhang J, Chen L, Geng Y, Wang X (2019) Single wearable accelerometer-based human activity recognition via kernel discriminant analysis and QPSO-KELM classifier. IEEE Access 7:109216–109227. https://doi.org/10.1109/access.2019.2933852

Tran D, Wang H, Feiszli M, Torresani L (2019) Video classification with channel-separated convolutional networks. In: Proceedings of IEEE international conference on computer vision, vol 2019-Octob, pp 5551–5560, 2019. https://doi.org/10.1109/ICCV.2019.00565 .

Vaniya SM, Bharathi B (2017) Exploring object segmentation methods in visual surveillance for human activity recognition. In: Proceedings of International Conference on Global Trends in Signal Processing, Information Computing and Communication. ICGTSPICC 2016, pp 520–525, 2017. https://doi.org/10.1109/ICGTSPICC.2016.7955356

Vishwakarma DK, Singh K (2017) Human activity recognition based on spatial distribution of gradients at sublevels of average energy silhouette images. IEEE Trans Cogn Dev Syst 9(4):316–327. https://doi.org/10.1109/TCDS.2016.2577044

Wang A, Chen G, Yang J, Zhao S, Chang CY (2016a) A comparative study on human activity recognition using inertial sensors in a smartphone. IEEE Sens J 16(11):4566–4578. https://doi.org/10.1109/JSEN.2016.2545708

Wang F, Feng J, Zhao Y, Zhang X, Zhang S, Han J (2019c) Joint activity recognition and indoor localization with WiFi fingerprints. IEEE Access 7:80058–80068. https://doi.org/10.1109/ACCESS.2019.2923743

Wang F, Gong W, Liu J (2019d) On spatial diversity in wifi-based human activity recognition: a deep learning-based approach. IEEE Internet Things J 6(2):2035–2047. https://doi.org/10.1109/JIOT.2018.2871445

Wang K, He J, Zhang L (2019a) Attention-based convolutional neural network for weakly labeled human activities’ recognition with wearable sensors. IEEE Sens J 19(17):7598–7604. https://doi.org/10.1109/JSEN.2019.2917225

Wang Q, Ma Y, Zhao K, Tian Y (2020) A comprehensive survey of loss functions in machine learning. Ann Data Sci. https://doi.org/10.1007/s40745-020-00253-5

Wang Z, Wu D, Chen J, Ghoneim A, Hossain MA (2016b) A triaxial accelerometer-based human activity recognition via EEMD-based features and game-theory-based feature selection. IEEE Sens J 16(9):3198–3207. https://doi.org/10.1109/JSEN.2016.2519679

Wang F, Liu J, Gong W (2020) Multi-adversarial in-car activity recognition using RFIDs. IEEE Trans Mob Comput 1–1. https://doi.org/10.1109/tmc.2020.2977902

Wang X, Ji Q (2014) A hierarchical context model for event recognition in surveillance video. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 2561–2568. https://doi.org/10.1109/CVPR.2014.328 .

Wang K, He J, Zhang L (2019) Attention-based convolutional neural network for weakly labeled human activities recognition with wearable sensors. arXiv, vol 19, no. 17, pp 7598–7604

Wang L, Zhou F, Li Z, Zuo W, Tan H (2018) Abnormal event detection in videos using hybrid spatio-temporal autoencoder school of instrumentation science and opto-electronics Engineering, Beihang University, Beijing, China Department of Electronic Information Engineering, Foshan University, Fo. In: 2018 25th IEEE international conference on image processing, pp 2276–2280

Weiss GM, Yoneda K, Hayajneh T (2019) Smartphone and smartwatch-based biometrics using activities of daily living. IEEE Access 7:133190–133202. https://doi.org/10.1109/ACCESS.2019.2940729

Weng Z, Li W, Jin Z (2021) Human activity prediction using saliency-aware motion enhancement and weighted LSTM network. Eurasip J Image Video Process 1:2021. https://doi.org/10.1186/s13640-020-00544-0

Xia K, Huang J, Wang H (2020) LSTM-CNN architecture for human activity recognition. IEEE Access 8:56855–56866. https://doi.org/10.1109/ACCESS.2020.2982225

Xia L, Chen C, Aggarwal J (2012) View invariant human action recognition using histograms of 3D joints The University of Texas at Austin. In: CVPR 2012 HAU3D workshop, pp 20–27, 2012, [Online]. http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:View+Invariant+Human+Action+Recognition+Using+Histograms+of+3D+Joints+The+University+of+Texas+at+Austin#1

Xie L, Wang C, Liu AX, Sun J, Lu S (2018) Multi-Touch in the air: concurrent micromovement recognition using RF signals. IEEE/ACM Trans Netw 26(1):231–244. https://doi.org/10.1109/TNET.2017.2772781

Xu W, Miao Z, Zhang XP, Tian Y (2017) A hierarchical spatio-temporal model for human activity recognition. IEEE Trans Multimed 19(7):1494–1509. https://doi.org/10.1109/TMM.2017.2674622

Xu X, Tang J, Zhang X, Liu X, Zhang H, Qiu Y (2013) Exploring techniques for vision based human activity recognition: methods, systems, and evaluation. Sensors (Switzerland) 13(2):1635–1650. https://doi.org/10.3390/s130201635

Yan H, Zhang Y, Wang Y, Xu K (2020) WiAct: a passive WiFi-based human activity recognition system. IEEE Sens J 20(1):296–305. https://doi.org/10.1109/JSEN.2019.2938245

Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition, arXiv, 2018

Yao L et al (2018) Compressive representation for device-free activity recognition with passive RFID signal strength. IEEE Trans Mob Comput 17(2):293–306. https://doi.org/10.1109/TMC.2017.2706282

Yao S., Hu S, Zhao Y, Zhang A, Abdelzaher T (2017) DeepSense: A unified deep learning framework for time-series mobile sensing data processing. In: 26th international world wide web conferences WWW 2017, pp 351–360. https://doi.org/10.1145/3038912.3052577

Yao S et al. (2019) SADeepSense: self-attention deep learning framework for heterogeneous on-device sensors in internet of things applications. In: Proceedings of IEEE INFOCOM, vol 2019-April, pp 1243–1251. https://doi.org/10.1109/INFOCOM.2019.8737500

Yao S et al. (2018) Cover feature embedded deep learning, 2018, [Online]. https://fardapaper.ir/mohavaha/uploads/2018/06/Fardapaper-Deep-Learning-for-the-Internet-of-Things.pdf .

Zeng M et al. (2015) Convolutional Neural Networks for human activity recognition using mobile sensors. In: Proceedings of 2014 6th international conference on mobile computing, applications and services MobiCASE 2014, vol 6, pp 197–205, 2015. https://doi.org/10.4108/icst.mobicase.2014.257786 .

Zhang H, Parker LE (2016) CoDe4D: color-depth local spatio-temporal features for human activity recognition from RGB-D videos. IEEE Trans Circuits Syst Video Technol 26(3):541–555. https://doi.org/10.1109/TCSVT.2014.2376139

Zhang D, Zhou J, Guo M, Cao J, Li T (2011) TASA: tag-free activity sensing using RFID tag arrays. IEEE Trans Parallel Distrib Syst 22(4):558–570. https://doi.org/10.1109/TPDS.2010.118

Zhang M, Sawchuk AA (2012) USC-HAD: a daily activity dataset for ubiquitous activity recognition using wearable sensors. In: UbiComp’12—proceedings of 2012 ACM conference on ubiquitous computing, pp 1036–1043

Zhou X, Liang W, Wang KIK, Wang H, Yang LT, Jin Q (2020) Deep-learning-enhanced human activity recognition for internet of healthcare things. IEEE Internet Things J 7(7):6429–6438. https://doi.org/10.1109/JIOT.2020.2985082

Zhu R et al (2019) Efficient human activity recognition solving the confusing activities via deep ensemble learning. IEEE Access 7:75490–75499. https://doi.org/10.1109/ACCESS.2019.2922104

Zhu C, Sheng W (2012) Realtime recognition of complex human daily activities using human motion and location data. IEEE Trans Biomed Eng 59(9):2422–2430. https://doi.org/10.1109/TBME.2012.2190602

Zou H, Zhou Y, Arghandeh R, Spanos CJ (2019) Multiple kernel semi-representation learning with its application to device-free human activity recognition. IEEE Internet Things J 6(5):7670–7680. https://doi.org/10.1109/JIOT.2019.2901927

van Kasteren TLM, Englebienne G, Kröse BJA (2011) Human activity recognition from wireless sensor network data: benchmark and software, pp 165–186. https://doi.org/10.2991/978-94-91216-05-3_8 .

Download references

Author information

Authors and affiliations.

CSE Department, Bennett University, Greater Noida, UP, India

Neha Gupta & Suneet K. Gupta

Rawatpura Sarkar University, Raipur, Chhattisgarh, India

Rajesh K. Pathak

Bharati Vidyapeeth’s College of Engineering, Paschim Vihar, New Delhi, India

Neha Gupta & Vanita Jain

Intelligent Health Laboratory, Department of Biomedical Engineering, University of Florida, Gainesville, USA

Parisa Rashidi

Stroke Diagnostic and Monitoring Division, AtheroPointTM, Roseville, CA, 95661, USA

Jasjit S. Suri

Global Biomedical Technologies, Inc., Roseville, CA, USA

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jasjit S. Suri .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 944 kb)

Appendix a.

The type of HAR devices and applications are two main components of HAR. Table A.1 , A.2 , A.3 , and A.4 illustrates the device wise description of existing HAR models in terms of data source, #activities, #subjects, datasets, activity scenarios and performance evaluation.

The performance of HAR model is evaluated using metrics. Table B.1 illustrates various evaluation metrics used in existing HAR models. But before the description of metrics, some terms need to be understood:

True positive (TP): no. of positive samples predicted correctly.

False positive (FP): no. of actual negative samples predicted as positive.

True negative (TN): no. of negative samples predicted correctly.

False negative (FN): no. of actual positive samples predicted as negative.

Rights and permissions

Reprints and permissions

About this article

Gupta, N., Gupta, S.K., Pathak, R.K. et al. Human activity recognition in artificial intelligence framework: a narrative review. Artif Intell Rev 55 , 4755–4808 (2022). https://doi.org/10.1007/s10462-021-10116-x

Download citation

Published : 18 January 2022

Issue Date : August 2022

DOI : https://doi.org/10.1007/s10462-021-10116-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Sensor-based
Vision-based
Radio frequency-based identification
Device-free
And hybrid models
Find a journal
Publish with us
Track your research

COMMENTS

A Comprehensive Review of Deep Learning for Activity Recognition
Human Activity Recognition (HAR) is an important field in smart healthcare that has attracted remarkable attention from researchers. ... Various data segmentation techniques are proposed in the literature to separate the composite activity into different unified activities. For example, cycling is a composite activity that consists of several ...
Robotics and Coding Pattern Recognition Peer Review
In this activity, Grade 3 learners will create patterns according to specific criteria, such as colour sequences or shapes, and then have their pattern reviewed by a peer. This peer review process is important because it encourages collaboration, communication, and critical thinking among learners. It also allows them to see different perspectives and approaches to problem-solving. Pattern ...
A Systematic Literature Review on Human Activity Recognition
Human Activity Recognition (HAR) plays a significant role in several fields by automatically identifying and monitoring human activities using advanced techniques. It enhances safety, improves healthcare services, optimizes fitness routines, and enables context-aware applications in various fields. HAR contributes to a more efficient and intelligent interaction between humans and technology.
A systematic review of smartphone-based human activity recognition
Researchers have proposed various human activity recognition (HAR) systems aimed at translating measurements from smartphones into various types of physical activity. ... The literature review ...
Semantic human activity recognition: A literature review
Human activity recognition is being leveraged for an increasingly wide variety of computer vision applications. What all of these works have in common is to study some aspects of human-computer interaction. Recognizing activities can range from a single person action to multi-people activity recognition.
Human Activity Recognition: Review, Taxonomy and Open Challenges
Nowadays, Human Activity Recognition (HAR) is being widely used in a variety of domains, and vision and sensor-based data enable cutting-edge technologies to detect, recognize, and monitor human activities. Several reviews and surveys on HAR have already been published, but due to the constantly growing literature, the status of HAR literature needed to be updated. Hence, this review aims to ...
A Survey on Deep Learning for Human Activity Recognition
Abstract. Human activity recognition is a key to a lot of applications such as healthcare and smart home. In this study, we provide a comprehensive survey on recent advances and challenges in human activity recognition (HAR) with deep learning. Although there are many surveys on HAR, they focused mainly on the taxonomy of HAR and reviewed the ...
Human Activity Recognition: Review, Taxonomy and Open Challenges
1. Introduction. Humans engage in a wide range of activities in their daily lives. The recent advancement in technology and data from Closed-Circuit Television (CCTV) and sensors has enabled the detection of anomalies as well as the recognition of daily human activities for surveillance [1,2].The term anomaly refers to abnormal or unusual behavior or activity [].
A Review on Human Activity Recognition Using Vision‐Based Method
This review highlights the advances of image representation approaches and classification methods in vision-based activity recognition. Generally, for representation approaches, related literatures follow a research trajectory of global representations, local representations, and recent depth-based representations (Figure 1 ).
(PDF) A Review of Human Activity Recognition Methods
A plethora of human activity recognition methods based on space-time representation have been proposed in the literature (Efros et al., 2003; Schuldt et al., 2004; Jhuang et al., 2007; Fathi and ...
A Systematic Review of Human Activity Recognition Based on Mobile
Secondly, we review and analyze the research progress of HAR based on mobile devices from each main aspect, including human activities, sensor data, data preprocessing, recognition approaches, evaluation standards and application cases. Finally, we present some promising trends in HAR based on mobile devices for future research.
Human Activity Recognition: A Survey
Abstract. Human Activity Recognition (HAR) has been a challenging problem yet it needs to be solved. It will mainly be used for eldercare and healthcare as an assistive technology when ensemble with other technologies like Internet of Things (IoT). HAR can be done with the help of sensors, smartphones or images.
Frontiers
In Sections 4 and 5, we review various human activity recognition methods and analyze the strengths and weaknesses of each category separately. In Section 6, we provide a categorization of human activity classification datasets and discuss some future research directions. Finally, conclusions are drawn in Section 7. 2.
Human Activity Recognition (HAR) Using Deep Learning: Review ...
Human activity recognition is essential in many domains, including the medical and smart home sectors. Using deep learning, we conduct a comprehensive survey of current state and future directions in human activity recognition (HAR). Key contributions of deep learning to the advancement of HAR, including sensor and video modalities, are the focus of this review. A wide range of databases and ...
Literature Review on Transfer Learning for Human Activity Recognition
The interest literature review focuses on original papers investigating MWE computing, transfer learning, and activity recognition. Surveys and reviews papers were excluded to avoid duplication as well as studies consisting of activity recognition techniques overlooking transfer learning approaches, and papers lacking scientific format.
A systematic review of smartphone-based human activity recognition
Human activity recognition (HAR) is a process aimed at the classification of human actions in a given period of time based on discrete measurements (acceleration, rotation speed, geographical coordinates, etc.) made by personal digital devices. ... Our literature review revealed that the most commonly used sensors for HAR are the accelerometer ...
Human Activity Recognition Data Analysis: History, Evolutions, and New
This systematic review of the literature locates the advances made in Human Activity Recognition in each of the automatic learning methods, their evolution, and results. The recognition of human activities has become one of the most used areas of knowledge that has allowed many advances in the care of patients at home and the improvement of the ...
Measuring Fatigue through Heart Rate Variability and Activity
A scoping literature review was conducted to summarize the current research trends in fatigue identification with applications to human activity recognition through the use of diverse commercially available accelerometers. This paper also provides a brief overview of heart rate variability and its effect on fatigue.
Human activity recognition: A review
Human Activity Recognition is one of the active research areas in computer vision for various contexts like security surveillance, healthcare and human computer interaction. In this paper, a total of thirty-two recent research papers on sensing technologies used in HAR are reviewed. The review covers three area of sensing technologies namely RGB cameras, depth sensors and wearable devices. It ...
Analysis of Deep Neural Networks for Human Activity Recognition in
The primary objective of this Systematic Literature Review (SLR) is to collect existing research on video-based human activity recognition, summarize, and analyze the state-of-the-art deep learning architectures regarding various methodologies, challenges, and issues. ... this systematic study by summarizing 70 different research articles on ...
Information
This contribution provides a systematic literature review of Human Activity Recognition for Production and Logistics. An initial list of 1243 publications that complies with predefined Inclusion Criteria was surveyed by three reviewers. Fifty-two publications that comply with the Content Criteria were analysed regarding the observed activities, sensor attachment, utilised datasets, sensor ...
Human activity recognition in artificial intelligence framework: a
Human activity recognition (HAR) has multifaceted applications due to its worldly usage of acquisition devices such as smartphones, video cameras, and its ability to capture human activity data. While electronic devices and their applications are steadily growing, the advances in Artificial intelligence (AI) have revolutionized the ability to extract deep hidden information for accurate ...
A Review of Deep Learning-based Human Activity Recognition on Benchmark
The current survey aims to provide the literature review of vision-based human activity recognition based on up-to-date deep learning techniques on benchmark video datasets. These video datasets are containing the video clips recorded from the static cameras installed at specific fixed locations. ... A Review on Human Activity Recognition Using ...
A review on applications of activity recognition systems with regard to
Activity recognition, machine learning, wireless sensor networks (WSNs) WSN: Chen Wu et al. 49: Multi-view activity recognition, decision fusion methods, smart home: Multi-camera views: Parisa Rashidi et al. 64: Activity recognition, data mining, sequence mining, clustering, smart homes: Accelerometer, state-change sensor, motion sensors, and so on
PDF Semantic human activity recognition: A literature review
Introduction. Human activity recognition is being leveraged for an increas-ingly wide variety of computer vision applications. What all of these works have in common is to study some aspects of human. computer interaction. Recognizing activities can range from a single person action to multi-people activity recognition.

A systematic review of smartphone-based human activity recognition methods for health research

Similar content being viewed by others

A “one-size-fits-most” walking recognition method for smartphones, smartwatches, and wearable accelerometers

Quantification of acceleration as activity counts in ActiGraph wearable

Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization

Study populations

Data acquisition

Data preprocessing

Feature extraction

Activity classification

Data availability

Code availability

Acknowledgements

Author information

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Share this article

This article is cited by

Unlocking the potential of smartphone and ambient sensors for ADL detection

Self-supervised learning for human activity recognition using 700,000 person-days of wearable data

Activity recognition based on smartphone sensor data using shallow and deep learning techniques: A Comparative Study

Quick links

A Survey on Deep Learning for Human Activity Recognition

New Citation Alert added!

New Citation Alert!

Information & Contributors

Index Terms

Recommendations

Deep learning for sensor-based activity recognition: A survey

Toward human activity recognition: a survey

Information

Publication History

Funding Sources

Contributors

View Options

Full Access

HTML Format

Share this Publication link

Share on social media

REVIEW article

1. Introduction

2. Previous Surveys and Taxonomies

3. Human Activity Categorization

4. Unimodal Methods

4.1. Space-Time Methods

4.2. Stochastic Methods

4.3. Rule-Based Methods

4.4. Shape-Based Methods

5. Multimodal Methods

5.1. Affective Methods

5.2. Behavioral Methods

5.3. Methods Based on Social Networking

5.4. Multimodal Feature Fusion

6. Discussion

7. Conclusion

Conflict of Interest Statement

Acknowledgments

Human Activity Recognition Data Analysis: History, Evolutions, and New Trends

Enrico Vicario

Ana Isabel Oviedo-Carrascal

Marlon Alberto Piñeres-Melo

Alejandra Quintero-Linero

Fulvio Patara

1. Introduction

2. HAR Approach Concept Maps

3. Conceptual Information

3.1.1. Decision Tree

3.1.2. Support Vectorial Machine (SVM)

3.1.3. Naïve Bayesian Classifier

3.1.4. Artificial Neural Networks-ANN

3.1.5. Decision Tables

3.1.6. Tree-Based on the Logistic Model-LMT

3.2. Unsupervised Learning

3.2.1. Clustering Methods

3.2.2. Association Rules Methods

3.2.3. Dimensionality Reduction Methods