Skip to main content

Stylistic gait synthesis based on hidden Markov models


In this work we present an expressive gait synthesis system based on hidden Markov models (HMMs), following and modifying a procedure originally developed for speaking style adaptation, in speech synthesis. A large database of neutral motion capture walk sequences was used to train an HMM of average walk. The model was then used for automatic adaptation to a particular style of walk using only a small amount of training data from the target style. The open source toolkit that we adapted for motion modeling also enabled us to take into account the dynamics of the data and to model accurately the duration of each HMM state. We also address the assessment issue and propose a procedure for qualitative user evaluation of the synthesized sequences. Our tests show that the style of these sequences can easily be recognized and look natural to the evaluators.

1 Introduction

Human motion is a very complex field of study. The components of our behaviors, which are so natural to the human eye, can hardly be separated into physiological processes, the personal style of every human, or some kind of additional "style" or "mood" that influences the final motion, as presented in many works (see for instance [1]). A given gesture will easily be identified by the human eye as clumsy, elegant, heavy, or any other characteristic. Unfortunately, automatically extracting that information is a very difficult task, as stylistic variation is intrinsically merged with the basic motion, the individuality of the subject and the time-variability of the gesture (two motions by the same subject will never be exactly the same).

A broad field of applications can be found for human motion synthesis. While its use is currently mostly limited to the entertainment industry, with 3D movies, video games, virtual agents, etc., other domains could benefit from a realistic automatic motion synthesis, in the same way as they already benefit from motion capture [2]. Medical applications could use it for instance to control active prostheses, or try to detect and understand the motion of motor impaired individuals [3]. New applications in the field of the animation of virtual characters in 3D could also take advantage of more evolved motion synthesis, for virtual agents interacting in real-time with the user, for 3D animation movies, for video games, etc. [4]. A good synthesizer could for instance enable non-professional animators to produce believable and controllable motion sequences without animation experience nor expensive motion capture equipment. For artistic applications, motion analysis/synthesis could make it possible for an actor or a dancer to interact in real-time on the scene with a virtual character whose motions are correlated with those of the human performer or with any other signal.

In the framework of virtual character animation, several approaches are available to synthesize realistic human motion. Among those, motion capture based approaches have been driving a lot of interest in the last years, especially since motion capture becomes more affordable. Numerous methods have been developed for using and re-using motion capture data [5], a technology that transfers the movements of humans into a numerical form usable by computers. The main problems encountered with motion capture data are its high dimensionality, the choice of the parameterization, and the variability associated with motion in general. All these factors make it hard to retrieve, analyze, adapt, and modify motion patterns either made "on request" or coming from an existing motion database.

Two approaches coexist for using motion capture data for producing animations: the "template-based" and the "model-based" approaches. In the "template-based" approach, a large database of motion sequences is built and algorithms are developed to address the common data-mining issues (like retrieving the required motion segments), edit these motion parts if needed, and blend them together to produce new sequences [6]. Several problems are associated to the "template-based" approach, that rely on a database which is queried for motion segments. The database needs to be stored, which can be a first issue, and to be large enough to contain all the required motion capture segments. But increasing the database size can be a problem for effective searching. In the synthesis process, unrelated motion parts have to be concatenated, and it is difficult to ensure the continuity of the produced motion. Controllability is also an issue, as there is no continuous modeling of styles, only distinct independent examples that can display several characteristics.

The "model-based" approach, sometimes also referred to as the "machine learning" approach, consists in training models based on motion capture data. The models can later be used to synthesize new motion sequences without resorting to the database initially used for training [710]. Furthermore, style can be modeled as a parameter of the model, giving the user new possibilities for the control of his synthesized segments, and for the combination of styles not available in the original data. This approach has been used for years in speech processing for example, first for recognition and more recently for synthesis [11].

Our current work falls in the latter category, with the use of model-based techniques, and more precisely of hidden Markov models (HMMs) [12], for the modeling and synthesis of human-like motion. We aim not only to synthesize a plausible human walk but also to isolate some kind of "style" component. Taking into account such a "style" parameter will enable us to synthesize a broad range of styles, and to have an open model where new styles can always easily be added.

In this work, a general model of "neutral" walk is built in a first step, by training a model over a large database. The resulting neutral-style model can then be used as a basis for the adaptive training of any style-specific model using only a small amount of training data from the target style. A new style-adapted model can thus be obtained very easily each time it is required by capturing only about a dozen steps of the desired walk style and running the adaptive training. This technique, which was originally developed for speaker adaptation in speech synthesis (HTS toolkit) [11, 13, 14], has been adapted to the motion synthesis problem in our work. The main interest of this approach is that it makes it possible to tackle the main problem of model-based techniques, which is the large amount of data needed to train each new model (corresponding to each different walk style in our case). Thanks to this work, it is possible to train representative models for walk styles for which the training of standard models failed because the set of data available for each style was too small. This method also opens interesting paths for style interpolation or for adding style to plain walks.

The article is organized as follows. Section 2 makes a review of HMM-based motion analysis/synthesis. The training databases are then presented in Section 3. Section 4 describes the preprocessing of the data. Section 5 presents the HMM training and adaptation procedure and its use for synthesis of new stylistic walk sequences, along with some results. A qualitative user evaluation is presented in Section 6. Section 7 concludes this article by presenting perspectives and future works.

2 Related work

2.1 HMMs for motion synthesis

Various walk synthesis algorithms use statistical learning techniques to automatically extract the underlying rules of human motion, without any prior knowledge, directly from training on 3D motion capture data. The resulting statistical models can then be used for generating new motion sequences automatically, using only some high-level commands from the user. Such synthesized motions are thus visually different from the training motions but stochastically similar to them. A few studies use principal component analysis (PCA), not for reducing the dimensionality of the angle data, but as a way of modeling motion units composed of a sequence of frames. Thanks to their periodicity, walk cycles are especially well suited for such an algorithm. This approach has been taken for instance by Glardon et al. [8], Troje [15] and in our previous work [10]. But most work in this area use variations of HMMs, Markov chains or other kinds of probabilistic transitions between motions [9, 1618], in order to take the high dynamic complexity of human movement into account.

A HMM consists of a finite set of states, with transitions among the states governed by a set of so-called transition probabilities. In HMMs, each state is associated with an outcome (more generally called observation) probability distribution. Only this outcome is visible for an external observer, not the state that produced it: at each time t, the external observer sees one outcome, but does not know which state produced it. HMMs are hence double stochastic processes, as visible outcomes are determined by the outcome probability distribution associated with the state, and as the state changes at each time according the transition probabilities between states. In our work, the outcome of the HMM are the frames of the motion. A basic left-to-right HMM with no skips is illustrated in Figure 1, as a model in which the only possible state transitions at each time are either to stay in the same state or to go to the next state.

Figure 1
figure 1

A simple three-states left-to-right HMM with no skip (with a ij representing transition probability between states i and j ).

Motion has been studied with numerous variants of HMMs, whether it was for analysis or synthesis purposes [19]. In the following paragraphs, we will focus on the studies related to the use of HMMs for motion synthesis, and not just motion in general. In some cases, some kind of "style" component is taken into account, but no style parameter has yet been found that can be used to synthesize styles that are very different from the motions on which the system was trained. Increasing the number of styles represented by the system means increasing the complexity of the model and in most cases re-training it completely, with the additional issue that enough data must be available for each style.

Tanco and Hilton [16] describe a model consisting of two hierarchical levels. In the first level, PCA is used to reduce the dimensionality of the data and is followed by a K-means clustering of the poses space. The clusters--defining the boundaries of "motion segments" in the original training data--are then used as states of a Markov chain that represents the temporal behavior of the training data. A discrete HMM is only used in the second level to relate the states of the Markov chain to the original full examples of motion sequences from the training database (the Markov chain states are the observations of the HMM and the hidden states of the HMM are the motion examples). During synthesis, a sequence of Markov chain states is calculated given beginning and end poses defined by the user. The second synthesis stage takes the generated state sequence as an input and searches for the most likely sequence of motion segments from the original training data that could have generated that Markov chain state sequence. There is thus no "true" HMM synthesis step, as the database needs to be accessed each time a new motion has to be built. This work is more related to the template-database approach of motion capture animation, as it was described in Section 1, than to the approach we describe in the present article.

Wang et al. [17] go further in their motion modeling by using a "time-striding HMM" (TSHMM), which is also a two-layer model. In the first layer, an approximation of high-level (time-striding) statistical transitions is calculated, with first order transition probabilities. Those "high level" transitions correspond, for example, to the transitions between two different behaviors like walking and running. The high-level states from the upper layer are modeled in the second layer by a set of left-right HMMs. Those HMMs correspond to "atomical movements", i.e., motion segments maximally short, while being long enough to enable the prediction of the next pose. Synthesis is only based on the model without needing to reuse any motion segment from the original database.

Li et al. [18] also use the principle of motion decomposition into sub-units connected to each other by transition probabilities, and model each sub-unit individually. Their system, called "motion texture", is a technique for synthesizing complex human motions (like dancing for instance) so that they are statistically similar to the original motion capture data. The model is made of a set of "motion textons", and of their distribution, thereby characterizing the stochastic and dynamic nature of motion captures performed for the training. They define "motion textons" as the repetitive patterns in complex human motion (for instance: spinning, hopping or tiptoeing for dance motion). Each motion texton is modeled by a linear dynamic system (LDS) [18]. The distribution of the textons is modeled by a transition matrix which gives probabilities for transiting from one texton to another. It is thus possible to generate new animations and vary their execution by modifying motion at the texton level, or to synthesize a new choreography by varying the distributions.

2.2 Motion style modeling and synthesis

An interesting approach is chosen by some researchers who try to integrate a "style" variable into their HMM models. It enables the model, during the synthesis step, to vary not only the motion itself, but also the way the motion is performed, i.e. the "style" of the motion.

Wang et al. [20], for instance, use a training algorithm which integrates statistical optimization techniques with the expectation-maximization (EM) learning steps. Their method, called "SOMN-HMM" (which stands for "self-organizing mixture networks" which are used to represent mixture of Gaussians in the HMMs), makes it possible to train basic HMMs as well as parametric HMMs containing a "style" parameter. In [21], output densities are represented by "stylized decomposable triangulated graphs" (mix-SDTG) instead of SOMNs, and they also take into account a style variable.

Among all the models enabling the generation of data representing motion thanks to approximation functions, the "style machine" developed by Matthew Brand [9] is especially appealing. The major interest brought by this method is that, thanks to its learning algorithm based on the maximization of entropy, it enables to train HMMs for which we do not know the structure in advance, and it does it without having to proceed by successive attempts in order to find the adequate structure. Furthermore, this method integrates a style variable that can vary during the synthesis of a motion sequence. However, in that work the "style" variable is not explicit and it is thus not possible to control directly a given style, but rather to change some intrinsic style-related parameters.

In an approach closely related to ours, Yamazaki et al. [22] synthesize walk using a hidden semi-Markov model (HSMM). The "style" variation they incorporate in their model thanks to multiple regression is the quantitative variations of speed and stride length. There are thus two values that can be controlled but multiple regression is not suited for expressivity modeling which can hardly be quantified with a numerical value. The multiple regression method is trained once for all and it is not possible to add a new "style characteristic" without having to train the whole model again.

One of the problems with motion synthesis is that, unlike for speech which is decomposed into sentences, words, phonemes, etc., which are universal and can be represented as a finite set of possibilities, there is no widely accepted "dictionary" of basic motions. Each research team uses its own terminology and the possibilities are potentially infinite. There is thus no common basis for comparison, and as there is no method to assess the quality or the realism of a synthesized motion, the comparison of methods proposed by each research group is not straightforward. Most studies even lack qualitative assessment of their results.

3 Training databases

In all model-based techniques, the first major issue is to obtain enough representative training data. The quality of models is highly dependent on the quality of the data and how accurately these data describe the studied phenomenon. Motion capture being the only solution to obtain realistic 3D human motion data [2], it is the only way to gather representative training data for statistical modeling of human motion.

In this work we have used two databases recorded with an inertial motion tracking system, the inertial gyroscopic system (IGS-190) from Animazoo [23]. The IGS-190 is a commercial motion capture suit that contains 18 inertial sensors, which each consist of a three axis accelerometer, a three axis gyroscope and a three axis magnetometer. The data from those three sources are integrated and fused directly in the inertial sensor boxes. Angles between the body segments are thus provided straight from the sensors; no mapping is necessary between tracked 3D positions of markers and joint angles, unlike in optical motion capture systems.

Most studies use optical motion capture systems, which usually induce space limitations and where walk is thus recorded on a treadmill. In contrast, the inertial suit IGS-190 does not imply any kind of space limitation. The recorded subject can thus move freely in an open space area and walk can be recorded in a more natural way. This kind of inertial suit is thus especially interesting for the study of expressive walk, as it gives more freedom to the subject who can follow non-straight trajectories and is not constrained to a given constant speed like he would be on a treadmill.

In our databases, like in all motion capture systems, the human body structure is approximated by a kinematic tree of joints modeled as points separated by segments of known constant lengths (see the skeletons in Figure 2). The starting point of that kinematic chain, also called the "root" of the skeleton, is the middle of the hips, at the bottom of the spine. The hierarchy of the skeleton is the same for all our subjects and recordings and contains 18 articulations. This hierarchy and the limb lengths corresponding to the recorded subject are defined for each person prior to the first recording and are constant across different motion recordings by the same subject.

Figure 2
figure 2

Four example postures taken from the motion capture database (sad, afraid, drunk and decided walks).

There is no 3D position tracking system in the IGS suit, and the absolute position of the subject is calculated by the software given a known initial position, using the length of the skeleton segments from the feet to the hip, the angles recorded between those segments for each frame, and always considering that the lowest point of the skeleton is in contact with the ground.

Our two databases, respectively called "eNTERFACE'08 3D" and "Mockey", were recorded with the same motion capture suit but with different aims, subjects and settings. The eNTERFACE'08 3D database is described in details in [24]. This first database contains 17 walk sequences for 41 subjects. Among these, 12 sequences correspond, for each subject, to three sequences of straight walk over approximately seven meters for four different speed instructions. Those four instructions were "free", slow, middle and fast walks. In the "free" walk, subjects were invited to walk at their usual comfort speed. In the present work, the three free walk sequences of the 41 subjects were used to train our average "neutral" walk model. In that database, the motion was captured at a frame rate of 60 frames per second (fps).

The Mockey database, the second database used in this work, aims to study the "expressivity" of walk [10]. Various walks were performed by the same actor walking on a scene. He was given instructions about the "walking style" he had to act before each walk sequence recording. The 11 different acted styles were the following: proud, decided, sad, topmodel, drunk, cool, afraid, tiptoeing, heavy, in a hurry, manly. These 11 styles were arbitrarily chosen as they all have a recognizable influence on walk, as illustrated in Figure 2. Our "style" component consists thus in exaggerated variations that can be far from plain walk. In this second database, motion was recorded at a frame rate of 30 fps. Depending on the style of walk performed and its corresponding step length, a different number of walk cycles was recorded for each style. The 11 different styles and their corresponding number of left and right steps are presented in Table 1.

Table 1 Mockey database walk styles and corresponding number of steps recorded

4 Data preprocessing

In the data format we use, three values per frame give the absolute 3D position (XYZ cartesian coordinates) of the root of the skeleton while the 54 other values represent the 3D angles of the 18 joints of the skeleton. The three values corresponding to the 3D position were discarded as they can be recalculated later using the angles, information about the foot contact with the ground, and the fixed leg segment lengths. The directions of all walk sequences from both databases were then aligned before further processing. The walk sequences were also manually segmented into left and right steps. The boundaries of the steps were arbitrarily defined as the moment the heel touches the ground.

We chose to model the rotations of the 18 captured joints rather than the 3D cartesian coordinates of these joints in order to ensure that the fixed limb length constraints were respected in the synthesized motion: as only rotations are applied to the fixed limb length skeleton definition presented in Section 3, there will be no length deformation in the skeleton after synthesis. This would not be the case with joint cartesian coordinates as nothing would insure that the distance between two successive joints of the skeleton hierarchy remains constant, unless that constraint is explicitly added in the synthesis algorithm.

Once we had chosen to model rotations, the choice of the rotation parameterization was not straightforward. Lots of problems are associated with the different 3D rotation representations that exist, and none of them is ideal in all situations. Rotation matrices, Euler angles, quaternions, axis/angle representation and exponential maps are the most common rotation parameterizations (see for instance [25] for a more detailed presentation of those five representations), but the choice of the parameterization will always depend on the application of interest.

Our data was originally represented by Euler angles, in which each 3D rotation is splitted into three simpler successive rotations around the axes of the local coordinate system associated to the object (X, Y and Z axis). That representation is not well suited for our purpose as, among other issues, there is not always a single representation of each 3D rotation but rather several possible angle combinations that lead to the same rotation. More information about singularities in the Euler angle parameterization can be found in [25, 26].

In this work, our angles were converted into the exponential map parameterization which is locally linear and where singularities can be avoided [26, 27]. Exponential maps represent any 3D rotation by a single rotation about an axis. In this parameterization, each vector r in 3 is associated to a single rotation:

r = θ . u ,

where the vector r is the three-component exponential map, u is the unit-length 3D vector corresponding to the axis of rotation, and θ is the rotation angle around the axis. The direction of r defines the rotation axis u , and the magnitude (θ) of the vector r is the scalar value of the angle to rotate by. This relationship is completed by associating the zero vector to the identity rotation, making the relationship continuous. For in-depth analysis of the advantages and drawbacks of exponential maps, please refer to [26].

The pose of the skeleton at each frame of the walk cycle is thus described by a vector with a fixed number of variables: 18 tridimensional joint angles, which gives a vector of 54 values per frame to describe the motion.

5 Average model and style adaptation

5.1 Method

As explained before, our objective was to synthesize stylistic walks with few data, starting from a robust neutral walk modeling. Our approach is to start from a procedure originally developed for speaker adaptation in speech synthesis and to adapt it to our motion problem. Both speech and motion fields present strong similarities, like inter-subject variability, stylistic or temporal variations. They are also very different; for instance, motion data do not need feature extraction or temporal windowing, have a much higher dimensionality, and cannot be represented by a finite number of phonemes. This led us to reduce our study to walk synthesis alone, as opposed to motion synthesis in general. In this paragraph, we will briefly explain the different stages of the HMM-based motion synthesis as we used it, based on the HTS framework [11].

5.1.1 Parameter analysis, model structure and labels

Let us assume that our training data C consists in T realizations of our 54-dimensional parameter vector c t : C = [c1, c2, ..., c t , ..., c T ]. As presented in Section 4, our feature vector (c t ) consists in the 54 exponential map parameters describing the skeleton pose at frame t, so we have c t = [c t (1), ct(2),..., c t (54)]. Following the procedure proposed in the HTS framework, the dynamics of the data was taken into account in our models by concatenating c t with a vector containing the first and second time derivatives of our parameters (for both neutral and stylistic model training) [28]. The observation vector o t we want to model thus consists of the static feature vector c t plus the corresponding dynamic feature vectors Δc t , and Δ2c t , which makes o t a 162-dimensional parameter vector. Our observation vector o t can thus be expressed as o t = c t , Δ c t , Δ 2 c t , where the derivatives were calculated as follows:

Δ c t = 1 2 c t + 1 - c t - 1 ,
Δ 2 c t = 1 4 c t + 2 - 2 c t + c t - 2 .

Taking into account the T observation vectors, our whole training data can be expressed as O = i[o1, o2,..., o t ,..., o T ]. Considering matrix W representing the coefficients that link the c, Δc, and Δ2c as expressed in Equations (2) and (3), the relation between the observation matrix O and the static parameter matrix C is:

O = W C .

In HTS, the time d spent in each state of the HMM is explicitly modeled in duration probability density functions thanks to HSMM [29], a variation of HMMs which takes state duration modeling into account. The schematic representation of an HSMM is represented in Figure 3 and can be compared to the classical HMM of Figure 1. This prevents the probability density of the duration d from being modeled as a decaying exponential like in classical HMMs, as this is inaccurate for most real life problems, like motions in our case. State duration densities were modeled with a multidimensional Gaussian distribution for each HMM. The dimension of these distributions is equal to the number of states in the HMM, set to five in our work, with each dimension corresponding to one HMM state, as explained in [29].

Figure 3
figure 3

A three-states HSMM (with p i (d) representing the density probability of the duration of state i ).

During training, contextual factors related to the position of the step in the whole walk sequence were taken into account, thereby multiplying the number of models to train. However, all model parameters can not be estimated with sufficient accuracy if we only have limited training data. Furthermore, all the possible combinations of contextual factors will not always be present in the training database and unseen models have to be taken into account before the synthesis step. To overcome this problem, both parameter and duration models can be clustered using decision trees. The decision tree is a binary tree, and in each of its nodes, a question splits contextual models into two groups. All possible contextual combinations can be found by traversing the trees. Once the decision tree is constructed, unseen contexts can be taken into account and leaves containing little or very similar data can be merged (for more information on how trees are built and used, please refer to [30]).

5.1.2 Average model training

Using the above HSMM model taking both static and dynamic parameters into account, we train an average walk model on a large set of walkers. This average model will be used in the next step of our procedure as the initial model from which the adaptation will start. In our work, the step boundaries of our segmented database are only used to initialize the parameters of the average walk model (they are not used in the adaptation or synthesis stages). A "walker adaptive training" (WAT) algorithm was used during the training stage of our average model. This WAT training reduces the influence of walk differences among the 41 walkers of our training data on the parameters of the final average model. More information on the WAT training of the average model can be found in [31], where it is referred to as "SAT" for "speaker adaptive training".

5.1.3 Style adaptive training of HSMM models

In the previous paragraph, the general HSMM training scheme has been presented. In some cases, for instance when not enough data is available to perform a conventional training, an adaptive training procedure can be conducted. This adaptive training modifies a general HSMM model, trained with sufficient data, to fit a particular style using only a small amount of data from this target style. Training is performed in this article with constrained structural maximum a posteriori linear regression (CSMAPLR) transformation [32, 33]. This transformation is called "linear regression" because the calculated transformation of the HSMM parameters can only be linear. The adapted means μ ^ ( m ^ ) and variance Σ ^ ( σ ^ 2 ) of the state output (state duration) distributions can be expressed, given the linear transformation A o (A d ) and the bias b o (b d ), under the following form:

μ ^ = A o μ - b o , Σ ^ = A o Σ A o ,
m ^ = A d m - b d , σ ^ 2 = A d σ 2 A d .

The term "constrained" refers to the fact that the linear transformations applied to the means and the linear transformation applied to the variances of the average model (both for durations and observation parameters) are required to be the same, other than the bias. A detailed explanation of the CSMAPLR transformation and how it can be calculated can be found in [32] and [33]. This CSMAPLR transformation is implemented within the HTS framework.

The last step of the adaptation training procedure consists in a mximum a posteriori (MAP) [13] adaptation that further transforms the models already linearly adapted by CSMAPLR, modifying the estimation of the distributions having enough training samples, as explained in [32].

5.1.4 HSMM synthesis

In HSMM-based synthesis, the synthesis stage consists in an algorithm which directly generates the optimal parameter sequence from the HSMM in the maximum-likelihood sense. In our HSMMs, the probability density function of the observations was modeled by one Gaussian distribution per state. Given a HSMM (λ) and the sequence of steps we want to generate, the HSMM synthesis consists in finding the parameter sequence O * = [ o 1 , o 2 , , o T ] with maximum probability given the HSMM model λ. The problem can thus be mathematically expressed as follows:

O * = arg max O P ( O | λ ) .

Unfortunately, there is no known algorithm to analytically solve this equation. We can thus only find approximated solutions by using the most likely state sequence. Since P ( O | λ ) = Σ a l l q P ( O , q | λ ) , where q is one sequence of states from the set of all possible state sequences corresponding to the walk we want to generate, the problem can be approximated by:

O * arg  max O ( max q P ( O | q , λ ) P ( q | λ ) ) .

The initial problem of finding the optimal sequence of observations O* given the HSMM λ and the desired sequence of synthesized walk steps can thus be splitted into two optimization problems:

  1. (1)

    Find the optimal sequence of states q* given the HSMM λ and the desired sequence of synthesized walk steps:

    q * = arg  max q P ( q | λ ) .
  2. (2)

    Find the optimal sequence of parameters O given the previously determined optimal sequence of states q* and the HSMM λ:

    O * = arg  max O P ( O | q * , λ ) .

The optimal sequence of states q* must first be estimated, according to Equation (9). Knowing the state duration densities thanks to the HSMM modeling, the optimal sequence q* according to Equation (9) can be determined [29]. Once the optimal state sequence has been calculated, the optimal sequence of parameters can be determined from Equation (10).

When the constraints between static and dynamic features expressed in Equations (2) and (3) are added to the optimization problem, maximizing P(O | q*, λ) with respect to O (Equation (10)) becomes equivalent to maximizing it with respect to C:

C * = arg max C P ( W C | q * , λ ) ,

as O = WC (Equation (4)). In the HTS framework and as explained in details in [28], this problem can be solved using the Cholesky decomposition. The algorithm we just described can thus generate a parameter trajectory of static features that maximizes the likelihood of the parameter sequence containing both static and corresponding dynamic parameters given an HSMM model.

However, the generated parameter sequence is often excessively smoothed due to statistical processing. The sharp variations that appear in the motion and account for a great deal of the style variation tend to disappear and the synthesized walks loose a great deal of their naturalness. In [34], Toda and Tokuda present an algorithm to reduce that effect, by taking into account one of the characteristics of the parameter sequence that was removed statistically: the global variance of the data. The global variance ((C)) of the static features c t over a time sequence of T frames is calculated by:

c ̄ ( dim ) = 1 54 t = 1 T c t ( dim ) ,
υ ( dim ) = 1 54 t = 1 T c t ( dim ) - c ̄ ( dim ) 2 ,
g υ ( C ) = υ ( 1 ) , υ ( 2 ) , , υ ( dim ) , υ ( 54 ) .

The method proposed in [34] and implemented in the HTS framework considers not only the HMM likelihood for the static and dynamic feature vectors, but also the likelihood of the global variance. The probability to maximize in Equation (7) becomes:

P ( O | λ , λ g υ ) = all q P ( O , q | λ ) ω P ( g υ ( c ) | λ g υ ) ,

with λ υ a single Gaussian distribution representing the global variance of the data υ(c) by a mean vector and a covariance matrix, and ω a constant determining the weight between the two likelihoods. Taking into account the global variance of the data enabled us to avoid over-smoothed synthesized walks.

Once our adapted model is built, we can synthesize as many stylistic walk sequences as we want using the same synthesis procedure as described here. The model gives us joint angles and the displacement of the skeleton can be computed using our knowledge of the limb lengths and the step in which we are (which defines which foot is in contact with the ground).

5.2 Results

5.2.1 Neutral walk modeling

For our HMM training and synthesis, we followed the method explained in Section 5.1 and adapted the functions originally implemented for speech within the HMM-based speech synthesis system (HTS) to our procedure. The implementation of the HTS toolkit (version 2.1) that we used in this work is publicly available on the HTS website [11].

The three sequences of "free" walk of the 41 subjects of the eNTERFACE'08 3D database were used to train our average neutral walk model, which consisted of five-states left-to-right HSMM with no skip for both steps (right and left). The database contains 669 observation sequences for "right step" and 656 observation sequences for "left step". We made the contextual distinction between five positions in the walk sequence for each step: the first, second, last, last-but-one steps of the sequence, and all the other steps. The training began thus with ten models to train (five for each step).

During the training phase, some of the ten initial models were automatically tied by the context-based tree clustering and only six HMMs remained for the whole walk modeling in the average model: two models for the first step of a walk sequence, two for steps inside a sequence, and two for the last step of a sequence (one model for the right step and one for the left step each time).

5.2.2 Style walk modeling

Adaptive training is performed with constrained maximum likelihood linear regression (CMLLR) transformation [33] of our previously trained average neutral walk HSMM model. For each one of the 11 expressive walks of our Mockey database, a separate adaptive training was performed using all of the data available for the target style. The number of observation sequences for each of the stylized walks are given in Table 1. So, for each style, we obtained separate contextual (initial, final and "inside a sequence") models for the right and left steps.

5.2.3 Synthesis of new walk sequences

Each new walk sequence is synthesized by first concatenating HMMs corresponding to the desired succession of steps. The whole parameter sequence is then calculated from that complete sequence of models, taking into account the dynamics of the synthesized parameters thanks to the first and second derivatives of the parameters. Therefore, the smoothness of the transitions between the successive steps of the walk sequence is ensured.

The parameters generated by the model are only the angles between the body segments, hence no overall displacement of the character is synthesized by the HMMs. But using our knowledge of the boundaries of each synthesized step and the height of each foot (for both heel and toe) given by the known angles and length of the limb parts, we can determine which part of which foot is in contact with the ground. Starting from that fixed 3D point, we can compute the overall displacement of the whole body and ensure at the same time that no foot sliding occurs. Figure 4 illustrates two examples of synthesized walks (sad and topmodel walks). The style difference is already visible in these poses, and the duration difference is also illustrated as more poses (and thus more time) are needed to complete the sad walk step than the topmodel walk step.

Figure 4
figure 4

Left step of synthesized walk for sad (up) and topmodel (down) walks. Synthesized poses are displayed every 0.1 second.

Our average model was trained with data recorded at a frame rate of 60fps and adapted in the second phase to data captured at a rate of 30 fps, but that difference was not an issue as the durations were adapted automatically during the average-to-style model adaptation. The synthesized walks, coming from models adapted to the Mockey style data, corresponded to a frame rate of 30fps.

6 Qualitative user evaluation

6.1 Methodology

A recurrent problem with motion data synthesis is the difficulty to evaluate the produced motion sequences. Most studies only present their method without giving the reader information about the quality of the results, or just give a link to an example of synthesized motion.

In this article, we propose three different subjective tests that enabled us to assess the quality of the synthesis results. The basic set of the tested videos consisted in 44 walk sequences: one original walk sequence for each of the 11 styles, the same sequences from which the displacement of the root of the skeleton was removed, one sequence of synthesized walk for each of the 11 styles without adding the overall displacement (called "static" in the next sections), and the same synthesized sequences for which the absolute position of the root was calculated as explained in Section 5.1.4 (called "displacement" in the next sections). Two videos of motion synthesized with the average walk model were added (with and without displacement), which makes 46 videos in total. In the video sequences, motion was performed by a basic blue stick-figure character as shown in Figure 2.

Participants accessed to the evaluation tests through a web browser. They had to start the video themselves by clicking on it, and could watch it as many times as they wanted. If they did not complete the test thoroughly, they could come back later, but the participant's results were saved even if the three tests were not completely finished. Video sequences lasted between 3 and 17 s.

About a 100 naive evaluators took part in the evaluation. The three tests and their respective results are presented in Sections 6.2, 6.3, and 6.4. For each of the three tests, every evaluator was presented a set of ten videos or couples of videos. Those videos were randomly picked by the evaluation program, and were thus different for each evaluator.

6.2 Naturalness evaluation

In the first test, the evaluator was presented one random video at a time. He was asked to choose among three propositions: the stylistic walk in the video seems "real", "synthetic", or "I don't know". The aim of the test was to determine if there was a significant difference in the way the naturalness of the original and the synthesized walks were perceived.

In a first trial, this test was presented to the users in an odd manner and several users reported that they were confused and did not understand the question. The user was asked if the walk was "natural" or "unnatural", which lead most people to perceive nearly all the walks, both original and synthetic, as "unnatural" because of the nature of the data presented: exaggerated walk styles performed by an actor. We reformulated thus the question and only kept the results obtained after that change, which explains why only 500 sequences were evaluated in this first test.

Five-hundred sequences of walk were evaluated in total (246 original sequences and 254 synthetic sequences). The results of the test are presented in Figure 5. 65.45% of the original walks and 50.39% of the synthetic walks were labeled as "real walks", and the user could not decide for 2.44% of the original walks and 6.69% of the synthetic walks. We can thus say that even if the original walks seem a little bit more natural to the evaluators, our synthesized walks looked very natural too, with more than half of the synthetic sequences presented to the evaluators identified as "originals", 15% less than the real original walks. We also verified informally that the degree of unnaturalness between the original and synthesized motions labeled as "unnatural" by the evaluators was not significantly different. This was done a posteriori, by showing five people both original and synthesized videos that had the more often been identified as "unnatural" during our extensive user evaluation, and asking them if some of the videos were significantly less natural than others.

Figure 5
figure 5

Results of the naturalness test comparing the perception (real, synthetic or "I don't know") of original and synthesized walk sequences.

6.3 Style recognition evaluation

In the second test, the evaluators were again presented one video at a time. They were asked to chose between 13 different style possibilities: the 11 styles, plus "average walk" or "I don't know". A total number of 922 evaluations of videos taken randomly from the set of 46 possible videos were performed.

The recognition rate was of 45.9% for original walks and 38.93% for synthetic walks. Less than half of the styles were properly recognized but this results is easily explained by the fact that no examples of the different possible styles were presented to the users before letting them choose between the 13 proposed answers, and some of the styles were thus subject to the evaluator's interpretation which did not always correspond to the actor's interpretation. Furthermore, some of the styles were very close (for instance "proud" was more often recognized as "cool" or "manly") and were easily confused for one another. The confusion matrix of the classification by the evaluators is presented in Table 2. The confusion matrix shows that when a walk style is wrongly identified as another style, that association is the same for original and synthetic walk and consequently depends more on style interpretation than on the quality of the synthesis. In order to insure that the low style recognition results were not caused by the motion representation (stick figure), we displayed the original motion data on a more realistic 3D body character and asked five subjects to recognize the displayed style in some examples with stick figure and some examples with 3D body character. The 3D body representation did not improve recognition, which comforted us in our analysis that the poor recognition style was due to the variable appreciations of our eleven proposed styles by users and by the actor. Another factor which seemed to influence the results is that in the videos with root displacement, the character displayed was smaller on the screen and the details of the motion were harder to distinguish. Despite these facts, the classification rate is much higher than mere chance: with 13 possible choices for style, a random classification would have given a recognition rate of 7.69%.

Table 2 Confusion matrix of style recognition test for both original walk sequences (first part of the table) and synthesized sequences (second part of the table)

The percentages of correctly classified videos for both original/synthetic and with/without root displacement are presented in Table 3. The results correspond to what could be expected. Original motion were slightly better recognized than synthesized motions. Furthermore, since the displacement of the root enables the evaluator to have a better global view of the scene and of the succession of steps, adding the displacement improves the results for original motions but worsens them for synthesized motions. The values of all style recognition rates for both original and synthetic motions are presented in Figure 6. We can see that for some styles, synthesis worsened the recognition rate (for instance "tiptoe", walk number 8). For others, the style was better recognized in synthesized sequences than in the original ones (for instance the "cool" walk, number 6). The Pearson correlation coefficient between the recognition score of the original motions versus the synthesized motion for each of the 11 styles is 0.8849. This value shows that the recognition rates per style for original motions versus synthetic motions are tightly linked. If one style is well recognized in the original motions, it will be well recognized in the synthesized motions.

Table 3 Percentage of correctly classified walk sequences for the style recognition test
Figure 6
figure 6

Style recognition rate (in percent) of original versus synthesized walk sequences for the 11 walk styles.

As Bernhardt and Robinson in [1], we can calculate a more objective measure of the recognition efficiency η by normalizing the achieved recognition rate (or sensitivity) by the recognition rate given by a random classification (sensitivity expected by chance):

η = Achieved sensitivity Sensitivity expected by chance .

The efficiency of the users' recognition is thus equal to η orig = 47.6/7.69 = 6.19 for original walk sequences and to η orig = 37.15/7.69 = 4.83 for synthesized walk sequences. These values are both higher than the human recognition efficiencies cited in [1] (η = 3.72 and η = 3.55 for emotional state classification based on original knocking motions in [35] (four emotions, point-light display) and [36] (five emotions, full video)), indicating that the style component was accurately perceived in both our original and synthesized sequences.

6.4 Original versus synthesized comparison

In our last test, participants were presented two videos at the same time, the original and synthesized videos corresponding to the same style (either both static or both with displacement). A screenshot of this third test is presented in Figure 7. The order of the videos was randomly determined by the program. Evaluators were asked to choose between five possible qualifications for the level of resemblance between the two videos: "identical", "slightly different", "different", "very different" and "nothing in common". We gave numerical values to these subjective opinions, from four for "identical" to zero for "nothing in common". The best possible score is thus four if all comparisons are found identical and zero if they are all found has having "nothing in common". Eight-hundred and sixty-five comparison tests were performed, leading to a very good global score of 3.15. The detail of the number of answers for each of the five categories is presented in Figure 8. We can see that the most chosen resemblance is "identical" and that the number decreases while the perceived difference increases. These results show that our synthesized walk sequences look very similar to the original training data.

Figure 7
figure 7

Screenshot of the interface of the third evaluation test: comparison of original versus synthesized walk.

Figure 8
figure 8

Original versus synthesized comparison test, with resemblance ranging from 0 (nothing in common) to 4 (identical).

7 Conclusion

Thanks to the method presented in this article, we were able to build HMMs of our 11 different stylized walks, and to use these models to synthesize new walk sequences. Our method produces very convincing synthesized walk sequences where the styles characteristics can be recognized (some examples of synthesized motion sequences can be found in the "Additional file 1" ( or on the author's webpage [37]), even if the walk styles in our stylistic database were exaggerated and thus extremely different from each other unlike most motion style studies which concentrate on smaller variations.

Additional file 1: (quicktime movie). This short video present some examples of the stylistic walk sequences that were synthesized in this work and presented to the participants of the user assessment tests. (MOV 9 MB)

We also proposed a setup for a subjective evaluation of the synthesis results, which showed that the synthesized walks were close to the original training data and also pointed out some of the weaknesses of the synthesis, indicating directions for future work. The recognition test showed for instance that adding the displacement to the motion improved the recognition rate for original motions but had the opposite effect on synthesized sequences. We think that this is due to the inter-step variation which is lower in the synthesized sequences than in the original motion, and that should be further improved.

Future work will include further analyses of the evaluation tests that can be used to assess the naturalness of the produced motions, and analysis of the use of the style interpolation/extrapolation using the trained models. One could also study how several parameters influence the perceived results, like the variables of the HMM (number of states for instance), the influence of the number of stylistic steps in the adaptation training phase, the way the results are presented to the user (skinned virtual character versus stick figure), how a reduction of the dimensionally of the original data influences the quality of the results, etc. The adaptation method presented here could also be used to analyze and synthesize walks for different human characteristics that influence the walk style, like gender (male vs. female walk) or age (children vs. elderly or others).


  1. Bernhardt D, Robinson P, Paiva A, Prada R, Picard R: Detecting Affect from Non-stylised Body Motions. In Affective Computing and Intelligent Interaction. Springer, Berlin; 2007:59-70.

    Chapter  Google Scholar 

  2. Menache A: Understanding Motion Capture for Computer Animation and Video Games. Morgan Kauffman Publishers Inc., San Francisco; 2000.

    Google Scholar 

  3. Mena D, Mansour J, Simon S: Analysis and synthesis of human swing leg motion during gait and its clinical applications. J Biomech 1981, 14(12):823-832. 10.1016/0021-9290(81)90010-5

    Article  Google Scholar 

  4. Pejsa T, Pandzic I: State of the art in example-based motion synthesis for virtual characters in interactive applications. Comput Graph Forum 2010, 29: 202-226. 10.1111/j.1467-8659.2009.01591.x

    Article  Google Scholar 

  5. Forsyth D, Arikan O, Ikemoto L, O'Brien J, Ramanan D: Computational studies of human motion: part 1, tracking and motion synthesis. Found Trends Comput Graph Vis 2005, 1(2-3):77-254.

    Article  Google Scholar 

  6. Geng W, Reuse YG: of motion capture data in animation: a review. Comput Sci Appl(ICCSA) 2003, 2003: 620-629.

    Google Scholar 

  7. Calinon S, Guenter F, Billard A: On learning, representing, and generalizing a task in a humanoid robot. IEEE Trans Syst Man Cybern B 2007, 37(2):286-298.

    Article  Google Scholar 

  8. Glardon P, Boulic R, Thalmann D: PCA-based walking engine using motion capture data. IEEE Comput Graph Int 2004, 2004: 292-298.

    Google Scholar 

  9. Brand M, Hertzmann A: Style machines. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., New York; 2000:183-192.

    Google Scholar 

  10. Tilmanne J, Dutoit T: Expressive gait synthesis using PCA and Gaussian modeling. In Proceedings of the Third international conference on Motion in games. Springer, Berlin, Heidelberg; 2010:363-374.

    Google Scholar 

  11. HTS Working Group: The HMM-based speech synthesis system (HTS) Version 2.1. . Accessed 2010

  12. Rabiner L: A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 1989, 77: 257-286. 10.1109/5.18626

    Article  Google Scholar 

  13. Yamagishi J, Nose T, Zen H, Ling Z, Toda T, Tokuda K, King S, Renals S: Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Trans Audio Speech Lang Process 2009, 17(6):1208-1230.

    Article  Google Scholar 

  14. Picart B, Drugman T, Dutoit T: Analysis and synthesis of hypo and hyperarticulated speech. In Proceedings of the Speech Synthesis Workshop 7(SSW7). NICT/ATR, Kyoto, Japan, Sept; 2010:270-275.

    Google Scholar 

  15. Troje NF, Shipley TF, Zacks JM: Retrieving Information from Human Movement Patterns. In Understanding Events: How Humans See, Represent, and Act on Events. Volume 1. Oxford University Press, Oxford; 2008:308-334.

    Chapter  Google Scholar 

  16. Tanco LM, Hilton A: Realistic synthesis of novel human movements from a database of motion capture examples. In Proc of the Workshop on Human Motion (HUMO'00). IEEE Computer Society, Washington, DC, USA; 2000:137.

    Chapter  Google Scholar 

  17. Wang Y, Liu Z, Zhou L: Automatic 3D motion synthesis with time-striding hidden Markov model. In Proc International Conference on Machine Learning and Cybernetics (ICMLC'05). Volume 3930. SB Heidelberg, Guangzhou, Aug; 2005:558-567.

    Google Scholar 

  18. Li Y, Wang T, Shum H: Motion texture: a two-level statistical model for character motion synthesis. In Proc of SIGGRAPH'02. ACM Press, New York; 2002:465-472.

    Chapter  Google Scholar 

  19. Ramanan D, Forsyth DA: Motion Analysis by Synthesis: Automatically Annotating Activities in Video.2005. ["]

    Google Scholar 

  20. Wang Y, Xie L, Liu Z, Zhou L: The SOMN-HMM model and its application to automatic synthesis of 3D character animation. In IEEE Conference on Systems, Man, and Cybernetics. Taipei, Taiwan; 2006:4948-4952.

    Google Scholar 

  21. Wang Y, Liu Z, Zhou L: Learning style-directed dynamics of human motion for automatic motion synthesis. In IEEE Conference on Systems, Man,s and Cybernetics. Taipei, Taiwan; 2006:4428-4433.

    Google Scholar 

  22. Yamazaki T, Niwase N, Yamagishi J, Kobayashi T: HumanWalking motion synthesis based on multiple regression hidden semi-Markov model. In 2005 International Conference on Cyberworlds (CW'05). IEEE Computer Society, Washington DC; 2005:445-452.

    Google Scholar 

  23. IGS-190:Animazoo website. []

  24. Tilmanne J, Sebbe R, Dutoit T: A database for stylistic human gait modeling and synthesis. In Proceedings of the eNTER-FACE'08 Workshop on Multimodal Interfaces. Paris, France; 2008:91-94.

    Google Scholar 

  25. R Parent R: Technical background. In Computer Animation Complete: Part I: Introduction to Computer Animation. Morgan Kaufmann, Emeryville; 2009:60-68.

    Google Scholar 

  26. Grassia F: Practical parameterization of rotations using the exponential map. J Graph Tools 1998, 3: 29-48. 10.1080/10867651.1998.10487493

    Article  Google Scholar 

  27. Johnson MP: Exploiting quaternions to support expressive interactive character motion. Massachusetts Institute of Technology; 2002.

    Google Scholar 

  28. Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T: Speech parameter generation algorithms for HMM-based speech synthesis. In Proc ICASSP. Istanbul, Turkey; 2000:1315-1318.

    Google Scholar 

  29. Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Duration modeling for HMM-based speech synthesis. In Fifth International Conference on Spoken Language Processing (ICSLP). Sydney; 1998:29-32.

    Google Scholar 

  30. Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P: The HTK Book, Version 3.4. Entropic Cambridge Research Laboratory, Cambridge; 2009.

    Google Scholar 

  31. Yamagishi J, Kobayashi T: Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE Trans Inf Syst 2007, 90(2):533-543.

    Article  Google Scholar 

  32. Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans Audio Speech Lang Process 2009, 17: 66-83.

    Article  Google Scholar 

  33. Gales M: Maximum likelihood linear transformations for HMM-based speech recognition. Comput Speech Lang 1998, 12(2):75-98. 10.1006/csla.1998.0043

    Article  Google Scholar 

  34. Toda T, Tokuda K: A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans Inf Syst 2007, 90(5):816-824.

    Article  Google Scholar 

  35. Kapur A, Kapur A, Virji-Babul N, Tzanetakis G, Driessen P: Gesture-Based Affective Computing on Motion Capture Data. In Affective Computing and Intelligent Interaction. Volume 3784. Springer, Berlin/Heidelberg; 2005:1-7. 10.1007/11573548_1

    Chapter  Google Scholar 

  36. Pollick FE, Paterson HM, Bruderlin A, Sanford AJ: Perceiving affect from arm movement. Cognition 2001, 82(2):B51-B61. 10.1016/S0010-0277(01)00147-0

    Article  Google Scholar 

  37. Joelle Tilmanne's webpage. []

Download references


This project was partly funded by the Ministry of Région Wallonne under the Numediart research program (grant N0716631). Joëlle Tilmanne was supported by the "Fonds pour la formation à la recherche dans l'industrie et l'agriculture" (FRIA) during part of this work. The authors would like to thank the comedian Sebastien Marchetti and Thierry Ravet for their participation in the motion capture database recording.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Joëlle Tilmanne.

Additional information

Competing interests

The authors declare that they have no competing interests.

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Tilmanne, J., Moinet, A. & Dutoit, T. Stylistic gait synthesis based on hidden Markov models. EURASIP J. Adv. Signal Process. 2012, 72 (2012).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: