- Research
- Open Access

# Stylistic gait synthesis based on hidden Markov models

- Joëlle Tilmanne
^{1}Email author, - Alexis Moinet
^{1}and - Thierry Dutoit
^{1}

**2012**:72

https://doi.org/10.1186/1687-6180-2012-72

© Tilmanne et al; licensee Springer. 2012

**Received:**15 April 2011**Accepted:**26 March 2012**Published:**26 March 2012

## Abstract

In this work we present an expressive gait synthesis system based on hidden Markov models (HMMs), following and modifying a procedure originally developed for speaking style adaptation, in speech synthesis. A large database of neutral motion capture walk sequences was used to train an HMM of average walk. The model was then used for automatic adaptation to a particular style of walk using only a small amount of training data from the target style. The open source toolkit that we adapted for motion modeling also enabled us to take into account the dynamics of the data and to model accurately the duration of each HMM state. We also address the assessment issue and propose a procedure for qualitative user evaluation of the synthesized sequences. Our tests show that the style of these sequences can easily be recognized and look natural to the evaluators.

## Keywords

- motion capture
- hidden Markov models
- style
- expressivity
- gait
- motion synthesis

## 1 Introduction

Human motion is a very complex field of study. The components of our behaviors, which are so natural to the human eye, can hardly be separated into physiological processes, the personal style of every human, or some kind of additional "style" or "mood" that influences the final motion, as presented in many works (see for instance [1]). A given gesture will easily be identified by the human eye as clumsy, elegant, heavy, or any other characteristic. Unfortunately, automatically extracting that information is a very difficult task, as stylistic variation is intrinsically merged with the basic motion, the individuality of the subject and the time-variability of the gesture (two motions by the same subject will never be exactly the same).

A broad field of applications can be found for human motion synthesis. While its use is currently mostly limited to the entertainment industry, with 3D movies, video games, virtual agents, etc., other domains could benefit from a realistic automatic motion synthesis, in the same way as they already benefit from motion capture [2]. Medical applications could use it for instance to control active prostheses, or try to detect and understand the motion of motor impaired individuals [3]. New applications in the field of the animation of virtual characters in 3D could also take advantage of more evolved motion synthesis, for virtual agents interacting in real-time with the user, for 3D animation movies, for video games, etc. [4]. A good synthesizer could for instance enable non-professional animators to produce believable and controllable motion sequences without animation experience nor expensive motion capture equipment. For artistic applications, motion analysis/synthesis could make it possible for an actor or a dancer to interact in real-time on the scene with a virtual character whose motions are correlated with those of the human performer or with any other signal.

In the framework of virtual character animation, several approaches are available to synthesize realistic human motion. Among those, motion capture based approaches have been driving a lot of interest in the last years, especially since motion capture becomes more affordable. Numerous methods have been developed for using and re-using motion capture data [5], a technology that transfers the movements of humans into a numerical form usable by computers. The main problems encountered with motion capture data are its high dimensionality, the choice of the parameterization, and the variability associated with motion in general. All these factors make it hard to retrieve, analyze, adapt, and modify motion patterns either made "on request" or coming from an existing motion database.

Two approaches coexist for using motion capture data for producing animations: the "template-based" and the "model-based" approaches. In the "template-based" approach, a large database of motion sequences is built and algorithms are developed to address the common data-mining issues (like retrieving the required motion segments), edit these motion parts if needed, and blend them together to produce new sequences [6]. Several problems are associated to the "template-based" approach, that rely on a database which is queried for motion segments. The database needs to be stored, which can be a first issue, and to be large enough to contain all the required motion capture segments. But increasing the database size can be a problem for effective searching. In the synthesis process, unrelated motion parts have to be concatenated, and it is difficult to ensure the continuity of the produced motion. Controllability is also an issue, as there is no continuous modeling of styles, only distinct independent examples that can display several characteristics.

The "model-based" approach, sometimes also referred to as the "machine learning" approach, consists in training models based on motion capture data. The models can later be used to synthesize new motion sequences without resorting to the database initially used for training [7–10]. Furthermore, style can be modeled as a parameter of the model, giving the user new possibilities for the control of his synthesized segments, and for the combination of styles not available in the original data. This approach has been used for years in speech processing for example, first for recognition and more recently for synthesis [11].

Our current work falls in the latter category, with the use of model-based techniques, and more precisely of hidden Markov models (HMMs) [12], for the modeling and synthesis of human-like motion. We aim not only to synthesize a plausible human walk but also to isolate some kind of "style" component. Taking into account such a "style" parameter will enable us to synthesize a broad range of styles, and to have an open model where new styles can always easily be added.

In this work, a general model of "neutral" walk is built in a first step, by training a model over a large database. The resulting neutral-style model can then be used as a basis for the adaptive training of any style-specific model using only a small amount of training data from the target style. A new style-adapted model can thus be obtained very easily each time it is required by capturing only about a dozen steps of the desired walk style and running the adaptive training. This technique, which was originally developed for speaker adaptation in speech synthesis (HTS toolkit) [11, 13, 14], has been adapted to the motion synthesis problem in our work. The main interest of this approach is that it makes it possible to tackle the main problem of model-based techniques, which is the large amount of data needed to train each new model (corresponding to each different walk style in our case). Thanks to this work, it is possible to train representative models for walk styles for which the training of standard models failed because the set of data available for each style was too small. This method also opens interesting paths for style interpolation or for adding style to plain walks.

The article is organized as follows. Section 2 makes a review of HMM-based motion analysis/synthesis. The training databases are then presented in Section 3. Section 4 describes the preprocessing of the data. Section 5 presents the HMM training and adaptation procedure and its use for synthesis of new stylistic walk sequences, along with some results. A qualitative user evaluation is presented in Section 6. Section 7 concludes this article by presenting perspectives and future works.

## 2 Related work

### 2.1 HMMs for motion synthesis

Various walk synthesis algorithms use statistical learning techniques to automatically extract the underlying rules of human motion, without any prior knowledge, directly from training on 3D motion capture data. The resulting statistical models can then be used for generating new motion sequences automatically, using only some high-level commands from the user. Such synthesized motions are thus visually different from the training motions but stochastically similar to them. A few studies use principal component analysis (PCA), not for reducing the dimensionality of the angle data, but as a way of modeling motion units composed of a sequence of frames. Thanks to their periodicity, walk cycles are especially well suited for such an algorithm. This approach has been taken for instance by Glardon et al. [8], Troje [15] and in our previous work [10]. But most work in this area use variations of HMMs, Markov chains or other kinds of probabilistic transitions between motions [9, 16–18], in order to take the high dynamic complexity of human movement into account.

*t*, the external observer sees one outcome, but does not know which state produced it. HMMs are hence double stochastic processes, as visible outcomes are determined by the outcome probability distribution associated with the state, and as the state changes at each time according the transition probabilities between states. In our work, the outcome of the HMM are the frames of the motion. A basic left-to-right HMM with no skips is illustrated in Figure 1, as a model in which the only possible state transitions at each time are either to stay in the same state or to go to the next state.

Motion has been studied with numerous variants of HMMs, whether it was for analysis or synthesis purposes [19]. In the following paragraphs, we will focus on the studies related to the use of HMMs for motion synthesis, and not just motion in general. In some cases, some kind of "style" component is taken into account, but no style parameter has yet been found that can be used to synthesize styles that are very different from the motions on which the system was trained. Increasing the number of styles represented by the system means increasing the complexity of the model and in most cases re-training it completely, with the additional issue that enough data must be available for each style.

Tanco and Hilton [16] describe a model consisting of two hierarchical levels. In the first level, PCA is used to reduce the dimensionality of the data and is followed by a *K*-means clustering of the poses space. The clusters--defining the boundaries of "motion segments" in the original training data--are then used as states of a Markov chain that represents the temporal behavior of the training data. A discrete HMM is only used in the second level to relate the states of the Markov chain to the original full examples of motion sequences from the training database (the Markov chain states are the observations of the HMM and the hidden states of the HMM are the motion examples). During synthesis, a sequence of Markov chain states is calculated given beginning and end poses defined by the user. The second synthesis stage takes the generated state sequence as an input and searches for the most likely sequence of motion segments from the original training data that could have generated that Markov chain state sequence. There is thus no "true" HMM synthesis step, as the database needs to be accessed each time a new motion has to be built. This work is more related to the template-database approach of motion capture animation, as it was described in Section 1, than to the approach we describe in the present article.

Wang et al. [17] go further in their motion modeling by using a "time-striding HMM" (TSHMM), which is also a two-layer model. In the first layer, an approximation of high-level (time-striding) statistical transitions is calculated, with first order transition probabilities. Those "high level" transitions correspond, for example, to the transitions between two different behaviors like walking and running. The high-level states from the upper layer are modeled in the second layer by a set of left-right HMMs. Those HMMs correspond to "atomical movements", i.e., motion segments maximally short, while being long enough to enable the prediction of the next pose. Synthesis is only based on the model without needing to reuse any motion segment from the original database.

Li et al. [18] also use the principle of motion decomposition into sub-units connected to each other by transition probabilities, and model each sub-unit individually. Their system, called "motion texture", is a technique for synthesizing complex human motions (like dancing for instance) so that they are statistically similar to the original motion capture data. The model is made of a set of "motion textons", and of their distribution, thereby characterizing the stochastic and dynamic nature of motion captures performed for the training. They define "motion textons" as the repetitive patterns in complex human motion (for instance: spinning, hopping or tiptoeing for dance motion). Each motion texton is modeled by a linear dynamic system (LDS) [18]. The distribution of the textons is modeled by a transition matrix which gives probabilities for transiting from one texton to another. It is thus possible to generate new animations and vary their execution by modifying motion at the texton level, or to synthesize a new choreography by varying the distributions.

### 2.2 Motion style modeling and synthesis

An interesting approach is chosen by some researchers who try to integrate a "style" variable into their HMM models. It enables the model, during the synthesis step, to vary not only the motion itself, but also the way the motion is performed, i.e. the "style" of the motion.

Wang et al. [20], for instance, use a training algorithm which integrates statistical optimization techniques with the expectation-maximization (EM) learning steps. Their method, called "SOMN-HMM" (which stands for "self-organizing mixture networks" which are used to represent mixture of Gaussians in the HMMs), makes it possible to train basic HMMs as well as parametric HMMs containing a "style" parameter. In [21], output densities are represented by "stylized decomposable triangulated graphs" (mix-SDTG) instead of SOMNs, and they also take into account a style variable.

Among all the models enabling the generation of data representing motion thanks to approximation functions, the "style machine" developed by Matthew Brand [9] is especially appealing. The major interest brought by this method is that, thanks to its learning algorithm based on the maximization of entropy, it enables to train HMMs for which we do not know the structure in advance, and it does it without having to proceed by successive attempts in order to find the adequate structure. Furthermore, this method integrates a style variable that can vary during the synthesis of a motion sequence. However, in that work the "style" variable is not explicit and it is thus not possible to control directly a given style, but rather to change some intrinsic style-related parameters.

In an approach closely related to ours, Yamazaki et al. [22] synthesize walk using a hidden semi-Markov model (HSMM). The "style" variation they incorporate in their model thanks to multiple regression is the quantitative variations of speed and stride length. There are thus two values that can be controlled but multiple regression is not suited for expressivity modeling which can hardly be quantified with a numerical value. The multiple regression method is trained once for all and it is not possible to add a new "style characteristic" without having to train the whole model again.

One of the problems with motion synthesis is that, unlike for speech which is decomposed into sentences, words, phonemes, etc., which are universal and can be represented as a finite set of possibilities, there is no widely accepted "dictionary" of basic motions. Each research team uses its own terminology and the possibilities are potentially infinite. There is thus no common basis for comparison, and as there is no method to assess the quality or the realism of a synthesized motion, the comparison of methods proposed by each research group is not straightforward. Most studies even lack qualitative assessment of their results.

## 3 Training databases

In all model-based techniques, the first major issue is to obtain enough representative training data. The quality of models is highly dependent on the quality of the data and how accurately these data describe the studied phenomenon. Motion capture being the only solution to obtain realistic 3D human motion data [2], it is the only way to gather representative training data for statistical modeling of human motion.

In this work we have used two databases recorded with an inertial motion tracking system, the inertial gyroscopic system (IGS-190) from Animazoo [23]. The IGS-190 is a commercial motion capture suit that contains 18 inertial sensors, which each consist of a three axis accelerometer, a three axis gyroscope and a three axis magnetometer. The data from those three sources are integrated and fused directly in the inertial sensor boxes. Angles between the body segments are thus provided straight from the sensors; no mapping is necessary between tracked 3D positions of markers and joint angles, unlike in optical motion capture systems.

Most studies use optical motion capture systems, which usually induce space limitations and where walk is thus recorded on a treadmill. In contrast, the inertial suit IGS-190 does not imply any kind of space limitation. The recorded subject can thus move freely in an open space area and walk can be recorded in a more natural way. This kind of inertial suit is thus especially interesting for the study of expressive walk, as it gives more freedom to the subject who can follow non-straight trajectories and is not constrained to a given constant speed like he would be on a treadmill.

There is no 3D position tracking system in the IGS suit, and the absolute position of the subject is calculated by the software given a known initial position, using the length of the skeleton segments from the feet to the hip, the angles recorded between those segments for each frame, and always considering that the lowest point of the skeleton is in contact with the ground.

Our two databases, respectively called "eNTERFACE'08 3D" and "Mockey", were recorded with the same motion capture suit but with different aims, subjects and settings. The eNTERFACE'08 3D database is described in details in [24]. This first database contains 17 walk sequences for 41 subjects. Among these, 12 sequences correspond, for each subject, to three sequences of straight walk over approximately seven meters for four different speed instructions. Those four instructions were "free", slow, middle and fast walks. In the "free" walk, subjects were invited to walk at their usual comfort speed. In the present work, the three free walk sequences of the 41 subjects were used to train our average "neutral" walk model. In that database, the motion was captured at a frame rate of 60 frames per second (fps).

Mockey database walk styles and corresponding number of steps recorded

Nbr steps | |||
---|---|---|---|

Walk | Style | ||

Left | Right | ||

1 | Proud | 26 | 24 |

2 | Decided | 18 | 15 |

3 | Sad | 35 | 33 |

4 | Topmodel | 28 | 27 |

5 | Drunk | 40 | 40 |

6 | Cool | 25 | 25 |

7 | Afraid | 19 | 18 |

8 | Tiptoeing | 18 | 20 |

9 | Heavy | 24 | 25 |

10 | In a hurry | 20 | 21 |

11 | Manly | 19 | 20 |

## 4 Data preprocessing

In the data format we use, three values per frame give the absolute 3D position (XYZ cartesian coordinates) of the root of the skeleton while the 54 other values represent the 3D angles of the 18 joints of the skeleton. The three values corresponding to the 3D position were discarded as they can be recalculated later using the angles, information about the foot contact with the ground, and the fixed leg segment lengths. The directions of all walk sequences from both databases were then aligned before further processing. The walk sequences were also manually segmented into left and right steps. The boundaries of the steps were arbitrarily defined as the moment the heel touches the ground.

We chose to model the rotations of the 18 captured joints rather than the 3D cartesian coordinates of these joints in order to ensure that the fixed limb length constraints were respected in the synthesized motion: as only rotations are applied to the fixed limb length skeleton definition presented in Section 3, there will be no length deformation in the skeleton after synthesis. This would not be the case with joint cartesian coordinates as nothing would insure that the distance between two successive joints of the skeleton hierarchy remains constant, unless that constraint is explicitly added in the synthesis algorithm.

Once we had chosen to model rotations, the choice of the rotation parameterization was not straightforward. Lots of problems are associated with the different 3D rotation representations that exist, and none of them is ideal in all situations. Rotation matrices, Euler angles, quaternions, axis/angle representation and exponential maps are the most common rotation parameterizations (see for instance [25] for a more detailed presentation of those five representations), but the choice of the parameterization will always depend on the application of interest.

Our data was originally represented by Euler angles, in which each 3D rotation is splitted into three simpler successive rotations around the axes of the local coordinate system associated to the object (*X, Y* and *Z* axis). That representation is not well suited for our purpose as, among other issues, there is not always a single representation of each 3D rotation but rather several possible angle combinations that lead to the same rotation. More information about singularities in the Euler angle parameterization can be found in [25, 26].

^{3}is associated to a single rotation:

where the vector $\overrightarrow{r}$ is the three-component exponential map, $\overrightarrow{u}$ is the unit-length 3D vector corresponding to the axis of rotation, and *θ* is the rotation angle around the axis. The direction of $\overrightarrow{r}$ defines the rotation axis $\overrightarrow{u}$, and the magnitude (*θ*) of the vector $\overrightarrow{r}$ is the scalar value of the angle to rotate by. This relationship is completed by associating the zero vector to the identity rotation, making the relationship continuous. For in-depth analysis of the advantages and drawbacks of exponential maps, please refer to [26].

The pose of the skeleton at each frame of the walk cycle is thus described by a vector with a fixed number of variables: 18 tridimensional joint angles, which gives a vector of 54 values per frame to describe the motion.

## 5 Average model and style adaptation

### 5.1 Method

As explained before, our objective was to synthesize stylistic walks with few data, starting from a robust neutral walk modeling. Our approach is to start from a procedure originally developed for speaker adaptation in speech synthesis and to adapt it to our motion problem. Both speech and motion fields present strong similarities, like inter-subject variability, stylistic or temporal variations. They are also very different; for instance, motion data do not need feature extraction or temporal windowing, have a much higher dimensionality, and cannot be represented by a finite number of phonemes. This led us to reduce our study to walk synthesis alone, as opposed to motion synthesis in general. In this paragraph, we will briefly explain the different stages of the HMM-based motion synthesis as we used it, based on the HTS framework [11].

#### 5.1.1 Parameter analysis, model structure and labels

*C*consists in

*T*realizations of our 54-dimensional parameter vector

*c*

_{ t }:

*C*= [

*c*

_{1},

*c*

_{2}, ...,

*c*

_{ t }, ...,

*c*

_{ T }]. As presented in Section 4, our feature vector (

*c*

_{ t }) consists in the 54 exponential map parameters describing the skeleton pose at frame

*t*, so we have

*c*

_{ t }= [

*c*

_{ t }(1),

*c*

_{t}(2),...,

*c*

_{ t }(54)]

^{⊤}. Following the procedure proposed in the HTS framework, the dynamics of the data was taken into account in our models by concatenating

*c*

_{ t }with a vector containing the first and second time derivatives of our parameters (for both neutral and stylistic model training) [28]. The observation vector

*o*

_{ t }we want to model thus consists of the static feature vector

*c*

_{ t }plus the corresponding dynamic feature vectors

*Δc*

_{ t }, and

*Δ*

^{ 2 }

*c*

_{ t }, which makes

*o*

_{ t }a 162-dimensional parameter vector. Our observation vector

*o*

_{ t }can thus be expressed as ${o}_{t}={\left[{c}_{t}^{\top},\Delta {c}_{t}^{\top},{\Delta}^{2}{c}_{t}^{\top}\right]}^{\top}$, where the derivatives were calculated as follows:

*T*observation vectors, our whole training data can be expressed as

*O*=

*i*[

*o*

_{1},

*o*

_{2},...,

*o*

_{ t },...,

*o*

_{ T }]. Considering matrix

*W*representing the coefficients that link the

*c, Δc*, and

*Δ*

^{2}

*c*as expressed in Equations (2) and (3), the relation between the observation matrix

*O*and the static parameter matrix

*C*is:

*d*spent in each state of the HMM is explicitly modeled in duration probability density functions thanks to HSMM [29], a variation of HMMs which takes state duration modeling into account. The schematic representation of an HSMM is represented in Figure 3 and can be compared to the classical HMM of Figure 1. This prevents the probability density of the duration

*d*from being modeled as a decaying exponential like in classical HMMs, as this is inaccurate for most real life problems, like motions in our case. State duration densities were modeled with a multidimensional Gaussian distribution for each HMM. The dimension of these distributions is equal to the number of states in the HMM, set to five in our work, with each dimension corresponding to one HMM state, as explained in [29].

During training, contextual factors related to the position of the step in the whole walk sequence were taken into account, thereby multiplying the number of models to train. However, all model parameters can not be estimated with sufficient accuracy if we only have limited training data. Furthermore, all the possible combinations of contextual factors will not always be present in the training database and unseen models have to be taken into account before the synthesis step. To overcome this problem, both parameter and duration models can be clustered using decision trees. The decision tree is a binary tree, and in each of its nodes, a question splits contextual models into two groups. All possible contextual combinations can be found by traversing the trees. Once the decision tree is constructed, unseen contexts can be taken into account and leaves containing little or very similar data can be merged (for more information on how trees are built and used, please refer to [30]).

#### 5.1.2 Average model training

Using the above HSMM model taking both static and dynamic parameters into account, we train an average walk model on a large set of walkers. This average model will be used in the next step of our procedure as the initial model from which the adaptation will start. In our work, the step boundaries of our segmented database are only used to initialize the parameters of the average walk model (they are not used in the adaptation or synthesis stages). A "walker adaptive training" (WAT) algorithm was used during the training stage of our average model. This WAT training reduces the influence of walk differences among the 41 walkers of our training data on the parameters of the final average model. More information on the WAT training of the average model can be found in [31], where it is referred to as "SAT" for "speaker adaptive training".

#### 5.1.3 Style adaptive training of HSMM models

*A*

_{ o }(

*A*

_{ d }) and the bias

*b*

_{ o }(

*b*

_{ d }), under the following form:

The term "constrained" refers to the fact that the linear transformations applied to the means and the linear transformation applied to the variances of the average model (both for durations and observation parameters) are required to be the same, other than the bias. A detailed explanation of the CSMAPLR transformation and how it can be calculated can be found in [32] and [33]. This CSMAPLR transformation is implemented within the HTS framework.

The last step of the adaptation training procedure consists in a mximum *a posteriori* (MAP) [13] adaptation that further transforms the models already linearly adapted by CSMAPLR, modifying the estimation of the distributions having enough training samples, as explained in [32].

#### 5.1.4 HSMM synthesis

*q*is one sequence of states from the set of all possible state sequences corresponding to the walk we want to generate, the problem can be approximated by:

*O**given the HSMM λ and the desired sequence of synthesized walk steps can thus be splitted into two optimization problems:

- (1)Find the optimal sequence of states
*q**given the HSMM λ and the desired sequence of synthesized walk steps:${q}^{*}=\underset{q}{\text{arg}\text{max}}P\left(q|\lambda \right).$(9) - (2)Find the optimal sequence of parameters
*O*given the previously determined optimal sequence of states*q**and the HSMM λ:${O}^{*}=\underset{O}{\text{arg}\text{max}}P\left(O|{q}^{*},\lambda \right).$(10)

The optimal sequence of states *q** must first be estimated, according to Equation (9). Knowing the state duration densities thanks to the HSMM modeling, the optimal sequence *q** according to Equation (9) can be determined [29]. Once the optimal state sequence has been calculated, the optimal sequence of parameters can be determined from Equation (10).

*P*(

*O*|

*q**, λ) with respect to

*O*(Equation (10)) becomes equivalent to maximizing it with respect to

*C*:

as *O = WC* (Equation (4)). In the HTS framework and as explained in details in [28], this problem can be solved using the Cholesky decomposition. The algorithm we just described can thus generate a parameter trajectory of static features that maximizes the likelihood of the parameter sequence containing both static and corresponding dynamic parameters given an HSMM model.

*gυ*(

*C*)) of the static features

*c*

_{ t }over a time sequence of

*T*frames is calculated by:

with λ _{
υ
} a single Gaussian distribution representing the global variance of the data *υ*(*c*) by a mean vector and a covariance matrix, and *ω* a constant determining the weight between the two likelihoods. Taking into account the global variance of the data enabled us to avoid over-smoothed synthesized walks.

Once our adapted model is built, we can synthesize as many stylistic walk sequences as we want using the same synthesis procedure as described here. The model gives us joint angles and the displacement of the skeleton can be computed using our knowledge of the limb lengths and the step in which we are (which defines which foot is in contact with the ground).

### 5.2 Results

#### 5.2.1 Neutral walk modeling

For our HMM training and synthesis, we followed the method explained in Section 5.1 and adapted the functions originally implemented for speech within the HMM-based speech synthesis system (HTS) to our procedure. The implementation of the HTS toolkit (version 2.1) that we used in this work is publicly available on the HTS website [11].

The three sequences of "free" walk of the 41 subjects of the eNTERFACE'08 3D database were used to train our average neutral walk model, which consisted of five-states left-to-right HSMM with no skip for both steps (right and left). The database contains 669 observation sequences for "right step" and 656 observation sequences for "left step". We made the contextual distinction between five positions in the walk sequence for each step: the first, second, last, last-but-one steps of the sequence, and all the other steps. The training began thus with ten models to train (five for each step).

During the training phase, some of the ten initial models were automatically tied by the context-based tree clustering and only six HMMs remained for the whole walk modeling in the average model: two models for the first step of a walk sequence, two for steps inside a sequence, and two for the last step of a sequence (one model for the right step and one for the left step each time).

#### 5.2.2 Style walk modeling

Adaptive training is performed with constrained maximum likelihood linear regression (CMLLR) transformation [33] of our previously trained average neutral walk HSMM model. For each one of the 11 expressive walks of our Mockey database, a separate adaptive training was performed using all of the data available for the target style. The number of observation sequences for each of the stylized walks are given in Table 1. So, for each style, we obtained separate contextual (initial, final and "inside a sequence") models for the right and left steps.

#### 5.2.3 Synthesis of new walk sequences

Each new walk sequence is synthesized by first concatenating HMMs corresponding to the desired succession of steps. The whole parameter sequence is then calculated from that complete sequence of models, taking into account the dynamics of the synthesized parameters thanks to the first and second derivatives of the parameters. Therefore, the smoothness of the transitions between the successive steps of the walk sequence is ensured.

Our average model was trained with data recorded at a frame rate of 60fps and adapted in the second phase to data captured at a rate of 30 fps, but that difference was not an issue as the durations were adapted automatically during the average-to-style model adaptation. The synthesized walks, coming from models adapted to the Mockey style data, corresponded to a frame rate of 30fps.

## 6 Qualitative user evaluation

### 6.1 Methodology

A recurrent problem with motion data synthesis is the difficulty to evaluate the produced motion sequences. Most studies only present their method without giving the reader information about the quality of the results, or just give a link to an example of synthesized motion.

In this article, we propose three different subjective tests that enabled us to assess the quality of the synthesis results. The basic set of the tested videos consisted in 44 walk sequences: one original walk sequence for each of the 11 styles, the same sequences from which the displacement of the root of the skeleton was removed, one sequence of synthesized walk for each of the 11 styles without adding the overall displacement (called "static" in the next sections), and the same synthesized sequences for which the absolute position of the root was calculated as explained in Section 5.1.4 (called "displacement" in the next sections). Two videos of motion synthesized with the average walk model were added (with and without displacement), which makes 46 videos in total. In the video sequences, motion was performed by a basic blue stick-figure character as shown in Figure 2.

Participants accessed to the evaluation tests through a web browser. They had to start the video themselves by clicking on it, and could watch it as many times as they wanted. If they did not complete the test thoroughly, they could come back later, but the participant's results were saved even if the three tests were not completely finished. Video sequences lasted between 3 and 17 s.

About a 100 naive evaluators took part in the evaluation. The three tests and their respective results are presented in Sections 6.2, 6.3, and 6.4. For each of the three tests, every evaluator was presented a set of ten videos or couples of videos. Those videos were randomly picked by the evaluation program, and were thus different for each evaluator.

### 6.2 Naturalness evaluation

In the first test, the evaluator was presented one random video at a time. He was asked to choose among three propositions: the stylistic walk in the video seems "real", "synthetic", or "I don't know". The aim of the test was to determine if there was a significant difference in the way the naturalness of the original and the synthesized walks were perceived.

In a first trial, this test was presented to the users in an odd manner and several users reported that they were confused and did not understand the question. The user was asked if the walk was "natural" or "unnatural", which lead most people to perceive nearly all the walks, both original and synthetic, as "unnatural" because of the nature of the data presented: exaggerated walk styles performed by an actor. We reformulated thus the question and only kept the results obtained after that change, which explains why only 500 sequences were evaluated in this first test.

### 6.3 Style recognition evaluation

In the second test, the evaluators were again presented one video at a time. They were asked to chose between 13 different style possibilities: the 11 styles, plus "average walk" or "I don't know". A total number of 922 evaluations of videos taken randomly from the set of 46 possible videos were performed.

Confusion matrix of style recognition test for both original walk sequences (first part of the table) and synthesized sequences (second part of the table)

Evaluators classification (%) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Proud | Decided | Sad | Topmodel | Drunk | Cool | Afraid | Tiptoe | Heavy | Hurry | Manly | Average | ? | |

Original | |||||||||||||

(actual style) | |||||||||||||

Proud | 10 | 0 | 3 | 3 | 0 | 27 | 0 | 0 | 3 | 0 | 23 | 27 | 3 |

Decided | 3 | 37 | 3 | 0 | 0 | 5 | 3 | 0 | 3 | 26 | 13 | 5 | 2 |

Sad | 2 | 0 | 68 | 0 | 9 | 2 | 4.6 | 0 | 7 | 0 | 2.3 | 5 | 0 |

Topmodel | 14 | 0 | 0 | 58 | 3 | 11 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |

Drunk | 0 | 0 | 0 | 0 | 91 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

Cool | 9 | 15 | 0 | 4 | 0 | 20 | 0 | 0 | 0 | 0 | 6 | 37 | 6 |

Afraid | 0 | 0 | 0 | 0 | 6 | 0 | 49 | 36 | 3 | 0 | 0 | 0 | 6 |

Tiptoe | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 71 | 0 | 11 | 0 | 0 | 7 |

Heavy | 0 | 0 | 15 | 0 | 34 | 15 | 7 | 0 | 17 | 0 | 10 | 0 | 2 |

Hurry | 0 | 39 | 0 | 10 | 0 | 5 | 0 | 0 | 0 | 19.5 | 5 | 15 | 7 |

Manly | 0 | 0 | 6 | 3 | 0 | 19 | 0 | 0 | 11 | 0 | 47 | 6 | 8 |

Synthesized | |||||||||||||

(actual style) | |||||||||||||

Proud | 15 | 0 | 0 | 6 | 0 | 23 | 0 | 0 | 6 | 0 | 10 | 37 | 2 |

Decided | 2 | 42 | 2 | 0 | 0 | 0 | 2 | 0 | 9 | 40 | 2 | 0 | 0 |

Sad | 0 | 3 | 71 | 0 | 5 | 0 | 0 | 0 | 13 | 0 | 3 | 3 | 3 |

Topmodel | 11 | 0 | 0 | 51 | 17 | 17 | 0 | 0 | 0 | 0 | 2 | 0 | 2 |

Drunk | 7 | 20 | 2 | 2 | 72 | 5 | 0 | 0 | 2 | 0 | 0 | 0 | 9 |

Cool | 15 | 3 | 0 | 3 | 0 | 38 | 0 | 0 | 0 | 0 | 24 | 12 | 6 |

Afraid | 0 | 0 | 0 | 0 | 2 | 0 | 44 | 40 | 7 | 0 | 0 | 0 | 7 |

Tiptoe | 0 | 4 | 0 | 8 | 0 | 0 | 15 | 42 | 0 | 15 | 0 | 4 | 12 |

Heavy | 5 | 2 | 10 | 5 | 7 | 17 | 2 | 0 | 19 | 0 | 19 | 2 | 12 |

Hurry | 2 | 30 | 0 | 16 | 0 | 5 | 0 | 0 | 0 | 23 | 5 | 18 | 2 |

Manly | 7 | 2 | 11 | 0 | 2 | 11 | 0 | 0 | 9 | 0 | 52 | 2 | 2 |

Percentage of correctly classified walk sequences for the style recognition test

Original (%) | Synthesized (%) | |
---|---|---|

Static | 44.20 | 40.71 |

Displacement | 47.60 | 37.15 |

*η*by normalizing the achieved recognition rate (or sensitivity) by the recognition rate given by a random classification (sensitivity expected by chance):

The efficiency of the users' recognition is thus equal to *η*_{
orig
} *=* 47.6/7.69 = 6.19 for original walk sequences and to *η*_{
orig
}= 37.15/7.69 = 4.83 for synthesized walk sequences. These values are both higher than the human recognition efficiencies cited in [1] (*η =* 3.72 and *η* = 3.55 for emotional state classification based on original knocking motions in [35] (four emotions, point-light display) and [36] (five emotions, full video)), indicating that the style component was accurately perceived in both our original and synthesized sequences.

### 6.4 Original versus synthesized comparison

## 7 Conclusion

Additional file 1: **Video-StylisticGaitSynthesis.mov (quicktime movie)**. This short video present some examples of the stylistic walk sequences that were synthesized in this work and presented to the participants of the user assessment tests. (MOV 9 MB)

We also proposed a setup for a subjective evaluation of the synthesis results, which showed that the synthesized walks were close to the original training data and also pointed out some of the weaknesses of the synthesis, indicating directions for future work. The recognition test showed for instance that adding the displacement to the motion improved the recognition rate for original motions but had the opposite effect on synthesized sequences. We think that this is due to the inter-step variation which is lower in the synthesized sequences than in the original motion, and that should be further improved.

Future work will include further analyses of the evaluation tests that can be used to assess the naturalness of the produced motions, and analysis of the use of the style interpolation/extrapolation using the trained models. One could also study how several parameters influence the perceived results, like the variables of the HMM (number of states for instance), the influence of the number of stylistic steps in the adaptation training phase, the way the results are presented to the user (skinned virtual character versus stick figure), how a reduction of the dimensionally of the original data influences the quality of the results, etc. The adaptation method presented here could also be used to analyze and synthesize walks for different human characteristics that influence the walk style, like gender (male vs. female walk) or age (children vs. elderly or others).

## Declarations

### Acknowledgements

This project was partly funded by the Ministry of Région Wallonne under the Numediart research program (grant N0716631). Joëlle Tilmanne was supported by the "Fonds pour la formation à la recherche dans l'industrie et l'agriculture" (FRIA) during part of this work. The authors would like to thank the comedian Sebastien Marchetti and Thierry Ravet for their participation in the motion capture database recording.

## Authors’ Affiliations

## References

- Bernhardt D, Robinson P, Paiva A, Prada R, Picard R: Detecting Affect from Non-stylised Body Motions. In
*Affective Computing and Intelligent Interaction*. Springer, Berlin; 2007:59-70.View ArticleGoogle Scholar - Menache A:
*Understanding Motion Capture for Computer Animation and Video Games*. Morgan Kauffman Publishers Inc., San Francisco; 2000.Google Scholar - Mena D, Mansour J, Simon S: Analysis and synthesis of human swing leg motion during gait and its clinical applications.
*J Biomech*1981, 14(12):823-832. 10.1016/0021-9290(81)90010-5View ArticleGoogle Scholar - Pejsa T, Pandzic I: State of the art in example-based motion synthesis for virtual characters in interactive applications.
*Comput Graph Forum*2010, 29: 202-226. 10.1111/j.1467-8659.2009.01591.xView ArticleGoogle Scholar - Forsyth D, Arikan O, Ikemoto L, O'Brien J, Ramanan D: Computational studies of human motion: part 1, tracking and motion synthesis.
*Found Trends Comput Graph Vis*2005, 1(2-3):77-254.View ArticleGoogle Scholar - Geng W, Reuse YG: of motion capture data in animation: a review.
*Comput Sci Appl(ICCSA)*2003, 2003: 620-629.Google Scholar - Calinon S, Guenter F, Billard A: On learning, representing, and generalizing a task in a humanoid robot.
*IEEE Trans Syst Man Cybern B*2007, 37(2):286-298.View ArticleGoogle Scholar - Glardon P, Boulic R, Thalmann D: PCA-based walking engine using motion capture data.
*IEEE Comput Graph Int*2004, 2004: 292-298.Google Scholar - Brand M, Hertzmann A: Style machines. In
*Proceedings of the 27th annual conference on Computer graphics and interactive techniques*. ACM Press/Addison-Wesley Publishing Co., New York; 2000:183-192.Google Scholar - Tilmanne J, Dutoit T: Expressive gait synthesis using PCA and Gaussian modeling. In
*Proceedings of the Third international conference on Motion in games*. Springer, Berlin, Heidelberg; 2010:363-374.Google Scholar - HTS Working Group:
*The HMM-based speech synthesis system (HTS) Version 2.1*. . Accessed 2010 http://hts.sp.nitech.ac.jp/ - Rabiner L: A tutorial on hidden markov models and selected applications in speech recognition.
*Proc IEEE*1989, 77: 257-286. 10.1109/5.18626View ArticleGoogle Scholar - Yamagishi J, Nose T, Zen H, Ling Z, Toda T, Tokuda K, King S, Renals S: Robust speaker-adaptive HMM-based text-to-speech synthesis.
*IEEE Trans Audio Speech Lang Process*2009, 17(6):1208-1230.View ArticleGoogle Scholar - Picart B, Drugman T, Dutoit T: Analysis and synthesis of hypo and hyperarticulated speech. In
*Proceedings of the Speech Synthesis Workshop 7(SSW7)*. NICT/ATR, Kyoto, Japan, Sept; 2010:270-275.Google Scholar - Troje NF, Shipley TF, Zacks JM: Retrieving Information from Human Movement Patterns. In
*Understanding Events: How Humans See, Represent, and Act on Events*.*Volume 1*. Oxford University Press, Oxford; 2008:308-334.View ArticleGoogle Scholar - Tanco LM, Hilton A: Realistic synthesis of novel human movements from a database of motion capture examples. In
*Proc of the Workshop on Human Motion (HUMO'00)*. IEEE Computer Society, Washington, DC, USA; 2000:137.View ArticleGoogle Scholar - Wang Y, Liu Z, Zhou L: Automatic 3D motion synthesis with time-striding hidden Markov model. In
*Proc International Conference on Machine Learning and Cybernetics (ICMLC'05)*.*Volume 3930*. SB Heidelberg, Guangzhou, Aug; 2005:558-567.Google Scholar - Li Y, Wang T, Shum H: Motion texture: a two-level statistical model for character motion synthesis. In
*Proc of SIGGRAPH'02*. ACM Press, New York; 2002:465-472.View ArticleGoogle Scholar - Ramanan D, Forsyth DA: Motion Analysis by Synthesis: Automatically Annotating Activities in Video.2005. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.94.2069"citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.94.2069]Google Scholar
- Wang Y, Xie L, Liu Z, Zhou L: The SOMN-HMM model and its application to automatic synthesis of 3D character animation. In
*IEEE Conference on Systems, Man, and Cybernetics*. Taipei, Taiwan; 2006:4948-4952.Google Scholar - Wang Y, Liu Z, Zhou L: Learning style-directed dynamics of human motion for automatic motion synthesis. In
*IEEE Conference on Systems, Man,s and Cybernetics*. Taipei, Taiwan; 2006:4428-4433.Google Scholar - Yamazaki T, Niwase N, Yamagishi J, Kobayashi T: HumanWalking motion synthesis based on multiple regression hidden semi-Markov model. In
*2005 International Conference on Cyberworlds (CW'05)*. IEEE Computer Society, Washington DC; 2005:445-452.Google Scholar - IGS-190:
*Animazoo website*. [http://www.animazoo.com] - Tilmanne J, Sebbe R, Dutoit T: A database for stylistic human gait modeling and synthesis. In
*Proceedings of the eNTER-FACE'08 Workshop on Multimodal Interfaces*. Paris, France; 2008:91-94.Google Scholar - R Parent R: Technical background. In
*Computer Animation Complete: Part I: Introduction to Computer Animation*. Morgan Kaufmann, Emeryville; 2009:60-68.Google Scholar - Grassia F: Practical parameterization of rotations using the exponential map.
*J Graph Tools*1998, 3: 29-48. 10.1080/10867651.1998.10487493View ArticleGoogle Scholar - Johnson MP:
*Exploiting quaternions to support expressive interactive character motion*. Massachusetts Institute of Technology; 2002.Google Scholar - Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T: Speech parameter generation algorithms for HMM-based speech synthesis. In
*Proc ICASSP*. Istanbul, Turkey; 2000:1315-1318.Google Scholar - Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Duration modeling for HMM-based speech synthesis. In
*Fifth International Conference on Spoken Language Processing (ICSLP)*. Sydney; 1998:29-32.Google Scholar - Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P:
*The HTK Book, Version 3.4*. Entropic Cambridge Research Laboratory, Cambridge; 2009.Google Scholar - Yamagishi J, Kobayashi T: Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training.
*IEICE Trans Inf Syst*2007, 90(2):533-543.View ArticleGoogle Scholar - Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm.
*IEEE Trans Audio Speech Lang Process*2009, 17: 66-83.View ArticleGoogle Scholar - Gales M: Maximum likelihood linear transformations for HMM-based speech recognition.
*Comput Speech Lang*1998, 12(2):75-98. 10.1006/csla.1998.0043View ArticleGoogle Scholar - Toda T, Tokuda K: A speech parameter generation algorithm considering global variance for HMM-based speech synthesis.
*IEICE Trans Inf Syst*2007, 90(5):816-824.View ArticleGoogle Scholar - Kapur A, Kapur A, Virji-Babul N, Tzanetakis G, Driessen P: Gesture-Based Affective Computing on Motion Capture Data. In
*Affective Computing and Intelligent Interaction*.*Volume 3784*. Springer, Berlin/Heidelberg; 2005:1-7. 10.1007/11573548_1View ArticleGoogle Scholar - Pollick FE, Paterson HM, Bruderlin A, Sanford AJ: Perceiving affect from arm movement.
*Cognition*2001, 82(2):B51-B61. 10.1016/S0010-0277(01)00147-0View ArticleGoogle Scholar *Joelle Tilmanne's webpage*. [http://tcts.fpms.ac.be/~tilmanne/]

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.