GEPETO

GEsture for the PEdagogie of InTOnation

logo_LAM logo_LPP logo_GIPSA logo_CNRS logo_ANR logo_SU

HMI and Gesture control for Education and Re-education of Phonatory Mastery

The aim of the ​GEPETO project is to investigate the use of manual gestures, mediated by new Human Machine Interfaces (HMI), for designing ​innovative tools and methods for intonation education (training) and re-education (re-training). The control of a synthesized voice through hand gestures is a new research paradigm in the field of human-machine interaction, with applications in new musical instruments and speech research (Feugère et al., 2017, Delalez and d’Alessandro, 2017). Like in a musical instrument, speech prosody is “played” or controlled by the hands (​chironomy,​ from the Greek “ruled by hand (motion)”). Previous studies have demonstrated that chironomic intonation using handwriting gestures on a graphical tablet, can be even more precise and accurate than the natural voice in imitation tasks (d’Alessandro et al., 2011, 2014). The high performance in chironomy for performative (i.e. real-time gesture controlled) voice synthesis can be attributed to its intrinsic multimodal integration (vision, kinesthesis and audition (Perrotin and d’Alessandro, 2016)), as well as to the existing dexterity of the handwriting movements (as used for writing and drawing purposes), which were repurposed for a new task.

Initially developed as tools for prosodic research and as a new family of digital musical instruments, it appears that performative voice synthesis could also foster new important applications in language acquisition and vocal substitution. The proposed project explores new HMI paradigms along two lines: education and re-education of the phonatory function. The first aim is to develop an educational program based on chironomy and to test it in language classes. The second aim is to develop tools based on chironomy for vocal impairment assistance. In the case of phonatory function impairment, gestural control can improve expressive intonation in an augmented reality paradigm: phonation is controlled or enhanced by chironomy and articulation is controlled by the true vocal tract. An extreme case is that of vocal substitution. In the case of laryngectomy inducing a voice loss, the gestural control of intonation must enable the restoration of both linguistic and expressive intonation (Crevier-Buchman et al., 1998).

Research Hypothesis

Based on the results obtained for performative singing synthesis, the ​GEPETO project is an innovative project, for the first time using (HMI mediated) chironomy as a mean to support the learning of difficult linguistic and expressive phonatory tasks. This work program has been explored by preliminary studies on gesture-controlled expressive speech synthesis (Evrard et al., 2015) and studies of the identification and production of Mandarin tones—naive learners made comparable improvements when chironomy replaced the natural voice in imitation tasks (Xiao et al. 2019). The GEPETO​ project is based on three main hypotheses:

  1. Central intonation patterns hypothesis. Melodic and rhythmic patterns can be considered as intonation gestures. These intonation gestures convey linguistic and expressive information. Intonation, both on its perceptual and motor production aspects, is represented and embodied at a relatively high cognitive level, and it is somehow independent of the modality actually used to reproduce it. Then, intonation can be transferred from the vocal apparatus to other modalities (in our case, hand gestures).
  2. Substitution hypothesis.The precision and quality of control of performative vocal synthesis is sufficient to reproduce chironomic intonation patterns that are indiscernible from speech intonation patterns. Then, intonation can be transferred from the vocal apparatus to other modalities, in our case hand gestures. This allows for hand control of an artificial voice source, or intonation substitution (re-education) and manipulation of fine details of recorded speech in foreign language accent acquisition (education).
  3. Multimodal reinforcement hypothesis. The gestural component, both in production and perception, is a fundamental dimension of learning: making or perceiving a gesture reinforces the acquisition of the corresponding intonative pattern. Performative synthesis of fine details of recorded speech involves the auditive, visual and kinaesthetic modalities, allowing for multimodal reinforcement.

The process of learning to produce a sound can be summarized by the following feedback loop (displayed in Fig. 1):

  • Hear a goal sound, whether external or internally.
  • Try to recreate the sound using movements of the body.
  • Detect the difference between the target sound and the produced sound.
  • Make adjustments to the body’s movements in order to reduce the difference.
process_of_intonation_learning
Figure 1: principle of intonation learning

This is common across music-learning, language-learning, and voice/speech therapy/rehabilitation. The lack of success in these tasks can be attributed either to a lack of awareness of the feedback loop or to problems in carrying out the feedback loop (in red in Fig. 1):

  1. Lacking the aural sensitivity to perceive differences in sounds
  2. Not having enough bodily control or memory to adjust sound-producing movements
  3. Trying to repeat phrases that are too long - lacking sufficient working memory capacity to hold them in mind long enough to find movements to recreate them.

The ​GEPETO project will provide a tool for people that explicitly targets these problems. Our hypothesis is that if these problems are properly addressed, anyone with normal hearing and voice/hand capacities can learn tasks traditionally considered challenging.

For problem 1: The correspondence of ​visual and kinaesthetic modalities w​ith the ​aural can guide the ear toward salient features in the sounds. Also, the system will be designed to give visual feedback. For problem 2: Chironomy makes use of a modality where people have more dexterity. Then ​use the hand to “teach" the voice​, in the learning task, and ​to pilot the voice in the rehabilitation task. For problem 3: Designing the applications to gradually increase in lengths of phrases, and more importantly, reinforce and complement aural memory by kinaesthetic and visual modalities.

Position of the project as it relates to the state of the art

The ​GEPETO project aims at bringing significant advances compared to the state of the art in three domains: vocal instruments, use of gesture in pedagogy, use of gesture in rehabilitation.

Vocal instruments

Instruments for performative voice synthesis were initially developed for studying expressive speech. Although automatic speech synthesis reached a high level of naturalness and intelligibility, its expressivity (ability to convey nuances of expression or emotional content) remains poor. This is because expressivity depends on the communicative situation, on the intentions of the speaker and reaction of the listener, all things that a machine can hardly manage. On the contrary, performative vocal synthesis is the process of playing synthetic voices, like a musical instrument. In this case the expression is given by the player, and the only limit is her/his ability in playing the instrument. In a speech intonation mimicking paradigm, it has been demonstrated that stylized intonation contours using chironomy seem perceptually indistinguishable from natural contours (d’Alessandro et al., 2011). This indicates that chironomic stylization is effective, and that hand movements can be analogous to intonation movements. This has been applied to musical instrument development. Cantor Digitalis, a real-time formant synthesizer controlled by a graphic tablet and a stylus, was used for assessment of melodic precision and accuracy in singing synthesis. The results show a high accuracy and precision obtained by all the subjects for chironomic control of singing synthesis (d’Alessandro et al., 2014). Some subjects performed significantly better in chironomic singing compared to natural singing, and this study demonstrated the capabilities of chironomy as a precise and accurate mean for controlling intonation in singing synthesis. Expressivity of performative vocal synthesis was recognized in the musical community, as Cantor Digitalis won the 1st price of the Margaret Guthman Musical Instrument Competition in 2015. These instruments initially used a graphic tablet. However, they can be adapted to other types of interfaces: Multi Polyphonic Expressive (MPE) Keyboard like the Seaboard Roli. Two instruments allowing for real-time resequencing and intonation control of pre-recorded speech have also been demonstrated: Vokinesis and Voks. They are controlled by a tablet, an MPE keyboard, or a hand free instruments like the Theremin (Xiao et al. 2019b), and with a touch control interface (button, Touché, MetaTouche). Chironomic control of intonation can be also viewed as the process of adding an auditory feedback to the gesture, or ​sonification of gesture​, where the sound features are representative of intonation components (e.g., f0, rhythmic patterns, etc.). It is the creation of this new link between kinaesthetic and visual perceptions of the gesture and auditory perception of the sonification that constitutes the multimodal reinforcement hypothesis that is at the core of the project. Moreover, the nature of the auditory feedback can lead to multiple scenarios:

  • The auditory feedback (variations of intonation features) can be played through the modification of another person’s voice (natural or synthetic, but different from the user). We call this scenario ​external sonification of gestural intonation​, and it is implemented through our vocal instruments.
  • The auditory feedback can be played through an artificial excitation source located inside the vocal tract of the user. Thus, this excitation source is naturally combined with the user’s articulation to produce an integrated semi-synthetic voice. We call this scenario ​internal sonification of gestural intonation​.

Then, these auditory feedbacks controlled by the user’s gesture can either be produced instead of the user’s natural voice (“Voice instrument” and “Voice substitution” conditions for external and internal sonification, respectively), or in simultaneity (“Dual voice” and “Augmented voice” condition for external and internal sonification, respectively). These combinations of the two auditory feedbacks with or without natural voice lead to 4 experimental conditions that are summarised in Table 1, and that will be investigated during this project. Specifically, the external sonification paradigm will be at the centre of the intonation education process, while the internal sonification paradigm will constitute the basis of the intonation re-education process.

table1
Table 1: Description of the sources used for each chironomic control condition

The ​GEPETO project therefore aims at bringing significant progress beyond the state of the art for vocal instruments. Vocal instruments are currently limited to external gestural sonification. Internal gestural sonification will be developed and tested for the first time. Both external and internal gestural sonification will be adapted to personal platforms: they are used today on laptops. The project will offer the opportunity for dissemination of voice instruments in other language learning and clinical phonetics fields (in addition to the more obvious musical field).

Use of gestures in pedagogy of foreign language

Several authors have shown that the use of gestures has potential beneficial effects in foreign language learning: pitch gestures (or gestures that mimic melody in speech) favour word learning in L2 Russian (Kushch et al., 2018) as well as the recognition of intonation patterns in L2 English (Crison et al., 2018); rhythmic beat gestures seem to significantly improve Spanish learners accentedness in L2 English (Gluhareva and Prieto, 2017). Coding information through different modalities (auditory, visual, kinaesthetic) leaves a richer trace in memory. “Engaging oneself physically (miming an action, making a gesture) has a stronger effect on short-term memorization. In foreign language teaching, children who reproduce gestures while repeating new words are able to memorize more items. It is therefore important to encourage the reproduction of pedagogical gestures in the classroom.” (Tellier, 2010, p.11). The pedagogical gesture of the teacher, a “federator of information” (Tellier, 2010, p.4) alleviates the formal language classroom setting and has an informative linguistic function by providing the pupil with lexical, phonological and/or grammatical indications. “Thus, the voice becomes visible, the movement heard.” (Llorca, 2008).

The ​GEPETO project will propose and test innovative methods for the use of gestures in the pedagogy of foreign languages. It will be the endeavour to use HMI tools and gesture control of intonation in this field, and the aim of the project is to present convincing proofs of this concept.

Use of gestures in rehabilitation

During Melodic Intonation Therapy, it has been proven that the use of the gesture, combined with a stimulation with coded intonations, could help some aphasic participants to find an oral expression. “Tapping the left hand may engage a right-hemisphere sensorimotor network that controls both hand and mouth movements” (Norton et al., 2009, p.4).

In case of laryngectomies, the source excitation of the voice signal carrying intonation information is absent. Current solutions for voice rehabilitation include the combination of the speaker’s natural articulation with an artificial excitation source that is injected on the neck or into the mouth (Liu & Ng, 2007). However, these systems often generate stationary excitations that have a relatively constant intonation and lead to extremely robotic voices (Kaye et al., 2017). Few solutions have been proposed for a gestural control of electrolarynx intonation using either a pressure control (TruTone™​, (Takahashi et al., 2005)) or accelerometers (Matsui et al. 2013), but none has led to a proper usability evaluation. We therefore observe a major contrast between previous studies on the chironomic control of intonation (d’Alessandro et al., 2011, 2014), and the few solutions proposed for intonation control in voice substitution. This contrast thus motivates the exploitation of the potential of chironomic control of intonation for voice substitution. Success of the ​GEPETO project in this research would constitute a substantial milestone in improving expressivity of voice substitution system output.