I am a Ph.D. in computer science. Currently, I work as Voice AI Datascientist at ViaDialog in Lannion, Brittany, France. My primary topics of interest are TTS (Text-To-Speech) and ASR (Automatic Speech Recognition). Prior to that, I worked in IRISA/EXPRESSION team while preparing my Ph.D. The team is located in Lannion and Vannes, Brittany, France. My Ph.D. work focused on the study of unit selection algorithms in Text-To-Speech synthesis. After, that I worked as a Language Technology Scientist at Voysis in Dublin, Ireland. At Voysis, I worked on TTS, ASR, NLP (Natural Language Processing) and DSP (Digital Signal Processing). Then, I came back to IRISA to work on TTS again.
After my Baccalauréat scientifique (French degree), in 2007, I first obtained a DUT (Diplôme Universitaire de Technologie) in computer science in 2009. I then entered ENSSAT (École Nationale Supérieure des Sciences Appliquées et de Technologie) where I achieved a "diplôme d'ingénieur" as long as a master degree (from the university of Rennes) in 2012. Finally, I joined the IRISA/CORDIAL team (which became IRISA/EXPRESSION team) for my Phd. Between February and July 2015, I worked as an intern at IDIAP research center in Switzerland. I defended my Ph.D. on the 22nd of September in 2016.
Besides research, I have been teaching computer science at ENSSAT between 2013 and 2016. I was mainly (but not only) involved in the following courses: C and Java language, Web technologies (client side mostly), UNIX/Linux programming, distributed programming and artificial intelligence. I taught a total of 380h of courses.
At Voysis, I focused on a wide variety of tasks, across a large part of our stack. I worked on ASR, Voice activity detection (and query endpoint detection), wakeword detection, NLP, Audio Analysis and Processing and TTS. I ran several data collection (audio and text utterances) and several evaluation campaigns.
Here are a few key points:
Image: My office at ENSSAT (10/2015).
I am working in the automatic speech synthesis field. The subject of my PhD has been the "Study of Unit Selection Text-To-Speech Synthesis Algorithms".
Two main strategy types are currently under consideration in this field. The first one relies on a statistical parametric approach where one tries to model speech signals. Models are then used in a generative way to produce speech utterances. It is called the Statistical Parametric Speech Synthesis approach (often shortened as "SPSS" though this conflicts with the name of IBM's statistics software). The second approach, which is an evolution of concatenation-based synthesis, consists in preserving and annotating a large speech corpus (usually several hours or even tens of hours), then extract fragments (called units) and paste them together to reproduce a the utterance that had to be synthesized (target utterance). The mechanism (not trivial) by which these fragments are selected is referred to as Unit Selection. The general technique is called Corpus-Based Speech Synthesis.
My thesis is to explore, diagnose Units Selection mechanism and suggest improvements. To meet these objectives, a corpus-based speech synthesis was needed. For reasons of independence, flexibility and to ensure a transversal control of the application, it was decided to build a completely new system rather than using and modifying an existing tool. So I spent a considerable time during my thesis in contributing to the implementation of the achieved engine within the team and adding features in it.
On the pure research part, I first took interest in evaluating the impact of the search algorithm on Unit Selection. In particular, the question was to identify whether or not optimality of the solution (ie. corpus units to be concatenated) was important and if not, what search strategy was the best. My conclusion was that the search algorithm sensibly impacts selection process only when searching for the optimal solution (or near optimal). Optimality of the solution is not necessary, however. Even a very pruned unit selection can be used with few sensible flaw.
I took also interest in the formulation of the cost function that allows the search algorithm to evaluate corpus units should have. Adapted preselection filtering method (i.e. Not consider the units deemed least useful) and Unit Selection influenced by a Vocalic Sandwich criterion (trying to avoid concatenation on points where they can degrade the signal) were tested. New target costs (judging the dissimilarity between a unit and it's desired characteristics) and concatenation (capacity of the unit to be pasted to the previous one without generating problems) have been implemented and tested. I am currently continuing my work it that direction, paying particular attention to Atom-based intonation decomposition technique which I discovered during my stay at IDIAP and it's possible applications to Target Costs.
Prior to working in the field of speech synthesis, I worked briefly on analogy relations, first trough an internship and then on my free time. In particular, I developed a search algorithm for analog proportions in concept lattice, which led to a publication. This work was a collaboration with Laurent Miclet and Henri Prade.
Keywords: Unit Selection ; Corpus-Based Speech Synthesis ; Text-To-Speech Synthesis ; Cost Function ; Concatenation Cost ; Target Cost ; Neural Networks ; Deep Learning.
Image: My teaching/research place, ENSSAT engineering school in Lannion, Brittany, France
Courses I gave as part of a "Mission d'enseignement" (teaching mission) during my second year as a PhD student at ENSSAT engineering school.
Courses I gave as part of a "Mission d'enseignement" during my third year as a PhD student at ENSSAT engineering school.
Courses I gave as part of my ATER (Attaché Temporaire d'Enseignement et de Recherche) position at ENSSAT.
July 2015 - September 2015 • Travel
A few photos of places I visited & loved with a few comments.
December 2015 • Music
One of my favorite interests is music, but one cannot always go to concerts. So, to listen to music recordings, you need either speakers or headphones. Here are my impressions concerning the latter.