Inside Nuance: the art and science of how Siri speaks

scratching the collar of my neck where humans once had gills certainly it had a company that sweatshirt but I'd always both ways concerned broken by enumeration also used the winner is to be announced at the world Awards divided into two sections both secured by yet another lock important as it is by my regular supervision furthermore so too for the travelers it was a nightmare in their minds a creature from the darker side of the intellect please wait so I do think language is primarily a tool for communication and traditionally within computer science within technology that's all we use it for but language has this very rich secondary power which is a kind of social glue did your about non unit precision content speech synthesis administrators nuance is a company that's focused on the next generation of human machine interactions we're building new types of interfaces for how users access information and a big part of that is making these systems talk a couple of decades ago when building a voice you might have just wanted to guarantee coverage of all the individual phonemes of the language right so a sound would be a tiger or are the most technologically advanced place they're ever built with electronics advanced given monitors price level and so long as we had all of these different sounds represented we could then cut these sounds up in different ways and reassemble them into whatever words and sentences we liked you are right the road I American does it sure is great to get out of that bear how articulation between sounds which means that this precise sound you make when you say an AA varies depending whether you're next going to say to another say to be or not to be that you've got with you oh yes i sir but well electronics present on wall in credible so the the speech organs of the mouth and throat move fluidly from one position to the next meaning that you get this coloration from one sound to the next so the very least you want to make sure the sound units contain sequences of sounds so we want all possible combinations of two sounds just got a couple of little things here on the first line on the first paragraph looking for a little bit more clarity with community you know there's the obvious things that that voice directors will will pick up on a missed word some gravel in the voice there's noise a typical voice project will last if you know we're on point about three maybe four months let's keep the same sort of energy and in pace the the intonation the speed at which people speak that carries most of the information that's what's going to tell you first of all somebody's sincere if they're warm what they're indicating what they're trying to tell you one one two two three three hold on I have two takes in this is the exciting part when I was younger I waited tables in a really fancy restaurant where you had to read an endless list of specials and I would get to the end and people would just look up at me say such a nice voice okay let's start over what can I help you with I was interested in becoming a broadcast journalist so I was on the radio in college and then I got a job at the local public radio station in Philadelphia so I was a reporter and a producer the route has been changed due to updated traffic information and then someone ended up hiring me as a voice for a similar similar project and I suddenly was the voice of all the computers on like the third floor of the Museum of Natural History in New York so that was kind of fun so rang means harmony of colors signifying various social religious linguistic communities and their peaceful coexistence at coastal Karnataka with nothing written in it we can flash it and I enjoy it it's as big as you love him or loathe him on Monday with it where Jemima was Saturday there now lay a large bad the stood thus Jacobi's Theory terminates in a finite on the hole in finite steps so these sentences you can see they don't really mean anything they don't really trip off the tongue and the talents do find them relatively hard to say we tend to do several takes of a lot of these sentences but they have the property that because they contain these unusual words in these unusual combinations we can cover more rare sound combinations faster and therefore with less material this specific kind of work is really different something I said 12 years ago can be put in front of another phrase that I said last week and it should match and so that's kind of a weird thing to learn how to do or to figure out how to do cancelled it's okay to change your mind we dragon cancelled it's okay to change your mind a lot of the time what we're doing is intonation and the intonation on something going up or going down or going sort of in between the the meaning that comes from your intonation has to be very precise and exactly the same on each phrase usually an actor is called in and to play a character or to be an announcer for a commercial for any number of things and what we're looking for here is for them to be themselves which can be disarming for for actors who come in expecting to put on a mask and say no no no put the masks away we want you to be you you know they hired me as me you know not just my voice or who I could pretend to be but who I who I really am because I have to sound the same for perhaps years one billion three billion and three I love the fact that we're building something that potentially in 20 to 30 years is part of the building blocks for artificial intelligence and as a science fiction and Technology buff I think that that's really really cool other people like actors would be totally bored and be driven crazy by this but I see it as kind of like an interesting linguistic puzzle that it's fun to sort of be in the inside of my name is de a the de - II put all David table I'm a linguist at New Orleans communications and I work on text-to-speech so the sound files come to us from the studio what we then need to do with them is to label them in various ways we need to label them so that they can be stored in the database and that database will then need to be accessed so we can build the TTS voice so that we can create new utterances new text-to-speech utterances this program that I'm using to show you this stuff is called prot PRA 80 and prot has various algorithms that just basically take this waveform and turn it into what's called a spectrogram labels that we need to apply to store it in that database are things like phonetic label stress label pitched labels because phonetic label stress label and pitch label are all relevant to which units get selected when I when we produce a new text-to-speech odorants so yeah in terms of the future they're about 6,000 languages in the world probably something like a third of them are in danger some of them may only have a few hundred speakers could be wiped out by a volcano say and that's happened before a single volcano eruption has wiped out a language because all the speakers were below the volcano in that case it would be possible for us to create a TTS voice of that language we just have to know a great deal about it we have to know all about its syntax about its phonology its phonetics and so on we have to have someone to produce it at least while it's still alive even if only barely so that we have recordings like the ones you saw made earlier so it is possible to make a noise so that we could actually preserve that language in some sense what we have here is a dragon reader it's a newsreader application really what it does is it reads the web to you from the verge Gabriel de Shaw is an artist that uses discarded parts from typewriters machines and old computers to create some truly beautiful pieces of art including takes on several iconic characters from Star Wars this is a the end result here is Alison's voice synthesized you've seen her in action in the booth and we're hearing true synthesis from the system where it's taking text and synthesizing human speech ultimately to generate what we hope is a natural and compelling experience as the product has been developing over a couple of years at first we would try out a version of it and it would sound very mechanical very much like the sort of computer voice that everybody hates but now I've just heard of the latest version and it's weird it sounds like me and for the first time now we're kind of entering an era where the technology that interacts with us using voice is trying to adapt to us and not the other way around well we don't have to put on a special voice to call a phone line and try to get a reservation where we can actually speak naturally and expect the system at the other end to understand us and I think that's an exciting time to be working in this field the verge

Gadgetory

All Cool Mind-blowing Gadgets You Love in One Place

Inside Nuance: the art and science of how Siri speaks

2013-09-17