Inside Nuance: the art and science of how Siri speaks
Inside Nuance: the art and science of how Siri speaks
2013-09-17
scratching the collar of my neck where
humans once had gills certainly it had a
company that sweatshirt but I'd always
both ways concerned broken by
enumeration also used the winner is to
be announced at the world Awards divided
into two sections both secured by yet
another lock important as it is by my
regular supervision furthermore so too
for the travelers it was a nightmare in
their minds a creature from the darker
side of the intellect
please wait so I do think language is
primarily a tool for communication and
traditionally within computer science
within technology that's all we use it
for but language has this very rich
secondary power which is a kind of
social glue
did your about non unit precision
content speech synthesis administrators
nuance is a company that's focused on
the next generation of human machine
interactions we're building new types of
interfaces for how users access
information and a big part of that is
making these systems talk a couple of
decades ago when building a voice you
might have just wanted to guarantee
coverage of all the individual phonemes
of the language right so a sound would
be a tiger or are the most
technologically advanced place they're
ever built with electronics advanced
given monitors price level and so long
as we had all of these different sounds
represented we could then cut these
sounds up in different ways and
reassemble them into whatever words and
sentences we liked you are right the
road I American does it sure is great to
get out of that bear how articulation
between sounds which means that this
precise sound you make when you say an
AA varies depending whether you're next
going to say to another say to be or not
to be that you've got with you oh yes i
sir but well electronics present on wall
in credible so the the speech organs of
the mouth and throat move fluidly from
one position to the next meaning that
you get this coloration from one sound
to the next so the very least you want
to make sure the sound units contain
sequences of sounds so we want all
possible combinations of two sounds
just got a couple of little things here
on the first line on the first paragraph
looking for a little bit more clarity
with community you know there's the
obvious things that that voice directors
will will pick up on a missed word some
gravel in the voice there's noise a
typical voice project will last if you
know we're on point about three maybe
four months let's keep the same sort of
energy and in pace the the intonation
the speed at which people speak that
carries most of the information that's
what's going to tell you first of all
somebody's sincere if they're warm what
they're indicating what they're trying
to tell you one one two two three three
hold on I have two takes in this is the
exciting part when I was younger I
waited tables in a really fancy
restaurant where you had to read an
endless list of specials and I would get
to the end and people would just look up
at me say such a nice voice okay let's
start over what can I help you with I
was interested in becoming a broadcast
journalist so I was on the radio in
college and then I got a job at the
local public radio station in
Philadelphia so I was a reporter and a
producer the route has been changed due
to updated traffic information
and then someone ended up hiring me as a
voice for a similar similar project and
I suddenly was the voice of all the
computers on like the third floor of the
Museum of Natural History in New York so
that was kind of fun so rang means
harmony of colors signifying various
social religious linguistic communities
and their peaceful coexistence at
coastal Karnataka with nothing written
in it we can flash it and I enjoy it
it's as big as you love him or loathe
him on Monday with it where Jemima was
Saturday there now lay a large bad the
stood thus Jacobi's Theory terminates in
a finite on the hole in finite steps so
these sentences you can see they don't
really mean anything
they don't really trip off the tongue
and the talents do find them relatively
hard to say we tend to do several takes
of a lot of these sentences but they
have the property that because they
contain these unusual words in these
unusual combinations we can cover more
rare sound combinations faster and
therefore with less material
this specific kind of work is really
different something I said
12 years ago can be put in front of
another phrase that I said last week and
it should match and so that's kind of a
weird thing to learn how to do or to
figure out how to do cancelled it's okay
to change your mind
we dragon cancelled it's okay to change
your mind a lot of the time what we're
doing is intonation and the intonation
on something going up or going down or
going sort of in between the the meaning
that comes from your intonation has to
be very precise and exactly the same on
each phrase
usually an actor is called in and to
play a character or to be an announcer
for a commercial for any number of
things and what we're looking for here
is for them to be themselves which can
be disarming for for actors who come in
expecting to put on a mask and say no no
no put the masks away we want you to be
you you know they hired me as me you
know not just my voice or who I could
pretend to be but who I who I really am
because I have to sound the same for
perhaps years one billion three billion
and three
I love the fact that we're building
something that potentially in 20 to 30
years is part of the building blocks for
artificial intelligence and as a science
fiction and Technology buff I think that
that's really really cool other people
like actors would be totally bored and
be driven crazy by this but I see it as
kind of like an interesting linguistic
puzzle that it's fun to sort of be in
the inside of
my name is de a the de - II put all
David table I'm a linguist at New
Orleans communications and I work on
text-to-speech so the sound files come
to us from the studio what we then need
to do with them is to label them in
various ways we need to label them so
that they can be stored in the database
and that database will then need to be
accessed so we can build the TTS voice
so that we can create new utterances new
text-to-speech utterances this program
that I'm using to show you this stuff is
called prot PRA 80 and prot has various
algorithms that just basically take this
waveform and turn it into what's called
a spectrogram labels that we need to
apply to store it in that database are
things like phonetic label stress label
pitched labels because phonetic label
stress label and pitch label are all
relevant to which units get selected
when I when we produce a new
text-to-speech odorants so yeah in terms
of the future they're about 6,000
languages in the world
probably something like a third of them
are in danger some of them may only have
a few hundred speakers could be wiped
out by a volcano say and that's happened
before a single volcano eruption has
wiped out a language because all the
speakers were below the volcano in that
case it would be possible for us to
create a TTS voice of that language we
just have to know a great deal about it
we have to know all about its syntax
about its phonology its phonetics and so
on we have to have someone to produce it
at least while it's still alive even if
only barely so that we have recordings
like the ones you saw made earlier so it
is possible to make a noise so that we
could actually preserve that language in
some sense
what we have here is a dragon reader
it's a newsreader application really
what it does is it reads the web to you
from the verge Gabriel de Shaw is an
artist that uses discarded parts from
typewriters machines and old computers
to create some truly beautiful pieces of
art including takes on several iconic
characters from Star Wars this is a the
end result here is Alison's voice
synthesized you've seen her in action in
the booth and we're hearing true
synthesis from the system where it's
taking text and synthesizing human
speech ultimately to generate what we
hope is a natural and compelling
experience as the product has been
developing over a couple of years at
first we would try out a version of it
and it would sound very mechanical very
much like the sort of computer voice
that everybody hates
but now I've just heard of the latest
version and it's weird it sounds like me
and for the first time now we're kind of
entering an era where the technology
that interacts with us using voice is
trying to adapt to us and not the other
way around well we don't have to put on
a special voice to call a phone line and
try to get a reservation where we can
actually speak naturally and expect the
system at the other end to understand us
and I think that's an exciting time to
be working in this field the verge
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.