My attempt at text-to-speech (FAIL)

Text to speech is where the computer speaks what you type.



For my attempt at text to speech my idea was to record myself saying every sound in the english language.

(I now think I had totally the wrong idea. When listening to the windows text to speech "LH michael" (Steven Hawking's voice), when playing it very slowly it sounds like the sounds are some how synthesized, not each sound recorded like I was doing)

Next I did some research about sounds of the english language, and I found that there are 44 sounds in total (20 vowels and 24 consonants).

These are...

VOWELS
A - the A sound (e.g. day, made, say)
a - the a sound (e.g. can, dad, mad)
ER - the louder ER/UR sound (e.g. der, confirm, sir)
er/a - the sound that is a cross between ER and a (e.g. teacher, singer)
AR - the ARE sound (e.g. dart, arch)
AIR - the AIR sound (e.g. dare, fair)
E - the EEE sound (e.g. meet, tea)
e - the eh sound (e.g. set, bed)
IE - the IEE sound (e.g. tier, heavier)
I - the EYE sound (e.g. mine, lie)
i - the i sound (e.g. did, sick)
O - the O sound (e.g. mode, so)
o - the o sound (e.g. odd, mob)
OO - the OO sound (e.g. moo, boot)
oo - the oo sound (e.g. book, foot)
OI - the OI sound (e.g. void, toy)
OR - the OR sound (e.g. door, more)
OW - the OW sound (e.g. cow, now)
U - the U sound (e.g. umbrella, up)
UA - the UA sound (e.g. dual, cruel)

CONSONANTS
B - the B sound (e.g. bet, ball)
C - the C sound but not CH (e.g. cat, cup)
CH - the CH sound (e.g. chocolate)
D - the D sound (e.g. duck, dog)
F - the F sound (e.g. fox, fan)
G - the G sound (e.g. goat, gat)
H - the H sound (e.g. house, hat)
J - the J sound (e.g. jewel, jug)
L - the L sound (e.g. land, lunch)
M - the M sound (e.g. mum, mad)
N - the N sound (e.g. new, nan)
NG - the NG sound (e.g. shopping)
P - the P sound (e.g. poo, purse)
R - the R sound (e.g. rat, run)
S - the S sound but not SH (e.g. sand, side)
SH - the SH sound (e.g. sheep, shop)
T - the T sound but not TH (e.g. tick, table)
TH - the TH sound (e.g. the)
th - the th sound slightly softer (e.g. thimble)
V - the V sound (e.g. vet, view)
W - the W sound (e.g. water)
Y - the Y sound (e.g. yellow, yacht)
Z - the Z sound (e.g. zoo, zebra)
ZH - the ZH sound a bit unusual (e.g. closure)


NOTE: This is my interpretation of the sounds. Different places have the sounds listed slightly differently (but the idea is still the same).

Suprisingly the sound Q does not exist because it is the C sound next to the W sound (CW)
and the sound X does not exist because it is the E sound next to the C sound next to the S sound (ECS)



Next I recorded myself saying all of the different sounds.

Then in javascript I worked out the logic of spelling pronounciation of the english language, so that it converts words from how they are written into how they sound. (this was hard as english is very complicated)



As I was working out the logic and testing the text to speech it sounded horrible and disjointed.

The problem was getting the different sounds to blend together to make the words sound right.

When I recorded the sound I was saying each sound, Now I realise that when people talk they say the sounds differently to when they say the sounds on there own.

e.g. if I make the sound C then make the sound OW then join the two together it will not sound like COW, instead it will be disjointed

Also the vowels were sort of alright but the consonants were awfull

So next I recorded all of the sounds again, but this time for each sound I said a word that had the sound I wanted, then said the sound again pretending I was still saying the word.

This worked a bit better, but it still didn't sound all that good, also one of the problems was I was saying the consonants at a slightly different pitch to the vowels. also some of the consonants still didn't sound right, the worst were P, T, D, R, & G.



I am not going to be able to make it sound any better without a lot of work so I have now given up.

(Now I realise that most text to speech sounds are synthesized, not recorded)

so here is my failed attempt at making a text to speech

CLICK HERE FOR MY TEXT-TO-SPEECH



I have learnt so much while making this. Why wasn't I told any of this at school?
At school teachers would always go on about which letters are vowels and which are consonants, but would never say what a vowel or consonant actually is. Teachers would always say that there are 5 vowels in the alphabet, but never said that the total number of vowel sounds is far more. Teachers always taught english is such a dull way - SCHOOL WAS SO CRAP.





Click here for more stuff