Speech synthesis - in the broad sense - restoration of the form of a speech signal according to its parameters [1] ; in the narrow sense - the formation of a speech signal in print [ clarify ] the text . Part of artificial intelligence .
Speech synthesis - first of all, everything that is connected with the artificial production of human speech is called.
Speech synthesizer is a structure capable of translating text / images into speech, in software and / or hardware.
The voice engine is directly a system / core for converting text / commands to speech, it can also exist independently of a computer.
Content
The use of speech synthesis
Speech synthesis may be required in all cases when the person is the recipient of the information. The quality of a speech synthesizer is primarily judged by its similarity to a human voice, as well as its ability to be understood. The simplest synthesized speech can be created by combining parts of the recorded speech, which will then be stored in a database. And strangely enough, with this method of synthesizing we are already everywhere, even without sometimes paying attention to it.
- Speech synthesis by text or message code can be used in information and reference systems , to help the blind and dumb, to control a person from the side of the machine.
- When announcements about the departure of trains and the like.
- For the issuance of information about technological processes: in military and aerospace engineering, in robotics, in the acoustic dialogue of a person with a computer.
- As a sound effect is often used in the creation of electronic music .
Speech synthesis methods
All methods of speech synthesis can be divided into groups: [2]
- parametric synthesis;
- concatenative, or compilation (compilation) synthesis;
- synthesis according to the rules;
- subject oriented synthesis.
Parametric Synthesis
Parametric speech synthesis is the final operation in vocoder systems , where the speech signal is represented by a set of a small number of continuously changing parameters. Parametric synthesis is advisable to apply in cases where the set of messages is limited and does not change too often. The advantage of this method is the ability to record speech for any language and any speaker . The quality of parametric synthesis can be very high (depending on the degree of information compression in the parametric representation). However, parametric synthesis cannot be used for arbitrary, not predefined messages.
Compilation Synthesis
Compilation synthesis comes down to composing a message from a pre-recorded dictionary of the source elements of the synthesis. The size of the synthesis elements is not less than a word. Obviously, the content of synthesized messages is fixed by the volume of the dictionary. As a rule, the number of dictionary units does not exceed several hundred words. The main problem in compilation is the amount of memory used to store the dictionary. In this regard, various methods of compression / coding of a speech signal are used. Compilation synthesis has wide practical application. In Western countries, a variety of devices (from military aircraft to home appliances) are equipped with voice response systems. Until recently, in Russia, voice response systems were mainly used in the field of military equipment, and now they are increasingly used in everyday life, for example, in the help services of mobile operators when receiving information about the status of a subscriber's account.
Complete speech synthesis by rules
A complete speech synthesis according to the rules (or synthesis according to printed text) provides control of all parameters of the speech signal and, thus, can generate speech from a previously unknown text. In this case, the parameters obtained during the analysis of the speech signal are stored in memory in the same way as the rules for combining sounds into words and phrases . The synthesis is realized by modeling the speech tract, using analog or digital technology. Moreover, in the process of synthesizing the parameter values and phoneme connection rules, they are introduced sequentially at a certain time interval, for example, 5-10 ms. The method of synthesizing speech from printed text (synthesis by the rules) is based on a programmed knowledge of acoustic and linguistic restrictions and does not directly use elements of human speech. In systems based on this synthesis method, two approaches are distinguished. The first approach is aimed at building a model of the speech-producing system of a person, it is known as articulatory synthesis . The second approach is formant synthesis by the rules . The intelligibility and naturalness of such synthesizers can be brought up to values comparable to the characteristics of natural speech.
Synthesis of speech according to the rules using previously stored segments of the natural language is a type of speech synthesis according to the rules, which has become widespread in connection with the emergence of the possibilities of manipulating a speech signal in digitalized form. Depending on the size of the starting elements of the synthesis, the following types of synthesis are distinguished:
- microsegment (microwave);
- allophonic ;
- diphonic;
- semi syllable ;
- syllabic;
- synthesis from units of arbitrary size.
Usually, half-syllables are used as such elements - segments containing half of the consonant and half of the vowel adjacent to it. At the same time, speech can be synthesized using a predetermined text, but it is difficult to control intonation characteristics. The quality of this synthesis does not correspond to the quality of natural speech, since distortions often arise at the boundaries of the diphon crosslinking. Compilation of speech from pre-recorded word forms also does not solve the problem of high-quality synthesis of arbitrary messages, since the acoustic and prosodic (duration and intonation) characteristics of words vary depending on the type of phrase and the place of the word in the phrase. This position does not change even when using large amounts of memory to store word forms.
Subject-Oriented Synthesis
Subject-oriented synthesis compiles words recorded in advance, as well as phrases to create complete speech messages. It is used in applications where the variety of system texts will be limited to a certain topic / area, for example, train departure announcements and weather forecasts . This technology is easy to use and has long been used for commercial purposes: it was also used in the manufacture of electronic devices such as talking watches and calculators . The naturalness of the sound of these systems can potentially be high due to the fact that the variety of types of sentences is limited and close with the corresponding intonation of the source records. And since these systems are limited by the choice of words and phrases in the database, they can no longer be widely used in human activities, only because they are able to synthesize combinations of words and phrases for which they were programmed.
History
At the end of the 18th century, the Danish scientist Christian Kratzenstein , a full member of the Russian Academy of Sciences , created a model of the human voice path that can pronounce five long vowel sounds ( a , e , u , o , y ). The model was a system of acoustic resonators of various shapes that made vowels using vibrating reeds excited by an air stream. In 1778, the Austrian scientist Wolfgang von Kampelen supplemented the Kratzenstein model with models of the tongue and lips and introduced an acoustic- mechanical speaking machine capable of reproducing certain sounds and their combinations. Hissing and whistling were blown out with the help of a special fur with manual control. In 1837, scientist Charles Wheatstone introduced an improved version of the machine that could reproduce vowels and most consonants . And in 1846, Joseph Faber demonstrated his talking organ, Euphonia , in which an attempt was made to synthesize not only speech, but also singing.
At the end of the XIX century, the famous scientist Alexander Bell created his own "talking" mechanical model, very similar in design to the Wheatstone machine. With the onset of the 20th century , the era of electric machines began, and scientists were able to use sound wave generators and build algorithmic models on their basis.
In the 1930s, Bellr Labs employee Homer Dudley , working on the problem of finding ways to reduce the bandwidth needed in telephony to increase its transmission capacity, develops VOCODER (short for English voice - voice, English coder - encoder) - a keyboard-controlled electronic analyzer and speech synthesizer. Dudley's idea was to analyze the voice signal, disassemble it into parts, and re-synthesize into a less demanding line throughput. An improved version of Dudley's vocoder , VODER, was presented at the 1939 New York World's Fair [3] .
The first speech synthesizers sounded rather unnatural, and often it was hardly possible to make out the phrases they produced. However, the quality of synthesized speech has constantly improved, and the speech generated by modern speech synthesis systems can sometimes not be distinguished from real human speech. But despite the success of electronic speech synthesizers , research in the field of creating mechanical speech synthesizers is still being conducted, for example, for use in humanoid robots . [four]
The first computer-based speech synthesis systems began to appear in the late 1950s , and the first text-to-speech synthesizer was created in 1968 .
Present and Future
It is too early to talk about some promising future for the coming decades for synthesizing speech according to the rules , since the sound still resembles most of all the speech of robots, and in some places this is also a difficult to understand speech. What we can accurately determine accurately is whether the speech synthesizer speaks in a male or female voice, and sometimes we still do not distinguish subtleties inherent in the human voice. Therefore, the development technology has partially turned its back on the actual construction of the synthesis of speech signals, but still continues to use the simplest segmentation of voice recording.
Hybrid speech synthesis can be used to hack speech recognition systems . [five]
See also
- Voice search
- Vocaloid
- Vocoder
- Voice cloning
- Speech recognition
- Jaws
- VoiceXML
Notes
- ↑ In this definition, the conversion of sound pressure to electrical voltage and vice versa in a microphone and telephone, as well as recording and playback, for example, from magnetic media, is not a synthesis. Discretization and quantization of a speech signal during pulse-code modulation are also not related to speech synthesis, but the generation of a speech signal in vocoder systems can be considered synthesis.
- ↑ Sorokin V.N. Speech synthesis. - M .: Nauka, 1992, p. 392.
- ↑ Dennis Klatt's History of Speech Synthesis page An archived copy of July 4, 2006 on the Wayback Machine , dedicated to the history of speech synthesizers, presents audio files with recordings of various speech synthesizers. There is a file recording the sound of the vocoder Homer Dudley.
- ↑ For example, Japanese scientists from the Takanishi Laboratory at Waseda University are working on an anthropomorphic model of a talking robot. Their latest development ( 2005 ) - the Waseda Talker No.5 model - has the entire set of speech instruments: lungs, larynx, soft palate, tongue, teeth, lips, etc. In total, all these organs have 18 degrees of freedom. On their Anthropomorphic Talking Robot Waseda-Talker Series page. Archived July 17, 2007. more detailed information is available, including photographs and videos.
- ↑ RESEARCH OF STABILITY OF VOICE VERIFICATION TO ATTACKS USING THE SYNTHESIS SYSTEM. - Journal of instrumentation . - February 2014.
Literature
- B. M. Lobanov, L. I. Tsirulnik "Computer synthesis and speech cloning." - Minsk, “Belarusian Science”, 2008. - 316 p.
- James L. Flanagan. Analysis, synthesis and perception of speech. - M., Communication, 1968 .-- 394 p.
- V.N. Sorokin. Speech synthesis - Science, 1992.
- Dutoit, Thierry. An Introduction to Text-to-Speech Synthesis. - Kluwer Academic Publishers, 1997 .-- 312 p. - ISBN 0-7923-4498-7 .
- Rybin SV. SYNTHESIS OF SPEECH Textbook on the subject "Synthesis of speech." - St. Petersburg: ITMO University, 2014. - 92 p. / annotation pdf
Links
- Speech Synthesis in the Open Directory Project Link Directory (dmoz)
- Thierry Dutoit. A Short Introduction to Text-to-Speech Synthesis . TTS research team, TCTS Lab. (12/17/1999). Date of treatment January 4, 2014.
- How is the synthesis of speech from Yandex | Habrahabr
- Speech Synthesizer Online