Another development in voice interaction technology from Google, where the "speech" was learned purely from circa 24 hours of text and speech recording, not relying on linguistics. Essentially very complex pattern recognition but with a 'simpler' underlying technical structure to generate the results.
Mimicking speech is not new (just ask Alexa.. or Siri..) but interesting on two accounts:
1) evidence of continuing machine learning, and neural network structures to solve problems by identifying patterns in a large volume of data - relevant to data science in any organisation
2) bringing access to human sounding speech, purely generated from text, closer to general use - and the investment by the large tech companies to make this a reality.
Generating very natural sounding speech from text (text-to-speech, TTS) has been a research goal for decades. There has been great progress in TTS research over the last few years and many individual pieces of a complete TTS system have greatly improved. Incorporating ideas from past work such as Tacotron and WaveNet, we added more improvements to end up with our new system, Tacotron 2. Our approach does not use complex linguistic and acoustic features as input. Instead, we generate human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts