I recently looked at generating spoken audio from text using here. I have found another way to do this, but locally, instead of sending all the data to Google.
The tool that I used is called piper1-gpl.
Install Piper
Use the Python package manager pip to get it installed:
pip install piper-tts
You may need to use your distro’s package manager to get pip installed if it’s not present.
Choose a voice
Next, you need to choose a voice to read the text. The following command will print a list of available voices:
python3 -m piper.download_voices
This prints all the voices with the form country-code_ACCENT_name_resolution. I’m looking for Spanish with a Latin American accent (I know there are lots, but anything that’s not PorteƱo will do (sorry Argentina, but I’m British and we’re still salty about the Hand of God)), so a Mexican accent is perfect. The two options that the above command prints are as follows:
python3 -m piper.download_voices
...
es_MX-ald-medium
es_MX-claude-high
...
I’m going with claude, so I downloaded this voice:
python3 -m piper.download_voices es_MX-claude-high
This downloads a small model to the current directory. In this case, the model es_MX-claude-high.onnx is downloaded along with a JSON config file es_MX-claude-high.onnx.json. It’s only 64M.
Generate Speech To A File
Now you have everything you need to generate some speech. The following command will output the speech to a .wav audio file and take a text file (the first paragraph of “Don Quixote” from Gutenberg) as the input:
python3 -m piper --model es_MX-claude-high.onnx --input-file Don-Quixote-para1.txt --output-file test.wav
The flags I used are:
--model- Voice model file--input-file- Input text file--output-file- Output audio file
The WAV file will get rather large quickly. The test I used above was 139 words long and resulted in a 2.1M file. You can squash this into an .opus file with the following command:
fmpeg -i <INFILE> -c:a libopus -b:a 32k -vbr on -compression_level 10 -threads 0 -ac 1 -application voip <OUTFILE>
This reduced it to 192K.
Generate and Play Speech
Instead of outputting the speech to a .wav file, it is possible to play the speech by sending the audio to stdout and piping it into an audio player. This took me a little while to get working, but in the end, I used aplay provided by (at least on Arch) the alsa-utils package.
The command is:
python3 -m piper --model es_MX-claude-high.onnx --input-file Don-Quixote-para1.txt --output-raw | aplay --format=S16_LE --rate=22050
The new arguments I used are:
--output-raw- Sends the speech to stdout--format=S16_LE- Promptes aplay to use the S16_LE format--rate=22050- Prompts aplay to use a sample rate of 22050
Now you can listen to the speech as piper renders it.