How to Generate Text-to-Speech Locally

I recently looked at generating spoken audio from text using here. I have found another way to do this, but locally, instead of sending all the data to Google.

The tool that I used is called piper1-gpl.

Install Piper

Use the Python package manager pip to get it installed:

pip install piper-tts

You may need to use your distro’s package manager to get pip installed if it’s not present.

Choose a voice

Next, you need to choose a voice to read the text. The following command will print a list of available voices:

python3 -m piper.download_voices

This prints all the voices with the form country-code_ACCENT_name_resolution. I’m looking for Spanish with a Latin American accent (I know there are lots, but anything that’s not Porteño will do (sorry Argentina, but I’m British and we’re still salty about the Hand of God)), so a Mexican accent is perfect. The two options that the above command prints are as follows:

python3 -m piper.download_voices
...
es_MX-ald-medium
es_MX-claude-high
...

I’m going with claude, so I downloaded this voice:

python3 -m piper.download_voices es_MX-claude-high

This downloads a small model to the current directory. In this case, the model es_MX-claude-high.onnx is downloaded along with a JSON config file es_MX-claude-high.onnx.json. It’s only 64M.

Generate Speech To A File

Now you have everything you need to generate some speech. The following command will output the speech to a .wav audio file and take a text file (the first paragraph of “Don Quixote” from Gutenberg) as the input:

python3 -m piper --model es_MX-claude-high.onnx --input-file Don-Quixote-para1.txt --output-file test.wav

The flags I used are:

--model - Voice model file
--input-file - Input text file
--output-file - Output audio file

The WAV file will get rather large quickly. The test I used above was 139 words long and resulted in a 2.1M file. You can squash this into an .opus file with the following command:

fmpeg -i <INFILE> -c:a libopus -b:a 32k -vbr on -compression_level 10 -threads 0 -ac 1 -application voip <OUTFILE>

This reduced it to 192K.

Generate and Play Speech

Instead of outputting the speech to a .wav file, it is possible to play the speech by sending the audio to stdout and piping it into an audio player. This took me a little while to get working, but in the end, I used aplay provided by (at least on Arch) the alsa-utils package.

The command is:

python3 -m piper --model es_MX-claude-high.onnx --input-file Don-Quixote-para1.txt --output-raw | aplay --format=S16_LE --rate=22050

The new arguments I used are:

--output-raw - Sends the speech to stdout
--format=S16_LE - Promptes aplay to use the S16_LE format
--rate=22050 - Prompts aplay to use a sample rate of 22050

Now you can listen to the speech as piper renders it.