SRE Audiobook
Why
I wanted to listen to the SRE book while biking.
How
- I've already been using edge's built-in text-to-speech feature to listen to articles of the book and the quality of the text-to-speech service is good enough.
- My podcast player, Pocket Casts, has great UX and I'm used to it. It can be used to play the audiobook if I publish it as a podcast.
What
- A podcast feed that contains the audio files of the book generated by the text-to-speech (TTS) service.
Challenges
- ▶️ TTS services guess breaks and tone if you just give them the text.
- ▶️ Google's TTS struggles with long sentences and requires more guidance. It failed to synthesize the audio for the Acknowledgments section of the book's preface because they were just a list of names. This issue has been reported in here as well.
- ☑️ I switched to Microsoft's TTS service and it worked fine.
- ▶️ GCP documentation for TTS service is not as good as Azure's.
- ▶️ Google's TTS service is more forgiving when it comes to SSML. It can handle invalid SSML and still generate the audio file. Microsoft's TTS service is more strict and won't generate the audio file if the SSML is invalid.
- ▶️ TTS services has length limit for the generated output of the input text. They also offer long audio APIs but there's no support for SSML and they also require storage setup and won't return the audio file in the response.
- ☑️ The input needs to be split into smaller chunks and each chunk needs to be sent to the TTS service separately. This was covered in the
HTMLParser
code as well. - ☑️ Chunks will be concatenated to generate the final audio file of each chapter.
- ☑️ The input needs to be split into smaller chunks and each chunk needs to be sent to the TTS service separately. This was covered in the
- ▶️ Sometimes the TTS service fails to generate the audio file and returns an error. Since the script runs on the whole book, restarting the script regenerate the whole book from the beginning.
- ☑️ I cache the generated chunks and chapters based on SHA256 of the input text. If the script fails, we can restart it and it will skip the chunks that have already been generated.
- ▶️ Tables, images and graphs need to be converted to more proper format for the audiobook.
- ⏹️ For tables, I can repeat the column names for each row.
- ⏹️ For images and graphs I can describe them using a image to text service and then use the text as the description of the image.
- ⏹️ Images, graphs and tables can be referenced to their source URL from the audiobook.
- ▶️ Nonlinear elements like
<sub>
or footnotes need to be handled properly. - ▶️ Headers can be used as bookmark sections in the audiobook.