I want to convert a book to audio, and save the file, so naturally I don't want my computer to be speaking the book out loud while the conversion happens, but looking at the Azure documentation, I frankly don't see a way to get a stream object without speaking the text first. I've already got the code set up so that I can save the file, but I can't save the file unless I play that audio first. I want to convert some text to a stream object without having to listen to my computer utter the text. I realize a very inelegant solution is to simply mute my computer, but still, suppose the conversion takes an hour and I need to take a phone call on it.
import azure.cognitiveservices.speech as speechsdkspeech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=service_region)audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)speech_config.speech_synthesis_voice_name = 'ar-EG-SalmaNeural'speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)In the following line, I don't want to do this step because this utters the audio:
result = speech_synthesizer.speak_text_async("I'm excited to try text to speech").get()But I have to do that step in order to get the following steps:
stream = AudioDataStream(result)stream.save_to_wav_file(path)I've tried looking at all the methods listed in the speech_synthesizer object but all of them involve speaking the text, they are listed here:
class SpeechSynthesizer(builtins.object) | SpeechSynthesizer(speech_config: azure.cognitiveservices.speech.SpeechConfig, audio_config: Optional[azure.cognitiveservices.speech.audio.AudioOutputConfig] = <azure.cognitiveservices.speech.audio.AudioOutputConfig object at 0x137ffc790>, auto_detect_source_language_config: azure.cognitiveservices.speech.languageconfig.AutoDetectSourceLanguageConfig = None) | | A speech synthesizer. | | :param speech_config: The config for the speech synthesizer | :param audio_config: The config for the audio output. | This parameter is optional. | If it is not provided, the default speaker device will be used for audio output. | If it is None, the output audio will be dropped. | None can be used for scenarios like performance test. | :param auto_detect_source_language_config: The auto detection source language config | | Methods defined here: | | __del__(self) | | __init__(self, speech_config: azure.cognitiveservices.speech.SpeechConfig, audio_config: Optional[azure.cognitiveservices.speech.audio.AudioOutputConfig] = <azure.cognitiveservices.speech.audio.AudioOutputConfig object at 0x137ffc790>, auto_detect_source_language_config: azure.cognitiveservices.speech.languageconfig.AutoDetectSourceLanguageConfig = None) | Initialize self. See help(type(self)) for accurate signature. | | get_voices_async(self, locale: str = '') -> azure.cognitiveservices.speech.ResultFuture | Get the available voices, asynchronously. | | :param locale: Specify the locale of voices, in BCP-47 format; or leave it empty to get all available voices. | :returns: A task representing the asynchronous operation that gets the voices. | | speak_ssml(self, ssml: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult | Performs synthesis on ssml in a blocking (synchronous) mode. | | :returns: A SpeechSynthesisResult. | | speak_ssml_async(self, ssml: str) -> azure.cognitiveservices.speech.ResultFuture | Performs synthesis on ssml in a non-blocking (asynchronous) mode. | | :returns: A future with SpeechSynthesisResult. | | speak_text(self, text: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult | Performs synthesis on plain text in a blocking (synchronous) mode. | | :returns: A SpeechSynthesisResult. | | speak_text_async(self, text: str) -> azure.cognitiveservices.speech.ResultFuture | Performs synthesis on plain text in a non-blocking (asynchronous) mode. | | :returns: A future with SpeechSynthesisResult. | | start_speaking_ssml(self, ssml: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult | Starts synthesis on ssml in a blocking (synchronous) mode. | | :returns: A SpeechSynthesisResult. | | start_speaking_ssml_async(self, ssml: str) -> azure.cognitiveservices.speech.ResultFuture | Starts synthesis on ssml in a non-blocking (asynchronous) mode. | | :returns: A future with SpeechSynthesisResult. | | start_speaking_text(self, text: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult | Starts synthesis on plain text in a blocking (synchronous) mode. | | :returns: A SpeechSynthesisResult. | | start_speaking_text_async(self, text: str) -> azure.cognitiveservices.speech.ResultFuture | Starts synthesis on plain text in a non-blocking (asynchronous) mode. | | :returns: A future with SpeechSynthesisResult. | | stop_speaking(self) -> None | Synchronously terminates ongoing synthesis operation. | This method will stop playback and clear unread data in PullAudioOutputStream. | | stop_speaking_async(self) -> azure.cognitiveservices.speech.ResultFuture | Asynchronously terminates ongoing synthesis operation. | This method will stop playback and clear unread data in PullAudioOutputStream. | | :returns: A future that is fulfilled once synthesis has been stopped. | UPDATE
Someone recommended using the synthesize_speech_to_stream_async method but his code resulted in errors and I haven't heard back from him, but I think he might be on to something.
His code was
speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=service_region)speech_config.speech_synthesis_voice_name = 'ar-EG-SalmaNeural'stream = speechsdk.AudioDataStream(format=speechsdk.AudioStreamFormat(pcm_data_format=speechsdk.PcmDataFormat.Pcm16Bit, sample_rate_hertz=16000, channel_count=1))result = speechsdk.SpeechSynthesizer(speech_config=speech_config).synthesize_speech_to_stream_async("I'm excited to try text to speech", stream).get()stream.save_to_wav_file(path)This generated an error:
stream = speechsdk.AudioDataStream( format=speechsdk.AudioStreamFormat( pcm_data_format=speechsdk.PcmDataFormat.Pcm16Bit, sample_rate_hertz=16000, channel_count=1))which recommended:
stream = speechsdk.AudioDataStream( format=speechsdk.AudioStreamWaveFormat( pcm_data_format=speechsdk.PcmDataFormat.Pcm16Bit, sample_rate_hertz=16000, channel_count=1))But that generated:
AttributeError: module 'azure.cognitiveservices.speech' has no attribute 'PcmDataFormat'