I'm trying to stream audio bytes into a shared buffer, and pass this through a transcription model. The audio's coming from a websocket sampled at 8kHz and mu-law encoded. I've managed to play a few seconds of audio to myself fine if I stream into separate separate audio buffers (ibuffer
and obuffer
) for inbound and outbound audio, but if I collect into a shared
buffer the audio is really slow and delayed. Here is an extract from my testing code:
obuffer = b""ibuffer = b""shared = b""while True: data = await queue.get() if data["event"] == "media": websocket_payload = data["media"]["payload"] chunk = audioop.ulaw2lin(base64.b64decode(websocket_payload), 2) if data["media"]["track"] == INBOUND: obuffer += chunk if data["media"]["track"] == OUTBOUND: ibuffer += chunk shared += chunk
I've been testing by collecting obuffer
, ibuffer
and shared
, pickling the buffers, and then saving as .wav
files and playing them on my machine. The separate buffers play fine, and can even be merged by simply averaging them which also plays fine - but why can't collecting them into a shared buffer produce the same quality of audio? The produced sound is quite far off from the original, and I've tried different sampling rates up to 16 kHz etc. Does anyone have any idea on what to do here?
It's strange because Twilio's own documentation says you can do this no problem.
import pickleimport wavewith open("all_bytes.pkl", "rb") as f: loaded_audio_bytes = pickle.load(f)nchannels = 1sampwidth = 2framerate = 8000nframes = len(loaded_audio_bytes) // (nchannels * sampwidth)with wave.open("wav.wav", 'wb') as wf: wf.setnchannels(nchannels) wf.setsampwidth(sampwidth) wf.setframerate(framerate) wf.setnframes(nframes) wf.writeframes(loaded_audio_bytes)
This answer suggests just using the outbound only, but I need both tracks here!