I'm trying to read a Fastq file directly into a pandas dataframe, similar to the link below:
Read FASTQ file into a Spark dataframe
I've searched all over, but just can't find a viable option.
Currently, I'm running the following:
cmd = f'zcat {infile} | paste - - - -'p = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)b = StringIO(p.communicate()[0].decode('utf-8'))_ = pd.read_csv(b, sep='\t', names=['read_id', 'seq', '+', 'qual'], on_bad_lines='skip', dtype=str, chunksize=1000000)
Is there a cleaner way to just use pandas
instead? I was thinking of setting sep='\n'
, but then I just get 1 row with multiple columns. Could I maybe read the file in, and then take every 4th row to create the 4 needed columns (or something like that)?
Speed is really what I'm looking for, so the quickest solution would be the best.
Side note: my Fastq files will not fit in memory, so I will have to chunk the read