I have a .dat file that I need to parse using spark or python
The row that needs removing from the dataframe is the | | pipe separated row, when trying to add the option to skip rows or first row as header it still brings back the pipe row as either the first or second row with _c0 as column type headers
import pandas as pd df = spark.read.format("csv").option("skipRows", 100).load("file/path.dat.gz") display(df)`
the following image shows the output of the above
FileHeader|^2024-01-10 11:00:00|^2.00|^APR|^021|^CRR|^2024-01-10 00:00:00|^2024-01-10 23:59:59|^NODE
The second row is the actual row that will be used as the table headers and is comma(,) separated in the following format.
@column1,column2.Value,column3.Value,column4.Value,column5.Value
How do I remove the pipe separated row and have the second row be used as header.
I have tried the following to skip the first row within the dat.gz file
df = spark.read.format("csv").option("skipRows", 1).load("<filepath>")
I have also tried the following to remove the first row as header but no avail as it still uses the pipe delimited row within the data frame
diamonds = (spark.read .format("csv") .option("header", "false") .option("inferSchema", "true") .option("sep", "|") .load("/path/to/file.dat.gz"))##display(diamonds)