Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 18906

Databricks how to Remove pipe row from a dat file and output to dataframe

$
0
0

I have a .dat file that I need to parse using spark or python

The row that needs removing from the dataframe is the | | pipe separated row, when trying to add the option to skip rows or first row as header it still brings back the pipe row as either the first or second row with _c0 as column type headers

  import pandas as pd  df = spark.read.format("csv").option("skipRows", 100).load("file/path.dat.gz")  display(df)`

the following image shows the output of the above

FileHeader|^2024-01-10 11:00:00|^2.00|^APR|^021|^CRR|^2024-01-10 00:00:00|^2024-01-10 23:59:59|^NODE

The second row is the actual row that will be used as the table headers and is comma(,) separated in the following format.

@column1,column2.Value,column3.Value,column4.Value,column5.Value

How do I remove the pipe separated row and have the second row be used as header.

I have tried the following to skip the first row within the dat.gz file

 df = spark.read.format("csv").option("skipRows", 1).load("<filepath>")

I have also tried the following to remove the first row as header but no avail as it still uses the pipe delimited row within the data frame

diamonds = (spark.read       .format("csv")       .option("header", "false")       .option("inferSchema", "true")       .option("sep", "|")       .load("/path/to/file.dat.gz"))##display(diamonds)

output of the above


Viewing all articles
Browse latest Browse all 18906


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>