Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

Pyspark to_timestamp date format parsing error

$
0
0

In my pyspark dataframe, there is a column CallDate with datatype as string containing the values shown below:

2008-04-01T00:00:002008-04-01T00:00:002008-04-01T00:00:002008-04-01T00:00:002008-04-01T00:00:002008-04-01T00:00:00

I am trying to convert this columns from datatype string to timestamp using pyspark.sql.functions.to_timestamp().

When I am running this code:

df.withColumn('IncidentDate', to_timestamp(col('CallDate'), 'yyyy/MM/dd')).select('CallDate', 'IncidentDate').show()

And I am getting this output:

+-------------------+------------+|           CallDate|IncidentDate|+-------------------+------------+|2008-04-01T00:00:00|        NULL||2008-04-01T00:00:00|        NULL||2008-04-01T00:00:00|        NULL||2008-04-01T00:00:00|        NULL||2008-04-01T00:00:00|        NULL||2008-04-01T00:00:00|        NULL|

I believe the NULL values are due to the fact that the format specified for the date is not consistent with the actual date string and since no match is found, NULL values are returned.

But when I run this code:

df.withColumn('IncidentDate', to_timestamp(col('CallDate'), 'yyyy-MM-dd')).select('CallDate', 'IncidentDate').show()

I am getting an error saying that

Caused by: org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:Fail to parse '2008-04-01T00:00:00' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.

I know the correct parse format should be "yyyy-MM-dd'T'HH:mm:ss" as shown below:

df.withColumn('IncidentDate', to_timestamp(col('CallDate'), "yyyy-MM-dd'T'HH:mm:ss")).select('CallDate', 'IncidentDate').show()

But my question is why is it that when I give the date parse format as yyyy/MM/dd, Spark returns NULL values but when I give it as yyyy-MM-dd, it is throwing an error?


Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>