Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

Python Databricks Dataframe join filtering records unexpectedly

$
0
0

In Azure Databricks I have the following tables:

[File_Processing_History]Id  bigintClient  varchar(255)FileName    varchar(255)FileType    varchar(3)EventType   varchar(100)EventContext    varchar(255)Checksum    char(40)OccurredOn  timestampStatus  varchar(50)Retry   boolean[Raw_Data]XYZFileKey  stringFileName    stringFileContent string

The File_Processing_History table is historical table that tracks the event history of things that happen to files in the system. The Raw_Data table contains raw data extracted from files in our system and has metatdata such as the source file name and contents (data). Both tables have a FileName column that will contain the names of files on the system.

I also have the following Python/Spark code in an Azure Databricks Notebook:

file_processing_df = spark.table("myapp.control.File_Processing_History")latest_records_df = file_processing_df.join( file_processing_df.groupBy("FileName").agg(max("OccurredOn").alias("MaxOccurredOn")),    on=["FileName"],    how="inner")result_df = latest_records_df.filter(    (col("Client") == Client) &    (col("FileType") == fileType) &    (col("Status") == "Loaded"))windowSpec = Window.partitionBy("FileName").orderBy(col("OccurredOn").desc())result_df = result_df.withColumn("row_number", F.row_number().over(windowSpec))filtered_result_df = result_df.filter((col("OccurredOn") == col("MaxOccurredOn")) & (col("row_number") == 1))filtered_result_df = filtered_result_df.drop("row_number")filtered_result_df.show()FileNames = filtered_result_df.select("FileName").distinct()FileNames.show()rawDF = spark.table(f"`usurint-mvp`.raw.Raw_Data")filtered_rawDF = rawDF.join(FileNames, on="FileName", how="inner")filtered_rawDF.show()

This code is attempting to:

  1. Find all the File_Processing_History rows for the given Client + FileType that not only have the latest OccurredOn date, but that also have Statuses of "Loaded" (if any exists), and save these rows to a dataframe (called filtered_result_df)
  2. Find all the distinct FileNames in that dataframe an store thm in a FileNames dataframe
  3. Select all rows from the FileNames dataframe and select any rows from the Raw_data table whose FileName value matches one of the values in FileNames

When this runs I get the following console output from the shows():

+--------------------+---+-------+--------+--------------+--------------------+--------+--------------------+------+-----+--------------------+|            FileName| Id| Tenant|FileType|     EventType|        EventContext|Checksum|          OccurredOn|Status|Retry|       MaxOccurredOn|+--------------------+---+-------+--------+--------------+--------------------+--------+--------------------+------+-----+--------------------+|z823021-01-04f42s...|175|acme-1|     123|FILE_DECRYPTED|Things were done and...|     N/A|2024-04-09 15:38:...|Loaded|false|2024-04-09 15:38:...|+--------------------+---+-------+--------+--------------+--------------------+--------+--------------------+------+-----+--------------------++--------------------+|            FileName|+--------------------+|z823021-01-04f42s...|+--------------------++--------------+----------+--------------+|      FileName|XYZFileKey|SourceFileData|+--------------+----------+--------------++--------------+----------+--------------+

Since z823021-01-04f42s... exists in both filtered_result_df and FileNames, I would have expected the filtered_rawDF to be non-empty. Can anyone spot where I'm going awry?


Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>