In Azure Databricks I have the following tables:
[File_Processing_History]Id bigintClient varchar(255)FileName varchar(255)FileType varchar(3)EventType varchar(100)EventContext varchar(255)Checksum char(40)OccurredOn timestampStatus varchar(50)Retry boolean[Raw_Data]XYZFileKey stringFileName stringFileContent stringThe File_Processing_History table is historical table that tracks the event history of things that happen to files in the system. The Raw_Data table contains raw data extracted from files in our system and has metatdata such as the source file name and contents (data). Both tables have a FileName column that will contain the names of files on the system.
I also have the following Python/Spark code in an Azure Databricks Notebook:
file_processing_df = spark.table("myapp.control.File_Processing_History")latest_records_df = file_processing_df.join( file_processing_df.groupBy("FileName").agg(max("OccurredOn").alias("MaxOccurredOn")), on=["FileName"], how="inner")result_df = latest_records_df.filter( (col("Client") == Client) & (col("FileType") == fileType) & (col("Status") == "Loaded"))windowSpec = Window.partitionBy("FileName").orderBy(col("OccurredOn").desc())result_df = result_df.withColumn("row_number", F.row_number().over(windowSpec))filtered_result_df = result_df.filter((col("OccurredOn") == col("MaxOccurredOn")) & (col("row_number") == 1))filtered_result_df = filtered_result_df.drop("row_number")filtered_result_df.show()FileNames = filtered_result_df.select("FileName").distinct()FileNames.show()rawDF = spark.table(f"`usurint-mvp`.raw.Raw_Data")filtered_rawDF = rawDF.join(FileNames, on="FileName", how="inner")filtered_rawDF.show()This code is attempting to:
- Find all the
File_Processing_Historyrows for the given Client + FileType that not only have the latestOccurredOndate, but that also have Statuses of "Loaded" (if any exists), and save these rows to a dataframe (calledfiltered_result_df) - Find all the distinct
FileNamesin that dataframe an store thm in aFileNamesdataframe - Select all rows from the
FileNamesdataframe and select any rows from theRaw_datatable whoseFileNamevalue matches one of the values inFileNames
When this runs I get the following console output from the shows():
+--------------------+---+-------+--------+--------------+--------------------+--------+--------------------+------+-----+--------------------+| FileName| Id| Tenant|FileType| EventType| EventContext|Checksum| OccurredOn|Status|Retry| MaxOccurredOn|+--------------------+---+-------+--------+--------------+--------------------+--------+--------------------+------+-----+--------------------+|z823021-01-04f42s...|175|acme-1| 123|FILE_DECRYPTED|Things were done and...| N/A|2024-04-09 15:38:...|Loaded|false|2024-04-09 15:38:...|+--------------------+---+-------+--------+--------------+--------------------+--------+--------------------+------+-----+--------------------++--------------------+| FileName|+--------------------+|z823021-01-04f42s...|+--------------------++--------------+----------+--------------+| FileName|XYZFileKey|SourceFileData|+--------------+----------+--------------++--------------+----------+--------------+Since z823021-01-04f42s... exists in both filtered_result_df and FileNames, I would have expected the filtered_rawDF to be non-empty. Can anyone spot where I'm going awry?