Python Databricks Dataframe join filtering records unexpectedly

In Azure Databricks I have the following tables:

[File_Processing_History]Id  bigintClient  varchar(255)FileName    varchar(255)FileType    varchar(3)EventType   varchar(100)EventContext    varchar(255)Checksum    char(40)OccurredOn  timestampStatus  varchar(50)Retry   boolean[Raw_Data]XYZFileKey  stringFileName    stringFileContent string

The File_Processing_History table is historical table that tracks the event history of things that happen to files in the system. The Raw_Data table contains raw data extracted from files in our system and has metatdata such as the source file name and contents (data). Both tables have a FileName column that will contain the names of files on the system.

I also have the following Python/Spark code in an Azure Databricks Notebook:

file_processing_df = spark.table("myapp.control.File_Processing_History")latest_records_df = file_processing_df.join( file_processing_df.groupBy("FileName").agg(max("OccurredOn").alias("MaxOccurredOn")),    on=["FileName"],    how="inner")result_df = latest_records_df.filter(    (col("Client") == Client) &    (col("FileType") == fileType) &    (col("Status") == "Loaded"))windowSpec = Window.partitionBy("FileName").orderBy(col("OccurredOn").desc())result_df = result_df.withColumn("row_number", F.row_number().over(windowSpec))filtered_result_df = result_df.filter((col("OccurredOn") == col("MaxOccurredOn")) & (col("row_number") == 1))filtered_result_df = filtered_result_df.drop("row_number")filtered_result_df.show()FileNames = filtered_result_df.select("FileName").distinct()FileNames.show()rawDF = spark.table(f"`usurint-mvp`.raw.Raw_Data")filtered_rawDF = rawDF.join(FileNames, on="FileName", how="inner")filtered_rawDF.show()

This code is attempting to:

Find all the File_Processing_History rows for the given Client + FileType that not only have the latest OccurredOn date, but that also have Statuses of "Loaded" (if any exists), and save these rows to a dataframe (called filtered_result_df)
Find all the distinct FileNames in that dataframe an store thm in a FileNames dataframe
Select all rows from the FileNames dataframe and select any rows from the Raw_data table whose FileName value matches one of the values in FileNames

When this runs I get the following console output from the shows():

+--------------------+---+-------+--------+--------------+--------------------+--------+--------------------+------+-----+--------------------+|            FileName| Id| Tenant|FileType|     EventType|        EventContext|Checksum|          OccurredOn|Status|Retry|       MaxOccurredOn|+--------------------+---+-------+--------+--------------+--------------------+--------+--------------------+------+-----+--------------------+|z823021-01-04f42s...|175|acme-1|     123|FILE_DECRYPTED|Things were done and...|     N/A|2024-04-09 15:38:...|Loaded|false|2024-04-09 15:38:...|+--------------------+---+-------+--------+--------------+--------------------+--------+--------------------+------+-----+--------------------++--------------------+|            FileName|+--------------------+|z823021-01-04f42s...|+--------------------++--------------+----------+--------------+|      FileName|XYZFileKey|SourceFileData|+--------------+----------+--------------++--------------+----------+--------------+

Since z823021-01-04f42s... exists in both filtered_result_df and FileNames, I would have expected the filtered_rawDF to be non-empty. Can anyone spot where I'm going awry?

Python Databricks Dataframe join filtering records unexpectedly

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...