After mounting my data lake to Databricks,I encountered an issue while attempting to load all JSON files into a dataframe using *.json. However, upon removing the file extension, the operation was successful.
Not working:
df = spark.read.option("recursiveFileLookup", "true") \ .json("/mnt/adls_gen/prod/**/*.json")
I get below error after executing above code
[PATH_NOT_FOUND] Path does not exist: dbfs:/mnt/adls_gen/prod/**/*.json.
Working
df = spark.read.option("recursiveFileLookup", "true") \ .json("/mnt/adls_gen/prod/**/*")
However, it is also reading other files such as files with extensions like *.json_old and *.txt.
I'm unfamiliar with any alternative options to use in this scenario. Is there another method available for filtering by file extension? My files in the data lake have various extensions, so I'm seeking a solution that accommodates this diversity.
Version of Apache is as below
Apache Spark 3.4.1, Scala 2.12