Hi i am learning pyspark
now and it currently working for the csv data but if i convert it into json
data i am getting an error
*Since Spark 2.3, the queries from raw JSON/CSV
files are disallowed when thereferenced columns only include the internal corrupt record column(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show()
.Instead, you can cache or save the parsed results and then send the same query.For example, val df = spark.read.schema(schema).json(file).cache()
and then
df.filter($"_corrupt_record".isNotNull).count().*
the sample json data is
[ {"student_id": 1,"name": "John Doe","age": 18,"grade": "A" }, {"student_id": 2,"name": "Jane Smith","age": 17,"grade": "B" }, {"student_id": 3,"name": "Bob Johnson","age": 19,"grade": "C" }, {"student_id": 4,"name": "Alice Williams","age": 18,"grade": "A" }, {"student_id": 5,"name": "Charlie Brown","age": 17,"grade": "B" }, {"student_id": 6,"name": "Emma Davis","age": 19,"grade": "C" }, {"student_id": 7,"name": "James Miller","age": 18,"grade": "A" }, {"student_id": 8,"name": "Sophie Taylor","age": 17,"grade": "B" }, {"student_id": 9,"name": "David White","age": 19,"grade": "C" }]
and the python code that i have used is
mydata = spark.read.json("/original.csv")mydata.show()