I am trying to write a pyspark dataframe to S3. This is my configuration:
spark = (SparkSession.builder .config('spark.master', 'local') .config('spark.app.name', 'Demo') .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.4,org.apache.hadoop:hadoop-common:3.2.4' ) .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') .getOrCreate() )sc = spark.sparkContextsc._jsc.hadoopConfiguration().set("fs.s3a.access.key", '')sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", '')
When I run my code using python app.py
, the script runs fine and the data is uploaded to S3.
But when I run the same script using spark-submit app.py
, I am getting the error below:
Traceback (most recent call last): File "/home/usr/sparkCode/EndToEndProj1/dataPipeline.py", line 94, in <module> s3Interaction(bucketName, prefixName, fileName, write=True) File "/home/usr/sparkCode/EndToEndProj1/dataPipeline.py", line 22, in s3Interaction flightsDelatDfOrgDestMaxDist.write.csv(s3Path, mode='overwrite', header=True) File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1864, in csv File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_valuepy4j.protocol.Py4JJavaError: An error occurred while calling o76.csv.: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:454) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:530) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240) at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:850) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829)Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686) ... 25 more
When I run it through python
, I can see the code download the dependencies, but when I run it using spark-submit
, i do not see it download any dependencies. I believe spark-submit
is not download the .jar files and therefore raising this error.
I tried to find the answer on this thread, but it's not clear to me.
I am confused why is the execution behavior is different between two executions? How python
and spark-submit
are different from each other?