Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13951

Pyspark: Python Vs Spark-Submit. Error: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

$
0
0

I am trying to write a pyspark dataframe to S3. This is my configuration:

spark = (SparkSession.builder         .config('spark.master', 'local')          .config('spark.app.name', 'Demo')         .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.4,org.apache.hadoop:hadoop-common:3.2.4'                )          .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')         .getOrCreate()         )sc = spark.sparkContextsc._jsc.hadoopConfiguration().set("fs.s3a.access.key", '')sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", '')

When I run my code using python app.py, the script runs fine and the data is uploaded to S3.

But when I run the same script using spark-submit app.py, I am getting the error below:

Traceback (most recent call last):  File "/home/usr/sparkCode/EndToEndProj1/dataPipeline.py", line 94, in <module>    s3Interaction(bucketName, prefixName, fileName, write=True)  File "/home/usr/sparkCode/EndToEndProj1/dataPipeline.py", line 22, in s3Interaction    flightsDelatDfOrgDestMaxDist.write.csv(s3Path, mode='overwrite', header=True)  File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1864, in csv  File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__  File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco  File "/home/usr/virtualenv/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_valuepy4j.protocol.Py4JJavaError: An error occurred while calling o76.csv.: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)        at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:454)        at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:530)        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388)        at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361)        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240)        at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:850)        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)        at java.base/java.lang.reflect.Method.invoke(Method.java:566)        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)        at py4j.Gateway.invoke(Gateway.java:282)        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)        at py4j.commands.CallCommand.execute(CallCommand.java:79)        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)        at java.base/java.lang.Thread.run(Thread.java:829)Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)        ... 25 more

When I run it through python, I can see the code download the dependencies, but when I run it using spark-submit, i do not see it download any dependencies. I believe spark-submit is not download the .jar files and therefore raising this error.

I tried to find the answer on this thread, but it's not clear to me.

I am confused why is the execution behavior is different between two executions? How python and spark-submit are different from each other?


Viewing all articles
Browse latest Browse all 13951

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>