Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 14301

Import Custom Python Modules on EMR Serverless through Spark Configuration

$
0
0

I created a spark_ready.py module that hosts multiple classes that I want to use as a template. I've seen in multiple configurations online that using the "spark.submit.pyFiles" will allow you to import and the directory has init.py. However, I'm still being returned an error when I'm trying to "import spark_reader". I had the assumption that it will work since Spark is configuring it.

My goal is to use the ERM Serverless application also in SageMaker that's why I want to solve this dependency management issue. Have any of you solved this by modifying the runtimeConfiguration?

{"runtimeConfiguration": [    {"classification": "spark-defaults","configurations": null,"properties": {"spark.pyspark.virtualenv.requirements": "s3://bi-emr-2024/venv/requirements.txt","spark.submit.pyFiles": "s3://bi-emr-2024/emr-serverless-workspaces/modules/spark_reader.py","spark.pyspark.virtualenv.requirements.use": "true","spark.pyspark.virtualenv.type": "native","spark.pyspark.virtualenv.enabled": "true","spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv","spark.sql.shuffle.partitions": "100","spark.pyspark.python": "python","spark.log.level": "DEBUG","spark.serializer": "org.apache.spark.serializer.KryoSerializer","spark.jars": "s3://bi-emr-2024/mysql-connector-j-8.3.0.jar"      }    }  ]}

What I did for now is to pass in the Python module s3 path as a secret so that I can load it manually in the spark context:

sc.addPyFile(f"{os.environ['PYTHON_MODULES']}")

I think I'm being inefficient here and there could be better solutions to this.


Viewing all articles
Browse latest Browse all 14301

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>