I have trouble getting different versions of PySpark to work correctly on my windows machine in combination with different versions of Python installed via PyEnv.
The setup:
- I installed pyenv and let it set the environment variables (PYENV, PYENV_HOME, PYENV_ROOT and the entry in PATH)
- I installed Amazon Coretto Java JDK (jdk1.8.0_412) and set the JAVA_HOME environment variable.
- I downloaded the winutils.exe & hadoop.dll from here and set the HADOOP_HOME environment variable.
- Via pyenv I installed Python 3.10.10 and then pyspark 3.4.1
- Via pyenv I installed Python 3.8.10 and then pyspark 3.2.1
Python works as expected:
- I can switch between different versions with
pyenv global <version>
- When I use
python --version
in PowerShell it always shows the version that I set before with pyenv.
But I'm having trouble with PySpark.
For one, I cannot start PySpark via the powershell console by running pyspark
>>> The term 'pyspark' is not recognized as the name of a cmdlet, function, script file....
.
More annoyingly, my repo-scripts (with a .venv created via pyenv & poetry) also fail:
Caused by: java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
[...]Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
However, both work after I add the following two entries to the PATH environment variable:
- C:\Users\myuser\.pyenv\pyenv-win\versions\3.10.10
- C:\Users\myuser\.pyenv\pyenv-win\versions\3.10.10\Scripts
but I would have to "hardcode" the Python Version - which is exactly what I don't want to do while using pyenv.
If I hardcode the path, even if I switch to another Python version (pyenv global 3.8.10
), once I run pyspark
in Powershell, the version PySpark 3.4.1 starts from the environment PATH entry for Python 3.10.10. I also cannot just do anything with python in the command line as it always points to the hardcoded python version, no matter what I do with pyenv.
I was hoping to be able to start PySpark 3.2.1 from Python 3.8.10 which I just "activated" with pyenv globally.
What do I have to do to be able to switch between the Python installations (and thus also between PySparks) with pyenv without "hardcoding" the Python paths?
Example PySpark script:
from pyspark.sql import SparkSessionspark = ( SparkSession .builder .master("local[*]") .appName("myapp") .getOrCreate())data = [("Finance", 10), ("Marketing", 20), ]df = spark.createDataFrame(data=data)df.show(10, False)