I want to use a Jinja2 template to create a column in a df using PySpark. For example, if I have a column name
, use the following template to create another column called new_name
.
from jinja2 import TemplateTEMPLATE = """Hello {{ customize(name) }}!"""template = Template(source = TEMPLATE)template.globals["customize"] = customizedef customize(name): return name+"san"def udf_foo(name): template.render(name)convertUDF = udf(lambda z: udf_foo(z),StringType())df = df.select(df.name)df1 = df.withColumn("new_name", convertUDF(col("name")))
Executing the code, I get the following error which I think is because the template cannot be serialized successfully.
An exception was thrown from the Python worker. Please see the stack trace below.'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 189, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 541, in loads return cloudpickle.loads(obj, encoding=encoding)TypeError: Template.__new__() missing 1 required positional argument: 'source''. Full traceback below:Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 189, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 541, in loads return cloudpickle.loads(obj, encoding=encoding)TypeError: Template.__new__() missing 1 required positional argument: 'source'
I have tried using other serializers like Pickle, Kryo etc but the error persists.
- Does anyone think it might not be serialization related error?
- Do you know how to fix this so that we can use Jinja2 with Pyspark?
Thanks in advance!