I'm trying to use the john snow ESG model.
And I keep getting the following error:
Line document_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document')
Error java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class
Py4JJavaError: An error occurred while calling None.com.johnsnowlabs.nlp.DocumentAssembler.: java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class at com.johnsnowlabs.nlp.DocumentAssembler.<init>(DocumentAssembler.scala:16) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397) at py4j.Gateway.invoke(Gateway.java:257) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195) at py4j.ClientServerConnection.run(ClientServerConnection.java:115) at java.lang.Thread.run(Thread.java:750)Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.util.MLWritable$class at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 13 more
I'm working over Databricks, with the following clusters:
- Cluster 1:
- Runtime: 13.2 ML (includes Apache Spark 3.4.0, GPU, Scala 2.12)
- Worker & Driver type: Standard_NC21s_v3 224 GB Memory, 2 GPUs
- Cluster 2:
- Runtime: 12.2 LTS ML (includes Apache Spark 3.3.2, Scala 2.12)
- Node type: Standard_DS5_v2 56 GB Memory, 16 Cores
Added libraries to the cluster are according to here:
- PyPi: spark-nlp (tried with and without version)
- PyPi: pyspark (tried with and without version)
- Maven: com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.0
Spark NLP Version: 5.2.2
Spark Version: 3.4.0 (Tried also with 14.1 Cluster with 3.5.0 version)
Code:
import sparknlpspark = sparknlp.start()sparknlp.version(), spark.versionfrom sparknlp.base import *from sparknlp.annotator import *from pyspark.ml import Pipelineimport pandas as pddocument_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document')
I found the following questions and data online but haven't manage to get a solution:
- spark-nlp : DocumentAssembler initializing failing
- Maven dependency for java.lang.NoClassDefFoundError
- java.lang.NoClassDefFoundError
- java.lang.NoClassDefFoundError
- NoClassDefFoundError: org/apache/spark/ml/util/MLWritable
- java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class
- NoClassDefFoundError: org/apache/spark/ml/util/MLWritable
- TypeError: 'JavaPackage' object is not callable - DocumentAssembler() - Spark NLP
- Natural language processing
- apache-spark-support