Hello I am trying to use pyspark + kafka in order to do this I execute this command in order to set up the kafka-cluster
- Spark version is 3.5.0 | spark-3.5.0-bin-hadoop3
- Kafka version is - kafka_2.13-3.6.0
- pyspark version is 3.5.0
My code is
import osos.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /content/spark-streaming-kafka-0-10-assembly_2.13-3.5.0.jar--packages,org.apache.spark:spark-streaming-kafka-0-10_2.13:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.0 pyspark-shell'import findsparkfindspark.init()from pyspark.sql import SparkSessionfrom pyspark.sql.functions import explodefrom pyspark.sql.functions import splitspark = SparkSession \ .builder \ .appName("StructuredNetworkWordCount") \ .getOrCreate()df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("subscribe", "c1") \ .option("includeHeaders", "true") \ .load()This return the following error:
Failed to find data source: kafka. Please deploy the application as per the deployment section of Structured Streaming + Kafka Integration Guide.
I'm using Google Colab
I've try downgrade versions and try multiple old solutions from stack overflow