Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13951

Sink is not written into delta table in Spark structured streaming

$
0
0

I want to create a streaming job, that reads messages from a folder within TXT files, does the parsing, some processing, and appends the result into one of 3 possible delta tables depending on the parse result. There is a parse_failed table, an unknwon_msgs table, and a parsed_msgs table.

Reading is done with

sdf = spark.readStream.text(path=path_input, lineSep="\n\n", pathGlobFilter="*.txt", recursiveFileLookup=True)

and writing with

x = sdf.writeStream.foreachBatch(process_microbatch).start()

where process_microbatch is

def process_microbatch(self, batch_df: DataFrame, batch_id: int) -> None:"""Processing of newly arrived messages. For each message replicate it if needed, and execute the parse_msg_proxy on each."""    batch_df.rdd.flatMap(lambda msg: replicate_msg(msg)).map(lambda msg: parse_msg_proxy(msg))

and where parse_msg_proxy is

def parse_msg_proxy(self, msg: str) -> None:        try:            parsed_msg = parse_message(msg, element_mapping)            # do some processing            # create df_msg dataframe from parsed_msg            df_msg.write.format("delta").mode("append").save(path_parsed_msgs)        except ParseException as e:            spark.createDataFrame([{'msg': parsed_msg, 'error': str(e)}]).write.format("delta").mode("append").save(path_parse_errors)            raise Exception("Parse error occured.")        except UnknownMsgTypeException:            spark.createDataFrame([{'msg': parsed_msg}]).write.format("delta").mode("append").save(path_unknown_msgs)

The streaming job starts without error message, but the delta tables are not created. Whats wrong?

Thanks!

Update:

If I change the function to be executed by adding a collect to it:

def process_microbatch(self, batch_df: DataFrame, batch_id: int) -> None:"""Processing of newly arrived messages. For each message replicate it if needed, and execute the parse_msg_proxy on each."""    batch_df.rdd.flatMap(lambda msg: replicate_msg(msg)).map(lambda msg: parse_msg_proxy(msg)).collect()

I get the error messageRuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.


Viewing all articles
Browse latest Browse all 13951

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>