Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 14126

Is it possible to write self referencing column in pyspark

$
0
0

I'm writing small poc trying to rewrite piece of logic written in python to pyspark, where im processing logs stored in sqlite one by one:

logs = [...]processed_logs = []previous_log = EmptyDecoratedLog() #emptyfor log in logs:    processed_log = with_outlet_value_closed(log, previous_log)    previous_log = processed_log     processed_logs.append(processed_log)

and

def with_outlet_value_closed(current_entry: DecoratedLog, previous_entry: DecoratedLog):    if current_entry.sourceName == "GS2":        self.outletValveClosed = current_entry.eventData    else:        self.outletValveClosed = previous_entry.outletValveClosed

which I wanted to represent in pyspark api as:

import pyspark.sql.functions as fwindow = W.orderBy("ID") #where ID is unique id on those logsdf.withColumn("testValveOpened",                f.when((f.col("sourceName") == "GS2"), f.col("eventData"))                .otherwise(f.lag("testValveOpened").over(window)),                )

but this leads to AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name outletValveClosed cannot be resolved.

So my question is:Is it possible to represent such code where the value of a current row depends from previous row of the same column(i know that this will result in all records being processed on a single thread, but thats fine)

I've tried adding initialization of a column

df = df.withColumn("testValveOpened", f.lit(0))df.withColumn("testValveOpened",                f.when((f.col("sourceName") == "GS2"), f.col("eventData"))                .otherwise(f.lag("testValveOpened").over(window)),                )

but then I'm getting

ID |sourceName|eventData|testValveOpened1  |GS3       |1        |02  |GS2       |1        |13  |GS2       |1        |14  |GS1       |1        |05  |GS1       |1        |06  |ABC       |0        |07  |B123      |0        |08  |B423      |0        |09  |PTSD      |168      |010 |XCD       |0        |0

I would like to get

ID |sourceName|eventData|testValveOpened1  |GS3       |1        |02  |GS2       |1        |13  |GS2       |1        |14  |GS1       |1        |15  |GS1       |1        |16  |ABC       |0        |17  |B123      |0        |18  |B423      |0        |19  |PTSD      |168      |110 |XCD       |0        |1  

so when there's GS2 take value of eventData, otherwise cary value from previous testValueOpened


Viewing all articles
Browse latest Browse all 14126

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>