Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

Handling millions of state keys in Beam and Dataflow

$
0
0

I’m building a pipeline where I need to keep track of item changes. I need to implement a stateful streaming pipeline. The use case in general is I will be receiving around 10s of events per seconds. These events should be enriched with some metadata from BigQuery. The BigQuery table is holding the metadata related to millions of unique events (10s to 100s millions). This table is updated every 24h.

The general use case is for each received event, I need to query the metadata from BigQuery. Since the table is really big, I cannot keep querying metadata of event in every time I receive one. My plan is to have the metadata somewhere inside the job. For this, I’m in front of 3 possible scenarios:

  • Slowly-changing lookup cache: I query all the metadata, keep them in-memory, pass them as side input to the transformation responsible for enriching the streams. Honestly, I tried this approach with small chunk of metadata. The issue with it is when I send a message to a pubsub which is connected to an event-based-trigger in my pipeline to update the metadata, the side input keeps holding an old version of the data and doesn’t use the newly updated side input (I made sure through the log that the side input is indeed triggered and updated) - I’ve already posted a question about that here
  • Ingesting the metadata as a state: The other approach I though about is to ingest the BQ metadata as a state in my pipeline. This approach force the ingestion of all metadata to a state using the item key (which is the same key that the received events are having). While I’ll be storing millions of state key-value in the streaming job that are not gonna be used, this ensures that all the received events are gonna find their appropriate metadata in the state so they can be enriched with it.When I try this approach with key 10s to 100s of thousands keys, it seems to be working fine. But When I pull all the millions of metadata, for somewhow the pipeline gets blocked in processing all the received millions of the metadata. Here is what I’ve tried:

Since I cannot use the beam transformation ReadFromBigQuery because it needs to be placed in the beginning of the grapg PBgin, I needed to implement my own transformation that reads from BQ:

  • Using BQ Storage API: I used this API to pull the data through multiple streams:
    session = self.client.create_read_session(                parent=parent,                read_session=requested_session,                max_stream_count=10,            )    for stream in session.streams:          logging.info(f"Reading stream {stream_counter}")          reader = self.client.read_rows(stream.name)          rows = reader.rows(session)          for row in rows:              yield state_key_str, state_value_dict

While this approach worked well with small amount of data, when It comes to pulling the millions of the metadata and ingest them into the state, I see from logs that the workers just keeps reading the metadata again and again and again without processing them correctly.

  • Exporting the BQ table into multiple files, and read the states from each file:

Building my own file reader as well as using beam native transformation ReadAllFromText . For this I can see the data returned, but the following transformation keeps being blocked and not properly process the data for the case of millions of metadata. However for small data, it seems to be working fine.

class ReadDataFromStorage(DoFn):    def setup(self):        self.client = storage.Client()    def process(self, element):        blobs = self.client.list_blobs("bucket", prefix="prefix")        for blob in blobs:            if not str(blob.name).endswith(".csv"):                logging.info(f"Skipping element {blob.name}")                continue            yield f"gs://bucket/{blob.name}"class PullBQMetadata(PTransform):    def expand(self, pcoll):        return (            pcoll            | 'Read files list'>> ParDo(ReadDataFromStorage())            | "Read All From text" >> ReadAllFromText(skip_header_lines=1)            | 'Process Results'>> ParDo(self.filter_and_return_native)        )

Am I on the wrong approach? Is there something wrong with what I'm trying to process? Have someone faced a similar issue like this before?


Viewing all articles
Browse latest Browse all 23131

Trending Articles