Hi I have a task in Python to generate simulated events in Json coming from a Customer Data Platform (such as Rudderstack), do some simple transformation and finally load into a SQLite DB. This solution should then be deployed as a Prefect flow. The below illustrates the above in more details:
As seen above a script "generate.py" will create events simulating the payload from a Customer Data Platform such as "Rudderstack". The "task.py", will, in turn, scan through the output from "generate.py" and apply some transformation to the data so that the final output is a dictionary with the number of event views per source for a given hour of a given day (example above). The "task.py" should finally load this final output into a Database SQLite.
As mentioned earlier the solution above should be deployed as Prefect Flow running on a Kubernetes cluster.
Questions:
a) How many Prefect tasks / blocks should I be using as part of this workflow?
b) Under the scenario in which a new business requirement comes in to have data delivered to a Machine Learning application at a much lower latency (seconds instead of hourly). What should be done to meet this requirement?
c) Under the scenario in which over the period of a few months, the volume of clickstream data has increased 100x and the above solution is no longer able to process this volume of data anymore. Suggestions to how to go about resolving this throttling issue?