I have to accomplish a complex dataframe conversion like this:
original_dataframe = pl.DataFrame({'index': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 'content': [{'key': 3, 'val': 20}, {'key': 4, 'val': 50}, {'key': 3, 'val': 8}, {'key': 5, 'val': 70}, {'key': 4, 'val': -60}, {'key': 2, 'val': 30}, {'key': 4, 'val': 5}]})┌───────┬───────────┐│ index ┆ content ││ --- ┆ --- ││ str ┆ struct[2] │╞═══════╪═══════════╡│ A ┆ {3,20} ││ B ┆ {4,50} ││ C ┆ {3,8} ││ D ┆ {5,70} ││ E ┆ {4,-60} ││ F ┆ {2,30} ││ G ┆ {4,5} │└───────┴───────────┘ || \/ ┌───────┬──────────────────────────┐│ index ┆ content ││ --- ┆ --- ││ str ┆ list[struct[2]] │╞═══════╪══════════════════════════╡│ A ┆ [{3,20}] ││ B ┆ [{3,20}, {4,50}] ││ C ┆ [{3,28}, {4,50}] ││ D ┆ [{3,28}, {4,50}, {5,70}] ││ E ┆ [{3,28}, {5,70}] │└───────┴──────────────────────────┘This conversion combines:
- cumulative add
structintolistrow by row; - if it exists same
struct 'key' fieldin thelist, aggregate the two struct by sumstruct 'val' field; - if the
struct 'val' field<= 0 after aggregation, drop it in thelist; - sort each
listbystruct 'key' field; - also drop
structif its'val' fieldor'key' fieldis null.
The conversion can be ugly down by use iter_rows() and to_list() to iterate dataframe rows with intermediate python data type list, dict. But this way is slow. How it can be solved just use polars functions for fast and elegant?
PS: Thanks @jqurious' reminder, there is an additional requirement, so I updated the question.
pl.DataFrame({'index': ['A', 'B', 'C', 'D', 'E', 'F'], 'content': [{'key': 3, 'val': 20}, {'key': 4, 'val': 50}, {'key': 3, 'val': 8}, {'key': 2, 'val': 30}, {'key': 4, 'val': -60}, {'key': 4, 'val': 5}]})┌───────┬───────────┐│ index ┆ content ││ --- ┆ --- ││ str ┆ struct[2] │╞═══════╪═══════════╡│ A ┆ {3,20} ││ B ┆ {4,50} ││ C ┆ {3,8} ││ D ┆ {2,30} ││ E ┆ {4,-60} ││ F ┆ {4,5} │└───────┴───────────┘ || \/ ┌───────┬──────────────────────────┐│ index ┆ content ││ --- ┆ --- ││ str ┆ list[struct[2]] │╞═══════╪══════════════════════════╡│ A ┆ [{3,20}] ││ B ┆ [{3,20}, {4,50}] ││ C ┆ [{3,28}, {4,50}] ││ D ┆ [{2,30}, {3,28}, {4,50}] ││ E ┆ [{2,30}, {3,28}] ││ F ┆ [{2,30}, {3,28}, {4,5}] │└───────┴──────────────────────────┘the updated requirement is:
- if the
struct 'val' field<= 0 after cumulative sum, drop it at the corresponding row'slistimmediately; and if thestruct 'key' fieldappears again in the following rows withstruct 'val' field> 0, it should be cumulative aggregate again;