Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23305

cumulative aggregate a polars list[struct[]]

$
0
0

I have to accomplish a complex dataframe conversion like this:

original_dataframe = pl.DataFrame({'index': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 'content': [{'key': 3, 'val': 20}, {'key': 4, 'val': 50}, {'key': 3, 'val': 8}, {'key': 5, 'val': 70}, {'key': 4, 'val': -60}, {'key': 2, 'val': 30}, {'key': 4, 'val': 5}]})┌───────┬───────────┐│ index ┆ content   ││ ---   ┆ ---       ││ str   ┆ struct[2] │╞═══════╪═══════════╡│ A     ┆ {3,20}    ││ B     ┆ {4,50}    ││ C     ┆ {3,8}     ││ D     ┆ {5,70}    ││ E     ┆ {4,-60}   ││ F     ┆ {2,30}    ││ G     ┆ {4,5}     │└───────┴───────────┘       ||       \/ ┌───────┬──────────────────────────┐│ index ┆ content                  ││ ---   ┆ ---                      ││ str   ┆ list[struct[2]]          │╞═══════╪══════════════════════════╡│ A     ┆ [{3,20}]                 ││ B     ┆ [{3,20}, {4,50}]         ││ C     ┆ [{3,28}, {4,50}]         ││ D     ┆ [{3,28}, {4,50}, {5,70}] ││ E     ┆ [{3,28}, {5,70}]         │└───────┴──────────────────────────┘

This conversion combines:

  1. cumulative add struct into list row by row;
  2. if it exists same struct 'key' field in the list, aggregate the two struct by sum struct 'val' field;
  3. if the struct 'val' field<= 0 after aggregation, drop it in the list;
  4. sort each list by struct 'key' field;
  5. also drop struct if its 'val' field or 'key' field is null.

The conversion can be ugly down by use iter_rows() and to_list() to iterate dataframe rows with intermediate python data type list, dict. But this way is slow. How it can be solved just use polars functions for fast and elegant?

PS: Thanks @jqurious' reminder, there is an additional requirement, so I updated the question.

pl.DataFrame({'index': ['A', 'B', 'C', 'D', 'E', 'F'], 'content': [{'key': 3, 'val': 20}, {'key': 4, 'val': 50}, {'key': 3, 'val': 8}, {'key': 2, 'val': 30}, {'key': 4, 'val': -60}, {'key': 4, 'val': 5}]})┌───────┬───────────┐│ index ┆ content   ││ ---   ┆ ---       ││ str   ┆ struct[2] │╞═══════╪═══════════╡│ A     ┆ {3,20}    ││ B     ┆ {4,50}    ││ C     ┆ {3,8}     ││ D     ┆ {2,30}    ││ E     ┆ {4,-60}   ││ F     ┆ {4,5}     │└───────┴───────────┘        ||        \/ ┌───────┬──────────────────────────┐│ index ┆ content                  ││ ---   ┆ ---                      ││ str   ┆ list[struct[2]]          │╞═══════╪══════════════════════════╡│ A     ┆ [{3,20}]                 ││ B     ┆ [{3,20}, {4,50}]         ││ C     ┆ [{3,28}, {4,50}]         ││ D     ┆ [{2,30}, {3,28}, {4,50}] ││ E     ┆ [{2,30}, {3,28}]         ││ F     ┆ [{2,30}, {3,28}, {4,5}]  │└───────┴──────────────────────────┘

the updated requirement is:

  1. if the struct 'val' field<= 0 after cumulative sum, drop it at the corresponding row's list immediately; and if the struct 'key' field appears again in the following rows with struct 'val' field> 0, it should be cumulative aggregate again;

Viewing all articles
Browse latest Browse all 23305

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>