Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

Elastic Performance for Spatial Queries Slows Down when Geometries are in Millions

$
0
0

I have a usecase to query ES for spatial intersection query with a Multipolygon. I tried two approaches where I store documents in flat structure, where 1 Polygon represents 1 document and in another setup I aggregate docs before ingesting and so 1 Polygon now represents n documents using nested type.

When I do spatial intersection query, I get very similar query time, average difference of + - 2sec but the reduction of documents between the two approaches is around 5x, meaning second approach has only 3 Million documents as compared to 13M previously

I want to understand what can I optimise in terms of my spatial query or index settings. I intend to have all my spatial queries perform at around 1s irrespective of polygon complexity. My intention is to get unique values for some metadata key based on a spatial intersection, in this case, areaName. I'm aware of different complexity of polygons will give different response times but comparing 13M vs 3M doc index, I expected a considerable improvement. I'm running everything using Python ES API.

On checking with Kibana Profiler, I understand that IndexOrDocValuesQuery takes longest time in which specifically build_scorer is taking 90% of the time.

Initially, I tried to ingest my documents in a flat structure with this kind of mapping:

{"mappings": {"properties": {"areaName": {"type": "keyword"      },"location": {"type": "geo_shape"      }    }  }}

Here's a sample of docs for this index:

{"_index": "catalog.a","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.0025355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","areaName": "Dataset A"  },"_index": "catalog.a","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.0015355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","areaName": "Dataset B"  },"_index": "catalog.a","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.002355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","areaName": "Dataset C"  }}

This index is around 80GB in size and has 3 primary shards and has around 13 Million such docs. I'm testing using this query for performance:

response = es.search(index = catalog.a, body =        {"size":0,"query": {"bool": {"filter": [                {"geo_shape": {"location": {"shape": {"type": "Multipolygon","coordinates": country_variable                      },"relation": "intersects"                    }                  }                }              ]            }          },"aggs": {"unique_soilProperty": {"terms": {"field": "areaName","size":15,              }            }          }        },     request_cache=False    )

Next, I changed my mapping to this:

{"mappings": {"location": {"type": "geo_shape"      }"items": {"type": "nested","properties": {"areaName": {"type": "keyword"      },    }  }}

My documents now look like this:

{"_index": "catalog.b","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.0025355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","items":[{ "areaName": "Dataset A" }]  },"_index": "catalog.b","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.0015355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","items":[{ "areaName": "Dataset B" }]  }}

This index is around 25GB in size and has 1 primary shard and has around 3 Million such docs. I'm testing using this query for performance:

response = es.search(index = catalog.b, body =        {"size":0,"query": {"bool": {"filter": [                {"geo_shape": {"location": {"shape": {"type": "Multipolygon","coordinates": country_variable                      },"relation": "intersects"                    }                  }                }              ]            }          },"aggs": {"nested_items": {"nested": {"path": "items"              },"aggs": {"unique_dataSources": {"terms": {"field": "items.areaName","size": 15                  }                }              }            }          }        },     request_cache=False    )

I get very similar query performance even with this setup. What kind of optimisations I could do more, to speed it up?


Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>