I have a usecase to query ES for spatial intersection query with a Multipolygon. I tried two approaches where I store documents in flat structure, where 1 Polygon represents 1 document and in another setup I aggregate docs before ingesting and so 1 Polygon now represents n documents using nested type.
When I do spatial intersection query, I get very similar query time, average difference of + - 2sec but the reduction of documents between the two approaches is around 5x, meaning second approach has only 3 Million documents as compared to 13M previously
I want to understand what can I optimise in terms of my spatial query or index settings. I intend to have all my spatial queries perform at around 1s irrespective of polygon complexity. My intention is to get unique values for some metadata key based on a spatial intersection, in this case, areaName. I'm aware of different complexity of polygons will give different response times but comparing 13M vs 3M doc index, I expected a considerable improvement. I'm running everything using Python ES API.
On checking with Kibana Profiler, I understand that IndexOrDocValuesQuery takes longest time in which specifically build_scorer is taking 90% of the time.
Initially, I tried to ingest my documents in a flat structure with this kind of mapping:
{"mappings": {"properties": {"areaName": {"type": "keyword" },"location": {"type": "geo_shape" } } }}Here's a sample of docs for this index:
{"_index": "catalog.a","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.0025355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","areaName": "Dataset A" },"_index": "catalog.a","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.0015355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","areaName": "Dataset B" },"_index": "catalog.a","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.002355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","areaName": "Dataset C" }}This index is around 80GB in size and has 3 primary shards and has around 13 Million such docs. I'm testing using this query for performance:
response = es.search(index = catalog.a, body = {"size":0,"query": {"bool": {"filter": [ {"geo_shape": {"location": {"shape": {"type": "Multipolygon","coordinates": country_variable },"relation": "intersects" } } } ] } },"aggs": {"unique_soilProperty": {"terms": {"field": "areaName","size":15, } } } }, request_cache=False )Next, I changed my mapping to this:
{"mappings": {"location": {"type": "geo_shape" }"items": {"type": "nested","properties": {"areaName": {"type": "keyword" }, } }}My documents now look like this:
{"_index": "catalog.b","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.0025355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","items":[{ "areaName": "Dataset A" }] },"_index": "catalog.b","_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d","_score": 0.0015355585,"_source": {"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))","items":[{ "areaName": "Dataset B" }] }}This index is around 25GB in size and has 1 primary shard and has around 3 Million such docs. I'm testing using this query for performance:
response = es.search(index = catalog.b, body = {"size":0,"query": {"bool": {"filter": [ {"geo_shape": {"location": {"shape": {"type": "Multipolygon","coordinates": country_variable },"relation": "intersects" } } } ] } },"aggs": {"nested_items": {"nested": {"path": "items" },"aggs": {"unique_dataSources": {"terms": {"field": "items.areaName","size": 15 } } } } } }, request_cache=False )I get very similar query performance even with this setup. What kind of optimisations I could do more, to speed it up?