Quantcast
Viewing all articles
Browse latest Browse all 14185

Cannot mimic manual document split in Azure, programatically, using Azure SplitSkill

I am going from a manual setup of my RAG solution in Azure to setting up everything programmatically using the azure python sdk. I have a container with a single pdf. When setting up manually is see that the Document count under the created index is 401 when setting the chunking to 256. When using my custom skillset:

split_skill = SplitSkill(    name="split",    description="Split skill to chunk documents",    context="/document",    text_split_mode="pages",    default_language_code="en",    maximum_page_length=300,  # why cannot this be set to 256 if I can do this with a manual setup?    page_overlap_length=30,    inputs=[          InputFieldMappingEntry(name="text", source="/document/content"),      ],      outputs=[          OutputFieldMappingEntry(name="textItems", target_name="pages")      ],)

I get 271. I want to mimic my manual chunking setup as much as possible as I already have good performance. What am I missing?


Viewing all articles
Browse latest Browse all 14185

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>