Getting a pyarrow.lib.ArrowInvalid: Column 1 named type expected length 44 but got length 21 when trying to create Hugging Face database

I am getting the below error when trying to modify, chunk and resave a Huggingface Dataset.

I was wondering if anyone might be able to help?

Traceback (most recent call last):  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\main.py", line 39, in <module>    new_dataset = dataset.map(process_row, batched=True, batch_size=1, remove_columns=None)                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 602, in wrapper    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 567, in wrapper    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 3156, in map    for rank, done, content in Dataset._map_single(**dataset_kwargs):  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 3570, in _map_single    writer.write_batch(batch)  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_writer.py", line 571, in write_batch    pa_table = pa.Table.from_arrays(arrays, schema=schema)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "pyarrow\\table.pxi", line 4642, in pyarrow.lib.Table.from_arrays  File "pyarrow\\table.pxi", line 3922, in pyarrow.lib.Table.validate  File "pyarrow\\error.pxi", line 91, in pyarrow.lib.check_statuspyarrow.lib.ArrowInvalid: Column 1 named type expected length 44 but got length 21

My minimun reproducable code is below:

import datasetsfrom datasets import load_dataset, Datasetfrom semantic_text_splitter import TextSplitter# Step 1: Load the existing datasetdataset = load_dataset('HF_Dataset')# Slice the 'train' split of the datasetsliced_data = dataset['train'][:100]# Convert the sliced data back into a Dataset objectdataset = Dataset.from_dict(sliced_data)def chunk_text(text_list, metadata):    splitter = TextSplitter(1000)    chunks = [chunk for text in text_list for chunk in splitter.chunks(text)]    return {"text_chunks": chunks, **metadata}# Define a global executor#executor = ThreadPoolExecutor(max_workers=1)def process_row(batch):    # Initialize a dictionary to store the results    results = {k: [] for k in batch.keys()}    results['text_chunks'] = []  # Add 'text_chunks' key to the results dictionary    # Process each row in the batch    for i in range(len(batch['text'])):        # Apply the chunk_text function to the text        chunks = chunk_text(batch['text'][i], {k: v[i] for k, v in batch.items() if k != 'text'})        # Add the results to the dictionary        for k, v in chunks.items():            results[k].extend(v)    # Return the results    return results# Apply the function to the datasetnew_dataset = dataset.map(process_row, batched=True, batch_size=1, remove_columns=None)# Save and upload the new datasetnew_dataset.to_json('dataset.jsonl')dataset_dict = datasets.DatasetDict({"split": new_dataset})# dataset_dict.save_to_disk("", format="json")# dataset_dict.upload_to_hub("", "This is a test dataset")

I was expecting the code to chunk the dataset, keep the metadata and save it as a .jsonl file.

Instead, I got the above error.

Getting a pyarrow.lib.ArrowInvalid: Column 1 named type expected length 44 but got length 21 when trying to create Hugging Face database

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...