I am getting the below error when trying to modify, chunk and resave a Huggingface Dataset.
I was wondering if anyone might be able to help?
Traceback (most recent call last): File "C:\Users\conno\LegalAIDataset\LegalAIDataset\main.py", line 39, in <module> new_dataset = dataset.map(process_row, batched=True, batch_size=1, remove_columns=None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 602, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 3156, in map for rank, done, content in Dataset._map_single(**dataset_kwargs): File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 3570, in _map_single writer.write_batch(batch) File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_writer.py", line 571, in write_batch pa_table = pa.Table.from_arrays(arrays, schema=schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow\\table.pxi", line 4642, in pyarrow.lib.Table.from_arrays File "pyarrow\\table.pxi", line 3922, in pyarrow.lib.Table.validate File "pyarrow\\error.pxi", line 91, in pyarrow.lib.check_statuspyarrow.lib.ArrowInvalid: Column 1 named type expected length 44 but got length 21My minimun reproducable code is below:
import datasetsfrom datasets import load_dataset, Datasetfrom semantic_text_splitter import TextSplitter# Step 1: Load the existing datasetdataset = load_dataset('HF_Dataset')# Slice the 'train' split of the datasetsliced_data = dataset['train'][:100]# Convert the sliced data back into a Dataset objectdataset = Dataset.from_dict(sliced_data)def chunk_text(text_list, metadata): splitter = TextSplitter(1000) chunks = [chunk for text in text_list for chunk in splitter.chunks(text)] return {"text_chunks": chunks, **metadata}# Define a global executor#executor = ThreadPoolExecutor(max_workers=1)def process_row(batch): # Initialize a dictionary to store the results results = {k: [] for k in batch.keys()} results['text_chunks'] = [] # Add 'text_chunks' key to the results dictionary # Process each row in the batch for i in range(len(batch['text'])): # Apply the chunk_text function to the text chunks = chunk_text(batch['text'][i], {k: v[i] for k, v in batch.items() if k != 'text'}) # Add the results to the dictionary for k, v in chunks.items(): results[k].extend(v) # Return the results return results# Apply the function to the datasetnew_dataset = dataset.map(process_row, batched=True, batch_size=1, remove_columns=None)# Save and upload the new datasetnew_dataset.to_json('dataset.jsonl')dataset_dict = datasets.DatasetDict({"split": new_dataset})# dataset_dict.save_to_disk("", format="json")# dataset_dict.upload_to_hub("", "This is a test dataset")I was expecting the code to chunk the dataset, keep the metadata and save it as a .jsonl file.
Instead, I got the above error.