Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23247

Getting a pyarrow.lib.ArrowInvalid: Column 1 named type expected length 44 but got length 21 when trying to create Hugging Face database

$
0
0

I am getting the below error when trying to modify, chunk and resave a Huggingface Dataset.

I was wondering if anyone might be able to help?

Traceback (most recent call last):  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\main.py", line 39, in <module>    new_dataset = dataset.map(process_row, batched=True, batch_size=1, remove_columns=None)                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 602, in wrapper    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 567, in wrapper    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 3156, in map    for rank, done, content in Dataset._map_single(**dataset_kwargs):  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_dataset.py", line 3570, in _map_single    writer.write_batch(batch)  File "C:\Users\conno\LegalAIDataset\LegalAIDataset\.venv\Lib\site-packages\datasets\arrow_writer.py", line 571, in write_batch    pa_table = pa.Table.from_arrays(arrays, schema=schema)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "pyarrow\\table.pxi", line 4642, in pyarrow.lib.Table.from_arrays  File "pyarrow\\table.pxi", line 3922, in pyarrow.lib.Table.validate  File "pyarrow\\error.pxi", line 91, in pyarrow.lib.check_statuspyarrow.lib.ArrowInvalid: Column 1 named type expected length 44 but got length 21

My minimun reproducable code is below:

import datasetsfrom datasets import load_dataset, Datasetfrom semantic_text_splitter import TextSplitter# Step 1: Load the existing datasetdataset = load_dataset('HF_Dataset')# Slice the 'train' split of the datasetsliced_data = dataset['train'][:100]# Convert the sliced data back into a Dataset objectdataset = Dataset.from_dict(sliced_data)def chunk_text(text_list, metadata):    splitter = TextSplitter(1000)    chunks = [chunk for text in text_list for chunk in splitter.chunks(text)]    return {"text_chunks": chunks, **metadata}# Define a global executor#executor = ThreadPoolExecutor(max_workers=1)def process_row(batch):    # Initialize a dictionary to store the results    results = {k: [] for k in batch.keys()}    results['text_chunks'] = []  # Add 'text_chunks' key to the results dictionary    # Process each row in the batch    for i in range(len(batch['text'])):        # Apply the chunk_text function to the text        chunks = chunk_text(batch['text'][i], {k: v[i] for k, v in batch.items() if k != 'text'})        # Add the results to the dictionary        for k, v in chunks.items():            results[k].extend(v)    # Return the results    return results# Apply the function to the datasetnew_dataset = dataset.map(process_row, batched=True, batch_size=1, remove_columns=None)# Save and upload the new datasetnew_dataset.to_json('dataset.jsonl')dataset_dict = datasets.DatasetDict({"split": new_dataset})# dataset_dict.save_to_disk("", format="json")# dataset_dict.upload_to_hub("", "This is a test dataset")

I was expecting the code to chunk the dataset, keep the metadata and save it as a .jsonl file.

Instead, I got the above error.


Viewing all articles
Browse latest Browse all 23247

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>