Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13951

ClassLabel disappear after loading DatasetDict (Hugging Face)

$
0
0

I have a DatasetDict containing 10 splits (‘fold_0’ to ‘fold_9’). All the Dataset objects included in the DatasetDict contain 2 features: “label”& “text”. Here’s a small overview:

print(my_dataset_dict)>>> DatasetDict({        fold_0: Dataset({            features: ['label', 'text'],            num_rows: 85087        })        fold_1: Dataset({            features: ['label', 'text'],            num_rows: 85076        })    ....        fold_9: Dataset({            features: ['label', 'text'],            num_rows: 85159        })    })

For each Dataset, the “label” column was encoded with ClassLabel, and the “text” column is just a bunch of sentences:

print(my_dataset_dict['fold_0'].features)>>> {'label': ClassLabel(names=['MA211', 'MA221', ..., 'V39'], id=None), 'text': Value(dtype='string', id=None)}

So far so good, it’s exactly what I’m expecting.However, if I push it to the Hub and then load it again (in another script or in the same one, it doesn’t matter), then the labels disappear:

huggingface_hub.delete_repo(repo_id=dataset_path, repo_type='dataset', missing_ok=True)  # Just to be sure the previous DatasetDict is removed firstmy_dataset_dict.push_to_hub(dataset_path)  # No problem, I see it on the Hub after that (and the real labels appear)test_dataset_dict = datasets.load_dataset(dataset_path)  # Reloading it from the same pathprint(test_dataset_dict['fold_0'].features)>>> {'label': Value(dtype='string', id=None),'text': Value(dtype='string', id=None)}

As you can see, I don’t have the labels anymore. It’s a problem for me because I need to “cure the data” and create the dataset in a specific notebook, and then load back the data and perform some ML tasks on another notebook, and I’m losing the real labels.I tried loading using test_dataset_dict = datasets.load_dataset(dataset_path, download_mode=datasets.downloadMode.FORCE_REDOWNLOAD) but it doesn’t change anything. The text and the labels (just the integers) are loaded, but I don’t have the names of the labels. The names of the labels are pushed to the Hub, because I can see them on the viewer under the label column (I see the integer and the associated code right next to it):enter image description here

Thanks for your help!


Viewing all articles
Browse latest Browse all 13951

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>