I have a .jsonl dataset that I am trying to convert into a tensorflow dataset.
Each line of the .jsonl is of the form
{"text": "some text", "meta": "irrelevant"}I need to get it into a tensorflow dataset where each element has a key "text" associated with a tf.string value.
It seems like the closest I've gotten is the following
import tensorflow as tfds = tf.data.TextLineDataset('train_mini.jsonl')def f(tnsr): text = eval(tnsr.numpy())['text'] return tf.constant(text) #return {'text':text}ds = ds.map(lambda x: tf.py_function(func=f,inp=[x], Tout=tf.string))ds = tf.data.Dataset({"text": list(ds.as_numpy_iterator())})which throws the following error
InvalidArgumentError: ValueError: Error converting unicode string while converting Python sequence to Tensor.Traceback (most recent call last): File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 241, in __call__ return func(device, token, args) File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 130, in __call__ ret = self._func(*args) File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 309, in wrapper return func(*args, **kwargs) File "/home/crytting/persuasion/json_to_tfds.py", line 7, in f return tf.constant(text) File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 262, in constant allow_broadcast=True) File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 270, in _constant_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 96, in convert_to_eager_tensor return ops.EagerTensor(value, ctx.device_name, dtype)ValueError: Error converting unicode string while converting Python sequence to Tensor. [[{{node EagerPyFunc}}]]I have tried many many ways of doing this, but nothing has worked. It seems like it shouldn't be this hard, and I'm wondering if I'm missing some really simple way of doing it.