I am trying to work with nested json in great expectations. I have managed to achieve the same with following expectation suite and batch modification like so:
nestedjson_expectations_suite.json
{"data_asset_type": "Dataset","expectation_suite_name": "default","expectations": [ {"expectation_type": "expect_column_values_to_be_between","kwargs": {"column": "id","max_value": 100,"min_value": 1 },"meta": {} }, {"expectation_type": "expect_column_values_to_be_unique","kwargs": {"column": "id" },"meta": {} }, {"expectation_type": "expect_column_values_to_match_regex","kwargs": {"column": "name","regex": "^[A-Za-z\\s]+$" },"meta": {} }, {"expectation_type": "expect_column_values_to_not_be_null","kwargs": {"column": "name" },"meta": {} }, {"expectation_type": "expect_column_values_to_be_between","kwargs": {"column": "details_age","max_value": 120,"min_value": 0 },"meta": {} }, {"expectation_type": "expect_column_values_to_match_regex","kwargs": {"column": "details_address_city","regex": "^[A-Za-z\\s]+$" },"meta": {} }, {"expectation_type": "expect_column_values_to_match_regex","kwargs": {"column": "details_address_state","regex": "^[A-Za-z\\s]+$" },"meta": {} } ],"ge_cloud_id": null,"meta": {"great_expectations_version": "0.18.8" }}nested.py
import great_expectations as geimport openpyxl# Load the Great Expectations contextcontext = ge.data_context.DataContext("../.")# Load the JSON data into a Pandas DataFramedata_file_path = "../../data/nested.json"df = ge.read_json(data_file_path)# Create a batch of databatch = ge.dataset.PandasDataset(df)# Create new columns for nested valuesbatch["details_age"] = batch["details"].apply(lambda x: x.get("age"))batch["details_address_city"] = batch["details"].apply(lambda x: x.get("address").get("city"))batch["details_address_state"] = batch["details"].apply(lambda x: x.get("address").get("state"))result = batch.validate("nestedjson_expectations_suite.json")print(result)nested.json
[ {"id": 1,"name": "John Doe","details": {"age": 30,"address": {"city": "New York","state": "NY" } } }, {"id": 2,"name": "Jane Smith","details": {"age": 25,"address": {"city": "San Francisco","state": "CA" } } }]The documentation is not very clear on how to proceed with batches:
- How to run this as checkpoints?
- Build data docs?
- Create a validator?
- Export the results to an output file, maybe as a excel and not json
I tried to update the batch request to modify batches but that didn't seem to work!
validator = context.get_validator( batch_request=BatchRequest(**batch_request), expectation_suite_name=expectation_suite_name,)Please suggest