I've been working on a Python implementation for a Merkle DAG (Directed Acyclic Graph) with the goal of creating Content Addressable Archive (CAR) files. However, I've hit a roadblock and I'm struggling to figure out the correct way to specify links in the nodes. Following is my python3 implementation.
I'm using the multiformat library to generate CIDs for each chunk of data, and then I'm trying to create a Merkle DAG where each node contains links to its children. The end goal is to produce a CAR file.
I'm storing the CIDs of the chunks in the "links" field of the root node. However, I'm unsure if this is done correctly. Are there any specific requirements for linking nodes in a IPLD Merkle DAG that I might be missing?
If anyone has experience with Merkle DAGs and CAR file creation in Python, could you please review my code and provide insights into the correct way to specify links in the nodes and generate a valid CAR file?
I appreciate any assistance or suggestions to help me move past this roadblock. Thank you!
from multiformats import CID, varint, multihash, multibaseimport dag_cborimport jsonimport msgpackdef generate_cid(data, codec="dag-pb"): hash_value = multihash.digest(data, "sha2-256") return CID("base32", version=1, codec=codec, digest=hash_value)def generate_merkle_tree(file_path, chunk_size): cids = [] # Read the file with open(file_path, "rb") as file: while True: # Read a chunk of data chunk = file.read(chunk_size) if not chunk: break # Generate CID for the chunk cid = generate_cid(chunk, codec="raw") cids.append((cid, chunk)) # Generate Merkle tree root CID from all the chunks # root_cid = generate_cid(b"".join(bytes(cid[0]) for cid in cids)) # Create the root node with links and other data root_node = {"file_name": "test.png","links": [str(cid[0]) for cid in cids] } # Encode the root node as dag-pb root_data = dag_cbor.encode(root_node) # Generate CID for the root node root_cid = generate_cid(root_data, codec="dag-pb") return root_cid, cids, root_datadef create_car_file(root, cids): header_roots = [root] header_data = dag_cbor.encode({"roots": header_roots, "version": 1}) header = varint.encode(len(header_data)) + header_data car_content = b"" car_content += header for cid, chunk in cids: cid_bytes = bytes(cid) block = varint.encode(len(chunk) + len(cid_bytes)) + cid_bytes + chunk car_content += block root_cid = bytes(root) root_block = varint.encode(len(root_cid)) + root_cid car_content += root_block with open("output.car", "wb") as car_file: car_file.write(car_content)file_path = "./AADHAAR.png" # Replace with the path to your filechunk_size = 16384 # Adjust the chunk size as neededroot, cids, root_data = generate_merkle_tree(file_path, chunk_size)print(root)create_car_file(root, cids)
I've been working on a Python implementation to create a Merkle DAG and subsequently generate a Content Addressable Archive (CAR) file.
I attempted to link nodes by storing the CIDs of the chunks in the "links" field of the root node. However, I'm uncertain if I'm doing this correctly. My expectation was that each node would contain links to its children, but I'm unsure if there are specific requirements for linking nodes in a IPLD Merkle DAG.