I am wondering, how to make sure HDFS data accessing makes the best use of the local replication to minimize the use of network transfer.
I am hosting HDFS on 3 machines and replication is set to 3. Let's name them machine A, B, C. Machine A is the namenode, and these 3 are all datanode.
Currently, I am reading data like the following code
# Run this code on machine A, B, C separatelyimport fsspecimport pandas as pdwith fsspec.open('hdfs://machine_A_ip:9000/path/to/data.parquet', 'rb') as fp: df = pd.read_parquet(fp)And I observed I have a huge traffic of network connection 100+MB/s upload and download. No matter I run on which machine (namenode or not).
I also tried hosting Dask and Ray cluster on the same machine set.But I think Dask is not supporting this feature: Does Dask communicate with HDFS to optimize for data locality? - Stack Overflow
I haven't found clues in the documentations
- API Reference — fsspec: Wraps pyarrow
- pyarrow.fs.HadoopFileSystem — Apache Arrow
- Filesystem Interface — Apache Arrow
- Issues · apache/arrow: Seems no dicussion about locality