I try to convert a GeoDataFrame to a polars DataFrame with from_pandas
. I receive an ArrowTypeError: Did not pass numpy.dtype object exception.
Expected outcome would be a polars DataFrame with the geometry
column being typed as pl.Object
.
I'm aware of https://github.com/geopolars/geopolars (alpha) and https://github.com/pola-rs/polars/issues/1830 and would be OK with the shapely objects just being represented as pl.Object for now.
Here is a minimal example to demonstrate the problem:
## Minimal example displaying the issueimport geopandas as gpdprint("geopandas version: ", gpd.__version__)import geodatasetsprint("geodatasets version: ", geodatasets.__version__)import polars as plprint("polars version: ", pl.__version__)gdf = gpd.GeoDataFrame.from_file(geodatasets.get_path("nybb"))print("\nOriginal GeoDataFrame")print(gdf.dtypes)print(gdf.head())print("\nGeoDataFrame to Polars without geometry")print(pl.from_pandas(gdf.drop("geometry", axis=1)).head())try: print("\nGeoDataFrame to Polars naiive") print(pl.from_pandas(gdf).head())except Exception as e: print(e)try: print("\nGeoDataFrame to Polars with schema override") print(pl.from_pandas(gdf, schema_overrides={"geometry": pl.Object}).head())except Exception as e: print(e)# again to print stack tracepl.from_pandas(gdf).head()
Output
geopandas version: 0.14.4geodatasets version: 2023.12.0polars version: 0.20.23Original GeoDataFrameBoroCode int64BoroName objectShape_Leng float64Shape_Area float64geometry geometrydtype: object BoroCode BoroName Shape_Leng Shape_Area \0 5 Staten Island 330470.010332 1.623820e+09 1 4 Queens 896344.047763 3.045213e+09 2 3 Brooklyn 741080.523166 1.937479e+09 3 1 Manhattan 359299.096471 6.364715e+08 4 2 Bronx 464392.991824 1.186925e+09 geometry 0 MULTIPOLYGON (((970217.022 145643.332, 970227.... 1 MULTIPOLYGON (((1029606.077 156073.814, 102957... 2 MULTIPOLYGON (((1021176.479 151374.797, 102100... 3 MULTIPOLYGON (((981219.056 188655.316, 980940.... 4 MULTIPOLYGON (((1012821.806 229228.265, 101278... GeoDataFrame to Polars without geometryshape: (5, 4)┌──────────┬───────────────┬───────────────┬────────────┐│ BoroCode ┆ BoroName ┆ Shape_Leng ┆ Shape_Area ││ --- ┆ --- ┆ --- ┆ --- ││ i64 ┆ str ┆ f64 ┆ f64 │╞══════════╪═══════════════╪═══════════════╪════════════╡│ 5 ┆ Staten Island ┆ 330470.010332 ┆ 1.6238e9 ││ 4 ┆ Queens ┆ 896344.047763 ┆ 3.0452e9 ││ 3 ┆ Brooklyn ┆ 741080.523166 ┆ 1.9375e9 ││ 1 ┆ Manhattan ┆ 359299.096471 ┆ 6.3647e8 ││ 2 ┆ Bronx ┆ 464392.991824 ┆ 1.1869e9 │└──────────┴───────────────┴───────────────┴────────────┘GeoDataFrame to Polars naiiveDid not pass numpy.dtype objectGeoDataFrame to Polars with schema overrideDid not pass numpy.dtype object
Stack trace (is the same with and without schema_overrides
)
---------------------------------------------------------------------------ArrowTypeError Traceback (most recent call last)Cell In[59], line 27 24 print(e) 26 # again to print stack trace---> 27 pl.from_pandas(gdf).head()File c:\Users\...\polars\convert.py:571, in from_pandas(data, schema_overrides, rechunk, nan_to_null, include_index) 568 return wrap_s(pandas_to_pyseries("", data, nan_to_null=nan_to_null)) 569 elif isinstance(data, pd.DataFrame): 570 return wrap_df(--> 571 pandas_to_pydf( 572 data, 573 schema_overrides=schema_overrides, 574 rechunk=rechunk, 575 nan_to_null=nan_to_null, 576 include_index=include_index, 577 ) 578 ) 579 else: 580 msg = f"expected pandas DataFrame or Series, got {type(data).__name__!r}"File c:\Users\...\polars\_utils\construction\dataframe.py:1032, in pandas_to_pydf(data, schema, schema_overrides, strict, rechunk, nan_to_null, include_index) 1025 arrow_dict[str(idxcol)] = plc.pandas_series_to_arrow( 1026 data.index.get_level_values(idxcol), 1027 nan_to_null=nan_to_null, 1028 length=length, 1029 ) 1031 for col in data.columns:-> 1032 arrow_dict[str(col)] = plc.pandas_series_to_arrow( 1033 data[col], nan_to_null=nan_to_null, length=length 1034 ) 1036 arrow_table = pa.table(arrow_dict) 1037 return arrow_to_pydf( 1038 arrow_table, 1039 schema=schema, (...) 1042 rechunk=rechunk, 1043 )File c:\Users\...\polars\_utils\construction\other.py:97, in pandas_series_to_arrow(values, length, nan_to_null) 95 return pa.array(values, from_pandas=nan_to_null) 96 elif dtype:---> 97 return pa.array(values, from_pandas=nan_to_null) 98 else: 99 # Pandas Series is actually a Pandas DataFrame when the original DataFrame 100 # contains duplicated columns and a duplicated column is requested with df["a"]. 101 msg = "duplicate column names found: "File c:\Users\...\pyarrow\array.pxi:323, in pyarrow.lib.array()File c:\Users\...\pyarrow\array.pxi:79, in pyarrow.lib._ndarray_to_array()File c:\Users\...\pyarrow\array.pxi:67, in pyarrow.lib._ndarray_to_type()File c:\Users\...\pyarrow\error.pxi:123, in pyarrow.lib.check_status()ArrowTypeError: Did not pass numpy.dtype object