Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

How to use a pandas series of arrays, each containing dataframe indices, to operate on that dataframe

$
0
0

This is from the 1990 California Housing Dataset used in Geron's Hands-On Machine Learning. That might be helpful context, but this is more of a pandas/numpy question. I have a solution, but I'm wondering if there's a better one, as mine feels pretty inelegant.

There are 3 pieces of data involved, each with 16,512 rows. The first is the latitude and longitude of the districts each house is located within:

lat_longs = housing.iloc[:, :2]lat_longs.head()
indexlongitudelatitude
13096-122.4237.8
14973-118.3834.14
3785-121.9838.36
14689-117.1133.75
20507-118.1533.77

The second set is the prices associated with each of those houses:

housing_labels.head()
indexmedian_house_value
13096458300.0
14973483800.0
3785101700.0
1468996100.0
20507361800.0

The third is an array, 16,512 x 5. Each row contains the indices of the 5 closest houses to the house referenced by that row. (I wrapped it in a dataframe to make it easier to display in markdown, but it's a numpy array.)

idx[:5]

index01234
030591266838284611138
15608111080937213446
2539423101146962497
3314935138391140114826
450165510470811889

My goal is to get the median house price of the five closest houses. My solution was the below

pd.Series(list(idx)).apply(lambda x: np.median(housing_labels.iloc[x]))

index0
0500001.0
1386700.0
2111500.0
396100.0
4306300.0

(The x in the lambda above is all 5 indices for that row.) Like I said, it worked, but I'm wondering if there's a better, faster (I'm under the impression apply is slow) and/or more elegant solution than what I came up with?

This pattern of having a series of arrays, where each array has indices that match a criteria of interest to that particular row, seems like a common pattern in data science that I'd love to have a better solution for. Any ideas?

-Joe


Was asked for a reproducible example. When I tried the solution suggested below, these small dataframes and arrays provided the correct answer.I tried this and it worked. For housing_labels:|index|median_house_value||---|---||13096|458300.0||14973|483800.0||3785|101700.0||14689|96100.0||20507|361800.0||1286|92600.0||18078|349300.0||4396|440900.0||18031|160100.0||6753|183900.0|

For idx:

idx = np.array([ [13096, 20507, 4396],          [6753, 3785, 14973],           [14689, 18078, 18031],           [14973, 20507, 1286]])

Correct output:

[440900. 183900. 160100. 361800.]


Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>