This is from the 1990 California Housing Dataset used in Geron's Hands-On Machine Learning. That might be helpful context, but this is more of a pandas/numpy question. I have a solution, but I'm wondering if there's a better one, as mine feels pretty inelegant.
There are 3 pieces of data involved, each with 16,512 rows. The first is the latitude and longitude of the districts each house is located within:
lat_longs = housing.iloc[:, :2]lat_longs.head()| index | longitude | latitude |
|---|---|---|
| 13096 | -122.42 | 37.8 |
| 14973 | -118.38 | 34.14 |
| 3785 | -121.98 | 38.36 |
| 14689 | -117.11 | 33.75 |
| 20507 | -118.15 | 33.77 |
The second set is the prices associated with each of those houses:
housing_labels.head()| index | median_house_value |
|---|---|
| 13096 | 458300.0 |
| 14973 | 483800.0 |
| 3785 | 101700.0 |
| 14689 | 96100.0 |
| 20507 | 361800.0 |
The third is an array, 16,512 x 5. Each row contains the indices of the 5 closest houses to the house referenced by that row. (I wrapped it in a dataframe to make it easier to display in markdown, but it's a numpy array.)
idx[:5]
| index | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| 0 | 3059 | 1266 | 8382 | 8461 | 1138 |
| 1 | 5608 | 1 | 11080 | 9372 | 13446 |
| 2 | 5394 | 2 | 3101 | 14696 | 2497 |
| 3 | 3 | 14935 | 13839 | 11401 | 14826 |
| 4 | 5016 | 5510 | 4 | 708 | 11889 |
My goal is to get the median house price of the five closest houses. My solution was the below
pd.Series(list(idx)).apply(lambda x: np.median(housing_labels.iloc[x]))
| index | 0 |
|---|---|
| 0 | 500001.0 |
| 1 | 386700.0 |
| 2 | 111500.0 |
| 3 | 96100.0 |
| 4 | 306300.0 |
(The x in the lambda above is all 5 indices for that row.) Like I said, it worked, but I'm wondering if there's a better, faster (I'm under the impression apply is slow) and/or more elegant solution than what I came up with?
This pattern of having a series of arrays, where each array has indices that match a criteria of interest to that particular row, seems like a common pattern in data science that I'd love to have a better solution for. Any ideas?
-Joe
Was asked for a reproducible example. When I tried the solution suggested below, these small dataframes and arrays provided the correct answer.I tried this and it worked. For housing_labels:|index|median_house_value||---|---||13096|458300.0||14973|483800.0||3785|101700.0||14689|96100.0||20507|361800.0||1286|92600.0||18078|349300.0||4396|440900.0||18031|160100.0||6753|183900.0|
For idx:
idx = np.array([ [13096, 20507, 4396], [6753, 3785, 14973], [14689, 18078, 18031], [14973, 20507, 1286]])Correct output:
[440900. 183900. 160100. 361800.]