Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13951

Issue in finding nearest values in dataframes for different variables (based on location and conditions)

$
0
0

I have a combined_vars_df and surge_df, wave_df, and waterlevel_df for 235 counties on the "Atlantic" and "GulfOfMexico" coasts. Their few lines of data information are given below. I am using dates of combined_vars_df and selecting the same dates in surge_df, wave_df and waterlevel_df, and calculating distances (using haversine package) from longitude_precip and latitude_precip to latitudes and longitudes of surge, wave and waterlevel dfs and select the nearest locations of surge, wave and waterlevel and select their corresponding percentiles and other corresponding columns to add them in combined_vars_df. I am using the following conditions for calculating distances and finding the nearest location and its corresponding percentiles:

Conditions:

  • if surge_point = Yes, find the closest Storm Surge data points with their corresponding columns to that location and add them to combied_vars_df

  • if wave_point = Yes, find the closest wave data points with their percentiles to that location and add them to combied_vars_df

  • if surge_point = Yes, find the closest waterlevel data points with their corresponding percentiles to that location and add them to combied_vars_df

Also, I am using a condition for the wave_df to select df from if coast == "GulfOfMexico" or "Atlantic" because I have two datasets for waves, one for all the counties at "Atlantic" and another one for "GulfOfMexico".

I want to add all columns from surge, wave and waterlevel data and their values based on nearest locations. For example, creating new columns "Date_surge", "latitude_surge", "longitude_surge", "surge_percentiles", "Date_waterlevel", "latitude_waterlevel", "longitude_waterlevel", "waterlevel_percentiles", "Date_wave", "latitude_wave", "longitude_wave", "waveHs_percentiles" in combined_vars_df.

combined_vars_df:

Time    longitude_precip    latitude_precip PRCP    Percentiles_precip  Date_precip longitude_soil_moisture latitude_soil_moisture  swvl1   Percentiles_soil_moisture   Date_soil_moisture  time    longitude_discharge latitude_discharge  dis24   Percentiles_discharge   Date_discharge  surge_points    wave_points1/1/1980    -75.925 37.525  6.96E-08    0.351962956 1/1/1980    -75.925 37.525  0.015750297 0.424171425 1/1/1980    1/1/1980    -75.925 37.525  0.31250007  0.404584409 1/1/1980    Yes No1/1/1980    -75.775 37.625  6.96E-08    0.359496898 1/1/1980    -75.775 37.625  0.09928183  0.425004301 1/1/1980    1/1/1980    -75.775 37.625  0.18750007  0.340279939 1/1/1980    Yes No1/1/1980    -75.475 37.875  6.96E-08    0.354216519 1/1/1980    -75.475 37.875  0.22655162  0.392982732 1/1/1980    1/1/1980    -75.475 37.875  0.10937507  0.441995239 1/1/1980    Yes No1/1/1980    -75.725 37.975  6.96E-08    0.376004926 1/1/1980    -75.725 37.975  0.25292561  0.398858982 1/1/1980    1/1/1980    -75.725 37.975  0.46875007  0.454758892 1/1/1980    Yes No1/1/1980    -75.475 37.925  6.96E-08    0.356204666 1/1/1980    -75.475 37.925  0.23879522  0.391525624 1/1/1980    1/1/1980    -75.475 37.925  0.39062507  0.383258258 1/1/1980    Yes No1/1/1980    -75.625 37.625  6.96E-08    0.351704216 1/1/1980    -75.625 37.625  0.09588592  0.425028448 1/1/1980    1/1/1980    -75.625 37.625  0.03125007  0.184426202 1/1/1980    Yes No1/1/1980    -75.575 37.925  6.96E-08    0.360239297 1/1/1980    -75.575 37.925  0.24256245  0.396662633 1/1/1980    1/1/1980    -75.575 37.925  0.17187507  0.321553118 1/1/1980    Yes No1/1/1980    -75.675 37.725  6.96E-08    0.35538695  1/1/1980    -75.675 37.725  0.16938361  0.424660739 1/1/1980    1/1/1980    -75.675 37.725  0.17187507  0.201397005 1/1/1980    Yes No

surge_df:

Date_surge  waterlevel_surge    latitude_surge  longitude_surge surge   waterlevel (tide)_surge surge_percentiles1/1/1980    2107    44.634  -66.694 4.00000007  2103    0.71/1/1980    519 25.913  -81.753 -66.99999993    586 0.261/1/1980    106 41.235  -70.034 8.00000007  98  0.611/1/1980    1222    43.931  -69.302 -20.99999993    1243    0.5951/1/1980    819 39.067  -74.956 46.00000007 773 0.7866666671/1/1980    60.00000003 29.634  -91.567 -156.9999999    217 0.051/1/1980    156 29.78   -85.415 -92.99999993    249 0.13251/1/1980    1253    43.696  -70.21  -10.99999993    1264    0.61251/1/1980    631 30.308  -81.372 -109.9999999    741 0.2251/1/1980    85.00000003 30.278  -88.755 -221.9999999    307 0.051/1/1980    159 29.048  -89.106 -76.99999993    236 0.1051/1/1980    156 29.019  -89.282 -79.99999993    236 0.0951/1/1980    100 30.22   -88.462 -184.9999999    285 0.05

waterlevel_df:

Date_waterlevel waterlevel  latitude_waterlevel longitude_waterlevel    surge_waterlevel    waterlevel (tide)_waterlevel    waterlevel_percentile1/1/1980    2107    44.634  -66.694 4.00000007  2103    0.8335714291/1/1980    519 25.913  -81.753 -66.99999993    586 0.5151/1/1980    106 41.235  -70.034 8.00000007  98  0.41751/1/1980    1222    43.931  -69.302 -20.99999993    1243    0.808751/1/1980    819 39.067  -74.956 46.00000007 773 0.8283333331/1/1980    60.00000003 29.634  -91.567 -156.9999999    217 0.051/1/1980    156 29.78   -85.415 -92.99999993    249 0.31751/1/1980    1253    43.696  -70.21  -10.99999993    1264    0.841251/1/1980    631 30.308  -81.372 -109.9999999    741 0.6116666671/1/1980    85.00000003 30.278  -88.755 -221.9999999    307 0.0833333331/1/1980    159 29.048  -89.106 -76.99999993    236 0.34251/1/1980    156 29.019  -89.282 -79.99999993    236 0.3275

wave_df:

latitude    longitude   station_name    waveHs  Date    waveHs_percentiles23.5    -83.5   61001   0.164062585 1/1/1980 23:00  0.13789775617.5    -72 61030   0.664062572 1/1/1980 23:00  0.0542.416672   -70.583298  63051   0.515625073 1/1/1980 23:00  0.20930151529.1667 -80.75  63428   1.343750072 1/1/1980 23:00  0.6599999936.5    -75.583298  63208   1.054687572 1/1/1980 23:00  0.43500000434.666698   -76.25  63271   0.750000072 1/1/1980 23:00  0.12499999532.833302   -79.25  63340   1.390625072 1/1/1980 23:00  0.68999998939.416698   -74.083298  63139   0.593750072 1/1/1980 23:00  0.08999999141.083328   -70.083298  63092   0.437500072 1/1/1980 23:00  0.0531.25   -81 63387   1.062500072 1/1/1980 23:00  0.594030164

I am using the following code, but I am not getting the data for surge, wave, and water level. I am getting empty columns. Could you please help me solve this issue?

def find_nearest_percentiles(    combined_vars_df, surge_df, wave_df, waterlevel_df, coast, sheldus_df, window_size=3):"""    Update the combined_vars_df DataFrame with the nearest percentile values of storm surge, wave, and water level    based on the conditions of 'surge_points' and 'wave_points'. If 'surge_points' or 'wave_points' is 'No',    the respective entry will be set to a null value. Additionally, consider the coast condition to differentiate    between Gulf of Mexico and Atlantic datasets.    Parameters:    -----------    combined_vars_df : pd.DataFrame        DataFrame containing combined variables data with 'longitude_precip', 'latitude_precip', 'Date_precip','surge_points', and 'wave_points' columns.    surge_df : pd.DataFrame        DataFrame containing storm surge data with 'longitude_surge', 'latitude_surge', 'Date_surge',        and 'surge_percentiles' columns.    wave_df : pd.DataFrame        DataFrame containing wave data with 'longitude_wave', 'latitude_wave', 'Date_wave',        and 'waveHs_percentiles' columns.    waterlevel_df : pd.DataFrame        DataFrame containing water level data with 'longitude_waterlevel', 'latitude_waterlevel', 'Date_waterlevel',        and 'waterlevel_percentiles' columns.    coast : str        Coast information of the county ('GulfOfMexico' or 'Atlantic') to apply specific conditions.    sheldus_df : pd.DataFrame        DataFrame containing SHELDUS hazard data with 'Hazard_start' and 'Hazard_end' columns.    window_size : int, optional        The number of days to consider before and after the hazard start and end dates (default is 5).    Returns:    --------    pd.DataFrame        The updated combined_vars_df with new columns 'nearest_surge', 'nearest_wave', and 'nearest_waterlevel'        containing the nearest percentile values based on the specified conditions and coast."""    print("Finding the nearest percentile values for storm surge, wave, and water level..."    )    # Initialize new columns for the nearest percentile values    combined_vars_df["nearest_surge"] = None    combined_vars_df["nearest_wave"] = None    combined_vars_df["nearest_waterlevel"] = None    # Convert date columns to datetime format for comparison    combined_vars_df["Date_precip"] = pd.to_datetime(combined_vars_df["Date_precip"])    surge_df["Date_surge"] = pd.to_datetime(surge_df["Date_surge"])    wave_df["Date_wave"] = pd.to_datetime(wave_df["Date_wave"])    waterlevel_df["Date_waterlevel"] = pd.to_datetime(waterlevel_df["Date_waterlevel"])    # Remove Time from the Date column in wave_df and keep only the date if there are time values present    wave_df["Date_wave"] = wave_df["Date_wave"].dt.date    # Limit latitude and longitude to 6 decimal places for comparison    surge_df["latitude_surge"] = surge_df["latitude_surge"].round(4)    surge_df["longitude_surge"] = surge_df["longitude_surge"].round(4)    wave_df["latitude_wave"] = wave_df["latitude_wave"].round(4)    wave_df["longitude_wave"] = wave_df["longitude_wave"].round(4)    waterlevel_df["latitude_waterlevel"] = waterlevel_df["latitude_waterlevel"].round(4)    waterlevel_df["longitude_waterlevel"] = waterlevel_df["longitude_waterlevel"].round(        4    )    combined_vars_df["latitude_precip"] = combined_vars_df["latitude_precip"].round(4)    combined_vars_df["longitude_precip"] = combined_vars_df["longitude_precip"].round(4)    # Subset the dataframes based on the SHELDUS hazard start and end dates    output_combined_vars_df = pd.DataFrame()    output_surge_df = pd.DataFrame()    output_wave_df = pd.DataFrame()    output_waterlevel_df = pd.DataFrame()    for i in range(len(sheldus_df)):        # Get the start and end dates for the hazard event and the window dates        start_date = sheldus_df["Hazard_start"].iloc[i]        end_date = sheldus_df["Hazard_end"].iloc[i]        first_window_date = start_date - timedelta(days=window_size)        last_window_date = end_date + timedelta(days=window_size)        window_dates = pd.date_range(first_window_date, last_window_date)        # Subset the dataframes based on the window dates        window_combined_vars = combined_vars_df.loc[            combined_vars_df["Date_precip"].isin(window_dates)        ]        window_surge = surge_df.loc[surge_df["Date_surge"].isin(window_dates)]        window_wave = wave_df.loc[wave_df["Date_wave"].isin(window_dates)]        window_waterlevel = waterlevel_df.loc[            waterlevel_df["Date_waterlevel"].isin(window_dates)        ]        # Concatenate the subsetted dataframes        output_combined_vars_df = pd.concat(            [output_combined_vars_df, window_combined_vars], axis=0        )        output_surge_df = pd.concat([output_surge_df, window_surge], axis=0)        output_wave_df = pd.concat([output_wave_df, window_wave], axis=0)        output_waterlevel_df = pd.concat(            [output_waterlevel_df, window_waterlevel], axis=0        )    # Update the dataframes with the subsetted data    combined_vars_df = output_combined_vars_df    surge_df = output_surge_df    wave_df = output_wave_df    waterlevel_df = output_waterlevel_df    # Reset the index of the combined_vars_df    combined_vars_df.reset_index(drop=True, inplace=True)    # Print the combined_vars_df    print("Combined variables data for all sheldus events: ", combined_vars_df)    # Print the surge_df and wave_df to check if "Date_surge" and "Date_wave" columns are present in the data    print("Surge data: ", surge_df.head())    print("Wave data: ", wave_df.head())    # # Save the subsetted data to a CSV file    # combined_vars_df.to_csv(f"{base_path}data/combined_vars_df_sheldus_dates_{county_name}_{FIPS}.csv", index=False)    # Calculate bounding box for the county    buffer = 0.25  # Buffer in degrees to include a little more area    min_lon, max_lon = (        combined_vars_df["longitude_precip"].min() - buffer,        combined_vars_df["longitude_precip"].max() + buffer,    )    min_lat, max_lat = (        combined_vars_df["latitude_precip"].min() - buffer,        combined_vars_df["latitude_precip"].max() + buffer,    )    # Filter surge, wave, and water level data within the bounding box    surge_filtered = surge_df[        (surge_df["longitude_surge"] >= min_lon)& (surge_df["longitude_surge"] <= max_lon)& (surge_df["latitude_surge"] >= min_lat)& (surge_df["latitude_surge"] <= max_lat)    ]    wave_filtered = wave_df[        (wave_df["longitude_wave"] >= min_lon)& (wave_df["longitude_wave"] <= max_lon)& (wave_df["latitude_wave"] >= min_lat)& (wave_df["latitude_wave"] <= max_lat)    ]    waterlevel_filtered = waterlevel_df[        (waterlevel_df["longitude_waterlevel"] >= min_lon)& (waterlevel_df["longitude_waterlevel"] <= max_lon)& (waterlevel_df["latitude_waterlevel"] >= min_lat)& (waterlevel_df["latitude_waterlevel"] <= max_lat)    ]    # Print the filtered surge, wave, and water level data    print("Filtered surge data: ", surge_filtered.head())    print("Filtered wave data: ", wave_filtered.head())    print("Filtered water level data: ", waterlevel_filtered.head())    # Function to find the nearest percentile value for a given row and condition    def find_nearest_percentile(        row, event_df, date_col, lat_col, lon_col, percentile_col    ):"""        Find the nearest percentile value for a given row and condition.        Parameters:        -----------        row : pd.Series            A single row from the combined_vars_df DataFrame containing the reference point.        event_df : pd.DataFrame            DataFrame containing event data (e.g., surge, wave, or water level) with date, latitude, longitude,            and percentile columns.        date_col : str            Name of the column in event_df representing the date.        lat_col : str            Name of the column in event_df representing the latitude.        lon_col : str            Name of the column in event_df representing the longitude.        percentile_col : str            Name of the column in event_df representing the percentile values.        Returns:        --------        float or None            The nearest percentile value if found, or None if no matching date is found in event_df."""        try:            if row["Date_precip"].date() not in event_df[date_col].dt.date.values:                return None            event_same_date = event_df.loc[                event_df[date_col].dt.date == row["Date_precip"].date()            ]            distances = event_same_date.apply(                lambda event: haversine(                    (row["latitude_precip"], row["longitude_precip"]),                    (event[lat_col], event[lon_col]),                    unit=Unit.KILOMETERS,                ),                axis=1,            )            nearest_index = distances.idxmin()            return event_same_date.loc[nearest_index, percentile_col]        except KeyError as e:            print(f"KeyError occurred: {str(e)}")            return None        except Exception as e:            print(f"An error occurred: {str(e)}")            return None    # Apply the function to each row based on the conditions and coast    for index, row in tqdm(combined_vars_df.iterrows(), total=len(combined_vars_df)):        if row["surge_points"] == "Yes":            combined_vars_df.at[index, "nearest_surge"] = find_nearest_percentile(                row,                surge_filtered,"Date_surge","latitude_surge","longitude_surge","surge_percentiles",            )            combined_vars_df.at[index, "nearest_waterlevel"] = find_nearest_percentile(                row,                waterlevel_filtered,"Date_waterlevel","latitude_waterlevel","longitude_waterlevel","waterlevel_percentiles",            )        if row["wave_points"] == "Yes":            if coast == "GulfOfMexico":                combined_vars_df.at[index, "nearest_wave"] = find_nearest_percentile(                    row,                    wave_filtered,"Date_wave","latitude_wave","longitude_wave","waveHs_percentiles",                )            elif coast == "Atlantic":                combined_vars_df.at[index, "nearest_wave"] = find_nearest_percentile(                    row,                    wave_filtered,"Date_wave","latitude_wave","longitude_wave","waveHs_percentiles",                )    return combined_vars_df

Viewing all articles
Browse latest Browse all 13951

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>