Quantcast
Viewing all articles
Browse latest Browse all 14069

iterate over 10 k pages & fetch data, parse: European Volunteering-Services: tiny scraper that collects opportunities from EU-Site

I am looking for a public list of Volunteering - Services in Europe: I don't need full addresses - but the name and the website. I think of data ...XML, CSV ... with these fields: name, country - and some additional fields would be nice one record per country of presence. btw: the european volunteering services are great options for the youth

well I have found a great page that is very very comprehensive;want to gather data from the european volunteering services that are hosted on a European site:

see: https://youth.europa.eu/go-abroad/volunteering/opportunities_en

@HedgeHog showed me the right approach and how to find the correct selectorsin this thread: BeatuifulSoup iterate over 10 k pages & fetch data, parse: European Volunteering-Services: a tiny scraper that collects opportunities from EU-Site

# Extracting relevant datatitle = soup.h1.get_text(', ',strip=True)location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ',strip=True)start_date,end_date = (e.get_text(strip=True)for e in soup.select('span.extra strong')[-2:])

but we have got several hundred volunteering opportunities there - which are stored in sites like the following:

 https://youth.europa.eu/solidarity/placement/39020_en https://youth.europa.eu/solidarity/placement/38993_en https://youth.europa.eu/solidarity/placement/38973_en https://youth.europa.eu/solidarity/placement/38972_en https://youth.europa.eu/solidarity/placement/38850_en https://youth.europa.eu/solidarity/placement/38633_en

idea:

I think it would be awesome to gather the data - i.e. with a scraper that is based on BS4 and requests - parsing the data and subsequently printing the data in a dataframe

Well - I think that we could iterate over all the urls:

placement/39020_en placement/38993_en placement/38973_en placement/38850_en 

idea: I think that we can iterate from zero to 100 000 in stored to fetch all the results that are stored in placements.But this idea is not backed with a code. In other words - at the moment I do not have an idea how to do this special idea of iterating over such a great range:

At the moment I think - it is a basic approach to start with this:

import requestsfrom bs4 import BeautifulSoupimport pandas as pd# Function to generate placement URLs based on a range of IDsdef generate_urls(start_id, end_id):    base_url = "https://youth.europa.eu/solidarity/placement/"    urls = [base_url + str(id) +"_en" for id in range(start_id, end_id+1)]    return urls# Function to scrape data from a single URLdef scrape_data(url):    response = requests.get(url)    if response.status_code == 200:        soup = BeautifulSoup(response.content, 'html.parser')        title = soup.h1.get_text(', ', strip=True)        location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ', strip=True)        start_date, end_date = (e.get_text(strip=True) for e in soup.select('span.extra strong')[-2:])        website_tag = soup.find("a", class_="btn__link--website")        website = website_tag.get("href") if website_tag else None        return {"Title": title,"Location": location,"Start Date": start_date,"End Date": end_date,"Website": website,"URL": url        }    else:        print(f"Failed to fetch data from {url}. Status code: {response.status_code}")        return None# Set the range of placement IDs we want to scrapestart_id = 1end_id = 100000# Generate placement URLsurls = generate_urls(start_id, end_id)# Scrape data from all URLsdata = []for url in urls:    placement_data = scrape_data(url)    if placement_data:        data.append(placement_data)# Convert data to DataFramedf = pd.DataFrame(data)# Print DataFrameprint(df)

which gives me back the following

 Failed to fetch data from https://youth.europa.eu/solidarity/placement/154_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/156_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/157_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/159_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/161_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/162_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/163_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/165_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/166_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/169_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/170_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/171_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/173_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/174_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/176_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/177_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/178_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/179_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/180_en. Status code: 404    ---------------------------------------------------------------------------    ValueError                                Traceback (most recent call last)<ipython-input-5-d6272ee535ef> in <cell line: 42>()         41 data = []         42 for url in urls:    ---> 43     placement_data = scrape_data(url)         44     if placement_data:         45         data.append(placement_data)<ipython-input-5-d6272ee535ef> in scrape_data(url)         16         title = soup.h1.get_text(', ', strip=True)         17         location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ', strip=True)    ---> 18         start_date, end_date = (e.get_text(strip=True) for e in soup.select('span.extra strong')[-2:])         19         website_tag = soup.find("a", class_="btn__link--website")         20         website = website_tag.get("href") if website_tag else None    ValueError: not enough values to unpack (expected 2, got 0)

Any idea?

see the bas-url: https://youth.europa.eu/go-abroad/volunteering/opportunities_en


Viewing all articles
Browse latest Browse all 14069

Trending Articles