I am looking for a public list of Volunteering - Services in Europe: I don't need full addresses - but the name and the website. I think of data ...XML, CSV ... with these fields: name, country - and some additional fields would be nice one record per country of presence. btw: the european volunteering services are great options for the youth
well I have found a great page that is very very comprehensive;want to gather data from the european volunteering services that are hosted on a European site:
see: https://youth.europa.eu/go-abroad/volunteering/opportunities_en
@HedgeHog showed me the right approach and how to find the correct selectorsin this thread: BeatuifulSoup iterate over 10 k pages & fetch data, parse: European Volunteering-Services: a tiny scraper that collects opportunities from EU-Site
# Extracting relevant datatitle = soup.h1.get_text(', ',strip=True)location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ',strip=True)start_date,end_date = (e.get_text(strip=True)for e in soup.select('span.extra strong')[-2:])
but we have got several hundred volunteering opportunities there - which are stored in sites like the following:
https://youth.europa.eu/solidarity/placement/39020_en https://youth.europa.eu/solidarity/placement/38993_en https://youth.europa.eu/solidarity/placement/38973_en https://youth.europa.eu/solidarity/placement/38972_en https://youth.europa.eu/solidarity/placement/38850_en https://youth.europa.eu/solidarity/placement/38633_en
idea:
I think it would be awesome to gather the data - i.e. with a scraper that is based on BS4
and requests
- parsing the data and subsequently printing the data in a dataframe
Well - I think that we could iterate over all the urls:
placement/39020_en placement/38993_en placement/38973_en placement/38850_en
idea: I think that we can iterate from zero to 100 000 in stored to fetch all the results that are stored in placements.But this idea is not backed with a code. In other words - at the moment I do not have an idea how to do this special idea of iterating over such a great range:
At the moment I think - it is a basic approach to start with this:
import requestsfrom bs4 import BeautifulSoupimport pandas as pd# Function to generate placement URLs based on a range of IDsdef generate_urls(start_id, end_id): base_url = "https://youth.europa.eu/solidarity/placement/" urls = [base_url + str(id) +"_en" for id in range(start_id, end_id+1)] return urls# Function to scrape data from a single URLdef scrape_data(url): response = requests.get(url) if response.status_code == 200: soup = BeautifulSoup(response.content, 'html.parser') title = soup.h1.get_text(', ', strip=True) location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ', strip=True) start_date, end_date = (e.get_text(strip=True) for e in soup.select('span.extra strong')[-2:]) website_tag = soup.find("a", class_="btn__link--website") website = website_tag.get("href") if website_tag else None return {"Title": title,"Location": location,"Start Date": start_date,"End Date": end_date,"Website": website,"URL": url } else: print(f"Failed to fetch data from {url}. Status code: {response.status_code}") return None# Set the range of placement IDs we want to scrapestart_id = 1end_id = 100000# Generate placement URLsurls = generate_urls(start_id, end_id)# Scrape data from all URLsdata = []for url in urls: placement_data = scrape_data(url) if placement_data: data.append(placement_data)# Convert data to DataFramedf = pd.DataFrame(data)# Print DataFrameprint(df)
which gives me back the following
Failed to fetch data from https://youth.europa.eu/solidarity/placement/154_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/156_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/157_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/159_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/161_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/162_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/163_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/165_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/166_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/169_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/170_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/171_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/173_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/174_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/176_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/177_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/178_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/179_en. Status code: 404 Failed to fetch data from https://youth.europa.eu/solidarity/placement/180_en. Status code: 404 --------------------------------------------------------------------------- ValueError Traceback (most recent call last)<ipython-input-5-d6272ee535ef> in <cell line: 42>() 41 data = [] 42 for url in urls: ---> 43 placement_data = scrape_data(url) 44 if placement_data: 45 data.append(placement_data)<ipython-input-5-d6272ee535ef> in scrape_data(url) 16 title = soup.h1.get_text(', ', strip=True) 17 location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ', strip=True) ---> 18 start_date, end_date = (e.get_text(strip=True) for e in soup.select('span.extra strong')[-2:]) 19 website_tag = soup.find("a", class_="btn__link--website") 20 website = website_tag.get("href") if website_tag else None ValueError: not enough values to unpack (expected 2, got 0)
Any idea?
see the bas-url: https://youth.europa.eu/go-abroad/volunteering/opportunities_en