iterate over 10 k pages & fetch data, parse: European Volunteering-Services: tiny scraper that collects opportunities from EU-Site

I am looking for a public list of Volunteering - Services in Europe: I don't need full addresses - but the name and the website. I think of data ...XML, CSV ... with these fields: name, country - and some additional fields would be nice one record per country of presence. btw: the european volunteering services are great options for the youth

well I have found a great page that is very very comprehensive;want to gather data from the european volunteering services that are hosted on a European site:

see: https://youth.europa.eu/go-abroad/volunteering/opportunities_en

@HedgeHog showed me the right approach and how to find the correct selectorsin this thread: BeatuifulSoup iterate over 10 k pages & fetch data, parse: European Volunteering-Services: a tiny scraper that collects opportunities from EU-Site

# Extracting relevant datatitle = soup.h1.get_text(', ',strip=True)location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ',strip=True)start_date,end_date = (e.get_text(strip=True)for e in soup.select('span.extra strong')[-2:])

but we have got several hundred volunteering opportunities there - which are stored in sites like the following:

 https://youth.europa.eu/solidarity/placement/39020_en https://youth.europa.eu/solidarity/placement/38993_en https://youth.europa.eu/solidarity/placement/38973_en https://youth.europa.eu/solidarity/placement/38972_en https://youth.europa.eu/solidarity/placement/38850_en https://youth.europa.eu/solidarity/placement/38633_en

idea:

I think it would be awesome to gather the data - i.e. with a scraper that is based on BS4 and requests - parsing the data and subsequently printing the data in a dataframe

Well - I think that we could iterate over all the urls:

placement/39020_en placement/38993_en placement/38973_en placement/38850_en

idea: I think that we can iterate from zero to 100 000 in stored to fetch all the results that are stored in placements.But this idea is not backed with a code. In other words - at the moment I do not have an idea how to do this special idea of iterating over such a great range:

At the moment I think - it is a basic approach to start with this:

import requestsfrom bs4 import BeautifulSoupimport pandas as pd# Function to generate placement URLs based on a range of IDsdef generate_urls(start_id, end_id):    base_url = "https://youth.europa.eu/solidarity/placement/"    urls = [base_url + str(id) +"_en" for id in range(start_id, end_id+1)]    return urls# Function to scrape data from a single URLdef scrape_data(url):    response = requests.get(url)    if response.status_code == 200:        soup = BeautifulSoup(response.content, 'html.parser')        title = soup.h1.get_text(', ', strip=True)        location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ', strip=True)        start_date, end_date = (e.get_text(strip=True) for e in soup.select('span.extra strong')[-2:])        website_tag = soup.find("a", class_="btn__link--website")        website = website_tag.get("href") if website_tag else None        return {"Title": title,"Location": location,"Start Date": start_date,"End Date": end_date,"Website": website,"URL": url        }    else:        print(f"Failed to fetch data from {url}. Status code: {response.status_code}")        return None# Set the range of placement IDs we want to scrapestart_id = 1end_id = 100000# Generate placement URLsurls = generate_urls(start_id, end_id)# Scrape data from all URLsdata = []for url in urls:    placement_data = scrape_data(url)    if placement_data:        data.append(placement_data)# Convert data to DataFramedf = pd.DataFrame(data)# Print DataFrameprint(df)

which gives me back the following

 Failed to fetch data from https://youth.europa.eu/solidarity/placement/154_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/156_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/157_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/159_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/161_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/162_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/163_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/165_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/166_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/169_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/170_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/171_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/173_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/174_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/176_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/177_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/178_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/179_en. Status code: 404    Failed to fetch data from https://youth.europa.eu/solidarity/placement/180_en. Status code: 404    ---------------------------------------------------------------------------    ValueError                                Traceback (most recent call last)<ipython-input-5-d6272ee535ef> in <cell line: 42>()         41 data = []         42 for url in urls:    ---> 43     placement_data = scrape_data(url)         44     if placement_data:         45         data.append(placement_data)<ipython-input-5-d6272ee535ef> in scrape_data(url)         16         title = soup.h1.get_text(', ', strip=True)         17         location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ', strip=True)    ---> 18         start_date, end_date = (e.get_text(strip=True) for e in soup.select('span.extra strong')[-2:])         19         website_tag = soup.find("a", class_="btn__link--website")         20         website = website_tag.get("href") if website_tag else None    ValueError: not enough values to unpack (expected 2, got 0)

Any idea?

see the bas-url: https://youth.europa.eu/go-abroad/volunteering/opportunities_en

iterate over 10 k pages & fetch data, parse: European Volunteering-Services: tiny scraper that collects opportunities from EU-Site

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112