Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

Scraping Data from Framework Documentation and Preparing for Fine-Tuning Gemma 2b Model [closed]

$
0
0

I'm working on a project where I aim to scrape data from documentation of frameworks like Vue.js and prepare it for fine-tuning Gemma 2b model. I've developed a Python script using BeautifulSoup to scrape the data, but I'm unsure if it's the correct approach and how to proceed with cleaning and formatting the scraped data for fine-tuning.

Code Overview:

I have implemented a Python script that performs the following tasks:

  1. Crawls through the documentation website starting from a given URL.
  2. Extracts useful information from each page, filtering out elements without specific classes.
  3. Saves the scraped data to a text file.

Code For Scraping:-

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlparse, urljoindef get_internal_links(url, domain):    response = requests.get(url)    soup = BeautifulSoup(response.text, 'html.parser')    internal_links = set()    for link in soup.find_all('a', href=True):        absolute_link = urljoin(url, link['href'])        parsed_link = urlparse(absolute_link)        if parsed_link.netloc == domain:            internal_links.add(absolute_link)    return internal_linksdef crawler(domain, start_url):    visited = set()    queue = [start_url]    useful_data = []    while queue:        current_url = queue.pop(0)        if current_url in visited:            continue        visited.add(current_url)        try:            response = requests.get(current_url)            soup = BeautifulSoup(response.text, 'html.parser')            # Extract useful information            elements_without_class = soup.find_all(lambda tag: not tag.has_attr('class'))            for element in elements_without_class:                if element.text.strip():                    useful_data.append(element.text.strip())            # Get all internal links in the current page            internal_links = get_internal_links(current_url, domain)            # Add new internal links to the queue            for link in internal_links:                if link not in visited:                    queue.append(link)        except Exception as e:            print(f"Error crawling {current_url}: {e}")    return useful_data# Usagestart_url = input("Enter the URL of the website you want to scrape: ")parsed_url = urlparse(start_url)domain = parsed_url.netlocuseful_data = crawler(domain, start_url)# Save useful data to a text fileoutput_file = 'scraped_data.txt'with open(output_file, 'w', encoding='utf-8') as f:    for data in useful_data:        f.write(data +'\n')print(f"Scraped data saved to {output_file}")

Questions:

  1. Is the approach outlined in the provided Python script suitable for scraping data from documentation websites such as Vue.js?
  2. How should I clean the scraped data to remove duplicates and irrelevant content?My goal is to create a CSV dataset in the format of question (prompt) and answer pairs from the scraped data. How can I achieve this formatting?
  3. What are the best practices for preparing a dataset for fine-tuning the Gemma 2b model using documentation from frameworks like Vue.js?
  4. I would greatly appreciate any guidance or suggestions on how to proceed with these tasks. Thank you for your assistance!

Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>