I'm working on a project where I aim to scrape data from documentation of frameworks like Vue.js and prepare it for fine-tuning Gemma 2b model. I've developed a Python script using BeautifulSoup to scrape the data, but I'm unsure if it's the correct approach and how to proceed with cleaning and formatting the scraped data for fine-tuning.
Code Overview:
I have implemented a Python script that performs the following tasks:
- Crawls through the documentation website starting from a given URL.
- Extracts useful information from each page, filtering out elements without specific classes.
- Saves the scraped data to a text file.
Code For Scraping:-
import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlparse, urljoindef get_internal_links(url, domain): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') internal_links = set() for link in soup.find_all('a', href=True): absolute_link = urljoin(url, link['href']) parsed_link = urlparse(absolute_link) if parsed_link.netloc == domain: internal_links.add(absolute_link) return internal_linksdef crawler(domain, start_url): visited = set() queue = [start_url] useful_data = [] while queue: current_url = queue.pop(0) if current_url in visited: continue visited.add(current_url) try: response = requests.get(current_url) soup = BeautifulSoup(response.text, 'html.parser') # Extract useful information elements_without_class = soup.find_all(lambda tag: not tag.has_attr('class')) for element in elements_without_class: if element.text.strip(): useful_data.append(element.text.strip()) # Get all internal links in the current page internal_links = get_internal_links(current_url, domain) # Add new internal links to the queue for link in internal_links: if link not in visited: queue.append(link) except Exception as e: print(f"Error crawling {current_url}: {e}") return useful_data# Usagestart_url = input("Enter the URL of the website you want to scrape: ")parsed_url = urlparse(start_url)domain = parsed_url.netlocuseful_data = crawler(domain, start_url)# Save useful data to a text fileoutput_file = 'scraped_data.txt'with open(output_file, 'w', encoding='utf-8') as f: for data in useful_data: f.write(data +'\n')print(f"Scraped data saved to {output_file}")Questions:
- Is the approach outlined in the provided Python script suitable for scraping data from documentation websites such as Vue.js?
- How should I clean the scraped data to remove duplicates and irrelevant content?My goal is to create a CSV dataset in the format of question (prompt) and answer pairs from the scraped data. How can I achieve this formatting?
- What are the best practices for preparing a dataset for fine-tuning the Gemma 2b model using documentation from frameworks like Vue.js?
- I would greatly appreciate any guidance or suggestions on how to proceed with these tasks. Thank you for your assistance!