I would like to scrape the restaurant and its popular dishes on Yelp.
At the begining, I was scraping Yelp with scrapy smoothly. But after a while, it pops up the error message of 503 Service Unavailable. And I also failed to access the Yelp website.
In the scrapy code, I've set my own user agent and set the robot.txt as False. Now I try to use the fake-agent package, but also failed.
Now I've sent an email to the Yelp team for the permission to enter the page. But I'm also concerning that if I scrape lots of info later, what if I'm blocked by Yelp again? It looks like rotating the user agents not work.
code in setting.py
from fake_useragent import UserAgentua = UserAgent()BOT_NAME = "yelp"SPIDER_MODULES = ["yelp.spiders"]NEWSPIDER_MODULE = "yelp.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"#USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"USER_AGENT = ua.random# Obey robots.txt rulesROBOTSTXT_OBEY = False
code in the spider:
def parse(self, response): pagination_text = response.css('.css-1aq64zd .css-chan6m::text').get() if pagination_text: total_pages = int(pagination_text.split(' of ')[-1]) for page_number in range(total_pages): start_value = page_number * 10 #next_page_url = f"https://www.yelp.com/search?find_desc=&find_loc=New+York%2C+NY&start={start_value}" #next_page_url = f"https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&start={start_value}" next_page_url = response.url +"&start={}".format(start_value) yield response.follow(next_page_url, callback=self.parse_page, headers={"User-Agent":ua.random})