Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 16595

Scraping Yelp, but 503 Service Unavailable

$
0
0

I would like to scrape the restaurant and its popular dishes on Yelp.

At the begining, I was scraping Yelp with scrapy smoothly. But after a while, it pops up the error message of 503 Service Unavailable. And I also failed to access the Yelp website.

In the scrapy code, I've set my own user agent and set the robot.txt as False. Now I try to use the fake-agent package, but also failed.

Now I've sent an email to the Yelp team for the permission to enter the page. But I'm also concerning that if I scrape lots of info later, what if I'm blocked by Yelp again? It looks like rotating the user agents not work.

code in setting.py

from fake_useragent import UserAgentua = UserAgent()BOT_NAME = "yelp"SPIDER_MODULES = ["yelp.spiders"]NEWSPIDER_MODULE = "yelp.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"#USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"USER_AGENT = ua.random# Obey robots.txt rulesROBOTSTXT_OBEY = False

code in the spider:

def parse(self, response):        pagination_text = response.css('.css-1aq64zd .css-chan6m::text').get()        if pagination_text:            total_pages = int(pagination_text.split(' of ')[-1])            for page_number in range(total_pages):                start_value = page_number * 10                #next_page_url = f"https://www.yelp.com/search?find_desc=&find_loc=New+York%2C+NY&start={start_value}"                #next_page_url = f"https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&start={start_value}"                next_page_url = response.url +"&start={}".format(start_value)                yield response.follow(next_page_url, callback=self.parse_page, headers={"User-Agent":ua.random})

Viewing all articles
Browse latest Browse all 16595

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>