Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23218

How to get the number of requests in queue in python scrapy?

$
0
0

In below code,

  • len(self.crawler.engine.slot.scheduler) is always returning 0
  • and self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued'] is returning value in increasing order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

I was expecting the queue to be high initially and in decreasing order as URLs get crawled. Higher queue before crawling and lower value of queue after crawling.

Also, uncommenting this code shows similar trend of increasing queue size.

if next_page is not None:    next_page = response.urljoin(next_page)    yield scrapy.Request(next_page, callback=self.parse)

note: I have set CONCURRENT_REQUESTS = 1 in settings

import scrapyclass QuotesSpider(scrapy.Spider):    name = "quotes_spider"    start_urls = ["https://quotes.toscrape.com/page/1/","https://quotes.toscrape.com/page/2/","https://quotes.toscrape.com/page/3/","https://quotes.toscrape.com/page/4/","https://quotes.toscrape.com/page/5/","https://quotes.toscrape.com/page/6/","https://quotes.toscrape.com/page/7/","https://quotes.toscrape.com/page/8/","https://quotes.toscrape.com/page/9/","https://quotes.toscrape.com/page/10/",    ]    def parse(self, response):        print(f"\n before {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} \n\n")        print(f"\n before2 {len(self.crawler.engine.slot.scheduler)}")  # dont know why it always returns zero        for quote in response.css("div.quote"):            yield {"text": quote.css("span.text::text").get(),"author": quote.css("small.author::text").get(),"tags": quote.css("div.tags a.tag::text").getall(),            }            next_page = response.css("li.next a::attr(href)").get()            if next_page is not None:                next_page = response.urljoin(next_page)                yield scrapy.Request(next_page, callback=self.parse)        print(f"\n After {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} \n\n")        print(f"\n after2 {len(self.crawler.engine.slot.scheduler)}") # dont know why it always returns zero

Viewing all articles
Browse latest Browse all 23218

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>