Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13951

Crawled pages but scrape 0 items; contents through scrapy shell fetch different from that in HTML in dev tool

$
0
0

I'm a newbie in scrapy frame. I would like to scrape the news from https://www.snl24.com/dailysun/news. It will pop up news through infinite scroll. I encontered 2 issues during these days.

  1. Since it needs infinite scroll to refresh new contents, so I find the url is https://www.snl24.com/dailysun/news?pagenumber={page_number}` in the dev tool. Ended with the specific page number. And it can also be found from the html of the former page.

    enter image description here

    The contents in html from the dev tool looks fine and consist with what shows on the web, but the content fetched ended with ?pagenumber=4 (for example) is totally confusing and different from what in dev tool.

  2. If I run the code below, it shows I crawl the page successfully but scraped no items. The output file contains only little part of the data. And it keeps crawling but output no data anymore.

    enter image description here

import scrapy, timefrom scrapy_lyntest.items import ScrapyLyntestItemclass DailysunSpider(scrapy.Spider):    name = "DailySun"    allowed_domains = ["www.snl24.com"]    start_urls = ['https://www.snl24.com/dailysun/news']    def parse(self, response):        newss = response.css('div.article-item--container')        for news in newss:            inner_news_relative_url = news.css('a[data-event-name="article_link"]::attr(href)').get()            inner_news_url = 'https://www.snl24.com'+ inner_news_relative_url            yield response.follow(inner_news_url, callback = self.parse_news_page)        if response.css('div.loader-indicator').attrib['class'] == 'loader-indicator':            scroll_relative_url = response.css('div[hx-trigger="revealed"]::attr(hx-get)').get()            scroll_url = 'https://www.snl24.com/'+ scroll_relative_url            yield scrapy.Request(url = scroll_url, callback=self.parse)    def parse_news_page(self, response):        newspage = response.css('div.article.tf-lhs-col')        news_item = ScrapyLyntestItem()        news_item["url"] = response.url,        news_item["title"] = newspage.css('h1.article__title::text').get(),        news_item["content"] = newspage.css('div.article__body.NewsArticle p::text').getall(),        news_item["date"] = newspage.css('p.article__date::text').get()        yield news_item

I tried to json.loads(response.text) what I get through fetching the url ended with pagenumber={}. But it pops up the error message JSONDecodeError: Expecting value: line 2 column 1 (char 1) so I tried to use css directly.

Need help. Thanks a lot.


Viewing all articles
Browse latest Browse all 13951

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>