I'm a newbie in scrapy frame. I would like to scrape the news from https://www.snl24.com/dailysun/news. It will pop up news through infinite scroll. I encontered 2 issues during these days.
Since it needs infinite scroll to refresh new contents, so I find the url is
https://www.snl24.com/dailysun/news?pagenumber={page_number}
` in the dev tool. Ended with the specific page number. And it can also be found from the html of the former page.The contents in html from the dev tool looks fine and consist with what shows on the web, but the content fetched ended with
?pagenumber=4
(for example) is totally confusing and different from what in dev tool.If I run the code below, it shows I crawl the page successfully but scraped no items. The output file contains only little part of the data. And it keeps crawling but output no data anymore.
import scrapy, timefrom scrapy_lyntest.items import ScrapyLyntestItemclass DailysunSpider(scrapy.Spider): name = "DailySun" allowed_domains = ["www.snl24.com"] start_urls = ['https://www.snl24.com/dailysun/news'] def parse(self, response): newss = response.css('div.article-item--container') for news in newss: inner_news_relative_url = news.css('a[data-event-name="article_link"]::attr(href)').get() inner_news_url = 'https://www.snl24.com'+ inner_news_relative_url yield response.follow(inner_news_url, callback = self.parse_news_page) if response.css('div.loader-indicator').attrib['class'] == 'loader-indicator': scroll_relative_url = response.css('div[hx-trigger="revealed"]::attr(hx-get)').get() scroll_url = 'https://www.snl24.com/'+ scroll_relative_url yield scrapy.Request(url = scroll_url, callback=self.parse) def parse_news_page(self, response): newspage = response.css('div.article.tf-lhs-col') news_item = ScrapyLyntestItem() news_item["url"] = response.url, news_item["title"] = newspage.css('h1.article__title::text').get(), news_item["content"] = newspage.css('div.article__body.NewsArticle p::text').getall(), news_item["date"] = newspage.css('p.article__date::text').get() yield news_item
I tried to json.loads(response.text)
what I get through fetching the url ended with pagenumber={}. But it pops up the error message JSONDecodeError: Expecting value: line 2 column 1 (char 1)
so I tried to use css directly.
Need help. Thanks a lot.