Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13921

Web Scraping with Selenium and Python

$
0
0

I have been struggling with web scraping by using Selenium. The target website contains a responsive table in which I will have to gather the data. the html codes look like this (please forgive me for altering the html codes that contains certain information, where I would say the structure stays the same as the real one):

<table class="items">>  ...>   <tbody>>>   <tr class="odd">>>>   <td class="centered">1</td>>>>   <td class="centered no-border-right">>>>>   <a title="company 1" name="" href="/company 1/year_id/1970"><img src="https://company_1.com/logo.png">>>>>   </a>>>>  </td>>>>  <td class="mainlink no-border-links">>>>>  <a title="company 1" name="" href="/company 1/year_id/1970">company 1>>>>  </a>>>>  </td>>>>  <td class="rights mainlink redtext">$270k</td>>>>  <td class="centered">>>>>  <a href="/company 1/purchase/year_id/1970">5>>>>  </a>>>>  </td>>>>  <td class="rights mainlink greentext">->>>  </td>>>>  <td class="centered">>>>>  <a href="/company 1/purchase/year_id/1970">4>>>>  </a>>>>  </td>>>>  <td class="rights mainlink"><span class="redtext">$-270k</span>>>>  </td>>> </tr>>> <tr class="even"># these code blocks repeat with different data for 24 times...> </tbody></table>...

And with the help of Gemini, in Python my syntax are as below:

from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.common.by import Byfrom fake_useragent import UserAgentfrom pandas import DataFrameoption = webdriver.ChromeOptions()option.add_argument("--headless")ua = UserAgent()option.add_argument(f"user-agent={ua.chrome}")driver = webdriver.Chrome(options=option)table_class='items'url_expenditure = 'https://target_website.com'driver.get(url_expenditure)driver.implicitly_wait(5)table_element = driver.find_element(By.CLASS_NAME, table_class)table_data = table_element.find_element(By.TAG_NAME, "tr") table_data = []for row in table_element.find_elements(By.TAG_NAME, "tr"):> row_data = [cell.text.strip() for cell in row.find_elements(By.TAG_NAME, "td")]  > table_data.append(row_data)driver.quit()print(table_data)

The results somehow show a list of correct rows and columns. However, the data are not shown, with commas separated: [[], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '']]

whereas I expect to see [['1','','company 1','$270K','5','-','4','$-270K'],[#next row of data]...]

Please help explain what I have to amend in the code block, thank you!


Viewing all articles
Browse latest Browse all 13921

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>