Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

How do I clean html code with multiple unwanted newlines using Python?

$
0
0

I have a lot of html pages that have somehow become embedded with multiple newline characters, with the tags on separate lines and some of the sentences split up at apparently random intervals. Here is an example of what I am dealing with:

<html><head><title>One of many</title></head><body><h1>Spam is not ham</h1><p>Many plates of Spam</p><p>Use the Fry option to properly cook theSpamuntil done.</p><p>Enquiries for more recipes can be made through theFeed Meoption.</p></body></html>

I used the replace() function with partial success for the beginning tags with this code:

html_filename = 'page.htm'f = open(html_filename, encoding="utf-8")file_str = f.readlines()f.close()with open(html_filename, 'w', encoding="utf-8") as f:    for line in file_str:        if '<h1>\n' in line:            tmp = line.replace('<h1>\n', '<h1>')            f.write(tmp)        elif '<p>\n' in line:            tmp = line.replace('<p>\n', '<p>')            f.write(tmp)        else:            f.write(line)

and get the following result:

<html><head><title>One of many</title></head><body><h1>Spam is not ham</h1><p>Many plates of Spam</p><p>Use the Fry option to properly cook theSpamuntil done.</p><p>Enquiries for more recipes can be made through theFeed Meoption.</p></body></html>

However, I can't figure out how to resolve the lines with just text or the lines with just an end tag.


Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>