April 24, 2024, 4:06 a.m. | drake

DEV Community dev.to

当爬虫请求一个网页,这个网页是非标准HTML的时候,那么一般方式都是无法正常解析成dom的;比如:





错误示例



  • 1、此种方式将得到None



from lxml import etree
tree = etree.HTML(res.text)



  • 2、此种方式也是得到None



from lxml import etree

parser = etree.HTMLParser()
tree = etree.fromstring(res.text, parser)




  • 3、此种方式也是得到None



from bs4 import BeautifulSoup
tree = BeautifulSoup(html, 'html.parser')





正确示例



  • 1、解决方案



from lxml.html import soupparser
tree = soupparser.fromstring(res.text)




  • 2、解决方案



from lxml import etree
tree = etree.HTML(res.text.encode("ascii", "xmlcharrefreplace").decode("ascii"))


ascii beautifulsoup decode encode html import text tree

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne