all AI news
非标准HTML无法被解析的问题解决
April 24, 2024, 4:06 a.m. | drake
DEV Community dev.to
当爬虫请求一个网页,这个网页是非标准HTML的时候,那么一般方式都是无法正常解析成dom的;比如:
错误示例
- 1、此种方式将得到None
from lxml import etree
tree = etree.HTML(res.text)
- 2、此种方式也是得到None
from lxml import etree
parser = etree.HTMLParser()
tree = etree.fromstring(res.text, parser)
- 3、此种方式也是得到None
from bs4 import BeautifulSoup
tree = BeautifulSoup(html, 'html.parser')
正确示例
- 1、解决方案
from lxml.html import soupparser
tree = soupparser.fromstring(res.text)
- 2、解决方案
from lxml import etree
tree = etree.HTML(res.text.encode("ascii", "xmlcharrefreplace").decode("ascii"))
More from dev.to / DEV Community
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US