非标准HTML无法被解析的问题解决

April 24, 2024, 4:06 a.m. | drake

DEV Community dev.to

当爬虫请求一个网页，这个网页是非标准HTML的时候，那么一般方式都是无法正常解析成dom的；比如：

错误示例

1、此种方式将得到None

from lxml import etree
tree = etree.HTML(res.text)

2、此种方式也是得到None

from lxml import etree

parser = etree.HTMLParser()
tree = etree.fromstring(res.text, parser)

3、此种方式也是得到None

from bs4 import BeautifulSoup
tree = BeautifulSoup(html, 'html.parser')

正确示例

1、解决方案

from lxml.html import soupparser
tree = soupparser.fromstring(res.text)

2、解决方案

from lxml import etree
tree = etree.HTML(res.text.encode("ascii", "xmlcharrefreplace").decode("ascii"))

ascii beautifulsoup decode encode html import text tree

Visit resource

More from dev.to / DEV Community

Conquering the Container: A Guide to Dockerizing Your Angular and Flask App 50 minutes ago | dev.to

angular app application backend +13

Basic Sorting Algorithms 52 minutes ago | dev.to

algorithms article basic beginners +16

Coin Collector 54 minutes ago | dev.to

arrow book coding computer +13

Choosing the Right CSS Approach: Tailwind CSS vs Bootstrap vs Vanilla CSS an hour ago | dev.to

adoption basic bootstrap create +16

New AI Frontend Tool Released This Week 2 hours ago | dev.to

ai tool beginners demo development +10

Python data types 2 hours ago | dev.to

build check code data +8

AI enthusiasm #10 - Summarize PDFs with AI🗎 3 hours ago | dev.to

ai always on atmosphere build +7

Passos que usei para definir uma ferramenta para um portal do desenvolvedor com ChatGPT, engenharia … 4 hours ago | dev.to

beginners chatgpt prompt

Developed Python CLI to easily download Dutch point cloud data 5 hours ago | dev.to

analysis applications building cli +20

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

View more jobs

all AI news

非标准HTML无法被解析的问题解决

错误示例

正确示例

More from dev.to / DEV Community

Jobs in AI, ML, Big Data

Founding AI Engineer, Agents

AI Engineer Intern, Agents

AI Research Scientist

Data Architect

Data ETL Engineer

Lead GNSS Data Scientist