非标准HTML无法被解析的问题解决

April 24, 2024, 4:06 a.m. | drake

DEV Community dev.to

当爬虫请求一个网页，这个网页是非标准HTML的时候，那么一般方式都是无法正常解析成dom的；比如：

错误示例

1、此种方式将得到None

from lxml import etree
tree = etree.HTML(res.text)

2、此种方式也是得到None

from lxml import etree

parser = etree.HTMLParser()
tree = etree.fromstring(res.text, parser)

3、此种方式也是得到None

from bs4 import BeautifulSoup
tree = BeautifulSoup(html, 'html.parser')

正确示例

1、解决方案

from lxml.html import soupparser
tree = soupparser.fromstring(res.text)

2、解决方案

from lxml import etree
tree = etree.HTML(res.text.encode("ascii", "xmlcharrefreplace").decode("ascii"))

ascii beautifulsoup decode encode html import text tree

Visit resource

More from dev.to / DEV Community

Unified Stream and Batch Processing of WorldQuant 101 Alphas in DolphinDB 3 hours ago | dev.to

advantages alpha batch processing database +9

Time Zones in DolphinDB 3 hours ago | dev.to

api csv data databases +10

Optimizing MySQL Database Performance: Identifying and Managing Long Queries 3 hours ago | dev.to

article cpu current database +11

Implementing GraphRAG for Query-Focused Summarization 4 hours ago | dev.to

ai diverse edge explore +18

C# dasturlash tili | Record haqida 4 hours ago | dev.to

age agent csharp data +8

Python. Flipping Calendars. 4 hours ago | dev.to

appointments article become digital +13

10 Essential Questions to Ask When Starting with NumPy Data Manipulation 5 hours ago | dev.to

analysis beginners best practices data +12

Introduction to Natural Language Processing with NLTK 5 hours ago | dev.to

algorithms computer computers computer science +14

Stable Cascade Full Tutorial for Windows, Massed Compute, RunPod & Kaggle — Predecessor of SD3 … 5 hours ago | dev.to

ai ai art art beginners +9

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

View on ai-jobs.net

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

View on ai-jobs.net

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

all AI news

非标准HTML无法被解析的问题解决

错误示例

正确示例

More from dev.to / DEV Community

Jobs in AI, ML, Big Data

Software Engineer for AI Training Data (School Specific)

Software Engineer for AI Training Data (Python)

Software Engineer for AI Training Data (Tier 2)

Data Engineer

Artificial Intelligence – Bioinformatic Expert

Lead Developer (AI)