Python 速查手冊

12.7 HTML 處理 html.parser

html.parser 為解析 HTML 文件的模組 (module) ，基本使用順序如下

定義繼承自 HTMLParser 的類別 (class) 。
改寫解析 HTML 文件的方法，例如 handle_starttag() 、 handle_endtag() 、 handle_data() 、 handle_comment() 等。
建立繼承自 HTMLParser 類別的物件，以該物件呼叫 feed() 方法，並以 HTML 文件的字串 (string) 當參數。

以下程式示範計算 HTML 文件中 <p> 元素的數量

from html.parser import HTMLParser

class PCounter(HTMLParser):
    def __init__(self):
        super().__init__()
        self.count = 0
    
    def handle_starttag(self, tag, attrs):
        if tag == "p":
            self.count += 1

parser = PCounter()
parser.feed("<p></p><p></p><p></p><p></p><p></p>")
print(parser.count)

#《程式語言教學誌》的範例程式
# http://kaiching.org/
# 檔名：hdemo01.py
# 功能：示範 html.parser 模組
# 作者：張凱慶

於命令列依序執行以上程式，結果如下

$ python3 hdemo01.py

以下程式示範將 HTML 文件中的內容擷取到串列 (list) 屬性 (attribute) content 中

from html.parser import HTMLParser

class ContentParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.content = []
    
    def handle_starttag(self, tag, attrs):
        self.content.append("start-" + tag)
    
    def handle_endtag(self, tag):
        self.content.append("end-" + tag)
    
    def handle_data(self, data):
        self.content.append(data)

parser = ContentParser()
data = "<div>H</div><p>E</p><p>L</p><span>L</span><div>O</div>"
parser.feed(data)
print(parser.content)

#《程式語言教學誌》的範例程式
# http://kaiching.org/
# 檔名：hdemo02.py
# 功能：示範 html.parser 模組
# 作者：張凱慶

於命令列依序執行以上程式，結果如下

$ python3 hdemo02.py

['start-div', 'H', 'end-div', 'start-p', 'E', 'end-p', 'start-p', 'L', 'end-p', 'start-span', 'L', 'end-span', 'start-div', 'O', 'end-div']

相關教學影片

第七堂課舉一反三 ⇨ YouTube 頁面連結

上一頁： 12.6 資料庫 sqlite3
Python 速查手冊 - 目錄
下一頁： 12.8 伺服器 http.server

回 Python 教材首頁

回程式語言教材首頁