明天的筆記本‧Tomorrow notebook

BeautifulSoup常用指令筆記

一、本篇文章不介紹BeautifulSoup用法，網路上已經很多介紹文，僅筆記自己常用語法。
二、有寫（除錯用）那行基本上都可以註解掉。

一、取得內容類：

（一）基本使用：

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')
print (soup.prettify())# 用標準html 顯示方法列印html, 排版顯示（除錯用）

（二）選擇器：

１、soup.加上html標籤方式：

soup.title                #<title>The Dormouse's story</title>
soup.title.name           #u'title'
soup.title.string         #u'The Dormouse's story'
soup.title.parent.name    #u'head'
soup.p                    #<p class="title"><b>The Dormouse's story</b></p>
soup.p['class']           #u'title'
soup.a                    #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')        #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
                          # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
                          # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

常用功能：

tag = soup.b
tag.name #取得標籤名 
tag.attrs # 直接取属性{u'class': u'boldest'}
tag.string #取出該標籤裡的字串
unicode(tag.string) 轉換成Unicode字串

文字字串：

soup.title.string  #返回迭代器，出現換行就找不到了
soup.title.text
soup.title.get_text()
        
title = soup.find('title').get_text()
print(title)

３、find_all 範例：

        
# 同時搜尋多種標籤(搜尋所有超連結與粗體字)
tags = soup.find_all(["a", "b"])
print(tags)

# 限制搜尋結果數量
tags = soup.find_all(["a", "b"], limit=2)
print(tags)

# 不使用遞迴搜尋，僅尋找次一層的子節點
soup.html.find_all("title", recursive=False)
        
        
soup.find(id="link3")     # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
_title = soup.find("h1").get_text()     # 找h1的內容
_context = soup.find_all("a", string=re.compile("郭"))

正則表達式：

使用:
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

#取得a標籤裡開頭為http://www.aaa.com/的超連結，的文字
_tag = soup.find('a',attrs={'href':re.compile(r'http://www.aaa.com/(.*)')}).get_text()

１、title

1 ------------
# 爬取HTML中的title
res = re.findall(r"<title>(.+?)</title>", html)
print(" Page title is: ", res[0])
 
2 ------------
#輸出標題
print(soup.title)

------------
soup.select("a[href]")   --》 选择带有href属性的<a> tag.
------------
soup.select('div[title*="关键字"]')  --》选择 title属性含有 “关键字“ 的<div> tag.
------------
#正则表达式re.compile()


------------
# 爬取段落
res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print(" Page paragraph is: ", res[0])

# 爬取网页中所有超链接 href
res = re.findall(r'href="(.*?)"', html)
print(" All links: ", res)

//domain ote.i360.tw個人部落格可寫文章BeautifulSoup常用指令筆記