自制工具爬取豆瓣租房信息｜Python 主题月-一一网

本文正在参加「Python主题月」，详情查看活动链接。最近在豆瓣上找房子，发现效率有点低，要在豆瓣小组中逐个查找相关信息，比较浪费时间。还是利用Python来爬取豆瓣的网页，做个简单的工具提取关键信息效率比较高。

基本思路

通过requests获取相关网页数据
通过BeautifulSoup拿到相关DOM节点中的内容
通过re对内容进行正则匹配，筛选关键信息
通过Pandas生成Excel

爬取所有列表页的html

例如：爬取深圳租房团、深圳租房这两个租房小组的页面前10页的数据。

注意：Cookie 需要自行在浏览器登录豆瓣后，F12查看NetWork中的请求获取。没有Cookie的话频繁爬取会被豆瓣禁止访问。具体位置如下所示：

import requests
# 起始条目，最终条目，每页条数
page_indexs = range(0,250,25)

#租房小组链接
baseUrls = ['https://www.douban.com/group/szsh/discussion',#深圳租房
           'https://www.douban.com/group/106955/discussion'#深圳租房团
           ]

#cookie 注意
cookie = '在浏览器登录豆瓣后查找你的cookie'

def download_all_htmls():

    htmls = []
    for baseUrl in baseUrls:
        for idx in page_indexs:
            UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
            url = f"{baseUrl}?start={idx}"
            print("download_all_htmls craw html:", url)
            r = requests.get(url,
                            headers={"User-Agent":UA,"Cookie":cookie})
            if r.status_code != 200:
                print('download_all_htmls,r.status_code',r.status_code)
                ##raise Exception("error")
            htmls.append(r.text)
    return htmls

htmls = download_all_htmls()
复制代码

获取相关DOM节点中的内容

在豆瓣的页面按F12查看页面相关的元素，主要就是标题title、链接href、时间time等元素

def parse_single_html(html):

    soup = BeautifulSoup(html, 'html.parser')

    #每个条目的内容
    article_items = (
        soup.find("table", class_="olt")
            .find_all("tr", class_="")
    )

    for article_item in article_items:
        
        # 文章标题
        title = article_item.find("td", class_="title").get_text().strip()
        #文章链接
        link = article_item.find("a")["href"]
        #文章时间
        time = article_item.find("td", class_="time").get_text()

复制代码

正则筛选关键信息

可以用正则来筛选title中的信息

# 文章标题
title = article_item.find("td", class_="title").get_text().strip()

# 匹配科技园、竹子林、车公庙三个关键字
res1 = re.search("科技园|竹子林|车公庙",title)
# 筛选一房
res2 = re.search("一房|单间|一室|1房|1室", title)

if res1 is not None and res2 is not None:
    print(title,link,time)
复制代码

生成Excel

将之前获取的数据生成Excel，这样一个简单的筛选豆瓣信息功能就完成了，完整代码如下：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import random

# 起始条目，最终条目，每页条数
page_indexs = range(0,250,25)

#租房小组链接
baseUrls = ['https://www.douban.com/group/szsh/discussion',#深圳租房
           'https://www.douban.com/group/106955/discussion'#深圳租房团
           ]

#cookie，注意 
cookie = '在浏览器登录豆瓣后查找你的cookie'

#下载每个页面
def download_all_htmls():
    htmls = []
    for baseUrl in baseUrls:
        for idx in page_indexs:

            UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
            url = f"{baseUrl}?start={idx}"
            print("download_all_htmls craw html:", url)
            r = requests.get(url,
                            headers={"User-Agent":UA,"Cookie":cookie})
            if r.status_code != 200:
                print('download_all_htmls,r.status_code',r.status_code)
            htmls.append(r.text)
    return htmls

htmls = download_all_htmls()

#保存每个标题名称，以便后续去重
datasKey = []

#解析单个HTML，得到数据
def parse_single_html(html):

    soup = BeautifulSoup(html, 'html.parser')

    article_items = (
        soup.find("table", class_="olt")
            .find_all("tr", class_="")
    )

    datas = []
    
    for article_item in article_items:
        
        # 文章标题
        title = article_item.find("td", class_="title").get_text().strip()
        #文章链接
        link = article_item.find("a")["href"]
        #文章时间
        time = article_item.find("td", class_="time").get_text()
        
        # 匹配科技园、竹子林、车公庙三个关键字
        res1 = re.search("科技园|竹子林|车公庙",title)
        # 筛选一房
        res2 = re.search("一房|单间|一室|1房|1室", title)

        # 找到地点和一房匹配的标题和之前存储的列表中不存在的
        if res1 is not None and res2 is not None and not title in datasKey:
                print(title,link,time)
                datasKey.append(title)
                datas.append({
                    "title":title,
                    "link":link,
                    "time":time
                })
    return datas

all_datas = []

#遍历所有爬取到的html，并解析
for html in htmls:
    all_datas.extend(parse_single_html(html))
    
df = pd.DataFrame(all_datas)
#将数据转成Excel
df.to_excel("test.xlsx")
复制代码