Python爬取链家租房信息

it2025-11-13  16

Python爬取链家租房信息

兴趣点:

继续练手,今天以石家庄市开发区为例,爬取所有链家在租房屋信息 这种静态网页我已经练了很多了,已经驾轻就熟了

目标网站:

传送门:https://sjz.lianjia.com/zufang/kaifaqu1/

爬虫大体思路与方法:

大体思路:

(1)找到翻页的规律(链家是静态网页,直接看浏览器上方显示的链接就能找到规律) (2)获取我们想要的出租房的信息包括房源、面积、具体地址等,整理到列表中 (3)存储到本地

方法:

(1)getHTMLText(url):页面获取方法 (2)fillList(url,roomlist):把出租房的各种信息存入列表 (3)save(roomlist,path):读取列表内容存入本地txt文件

参数介绍:

(1)roomlist:存放出租房各种信息的列表 (2)path:本地存储路径

部分细节讲解:

(1)获取house_name的写法: 为什么不直接用 find 方法找a标签?如下:

house_name = house.find("a",class_ = "twoline").get_text().strip()

这样写获取不到内容,这个问题我也不清楚,也不是第一次遇到了,可能是a标签的特性

所以改成了下面这样:

house_name = house.find("p",class_ = "content__list--item--title").find("a").get_text().strip()

(2)为了使获取的内容更整齐,再次强调一下去空格: str.strip():去除字符串两端的空格 str.replace(“旧字符”,“新字符”):把旧字符替换成新字符,用于替换内部的空格、制表符和换行符

(3)有个别出租房没有提供部分信息可能导致爬虫中断: 在循环外套一个try——except抛出即可

try: for house in soup.find_all("div",class_ = "content__list--item"): ...... ...... except: print("部分信息缺失,爬取失败*************************")

完整代码:

import requests import re import os from bs4 import BeautifulSoup def getHTMLText(url): try: kv = {"user-agent":"Mozilla/5.0"} r = requests.get(url,headers = kv) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: print("getHTMLText失败!") return "" def fillList(url,roomlist): for i in range(1,55): page_url = url + "pg{}".format(i) html = getHTMLText(page_url) soup = BeautifulSoup(html,"html.parser") try: for house in soup.find_all("div",class_ = "content__list--item"): house_name = house.find("p",class_ = "content__list--item--title").find("a").get_text().strip() house_price = house.find("span",class_ = "content__list--item-price").get_text().strip() des = house.find("p",class_ = "content__list--item--des").get_text() deslist = des.split("/") house_address = deslist[0].strip() house_area = deslist[1].strip() house_towards = deslist[2].strip() house_roomtype = deslist[3].strip() house_floor = deslist[4].strip().replace(" ","").replace("\t","") house_tips = house.find("p",class_ = "content__list--item--bottom oneline").get_text().strip().replace("\n","").replace("\t","") roomlist.append([house_name, house_price, house_address, house_area, house_towards, house_roomtype, house_floor, house_tips]) print(house_address + "爬取成功!") except: print("部分信息缺失,爬取失败*************************") def save(roomlist,path): with open (path,'a',encoding = 'utf-8') as f: f.write("房源"+"\t"+"价格"+"\t"+"具体地址"+"\t"+"面积"+"\t"+"朝向"+"\t"+"房型"+"\t"+"层数"+"\t"+"其他优势"+"\n") for room in roomlist: f.write( room[0] + "\t" + room[1] + "\t" + room[2] + "\t" + room[3] + "\t" + room[4] + "\t" + room[5] + "\t" + room[6] + "\t" + room[7] + "\n") print(room[2] + "存储成功!") f.close() def main(): url = "https://sjz.lianjia.com/zufang/kaifaqu1/" roomlist = [] path = "石家庄开发区租房信息.txt" fillList(url,roomlist) save(roomlist,path) main()

爬取结果展示:

直接写入Excel未免太过麻烦,我习惯了写到txt再转到Excel的方法了(☄⊙ω⊙)☄ 至于怎么用Excel按制表符分列大家应该都会吧?不会有人不会吧?不会吧?不会吧?哈哈

最新回复(0)