Python爬取链家租房信息
兴趣点:
继续练手,今天以石家庄市开发区为例,爬取所有链家在租房屋信息 这种静态网页我已经练了很多了,已经驾轻就熟了
目标网站:
传送门:https://sjz.lianjia.com/zufang/kaifaqu1/
爬虫大体思路与方法:
大体思路:
(1)找到翻页的规律(链家是静态网页,直接看浏览器上方显示的链接就能找到规律) (2)获取我们想要的出租房的信息包括房源、面积、具体地址等,整理到列表中 (3)存储到本地
方法:
(1)getHTMLText(url):页面获取方法 (2)fillList(url,roomlist):把出租房的各种信息存入列表 (3)save(roomlist,path):读取列表内容存入本地txt文件
参数介绍:
(1)roomlist:存放出租房各种信息的列表 (2)path:本地存储路径
部分细节讲解:
(1)获取house_name的写法: 为什么不直接用 find 方法找a标签?如下:
house_name
= house
.find
("a",class_
= "twoline").get_text
().strip
()
这样写获取不到内容,这个问题我也不清楚,也不是第一次遇到了,可能是a标签的特性
所以改成了下面这样:
house_name
= house
.find
("p",class_
= "content__list--item--title").find
("a").get_text
().strip
()
(2)为了使获取的内容更整齐,再次强调一下去空格: str.strip():去除字符串两端的空格 str.replace(“旧字符”,“新字符”):把旧字符替换成新字符,用于替换内部的空格、制表符和换行符
(3)有个别出租房没有提供部分信息可能导致爬虫中断: 在循环外套一个try——except抛出即可
try:
for house
in soup
.find_all
("div",class_
= "content__list--item"):
......
......
except:
print("部分信息缺失,爬取失败*************************")
完整代码:
import requests
import re
import os
from bs4
import BeautifulSoup
def getHTMLText(url
):
try:
kv
= {"user-agent":"Mozilla/5.0"}
r
= requests
.get
(url
,headers
= kv
)
r
.raise_for_status
()
r
.encoding
= r
.apparent_encoding
return r
.text
except:
print("getHTMLText失败!")
return ""
def fillList(url
,roomlist
):
for i
in range(1,55):
page_url
= url
+ "pg{}".format(i
)
html
= getHTMLText
(page_url
)
soup
= BeautifulSoup
(html
,"html.parser")
try:
for house
in soup
.find_all
("div",class_
= "content__list--item"):
house_name
= house
.find
("p",class_
= "content__list--item--title").find
("a").get_text
().strip
()
house_price
= house
.find
("span",class_
= "content__list--item-price").get_text
().strip
()
des
= house
.find
("p",class_
= "content__list--item--des").get_text
()
deslist
= des
.split
("/")
house_address
= deslist
[0].strip
()
house_area
= deslist
[1].strip
()
house_towards
= deslist
[2].strip
()
house_roomtype
= deslist
[3].strip
()
house_floor
= deslist
[4].strip
().replace
(" ","").replace
("\t","")
house_tips
= house
.find
("p",class_
= "content__list--item--bottom oneline").get_text
().strip
().replace
("\n","").replace
("\t","")
roomlist
.append
([house_name
, house_price
, house_address
, house_area
, house_towards
, house_roomtype
, house_floor
, house_tips
])
print(house_address
+ "爬取成功!")
except:
print("部分信息缺失,爬取失败*************************")
def save(roomlist
,path
):
with open (path
,'a',encoding
= 'utf-8') as f
:
f
.write
("房源"+"\t"+"价格"+"\t"+"具体地址"+"\t"+"面积"+"\t"+"朝向"+"\t"+"房型"+"\t"+"层数"+"\t"+"其他优势"+"\n")
for room
in roomlist
:
f
.write
(
room
[0] + "\t" +
room
[1] + "\t" +
room
[2] + "\t" +
room
[3] + "\t" +
room
[4] + "\t" +
room
[5] + "\t" +
room
[6] + "\t" +
room
[7] + "\n")
print(room
[2] + "存储成功!")
f
.close
()
def main():
url
= "https://sjz.lianjia.com/zufang/kaifaqu1/"
roomlist
= []
path
= "石家庄开发区租房信息.txt"
fillList
(url
,roomlist
)
save
(roomlist
,path
)
main
()
爬取结果展示:
直接写入Excel未免太过麻烦,我习惯了写到txt再转到Excel的方法了(☄⊙ω⊙)☄ 至于怎么用Excel按制表符分列大家应该都会吧?不会有人不会吧?不会吧?不会吧?哈哈