30行代码统计自己 CSDN 博客相关数据

it2023-02-13 108

1. 编写目的

爬虫本身是一个非常简单的事情，都是由于业务需要才变得越来越复杂的。为了方便广大开发者，也有很多简单好用的爬虫框架，但这里不使用那些已经实现了的专用框架，也不能起到任何商业化的目的，只是单纯地爬一下自己 csdn 博客数据。

当然，为了更加好玩可以自行添加一些功能，比如说增加粉丝或者有评论时给自己发个邮件等等。当初还自己写了一些统计功能，但是现在已经提供了 “数据观星” 的功能，没事的时候可以逛逛看看自己的博客访客点赞评论数目（多么少）。

2. 具体实现

2.1 依赖

python 3BeautifulSoup v4.x 开发文档urllibtqdm

在运行下面的代码的时候发现提示缺包请自行安装。

2.2 爬取博客源码

因为接下来的操作都需要基于源码进行，所以这是最基本的一步，也是最简单的一步。

from urllib.request import urlopen from bs4 import BeautifulSoup # 博客地址 url = 'https://blog.csdn.net/smileyan9' html = urlopen(url) soup = BeautifulSoup(html.read()) print(soup.title) print(soup.body)

输出内容：

<title>Smileyan's blog_smile-yan_博客-我的大后端,C/C++,我的Linux领域博主</title> <body class="nodata" style=""> <script src="https://g.csdnimg.cn/common/csdn-toolbar/csdn-toolbar.js" type="text/javascript"></script> ......

2.2 获得总计数据

包括总访客数目、粉丝数目、积分、总排名。当达到指定数目就发送邮件提醒自己。

from urllib.request import urlopen from bs4 import BeautifulSoup # 博客地址 url = 'https://blog.csdn.net/smileyan9' html = urlopen(url) soup = BeautifulSoup(html.read()) # 进一步缩小范围 sources = soup.select('.data-info') soup = BeautifulSoup(str(sources)) dls = soup.find_all(['dl']) notes = ['原创','周排名','总排名','访问','等级','积分','粉丝','获赞','评论','收藏'] for step,dl in enumerate(dls): print(notes[step],':',dl['title'])

输出内容为：

原创 : 数字周排名 : 数字总排名 : 数字访问 : 数字等级 : 7级,点击查看等级说明积分 : 数字粉丝 : 数字获赞 : 数字评论 : 数字收藏 : 数字

2.3 爬取所有博客地址

这次爬取的地址为csdn的博客地址/article/list/1，最后的数字是指页数。

from urllib.request import urlopen from bs4 import BeautifulSoup from tqdm import tqdm from time import sleep # 页数 page_num = 4 # 博客具体地址 key = 'https://smileyan.blog.csdn.net/article/details/' # 所有博客地址 all_urls = [] for i in range(page_num): url = 'https://smileyan.blog.csdn.net/article/list/{}'.format(i+1) html = urlopen(url) soup = BeautifulSoup(html.read()) # 根据源码中 css class 搜索 all_a = soup.select('.article-item-box') for one in all_a: # 格式处理 target_url = one['data-articleid'] all_urls.append(target_url) print(target_url,end=',') # 共 146 print(len(all_urls))

2.4 爬取所有博客访问量

和上面差不多，只是寻找源码位置不同。

from urllib.request import urlopen from bs4 import BeautifulSoup from tqdm import tqdm from time import sleep # 页数 page_num = 4 all_visitors = [] for i in range(page_num): url = 'https://blog.csdn.net/smileyan9/article/list/{}'.format(i+1) html = urlopen(url) soup = BeautifulSoup(html.read()) all_a = soup.select('.article-item-box') for one in all_a: num = one.select('.read-num')[0].get_text() all_visitors.append(int(num)) print(len(all_visitors)) for visit in all_visitors: print(visit,end=',')

2.5 发送邮件

在 2.2 获得总计数据后，可以考虑设置一个目标，当达到目标后给自己发送一个邮件。

#!/usr/bin/python # -*- coding: UTF-8 -*- from urllib.request import urlopen from bs4 import BeautifulSoup import smtplib from email.mime.text import MIMEText from email.header import Header # 博客地址 url = 'https://blog.csdn.net/smileyan9' html = urlopen(url) soup = BeautifulSoup(html.read()) sources = soup.select('.data-info') soup = BeautifulSoup(str(sources)) # soup results = soup.find_all(['span']) # 总排名 place = results[2].get_text() # 总积分 # score = results[4].get_text() # 如果积分不超过1万可以这么使用 score = soup.find_all(['dl'])[5]['title'] # 总粉丝 fans = results[5].get_text() # 总访客 text = soup.find_all(style='min-width:58px')[0] visitors = text['title'] print('排名：',place) print('积分：',score) print('粉丝：',fans) print('访客：',visitors) # 定义规则 target_place = 2000 target_visitors = 100*10000 target_fans = 1000 # 好难，哈哈哈哈哈 target_score = 10000 # 排名数目越小越好，所以用负数 now = [-int(place), int(visitors), int(fans), int(score)] targets = [-target_place, target_visitors, target_fans, target_score] messages = [ '恭喜！你的排名已经达到了目标！', '恭喜！你的访客已经达到了目标！', '恭喜！你的粉丝已经达到了目标！', '恭喜！你的积分已经达到了目标！' ] reach = -1 for i in range(len(targets)): if (now[i] >= targets[i]): reach = i break if(reach > -1): common = '排名：{}; 积分：{}; 粉丝：{}; 访客：{}'.format(place, score, fans, visitors) # 第三方 SMTP 服务 mail_host = 'smtp.exmail.qq.com' #设置服务器 mail_user = "root@smileyan.cn" #用户名 mail_pass="You password" #口令 sender = 'root@smileyan.cn' receivers = ['root@smileyan.cn'] # 接收邮件，可设置为你的QQ邮箱或者其他邮箱 message = MIMEText(messages[reach], 'plain', 'utf-8') message['From'] = Header("Python 脚本(by smileyan)", 'utf-8') message['To'] = Header("幸运儿", 'utf-8') subject = '恭喜恭喜' message['Subject'] = Header(subject, 'utf-8') try: smtpObj = smtplib.SMTP() smtpObj.connect(mail_host, 25) # 25 为 SMTP 端口号 smtpObj.login(mail_user,mail_pass) smtpObj.sendmail(sender, receivers, message.as_string()) print("邮件发送成功") except smtplib.SMTPException: print("Error: 无法发送邮件") else: print('革命尚未成功，同志仍需努力！')

注意需要更改邮箱，密码等等，已经测试过了，能够正常发送邮件。

还有一个问题就是，如何让这份 python 代码在给定时间间隔内执行呢？首先最好有一台能一直运行的电脑（推荐购买便宜好用的云服务器），然后在服务器上一直跑一段代码是很容易的，比如说再外层添加一个for 循环，每次循环添加一个 sleep 即可。也可以考虑编写 linux 脚本，每隔多长执行一次脚本等等。

最近(2020.10.28) 有时间在自己的华为云服务器上完成了这个功能，感兴趣的话，请参考 linux 定时任务 (python 爬虫统计博客数据)

3. 总结

有一种无聊叫 “爬数据玩玩吧”，还有一种无聊叫 “顺便水一篇博客吧” ……

感谢阅读，如果觉得好玩的话，记得再下方左下角点赞，感谢！

无意间发现这个简单好用的 BeautifulSoup 所以用来写个demo，并且感谢提供的免费域名博客 https://smileyan.blog.csdn.net/ 。再次说明本次爬数据代码纯属娱乐，绝无 “刷访客”、“商业用途” 之意。

Smileyan 2020.10.20.17:04

最新回复(0)