回想当初自学Python很大一部分原因是想要自己爬数据,今天终于学会了怎么下载小说。于是搞了一波《球状闪电》。
需要用到两个库:requests 和 BeautifulSoup,用 pip 安装即可。
主要的步骤是:
利用 requests.get(url) 获取网页信息。注意如果中文出现乱码,就加上 edcoding = '*',其中星号代表 html 解码方式,通常在 head 里面的 charset 可以找到。通过审查元素找到想要抓取内容的路径(右击正文部分打开检查即可)利用 find_all() 抓取有用信息,并进行过滤但最后效果是换行有点多,格式有些乱。我用 text.replace() 也没用,可能换行符不大一样吧。
同志还需努力。
import requests, sys from bs4 import BeautifulSoup def get_contents(target): # 获取章节内容 req = requests.get(url = target) req.encoding = 'GB2312' html = req.text bf = BeautifulSoup(html, features = "lxml") texts = bf.find_all('div', id = 'content') texts = texts[0].text.replace('\n\n', '\n') #去不掉多余换行? return texts def writer(name, path, text): # 写入 path write_flag = True with open(path, 'a', encoding = 'utf-8') as f: f.write(name + '\n') f.writelines(text) f.write('\n\n') if __name__ == "__main__": # 获取目录 names, urls = [], [] req = requests.get(url = 'http://book.sbkk8.com/xiandai/liucixinzuopinji/qiuzhuangshandian') req.encoding = 'GB2312' html = req.text bf = BeautifulSoup(html, features = "lxml") content = bf.find_all('div', class_ = 'mulu') atmp = BeautifulSoup(str(content[0]), features = "lxml") a = atmp.find_all('a') # 返回一个list num = len(a) for u in a: # 每章名称和链接 names.append(u.string) urls.append('http://book.sbkk8.com/' + u.get('href')) print("Downloading...") for i in range(num): writer(names[i], 'Ball-lightning.txt', get_contents(urls[i])) print("%.2f%% has been downloaded" % float(100.0*i/num), end = '\r') print("100.00% has been downloaded\nFinish")