python自动构建Markdown博客列表

it2025-07-19 12

python 构建博客列表 https://blog.csdn.net/nameofcsdn/article/details/104988119

这篇博客中我给出了一个python程序代码，用来输出所有博客的标题和url

因为恶心的限制，一篇文章不能超过64000字，所以我不得不用Markdown编辑器

对照Markdown的语法，控制程序输出的格式：

import re, os import urllib.request def out(n): if n: print("#", end='') out(n-1) def readfile(A, file): f = open(file, 'r').read() try: url = re.findall('Content-Location: http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f)) eachurl = url[0][18:] except: url = re.findall('http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f)) eachurl = url[0] html = urllib.request.urlopen(eachurl).read().decode('UTF-8') title = re.findall('var articleTitle =.*;', str(html)) eachtitle = title[0] aurl = eachtitle[20:-2] + ' '+eachurl A.append(aurl) def outpath(path1, path2, deep): path1 = os.path.join(path1, path2) mylist = os.listdir(path1) out(deep) if os.path.isdir(os.path.join(path1, mylist[0])): # 全是目录 print(' ', path2) for adir in mylist: outpath(path1, adir, deep + 1) else: # 全是文件 A = [] for adir in mylist: try: readfile(A, os.path.join(path1, adir)) except: A.append(adir) print(' ', path2, ' 共', len(mylist), '篇') A.sort() for each in A: print(each[:-58] + ' [博客链接](' + each[-58:] + ')') outpath('D:\\朱聪', '博客备份（2020年10月20日）', 0) # 0对应path2 = '博客备份'以此类推

运行结果：

把输出的内容直接粘贴到博客里面就得到了：https://blog.csdn.net/nameofcsdn/article/details/109147261

之所以用“博客链接”这个词而不是用“Link”这个词，是因为link在一些博客标题里面出现了，所以不是很方便。

再次优化：

把最后一行函数调用的地方换成：

path = r'D:\朱聪\博客备份\7，数学与逻辑\7.6, 从数学到编程' loc = path.rfind('\\') path1 = path[0:loc] outpath(path1, path[loc+1:], len(path1.split('\\'))-2)

这样每次只需要直接把绝对路径复制过来即可，不需要其他任何操作。

再次优化：

逐个打开网页的方法太慢了，于是我决定把信息存下来，下次运行就不用再访问网页了。

考虑了几种方式之后，我选择了直接修改文件名的方式，把每个博客的文件名直接改成博客标题。

但是这样又有个问题，有些符号没法作为文件名，

所以实际策略是，能改文件名的就改，改不了的就不改，就需要每次访问网页。

在不能用作文件名的符号中，英文冒号:出现在我的博客里面很多，其他的都是个例，所以我在改文件名的时候，把英文冒号改成中文冒号。

代码：

import re, os import urllib.request def out(n): if n: print("#", end='') out(n - 1) def getUrlFromFile(file): f = open(file, 'r').read() try: url = re.findall('Content-Location: http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f)) ret = url[0][18:] except: url = re.findall('http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f)) ret = url[0] return ret def getTitleFromFile(file): eachurl = getUrlFromFile(file) try: html = urllib.request.urlopen(eachurl).read().decode('UTF-8') except: print(file) title = re.findall('var articleTitle =.*;', str(html)) eachtitle = title[0] aurl = eachtitle[20:-2] return aurl def outpath(path1, path2, deep): path1 = os.path.join(path1, path2) mylist = os.listdir(path1) out(deep) if os.path.isdir(os.path.join(path1, mylist[0])): # 全是目录 print(' ', path2) for adir in mylist: outpath(path1, adir, deep + 1) else: # 全是文件 A = [] for adir in mylist: file = os.path.join(path1, adir) filename = getUrlFromFile(file) if '博客' in adir or 'nameofcsdn' in adir: title = getTitleFromFile(file) else: title = adir[:-4] A.append(title + ' ' + filename) title = title.replace(':','：') try: os.replace(file, os.path.join(path1, title + '.mht')) except: nothing = 0 print(' ', path2, ' 共', len(mylist), '篇') A.sort() for each in A: print(each[:-58] + ' [博客链接](' + each[-58:] + ')') path = r'D:\朱聪\博客备份' loc = path.rfind('\\') path1 = path[0:loc] outpath(path1, path[loc+1:], len(path1.split('\\'))-2)

优化之前，扫描1300篇博客大概需要30分钟，优化之后只需要2分钟

最新回复(0)