任务要点
在词表中,一些单词重复,并有重复例句。找出所有重复单词的索引,并将重复例句合并。最后将整张词表分割成重复值和非重复值部分。
核心代码
1、使用xlwt和xlrd模块读写Excel
读取Excel的步骤在于,获得所有sheet名字的数组,通过名字读取某一个sheet的内容,然后使用sheet.row_values()和sheet.col_values()获取某一行或列的内容。
initialData = ‘...’ #需要读取的excel的路径
workbook = xlrd.open_workbook(initialData)
sheet_names = workbook.sheet_names()
sheet = workbook.sheet_by_name(sheet_names[0])
data = sheet.col_values(4)
写入EXCEL的步骤在于,使用xlwt.Workbook()新建一个Excel缓存,然后使用.add_sheet()指定名字新建sheet。
book = xlwt.Workbook(encoding='utf-8', style_compression=0)
wSheet1 = book.add_sheet("noRepetition")
wSheet2 = book.add_sheet("repetition")
2、使用set(data)去除所有重复值
构建矩阵allData,储存所有单词的序号、重复次数、单词内容。
data_unique = set(data)
allData = []
for item in data_unique:
id = data.index(item)
num = data.count(item)
allData.append([id,num,data[id].strip()])
3、查找所有例句
核心思想是使用.index()查找重复单词的所有例句,.index()只能查找找到的第一个单词的索引。根据重复单词的重复次数,把之前找到的单词有其他内容代替,然后循环查找,就能找到所有例句了。(引自:https://blog.csdn.net/qq_33094993/article/details/53584379,也叫“偷梁换柱”)
nid = id
for n in range(num-1):
data[nid] = 'quchu'
print(id, num, data[nid])
nid = data.index(word)
nwordData = sheet.row_values(nid)
wSheet2.write(c2, 1+dlen+4*n, nwordData[6])
wSheet2.write(c2, 1+dlen+4*n+1, nwordData[7])
wSheet2.write(c2, 1+dlen+4*n+2, nwordData[8])
wSheet2.write(c2, 1+dlen+4*n+3, nwordData[9])
所有代码
import xlwt,xlrd
initialData = 'book.xlsx'
workbook = xlrd.open_workbook(initialData)
sheet_names = workbook.sheet_names()
sheet = workbook.sheet_by_name(sheet_names[0])
data = sheet.col_values(4)
print(len(data))
for i in range(len(data)):
data[i] = data[i].strip()
data_unique = set(data)
allData = []
for item in data_unique:
id = data.index(item)
num = data.count(item)
allData.append([id,num,data[id].strip()])
book = xlwt.Workbook(encoding='utf-8', style_compression=0)
wSheet1 = book.add_sheet("noRepetition")
wSheet2 = book.add_sheet("repetition")
c1 = 0
c2 = 0
for d in allData:
id = d[0]
num = d[1]
word = d[2]
wordData = sheet.row_values(int(id))
if num > 1:
wSheet2.write(c2, 0, num)
dlen = len(wordData)
for i in range(dlen):
wSheet2.write(c2, i+1, wordData[i])
nid = id
for n in range(num-1):
data[nid] = 'quchu'
print(id, num, data[nid])
nid = data.index(word)
nwordData = sheet.row_values(nid)
wSheet2.write(c2, 1+dlen+4*n, nwordData[6])
wSheet2.write(c2, 1+dlen+4*n+1, nwordData[7])
wSheet2.write(c2, 1+dlen+4*n+2, nwordData[8])
wSheet2.write(c2, 1+dlen+4*n+3, nwordData[9])
c2 = c2 + 1
else:
for i in range(len(wordData)):
wSheet1.write(c1, i, wordData[i])
c1 = c1 + 1
savePath = 'book_分离.xls'
book.save(savePath)