使用python查找重复值

it2023-01-09 121

任务要点

在词表中，一些单词重复，并有重复例句。找出所有重复单词的索引，并将重复例句合并。最后将整张词表分割成重复值和非重复值部分。

核心代码

1、使用xlwt和xlrd模块读写Excel

读取Excel的步骤在于，获得所有sheet名字的数组，通过名字读取某一个sheet的内容，然后使用sheet.row_values()和sheet.col_values()获取某一行或列的内容。

initialData = ‘...’ #需要读取的excel的路径 workbook = xlrd.open_workbook(initialData) sheet_names = workbook.sheet_names() sheet = workbook.sheet_by_name(sheet_names[0]) data = sheet.col_values(4)

写入EXCEL的步骤在于，使用xlwt.Workbook()新建一个Excel缓存，然后使用.add_sheet()指定名字新建sheet。

book = xlwt.Workbook(encoding='utf-8', style_compression=0) wSheet1 = book.add_sheet("noRepetition") wSheet2 = book.add_sheet("repetition")

2、使用set(data)去除所有重复值

构建矩阵allData，储存所有单词的序号、重复次数、单词内容。

data_unique = set(data) allData = [] for item in data_unique: id = data.index(item) num = data.count(item) allData.append([id,num,data[id].strip()])

3、查找所有例句

核心思想是使用.index()查找重复单词的所有例句，.index()只能查找找到的第一个单词的索引。根据重复单词的重复次数，把之前找到的单词有其他内容代替，然后循环查找，就能找到所有例句了。（引自：https://blog.csdn.net/qq_33094993/article/details/53584379，也叫“偷梁换柱”）

nid = id for n in range(num-1): data[nid] = 'quchu' print(id, num, data[nid]) nid = data.index(word) nwordData = sheet.row_values(nid) wSheet2.write(c2, 1+dlen+4*n, nwordData[6]) wSheet2.write(c2, 1+dlen+4*n+1, nwordData[7]) wSheet2.write(c2, 1+dlen+4*n+2, nwordData[8]) wSheet2.write(c2, 1+dlen+4*n+3, nwordData[9])

所有代码

import xlwt,xlrd initialData = 'book.xlsx' workbook = xlrd.open_workbook(initialData) sheet_names = workbook.sheet_names() sheet = workbook.sheet_by_name(sheet_names[0]) data = sheet.col_values(4) print(len(data)) for i in range(len(data)): data[i] = data[i].strip() data_unique = set(data) allData = [] for item in data_unique: id = data.index(item) num = data.count(item) allData.append([id,num,data[id].strip()]) book = xlwt.Workbook(encoding='utf-8', style_compression=0) wSheet1 = book.add_sheet("noRepetition") wSheet2 = book.add_sheet("repetition") c1 = 0 c2 = 0 for d in allData: id = d[0] num = d[1] word = d[2] wordData = sheet.row_values(int(id)) if num > 1: wSheet2.write(c2, 0, num) dlen = len(wordData) for i in range(dlen): wSheet2.write(c2, i+1, wordData[i]) nid = id for n in range(num-1): data[nid] = 'quchu' print(id, num, data[nid]) nid = data.index(word) nwordData = sheet.row_values(nid) wSheet2.write(c2, 1+dlen+4*n, nwordData[6]) wSheet2.write(c2, 1+dlen+4*n+1, nwordData[7]) wSheet2.write(c2, 1+dlen+4*n+2, nwordData[8]) wSheet2.write(c2, 1+dlen+4*n+3, nwordData[9]) c2 = c2 + 1 else: for i in range(len(wordData)): wSheet1.write(c1, i, wordData[i]) c1 = c1 + 1 savePath = 'book_分离.xls' book.save(savePath)

最新回复(0)