《Python cookbook》笔记二

it2023-05-31 80

《Python cookbook》笔记二

第二章字符串和文本

—使用多个界定符分割字符串—

你需要将一个字符串分割为多个字段，但是分隔符 (还有周围的空格) 并不是固定的。

# str.split() 方法只适应于非常简单的字符串分割情形 # 当你需要更加灵活的切割字符串的时候，最好使用 re.split() 方法 >>> line = 'asdf fjdk; afed, fjek,asdf, foo' >>> import re >>> re.split(r'[;,\s]\s*', line) ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo'] # 注意正则表达式中是否包含一个括号捕获分组 >>> fields = re.split(r'(;|,|\s)\s*', line) >>> fields ['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo'] # 你可以这样 >>> re.split(r'(?:,|;|\s)\s*', line) ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

# 获取分割字符在某些情况下也是有用的 >>> values = fields[::2] >>> delimiters = fields[1::2] + [''] >>> values ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo'] >>> delimiters [' ', ';', ',', ',', ',', ''] >>> # Reform the line using the same delimiters >>> ''.join(v+d for v,d in zip(values, delimiters)) 'asdf fjdk;afed,fjek,asdf,foo'

—字符串开头或结尾匹配—

你需要通过指定的文本模式去检查字符串的开头或者结尾，比如文件名后缀，URL Scheme 等等。

# str.startswith() 或者是str.endswith() 方法 >>> filename = 'spam.txt' >>> filename.endswith('.txt') True >>> filename.startswith('file:') False

如果你想检查多种匹配可能，只需要将所有的匹配项放入到一个元组(只能是元组)中去，然后传给 startswith() 或者 endswith() 方法

>>> import os >>> filenames = os.listdir('.') >>> filenames [ 'Makefile', 'foo.c', 'bar.py', 'spam.c', 'spam.h' ] >>> [name for name in filenames if name.endswith(('.c', '.h')) ] ['foo.c', 'spam.c', 'spam.h' >>> any(name.endswith('.py') for name in filenames) True

你可能还想到了用正则去实现：

>>> import re >>> url = 'http://www.python.org' >>> re.match('http:|https:|ftp:', url) <_sre.SRE_Match object at 0x101253098>

当和其他操作比如普通数据聚合相结合的时候 startswith() 和endswith() 方法是很不错的

if any(name.endswith(('.c', '.h')) for name in listdir(dirname)): ...

—用shell通配符匹配字符串—

你想使用 Unix Shell 中常用的通配符 (比如 .py , Dat[0-9].csv 等) 去匹配文本字符串

# fnmatch 模块提供了两个函数—— fnmatch() 和 fnmatchcase() ，可以用来实现这样的匹配 >>> from fnmatch import fnmatch, fnmatchcase >>> fnmatch('foo.txt', '*.txt') True >>> fnmatch('foo.txt', '?oo.txt') True >>> fnmatch('Dat45.csv', 'Dat[0-9]*') True >>> names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py'] >>> [name for name in names if fnmatch(name, 'Dat*.csv')] ['Dat1.csv', 'Dat2.csv'] # fnmatch()依赖不同操作系统对大小写的敏感状况 >>> # On OS X (Mac) >>> fnmatch('foo.txt', '*.TXT') False >>> # On Windows >>> fnmatch('foo.txt', '*.TXT') True # 你可以用fnmatchcase() >>> fnmatchcase('foo.txt', '*.TXT') False

—字符串匹配和搜索—

你想匹配或者搜索特定模式的文本

# 如果你想匹配的是字面字符串，那么你通常只需要调用基本字符串方法就行，比如str.find() , str.endswith() , str.startswith(),对于复杂的匹配需要使用正则表达式和 re 模块 >>> text1 = '11/27/2012' >>> text2 = 'Nov 27, 2012' >>> >>> import re >>> # Simple matching: \d+ means match one or more digits >>> if re.match(r'\d+/\d+/\d+', text1): ... print('yes') ... else: ... print('no') ... yes

如果你想使用同一个模式去做多次匹配，你应该先将模式字符串预编译为模式对象re.compile()

>>> datepat = re.compile(r'\d+/\d+/\d+') >>> if datepat.match(text1): ... print('yes') ... else: ... print('no') ... yes # match() 总是从字符串开始去匹配，如果你想查找字符串任意部分的模式出现位置，使用 findall() 方法去代替

在定义正则式的时候，通常会利用括号去捕获分组

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)') >>> m = datepat.match('11/27/2012') >>> m <_sre.SRE_Match object at 0x1005d2750> >>> # Extract the contents of each group >>> m.group(0) '11/27/2012' >>> m.group(1) '11' >>> m.group(2) '27' >>> m.group(3) '2012' >>> m.groups() # month, day, year = m.groups() ('11', '27', '2012')

tip：如果你打算做大量的匹配和搜索操作的话，最好先编译正则表达式，然后再重复使用它

—字符串搜索和替换—

你想在字符串中搜索和匹配指定的文本模式

# 对于简单的字面模式，直接使用 str.repalce() >>> text = 'yeah, but no, but yeah, but no, but yeah' >>> text.replace('yeah', 'yep') 'yep, but no, but yep, but no, but yep' # 对于复杂的模式，请使用 re 模块中的 sub() 函数 >>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.' >>> import re >>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text) # 反斜杠数字比如 \3 指向前面模式的捕获组号 'Today is 2012-11-27. PyCon starts 2013-3-13.' # 如果你打算用相同的模式做多次替换，考虑先编译re.compile()它来提升性能

# 对于更加复杂的替换，可以传递一个替换回调函数来代替 >>> from calendar import month_abbr >>> def change_date(m): ... mon_name = month_abbr[int(m.group(1))] ... return '{} {} {}'.format(m.group(2), mon_name, m.group(3)) ... >>> datepat.sub(change_date, text) 'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.' # 如果除了替换后的结果外，你还想知道有多少替换发生了，可以使用 re.subn()来代替 >>> newtext, n = datepat.subn(r'\3-\1-\2', text) >>> newtext 'Today is 2012-11-27. PyCon starts 2013-3-13.' >>> n 2

—字符串忽略大小写的搜索替换—

# 为了在文本操作时忽略大小写，你需要在使用 re 模块的时候给这些操作提供re.IGNORECASE 标志参数 def matchcase(word): def replace(m): text = m.group() if text.isupper(): return word.upper() elif text.islower(): return word.lower() elif text[0].isupper(): return word.capitalize() else: return word return replace >>> re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE) 'UPPER SNAKE, lower snake, Mixed Snake' # matchcase('snake') 返回了一个回调函数 (参数必须是 match 对象)，前面一节一节提到过， sub() 函数除了接受替换字符串外，还能接受一个回调函数。

—最短匹配模式—

你正在试着用正则表达式匹配某个文本模式，但是它找到的是模式的最长可能匹配。而你想修改它变成查找最短的可能匹配。

>>> str_pat = re.compile(r'\"(.*)\"') # r'\"(.*)\"' 的意图是匹配被双引号包含的文本 >>> text2 = 'Computer says "no." Phone says "yes."' >>> str_pat.findall(text2) ['no." Phone says "yes.'] >>> str_pat = re.compile(r'\"(.*?)\"') # 在模式中的 * 操作符后面加上? 变成懒惰模式 >>> str_pat.findall(text2) ['no.', 'yes.']

—多行匹配模式—

你正在试着使用正则表达式去匹配一大块的文本，而你需要跨越多行去匹配。

>>> comment = re.compile(r'/\*(.*?)\*/') >>> text1 = '/* this is a comment */' >>> text2 = '''/* this is a ... multiline comment */ ... ''' >>> >>> comment.findall(text1) [' this is a comment '] >>> comment.findall(text2) [] # 为了修正这个问题，你可以修改模式字符串，增加对换行的支持 >>> comment = re.compile(r'/\*((?:.|\n)*?)\*/') >>> comment.findall(text2) [' this is a\n multiline comment '] # re.DOTALL 可以让正则表达式中的点 (.) 匹配包括换行符在内的任意字符 >>> comment = re.compile(r'/\*(.*?)\*/', re.DOTALL) >>> comment.findall(text2) [' this is a\n multiline comment ']

—将 Unicode 文本标准化—

你正在处理 Unicode 字符串，需要确保所有字符串在底层有相同的表示

>>> s1 = 'Spicy Jalape\u00f1o' >>> s2 = 'Spicy Jalapen\u0303o' >>> s1 'Spicy Jalapeño' >>> s2 'Spicy Jalapeño' >>> s1 == s2 False >>> len(s1) 14 >>> len(s2) 15

# NFC 表示字符应该是整体组成 (比如可能的话就使用单一编码)，而 NFD 表示字符应该分解为多个组合字符表示 >>> import unicodedata >>> t1 = unicodedata.normalize('NFC', s1) >>> t2 = unicodedata.normalize('NFC', s2) >>> t1 == t2 True >>> print(ascii(t1)) 'Spicy Jalape\xf1o' >>> t3 = unicodedata.normalize('NFD', s1) >>> t4 = unicodedata.normalize('NFD', s2) >>> t3 == t4 True >>> print(ascii(t3)) 'Spicy Jalapen\u0303o'

>>> s = '\ufb01' # A single character >>> s ' fi' >>> unicodedata.normalize('NFD', s) ' fi' # Notice how the combined letters are broken apart here >>> unicodedata.normalize('NFKD', s) 'fi' >>> unicodedata.normalize('NFKC', s) 'fi'

# combining() 函数可以测试一个字符是否为和音字符 >>> t1 = unicodedata.normalize('NFD', s1) >>> ''.join(c for c in t1 if not unicodedata.combining(c)) 'Spicy Jalapeno'

—在正则式中使用 Unicode—

你正在使用正则表达式处理文本，但是关注的是 Unicode 字符处理

# 默认情况下 re 模块已经对一些 Unicode 字符类有了基本的支持。比如， \\d 已经匹配任意的 unicode 数字字符了 >>> import re >>> num = re.compile('\d+') >>> # ASCII digits >>> num.match('123') <_sre.SRE_Match object at 0x1007d9ed0> >>> # Arabic digits >>> num.match('\u0661\u0662\u0663') <_sre.SRE_Match object at 0x101234030> # 如果你想在模式中包含指定的 Unicode 字符，你可以使用 Unicode 字符对应的转义序列 (比如 \uFFF 或者 \UFFFFFFF ) >>> arabic = re.compile('[\u0600-\u06ff\u0750-\u077f\u08a0-\u08ff]+') >>> pat = re.compile('stra\u00dfe', re.IGNORECASE) >>> s = 'straße' >>> pat.match(s) # Matches <_sre.SRE_Match object at 0x10069d370> >>> pat.match(s.upper()) # Doesn't match >>> s.upper() # Case folds 'STRASSE'

—删除字符串中不需要的字符—

strip() 方法能用于删除开始或结尾的字符。 lstrip() 和 rstrip() 分别从左和从右执行删除操作

>>> # Whitespace stripping >>> s = ' hello world \n' >>> s.strip() 'hello world' >>> s.lstrip() 'hello world \n' >>> s.rstrip() ' hello world' >>> >>> # Character stripping >>> t = '-----hello=====' >>> t.lstrip('-') 'hello=====' >>> t.strip('-=h') 'ello' # 如果你想处理中间的空格使用 replace() 方法或者是用正则表达式替换 >>> s = ' hello world \n' >>> s.replace(' ', '') 'helloworld' >>> import re >>> re.sub('\s+', ' ', s) 'hello world'

—审查清理文本字符串—

一些无聊的幼稚黑客在你的网站页面表单中输入文本”pýtĥöñ”，然后你想将这些字符清理掉

>>> s = 'pýtĥöñ\fis\tawesome\r\n' # 还有upper(),lower(),re.replace(),re.sub()等 >>> s 'pýtĥöñ\x0cis\tawesome\r\n' >>> remap = { ... ord('\t') : ' ', ... ord('\f') : ' ', ... ord('\r') : None # Deleted ... } >>> a = s.translate(remap) >>> a 'pýtĥöñ is awesome\n' >>> import unicodedata >>> import sys >>> cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) ... if unicodedata.combining(chr(c))) # 把字符的权威组合值返回，如果没有定义，默认是返回0 ... >>> b = unicodedata.normalize('NFD', a) >>> b 'pýtĥöñ is awesome\n' >>> b.translate(cmb_chrs) 'python is awesome\n'

—字符串对齐—

使用字符串的 ljust() , rjust() 和 center()方法

# 函数 format() 同样可以用来很容易的对齐字符串。你要做的就是使用 <,> 或者ˆ 字符后面紧跟一个指定的宽度 >>> format(text, '>20') ' Hello World' >>> format(text, '<20') 'Hello World ' >>> format(text, '^20') ' Hello World ' >>> format(text, '=>20s') '=========Hello World' >>> format(text, '*^20s') '****Hello World*****' >>> '{:>10s} {:>10s}'.format('Hello', 'World') ' Hello World' >>> x = 1.2345 >>> format(x, '>10') ' 1.2345' >>> format(x, '^10.2f') ' 1.23 '

—合并拼接字符串—

>>> parts = ['Is', 'Chicago', 'Not', 'Chicago?'] >>> ' '.join(parts) 'Is Chicago Not Chicago?' >>> a = 'Is Chicago' >>> b = 'Not Chicago?' >>> a + ' ' + b 'Is Chicago Not Chicago?' >>> print('{} {}'.format(a,b)) Is Chicago Not Chicago? >>> print(a, b, sep=' ') Is Chicago Not Chicago?

—字符串中插入变量—

你想创建一个内嵌变量的字符串，变量被它的值所表示的字符串替换掉

>>> s = '{name} has {n} messages.' >>> s.format(name='Guido', n=37) 'Guido has 37 messages.' >>> name = 'Guido' >>> n = 37 >>> s.format_map(vars()) 'Guido has 37 messages.' # format 和 format map() 的一个缺陷就是它们并不能很好的处理变量缺失的情况 class safesub(dict): """ 防止 key 找不到""" def __missing__(self, key): return '{' + key + '}' >>> del n # Make sure n is undefined >>> s.format_map(safesub(vars())) 'Guido has {n} messages.' # 你可以将变量替换步骤用一个工具函数封装起来 import sys def sub(text): return text.format_map(safesub(sys._getframe(1).f_locals)) >>> name = 'Guido' >>> n = 37 >>> print(sub('Hello {name}')) Hello Guido >>> print(sub('You have {n} messages.')) You have 37 messages. >>> print(sub('Your favorite color is {color}')) Your favorite color is {color} # 还有一些可用的方法 >>> name = 'Guido' >>> n = 37 >>> '%(name) has %(n) messages.' % vars() 'Guido has 37 messages.' >>> import string >>> s = string.Template('$name has $n messages.') >>> s.substitute(vars()) 'Guido has 37 messages.'

—以指定列宽格式化字符串—

你有一些长字符串，想以指定的列宽将它们重新格式化

>>> s = '123456789' >>> import textwrap >>> textwrap.fill(s, 4) '1234\n5678\n9' >>> textwrap.fill(s,4,initial_indent='----') '----1\n2345\n6789' #os.get terminal size() 方法来获取终端的大小尺寸 >>> import os >>> os.get_terminal_size().columns 80

— 在字符串中处理 html 和 xml—

>>> s = 'Elements are written as "<tag>text</tag>".' >>> import html >>> print(s) Elements are written as "<tag>text</tag>". >>> print(html.escape(s)) Elements are written as "<tag>text</tag>". >>> # Disable escaping of quotes >>> print(html.escape(s, quote=False)) Elements are written as "<tag>text</tag>". # 如果你正在处理的是 ASCII 文本，并且想将非 ASCII 文本对应的编码实体嵌入进去，可以给某些 I/O 函数传递参数 errors='xmlcharrefreplace' 来达到这个目 >>> s = 'Spicy Jalapeño' >>> s.encode('ascii', errors='xmlcharrefreplace') b'Spicy Jalapeño'

>>> s = 'Spicy "Jalapeño&quot.' >>> from html.parser import HTMLParser >>> p = HTMLParser() >>> p.unescape(s) 'Spicy "Jalapeño".' >>> >>> t = 'The prompt is >>>' >>> from xml.sax.saxutils import unescape >>> unescape(t) 'The prompt is >>>'

—字符串令牌解析—

你有一个字符串，想从左至右将其解析为一个令牌流。(使用的令牌是指用于取代敏感数据的字母数字代码)

text = 'foo = 23 + 42 * 10' tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'), ('NUM', '42'), ('TIMES', '*'), ('NUM', '10')] import re NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)' NUM = r'(?P<NUM>\d+)' PLUS = r'(?P<PLUS>\+)' TIMES = r'(?P<TIMES>\*)' EQ = r'(?P<EQ>=)' WS = r'(?P<WS>\s+)' master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS])) >>> scanner = master_pat.scanner('foo = 42') >>> scanner.match() <_sre.SRE_Match object at 0x100677738> >>> _.lastgroup, _.group() # Python解释器模式中,"_"表示上次结果 ('NAME', 'foo') >>> scanner.match() <_sre.SRE_Match object at 0x100677738> >>> _.lastgroup, _.group() ('WS', ' ') >>> scanner.match() <_sre.SRE_Match object at 0x100677738> >>> _.lastgroup, _.group() ('EQ', '=') >>> scanner.match() <_sre.SRE_Match object at 0x100677738> >>> _.lastgroup, _.group() ('WS', ' ') >>> scanner.match() <_sre.SRE_Match object at 0x100677738> >>> _.lastgroup, _.group() ('NUM', '42')

# 实际使用这种技术的时候，可以很容易的像下面这样将上述代码打包到一个生成器中 def generate_tokens(pat, text): Token = namedtuple('Token', ['type', 'value']) scanner = pat.scanner(text) for m in iter(scanner.match, None): yield Token(m.lastgroup, m.group()) # Example use for tok in generate_tokens(master_pat, 'foo = 42'): print(tok) # Produces output # Token(type='NAME', value='foo') # Token(type='WS', value=' ') # Token(type='EQ', value='=') # Token(type='WS', value=' ') # Token(type='NUM', value='42') # 如果你想过滤令牌流，你可以定义更多的生成器函数或者使用一个生成器表达式。比如，下面演示怎样过滤所有的空白令牌 tokens = (tok for tok in generate_tokens(master_pat, text) if tok.type != 'WS') for tok in tokens: print(tok)

—实现一个递归下降分析器—

—字节字符串上的字符串操作—

字节字符串同样也支持大部分和文本字符串一样的内置操作 >>> data = b'Hello World' >>> data[0:5] b'Hello' >>> data.startswith(b'Hello') True >>> data.split() [b'Hello', b'World'] >>> data.replace(b'Hello', b'Hello Cruel') b'Hello Cruel World' # 这些操作同样也适用于字节数组 >>> data = bytearray(b'Hello World') >>> data[0:5] bytearray(b'Hello') >>> data.startswith(b'Hello') True >>> data.split() [bytearray(b'Hello'), bytearray(b'World')] >>> data.replace(b'Hello', b'Hello Cruel') bytearray(b'Hello Cruel World') # 你可以使用正则表达式匹配字节字符串，但是正则表达式本身必须也是字节串 >>> data = b'FOO:BAR,SPAM' >>> import re >>> re.split('[:,]',data) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.3/re.py", line 191, in split return _compile(pattern, flags).split(string, maxsplit) TypeError: can't use a string pattern on a bytes-like object >>> re.split(b'[:,]',data) # Notice: pattern as bytes [b'FOO', b'BAR', b'SPAM']

这里也有一些需要注意的不同点 # 字节字符串的索引操作返回整数而不是单独字符 >>> a = 'Hello World' # Text string >>> a[0] 'H' >>> a[1] 'e' >>> b = b'Hello World' # Byte string >>> b[0] 72 >>> b[1] 101 # 字节字符串不会提供一个美观的字符串表示，也不能很好的打印出来 >>> s = b'Hello World' >>> print(s) b'Hello World' # Observe b'...' >>> print(s.decode('ascii')) Hello World # 也不存在任何适用于字节字符串的格式化操作 >>> b'{} {} {}'.format(b'ACME', 100, 490.1) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'bytes' object has no attribute 'format' # 如果你想格式化字节字符串，你得先使用标准的文本字符串，然后将其编码为字节字符串 >>> '{:10s} {:10d} {:10.2f}'.format('ACME', 100, 490.1).encode('ascii') b'ACME 100 490.10' # 最后需要注意的是，使用字节字符串可能会改变一些操作的语义，特别是那些跟文件系统有关的操作

最后提一点，一些程序员为了提升程序执行的速度会倾向于使用字节字符串而不是文本字符串。尽管操作字节字符串确实会比文本更加高效 (因为处理文本固有的 Unicode 相关开销)。这样做通常会导致非常杂乱的代码。你会经常发现字节字符串并不能和 Python 的其他部分工作的很好，并且你还得手动处理所有的编码/解码操作。坦白讲，如果你在处理文本的话，就直接在程序中使用普通的文本字符串而不是字节字符串。不做死就不会死！ ————《Python cookbook》

最新回复(0)

《Python cookbook》笔记二

《Python cookbook》笔记二

第二章 字符串和文本

—使用多个界定符分割字符串—

—字符串开头或结尾匹配—

—用shell通配符匹配字符串—

—字符串匹配和搜索—

—字符串搜索和替换—

—字符串忽略大小写的搜索替换—

—最短匹配模式—

—多行匹配模式—

—将 Unicode 文本标准化—

—在正则式中使用 Unicode—

—删除字符串中不需要的字符—

—审查清理文本字符串—

—字符串对齐—

—合并拼接字符串—

—字符串中插入变量—

—以指定列宽格式化字符串—

— 在字符串中处理 html 和 xml—

—字符串令牌解析—

—实现一个递归下降分析器—

—字节字符串上的字符串操作—

第二章字符串和文本