网上找了很多方法,都仔细读懂了,但是自己发现很多html页面中其实可以用更巧妙的方法去提取http链接,用的思想是,html中url链接都是一组双引号中的,我只需要针对链接末尾的(且出现的第一个)双引号即可 html页面例
// An highlighted block <div class="hotnews"> <div class="imgview"id="imgView"><a href="https://xinwen.eastday.com/a/n181211075002407.html?qid=news.baidu.com"target="_blank"><img src="https://imgsa.baidu.com/news/q%3D100/sign=cdae0fb78a94a4c20c23e32b3ef51bac/cefc1e178a82b90151b62d8b7e8da9773912ef6b.jpg"></a></div><ul><liclass="hdline0"> <a href="http://www.xinhuanet.com/politics/xxjxs/2018-12/11/c_1123834898.htm"target="_blank"class="a3"> ...的xxx之“喻” </a></li> <li class="hdline1"> <a href="http://news.cri.cn/20181211/313376c7-77cc-abff-3a81-bd855c0a8577.html"target="_blank"> 《xxxx》宣传片</a> <i style="font-size: 12px"> </i><a href="http://politics.gmw.cn/2018-12/11/content_32146726.htm"target="_blank"> 主题歌《梦想阳光》发布</a> </li> <li class="hdline2"> <img src="https://imgsa.baidu.com/news/q%3D100/sign=ab45ee53bbfd5266a12b38149b199799/f9198618367adab46063f9fb86d4b31c8601e4d3.jpg"><a href="http://politics.people.com.cn/n1/2018/1211/c1001-30458946.html"target="_blank"class="a3"> 【央视快评】xxxxxxxx道路</a></li> <li class="hdline3"> <a href="http://news.cri.cn/20181210/384ab948-e36b-b455-9d97-8eb05172c179.html"target="_blank">同舟共济</a> <i style="font-size: 12px"> </i><a href="http://news.cctv.com/2018/12/10/ARTI9v2GwcDNkh8obJh2vnUy181210.shtml"target="_blank"> 《xxxx关键一招》第一集</a> </li> <li class="hdline4"> <a href="http://news.cctv.com/2018/12/10/ARTISzd4ekNLNB88EFFtMgB7181210.shtml"target="_blank"class="a3"> 【数说xx开放40年】40年减贫7.4亿人</a></li> <li class="hdline5"> <a href="http://news.ifeng.com/a/20181211/60188943_0.shtml?_zbs_baidu_news"target="_blank">xxx出席的这个活动,有什么来头?</a> </li> </ul> </div>关键代码:
// An highlighted block reg = r'http[s]?://[^"]+' res = re.findall(reg, text) for i in res: print(i)结果: