怎么使用Python中的正则表达式处理html文件

前端开发   发布日期:2025年04月23日   浏览次数:218

这篇文章主要介绍“怎么使用Python中的正则表达式处理html文件”的相关知识,小编通过实际案例向大家展示操作过程,操作方法简单快捷,实用性强,希望这篇“怎么使用Python中的正则表达式处理html文件”文章能帮助大家解决问题。

使用Python中的正则表达式处理html文件

finditer方法是一种全匹配方法。您可能已经使用了findall方法,它返回多个匹配字符串的列表。finditer返回一个迭代器顺序地为多个匹配中的每一个生成匹配对象。在下面的代码中,这些匹配对象被访问(通过for循环),因此可以打印组1。

您的任务是编写Python RE来识别HTML文本文件中的某些模式。将代码添加到STARTER脚本为这些模式编译RE(将它们分配给有意义的变量名称),并将这些RE应用于文件的每一行,打印出找到的匹配项。

1.编写识别HTML标签的模式,然后将其打印为“TAG:TAG string”(例如“TAG:b”代表标签)。为了简单起见,假设左括号和右括号每个标记的(<,>)将始终出现在同一行文本中。第一次尝试可能使regex“<.*>”其中“.”是与任何字符匹配的预定义字符类符号。尝试找出这一点,找出为什么这不是一个好的解决方案。编写一个更好的解决方案,解决这个问题

2.修改代码,使其区分开头和结尾标记(例如p与/p)打印OPENTAG和CLOSETAG

  1. import sys, re
  2. #------------------------------
  3. testRE = re.compile('(logic|sicstus)', re.I)
  4. testI = re.compile('<[A-Za-z]>', re.I)
  5. testO = re.compile('<[^/](S*?)[^>]*>')
  6. testC = re.compile('</(S*?)[^>]*>')
  7. with open('RGX_DATA.html') as infs:
  8. linenum = 0
  9. for line in infs:
  10. linenum += 1
  11. if line.strip() == '':
  12. continue
  13. print(' ', '-' * 100, '[%d]' % linenum, '
  14. TEXT:', line, end='')
  15. m = testRE.search(line)
  16. if m:
  17. print('** TEST-RE:', m.group(1))
  18. mm = testRE.finditer(line)
  19. for m in mm:
  20. print('** TEST-RE:', m.group(1))
  21. index= testI.finditer(line)
  22. for i in index:
  23. print('Tag:',i.group().replace('<', '').replace('>', ''))
  24. open1= testO.finditer(line)
  25. for m in open1:
  26. print('opening:',m.group().replace('<', '').replace('>', ''))
  27. close1= testC.finditer(line)
  28. for n in close1:
  29. print('closing:',n.group().replace('<', '').replace('>', ''))

请注意,有些HTML标签有参数,例如:

  1. <table border=1 cellspacing=0 cellpadding=8>

确保打开标记的模式适用于带参数和不带参数的标记,即成功找到并打印标签标签。现在扩展您的代码,以便打印两个打开的标签标签和参数,例如:

OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8

  1. open1= testO.finditer(line)
  2. for m in open1:
  3. #print('opening:',m.group().replace('<', '').replace('>', ''))
  4. firstm= m.group().replace('<', '').replace('>', '').split()
  5. num = 0
  6. for otherm in firstm:
  7. if num == 0:
  8. print('opening:',otherm)
  9. else:
  10. print('pram:',otherm)
  11. num+= 1

在正则表达式中,可以使用反向引用来指示匹配早期部分的子字符串,应再次出现正则表达式的。格式为N(其中N为正整数),并返回到第N个匹配的文本正则表达式组。例如,正则表达式,如:r" (w+) 1 仅当与组(w+)完全匹配的字符串再次出现时才匹配 backref1出现的位置。这可能与字符串“踢”匹配.例如,“the”出现两次。使用反向引用编写一个模式,当一行包含成对的open和关闭标签,例如在粗体中.

考虑到我们可能想要创建一个执行HTML剥离的脚本,即一个HTML文件,并返回一个纯文本文件,所有HTML标记都已从中删除出来这里我们不打算这样做,而是考虑一个更简单的例子,即删除我们在输入数据文件的任何行中找到的HTML标记。

你应该能够让您已经定义的RE识别HTML标签这样做,将生成的文本打印到屏幕上为STRIPPED:。。

  1. import sys, re
  2. #------------------------------
  3. # PART 1:
  4. # Key thing is to avoid matching strings that include
  5. # multiple tags, e.g. treating '<p><b>' as a single
  6. # tag. Can do this in several ways. Firstly, use
  7. # non-greedy matching, so get shortest possible match
  8. # including the two angle brackets:
  9. tag = re.compile('</?(.*?)>')
  10. # The above treats the '/' of a close tag as a separate
  11. # optional component - so that this doesn't turn up as
  12. # part of the match '.group(1)', which is meant to return
  13. # the tag label.
  14. # Following alternative solution uses a negated character
  15. # class to explicitly prevent this including '>':
  16. tag = re.compile('</?([^>]+)>')
  17. # Finally, following version separates finding the tag
  18. # label string from any (optional) parameters that might
  19. # also appear before the close angle bracket:
  20. tag = re.compile(r'</?(w+)([^>]+)?>')
  21. # Note that use of '' (as word boundary anchor) here means
  22. # we must mark the regex string as a 'raw' string (r'..').
  23. #------------------------------
  24. # PART 2:
  25. # Following closeTag definition requires first first char
  26. # after the open angle bracket to be '/', while openTag
  27. # definition excludes this by requiring first char to be
  28. # a 'word char' (w):
  29. openTag = re.compile(r'<(w[^>]*)>')
  30. closeTag = re.compile(r'</([^>]*)>')
  31. # Following revised definitions are more carefully stated
  32. # for correct extraction of tag label (separately from
  33. # any parameters:
  34. openTag = re.compile(r'<(w+)([^>]+)?>')
  35. closeTag = re.compile(r'</(w+)s*>')
  36. #------------------------------
  37. # PART 3:
  38. # Above openTag definition will already get the string
  39. # encompassing any parameters, and return it as
  40. # m.group(2), i.e. defn:
  41. openTag = re.compile(r'<(w+)([^>]+)?>')
  42. # If assume that parameters are continuous non-whitespace
  43. # chars separated by whitespace chars, then we can divide
  44. # them up using split - and that's how we handle them
  45. # here. (In reality, parameter strings can be a lot more
  46. # messy than this, but we won't try to deal with that.)
  47. #------------------------------
  48. # PART 4:
  49. openCloseTagPair = re.compile(r'<(w+)([^>]+)?>(.*?)</1s*>')
  50. # Note use of non-greedy matching for the text falling
  51. # *between* the open/close tag pair - to avoid false
  52. # results where have two similar tag pairs on same line.
  53. #------------------------------
  54. # PART 5: URLS
  55. # This is quite tricky. The URL expressions in the file
  56. # are of two kinds, of which the first is a string
  57. # between double quotes ("..") which may include
  58. # whitespace. For this case we might have a regex:
  59. url = re.compile('href="https://www.19jp.com">

以上就是怎么使用Python中的正则表达式处理html文件的详细内容,更多关于怎么使用Python中的正则表达式处理html文件的资料请关注九品源码其它相关文章!