Title: Memory Peak
Name: Dave S
Date: 1/25/2006 10:06:34 AM
This regexp choked on a string containing the 'less-than' character as part of invalid HTML. As in: 1 is < 2.
Everything following the < character causes greedy validation and with a long string (748 characters long), this regular expression caused CPU usage to peak and remain at 100%. This problem happened consistently (i.e. EVERY TIME that string was passed through the regex. I tracked down the problem to THIS regex with a Microsoft Tech agent who studied the memory dump produced by Windows and IIS. The memory dump pointed to this line:
isHTML = objRegExp.Test(str)
This indicates that the .Test method (in VBScript) of the regular expression object would choke on the 748-character-long string containing the 'less-than' character.
Obviously, in valid HTML that should be written as: 1 is < 2. But many users don't know proper HTML entities. I've reverted to <[^>]+> for the time being.
Title: best one so
Name: manit chanthavong
Date: 11/3/2005 6:42:57 PM
looked for RE to strip html tags from a document. This is the best one I've seen.
Title: Very good
Name: Simon Cann
Date: 10/3/2005 11:34:30 AM
Well done for a great expression, it's just what I needed.
Title: RE:Half right, half wrong
Name: Toby Henderson
Date: 4/5/2005 6:11:32 AM
Gideon you are correct as those are not valid html tags. But seeing that they are meant to be a tags, I would want them captured. I'm not testing for validity I just want to find every tag in document to do something with them.
Title: Half right, half wrong
Name: Gideon Engelberth
Date: 4/4/2005 11:27:53 AM
This expression may not give false negatives (because it allows things inside tags), but it definately gives false positives. Two examples of matches that as far as I know should not match are: