Title |
Test
Find
Pattern Title
|
Expression |
<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a> |
Description |
This regex will extract the link and the link title for every a href in HTML source. Useful for crawling sites.
Note that this pattern will also allow for links that are spread over multiple lines. |
Matches |
<a href='http://www.regexlib.com'>Text</a> | <a href="...">Text</a> |
Non-Matches |
all other html tags |
Author |
Rating:
Not yet rated.
Jacek Sompel
|
Source |
|
Your Rating |
|
Title: great!
Name: alex
Date: 3/11/2009 10:10:47 AM
Comment:
just what I needed. thank you.
Title: Problem with single quotes
Name: Andrew Rosca
Date: 8/27/2007 10:03:00 AM
Comment:
This expression is good, but needs a couple of corrections. I ran into two problems with it:
1. I encountered URLs with single quotes in them (don't ask). Because the expression looks for EITHER a single or double quote at the end of the URL it would capture only part of the URL. Example:
<a href="http://domain.com/andrew's%20page/">test</a>
Capture: "http://domain.com/andrew"
2. The expression looks for ONE OR MORE space, double, or single quote. This means it will incorrectly parse links where the URL is blank.
Example:
<a href="" onmouseover="...">test</a>
Capture: "onmouseover="
The following corrections bring it a bit closer:
<a[\s]+[^>]*?href[\s]*=[\s]*((\"(.*?)\")|(\'(.*?)\')).*?>([^<]+|.*?)?<\/a>
The captured URL will be in either group 3 or 5 (ignore the one that is blank)
Title: Problem when there is a CR in the data
Name: Squarei
Date: 4/27/2007 1:00:50 PM
Comment:
This works great until a Carriage Return is added as follows:
<a href="abc"><img src="def" border="0">1223</a>
<a href="ghi" target="_blank">
<img src="jkl"></a>
This will only return the first item.
I have been messing with this for a while and can't figure out how to ignore the carriage return
Title: Problem when there is a CR in the data
Name: Squarei
Date: 4/27/2007 1:00:00 PM
Comment:
This works great until a Carriage Return is added as follows:
<a href="abc"><img src="def" border="0">1223</a>
<a href="ghi" target="_blank">
<img src="jkl"></a>
This will only return the first item.
I have been messing with this for a while and can't figure out how to ignore the carriage return
Title: didn't work in php
Name: js
Date: 9/24/2006 12:18:34 AM
Comment:
got the error "Warning: preg_match() [function.preg-match]: Unknown modifier ']' in..."
Title: didn't work in php
Name: js
Date: 9/24/2006 12:16:22 AM
Comment:
got the error "Warning: preg_match() [function.preg-match]: Unknown modifier ']' in..."
Title: Updated
Name: Jacek Somepl
Date: 4/8/2005 6:32:39 PM
Comment:
Updated the patern with the one form LPX's comment since it's slightly more efficient. Thanks LPX
Title: Not a bug
Name: Jacek Somepl
Date: 4/8/2005 6:28:54 PM
Comment:
Both paterns return the same results, however you are correct my original patern will match the <a name="name"></a><a href="http://google.com">Gooogle</a> line and since the anchored link doesn't contain any of the required parts it will simply be passed by the regex engine.
Title: bug?
Name: LPX
Date: 4/8/2005 5:27:15 AM
Comment:
<a name="name"></a><a href="http://google.com">Gooogle</a>
MATCH
<a name="name"></a><a href="http://google.com">Gooogle</a>
INSTEAD OF
<a href="http://google.com">Gooogle</a>
WITH
$1=http://google.com
$2=Gooogle
I think this is correct :
<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>
LPX