Regular Expression Library

Recent Expressions

Site Links

Community

Top Contributors

Advertise with us

Regular Expression Details

Title	Test Find Pattern Title
Expression	<a[\s]+[^>]?href[\s]?=[\s\"\']+(.?)[\"\']+.?>([^<]+\|.?)?<\/a>
Description	This regex will extract the link and the link title for every a href in HTML source. Useful for crawling sites. Note that this pattern will also allow for links that are spread over multiple lines.
Matches	<a href='http://www.regexlib.com'>Text</a> \| <a href="...">Text</a&gt
Non-Matches	all other html tags
Author	Rating: Not yet rated. Jacek Sompel
Source
Your Rating	Bad 1 2 3 4 5 Good

Enter New Comment

Title

Name

Comment

Spammers suck - we apologize. Please enter the text shown below to enable your comment (not case sensitive - try as many times as you need to if the first ones are too hard):

Existing User Comments

Title: great!
Name: alex
Date: 3/11/2009 10:10:47 AM
Comment:
just what I needed. thank you.

Title: Problem with single quotes
Name: Andrew Rosca
Date: 8/27/2007 10:03:00 AM
Comment:
This expression is good, but needs a couple of corrections. I ran into two problems with it: 1. I encountered URLs with single quotes in them (don't ask). Because the expression looks for EITHER a single or double quote at the end of the URL it would capture only part of the URL. Example: <a href="http://domain.com/andrew's%20page/">test</a> Capture: "http://domain.com/andrew" 2. The expression looks for ONE OR MORE space, double, or single quote. This means it will incorrectly parse links where the URL is blank. Example: <a href="" onmouseover="...">test</a> Capture: "onmouseover=" The following corrections bring it a bit closer: <a[\s]+[^>]*?href[\s]*=[\s]*((\"(.*?)\")|(\'(.*?)\')).*?>([^<]+|.*?)?<\/a> The captured URL will be in either group 3 or 5 (ignore the one that is blank)

Title: Problem when there is a CR in the data
Name: Squarei
Date: 4/27/2007 1:00:50 PM
Comment:
This works great until a Carriage Return is added as follows: <a href="abc"><img src="def" border="0">1223</a>     <a href="ghi" target="_blank"> <img src="jkl"></a> This will only return the first item. I have been messing with this for a while and can't figure out how to ignore the carriage return

Title: Problem when there is a CR in the data
Name: Squarei
Date: 4/27/2007 1:00:00 PM
Comment:
This works great until a Carriage Return is added as follows: <a href="abc"><img src="def" border="0">1223</a>     <a href="ghi" target="_blank"> <img src="jkl"></a> This will only return the first item. I have been messing with this for a while and can't figure out how to ignore the carriage return

Title: didn't work in php
Name: js
Date: 9/24/2006 12:18:34 AM
Comment:
got the error "Warning: preg_match() [function.preg-match]: Unknown modifier ']' in..."

Title: didn't work in php
Name: js
Date: 9/24/2006 12:16:22 AM
Comment:
got the error "Warning: preg_match() [function.preg-match]: Unknown modifier ']' in..."

Title: Updated
Name: Jacek Somepl
Date: 4/8/2005 6:32:39 PM
Comment:
Updated the patern with the one form LPX's comment since it's slightly more efficient. Thanks LPX

Title: Not a bug
Name: Jacek Somepl
Date: 4/8/2005 6:28:54 PM
Comment:
Both paterns return the same results, however you are correct my original patern will match the <a name="name"></a><a href="http://google.com">Gooogle</a> line and since the anchored link doesn't contain any of the required parts it will simply be passed by the regex engine.

Title: bug?
Name: LPX
Date: 4/8/2005 5:27:15 AM
Comment:
<a name="name"></a><a href="http://google.com">Gooogle</a> MATCH <a name="name"></a><a href="http://google.com">Gooogle</a> INSTEAD OF <a href="http://google.com">Gooogle</a> WITH $1=http://google.com $2=Gooogle I think this is correct : <a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a> LPX

RegExLib.com - The first Regular Expression Library on the Web!

Subscribe