RegExLib.com - The first Regular Expression Library on the Web!

Please support RegExLib Sponsors

Sponsors

Regular Expression Details

Title Test Find Pattern Title
Expression
<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>
Description
This regex will extract the link and the link title for every a href in HTML source. Useful for crawling sites. Note that this pattern will also allow for links that are spread over multiple lines.
Matches
<a href='http://www.regexlib.com'>Text</a> | <a href="...">Text</a&gt
Non-Matches
all other html tags
Author Rating: Not yet rated. Jacek Sompel
Source
Your Rating
Bad Good

Enter New Comment

Title
 
Name
 
Comment
 
Spammers suck - we apologize. Please enter the text shown below to enable your comment (not case sensitive - try as many times as you need to if the first ones are too hard):

Existing User Comments

Title: great!
Name: alex
Date: 3/11/2009 10:10:47 AM
Comment:
just what I needed. thank you.


Title: Problem with single quotes
Name: Andrew Rosca
Date: 8/27/2007 10:03:00 AM
Comment:
This expression is good, but needs a couple of corrections. I ran into two problems with it: 1. I encountered URLs with single quotes in them (don't ask). Because the expression looks for EITHER a single or double quote at the end of the URL it would capture only part of the URL. Example: <a href="http://domain.com/andrew's%20page/">test</a> Capture: "http://domain.com/andrew" 2. The expression looks for ONE OR MORE space, double, or single quote. This means it will incorrectly parse links where the URL is blank. Example: <a href="" onmouseover="...">test</a> Capture: "onmouseover=" The following corrections bring it a bit closer: <a[\s]+[^>]*?href[\s]*=[\s]*((\"(.*?)\")|(\'(.*?)\')).*?>([^<]+|.*?)?<\/a> The captured URL will be in either group 3 or 5 (ignore the one that is blank)


Title: Problem when there is a CR in the data
Name: Squarei
Date: 4/27/2007 1:00:50 PM
Comment:
This works great until a Carriage Return is added as follows: <a href="abc"><img src="def" border="0">1223</a> &nbsp;&nbsp;&nbsp; <a href="ghi" target="_blank"> <img src="jkl"></a> This will only return the first item. I have been messing with this for a while and can't figure out how to ignore the carriage return


Title: Problem when there is a CR in the data
Name: Squarei
Date: 4/27/2007 1:00:00 PM
Comment:
This works great until a Carriage Return is added as follows: <a href="abc"><img src="def" border="0">1223</a> &nbsp;&nbsp;&nbsp; <a href="ghi" target="_blank"> <img src="jkl"></a> This will only return the first item. I have been messing with this for a while and can't figure out how to ignore the carriage return


Title: didn't work in php
Name: js
Date: 9/24/2006 12:18:34 AM
Comment:
got the error "Warning: preg_match() [function.preg-match]: Unknown modifier ']' in..."


Title: didn't work in php
Name: js
Date: 9/24/2006 12:16:22 AM
Comment:
got the error "Warning: preg_match() [function.preg-match]: Unknown modifier ']' in..."


Title: Updated
Name: Jacek Somepl
Date: 4/8/2005 6:32:39 PM
Comment:
Updated the patern with the one form LPX's comment since it's slightly more efficient. Thanks LPX


Title: Not a bug
Name: Jacek Somepl
Date: 4/8/2005 6:28:54 PM
Comment:
Both paterns return the same results, however you are correct my original patern will match the <a name="name"></a><a href="http://google.com">Gooogle</a> line and since the anchored link doesn't contain any of the required parts it will simply be passed by the regex engine.


Title: bug?
Name: LPX
Date: 4/8/2005 5:27:15 AM
Comment:
<a name="name"></a><a href="http://google.com">Gooogle</a> MATCH <a name="name"></a><a href="http://google.com">Gooogle</a> INSTEAD OF <a href="http://google.com">Gooogle</a> WITH $1=http://google.com $2=Gooogle I think this is correct : <a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a> LPX


Copyright © 2001-2024, RegexAdvice.com | ASP.NET Tutorials