Regex syntax request for help

Ker_01 · Feb 18, 2008

I'm parsing an HTML file, and originally, I thought I only needed to capture
all the links- the following worked well in my particular application
(sample HTML snippet pasted at bottom of post):
^<A HREF=.*>

However, now I've found that I only need to capture and process certain
links. The information that will determine whether a link needs to be
processed is buried between the original link and the next link (or EOF), so
I need to capture a larger (multiline) section of text and test each one to
see if it contains my identifier. It appears that I'm safe using the </TR>
tag as something that always comes after my new identifier and before the
next link (or EOF). So, I'm trying to edit my regex so I can grab this
larger (multiline) section of text, then if the identifier is the correct
one, I'll use my first regex expression or a slightly modified version to
grab just the URL from within the match.

I've been using http://www.aivosto.com/vbtips/regex.html as a helpful source
on regex expressions, but when I test my code on
http://regexlib.com/RETester.aspx I'm getting no results (my first
expression worked fine). Any assistance would be greatly appreciated. I
think I'm pretty close, but the following isn't working:
^<A HREF=.*/TR>

Any advice? The only difference is replacing the single '>' with '/TR>'. I
suspect it may have to do with spaces or linebreaks, but I don't know for
certain.

I'm posting a sample of my much larger HTML below; I'm trying to only
capture the ^<A HREF=.*> URL match for items where the class td includes
"Land Spread Vector".

I prefer using multiple simple Regex expressions versus one donated
expression that does it all, so I can understand my own code and at least
attempt to troubleshoot if I need to change anything.

Thanks!
Keith

<A Href=javascript

penDocument('0900043d802b3528');>

<img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16>

 101998

</a>

</td>

<td class='classtd'>

Green-tipped Martin

</td>

<td class='classtd'>

CURRENT,3.2

</td>

</TR>

<TR>

<TD></TD>

<TD>

<A Href=javascript

penDocument('0900043d803a1ce4');>

<img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16>

 101998 - APRRE - Assert.doc

</a>

</td>

<td class='classtd'>

Land Spread Vector

</td>

<td class='classtd'>

CURRENT,3.0

</td>

</TR>

<TR>

<TD></TD>

<TD>

<A Href=javascript

penDocument('0900043d802b635e');>

<img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16>

 101998-R

</a>

</td>

<td class='classtd'>

Reevaluation

</td>

<td class='classtd'>

CURRENT,1.0

</td>

</TR>

</TD></TR></TABLE><BR><BR>

<CENTER>

<A Href='javascript:history.back();'><img
src='/OurDir/images/back_down.jpg' border=0 align='center'
alt='Back'></A> 

<A Href='javascript:goHome();'><img
src='/OurDir/images/home_down.jpg' border=0 align='center' alt='Home'></A>

</CENTER>

</BODY>

</HTML>

Ron Rosenfeld · Feb 18, 2008

I'm parsing an HTML file, and originally, I thought I only needed to capture
all the links- the following worked well in my particular application
(sample HTML snippet pasted at bottom of post):
^<A HREF=.*>

However, now I've found that I only need to capture and process certain
links. The information that will determine whether a link needs to be
processed is buried between the original link and the next link (or EOF), so
I need to capture a larger (multiline) section of text and test each one to
see if it contains my identifier. It appears that I'm safe using the </TR>
tag as something that always comes after my new identifier and before the
next link (or EOF). So, I'm trying to edit my regex so I can grab this
larger (multiline) section of text, then if the identifier is the correct
one, I'll use my first regex expression or a slightly modified version to
grab just the URL from within the match.

I've been using http://www.aivosto.com/vbtips/regex.html as a helpful source
on regex expressions, but when I test my code on
http://regexlib.com/RETester.aspx I'm getting no results (my first
expression worked fine). Any assistance would be greatly appreciated. I
think I'm pretty close, but the following isn't working:
^<A HREF=.*/TR>

Any advice? The only difference is replacing the single '>' with '/TR>'. I
suspect it may have to do with spaces or linebreaks, but I don't know for
certain.

I'm posting a sample of my much larger HTML below; I'm trying to only
capture the ^<A HREF=.*> URL match for items where the class td includes
"Land Spread Vector".

I prefer using multiple simple Regex expressions versus one donated
expression that does it all, so I can understand my own code and at least
attempt to troubleshoot if I need to change anything.

Thanks!
Keith

Your description and the data confuses me a bit. IT might be clearer to me if
you posted exactly which links you expect to extract.

However, two suggestions:

1. In VBA, dot (".") never matches newline. So if you want to devise an
expression that will match across multiple lines, you need to use something
like "[\s\S]*"

2. If you want to match only those H REF matches that are followed by your
tag, you could use a look-ahead assertion:

<A\sHREF=.*>(?=[\S\s]*/TR>)

Note that the use of the dot in the URL will restrict to only those URL's that
are on a single line. If your URL's might extend across more than one line,
then:

<A\sHREF=[\s\S]*?>(?=[\S\s]*/TR>)

--ron

Ker_01 · Feb 18, 2008

Ron- thank you for your reply. In the sample HTML in the original post, the
only URL I ultimately need is
<A Href=javascript

penDocument('0900043d803a1ce4');>

because it is the only one where the text between that URL and the next
includes the text:
<td class='classtd'>
Land Spread Vector '<- what I really need to know
</td>
.....
</TR>

Your last suggested regex was very helpful; I changed it to only look for
the LSV as follows:
<A\sHREF=[\s\S]*?>(?=[\S\s]*Land Spread Vector)

It returned the target URL, but also returned the URL above it, presumably
because they are both followed by the LSV (oops!). I like the idea of using
regex to only return the URLs that are followed by LSV (saves me two steps!)
but I'd need to learn how to have the regex not return the URL if it hits
another URL before the LSV.

The alternative would be to return everything between the URL and the /TR
(multiple lines of text) which would not cut across multiple URLs, and I
could look to see if there was an LSV within that returned text block. The
expression above is only returning the URL line itself, not the multiple
lines of text that end in </TR>

Thanks for any advice!
Keith

Ron Rosenfeld said:
I'm parsing an HTML file, and originally, I thought I only needed to
capture
all the links- the following worked well in my particular application
(sample HTML snippet pasted at bottom of post):
^<A HREF=.*>

However, now I've found that I only need to capture and process certain
links. The information that will determine whether a link needs to be
processed is buried between the original link and the next link (or EOF),
so
I need to capture a larger (multiline) section of text and test each one
to
see if it contains my identifier. It appears that I'm safe using the </TR>
tag as something that always comes after my new identifier and before the
next link (or EOF). So, I'm trying to edit my regex so I can grab this
larger (multiline) section of text, then if the identifier is the correct
one, I'll use my first regex expression or a slightly modified version to
grab just the URL from within the match.

I've been using http://www.aivosto.com/vbtips/regex.html as a helpful
source
on regex expressions, but when I test my code on
http://regexlib.com/RETester.aspx I'm getting no results (my first
expression worked fine). Any assistance would be greatly appreciated. I
think I'm pretty close, but the following isn't working:
^<A HREF=.*/TR>

Any advice? The only difference is replacing the single '>' with '/TR>'. I
suspect it may have to do with spaces or linebreaks, but I don't know for
certain.

I'm posting a sample of my much larger HTML below; I'm trying to only
capture the ^<A HREF=.*> URL match for items where the class td includes
"Land Spread Vector".

I prefer using multiple simple Regex expressions versus one donated
expression that does it all, so I can understand my own code and at least
attempt to troubleshoot if I need to change anything.

Thanks!
Keith

Click to expand...

Your description and the data confuses me a bit. IT might be clearer to
me if
you posted exactly which links you expect to extract.

However, two suggestions:

1. In VBA, dot (".") never matches newline. So if you want to devise an
expression that will match across multiple lines, you need to use
something
like "[\s\S]*"

2. If you want to match only those H REF matches that are followed by
your
tag, you could use a look-ahead assertion:

<A\sHREF=.*>(?=[\S\s]*/TR>)

Note that the use of the dot in the URL will restrict to only those URL's
that
are on a single line. If your URL's might extend across more than one
line,
then:

<A\sHREF=[\s\S]*?>(?=[\S\s]*/TR>)

--ron

Ron Rosenfeld · Feb 18, 2008

Ron- thank you for your reply. In the sample HTML in the original post, the
only URL I ultimately need is
<A Href=javascriptpenDocument('0900043d803a1ce4');>

because it is the only one where the text between that URL and the next
includes the text:
<td class='classtd'>
Land Spread Vector '<- what I really need to know
</td>
....
</TR>

Your last suggested regex was very helpful; I changed it to only look for
the LSV as follows:
<A\sHREF=[\s\S]*?>(?=[\S\s]*Land Spread Vector)

It returned the target URL, but also returned the URL above it, presumably
because they are both followed by the LSV (oops!). I like the idea of using
regex to only return the URLs that are followed by LSV (saves me two steps!)
but I'd need to learn how to have the regex not return the URL if it hits
another URL before the LSV.

The alternative would be to return everything between the URL and the /TR
(multiple lines of text) which would not cut across multiple URLs, and I
could look to see if there was an LSV within that returned text block. The
expression above is only returning the URL line itself, not the multiple
lines of text that end in </TR>

Thanks for any advice!
Keith

OK, it's difficult to look for the absence of a phrase.

I think the following regex will do it, though. It first matches the H Href
token; then it uses a negative lookahead to NOT match the <A HREF string until
it gets to a Land Spread Vector string.

It also uses capturing group to extract the actual URL, so you want your
routine to return just capturing Group 1.

<A\sHREF=([^>]+)(?

?![\s\S]?<A\sHREF=)[\s\S]?)*Land Spread Vector

As written, it will capture into group 1 only:

javascript

penDocument('0900043d803a1ce4');

But if you move the first parenthesis to the beginning of the line, you can
also capture that initial <A HREF

Be sure to set IgnoreCase to True in your VBA routine.
--ron

VBA and Internet Explorer	9	Sep 8, 2009
Looping through HTML table to populate Excel	6	Apr 28, 2009
Button with no name IE automation	4	Sep 6, 2009
Difficulty analysing Journey Planner result	6	Oct 22, 2009
IE Automation - checkbox	7	Jan 4, 2010
Create merge cells from table definition	10	Sep 24, 2009
run report on web page, import to excel	20	Mar 25, 2010
External data from HTML document	2	Dec 12, 2009

Regex syntax request for help

Ker_01

Ron Rosenfeld

Ker_01

Ron Rosenfeld

Ask a Question

Similar Threads