Regular Expression Help on syntax

L

Lars-Åke Aspelin

In order to get a proper match I have to use the \ character before
the / characters in the Format function, like this

sURLdate = Format(CDate("1/1/2010"), "\/yyyy\/m\/d\/")

Maybe this has something to do with the Regional and Language
settings.

I don't know. AFAIK, all characters except

[\^$.|?*+()

should get matched literally.

Preceding a character (that has no special meaning) by a backslash ("\") merely
represents a single instance of the second character.

If you type, into the immediate window:

?CDate("1/1/2010")

what is returned?
I also have a question. What is the function of the two \b in the
regexp? I get a match even without them.
And the negative number is also matched.

\b represents a word boundary, or, more specifically, it matches at the
position between a word character (anything matched by \w) and a non-word
character (anything matched by [^\w] or \W) as well as at the start and/or end
of the string (or line) if the first and/or last characters in the string (or
line) are word characters.
--ron

If I write ?CDate("1/1/2010") in the immediate window the following is
returned
2010-01-01

If I write ?Format(CDate("1/1/2010"), "/yyyy/m/d/") the following is
returned
-2010-1-1-

Only if I write ?Format(CDate("1/1/2010"), "\/yyyy\/m\/d\/" the wanted
result
/2010/1/1/
is achieved.

So, with my settings the / character seems to generate a - .
To have a / generated the / has to be escaped with \.

I found the explanation in the Excel help for Format:

"(/) Date separator. In some locales, other characters may be used to
represent the date separator. The date separator separates the day,
month, and year when date values are formatted. The actual character
used as the date separator in formatted output is determined by your
system settings."

And in my settings - (hyphen) is used as the date separator, in
accordance with ISO 8601 extended format
http://en.wikipedia.org/wiki/ISO_8601

Lars-Åke
 
W

Walter Briscoe

In message <[email protected]> of Sun, 10 Jan 2010 04:55:10 in
microsoft.public.excel.programming said:
Ron: have you ever looked at the 1970's unix manual volume 2b under the
YACC topic. See this webpage. Look for the link for the PDF files and
use the link : v7vol2b.pdf (819KB)

Webpage
'7th Edition Manual PDF'
(http://plan9.bell-labs.com/7thEdMan/bswv7.html)

pdf file
http://plan9.bell-labs.com/7thEdMan/v7vol2b.pdf



It is the only good description of pattern matching that I have ever
seen.

The question mark indicates any single character.

Joel, How do you conclude that? (I expect "." to match any single
character within a line and ".|\s" to match any single character).

I found this table on page 49/250 (Joel is probably looking elsewhere):

Regular expressions in Lex use the following operators:

....
x? an optional x.
x* 0,1,2, ... instances of x.
x+ 1,2,3, ... instances of x.
x?y an x or a y.
....

I am confused by that x?y. I think it means an optional x followed by a
literal y. I think there may be a glyph confusion. I suspect it is
intended to be x|y where the character - for which I know no name -
between x and y means "or" as & - ampersand - means "and".

The symbols used by Set RE = CreateObject("VBScript.RegExp")
RE.Pattern = ... are a superset of the BRE described in <http://opengrou
p.org/onlinepubs/007908775/xbd/re.html>. My own view is that thinks like
\d is an unnecessary shorthand for [0-9]. I concede it is reasonable if
a range is not allowed in a character set. i.e. if [0123456789] is
needed. In VBA, I refer to <http://msdn.microsoft.com/en-
us/library/ms974570.aspx> which specifies everything I want to know
other than the meanings of the exceptions given.
<http://msdn.microsoft.com/en-us/library/xe43cc8d(VS.85).aspx>
specifies 5019, which I hit yesterday. ;)

For my part, I would start with "-?\d+" to grab a whole number. i.e a
whole number is a minus which is optional followed by a digit one or
more times. I found the OP's situation too complicated to want to follow
and offer a suggestion.
 
R

Ron Rosenfeld

In order to get a proper match I have to use the \ character before
the / characters in the Format function, like this

sURLdate = Format(CDate("1/1/2010"), "\/yyyy\/m\/d\/")

Maybe this has something to do with the Regional and Language
settings.

I don't know. AFAIK, all characters except

[\^$.|?*+()

should get matched literally.

Preceding a character (that has no special meaning) by a backslash ("\") merely
represents a single instance of the second character.

If you type, into the immediate window:

?CDate("1/1/2010")

what is returned?
I also have a question. What is the function of the two \b in the
regexp? I get a match even without them.
And the negative number is also matched.

\b represents a word boundary, or, more specifically, it matches at the
position between a word character (anything matched by \w) and a non-word
character (anything matched by [^\w] or \W) as well as at the start and/or end
of the string (or line) if the first and/or last characters in the string (or
line) are word characters.
--ron

If I write ?CDate("1/1/2010") in the immediate window the following is
returned
2010-01-01

If I write ?Format(CDate("1/1/2010"), "/yyyy/m/d/") the following is
returned
-2010-1-1-

Only if I write ?Format(CDate("1/1/2010"), "\/yyyy\/m\/d\/" the wanted
result
/2010/1/1/
is achieved.

So, with my settings the / character seems to generate a - .
To have a / generated the / has to be escaped with \.

I found the explanation in the Excel help for Format:

"(/) Date separator. In some locales, other characters may be used to
represent the date separator. The date separator separates the day,
month, and year when date values are formatted. The actual character
used as the date separator in formatted output is determined by your
system settings."

And in my settings - (hyphen) is used as the date separator, in
accordance with ISO 8601 extended format
http://en.wikipedia.org/wiki/ISO_8601

Lars-Åke

Well, that is an interesting difference, and clearly related to
Excel/VBA/Regional Settings and not to the regex engine per se, as I originally
thought.

When I place ?Cdate("1/1/2010") in the immediate window, it returns
1/1/2010.

So it would seem that for code that will run properly, regardless of the
Windows Regional settings, it might be best to use "\/yyyy\/m\/d" in the Format
command to properly format the date string to match with the URL.

Here in the US, I don't generally have to deal with international requirements.
Thanks for pointing that out.
--ron
 
R

Ron Rosenfeld

Ron: have you ever looked at the 1970's unix manual volume 2b under the
YACC topic. See this webpage. Look for the link for the PDF files and
use the link : v7vol2b.pdf (819KB)

Webpage
'7th Edition Manual PDF'
(http://plan9.bell-labs.com/7thEdMan/bswv7.html)

pdf file
http://plan9.bell-labs.com/7thEdMan/v7vol2b.pdf



It is the only good description of pattern matching that I have ever
seen.

The question mark indicates any single character.


I have not looked at that reference. However, if that is what it says, it is
wrong.

The question mark "?": makes the preceding item optional. Greedy, so the
optional item is included in the match if possible.

=================================
Looking at the reference you provided, I see that you have misread it. It does
*NOT* read, as you claim, that "the question mark indicates any single
character". It states the almost the SAME definition as *I* posted above:

"Optional Expressions . The operator ? indicates an optional element of an
expression. Thus ab?c matches either ac or abc."
===================================

There are some instances where "?" represents any single character, but it does
NOT do this with regard to regular expressions. Perhaps that is where you got
confused.
There are a number of problems with the above search string

1) why do you care that a blank occurs immediately after the last
digit. what happens if there is a return, tab or end of file

I don't know what you mean by a "blank". Perhaps you are confusing the use of
the "\b" token? Check that definition (which I previously posted). It is also
does NOT seem to be part of the LEX regular expression flavor which you are
quoting. Rather, in LEX, it would mean a backspace; but that is not the case
with more commonly used flavors.
2) The match is looking for a positive or negative sign folowed by any
character followed by a blank. Most people don't have a blank.

No it is not. See above. You are misinterpreting "\b".\
\d+|([-+]\d+)

It is where I found the saying

If you can't bring Mohammed to the mountain, bring the mountain to
Mohammed.

If Mohammed goes to the wrong mountain, he may get the wrong information.

You might try some of these "mountains", if the links are still valid:

Regular Expressions
http://www.regular-expressions.info/reference.html
http://support.microsoft.com/default.aspx?scid=kb;en-us;818802&Product=vbb
http://msdn2.microsoft.com/en-us/library/6wzad2b2.aspx
http://msdn2.microsoft.com/en-us/library/ms974619.aspx
http://www.regex-guru.info/
--ron
 
R

Ron Rosenfeld

For my part, I would start with "-?\d+" to grab a whole number. i.e a
whole number is a minus which is optional followed by a digit one or
more times. I found the OP's situation too complicated to want to follow
and offer a suggestion.

I agree with a lot of what you wrote.

I would point out that the above construct will also extract whole numbers that
are embedded within other strings. That may or may not be desirable.

In other words, it can extract both integers from the string below:

abc123abc 123
--ron
 
J

joel

Ron: I don't think in this case you care if you get multiple numbe
using your expression. Assume this is a webpage (html). Th
requirement is to return the number associated with a tag item in
large documents where you don't know the number of tags in the document
Some tags may occur more than once and you may need to get the numbe
that is associated with the 5th time the tag appeared.

Walter Briscoe: There are utilities in the Unix manual where the
means a single character. In YACC expresions it meas an optiona
character. So you can use

[-+]?\d+

Lars-Åke Aspelin: The backslash before for special characters come
from Bell Labs (the Unix manuals). The scientist at Bell Labs develope
a lot of the standards that are used today for regular expressions. th
reasearch was needed to develope algorithms that cold be used to fin
peole names in phone books and to develope softwae compilers fo
programming languages. both tasks the phone company was interested i
doing and used similar algorithms. Computers in the 1970's wer
expensive to operate and developing efficient fast algortihms that us
minimum amount of memory (memory very expensive).

Companiers today don't have the same concerns. The people are mor
expensive than the hardware which is fast and inexpensive so algorithm
are developed as quickly as possible with lots of errors that sometime
gets fixed if enough people complain
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top