String splitting for inconsistent strings

Richard · Jun 20, 2006

I receive a series of daily share prices in an html email which I save
as a file and then have a macro load this file into a data sheet.

Amongst all the html stuff there are recognisable share price strings
like this in Column A

ARM Holdings 116.00p -0.03p (-2.11)
Amino Technologies 98.50p 0.00p (-0.00)
BAA 872.50p -0.01p (-0.06)
McCarthey and Stone 766.00p 0.04p (0.52)

I'd like to slice these up so that in columns B:C I end up with the
following

B1: Arm Holdings, C1: 116.00
B2: Amino Technologies, C2: 98.50
B3: BAA, C3: 872.50
B4: McCarthy and Stone, C4: 766.00

Unfortunately some names are two or more words long, and others a
single word, which makes identifying the split between the name and the
price difficult.

Until recently, Digital Look.com who send the emails had a slightly
different layout where there was a unique identifier company ID code
surrounded with ( ) which made it reasonably easy to identify the split
between the name and the price. However their layout has now changed to
the above, and I'm strruggling to find an elegant solution to slicing
the string.

I'd be grateful for any help or suggestions

Usual TIA,

Richard Buttrey

Ron Rosenfeld · Jun 20, 2006

I receive a series of daily share prices in an html email which I save
as a file and then have a macro load this file into a data sheet.

Amongst all the html stuff there are recognisable share price strings
like this in Column A

ARM Holdings 116.00p -0.03p (-2.11)
Amino Technologies 98.50p 0.00p (-0.00)
BAA 872.50p -0.01p (-0.06)
McCarthey and Stone 766.00p 0.04p (0.52)

I'd like to slice these up so that in columns B:C I end up with the
following

B1: Arm Holdings, C1: 116.00
B2: Amino Technologies, C2: 98.50
B3: BAA, C3: 872.50
B4: McCarthy and Stone, C4: 766.00

Unfortunately some names are two or more words long, and others a
single word, which makes identifying the split between the name and the
price difficult.

Until recently, Digital Look.com who send the emails had a slightly
different layout where there was a unique identifier company ID code
surrounded with ( ) which made it reasonably easy to identify the split
between the name and the price. However their layout has now changed to
the above, and I'm strruggling to find an elegant solution to slicing
the string.

I'd be grateful for any help or suggestions

Usual TIA,

Richard Buttrey

It seems as if this can be done with "Regular Expressions". I will make use of
the fact that the value which you want to pull out is the first floating point
number that ends with a "p " (thats "p" followed by <space>). If there are
any issues where the stock name includes a string of the form <number>p<space>
you might want to change the formula to take that into account.

Download and install Longre's free morefunc.xll add-in from
http://xcell05.free.fr

To extract the company name:

=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s))")

To extract the price:

=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s)")

To be a little more bullet-proof one can require that the price number be
identified as the first positive floating point number followed by the sequence
of p<space> then optionally a "-" and another digit. So the formula would
then be:

Company Name:
=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s-?\d))")

Price:
=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s-?\d)")

--ron

Richard Buttrey · Jun 22, 2006

It seems as if this can be done with "Regular Expressions". I will make use of
the fact that the value which you want to pull out is the first floating point
number that ends with a "p " (thats "p" followed by <space>). If there are
any issues where the stock name includes a string of the form <number>p<space>
you might want to change the formula to take that into account.

Download and install Longre's free morefunc.xll add-in from
http://xcell05.free.fr

To extract the company name:

=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s))")

To extract the price:

=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s)")

To be a little more bullet-proof one can require that the price number be
identified as the first positive floating point number followed by the sequence
of p<space> then optionally a "-" and another digit. So the formula would
then be:

Company Name:
=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s-?\d))")

Price:
=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s-?\d)")

--ron

Ron,

Many thanks for the pointer to this Add In which I've just downloaded.
The REGEX function looks very interesting - I just need to get my mind
around the syntax and terminology.

One supplementary if I may. I've just used your example on my data and
it works a treat.
However there's one other small variation I've just noticed, and
that's where the integer part of the share price is four characters
with a comma separating the thousands.

e.g. the following string

GlaxoSmithKline 1,480.00p 0.16p (1.09)

returns nothing for the name in B1, and in C1 480.00, not 1480.00

Can you suggest a modification that will overcome this please - it'll
probably take me a couple of weeks to understand the syntax!

Kind regards,

Richard Buttrey

__
Richard Buttrey
Grappenhall, Cheshire, UK
__________________________

Ron Rosenfeld · Jun 22, 2006

Ron,

Many thanks for the pointer to this Add In which I've just downloaded.
The REGEX function looks very interesting - I just need to get my mind
around the syntax and terminology.

One supplementary if I may. I've just used your example on my data and
it works a treat.
However there's one other small variation I've just noticed, and
that's where the integer part of the share price is four characters
with a comma separating the thousands.

e.g. the following string

GlaxoSmithKline 1,480.00p 0.16p (1.09)

returns nothing for the name in B1, and in C1 480.00, not 1480.00

Can you suggest a modification that will overcome this please - it'll
probably take me a couple of weeks to understand the syntax!

Kind regards,

Richard Buttrey

__
Richard Buttrey
Grappenhall, Cheshire, UK
__________________________

What you need to do, I believe, is optionally look for zero or more sequences,
within the number, of a comma followed by three digits. The syntax for that
would be: (,\d{3})*

I think this modification will do that:

\d+(,\d{3})*(\.\d*)(?=p\s)

So the modified "more bulletproof" formulas would be:

Company Name:

=REGEX.MID(A1,"(\w+\s?)+(?=\s\d+(,\d{3})*(\.\d*)?(?=p\s-?\d))")

Price:

=REGEX.MID(A1,"\d+(,\d{3})*(\.\d*)?(?=p\s-?\d)")

If you look at the Company Name formula, you will note I made a small change as
the previous formula would include a trailing <space> after each name. This
one does not.

The simpler formulas would be

Name:

=REGEX.MID(A1,"(\w+\s?)+(?=\s\d+(,\d{3})*(\.\d*)?(?=p\s))")

Price:

=REGEX.MID(A1,"\d+(,\d{3})*(\.\d*)(?=p\s)")

=========================================

http://www.regular-expressions.info/reference.html

is a place on the net to start learning a bit about syntax.

I would also recommend the book "Mastering Regular Expressions" by Jeffrey
Friedl. I got mine for just a few dollars, used, at amazon.com

Finally, be aware that a limitation of Longre's functions is that they cannot
process strings longer than 255 characters. I believe this limitation is due
to the data types used in the XLL, and also applies to the other functions in
the add-in.

If you need to process longer strings, it can be done in VBA by setting a
reference to Microsoft VBScript Regular Expressions 5.5 and writing some code.
I've got a little of that done, but nothing as flexible as what Longre
provides. And there's also some syntax that won't work in the VBScript
implementation, so I rarely bother with it.

Have fun
--ron

FYI - Sql for splitting a delimited concatenated string into separate strings/rows.	0	Feb 24, 2004
FYI - Sql for splitting a delimited concatenated string into separate strings/rows.	2	Feb 24, 2004

String splitting for inconsistent strings

Richard

Ron Rosenfeld

Richard Buttrey

Ron Rosenfeld

Ask a Question

Similar Threads