Extracting some text from a string.

A

Art MacNeil

Hello,

I'm trying to extract some specific text from a bunch of strings of text.

I think this should be really easy but I can't get it done. I've used "Left"
"Mid", "Substitute" and "Find" in one combination of another and sometimes
it works but it doesn't always work.

Regardless of the path, what I need is, the string of text after the final
"\" and before the ".xls"


I've copied a sample of the strings I'm working with below.

C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Federal\5 year compare\5yearscomparePF.xls
C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Federal\CCA\CCA Line 1 to 2475.xls
C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Federal\CCA\CCA Lines 2475 & down.xls
C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Federal\RRSP Deduction pf\RRSP Deduction PF.xls
E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Federal\5 year compare pf.xls
E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Federal\CCA Line 1 to 2475.xls
E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Federal\CCA Lines 2475 & down.xls
E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Federal\RRSP Deduction pf.xls
C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Québec\Annexe A\Annexe A T P.xls
C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Québec\LSVCC\LSVCC (Québec) T P.xls
C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Québec\QVEH-B\QVEH-B series T P.xls
C:\Source\qa\T P 2006\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Québec\TP1-Comparative_Summary\TP1_Comparative_TAXSUMMARY_PF.xls
E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Québec\Annexe A.xls
E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Québec\LSVCC (Québec).xls
E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Québec\QVEH B.xls
E:\T Ps\T P 2007\QT, IR, QTW, IRW\Tax Components\Calculation
Testing\Québec\TP1 Comp Summary pf.xls




The Text I need is:
5yearscomparePF
CCA Line 1 to 2475
CCA Lines 2475 & down
RRSP Deduction PF
5 year compare pf
CCA Line 1 to 2475
CCA Lines 2475 & down
RRSP Deduction pf
Annexe A T P
LSVCC (Québec) T P
QVEH-B series T P
TP1_Comparative_TAXSUMMARY_PF
Annexe A
LSVCC (Québec)
QVEH B
TP1 Comp Summary pf


Thank you for your help,

Art.
 
R

Rick Rothstein \(MVP - VB\)

If your 'xls' extensions are all always in lower case letters, then this
formula will work...

=SUBSTITUTE(SUBSTITUTE(A1,LEFT(A1,FIND("*",SUBSTITUTE(A1,"\","*",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))),""),".xls","")

If the extensions can be all lower or all upper case letters, then this
formula will work....

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1,LEFT(A1,FIND("*",SUBSTITUTE(A1,"\","*",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))),""),".xls",""),".XLS","")

Note though that neither formula will work if there is a mixture of upper
and lower case letters in the extension part of the string.

Rick
 
M

MartinW

Hi Rick,

Perhaps this formula which I believe was one of yours might work better.
=MID(A1,FIND("~",SUBSTITUTE(A1,"\","~",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))+1,999)
then in C1 put
=LEFT(B1,LEN(B1)-4)
This allows for upper and lower case as well as .doc or any other extension.

Now if you could just show us how to combine the two.....I've been trying
but I just can't get the syntax right.

Regards
Martin
 
A

Art MacNeil

This worked perfectly:

=SUBSTITUTE(SUBSTITUTE(A1,LEFT(A1,FIND("*",SUBSTITUTE(A1,"\","*",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))),""),".xls","")

I can't believe how small the formula is. Mine was huge, and it didn't work
in all cases.

Thank you very much,

Art.
 
R

Rick Rothstein \(MVP - VB\)

Hey, I forgot I posted that one (I tend to re-create these things as the
questions come in). Here is the combined formula...

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1,LEFT(A1,FIND("*",SUBSTITUTE(A1,"\","*",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))))),""),".xls",""),".XLS","")

It is 38 characters longer than the one I posted to this thread and requires
some 5 additional function calls (the extra overhead coming from having to
remove the extension).

Rick
 
R

Ron Coderre

Try this:

B1:
=SUBSTITUTE(RIGHT(A1,MATCH("\",INDEX(MID(A1,LEN(A1)+1-ROW($1:$255),1),0),0)-1),".xls","")

That assumes the file name does not exceed 255 characters.

Is that something you can work with?
***********
Regards,
Ron

XL2003, WinXP
 
A

Art MacNeil

This worked as well.

Thank you,

Art.


Ron Coderre said:
Try this:

B1:
=SUBSTITUTE(RIGHT(A1,MATCH("\",INDEX(MID(A1,LEN(A1)+1-ROW($1:$255),1),0),0)-1),".xls","")

That assumes the file name does not exceed 255 characters.

Is that something you can work with?
***********
Regards,
Ron

XL2003, WinXP
 
H

Harlan Grove

Art MacNeil said:
Regardless of the path, what I need is, the string of text after the final
"\" and before the ".xls"
....

Base filenames.

Another alternative, define the name seq referring to the formula

=ROW(INDEX($1:$65536,1,1):INDEX($1:$65536,1024,1))

Then with one of these full pathnames in cell A2, try the formula

=SUBSTITUTE(REPLACE(A2,1,LOOKUP(2,1/(MID(A2,seq,1)="\"),seq),""),".xls","")

For more general parsing tasks, you'd be better off downloading and
installing Laurent Longre's MOREFUNC.XLL add-in and using it's regular
expression functions. Using it you could extract base filenames using

=REGEX.SUBSTITUTE(A2,"^.+\\([^\\]+?)(\.[^.\\]*)?$","[1]")
 
R

Ron Rosenfeld

Art MacNeil said:
Regardless of the path, what I need is, the string of text after the final
"\" and before the ".xls"
...

Base filenames.

Another alternative, define the name seq referring to the formula

=ROW(INDEX($1:$65536,1,1):INDEX($1:$65536,1024,1))

Then with one of these full pathnames in cell A2, try the formula

=SUBSTITUTE(REPLACE(A2,1,LOOKUP(2,1/(MID(A2,seq,1)="\"),seq),""),".xls","")

For more general parsing tasks, you'd be better off downloading and
installing Laurent Longre's MOREFUNC.XLL add-in and using it's regular
expression functions. Using it you could extract base filenames using

=REGEX.SUBSTITUTE(A2,"^.+\\([^\\]+?)(\.[^.\\]*)?$","[1]")

Or you could use (a little shorter):

=REGEX.MID(A1,"[^\\]*(?=\.xls$)")

or, if other than xls files might be present:

=REGEX.MID(A1,"[^\\]*(?=\.\w{3}$)")


--ron
 
R

Ron Rosenfeld

Art MacNeil said:
Regardless of the path, what I need is, the string of text after the final
"\" and before the ".xls"
...

Base filenames.

Another alternative, define the name seq referring to the formula

=ROW(INDEX($1:$65536,1,1):INDEX($1:$65536,1024,1))

Then with one of these full pathnames in cell A2, try the formula

=SUBSTITUTE(REPLACE(A2,1,LOOKUP(2,1/(MID(A2,seq,1)="\"),seq),""),".xls","")

For more general parsing tasks, you'd be better off downloading and
installing Laurent Longre's MOREFUNC.XLL add-in and using it's regular
expression functions. Using it you could extract base filenames using

=REGEX.SUBSTITUTE(A2,"^.+\\([^\\]+?)(\.[^.\\]*)?$","[1]")

Or you could use (a little shorter):

=REGEX.MID(A1,"[^\\]*(?=\.xls$)")

or, if other than xls files might be present:

=REGEX.MID(A1,"[^\\]*(?=\.\w{3}$)")


--ron

The non-XLS file version regex could be augmented to something like:

[^\\]*(?=\.[^\\]*$)

or

[^\\]*(?=\.[^\\]+$)


--ron
 
A

Art MacNeil

Ron Rosenfeld said:
...
Regardless of the path, what I need is, the string of text after the
final
"\" and before the ".xls"
...

Base filenames.

Another alternative, define the name seq referring to the formula

=ROW(INDEX($1:$65536,1,1):INDEX($1:$65536,1024,1))

Then with one of these full pathnames in cell A2, try the formula

=SUBSTITUTE(REPLACE(A2,1,LOOKUP(2,1/(MID(A2,seq,1)="\"),seq),""),".xls","")

For more general parsing tasks, you'd be better off downloading and
installing Laurent Longre's MOREFUNC.XLL add-in and using it's regular
expression functions. Using it you could extract base filenames using

=REGEX.SUBSTITUTE(A2,"^.+\\([^\\]+?)(\.[^.\\]*)?$","[1]")

Or you could use (a little shorter):

=REGEX.MID(A1,"[^\\]*(?=\.xls$)")

or, if other than xls files might be present:

=REGEX.MID(A1,"[^\\]*(?=\.\w{3}$)")


--ron

The non-XLS file version regex could be augmented to something like:

[^\\]*(?=\.[^\\]*$)

or

[^\\]*(?=\.[^\\]+$)


--ron


Thank you.

I hate to say it, but I already had downloaded the MOREFUNC.XLL add-in
installed. I just didn't realize it could be used to do this.

Art.
 
H

Harlan Grove

Ron Rosenfeld said:
=REGEX.SUBSTITUTE(A2,"^.+\\([^\\]+?)(\.[^.\\]*)?$","[1]")
....
The non-XLS file version regex could be augmented to something like:

[^\\]*(?=\.[^\\]*$)

or

[^\\]*(?=\.[^\\]+$)
....

I have nastier filenames on my system, e.g., .RHistory which begins with a
period and has no other period. My REGEX.SUBSTITUTE formula returns it while
your REGEX.MID formula returns "". Either may be appropriate depending on
exactly how one wants to handle such filenames. However, there are also
filenames that have no periods whatsoever, and with those my formula returns
then while yours still returns "".

I had tried REGEX.MID with several variations on your ending assertion to
handle such filenames, but I gave up and used (what I consider to be) the
much more reliable REGEX.SUBSTITUTE.
 
R

Ron Rosenfeld

Thank you.

I hate to say it, but I already had downloaded the MOREFUNC.XLL add-in
installed. I just didn't realize it could be used to do this.

Art.

Once you get used to using Regular Expressions, you'll find that complex text
manipulations become pretty simple.

By the way, so far as I know, MOREFUNC.XLL has not been updated to work with
XL2007, so some of the functions don't work.

Also, one limitation of MOREFUNC is 255 characters. This is a limitation of
the XLL, apparently. But the Regular Expression functions can all be
duplicated in VBA, with only minor changes in the expressions.
--ron
 
R

Ron Rosenfeld

Ron Rosenfeld said:
=REGEX.SUBSTITUTE(A2,"^.+\\([^\\]+?)(\.[^.\\]*)?$","[1]")
...
The non-XLS file version regex could be augmented to something like:

[^\\]*(?=\.[^\\]*$)

or

[^\\]*(?=\.[^\\]+$)
...

I have nastier filenames on my system, e.g., .RHistory which begins with a
period and has no other period. My REGEX.SUBSTITUTE formula returns it while
your REGEX.MID formula returns "". Either may be appropriate depending on
exactly how one wants to handle such filenames. However, there are also
filenames that have no periods whatsoever, and with those my formula returns
then while yours still returns "".

I had tried REGEX.MID with several variations on your ending assertion to
handle such filenames, but I gave up and used (what I consider to be) the
much more reliable REGEX.SUBSTITUTE.

It is certainly the case that my REGEX.MID function assumes that the filename
will have a suffix starting with a "."

How do you differentiate a file name that has no suffix from a subfolder name?
I suppose you could look for a terminal "\", but you'd have to ensure that
would be in the list you're examining.
--ron
 
H

Harlan Grove

Ron Rosenfeld said:
How do you differentiate a file name that has no suffix from a subfolder
name?
I suppose you could look for a terminal "\", but you'd have to ensure that
would be in the list you're examining.

Exactly. If there's a \ to the right, you're NOT looking at a filename.
Conversely, folders CAN have periods in their names.

As for ensuring there are \s, it didn't (and still doesn't) seem relevant to
this thread, but if filenames MIGHT lack drive/directory paths, then use

=REGEX.SUBSTITUTE(A1,"^(.+\\)?([^\\]+?)(\.[^.\\]+)?$","[2]")
 
R

Ron Rosenfeld

Ron Rosenfeld said:
How do you differentiate a file name that has no suffix from a subfolder
name?
I suppose you could look for a terminal "\", but you'd have to ensure that
would be in the list you're examining.

Exactly. If there's a \ to the right, you're NOT looking at a filename.
Conversely, folders CAN have periods in their names.

As for ensuring there are \s, it didn't (and still doesn't) seem relevant to
this thread, but if filenames MIGHT lack drive/directory paths, then use

=REGEX.SUBSTITUTE(A1,"^(.+\\)?([^\\]+?)(\.[^.\\]+)?$","[2]")

OK. That makes sense. And it's been a long time since I've worked with other
file systems. As usual, the best Regex to use depends on the nature of the
data and what is to be extracted.

It is the case that "[^\\]*(?=\.[^\\]+$)" will match just the file names in all
of the examples given by the OP. And if he were only interested in .xls files,
then

=REGEX.MID(A2,"[^\\]*(?=\.xls$)")

But yours is more general, and could be easily adapted to other situations.
--ron
 
Top