regex alternative to nested functions

S

sarah

I am trying to write a formula that takes the contents of a cell,
removes a bunch of characters and prints it in another cell.

For example,

Take: Jim's Roofing..."The best roofing in town!"

Remove certain characters: ' . " !

And return: Jims Roofing The best roofing in town

There are a lot of characters I want to remove and some characters that
I want to replace with other text.

Basically, I have too many SUBSTITUTE fuctions to use in one formula.
VB is not my native language, so I am sorry if this is a stupid
question. Some languages allow you to use regular expressions so that
instead of having a separate SUBSTITUTE function for each character I
want to remove, I can use one.

Is this possible in VB too?
 
D

Daniel.M

Hi Sarah,

You don't need RegEx for what you're doing.
I wrote MultiSubstitute a while back.
If ByChars argument is shorter than ToBeReplacedChars, the 'extra'
characters in ToBeReplacedChars will be eliminated.

Example:
=MultiSubstitute("Hello Mommy","elm","ab") will produce: "Habbo Moy"

Notice the "m" are eliminated because, being at the end of elm, they don't
have any corresponding letter in "ab".

' ===========================
'MultiSubstitute
'In an InputStr, replace all chars belonging to ToBeReplacedChars
'by the corresponding chars in ByChars
'ToBeReplacedChars and ByChars should be of the same length
'or else it removes the found chars
'Returns the string with the characters replaced (or removed)
'By Daniel M.
Function MultiSubstitute(InputStr As String, _
ToBeReplacedChars As String, ByChars As String) As String

Dim s As String, i As Integer
Dim anOffset As Integer, len_Input As Integer, len_ByChars As Integer

len_Input = Len(InputStr)
len_ByChars = Len(ByChars)
For i = 1 To len_Input
s = mid(InputStr, i, 1)
anOffset = InStr(ToBeReplacedChars, s)
If anOffset > 0 Then
If anOffset <= len_ByChars Then
MultiSubstitute = MultiSubstitute & _
mid(ByChars, anOffset, 1)
End If
Else
MultiSubstitute = MultiSubstitute & s
End If
Next

End Function
' ===========================

Regards,

Daniel M.
 
H

Harlan Grove

You don't need RegEx for what you're doing.

No. But she doesn't *need* anything beyond what Excel worksheet functions
provide. As written in her example,

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE("Jim's Roofing...""The best roofing
in town!""","'",""),".",""),"""",""),"!","")

would suffice. As for your example,

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE("Hello Mommy","e","a"),"l","b"),"m","")

would suffice. Granted nested SUBSTITUTE calls don't nest well, but if there are
8 or fewer characters to replace, nested SUBSTITUTE calls would be faster than
udfs.

I suspect the OP's using an overly simplified example. I'd guess only letters
and whitespace should remain. While it'd be possible to feed your UDF a string
of all the other characters as 2nd argument and "" as 3rd argument, that's be
much more difficult than deleting all characters in the class [^A-Za-z \t].
Regexps may not be necessary, strictly speaking, but they wipe the floor with
all other text matching approaches in terms of economy of expression.

A udf is also required to make use of the RegExp object provided by VBScript. If
the OP has Internet Explorer 5.0 or later installed, then VBScript 5 or higher
was installed with it. If not, it's a free download from Microsoft's web site.

http://msdn.microsoft.com/library/default.asp?url=/downloads/list/webdev.asp

Once installed, the udf below could be used in a formula like

=Subst("Jim's Roofing...""The best roofing in town!""","[^A-Za-z \t]+","")

Note that this doesn't give a space between the words Roofing and The. To
replace multiple chars with one space and delete single chars, try

=Subst(Subst("Jim's Roofing...""The best roofing in town!""",
"[^A-Za-z \t]{2,}"," "),"[^A-Za-z \t]","")

This wrinkle would pose nontrivial problems using MultiSubstitute if *single*
unwanted chars should be deleted while *multiple* unwanted chars should be
replaced by a single space. There's a place for transliteration, but it's not a
replacement for regular expressions.


'Similar to Excel's SUBSTITUTE but using VBScript's perl-like regexps
'Required:
' orig_text - string to search through,
' match_pat - regexp to find,
' replace_pat - replacement pattern
'Optional:
' instance - which matched substring to replace or 0 for all (default)
'-----------------------------------------
Function Subst(orig_text As String, _
match_pat As String, _
replace_pat As String, _
Optional instance As Variant) As Variant
'-----------------------------------------
Dim regex As Object, matches As Object, m As Object

If IsMissing(instance) Then
instance = 0#

ElseIf TypeName(instance) <> "Double" Then
Subst = CVErr(xlErrValue) 'invalid instance type
instance = -1#

ElseIf CDbl(instance) <= 0.5 Then
Subst = CVErr(xlErrNum) 'invalid instance value
instance = -1#

Else
instance = Int(instance + 0.5)

End If

If instance = -1# Then Exit Function 'do nothing quickly

Set regex = CreateObject("vbscript.regexp")
regex.Pattern = match_pat
regex.Global = True

If instance = 0# Then
Subst = regex.Replace(orig_text, replace_pat)

Else
Set matches = regex.Execute(orig_text)

If instance > matches.Count Then
Subst = orig_text 'matchnum out of bounds - do nothing

Else
Set m = matches.Item(instance - 1)
Subst = Left(orig_text, m.FirstIndex) & _
regex.Replace(m.Value, replace_pat) & _
Right(orig_text, Len(orig_text) - m.FirstIndex - m.Length)

End If

End If

End Function
 
D

Daniel.M

Hi Harlan,
...

No. But she doesn't *need* anything beyond what Excel worksheet functions
provide. ...

snip

... Granted nested SUBSTITUTE calls don't nest well, but if there are
8 or fewer characters to replace, nested SUBSTITUTE calls would be faster than
udfs.

I suspect the OP's using an overly simplified example. I'd guess only
letters and whitespace should remain.

I discarted the RegExp quite fast. Maybe too fast. :)
But my take on this was the OP may need to replace/substitute more than 8
different characters. I think MultiSubstitute is adequate for the OP's
request as stated.

As for regular expressions, the link and Subst() UDF you provided will
enable the OP to use those powerful tools if she needs/wants it. I would
also suggest she downloads the help file related to Windows Scripting
Documentation (for v.5.6, it's SCRIPT56.CHM) if she wants to be sure about
the pattern metacharacters the regular expression engine is using.

Regards,

Daniel M.
 
D

Daniel.M

Hi Harlan,
Then show how you'd use it to convert the OP's sample text,
Jim's Roofing..."The best roofing in town!"

into the OP's expected result,
Jims Roofing The best roofing in town
noting that in the result there's no space between Jim and s but there's
one space between Roofing and The. I realize it could be done
using SUBSTITUTE to replace the ellipsis with a space, then
using MultiSubstitute to delete remaining punctuation.
MultiSubstitute calls alone won't do it.

Ok. I now see what you meant.
I didn't pick up the 'special case' about treatment of words to prevent
'word collisions'.

As with string processing in general, there are a lot of cases involved. it
would be appreciated to know the OP's choices here.

Regards,

Daniel M.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top