Hi Ron. Thanks for that. However I have a few more questions as follows:
1. How can I get it to include e-mail addresses which contain the symbols
'@' and '.'?
That depends. It is simple to just add the '@' and '.' to the list of valid
characters in stripword. But that would cause problems when those characters
are not part of a valid email address. At present, StripWord strips out what
you originally stated would be invalid characters. However, to include email
addresses, requires a different approach including validating words and email
addresses. Before I get into that, are there any other requirements that you
have yet to specify?
Below is the modified routine. The major modification is in StripWord which
has been modified to recognize many (but not all) valid email addresses; as
well as any words that contain at least 4 characters. The characters in these
words need to be letters, digits or hyphens. (The email address is made a
valid exception to that rule).
There are some cosmetic modifications in the major routine also. One cosmetic
modification I did not include was to have all the output be lower case. But I
have an annotation for where that can be done.
2. When I tested it with some data that included mixed case words, I got
invalid results. Specifically, three cells with "Seven seven SEVEN", "SeVeN
sEvEn", and ""Seven seven SEVEN"" returned a word count of only 2 for the
word 'seven'. Why would this be?
As written, the code SHOULD be case-insensitive. Are you using the appropriate
code? The first I presented had to be changed to make it case insensitive.
When I put those three phrases in A1:A3, and run the code, I get a result of:
Seven 8
Another reason for your problem is that rSrc is not properly specified so that
the routine is not checking the cells where you have those strings.
3. For the previously mentioned invalid results, when a word such as
'(e-mail address removed)' was stripped of non-letter characters, it also got counted as 0
instances of the word.
That is because there is no word = "meherecom". In my first response to you, I
wrote "words are defined as strings containing only letters, digits, slash or a
hyphen. This is done in order to remove punctuation. But it will also remove
other substrings that might include other characters. If this will be an issue,
changes can be easily made."
But this is the first you have indicated that your definition of "word" does
not match the one I stated I was using. So unexpected results are ...
"expected"
4. I am trying to understand your code, and the statements 'Set re =
CreateObject("vbscript.regexp")' and 'set mc = re.Execute' are confusing to
me. Can you explain them please (or point me to another resource that
explains them)?
This is part of the VBScript Regular Expression Engine, which is similar to
that specified in the ECMA262 specification.
Here are some bookmarks (most of which should still be good):
Regular Expressions
http://www.regular-expressions.info/reference.html
http://support.microsoft.com/default.aspx?scid=kb;en-us;818802&Product=vbb
http://msdn2.microsoft.com/en-us/library/6wzad2b2.aspx
http://msdn2.microsoft.com/en-us/library/ms974619.aspx
http://www.regex-guru.info/
================================================
Option Explicit
Sub UniqueWordList()
Dim rSrc As Range, rDest As Range, c As Range
Dim cWordList As Collection
Dim res() As Variant
Dim w() As String
Dim i As Long
Set cWordList = New Collection
Set rSrc = Range("A1:B50")
Set rDest = Range("M1")
rDest.EntireColumn.NumberFormat = "@"
For Each c In rSrc
w = Split(c.Value)
For i = 0 To UBound(w)
w(i) = StripWord(w(i))
If Not w(i) = "" Then
On Error Resume Next
cWordList.Add Item:=w(i), Key:=w(i)
On Error GoTo 0
End If
Next i
Next c
'transfer words to results array
ReDim res(1 To cWordList.Count, 0 To 1)
For i = 1 To cWordList.Count
res(i, 0) = cWordList(i)
Next i
'get counts
For i = LBound(res) To UBound(res)
For Each c In rSrc
res(i, 1) = res(i, 1) + CountWord(c.Value, res(i, 0))
Next c
Next i
'sort alpha: d=0; sort numeric d=1
'there are various ways of sorting
BubbleSort res, 1
rDest.CurrentRegion.Clear
For i = LBound(res) To UBound(res)
rDest.Offset(i, 0).NumberFormat = "@"
rDest.Offset(i, 0).Value = res(i, 0)
'For just lowercase output, use:
'rDest.Offset(i, 0).Value = LCase(res(i, 0))
rDest.Offset(i, 1).Value = res(i, 1)
Next i
End Sub
Private Function StripWord(s As String) As String
Dim re As Object, mc As Object, m As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?" _
& "^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]" _
& "(?:[a-z0-9-]*[a-z0-9])?|[-\w]{4,}"
If re.test(s) = True Then
Set mc = re.Execute(s)
StripWord = mc(0).Value
End If
Set re = Nothing
End Function
Private Function CountWord(ByVal s As String, sPat) As Long
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.IgnoreCase = True
re.Pattern = "\b" & sPat & "\b"
Set mc = re.Execute(s)
CountWord = mc.Count
End Function
Private Sub BubbleSort(TempArray As Variant, d As Long) 'd is 0 based dimension
Dim temp(0, 1) As Variant
Dim i As Integer
Dim NoExchanges As Integer
' Loop until no more "exchanges" are made.
Do
NoExchanges = True
' Loop through each element in the array.
For i = LBound(TempArray) To UBound(TempArray) - 1
' If the element is less than the element
' following it, exchange the two elements.
' change "<" to ">" to sort ascending
If TempArray(i, d) < TempArray(i + 1, d) Then
NoExchanges = False
temp(0, 0) = TempArray(i, 0)
temp(0, 1) = TempArray(i, 1)
TempArray(i, 0) = TempArray(i + 1, 0)
TempArray(i, 1) = TempArray(i + 1, 1)
TempArray(i + 1, 0) = temp(0, 0)
TempArray(i + 1, 1) = temp(0, 1)
End If
Next i
Loop While Not (NoExchanges)
End Sub
==========================================
--ron