Optimizing log parsing for speed

B

Boris Zakharin

I have a VBA method to read log files and parse data into the Access
database. I have been told that the speed at which my method runs now is
unacceptable, so I'm looking for any advie to speed this up.

The structure of the file is (at high level) as follows:
The log represents the interaction of users with the company's mail server.
Each entry describes an event such as mail sent, mail received, etc
Each entry is exactly one line in length
Each entry has a time stamp (the date is in the file name)
Each entry has an ID number that is guaranteed to be the same for all events
associated with one message (that is, this number is the same for the send
event and for all receive events of the same message)

Now, my current approach is as follows:
Using Scripting.FileSystemObject I get each file name using a for each
construct
I get the date from the file name using FormatDateTime(Left(f1.Name, 2) &
"/" & Mid(f1.Name, 3, 2) & "/" & year, vbShortDate)
I open the file using OpenAsTextStream(ForReading)
I read each line using vbCrLf & ts.readline & vbCrLf
From each line, I try to extract the time, sender, recipient, line id, etc
(even though they cannot all be in the same line at once)
To locate information in the line (this is done by calling a separate
function) I use InStr and InStrRev to get the left and right delimiters then
return the string produced by the Mid stratement
Based on which information is available, I call an append query using
CurrentDb.QueryDefs and pass the information as parameters
If it's a send event I first delete all previous events with the same ID
then i add this line to the senders table
If it's a receive event I first add the line to the recipients table
Then I open a query which joins the senders table and the final table as a
recordsert
If the recordset has only one entry, I pass its fields as parameters to an
append query to insert into the final table
If it has more than one entry I use SELECT TOP 1 (I sort on an autonumber
field so that the most recently inserted will be at the top)

There are other details that have not gone into, but here's an example of a
query I use to join the senders and recipients table. There's also a num:
SELECT s.rDate, s.rTime, s.rFrom, [@rTo] AS rTo, n.nr, s.lc, [@rTime] AS
sTime
FROM senders AS s, numRecs AS n, email_data AS e
WHERE s.rFrom=e.rTo And s.lc=e.lc And s.lc=n.lc And s.rFrom=n.rFrom And
n.nr='1' And s.lc=[@lc] And e.rFrom=[@rTo];

Here numRecs has the number of recipients which is also read from the same
line as the name of the sender and email_data is the final tasble. This is
required because a reply to a message has the same ID as the original
message nad is being loomed up in the part of the table already created (lc
is the ID number and nr is the number of recipients)

Does this sound like there could be a better way to do this?

Thanks,
Boris Zakharin
 
A

Albert D. Kallal

You might consider dropping the file scripting object, as I would bet that
the native file open commands in ms-access are faster. However, it is
UN-LIKELY that your read speed is the problem.
I call an append query using
CurrentDb.QueryDefs and pass the information as parameters

Remember, the speed of your append query is going to very fast. However, to
open the query, time setup the parameters and then start the process is VERY
TIME consuming. Anytime you open a table for processing, you are kind
starting up a helicopter. Once the car is started, or the helicopter is
running, then they can travel very fast. However, that start-up time is MANY
TIMES resources intensive as compared to adding the one record. You can add
hundreds of records per second with ease, but you can only execute maybe 10
queries per second. So, keep this concept in mind when trying to optimize
your import speed.

So, in place of using some append queries and parameters, you could open a
reocrdset (that means the expensive and huge helicopter start-up time of
opening the table is done ONCE).

dim rstRecs as dao.recordset


set rstRecs = currentdb.OpenRecordSet("tblResults")

.....
.....
.....
do while (more data to import)

rstRecs.AddNew
rstRecs!id = lngId
rstRecs!dtLogTime = dtTime
.etc. et.c

.update

As you can see, now in our loop we have eliminated one large helicopter
start-up by not using a append query to add new records. So, if you can
eliminate a table open as above, you will improve that speed by at least 100
times. (so, 100 minutes becomes 1 minute).

Unfortunately, it seems that you need to do a test select on the "id" for
each record of log file you import (by the way, have you looked at the split
command for parsing text? It is great at plucking out text and parsing).
Anyway, executing that one select statement via the ID is another helicopter
start-up. So, for example, you might build a list of 10, or perhaps 20 id's.
You then make ONE select of those ID's into a recodes (with a given order).

select * from maintable where id in (34, 653, 1299)

So, build up a reocrdset with say 10, or 20 chunks of info, and process
that.

You can also check out the seek command, as that is so blinding fast as to
spin your head off. You can only use the seek command on a jet file share
table. Again, seek is very fast, as you can keep the table open. It is a
good 1000+ faster then creating a recordset or using a query to grab/test
for the existence ONE id. (note again how I am talking about ONE id with
that large query/helicopter start-up time). If you can grab a bunch of
records with ONE sql, then the amount of time for the sql start-up is not
that significant.
However, seek is *really* fast, as it is a raw seek on the index. You can't
use seek() on linked tables, but there is a workaround for this at the
access web.

http://www.mvps.org/access/tables/tbl0006.htm

You also have some quite nasty joins. You need to make sure that fields are
indexed on that join. Further, it is quite weird to learn that in-line sql
often runs MUCH faster then the mdb stored jet quires since in-line sql is
forced to re-compile the sql, and use the latest stats for the query plan.
This hint does NOT apply if using sql server, but it is not clear what data
engine you are using.

Also, you can use ADO, and it actually does a better job then dao of keeping
the table open if you create a parameter query. While in general dao is
faster, dao with the exception of the seek command does not have a way to
re-repeatedly grab a value from a open table via sql, where as ado does.
 
B

Boris Zakharin

Thanks a lot

Albert D. Kallal said:
You might consider dropping the file scripting object, as I would bet that
the native file open commands in ms-access are faster. However, it is
UN-LIKELY that your read speed is the problem.

I have to read from multiple files in a directory tree. What other way is
there to easily do this?
So, in place of using some append queries and parameters, you could open a
reocrdset (that means the expensive and huge helicopter start-up time of
opening the table is done ONCE).

I've done this with the parts that are straightforward and am already seeing
an improvement of nearly 40%! Thanks.
You then make ONE select of those ID's into a recodes (with a given order).

select * from maintable where id in (34, 653, 1299)

So, build up a reocrdset with say 10, or 20 chunks of info, and process
that.

I'm not sure how feasible this is, but I'll take this into advisement
You can also check out the seek command, as that is so blinding fast as to
spin your head off. You can only use the seek command on a jet file share
table. Again, seek is very fast, as you can keep the table open. It is a
good 1000+ faster then creating a recordset or using a query to grab/test
for the existence ONE id.

That's the key. I need the ability to find something where the ID is a
certain value and the time is in a certain range. Therefore, I'm now using a
dynaset-type recordset and FindFirst/FindNext
You also have some quite nasty joins. You need to make sure that fields are
indexed on that join. Further, it is quite weird to learn that in-line sql
often runs MUCH faster then the mdb stored jet quires since in-line sql is
forced to re-compile the sql, and use the latest stats for the query plan.
This hint does NOT apply if using sql server, but it is not clear what data
engine you are using.

I'm working with data within the Access file using the default (DAO) access
methods. What do you mean by in-line? I was previously using docmd.runsql
statements to run my queries and then switched to mdb-stored queries based
on advice from somewhere on the web. My performance went up considerably.
Also, you can use ADO, and it actually does a better job then dao of keeping
the table open if you create a parameter query. While in general dao is
faster, dao with the exception of the seek command does not have a way to
re-repeatedly grab a value from a open table via sql, where as ado does.

I'll take this into advisement
 
A

Albert D. Kallal

comments in-line:

Boris Zakharin said:
Thanks a lot



I have to read from multiple files in a directory tree. What other way is
there to easily do this?

If you need to "traverse" several dirs deep, then the following code is
really fast to built and grab all files from the nested dirs (the code is
short since it is recursive code)

Sub dirTest()

Dim dlist As New Collection
Dim startDir As String
Dim i As Integer

startDir = "C:\access\"
Call FillDir(startDir, dlist)

MsgBox "there are " & dlist.Count & " in the dir"

' lets printout the stuff into debug window for a test

For i = 1 To dlist.Count
Debug.Print dlist(i)
Next i

End Sub


Sub FillDir(startDir As String, dlist As Collection)

' build up a list of files, and then
' add add to this list, any additinal
' folders

Dim strTemp As String
Dim colFolders As New Collection
Dim vFolderName As Variant

strTemp = Dir(startDir)

Do While strTemp <> ""
dlist.Add startDir & strTemp
strTemp = Dir
Loop

' now build a list of additional folders
strTemp = Dir(startDir & "*.", vbDirectory)

Do While strTemp <> ""
If (strTemp <> ".") And (strTemp <> "..") Then
colFolders.Add strTemp
End If
strTemp = Dir
Loop

' now process each folder (recursion)
For Each vFolderName In colFolders
Call FillDir(startDir & vFolderName & "\", dlist)
Next vFolderName

End Sub


I've done this with the parts that are straightforward and am already seeing
an improvement of nearly 40%! Thanks.


I'm not sure how feasible this is, but I'll take this into advisement

Yes, the above seems a bit messy. I would actually use the seek command in
your case. It is the sotlion here!
That's the key. I need the ability to find something where the ID is a
certain value and the time is in a certain range. Therefore, I'm now using a
dynaset-type recordset and FindFirst/FindNext

FindFirst is ok if the data set is small, say only a few thousand records
max. If it is larger..then use the seek command as it will be several
hundred times faster. The problem with find first is that it not only finds
the record..but HAS to check that it is the FIRST one! (a terrible waste of
processing). We really need a "findone" command! So, findfirst does
work...but due to its meaning...is very slow. When findFirst fails..it is
about 1000 x faster! When it finds one, it has to CHECK if it is the first!
I'm working with data within the Access file using the default (DAO) access
methods. What do you mean by in-line? I was previously using docmd.runsql
statements to run my queries and then switched to mdb-stored queries based
on advice from somewhere on the web. My performance went up considerably.

Yes, in your case, likely it is best to stick with mdb stored queries.
However, in-line sql can actually run MUCH faster. However, the in-line
query has a LARGER helicopter start-up time (the query plan, and sql must be
interpreted each time BEFORE the data starts flowing, but once the data
starts flowing..the in-line is MUCH faster). Some in-line sql queries
execute as much as 40% faster then stored mdb queries with parameters. The
strange reason for this is that the query optimizing with parameters has to
make guess as to what those parameters will be. With in-line sql, the query
optimizer has less guessing. Since the time to compile, and build the query
plan is VERY small, then in-line sql can be much faster. However, this
advice really only applies when you have to execute the query once in code.
For repeated quires in a loop, this advice likely does NOT apply since then
the query compile time would become a factor. You can eastly use .1 of a
second with in-line to save 10 seoncds on grabbing a bunch of data. However,
if you have a loop that exectutes 100's of times to grab one reocrd...then
you will not save much!

However, above is a neat piece of information to be aware of, and I seen
this speed up some quires buy rather large amounts. (the trick only works if
you are comparing it to a saved query with parameters. With no parameters,
then trick does not help).
I'll take this into advisement

However, remember, using seek() is likely the best solution here, and far
better then the findfirst.
 
B

Boris Zakharin

Thanks for responding. Comments below.
If you need to "traverse" several dirs deep, then the following code is
really fast to built and grab all files from the nested dirs (the code is
short since it is recursive code)

I'm really not sure how much this would speed things up as the dir parsing
is only a small fraction of the SQL queries, but I'll see if I can implement
it.
FindFirst is ok if the data set is small, say only a few thousand records
max. If it is larger..then use the seek command as it will be several
hundred times faster. The problem with find first is that it not only finds
the record..but HAS to check that it is the FIRST one! (a terrible waste of
processing). We really need a "findone" command! So, findfirst does
work...but due to its meaning...is very slow. When findFirst fails..it is
about 1000 x faster! When it finds one, it has to CHECK if it is the
first!

First of all, I do need to find all occurances (and there may be multiple)
not just "findone." Second, seek only works on an indexed field, right? I
need to find records based on the values of two filds (at least in some
cases). CAn I have a two-field index and how would I do a seek on it?

Also, as I understand it, DAO recordsets created from joins do not get
updated when I add records to a table. Apparently, ADO ones do. Is this why
you recommended ADO in the first place? Can I mix ADO and DAO recordsets in
one sub? (I assume not)

Thanks again,
Boris Zakharin
 
A

Albert D. Kallal

Boris Zakharin said:
Thanks for responding. Comments below.


I'm really not sure how much this would speed things up as the dir parsing
is only a small fraction of the SQL queries, but I'll see if I can implement
it.

You are correct, the above will help little, if nothing in terms of
performance. I did not means to mislead here...Sorry! I was just giving a
nice way to populate multiple dirs into one collection. And, if that
number of entries is quite large, the sample code is very fast...but really,
in terms of your import issue...it is a non issue!

first!

First of all, I do need to find all occurances (and there may be multiple)
not just "findone."

Golly...ok. If you can order the data by that id (and it is indexed), then
you can simply loop while the id is the same. In fact, funny, I just wrote
this code a few days ago.:


' scan list for mainid
Dim strWhere As String

strWhere = "main_id = " & mainID

rstInClubs.FindFirst strWhere

Do While rstInClubs.NoMatch = False

If strClubList <> "" Then strClubList = strClubList & ","
strClubList = strClubList & rstInClubs!ClubName

rstInClubs.FindNext strWhere

Loop

In the above, I did use FindNext, but perhaps just using .movenext and
testing might run faster (have not tested this idea).

I would bet that even if no index is present, simply having a order by
clause for that recordset would help a lot.

Second, seek only works on an indexed field, right? I
need to find records based on the values of two filds (at least in some
cases). CAn I have a two-field index and how would I do a seek on it?

Yes. You can use seek with two fields. This does mean you need to create a
index that has BOTH fields. When you create a index (or even a primary key)
in table design mode, you can whack the indexes button, and you can add more
then one field to a given index name. However, if you DO USE a multi field
index, then you HAVE to supply both keys for the seek command. So, if you
don't always have two index key fields to search on, then using seek might
not be worth the trouble.

Also, as I understand it, DAO recordsets created from joins do not get
updated when I add records to a table. Apparently, ADO ones do. Is this why
you recommended ADO in the first place?

No, I only mentioned ADO, as in some cases it can keep the reocrdset open,
but use the index BETTER then findfirst in DAO (this is mostly because of
the *different* meaning that findfirst and find have). Hence, you can
repeat give a ADO recordset a id to find and there is less helicopter
start-up time in some cases (that is to use a query with a parameter)
So, I ONLY mentioned ADO due to this ability. In
most other situations...our old friend DAO is faster, and also more native
to
JET.

And, the findfirst in DAO can have multiple conditions (ADO DOES NOT allow
this with the find.....so the workaround is to use a filter). So, the ADO
find is not that great as it only allows ONE field in the condition (once
again, this likely means that ado will not help you..since you have
muli-conditions).

Also, the issue of recordsets getting updated? Hum, either case you have to
do a .requery on the reocrdset to reflect new records added. Again, not
a DAO vs ADO issue.
Can I mix ADO and DAO recordsets in
one sub? (I assume not)

Yes, actually you can do that. I don't recommend recommend due to some sync
problems...but you can
certainly load up ado, or dao reocrdsets in the same sub at the same time.

The only thing I am suggesting here is to try a ADO query with a parameter
(one, or more if need be) in the case where you have to repeated "hits" on a
table. Often, it does a MUCH better job then dao. This is because ADO can
keep much of the helicopter running in some cases.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top