I give up-- I need help Pleeeezze? Anybody up to a little challange?

M

Mctabish

I recieve data every few weeks (about 65 files, each having up to 7000 or
so) that I need to evaluate. I want to work in Excel., but the data comes in
a PDF format. I can not recieve data in anything but PDF (too simple, I
know)

The fields are fixed len, except the description field (second to last
feild) Each record wraps around 2 or three lines
I need to get each field into it's own cell. I probably could get it IF I
can figure out how to get each record into it's own row....

I have tried to save to TEXT from adobe. then opened it up under word and
tried to seperate at the records, but was not sucessful.


Here is a sample of one of the files: http://www.johnsonclan.net/sample.pdf

Any help would be greatly appreciated!

Thanks,
Mc
 
P

PY & Associates

we could not save a copy of your file, but at least we could Select Text/
copy and paste into Excel

Would this be sufficient for you please?
 
P

PY & Associates

Further to our earlier response, we note the layout of the PDF report is not
consistent. Some places where zero entries are made as ".00" whereas some
places there is no entries at all. If this inconsistence can be rectified,
we believe the report can be converted to Excel with a little help from VBA
please.

Because MUN SHE CHE MHE tax have no entry, we have to adjust position of the
data.
But we trust the following conversion is correct.

NAME JOHNSON KAREN G
CO. 1
YR. 0
C/S# 15
ST.TAX 52.78
CO.TAX 354.03
MUN.TAX
FEES 10
INTER 21.35
AD.COST 18
OTHER 15
TOTAL 471.16
ASSES.CL 2
MUN.CODE 2
A.V. 8120
SHE TAX
CHE TAX
MHE TAX
OLD PARCEL NO. 0000000000000000000000
NEW PARCEL NO. 0121001430010320000100
DESCRIPTION LOT etc
 
H

Harlan Grove

Mctabish wrote...
I recieve data every few weeks (about 65 files, each having up to 7000 or
so) that I need to evaluate. I want to work in Excel., but the data comes in
a PDF format. I can not recieve data in anything but PDF (too simple, I
know)
....

Your best choices are either convincing whoever sends you these files
to send .XLS or even plain text files rather than .PDF files or use
better tools than Excel to deal with these files.
The fields are fixed len, except the description field (second to last
feild) Each record wraps around 2 or three lines
I need to get each field into it's own cell. I probably could get it IF I
can figure out how to get each record into it's own row....

I have tried to save to TEXT from adobe. then opened it up under word and
tried to seperate at the records, but was not sucessful.
....

There's a utility program named pdf2text that used to be available on
Simtel. I haven't been able to find it there today, but it may be
available from other sites (I haven't looked). If you can't find it,
then you could consider GhostScript, which can convert PDF files to
other formats. If these files have inconsistent fields in different
records but a common record terminator or separator, you scripting
languages like Perl or Python to parse the actual fields and use
default values for any missing fields. You could also do this in Excel
once you had the text from the .PDF files, but it's more difficult in
Excel.
 
M

Mctabish

Thanks for the suggestions. Currently, it looks like PDF2TEXT works the best
to get the data over to excel. but...
Is there a way to combine all rows in a column that have data in them until
the next empty row in that column?

This code seems to work OK if I only have 2 rows of data:
=IF(LEN(TRIM(C6))>0,IF(LEN(TRIM(C7))>0,C6 & " | " & C7,""),"") (first test
the current row, if it is empty, do nothing... Then test the next row, if it
has data, combine the current row, then add the "pipe" symbol as a
seperator, the add the next row)

But there are varible rows of data for each entry (MOST are 2 rows only, but
there are a lot that have 3, and more rows)

Thanks,
Mc
 
H

Harlan Grove

Mctabish said:
Thanks for the suggestions. Currently, it looks like PDF2TEXT works
the best to get the data over to excel. but...

Good that you found pdf2text.
Is there a way to combine all rows in a column that have data in
them until the next empty row in that column?

I've looked at your PDF file. If the converted text looked like the PDF
file, there'd be blank lines between records which you could use as record
separators, but they seem to be omitted from the converted text. If that's
so, then your PDF file requires *you* manually indicating which lines begin
new records.

If this is from a governmental entity, and PDF files are all they provide,
then you're very likely out of luck.

Also the resulting text shows only a single space between all nonblank
fields. While it'd be possible to deal with blank SHE/CHE/MHE tax fields,
which precede the unmistakable parcel number fields, there's no simple way
to deal with blank MUN tax fields. The closest you could come would be
parsing out the available fields using formulas like

A1:
=LEFT(B!B1,FIND(" 1 ",B!B1)-1)

B1:
=MID(B!$B1,SUMPRODUCT(LEN($A1:A1)+1)+1,
FIND(" ",MID(B!$B1&" ",SUMPRODUCT(LEN($A1:A1)+1)+1,1024))-1)

Fill B1 right into C1:IV1. This will parse out each word or token in the
description field. Then you'd need to construct records from the fields,
making decisions about when MUN/SHE/CHE/MHE tax fields were blank. When
would these tax fields be blank? If there's no pattern to it, then you have
a mostly manual exercise on your hands.

You *may* be better off using GhostScript or some other software to convert
the PDF files to HTML if the resulting HTML files mostly preserve the PDF
file formatting.
 
B

Bruce

All of the fields "should" be fixed len, except description. I do not care
about the CL CODE after description, so I could use that as part of
description (or use right() to trim it off)
There is a blank line between each record (I have cleaned off all of the
page header info BTW), so what I want to do is to append all records that
are between two blank lines, and then I can parse for the len of each field.
I am 'trying' to make this happen with a few as possible human steps, as I
have to do this every week, on about 60 plus files, some of the files being
a few thousand records.
You are right, these files ARE from a state goverenment agency, that is why
I said that I have to work with the PDF to start with.

Oh, BTW, I was mistaken.... I could NOT find PDF2TEST, it is PDF2TXT.

Thanks,
Mc
 
H

Harlan Grove

Bruce wrote...
All of the fields "should" be fixed len, except description. I do not care
about the CL CODE after description, so I could use that as part of
description (or use right() to trim it off)
There is a blank line between each record (I have cleaned off all of the
page header info BTW), so what I want to do is to append all records that
are between two blank lines, and then I can parse for the len of each field.
I am 'trying' to make this happen with a few as possible human steps, as I
have to do this every week, on about 60 plus files, some of the files being
a few thousand records.
You are right, these files ARE from a state goverenment agency, that is why
I said that I have to work with the PDF to start with.
OK.

Oh, BTW, I was mistaken.... I could NOT find PDF2TEST, it is PDF2TXT.

There are a few products named pdf2text. I mean the one from

http://alkaline.vestris.com/download/WinNT/pdf2text.zip

which is a command line utility. That's what I've used to convert your
PDF file into a text file. The result doesn't have blank lines between
records, but I can work around that. If every record is preceded by a
blank line, then insert a column to the left of the imported text,
enter 1 in A1 (if you've stripped out the header lines, then row 1
should contain the first line of the first record). Then enter the
following formula in A2.

=--(TRIM(B1)="")

and fill down as far as needed in col A, which should evaluate to 1
when the line is the first line of a record and 0 for subsequent lines.
Then if this worksheet were named A, in another worksheet enter the
following.

A2:
=MATCH(1,A!$A:$A,0)

A3:
=MATCH(1,INDEX(A!$A:$A,A2+1):INDEX(A!$A:$A,65536),0)+A2

and fill down as far as needed. Then in B2:O2 you'd need formulas like

B2:
=TRIM(MID(INDEX(B!$B:$B,$A2),2,27))

C2:
=TRIM(MID(INDEX(sample!$B:$B,$A2),30,1))

etc., and in P2:U2 formulas like

P2:
=TRIM(MID(INDEX(sample!$B:$B,$A2+1),1,11))

Q2:
=TRIM(MID(INDEX(sample!$B:$B,$A2+1),13,10))

etc., and in V2 the formula

V2:
=TRIM(MID(INDEX(sample!$B:$B,$A2+1),93,40)
&IF(A2+2<A3,INDEX(sample!$B:$B,$A2+2),""))
 
B

Bruce

Thanks Harlan,

PDF2TXT, is win based (but seems to have a command line mode) It DOES leave
in the blank lines, and appears to have all spaces etc
Sorry for the name change.... I am having to use my son's system...
I usually can figure out code f(for the most part) but I have more questions
than answers here...

I do have a few questions


Thanks,
Mc

=--(TRIM(B1)="") -> seems to work OK
Then if this worksheet were named A, -> renamed
worksheet to "A"
in another worksheet enter the following. -> named
worksheet to B
 
H

Harlan Grove

I butchered the formulas, didn't I? Every instance of B! or sample!
should be A!, so the result worksheet formulas should be

B2 (and similar for C2:O2):
=TRIM(MID(INDEX(A!$B:$B,$A2),2,27))

P2 (and similar for Q2:U2):
=TRIM(MID(INDEX(A!$B:$B,$A2+1),1,11))

and finally V2:
=TRIM(MID(INDEX(A!$B:$B,$A2+1),93,40)&IF(A2+2<A3,INDEX(A!$B:$B,$A2+2),""))
 
B

Bruce

Thanks for all of the assistance!
I tried what you had suggested (strictly cut and paste - as I am not
familiar with the INDEX ) but it did not work for me...
Here is what I did, and it is working!

Column a -> pdf paste - the dat from PDF2TXT
Column b-> paste of row (_value_ of column "d") I use this to re assemble
after sorting on column A to remove the page headers)
Column c -> =IF(LEN(TRIM(A3))>0,1,0) - assigns a 0 or a 1 if the record
is blank or not
Column d -> row() used to keep track of rows for column b and also used in
column d to see what the next blank is, from the current row
Column e -> Test to see how many lines in each "group"
=IF(C3=1,VLOOKUP(0,C3:D$570,2,FALSE)-ROW(),"") ; 570 is how many rows I
have in may test sample)
Column f -> combined for E rows- =IF(LEN(TRIM(E5))=0,IF(E6=5,A6 & A7 & A8
& A9 & A10,IF(E6= 4, A6 & A7 & A8 & A9,IF(E6= 3, A6 & A7 & A8,IF(E6=
2, A6 & A7,A6 )))),"") - this assumes that there is a maximum of 5 rows
per entry. I saw a max of 4 in my test sample.....

I then just needed to parse using the mid() function!

Might not be the cleanest or the fastest, but it works! and it is one heck
of a lot better that what I have been doing!

Now, I need to find a way to combine about 60 or so PDFS into one or two
files (just to make this process easier....)


I am a happy Camper!

Again. Thanks for the assistance!
Mc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top