Cleaning up Long Word Doc

R

Rick Gregory

I have a 200+ page document that was created in Pagemaker, distilled to PDF,
and now exported to .rtf for editing in Word (Mac Word 2004).

Among other challenges, it seems the content that was once continuous text
in paragraph form now has every line broken with hard breaks, with no
automatic text wrapping.

As I clean this document up (a client wants it for translation to multiple
languages), can anyone recommend a way to eliminate all the unwanted hard
line breaks (without also eliminating the hard breaks that SHOULD be there)?

I've been playing with search/replace wildcard combos, but no luck yet.

Thank you!
 
E

Elliott Roper

Rick Gregory said:
I have a 200+ page document that was created in Pagemaker, distilled to PDF,
and now exported to .rtf for editing in Word (Mac Word 2004).

Among other challenges, it seems the content that was once continuous text
in paragraph form now has every line broken with hard breaks, with no
automatic text wrapping.

As I clean this document up (a client wants it for translation to multiple
languages), can anyone recommend a way to eliminate all the unwanted hard
line breaks (without also eliminating the hard breaks that SHOULD be there)?

I've been playing with search/replace wildcard combos, but no luck yet.

If you have a para at the end of every line and two at the end of a
real para, its easy.

Select all
find and replace
(wildcards off)
find ^p^p
replace with /\ (A placeholder not in the text)
replace all
find ^p
replace with a single space
replace all
and finally, turn all those /\'s back into ^p

If you want to be fussy, clean up possible multiple spaces too:-
find <space>^w replace <space> Of course <space> stands for a space
character.

If the hard breaks are manual breaks, substitute ^l for ^p (That's a
lower case L)

Try it on a copy. It might fry styles where paras in different style
were merged into one.

Better yet, if you do lots of that work, is to make a few cleanup
macros that do variations on the above in one keystroke.

For instance, here's on that works only on the selection, making one
paragraph out of it.

Sub One_Paragraph()
'
' One_Paragraph Macro
' Macro recorded 21-10-2002 by Elliott Roper
'
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = "^p"
.Replacement.Text = " "
.Forward = True
.Wrap = wdFind
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = " ^w"
.Replacement.Text = " "
.Forward = True
.Wrap = wdFind
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub

These things are easy to make, just record the macro and tidy it up a
bit afterward.
 
R

Rick Gregory

If you have a para at the end of every line and two at the end of a
real para, its easy.

Select all
find and replace
(wildcards off)
find ^p^p
replace with /\ (A placeholder not in the text)
replace all
find ^p
replace with a single space
replace all
and finally, turn all those /\'s back into ^p

If you want to be fussy, clean up possible multiple spaces too:-
find <space>^w replace <space> Of course <space> stands for a space
character.

If the hard breaks are manual breaks, substitute ^l for ^p (That's a
lower case L)

Try it on a copy. It might fry styles where paras in different style
were merged into one.

Better yet, if you do lots of that work, is to make a few cleanup
macros that do variations on the above in one keystroke.

For instance, here's on that works only on the selection, making one
paragraph out of it.

Sub One_Paragraph()
'
' One_Paragraph Macro
' Macro recorded 21-10-2002 by Elliott Roper
'
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = "^p"
.Replacement.Text = " "
.Forward = True
.Wrap = wdFind
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = " ^w"
.Replacement.Text = " "
.Forward = True
.Wrap = wdFind
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub

These things are easy to make, just record the macro and tidy it up a
bit afterward.


Thanks for the response. Oh, if it were that simple. There are not in fact
two para breaks at the end of real paragraphs, so I can't seem to find a way
to distinguish continuous lines vs. lines that should break.
 
R

Rick Gregory

Couldn't you have saved a step by exporting directly from Pagemaker as rich
text?
Thanks, but unfortunately, the source material is in 25+ PM docs, each of
which has multiple stories. I'd have to export the rtf text from each story
in each doc and then reassemble...!
 
E

Elliott Roper

Rick Gregory said:
Thanks for the response. Oh, if it were that simple. There are not in fact
two para breaks at the end of real paragraphs, so I can't seem to find a way
to distinguish continuous lines vs. lines that should break.
If there is no way other than reading the document for sense, then
that's what you will have to do to break the paras. Use my macro every
time you select a paragraph and it should go smoothly enough.

Although if I were faced with that task, I think I'd make a first pass
adding a second para mark at the end of each real paragraph, then using
the first technique to clean it all up at the end.

Do you have hard copy of the originals? Scan and OCR?

What about Adobe Acrobat professional? You might be able to get it to
export the text in a more organised way?
 
J

John McGhie [MVP - Word and Word Macintosh]

Rick:

The problem with taking it out to PDF is that you lost the internal
structure: PDF is a page description language that saves only the
positioning information.

What I would do is save the whole lot back to plain text, getting rid of the
formatting entirely. You can then use a Find/Replace to add the paragraph
marks (change each instance of two paragraph marks to some tag such as
{myTag} then change each single paragraph mark (the line-ends) to a space.
Then search for {myTag} and change them all back to paragraph marks.

Now set yourself up some styles and run through and reformat the whole
thing. If you set the styles up on a toolbar, you can totally reformat a
200-page document from plain text in about an hour.

That will give you the fastest, cleanest solution.

Cheers

Thanks, but unfortunately, the source material is in 25+ PM docs, each of
which has multiple stories. I'd have to export the rtf text from each story
in each doc and then reassemble...!

--

Please reply to the newsgroup to maintain the thread. Please do not email
me unless I ask you to.

John McGhie <[email protected]>
Microsoft MVP, Word and Word for Macintosh. Consultant Technical Writer
Sydney, Australia +61 4 1209 1410
 
J

Jim Lange

Rick says:
³Oh, if it were that simple. There are not in fact two para breaks at the
end of real paragraphs...²

I work on similar size documents daily. My solution‹although not completely
foolproof‹is this:
Find and Replace

Find what: .^p
Replace with: .zzz [or any other exclusive placeholder]
Replace All

Find what: ^p
Replace with: [single space]
Replace All

Find what: .zzz
Replace with: ^p
Replace All

The rationale is that 99% of all true paragraphs end with a period. So, this
search parameter has a very high probability of doing what you need. Yes,
you get a few false positives from mid-paragraph sentences whose periods
just happen to be at the end of a line, but they are rare and easy to clean
up when applying styles to reflect the original layout.

Imported bullet points are another story; sometimes Word has no clue about
what character or symbol the bullet is, and most times it won¹t even find
one that you¹ve copied from the document and pasted into Find and Replace.


Jim Lange
Sparkling Clearwater, Fla.
 
J

John McGhie [MVP - Word and Word Macintosh]

Hi Jim:

{Giggle} Just in case you¹re interested, Word knows exactly which character
a bullet is. Regrettably, those characters are not part of the document
text: that¹s why you can¹t ³Find² them.

Bullets and numbering are generated at print or display time from paragraph
properties. They do not exist as characters.

You can find them by searching for the List Bullet or List Number Style (if
that¹s what you used to apply the bullets or numbering).

Cheers


Rick says:
³Oh, if it were that simple. There are not in fact two para breaks at the end
of real paragraphs...²

I work on similar size documents daily. My solution‹although not completely
foolproof‹is this:
Find and Replace

Find what: .^p
Replace with: .zzz [or any other exclusive placeholder]
Replace All

Find what: ^p
Replace with: [single space]
Replace All

Find what: .zzz
Replace with: ^p
Replace All

The rationale is that 99% of all true paragraphs end with a period. So, this
search parameter has a very high probability of doing what you need. Yes, you
get a few false positives from mid-paragraph sentences whose periods just
happen to be at the end of a line, but they are rare and easy to clean up when
applying styles to reflect the original layout.

Imported bullet points are another story; sometimes Word has no clue about
what character or symbol the bullet is, and most times it won¹t even find one
that you¹ve copied from the document and pasted into Find and Replace.


Jim Lange
Sparkling Clearwater, Fla.





Rick:

The problem with taking it out to PDF is that you lost the internal
structure: PDF is a page description language that saves only the
positioning information.

What I would do is save the whole lot back to plain text, getting rid of the
formatting entirely. You can then use a Find/Replace to add the paragraph
marks (change each instance of two paragraph marks to some tag such as
{myTag} then change each single paragraph mark (the line-ends) to a space.
Then search for {myTag} and change them all back to paragraph marks.

Now set yourself up some styles and run through and reformat the whole
thing. If you set the styles up on a toolbar, you can totally reformat a
200-page document from plain text in about an hour.

That will give you the fastest, cleanest solution.

Cheers


--

Please reply to the newsgroup to maintain the thread. Please do not email
me unless I ask you to.

John McGhie <[email protected]>
Microsoft MVP, Word and Word for Macintosh. Consultant Technical Writer
Sydney, Australia +61 (0) 4 1209 1410
 
J

Jim Lange

John, thanks for the tip; it¹s always good that you try to expand our
knowledge. ;>)

However, to your comment, I referred specifically to ³imported bullets² such
as those non-paragraph-propertied ones brought in from PDFs, copy-and-pasted
HTML, and the like. Some of the imported character strings I¹ve dealt with
included grayed-out symbols that, when highlighted and copied from the text,
could be pasted into the ³Find what:² box, yet not successfully found. Next
time I run into some, I¹ll send to you for study.

Also, a correction is necessary to the solution I presented. The ³Replace
with:² box in the step below should have had a period in the replace
parameter, as in this:

Find what: .zzz
Replace with: .^p
Replace All

Despite that, I¹ve not received any feedback as to the viability of the
solution. Rick? Any comments? Does this work for you?

Also note that some cultures regard a giggle associated with a comment as
condescending, mocking or disrespectful, but I know that¹s neither your
intention nor style.

Jim




John McGhie said:
 
J

John McGhie [MVP - Word and Word Macintosh]

Hi Jim:

Ah hah! Sorry: My bad, not reading closely enough. I do apologise...

Yes, if the bullet characters are in high Unicode, you can¹t paste them
because the clipboard pastes into that field in plain text, which handle
anything that¹s not in the Mac International character set.

Select precisely one character and run this macro to find out what it is...

Sub Main()
'
' Charcode Macro
' Macro recorded 8/06/00 by John McGhie
'
charnum = AscW(Selection.Text)
MsgBox Str(charnum)

End Sub

(Actually, if you select more than one character, you will get the character
number of the ³first² character in the selection).

The answer comes back in Decimal. Turn wildcards ON, then search for
³^nnnn² where nnnn is the decimal number of the character. If you get text
from WordPerfect or really early versions of Word, the bullet will be in a
SYMBOL field. You can¹t find the content of the field, but you can find the
field itself: ³^d SYMBOL² should get it for you.

Cheers

John, thanks for the tip; it¹s always good that you try to expand our
knowledge. ;>)

However, to your comment, I referred specifically to ³imported bullets² such
as those non-paragraph-propertied ones brought in from PDFs, copy-and-pasted
HTML, and the like. Some of the imported character strings I¹ve dealt with
included grayed-out symbols that, when highlighted and copied from the text,
could be pasted into the ³Find what:² box, yet not successfully found. Next
time I run into some, I¹ll send to you for study.

Also, a correction is necessary to the solution I presented. The ³Replace
with:² box in the step below should have had a period in the replace
parameter, as in this:

Find what: .zzz
Replace with: .^p
Replace All

Despite that, I¹ve not received any feedback as to the viability of the
solution. Rick? Any comments? Does this work for you?

Also note that some cultures regard a giggle associated with a comment as
condescending, mocking or disrespectful, but I know that¹s neither your
intention nor style.

Jim




John McGhie said:


--

Please reply to the newsgroup to maintain the thread. Please do not email
me unless I ask you to.

John McGhie <[email protected]>
Microsoft MVP, Word and Word for Macintosh. Consultant Technical Writer
Sydney, Australia +61 (0) 4 1209 1410
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top