Ridiculously large save files in Word 2008

M

mattbacon

Version: 2008
Operating System: Mac OS X 10.5 (Leopard)
Processor: intel

I had two 5/6 page documents, text with a single logo image, sent to me. After opening them, making a few changes and saving them as RTF in Word 2008, they were 28 MB and 19MB respectively. I then reopened them in Word 2004, pasted the text into a new document, saved the logo as a BMP and reinserted it, and saved them. As Word .doc files, they were just under 200K each; as RTF about 300K. What on earth is going on? This is definitely a bug, not a feature! (there's no way that an incremental change history on a document containing 7848 CHARACTERS adds up to 27.8 MB...)

best regards,
Matt
 
J

John McGhie

Hi Matt:

I suspect you have at least one full-page bitmap in there. A bitmap is any
raster graphic: BMP, JPG, PNG, TIFF etc.

A full-page full-colour bitmap is about 24 MB at print resolution.

When a document is crated on a PC, the graphic is saved as the original (BMP
format in your case) and as PNG (the native format on the PC version).

When you get the document and open it on Mac Word, it converts the graphic
to its own native format (PICT) but leaves the other two in there. This
produces really impressive file-bloat very quickly :)

Now, if the logo image is an publishing resolution (900 or 1200 dpi) it is
already huge. If it appears in the running header or footer, it is stored
in the document only once. If it appears in the "text" portion of the
document, a new copy is stored for each page. Things get a bit over-weight!

Saving a file in RTF is not helping: RTF is at least twice the size of the
..doc format, and eight times the size of the .docx format (depending on what
is in the file).

But it may be not so much a bug, or a feature. Coupled with a design
limitation of the old file formats, someone along the way may be
mishandling/misunderstanding the inclusion of computer graphics.

Or: It could be a corrupt document, full of "stranded RTF".

So: the first thing we need to to is find out what really is the problem.
Please make a COPY of the document, and remove ALL the graphics, then save
it under a new file name. Close Word, then check the file size. If it was
the graphics, and the document is otherwise undamaged, it will have shrunk
to around 200-300 kb.

If the file did not shrink much, the document contains "stranded RTF" and we
need to Maggie it.

Stranded RTF is content that is no longer required in the document (usually:
pictures) that have become disconnected and so cannot be found to delete.
In a binary-format (.doc) document, all of the "non-text" components are
stored in containers at the end of the file. They are linked in to the text
by "pointers". If the pointers get damaged, Word cannot find the content so
when the graphic is deleted from the text, the image file can't be removed
from the document. Now, THAT is a bug! But it is so old now that we're
used to it. It's a design limitation of the old .doc and .rtf formats, and
was part of the reason for the adoption of the new file formats.

The Maggie:

1. Create a new blank document
2. Carefully select all of the text in the original document EXCEPT the last
paragraph mark
3. Copy it.
4. Paste in the new document.
5. Save under a new file name and close all, then re-open.

This technique for de-corrupting is known as "Doing a 'Maggie'", after
Margaret Secara from the Word PC-L mailing list who first publicised the
technique.

Now save again, and check the file size. That might be enough to get you
out of trouble. If the file is down to a reasonable size, stop here and
quit while you are ahead!

If not, we need to ask some careful questions about the intended use of the
document, because we need to re-size those graphics.

* Is the document for on-screen display only? 96 dpi.
* Is the document for printing on normal office colour printers? 150 dpi
and 24-bit RGB colour.
* Is the document for printing on normal office black-and-white printers?
150 dpi 8-bit grey scale.
* Is the document for printing on commercial full-colour equipment? Sorry:
it's going to be "big".

Fire up your graphics editor, and obtain the original of the logo graphic.

I am going to assume that the logo is "geometric shapes and letters" rather
than a continuous-tone photo.

So the next thing we do is switch the graphics format from BMP to PNG. PNG
is 1/20th of the size of BMP, but saves the resolution by reducing the
colour information. On a logo, you won't see any difference, but the file
size will be dramatically smaller. If the logo included a photo, we would
choose JPG. It offers the same size reduction, but at the expense of
sharpness and detail. OK in a photo, not good in a logo. We could use
Compressed TIFF instead of PNG but that can be troublesome. Or if the logo
has few colours (less than 256) we could use GIF.

Next, we want to reduce the size and the resolution appropriate to our
purpose. We do the size first, because the resizing algorithm works best if
it has spare pixels to play with.

Measure the height and width of the printed image you want. Now: here is a
trap for young players that I have seen catch a few people out: they wanted
a logo at the top of the page and an address block at the bottom. They
produced it as a single full-page graphic, not realizing that in computer
graphics, "transparent" is a colour too, and all that "space" in the middle
of the page was taking up the same amount of space on disk as if the page
had been completely covered in a photo! If your logo is like that, cut it
into "top" and "bottom" sections, including only the VISIBLE elements in
each.

Use the graphic editor's "resize" command to adjust the picture (or each
part, if there are two) to the height and width you want them to have in the
finished result. Save these as "copies" -- don't lose the original file
because you may change your mind and go back a couple of times...

Once the size is correct, we adjust the resolution. For on-screen display,
set 96 dpi. Screens can't display any more than this, so there's no point
in having any more in the document, it's only going to be omitted from the
result.

For office printing in black and white, we want 150 dpi, but we want the
image in grey-scale (eight bits per pixel).

For office printing in colour, we still need only 150 dpi, but we want the
colour as RGB format -- 24 bits per pixel. However, you may be able to
"cheat": your graphics soft ware will show you how many colours you can get
at each colour depth (bits per pixel). If you know the logo has only four
colours (and many of them do...) you can save the file as 4 bits per pixel,
and it will be smaller than the black-and-white version. If your designer
went mad with PhotoShop, sorry, you will have to use 24 bits per pixel (and
a different designer next time...)

Preparing a document for commercial colour printing is where the file-size
pain really happens. For commercial printing you need a high resolution,
and a high colour depth. You would ask your printer what they require: but
expect an answer in the 900 to 1,200 dpi range, and they might ask for CMYK
(36-bit colour). Don't use CMYK unless you have to: it won't display
properly in Microsoft Office products yet. This kind of resolution and
colour depth produces very large files, and there's no good way around it.

Having re-sized the graphic to the minimum, and reduced the colour depth as
low as practicable, put it back in the document. Remember: if you want it
in the same position on each page, store it in the running header so it's in
the file only once. Just because the graphic is stored in the header does
not mean it has to print there: you can drag it to any position you like on
the page.

If you do all of that, your document will be as small as you can get it. If
you want it smaller than that, talk to your graphics designer. You could
start by mentioning that you expect them, as industry professionals, to know
better than to send you bitmaps for use in office documents :)

Hope this helps


Version: 2008
Operating System: Mac OS X 10.5 (Leopard)
Processor: intel

I had two 5/6 page documents, text with a single logo image, sent to me. After
opening them, making a few changes and saving them as RTF in Word 2008, they
were 28 MB and 19MB respectively. I then reopened them in Word 2004, pasted
the text into a new document, saved the logo as a BMP and reinserted it, and
saved them. As Word .doc files, they were just under 200K each; as RTF about
300K. What on earth is going on? This is definitely a bug, not a feature!
(there's no way that an incremental change history on a document containing
7848 CHARACTERS adds up to 27.8 MB...)

best regards,
Matt

--
Don't wait for your answer, click here: http://www.word.mvps.org/

Please reply in the group. Please do NOT email me unless I ask you to.

John McGhie, Microsoft MVP, Word and Word:Mac
Sydney, Australia. mailto:[email protected]
 
M

mattbacon

That's a great answer, John, and full of helpful tips for many circumstances. I think making a "maggie" is effectively what I did. I don't think, though, that the graphic logo is the problem. I opened it out of the document into Photoshop, and it's only 378 pixels square in 8 bit RGB. Saved as PNG it's 128K, saved as a Windows BMP/16 it's 320K.

I think the "stranded RTF" must be the problem. The document arrived as a DOC file in a mail attachment. Attached, it's 4.77MB. Opened in Word 2004 and saved as a DOC file, it's 3.5MB. Saved from 2004 as an RTF, it's 17.4MB. Saved from 2008 as an RTF, it's 19MB. "maggied" as described, and saved as an RTF in 2004, it's 324K and 200K in DOC format.

Clearly something "invisible" in the original Word DOC, which Word encoded "relatively" efficiently in 'only' 3.3 MB excess, is bloating to 17.1MB in RTF. Fonts, maybe?

bestest,
M.
 
J

John McGhie

Hi Matt:

Great! You have found the problem.

In all my waffle yesterday, I am not sure I properly explained the "Graphics
Bloat Problem" built in to Word, but that's what it was.

When you open a Word document, Word creates a copy of each image in its
native format for display purposes. It leaves that image stored in the
document to save having to create it again. If you then open a document on
a different version of Word, it creates another copy of the image in its own
native format, leaving the other one in place.

This is a very poor design, but Microsoft does not think it is important
enough to change it.

Cheers


That's a great answer, John, and full of helpful tips for many circumstances.
I think making a "maggie" is effectively what I did. I don't think, though,
that the graphic logo is the problem. I opened it out of the document into
Photoshop, and it's only 378 pixels square in 8 bit RGB. Saved as PNG it's
128K, saved as a Windows BMP/16 it's 320K.

I think the "stranded RTF" must be the problem. The document arrived as a DOC
file in a mail attachment. Attached, it's 4.77MB. Opened in Word 2004 and
saved as a DOC file, it's 3.5MB. Saved from 2004 as an RTF, it's 17.4MB. Saved
from 2008 as an RTF, it's 19MB. "maggied" as described, and saved as an RTF in
2004, it's 324K and 200K in DOC format.

Clearly something "invisible" in the original Word DOC, which Word encoded
"relatively" efficiently in 'only' 3.3 MB excess, is bloating to 17.1MB in
RTF. Fonts, maybe?

bestest,
M.

--
Don't wait for your answer, click here: http://www.word.mvps.org/

Please reply in the group. Please do NOT email me unless I ask you to.

John McGhie, Microsoft MVP, Word and Word:Mac
Sydney, Australia. mailto:[email protected]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top