Technology in terms you understand. Sign up for the Confident Computing newsletter for weekly solutions to make your life easier. Click here and get The Ask Leo! Guide to Staying Safe on the Internet — FREE Edition as my thank you for subscribing!

Why Does a Scan of a Simple Text Document Result in Such a Large File?

It seems like it should be smaller.

Scanner

Scanned documents can result in large graphics files that are too large to email. I'll review options for more manageable results.
Question: I scanned a copy of my W2 form and tried to email it to someone, but I was told the file was too big – over 25 MB. How does a simple text document acquire such a huge volume?

A scan of a document creates a picture. It’s exactly as if you had pointed your camera at the paper and snapped a photo of it, except your scanner is better at capturing large, flat surfaces.

And pictures can be big.

Let’s look at why, and some of the alternatives.

Become a Patron of Ask Leo! and go ad-free!

TL;DR:

Scanning documents

Scanning is the equivalent of taking a photograph of a document. Options to make the resulting file smaller: use OCR to return only the text found in the picture, scan at a lower resolution, or save .jpg files at a lower quality setting.

Text versus picture

Here’s some text:

Ask Leo!

That’s exactly eight bytes: one for each of the letters, one for the space, and one for the exclamation point.

Now, here’s a picture of that text:

Ask Leo!

That picture — a “.png” file in this case — is 2,431 bytes in size, over 300 times the size of what was needed to represent the text.

The difference is simple: while the text can be represented by eight bytes, each of which represents one character in the string, a picture is a collection of information that describes each pixel in an image — in this case, a 133×40 image, which contains 5,320 pixels. That simply takes more data to represent.

Scanning results in a picture

As I said, scanning a document is almost exactly like taking a picture of the piece of paper with your camera. Indeed, many smartphones now have apps for exactly that purpose: point the device’s camera at a document, snap a photo, and you’ve “scanned” it. I do this with credit card receipts all the time.

A camera’s image can be quite large, depending on a number of different factors, and a scan is no different. Your scan is probably a .jpeg or a .png file (both are graphic file formats), or perhaps a PDF file containing images (stored internally as .png or .jpg). The result of a scan is not a simple document or text file format.

In your case, your scan resulted in a file too big to email. There are ways to make it smaller.

Option 1: OCR

Converting an image of a scanned document into text you can edit requires what’s called OCR, for Optical Character Recognition.

There are a number of approaches. Your scanning software may perform this task for you and make the editable text part of the resulting file. My ScanSnap scanner, for example, creates a PDF of each document I scan, which contains both the picture of each page as well as the results of OCR run on the image. The irony, of course, is that including both makes the resulting document even bigger.

One of the issues with OCR, beyond accuracy, is its focus on the text (or characters) on the scanned page, but not the formatting. OCR gives you the text, but generally all formatting is lost in the process. Sometimes that’s perfect and exactly what you want. It’ll almost certainly be smaller than a picture of the document.

But sometimes — as with your W2, I assume — you really need a copy that looks the same as the original. That’s when you want that picture.

Option 2: Scan resolution

Twenty-five megabytes does seem a little large for a simple one-page document.

Most scanners have a setting controlling how detailed a picture it takes of your document. This is measured by the “DPI”, or Dots Per Inch.

A simple text document can usually be scanned at a setting as low as 75 DPI and return perfectly acceptable results.

Your scanner might be set at, or default to, a higher resolution. Since scanners are often used to scan photographs, where resolution and details are much more important, they have much higher DPI settings available. My flatbed scanner, which I use to scan old photographs, can go as high as 2400DPI. That generates significantly more data for each item scanned, and the resulting files are proportionately much larger compared to a 75DPI scan.

I’d definitely look at adjusting the scan resolution as the next step in cutting down the file size of your scans.

Option 3: Adjust compression

If your scanner produces a “.jpg” file, or can produce one, there’s one more setting you might try to find: the jpg quality setting.

Exactly where this will be, and even what it’s called, unfortunately varies depending on the software you’re using, so I can’t tell you exactly what to look for. It’s often a number between 1 and 10 (though, again, other ranges are also often used) allowing you to make the tradeoff between the size of the resulting file and its quality. A better-looking file will be larger than a file with lower quality.

Get Out Of Hell Free - High Quality

Get Out Of Hell Free - Low Quality

In the two images above, the first is saved as a jpg at high quality, and is roughly 75KB in size. The second is at low quality, and is 23KB. You can see that the second is significantly less clear and crisp than the first. Depending on the document you’re dealing with, that may be an acceptable tradeoff for a significantly smaller-sized image.

Do this

Subscribe to Confident Computing! Less frustration and more confidence, solutions, answers, and tips in your inbox every week.

I'll see you there!

Podcast audio

Play

6 comments on “Why Does a Scan of a Simple Text Document Result in Such a Large File?”

  1. I surely wouldn’t recommend emailing a document as sensitive as a W2 form. The potential for identity theft is too great because typically these forms contain confidential info like SS numbers. I would paper mail such a sensitive document. Much more secure.

    Reply
  2. For a >25MB output file, I’d suspect the file was scanned into some lossless format, most likely TIFF or BMP, which would typically require 4 bytes per pixel. Assuming 8.5″x11″ @ 300dpi * 4 bytes/pixel you get 32.1MB
    You’d want to save as a PNG (or possibly GIF) if it’s a mostly text/line-art image, or JPEG if it’s a photographic type subject. This should be an option in either the scanning program or your favourite image editor. In either case the output image should easily be less than 1MB.

    Reply
  3. In addition to the file type mentioned by James as being a typical culprit in large file sizes, check the color settings.
    A W2 and many other documents don’t need to be sent in full color.
    8 bit gray scale only takes 1/3 the space of full color and 2 color (black/white) takes even less. If you get down to 2 color, many programs will use group 4 encoding which is extremely compact with no loss of image quality. Group 4 was created for sending faxes and any image that is mainly white with black scattered all around is what it is best at compressing.

    Reply
  4. I recently carried out a comparison exercise, based on the simple phrase “Here is the News”.

    Bearing in mind the effects of disk segments etc; and that with the JPG, I trimmed the image to minimum to contain the phrase, the results were-

    1 KB – TeXT
    11KB – JPeG Graphics
    26KB – DOC WORD
    32KB – PRiNt as sent to Printer
    50KB – Portable Document Format
    7,125KB – MOVie Quicktime

    I did not check the PDF version to see if there are any inclusions such as Fonts etc.

    Reply

Leave a reply:

Before commenting please:

  • Read the article.
  • Comment on the article.
  • No personal information.
  • No spam.

Comments violating those rules will be removed. Comments that don't add value will be removed, including off-topic or content-free comments, or comments that look even a little bit like spam. All comments containing links and certain keywords will be moderated before publication.

I want comments to be valuable for everyone, including those who come later and take the time to read.