Technology in terms you understand. Sign up for my weekly newsletter, "Confident Computing", for more solutions you can use to make your life easier. Click here.

Why Does a Scan of a Simple Text Document Result in Such a Large File?

//
 I scanned a copy of my W2 form and tried to email it to someone but I was told the file was too big – over 25 MB. How does a simple text document acquire such a huge volume?

The short answer is very simple: a scan of a document creates a picture. It’s exactly as if you had pointed your camera at the paper and snapped a photo of it, except your scanner is better at capturing large, flat surfaces.

And pictures can be big.

Let’s look at why, and what some of the alternatives might be.

Become a Patron of Ask Leo! and go ad-free!

Text versus picture

Here’s some text:

Ask Leo!

That’s exactly eight bytes: one for each of the letters, one for the space, and one for the exclamation point.

Now, here’s a picture of that text:

Ask Leo! text picture

That picture — a “.png” file in this case — is 878 bytes in size, over 100 times the size of what was needed to represent just the text.

ScannerThe difference is simple: while the text can be represented by 8 bytes, each of which represents one character in the string, a picture is a collection of information that describes each pixel in an image — in this case, a 68×15 image, which contains 1020 pixels. That simply takes more data to represent.

Scanning results in a picture

As I said, scanning a document is almost exactly like taking a picture of the piece of paper with your camera. Indeed, many smartphones now have apps for exactly that purpose: point the device’s camera at a document, snap a photo, and you’ve “scanned” it. I do this with credit card receipts all the time.

A camera’s image can be quite large, depending on a number of different factors, and a scan is no different. Your scan is probably a .jpeg or a .png file (both graphic file formats), or perhaps a PDF file containing images (stored internally as .png or .jpg). The result of a scan is most assuredly not a simple document or text file format.

In your case, your scan resulted in a file too big to email. There are ways to make it smaller.

Option 1: OCR

Converting an image of a scanned document into text you can edit requires what’s called OCR, for Optical Character Recognition.

There are a number of approaches. Your scanning software may perform this task for you and make the editable text part of the resulting file. My ScanSnap scanner, for example, creates a PDF of each document I scan that contains both the picture of each page as well as the results of OCR run on the image. The irony, of course, is that including both makes the resulting document even bigger.

One of the issues with OCR, beyond accuracy, is its focus on the text (or characters) on the scanned page, but not the formatting. OCR gives you the text, but generally all formatting is lost in the process. Sometimes that’s perfect and exactly what you want. It’ll almost certainly be smaller than a picture of the document.

But sometimes — as with your W2, I assume — you really just want a true copy that looks the same as the original. That’s when you want that picture.

Option 2: scan resolution

Twenty-five megabytes does seem a little large for a simple one-page document.

Most scanners have a setting that controls how detailed a picture it takes of your document. This is measured by the “DPI”, or Dots Per Inch.

A simple text document can be usually be scanned at a setting as low as 75 DPI and return perfectly acceptable results.

Your scanner might be set, or default, to a higher resolution. Since scanners are often used to scan photographs, where resolution and details are much more important, they have much higher DPI settings available. My flatbed scanner, which I use to scan old photographs, can go as high as 2400DPI. That generates significantly more data for each item scanned, and the resulting files are proportionately much larger compared to a 75-DPI scan.

I’d definitely look at adjusting the scan resolution as the next step in cutting down the file size of your scans.

Option 3: adjust compression

If your scanner produces a “.jpg” file, or can produce one, there’s one more setting you might try to find: the jpg quality setting.

Exactly where this will be, and even what it’s called, unfortunately varies depending on the software you’re using, so I can’t tell you exactly what to look for. It’s often a number between 1 and 10 (though, again, other ranges are also often used) that allows you to make the tradeoff between the size of the resulting file and its quality. A better-looking file will be larger than a file with lower quality.

Get Out Of Hell Free - High Quality

Get Out Of Hell Free - Low Quality

In the two images above, the first is saved as a jpg at high quality, and is roughly 75KB in size. The second is at low quality and 23KB. You can see that the second is significantly less clear and crisp than the first. Depending on the document you’re dealing with, that may be an acceptable tradeoff for a significantly smaller-sized image.

Podcast audio

Play

Video Narration

6 comments on “Why Does a Scan of a Simple Text Document Result in Such a Large File?”

  1. I surely wouldn’t recommend emailing a document as sensitive as a W2 form. The potential for identity theft is too great because typically these forms contain confidential info like SS numbers. I would paper mail such a sensitive document. Much more secure.

  2. For a >25MB output file, I’d suspect the file was scanned into some lossless format, most likely TIFF or BMP, which would typically require 4 bytes per pixel. Assuming 8.5″x11″ @ 300dpi * 4 bytes/pixel you get 32.1MB
    You’d want to save as a PNG (or possibly GIF) if it’s a mostly text/line-art image, or JPEG if it’s a photographic type subject. This should be an option in either the scanning program or your favourite image editor. In either case the output image should easily be less than 1MB.

  3. In addition to the file type mentioned by James as being a typical culprit in large file sizes, check the color settings.
    A W2 and many other documents don’t need to be sent in full color.
    8 bit gray scale only takes 1/3 the space of full color and 2 color (black/white) takes even less. If you get down to 2 color, many programs will use group 4 encoding which is extremely compact with no loss of image quality. Group 4 was created for sending faxes and any image that is mainly white with black scattered all around is what it is best at compressing.

  4. @Bill
    I’ve found that 8 bit gray scale compressed in the .jpg format is significantly more readable than a 2 color image and with not a great deal of size difference.

  5. I recently carried out a comparison exercise, based on the simple phrase “Here is the News”.

    Bearing in mind the effects of disk segments etc; and that with the JPG, I trimmed the image to minimum to contain the phrase, the results were-

    1 KB – TeXT
    11KB – JPeG Graphics
    26KB – DOC WORD
    32KB – PRiNt as sent to Printer
    50KB – Portable Document Format
    7,125KB – MOVie Quicktime

    I did not check the PDF version to see if there are any inclusions such as Fonts etc.

Comments are closed.