It seems like it should be smaller.
A scan of a document creates a picture. It’s exactly as if you had pointed your camera at the paper and snapped a photo of it, except your scanner is better at capturing large, flat surfaces.
And pictures can be big.
Let’s look at why, and some of the alternatives.
Become a Patron of Ask Leo! and go ad-free!
Scanning is the equivalent of taking a photograph of a document. Options to make the resulting file smaller: use OCR to return only the text found in the picture, scan at a lower resolution, or save .jpg files at a lower quality setting.
Text versus picture
Here’s some text:
That’s exactly eight bytes: one for each of the letters, one for the space, and one for the exclamation point.
Now, here’s a picture of that text:
That picture — a “.png” file in this case — is 2,431 bytes in size, over 300 times the size of what was needed to represent the text.
The difference is simple: while the text can be represented by eight bytes, each of which represents one character in the string, a picture is a collection of information that describes each pixel in an image — in this case, a 133×40 image, which contains 5,320 pixels. That simply takes more data to represent.
Scanning results in a picture
As I said, scanning a document is almost exactly like taking a picture of the piece of paper with your camera. Indeed, many smartphones now have apps for exactly that purpose: point the device’s camera at a document, snap a photo, and you’ve “scanned” it. I do this with credit card receipts all the time.
A camera’s image can be quite large, depending on a number of different factors, and a scan is no different. Your scan is probably a .jpeg or a .png file (both are graphic file formats), or perhaps a PDF file containing images (stored internally as .png or .jpg). The result of a scan is not a simple document or text file format.
In your case, your scan resulted in a file too big to email. There are ways to make it smaller.
Option 1: OCR
Converting an image of a scanned document into text you can edit requires what’s called OCR, for Optical Character Recognition.
There are a number of approaches. Your scanning software may perform this task for you and make the editable text part of the resulting file. My ScanSnap scanner, for example, creates a PDF of each document I scan, which contains both the picture of each page as well as the results of OCR run on the image. The irony, of course, is that including both makes the resulting document even bigger.
One of the issues with OCR, beyond accuracy, is its focus on the text (or characters) on the scanned page, but not the formatting. OCR gives you the text, but generally all formatting is lost in the process. Sometimes that’s perfect and exactly what you want. It’ll almost certainly be smaller than a picture of the document.
But sometimes — as with your W2, I assume — you really need a copy that looks the same as the original. That’s when you want that picture.
Option 2: Scan resolution
Twenty-five megabytes does seem a little large for a simple one-page document.
Most scanners have a setting controlling how detailed a picture it takes of your document. This is measured by the “DPI”, or Dots Per Inch.
A simple text document can usually be scanned at a setting as low as 75 DPI and return perfectly acceptable results.
Your scanner might be set at, or default to, a higher resolution. Since scanners are often used to scan photographs, where resolution and details are much more important, they have much higher DPI settings available. My flatbed scanner, which I use to scan old photographs, can go as high as 2400DPI. That generates significantly more data for each item scanned, and the resulting files are proportionately much larger compared to a 75DPI scan.
I’d definitely look at adjusting the scan resolution as the next step in cutting down the file size of your scans.
Option 3: Adjust compression
If your scanner produces a “.jpg” file, or can produce one, there’s one more setting you might try to find: the jpg quality setting.
Exactly where this will be, and even what it’s called, unfortunately varies depending on the software you’re using, so I can’t tell you exactly what to look for. It’s often a number between 1 and 10 (though, again, other ranges are also often used) allowing you to make the tradeoff between the size of the resulting file and its quality. A better-looking file will be larger than a file with lower quality.
In the two images above, the first is saved as a jpg at high quality, and is roughly 75KB in size. The second is at low quality, and is 23KB. You can see that the second is significantly less clear and crisp than the first. Depending on the document you’re dealing with, that may be an acceptable tradeoff for a significantly smaller-sized image.
Subscribe to Confident Computing! Less frustration and more confidence, solutions, answers, and tips in your inbox every week.
I'll see you there!
Download (right-click, Save-As) (Duration: 8:37 — 7.4MB)
Subscribe: Apple Podcasts | RSS
6 comments on “Why Does a Scan of a Simple Text Document Result in Such a Large File?”
I surely wouldn’t recommend emailing a document as sensitive as a W2 form. The potential for identity theft is too great because typically these forms contain confidential info like SS numbers. I would paper mail such a sensitive document. Much more secure.
For a >25MB output file, I’d suspect the file was scanned into some lossless format, most likely TIFF or BMP, which would typically require 4 bytes per pixel. Assuming 8.5″x11″ @ 300dpi * 4 bytes/pixel you get 32.1MB
You’d want to save as a PNG (or possibly GIF) if it’s a mostly text/line-art image, or JPEG if it’s a photographic type subject. This should be an option in either the scanning program or your favourite image editor. In either case the output image should easily be less than 1MB.
In addition to the file type mentioned by James as being a typical culprit in large file sizes, check the color settings.
A W2 and many other documents don’t need to be sent in full color.
8 bit gray scale only takes 1/3 the space of full color and 2 color (black/white) takes even less. If you get down to 2 color, many programs will use group 4 encoding which is extremely compact with no loss of image quality. Group 4 was created for sending faxes and any image that is mainly white with black scattered all around is what it is best at compressing.
I’ve found that 8 bit gray scale compressed in the .jpg format is significantly more readable than a 2 color image and with not a great deal of size difference.
I recently carried out a comparison exercise, based on the simple phrase “Here is the News”.
Bearing in mind the effects of disk segments etc; and that with the JPG, I trimmed the image to minimum to contain the phrase, the results were-
1 KB – TeXT
11KB – JPeG Graphics
26KB – DOC WORD
32KB – PRiNt as sent to Printer
50KB – Portable Document Format
7,125KB – MOVie Quicktime
I did not check the PDF version to see if there are any inclusions such as Fonts etc.
I scanned the same document twice. First time it became 3MB. Second time it was 5.9MB. That doesn’t make sense.