It seems like it should be smaller.
A scan of a document creates a picture. It’s exactly as if you had pointed your camera at the paper and snapped a photo of it, except your scanner is better at capturing large, flat surfaces.
And pictures can be big.
Let’s look at why, and some of the alternatives.
Become a Patron of Ask Leo! and go ad-free!
Scanning is the equivalent of taking a photograph of a document. Options to make the resulting file smaller: use OCR to return only the text found in the picture, scan at a lower resolution, or save .jpg files at a lower quality setting.
Text versus picture
Here’s some text:
That’s exactly eight bytes: one for each of the letters, one for the space, and one for the exclamation point.
Now, here’s a picture of that text:
That picture — a “.png” file in this case — is 2,431 bytes in size, over 300 times the size of what was needed to represent the text.
The difference is simple: while the text can be represented by eight bytes, each of which represents one character in the string, a picture is a collection of information that describes each pixel in an image — in this case, a 133×40 image, which contains 5,320 pixels. That simply takes more data to represent.
Scanning results in a picture
As I said, scanning a document is almost exactly like taking a picture of the piece of paper with your camera. Indeed, many smartphones now have apps for exactly that purpose: point the device’s camera at a document, snap a photo, and you’ve “scanned” it. I do this with credit card receipts all the time.
A camera’s image can be quite large, depending on a number of different factors, and a scan is no different. Your scan is probably a .jpeg or a .png file (both are graphic file formats), or perhaps a PDF file containing images (stored internally as .png or .jpg). The result of a scan is not a simple document or text file format.
In your case, your scan resulted in a file too big to email. There are ways to make it smaller.
Option 1: OCR
Converting an image of a scanned document into text you can edit requires what’s called OCR, for Optical Character Recognition.
There are a number of approaches. Your scanning software may perform this task for you and make the editable text part of the resulting file. My ScanSnap scanner, for example, creates a PDF of each document I scan, which contains both the picture of each page as well as the results of OCR run on the image. The irony, of course, is that including both makes the resulting document even bigger.
One of the issues with OCR, beyond accuracy, is its focus on the text (or characters) on the scanned page, but not the formatting. OCR gives you the text, but generally all formatting is lost in the process. Sometimes that’s perfect and exactly what you want. It’ll almost certainly be smaller than a picture of the document.
But sometimes — as with your W2, I assume — you really need a copy that looks the same as the original. That’s when you want that picture.
Option 2: Scan resolution
Twenty-five megabytes does seem a little large for a simple one-page document.
Most scanners have a setting controlling how detailed a picture it takes of your document. This is measured by the “DPI”, or Dots Per Inch.
A simple text document can usually be scanned at a setting as low as 75 DPI and return perfectly acceptable results.
Your scanner might be set at, or default to, a higher resolution. Since scanners are often used to scan photographs, where resolution and details are much more important, they have much higher DPI settings available. My flatbed scanner, which I use to scan old photographs, can go as high as 2400DPI. That generates significantly more data for each item scanned, and the resulting files are proportionately much larger compared to a 75DPI scan.
I’d definitely look at adjusting the scan resolution as the next step in cutting down the file size of your scans.
Option 3: Adjust compression
If your scanner produces a “.jpg” file, or can produce one, there’s one more setting you might try to find: the jpg quality setting.
Exactly where this will be, and even what it’s called, unfortunately varies depending on the software you’re using, so I can’t tell you exactly what to look for. It’s often a number between 1 and 10 (though, again, other ranges are also often used) allowing you to make the tradeoff between the size of the resulting file and its quality. A better-looking file will be larger than a file with lower quality.
In the two images above, the first is saved as a jpg at high quality, and is roughly 75KB in size. The second is at low quality, and is 23KB. You can see that the second is significantly less clear and crisp than the first. Depending on the document you’re dealing with, that may be an acceptable tradeoff for a significantly smaller-sized image.
Subscribe to Confident Computing! More confidence & less frustration -- solutions, answers, & tips -- in your inbox every week.
I'll see you there!