The short answer is very simple: a scan of a document creates a picture. It’s exactly as if you had pointed your camera at the paper and snapped a photo of it, except your scanner is better at capturing large, flat surfaces.
And pictures can be big.
Let’s look at why, and what some of the alternatives might be.
Become a Patron of Ask Leo! and go ad-free!
Text versus picture
Here’s some text:
That’s exactly eight bytes: one for each of the letters, one for the space, and one for the exclamation point.
Now, here’s a picture of that text:
That picture — a “.png” file in this case — is 878 bytes in size, over 100 times the size of what was needed to represent just the text.
The difference is simple: while the text can be represented by 8 bytes, each of which represents one character in the string, a picture is a collection of information that describes each pixel in an image — in this case, a 68×15 image, which contains 1020 pixels. That simply takes more data to represent.
Scanning results in a picture
As I said, scanning a document is almost exactly like taking a picture of the piece of paper with your camera. Indeed, many smartphones now have apps for exactly that purpose: point the device’s camera at a document, snap a photo, and you’ve “scanned” it. I do this with credit card receipts all the time.
A camera’s image can be quite large, depending on a number of different factors, and a scan is no different. Your scan is probably a .jpeg or a .png file (both graphic file formats), or perhaps a PDF file containing images (stored internally as .png or .jpg). The result of a scan is most assuredly not a simple document or text file format.
In your case, your scan resulted in a file too big to email. There are ways to make it smaller.
Option 1: OCR
Converting an image of a scanned document into text you can edit requires what’s called OCR, for Optical Character Recognition.
There are a number of approaches. Your scanning software may perform this task for you and make the editable text part of the resulting file. My ScanSnap scanner, for example, creates a PDF of each document I scan that contains both the picture of each page as well as the results of OCR run on the image. The irony, of course, is that including both makes the resulting document even bigger.
One of the issues with OCR, beyond accuracy, is its focus on the text (or characters) on the scanned page, but not the formatting. OCR gives you the text, but generally all formatting is lost in the process. Sometimes that’s perfect and exactly what you want. It’ll almost certainly be smaller than a picture of the document.
But sometimes — as with your W2, I assume — you really just want a true copy that looks the same as the original. That’s when you want that picture.
Option 2: scan resolution
Twenty-five megabytes does seem a little large for a simple one-page document.
Most scanners have a setting that controls how detailed a picture it takes of your document. This is measured by the “DPI”, or Dots Per Inch.
A simple text document can be usually be scanned at a setting as low as 75 DPI and return perfectly acceptable results.
Your scanner might be set, or default, to a higher resolution. Since scanners are often used to scan photographs, where resolution and details are much more important, they have much higher DPI settings available. My flatbed scanner, which I use to scan old photographs, can go as high as 2400DPI. That generates significantly more data for each item scanned, and the resulting files are proportionately much larger compared to a 75-DPI scan.
I’d definitely look at adjusting the scan resolution as the next step in cutting down the file size of your scans.
Option 3: adjust compression
If your scanner produces a “.jpg” file, or can produce one, there’s one more setting you might try to find: the jpg quality setting.
Exactly where this will be, and even what it’s called, unfortunately varies depending on the software you’re using, so I can’t tell you exactly what to look for. It’s often a number between 1 and 10 (though, again, other ranges are also often used) that allows you to make the tradeoff between the size of the resulting file and its quality. A better-looking file will be larger than a file with lower quality.
In the two images above, the first is saved as a jpg at high quality, and is roughly 75KB in size. The second is at low quality and 23KB. You can see that the second is significantly less clear and crisp than the first. Depending on the document you’re dealing with, that may be an acceptable tradeoff for a significantly smaller-sized image.
If you found this article helpful, I'm sure you'll also love Confident Computing! My weekly email newsletter is full of articles that help you solve problems, stay safe, and give you more confidence with technology. Subscribe now and I'll see you there soon,