Everything and nothing.
Someone commented to me that his spam filter was pretty useless since the spam he was receiving kept coming from different email addresses. The implication was that this person believed that the “From:” address is the only thing that spam filters check.
While that’s possible, it’s also rare.
These days, spam filters are complex and sophisticated pieces of software that check much more than you might think.
Become a Patron of Ask Leo! and go ad-free!
Spam filters
Spam filters analyze multiple email characteristics to determine spam probability. They consider factors like sender details, subject lines, message content, and unseen headers. No single rule guarantees filtering; instead, a combination of traits increases the likelihood of being marked as spam.
Your mileage may vary
I have to point out that there is no single spam filter, spam-filtering technique, or spam-filtering set of rules.
How the spam filter works on your email may differ greatly from how it works on mine. As we’ll see in a moment, for large email providers that’s almost guaranteed to be the case.
No set of rules or factors will or will not cause email to be marked as spam. If there were such a set of comprehensive and public rules, spammers would just use them to game the system, making the rules ineffective the moment they were instituted.
It’s all about probability
Even the word “rules” is technically incorrect.
There’s no magic rule or set of rules that, once broken, guarantees email will be placed into a spam folder. Instead, we’re talking about a much fuzzier concept: probability.
What we can look at are the characteristics of email that add to or reduce the probability a message will be marked as spam.
Any single characteristic by itself is typically not enough to make that determination. However, a message that shows multiple characteristics of being spam is probably going to get filtered as spam. Think of each as a strike against the message. Too many strikes and the message is judged as spam.
Of course, not all characteristics are created equal. One characteristic might be a stronger indicator of spamminess than another. Nor do they stay the same over time: spam topics come in waves and react to current events. For example, a word that was completely benign in email last year might be an indicator of spam tomorrow.
So just what are these characteristics?
In no way is this a comprehensive or official list, but these are some of the types of things that spam filters can look for.
The From: line
Yes, it is possible to block spam based on the “From” line alone. This is helpful when you’re dealing with someone emailing you from a consistent address.
But spam doesn’t work that way. Each spam message is likely to come from a different email address. That being said, it’s possible that a spam filter might have a list of names, email addresses, or email domains that contribute to the probability of an email being marked as spam.
One test that spam filters can use is this:
From: someone@somerandomservice.com <askleoexample@hotmail.com>
This “From:” line has both a display name and an email address. (More on that in the article Why Is an Email Address Sometimes in Angle Brackets?) They both look like email addresses and they don’t match. That’s a common characteristic of spam. (This only works if the display name looks like an email address.)
The To: and Cc: lines
The same “display name versus email address” check applied to the sender can apply to every recipient’s email address.
Other checks on the recipients might include:
- Are there many? Spam is often sent to many recipients at once via the To: or Cc: lines.
- Does the delivery account appear on either To: or Cc:? This could be legitimate — perhaps you were Bcc’ed on an email message — but spammers often use Bcc to send to many more recipients than the message might show.
- Are there any recipients at all? Spam often has a blank To: or Cc: line.
The Subject: line
There may be a list of currently common spammy subjects that increase a message’s chance of being filtered as spam. A message having no subject at all, I’m sure, is on that list.
Similarly, words currently common in spam subject lines could count against any message using them.
Grammar counts, though not as much as the body. Many legitimate subject lines are grammatically incorrect, but most spam subject lines are. Spelling, unusual spacing, or capitalization of words can also have a negative effect.
Language — both in word choice and the set of characters used — can signal spam. If a message originates in an English-speaking country and is destined for an English-speaking country, seeing it in a foreign language or seeing foreign characters in the message could be a clue.
The message body
The body of the message is where things get interesting — almost magical. This is where the phrase “looks like spam” really applies at its fuzziest, since what looks like spam to one person might not look like spam to another. Spam filters fight this battle every day.
Spam filters check for these and other issues in the message of the body.
- Just a link. A common phishing attempt is to send just a link in the email body, particularly when the message originates from a hacked email account. Trusting that the sender is indeed one of their contacts, people will often blindly click the link.
- Spammy topics. There are topics related to body enhancement, politics, money-making schemes, and more that are very common in spam and likely to be filtered as such.
- Grammar and spelling. No, you and I are not perfect, but most spam is worse. The quality of the writing can be factored in as a sign of potential spam.
- Language and character set. Just like the subject line, messages in languages that are foreign to both the sender and the receiver can be a sign of spam.
- Images. Email messages with images are common in spam and thus can act as a sign. In particular, an email that is only an image or is mostly images — geared to try to trick you into allowing images to be displayed — can be a red flag.
- Spacing. This is obscure, but I see it used a lot: the top part of a message body might be an explicit call to action for some spammer’s goal. But since it’s so clearly spammy content, they add several blank lines to the message and append non-spammy, often random, content at the end. The presence of non-spammy content might tilt the balance in favor of the message not being identified as spam when it obviously is.
I’m sure I’m missing many more possible indications used by spam filters when they analyze the body of an email message.
Unseen headers
By now, I’m sure you’ve heard about the headers in email you don’t normally see. These lines (much like the To: and Subject: lines) include a bunch of technical information about how the email was routed and formatted and may include what a spam filter thought about it.
Spam filters analyze some headers for clues. The most interesting is what I’d call the chain of custody.
The chain is nothing more than a sequence of information that looks something like this:
- I’m server A, and I got this message from server B to be delivered to my customer, someone@somerandomservice.com.
- I’m B and I got this from C.
- I’m C and I got this from D.
- I’m server D, and I originated this message.
Each of those steps is identified with an IP address and often a name. Now, while we can’t use an IP address to identify a specific source or person (and I have many articles on the topic), there are generalizations about the IP addresses in the chain of custody that can affect the probability of that message being spam.
- DNS. DNS maps names to IP addresses. So, if server D, for example, has a name, does it match the IP address? If not, that’s a strike against it. Better yet, is there any name associated with the IP address at all? If not, that’s typically a serious issue and a sign of spam.
- IP location. Does the location of the IP address the message came from (“server D”) match where the email address supposedly exists? Email from your local ISP’s domain, for example, should never originate from a server in a foreign country.
- IP ownership. Does the source IP address of that message match the servers that are supposedly sending for that domain? For example, if that’s a message from a Gmail account, did it originate on a Gmail server?
- Chain of custody. Is the chain broken? For example, if the line “I’m C and I got this from D” wasn’t present, then the message somehow appears to have hopped from D to B without C recording anything. That’s highly suspicious and often a sign of header forgery.
- Chain reasonableness. As we travel from D to C to B to A, is the path the message took “reasonable”? Did the message appear to take an unnecessary trip through a foreign server? Once again, that’s a possible sign of header forgery and spam.
These are just examples and made-up ones at that. But they should give you some idea of the analysis that’s possible when spam filters review the headers you don’t normally see.
SPF, DKIM, and DMARC
SPF and DKIM are standards that control aspects of mail content and delivery. As a (very gross) overgeneralization:
- SPF – Sender Policy Framework – is mostly about identifying servers that originate email for given email domains. For example, only Yahoo! servers can originate email from Yahoo! email addresses, and Yahoo! has stated that anything not matching that should be considered spam.
- DKIM – Domain Keys Identified Mail – is mostly about using encryption and digital signatures to authenticate that the claimed sender of a message is the real sender of that message, and possibly also that the message content has not been tampered with. If the confirmation fails, that’s a possible sign of spam.
- DMARC – Domain-based Message Authentication, Reporting & Conformance – is a framework that a) allows the apparent sending domain (say, Yahoo.com) to indicate what should happen if either SPF or DKIM checks fail, and b) provides a mechanism for reporting back to the sending domain what’s happening.
Training
One of the most potentially confusing things about spam filtering is that what is spam to you might not be spam to me.
When we “mark as spam” in many email programs and on many email services, what we’re saying is, “Email like this is spam to me.”
Sophisticated email filtering systems then use that specific email message (that you said is spam) to do two things:
- Analyze its characteristics and update the things that the filter looks at to check for spam for everyone. For example, if large numbers of people mark a specific message as spam, then the filter tries to take that message’s characteristics into account as it does the analysis I’ve been talking about above.
- Use those same characteristics — perhaps a little more aggressively — to update the spam filter specifically for you. The net result is you end up with a spam filter customized to your indication of what is and is not spam.
Failure is always an option
Inevitably, here’s where things get disappointing.
Spam filtering can be complex — and it can also be wrong.
Depending on the sophistication of the spam filter and its ability to adapt not only to new spam (as spammers try to weasel their way around the filter) but also individual user preferences, and depending on its ability to do its job in a reasonable amount of time, spam filters run the range from pretty darned good to relatively pointless.
Some spam will make it through, and some “ham” (legitimate mail — the opposite of spam) will occasionally end up in the spam folder.
Do this
My recommendation for dealing with spam remains as it has been for some time.
- Train your email program or service’s spam filter: mark spam as spam and mark those false positives you find in the spam folder as not-spam.
- Never reply to spam.
- Never try to unsubscribe from spam. (If you asked for the email by subscribing, then it’s not spam, and “unsubscribe” is the right way to stop it.)
And above all, don’t let spam stress you out. It’s an everyday fact of life on today’s internet. Mark as spam, and move on.