Someone commented to me that his spam filter was pretty useless since the spam he was receiving kept coming from different email addresses. The implication was that this person believed that the “From:” address is the only thing that spam filters check.
While that’s possible, it’s also very rare.
These days, spam filters are incredibly complex and sophisticated pieces of software that check much more than you might think.
Become a Patron of Ask Leo! and go ad-free!
I have to start this discussion by pointing out that there is no single spam filter, no single spam-filtering technique, and no single spam-filtering set of rules.
How the spam filter works on your email may be very, very different than how it works on mine. As we’ll see in a moment, when it comes to large email providers, that’s almost guaranteed to be the case.
So be clear that there is not some set of rules, or things that will, or will not, cause email to be marked as spam. If there were such a set of comprehensive and public rules, spammers would just follow them anyway, making them ineffective the moment they were instituted.
It’s all about probability
Even the word “rules” is technically incorrect.
There are no guarantees when it comes to spam. There’s no magic rule that once broken is guaranteed to place an email into a spam folder. Instead, we’re talking about a much fuzzier concept: probability.
What we can look at are the characteristics of email that either add to or reduce the probability that a message is marked as spam.
Any single characteristic, by itself, is typically not enough to make a determination. Taken in combination with other characteristics, however, a message that shows multiple characteristics of being spam is probably going to get filtered as spam. Think of each as a strike against the message. Too many strikes, and the message is judged as spam.
Of course, not all characteristics are created equal: one type of characteristic might be a stronger indicator of spamminess than another. Nor do they stay the same over time: spam topics come in waves and are very reactive to current events. For example, a word that was completely benign in email last year might be an indicator of spam tomorrow.
So just what are these “characteristics” I keep talking about?
In no way is this a comprehensive or official list, but these are the types of things that spam filters can look for.
The From: line
Yes, it is possible to block spam based on the “From” line alone. This is particularly helpful when you’re dealing not with spam, but with someone or some entity that is emailing you from a consistent address.
Spam doesn’t work that way. Each spam message is likely to come from a different email address.
That being said, it’s possible that a spam filter might have a list of names and/or email addresses that contribute to a message’s probability of being spam.
One test that spam filters can perform is to look for this:
From: firstname.lastname@example.org <email@example.com>
This “From:” line has both a display name and an email address. (More on that in the article Why is an email address sometimes in angle-brackets?) They both look like email addresses, and they don’t match. That’s a common characteristic of spam. (This only works if the display name actually looks like an email address.)
The To: and Cc: lines
The same “display name versus email address” check applied to the sender can be applied to every recipients’ email address.
Other checks on the recipients might include:
- Are there many? Spam is often sent to many recipients at once via the To: or Cc: lines.
- Does the delivery account appear on either To: or Cc:? This could be legitimate – perhaps you were Bcc’ed on an email message – but spammers often use Bcc to send to many more recipients than the message might indicate.
- Are there any recipients at all? Spam often appears with a blank To: or Cc: line.
The Subject: line
I’m sure that there’s often a list of the current collection of common spammy subjects that increase a message’s chance of being filtered as spam. A message having no subject at all, I’m sure, is on that list.
Similarly, words that are currently common in spam message subject lines could count against any message that happened to use them.
Grammar counts. Not as much as the body, which I’ll discuss in a moment, but to some degree. Many legitimate subject lines are grammatically incorrect, but most spam subject lines are. Spelling, unusual spacing, or capitalization of words can also have a negative effect.
Language – both in word choice as well as the set of characters used – can be a signal of spam. If a message originates in an English-speaking country, and is destined for an English-speaking country, seeing it in a foreign language, or seeing foreign characters in the message, could be a clue.
The message body
The actual body of the message is where things get interesting, and almost magical at times. This is where the phrase “looks like spam” really applies at its fuzziest since what looks like spam to one person might not look like spam to another. Spam filters fight this battle every day.
Just a few of the issues spam filters might look for in the message body include:
- Just a link. A very common phishing attempt of late is to send just a link in the email body, particularly when the message can be made to originate from a hacked email account. People, trusting that the sender is indeed one of their contacts, will often blindly click the link.
- Spammy topics. As you can imagine, there are topics – say, related to body enhancement, for one example – that are very common in spam, and very likely to be filtered as spam.
- Grammar and spelling. No, you and I are not perfect in this regard, but most spam is worse. The quality of the actual writing can be factored in as a sign of potential spam.
- Language and character set. Just like the subject line, messages in languages that are foreign to both the sender and the receiver are a possible sign of spam.
- HTML mail. Because HTML email can be abused in many ways to mislead the recipient, simply sending a message in HTML format is a sign of possible spam. It’s not much of a sign, since so many messages are HTML these days, but taken in conjunction with other characteristics, it can count against a message.
- Images. Email messages with images are common in spam, and thus can act as a sign of spam. In particular, email that is only an image, or email that is mostly images – geared to try to trick you to “display images” – can be red flag.
- Spacing. This is somewhat obscure, but I see it used a lot: the top part of a message body might be an explicit call to action for some spammer’s goal. But since it’s so clearly spammy content, they add a number of blank lines to the message and append non-spammy, often random, content at the end. The idea is that the presence of non-spammy content might tilt the balance in favor of the message not being spam, when it obviously is.
By now, I’m sure you’ve heard about the “headers” in email that you don’t normally see. These are lines (much like the To: and Subject: lines) that include a bunch of technical information about how the email was routed and formatted, and in many cases, what a spam filter might have thought about it.
As you can imagine, spam filters can analyze some of the headers for clues. The most interesting is what I’d call the “chain of custody”.
The chain is nothing more than a sequence of information that looks something like this:
- I’m server A, and I got this message from server B to be delivered to my customer, firstname.lastname@example.org.
- I’m B and I got this from C.
- I’m C and I got this from D.
- I’m server D, and I originated this message.
Each of those steps is identified with an IP address and often a name. Now, while we can’t use an IP address to identify a specific source or person (and I have many articles on the topic), there are generalizations about the IP addresses in the chain of custody that can affect the probability of that message being spam.
- DNS. DNS maps names to IP addresses. So, if “server D”, for example, has a name, does it match the IP address? If not, that’s a strike against it. Better yet, is there any name associated with the IP address at all? If not, that’s typically a serious issue and sign of spam.
- IP location. Does the location of the IP address that the message came from (“server D”) match where the email address supposedly exists? Email from your local ISP’s domain, for example, should never originate from a server in a foreign country.
- IP ownership. Does the source IP address of that message actually match the servers that are supposedly sending for that domain? For example, if that’s a message from a Gmail account, did it originate on a Gmail server?
- Chain of custody. Is the chain broken? For example, if the line “I’m C and I got this from D” wasn’t present, then the message somehow appears to have hopped from D to B without C recording anything. That’s highly suspicious and often a sign of header forgery.
- Chain reasonableness. As we travel from D to C to B to A … is the path the message took “reasonable”? Did the message appear to take an unnecessary trip through a foreign server? Once again, that’s a possible sign of header forgery and spam.
These are just examples, and made up ones at that. But they should give you some idea of the analysis that’s possible when spam filters review the headers you don’t normally see.
SPF and DKIM
SPF and DKIM are alternately competing and cooperating standards that control aspects of mail content and delivery. As a (very gross) over generalization:
- SPF – Sender Policy Framework – is mostly about identifying servers that are allowed to originate email for given email domains. For example, only Yahoo! servers can originate email from Yahoo! email addresses, and Yahoo! has stated that anything not matching that should be considered spam.
- DKIM – Domain Keys Identified Mail – is mostly about using encryption and digital signatures to authenticate that the claimed sender of a message is the real sender of that message, and possibly also that the message content has not been tampered with. If the confirmation fails, that’s a possible sign of spam.
One of the most potentially confusing things about spam filtering is, as I said earlier, what’s spam to you might not be spam to me – literally.
When we “mark as spam” in many email programs and on many email services, what we’re really saying is “Email like this is spam to me.”
Sophisticated email filtering systems can then use that specific email message (that you said is spam) to do two things:
- Analyze it to see what characteristics it has, and update the things that the filter looks at to check for spam for everyone. For example, if large numbers of people are marking a specific message as spam, then the filter tries to take that message’s characteristics into account as it does the analysis I’ve been talking about above.
- Use those same characteristics – perhaps a little more aggressively – to update the spam filter specifically for you. The net result is that you end up with a spam filter customized to your indication of what is and is not spam.
Failure is always an option
Inevitably, here’s where things get disappointing.
As we’ve seen so far, spam filtering can be exceptionally complex.
And it can also be wrong.
Depending on the sophistication of the spam filter, depending on its ability to adapt not only to new spam as spammers try to weasel their way around the filter, but also to individual user preferences, and depending on its ability to do its job in a reasonable amount of time, spam filters run the range from pretty darned good (but not perfect), to relatively pointless.
Some spam will make it through. And some “ham” (legitimate mail – the opposite of spam) will occasionally end up in the spam folder.
Dealing with spam
My recommendation for dealing with spam remains as it has for some time:
- Train your email program or service’s spam filter: mark spam as spam, and make sure to mark those false positives you find in the spam folder as not-spam.
- Never reply to spam.
- Never try to unsubscribe from spam. (If you asked for the email by subscribing, then it’s not spam, and “unsubscribe” is the right way to stop it.)
And above all, don’t let spam stress you out. It’s a normal, every-day fact of life on today’s internet.
If you found this article helpful, I'm sure you'll also love Confident Computing! My weekly email newsletter is full of articles that help you solve problems, stay safe, and give you more confidence with technology. Subscribe now and I'll see you there soon,