Why does so much spam have a part of some other email or document in it?

Question:

Do you happen to know what the gibberish writing at the bottom of
spam is? I periodically get spam in my regular mail with a normal
enough header, but at the bottom of these emails (not all) have like a
history story, or some funky strung together writing about a person or
event or who knows what.

Yep.

It’s spammers being spammy.

It’s one of many techniques spammers use to try and slide by
automated spam filters.

Let’s look at how that works.

]]>

Spam filters are incredibly complicated analysis tools. Some of course are better than others, but they all operate in a variety of ways, looking at various characteristics of each message they analyze.

Headers you see

“Computers are phenomenally stupid.”

The “To:”, “From:” and “Subject:” lines are examined for suspicious behaviors. Some spam filters will make sure that the email domains actually exist, for example, or messages that are “From:” suspicious sources.

This is one of the reasons you’ll often see spam “From:” yourself. You didn’t send it, but the spammer simply spoofed the “From:” line to make it look like you did. By definition, if you get it, then your email address is valid, and will pass many spam checkers as a valid “From:” address as well.

And of course most will look for “bad words” in the “Subject:” line. This is often why you’ll see spam with subjects completely unrelated to the body of the message, and are in fact often worded so as to entice you to open the message.

Headers you don’t see

The full headers that accompany email messages contain a lot more information. Once again, spammers often falsify that information, and spam checkers will look. Even without falsifying headers, this is also where the IP addresses of the email servers that route the message can be found. There are many “black lists” that contain the IP addresses of known spam sources, and many spam checkers will use these blacklists to determine if an incoming message is likely to be spam. (Sadly, with so many lists, they are also often prone to errors, missing some spammers, and blacklisting honest sources in error.)

The majority of full-header analysis is typically done by spam filtering solutions on mail servers, before the message ever reach you.

The Message Body

Naturally, the message body is where the spam is most evident. Embedded pictures, bad words or intentional misspellings of bad words are all things that a spam filter can look at to determine if a message is in fact spam.

In fact, it would seem … obvious. I mean, you know what spam is when you see it, right?

The Dilemma

Computers are phenomenally stupid. They make up for it in speed, but at the core of the issue, they’re just dumb. They can parse, they can count and they can categorize, but they can’t understand. So we have to give them rules – often incredibly complex rules – that help them determine what is and is not spam.

For example, is a message that contains the word Viagra spam? How about if it’s mentioned twice? How about if it’s misspelled? If it comes from an overseas domain?

Maybe. Maybe not.

The classic case is of breast cancer discussion lists that lose a bunch of messages because they use the word breast. Spam? Probably not. But the word actually is in an awful lot of spam, so it has to be analyzed for the possibility.

The solution is that most spam filters don’t look at spam as either black or white – they formulate a guess as to “how spammy” it is, and then choose a threshold – anything over that threshold of spammyness is flagged as spam, and anything below it is not.

And that’s where the off-topic text comes in.

A message that has a line or two about Viagra is likely to be analyzed as spam, since that’s all it talks about. However, a line or two about Viagra, followed by multiple paragraphs of boring and unrelated text? That’s harder to say. The spam filter can’t tell that the boring and unrelated stuff is in fact boring and unrelated. The message, as a whole, might actually be legitimate.

As a result, spammers are using that random text to tip the balance of the message’s spammyness in the eyes of many spam filters back into the “probably not spam” category.

Even though it is.

Spam. It’s a war. Or a game of whack-a-mole. About the time one side gets better weapons, the other side gets better defenses. Repeat, ad nauseum.

5 comments on “Why does so much spam have a part of some other email or document in it?”

I once set a filter to block Cialis. It stopped an email with the word speCIALISt 😉 (capitalized for emphasis).

Mark, that’s a pretty clbuttic mistake made by filter writers. 🙂

http://www.telegraph.co.uk/news/newstopics/howaboutthat/2667634/The-Clbuttic-Mistake-When-obscenity-filters-go-wrong.html

That link’s broken (not sure where it should really go), so I’ll offer this instead: clubuttic.

– Leo
28-Apr-2009

I’ve noticed a lot of spammers will deliberately insert spaces inside the key words. For example, “Viagra” becomes “Vi agra”. Still readable, but the computer doesn’t recognize it, so it goes through.

Talking of spam, I see my GMail spam box is getting bigger by the day. Used to be little more than about ten in there when I checked once a week but I`ve just deleted fifty four of the beasts in two days and one popped up before my very eyes! Anything to do with Conficker do you think?

I get MUCH, M-U-C-H more spam on my Yahoo account than I get on my hotmail account. Are there certain rules that you can use to get less spam? Like, “Don’t use a yahoo account” likely being one.

Why does so much spam have a part of some other email or document in it?

Do this

5 comments on “Why does so much spam have a part of some other email or document in it?”

Leave a reply: