SPAM interception:  Profiling vs. filtering

November 27 2007

Some say spam is like a bacterial infection: for every new antibiotic drug we invent, a germ will develop that resists it. Antispam services strive to keep one step ahead of the spammers while not blocking any legitimate (and possibly critical) "real" emails.

How do they do it? Tactics fall into two basic categories: profiling and filtering.

Profiling addresses external characteristics about the message. The simplest form of profiling ushers through mail from addresses we know are legitimate ("whitelisting"), while blocking addresses associated with spam, malware, phishing attacks, and other forms of electronic postal fraud ("blacklisting").
Like police on patrol, spam algorithms also keep their antennae up for things that "look wrong." Some messages have bad license plates (suspicious timestamps, invalid headers or timezones). Or they may have suspicious type patterns like headers in all-caps.

Domain-name reputations help identify spam based on where a message originates. Spammers move around as they're detected, setting up shop in a newly registered Internet domain not already identified as a spam site. Like city neighborhoods, however, areas of the Internet have reputations. If a new domain has shady surroundings, the new site can be checked out and, if we're lucky, shut down before spamming begins.

Unfortunately, analyzing message structure and origin fails to snare all of spam coming through the door. Hence, spam-blockers move on to the message itself, utilizing filtering techniques to screen further.

Content scanning checks messages for suspect words and phrases, such as "Viagra" or "time-limited offer." They're why spammers switch to alternate spellings, such as "v14gr4."

Content filters also look at the URLs (web addresses) in messages and block those related to shopping, social networking, gambling, adult content, etc.
"Heuristic" or Bayesian filtering uses sophisticated algorithms to calculate the likelihood that a message is spam based on whether past, similar messages have been spam. For example, in one batch of messages, mail with the word "sex" might have a 97% probability of being spam, whereas e-mails with the word "sorry" might have only a 2% probability. By carefully weighing these characteristics, the filter can accurately predict the status of future messages.