Content filtering and spam in 2022

Content filtering and spam in 2022

Content filtering

When people talk spam filtering they normally think Bayesian probability theory or content scanning or spamassassin, using regular expressions or pattern matching to stop annoying email messages.

The entire world of UCE and UBE, unsolicited bulk e-mail and unsolicited commercial email traffic, the sort of get rich quick schemes, or 419 Nigerian scams, or viagra or other private organ enhancement schemes have no interest in dying down even in 2022.

With the advent of phishing scams and everything moving online, Internet based digital currencies and ease of transferring money online and with the massive connectivity to online services from every nook and corner of the planet, the thieves that want your hard earned money for free are also playing games to defraud you.

The widespread phishing nets that people fall into worldwide has caused losses of several millions of dollars every year and this is showing no sign of reduction or dying with.

What with all the end user training and secure email gateways, there seem to be no change in the threat posed by those wanting to earn a fast buck.

In terms of police action, law enforcement and legal statutes, nothing much has helped since most people that fall for such scams are themselves guilty at some level if not anything at least for being foolish or greedy.

Fool and his money

We studied in school that the fool and his money are soon parted and this is more true in the day of the Internet and most people are too ashamed of talking about this in public.

And anybody with a smart phone has internet access and uses e-mail and I get at least 10 emails daily promising me some lottery funds or some other promise of several millions and I know I can never make money for free, so I do not proceed.

But there is a very small minority among us that fall for such tricks and they are none the wiser. Some event travel abroad and get killed.

What can we do?

It is not uncommon for people to get into serious trouble and there are criminals that get caught too from time to time worldwide , but Internet scams are typically borderless and it is very hard to bring criminals to book as they are often in a different jurisdiction.

Now, this article tries to analyze if such scams, phishing, spearphishing and other forms of business e-mail compromise can actually be prevented by using content scanning based secure email gateways or some open source based home brew spam control mix.

It is puerile to begin with.

There is no way content scanning is going to unearth such scams and if that were the case, then companies would not be spending billions of dollars building and maintaining spam filters and API integrations to protect their network assets.

E-mail is only an entry point in most cases to private networks that are otherwise walled off to the Internet and if a malware gains entry into a network that is not reachable the malware can cause damage running on other machines that originated in e-mail.

rspamd and Bayes filtering

Statistics is used by Rspamd(used internally by SpamCheetah) to define the class of message: either spam or ham. The overall algorithm is based on Bayesian theorem that defines probabilities combination. In general, it defines the probability of that a message belongs to the specified class (namely, spam or ham) base on the following factors:

The probability of a specific token to be spam or ham (which means efficiently count of a token’s occurrences in spam and ham messages) the probability of a specific token to appear in a message (which efficiently means frequency of a token divided by a number of tokens in a message)

Shingles algorithm and friends

Shingles algorithm is used in computing mainly for textual analysis for detecting minor modifications of spammy content for evading spam filters.

What is wrong with content scanning?

Fundamentally an e-mail is identified as spam or ham not based on content but based on how it is sent, whether it is machine sent or human originated and whether the mail server that sent the mail is trusted or not, if the SPF records of the mail server matches the domain it claims and so on.

Presence of DKIM signature and other forms of safety measures are all used, alongside RBL checks of known spam sources at any point of time and plenty of such metrics are applied to make the decision.

And content scanning is also applied into the mix to ensure that some telltale signs of clear scamming is not present in the email body.

Also the malware attachments and phishing URLs in a html MIME envelope are cause for concern as well.

Such things are identified using cloud based API services that are updated by the minute and databases of known spam sources, active phishing links and newest malware families.

It is not easy to make out if a mail is legitimate or spammy based on content alone, if this article does not convince you that, then wonder if some other article will.

The future

Mathematically speaking we live in the day and age where the effectiveness of Bayesian content analysis with Markov chains are still effective and machine learning and AI are yet to show its worth compared to these simple techniques.

The focus has shifted over from the mail body and content to the mechanics of spam sending.

Most email related annoyances are despatched by machines and from mail servers or domains with poor reputation and low reliability scores. Secure email gateways capitalize on such information and leverage the active network maps of known botnets or security compromises and bogons and so on to figure out if a message from a certain IP block of address is to be trusted.

The ability to detect the source of attack is vital in many cases and it takes a long time to establish trust and even after gaining trust some mail servers could get compromised by using some other software bugs or firewall rules.

Greylisting as a technique to detect spam has been pretty effective in figuring out if a mail server is playing the volume game of sending out massive amounts of spam based on volume marketing metrics or if it is standards compliant and retries the delivery of each mail message.

It is not practical to use greylisting delays on each email sent by a mail server, so once a mail server is known to be trusted or proves that it is legitimate SpamCheetah assumes that the mail messages sent by it are also genuine, but this is not always true.

In order to stop attacks of temporary compromises secure e-mail gateways typically employ multiple levels of checks to mandate a set of rules to funnel e-mail messages in real time as and when SMTP messages are delivered to multiple user inboxes.

This level of care is necessary to combat spam and content scanning or using tokenizing and Bayesian theory of closely related words or poor English or whatever content scanning algorithms come up with are scarcely enough.

Threat of false positives

In addition to the effectiveness of content filters and the math behind them being woefully inadequate in most cases, the chance of false positives and true negatives, in which a legitimate mail is filed away as spam and a spam lands in inbox , are both very likely too.

Once the internal metrics involved in applying content scanning decisions are skewed by user feedback to mark as spam and treat as ham content goes wrong, the result is chaos and danger.

SpamCheetah has ability of using mailbots to mark a message as spam. You can read more about it here. This makes it quite convenient to train the spam content scanner.

But that said, the typical method of spam filtering normally works using the mechanics of E-mail sending, the standards compliance and other characteristics of the email delivery pipeline and not really what is inside the payload of the email message itself.


The war against stopping email messages that are not human generated and from a known source or in the end user’s interest has been ongoing and content filtering is one of the tools in the mix.

It is not a vital element and not the only game in town but my guess is that over time, its importance shall get further and further watered down. It already is.

The chance of a pattern or regular expression identifying an e-mail as being dangerous for human consumption is not quite sensible if you think about it. It used to make a use case long ago before we had access to other means of identifying spam.

Definitely targeted attacks that are human generated to steal your money based on your company’s situation and power structure cannot be fixed by using content scanners.

It is like saying that a robot can identify what a human can identify. There is a place for robotic reading of mail to identify spam, but such rulebooks and cookie cutter approaches have been superseded by machine learning techniques and AI in 2022.

That said, most SEG vendors do not depend on AI to identify spam. It is quite difficult to train emails with a huge corpus and it takes several gigabytes of training data to make those algorithms work.

Instead when you can easily identify known sources of spam and compromised networks, what would you choose to do if you were developing a spam filter?

Think. 😊


rspamd article

Back to homepage

Download 30 day trial of SpamCheetah