SpamCheetah is first of all an email security gateway product. That means that it does not only stop spam but also allows mails to flow.
If the product stops spam well but mails also don’t come, then it is a failure and useless.
If however it allows all mails including spam traffic, then also it is worthless.
We need to hit middle ground.
How SpamCheetah does this using third party plugins and tools is described in this blog article.
One of the very first things SpamCheetah does in order to combat spam is ensuring that the DNS verifications check out.
This means a host of queries like reverse DNS and MX check, HELO check, RFC2821 compliance and so on.
It is important for mail senders to use an FQDN in its HELO string.
It is vital for mail senders to have MX records to receive bounces.
DNS checks are all done asynchronously without waiting on each response. SPF queries can be very slow due to the recursive nature and DNS by definition is a recursive resolver and email and DNS are in a tight bind as anyone in the email technology field is aware.
But as all code runs in C, the checks are performed in wire speed.
One of the interesting checks here is the sender score check. This is a very vital piece for spam control as most good mail servers have a high reputation and many don’t have a reputation score at all.
There is a really long list of techniques spam control products deploy.
There is stuff like SPF, DKIM, DMARC, ARS, SRS and the list goes on ad infinitum.
In this blog, we are exploring how some of these help SpamCheetah do its job of effectively keeping spammers at bay and deliver peace of mind to you at a nominal price.
Already SpamCheetah employs many of the techniques all standard mail filters use and many more.
We have already discussed these techniques in enough detail here
In this blog however we are going to look into some depth of the other standard techniques SpamCheetah uses to augment its spam detection engine.
DNS checks are one of the most important components of SpamCheetah just like it is for any other email security product.
Well, SpamCheetah is not in the business of delaying your mail to INBOX time. If anything, it is keen on getting your emails on time and without hassle as email inboxes with plenty of unwanted junk mail can be a pain as most of the time is now spent in clearing out the junk.
But an unavoidable side effect of using greylisting which is a highly effective method to combat spam is that first time mail senders experience a delay to inbox.
Usually this is not a problem but sometimes it can be.
If you feel as the admin that the price you pay for greylisting is not justified in its spam catch rate you can turn it off using the web interface and things work as usual.
Previously when SpamCheetah was first created it was built around greylisting. But today it is one of the various techniques it employs to fight spam.
An RBL or relay blackhole list is a list of known e-mail sending IP addresses. Whenever SpamCheetah receives a mail attempt from an IP address, the RBL is checked. This is moderately effective as sometimes even good mail servers can become open relay due to misconfiguration.
RBL checks are very less expensive since we reject the mail even before the first packet is sent from the blacklisted IP address source.
RBL APIs are very commonly available and SpamCheetah uses one of the most effective ones. However the RBL data is cached to speed up further processing in an internal linked list.
SpamCheetah depends on the URL scanning API programs in the wild to detect spammy or malicious URLs. If your mail body contains them, the whole message is dropped or quarantined.
Right now the mechanism of simply removing the offensive URLs and passing the message does not exist.
Is it meaningful to do so?
Attachment scanning in SpamCheetah involves a lot of work. The MIME content is decoded and the attachment file extracted, the sha256 hash computed and checked against list of known malware hashes.
The malware scanning only happens when the mail has attachments.
Of course malware can also be downloaded form URLs which the previous section discusses.
ClamAV is the most popular anti virus software in the open source world and several spam control products use it to effectively filter our not only viruses/worms/trapdoors but also malware.
SpamCheetah uses a combination of rspamd and clamav to keep your INBOX clean.
ClamAV has been found to be incredibly stable and effective and its signatures very current and up-to-date.
Content filtering is the use of a program to filter email deemed worthless. Content filtering works by specifying content patterns – such as text strings or objects within images – that, if matched, indicate undesirable content that is to be left out.
Text scanning is serving a security purpose ; and in our case a very vital one. However relying a lot on content scanning for spam control is not a very wise thing to do.
For many reasons.
First, what is spam for one may not be spam for another.
So receiving user feedback and training the content filters is not a very good thing to do.
Moreover when there are several other methods to figure out spamminess why depend on this one? Particularly when we know it is known to cause more misery than happiness.
Allowing p0rn content into the workplace can put a company at risk for sexual harassment claims, or otherwise create a hostile or demeaning work environment.
Hate filled or violent content can compromise employee safety and also reflect poorly on the company as a whole.
Malware sites can lead to malware or other malicious software being installed onto work computers that serve as trapdoors or spread worms.
Social networking sites can reduce productivity and distract employees from routine tasks.
There are no limitations to the scope of spam filtering, and it can often be a strong step towards providing security to any and all users accessing the web by doing URL scanning/filtering aka SURBL.
In statistics, naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes' theorem with strong (naïve) independence assumptions between the features (see Bayes classifier). They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve higher accuracy levels.
Naïve Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression,:718 which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.
In the statistics and computer science literature, naive Bayes models are known under a variety of names, including simple Bayes and independence Bayes. All these names reference the use of Bayes' theorem in the classifier’s decision rule, but naïve Bayes is not (necessarily) a Bayesian method.
Shingles algorithm is used in computing mainly for textual analysis for detecting minor modifications of spammy content for evading spam filters.
Here is the lowdown on it.
The text section must be plain text. For websites, the content is in HTML code. In other words, in order to be able to apply the algorithm meaningfully to the text, all code and any formatting must be removed. In addition, it is also possible to delete fill words, which can be used to extend text artificially, for example, “nevertheless.”
Shingles are overlapped sentences of the text, consisting of a fixed length of words. They are superimposed on one another similar to shingles. A short example with length 3 using the sentence
The quick brown fox jumped over the lazy dog
Shingle 1 = the quick brown
Shingle 2 = brown fox jumped
Shingle 3 = jumped over the
Shingle 4 = the lazy dog
If it is too long, duplicates are overlooked. If the value is too small, a text may be quickly evaluated as duplicate content.
A simple calculation is sufficient to determine whether two texts match. The intersection of overlapping shingles from the two texts and the combined quantity of the shingles of both texts get determined. The respective total is then divided by the respective other total. The % is thus calculated by dividing the number of matching shingles by the total number of shingles.
If two exactly identical texts are compared, the result is 1 and thus a 100% match. If no single shingle is identical, the counter will show 0, in other words a result of 0%.
rpamd is a really fantastic tool and without using rspamd and its internal modules SpamCheetah will have to reinvent the wheel and do a lot of the mundane things that well known open source tools have been know to excel for decades.
Thereby SpamCheetah can focus on its key SMTP proxying whereby the content analysis and textual models, Bayesian probability analysis and so on can be performed by a third party plugin like rspamd.
Once rspamd is happy the mail is passed.
If not, it is stopped either by dropping or quarantining as per admin configuration in the web interface.
rspamd can be augmented with several other plugins and backends too but there is no point in duplicating the work that SpamCheetah natively supports.
rspamd is known to be quite a bit more scalable as compared to spamassassin due to the programming language in which it is written in addition to other design features of rspamd.
There is no widespread usage of rspamd today but over time it will pickup due to its effectiveness in its job and speed of operation.
To fight spam and stay up 24/7 to deliver mail SpamCheetah has to not only be effective in combating spam and letting mail flow, it must also implement some company wide policy like add disclaimers, block certain senders or recipients or MIME types.
The way SpamCheetah is able to quickly deliver mail at wire speed despite performing the wide variety of tasks is credited to its design and programming language.
We know one competing email product that uses node.js for this purpose using streams and what not. Node.js with all its benefits cannot outperform C in such tasks is our firm belief and conviction.