The word spamassassin has been very well established in our mails for around 20 years. But it is written in perl and now owned by Apache which itself is pretty commercial, using spamassassin backend is not wise according to us. Many sites use it though and it has gotten lot of exposure. But then, it is very simple logic that a perl program cannot scale under load.
So SpamCheetah wishes to count on the new kid on the block rspamd.
Rspamd is significantly faster and more stable and even backward compatible with spamassassin and comes loaded with defaults that simply work and much more.
Rspamd is a spam filtering system that allows evaluation of messages by a number of rules including regular expressions, statistical analysis and custom services such as URL black lists. Each message is analysed by Rspamd and given a spam score.
Rspamd is modern, has different goals and supports plenty of modules and configurations even to speak to third party commercial providers like Virustotal.
SpamCheetah however does not depend on any commercial tool and relies totally on open source technologies and code developed in-house.
Rspamd runs significantly faster than SpamAssassin while providing approximately the same quality of filtering. On the other hand, if you have a lot of custom rules, or you use
Yes of course as is evident from above.
This is the complete list.
But SpamCheetah does not use many of them internally.
Here are the details
|antivirus||integrates virus scanners (requires configuration)|
|arc||checks and signs ARC signatures|
|asn||looks up ASN-related information|
|clickhouse||pushes scan-related information to clickhouse DBMS|
|bayes_expiry||provides expiration of statistical tokens (requires Redis|
|dcc||performs DCC lookups to determine message bulkiness (requires|
|dkim_signing||adds DKIM signatures to messages (requires configuration)|
|dmarc||performs DMARC policy checks (requires Redis & configuration for|
|elastic||pushes scan-related information to Elasticsearch. (requires|
|emails||extract emails from a message and checks it against DNS|
|blacklists. (requires configuration)|
|force_actions||forces actions if selected symbols are detected|
|greylisting||allows to delay suspicious messages (requires Redis)|
|history redis||stores history in Redis (requires Redis)|
|ip_score||dynamically scores sender reputation (requires Redis). This|
|module is removed since Rspamd 2.0 and replaced by reputation module.|
|The existing configuration is automatically converted by Rspamd.|
|maillist||determines the common mailing list signatures in a message.|
|metadata_exporter||pushes message metadata to external systems|
|metric_exporter||pushes statistics to external monitoring systems|
|mid||selectively suppresses invalid/missing message-id rules|
|milter_headers||adds/removes headers from messages (requires|
|mime_types||applies some rules about mime types met in messages|
|multimap||a complex module that operates with different types of maps.|
|neural networks||allows to post-process messages using neural network|
|classification. (requires Redis).|
|once_received||detects messages with a single Received headers and|
|performs some additional checks for such messages.|
|phishing||detects messages with phished URLs.|
|ratelimit||implements leaked bucket algorithm for ratelimiting|
|(requires Redis & configuration)|
|replies||checks if an incoming message is a reply for our own message|
|rbl||a plugin that checks messages against DNS runtime blacklists.|
|reputation||a plugin that manages reputation evaluation based on|
|rspamd_update||load dynamic rules and other Rspamd updates (requires|
|spamassassin||load spamassassin rules (requires configuration)|
|spf.html||perform SPF checks|
|trie||uses suffix trie for extra-fast patterns lookup in messages.|
|whitelist||provides a flexible way to whitelist (or blacklist) messages|
|based on SPF/DKIM/DMARC combinations|
|url_redirector||dereferences redirects (requires Redis configuration)|
It uses rspamd filtering as one more hurdle to cross before the mail lands in your inbox.
rspamd may not get invoked if the mail is dropped before we get to the rpamd invocation in SpamCheetah C code.
Redis cache server is used as an efficient key-value storage by many Rspamd modules, including such modules as:
Furthermore, Redis is used to store Bayes tokens in the statistics module. Rspamd provides several ways to configure Redis storage. There is also support for Redis replication, so Rspamd can write values to one set of Redis servers and read data from another set.
Well that promise cannot be given by any software leave alone rspamd. But rspamd configuration seems pretty promising and perhaps over time SpamCheetah will learn to dig deeper into the aspects of rspamd hitherto untouched.
Statistics is used by Rspamd to define the class of message: either spam or ham. The overall algorithm is based on Bayesian theorem that defines probabilities combination. In general, it defines the probability of that a message belongs to the specified class (namely, spam or ham) base on the following factors:
The probability of a specific token to be spam or ham (which means efficiently count of a token’s occurrences in spam and ham messages) the probability of a specific token to appear in a message (which efficiently means frequency of a token divided by a number of tokens in a message)
However, Rspamd uses more advanced techniques to combine probabilities, such as sparsed bigramms (OSB) and inverse chi-square distribution. The key idea of OSB algorithm is to use not merely single words as tokens but combinations of words weighted by theirs positions. This schema is displayed in the following picture:
The main disadvantage is the amount of tokens which is multiplied by size of window. In Rspamd, we use a window of 5 tokens that means that the number of tokens is around 5 times larger than the amount of words.
Statistical tokens are stored in statfiles which, in turn, are mapped to specific backends.
Shingles algorithm is used in computing mainly for textual analysis for detecting minor modifications of spammy content for evading spam filters.
Here is the lowdown on it.
The text section must be plain text. For websites, the content is in HTML code. In other words, in order to be able to apply the algorithm meaningfully to the text, all code and any formatting must be removed. In addition, it is also possible to delete fill words, which can be used to extend text artificially, for example, “nevertheless.”
Shingles are overlapped sentences of the text, consisting of a fixed length of words. They are superimposed on one another similar to shingles. A short example with length 3 using the sentence
The quick brown fox jumped over the lazy dog
Shingle 1 = the quick brown
Shingle 2 = brown fox jumped
Shingle 3 = jumped over the
Shingle 4 = the lazy dog
If it is too long, duplicates are overlooked. If the value is too small, a text may be quickly evaluated as duplicate content.
A simple calculation is sufficient to determine whether two texts match. The intersection of overlapping shingles from the two texts and the combined quantity of the shingles of both texts get determined. The respective total is then divided by the respective other total. The % is thus calculated by dividing the number of matching shingles by the total number of shingles.
If two exactly identical texts are compared, the result is 1 and thus a 100% match. If no single shingle is identical, the counter will show 0, in other words a result of 0%.
Fuzzy hashes are used to search for similar messages – i.e. you can find messages with the same or a slightly modified text using this method. This technology fits well for blocking spam that is simultaneously sent to many users. Since the hash function is unidirectional, it is impossible to restore the original text using a hash only. And this allows you to send requests to third-party hash storages without risk of disclosure.
Furthermore, fuzzy hashes are used not merely for textual data but also for images and other attachments types in email messages. However, in this case, rspamd looks for the exact matches to find similar objects.
Though Rspamd is free to use for any purpose many of the RBLs used in the default configuration aren’t & care should be taken to see that your use cases are not infringing. Notes about specific RBLs follow below (please follow the links for details):
So with this load rate (1500 messages per second) and with the average size of messages around 2Kb, Rspamd processes each message in around 100ms in average. I hope these numbers could give one some impression about Rspamd performance in general.
Quote from this place.