Close look at rspamd

Introduction to rspamd

rspamd

The word spamassassin has been very well established in our mails for around 20 years. But it is written in perl and now owned by Apache which itself is pretty commercial, using spamassassin backend is not wise according to us. Many sites use it though and it has gotten lot of exposure. But then, it is very simple logic that a perl program cannot scale under load.

So SpamCheetah wishes to count on the new kid on the block rspamd.

Rspamd is significantly faster and more stable and even backward compatible with spamassassin and comes loaded with defaults that simply work and much more.

Rspamd is a spam filtering system that allows evaluation of messages by a number of rules including regular expressions, statistical analysis and custom services such as URL black lists. Each message is analysed by Rspamd and given a spam score.

How it differs from SpamAssassin

Rspamd is modern, has different goals and supports plenty of modules and configurations even to speak to third party commercial providers like Virustotal.

SpamCheetah however does not depend on any commercial tool and relies totally on open source technologies and code developed in-house.

Rspamd runs significantly faster than SpamAssassin while providing approximately the same quality of filtering. On the other hand, if you have a lot of custom rules, or you use

Can rspamd scale?

Yes of course as is evident from above.

What are the backends available?

This is the complete list.

But SpamCheetah does not use many of them internally.

  • Antivirus
  • ARC module
  • ASN module
  • Bayes expiry module
  • Clickhouse module
  • Chartable
  • DCC module
  • DKIM
  • DKIM signing
  • DMARC
  • Elasticsearch exporter
  • Emails scan
  • External services
  • Force Actions
  • Fuzzy check
  • Fuzzy Collection
  • Greylisting module
  • History redis module
  • Mailing list
  • Metadata exporter
  • Metric exporter
  • MID module
  • Milter headers
  • Mime types
  • Multimap
  • MX Check
  • Neural network module
  • Phishing check
  • Ratelimit
  • RBL
  • Regexp
  • Reputation
  • Received policy
  • Replies module
  • Spamassassin rules
  • Spamtrap
  • SPF
  • Trie matcher
  • URL redirector
  • Whitelist

Here are the details

  • antivirus - integrates virus scanners (requires configuration)
  • arc - checks and signs ARC signatures
  • asn - looks up ASN-related information
  • clickhouse - pushes scan-related information to clickhouse DBMS
  • (requires configuration)
  • bayes_expiry - provides expiration of statistical tokens (requires Redis
  • and configuration)
  • dcc - performs DCC lookups to determine message bulkiness (requires
  • configuration)
  • dkim_signing - adds DKIM signatures to messages (requires configuration)
  • dmarc - performs DMARC policy checks (requires Redis & configuration for
  • reporting)
  • elastic - pushes scan-related information to Elasticsearch. (requires
  • configuration)
  • emails - extract emails from a message and checks it against DNS
  • blacklists. (requires configuration)
  • force_actions - forces actions if selected symbols are detected
  • (requires configuration)
  • greylisting - allows to delay suspicious messages (requires Redis)
  • history redis - stores history in Redis (requires Redis)
  • ip_score - dynamically scores sender reputation (requires Redis). This
  • module is removed since Rspamd 2.0 and replaced by reputation module.
  • The existing configuration is automatically converted by Rspamd.
  • maillist - determines the common mailing list signatures in a message.
  • metadata_exporter - pushes message metadata to external systems
  • (requires configuration)
  • metric_exporter - pushes statistics to external monitoring systems
  • (requires configuration)
  • mid - selectively suppresses invalid/missing message-id rules
  • milter_headers - adds/removes headers from messages (requires
  • configuration)
  • mime_types - applies some rules about mime types met in messages
  • multimap - a complex module that operates with different types of maps.
  • neural networks - allows to post-process messages using neural network
  • classification. (requires Redis).
  • once_received - detects messages with a single Received headers and
  • performs some additional checks for such messages.
  • phishing - detects messages with phished URLs.
  • ratelimit - implements leaked bucket algorithm for ratelimiting
  • (requires Redis & configuration)
  • replies - checks if an incoming message is a reply for our own message
  • (requires Redis)
  • rbl - a plugin that checks messages against DNS runtime blacklists.
  • reputation - a plugin that manages reputation evaluation based on
  • various rules.
  • rspamd_update - load dynamic rules and other Rspamd updates (requires
  • configuration)
  • spamassassin - load spamassassin rules (requires configuration)
  • spf.html - perform SPF checks
  • trie - uses suffix trie for extra-fast patterns lookup in messages.
  • (requires configuration)
  • whitelist - provides a flexible way to whitelist (or blacklist) messages
  • based on SPF/DKIM/DMARC combinations
  • url_redirector - dereferences redirects (requires Redis configuration)

How does SpamCheetah use rspamd?

It uses rspamd filtering as one more hurdle to cross before the mail lands in your inbox.

rspamd may not get invoked if the mail is dropped before we get to the rpamd invocation in SpamCheetah C code.

rspamd and redis

Redis cache server is used as an efficient key-value storage by many Rspamd modules, including such modules as:

  • Ratelimit plugin uses Redis to store limits buckets
  • Greylisting module stores data and meta hashes inside Redis
  • DMARC module can save DMARC reports inside Redis keys
  • Replies plugin requires Redis to save message ids hashes for outgoing messages
  • IP score plugin uses Redis to store data about AS, countries and networks reputation
  • Multimap module can use Redis as readonly database for maps
  • MX Check module uses Redis for caching
  • Reputation module uses Redis for caching
  • Neural network module uses Redis for data storage

Furthermore, Redis is used to store Bayes tokens in the statistics module. Rspamd provides several ways to configure Redis storage. There is also support for Redis replication, so Rspamd can write values to one set of Redis servers and read data from another set.

Can rspamd solve the spam problem completely?

Well that promise cannot be given by any software leave alone rspamd. But rspamd configuration seems pretty promising and perhaps over time SpamCheetah will learn to dig deeper into the aspects of rspamd hitherto untouched.

rspamd and Bayes filtering

Statistics is used by Rspamd to define the class of message: either spam or ham. The overall algorithm is based on Bayesian theorem that defines probabilities combination. In general, it defines the probability of that a message belongs to the specified class (namely, spam or ham) base on the following factors:

The probability of a specific token to be spam or ham (which means efficiently count of a token’s occurrences in spam and ham messages) the probability of a specific token to appear in a message (which efficiently means frequency of a token divided by a number of tokens in a message)

However, Rspamd uses more advanced techniques to combine probabilities, such as sparsed bigramms (OSB) and inverse chi-square distribution. The key idea of OSB algorithm is to use not merely single words as tokens but combinations of words weighted by theirs positions. This schema is displayed in the following picture:

The main disadvantage is the amount of tokens which is multiplied by size of window. In Rspamd, we use a window of 5 tokens that means that the number of tokens is around 5 times larger than the amount of words.

Statistical tokens are stored in statfiles which, in turn, are mapped to specific backends.

Shingles algorithm

Shingles algorithm is used in computing mainly for textual analysis for detecting minor modifications of spammy content for evading spam filters.

Here is the lowdown on it.

Step 1: Normalize text

The text section must be plain text. For websites, the content is in HTML code. In other words, in order to be able to apply the algorithm meaningfully to the text, all code and any formatting must be removed. In addition, it is also possible to delete fill words, which can be used to extend text artificially, for example, “nevertheless.”

Step 2: Divide text into shingles

Shingles are overlapped sentences of the text, consisting of a fixed length of words. They are superimposed on one another similar to shingles. A short example with length 3 using the sentence

The quick brown fox jumped over the lazy dog

  • Shingle 1 = the quick brown

  • Shingle 2 = brown fox jumped

  • Shingle 3 = jumped over the

  • Shingle 4 = the lazy dog

If it is too long, duplicates are overlooked. If the value is too small, a text may be quickly evaluated as duplicate content.

Step 3: Comparing shingles of different texts

A simple calculation is sufficient to determine whether two texts match. The intersection of overlapping shingles from the two texts and the combined quantity of the shingles of both texts get determined. The respective total is then divided by the respective other total. The % is thus calculated by dividing the number of matching shingles by the total number of shingles.

If two exactly identical texts are compared, the result is 1 and thus a 100% match. If no single shingle is identical, the counter will show 0, in other words a result of 0%.

What can rspamd do with Fuzzy hashes?

Fuzzy hashes are used to search for similar messages – i.e. you can find messages with the same or a slightly modified text using this method. This technology fits well for blocking spam that is simultaneously sent to many users. Since the hash function is unidirectional, it is impossible to restore the original text using a hash only. And this allows you to send requests to third-party hash storages without risk of disclosure.

Furthermore, fuzzy hashes are used not merely for textual data but also for images and other attachments types in email messages. However, in this case, rspamd looks for the exact matches to find similar objects.

RBLs in rspamd

Though Rspamd is free to use for any purpose many of the RBLs used in the default configuration aren’t & care should be taken to see that your use cases are not infringing. Notes about specific RBLs follow below (please follow the links for details):

  • Abusix Mail Intelligence
  • DNSWL
  • Mailspike
  • Rspamd URIBL
  • SORBS
  • SpamEatingMonkey
  • Spamhaus
  • SURBL
  • UCEProtect
  • URIBL

Performance measurements of rspamd by third party sources

So with this load rate (1500 messages per second) and with the average size of messages around 2Kb, Rspamd processes each message in around 100ms in average. I hope these numbers could give one some impression about Rspamd performance in general.

Quote from this place.