This article is the first in a series of posts I’ve written over the past eight years about various SaaS products and website operations. I’ll share some of the issues I’ve dealt with, lessons learned, failures, and maybe some things that have worked. please let me know what you think!
Back in 2019 or 2020, I decided to rewrite the entire backend of Block Sender. Block Sender is a SaaS application that allows users to create better email blocks and more. Along the way, we’ve added some new features and upgraded to more modern technology. I ran the tests, deployed the code, and tested everything manually in production. Other than a few coincidences, everything seemed to be working fine. I wish this was the end of the story…
A few weeks later, a customer notified me (which is embarrassing in itself) that the service wasn’t working and that his inbox was full of emails that should have been blocked, so we investigated. Often this issue is caused by Google removing the connection to your account from the service, and the system handles this by notifying you via email and asking you to reconnect, but this time it’s a different connection. It was the cause.
The backend worker that handles email checking for user blocks seemed to keep crashing every 5-10 minutes. The weirdest part was that although there were no errors in the logs and memory was fine, the CPU load would occasionally spike at seemingly random times. So for the next 24 hours (with a 3 hour sleep break, sorry customers 😬), I had to manually restart the worker every time it crashed. For some reason, the Elastic Beanstalk service was waiting too long to restart, so I had to restart it manually.
Debugging issues in production is always a pain. Especially since I wasn’t able to reproduce the problem locally, let alone determine the cause of the problem.So, like any “good” developer, I just started logging all And waited for the server to crash again. Since the CPU load was regularly spiking, I thought this was probably caused by a particular email or user rather than a macro problem (for example, out of memory). So I tried narrowing it down to:
- Did it crash with a specific email ID or type?
- Did a crash occur with a specific customer?
- Was it crashing at regular intervals?
After hours of doing this and staring at the logs more than necessary, I finally narrowed it down to a specific customer. From there, the search scope narrowed considerably. It was probably due to a blocking rule or a particular email that the server kept retrying. Luckily for me it was the former. It’s a much easier problem to debug, considering we take privacy very seriously and don’t store or display email data.
Before we get into the exact issue, let’s first talk about one of the features of Block Sender. At the time, we had a lot of customers asking for wildcard blocks that allowed them to block certain types of email addresses that followed the same pattern. For example, if you want to block all emails from marketing email addresses, you can use wildcards. marketing@*
And block all emails from addresses starting with . marketing@
.
What I didn’t think about is that not everyone understands how wildcards work. I assumed that most people would use them the same way I as a developer do. *
Represents any number of characters. Unfortunately, this particular user assumed that he would need to use the following command: one wildcard for each character you want to match. In their case, they wanted to block all emails from a specific domain (this is a native feature of block senders, but I guess they weren’t aware of that, which is a whole problem in itself) ).So instead of using *@example.com
they used **********@example.com
.
POV: Observing users as they use the app…
I am using a Node.js library matcher to handle wildcards on my worker servers. This helps glob matching by converting it to a regular expression. This library looks like this: **********@example.com
Convert it to something like the following regular expression:
/[sS]*[sS]*[sS]*[sS]*[sS]*[sS]*[sS]*[sS]*[sS]*[sS]*@example.com/i
If you have any experience with regular expressions, you know that regular expressions can get complex very quickly, especially at the computational level. Matching the above formula to text of appropriate length was computationally prohibitively expensive and ended up overwhelming the worker server’s CPU. This is why the server crashes every few minutes. You can get stuck when trying to match complex regular expressions to email addresses.. So every time this user receives an email, the server will crash, in addition to all the retries built in to handle temporary failures.
So how did I fix this? Obviously, the easy solution was to find and fix all blocks containing multiple wildcards in a row. But we also needed to better sanitize user input. Any user can enter a regular expression and bring down the entire system with a ReDoS attack.
Find a practical, hands-on guide to learning Git that includes best practices, industry-recognized standards, and cheat sheets. Stop Googling Git Commands and Actually learn that!
Handling this particular case is very simple: remove consecutive wildcard characters.
block = block.replace(/*+/g, '*')
However, it still leaves your app vulnerable to other types of ReDoS attacks. Fortunately, there are many packages/libraries that can help with these types as well.
Using the solution above in combination with other safety measures, I was able to prevent this problem from happening again. However, it reminded me that user input can never be trusted and should always be sanitized before being used in an application. I didn’t even realize this was a potential problem until it happened to me. So I hope this helps others avoid the same issue.
Have questions, comments or want to share your own story? contact us twitter!