A Good Morning
One fine day, we woke up to a page at 7AM that one of our non-critical services are returning more 500s than the usual. In fact, 500s constituted more than 50% of the responses, and that’s why we were paged. This service ran smoothly for more than a month now, and we had just enabled message publishing in that service last evening. Suspecting this would be the cause, we quickly disabled the functionality by turning a feature flag off and the 500s gradually came down until the alert automatically turned itself off.
Once we were completely awake and caffeinated, we started looking into the logs of the service and a few things stood out:
- The entries with
level=ERROR
had a spike in the duration of increased 500s. The errors were 502 actually, since the service was quite unavailable during that time. - Those log entries were specifically coming from the MQ client - we use self-hosted RabbitMQ
- The logs were repeatedly saying
- MQ sending us
ECONNRESET
- Our connection retry mechanism kicking in, and becoming successful sometimes, failing other times
- A seemingly crpyic error message -
connection closed: FRAME_ERROR - type 65, all octets = <<>>: {frame_too_large,1342177289,4088} (501)
- MQ sending us
Our service is written in NodeJS + TS, and we use @cloudamqp/amqp-client
to communicate with MQ. We subscribe to AMQPChannel
’s onError
method to kick in our connection retry mechanism. Our retry mechanism is quite raw and simple and does not take of messages that would be sent during the retry window - they would be lost. The retry mechanism keeps track of a counter, and backs off exponentially. Once the retry is successful, the counter is reset and everything is back normal.
While the retry mechanism seemed to work fine in this case (spoiler: it wasn’t), which we were happy about, the number 1342177289
worried us. We have configured the maximum message size as 4088
bytes, which is more than sufficient in almost all the cases. What we were seeing instead was a message of around 1GB, which would be close to impossible! I joked that I’m 99.99% confident that 1GB message would not be the case, but left the remaining probability for me to be amused.
We scoured the logs of the service and its historical data and nothing could come to as close as a GB, or even a few 100KBs. Turning back to logs, We noticed that same number 1342177289
was repeated multiple times. It is highly improbable that we get multiple 1GB ish messages with the same byte count - we never retry publishes as well.
Having gone through articles of magic numbers appearing during crashes, I became particularly interested in the number 1342177289
- would it be the case it was a legit packet size and somehow one or more bits got corrupted to become this humongous big number? I started looking into the bits then:
0101 0000 0000 0000 0000 0000 0000 1001
Weird - there are more zeros than ones - yes. But 2 bits at Most Significant side should not haven been set accidentally. Meanwhile, we have pasted the number into Google and it only brought up results ONLY related to RabbitMQ!
What had happened?
John Eckersberg makes a good job at explaning what this number means. To make sense of this number, we need to include type 65
into the picture as well — 65
is ASCII A
. Then there is channel number 19793
, which was not included in this log. These three numbers combined together makes out the string AMQP
followed by the numbers 009
. For reference, the AMQP init packet expects AMQP0091
in the header (0091
being the version and represented as a raw number). We can see parts of it in our number:
1342177289
= 0101 0000 0000 0000 0000 0000 0000 1001
= 0x50 0x00 0x00 0x09
= "P" 0 0 9
This means that we were sending initialization packet after the connection was initialized. For reference, here is how our retry looked like:
class MqClient {
private readonly client: AMQPClient
// We keep a single connection
private readonly connection: AMQPConnection | null = null
private retryCount = 0
public MqClient(url: string) {
this.client = new AMQPClient(url)
}
// Users SHOULD call this
public async init() {
this.connection = await client.connection()
this.connection.onEror(this.retry)
this.retryCount = 0
}
private async retry() {
this.retryCount += 1
// Exponential backoff
await sleep(2 ** this.retryCount * 1000 /*ms*/)
await this.init()
}
}
While this code looks good at the surface, if an error happens during retry, since the retry handler is already registered, it is called again - now multiple retry sequence are running concurrently. And it is entirely possible that a retry happens on already initialized connection - and that’s what has happened.
The Fix
The fix is simple, clear connection and associated handlers before attempting a retry. This is akin how React’s useEffect
wants us to clean up all the subscriptions, callbacks etc when the component is unmounting.
However, that’s not the end of the story. We observed that MQ connections were being closed when the traffic to our service was mere 20% more than usual. We were running 3 pods in Kubernetes and observed our pods were running over limit on CPU utilization. Scaling them to 16 did bring things back to normal at 100% CPU utilization. We need to further investigate why an act of publishing the message uses up the resources so much. Until then, we plan to use HorizontalPodScalar
on CPU utilization to scale up the service on load.