Fix Azure Cache for Redis Long Hangup and Unstable Connection for Node.js Node-Redis

Azure Jul 14, 2023

For this implementation, I use the official redis npm package (formerly node-redis). For other library like ioredis you might run into different problems.

Quick fix on unstable connection — Easy part

Adding the simple config in the Node.js redis package to make it PING.

...
disableOfflineQueue: true,
pingInterval: 1000,

The PING will keep the heartbeat and the connection will get less likely to be idled and terminated.

This method is confirmed by Azure doc:

Best practices for connection resilience - Azure Cache for Redis
Learn how to make your Azure Cache for Redis connections resilient.

The Long Hangup — Hard Part

The generalized config of Linux cannot recover quickly after TCP retransmission failure.

Best practices for connection resilience - Azure Cache for Redis
Learn how to make your Azure Cache for Redis connections resilient.

The another half story is to blame on Azure and Microsoft.

Don’t you think that

  1. Linux is config like this around the world
    And … not only for this TCP service
  2. Normal Redis IaaS just work and stable out of the box

And there is no whatsoever problem and we never ever touch them

Causes: The TCP Retransmission ‘Exponentially backoff’

When there is TCP transmission problem, by default TCP WAIT and try to resend that packet.

The keyword here is ‘WAIT’

That’s why we got the long hangup.

Similar issue also happen to Elasticsearch

TCP retransmission timeout | Elasticsearch Guide [8.8] | Elastic
The “exponential backoff”, or in simple term “wait to send and if it fail double the wait time”.

As an example initial wait time of 200ms and 5 retransmission will equal to around 6s

  1. 200ms
  2. 400ms
  3. 800ms
  4. 1600ms
  5. 3200ms

But for Linux the number is stored inside the sysctl and it is 15meaning around 15 retransmission and this can take 10 minutes or more depending when in the event you start to observe.

$ sysctl -a | grep tcp_retries
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15

In the server, this root cause can be confirm by

$ netstat -ant

... Send-Q ... Port ...
               6378

// You will see a lot of number in the send-Q
// Confirm that the packet stuck and piled up there
// Waiting the retransmission backoff

Or on my local machine, I can confirm the result with WireShark capture.

Seeing a lot of TCP Retransmission and finally close the connection then open three-way handshake SYN , SYN-ACK , ACKagain

On my machine — MacOS, it is better configured for this case than Linux.

Why Config Redis Ping does not help

This is not problem with application inactivity.

Your Redis PING will stuck and pile up at the Send-Q waiting too.

Why Config Redis Socket Timeout does not help

The socket.timeout implement in redis is the “Connect” phase timeout. Not when the client is ready sending the command and has the timeout.

Bad Luck on non-cluster Service and Docker

With above problem we can just set the sysctl -w <tcp_retries>=number

If your deployment allow you to set it, the recommended for modern app is 3, 5 or 8 as recommended by 100s of RFC 1122.

What value should I set for the tcp_retries2 parameter? - Red Hat Customer Portal
The /proc/sys/net/ipv4/tcp_retries2 setting controls how many retries TCP makes on data segments. Some guides suggest changing it to a much lower value than the default. Why should I do this?

But if that is not the case you got two problems:

  1. You cannot set sysctl in the Dockerfile
  2. You cannot set sysctlin ready-made service
    like Azure Container Instance, App Service (2023–04–16).

Despite you will find SOME HOPE in the YAML where structure looks like Kubernetes.

Sadly, there is no securityContext where you can escalate privilleage: true like that in AKS or sysctls YAML properties wher you can set it directly.

Solution: Implement Command Timeout

The solution is to implement “Command timeout”.

When sending PING , GET , SET , we also need timeout here too.

Some trick here, instead of implement timeout everywhere, you can consider some fact that:

If the socket stuck it stuck everywhere.

So we implement the interval for checking and reconnect it centrally instead.

   const config = {
      url: `redis://${redisConfig.host}:${redisConfig.port}`,
      password: redisConfig.password,
      disableOfflineQueue: true,
      pingInterval: 1000,
    };

    redis.connect(config);

    // This is NOT socket.timeout options
    // Which handle only at connecting
    // Connection Timeout VS Command Timeout
    setInterval(async () => {
      try {
        const heartbeatKey = `heartbeat`;
        // Promise
        const read = redis.get(heartbeatKey);
        const timer = new Promise((resolve, reject) => {
          const wait = setTimeout(() => {
            clearTimeout(wait);
            reject('Interval Check Command Timeout YES');
          }, TIMEOUT_TIME);
        });

        // Let's race our promises
        const race = await Promise.race([read, timer]);
        redis.set(heartbeatKey, Date.now());
      } catch (error) {
        await redis.disconnect();
        await connectionRedis(config);
        redis.set(`incident-${Date.now()}`, `${new Date()}`);
      }
    }, INTERVAL_TIME);

It take me few days over night to figure it out.

Hope this save your time in some way.

Tags

TeamCMD

We are CODEMONDAY team and provide a variety of content about Business , technology, and Programming. Let's enjoy it with us.