Fix Azure Cache for Redis Long Hangup and Unstable Connection for Node.js Node-Redis
For this implementation, I use the official redis
npm package (formerly node-redis
). For other library like ioredis
you might run into different problems.
Quick fix on unstable connection — Easy part
Adding the simple config in the Node.js redis
package to make it PING
.
...
disableOfflineQueue: true,
pingInterval: 1000,
The PING
will keep the heartbeat and the connection will get less likely to be idled and terminated.
This method is confirmed by Azure doc:
The Long Hangup — Hard Part
The generalized config of Linux cannot recover quickly after TCP retransmission failure.
The another half story is to blame on Azure and Microsoft.
Don’t you think that
- Linux is config like this around the world
And … not only for this TCP service - Normal Redis IaaS just work and stable out of the box
And there is no whatsoever problem and we never ever touch them
Causes: The TCP Retransmission ‘Exponentially backoff’
When there is TCP transmission problem, by default TCP WAIT and try to resend that packet.
The keyword here is ‘WAIT’
That’s why we got the long hangup.
Similar issue also happen to Elasticsearch
The “exponential backoff”, or in simple term “wait to send and if it fail double the wait time”.
As an example initial wait time of 200ms and 5 retransmission will equal to around 6s
- 200ms
- 400ms
- 800ms
- 1600ms
- 3200ms
But for Linux the number is stored inside the sysctl
and it is 15
meaning around 15 retransmission and this can take 10 minutes or more depending when in the event you start to observe.
$ sysctl -a | grep tcp_retries
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
In the server, this root cause can be confirm by
$ netstat -ant
... Send-Q ... Port ...
6378
// You will see a lot of number in the send-Q
// Confirm that the packet stuck and piled up there
// Waiting the retransmission backoff
Or on my local machine, I can confirm the result with WireShark capture.
Seeing a lot of TCP Retransmission and finally close the connection then open three-way handshake SYN
, SYN-ACK
, ACK
again
On my machine — MacOS, it is better configured for this case than Linux.
Why Config Redis Ping does not help
This is not problem with application inactivity.
Your RedisPING
will stuck and pile up at theSend-Q
waiting too.
Why Config Redis Socket Timeout does not help
The socket.timeout
implement in redis
is the “Connect” phase timeout. Not when the client is ready sending the command and has the timeout.
Bad Luck on non-cluster Service and Docker
With above problem we can just set the sysctl -w <tcp_retries>=number
If your deployment allow you to set it, the recommended for modern app is 3, 5 or 8 as recommended by 100s of RFC 1122.
But if that is not the case you got two problems:
- You cannot set
sysctl
in theDockerfile
- You cannot set
sysctl
in ready-made service
like Azure Container Instance, App Service (2023–04–16).
Despite you will find SOME HOPE in the YAML where structure looks like Kubernetes.
Sadly, there is nosecurityContext
where you can escalateprivilleage: true
like that in AKS orsysctls
YAML properties wher you can set it directly.
Solution: Implement Command Timeout
The solution is to implement “Command timeout”.
When sending PING
, GET
, SET
, we also need timeout here too.
Some trick here, instead of implement timeout everywhere, you can consider some fact that:
If the socket stuck it stuck everywhere.
So we implement the interval for checking and reconnect it centrally instead.
const config = {
url: `redis://${redisConfig.host}:${redisConfig.port}`,
password: redisConfig.password,
disableOfflineQueue: true,
pingInterval: 1000,
};
redis.connect(config);
// This is NOT socket.timeout options
// Which handle only at connecting
// Connection Timeout VS Command Timeout
setInterval(async () => {
try {
const heartbeatKey = `heartbeat`;
// Promise
const read = redis.get(heartbeatKey);
const timer = new Promise((resolve, reject) => {
const wait = setTimeout(() => {
clearTimeout(wait);
reject('Interval Check Command Timeout YES');
}, TIMEOUT_TIME);
});
// Let's race our promises
const race = await Promise.race([read, timer]);
redis.set(heartbeatKey, Date.now());
} catch (error) {
await redis.disconnect();
await connectionRedis(config);
redis.set(`incident-${Date.now()}`, `${new Date()}`);
}
}, INTERVAL_TIME);
It take me few days over night to figure it out.
Hope this save your time in some way.