On fast reconnection mechanism of websocket

original
2020/07/22 16:59
Reading 230

Article | Ma Yingying, Netease Smart Enterprise Web Front End Development Engineer

introduction

In a perfect instant messaging application, Websocket is an extremely critical link. It provides a full duplex communication mechanism for the client and server of web applications. However, due to the instability of its own and the underlying TCP connection, developers have to design a complete set of live, live, and reconnection schemes for it to ensure the instantaneity and high availability of applications in practical applications. As far as reconnection is concerned, its speed has seriously affected the "immediacy" and user experience of upper layer applications. Imagine that if WeChat can't send and receive messages a minute after opening the network, would you be crazy?

Therefore, how to quickly restore the availability of websocket when the network changes becomes particularly important.

Quick understanding of websocet

Websocket was born in 2008 and became an international standard in 2011. Now all browsers support it. It is a new application layer protocol, a true full duplex communication protocol specially designed for web clients and servers,

You can learn about websocket protocol by analogy with HTTP protocol. Their differences:

  • The protocol identifier of HTTP is http, and that of websocket is ws
  • HTTP requests can only be initiated by the client. The server cannot actively push messages to the client, but websocket can
  • HTTP requests have the same origin restriction, and communication between different origins needs to cross domains, while websocket does not have the same origin restriction

Similarities:

  • Both are application layer communication protocols
  • The default ports are the same, 80 or 443
  • Can be used for communication between browser and server
  • All based on TCP protocol

Relationship between TCP and TCP:

picture source

Disassembly of reconnection process

First, consider the question when reconnection is required?

The easiest thing to think of is that the websocket connection is broken. In order to send and receive messages next, we need to initiate another connection. However, in many scenarios, even if the websocket connection is not disconnected, it is actually unavailable, such as the device switching network, the route collapse in the middle of the link, and the server load continues to be too high to respond. In these scenarios, the websocket is not disconnected, but for the upper layer, there is no way to send and receive data normally. Therefore, before reconnecting, we need a mechanism to sense whether the connection is available and whether the service is available, and it should be able to quickly sense, so that we can quickly recover from the unavailable state.

Once the connection is perceived to be unavailable, you can discard the old connection, disconnect the old connection, and then initiate a new connection. These two steps seem simple, but if you want to achieve fast, and not so easy.

The first is to disconnect the old connection. For the client, how to quickly disconnect? The protocol stipulates that the client must negotiate with the server before disconnecting the websocket connection. But when the client cannot contact the server and cannot negotiate, how can you disconnect and quickly recover?

The second is to quickly initiate new connections. This fast is not the same fast. The fast here is not to initiate a connection immediately, which will have an unpredictable impact on the server. When reconnecting, some backoff algorithms are usually used, and reconnection is initiated after a delay. But how to make a trade-off between reconnection interval and performance consumption? How to quickly initiate a connection at the "right point in time"?

With these questions, let's take a closer look at these three processes.

Quickly perceive when reconnection is required

The scenarios that need to be reconnected can be divided into three types: the connection is broken, the connection is not broken but unavailable, and the service connected to the opposite end is unavailable.

The first scenario is very simple. If the connection is directly disconnected, it must be reconnected.

For the latter two, whether the connection is unavailable or the service is unavailable, the impact on the upper layer application is that instant messages can no longer be sent and received. So from this perspective, a simple and crude way to sense when reconnection is needed is to timeout the heartbeat packet: send a heartbeat packet, and if the server has not received a packet back after a specific time, The service is considered unavailable, as shown in the scheme on the left in the figure below; This method is the most direct. Then if you want to Rapid perception You can only send more heartbeat packets to speed up the heartbeat rate. However, if the heartbeat is too fast, it will consume too much mobile terminal traffic and power. Therefore, this method can not achieve rapid perception, and can be used as a bottom-up mechanism for detecting connections and services.

If you want to detect the connection unavailability, in addition to using heartbeat detection, you can also judge the network status. Because network disconnection, Wifi switching, and network switching are the most direct reasons for the connection unavailability, when the network status changes from offline to online, you need to reconnect in most cases, but not necessarily, because the bottom layer of webtoken is based on TCP, TCP connection cannot be sensitive to the network changes at the application layer, so sometimes even if the network is disconnected for a short time, it will not affect the websocket connection. After the network is restored, it can still communicate normally. Therefore, when the network is disconnected to the connection, immediately determine whether the connection is available by sending a heartbeat packet. If the heartbeat packet from the server is received normally, the connection is still available. If the heartbeat packet is not received after the timeout, you need to reconnect, as shown on the right in the above figure. The advantage of this method is that it is fast. After the network is recovered, you can immediately perceive whether the connection is available. If it is unavailable, you can quickly perform the recovery, but it can only cover the situation where the application layer network changes and websocket is unavailable.

To sum up, the scheme of regularly sending heartbeat packet detection is stable and can cover all scenarios, but the speed is not very good; The scheme to judge the network status is fast, without waiting for the heartbeat interval, and is more sensitive, but the coverage scenario is limited. Therefore, we can combine two solutions: regularly send heartbeat packets at a not too fast frequency, such as 40s/time, 60s/time, etc., which can be determined according to the application scenario, and then immediately send a heartbeat when the network status changes from offline to online, to detect whether the current connection is available, and immediately recover if unavailable. In this way, in most cases, the application communication of the upper layer can be recovered from the unavailable state quickly. For a few scenarios, there is a timed heartbeat as the background, which can also be recovered in a heartbeat cycle.

Quickly disconnect old connections

In general, before initiating the next connection, if the old connection still exists, the old connection should be disconnected first, so that the resources of the client and server can be released, and second, data can be sent and received from the old connection by mistake.

We know that the bottom layer of websocket transmits data based on the TCP protocol. Both ends of the connection are the server and the client. The TIME_WAIT state of TCP is maintained by the server. Therefore, in most normal cases, the server should initiate the disconnection of the bottom layer TCP connection, not the client. That is to say, when the websocket connection is to be disconnected, if the server receives an instruction to disconnect websocket, it should immediately initiate to disconnect the TCP connection; If the client receives an instruction to disconnect websocket, it should send a signal to the server, and then wait for the underlying TCP connection to be disconnected by the server or until the timeout occurs.

If the client wants to disconnect the old websocket, it can be discussed in two cases: the websocket connection is available and unavailable. When the old connection is available, the client can directly send a disconnection signal to the server, and then the server can initiate a disconnection; When the old connection is unavailable, such as when the client switches over wifi, the client sends a disconnection signal, but the server cannot receive it. The client can only wait until the timeout period expires before being allowed to disconnect. The timeout disconnection process is relatively long. Is there any way to quickly disconnect?

The upper layer application cannot change the protocol level rule that only the server can initiate disconnection, so it can only start from the application logic. For example, in the upper layer, the business logic ensures that the old connection completely fails, simulates disconnection, and then initiates a new connection to resume communication. This method is equivalent to trying to disconnect the old connection when it fails, and then you can quickly enter the next process. So when using this method, you must ensure that the old connection has completely failed in business logic, for example, to ensure that all data received from the old connection is lost, and that the old connection does not hinder the establishment of a new connection, The old connection timeout and disconnection cannot affect the new connection and upper business logic.

Quickly initiate new connections

Students with experience in IM development should know that when encountering reconnection due to network reasons, they should never immediately initiate a new connection, otherwise when network jitter occurs, all devices will immediately initiate a connection to the server at the same time, which is no different from a denial of service attack caused by hackers consuming network bandwidth by launching a large number of requests, This is a disaster for the server. Therefore, some backoff algorithms are usually used during reconnection, and the reconnection is initiated after a delay, as shown in the flow on the left in the figure below.

What if you want to connect quickly? The most direct way is to shorten the retry interval. The shorter the retry interval, the faster the communication can be restored after the network is restored. However, too frequent retries will seriously consume performance, bandwidth and power. How to make a better trade-off?

A more reasonable way is to gradually increase the retry interval as the number of retries increases; On the other hand, monitor network changes. When the network status changes from offline to online, which is more likely to be reconnected, you can appropriately reduce the reconnection interval, as shown on the right side of the figure above (the reconnection interval will also increase with the number of retries). The two methods are used together.

In addition, the interval can also be adjusted properly according to the possibility of successful reconnection in combination with business logic. For example, the interval can be increased when the network is not connected or the application is in the background, and it can be decreased when the network is normal, so as to speed up the reconnection.

ending

In conclusion, at the beginning of this article, the disconnection and reconnection of websocket is divided into three steps: determining when to reconnect, disconnecting the old connection, and initiating a new connection. Then it analyzes how to quickly complete the three steps in different states of websocket and different network states: first, detect whether the current connection is available by sending heartbeat packets regularly, monitor network recovery events, send a heartbeat immediately after recovery, quickly sense the current state, and judge whether reconnection is required; Secondly, under normal circumstances, the server disconnects the old connection. When losing contact with the server, the old connection is directly discarded. The upper layer simulates disconnection to achieve quick disconnection; Finally, when launching a new connection, use the backoff algorithm to delay the connection for a period of time. At the same time, considering the resource waste and reconnection speed, you can increase the reconnection interval when the network is offline, and reduce the reconnection interval when the network is normal or when the network changes from offline to online, so that it can be reconnected as quickly as possible.

reference resources:

understand Netease Yunxin Communication and video cloud services from NetEase core architecture>>

For more technical goods, please follow the vx official account * * * * "Netease Smart Enterprise Technology+" * * *. A series of courses can be viewed in advance. Boutique gifts are free, and you can also talk to CTO directly.

Listen to Netease CTO's talk about cutting-edge observation, see the most valuable technical dry goods, and learn from Netease's latest practical experience. Netease Smart Enterprise Technology+will accompany you to grow from a thinker to a technical expert.

Expand to read the full text
Loading
Click to lead the topic 📣 Post and join the discussion 🔥
Reward
zero comment
zero Collection
zero fabulous
 Back to top
Top