MQTT-fu

DEFINITION:
-fu 1. [Slang] Expertise, Mastery
Example 1: My google-fu is weak
Example 2: Aragorn uses Ranger-fu to figure out that Sam and Frodo have taken

This article describes use patterns that experienced MQTT application developers often adopt. It brings together some best practices that are often found in code, passed on verbally, or held in the minds of experienced MQTT practitioners. We hope you find it a valuable resource. It builds on the introduction to MQTT in this article.

We describe and explain: retained messages, Quality of Service (QoS), including subscriber QoS and the “QoS paradox”, Last Will and Testament (LWT), Birth Certificates, online status indication, deleting online status messages, the Clean Session flag, cleaning up durable subscriptions, and finally the request-response messaging pattern.

Retained Messages

When a new subscriber connects to a broker and subscribes to a topic, it is often useful for it to find out what the last-published value was on that topic. For example, if the subscribing application displays weather data, some values, such as windspeed, will change frequently; other values, for example atmospheric pressure, might only change every half hour or so. When you start up the weather display application, you would expect it to show the current values of each variable, rather than leave the pressure dial blank until the next update.

This is where retained messages come in.

If the publisher of a message sets the RETAIN flag in the MQTT publish command, the broker maintains a one-message-deep buffer associated with each topic, and stores the payload of that retained message into that buffer.

Node-RED MQTT output node showing Retain and QoS settings

When a client subscribes to that topic, if there is a retained message in the buffer, then it is immediately sent to the subscribing client, with its RETAIN flag set, so the client knows it is not a “live” value (i.e. just published), but that it is potentially “stale” or out of date. Nonetheless, it is the last-published value, and thus represents the most up-to-date value of the data on that topic.

Notes

Note that you will only receive a maximum of one retained message for each topic you subscribe to. If you receive a message without the retain flag set, after subscribing to a topic, you can assume you will not be receiving a retained message.

If you use wildcards in the subscription, you can receive multiple retained messages on the topics which match your wildcard subscription that have a retained message on them.

Deleting retained messages

You can delete the retained message in the buffer on any topic by publishing a retained message to that topic with an empty payload. That is, a zero-length, null, payload. Note that the null message must have the RETAIN flag set. The null message will be sent to subscribers to that topic, as well as deleting the message buffer. Subsequent subscribers to that topic will not receive a retained message.

Quality of Service – QoS

Some messages are more important than others, and we want to make sure they are not lost en-route to where they are going. Other messages are less important, and we don’t care if the odd one gets lost.

MQTT enables the publisher of a message to decide how important the message is, and then the MQTT infrastructure (publishing client, broker and subscribers) know how hard they need to work to make sure that message gets through. We call this the Quality of Service, or QoS, of the message.

It is important to realise that the quality of service is only relevant in the situation where the connection between a client and a broker breaks, and it has to reconnect and work out if there is any catching-up to do on any messages that were in the process of delivery when the connection broke. TCP/IP, the Internet messaging protocol that underpins MQTT, already offers assured delivery of messages, and so this often raises the question of why MQTT needs its own mechanisms for assuring delivery of messages.

The answer is that TCP/IP only assures delivery of messages in a connected system. If the connection breaks, the “socket” – as the TCP/IP message channel is called – is destroyed, and has to be started again by the connecting application.

This is rather like when you phone someone on a mobile phone, and the connection drops, but you carry on talking. Then you eventually realise the person you’re talking to has disconnected, you phone them back, and have to say again what you were saying between the connection dropping and you realising.

The MQTT QoS mechanisms deal with the delivery of messages in the face of network disconnections and reconnections.

Another subtle difference between the service that TCP/IP offers, and that of MQTT, is that TCP/IP only assures message delivery to the top of the TCP/IP software stack in the computer’s operating system. If it gets lost between there and the receiving application successfully processing it, then there is nothing TCP/IP can do to help.

MQTT operates at the application level, and so it is possible to assure that a message is received and processed by the target application, and if it fails to do that, will re-deliver the message later for another try.

MQTT QoS applies in two places in the journey from publisher, via the broker, to the subscribers:

a) between the publisher and the broker, where the QoS is specified by the publishing application on a per message basis.

b) between the broker and each of the subscribers, where the QoS is specified by the subscriber when it places its subscription with the broker.

This second point is quite subtle, and deserves further explanation, but we will deal with the QoS basics first.

MQTT has three qualities of service for message delivery, or QoS settings, as we tend to call them. We will describe the interactions between a publishing client and an MQTT message broker here, but the same interactions also apply, in exactly the same way, to the connection from broker to subscriber.

It is also worth mentioning that when we say a message is delivered, we mean that it arrives at its intended destination, complete and uncorrupted. That is, it is exactly what was sent from the other end. Most of this is thanks to TCP/IP, which assures (through checksums, retransmission and packet sequencing) that messages arrive uncorrupted, and in the same order that they were transmitted.

QoS 0, aka “fire and forget”

This is the lowest quality of service, and thus the one where the messaging system tries least hard to deliver the message. The “fire and forget” label describes what happens: assuming that there is an open MQTT connection, the publisher sends the message to the broker. And that’s it. If it gets there, great; if not, oh well, it presumably didn’t matter much.

Despite this seeming disregard for the wellbeing of the message, remember that as long as the TCP/IP connection is still working, then the message will be delivered. Thus, QoS 0 is by far the most commonly used quality of service in the MQTT world.

QoS 1, aka “at least once delivery”

QoS 1 is for situations where the message must definitely get through to the other end, even if the link breaks during transmission, but where it doesn’t matter if the message arrives more than once. This QoS level is what’s known as a simple acknowledgement. The publisher sends the message to the broker, and when it receives it, the broker sends back a message saying “I got that message”. Each message has a number, and the acknowledgement message mentions that number, so the publisher knows which message the broker is saying it received, so it can cross it off its list of pending messages.

How can the message be delivered twice? If the connection breaks before the acknowledgement is received by the publisher, the broker has received the mes- sage, but the publisher doesn’t know that. So when the connection is restored, the publisher re-sends the message (with the same message identifier number). The broker receives the message (again), and sends an acknowledgement (again). This could go on, until the acknowledgement is successfully received by the pub- lisher, at which point it knows for certain that the broker has received the message, and can stop trying to send it.

QoS 1 is most often used where the messages are what’s called “idempotent”. This is where applying the message more than once doesn’t change the outcome. For example, “set my bank balance to £100.00” is idempotent, because you can do it over and over again, and the outcome is still the same: I have £100 in my account. The message “add £5 to my bank balance”, though, is not idempotent. Applying it once will give me £5 more than I had before. Applying it a second time will mean I now have £10 more than I had to start with, and so on. So you would not use QoS 1 for the second type of message.

The other way to deal with multiple deliveries of the same message is known as “de-duplication”, or “de-duping”. Each message is given a unique identifier by the publisher, added to the message payload. It might be a timestamp of sufficiently fine grain to ensure uniqueness, or a unique serial number chosen by the publisher. The receiving application can use this identifier, comparing against the messages it has received before, to make sure the message has not already been processed, and discarding second and subsequent copies if it has.

QoS 2, aka “exactly once delivery”

The third, and highest, Quality of Service in MQTT is QoS 2. This is described as exactly once delivery as it uses an additional message flow between sender and receiver to ensure that despite multiple attempts at delivery, the message is only formally received and processed once by the receiving application.

This is achieved by breaking the message delivery into two steps. The first is the same as the QoS 1 message flow, ensuring that the message arrives at the broker at least once. In the QoS 2 case, however, duplicate messages are recognised by the existence of a message with the same message identifier number in the “in progress” list in the broker.

Multiple deliveries of a message with the same message identifier result in acknowledgements being sent by the broker back to the publisher, but no further action by the broker.

When an acknowledgement gets through to the publisher, it moves to the second part of the delivery process, and sends a “release” message. This tells the broker to go ahead and process the message (working out which subscribers should receive the message, and starting delivering copies of the message to each of them). The broker sends a “completion” message back to the publisher to confirm receipt of the release message.

If the completion message gets lost on its way to the publisher, due to the connection dropping, the publisher will re-send the release message when the connection is restored. However, by now, the broker will have deleted any record of the in-flight message, and so will just politely acknowledge receipt of the release message by re-sending the completion message.

This flow of messages results in both the original publication, and the “release” message, definitely being received by the broker, and by performing the pre- scribed processing steps on receipt of these messages, the message is assured to be delivered once and once only.

Clearly QoS 2 is more “chatty” on the communication link than QoS 0 and 1, but that level of assurance of delivery makes it more than worthwhile in the situations where it is warranted.

For example, if my sensor has measured the delivery of 1000 litres of oil to my heating system storage tank, it should send the message at QoS 2 back to the oil company to raise an invoice for that fuel purchase. I don’t want to receive two bills for it, and the oil company doesn’t want to send zero bills. Exactly one bill being sent to me is the correct result.

Subscriber QoS

As mentioned in the previous section, there is also a QoS setting on a subscription lodged by a subscribing client with a broker. This specifies the maximum QoS that it wants to receive. Under normal circumstances, the default QoS setting of “2” is used by the subscriber. Setting it to 2 means that a QoS no higher than 2 is acceptable, which is fine, as that is the highest QoS level. So in practice this tells the broker to send all messages at the same QoS at which they were published.

Consequently a QoS 2 publication from a publishing client makes two distinct QoS 2 transfers: publisher to broker, then broker to subscriber.

If the subscriber specifies a QoS of 0 or 1, it is an indication that for some reason it is not able to deal with a higher QoS from the broker. In particular, a downgrade from QoS 2 message may be requested by a client with very little memory that is unable to hold the state information required to complete a QoS 2 message transfer handshake.

Note that the QoS of delivery from the broker to subscribers is on a per- subscriber basis. The publisher has no knowledge of when or if the message has been delivered to all the subscribers to a message. A subscriber might not be connected to the broker, and thus not receive the message, even though it is still being held on the broker ready for delivery when it does next connect.

Node-RED MQTT input node showing subscriber QoS setting

The QoS paradox

The usual assumption with MQTT messages is that the more “important” they are, the higher the Quality of Service you should use when you publish and subscribe to it, in order that all parties (publisher, broker and subscriber) try their hardest to get the message through.

A message sent with QoS 2 will (barring physical destruction of the client or broker) be delivered once and once only to subscribers. If there is a problem, such as a network link temporarily down, or a subscriber is not available, then the client and broker will remember where they were in the message transfer process, and will retry it repeatedly until the protocol handshake is successfully completed.

However, there is an important exception to this approach: if there is a time window in which delivering the message is important, but after that time is irrelevant, or worse, potentially dangerous, then the message should be sent at QoS 0.

Imagine the scenario that a message must be sent to a remote relay to turn a machine on to perform some task there and then, and that a naive programmer has used QoS 2 for the message, because the machine MUST be turned on.

If the network is down for a few days, and the message isn’t delivered to the machine, some other action will most likely be taken, either a manual intervention, or some kind of fail-safe behaviour by the machine to ensure that it does what is considered the “right thing” in the absence of other instructions from the publisher.

When, sometime later, the network returns and the message is delivered, the need to turn on the machine may well have passed, or indeed it could be extremely dangerous for the machine to suddenly spring into life unexpectedly, just because the network suddenly came back up.

The recommended approach in such situations, then, is to use a QoS 0 publication, and have the subscribing application publish back (also at QoS 0) a confirmation that it has received the message. The publishing application subscribes to those confirmations so that it knows that the message has been received and acted-on by the subscriber (in our example, confirmation that the machine has been turned on).

Most importantly, if a response has not been received from the remote device within an appropriate amount of time, an action can be initiated, such as notifying an operator, retrying the message before giving up, or initiating an alternative action.

An example might be if a remote device is not contactable via the usual wide- area network, to initiate a backup (possibly high-cost, low-bandwidth) network connection such as a GSM or satellite phone connection, to contact the device.

Last Will and Testament (LWT)

It is often useful to know when a device has unexpectedly become unable to publish data. This could be because the device itself has failed (for example, batteries running out), or that the network connection has failed. MQTT provides a mechanism to notify interested parties when a device has unexpectedly dropped off the network.

This mechanism is called “Last Will and Testament”, as it can be thought of as the document you lodge with a legal agent such as your lawyer, to convey the information you would like people to know immediately after your death. It can also be thought of as the message you would have sent if you knew you were about to drop off the network.

They are sometimes referred to as “death certificates” (cf birth certificates), “will messages”, or “LWT messages”.

Node-RED MQTT input node Last Will and Testament settings for the MQTT broker configuration

The way the Last Will and Testament mechanism is effected is that when a client connects to a broker, it can optionally include a message in the connect packet, with the usual topic and payload, and QoS and retain flags. This message is held in the broker, associated with the client identifier that the client uses.

As the value of this message is almost always to let clients know that a particular device is no longer connected, it is usual for either the topic or the payload of the LWT message to contain a unique identifier for that client. It does not need to be the MQTT client identifier (which is unique for that broker), but is often a serial number or network MAC address or something like a geographical identifier for that client.

At connect time, the client also establishes a commitment to the broker for a “keepalive” time. This is a commitment from the client that it will send something to the broker every n seconds. Even if it’s not a publication it will send a small “ping” packet to say “I’m still alive”.

The broker tracks the publications and pings from clients and, if the keepalive interval is exceeded (plus a bit of leeway for network delays, and clients being busy doing other processing), then the broker assumes the “untimely death” of the client, closes the network connection to it, and publishes, on the client’s behalf, the Last Will and Testament message.

This is delivered to subscribers to the topic of the LWT message, as set by the client when it connected, and serves as a notification that the client has dropped off the network, and thus they should not expect it to send or receive data. If the unique identifier for the device is in the topic of the LWT message, then the subscribers can use a wildcard subscription to ensure they get LWT messages from any client.

Note that the topic structure for LWT messages must be agreed in advance by publishers and subscribers, so applications that wish to receive LWT messages subscribe to the correct topics, and that clients set the correctly formatted topic and payload in their LWT message.

For example, if the agreed topic structure for LWT messages is of the form “LWT/{device name}”, then the LWT status message for device “fred” would be:

topic: “LWT/fred”
payload: “offline”

An application subscribing to “LWT/+” will receive LWT messages from devices “fred”, “jim” (on topic LWT/jim), etc.

Note that the LWT message is defined at connect time, not at the time of the unexpected disconnection. Hence you can not have an LWT payload of “went offline at 10:00”, as that time would be incorrect.

However, if timestamped LWT messages are important to an application, they can either apply the timestamp themselves when the message arrives, or, an application could subscribe to all LWT messages, enhance them, and then re- publish them (on a different topic). Applications could then subscribe to the enhanced LWT messages.

Of course, someone needs to watch the LWT of the LWT-enhancer application, to restart it if it fails!

Birth certificates

Knowing about the other end of the client lifecycle – when a client first connects to a broker – may also be of value to subscribing applications. An indication that a device is operational and connected to the broker means that, as a publisher, it is now likely to send data, or, as a subscriber, that it can now receive messages from publishers.

The broker knows that the client is connected, but there is no way for any other clients to find out that fact unless the client explicitly makes an announcement that it is online.

Node-RED MQTT input node showing Birth Certificate

We use the term “birth certificate” to refer to a message sent by a client when it first connects to a broker. Subscribing clients, interested in the appearance of online of devices, can subscribe to a previously-agreed topic structure to ensure they receive such messages.

As with Last Will and Testament messages, the usual purpose of birth certificates is to announce that a particular device is now online, and so it is common for a unique identifier for the client to be incorporated into either the topic or the payload of the birth certificate.

It is also common for the time at which the device came online to be included in the message (assuming that is known by the device) and also often dynamic information such as the allocated IP address for the device.

Online status indication

A “current status” indication can be elegantly implemented using a combination of retained messages, birth certificates and death certificates. This enables a subscribing client to determine the current online status of a device, which can be very useful for status dashboards, or generating notifications after a period offline, etc.

Each participating client has a unique identifier, maybe its MAC address, serial number, or some other identifier. For this example our device will be “abc123”. When it connects to the broker, our device sets a last will and testament to be a retained publication on topic “status/abc123” with payload “0”, indicating that abc123 has gone offline.

Our device publishes a birth certificate every time it connects: a retained mes- sage to the “status/abc123” topic, with payload “1”, indicating that it is online.

At any time, an application wishing to determine whether our device is online or not, can subscribe to “status/abc123”, and will immediately receive the retained publication on that topic, as set by either the birth (1), or death (0) certificate, indicating its status at that time.

If the subscriber remains subscribed to that topic, it will receive live status change updates as they happen. By using a wildcard subscription, “status/+”, an application can determine the current, and then ongoing status, of all devices participating in the scheme, which can be used, for example, as input data for a status dashboard application.

Deleting online status messages

Client devices which connect to a broker and publish a retained birth certificate to indicate they are connected, will leave a retained publication on that broker which will hang around for ever, even if the Client device never connects again.

If you have a highly dynamic client population, where clients connect for a while, but then disappear and don’t come back, it may be desirable to remove all evidence of their presence, and delete their online status message. This would ensure the topic tree for online statuses does not have lots of retained ‘offline’ status messages from devices which are long-gone.

Deleting the online status message can be achieved by putting an empty message payload in the retained message of the Last Will and Testament.

An empty payload in a retained message has the effect of clearing the retained message on that topic in the broker. Thus, when the LWT fires, the retained online status message is deleted.

Sometimes this is not the desired behaviour, as an application may wish to find out that a particular client was once connected but is no longer. In this case, the usual death certificate of a retained “0” or “offline” value should be used.

Clean Session flag

Some messaging systems, for example, JMS (Java Messaging Service), talk about “durable” and “non-durable” subscriptions. In MQTT this mechanism is implemented using the Clean Session flag in the initial connection message from a client to its broker.

The Clean Session flag in the MQTT connect packet tells the broker what it should do with messages and subscriptions for a client when that client disconnects. With Clean Session true, when the client disconnects, the broker deletes all state associated with that client: its subscriptions and any in-flight messages (of any QoS) that were in the process of being delivered or queued for delivery. These are also referred to (for example, in JMS) as ‘non-durable’ subscriptions.

This mode of operation is typically used by devices that are not able to retain state across a power cycle – they start in a known, clean, state, and set up the subscriptions that they want when they connect to a broker.

Messages arriving thereafter will be predictable in the sense that they will be on topics to which the client has explicitly subscribed. Setting Clean Session to true also prevents the accumulation of “marooned” states on the broker from clients that may be ephemeral, for example web browser based clients with randomised client IDs, that will only ever connect once, and never come back again with the same Client ID.

Setting Clean Session false enables a client with a static client ID (i.e. which does not change between separate broker connection sessions) to subscribe to topics on a broker, and have those subscriptions remain lodged in the broker, even when the client disconnects. These are also referred-to in other messaging systems as “durable subscriptions”.

For QoS 1 and QoS 2 messages, this means that when a client disconnects (deliberately or otherwise) from the broker, messages that are published on the subscribed topics while the client is offline, will be queued in the broker for that client. When the client reconnects, the messages will be delivered.

So with Clean Session false, the subscribing client will not miss any QoS1 and QoS2 messages even when temporarily disconnected. For example, a mobile application in a commercial vehicle, receiving work orders for the driver, will not miss any incoming messages when the vehicle goes through a mobile signal dead spot, or goes through a tunnel.

Cleaning-up durable subscriptions

It is up to the implementers of MQTT broker software what, if any, facilities they offer for an administrator to clean up the debris of messages being held in queues for clients that have connected with Clean Session false. Over time this can can cause problems for storage and memory resources on the broker machine.

Here we describe a way to manage the durable subscriptions on a broker, in a way which respects the need for clients to reconnect later to collect messages, and cleaning up messages and subscriptions for clients that have not been seen for a suitably long time. More than anything, it is an example of how the features of MQTT can be combined with application logic to solve some quite complex problems.

Clean session flag in an MQTT broker configuration in Node-RED

The solution is based on an application, known as “The Cleaner”, which stays connected to the broker continuously. It subscribes to the birth/death certificate topic tree (see earlier sections on “Last Will and Testament”, “birth certificates” and “online status notification” to understand these mechanisms). The status topics are arranged to include the client ID (that is, the one used in the Connect message to identify each client uniquely to the broker) of each connected client, for example, “status/device_1” with a value of 1 (birth certificate) and 0 (set in the Last Will and Testament message).

The Cleaner application receives the offline status messages, unpacks the topic to extract the client ID, and records a timestamp for when that client disconnected in a list. Using a timer or a periodic scan of the list of offline clients, the Cleaner application determines that a given client has been offline for “too long”, according to some suitable definition, and then executes the clean-up step.

Using a separate thread (or spawning a child process, depending on how the application is written), the Cleaner makes an MQTT connection to the broker using the client ID of the client it wants to clean-up. In the connect message, it sets Clean Session true. This tells the broker to clean up any existing state (subscriptions and messages) for that client. The application then disconnects this new connection. Then it can discard the information it holds about that client: both the application and the broker have now removed all trace of its existence.

If you use QoS 2 for the birth/death certificate publications from the devices, and the Cleaner application makes a QoS 2 subscription to the status topic tree, the broker will ensure that all “offline” status messages are received by the Cleaner application, even if it disconnects briefly.

Request-response pattern

Note: This article was written in 2017, and refers to v3.1.1. MQTT v5.0 has a built-in request-response mechanism. For more information on features of v5.0 see https://www.hivemq.com/tags/mqtt-5-essentials/

MQTT uses a publish-and-subscribe messaging pattern, which decouples publishers from subscribers with an intermediary broker. It is inherently a one-to- many protocol, based on topic-based subscriptions.

Another commonly-used messaging pattern is known as “request-response”. Here, one application sends a message to another application, containing a request for some information, and the application receiving the request formulates a response and sends it back to the requesting application. HTTP-based messaging systems and RESTful APIs (Application Programming Interfaces) implement this pattern.

There is an elegant way to implement the request-response pattern using the publish-subscribe model of MQTT. This enables an MQTT client to request information from another client, and receive a response. The solution is as follows.

The server application (another MQTT client connected to the broker) that will receive the request and provide the response subscribes to a topic which is known to all clients that will want to make use of the service offered. For example, an application that can calculate the square of a number might subscribe to the “square” topic. This is known as the service topic.

A client wishing to send a request and get a response from an application such as our “square” service example constructs a request message including two fields: the input data for the request, and a “reply-to” topic name.

The reply-to topic should be randomly generated, or follow a schema which ensures it is unique to that client. It might be derived from the client ID, or just be random. In this case we’ll use a random number with a prefix string: “response/2901”. The client subscribes to its unique reply-to topic.

The request can be in any format agreed between clients and server applications, but in this example we’ll use a JSON payload:

{       request: 4,       replyTopic: "response/2901"   }

The client publishes this message to the service topic that the server has sub- scribed to: “square” in this example.

The server application receives the message, parses the request to get the input data to operate on, computes the response value, and publishes a message back to the topic named in the reply-to topic of the request message.

In this case, the application publishes the value 16 (maybe in a JSON or other format) to the “response/2901” topic.

{     response: 16 }

The requesting application receives the response on this topic, as it had previously subscribed to it, and the request/response exchange is complete.

If there are going to be other requests, the same reply-to topic can be used, as long as there is no chance of a long-running request overlapping with a sub- sequent request, in this case separate topics are recommended to act as a “correlation ID” to enable each response to be re-united with its corresponding request, on arrival back at the requesting client.

Conclusion

In this article we’ve got into some pretty deep MQTT techniques, both at the protocol and application level.

This was a collection of the collected wisdom, accumulated over many years, that MQTT application developers routinely use to write sophisticated applications for the Internet of Things.

You now should understand a lot more about retained messages, the use of Quality of Service by publishers and subscribers, and the meaning of the Clean Session flag. We have discussed Last Will and Testament and the use of birth and death certificates to maintain online/offline status information for connected devices.

We also explained how to implement the request-response messaging pattern in a publish-subscribe messaging environment.

Update March 2020: There’s a great article on v5.0 req/resp pattern by HiveMQHiveMQ – and they wrote a wonderful series of blogs on v5: https://www.hivemq.com/tags/mqtt-5-essentials/