Subscribe via RSS
28May/150

Sockets: Bytes and Message Terminators

Sockets are fun. They're a cross-platform mechanism for opening a black hole to another dimension. Of course, you're the one that defines the other end and you're in full control of what gets sent into the hole.

Hopefully you've also written the code at the other end... that way you're also in full control of what comes back through the hole. Note that, like any good black hole, the specks of data flying back at you won't always be complete, nor in sequence. Fortunately, if you're using TCP, the underlying network technology/code will rearrange the packets for you. UDP, on the other hand, will send the packets to you as it receives them.

Text or Bytes?

It does get better though... especially when you're trying to send through real data, we'll call them bytes [valued from 0 to 255], as opposed to just fancy textual strings in the ASCII format. The latter is great if you actually want to send legible messages; a chat program perhaps? But useless if you want to send large numbers or other binary data quickly and efficiently.

A quick example... say you want to send the value of a HTML colour. Let's choose the colour off-white, of RGB value (250,250,250). You could send the string "250,250,250" and the receiver could parse the ASCII values. You could save further chars by passing in "250250250" and, as long as your receiver understands the format, it would still be understood. What you need to realise though is that you're sending 3 bytes down the pipe when you really only need to send one. Each byte, in this case '2', '5' and '0' are all bytes, has the ability to store a numeric value from 0 through to 255.

The first byte, the ASCII character '2' is actually represented by the numeric raw value 50. The basic idea is that ASCII is a big lookup table. The number 20 is the index in the table that describes the pixel arrangement to display a character '2' on the screen. What you can then realise is that, in the buffer, you've actually placed the value 50 in the first byte, not 2. If you knew how to read that value as 50 and not 2, then you're already saving 1 byte (as '50' would take two ASCII characters!)

It gets better though... each byte can store up to the value 255, so instead of trying to jam the string "250" into three bytes, you simply need to store the value 250 into the first byte.

Signed or Unsigned?

A byte has 8 bits. Of these 8 bits, all 0s equals the numerical value zero and all 1s equals 255. You can therefore store 256 values in a byte. Computers, to represent negative numbers, use the most significant bit (determined by endianess) as a flag to indicate if a number is negative. Unfortunately, this takes one bit off your value, allowing you to store -128 to +127. Why not -127? That would be a waste and would allow -0 and +0. See Two's complement to understand more.

In the end, unless you're using C# (where Microsoft likes to 'help' you and limit ASCII usage) then it'll be up to the receiving end as to how to read the data. All the bits will be there; it's just a matter of casting them to a format you desire.

Message Terminators

Firstly, I'm using the term 'message' here to describe a block of data sent from the server to a client. This block is formulated by the server with known start and end indicators and thrown down the tubes. If the client isn't listening hard enough, then they may well miss the start of the message and have no idea how to recover and process the rest of the data. The goal is to create unique tokens in your message stream to allow a client to truncate data it can't deal with and get back to a known starting point. It can then process the next message in the queue.

Choosing a terminator can be difficult. A unique byte, or sequence of bytes, can be hard to determine if you are expecting to send arbitrary data in the message content. Human-readable characters can be used, if sending strings where those characters can't possibly be included. The 'pipe' | is a good choice, even a comma if you're in total control of message content and can replace/remove them from the middle. The issue will be that as soon as a terminating character is found in the middle of the string, then the client will expect that the message is complete and pass on the truncated data for further processing. It then gets worse when the client retrieves the next message which happens to be the second half of what should have been a complete message.

The best way to get around this is to have a header at the start of your messages. First and foremost this header needs to indicate the length of the message it describes. From then on you can have whatever message content you want, making sure the byte count matches the length you have set. In a recent application, I limited myself to 255 byte messages, so the first byte of any data sent was a numeric value that describe the number of following bytes that made up the message. I then also put a terminating character at the end, the pipe, as a check so the client could confirm that the end was really there.

Putting it all together...

Once you've defined a message structure, your listeners and receivers should be able to decipher the string of bytes coming down the line with a little more ease. Of course, if they get caught up and fail to read packets then those packets are lost. Your next step would be to ensure each end is in a known state and that if a state hasn't progressed that data needs to be re-sent.

I'll post again in the future with code samples for the theories above.