Serialisation
Serialisation is the process translating an in-memory object to a byte sequence. Deserialisation is the reverse process.
Synonyms of serialisation include marshalling, encoding, and pickling (the latter is Python specific). Conversely, deserialisation is also known as unmarshalling, decoding, or unpickling.
Uses
- Serialising data for transfer across wires and networks (messaging).
- Storing data (in databases, on hard disk drives).
- Remote procedure calls, e.g., as in SOAP.
- Distributing objects, especially in component-based software engineering such as COM, CORBA, etc.
- Detecting changes in time-varying data.
Messaging
Serialisation is necessary in messaging because in-memory objects use pointers that would be meaningless to the receiver. Serialisation is also necessary for cross-platform communication, where the in-memory representation of objects may differ.
Common serialisation formats
- JSON
- XML
- Protocol Buffers
Less commonly:
- YAML
- CSV
Language-specific serialisation
Many languages have their own serialisation libraries. For example, in Python,
the pickle module is used for serialisation.
These comes with serious disadvantages:
- You are tied to the language.
- The serialised data may not be human-readable.
- The serialised data may not be versionable.
- If a MiTM attack occurs, the attacker can inject malicious code into the serialised data, which could result in arbitrary code execution.
JSON and XML
-
Both JSON and XML are human-readable.
-
Both JSON and XML can apply optional schema validation.
-
XML is verbose and complex.
-
XML cannot distinguish between a number and string without a schema.
-
JSON doesn’t distinguish between integers and floats.
-
JSON doesn’t specify a floating-point precision. Integers greater than 2^53 cannot be represented accurately in an IEEE 754 double-precision float; this is relevant for languages that use IEEE 754 double-precision floats, such as JavaScript.
-
JSON and XML don’t support binary strings.
-
There are various JSON and XML binary serialisation formats, such as BSON and MessagePack for JSON, and Fast Infoset for XML. None of them are widely used.
Changing schemas
When changing schemas over time, you must carefully consider backward and forward compatibility. Checking the documentation of the serialisation format you are using is essential as there may be hidden pitfalls like loss of data or precision.
Binary serialisation
Binary serialisation is more efficient than text-based serialisation formats like JSON and XML; the same data can be represented in fewer bytes.
However, binary serialisation is not human-readable, and it is not self-describing. This means that the receiver must know the schema of the serialised data in advance.
Protocol Buffers (protobuf) and Apache Thrift
Protocol Buffers and Apache Thrift are modern binary serialisation formats. Protobuf was developed by Google, and Thrift was developed by Facebook.
Both require a schema to be defined. Both come with code generators that generate code for serialising and deserialising objects in various languages.
Advantages of binary serialisation
-
More compact than text-based serialisation formats.
-
The schema is a source of documentation. The schema is required and cannot be omitted.
-
A database of schemas can be maintained to check forwards and backwards compatibility.
-
Code generation can be used for statically typed languages.