Data Serialization: Apache Avro vs. Google Protobuf

by Vijay Nayar
in Software
on 2022-08-26

Abstract
Introduction
- The Use of JSON
- When to Consider Alternative Formats
Comparing Binary-Capable Serialization Formats
Conclusion

Abstract

For applications that need to transmit data over a network or persist data for storage into files, the topic of selecting a data serialization format is an important one. For low performance applications with small amounts of data, there are advantages to using JSON, which is highly readable by humans. However, for more complex systems that store larger amounts of data and have significant computational challenges, a format that supports binary encoding can yield significant advantages. Google’s Protocol Buffers provide solid performance and are easy to use, but their format can only be changed at compile-time. With more flexible use-cases in mind, Apache Avro was created to offer similar performance and lower size like Protocol Buffers, but supports the ability to modify data formats during runtime and it is especially suited for generic data systems such those needed for data engineering and databases.

Introduction

When it comes to data serialization and de-serialization, there is a myriad of formats ranging from raw binary data or CSV files, XML, and of course more modern formats like JSON or YAML. Many of these formats were created for specific purposes and domains such as for configuration files, data storage, transmitting data over HTTP, etc.

The Use of JSON

In this day and age, one of the first go-to solutions for a data serialization format is JSON which has extensive support in virtually all programming languages. Here are a few examples from Python, Java, C++, Rust, and D. And, of course, the format is natively supported in the JavaScript programming language.

{
  "name": "Alex J. Murphy",
  "address": {
    "street": "548 Primrose Ln.",
    "city": "Detroit. MI.",
    "residential": true
  },
  "accuracy": [92.6, 85.1, 75.9]
}
Code language: JSON / JSON with Comments (json)

JSON has a lot of advantages associated with it:

Simple: Data that you create consists of combinations of numbers, strings, booleans, arrays, and objects, and that’s about it. It’s quick to learn and quick to use.
Human Readable: The JSON format is fairly straight forward and is easy to both type and read. However, it does not officially permit comments, which makes documenting it a tad more difficult.
Broad Language Support: As described above, there is hardly a programming language in common use that does not support JSON.

For a single data record, the increased cost of using JSON in terms of memory and CPU usage is pretty small and doesn’t have a significant impact. This is one reason why it is not uncommon for JSON to be transmitted over HTTP, a protocol that already uses human-readable formats and deems the cost to be worth it. It also helps with debugging, because you can simply type in JSON data to send to a server using simple shell scripts like so:

$ curl http://localhost:8080/myService/things -d '{"id": 20}'
Code language: JavaScript (javascript)

However, the format is not suitable for all applications. Compared to other formats that exist, JSON has the following disadvantages:

Space Cost: JSON can take more bytes to represent data than binary formats. For example the number 2,070,638,924 is represented in JSON as the character sequence “2070638924”. Using UTF-8 encoding, there is 1 byte per character, thus 10 bytes. If UTF-16 is used, then 20 bytes. A typical binary encoding is 4 bytes, and it can be even less using techniques like zig-zag encoding.
CPU Cost: In order to make use of data, it must first be converted into a form that a CPU can use. For example, a CPU requires an integer to be encoded as a 32-bit or 64-bit binary number, either in least-significant-byte or most-significant-byte order. The readable format of JSON requires costly conversion into a binary format.

Next we will explore some of the cases where one is willing to give up a human-readable format like JSON, and instead choose something else.

When to Consider Alternative Formats

Like luggage, sometimes data doesn’t fit into the space available.

Serialization formats are used typically when moving data outside of a single program, and instead saving it to files or pushing it over a network. In these situations, size and CPU costs are closely related. In network protocols like TCP and UDP, data is not transmitted as individual bytes, but rather in datagrams or packets. At the network level, such as Ethernet, a number of bytes up to the maximum transmission unit (MTU) for the network are sent together. For most Ethernet networks, a typical MTU is 1,500 bytes.

1,500 bytes is not a large amount of data, it represents approximately 2/3 of a page of text. If you have a lot of fields in your JSON data, it doesn’t take very much to reach this limit.

This means that whether you send 1 or 1,500 bytes, the cost of network latency (by far the largest cost in network communication) is the same. However, sending 1,501 bytes costs twice as much as sending 1,500 bytes because this data must be delivered in 2 packets rather than 1. This means that when there is large volumes of network traffic, saving a few bytes can actually make a big difference if it reduces the number of packets that have to be sent to transmit data.

The other common reason to consider a different data format is when a lot of data needs to be archived. For example, if you are keeping track of financial transactions of customers for a bank, then the numbers add up fairly quickly.

2 transactions per day per customer
x 1,000,000 customers
x 5 year legal retention requirement
x 2 KB per transaction
—————————-
7 Terrabytes of data

If the size of each transaction can be cut by even as little as 50%, this saves 3.5 Terrabytes of storage space. This is especially important because most cloud database providers charge on a monthly basis based on how much storage space is used. If you are using AWS RDS, which charges $0.23 per GB per month with an additional $0.095 for backups, this amounts to a saving of at least $1,200 per month. Furthermore, if you are building a bank, this data will be stored and used by multiple services available to service customer requests, be used to train machine learning models, etc. Throw in the additional savings associated with faster transmission and processing of smaller data, and the savings can easily be multiplied 5x or more. If you are storing large number of records in a databases, logs, data warehouses, and other places, a binary format can provide significant cost savings.

With the reasons why one would consider a binary serialization format described, let us now consider two popular formats: Google Protobuf and Apache Avro.

Comparing Binary-Capable Serialization Formats

An Example Data Record

We’ll use an example from banks, so it sounds nice and official.

For the sake of our analysis, let us consider a realistic data record type for which there may be many records to store. In the introduction, we vaguely referenced a 2KB bank transaction, but now let us be more specific.

The following record (represented as JSON with comments) is a realistic example record for a business:

// A financial transaction.
{
  "id": "1111-2222-3333-4444",
  "version": 3,
  "providerId": "MASTERCARD",
  "providerData": {
    "mcc": "5331",
    "merchantName": "Spencers Späti",
    "merchantLocation": {
      "latitude": 12.1234,
      "longitude": 34.5678
    }
  },
  "source": {
    "internalId": "12345",
    "externalId": "2222333344445555",
    "externalType": "CREDIT_CARD"
  },
  "target": {
    "internalId": "54321",
    "externalId": null,
    "externalType": null
  },
  "description": "Deluxe bagel with ice cream.",
  "type": "PURCHASE",
  "state": "AUTHORIZED",
  "amount": {
    "currencyCode": "EUR",
    "units": 4,
    "nanos": 60000000
  },
  "timestampInMs": 1661506029000,
}
Code language: JSON / JSON with Comments (json)

In JSON format, this record is 675 bytes, assuming that UTF-8 is being used as the encoding. If UTF-16 is being used, then this would be 1,350 bytes. A real record would likely have more data than this, but this serves as a semi-realistic example.

Google Protocol Buffers

Summary of Google Protobuf as a Binary Encoding Format:

Traits
- Data formats are described by writing proto files in a custom format.
- A protobuf compiler generates code in your chosen programming language.
- You create, serialize, and de-serialize your data using the generated code.
Pros
- Very efficient binary format which reduces bytes.
- Standard JSON representation.
- Strongly-typed generated code to speed development.
Cons
- Requires recompilation when formats change.
- Strict discipline is required to maintain backwards compatibility when formats change.
Useful when:
- It is very important to minimize data size
- Data formats do not change very frequently
- The number of possible formats remains small

The Protocol Buffer serialization format was originally created by Google in 2001 and made publicly available in 2008.

In Google Protocol Buffers, the binary format consists of a number of data rows that are roughly formatted as follows:

field number (5+ bits), field type (3 bits), encoded field value

In protocol buffers, every field has a unique number which must not change if you want to preserve backwards compatibility. The fields also have a type, which should be familiar to any Java or C++ programmer, e.g. “int32”, “float”, “bool”, “string”. There are also ways to group many fields together using either the “message” or “repeated” types. Named types can be created using “enum”.

An engineer typically writes a description of data in the proto format and saves it into a file which commonly is referred to as a “proto file” due to the file extension “proto”, e.g. “transaction.proto”.

Example protocol buffer data description:

// File: transaction.proto

// Comments like this are supported in proto files.
syntax = "proto3";

// Programming language specific options are possible.
option java_package = "com.transaction.proto";

// A proto format description consistent with the JSON above.
message Transaction {
  string id = 1; // The value "1" is the field ID, not a value.
  uint32 version = 2;  // Unsigned values are supported as well.
  ProviderId providerId = 3;  // This is an enumerated type we created.

  enum ProviderId {
    INVALID_PROVIDER = 0; // If not specified, the default value is 0.
    MASTERCARD = 1;
    // ...
  }

  // ...
}
Code language: PHP (php)

This proto file is then given to the protocol buffer compiler, called protoc, and the compiler will generate code in the chosen programming language that represents data objects associated with the format you specified in the proto file. Additionally, there are utilities present to serialize and de-serialize your data to and from binary data.

For example, in Java, the classes that represent your proto file come with two functions:

byte[] toByteArray()
static Transaction parseFrom(byte[] data)

Builder classes are also created to more rapidly construct data as well. This allows data to be constructed like so:

Transaction myTransaction = Transaction.newBuilder()
    .id("abcd")
    .version(3)
    .amount(Amount.newBuilder()
      .units(3)
      .currency("GBP"))
    // ...
    .build();
byte[] rawData = myTransaction.toByteArray();
Code language: JavaScript (javascript)

The same example message above, when encoded using protocol buffers, ends up being approximately 141 bytes. Much of the data cannot be reduced in size due to the fact that they are simply textual strings, however, the data size is, even in this case, reduced by 75%. The main reasons for this are:

The Protobuf format does not need to indicate the field name in binary.
The Protobuf format can use zig-zag encoding to make integers smaller.
The Protobuf format does not use any bytes for fields which have their default value.

Apache Avro

The logo of Apache Avro.

Summary of Apache Avro as a Binary Encoding Format:

Traits
- Data formats are described by writing Avro Schemas, which are in JSON.
- Data can be used as generically as GenericRecords or with compiled code.
- The data saved in files or transmitted over a network often contains the schema itself.
Pros
- Efficient binary format which reduces record sizes.
- The reader and writer schemas are known, permitting major data format changes.
- Very expressive language for writing schemas.
- No recompilation is needed to support new data formats.
Cons
- The Avro Schema format is complex and requires studying to learn.
- The Avro libraries for serializing and de-serializing are more complex to learn.
- The binary format is not as compressed as Protocol Buffers.
Useful when:
- It is very important to minimize data size.
- Data formats are frequently evolving or even something users can use.
- Generic solutions to problems like data warehousing or data query systems are needed.

The Apache Avro data serialization system was originally developed as part of the Hadoop distributed computing tools in 2006 and became a stand-alone publicly available project in 2011. Avro was designed with full awareness of Protocol Buffers and was built specifically to overcome some of the challenges and shortcomings associated with that format. Most importantly, it was created to permit changes of the data schema at runtime rather than having to rely on code generation and changes during compile-time.

As a more general serialization system, Apache Avro supports many ways of serializing data, beyond simply serializing a single record at a time. For example, Avro also supports saving batches of records together into a standard file format known as Object Container Files. These files combine the actual data format that records were created with with the records themselves. This file format is ideal for collecting large volumes of data over long stretches of time, which is commonly the case in data engineering and data science, where data from systems is stored in cost-efficient systems like AWS S3.

An Avro Schema representing our example record would look something like the following:

// File: transaction.avsc
{
  "namespace": "mycompany",
  "name": "Transaction",
  "type": "record",
  "fields": [
    {
      "name": "id",
      "doc": "A unique ID for a transaction.",
      "type": "string"
    },
    {
      "name": "version",
      "doc": "Each update uses a increasing version number.",
      "type": "int"
    },
    {
      "name": "providerId",
      "doc": "The financial provider where this transaction occurred.",
      "type": {
        "name": "ProviderId",
        "type": "enum",
        "symbols": ["INVALID", "MASTERCARD", "VISA", ...],
        "default": "INVALID"
      }
    },
    // ...
}
Code language: JSON / JSON with Comments (json)

The binary format used to represent Avro records is somewhat similar to the format used in Protocol Buffers. In order to save space and avoid having to write the textual names of fields, it uses the fact that the reader and writer of data are aware of the schemas being used, and these schemas can be reconciled if they are different. For example, when representing an object which in JSON would be represented as { "name": "Bob", "age": 30 }, the data is written to memory as the data value “Bob” followed by the binary representation of 30. Although JSON attributes are not ordered, the definition of a record type in Avro does order fields, and it is this order which determines the order in which they appear in the binary format.

Thus, in this particular example, our Apache Avro serialized version of our data record would come in at roughly 138 bytes, which is 3 bytes smaller than the serialized Protocol Buffer data. The reason for this is that Apache Avro uses field position rather than field ID to read and write fields. If one has data records where most fields take on their default values, Protocol Buffer records will end up taking fewer bytes.

There are many ways to create and access data in accordance with an Apache Avro schema, but for the purpose of illustrating what is unique about Apache Avro, let us consider of accessing data in a generic manner.

Consider the following Java snippet that manipulates a record, analogous to the previous example given for Protocol Buffers.

Schema schema = new Schema.Parser().parse(new File("transaction.avsc"));
GenericRecord transaction = new GenericData.Record(schema);
transaction.put("id", "abcd");
transaction.put("version", 3);
transaction.get("amount").put("units", 3);
transaction.get("amount").put("currency", "GBP");

// There is a much higher complexity working with the Avro libraries than
// there is when working with Protocol Buffers.
// However, each part, the output, the encoding, and the writing mechanism can be
// swapped and recombined.
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
BinaryEncoder binaryEncoder = EncoderFactory.get().binaryEncoder(outputStream, null);)
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
datumWriter.write(transaction, binaryEncoder);
binaryEncoder.flush();
byte[] rawData = outputStream.toByteArray();
Code language: JavaScript (javascript)

The extra layers of complexity of trying to convert an Avro record into bytes can be enough to make the eyes glaze over, however, with a few good examples and some time reading documentation, this is an easy enough hurdle to overcome.

Additional Considerations

An additional consideration to keep in mind when thinking about Apache Avro, is that many other organizations such as Confluent.io (the original creators of Apache Kafka) recommend using Apache Avro as a data format.

If you are using AWS Redshift as a data warehouse, it should be noted that it also supports the ingestion of Avro data.

AWS Glue, a system for performing data extract, transform, and load (ETL), also supports data in Avro format.

Conclusion

Picking the right method of serializing your data can save a lot of hassle.

The right serialization format to use depends heavily on your use cases. If you are working with low volumes of data where performance or storage is not a concern, than JSON is a perfectly acceptable format.

If bandwidth, CPU, or storage space are of critical concern, but the data formats are few in number and only change slowly, then Protocol Buffers can be a nice choice. They have the advantage of being relatively straight forward and easy to use.

If the same performance constraints exist, but the data formats cannot be known in advance at compile-time, or the number of formats is large and constantly changing, then Apache Avro is a strong candidate for a serialization format. However, it comes with increased complexity, so be prepared to take a bit of time to read the documentation.

Data Serialization: Apache Avro vs. Google Protobuf