Do not trust others’ data – meticulousengineer

Our world works by exchanging data. Thousands of bits flow in front of our eyes every second without us even realizing it. A touch of a screen or changes in our blood pressure or heart rate will quickly become ones and zeros and immediately travel to the other side of our planet to be stashed or transformed into another digital data set.

We love data and we are addicted to it. We gather it from everything around us, inside us, and even inside others, without a clear purpose. We like to know how many steps we took, our running activity statistics, how many times our body breath in the last minute. Even if we are on vacation for the next two weeks, we still want to know the temperature in our homes.

Getting closer to our domain of expertise, we have apps/services. Call them however you prefer. Those apps are thought to do a job. Some even learn by themselves. However, what they cannot do is work without good-quality data. Furthermore, they need each other. Some produce data and others consume that data. It should sound familiar.

Between producers and consumers, there is a contract defining the communication rules between them. Producers define the contracts and have the responsibility not to break them. In the case of planning to break a contract, they should find ways to announce the consumers and give them enough time to adapt their systems to the new contracts. The contracts can be enhanced with additional tools to ensure they remain valid. Contracts are also encountered under the names API or schemas.

The article will focus on the data consumers. The “producer” will be associated with the “data source” in many places.

Problems with data consumption

Breaking contract schema

It can be correlated with breaking changes introduced by producers. They do not show up often, and when they do, it is primarily because of human error.

For example, if a producer removes a field, this will generate errors on the consumer’s side, breaking their contract. Imagine the case where the consumer needs a bank account number, and suddenly the value is not available anymore. The consumer will stop working at that point since the bank account number is strictly required.

Some possible scenarios:

In producer, a typo in the contract gets unnoticed through code review;
In producer, an annotation that facilitates communication serialization is moved away from a field without intention;
In producer or consumer, serialization configurations are changed at the system level, and nobody knows what to expect or where it might break;
In producer, forgetting to update the contracts documentation with the new rules
In producer or consumer, engineers make changes to existing functionality without realizing there is a usage of that functionality in another tiny corner of the system
Consumers omit or are unaware of the updates/news related to the producers’ contract updates.

As a consumer, you can’t do much about it besides trying to adopt or implement a schema mechanism that will guarantee the data has the proper structure.

Data format changes

These are also breaking changes, and the tricky part is that they might go unnoticed. Compared to the previously described problem, in this case, the consumer can continue to process messages containing invalid data.

Imagine two consumers, C1 and C2, both belonging to the same company. C1 generates some PDFs, contracts between a customer and a company, which should contain an address. The address comes from a data source, S1. C1 is also a data source for consumer C2, which prints physical letters with the address from C1. The address is formatted by data source S1 as

Country|PostCode|City|Region|Sector|Street|Number

and is filled in a form by a customer. Each address component has its own field in the UI. When submitted, all values are joined together into a single string, separated by |. S1 is part of another company.

C1 has been working fine for a while, but at some point, it starts receives an address like NL|123456||Amsterdam|Noth Holland||A|300. Note the two ||. It could be a typo in the UI in the postcode, city, sector, or street. The system cannot know which one is correct, but we can be sure that one field was omitted on purpose. C1 stores this address and generates the PDF. Even with such typos, the address is still readable and acceptable from a legal perspective. C2 receives the same address as C1. C2 splits the string by | and prints the first seven pieces on the physical letters, omitting the 300. In this case, the letter will not be delivered, which might be an issue from a legal perspective.

This is not a functional example some systems do use it. Even a medical protocol for transfering data sends the address in a similar format.

In this case, consumers can implement parsings and validations to protect themselves.

Now, let us ask ourselves, where is the problem? Who will be blamed? First, the user was not careful with the data he filled in even though he knew it had high importance. Second, S1 because, at first sight, it seems to be accepting any input. Third, C1 because it seems to be accepting any data from S1. Fourth, C2 because he knew the importance of a correct address and did not make sure it was ok and complete. Probably C2 will blame C1 because it did not validate it first and C2 supposed that C1 has valid data.

Solutions

At least two main strategies can be adopted: implementing a schema validation mechanism and data parsing and validations. They do not provide the perfect solution, but they increase the reliability and robustness of the system.

Implement a schema validation mechanism

A schema mechanism will guarantee that the data contains all the necessary fields and that the fields have the expected data types. Such schema mechanisms are already implemented in third-party libraries. Choosing one depends on the communication protocol. The schema mechanisms offer support for various data types (string, int, …), different data structures (like enum), and basic validations (like minimum integer value).

JSON Schema

It is a schema specification for JSON data with implementations, available for many programming languages.

Schema Sample:

{
  "$id": "https://example.com/person.schema.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Person",
  "type": "object",
  "properties": {
    "firstName": {
      "type": "string",
      "description": "The person's first name."
    },
    "lastName": {
      "type": "string",
      "description": "The person's last name."
    },
    "age": {
      "description": "Age in years which must be equal to or greater than zero.",
      "type": "integer",
      "minimum": 0
    }
  }
}

It also has a validation specification which can make a difference and improve development speed.

One JSON Schema validator implementation for Java is Vert.x JSON Validation. It takes a JSON input, compares it against the schema rules, and, in the end, will provide a validation result. It looks like this:

Validation Sample

JsonSchema schema = JsonSchema.of(object); // reads the schema
// schema validation
OutputUnit result = Validator.create(
    schema,
    new JsonSchemaOptions().setDraft(Draft.DRAFT7))
  .validate(json);
if (result.getValid()) {
  // valid
} else {
  // not valid  
}

Be aware that some tools do not offer full compatibility with JSON Schemas. For example, Swagger (used to generate API documentation).

Apache Avro schema

Another JSON-based specification frequently encountered in Kafka environments. It works by making the Kafka components aware of the schema, and they will make sure the messages have the correct structure.

Schema Sample:

{
  "namespace": "com.acme",
  "protocol": "HelloWorld",
  "doc": "Protocol Greetings",
  "types": [
    {"name": "Greeting", "type": "record", "fields": [
      {"name": "message", "type": "string"}]},
    {"name": "Curse", "type": "error", "fields": [
      {"name": "message", "type": "string"}]}
  ],
  "messages": {
    "hello": {
      "doc": "Say hello.",
      "request": [{"name": "greeting", "type": "Greeting" }],
      "response": "Greeting",
      "errors": ["Curse"]
    }
  }
}

Protobuf (Protocol Buffers)

It is a data serialization mechanism, language and platform neutral.

Example:

syntax = "proto3";
message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
}

Data parsing and validating

The parsing and validating solution can live alone or as a complementary solution to schema validation. Suppose the system does not use a schema validation mechanism. In that case, engineers will probably end up implementing the same validation rules that would have come for free with the schema validator.

Schema validators cannot cover complex cases, and this is the moment where parsing and validating data is necessary. In the example with the producer and the consumers, a schema validator can only ensure that the address field contains a not null, not empty, and not blank string. Custom validators can be added to cooperate with the existing schema validator, but, by doing this, we are moving away from the schema specification. Next, we would need some custom parsing and validation: split the field by |, check for seven components, probably check to contain only specific characters, etc.

There is another difference between schema solution and this type of validation: the first is declarative and the latter is imperative. The declarative style is preferred by engineers and easier to understand, telling the system what to do, while the imperative style implies teaching the system how to do it.

Data parsing and validation refer strictly to ensuring the data has the proper format. It does not involve business validations such as verifying that an existing customer does not already use the address.

Validating at the edge

It is a technique that simplifies the validation process for the consumers through an organization. It says that the services that receive data from outside the organization are responsible for validating and storing the data. Those services are seen as single sources of truth regarding their data. That means that whoever consumes this data has a guarantee that it is high-quality data.

The most challenging part of this approach is to make all engineers and all teams aware of the existence of such golden data sources. Moreover, this is extremely difficult to achieve in huge organizations. In many situations, the time needed to implement reading from a data source is significantly higher than the time spent trying to figure out which service holds that data.

Real world examples

No schemas for Kafka messages

A new version of a producer was deployed, publishing a message into an existing topic. The difference between the previous version and the new one was that it changed the ID of an entity from UUID to String. It made sense for them since that entity needed to work with both data types. This selfish change made one of the consumers get stuck because it was always expecting UUID.

When using Kafka without a schema for the messages inside a topic, there are very high chances of ending up with a stuck consumer, especially when there are no ACL rules.

By default, if the consumer cannot deserialize a message, it will keep retrying to process the failed message. This is how it gets blocked.

The tricky part: how can we resume our consumer?

If message loss is acceptable, then we can manually reset the offset of that consumer to the next message offset, manually manipulating production systems, going into Kafka and setting the offset of that consumer, for a specific topic and a specific partition, to another offset.

If message loss is unacceptable, a dead letter topic solution will help prevent the consumer from becoming blocked. Another solution to avoid blocking could be to store the unprocessed messages in the database. If no fallback solution is implemented, then it is terrible. For this a new consumer is required to process the problematic message. This means writing code while the production is down and going through the entire deployment process. The blocked consumer needs to be put down. Hopefully, no new bad formatted/structured messages arrived, otherwise you need to start over. You need to figure out another solution if you do not know how many bad messages you have.

In this particular case, we deployed a new version of that consumer supporting String instead of UUID for that particular field. This meant changing the data type throughout the application and included the database column type changes for multiple tables in which we had references. It was a quick fix and worked because this service was not providing data to other systems.

It took around two hours to figure out the problem while the production consumer was not processing any additional messages. It was a case where data loss was unacceptable. The problem was obvious from the first error logs, but the tricky part was figuring out what had changed and why.

The whole issue was a breaking change in the contract that the data source owners did not think about (or did not care about). Having a schema mechanism in place for the producer would have avoided this issue because it would have triggered errors at publish time (and probably in tests).

Who shall validate the data

Everyone! With the clarification that each system should validate only the data required to work correctly.

In the example with the producer and the consumers:

the producer should ensure that all address fields are filled in; it does not need to ensure the address exists
C1 and C2 should both verify that the address is complete, has the proper structure and format
touching for a moment the business rules, both C1 and C2 are required to validate that the address exists

Any data coming from any external systems should be strongly validated.

Services belonging to the same application

To raise awareness, let us consider S1, C1, and C2 services that belong to the same application or company, each belonging to a different team.

When designing the system, in the engineers’ discussions, we often hear “C1 can take data from S1 and C2 can work with the data from C1”. That is fine as long as all responsibilities are clear.

By being inside the same company/application, there is a bias toward trusting each other. Many issues can arise from this bias combined with poor team communication or poor documentation.

The C2 team might decide to go to S1 for the data without having any idea about the existence of C1. The responsibility for ensuring the address is structured and formatted correctly and that the address exists will now reside in two different services. Even worse, supposing that their addresses are already verified, C2 might go to S1. Much worse, C2 can even go directly to the source of S1.

These problems cannot be avoided. We can only strive for better communication and good documentation.

Conclusion

No data should pass unverified. Even though you might hear it, there is no such thing as too much validation. Validate only what is necessary to your use case. There are plenty of tools out there that can help you consume only good-quality data and will speed up your work. Trust nobody as long as you do not have a guarantee that the data is good. Better safe than sorry.

Keep in mind that there is also a place for errors wherever there are people involved.