Debezium to Snowflake: Lessons learned building data replication in production

Omar Ghalawinji
Shippeo Tech Blog
Published in
6 min readJun 1, 2022

--

With the well-deserved position that Snowflake and Debezium have gained in the modern data stack, it is now fairly easy to find online resources on using these technologies. In this blog, we take it one step further by reflecting on the lessons we (at Shippeo) have learned using Debezium to replicate data at scale in near real time to Snowflake.

Data is the blood of Shippeo’s business. As a data aggregator in supply chain, fresh and reliable data is crucial to provide accurate ETA predictions and performance insights.

This is not a tutorial. We are not going to explain the different steps to implement data replication to Snowflake using Debezium and Kafka. The available resources online as well as the official documentation are largely sufficient. What we want to share with you in this article is the cherry on the cake; what to do and what not to do if you want to go beyond a simple PoC of these technologies to the actual implementation of a reliable system in production.

Overall Architecture

Overview of the KafkaConnect-based (Debezium) data replication pipe

We use Debezium, an excellent open-source tool to perform log-based Change Data Capture; which is the process of using the logs of an OLTP database to track and stream all the changes in a database, as events. These events are sent in near real time to Kafka to be consumed by several Kafka Sink Connectors and pushed to OLAP systems downstream.

Thoughts on Debezium and Kafka

Think about optimizing Kafka usage in advance

There are few decisions you can take at the beginning of your development cycle that can have an enormous impact on your Kafka usage. For example, when using several Debezium connectors in the same Kafka cluster with large tables, it is not uncommon that a given Kafka node runs out of disk space or that the initial Debezium snapshot takes days to finish. Here is some advice:

  • Create Kafka topics manually with the right retention policy: Debezium can auto create Kafka topics if they don’t exist. This is better avoided. Creating Kafka topics manually before initializing Debezium gives you control over the configuration of each of these topics. For example, using the retention policy “Compact+Delete” is particularly useful for topics used by Debezium as it allows Kafka to keep only the latest message for a given Kafka Key, which leads to less storage usage. You can automate the creation of these topics with a script or using Terraform if you have a managed Kafka service. This will also allow you to fine-tune the number of partitions of each topic based on the size of the corresponding table (see next point).
  • Increase the number of partitions of Kafka topics corresponding to your largest tables: this will increase the parallelism of write operations in Kafka and massively expedite the initial snapshot of large tables.
  • Use AVRO serialization with a Schema registry: instead of the default JSON serialization. This will reduce the message sizes and save on storage usage. It’s worth mentioning here that some Kafka Sink connectors only support AVRO with a Schema registry, like BigQuery sink connector. So better to prepare in advance.

Create an independent Debezium connector for each downstream system

Let’s imagine that you have certain tables in MySQL that are needed both by your data scientists (who use Snowflake to train their models), and your data analysts (who use BigQuery to build some internal reporting).

Since it is the same source tables that are replicated to 2 different destinations, you might be tempted to define a single Debezium source connector whose data is consumed by 2 different Sink Connectors (Snowflake + BQ in this case). This creates coupling between data consumers, and it’s a bad idea for a few reasons, including:

  • If one consumer needs to reinitialize a snapshot of a certain table, the other consumer will be forced to consume this snapshot as well.
  • If one consumer needs only few columns of a certain large table while the other consumer needs all the columns, the first consumer will be forced to consume the payload containing all the columns anyway.

This breaks the isolation between different workloads, complexifies the maintenance of data pipelines and incurs unnecessary compute costs.

Watch out using Debezium with PostgreSQL

Setting up Debezium on PostgreSQL for the first time can be tricky. Debezium uses PostgreSQL’s WAL (Write-Ahead Logging) through “Publications” and “Replication Slots”. All these need to be correctly configured for Debezium to function. While Debezium can create them automatically (as explained in the docs), it requires giving Debezium’s DB user ownership privilege over the replicated tables. This is not only an anti-pattern, but it might even be impossible from an organizational perspective as these tables are most probably owned by other software development teams in your company, not by the data engineers.

The better alternative is to create these replication slots manually by the table owners or by a superuser. This is explained nicely in this article. Another good read on Debezium with PostgreSQL is the article of Ashhar Hasan.

Be extremely cautious of WAL size growing out of control. This can happen in high traffic/low traffic situations explained here, or if a replication slot is left unused after some manual testing. For instance, if your PostgreSQL is on GCP’s Cloud SQL with auto-scaling enabled, growing WAL will scale up the DB’s storage (and cost), and as of now it’s not possible to scale it down again. The only remedy when you discover the problem would be to drop the database and recreate it again.

Be aware of the major bug KAFKA-8713

Long story short, if there is a nullable database column with a default value defined, then KafkaConnect would incorrectly replace any null value of this column with the default value during serialization. This bug impacts Debezium downstream as it relies on KafkaConnect serialization.

Even though this bug compromises data integrity for such columns, it still didn’t seem to be getting a fix any time soon -at the time of writing- as it requires a KIP (Kafka Improvement Proposal).

Thoughts on Snowflake

The nice thing about Snowflake is that it just works reliably. There are not many things you need to be worried about (besides your bill for credit usage). And to make this bill less painful, you can optimize the buffer configuration in Snowflake Kafka connector in order to use Snowpipe more wisely.

Snowflake Kafka Sink Connector buffer configuration with default values

It is worth asking the question of how fast you need to flush data out of Kafka (and thus trigger Snowpipe). Every data pipeline can have different data freshness requirements. Some experimentation should be done on the buffer configuration to find the right trade-off between cost and data freshness.

One inconvenience of Snowflake Kafka Sink connector is that it pushes data to Snowflake semi-structured (as JSON), so it needs to be transformed if the tables schema is needed. Obed explained this elegantly in this blog.

Thoughts on Monitoring

Debezium and KafkaConnect collectively expose an exhaustive list of metrics that can provide excellent visibility on the data pipeline. But among all these metrics, there’s one single metric that stands out: kafka_consumergroup_lag.

I cannot stress enough how much this single metric can tell us about the health of the data replication pipeline end-to-end.

Healthy oscillating trends of kafka_consumergroup_lag

This Kafka Exporter metric reflects the number of messages queuing in a partition of a Kafka topic unconsumed by the consumer. Ideally it should oscillate between zero and a given threshold (based on your data pipeline SLAs).

  • If it stops going up: Debezium has stopped producing messages.
  • If it stops going down: Snowflake has stopped consuming messages.
  • If it goes too high: Snowflake is consuming messages too slowly. Buffer flush config should be tightened.
  • If it doesn’t go high enough: Snowflake is consuming messages too fast. There’s a potential on saving Snowpipe costs by increasing buffer flush limits.
  • If it is absent: Kafka might be unstable.

Here we come to the end of this self reflection on replicating transactional data to Snowflake using Debezium and Kafka. Should you have any comments, corrections or further insights; please do not hesitate to reach out!

--

--