A splash of water

Testing Databricks SQL Notebooks: A Practical Guide

30 Dec 2024

Testing notebooks may not sound glamorous, but it’s essential for building robust and scalable data pipelines. Imagine assembling IKEA furniture without instructions—everything might fall apart when you least expect it. Similarly, testing ensures your SQL notebooks run smoothly, delivering reliable and accurate data transformations.

In this article I’ll walk through how to set up a testing framework for Databricks SQL notebooks. The goal? To test individual notebooks in isolation while ensuring they work as expected in a Spark environment.

Why Test SQL Notebooks?

Databricks SQL notebooks are powerful tools for managing data transformations. However, without testing, even minor mistakes—like a mismatched schema or an incorrect join—can disrupt your pipeline and lead to hours of debugging.

Testing SQL notebooks helps you:

  • Refactor confidently: Update your code without fear of breaking existing workflows.
  • Catch errors early: Identify issues before they escalate.
  • Maintain pipeline integrity: Ensure each notebook produces the correct output.

What’s in the Test Framework?

To demonstrate, I have created a repository that showcases a file with MERGE operation between bronze and silver tables in a SQL notebook. The repository structure for this framework looks like this:

main_testing/
├── tests/
│   ├── integration_tests/
│   │   ├── conftest.py
│   │   └── test_merge_table_silver_target_table.py
│   └── run_integration_tests.py
└── src/
    └── notebooks/
        └── catalog_dev/
            ├── schema_bronze/
            │   └── create_table_bronze_source_table
            └── schema_silver/
                ├── create_table_silver_target_table
                └── merge_table_silver_target_table

How Does the Test Work?

Here’s the flow of individual test:

  1. Setup: Create the test tables using the same scripts that are used to create production tables, but with different catalog, schema and table name parameters.
  2. Data Preparation: Insert small, test data into the source and target tables.
  3. Execute: Run the MERGE notebook using the test data.
  4. Validate: Compare the results in the target table against expected outcomes.
  5. Cleanup: Drop the test tables to ensure isolation for future tests.
Key Principles
  • Automation: Tests can run as part of a CI/CD pipeline, ensuring continuous validation.
  • Isolation: Each test runs in a self-contained environment, preventing interference.
  • Parameterization: Catalogs, schemas, and table names are parameterized to simplify reusability.
  • Modularity: If the individual notebook contains only one operation, the corresponding test offers a level of granularity akin to unit testing in the world of data engineering. This approach ensures clarity, simplicity, and maintainability.

Step-by-Step Example

1. Parameterized SQL Notebooks

We begin with parameterized notebooks for creating source and target tables:

CREATE WIDGET TEXT catalog_name DEFAULT 'dev';
CREATE WIDGET TEXT schema_name DEFAULT 'bronze';
CREATE WIDGET TEXT table_name DEFAULT 'source_table';

CREATE OR REPLACE TABLE IDENTIFIER(:catalog_name || '.' || :schema_name || '.' || :table_name) 
(
    id INT,
    name STRING,
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    status STRING
);
2. Merge Notebook

The MERGE notebook performs the transformation:

MERGE INTO IDENTIFIER(:catalog_name || '.' || :silver_schema_name ||  '.' || :target_table_name) AS target
USING IDENTIFIER(:catalog_name || '.' || :bronze_schema_name ||  '.' || :source_table_name) AS source
ON target.id = source.id
WHEN MATCHED THEN
  UPDATE SET
    target.name = source.name,
    target.created_at = source.created_at,
    target.updated_at = source.updated_at,
    target.status = source.status
WHEN NOT MATCHED THEN
  INSERT (id, name, created_at, updated_at, status)
  VALUES (source.id, source.name, source.created_at, source.updated_at, source.status);
3. Test Implementation

Fixtures: conftest.py. These fixtures provide a Spark session and DBUtils instance for testing.

from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark():
    spark = SparkSession.builder.appName('integration-tests').getOrCreate()
    return spark

@pytest.fixture
def dbutils(spark):
    from pyspark.dbutils import DBUtils
    return DBUtils(spark)

Main entry point to run all tests: run_integration_tests.py

# Databricks notebook source
%reload_ext autoreload
%autoreload 2

# COMMAND ----------

import pytest
import os
import sys

# Integration tests folder
integration_tests_folder = "integration_tests"

# Avoid writing .pyc files on a read-only filesystem.
sys.dont_write_bytecode = True

# Run pytest on integration tests folder
retcode = pytest.main([integration_tests_folder, "-v", "-p", "no:cacheprovider", "--junit-xml", "/dbfs/tmp/integration_tests/junit.xml"])

# Fail the cell execution if there are any test failures.
assert retcode == 0, "The pytest invocation failed. See the log for details."

Test File: test_merge_table_silver_target_table.py

import os
import pytest
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.testing import assertDataFrameEqual

def test_merge_table_silver_target_table(spark, dbutils):
    """
    Test to verify the merge operation between bronze and silver tables.
    This test performs the following steps:
    1. Creates source and target tables by running respective notebooks.
    2. Inserts test data into both tables.
    3. Runs the merge notebook.
    4. Verifies that the target table has the expected data.
    5. Cleans up by dropping the created tables and schema.
    """
...

Why This Matters

By modularizing SQL notebooks and parameterizing table names, we ensure that each operation can be tested independently. This method offers a streamlined approach to testing even the smallest components of your data pipeline while allowing for scalability in a Spark context.

Conclusion

Testing notebooks might not be as flashy as deploying them, but it’s the unsung hero of data engineering. A solid testing framework catches bugs early, ensures data accuracy, and builds confidence in your pipelines. By adopting these practices, you’re not just writing tests—you’re laying the groundwork for scalable and reliable data systems.

References

https://medium.com/@magrathj/integration-testing-databricks-notebooks-with-repos-427a82df6575

Let’s talk
Let’s talk
Let’s talk
Let’s talk

Are you ready to collaborate?

    Your message has been sent successfully.

    I’ll try to respond ASAP.