Developer Blog | dbt Developer Hub

How we cut our tests by 80% while increasing data quality: the power of aggregating test failures in dbt

January 24, 2023 · 8 min read

Software Developer @ Tempus

Testing the quality of data in your warehouse is an important aspect in any mature data pipeline. One of the biggest blockers for developing a successful data quality pipeline is aggregating test failures and successes in an informational and actionable way. However, ensuring actionability can be challenging. If ignored, test failures can clog up a pipeline and create unactionable noise, rendering your testing infrastructure ineffective.

Power up your data quality with grouped checks

January 17, 2023 · 7 min read

Emily Riederer

Senior Manager Analytics @ Capital One

Imagine you were responsible for monitoring the safety of a subway system. Where would you begin? Most likely, you'd start by thinking about the key risks like collision or derailment, contemplate what causal factors like scheduling software and track conditions might contribute to bad outcomes, and institute processes and metrics to detect if those situations arose. What you wouldn't do is blindly apply irrelevant industry standards like testing for problems with the landing gear (great for planes, irrelevant for trains) or obsessively worry about low probability events like accidental teleportation before you'd locked down the fundamentals.

When thinking about real-world scenarios, we're naturally inclined to think about key risks and mechanistic causes. However, in the more abstract world of data, many of our data tests often gravitate towards one of two extremes: applying rote out-of-the-box tests (nulls, PK-FK relationships, etc.) from the world of traditional database management or playing with exciting new toys that promise to catch our wildest errors with anomaly detection and artificial intelligence.

Between these two extremes lies a gap where human intelligence goes. Analytics engineers can create more effective tests by embedding their understanding of how the data was created, and especially how this data can go awry (a topic I've written about previously). While such expressive tests will be unique to our domain, modest tweaks to our mindset can help us implement them with our standard tools. This post demonstrates how the simple act of conducting tests by group can expand the universe of possible tests, boost the sensitivity of the existing suite, and help keep our data "on track". This feature is now available in dbt-utils.

Serverless, free-tier data stack with dlt + dbt core.

January 15, 2023 · 8 min read

Euan Johnston

The problem, the builder and tooling

The problem: My partner and I are considering buying a property in Portugal. There is no reference data for the real estate market here - how many houses are being sold, for what price? Nobody knows except the property office and maybe the banks, and they don’t readily divulge this information. The only data source we have is Idealista, which is a portal where real estate agencies post ads.

Unfortunately, there are significantly fewer properties than ads - it seems many real estate companies re-post the same ad that others do, with intentionally different data and often misleading bits of info. The real estate agencies do this so the interested parties reach out to them for clarification, and from there they can start a sales process. At the same time, the website with the ads is incentivised to allow this to continue as they get paid per ad, not per property.

The builder: I’m a data freelancer who deploys end to end solutions, so when I have a data problem, I cannot just let it go.

The tools: I want to be able to run my project on Google Cloud Functions due to the generous free tier. dlt is a new Python library for declarative data ingestion which I have wanted to test for some time. Finally, I will use dbt Core for transformation.

The starting point

If I want to have reliable information on the state of the market I will need to:

Grab the messy data from Idealista and historize it.
Deduplicate existing listings.
Try to infer what listings sold for how much.

Once I have deduplicated listings with some online history, I can get an idea:

How expensive which properties are.
How fast they get sold, hopefully a signal of whether they are “worth it” or not.

Towards a solution

The solution has pretty standard components:

An EtL pipeline. The little t stands for normalisation, such as transforming strings to dates or unpacking nested structures. This is handled by dlt functions written in Python.
A transformation layer taking the source data loaded by my dlt functions and creating the tables necessary, handled by dbt.
Due to the complexity of deduplication, I needed to add a human element to confirm the deduplication in Google Sheets.

These elements are reflected in the diagram below and further clarified in greater detail later in the article:

Project architecture

Ingesting the data

For ingestion, I use a couple of sources:

First, I ingest home listings from the Idealista API, accessed through API Dojo's freemium wrapper. The dlt pipeline I created for ingestion is in this repo.

After an initial round of transformation (described in the next section), the deduplicated data is loaded into BigQuery where I can query it from the Google Sheets client and manually review the deduplication.

When I'm happy with the results, I use the ready-made dlt Sheets source connector to pull the data back into BigQuery, as defined here.

Transforming the data

For transforming I use my favorite solution, dbt Core. For running and orchestrating dbt on Cloud Functions, I am using dlt’s dbt Core runner. The benefit of the runner in this context is that I can re-use the same credential setup, instead of creating a separate profiles.yml file.

This is the package I created: https://github.com/euanjohnston-dev/idealista_dbt_pipeline

Production-readying the pipeline

To make the pipeline more “production ready”, I made some improvements:

Using a credential store instead of hard-coding passwords, in this case Google Secret Manager.
Be notified when the pipeline runs and what the outcome is. For this I sent data to Slack via a dlt decorator that posts the error on failure and the metadata on success.

from dlt.common.runtime.slack import send_slack_message

def notify_on_completion(hook):
    def decorator(func):
        def wrapper(*args, **kwargs):
            try:
                load_info = func(*args, **kwargs)
                message = f"Function {func.__name__} completed successfully. Load info: {load_info}"
                send_slack_message(hook, message)
                return load_info
            except Exception as e:
                message = f"Function {func.__name__} failed. Error: {str(e)}"
                send_slack_message(hook, message)
                raise
        return wrapper
    return decorator

The outcome

The outcome was first and foremost a visualisation highlighting the unique properties available in my specific area of search. The map shown on the left of the page gives a live overview of location, number of duplicates (bubble size) and price (bubble colour) which can amongst other features be filtered using the sliders on the right. This represents a much better decluttered solution from which to observe the actual inventory available.

Dashboard mapping overview

Further charts highlight additional metrics which – now that deduplication is complete – can be accurately measured including most importantly, the development over time of “average price/square metre” and those properties which have been inferred to have been sold.

Next steps

This version was very much about getting a base from which to analyze the properties for my own personal use case.

In terms of further development which could take place, I have had interest from people to run the solution on their own specific target area.

For this to work at scale I would need a more robust method to deal with duplicate attribution, which is a difficult problem as real estate agencies intentionally change details like number of rooms or surface area.

Perhaps this is a problem ML or GPT could solve equally well as a human, given the limited options available.

Learnings and conclusion

The data problem itself was an eye opener into the real-estate market. It’s a messy market full of unknowns and noise, which adds a significant purchase risk to first time buyers.

Tooling wise, it was surprising how quick it was to set everything up. dlt integrates well with dbt and enables fast and simple data ingestion, making this project simpler than I thought it would be.

dlt

Good:

As a big fan of dbt I love how seamlessly the two solutions complement one another. dlt handles the data cleaning and normalisation automatically so I can focus on curating and modelling it in dbt. While the automatic unpacking leaves some small adjustments for the analytics engineer, it’s much better than cleaning and typing json in the database or in custom python code.
When creating my first dummy pipeline I used duckdb. It felt like a great introduction into how simple it is to get started and provided a solid starting block before developing something for the cloud.

Bad:

I did have a small hiccup with the google sheets connector assuming an oauth authentication over my desired sdk but this was relatively easy to rectify. (explicitly stating GcpServiceAccountCredentials in the init.py file for the source).
Using both a verified source in the gsheets connector and building my own from Rapid API endpoints seemed equally intuitive. However I would have wanted more documentation on how to run these 2 pipelines in the same script with the dbt pipeline.

dbt

No surprises there. I developed the project locally, and to deploy to cloud functions I injected credentials to dbt via the dlt runner. This meant I could re-use the setup I did for the other dlt pipelines.

def dbt_run():
  # make an authenticated connection with dlt to the dwh
    pipeline = dlt.pipeline(
        pipeline_name='dbt_pipeline',
        destination='bigquery', # credentials read from env
        dataset_name='dbt'
    )
  # make a venv in case we have lib conflicts between dlt and current env
    venv = dlt.dbt.get_venv(pipeline)
  # package the pipeline, dbt package and env
    dbt = dlt.dbt.package(pipeline, "dbt/property_analytics", venv=venv)
  # and run it
    models = dbt.run_all()
    # show outcome
    for m in models:
        print(f"Model {m.model_name} materialized in {m.time} with status {m.status} and message {m.message}"

Cloud functions

While I had used cloud functions before, I had never previously set them up for dbt and I was able to easily follow dlt’s docs to run the pipelines there. Cloud functions is a great solution to cheaply run small scale pipelines and my running cost of the project is a few cents a month. If the insights drawn from the project help us save even 1% of a house price, the project will have been a success.

To sum up

dlt feels like the perfect solution for anyone who has scratched the surface of python development. To be able to have schemas ready for transformation in such a short space of time is truly… transformational. As a freelancer, being able to accelerate the development of pipelines is a huge benefit within companies who are often frustrated with the amount of time it takes to start ‘showing value’.

I’d welcome the chance to discuss what’s been built to date or collaborate on any potential further development in the comments below.

Making the leap from accountant to analytics engineer

December 15, 2022 · 9 min read

Samuel Harting

Associate Analytics Engineer @ dbt Labs

In seventh grade, I decided it was time to pick a realistic career to work toward, and since I had an accountant in my life who I really looked up to, that is what I chose. Around ten years later, I finished my accounting degree with a minor in business information systems (a fancy way of saying I coded in C# for four or five classes). I passed my CPA exams quickly and became a CPA as soon as I hit the two-year experience requirement. I spent my first few years at a small firm completing tax returns but I didn't feel like I was learning enough, so I went to a larger firm right before the pandemic started. The factors that brought me to the point of changing industries are numerous, but I’ll try to keep it concise: the tax industry relies on underpaying its workers to maintain margins and prevent itself from being top-heavy, my future work as a manager was unappealing to me, and my work was headed in a direction I wasn’t excited about.

Introducing the dbt_project_evaluator: Automatically evaluate your dbt project for alignment with best practices

November 30, 2022 · 7 min read

Grace Goheen

Analytics Engineer @ dbt Labs

Why we built this: A brief history of the dbt Labs Professional Services team

If you attended Coalesce 2022, you’ll know that the secret is out — the dbt Labs Professional Services team is not just a group of experienced data consultants; we’re also an intergalactic group of aliens traveling the Milky Way on a mission to enable analytics engineers to successfully adopt and manage dbt throughout the galaxy.

How to move data from spreadsheets into your data warehouse

November 23, 2022 · 11 min read

Joel Labes

Senior Developer Experience Advocate @ dbt Labs

Once your data warehouse is built out, the vast majority of your data will have come from other SaaS tools, internal databases, or customer data platforms (CDPs). But there’s another unsung hero of the analytics engineering toolkit: the humble spreadsheet.

Spreadsheets are the Swiss army knife of data processing. They can add extra context to otherwise inscrutable application identifiers, be the only source of truth for bespoke processes from other divisions of the business, or act as the translation layer between two otherwise incompatible tools.

Because of spreadsheets’ importance as the glue between many business processes, there are different tools to load them into your data warehouse and each one has its own pros and cons, depending on your specific use case.

A journey through the Foundry: Becoming an analytics engineer at dbt Labs

November 22, 2022 · 6 min read

Wasila Quader

Associate Analytics Engineer @ dbt Labs

Data is an industry of sidesteppers. Most folks in the field stumble into it, look around, and if they like what they see, they’ll build a career here. This is particularly true in the analytics engineering space. Every AE I’ve talked to had envisioned themselves doing something different before finding this work in a moment of serendipity. This raises the question, how can someone become an analytics engineer intentionally? This is the question dbt Labs’ Foundry Program aims to address.

Demystifying event streams: Transforming events into tables with dbt

November 4, 2022 · 13 min read

Charlie Summers

Staff Software Engineer @ Merit

Let’s discuss how to convert events from an event-driven microservice architecture into relational tables in a warehouse like Snowflake. Here are a few things we’ll address:

Why you may want to use an architecture like this
How to structure your event messages
How to use dbt macros to make it easy to ingest new event streams

Stronger together: Python, dataframes, and SQL

October 18, 2022 · 13 min read

Doug Beatty

Senior Developer Experience Advocate @ dbt Labs

For years working in data and analytics engineering roles, I treasured the daily camaraderie sharing a small office space with talented folks using a range of tools - from analysts using SQL and Excel to data scientists working in Python. I always sensed that there was so much we could work on in collaboration with each other - but siloed data and tooling made this much more difficult. The diversity of our tools and languages made the potential for collaboration all the more interesting, since we could have folks with different areas of expertise each bringing their unique spin to the project. But logistically, it just couldn’t be done in a scalable way.

So I couldn’t be more excited about dbt’s polyglot capabilities arriving in dbt Core 1.3. This release brings Python dataframe libraries that are crucial to data scientists and enables general-purpose Python but still uses a shared database for reading and writing data sets. Analytics engineers and data scientists are stronger together, and I can’t wait to work side-by-side in the same repo with all my data scientist friends.

Going polyglot is a major next step in the journey of dbt Core. While it expands possibilities, we also recognize the potential for confusion. When combined in an intentional manner, SQL, dataframes, and Python are also stronger together. Polyglot dbt allows informed practitioners to choose the language that best fits your use case.

In this post, we’ll give you your hands-on experience and seed your imagination with potential applications. We’ll walk you through a demo that showcases string parsing - one simple way that Python can be folded into a dbt project.

We’ll also give you the intellectual resources to compare/contrast:

different dataframe implementations within different data platforms
dataframes vs. SQL

Finally, we’ll share “gotchas” and best practices we’ve learned so far and invite you to participate in discovering the answers to outstanding questions we are still curious about ourselves.

Based on our early experiences, we recommend that you:

✅ Do: Use Python when it is better suited for the job – model training, using predictive models, matrix operations, exploratory data analysis (EDA), Python packages that can assist with complex transformations, and select other cases where Python is a more natural fit for the problem you are trying to solve.

❌ Don’t: Use Python where the solution in SQL is just as direct. Although a pure Python dbt project is possible, we’d expect the most impactful projects to be a mixture of SQL and Python.

Analysts make the best analytics engineers

September 29, 2022 · 11 min read

Brittany Krauth

Manager, Analytics & Insights @ Degreed

When you were in grade school, did you ever play the “Telephone Game”? The first person would whisper a word to the second person, who would then whisper a word to the third person, and so on and so on. At the end of the line, the final person would loudly announce the word that they heard, and alas! It would have morphed into a new word completely incomprehensible from the original word. That’s how life feels without an analytics engineer on your team.

So let’s say that you have a business question, you have the raw data in your data warehouse, and you’ve got dbt up and running. You’re in the perfect position to get this curated dataset completed quickly! Or are you?

The case against `git cherry pick`: Recommended branching strategy for multi-environment dbt projects

September 13, 2022 · 10 min read

Grace Goheen

Analytics Engineer @ dbt Labs

Why do people cherry pick into upper branches?

The simplest branching strategy for making code changes to your dbt project repository is to have a single main branch with your production-level code. To update the main branch, a developer will:

Create a new feature branch directly from the main branch
Make changes on said feature branch
Test locally
When ready, open a pull request to merge their changes back into the main branch

Basic git workflow

If you are just getting started in dbt and deciding which branching strategy to use, this approach–often referred to as “continuous deployment” or “direct promotion”–is the way to go. It provides many benefits including:

Fast promotion process to get new changes into production
Simple branching strategy to manage

The main risk, however, is that your main branch can become susceptible to bugs that slip through the pull request approval process. In order to have more intensive testing and QA before merging code changes into production, some organizations may decide to create one or more branches between the feature branches and main.

KonMari your data: Planning a query migration using the Marie Kondo method

September 8, 2022 · 10 min read

Lauren Benezra

Analytics Engineer @ dbt Labs

If you’ve ever heard of Marie Kondo, you’ll know she has an incredibly soothing and meditative method to tidying up physical spaces. Her KonMari Method is about categorizing, discarding unnecessary items, and building a sustainable system for keeping stuff.

As an analytics engineer at your company, doesn’t that last sentence describe your job perfectly?! I like to think of the practice of analytics engineering as applying the KonMari Method to data modeling. Our goal as Analytics Engineers is not only to organize and clean up data, but to design a sustainable and scalable transformation project that is easy to navigate, grow, and consume by downstream customers.

Let’s talk about how to apply the KonMari Method to a new migration project. Perhaps you’ve been tasked with unpacking the kitchen in your new house; AKA, you’re the engineer hired to move your legacy SQL queries into dbt and get everything working smoothly. That might mean you’re grabbing a query that is 1500 lines of SQL and reworking it into modular pieces. When you’re finished, you have a performant, scalable, easy-to-navigate data flow.

Leverage Accounting Principles when Modeling Financial Data

September 7, 2022 · 14 min read

Joe Markiewicz

Analytics Engineering Manager (Fivetran dbt package maintainer) @ Fivetran

Analyzing financial data is rarely ever “fun.” In particular, generating and analyzing financial statement data can be extremely difficult and leaves little room for error. If you've ever had the misfortune of having to generate financial reports for multiple systems, then you will understand how incredibly frustrating it is to reinvent the wheel each time.

This process can include a number of variations, but usually involves spending hours, days, or weeks working with Finance to:

Understand what needs to go into the reports
Model said reports
Validate said reports
Make adjustments within your model
Question your existence
Validate said reports again

You can imagine how extremely time consuming this process can be. Thankfully, you can leverage core accounting principles and other tools to more easily and effectively generate actionable financial reports. This way, you can spend more time diving into deeper financial analyses.

August 2022 dbt Update: v1.3 beta, Tech Partner Program, and Coalesce!

August 31, 2022 · 5 min read

Lauren Craigie

Product Marketing @ dbt Labs

Semantic layer, Python model support, the new dbt Cloud UI and IDE… there’s a lot our product team is excited to share with you at Coalesce in a few weeks.

But how these things fit together—because of where dbt Labs is headed—is what I’m most excited to discuss.

You’ll hear more in Tristan’s keynote, but this feels like a good time to remind you that Coalesce isn’t just for answering tough questions… it’s for surfacing them. For sharing challenges we’ve felt in silos, finding the people you want to solve them with, and spending the rest of the year chipping away at them. As Tristan says in his latest blog, that’s how this industry moves forward.

An introduction to unit testing your dbt Packages

August 25, 2022 · 7 min read

Yu Ishikawa

Senior Data Privacy Engineer @ Ubie

Editors note - this post assumes working knowledge of dbt Package development. For an introduction to dbt Packages check out So You Want to Build a dbt Package.

It’s important to be able to test any dbt Project, but it’s even more important to make sure you have robust testing if you are developing a dbt Package.

I love dbt Packages, because it makes it easy to extend dbt’s functionality and create reusable analytics resources. Even better, we can find and share dbt Packages which others developed, finding great packages in dbt hub. However, it is a bit difficult to develop complicated dbt macros, because dbt on top of Jinja2 is lacking some of the functionality you’d expect for software development - like unit testing.

In this article, I would like to share options for unit testing your dbt Package - first through discussing the commonly used pattern of integration testing and then by showing how we can implement unit tests as part of our testing arsenal.

Surrogate keys in dbt: Integers or hashes?

August 24, 2022 · 12 min read

Dave Connors

Senior Developer Experience Advocate @ dbt Labs

Those who have been building data warehouses for a long time have undoubtedly encountered the challenge of building surrogate keys on their data models. Having a column that uniquely represents each entity helps ensure your data model is complete, does not contain duplicates, and able to join across different data models in your warehouse.

Sometimes, we are lucky enough to have data sources with these keys built right in — Shopify data synced via their API, for example, has easy-to-use keys on all the tables written to your warehouse. If this is not the case, or if you build a data model with a compound key (aka the data is unique across multiple dimensions), you will have to rely on some strategy for creating and maintaining these keys yourself. How can you do this with dbt? Let’s dive in.

Narrative modeling: How structure can tell a story

August 22, 2022 · 15 min read

Ian Fahey

Analytics Engineer @ dbt Labs

The larger a data ecosystem gets, the more its users and stakeholders expect consistency. As the ratio of data models to team members (to say nothing of stakeholders to team members) skyrockets, an agreed-upon modeling pattern often acts as scaffolding around that growth.

The biggest tool in the toolbox today, dimensional modeling, offers enough consistency to make it the dominant approach in the space, but what might be possible if we shut that toolbox, took a break from our workbench, and instead strolled over to our bookshelf?

In other words, what if we told a story?

How we shaved 90 minutes off our longest running model

August 18, 2022 · 15 min read

Bennie Regenold

Analytics Engineer @ dbt Labs

Barr Yaron

Product Manager @ dbt Labs

When running a job that has over 1,700 models, how do you know what a “good” runtime is? If the total process takes 3 hours, is that fantastic or terrible? While there are many possible answers depending on dataset size, complexity of modeling, and historical run times, the crux of the matter is normally “did you hit your SLAs”? However, in the cloud computing world where bills are based on usage, the question is really “did you hit your SLAs and stay within budget”?

Here at dbt Labs, we used the Model Timing tab in our internal analytics dbt project to help us identify inefficiencies in our incremental dbt Cloud job that eventually led to major financial savings, and a path forward for periodic improvement checks.

Enforcing rules at scale with pre-commit-dbt

August 3, 2022 · 13 min read

Benoit Perigaud

Staff Analytics Engineer @ dbt Labs

Editor's note — since the creation of this post, the package pre-commit-dbt's ownership has moved to another team and it has been renamed to dbt-checkpoint. A redirect has been set up, meaning that the code example below will still work. It is also possible to replace repo: https://github.com/offbi/pre-commit-dbt with repo: https://github.com/dbt-checkpoint/dbt-checkpoint in your .pre-commit-config.yaml file.

At dbt Labs, we have best practices we like to follow for the development of dbt projects. One of them, for example, is that all models should have at least unique and not_null tests on their primary key. But how can we enforce rules like this?

That question becomes difficult to answer in large dbt projects. Developers might not follow the same conventions. They might not be aware of past decisions, and reviewing pull requests in git can become more complex. When dbt projects have hundreds of models, it's hard to know which models do not have any tests defined and aren't enforcing your conventions.

Updating our permissioning guidelines: grants as configs in dbt Core v1.2

July 26, 2022 · 7 min read

Jeremy Cohen

Product Manager @ dbt Labs

Doug Beatty

Senior Developer Experience Advocate @ dbt Labs

If you’ve needed to grant access to a dbt model between 2019 and today, there’s a good chance you’ve come across the "The exact grant statements we use in a dbt project" post on Discourse. It explained options for covering two complementary abilities:

querying relations via the "select" privilege
using the schema those relations are within via the "usage" privilege

Developer Blog | dbt Developer Hub

The problem, the builder and tooling​

The starting point​

Towards a solution​

Ingesting the data​

Transforming the data​

Production-readying the pipeline​

The outcome​

Next steps​

Learnings and conclusion​

dlt​

dbt​

Cloud functions​

To sum up​

Why we built this: A brief history of the dbt Labs Professional Services team​

Why do people cherry pick into upper branches?​

The problem, the builder and tooling

The starting point

Towards a solution

Ingesting the data

Transforming the data

Production-readying the pipeline

The outcome

Next steps

Learnings and conclusion

dlt

dbt

Cloud functions

To sum up

Why we built this: A brief history of the dbt Labs Professional Services team

Why do people cherry pick into upper branches?