Data Makes Your Business Grow

Get in touch if you need help improving how you use your data.

Contact Us

Anatomy of A Pipeline: CI/CD For a dbt Data Warehouse on Google Big Query Using Azure Pipelines

Anatomy of A Pipeline: CI/CD For a dbt Data Warehouse on Google Big Query Using Azure Pipelines

Setting up continuous deployment for your dbt data warehouse has never been simpler. The big code repositories (Github, Gitlab and Azure Dev Ops) all offer tools that automate deploying model changes into production; usually by creating a docker container, pulling in all code and dependencies, running and then tearing it all down again.

Using Azure Pipelines, this checkout and run process is specified in yaml that lives in your repo. Follow the steps below to automate the deployment of your dbt models on a Big Query data warehouse.

Defaults

The first set of statements define global defaults for the repo.

The trigger states when the pipeline will run, and how. In this case, any change to the master branch will result in a deployment. Combine this with a ruleset in your repo that only allows merges to the master branch following an approved pull request and you’ll be able to leave the master branch alone knowing that your production systems are well tested and offer no surprises.

With batch set to true the system waits until the run is completed before starting another run with all changes that have not yet been built. This is helpful if there are multiple people committing changes to the branch at the same time.

The pool and container options specify the OS and Python version that the pipeline docker container will run on.

---
trigger:
  batch: true
  branches:
    include:
      - master

pool:
  vmImage: 'ubuntu-latest'

container:
  image: python:3.8

Variables

Azure has predefined variables you can use in your pipeline, or you can define your own. Here, I set a location to store Python packages that I will cache in a later step.

I used the predefined variable Pipeline.Workspace, which supplies the working directory for the pipeline, to make things a little easier down the road.

variables:
  PIP_CACHE_DIR: $(Pipeline.Workspace)/.pip

Deployment Steps

The next few sections define the deployment steps that the pipeline will perform. The steps are carried out in the order they are listed.

Checkout repo

The first step is to checkout the repo to get my models into the docker container. I specified submodules as true because I’m using dbt packages (dbt_utils for example) that I’ve defined as git submodules within my repo. Although the recommended method of including packages in dbt is via dbt hub, using git allows me to specify the required package version and it gets added to my repo at the pipeline runtime.

Adding path as the name of my repo (e.g. ‘dbt’) was useful when referring to the model hierarchy in my dbt project definition. If you don’t do this, Azure defaults to naming the working directory to s and things can get confusing.

steps:
  - checkout: self
    submodules: true
    path: <repo name>

Download required files

The warehouse for this project is in Google Big Query. The best way I’ve found to connect dbt to Big Query is by specifying a link to a service account key file in ~/.dbt/profiles.yml, so both of these files need downloading to the container.

Azure has a nice feature named Secure Files where you can upload files with secrets in them to a secure area (separate from your versioned repo) that you can download into the container via a task in the pipeline. The files are deleted from the container after the pipeline runs.

  - task: [email protected]
    name: GettingServiceAccount
    displayName: 'Downloading Service Account'
    inputs:
      secureFile: <name of service account creds>

  - task: [email protected]
    name: SettingProfile
    displayName: 'Downloading Profile'
    inputs:
      secureFile: 'profiles.yml'

Cache Dependencies

Each time the Azure Pipelines job runs, all of dbt’s Python dependencies are downloaded from PyPI and added to the project. Given that the dependencies don’t change that often, it’s probably worth caching the packages for use across jobs - you do pay for the pipeline by the minutes used, after all.

  - task: [email protected]
    inputs:
      key: 'python | "$(Agent.OS)" | requirements.txt'
      restoreKeys: |
        python | "$(Agent.OS)"
        python
      path: $(PIP_CACHE_DIR)
    displayName: Cache pip packages

Following the package definitions in your requirements.txt file, the task will look for cached packages in the PIP_CACHE_DIR I specified earlier. If the cache does not exist, the job will complete and dependencies will be added as part of the normal install later.

Set Up dbt

To make sure that dbt runs correctly and has access to the warehouse, I create symlinks to the downloaded secure files in the ~/.dbt folder. This means that dbt has access while ensuring that the secure files do get deleted after each run.

  - script: |
      mkdir ~/.dbt
      echo Installing $(SettingProfile.secureFilePath) to the ~/.dbt...
      ln -s $(SettingProfile.secureFilePath) ~/.dbt/profiles.yml
      echo Installing $(GettingServiceAccount.secureFilePath) to the ~/.dbt...
      ln -s $(GettingServiceAccount.secureFilePath) ~/.dbt/<name of service account creds>
    displayName: Installing Profile

Create and Test the dbt models

Finally, I install dbt, run the models and test them. The build will fail if the tests fail, alerting you to a problem, but the models will be built so it is important to get alerted to any errors. It’s also important to run tests on your dbt models throughout the software development lifecycle.

  - script: |
      export PATH=$PATH:/home/vsts_azpcontainer/.local/bin
      pip install -r requirements.txt
      cd $(Pipeline.Workspace)/<repo name>
      dbt run
      dbt test
    displayName: 'Run and Test'

Of course, this pipeline is only a start. I’d like to improve it to only run changed models, and also regenerate and deploy the docs artifacts to a web server.