Why should we “refuse to click buttons”
Setting up Modern Data Stack at scaling companies is challenging. It will grow and become complex quickly. To stay in control, here is the inspiration we can take from the DevOps practice.
It was 2010. I was working on a project with a small team of engineers. After weeks of coding, one day, we decided to launch.
We needed to start our infrastructure, so we signed up for an AWS account, logged into the UI, and went to start our first server. We navigated the AWS console and got it up and running. We copied the IP address, connected to the server via SSH, updated the latest packages, installed dependencies, and deployed our code. We were in production. Yay!
Shortly after, we realized it would be good to have staging and development environments too. We returned to the AWS console and repeated the workflow a few more times all over again.
It was a bit messy—a combination of running scripts, clicking in UI, and other manual work.
Today the workflow feels very different. The DevOps discipline opened entirely new ways of working, enabling teams to build systems at a much larger scale.
The teams setting up their Modern Data Stack today are entering a very similar journey too.
The Modern Data Stack initially feels simple, so we run a few scripts, do a few clicks here and there to set it up. But it becomes increasingly hard to maintain as the company scales. It turns out it becomes very similar to other software: more code, configuration, components, and scripts. And since we have a much longer history managing our Software, we should explore if we can learn from it.
DevOps changed how we build software
Compared with the previous approach where developers wrote software that was later deployed and run by operations teams, we started to work in a more integrated fashion. Precisely as the name suggests—Developer & Operations is a cross-functional way of working.
With this new discipline, the toolbox started to change. CICD emphasized automated builds, testing, and deployment, and Docker and Kubernetes emerged into the ecosystem, opening even more options to automate our workflows.
We also changed the culture. Teams started to take end-to-end ownership of their software. We no longer had a mindset of “throwing things over the wall” for operations to run it.
Driven by the growing maturity of cloud provider APIs, infrastructure as code emerged. Tools like Terraform or Pulumi allowed teams to orchestrate API calls to create their infrastructure based on definitions in code.
And the result?
Today, we manage complex systems with many services and components with perfectly aligned configurations and deployment workflows with no manual actions besides merging our PRs.
Tristan Handy labelled DevOps beautifully in his talk at the Future of Data conference[1]:
This is what DevOps is about!
Instead of clicks, we use code. It’s faster and more reliable.
State of the Modern Data Stack
Deploying the Modern Data Stack today feels like deploying software a decade ago. It’s an awkward mix of scripts, clicks, and other manual work.
It’s partially because many components are relatively new, missing mature APIs which could support Infrastructure as Code workflows.
But there is even more to it.
To run the stack, we need to write a lot of SQL code. In many tools, we write code in a UI that usually looks like this:
I eventually have to write perfectly valid SQL code and click “Save.” And off to production we go.
Sometimes I get at least basic features for testing, but I still find it scary. A lot goes in my mind at that point:
Will it work?
Is there a logic error?
Are all the definitions aligned with other parts of the stack?
Did I thoroughly test it?
It’s not easy to answer these questions.
SQL controls the stack
To better understand the risks, it is crucial to know where in the Modern Data Stack the SQL is and what it does.
Starting with the ingest, we use tools such as Fivetran, described as ELT (extract-load-transform) pipelines, to load data into the warehouse as is. Except that it’s not entirely true. Fivetran has transformations[2] that let you write SQL code to alter the data.
When data lands in the warehouse, we do the modeling. We use SQL to query raw tables, turning them into new views or tables.
But there is more. Data flows into various analytics tools. Looker, Metabase, Mode, and other BI tools are all full of SQL code too.
We also have reverse ETLs. Census or Hightouch map tables into operational tools by using SQL queries. Segment has a similar feature called SQL traits[3]; write a SQL query and push data to various tools.
In many cases, we control which customers are considered high potential or likely to churn, who gets a notification to finish the necessary onboarding action, or which customer should get a discount. Maybe we’re exporting the data for the next board meeting too.
By changing the SQL code, we alter our data and, as a result, change decisions taken by people or downstream systems.
DevOps workflows for a more robust experience
We made a big step forward with the adoption of DBT[4], controlling SQL across the data warehouse. Other tools like Fivetran or Census integrate DBT too, spreading the analytics engineering workflow across the stack[5][6].
On the other hand, I still expect that many components will, and should, keep a UI to write SQL. For many use cases, this is perfectly fine.
But I see an increasing number of places where UI experience is not optimal. As soon as SQL starts to control critical processes, it deserves a more reliable workflow.
One option is for each tool to implement its own git-supported workflow. For example, Looker does it. I find this approach sort of ok, but usually combined UI+git workflows feel a bit clunky to me. Mostly because if I code data models and transformations, I want my setup, not an input box on a webpage.
Another option would be to control SQL via API. It’s not possible today, at least not for all parts of the stack, but it could be.
I imagine a workflow where we write SQL code in a centrally managed codebase with tests that verify that it works. We then deploy the code via API with a CICD pipeline to each component—no UI with “Edit code” and “Save” buttons.
Combined with the management of the whole stack with infrastructure as a code approach, we would get a fully replicable stack that reliably executes our data flows.
It’s the same as what we do with software today. We set up infrastructure as code and deploy software with (most of the time) reliable CICD pipelines. We could do the same for Data Stack as soon as APIs are available.
One way or another we can learn a lot from DevOps practised in the modern data stack. It would become more reliable.
We could indeed Refuse to click buttons.
External links
[1] Future Data 2021 - Infrastructure-as-Code and the Modern Data Experience
[2] https://fivetran.com/docs/transformations/basic-sql
[3] https://segment.com/docs/personas/audiences/