We hear more about tools than about using them the right way
There have been many debates lately about unbundling or rebundling, which led to some arguments over who’s gonna win the market for any given piece of the data stack. But those debates hide an important fact: tools are just tools. In the end, how you use them is equally (if not more) important than your product and vendor choice.
Why do we use tools
The main purpose of data tools is to help solve business problems, make better decisions, and improve existing processes through data-driven automations. But many started to adopt new tools only to keep up with the current trends, such as the modern data stack.
Especially in this day and age, when we’re bombarded with a gazillion of new technologies every day, it’s useful to develop a decision framework to select and successfully adopt new data tools and do it intentionally.
Choose your stack wisely
It’s a good practice to start with a problem that needs to be solved and which business process and end users this tool must support. Any addition to the stack should be adopted if:
- It solves a real problem,
- It fits into how you work and into the needs of your organization,
- It works well with your engineering workflow or business process and ideally enhances them.
Once that’s clarified, we need decision and evaluation criteria, and finally, we can proceed with a tool evaluation, selection, and adoption.
Why is such a structured approach important for selecting tools for your data stack? Because otherwise, you may end up with accidental complexity. Your data platform may turn into a disjointed and hard-to-maintain set of self-hosted tools, or even worse, a bunch of fully disconnected SaaS products for which you pay a fortune and end up with more problems than before.
The role of engineers in the data landscape
Once we selected our stack, we can start engineering. With the rise of the modern data stack and new open-source technologies, we started to see the emergence of new job titles such as ML engineers, analytics engineers, or data platform engineers. But what does it mean to be a good engineer and data practitioner?
It’s not only about coding, optimizing query performance, or learning how to cherry-pick Git commits. An engineer is someone who understands how to approach a business problem, identify its root cause, and then select tools and design solutions that won’t lead to regrets in the future. In order to do it right, we need both great tools and great engineers.
Good engineering design vs. tool selection
Does good engineering win over a careful tool selection? As always, the truth lies somewhere in between.
In the end, how you’re gonna use any given tool determines how successful you can be with it.
To illustrate this point, let’s look at the tools that help to solve the problem of ensuring reliable dataflow. The products in this domain started to diverge into two main categories. The first are orchestrators, rebundling the data stack and striving to control everything in it. The second is the coordination plane focused more on observing the state of various tools (and even other orchestrators) and allowing you to take action based on the observed state.
The traditional orchestrators rebundling the stack are prone to vendor lock-in, and therefore the coordination plane helps to build a data platform that is more adaptable to change and is better suited to serve you long-term. But not everything is black and white — even with a perfect data tool, you can still end up in a bad place if you’re not careful about how you build your system on top of that platform.
On that premise, does good engineering design win over a tool selection? To some extent, yes. However, choosing the wrong tool forces you into one-way door decisions locking you in regardless of how well you design your system on top of it.
Control plane: putting all your (dataflow) eggs in one basket
What does it mean to use a single orchestrator as a control plane for observability, lineage, and dataflow execution? It means that to get a reliable lineage graph and gain a full picture of the state of your data platform, you must use this platform for everything, everywhere, all at once.
On a practical level, it means that triggering your dbt models from dbt Cloud rather than from the central orchestrator breaks your workflow lineage, and your metadata picture is skewed. The orchestrator gets confused about the current state, and your data platform is broken only because you chose to trigger that run from dbt Cloud rather than from your central orchestrator.
The key distinction is choice, or rather whether the tool you select forces you to make one
It’s all about the choice and the right balance of constraints and flexibility. A central control plane rebundling the entire data stack is extremely constraining, but this might be desirable if you are a small team that favors such constraints over more flexibility.
In contrast, a coordination plane passively observing your data stack and collecting metadata about it regardless of which tool triggered your dataflow gives you the power to choose how you work. With that approach, you can focus on data and building reliable engineering processes rather than on tools and how to fight against them.
If you liked this post, you might enjoy this one, too.
Thanks for reading!
Does the Modern Data Stack Value the “Stack” Over “Data”? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.