Photo by Louis Hansel on Unsplash

Debunking common myths about data lake architectures, data lake definitions, and data lake analytics

The purpose of this post is to frame data lakes and provide context around how they fit within enterprise data strategies. This has historically been confusing and opaque given conflicting advice from consultants and vendors.

Unfortunately, confusing and misleading advice leads people down a path of asking questions in the context of technology platforms rather than one of strategy or business outcomes. A technology-driven decision-making process is an attempt to make a subjective conversation more objective. For example, they pursue questions like “What is Azure data lake?” or “What is an Amazon data lake?” or “What is the best data lake software”. Maybe there is a pushy vendor promoting buzzword-laden HIPPA compliant data lakes in a healthcare context.

As a result, the conversation around data lakes can get bewildering for those trying to figure out how a lake can add value to their data and insights efforts.

Overview

Before jumping into the “How to build a data lake?” or “How to create data lakes for the enterprise”, success is furthered when you can understand the “myths” associated with various strategy, architecture, and implementation advice. Breaking down these myths will help you understand why data lakes fail as well as various data lake challenges, some of which are the result of vendors and consultants providing a path that may be contrary to data lake best practices.

Let’s get started…

Myth #1: The data lake vs data warehouse

It is not uncommon to find advice that presents a binary choice between a data warehouse or a data lake. This is a false choice.

Reality Check- The difference between a data warehouse and a data lake

The data lake VS data warehouse frames the conversation incorrectly. Conversely, when people frame the conversation by asking “are data warehouses obsolete?” gives the impression that it is time to toss out your EDW. The framing of both these questions is leading you astray.

Normally, this occurs when a company has some form of technical investment in a particular design pattern. For example, they claim that certain operations can, or must, occur in a warehouse. They will then frame these operations as a limitation and risk associated with lakes.

What is an example of a data lake “limitation” vendors will promote?

A vendor will say lakes are limited because they can’t scale compute resources easily on demand like a warehouse. This is true but misleading. This is like complaining that Tom Brady has never hit a home run in his professional football career, so he must be a terrible athlete. Since Tom Brady is a football player, would you expect him to be dropping dingers over the green monster at Fenway (well, maybe the Pesky pole)? No.

The data lake VS data warehouse is a false choice. So why are vendors and consultants applying data warehouse compute concepts to a data lake?

The fact that a data lake does not have compute resources is FUD. Someone is likely trying to promote a warehouse as the panacea for your data. Data lakes don’t scale compute resources because there are no compute resources to scale.

Separate compute resources are a core abstraction a lake architecture embraces. This is why solutions like Redshift Spectrum, Presto and Athena exist. Let’s take Amazon Athena as an example. Athena is not a warehouse, but an on-demand query engine based on Facebook Presto. As a service, Athena (and Presto) provides on-demand “compute” resources to query data in your lake. Amazon Redshift Spectrum, like Athena, can query data in your lake separate from resources in a Redshift cluster.

Lakes, by design, are supposed to be well abstracted from the services that consume information that resides within them. Regardless if you have an Amazon data lake (AWS data lake), Oracle data lake, Azure data lake or BigQuery data lake on Google Cloud, the model is similar. Contents of the lake can be accessed with a query engine like Athena or “warehouse” like Redshift, BigQuery, or Snowflake. These services provide the compute resources, not the lake.

Rather than have the dialogue be lake VS warehouse, the right discussion is for most enterprises would be lake AND warehouse. When someone gives you argument that you need to choose one or the other, they likely have an agenda that aligns with their product offering or commercial partnership.

Myth #2: Your data warehouse is a data lake

This line of thinking suggests you forgo a data lake and dump everything into a warehouse.

Reality Check — Defining an effective data lake

Yes, there are vendors and consultants advocating the “warehouse as a data lake” model.

Various vendors and consultants will suggest that schemas (or other physical and logical constructs) be used to denote the lifecycle of data from “raw” to another state in a warehouse. Any data maturity needed by the business will be done directly within the confines of the warehouse.

Traditionally, the role of a data warehouse reflects the “settled truth” for a business. A settled truth reflects a collection of agreed-upon facts about the enterprise. For example, a settled truth may provide authoritative facts about revenue, orders, “best customers”, and a host of other domains.

However, in the “dump everything in the warehouse” model, the warehouse holds everything, which includes ephemeral and volatile raw data.

The suggested repackaging of all raw data into a warehouse looks more like an operational data store (ODS) or a data mart than a warehouse. Can you dump everything into a warehouse? Yes. Just because you can do something technically, does not make it the right architecture.

The suggestion for putting everything in a warehouse says that a “truth” is simply a function of a logical organization of the data. Who defines this logical definition and disseminates this within an enterprise is glossed over, not understood or at worse, ignored. This approach is almost a textbook definition of a data swamp attributed to data lakes, except someone is advocating your swamp occurs in a warehouse.

This model is locking you into a warehouse technology and an operational model. It embraces a mindset that now requires you to dump everything into the warehouse. If you like vendor lock-in, artificial constraints, reduced data literacy, and technical debt, then this approach is certainly for you.

Done right, a data lake can minimize the technical debt while accelerating an enterprise team consumption of data. Given an accelerating rate of change in the data warehouse, query engine, and data analytics market, minimizing risk and technical debt should be a core part of your strategy.

Myth #3: Data lakes equal Hadoop

You will often find discussion and examples where lakes are synonymous with Hadoop or Hadoop related vendor technology stacks. This gives the impression that a data lake is tightly bound to Hadoop specific technologies.

Reality Check- Hadoop is not a data lake

While Hadoop technologies may be used in the formation and operation of some lakes, they do not reflect the foundational strategy and architecture lakes are meant to support.

It is important to recognize that a lake first and foremost reflects a strategy and architecture, not a technology. Pentaho co-founder and CTO, James Dixon who coined the term data lake, said;

This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart. This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a Data Lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the Data Lake for new questions.

Hadoop, like any other technology, supports enablement of a strategy and architecture. If you had a lake today, you have a lot of non-Hadoop choices, even if those choices leverage Hadoop related technologies under the covers. For example, your lake may support a warehouse solution such as Snowflake or query in place with Amazon Athena, Presto, Redshift Spectrum, or BigQuery, all at the same time.

Don’t think a lake is tightly binding you to Hadoop. If you follow a well-abstracted lake architecture then you are minimizing the risk of artificially limiting the opportunity a lake represents and the benefits to a broader enterprise ecosystem they are meant to support.

Myth #4: Data lakes are just “storage”

In this scenario, a lake is just someplace to store all your stuff, not unlike a super big hard drive on your laptop where you drop all your files into “Untitled folder”. Just dump it in and declare victory.

Reality Check- A lake is not just a place to store stuff

This can get complicated when vendors frame data lakes to be synonymous with storage. For example, even Microsoft packages its product as “Azure data lake storage” or “Azure data lake storage gen2”. Lakes do provide storage, but a characterization they are “just storage” is off the mark.

As we stated previously your lake should be viewed as a strategic element of a broader enterprise data stack. This includes contributing to a settled truth in a downstream system like a data warehouse or supporting data consumption in tools like Tableau or Oracle ETL.

As such, lakes are not just storage and are not mutually exclusive of a warehouse or other aspects of a data and analytics stack. Actually, quite the opposite. Most lakes in nature are dynamic ecosystems, not static, closed systems. One job a lake can have is being an active source of data that fuels a warehouse. However, the opposite is also true where certain warehouse workloads can be offloaded to a lake to reduce costs and improve efficiencies.

Packaged and structured properly, the lake can deliver downstream value to those consuming data from it, including a warehouse. For example, lakes can play an active role in supporting the settled truth mission of a warehouse.

We have a customer that uses a lake to undertake quality control analysis for their tagging across dozens of web sites and third-party properties. This allows them to identify possible gaps or implementation errors by the different teams responsible for that work. We also have another customer that uses a lake to reconcile potentially inaccurate or duplicate multi-channel orders across various internal, third-party and partner systems, prior to delivery to an EDW.

Both of these examples highlight that the lake plays a dynamic role in ensuring that the downstream the settled truth meets enterprise expectations and norms.

As the folks at McKinsey said, “ data lakes ensure flexibility not just within technology stacks but also within business capabilities”. The data lake as a service model is about delivering business value, not just storage. We agree.

Myth #5: Data lakes are only for “raw” data

Linked to Myth #2, the “dump everything into the warehouse” says that data lakes don’t add value because only raw data is resident in a data lake. Their argument goes something like this; “If data lakes only deal with raw data, then don’t bother with a lake, just dump all your data, raw or processed, into a warehouse.”

Reality Check — Defining an effective data lake strategy and architecture

As we stated previously, this contradicts the fundamental premise that a warehouse is meant to reflect the “settled truth about the business. A better historical comparison is not between a warehouse and a data lake, but between an ODS and a lake.

Historically, it was an ODS, not a warehouse, that was ingesting rough and volatile raw data from upstream data sources. An ODS typically held a narrow time horizon of data, maybe 90 days worth. The ODS also may have had a narrower focus, say for a particular domain of data. A lake, on the other hand, will often have no time constraints for data retention and be broader in scope.

So are lakes just for raw data? No.

Data lakes, by design, should have some level of curation for data ingress (i.e., what is coming into the lake). If you have no sense of data ingress patterns into your lake, you likely have problems elsewhere in your technology stack. This is also true for a data warehouse or any data system. Garbage in, garbage out.

Best practices for lakes embrace a model where you have a landing zone to optimize (or curate), however minimally, for downstream consumption. Consumption might be within an analytics tool like Tableau or Power BI, but it also might be an application that handles loading to a warehouse such as Snowflake, Redshift or BigQuery.

We worked with a customer that would send Adobe event data to an AWS data lake to support an enterprise Oracle Cloud environment. Why AWS to Oracle? It was the most efficient and cost-effective data consumption pattern for the Oracle BI environment, especially considering the agility and economics of using an AWS data lake and Athena as the on-demand query service.

By maximizing the effectiveness and efficiency of data in the lake, you are minimizing the downstream costs of processing experienced by data consumers.

Kickstart your data lake with code-free, fully automated, and zero administration Amazon Redshift Spectrum or Amazon Athena services. Want to discuss data lake strategy, architecture, and pilots? Set up a call with our team of data experts.

Myth #6: Data lakes are only for “big” data

If you spend any time reading materials on data lakes, you would think there is only one type of lake and it would look like the Capsian Sea (it’s a lake despite “sea” in the name). People describe data lakes as massive, all-encompassing entities, designed to hold all knowledge. There is only an “enterprise big data lake” or a lake is synonymous with big data architecture.

Reality Check — Defining an effective data lake strategy and architecture

Unfortunately, the “big data” angle gives the impression that lakes are only for “Caspian” scale data endeavors. This certainly makes the use of data lakes intimidating. As a result, describing a lake in such massive terms makes it inaccessible to those who can benefit from them.

Data lakes, like lakes found in nature, come in all different shapes and sizes. Each lake has a natural state, often reflecting ecosystems of data, just like lakes in nature reflect ecosystems of fish, birds, or other organisms.

Here are a few data lake examples;

  • The Great “Caspian” Lakes: Just like the Caspian is a large body of water, so too are these lakes broad repositories of data. This broad collection of diverse data reflects information from across the enterprise.
  • Temporary “Ephemeral” Lakes: Just like deserts can have small, temporary lakes, an Ephemeral data lake exists for a short period of time. They may be used for a project, pilot, PoC or a point solution, they are quickly turned off as quickly as they were turned on.
  • Domain “Project” Lakes: These types of data lakes, like Ephemeral lakes, are often focused on specific knowledge domains. However, unlike the Ephemeral lake, this lake will persist over time. These lakes may also be “shallow”, meaning they may be focused on a narrow domain of data such as media, social, web analytics, email or similar data sources. We have a customer that refers has described their project as the “tableau data lake”.

By design, a lake should embrace an abstraction that minimizes risk and affords you greater flexibility. Also, a lake should be structured for easy consumption independent of size. This ensures a lake used by a data scientist or business user or python code, has an environment structured for easy data consumption vs an artificial size delineation.

Whether your use case is machine learning, visualization, reporting, feeding a warehouse or a data mart, thinking differently about size differently may unlock new opportunities to employ a lake.

Myth #7: Data lakes offer little security

Data lakes are an insecure collection of data objects available to anyone in an organization that wants to take a dip and leave with what they want.

Reality Check — Security is a choice, make sure it is one that you consider

There is some truth to this in the sense people rely on implicit technology solutions (i.e, automatic AWS S3 AES object encryption) rather than explicitly having an architecture and downstream use cases that govern security. This can lead to security gaps. However, this can be said of many systems and is not unique to a data lake per se. The notion that data lakes are inherently insecure is not accurate.

Security can and should be a first-class citizen in the context of your data lake. Here are a few areas of consideration;

  • Access: It is not uncommon for lakes to have well-defined access policies to the underlying data. Within AWS, this would be defined in your IAM policies for S3 and related services. In addition to AWS, Microsoft has an Azure data lake architecture that describes similar methods of security policies.
  • Tools: The tools and systems that consume data from a lake will also offer a level of security. For example, there can be a table and column level access control depending on the query engine. Also, data consumption tools such as Tableau or Power BI will also set access controls on the data in the lake.
  • Encryption: Lakes will often expect (or enforce) encryption in transit and at rest.
  • Partitioning: Lakes can also have a level of logical and physical partitioning that further facilitates a security strategy. For example, teams may ETL data from a “raw” landing zone to another location in the lake so they can anonymize sensitive data for downstream consumption.

One could argue the merits of these different strategies, but to say lakes are intrinsically insecure would be incorrect.

Myth #8: Data lakes equal data swamps

A critique of data lakes is that they devolve into data swamps because they are just storage, lack curation, have no management, no lifecycle/retention policies, and no metadata.

Reality Check — Defining an effective data lake strategy and architecture

In the extreme, there is a level of truth to this. If you treat a data lake like a generic “Untitled folder” on your laptop where you dump files, yes, you will likely have a swamp (see Myth #4). So this is a risk. However, anyone going down the path of dumping files this way is somewhat disinterested in being successful.

So what are the true data swamps? The ones created through design, not indifference.

The greater threat to a data lake is not from a lack of curation, management, life cycle, and metadata, but from an ecosystem of tools, roles, responsibilities, and systems meant to prevent this from happening. A lake becomes a swamp not just because of “dumping files, but the crushing weight of ancillary people, process, and technologies that are placed in and around it. If you thought your enterprise data warehouse was a slog, your data lake will start to look very familiar.

Part of the beauty of a lake is a level of simplicity, agility, and flexibility. When significant business logic and process occurs in-lake, you run the risk of creating a solution that lacks simplicity, is not responsive to change and is overly rigid by design. This is the data swamp you need to be very wary of. It is expensive, time-consuming and will fail to meet anyone’s expectations. Sound familiar?

To those planning or have deployed a data lake, be cautious of role and feature creep. It is not uncommon to see vendors (pushed by customers?) pull forward features and functionality found in a traditional warehouse or other ETL products be an “in-lake” capability. Yes, it is technically possible for you to perform complex in-lake data processing. However, you may already have workflows, tools, people, and technology that performs these functions outside of the lake.

Not all data activities may be appropriate given your context. Think long and hard about the risks of cascading complexity these choices represent. Be cognizant if a current or planned data lake starts to look more like an amalgamation of traditional ETL tools and data warehouses. If you have suffered through an overly complex EDW effort, this will be easy to spot.

Summary

Data lakes fit a familiar technology pattern where a new concept emerges, it is adopted by brave pioneers as well as technical charlatans. Over time, clarity emerges on success patterns. This clarity is born from hard-fought lessons, largely due to finding success through failure.

This results in the refinement of terminology, best practices, and investments into building better platforms. Ever changing economics, architecture, and optimization of business practices allow teams to mainstream lakes into their enterprise data stacks in a manner that fits their use case.

It is unfortunate that critiques of data lake initiatives devolve into broad statements of them “not being successful” or “data lakes equal swamps”. Also, arguments that lakes are too tightly linked to specific technologies like Hadoop. Lastly, the complaint that the semantic definition of “what is a data lake” is overly opaque and changing.

Criticism is a necessary part of growth with any technology.

However, a key to growth is taking a step back to develop perspective. In doing so these criticisms are not unique to data lakes. These critiques can apply to just about any technology endeavor generally and data projects specifically. For example, the term “data warehouse” suffers from the same opaque and changing definition as lakes (see Myth #2). Search Google for “failed data warehouse” and you will find stories about projects that were not successful. Does this mean that we should forgo the phrase “data warehouse” or stop the pursuit of those projects? No.

All too often these very same consultants or companies that deride lakes are the ones that offer products and services as the magic elixir to help implement their vision and best practices for lakes. If a consultant or vendor does not believe in the lake model, why engage in a solution that they don’t believe in? Entrusting this work to these very same consultants or vendors may very well be a reason lake initiatives are not successful.

Getting Started

Stop the purchase order for that shiny Hortonworks data lake solution. Put a hold the careers page openings for software development engineers, account managers, solutions architects, support engineers to build your enterprise data lake effort. Start small and be agile.

Here are a few tips as you think about how to get started with a lake;

  1. Focus: Seek opportunities where you can deploy an “Ephemeral” or “Project” type data lake. This will ensure you reduce risk, overcome technical and organizational challenges so your team can build confidence with lakes.
  2. Passion: Make sure you have an “evangelist” or “advocate” internally, someone who is passionate about the solution and adoption within the company. If you don’t have a person or team who is passionate about the solution you will find your data lake is just as productive as a gym membership 4 weeks after the New Years resolution that prompted it.
  3. Simple: Embrace simplicity and agility, put people, process and technology choices through this lens. The lack of complexity should not be seen as a deficiency but a byproduct of thoughtful design.
  4. Narrow: Keep the scope narrow and well defined by limiting your lake to understand data, say exports from ERP, CRM, Point-of-Sales, Marketing or Advertising data. Data literacy at this stage will help you understand workflow around data structure, ingest, governance, quality, and testing.
  5. Experiment: Pair your lake with a modern BI and analytics tools like Tableau, Power BI, Amazon Quicksight, or Looker. This will allow non-technical users an opportunity to experiment and explore data access via a lake. This allows you to engage a different user base that can assess performance bottlenecks, discover opportunities for improvements, possible linkages to any existing EDW systems (or other data systems), and additional candidate data sources. It will also allow you to discover data lake tools that make sense for your team and where best to invest resources into any type of data lake automation

Being a successful early data lake adopter means taking a business value approach rather than a technology one. This means you can worry less about a sexy new offering for the Cloudera data lake, running an AWS Lake Formation workflow, Gartner magic quadrant charts or some data lake analytics solution the Azure team wants you to purchase.

Focusing on business value affords you an opportunity to frame your lake in the context of a holistic data and analytics strategy. This increases velocity and helps you can achieve your data lake goals and measure progress in business performance.

Kickstart your data lake with code-free, fully automated, and zero administration Amazon Redshift Spectrum or Amazon Athena services.

Want to discuss data lake strategy, architecture, and pilots? Set up a call with our team of data experts.

Visit us at www.openbridge.com to how we help solve data challenges, big and small.


8 Myths About Data Lakes was originally published in Openbridge on Medium, where people are continuing the conversation by highlighting and responding to this story.