What is data mesh?
The new data architecture
By Brian Laleye · April 5, 2022 · 21 min read
You will learn in this guide the fundamentals of the new data architecture: data mesh.
In the era of digital transformation, organizations are facing a scale of change that is unprecedented in human history. The rate at which we are gaining data and metrics about our customers and the actions they take is increasing at an exponential rate.
This rate is more than ten times the aggregate growth in digital data volume from 2011 to 2016. Factors such as the growth in various data sources, business ecosystems, cross-industry interdependencies, deep tech changes, and new compliance regulations have increased complexity and unintended consequences.
According to the IDC (International Data Corporation), 175 Zettabytes of data will be generated in 2025 which represents almost 3 times the amount generated in 2021 (61 Zettabytes).
The volume of data created, captured, copied, and consumed worldwide from 2010 to 2025 from statista
In this world where data typically sits across 11 platforms (an average of 2.6 per person) but has no collective meaning or platform interoperability.
Being able to incorporate this information into just one report and dashboard allows organizations to make better decisions quicker than ever before.
Companies can get a handle on this data complexity by developing a single view of their key metrics across multiple clouds or on-premise data stores.
However, talent shortage for Data Analytics and IT management is present in the majority of companies. According to a McKinsey survey, companies lack the talent they will need in the future. Skills gaps need to be addressed in various business areas, specifically on Data analysis with 43% of the answers and IT management with 26% of the answers.
Graph from Mckinsey & Company
The new concept – referred to as “data mesh” – helps create a coherent view of your enterprise data’s ecosystem, combined with collective intelligence from both tech and operational teams in a company but also with external stakeholders (clients, partners, providers).
Let’s dive right in.
Data mesh definition
The concept of the data mesh is relatively new, it was first proposed in 2019 by Zhamak Dehghani, a principal consultant at Thoughtworks. The idea has gained popularity in the enterprise community as an alternative to traditional approaches to data platforms.
Data mesh is a new data architecture that treats each data silo as a microservice and uses an event-driven approach to share data throughout the organization.
The adoption of microservices over the past decade has not only revolutionized software development but also created a new way of building and delivering software applications.
Microservices allow for better scalability, flexibility, and reliability than monolithic architectures can provide.
The same rationale can be applied to data architectures. If we can treat each data silo as a microservice, we can build a holistic enterprise architecture that integrates all of the different pieces of our organization’s data infrastructure.
The result is an architecture that is more flexible, sustainable, and scalable than traditional enterprise architectures built around a central data warehouse (also referred to as “hub and spoke” architectures).
Unlike a data lake or warehouse, which centralizes IT control over resources and access to those resources, each domain in the mesh has its own distributed computing responsibility for the management of its resources.
The model is designed specifically to prevent the emergence of any single point of failure.
A challenger to the traditional data warehouse?
In the last decade, we have seen significant changes in how data is being collected and used.
Data-driven companies started to build internal data platforms, powered by emerging technologies like Hadoop, Apache Spark, and cloud-native services to capture new types of data and use it to create new products or increase customer engagement.
A data warehouse stores historical data in a central repository and uses an ETL process to “load” this data into the repository. The ETL process extracts the data from various source systems, transforms it into a consistent format, and then loads it into the warehouse’s target tables.
In contrast to traditional databases, a data warehouse does not overwrite existing data when it performs an update; instead, it appends new records to existing ones.
This approach allows for historical analysis of changes over time and provides a full audit trail for end users.
Data warehouses have been the main source of data for business intelligence (BI) tools and applications that provide insight into customer behavior, marketing performance, sales pipeline trends, financial projections, and more.
However, this traditional way of using BI tools is not enough anymore.
The need to use richer datasets with greater volumes and velocity has driven a new wave of innovation in data architecture with the emergence of modern data platforms such as data lakes or data mesh.
Companies that have implemented data mesh
Take Zalando, one of Europe’s largest e-commerce fashion retailers, as an example. The company has been using a central data warehouse for data analysis since its creation in 2008, but ultimately scalability became an issue.
The move to the cloud was a partial solution, but the bigger problem was to meet the growing demand for new uses of that data.
The data engineers were caught in the middle, responsible for both cleansing and transforming an ever-increasing volume of data while also meeting the demand for access.
In 2019, data mesh has been the architecture chosen by Zalando that has switched to a strategy of spreading data throughout the company and giving ownership to the business group that created it.
Data scientists and engineers were tasked to collaborate with business leaders to structure the data so that it could be shared easily.
Data mesh enables both Tech and Ops teams to collaborate hand-in-hand towards mutual success.
Zalando isn’t the only well-known company to implement data mesh. This architecture is quite recent but there are other great companies that have shared this new data architecture:
Data mesh vs Data lake
A data lake is a storage repository that holds a large amount of raw data in its native format until it is needed.
While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags.
When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
Building on the concepts of big data and data science, the term “data lake” became popular around 2010 when companies began moving their application development processes to the cloud and adopting agile software development methodologies that required more flexibility than traditional enterprise architectures could provide.
Data lakes can be used for a variety of purposes, but one of the most common use cases has been to conduct analytics on data sets of all types.
For example, organizations can leverage their data lakes to perform statistical analyses, machine learning, predictive analytics, and more.
Besides, data lakes can support structured data (relational databases), semi-structured data (files), and unstructured data (social media posts, log files).
This enables IT, teams, to easily access and analyze different types of data for various business objectives.
Data lakes are often used for log analytics, application monitoring, clickstream analysis, fraud detection, and risk management.
A data lake is similar to a data mesh, but instead of being combined, the data is kept separate in their silos.
Data lakes allow to pull specific data from them at will, which can be useful if you know exactly what you want to do with the information, but they’re not as useful when you’re trying to mine many different sources for information.
High-level Data LakeHouse data flow from Data LakeHouse — Paradigm of the Decade
Data mesh vs Data fabric
A data fabric is a logical, conceptual model that refers to a network of data sources, data storage, and processing components integrated into a unified whole that enables the movement and availability of data across the entire enterprise.
Here are the main benefits of data fabric:
- Flexibility as a business grows
- Scalability for handling larger volumes of data
- Vastly improved efficiency for managing massive amounts of data
- Seamless movement of data between multiple systems and applications
To sum up, the data fabric approach is to create a single centralized knowledge graph that has one canonical version of the truth.
A data mesh, on the other hand, takes a more decentralized approach by focusing on developing multiple data products around specific domains and use cases.
Data mesh doesn’t see the creation of a knowledge graph as a prerequisite to success.
Instead, it sees the knowledge graph as an outcome of developing multiple data products that are successful because they solve real problems for the business.
Data mesh benefits compared to current data architecture
To summarize, data mesh solves many of the challenges faced by current data architectures, including:
- Decentralization of data ownership, avoiding the need for a centralized team to govern and steward all enterprise data
- Reduced complexity and increased flexibility resulting from avoiding single-point-of-failure, rigid ETL pipelines
- Increased governance as a result of moving towards a distributed model where each service owns its shared subset of data
- Increased trust in data by ensuring that the data pipeline starts at the source, enabling consistency and repeatability
- Elimination of the duplication between various enterprise systems of record (ERP, CRM, HR, etc.)
- Increased agility due to an open-source architecture that relies on domain-specific tools
The four key principles of data mesh
According to Zhamak Dehghani, there are four principles that any data mesh implementation embodies to achieve the promise of scale while delivering quality and integrity guarantees needed to make data usable:
- Domain-oriented decentralized data ownership and architecture
- Data as a product
- Self-serve data infrastructure as a platform
- Federated computational governance
1. Domain Ownership
To support constant change and scalability, data mesh is built on decentralization and the distribution of responsibilities to those who are closest to the data.
There are two types of domain ownership in data mesh:
- Data domain: the data that a team owns and is responsible for its management and monetization.
- Data mesh domain: the scope of data ownership of the team within the broader mesh. This defines the scope of data pipelines managed by that team as well as the ability to consume data from other teams. A team’s data mesh domain can be equal to or smaller than its data domain.
The operational concept of a team owning a data mesh domain, rather than individual assets, creates a powerful paradigm for driving culture change across an organization.
Data is accessible by both business teams and data experts, facilitating collaboration; it enables cross-disciplinary collaboration and knowledge exchange that results in high-quality datasets.
For example in an e-commerce business, a traditional data platform is responsible for ingesting a variety of data such: as the ‘customers’, their ‘behavior on the platform’, ‘articles they purchased’, as well as ‘brands’ and “discounts”.
Data is then cleansed, enriched, and transformed. Datasets are served to a variety of consumers with a diverse set of needs.
However, there is a high probability that the platform delivers near real-time error when the amount of data is important.
The 30,000 ft view of the monolithic data platform from martinfowler
The solution is to move to a centralized domain agnostic data ownership like Zalando, the European e-commerce fashion leader.
2. Data as a product
With the data mesh architecture, data is treated as a product in its own right.
Data products are independent and decoupled from other systems, and they can be piped into other applications as needed. This means that data products can be created in response to data needs and that they can be tested and verified independently of other applications.
Insights are now available in real time and enable Operational Analytics.
The decision to build a machine learning model or not is left to the data team, who are best placed to make these judgments (and are free to do so without the pressure of an application team).
In addition, it allows the people building models to focus on getting them into production, rather than struggling with deployment.
By treating data as a product, we remove the need for business analysts and data scientists to collaborate closely with application developers to get their work into production.
Application developers need only to interact with working models and APIs that provide access to required datasets.
This leads to better alignment between the goals of the data team and those of the business and ops teams.
This shift also frees up application developers, who no longer need to worry about creating analytical capabilities, their concern becomes more about building great user experiences that harness insights delivered by data products.
Zalando is both an online store and a marketplace. Zalando is enriching its brand partners with data so they can take action, improve offerings and maximize sales.
3. Self-serve data infrastructure as a platform
The first step to creating a self-serve data platform is to create a catalog of data products, where each dataset or API is treated as a product with clear and consistent documentation about the business purpose, lineage, quality, and usage of the product.
Next, you should establish clear standards and guidelines for services that enable analysts and scientists to explore, query, transform, and share new datasets quickly without having to worry about how they will ultimately be used.
Once you have established a catalog of data products and services that enable exploration and experimentation with those products, you can bring together everything into one place so users can easily find what they need without having to navigate different environments and tools.
If I take the example of Zalando, each team, like finance or marketing, will be independent in terms of data and will be able to use data on their own.
4. Federated computational governance
In federated computational governance in data mesh, we discussed how the mesh can be used to manage and federate the ownership of algorithms for Artificial Intelligence/Machine Learning applications.
The focus of federated computation is to allow stakeholders to share algorithms without sharing their data sets.
In this model, stakeholders have ownership over their data and can compute that data by accessing algorithms from other “mesh nodes”.
This allows organizations to have full control over their data and gain value from others’ algorithms.
The table from Data Mesh Principles and Logical Architecture shows the contrast between the centralized (data lake, data warehouse) model of data governance- and distributed domain ownership through data mesh.
Rethink the organizational structure with data mesh: Growth marketing team example
To properly implement a data mesh architecture, it is necessary to rethink the organizational structure of the data-related teams, which can no longer be centralized but must be distributed in a “decentralized” way.
Let’s take an example of a growth marketing team.
Typically, marketing teams are organized as follows:
Typical marketing team organization from Moira Helana example
The classical structure forces all teams to work on the same database and it explains why there are always conflicts between these teams, resulting in inadequate data architecture.
Besides, this structure doesn’t always suit your needs. In her article about the restructuring of her growth marketing team, Moira Helena took the example of a company that doesn’t need a PR analyst to acquire new clients because the company focuses on digital channels, so they can place the analyst in the Retention Squad to maintain a good brand-relationship with clients.
With data mesh, you can organize your team to achieve your business goals.
Personalized marketing team organization from Moira Helana example
The idea behind the reorganization is to have an organization by domains to:
- Gain efficiency: Squad teams focus on the creation of value to achieve specific business objectives
- Reduce cost: Horizontal corner’s who deliver projects to both the Acquisition and Retention teams can be seen as satellites teams
- Better prioritization: Instead of delivering communication projects, your team will naturally prioritize actions that will help you reach your goal.
This new organization encourages employees to work through “task owners” who have more authority than the people they work for.
These individuals would be responsible for making sure the business goal is met, but would not be involved with day-to-day IT operations.
Besides, thanks to Operational Analytics enabled by data mesh, each team can work with real-time data that is specific to its domain.
Different ways to implement data mesh
According to Sven Balnojan, there are three types of data mesh architecture.
1. The standard data mesh
This approach requires a catalog of all the data, where each team stocks their data on a “bucket” with some access point to “look into these buckets”.
As a result, Team A and Team B are directly connected to the data’s end-users and both teams are able to claim true “ownership” of their data, as they now have new clients, data-customers.
Growth managers, sales managers, support agents are examples of data customers who have unique requirements that they can can now process and implement just like any other product requirement.
2. A mostly centralized data mesh
To visualize this type of data mesh, imagine one large bucket with a smooth way to see into it, as well as ten small pipelines that allow teams A, B, and so on to pour data into our large bucket.
Besides, all the teams poured the data in the same format.
Therefore, end users can extract information from different sources efficiently.
Data platform teams can get this up and running quickly because they don’t have to connect many different data.
3. A mostly decentralized data mesh
Finally, let’s imagine the opposite.
Consider ten data buckets of varying sizes, not just the typical size. Consider ten pipes that are currently siphoning data into our large bucket. Consider a smooth data access mechanism once more.
Teams A and B, on the other hand, developed custom access points to their data pools this time to allow their more specialized customers to access their data.
Consequently, that requires a lot of work from the central data team but provides even more flexibility for the data customers.
Does my company need data mesh?
Well, it depends.
Data mesh is a foundational technology and as such it’s likely not going to be a fit for every organization right away.
Think of your internal IT systems like organs in your body; they all work together and depend on each other in order to function correctly.
If you have a system that doesn’t seem to fit with your current system, perhaps it just needs some time to catch up or get trained in how things work around here.
Remember: it takes time for an organism to mature; don’t rush things!
If you’re making rapid changes with no plan of action, you might end up causing more harm than good.
What does Data Mesh do?
In short, it makes sure everything stays connected by linking disparate data sources together into one cohesive whole.
Data mesh works at all levels of your organization, from personal devices to entire cloud storage solutions and everything in between. By connecting these different pieces into one complete whole (data mesh), we can ensure that information is always accurate and accessible across any device, anywhere at any time.
This also means that if something goes wrong with one piece of equipment or software, we can isolate where exactly the problem lies so we can fix it quickly before too much damage is done.
How does Data Mesh differ from what I already have?
While most people are familiar with some form of security protocol, data mesh is unique in that it provides both encryption and decryption services.
Most security measures only encrypt information but leave messages vulnerable once they reach their destination.
With data mesh, however, even after messages are decrypted they remain encrypted until they reach their final destination.
This protects sensitive information against cyber-attacks while still allowing users to access their data whenever needed.
Additionally, many companies use various forms of hardware encryption today which are isolated within individual devices; data mesh instead creates a holistic encryption solution which runs across all platforms and integrates directly into your existing infrastructure without requiring any additional configuration or management overhead.
As with anything else, there are certain criteria you should look for when deciding whether or not data mesh is right for your business.
Here’s a list of questions to ask yourself:
- Are you currently using multiple cloud storage solutions?
- Do you experience network congestion during peak hours?
- Is there any way to improve file transfer speeds within your organization?
- Have employees been complaining about slow performance on their computers lately?
- Do employees regularly store company documents outside of corporate servers due to lack of available space?
- Do employees store company documents outside of corporate servers because they fear losing them due to lack of backups?
- Has anyone ever had sensitive information stolen from them via email attachment recently?
If you are asking yourself those questions, data mesh can be a solution.
If we take the example of a company with a culture that makes scaling centralized services difficult, centralized data governance can be seen as a bottleneck.
In this scenario, using a data mesh to distribute governance tasks can help you scale even further.
To sum up, the following are some of the benefits of using Data Mesh:
- Access to higher data quality
- Empowerment of Business & Ops teams
- Federated ownership to pass from data silos to insights flows
At RestApp, we’re building a Data Activation Platform for modern data teams with our large built-in library of connectors to databases data warehouses and business apps.
We have designed our next-gen data modeling editor to be intuitive and easy to use.
Discover the next-gen end-to-end data pipeline platform with our built-in No Code SQL, Python and NoSQL functions. Data modeling has never been easier and safer thanks to the No Code revolution, so you can simply create your data pipelines with drag-and-drop functions and stop wasting your time by coding what can now be done in minutes!
Discover Data modeling without code with our 14-day free trial!
Subscribe to our newsletter
Build better data pipelines
With RestApp, be your team’s data hero by activating insights from raw data sources.