December 31, 2019
Our Top 5 Blog Posts of 2019Hadoop hasn’t lived up to its early promise.
Enterprises in North America are already investing in their third generation data lake, with the same goal of “democratizing data” at scale. The goal of Governed Self-Service providing business insights, and ultimately enabling AI and ML initiatives that automate decisions, is a virtuous one.
Let’s look where we’ve come from:
- First generation: proprietary enterprise data warehouse and business intelligence platforms. Expensive, hard to maintain, limited use and very little ROI due to no business impact.
- Second generation: Hadoop as a silver bullet; complex big data ecosystem mostly focused on batch jobs operated by a central team of hyper-specialized data engineers. Science-project level of success. Again, over-hyped and under-delivered.
- Third generation: Similar in many ways to the second generation. However, there is now a focus on addressing some of the gaps of the previous generations, such as real-time data analytics and reducing the cost of managing big data infrastructure.
The concept of the data lake – one single place to consume – is a great concept because it enables policy & declarative modes of operation, which allow for much greater decentralization of data control. Unfortunately, data lake architectures have, to date, failed at scale.
To address the failure patterns, we need to shift from the purely centralized paradigm of a lake or its predecessor, the data warehouse. The next data lake architecture will be much more similar to a modern distributed architecture where domains are first class, applying automation for self-serve data infrastructure, and treating data as a product and a supply chain.
Data lives in certain technologies for a reason – it was created there, or the serving mechanism is suitable to the specific data or uses cases: think graph, time series, textual, big, small, highly concurrent, batch, streaming. The application of the data should define the data engineering required to deliver.
Hadoop data lakes failed because they were centralized and monolithic.
Two drivers for change:
- Heterogenous data and source proliferation: As more data becomes available, the ability to consume it all and store/serve it from one place under the control of one platform diminishes.
- Organizations’ growing desire to innovate and be data-driven: Not only are more ways to store data coming online but additionally, the use of data should be expected to continue to grow in scale and variety. Companies need to rapidly experiment to find insights, so cycle time reduction will be at the forefront. The only way to satisfy that goal is to invest in decentralization and automation. Again, the consumption patterns should naturally define the data engineering requirements. Slow delivery of value tied to the limitations implied by data engineering organizations are the main cause of friction and will continue that way until the automation agenda is satisfied.
The next-generation architecture will succeed if it delivers data that is:
- Secure & Governed
- Discoverable
- Self-Describing: More semantics than common data stores provide today
- Programmable
- Trustworthy
What’s next for big data?
In the future described above, many technologies can and will exist together. There are a large number of enterprises that may not LOVE Hadoop; however, it is the right technology to solve the problems they have. Hadoop will continue to have a place in the data infrastructure.
The future technologies to keep an eye on in the third generation of data lakes are tools that unify batch and streaming (such as Apache Beam) to accomplish the goals of the data lake without having to build a monolithic structure that is difficult to manage.
How AtScale can help
You are at a fork in the road. Stay in the Hadoop ecosystem or make the transition into the cloud. Both paths have their specific challenges. If the problems you are encountering on Hadoop fall into the Performance/Scale, Security or ease of use categories we can help you. No other company on the planet has as much experience as us making customers successful with Business Intelligence on any of the Hadoop Platforms.
However, if you’ve decided that the cloud is a better solution and you are interested in a relatively pain-free, no-data-consumer-disruption migration, we can enable that. Moving data platforms is usually a very complex project, often involving schema changes to fit the data to the functionality and performance characteristics of the new data platform. AtScale removes this consideration and allows you to do the easiest migration: Lift & Shift. Move the schema exactly as is (perhaps a few type changes) to the new cloud database, repoint AtScale to the new database platform and your data consumers will not even know it has happened. As a (very) nice side effect, the autonomous data engineering components of AtScale will have the added benefit of cost optimization on the new cloud database, resulting in significant material savings as you scale your analytics programs.
AtScale’s unique approach to the automation of analytical data engineering tasks will help with performance, security, agility, and cost on both Hadoop, in the cloud and in hybrid cloud/on-premise situations.
NEW BOOK