December 17, 2019
Big Data Analytics in the Cloud for Today’s Distributed and Diverse DataDespite the challenges associated with data warehousing, enterprise IT leaders have accepted it as a necessary evil of deriving value from information within Hadoop and other Big Data ecosystems. How much does it cost to create data warehouses or datamarts that extract data out of Hadoop? Is there a better way to do BI on Big Data?!
Analyzing the fiduciary costs of data warehousing
Amazon conducted a study on the costs associated with building and maintaining data warehouses, finding expenses can run between $19,000 and $25,000 per terabyte annually. That means a data warehouse containing 40TB of information (a modest repository for many large enterprises) requires a yearly budget of about $880,000 (close to $1M), assuming each TB requires $22,000 in upkeep.
What’s behind these costs and should they be worrisome to a business? While storage is growing more affordable every year, engineering (the setup and maintenance) of data warehouses is what lies at the heart of the issue. IT leaders, Architects and others in charge of developing data warehouses often encounter the following challenges, which add up to real and opportunity costs that affect the bottom line:
- Scrubbing Information: While it’s relatively easy to dump enterprise data into a warehouse, doing so burdens BI professionals with the need to categorize and validate information. To effectively address this challenge, the architect must typically collaborate with a data scientist.
- Maintaining Security: Both BI professionals and data modelers and scientists want unhindered access to the information within the data warehouse. This desire pressures IT to grant liberal user permissions, yet doing so introduces access control risks. In today’s world of data security and governance, security maintenance brings with it the potential for time and labor costs, along with penalty costs of sharing the wrong information with the wrong people.
- Establishing Compatibility with BI and Analytics Tools: Between the Hadoop flavors under the architect’s purview and the SQL tools with which data analyst and scientists are familiar, the former needs to jury-rig a method that allows the latter to efficiently query their Big Data ecosystem. The same challenge exists with vendor-specific BI software.
- Ongoing Data Movement: Propagating data from a Hadoop cluster make it accessible to business intelligence tools requires IT or data scientists to configure jobs and enable the data to pass from an ETL system to a data warehouse. In addition, depending on the BI tools analysts use, IT may need to also create multiple off-cluster analytics cubes or other environments to enable interactive queries.
Ideally, companies would allow BI tools to run interactive analysis workloads directly the Hadoop environment. How do enterprises accomplish this today?
Reducing costs through Big Data Intelligence Platforms
Running BI tools directly on Hadoop and other Big Data systems entails using Big Data intelligence platforms to act as mediators between BI applications and enterprise data lakes. The technology uses standard interfaces such as ODBC and JDBC that connect BI applications to various Hadoop ecosystems.
Big Data intelligence platforms provide data analysts with virtual cubes that represent the data within Big Data ecosystems. Analysts may use SQL, XMLA or MDX to create dimensions, measures and relationships as well as collaborate with other modelers to share, adjust and update models. For instance, an analyst using Excel could apply MDX queries, while another working in Tableau can employ SQL.
At their core, Hadoop intelligence platforms eliminate the need to extract, transform and load data to perform BI workloads through adaptive caching. This technology provides the responsiveness BI professionals expect when applying queries without introducing the need to create off-cluster indexes, data marts or multidimensional OLAP cubes. Instead, adaptive caching analyzes end-user query patterns and learns which aggregates users would want to pull.
The aggregates sit on the Hadoop cluster. In addition, the adaptive caching system handles the entire aggregate table lifecycle, applying updates and decommissioning as required. Ultimately, it eliminates the need for IT to migrate data to off-cluster environments.
But what about IT control? Hadoop intelligence platforms connect directly to secure Hadoop services, providing support for delegated authorization, Active Directory, Kerberos, SASL, TLS and other security protocols. In addition, administrators can monitor cube access and queries to ensure they’re effectively enforcing role-based access controls.
The Bottom Line
With the technology available today, investing in proprietary cost-intensive big data warehouses is no longer the only option. And neither is incurring the time, maintenance, and labor costs associated with constantly moving data into warehouses built for business intelligence queries.
To learn more about how leading enterprises plan to drive value for the business through modern analytics on big data, check out our Big Data Maturity Survey Report.
The Practical Guide to Using a Semantic Layer for Data & Analytics