Definition
Data Transformation is the practice of enhancing data to improve its ability to address relevant business questions, including cleansing, filtering, attributing and structuring to define, construct and dimensionalize topically, semantically and consistently for effective querying. Data Transformation is part of a broader set of activities referred to as ETL: Extract, Transform and Load, which is the process of making source data available, accurate and relevant for insights creation, including for user access and reporting and analysis.
Purpose
The purpose of data transformation is to enhance the raw data from its source to provide a consistent, accurate and understandable set of data that can be queried by a broad array of business insights creators and consumers to address relevant business questions effectively.
Types of Data Transformation
Basic
- Deduplication: Duplicate records result in incorrect answers to queries, a common data transformation step is removing dupes.
- Format Revision: Date/Time conversions, units of measure, and character set encodings are common for multinational corporations that do business in the US (Imperial) and everywhere else in the world (metric). OK, fine, Myanmar and Liberia also use Imperial.
- Cleaning: Null handling, standardization on things like M/F for gender is critical for grouping dimensions, and getting correct summarization of metrical values.
- Key Engineering: Occasionally, the relationship between data stored in different databases is some function of a key. In these cases, key restructuring transforms are applied to normalize the key elements.
Advanced
- Predication/Filtering: Only move data that satisfy the filter conditions.
- Summarization: A key element of Business Intelligence. Values are aggregated stored at multiple levels as business metrics. For example, Order Quantity totals by city, state, and country.
- Derivation: Applying business rules to your data that derive new calculated values from existing data – for example, creating a revenue metric that subtracts taxes.
- Splitting: Splitting a single column into multiple columns.
- Data validation: Can be a simple “if/then” calculation or can be a multi-valued assessment, for example, if this value A is “foo” and value B is greater than 100, reject this row of data
- Integration: Similar to Key Engineering: Standardize how data elements are addressed. Data integration reconciles different data keys and values for data that should be the same.
- Joining: Standard database joining, and more exotic joining from API or unstructured sources. Consider an example where your customers are stored in Salesforce and your sales data is in a homegrown system. Wouldn’t it be nice to have intelligence across both?
Primary Uses of Data Transformation Software
Data Transformation is used to enhance data from its source so that it is analysis-ready: relevant, business-defined, accurate and consistent. Changes to the data involve many different options, including filtering, attributing and structuring. Major transformation activities are defined below:
Cleansing – Data cleansing involves removing or altering source data that is not necessary, complete or accurate. Cleansing ensures that data is relevant, useful and understandable.
Aggregation – Data aggregation is the method for transforming data to conformed definitions across dimensions, such as aggregating data that is sourced at a day level to be aggregated at different levels, such as month, quarter, year-to-date, etc. Aggregation ensures that the data is defined consistently based on how the data is used by the business. Aggregation also improves query speed and performance.
Attributes – Data attributes are definitions of the attributes or standardized groupings of the attributes instantiated in the data to improve analysis relevance and consistency. This can involve creating standard definitions for ranges of values of data, such as defining business-relevant geographical regions, age-groups for demographic analysis, or instantiating other standardized segmentation and grouping. Other methods for data grouping, include discretization of continuous data: binning, trending, probability distribution fitting and association/regression.
Metrics / Features – Data Transformation can also involve creating metrics – standardized calculations (e.g., Key Performance Indicators) that are used for analysis. Metrics used for data science are often referred to as features, and these definitions are also defined and instantiated as part of the data transformation process.
Hierarchies – Data transformation can be the step to define and instantiate hierarchies, which enable logical movement from summary to detail in the data. For example, a geographical hierarchy might involve a sequence such as “where marketed”> “state”> “county” where there is a direct and complete relationship between the values between highest and lowest level of detail.
Normalization – Data normalization is the process of scaling the data to enable “apples-to-apples” comparisons of metrics and features across multiple data sets, either integrated or analyzed and compared separately. Normalization may also be used to eliminate noise or skews in the source data. Examples of normalization techniques are as follows:
- Min-max normalization: performing linear transformation on the original data.
- Z-score normalization: z-score normalization assigns a value to the value associated with its place/position relative to other values using a normal/gaussian distribution
- Decimal scaling: normalizing the value of the attribute by standardizing the decimal point placement.
How Data Transformation is Accomplished
The first step of data transformation is mapping. Data mapping determines the relationship between the data elements of two use cases and defines how the data from the source application is transformed before it is loaded into the target. Data mapping produces a set of instructions or metadata that is needed before the actual data conversion takes place.
An example of field mapping: the information in one application might be rendered in upper case, while another use expects it in lower case.
The structure of stored data may also vary between data consumers, requiring semantic mapping prior to the transformation process. For instance, in a multinational company there may be a transform required to change the format used to represent dates, or simply to alter the precision of that particular data type.
Overall, there are two ways to approach ETL transformation:
-
- Multistage data transformation: Extracted data is moved to a staging area where transformations occur prior to loading the data into the warehouse.
- In-warehouse data transformation – Data is extracted and loaded into the analytics warehouse, and transformations are done there. This is sometimes referred to as Extract, Load, Translate (ELT).
In-warehouse transforms are gaining traction driven by two factors:
-
-
- The increased performance and scalability of the modern analytics database
- These types of transforms are expressed in SQL, the data manipulation language of choice
-
AtScale allows for in-warehouse transforms, lately bound and materialized for performance mitigation. We empower data consumers to employ powerful transformations to their data warehouse using a feature we call Composer, which is a visual user interface without a requirement for the user to understand SQL.
Benefits of Well-Executed Data Transformation
The benefits of data transformation are to deliver data that is topically and semantically relevant, accurate, and consistent. Key impact drivers are being able to easily integrate, analyze, and apply data science to the data in order to generate insights with greater relevance, speed, clarity, alignment, and confidence as follows:
-
-
- Relevance – Insights delivered from data are relevant to the business subject and user needs. Data are available, accurate, timely, understandable, and comprehensive to address the business users’ needs.
- Speed – Insights created from data enable actions to be taken faster because the insights are structured to address business questions more timely and effectively.
- Clarity – The insights are seen as being clear, compelling, and accurate in such a way that conclusions reached from the data are consistent across the audience of users, and data are sufficient to address questions posed from multiple interpretations of the data and insights.
- Alignment – Because the data is relevant, timely, accurate, and comprehensive and the insights are presented in a clear, consistent way that is easily understood, the interpretation of the data is consistent, supporting improved alignment regarding decisions and actions to be taken.
- Confidence – The insights created from data are trusted, and the performance of decisions and actions are measured using data such that the relationship between insights and effective, impactful decisions and actions is direct, positive, and improving.
-
Common Roles and Responsibilities for Data Transformation
Business Intelligence and the resulting creation of actionable insights from data delivered to business users involves the following key roles:
-
-
- Data Engineers – Data engineers create and manage data pipelines that transport data from source to target, including creating and managing data transformations to ensure data arrives ready for analysis.
- Analytics Engineers – Analytics engineers support data scientists and other predictive and prescriptive analytics use cases, focusing on managing the entire data to model ops process, including data access, transformation, integration, DBMS management, BI and AI data ops, and model ops.
- Data Modelers – Data Modelers are responsible for each type of data model: conceptual, logical, and physical. Data Modelers may also be involved with defining specifications for data transformation and loading.
- Technical Architect – The technical architect is responsible for logical and physical technical infrastructure and tools. The technical architect works to ensure the data model and databases, including source and target data is physically able to be accessed, queried, and analyzed by the various OLAP tools.
- Data Analyst / Business Analyst – Often a business analyst or more recently, data analyst is responsible for defining the uses and use cases of the data, as well as providing design input to the data structure, particularly metrics, topical and semantic definitions, business questions/queries and outputs (reports and analyses) intended to be performed and improved. Responsibilities also include owning the roadmap for how data is going to be enhanced to address additional business questions and existing insights gaps.
- Insights Consumers – Insights consumers (e.g. business leaders and analysts) are responsible for using insights and analyses created by insights creators to improve business performance, including through improved awareness, plans, decisions, and actions.
-
Key Business Processes Associated Data Transformation
The process for delivering data transformation is as follows:
-
-
- Profiling – Data are reviewed for relevance, completeness, and accuracy by data creators and enablers. Profiling can and should occur for individual datasets and integrated data sets, both in raw form as was a ready-to-analyze structured form.
- Preparation / Transformation – Data are extracted, transformed, attributed, and dimensionalized to be available in a ready-to-analyze form, often with standardized configurations and coded automation to enable faster data refresh and delivery. Data is typically made available in an easy-to-query form such as a database, spreadsheet or Business Intelligence application.
- Integration – When multiple data sources are involved, integration involves combining multiple data sources into a single, structured, ready-to-analyze dataset. Integration involves creating a single data model and then extracting, transforming, and loading the individual data sources to conform to the data model, making the data available for querying by data insights creators and consumers.
- Extraction / Aggregation – The integrated dataset is made available for querying, including, aggregated to optimize query performance.
-
Common Technologies Categories Associated with Data Transformation
Technologies involved with data transformation are as follows:
-
-
- Data Engineering – Data engineering is the process and technology required to move data securely from source to target in a way that is easily available and accessible.
- Data Transformation – Data transformation involves altering the data from its raw form to a structured form that is easy to analyze via queries. Transformation also involves enhancing the data to provide attributes and references that increase standardization and ease of integration with other data sources.
- Data Warehouse – Data warehouses store data that are used frequently and extensively by the business for reporting and analysis. Data warehouses are constructed to store the data in a way that is integrated, secure, and easily accessible for standard and ad-hoc queries for many users.
- Data Lake – Data lakes are centralized data storage facilities that automate and standardize the process of acquiring data, storing it, and making it available for profiling, preparation, data modeling, analysis, and reporting/publishing. Data lakes are often created using cloud technology, which makes data storage very inexpensive, flexible, and elastic.
- Trends / Outlook for Data Transformation
-
Key trends to watch in the Data Transformation arena are as follows:
-
-
- Semantic Layer – The semantic layer is a common, consistent representation of the data used for business intelligence for reporting and analysis, as well as for analytics. The semantic layer is important because it creates a common consistent way to define data in multidimensional form to ensure that queries made from and across multiple applications, including multiple business intelligence tools, can be done through one common definition, rather than having to create the data models and definitions within each tool, thus ensuring consistency and efficiency, including cost savings as well as the opportunity to improve query speed/performance.
- Automation – Increase emphasis is being placed by vendors on ease of use and automation to increase speed-to-insights. This includes offering “drag and drop” interfaces to execute data-related preparation activities and insights creation/queries without having to write code, including reusing activities and processes, both for repeating use as well as sharing.
- Self-service – As data grows, the availability of qualified data technologists and analytics are very limited. To address this gap and increase productivity without having to lean 100% on IT resources to make data and analysis available, Self-service is increasingly available for data profiling, mining, preparation, reporting, and analysis. In addition, tools like the Semantic Layer offered by AtScale, Inc are also focused on enabling business users/data analysts to model data for business intelligence and analytics uses.
- Transferable – Increased effort is also underway to make data easier to consume, and this includes making data available for publishing easier, including using APIs and via objects that store elements of the insights.
- Observable – Recently, a host of new vendors are offering services referred to as “data observability”. Data observability is the practice of monitoring the data to understand how it is changing and being consumed. This trend, often called “dataops” closely mirrors the trend in software development called “devops” to track how applications are performing and being used to understand, anticipate and address performance gaps and improve areas proactively vs reactively.
-
AtScale and Data Transformation
AtScale is the leading provider of the Semantic Layer – to enable actionable insights and analytics to be delivered with increased speed, scale and cost-effectiveness. Research confirms that companies that use a semantic layer improve their speed to insights by 4x – meaning that a typical project to launch a new data source with analysis and reporting capabilities taking 4 months can now be done in just one month using a semantic layer.
AtScale’s semantic layer is uniquely positioned to support data transformation – the ability to ensure that data are consistently defined and structured using common attributes, metrics and features in dimensional form, including automating the process of data inspection, cleansing, editing and refining it by adding additional attributes, hierarchies, metrics / features, and extracting / delivering ready-to-analyze data automatically made available as a ready-to-analyze source for any BI tool, whether it’s Tableau, Power BI or Excel. Moreover, this work only requires one resource who understands the data and how it is to be analyzed, eliminating the need for complexity and resource intensity. This approach to data transformation automation eliminates multiple data hand-offs, manual coding, the risk of duplicate extracts and suboptimal query performance.
Additional Resources:
NEW BOOK