Home | | Data Ware Housing and Data Mining | Data extraction, clean up and transformation

Chapter: Data Warehousing and Data Mining

Data extraction, clean up and transformation

The tools enable sourcing of the proper data contents and formats from operational and external data stores into the data warehouse.

Data extraction, clean up and transformation

1. Tools requirements:


The tools enable sourcing of the proper data contents and formats from operational and external data stores into the data warehouse.


The task includes:

Data transformation from one format to another


Data transformation and calculation based on the application of the business rules. Eg : age from date of birth.


Data consolidation (several source records into single records) and integration


Meta data synchronizations and management include storing or updating metadata definitions.


When implementing datawarehouse, several selections criteria that affect the tools ability to transform, integrate and repair the data should be considered.

The ability to identify the data source


Support for flat files, Indexed files


Ability to merge the data from multiple data source


Ability to read information from data dictionaries


The code generated tool should be maintained in the development environment


The ability to perform data type and character set translation is requirement when moving data between incompatible systems.


The ability to summarization and aggregations of records


The data warehouse database management system should be able to perform the load directly from the tool using the native API.


2. Vendor approaches:


The tasks of capturing data from a source data system, cleaning transforming it and the loading the result into a target data system.


It can be a carried out either by separate product or by single integrated solutions. the integrated solutions are described below:


Code generators:


Create tailored 3GL/4GL transformation program based on source and target data definitions.


The data transformations and enhancement rules defined by developer and it employ data manipulation language.

code generation products Used for develop enterprise wide data warehouse database data


Main issue with this approach is the management of the large programs required to support a complex corporate information system


Database data Replication tools:


It employs changes to a single data source on one system and apply the changes to a copy of the source data source data loaded on a different systems.


Rule driven dynamic transformations engines (also known as data mart builders)


Capture the data from a source system at user defined interval, transforms the data, then send and load the result in to a target systems.


Data Transformation and enhancement is based on a script or function logic defined to the tool.


3. Access to legacy data:


Today many businesses are adopting client/server technologies and data warehousing to meet customer demand for new products and services to obtain competitive advantages.


Majority of information required supporting business application and analytical power of data warehousing is located behind mainframe based legacy systems. While many organizations protecting their heavy financial investment in hardware and software to meet this goal many organization turn to middleware solutions


Middleware strategy is the foundation for the enterprise/access. it is designed for scalability and manageability in a data warehousing environment.


The enterprise/access provides a three tired be connected to a new data warehouse, enterprise/access via client server interfaces. the three tired architecture contains following layer


The data layer: It provides data access and transaction services for management of corporate data access.


The process layer: It provides service to manage and automation and support for current business process.


The user layer: Manage the user interaction with process and data layer services.


4. Vendor solutions:

4.1 Prism solutions:


Prism manager provides a solution for data warehousing by mapping source data to target database management system.


The prism warehouse manager generates code to extract and integrate data, create and manage metadata and create subject oriented historical database.

It extracts data from multiple sources –DB2, IMS, VSAM, RMS &sequential files.


4.2 SAS institute:


SAS data access engines serve as a extraction tools to combine common variables, transform data


Representations forms for consistency.

it support for decision reporting ,graphing .so it act as the front end.


4.3 Carleton corporations PASSPORT and metacenter:

Carleton’s PASSPORT and the MetaCenter fulfill the data extraction and transformation need

of data warehousing.



PSSPORT can produce multiple output files from a single execution of an extract program.


It is metadata driven, data mapping and data migration facility. it runs as a client on various PC platform in three tired environment.


It consists of two components:


Mainframe based: Collects the file, record, or table layouts for the required inputs and outputs and convert then into the passport data language (PDL).


Workstation based: User must transfer the PDL file from the mainframe to a location accessible by PASSPORT


PASSPORT offers:

Metadata directory at the core of process,


Robust data conversion, migration, analysis and auditing facilities.


PASSPORT work bench, GUI workbench that enables project development on a work station and also maintains various personal who design, implement or use.


PASSPORT highlights:

Carleton passport include number of facilities and features which are briefly discussed below:


Data access: It provides data selection, automatic file and data matching and selective random access, automatic handling of multiple records types, intelligent joins, single or multiple record accesses.


Data analysis and auditing:  this facility provides


Audit reports including report, sequence report, class tabulation report, file footing report, stratification report and statistical summary report


Audit facilities, including SAMPLE command, AGE function, DATE keyword, and ENCRYPT option.


Language and design: It supports predefined calculation, arithmetic operations, relational and Boolean operations, range operation .array, input data sorts, work fields and system fields. Conditional process, internal and external sub routines and loop processing.


PASSPORT data language (PDL): This has free from command structure with English like command syntax.

Run time environment: Support dynamic work fields and error limit control.


Report writing: Supports on unlimited number of line formats, variable page size, controlled horizontal, vertical spacing and dynamic printing.


Centralized metadata repository: That provides global access, provides central information change management and control and enables metadata accuracy and integrity.


Business metadata: Provides for metadata in business .this metadata is stored in a relational format. This is accessible by any SQL based query tool.


Load image formats for target DBMS: PASSPORT formats data for loading into any target RDBMS including DB2, Informix, oracle, Sybase, red brick.


Optional user exists: PASSPORT provides support for user exists where users can optionally invoke previously written routines.


Browsing capabilities: PASSPORT provides metadata in a relational format that is easily accessible by end user query tool.


The Metacenter:


It is developed by Carleton Corporation in partnership with intellidex system INC that is designed to put users in a control of the data warehouse.

The heart of the metacenter is the metadata dictionary.


The metacenter  conjunction with PASSPORT provides number of capabilities.


Data extraction and transformation: The PASSPORT workbench provides data transformation capabilities to support the complex data, the developer can automatically generate COBOL extract programs from the metadata


Event management and notification: Data movement and subscription events are executed and monitored by the scheduler via its event monitors. The scheduler sends notification to the various responsible administrators via E-MAIL.


Data mart subscription: Data warehouse users can subscribe to the data they need using business terminology.


Control center mover: This unit’s works with the scheduler to automatically move each data request. the mover provides seamless connectivity and data pumping between the data warehouse and data marts.



4.4. Validity Corporation


Validity corporation integrity data reengineering tool is used to investigate standardizes transform


and integrates data from multiple operational systems and external sources.


It main focus is on data quality indeed focusing on avoiding the GIGO (garbage in garbage out)



Benefits of integrity tool:


Builds accurate consolidated views of customers, supplier, products and other corporate entities.


Maintain the highest quality of data.


4.5 Evolutionary technologies:


Another data extraction and transformation tool is ETI-EXTRACT tool, it automates the migration of data between dissimilar storage environments.


It saves up to 95 % of the time and cost of manual data conversion. It enables users to populate and maintain data warehouse


Move to new architectures such as distributed client/server o Integrate disparate system


Migrate data to a new database, platforms, and applications o Supports data collection, conversion


Automatically  generates  and  executes  program  in  the  appropriate  language  for


source and target Platforms.

Provide powerful metadata facility.

Provide a sophisticated graphical interface that allows users to indicate how to move


data               through simple point and click operation.



Product overview:

It Automate the migration of data between different storage system .it supports client/server handle complex conditional manipulations .it offers administration capabilities.


All ETI-EXTRACT functions are accessed through the industry standard .a key features is ETI-EXTRACT’s ability support conditional data selection and transformation that can be done programmatically.


Selection and transformation specified via a point and click driven process.


ETI-EXTRACT enables users to interactively view meta data merge meta data from multiple different sources and export meta data.


ETI-EXPORT tool consists of two sets of productivity tools. The Master Toolset


Data conversion toolset


The master toolset:


It is a set interactive editors that allows a system programmer known as a master user , to define to the meta store database everything required to able to read and write to any database system or file system in the enterprise environment. It contains


The environment editor: allows the specification of each different platforms and system operating environment to be accessed.

Schema editor: Provides easy access for browsing or updating schema information.


The grammar editor: It defining customized conditional retrieval transformation and populate logic to meet environment specifications.


The template editor: Enable rapid specification of programming rules to shape tha way data retrieval ,conversion and populate programs are generated


Data conversion toolset

Which include the conversion editor, the executive, the workset browser and metadata facility.


Conversion editor: It provides a graphical point and click interface for defining the mapping of data between various source data systems and target data systems.

The conversion editor allows user to

Selectively retrieve the data from one or more database management systems


Merge the data from multiple systems to create a new database


populate any number of database management systems or file formats.


Other components of the ETI-EXTRACT Tool Suite

The ETI-EXTRACT Executive


The ETI-EXTRACT Workset browser


The ETI-EXTRACT Metadata facility


The MetaStore Database


The Metadata Exchange Library


4.6 Information Builders


EDA/SQL is a product of information builder for data extraction, transformation and access legacy access tool for building a dw


EDA/SQL implements a client server model that optimized for higher performance and also support copy management, data quality management, data replications


5. Transformation Engines

Two dynamic transformation engine tools

Informatica’s Power Mart suite

Consteller Hub



5.1. Informatica


Informatica’s product, Informatica’s PowerMart suite has to be discussed in conjunction with the


Metadata Exchange Architecture (MX) initiative.

It provide the APIs to integrate the vendor tools with Imformatica metadata repository


It captures the technical metadata and business metadata on the back-end that can be integrated with the Metadata in front-end partner’s products


The informatica repository is the foundation of the PowerMart suite in with technical and business


Metadata is stored





PowerMart consist of following components


PowerMart designer

PowerMart Server

The informatica server Manager

Infomatica Repository

Informatica powerCapture


5.2 Constellar


The Consteller Hub consists of a set of components supports the distributed transformation


management capabilities


The product is designed for both data migration and data distribution in an operational system


It employs a hub and spoke architecture to manage the flow of data between source and target system


The spokes represents a data path between a transformation hub and data source or target


The hub and its associated sources and targets can be installed on the same machine, or may be separate networked computers


The hub supports

Record information and restructuring


Field level data transformation, validation, and table lookup


File and multifile set-level transformation and validation


Creation of intermediate results

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail
Data Warehousing and Data Mining : Data extraction, clean up and transformation |

Privacy Policy, Terms and Conditions, DMCA Policy and Compliant

Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.