Home | | Data Ware Housing and Data Mining | Data extraction, clean up and transformation

Chapter: Data Warehousing and Data Mining

Data extraction, clean up and transformation

The tools enable sourcing of the proper data contents and formats from operational and external data stores into the data warehouse.

Data extraction, clean up and transformation

1. Tools requirements:

 

The tools enable sourcing of the proper data contents and formats from operational and external data stores into the data warehouse.

 

The task includes:

Data transformation from one format to another

 

Data transformation and calculation based on the application of the business rules. Eg : age from date of birth.

 

Data consolidation (several source records into single records) and integration

 

Meta data synchronizations and management include storing or updating metadata definitions.

 

When implementing datawarehouse, several selections criteria that affect the tools ability to transform, integrate and repair the data should be considered.

The ability to identify the data source

 

Support for flat files, Indexed files

 

Ability to merge the data from multiple data source

 

Ability to read information from data dictionaries

 

The code generated tool should be maintained in the development environment

 

The ability to perform data type and character set translation is requirement when moving data between incompatible systems.

 

The ability to summarization and aggregations of records

 

The data warehouse database management system should be able to perform the load directly from the tool using the native API.

 

2. Vendor approaches:

 

The tasks of capturing data from a source data system, cleaning transforming it and the loading the result into a target data system.

 

It can be a carried out either by separate product or by single integrated solutions. the integrated solutions are described below:

 

Code generators:

 

Create tailored 3GL/4GL transformation program based on source and target data definitions.

 

The data transformations and enhancement rules defined by developer and it employ data manipulation language.

code generation products Used for develop enterprise wide data warehouse database data

 

Main issue with this approach is the management of the large programs required to support a complex corporate information system

 

Database data Replication tools:

 

It employs changes to a single data source on one system and apply the changes to a copy of the source data source data loaded on a different systems.

 

Rule driven dynamic transformations engines (also known as data mart builders)

 

Capture the data from a source system at user defined interval, transforms the data, then send and load the result in to a target systems.

 

Data Transformation and enhancement is based on a script or function logic defined to the tool.

 

3. Access to legacy data:

 

Today many businesses are adopting client/server technologies and data warehousing to meet customer demand for new products and services to obtain competitive advantages.

 

Majority of information required supporting business application and analytical power of data warehousing is located behind mainframe based legacy systems. While many organizations protecting their heavy financial investment in hardware and software to meet this goal many organization turn to middleware solutions

 

Middleware strategy is the foundation for the enterprise/access. it is designed for scalability and manageability in a data warehousing environment.

 

The enterprise/access provides a three tired be connected to a new data warehouse, enterprise/access via client server interfaces. the three tired architecture contains following layer

 

The data layer: It provides data access and transaction services for management of corporate data access.

 

The process layer: It provides service to manage and automation and support for current business process.

 

The user layer: Manage the user interaction with process and data layer services.

 

4. Vendor solutions:

4.1 Prism solutions:

 

Prism manager provides a solution for data warehousing by mapping source data to target database management system.

 

The prism warehouse manager generates code to extract and integrate data, create and manage metadata and create subject oriented historical database.

It extracts data from multiple sources –DB2, IMS, VSAM, RMS &sequential files.

 

4.2 SAS institute:

 

SAS data access engines serve as a extraction tools to combine common variables, transform data

 

Representations forms for consistency.

it support for decision reporting ,graphing .so it act as the front end.

 

4.3 Carleton corporations PASSPORT and metacenter:

Carleton’s PASSPORT and the MetaCenter fulfill the data extraction and transformation need

of data warehousing.

 

PASSPORT

PSSPORT can produce multiple output files from a single execution of an extract program.

 

It is metadata driven, data mapping and data migration facility. it runs as a client on various PC platform in three tired environment.

 

It consists of two components:

 

Mainframe based: Collects the file, record, or table layouts for the required inputs and outputs and convert then into the passport data language (PDL).

 

Workstation based: User must transfer the PDL file from the mainframe to a location accessible by PASSPORT

 

PASSPORT offers:

Metadata directory at the core of process,

 

Robust data conversion, migration, analysis and auditing facilities.

 

PASSPORT work bench, GUI workbench that enables project development on a work station and also maintains various personal who design, implement or use.

 

PASSPORT highlights:

Carleton passport include number of facilities and features which are briefly discussed below:

 

Data access: It provides data selection, automatic file and data matching and selective random access, automatic handling of multiple records types, intelligent joins, single or multiple record accesses.

 

Data analysis and auditing:  this facility provides

 

Audit reports including report, sequence report, class tabulation report, file footing report, stratification report and statistical summary report

 

Audit facilities, including SAMPLE command, AGE function, DATE keyword, and ENCRYPT option.

 

Language and design: It supports predefined calculation, arithmetic operations, relational and Boolean operations, range operation .array, input data sorts, work fields and system fields. Conditional process, internal and external sub routines and loop processing.

 

PASSPORT data language (PDL): This has free from command structure with English like command syntax.

Run time environment: Support dynamic work fields and error limit control.

 

Report writing: Supports on unlimited number of line formats, variable page size, controlled horizontal, vertical spacing and dynamic printing.

 

Centralized metadata repository: That provides global access, provides central information change management and control and enables metadata accuracy and integrity.

 

Business metadata: Provides for metadata in business .this metadata is stored in a relational format. This is accessible by any SQL based query tool.

 

Load image formats for target DBMS: PASSPORT formats data for loading into any target RDBMS including DB2, Informix, oracle, Sybase, red brick.

 

Optional user exists: PASSPORT provides support for user exists where users can optionally invoke previously written routines.

 

Browsing capabilities: PASSPORT provides metadata in a relational format that is easily accessible by end user query tool.

 

The Metacenter:

 

It is developed by Carleton Corporation in partnership with intellidex system INC that is designed to put users in a control of the data warehouse.

The heart of the metacenter is the metadata dictionary.

 

The metacenter  conjunction with PASSPORT provides number of capabilities.

 

Data extraction and transformation: The PASSPORT workbench provides data transformation capabilities to support the complex data, the developer can automatically generate COBOL extract programs from the metadata

 

Event management and notification: Data movement and subscription events are executed and monitored by the scheduler via its event monitors. The scheduler sends notification to the various responsible administrators via E-MAIL.

 

Data mart subscription: Data warehouse users can subscribe to the data they need using business terminology.

 

Control center mover: This unit’s works with the scheduler to automatically move each data request. the mover provides seamless connectivity and data pumping between the data warehouse and data marts.

 

 

4.4. Validity Corporation

 

Validity corporation integrity data reengineering tool is used to investigate standardizes transform

 

and integrates data from multiple operational systems and external sources.

 

It main focus is on data quality indeed focusing on avoiding the GIGO (garbage in garbage out)

 

Principle.

Benefits of integrity tool:

 

Builds accurate consolidated views of customers, supplier, products and other corporate entities.

 

Maintain the highest quality of data.

 

4.5 Evolutionary technologies:

 

Another data extraction and transformation tool is ETI-EXTRACT tool, it automates the migration of data between dissimilar storage environments.

 

It saves up to 95 % of the time and cost of manual data conversion. It enables users to populate and maintain data warehouse

 

Move to new architectures such as distributed client/server o Integrate disparate system

 

Migrate data to a new database, platforms, and applications o Supports data collection, conversion

 

Automatically  generates  and  executes  program  in  the  appropriate  language  for

 

source and target Platforms.

Provide powerful metadata facility.

Provide a sophisticated graphical interface that allows users to indicate how to move

 

data               through simple point and click operation.

 

 

Product overview:

It Automate the migration of data between different storage system .it supports client/server handle complex conditional manipulations .it offers administration capabilities.

 

All ETI-EXTRACT functions are accessed through the industry standard .a key features is ETI-EXTRACT’s ability support conditional data selection and transformation that can be done programmatically.

 

Selection and transformation specified via a point and click driven process.

 

ETI-EXTRACT enables users to interactively view meta data merge meta data from multiple different sources and export meta data.

 

ETI-EXPORT tool consists of two sets of productivity tools. The Master Toolset

 

Data conversion toolset

 

The master toolset:

 

It is a set interactive editors that allows a system programmer known as a master user , to define to the meta store database everything required to able to read and write to any database system or file system in the enterprise environment. It contains

 

The environment editor: allows the specification of each different platforms and system operating environment to be accessed.

Schema editor: Provides easy access for browsing or updating schema information.

 

The grammar editor: It defining customized conditional retrieval transformation and populate logic to meet environment specifications.

 

The template editor: Enable rapid specification of programming rules to shape tha way data retrieval ,conversion and populate programs are generated

 

Data conversion toolset

Which include the conversion editor, the executive, the workset browser and metadata facility.

 

Conversion editor: It provides a graphical point and click interface for defining the mapping of data between various source data systems and target data systems.

The conversion editor allows user to

Selectively retrieve the data from one or more database management systems

 

Merge the data from multiple systems to create a new database

 

populate any number of database management systems or file formats.

 

Other components of the ETI-EXTRACT Tool Suite

The ETI-EXTRACT Executive

 

The ETI-EXTRACT Workset browser

 

The ETI-EXTRACT Metadata facility

 

The MetaStore Database

 

The Metadata Exchange Library

 

4.6 Information Builders

 

EDA/SQL is a product of information builder for data extraction, transformation and access legacy access tool for building a dw

 

EDA/SQL implements a client server model that optimized for higher performance and also support copy management, data quality management, data replications

 

5. Transformation Engines

Two dynamic transformation engine tools

Informatica’s Power Mart suite

Consteller Hub

 

 

5.1. Informatica

 

Informatica’s product, Informatica’s PowerMart suite has to be discussed in conjunction with the

 

Metadata Exchange Architecture (MX) initiative.

It provide the APIs to integrate the vendor tools with Imformatica metadata repository

 

It captures the technical metadata and business metadata on the back-end that can be integrated with the Metadata in front-end partner’s products

 

The informatica repository is the foundation of the PowerMart suite in with technical and business

 

Metadata is stored

 

 

 

 

PowerMart consist of following components

 

PowerMart designer

PowerMart Server

The informatica server Manager

Infomatica Repository

Informatica powerCapture

 

5.2 Constellar

 

The Consteller Hub consists of a set of components supports the distributed transformation

 

management capabilities

 

The product is designed for both data migration and data distribution in an operational system

 

It employs a hub and spoke architecture to manage the flow of data between source and target system

 

The spokes represents a data path between a transformation hub and data source or target

 

The hub and its associated sources and targets can be installed on the same machine, or may be separate networked computers

 

The hub supports

Record information and restructuring

 

Field level data transformation, validation, and table lookup

 

File and multifile set-level transformation and validation

 

Creation of intermediate results


Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail
Data Warehousing and Data Mining : Data extraction, clean up and transformation |


Privacy Policy, Terms and Conditions, DMCA Policy and Compliant

Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.