Data extraction, clean up and transformation
1. Tools requirements:
The tools enable sourcing of the proper data contents and formats from operational and external data stores into the data warehouse.
The task includes:
Data transformation from one format to another
Data transformation and calculation based on the application of the business rules. Eg : age from date of birth.
Data consolidation (several source records into single records) and integration
Meta data synchronizations and management include storing or updating metadata definitions.
When implementing datawarehouse, several selections criteria that affect the tools ability to transform, integrate and repair the data should be considered.
The ability to identify the data source
Support for flat files, Indexed files
Ability to merge the data from multiple data source
Ability to read information from data dictionaries
The code generated tool should be maintained in the development environment
The ability to perform data type and character set translation is requirement when moving data between incompatible systems.
The ability to summarization and aggregations of records
The data warehouse database management system should be able to perform the load directly from the tool using the native API.
2. Vendor approaches:
The tasks of capturing data from a source data system, cleaning transforming it and the loading the result into a target data system.
It can be a carried out either by separate product or by single integrated solutions. the integrated solutions are described below:
Create tailored 3GL/4GL transformation program based on source and target data definitions.
The data transformations and enhancement rules defined by developer and it employ data manipulation language.
code generation products Used for develop enterprise wide data warehouse database data
Main issue with this approach is the management of the large programs required to support a complex corporate information system
Database data Replication tools:
It employs changes to a single data source on one system and apply the changes to a copy of the source data source data loaded on a different systems.
Rule driven dynamic transformations engines (also known as data mart builders)
Capture the data from a source system at user defined interval, transforms the data, then send and load the result in to a target systems.
Data Transformation and enhancement is based on a script or function logic defined to the tool.
3. Access to legacy data:
Today many businesses are adopting client/server technologies and data warehousing to meet customer demand for new products and services to obtain competitive advantages.
Majority of information required supporting business application and analytical power of data warehousing is located behind mainframe based legacy systems. While many organizations protecting their heavy financial investment in hardware and software to meet this goal many organization turn to middleware solutions
Middleware strategy is the foundation for the enterprise/access. it is designed for scalability and manageability in a data warehousing environment.
The enterprise/access provides a three tired be connected to a new data warehouse, enterprise/access via client server interfaces. the three tired architecture contains following layer
The data layer: It provides data access and transaction services for management of corporate data access.
The process layer: It provides service to manage and automation and support for current business process.
The user layer: Manage the user interaction with process and data layer services.
4. Vendor solutions:
4.1 Prism solutions:
Prism manager provides a solution for data warehousing by mapping source data to target database management system.
The prism warehouse manager generates code to extract and integrate data, create and manage metadata and create subject oriented historical database.
It extracts data from multiple sources –DB2, IMS, VSAM, RMS &sequential files.
4.2 SAS institute:
SAS data access engines serve as a extraction tools to combine common variables, transform data
Representations forms for consistency.
it support for decision reporting ,graphing .so it act as the front end.
4.3 Carleton corporations PASSPORT and metacenter:
Carleton’s PASSPORT and the MetaCenter fulfill the data extraction and transformation need
of data warehousing.
PSSPORT can produce multiple output files from a single execution of an extract program.
It is metadata driven, data mapping and data migration facility. it runs as a client on various PC platform in three tired environment.
It consists of two components:
Mainframe based: Collects the file, record, or table layouts for the required inputs and outputs and convert then into the passport data language (PDL).
Workstation based: User must transfer the PDL file from the mainframe to a location accessible by PASSPORT
Metadata directory at the core of process,
Robust data conversion, migration, analysis and auditing facilities.
PASSPORT work bench, GUI workbench that enables project development on a work station and also maintains various personal who design, implement or use.
Carleton passport include number of facilities and features which are briefly discussed below:
Data access: It provides data selection, automatic file and data matching and selective random access, automatic handling of multiple records types, intelligent joins, single or multiple record accesses.
Data analysis and auditing: this facility provides
Audit reports including report, sequence report, class tabulation report, file footing report, stratification report and statistical summary report
Audit facilities, including SAMPLE command, AGE function, DATE keyword, and ENCRYPT option.
Language and design: It supports predefined calculation, arithmetic operations, relational and Boolean operations, range operation .array, input data sorts, work fields and system fields. Conditional process, internal and external sub routines and loop processing.
PASSPORT data language (PDL): This has free from command structure with English like command syntax.
Run time environment: Support dynamic work fields and error limit control.
Report writing: Supports on unlimited number of line formats, variable page size, controlled horizontal, vertical spacing and dynamic printing.
Centralized metadata repository: That provides global access, provides central information change management and control and enables metadata accuracy and integrity.
Business metadata: Provides for metadata in business .this metadata is stored in a relational format. This is accessible by any SQL based query tool.
Load image formats for target DBMS: PASSPORT formats data for loading into any target RDBMS including DB2, Informix, oracle, Sybase, red brick.
Optional user exists: PASSPORT provides support for user exists where users can optionally invoke previously written routines.
Browsing capabilities: PASSPORT provides metadata in a relational format that is easily accessible by end user query tool.
It is developed by Carleton Corporation in partnership with intellidex system INC that is designed to put users in a control of the data warehouse.
The heart of the metacenter is the metadata dictionary.
The metacenter conjunction with PASSPORT provides number of capabilities.
Data extraction and transformation: The PASSPORT workbench provides data transformation capabilities to support the complex data, the developer can automatically generate COBOL extract programs from the metadata
Event management and notification: Data movement and subscription events are executed and monitored by the scheduler via its event monitors. The scheduler sends notification to the various responsible administrators via E-MAIL.
Data mart subscription: Data warehouse users can subscribe to the data they need using business terminology.
Control center mover: This unit’s works with the scheduler to automatically move each data request. the mover provides seamless connectivity and data pumping between the data warehouse and data marts.
4.4. Validity Corporation
Validity corporation integrity data reengineering tool is used to investigate standardizes transform
and integrates data from multiple operational systems and external sources.
It main focus is on data quality indeed focusing on avoiding the GIGO (garbage in garbage out)
Benefits of integrity tool:
Builds accurate consolidated views of customers, supplier, products and other corporate entities.
Maintain the highest quality of data.
4.5 Evolutionary technologies:
Another data extraction and transformation tool is ETI-EXTRACT tool, it automates the migration of data between dissimilar storage environments.
It saves up to 95 % of the time and cost of manual data conversion. It enables users to populate and maintain data warehouse
Move to new architectures such as distributed client/server o Integrate disparate system
Migrate data to a new database, platforms, and applications o Supports data collection, conversion
Automatically generates and executes program in the appropriate language for
source and target Platforms.
Provide powerful metadata facility.
Provide a sophisticated graphical interface that allows users to indicate how to move
data through simple point and click operation.
It Automate the migration of data between different storage system .it supports client/server handle complex conditional manipulations .it offers administration capabilities.
All ETI-EXTRACT functions are accessed through the industry standard .a key features is ETI-EXTRACT’s ability support conditional data selection and transformation that can be done programmatically.
Selection and transformation specified via a point and click driven process.
ETI-EXTRACT enables users to interactively view meta data merge meta data from multiple different sources and export meta data.
ETI-EXPORT tool consists of two sets of productivity tools. The Master Toolset
Data conversion toolset
The master toolset:
It is a set interactive editors that allows a system programmer known as a master user , to define to the meta store database everything required to able to read and write to any database system or file system in the enterprise environment. It contains
The environment editor: allows the specification of each different platforms and system operating environment to be accessed.
Schema editor: Provides easy access for browsing or updating schema information.
The grammar editor: It defining customized conditional retrieval transformation and populate logic to meet environment specifications.
The template editor: Enable rapid specification of programming rules to shape tha way data retrieval ,conversion and populate programs are generated
Data conversion toolset
Which include the conversion editor, the executive, the workset browser and metadata facility.
Conversion editor: It provides a graphical point and click interface for defining the mapping of data between various source data systems and target data systems.
The conversion editor allows user to
Selectively retrieve the data from one or more database management systems
Merge the data from multiple systems to create a new database
populate any number of database management systems or file formats.
Other components of the ETI-EXTRACT Tool Suite
The ETI-EXTRACT Executive
The ETI-EXTRACT Workset browser
The ETI-EXTRACT Metadata facility
The MetaStore Database
The Metadata Exchange Library
4.6 Information Builders
EDA/SQL is a product of information builder for data extraction, transformation and access legacy access tool for building a dw
EDA/SQL implements a client server model that optimized for higher performance and also support copy management, data quality management, data replications
5. Transformation Engines
Two dynamic transformation engine tools
Informatica’s Power Mart suite
Informatica’s product, Informatica’s PowerMart suite has to be discussed in conjunction with the
Metadata Exchange Architecture (MX) initiative.
It provide the APIs to integrate the vendor tools with Imformatica metadata repository
It captures the technical metadata and business metadata on the back-end that can be integrated with the Metadata in front-end partner’s products
The informatica repository is the foundation of the PowerMart suite in with technical and business
Metadata is stored
PowerMart consist of following components
The informatica server Manager
The Consteller Hub consists of a set of components supports the distributed transformation
The product is designed for both data migration and data distribution in an operational system
It employs a hub and spoke architecture to manage the flow of data between source and target system
The spokes represents a data path between a transformation hub and data source or target
The hub and its associated sources and targets can be installed on the same machine, or may be separate networked computers
The hub supports
Record information and restructuring
Field level data transformation, validation, and table lookup
File and multifile set-level transformation and validation
Creation of intermediate results