Data extraction, clean up and transformation
1. Tools
requirements:
The tools
enable sourcing of the proper data contents and formats from operational and
external data stores into the data warehouse.
The task
includes:
Data transformation from one format to another
Data transformation and calculation based on the application of the business
rules. Eg : age from date of birth.
Data consolidation (several source records into single records) and
integration
Meta data synchronizations and management include storing or updating
metadata definitions.
When
implementing datawarehouse, several selections criteria that affect the tools
ability to transform, integrate and repair the data should be considered.
The ability to identify the data source
Support for flat files, Indexed files
Ability to merge the data from multiple data source
Ability to read information from data dictionaries
The code generated tool should be maintained in the
development environment
The ability to perform data type and character set
translation is requirement when moving data between incompatible systems.
The ability to summarization and aggregations of
records
The data warehouse database management system
should be able to perform the load directly from the tool using the native API.
2. Vendor
approaches:
The tasks of capturing data from a source data system, cleaning
transforming it and the loading the result into a target data system.
It can be a carried out either by separate product or by single
integrated solutions. the integrated solutions are described below:
Code
generators:
Create tailored 3GL/4GL transformation program based on source and
target data definitions.
The data transformations and enhancement rules defined by developer and
it employ data manipulation language.
code generation products Used for develop enterprise wide data warehouse
database data
Main issue with this approach is the management of the large programs
required to support a complex corporate information system
Database
data Replication tools:
It employs changes to a single data source on one
system and apply the changes to a copy of the source data source data loaded on
a different systems.
Rule
driven dynamic transformations engines (also known as data mart builders)
Capture the data from a source system at user
defined interval, transforms the data, then send and load the result in to a
target systems.
Data Transformation and enhancement is based on a
script or function logic defined to the tool.
3. Access to
legacy data:
Today many businesses are adopting client/server technologies and data
warehousing to meet customer demand for new products and services to obtain
competitive advantages.
Majority of information required supporting business application and
analytical power of data warehousing is located behind mainframe based legacy
systems. While many organizations protecting their heavy financial investment
in hardware and software to meet this goal many organization turn to middleware
solutions
Middleware strategy is the foundation for the enterprise/access. it is
designed for scalability and manageability in a data warehousing environment.
The enterprise/access provides a three tired be connected to a new data
warehouse, enterprise/access via client server interfaces. the three tired
architecture contains following layer
The data layer: It
provides data access and transaction services for management of corporate data access.
The process layer: It
provides service to manage and automation and support for current business process.
The user layer: Manage
the user interaction with process and data layer services.
4. Vendor solutions:
4.1 Prism solutions:
Prism manager provides a solution for data warehousing by mapping source
data to target database management system.
The prism warehouse manager generates code to extract and integrate
data, create and manage metadata and create subject oriented historical
database.
It extracts data from multiple sources –DB2, IMS, VSAM, RMS
&sequential files.
4.2 SAS institute:
SAS data access engines serve as a extraction tools
to combine common variables, transform data
Representations
forms for consistency.
it support for decision reporting ,graphing .so it
act as the front end.
4.3 Carleton corporations PASSPORT and metacenter:
Carleton’s
PASSPORT and the MetaCenter fulfill the data extraction and transformation need
of data
warehousing.
PASSPORT
PSSPORT can produce multiple output files from a single execution of an
extract program.
It is metadata driven, data mapping and data migration facility. it runs
as a client on various PC platform in three tired environment.
It
consists of two components:
Mainframe
based: Collects the file, record, or table layouts for the required inputs and
outputs and convert then into the
passport data language (PDL).
Workstation
based: User must transfer the PDL file from the mainframe to a location
accessible by PASSPORT
PASSPORT
offers:
Metadata directory at the core of process,
Robust data conversion, migration, analysis and auditing facilities.
PASSPORT work bench, GUI workbench that enables project development on a
work station and also maintains various personal who design, implement or use.
PASSPORT
highlights:
Carleton
passport include number of facilities and features which are briefly discussed
below:
Data
access: It provides data selection, automatic file and data matching and
selective random access, automatic
handling of multiple records types, intelligent joins, single or multiple
record accesses.
Data
analysis and auditing: this
facility provides
Audit reports including report, sequence report, class tabulation
report, file footing report, stratification report and statistical summary
report
Audit facilities, including SAMPLE command, AGE function, DATE keyword,
and ENCRYPT option.
Language
and design: It supports predefined calculation, arithmetic
operations, relational and Boolean
operations, range operation .array, input data sorts, work fields and system
fields. Conditional process, internal and external sub routines and loop
processing.
PASSPORT
data language (PDL): This has free from command structure with English
like command syntax.
Run time
environment: Support dynamic work fields and error limit
control.
Report
writing: Supports on unlimited number of line formats, variable page size,
controlled horizontal, vertical
spacing and dynamic printing.
Centralized
metadata repository: That provides global access, provides central
information change management and
control and enables metadata accuracy and integrity.
Business
metadata: Provides for metadata in business .this metadata is stored in a
relational format. This is accessible
by any SQL based query tool.
Load
image formats for target DBMS: PASSPORT formats data for loading
into any target RDBMS including DB2,
Informix, oracle, Sybase, red brick.
Optional
user exists: PASSPORT provides support for user exists where
users can optionally invoke
previously written routines.
Browsing
capabilities: PASSPORT provides metadata in a relational format
that is easily accessible by end user
query tool.
The Metacenter:
It is developed by Carleton Corporation in partnership with intellidex system
INC that is designed to put users in a control of the data warehouse.
The heart of the metacenter is the metadata dictionary.
The
metacenter conjunction with PASSPORT
provides number of capabilities.
Data extraction and
transformation: The PASSPORT workbench provides data
transformation capabilities to
support the complex data, the developer can automatically generate COBOL
extract programs from the metadata
Event management and
notification: Data movement and subscription events are executed
and monitored by the scheduler via
its event monitors. The scheduler sends notification to the various responsible
administrators via E-MAIL.
Data mart subscription: Data
warehouse users can subscribe to the data they need using business terminology.
Control center mover: This
unit’s works with the scheduler to automatically move each data request. the mover provides seamless
connectivity and data pumping between the data warehouse and data marts.
4.4. Validity Corporation
Validity corporation integrity data reengineering tool is
used to investigate standardizes transform
and
integrates data from multiple operational systems and external sources.
It main focus is on data quality indeed focusing on avoiding the GIGO
(garbage in garbage out)
Principle.
Benefits of integrity tool:
Builds accurate consolidated views of customers, supplier, products and
other corporate entities.
Maintain the highest quality of data.
4.5 Evolutionary technologies:
Another data extraction and transformation tool is ETI-EXTRACT tool, it
automates the migration of data between dissimilar storage environments.
It saves up to 95 % of the time and cost of manual
data conversion. It enables users to populate and maintain data warehouse
Move to
new architectures such as distributed client/server o Integrate disparate system
Migrate
data to a new database, platforms, and applications o Supports data collection,
conversion
Automatically generates
and executes program
in the appropriate
language for
source
and target Platforms.
Provide powerful metadata facility.
Provide a
sophisticated graphical interface that allows users to indicate how to move
data through
simple point and click operation.
Product overview:
It Automate the migration of data between different storage system .it
supports client/server handle complex conditional manipulations .it offers
administration capabilities.
All ETI-EXTRACT functions are accessed through the industry standard .a
key features is ETI-EXTRACT’s ability support conditional data selection and
transformation that can be done programmatically.
Selection and transformation specified via a point and click driven
process.
ETI-EXTRACT enables users to interactively view meta data merge meta
data from multiple different sources and export meta data.
ETI-EXPORT
tool consists of two sets of productivity tools. The Master Toolset
Data
conversion toolset
The master toolset:
It is a
set interactive editors that allows a system programmer known as a master user , to define to the meta
store database everything required to able to read and write to any database
system or file system in the enterprise environment. It contains
The
environment editor: allows the specification of each different platforms and
system operating environment to be
accessed.
Schema
editor: Provides easy access for browsing or updating schema information.
The
grammar editor: It defining customized conditional retrieval
transformation and populate logic to
meet environment specifications.
The
template editor: Enable rapid specification of programming rules to
shape tha way data retrieval
,conversion and populate programs are generated
Data conversion toolset
Which
include the conversion editor, the executive, the workset browser and metadata
facility.
Conversion
editor: It provides a graphical point and click interface for defining the mapping of data between various source
data systems and target data systems.
The
conversion editor allows user to
Selectively retrieve the data from one or more database management
systems
Merge the data from multiple systems to create a new database
populate any number of database management systems or file formats.
Other components of the
ETI-EXTRACT Tool Suite
The ETI-EXTRACT Executive
The ETI-EXTRACT Workset browser
The ETI-EXTRACT Metadata facility
The MetaStore Database
The Metadata Exchange Library
4.6 Information Builders
EDA/SQL is a product of information builder for data extraction,
transformation and access legacy access tool for building a dw
EDA/SQL implements a client server model that optimized for higher
performance and also support copy management, data quality management, data
replications
5. Transformation Engines
Two
dynamic transformation engine tools
Informatica’s Power Mart suite
Consteller Hub
5.1. Informatica
Informatica’s product, Informatica’s PowerMart suite has to be discussed
in conjunction with the
Metadata
Exchange Architecture (MX) initiative.
It provide the APIs to integrate the vendor tools with Imformatica
metadata repository
It captures the technical metadata and business metadata on the back-end
that can be integrated with the Metadata in front-end partner’s products
The informatica repository is the foundation of the PowerMart suite in
with technical and business
Metadata
is stored
PowerMart consist
of following components
PowerMart designer
PowerMart
Server
The informatica server Manager
Infomatica Repository
Informatica powerCapture
5.2 Constellar
The Consteller Hub consists of a set of components
supports the distributed transformation
management capabilities
The product is designed for both data migration and
data distribution in an operational system
It employs a hub and spoke architecture to manage
the flow of data between source and target system
The spokes represents a data path between a transformation
hub and data source or target
The hub and its associated sources and targets can
be installed on the same machine, or may be separate networked computers
The hub
supports
Record information and restructuring
Field level data transformation, validation, and
table lookup
File and multifile set-level transformation and
validation
Creation of intermediate results
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.