1.Ahsan Abdullah
1
Data Warehousing
Lecture-18
ETL Detail: Data Extraction & Transformation
Virtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan1010@yahoo.com
2.Ahsan Abdullah
2
ETL Detail: Data Extraction & Transformation
3.Ahsan Abdullah
3
Extracting Changed Data
Incremental data extraction
Incremental data extraction i.e. what has changed, say during last 24 hrs if considering nightly extraction.
Efficient when changes can be identified
This is efficient, when the small changed data can be identified efficiently.
Identification could be costly
Unfortunately, for many source systems, identifying the recently modified data may be difficult or effect operation of the source system.
Very challenging
Change Data Capture is therefore, typically the most challenging technical issue in data extraction.
ONLY yellow part will go to Graphics
4.Ahsan Abdullah
4
Source Systems
Two CDC sources
Modern systems
Legacy systems
ONLY yellow part will go to Graphics
5.Ahsan Abdullah
5
CDC in Modern Systems
Time Stamps
Works if timestamp column present
If column not present, add column
May not be possible to modify table, so add triggers
Triggers
Create trigger for each source table
Following each DML operation trigger performs updates
Record DML operations in a log
Partitioning
Table range partitioned, say along date key
Easy to identify new data, say last week’s data
ONLY yellow part will go to Graphics
6.Ahsan Abdullah
6
CDC in Legacy Systems
Changes recorded in tapes Changes occurred in legacy transaction processing are recorded on the log or journal tapes.
Changes read and removed from tapes Log or journal tape are read and the update/transaction changes are stripped off for movement into the data warehouse.
Problems with reading a log/journal tape are many:
Contains lot of extraneous data
Format is often arcane
Often contains addresses instead of data values and keys
Sequencing of data in the log tape often has deep and complex implications
Log tape varies widely from one DBMS to another.
ONLY yellow part will go to Graphics
7.Ahsan Abdullah
7
Advantages
Immediate.
No loss of history
Flat files NOT required
CDC Advantages: Modern Systems
Modern
Systems
8.Ahsan Abdullah
8
Advantages
No incremental on-line I/O required for log tape
The log tape captures all update processing
Log tape processing can be taken off-line.
No haste to make waste.
CDC Advantages: Legacy Systems
Legacy
Systems
9.Ahsan Abdullah
9
Major Transformation Types
Format revision
Decoding of fields
Calculated and derived values
Splitting of single fields
Merging of information
Character set conversion
Unit of measurement conversion
Date/Time conversion
Summarization
Key restructuring
Duplication
10.Ahsan Abdullah
10
Format revision
Decoding of fields
Calculated and derived values
Splitting of single fields
Covered in issues
Covered in De-Norm
ONLY yellow part will go to Graphics
Major Transformation Types
11.Ahsan Abdullah
11
Merging of information
Character set conversion
Unit of measurement conversion
Date/Time conversion
Not really means combining columns to create one column.
Info for product coming from different sources merging it into single entity.
ONLY yellow part will go to Graphics
For PC architecture converting legacy EBCIDIC to ASCII
For companies with global branches Km vs. mile or lb vs Kg
November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format.
This date may be standardized to be written as 14 NOV 2005.
Major Transformation Types
12.Ahsan Abdullah
12
Aggregation & Summarization
How they are different?
Why both are required?
Grain mismatch (don’t require, don’t have space)
Data Marts requiring low detail
Detail losing its utility
Adding like values
Summarization with calculation across business dimension is aggregation. Example Monthly compensation = monthly sale + bonus
ONLY yellow part will go to Graphics
Major Transformation Types
13.Ahsan Abdullah
13
Key restructuring (inherent meaning at source)
i.e. 92424979234 changed to 12345678
Removing duplication
ONLY yellow part will go to Graphics
Incorrect or missing value
Inconsistent naming convention ONE vs 1
Incomplete information
Physically moved, but address not changed
Misspelling or falsification of names
Major Transformation Types
14.Ahsan Abdullah
14
Data content defects
Domain value redundancy
Non-standard data formats
Non-atomic data values
Multipurpose data fields
Embedded meanings
Inconsistent data values
Data quality contamination
15.Ahsan Abdullah
15
Domain value redundancy
Unit of Measure
Dozen, Doz., Dz., 12
Non-standard data formats
Phone Numbers
1234567 or 123.456.7
Non-atomic data fields
Name & Addresses
Dr. Hameed Khan, PhD
ONLY yellow part will go to Graphics
Data content defects Examples
16.Ahsan Abdullah
16
Embedded Meanings
RC, AP, RJ
received, approved, rejected
Data content defects Examples