to Data Imperative Getting the Single Source of Truth CS Lee 27 th August 2015
Data Sources Historical data System Replacements System Migrations Merger & Acquisition Data Feeds Banks, Partnerships New data types Social media Public emails 2
Common data problems Lack of information standards Different formats & structures across different systems Data surprises in individual fields Data misplaced in the database Information buried in free-form fields Descriptions, addresses Data myopia Lack of consistent identifiers inhibit a single view The redundancy nightmare Duplicate records with a lack of standards Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116 Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116 Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116 Name Tax ID Telephone J Smith DBA Lime Cons. 228-02-1975 6173380300 Williams & Co. C/O Bill 025-37-1888 415-392-2000 1st Natl Provident 34-2671434 3380321 HP 15 State St. 508-466-1200 Orlando WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH WING ASSEMBY, USE 5J868-A HEX BOLT.25 - DRILL FOUR HOLES USE 4 5J868A BOLTS (HEX.25) - DRILL HOLES FOR EA ON WING ASSEM RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM) 19-84-103 RS232 Cable 6' M-F CandS CS-89641 6 ft. Cable Male-F, RS232 #87951 C&SUCH6 Male/Female 25 PIN 6 Foot Cable 90328574 IBM 187 N.Pk. Str. Salem NH 01456 90328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 01456 90238495 Int. Bus. Machines 187 No. Park St Salem NH 04156 90233479 International Bus. M. 187 Park Ave Salem NH 04156 90233489 Inter-Nation Consults 15 Main Street Andover MA 02341 90345672 I.B. Manufacturing Park Blvd. Bostno MA 04106 3
Data Quality Dependency CRM, AFM ScoreCard, Behavior Trending KPIs etc Core Business Transaction Data Consistency Completeness Accuracy Uniformity 4
Use a Data Quality Methodology Knowing Cleansing Retaining Profiling Standardization Matching Survivorship Obtains 100% visibility into actual data condition Analyze single domain as well as free form fields Generate frequency counts of unique values Uncovers trends and discrepancies Parse free-form fields Standardize data Incorporate business or industry standards Apply phonetic coding to key words Match identical or near-identical entities within one or more files using with proven techniques Creates consolidated view of an entity Establishes crossreferences Create Best of Breed representation of the data Cross-populate multiple data files with best of breed values Resolves conflicting data values based on user-defined business rules Strictly Private & Confidential Page 5
Tested & Proven Methodology 1. Data Extraction & Metadata Capture 2. Data Audit & Profiling 3. Data Cleansing 4. Data Mapping Alignment & Conversion 5. Data Validation 6. Test Migration 7. Data Load Extracted Flat Files in Staging Area Cleansed Consolidated and merge data UAT and Test Case Pass UAT and Test Case Source 1 Source 2 Source 3 Master Data - Asset type - Addresses - Region - State - Daerah - PostCode - City - Tmn / Bdr - Jln / Lrg - Rates - etc Data Cleansing Reject or Unmatched Data Reject Unmatched data will be send for Clerical Review. - Review and fine tune cleansing rules Mapping Criteria Common Target Format Re-evaluate Mapping Criteria and repeat step 4 onwards. Cleansed consolidated & merge data Fail UAT and Test Case Populate and load cleansed / merge data into Staging DB
Comprehensive Data Profiling Process
Sample of Data Profiling outputs Discovering Quality of Data
Sample of Data Profiling outputs Discovering Quality of Data
Ideal situation Knowing location of insured assets accumulate Sum Insured monitor against Risk Thresholds CASE: Fire Risk Accumulation 10
CASE: Fire Risk Accumulation -tokenized ADDRESS RESIDUAL_ADD BUILDING ROAD GARDEN POSTCODE CITY STATE LOT XXX, Key KOMPLEKS data CAYMAN, 08000 SUNGAI PETANI, KEDAH LOT XXX KOMPLEKS CAYMAN 08000 SUNGAI PETANI KEDAH XXX, KOMPLEKS Insured PKNP, KG. TEKEK, PULAU Asset TIOMAN, 26800 Address ROMPIN, PAHANG DARUL MAKMUR XXX,,KG TEKEK,PULAU TIOMAN KOMPLEKS PKNP 26800 ROMPIN PAHANG LOT NO TXXX7 TYPE THE GRAND PHASE CP6C 41200 KLANG SELANGOR LOT NO TXXX7 TYPE THE GRAND PHASE CP6C 41200 KLANG SELANGOR XXX46 1ST FLOOR THE MINES SHOPPING FAIR 43300 SERI KEMBANGAN, SELANGOR XXX46 1ST FLOOR THE MINES SHOPPING FAIR 43300 SERI KEMBANGAN SELANGOR LOT NO.LXXX7 THE MINES SHOPPING FAIR JALAN DULANG, MINES RESORT CITY SERI KEMBANGAN LOT NO LXXX7 THE MINES SHOPPING FAIR JALAN DULANG,MINES RESORT CITY 43300 SERI KEMBANGAN SELANGOR UNIT NO XXX TYPE A-2 BLOCK A THE REEF 48000 RAWANG SELANGOR UNIT NO XXX TYPE A-2 BLK A THE REEF 48000 RAWANG SELANGOR SUITE XXX01, 11TH FLOOR, WISMA HANGSAM, 1, JALAN HANG LEKIR, 50000 KUALA LUMPUR. STE XXX01,11TH FLOOR,,1, WISMA HANGSAM JALAN HANG LEKIR 50000 KUALA LUMPUR WP KUALA LUMPUR BLOCK XXX3-7 MENARA CITY ONE JALAN MUNSHI ABDULLAH KUALA LUMPUR BLK XXX3-7 MENARA CITY ONE JALAN MUNSHI ABDULLAH 50000 KUALA LUMPUR WP KUALA LUMPUR NO XXX6-8 & XXX7-8 MENARA CITY ONE JLN MUNSHI ABDULLAH 50100 WILAYAH PERSEKUTUAN K.LUMPUR. NO XXX6-8 & XXX7-8 MENARA CITY ONE JLN MUNSHI ABDULLAH WILAYAH PERSEKUTUAN K LUMPU 50100 WP KUALA LUMPUR XXX2-03 MENARA CITY ONE NO 3 JALAN MUNSHI ABDULLAH 50100 KUALA LUMPUR WILAYAH PERSEKUTUAN XXX2-03 NO 3 MENARA CITY ONE JALAN MUNSHI ABDULLAH 50100 KUALA LUMPUR WP KUALA LUMPUR XXX3 PLAZA SEE HOY CHAN, JALAN RAJA CHULAN 50200 KUALA LUMPUR XXX 3, PLAZA SEE HOY CHAN JALAN RAJA CHULAN 50200 KUALA LUMPUR WP KUALA LUMPUR XXX7-3 MENARA ANTARA, NO 11 JALAN BUKIT CEYLON 50200 KUALA LUMPUR XXX7-3,NO 11 MENARA ANTARA JALAN BUKIT CEYLON 50200 KUALA LUMPUR WP KUALA LUMPUR CP58, SUITE XXX05-06 18TH FLOOR CENTRAL PLAZA 34 JALAN SULTAN ISMAIL 50250 KUALA LUMPUR CP58,STE XXX05-06 18TH FLOOR CENTRAL PLAZA 34 JALAN SULTAN ISMAIL 50250 KUALA LUMPUR WP KUALA LUMPUR LOT XXX9, GROUND FLOOR THE MALL NO 100 JALAN PUTRA 50350 KUALA LUMPUR LOT XXX9,GROUND FLOOR NO 100 THE MALL JALAN PUTRA 50350 KUALA LUMPUR WP KUALA LUMPUR XXXH FOOLR MENARA TH SELBORN NO.153 JALAN TUN RAZAK 50400 KUALA LUMPUR 11XXXH FOOLR NO 153 MENARA TH SELBORN JALAN TUN RAZAK 50400 KUALA LUMPUR WP KUALA LUMPUR NO 27-7 MENARA PERMATA DAMANSARA NO 685 JALAN DAMANSARA 60000 KUALA LUMPUR
CASE: Fire Risk Accumulation Structured and complete addresses Grouping by address Accumulating by location Decision Support 12
CASE: Fire Risk Accumulation -partial ADDRESS RESIDUAL_ADD BUILDING ROAD GARDEN POSTCODE CITY STATE UNIT NO XXX TAMAN SRI PUTRA P T NO 337 IN THE 41000 PEKAN OF KLANG SELANGOR TAMAN SRI PUTRA P T THE PEKAN OF NO 337 IN 41000 KLANG SELANGOR Key data Insured UNIT NO XXX Asset Address NO.X XXTH MILE, OFF THE FEDERAL HIGHWAY 47300 PETALING JAYA, SELANGOR DARUL EHSAN NO X XXTH MILE,OFF THE FEDERAL HWY 47300 PETALING SELANGOR PARCEL LOT NO RS/GXXX RETAIL SHOP(GROUND FLOOR) PLAZA SRI MUDA SEKSYEN 25 SHAH ALAM SELANGOR PARCEL LOT NO RS/GXXX RETAIL SHOP GROUND FLOOR SEKSYEN 25 PLAZA SRI MUDA 40000 SHAH ALAM SELANGOR PARCEL LOT NO RS/GXXX RETAIL SHOP(GROUND FLOOR) (PLAZA SRI MUDA SEKSYEN 25 SHAH ALAM SELANGOR) PARCEL LOT NO RS/GXXX RETAIL SHOP GROUND FLOOR SEKSYEN 25 PLAZA SRI MUDA 40000 SHAH ALAM SELANGOR XX, PLAZA PUCHONG JALAN PUCHONG MESRA 158200 KUALA LUMPUR W.P KUALA LUMPUR XX, PLAZA PUCHONG JALAN PUCHONG MESRA 1 KUALA LUMPUR W P 58200 KUALA LUMPUR NO.BXXXX PLAZA MOUNT KIARA NO.2 JALAN KIARA KUALA LUMPUR GM 6147 LOT 56054 MK BATU 50480 KL JALAN KIARA KUALA LUMPUR KUALA NO BXXXX NO 2 PLAZA MT KIARA GM 6147 LOT 56054 MK BATU 50480 LUMPUR THE LEGENDS GOLF & COUNTRY RESORT, LOT XXXX, KEBUN SEDENAK, P.O. BOX 11, KULAI, JOHOR THE LEGENDS GOLF& LOT XXXX,KEBUN SEDENAK,P O BOX 11 COUNTRY RESORT, 81000 KULAI JOHOR XXX & XXX FLOOR, KOMPLEKS YAYASAN BELIA SEDUNIA (WYF COMPLEX), LEBOH AYER KEROH, 75450 MELAKA WP KUALA LUMPUR WP KUALA LUMPUR XXX& XXX FLOOR,, KOMPLEKS YAYASAN BELIA SEDUNIA WYF COMPLEX LEBUH 75450 AYER KEROH MELAKA 13
CASE: Fire Risk Accumulation data acquisition ADDRESS RESIDUAL_ADD BUILDING ROAD GARDEN POSTCODE CITY STATE HS(D) Key 95160 PT8007 data MUKIM OF RASAH DISTRICT OF SEREMBAN 70000 NEGERI SEMBILAN HS D 95160 PT8007 MUKIM OF RASAH DISTRICT OF 70000 SEREMBANNEGERI SEMBILAN H.S (D) 36900 Insured PT 32027 MK KAJANG Asset LOT XX BLK Address E KWS PERUSAHAAN BKT ANGKAT SG CHUA 43000 KAJANG H S D 36900 PT 32027 MK KAJANG LOT XX BLK E KWS PERUSAHAAN BKT ANGKAT SG CHUA 43000 KAJANG SELANGOR H.S.(D) 36901 PT 32028 MK KAJANG LOT XX BLK E KAW PERUSAHAAN BKT ANGKAT SG CHUA 43000 KAJANG H S D 36901 PT 32028 MK KAJANG LOT XX BLK E KAW PERUSAHAAN BKT ANGKAT SG CHUA 43000 KAJANG SELANGOR HS(D)70271 PT NO PARCEL 66XXX HS(D)70271 PT1580 41000 KLANG SELANGOR DARUL EHSAN HS D 70271 PT NO PARCEL 66XXX HS D 70271 PT1580 41000 KLANG SELANGOR 14
15 CASE: Fire Risk Accumulation
CASE: Fire Risk Accumulation data enhancement Geo-coding locations Latitude 37.775837 Longitude -122.39557
COMMERCIAL BREAK
Established 2003 Data Warehouse ETL Data Quality Assurance Data Cleansing
Data Re-Engineering Input File: Data Cleansing Data Discovery Data Standardization Data Matching Data Survival Data Migration Data Extraction Data Cleansing Data Conversion Data Loading Address Line 1 Address Line 2 639 N MILLS AVENUE ORLANDO, FLA 32803 306 W MAIN STR, CUMMING, GA 30130 3142 WEST CENTRAL AV TOLEDO OH 43606 843 HEARD AVE AUGUSTA-GA-30904 1139 GREENE ST ACCT #1234 AUGUSTA GEORGIA 30901 4275 OWENS ROAD SUITE 536 EVANS GA 30809 1775 RUSSELL CIRCLE MILLIS MASSACH USETTS 02038 Retain the Best Information Result File: House # Dir Str. Name Type Unit No. NYSIIS City SOUNDEX State Zip ACCT# 639 N MILLS AVE MAL ORLANDO O645 FL 32803 306 W MAIN ST MAN CUMMING C552 GA 30130 3142 W CENTRAL AVE CANTRAL TOLEDO T430 OH 43606 843 HEARD AVE HAD AUGUSTA A223 GA 30904 1139 GREENE ST GRAN AUGUSTA A223 GA 30901 1234 4275 OWENS RD STE 536 ON EVANS E152 GA 30809 1775 RUSSELL CIR RASAL MILLIS L260 MA 02038 # of Pairs 4000 3500 3000 2500 2000 1500 1000 500 Data Weights Histogram UnMatched Matched 0-50 -40-30 -20-10 0 10 20 30 40 50 60
Data Re-Engineering Professional Services Data Warehouse Infrastructure Data Warehouse Design & Build Enterprise Data Integration Data Cleansing FINANCIAL CUBES HR CUBES BI Data Repository
Data Quality, Cleansing & ETL TMB Subscriber Profile Cleansing; CRM source data cleansing and filtering DBKL source data cleansing / migration CIMB 1Platform Data Warehouse migration BIMB Core Banking System DataCleansing/Re-Org CIMB Aviva EDW ETL and Cleansing Celcom Data Quality Profiling for DWH
Data Quality, Cleansing & ETL Local and Regional Clients CIMB EDW Platform migration CELCOM (IBM) Data Quality Profiling for EDW DBKL (IAC) SAP Data Cleansing & Migration (MY) CIMB/Aviva (ACT) Data Quality Assessment, Cleansing, ETL (MY) CIMB Bank (IBM) ETL, DataStageEnterprise upgrade (MY) Maxis (IBM) ETL, Dealer Incentive Analysis (MY) CIMB Bank ETL, Data Profiling and Data Cleansing (MY) Bank Islam Data Cleansing (MY) Telekom Malaysia, Data Cleansing Customer Segmentation, Group Marketing (MY) Hutchinson Indonesia (ACW,Aus) -Data Warehouse Call Behavior(Ind) Telekom Malaysia, Data Cleansing (Accenture) - icare(my) Telekom Malaysia, Data Mart Call Usage (MY) LHDN (IBM) -Data Profiling and Cleansing (MY) General Hospital, CDC (IBM) -Data Cleansing (MY) Brunei Prime Minister Office, CRM (BWN) Bernas, HR/Payroll (MY) Maxis, B.I. (MY) Thai Farmer Bank (IBM), ETL infra (TH) Bank of Thailand (IBM), ETL infra (TH) (Partial client list, in reversed-chronological order)
Thank You