Differences

This shows you the differences between two versions of the page.

--- oracle_goldengate_hadoop_hdfs [2020/11/11 14:21] – andonovj
+++ oracle_goldengate_hadoop_hdfs [2020/11/11 15:33] (current) – andonovj
@@ Line 1: / Line 1: @@
 =====Configuration Destination=====
+In this section we will replicate from Oracle -> Golden Gate -> HDFS
+{{:hdfs_diagram.jpg?400|}}
 <Code:bash|Create HDFS Properties File (TRG_GGHOME/dirprm/hdfs.properties)>
@@ Line 30: / Line 34: @@
 </Code>
+Now bare in mind that since the following settings:
+  * rollOnMetadataChange is set to TRUE
+  * pkUpdateHandling is set to delete-insert
+That means if the table structure changes (add/remove column), the HDFS will create new JSON. Furthermore, because of the 2nd parameter, when you update, that will do pretty much the same, as it will delete the record and insert new one.
+Furthermore, since the table is without PK, HDFS will be forced to track all the columns into a pseudo-PK column.
+We will modify that at the end of the section to see different results.
 =====Create Replicat=====
@@ Line 57: / Line 69: @@
 REPLICAT    RUNNING     HDFSRP      00:00:00      00:00:08
 </Code>
+=====Test=====
+To test, we can connect to the Source database and issue insert:
+<Code:bash|Insert Data>
+[oracle@hostname ~]$ sqlplus oggsrc/<pwd>@orcl
+SQL*Plus: Release 12.1.0.2.0 Production on Sat Feb 4 16:04:27
+Copyright (c) 1982, 2014, Oracle. All rights reserved.
+Last Successful login time: Sat Feb 04 2017 16:01:11 +11:00
+Connected to:
+Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 -
+With the Partitioning, OLAP, Advanced Analytics and Real
+Application Testing options
+SQL> insert into customer_prod select * from customer where
+customer_id < 1001;
+rows created.
+SQL> commit;
+Commit complete.
+SQL>
+</Code>
+Check the replication:
+<Code:bash|Check source and dest statistics>
+--Check Extract
+GGSCI (hostname) 2> send priex, stats
+Sending STATS request to EXTRACT PRIEX ...
+Start of Statistics at 2017-02-04 16:08:10.
+Output to ./dirdat/in:
+Extracting from OGGSRC.CUSTOMER_PROD to OGGSRC.CUSTOMER_PROD:
+*** Total statistics since 2017-02-04 15:46:18 ***
+Total inserts 1000.00
+Total updates 0.00
+Total deletes 0.00
+Total discards 0.00
+Total operations 1000.00
+GGSCI (hostname) 7>
+--Check Replicat
+GGSCI (hostname) 6> send hdfsrp, stats
+Sending STATS request to REPLICAT HDFSRP ...
+Start of Statistics at 2017-02-04 16:11:27.
+Replicating from OGGSRC.CUSTOMER_PROD to oggsrc.CUSTOMER_PROD:
+… Many lines omitted for brevity…
+*** Total statistics since 2017-02-04 16:05:11 ***
+Total inserts 1000.00
+Total updates 0.00
+Total deletes 0.00
+Total discards 0.00
+Total operations 1000.00
+GGSCI (hostname) 7>
+</Code>
+We can check in the HDFS as well:
+<Code:bash|Check in the HDFS>
+[oracle@edvmr1p0 oggsrc]$ hdfs dfs ls /user/oracle
+Found 1 items
+drwxr-xr-x   - oracle supergroup          0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod
+[oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle/oggsrc.customer_prod
+Found 1 items
+-rw-r--r--   1 oracle supergroup     449586 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json
+[oracle@edvmr1p0 oggsrc]$ hdfs dfs -cat /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json | head -100
+{
+    "table":"OGGSRC.CUSTOMER_PROD",
+    "op_type":"I",
+    "op_ts":"2020-11-11 14:06:05.000168",
+    "current_ts":"2020-11-11T14:06:11.134000",
+    "pos":"00000000000000002372",
+    "CUSTOMER_ID":1,
+    "FIRST_NAME":"Jaden",
+    "LAST_NAME":"Poole",
+    "EMAIL_ADDRESS":"[email protected]",
+    "SSN":"16060407-7404",
+    "ADDRESS":"3640 Ac St.",
+    "CITY":"San Juan de Dios",
+    "ZIP_CODE":"31014",
+.............................................................
+</Code>
+With the current configuration, the HDFS will create a new directory (under HDFS) for each changed table. For example if I update a new table, I will recieve two directories as follows:
+<Code:bash|Check HDFS Structure>
+[oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle
+Found 2 items
+drwxr-xr-x   - oracle supergroup          0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod              <- For customer_prod table.
+drwxr-xr-x   - oracle supergroup          0 2020-11-11 14:36 /user/oracle/oggsrc.product_prod               <- For product_prod table.
+[oracle@edvmr1p0 oggsrc]$
+</Code>
+=====Modify Configuration=====
+Let's test new configuration, by setting:
+  * gg.handler.hdfs.rollOnMetadataChange=false
+  * gg.handler.hdfs.format.pkUpdateHandling=update
+In $TRG_GGHOME/dirprm/hdfs.properties
+That will force the HDFS to NOT create new JSON on metadata change and on update. Once we do that change we have to restart the replicat:
+<Code:bash|Restart Replicat>
+[oracle@edvmr1p0 oggtrg]$ ggsci
+Oracle GoldenGate Command Interpreter
+Version 12.2.0.1.160823 OGGCORE_OGGADP.12.2.0.1.0_PLATFORMS_161019.1437
+Linux, x64, 64bit (optimized), Generic on Oct 19 2016 16:01:40
+Operating system character set identified as UTF-8.
+Copyright (C) 1995, 2016, Oracle and/or its affiliates. All rights reserved.
+GGSCI (edvmr1p0) 1> info all
+Program     Status      Group       Lag at Chkpt  Time Since Chkpt
+MANAGER     RUNNING
+REPLICAT    RUNNING     HDFSRP      00:00:00      00:00:03
+GGSCI (edvmr1p0) 2> stop replicat hdfsrp
+Sending STOP request to REPLICAT HDFSRP ...
+Request processed.
+GGSCI (edvmr1p0) 3> start hdfsrp
+Sending START request to MANAGER ...
+REPLICAT HDFSRP starting
+GGSCI (edvmr1p0) 4> info all
+Program     Status      Group       Lag at Chkpt  Time Since Chkpt
+MANAGER     RUNNING
+REPLICAT    RUNNING     HDFSRP      00:00:00      00:00:04
+GGSCI (edvmr1p0) 5>
+</Code>
+Now, if we add new column or modify the data, we will not rollover new JSON file until we don't reach the 1G file size defined in:
+  * gg.handler.hdfs.maxFileSize=1g
+These settings should cause the creation of a new JSON File:
+<Code:bash|Create new Column>
+SQL> alter table product_prod add third_column varchar2(20);
+Table altered.
+SQL> update product_prod set third_column='Test third' where product_id=4;
+row updated.
+SQL> commit;
+Commit complete.
+SQL>
+</Code>
+<Code:bash|Check HDFS files>
+[oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod
+Found 3 items
+-rw-r--r--   1 oracle supergroup       1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json          <- JSON from the first loan
+-rw-r--r--   1 oracle supergroup        699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json          <- JSON when I added the "ONE_MORE_COLUMN" previously
+-rw-r--r--   1 oracle supergroup        394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json          <- JSON after I changed the configuration to not rollover new JSON after table structure modification
+[oracle@edvmr1p0 oggtrg]$
+</Code>
+After this point, whatever we do to the table, as long as we don't exceed the 1G of size, it should grow, let's test it.
+We can examine the additional of a new column with a simple cat of the JSON
+<Code:bash|Check the JSON>
+[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 14:57:31.000115",
+    "current_ts":"2020-11-11T14:57:34.807000",
+    "pos":"00000000020000006904",
+    "PRODUCT_ID":3,
+    "PRODUCT_CODE":"OKI26BPY3QK",
+    "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,",
+    "SUPPLIER_ID":3,
+    "ONE_MORE_COLUMN":null,
+    "SECOND_COLUMN":"Data for second"
+}
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 14:59:44.000139",
+    "current_ts":"2020-11-11T14:59:49.201000",
+    "pos":"00000000020000009296",
+    "PRODUCT_ID":4,
+    "PRODUCT_CODE":"SXV14CFH8FL",
+    "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat",
+    "SUPPLIER_ID":4,
+    "ONE_MORE_COLUMN":"More data",
+    "SECOND_COLUMN":null,
+    "THIRD_COLUMN":"Test third"
+}
+</Code>
+So now, if I add a new column (yet again) it SHOULDN'T create new JSON. When we are at it, let's add a PK for that table as well.
+<Code:bash|Create PK & new column (yet again)>
+SQL> alter table product_prod add constraint product_prod_pk primary key(PRODUCT_ID);
+Table altered.
+SQL> alter table product_prod add fourth_column number;
+Table altered.
+SQL> update product_prod set fourth_column=1 where product_id=1;
+row updated.
+SQL> commit;
+Commit complete.
+SQL> exit
+</Code>
+Now, we can check again for the JSONs:
+<Code:bash|Check if HDFS, created new JSON>
+oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod
+Found 3 items
+-rw-r--r--   1 oracle supergroup       1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json
+-rw-r--r--   1 oracle supergroup        699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json
+-rw-r--r--   1 oracle supergroup        394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
+[oracle@edvmr1p0 oggtrg]$
+</Code>
+No new JSON, let's examine again the last one:
+<Code:bash|Examine HDFS>
+[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 14:57:31.000115",
+    "current_ts":"2020-11-11T14:57:34.807000",
+    "pos":"00000000020000006904",
+    "PRODUCT_ID":3,
+    "PRODUCT_CODE":"OKI26BPY3QK",
+    "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,",
+    "SUPPLIER_ID":3,
+    "ONE_MORE_COLUMN":null,
+    "SECOND_COLUMN":"Data for second"                   <- When I added the 2nd column
+}
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 14:59:44.000139",
+    "current_ts":"2020-11-11T14:59:49.201000",
+    "pos":"00000000020000009296",
+    "PRODUCT_ID":4,
+    "PRODUCT_CODE":"SXV14CFH8FL",
+    "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat",
+    "SUPPLIER_ID":4,
+    "ONE_MORE_COLUMN":"More data",
+    "SECOND_COLUMN":null,
+    "THIRD_COLUMN":"Test third"                   <- When I added the 3rd column
+}
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 15:02:49.000101",
+    "current_ts":"2020-11-11T15:02:54.274000",
+    "pos":"00000000020000013152",
+    "PRODUCT_ID":1,
+    "FOURTH_COLUMN":1                   <- When I added the 4th column and PK
+}
+</Code>
+Do you notice that in the last record, we have way less values than the rest, that is becasue I create a PK and HDFS wasn't forced to include all columns (in pseudo-PK column) but used the PK (Product_id) :)