Differences

This shows you the differences between two versions of the page.

--- oracle_goldengate_hadoop_hdfs [2020/11/11 15:07] – andonovj
+++ oracle_goldengate_hadoop_hdfs [2020/11/11 15:33] (current) – andonovj
@@ Line 1: / Line 1: @@
-=====Overview=====
+=====Configuration Destination=====
 In this section we will replicate from Oracle -> Golden Gate -> HDFS
 {{:hdfs_diagram.jpg?400|}}
-=====Configuration Destination=====
 <Code:bash|Create HDFS Properties File (TRG_GGHOME/dirprm/hdfs.properties)>
@@ Line 165: / Line 162: @@
 </Code>
-=====Test new Configuration=====
+=====Modify Configuration=====
 Let's test new configuration, by setting:
   * gg.handler.hdfs.rollOnMetadataChange=false
   * gg.handler.hdfs.format.pkUpdateHandling=update
+In $TRG_GGHOME/dirprm/hdfs.properties
 That will force the HDFS to NOT create new JSON on metadata change and on update. Once we do that change we have to restart the replicat:
@@ Line 216: / Line 215: @@
 </Code>
+Now, if we add new column or modify the data, we will not rollover new JSON file until we don't reach the 1G file size defined in:
+  * gg.handler.hdfs.maxFileSize=1g
+These settings should cause the creation of a new JSON File:
+<Code:bash|Create new Column>
+SQL> alter table product_prod add third_column varchar2(20);
+Table altered.
+SQL> update product_prod set third_column='Test third' where product_id=4;
+row updated.
+SQL> commit;
+Commit complete.
+SQL>
+</Code>
+<Code:bash|Check HDFS files>
+[oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod
+Found 3 items
+-rw-r--r--   1 oracle supergroup       1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json          <- JSON from the first loan
+-rw-r--r--   1 oracle supergroup        699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json          <- JSON when I added the "ONE_MORE_COLUMN" previously
+-rw-r--r--   1 oracle supergroup        394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json          <- JSON after I changed the configuration to not rollover new JSON after table structure modification
+[oracle@edvmr1p0 oggtrg]$
+</Code>
+After this point, whatever we do to the table, as long as we don't exceed the 1G of size, it should grow, let's test it.
+We can examine the additional of a new column with a simple cat of the JSON
+<Code:bash|Check the JSON>
+[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 14:57:31.000115",
+    "current_ts":"2020-11-11T14:57:34.807000",
+    "pos":"00000000020000006904",
+    "PRODUCT_ID":3,
+    "PRODUCT_CODE":"OKI26BPY3QK",
+    "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,",
+    "SUPPLIER_ID":3,
+    "ONE_MORE_COLUMN":null,
+    "SECOND_COLUMN":"Data for second"
+}
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 14:59:44.000139",
+    "current_ts":"2020-11-11T14:59:49.201000",
+    "pos":"00000000020000009296",
+    "PRODUCT_ID":4,
+    "PRODUCT_CODE":"SXV14CFH8FL",
+    "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat",
+    "SUPPLIER_ID":4,
+    "ONE_MORE_COLUMN":"More data",
+    "SECOND_COLUMN":null,
+    "THIRD_COLUMN":"Test third"
+}
+</Code>
+So now, if I add a new column (yet again) it SHOULDN'T create new JSON. When we are at it, let's add a PK for that table as well.
+<Code:bash|Create PK & new column (yet again)>
+SQL> alter table product_prod add constraint product_prod_pk primary key(PRODUCT_ID);
+Table altered.
+SQL> alter table product_prod add fourth_column number;
+Table altered.
+SQL> update product_prod set fourth_column=1 where product_id=1;
+row updated.
+SQL> commit;
+Commit complete.
+SQL> exit
+</Code>
+Now, we can check again for the JSONs:
+<Code:bash|Check if HDFS, created new JSON>
+oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod
+Found 3 items
+-rw-r--r--   1 oracle supergroup       1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json
+-rw-r--r--   1 oracle supergroup        699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json
+-rw-r--r--   1 oracle supergroup        394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
+[oracle@edvmr1p0 oggtrg]$
+</Code>
+No new JSON, let's examine again the last one:
+<Code:bash|Examine HDFS>
+[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 14:57:31.000115",
+    "current_ts":"2020-11-11T14:57:34.807000",
+    "pos":"00000000020000006904",
+    "PRODUCT_ID":3,
+    "PRODUCT_CODE":"OKI26BPY3QK",
+    "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,",
+    "SUPPLIER_ID":3,
+    "ONE_MORE_COLUMN":null,
+    "SECOND_COLUMN":"Data for second"                   <- When I added the 2nd column
+}
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 14:59:44.000139",
+    "current_ts":"2020-11-11T14:59:49.201000",
+    "pos":"00000000020000009296",
+    "PRODUCT_ID":4,
+    "PRODUCT_CODE":"SXV14CFH8FL",
+    "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat",
+    "SUPPLIER_ID":4,
+    "ONE_MORE_COLUMN":"More data",
+    "SECOND_COLUMN":null,
+    "THIRD_COLUMN":"Test third"                   <- When I added the 3rd column
+}
+{
+    "table":"OGGSRC.PRODUCT_PROD",
+    "op_type":"U",
+    "op_ts":"2020-11-11 15:02:49.000101",
+    "current_ts":"2020-11-11T15:02:54.274000",
+    "pos":"00000000020000013152",
+    "PRODUCT_ID":1,
+    "FOURTH_COLUMN":1                   <- When I added the 4th column and PK
+}
+</Code>
+Do you notice that in the last record, we have way less values than the rest, that is becasue I create a PK and HDFS wasn't forced to include all columns (in pseudo-PK column) but used the PK (Product_id) :)