oracle_goldengate_hadoop_hdfs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
oracle_goldengate_hadoop_hdfs [2020/11/11 15:07] – andonovjoracle_goldengate_hadoop_hdfs [2020/11/11 15:33] (current) – andonovj
Line 1: Line 1:
-=====Overview=====+=====Configuration Destination=====
 In this section we will replicate from Oracle -> Golden Gate -> HDFS In this section we will replicate from Oracle -> Golden Gate -> HDFS
  
 {{:hdfs_diagram.jpg?400|}} {{:hdfs_diagram.jpg?400|}}
  
- 
- 
-=====Configuration Destination===== 
  
 <Code:bash|Create HDFS Properties File (TRG_GGHOME/dirprm/hdfs.properties)> <Code:bash|Create HDFS Properties File (TRG_GGHOME/dirprm/hdfs.properties)>
Line 165: Line 162:
 </Code> </Code>
  
-=====Test new Configuration=====+=====Modify Configuration=====
 Let's test new configuration, by setting: Let's test new configuration, by setting:
  
   * gg.handler.hdfs.rollOnMetadataChange=false   * gg.handler.hdfs.rollOnMetadataChange=false
   * gg.handler.hdfs.format.pkUpdateHandling=update   * gg.handler.hdfs.format.pkUpdateHandling=update
 +
 +In $TRG_GGHOME/dirprm/hdfs.properties
  
 That will force the HDFS to NOT create new JSON on metadata change and on update. Once we do that change we have to restart the replicat: That will force the HDFS to NOT create new JSON on metadata change and on update. Once we do that change we have to restart the replicat:
Line 216: Line 215:
 </Code> </Code>
  
 +Now, if we add new column or modify the data, we will not rollover new JSON file until we don't reach the 1G file size defined in: 
 +
 +  * gg.handler.hdfs.maxFileSize=1g
 +
 +
 +These settings should cause the creation of a new JSON File:
 +
 +<Code:bash|Create new Column>
 +SQL> alter table product_prod add third_column varchar2(20);
 +
 +Table altered.
 +SQL> update product_prod set third_column='Test third' where product_id=4;
 +
 +1 row updated.
 +
 +SQL> commit;
 +
 +Commit complete.
 +
 +SQL>
 +</Code>
 +
 +<Code:bash|Check HDFS files>
 +[oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod
 +Found 3 items
 +-rw-r--r--   1 oracle supergroup       1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json          <- JSON from the first loan
 +-rw-r--r--   1 oracle supergroup        699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json          <- JSON when I added the "ONE_MORE_COLUMN" previously
 +-rw-r--r--   1 oracle supergroup        394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json          <- JSON after I changed the configuration to not rollover new JSON after table structure modification
 +[oracle@edvmr1p0 oggtrg]$
 +</Code>
 +
 +After this point, whatever we do to the table, as long as we don't exceed the 1G of size, it should grow, let's test it.
 +We can examine the additional of a new column with a simple cat of the JSON
 +
 +<Code:bash|Check the JSON>
 +[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
 +
 +{
 +    "table":"OGGSRC.PRODUCT_PROD",
 +    "op_type":"U",
 +    "op_ts":"2020-11-11 14:57:31.000115",
 +    "current_ts":"2020-11-11T14:57:34.807000",
 +    "pos":"00000000020000006904",
 +    "PRODUCT_ID":3,
 +    "PRODUCT_CODE":"OKI26BPY3QK",
 +    "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,",
 +    "SUPPLIER_ID":3,
 +    "ONE_MORE_COLUMN":null,
 +    "SECOND_COLUMN":"Data for second"
 +}
 +{
 +    "table":"OGGSRC.PRODUCT_PROD",
 +    "op_type":"U",
 +    "op_ts":"2020-11-11 14:59:44.000139",
 +    "current_ts":"2020-11-11T14:59:49.201000",
 +    "pos":"00000000020000009296",
 +    "PRODUCT_ID":4,
 +    "PRODUCT_CODE":"SXV14CFH8FL",
 +    "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat",
 +    "SUPPLIER_ID":4,
 +    "ONE_MORE_COLUMN":"More data",
 +    "SECOND_COLUMN":null,
 +    "THIRD_COLUMN":"Test third"
 +}
 +</Code>
 +
 +So now, if I add a new column (yet again) it SHOULDN'T create new JSON. When we are at it, let's add a PK for that table as well.
 +
 +<Code:bash|Create PK & new column (yet again)>
 +SQL> alter table product_prod add constraint product_prod_pk primary key(PRODUCT_ID);
 +
 +Table altered.
 +
 +SQL> alter table product_prod add fourth_column number;
 +
 +Table altered.
 +
 +SQL> update product_prod set fourth_column=1 where product_id=1;
 +
 +1 row updated.
 +
 +SQL> commit;
 +
 +Commit complete.
 +
 +SQL> exit
 +</Code>
 +
 +Now, we can check again for the JSONs:
 +
 +<Code:bash|Check if HDFS, created new JSON>
 +oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod
 +Found 3 items
 +-rw-r--r--   1 oracle supergroup       1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json
 +-rw-r--r--   1 oracle supergroup        699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json
 +-rw-r--r--   1 oracle supergroup        394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
 +[oracle@edvmr1p0 oggtrg]$ 
 +</Code>
 +
 +No new JSON, let's examine again the last one:
 +
 +<Code:bash|Examine HDFS>
 +[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
 +
 +{
 +    "table":"OGGSRC.PRODUCT_PROD",
 +    "op_type":"U",
 +    "op_ts":"2020-11-11 14:57:31.000115",
 +    "current_ts":"2020-11-11T14:57:34.807000",
 +    "pos":"00000000020000006904",
 +    "PRODUCT_ID":3,
 +    "PRODUCT_CODE":"OKI26BPY3QK",
 +    "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,",
 +    "SUPPLIER_ID":3,
 +    "ONE_MORE_COLUMN":null,
 +    "SECOND_COLUMN":"Data for second"                   <- When I added the 2nd column
 +}
 +{
 +    "table":"OGGSRC.PRODUCT_PROD",
 +    "op_type":"U",
 +    "op_ts":"2020-11-11 14:59:44.000139",
 +    "current_ts":"2020-11-11T14:59:49.201000",
 +    "pos":"00000000020000009296",
 +    "PRODUCT_ID":4,
 +    "PRODUCT_CODE":"SXV14CFH8FL",
 +    "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat",
 +    "SUPPLIER_ID":4,
 +    "ONE_MORE_COLUMN":"More data",
 +    "SECOND_COLUMN":null,
 +    "THIRD_COLUMN":"Test third"                   <- When I added the 3rd column
 +}
 +{
 +    "table":"OGGSRC.PRODUCT_PROD",
 +    "op_type":"U",
 +    "op_ts":"2020-11-11 15:02:49.000101",
 +    "current_ts":"2020-11-11T15:02:54.274000",
 +    "pos":"00000000020000013152",
 +    "PRODUCT_ID":1,
 +    "FOURTH_COLUMN":                  <- When I added the 4th column and PK
 +}
 +</Code>
 +
 +Do you notice that in the last record, we have way less values than the rest, that is becasue I create a PK and HDFS wasn't forced to include all columns (in pseudo-PK column) but used the PK (Product_id) :)
  
  • oracle_goldengate_hadoop_hdfs.1605107268.txt.gz
  • Last modified: 2020/11/11 15:07
  • by andonovj