In this section we will replicate from Oracle → Golden Gate → HDFS
Create HDFS Properties File (TRG_GGHOME/dirprm/hdfs.properties)
gg.handlerlist=hdfs gg.handler.hdfs.type=hdfs gg.handler.hdfs.includeTokens=false gg.handler.hdfs.maxFileSize=1g gg.handler.hdfs.rootFilePath=/user/oracle gg.handler.hdfs.fileRollInterval=0 gg.handler.hdfs.inactivityRollInterval=0 gg.handler.hdfs.fileSuffix=.json gg.handler.hdfs.partitionByTable=true gg.handler.hdfs.rollOnMetadataChange=true gg.handler.hdfs.authType=none gg.handler.hdfs.format=json_row gg.handler.hdfs.format.prettyPrint=true gg.handler.hdfs.format.jsonDelimiter=CDATA[] gg.handler.hdfs.format.pkUpdateHandling=delete-insert gg.handler.hdfs.format.generateSchema=false gg.handler.hdfs.mode=tx goldengate.userexit.timestamp=utc goldengate.userexit.writers=javawriter javawriter.stats.display=TRUE javawriter.stats.full=TRUE gg.log=log4j gg.log.level=INFO gg.report.time=30sec gg.classpath=/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/etc/hadoop/: jvm.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar
Now bare in mind that since the following settings:
That means if the table structure changes (add/remove column), the HDFS will create new JSON. Furthermore, because of the 2nd parameter, when you update, that will do pretty much the same, as it will delete the record and insert new one. Furthermore, since the table is without PK, HDFS will be forced to track all the columns into a pseudo-PK column. We will modify that at the end of the section to see different results.
Create Replication
GGSCI (edvmr1p0) 3> edit param hdfsrp REPLICAT hdfsrp TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.properties REPORTCOUNT EVERY 1 MINUTES, RATE GROUPTRANSOPS 10000 MAP oggsrc.*, target oggsrc.*; GGSCI (edvmr1p0) 4> add replicat hdfsrp, exttrail ./dirdat/hd REPLICAT added. GGSCI (edvmr1p0) 5> start hdfsrp Sending START request to MANAGER ... REPLICAT HDFSRP starting GGSCI (edvmr1p0) 6> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING HDFSRP 00:00:00 00:00:08
To test, we can connect to the Source database and issue insert:
Insert Data
[oracle@hostname ~]$ sqlplus oggsrc/<pwd>@orcl SQL*Plus: Release 12.1.0.2.0 Production on Sat Feb 4 16:04:27 2017 Copyright (c) 1982, 2014, Oracle. All rights reserved. Last Successful login time: Sat Feb 04 2017 16:01:11 +11:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> insert into customer_prod select * from customer where customer_id < 1001; 1000 rows created. SQL> commit; Commit complete. SQL>
Check the replication:
Check source and dest statistics
--Check Extract GGSCI (hostname) 2> send priex, stats Sending STATS request to EXTRACT PRIEX ... Start of Statistics at 2017-02-04 16:08:10. Output to ./dirdat/in: Extracting from OGGSRC.CUSTOMER_PROD to OGGSRC.CUSTOMER_PROD: *** Total statistics since 2017-02-04 15:46:18 *** Total inserts 1000.00 Total updates 0.00 Total deletes 0.00 Total discards 0.00 Total operations 1000.00 GGSCI (hostname) 7> --Check Replicat GGSCI (hostname) 6> send hdfsrp, stats Sending STATS request to REPLICAT HDFSRP ... Start of Statistics at 2017-02-04 16:11:27. Replicating from OGGSRC.CUSTOMER_PROD to oggsrc.CUSTOMER_PROD: … Many lines omitted for brevity… *** Total statistics since 2017-02-04 16:05:11 *** Total inserts 1000.00 Total updates 0.00 Total deletes 0.00 Total discards 0.00 Total operations 1000.00 GGSCI (hostname) 7>
We can check in the HDFS as well:
Check in the HDFS
[oracle@edvmr1p0 oggsrc]$ hdfs dfs ls /user/oracle Found 1 items drwxr-xr-x - oracle supergroup 0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod [oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle/oggsrc.customer_prod Found 1 items -rw-r--r-- 1 oracle supergroup 449586 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json [oracle@edvmr1p0 oggsrc]$ hdfs dfs -cat /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json | head -100 { "table":"OGGSRC.CUSTOMER_PROD", "op_type":"I", "op_ts":"2020-11-11 14:06:05.000168", "current_ts":"2020-11-11T14:06:11.134000", "pos":"00000000000000002372", "CUSTOMER_ID":1, "FIRST_NAME":"Jaden", "LAST_NAME":"Poole", "EMAIL_ADDRESS":"[email protected]", "SSN":"16060407-7404", "ADDRESS":"3640 Ac St.", "CITY":"San Juan de Dios", "ZIP_CODE":"31014", .............................................................
With the current configuration, the HDFS will create a new directory (under HDFS) for each changed table. For example if I update a new table, I will recieve two directories as follows:
Check HDFS Structure
[oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle Found 2 items drwxr-xr-x - oracle supergroup 0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod <- For customer_prod table. drwxr-xr-x - oracle supergroup 0 2020-11-11 14:36 /user/oracle/oggsrc.product_prod <- For product_prod table. [oracle@edvmr1p0 oggsrc]$
Let's test new configuration, by setting:
In $TRG_GGHOME/dirprm/hdfs.properties
That will force the HDFS to NOT create new JSON on metadata change and on update. Once we do that change we have to restart the replicat:
Restart Replicat
[oracle@edvmr1p0 oggtrg]$ ggsci Oracle GoldenGate Command Interpreter Version 12.2.0.1.160823 OGGCORE_OGGADP.12.2.0.1.0_PLATFORMS_161019.1437 Linux, x64, 64bit (optimized), Generic on Oct 19 2016 16:01:40 Operating system character set identified as UTF-8. Copyright (C) 1995, 2016, Oracle and/or its affiliates. All rights reserved. GGSCI (edvmr1p0) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING HDFSRP 00:00:00 00:00:03 GGSCI (edvmr1p0) 2> stop replicat hdfsrp Sending STOP request to REPLICAT HDFSRP ... Request processed. GGSCI (edvmr1p0) 3> start hdfsrp Sending START request to MANAGER ... REPLICAT HDFSRP starting GGSCI (edvmr1p0) 4> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING HDFSRP 00:00:00 00:00:04 GGSCI (edvmr1p0) 5>
Now, if we add new column or modify the data, we will not rollover new JSON file until we don't reach the 1G file size defined in:
These settings should cause the creation of a new JSON File:
Create new Column
SQL> alter table product_prod add third_column varchar2(20); Table altered. SQL> update product_prod set third_column='Test third' where product_id=4; 1 row updated. SQL> commit; Commit complete. SQL>
Check HDFS files
[oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod Found 3 items -rw-r--r-- 1 oracle supergroup 1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json <- JSON from the first loan -rw-r--r-- 1 oracle supergroup 699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json <- JSON when I added the "ONE_MORE_COLUMN" previously -rw-r--r-- 1 oracle supergroup 394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json <- JSON after I changed the configuration to not rollover new JSON after table structure modification [oracle@edvmr1p0 oggtrg]$
After this point, whatever we do to the table, as long as we don't exceed the 1G of size, it should grow, let's test it. We can examine the additional of a new column with a simple cat of the JSON
Check the JSON
[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json { "table":"OGGSRC.PRODUCT_PROD", "op_type":"U", "op_ts":"2020-11-11 14:57:31.000115", "current_ts":"2020-11-11T14:57:34.807000", "pos":"00000000020000006904", "PRODUCT_ID":3, "PRODUCT_CODE":"OKI26BPY3QK", "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,", "SUPPLIER_ID":3, "ONE_MORE_COLUMN":null, "SECOND_COLUMN":"Data for second" } { "table":"OGGSRC.PRODUCT_PROD", "op_type":"U", "op_ts":"2020-11-11 14:59:44.000139", "current_ts":"2020-11-11T14:59:49.201000", "pos":"00000000020000009296", "PRODUCT_ID":4, "PRODUCT_CODE":"SXV14CFH8FL", "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat", "SUPPLIER_ID":4, "ONE_MORE_COLUMN":"More data", "SECOND_COLUMN":null, "THIRD_COLUMN":"Test third" }
So now, if I add a new column (yet again) it SHOULDN'T create new JSON. When we are at it, let's add a PK for that table as well.
Create PK & new column (yet again)
SQL> alter table product_prod add constraint product_prod_pk primary key(PRODUCT_ID); Table altered. SQL> alter table product_prod add fourth_column number; Table altered. SQL> update product_prod set fourth_column=1 where product_id=1; 1 row updated. SQL> commit; Commit complete. SQL> exit
Now, we can check again for the JSONs:
Check if HDFS, created new JSON
oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod Found 3 items -rw-r--r-- 1 oracle supergroup 1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json -rw-r--r-- 1 oracle supergroup 699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json -rw-r--r-- 1 oracle supergroup 394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json [oracle@edvmr1p0 oggtrg]$
No new JSON, let's examine again the last one:
Examine HDFS
[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json { "table":"OGGSRC.PRODUCT_PROD", "op_type":"U", "op_ts":"2020-11-11 14:57:31.000115", "current_ts":"2020-11-11T14:57:34.807000", "pos":"00000000020000006904", "PRODUCT_ID":3, "PRODUCT_CODE":"OKI26BPY3QK", "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,", "SUPPLIER_ID":3, "ONE_MORE_COLUMN":null, "SECOND_COLUMN":"Data for second" <- When I added the 2nd column } { "table":"OGGSRC.PRODUCT_PROD", "op_type":"U", "op_ts":"2020-11-11 14:59:44.000139", "current_ts":"2020-11-11T14:59:49.201000", "pos":"00000000020000009296", "PRODUCT_ID":4, "PRODUCT_CODE":"SXV14CFH8FL", "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat", "SUPPLIER_ID":4, "ONE_MORE_COLUMN":"More data", "SECOND_COLUMN":null, "THIRD_COLUMN":"Test third" <- When I added the 3rd column } { "table":"OGGSRC.PRODUCT_PROD", "op_type":"U", "op_ts":"2020-11-11 15:02:49.000101", "current_ts":"2020-11-11T15:02:54.274000", "pos":"00000000020000013152", "PRODUCT_ID":1, "FOURTH_COLUMN":1 <- When I added the 4th column and PK }
Do you notice that in the last record, we have way less values than the rest, that is becasue I create a PK and HDFS wasn't forced to include all columns (in pseudo-PK column) but used the PK (Product_id) :)