This is an old revision of the document!
Overview
Configuration Destination
Create HDFS Properties File (TRG_GGHOME/dirprm/hdfs.properties)
gg.handlerlist=hdfs gg.handler.hdfs.type=hdfs gg.handler.hdfs.includeTokens=false gg.handler.hdfs.maxFileSize=1g gg.handler.hdfs.rootFilePath=/user/oracle gg.handler.hdfs.fileRollInterval=0 gg.handler.hdfs.inactivityRollInterval=0 gg.handler.hdfs.fileSuffix=.json gg.handler.hdfs.partitionByTable=true gg.handler.hdfs.rollOnMetadataChange=true gg.handler.hdfs.authType=none gg.handler.hdfs.format=json_row gg.handler.hdfs.format.prettyPrint=true gg.handler.hdfs.format.jsonDelimiter=CDATA[] gg.handler.hdfs.format.pkUpdateHandling=delete-insert gg.handler.hdfs.format.generateSchema=false gg.handler.hdfs.mode=tx goldengate.userexit.timestamp=utc goldengate.userexit.writers=javawriter javawriter.stats.display=TRUE javawriter.stats.full=TRUE gg.log=log4j gg.log.level=INFO gg.report.time=30sec gg.classpath=/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/etc/hadoop/: jvm.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar
Now bare in mind that since the following settings:
- rollOnMetadataChange is set to TRUE
- pkUpdateHandling is set to delete-insert
That means if the table structure changes (add/remove column), the HDFS will create new JSON. Furthermore, because of the 2nd parameter, when you update, that will do pretty much the same, as it will delete the record and insert new one. Furthermore, since the table is without PK, HDFS will be forced to track all the columns into a pseudo-PK column. We will modify that at the end of the section to see different results.
Create Replicat
Create Replication
GGSCI (edvmr1p0) 3> edit param hdfsrp REPLICAT hdfsrp TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.properties REPORTCOUNT EVERY 1 MINUTES, RATE GROUPTRANSOPS 10000 MAP oggsrc.*, target oggsrc.*; GGSCI (edvmr1p0) 4> add replicat hdfsrp, exttrail ./dirdat/hd REPLICAT added. GGSCI (edvmr1p0) 5> start hdfsrp Sending START request to MANAGER ... REPLICAT HDFSRP starting GGSCI (edvmr1p0) 6> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING HDFSRP 00:00:00 00:00:08
Test
To test, we can connect to the Source database and issue insert:
Insert Data
[oracle@hostname ~]$ sqlplus oggsrc/<pwd>@orcl SQL*Plus: Release 12.1.0.2.0 Production on Sat Feb 4 16:04:27 2017 Copyright (c) 1982, 2014, Oracle. All rights reserved. Last Successful login time: Sat Feb 04 2017 16:01:11 +11:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> insert into customer_prod select * from customer where customer_id < 1001; 1000 rows created. SQL> commit; Commit complete. SQL>
Check the replication:
Check source and dest statistics
--Check Extract GGSCI (hostname) 2> send priex, stats Sending STATS request to EXTRACT PRIEX ... Start of Statistics at 2017-02-04 16:08:10. Output to ./dirdat/in: Extracting from OGGSRC.CUSTOMER_PROD to OGGSRC.CUSTOMER_PROD: *** Total statistics since 2017-02-04 15:46:18 *** Total inserts 1000.00 Total updates 0.00 Total deletes 0.00 Total discards 0.00 Total operations 1000.00 GGSCI (hostname) 7> --Check Replicat GGSCI (hostname) 6> send hdfsrp, stats Sending STATS request to REPLICAT HDFSRP ... Start of Statistics at 2017-02-04 16:11:27. Replicating from OGGSRC.CUSTOMER_PROD to oggsrc.CUSTOMER_PROD: … Many lines omitted for brevity… *** Total statistics since 2017-02-04 16:05:11 *** Total inserts 1000.00 Total updates 0.00 Total deletes 0.00 Total discards 0.00 Total operations 1000.00 GGSCI (hostname) 7>
We can check in the HDFS as well:
Check in the HDFS
[oracle@edvmr1p0 oggsrc]$ hdfs dfs ls /user/oracle Found 1 items drwxr-xr-x - oracle supergroup 0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod [oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle/oggsrc.customer_prod Found 1 items -rw-r--r-- 1 oracle supergroup 449586 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json [oracle@edvmr1p0 oggsrc]$ hdfs dfs -cat /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json | head -100 { "table":"OGGSRC.CUSTOMER_PROD", "op_type":"I", "op_ts":"2020-11-11 14:06:05.000168", "current_ts":"2020-11-11T14:06:11.134000", "pos":"00000000000000002372", "CUSTOMER_ID":1, "FIRST_NAME":"Jaden", "LAST_NAME":"Poole", "EMAIL_ADDRESS":"[email protected]", "SSN":"16060407-7404", "ADDRESS":"3640 Ac St.", "CITY":"San Juan de Dios", "ZIP_CODE":"31014", .............................................................
With the current configuration, the HDFS will create a new directory (under HDFS) for each changed table. For example if I update a new table, I will recieve two directories as follows:
Check HDFS Structure
[oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle Found 2 items drwxr-xr-x - oracle supergroup 0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod <- For customer_prod table. drwxr-xr-x - oracle supergroup 0 2020-11-11 14:36 /user/oracle/oggsrc.product_prod <- For product_prod table. [oracle@edvmr1p0 oggsrc]$
Modify Configuration
Let's test new configuration, by setting:
- gg.handler.hdfs.rollOnMetadataChange=false
- gg.handler.hdfs.format.pkUpdateHandling=update
That will force the HDFS to NOT create new JSON on metadata change and on update. Once we do that change we have to restart the replicat:
Restart Replicat
[oracle@edvmr1p0 oggtrg]$ ggsci Oracle GoldenGate Command Interpreter Version 12.2.0.1.160823 OGGCORE_OGGADP.12.2.0.1.0_PLATFORMS_161019.1437 Linux, x64, 64bit (optimized), Generic on Oct 19 2016 16:01:40 Operating system character set identified as UTF-8. Copyright (C) 1995, 2016, Oracle and/or its affiliates. All rights reserved. GGSCI (edvmr1p0) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING HDFSRP 00:00:00 00:00:03 GGSCI (edvmr1p0) 2> stop replicat hdfsrp Sending STOP request to REPLICAT HDFSRP ... Request processed. GGSCI (edvmr1p0) 3> start hdfsrp Sending START request to MANAGER ... REPLICAT HDFSRP starting GGSCI (edvmr1p0) 4> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING HDFSRP 00:00:00 00:00:04 GGSCI (edvmr1p0) 5>