oracle_goldengate_hadoop_hdfs

This is an old revision of the document!


In this section we will replicate from Oracle → Golden Gate → HDFS

Create HDFS Properties File (TRG_GGHOME/dirprm/hdfs.properties)

gg.handlerlist=hdfs
gg.handler.hdfs.type=hdfs
gg.handler.hdfs.includeTokens=false
gg.handler.hdfs.maxFileSize=1g
gg.handler.hdfs.rootFilePath=/user/oracle
gg.handler.hdfs.fileRollInterval=0
gg.handler.hdfs.inactivityRollInterval=0
gg.handler.hdfs.fileSuffix=.json
gg.handler.hdfs.partitionByTable=true
gg.handler.hdfs.rollOnMetadataChange=true
gg.handler.hdfs.authType=none
gg.handler.hdfs.format=json_row
gg.handler.hdfs.format.prettyPrint=true
gg.handler.hdfs.format.jsonDelimiter=CDATA[]
gg.handler.hdfs.format.pkUpdateHandling=delete-insert
gg.handler.hdfs.format.generateSchema=false
gg.handler.hdfs.mode=tx
goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec
gg.classpath=/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/etc/hadoop/:
jvm.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

Now bare in mind that since the following settings:

  • rollOnMetadataChange is set to TRUE
  • pkUpdateHandling is set to delete-insert

That means if the table structure changes (add/remove column), the HDFS will create new JSON. Furthermore, because of the 2nd parameter, when you update, that will do pretty much the same, as it will delete the record and insert new one. Furthermore, since the table is without PK, HDFS will be forced to track all the columns into a pseudo-PK column. We will modify that at the end of the section to see different results.

Create Replication

GGSCI (edvmr1p0) 3> edit param hdfsrp

REPLICAT hdfsrp
TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.properties
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAP oggsrc.*, target oggsrc.*;


GGSCI (edvmr1p0) 4> add replicat hdfsrp, exttrail ./dirdat/hd
REPLICAT added.
GGSCI (edvmr1p0) 5> start hdfsrp

Sending START request to MANAGER ...
REPLICAT HDFSRP starting


GGSCI (edvmr1p0) 6> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     HDFSRP      00:00:00      00:00:08    

To test, we can connect to the Source database and issue insert:

Insert Data

[oracle@hostname ~]$ sqlplus oggsrc/<pwd>@orcl
SQL*Plus: Release 12.1.0.2.0 Production on Sat Feb 4 16:04:27
2017
Copyright (c) 1982, 2014, Oracle. All rights reserved.
Last Successful login time: Sat Feb 04 2017 16:01:11 +11:00
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 -
With the Partitioning, OLAP, Advanced Analytics and Real
Application Testing options
SQL> insert into customer_prod select * from customer where
customer_id < 1001;
1000 rows created.
SQL> commit;
Commit complete.
SQL>

Check the replication:

Check source and dest statistics

--Check Extract
GGSCI (hostname) 2> send priex, stats
Sending STATS request to EXTRACT PRIEX ...
Start of Statistics at 2017-02-04 16:08:10.
Output to ./dirdat/in:
Extracting from OGGSRC.CUSTOMER_PROD to OGGSRC.CUSTOMER_PROD:
*** Total statistics since 2017-02-04 15:46:18 ***
Total inserts 1000.00
Total updates 0.00
Total deletes 0.00
Total discards 0.00
Total operations 1000.00
GGSCI (hostname) 7>

--Check Replicat
GGSCI (hostname) 6> send hdfsrp, stats
Sending STATS request to REPLICAT HDFSRP ...
Start of Statistics at 2017-02-04 16:11:27.
Replicating from OGGSRC.CUSTOMER_PROD to oggsrc.CUSTOMER_PROD:
… Many lines omitted for brevity…
*** Total statistics since 2017-02-04 16:05:11 ***
Total inserts 1000.00
Total updates 0.00
Total deletes 0.00
Total discards 0.00
Total operations 1000.00
GGSCI (hostname) 7>

We can check in the HDFS as well:

Check in the HDFS

[oracle@edvmr1p0 oggsrc]$ hdfs dfs ls /user/oracle
Found 1 items
drwxr-xr-x   - oracle supergroup          0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod
[oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle/oggsrc.customer_prod
Found 1 items
-rw-r--r--   1 oracle supergroup     449586 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json
[oracle@edvmr1p0 oggsrc]$ hdfs dfs -cat /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json | head -100
{
    "table":"OGGSRC.CUSTOMER_PROD",
    "op_type":"I",
    "op_ts":"2020-11-11 14:06:05.000168",
    "current_ts":"2020-11-11T14:06:11.134000",
    "pos":"00000000000000002372",
    "CUSTOMER_ID":1,
    "FIRST_NAME":"Jaden",
    "LAST_NAME":"Poole",
    "EMAIL_ADDRESS":"[email protected]",
    "SSN":"16060407-7404",
    "ADDRESS":"3640 Ac St.",
    "CITY":"San Juan de Dios",
    "ZIP_CODE":"31014",
.............................................................

With the current configuration, the HDFS will create a new directory (under HDFS) for each changed table. For example if I update a new table, I will recieve two directories as follows:

Check HDFS Structure

[oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle
Found 2 items
drwxr-xr-x   - oracle supergroup          0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod              <- For customer_prod table.
drwxr-xr-x   - oracle supergroup          0 2020-11-11 14:36 /user/oracle/oggsrc.product_prod               <- For product_prod table.
[oracle@edvmr1p0 oggsrc]$

Let's test new configuration, by setting:

  • gg.handler.hdfs.rollOnMetadataChange=false
  • gg.handler.hdfs.format.pkUpdateHandling=update

In $TRG_GGHOME/dirprm/hdfs.properties

That will force the HDFS to NOT create new JSON on metadata change and on update. Once we do that change we have to restart the replicat:

Restart Replicat

[oracle@edvmr1p0 oggtrg]$ ggsci

Oracle GoldenGate Command Interpreter
Version 12.2.0.1.160823 OGGCORE_OGGADP.12.2.0.1.0_PLATFORMS_161019.1437
Linux, x64, 64bit (optimized), Generic on Oct 19 2016 16:01:40
Operating system character set identified as UTF-8.

Copyright (C) 1995, 2016, Oracle and/or its affiliates. All rights reserved.



GGSCI (edvmr1p0) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     HDFSRP      00:00:00      00:00:03    


GGSCI (edvmr1p0) 2> stop replicat hdfsrp

Sending STOP request to REPLICAT HDFSRP ...
Request processed.


GGSCI (edvmr1p0) 3> start hdfsrp

Sending START request to MANAGER ...
REPLICAT HDFSRP starting


GGSCI (edvmr1p0) 4> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     HDFSRP      00:00:00      00:00:04    


GGSCI (edvmr1p0) 5>
  • oracle_goldengate_hadoop_hdfs.1605107330.txt.gz
  • Last modified: 2020/11/11 15:08
  • by andonovj