oracle_goldengate_hadoop_hdfs [My wiki]

In this section we will replicate from Oracle → Golden Gate → HDFS

Create HDFS Properties File (TRG_GGHOME/dirprm/hdfs.properties)

gg.handlerlist=hdfs
gg.handler.hdfs.type=hdfs
gg.handler.hdfs.includeTokens=false
gg.handler.hdfs.maxFileSize=1g
gg.handler.hdfs.rootFilePath=/user/oracle
gg.handler.hdfs.fileRollInterval=0
gg.handler.hdfs.inactivityRollInterval=0
gg.handler.hdfs.fileSuffix=.json
gg.handler.hdfs.partitionByTable=true
gg.handler.hdfs.rollOnMetadataChange=true
gg.handler.hdfs.authType=none
gg.handler.hdfs.format=json_row
gg.handler.hdfs.format.prettyPrint=true
gg.handler.hdfs.format.jsonDelimiter=CDATA[]
gg.handler.hdfs.format.pkUpdateHandling=delete-insert
gg.handler.hdfs.format.generateSchema=false
gg.handler.hdfs.mode=tx
goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec
gg.classpath=/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/etc/hadoop/:
jvm.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

Now bare in mind that since the following settings:

rollOnMetadataChange is set to TRUE
pkUpdateHandling is set to delete-insert

That means if the table structure changes (add/remove column), the HDFS will create new JSON. Furthermore, because of the 2nd parameter, when you update, that will do pretty much the same, as it will delete the record and insert new one. Furthermore, since the table is without PK, HDFS will be forced to track all the columns into a pseudo-PK column. We will modify that at the end of the section to see different results.

Create Replication

GGSCI (edvmr1p0) 3> edit param hdfsrp

REPLICAT hdfsrp
TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.properties
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAP oggsrc.*, target oggsrc.*;


GGSCI (edvmr1p0) 4> add replicat hdfsrp, exttrail ./dirdat/hd
REPLICAT added.
GGSCI (edvmr1p0) 5> start hdfsrp

Sending START request to MANAGER ...
REPLICAT HDFSRP starting


GGSCI (edvmr1p0) 6> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     HDFSRP      00:00:00      00:00:08

To test, we can connect to the Source database and issue insert:

Insert Data

[oracle@hostname ~]$ sqlplus oggsrc/<pwd>@orcl
SQL*Plus: Release 12.1.0.2.0 Production on Sat Feb 4 16:04:27
2017
Copyright (c) 1982, 2014, Oracle. All rights reserved.
Last Successful login time: Sat Feb 04 2017 16:01:11 +11:00
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 -
With the Partitioning, OLAP, Advanced Analytics and Real
Application Testing options
SQL> insert into customer_prod select * from customer where
customer_id < 1001;
1000 rows created.
SQL> commit;
Commit complete.
SQL>

Check the replication:

Check source and dest statistics

--Check Extract
GGSCI (hostname) 2> send priex, stats
Sending STATS request to EXTRACT PRIEX ...
Start of Statistics at 2017-02-04 16:08:10.
Output to ./dirdat/in:
Extracting from OGGSRC.CUSTOMER_PROD to OGGSRC.CUSTOMER_PROD:
*** Total statistics since 2017-02-04 15:46:18 ***
Total inserts 1000.00
Total updates 0.00
Total deletes 0.00
Total discards 0.00
Total operations 1000.00
GGSCI (hostname) 7>

--Check Replicat
GGSCI (hostname) 6> send hdfsrp, stats
Sending STATS request to REPLICAT HDFSRP ...
Start of Statistics at 2017-02-04 16:11:27.
Replicating from OGGSRC.CUSTOMER_PROD to oggsrc.CUSTOMER_PROD:
… Many lines omitted for brevity…
*** Total statistics since 2017-02-04 16:05:11 ***
Total inserts 1000.00
Total updates 0.00
Total deletes 0.00
Total discards 0.00
Total operations 1000.00
GGSCI (hostname) 7>

We can check in the HDFS as well:

Check in the HDFS

[oracle@edvmr1p0 oggsrc]$ hdfs dfs ls /user/oracle
Found 1 items
drwxr-xr-x   - oracle supergroup          0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod
[oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle/oggsrc.customer_prod
Found 1 items
-rw-r--r--   1 oracle supergroup     449586 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json
[oracle@edvmr1p0 oggsrc]$ hdfs dfs -cat /user/oracle/oggsrc.customer_prod/oggsrc.customer_prod_2020-11-11_14-06-11.144.json | head -100
{
    "table":"OGGSRC.CUSTOMER_PROD",
    "op_type":"I",
    "op_ts":"2020-11-11 14:06:05.000168",
    "current_ts":"2020-11-11T14:06:11.134000",
    "pos":"00000000000000002372",
    "CUSTOMER_ID":1,
    "FIRST_NAME":"Jaden",
    "LAST_NAME":"Poole",
    "EMAIL_ADDRESS":"[email protected]",
    "SSN":"16060407-7404",
    "ADDRESS":"3640 Ac St.",
    "CITY":"San Juan de Dios",
    "ZIP_CODE":"31014",
.............................................................

With the current configuration, the HDFS will create a new directory (under HDFS) for each changed table. For example if I update a new table, I will recieve two directories as follows:

Check HDFS Structure

[oracle@edvmr1p0 oggsrc]$ hdfs dfs -ls /user/oracle
Found 2 items
drwxr-xr-x   - oracle supergroup          0 2020-11-11 14:06 /user/oracle/oggsrc.customer_prod              <- For customer_prod table.
drwxr-xr-x   - oracle supergroup          0 2020-11-11 14:36 /user/oracle/oggsrc.product_prod               <- For product_prod table.
[oracle@edvmr1p0 oggsrc]$

Let's test new configuration, by setting:

gg.handler.hdfs.rollOnMetadataChange=false
gg.handler.hdfs.format.pkUpdateHandling=update

In $TRG_GGHOME/dirprm/hdfs.properties

That will force the HDFS to NOT create new JSON on metadata change and on update. Once we do that change we have to restart the replicat:

Restart Replicat

[oracle@edvmr1p0 oggtrg]$ ggsci

Oracle GoldenGate Command Interpreter
Version 12.2.0.1.160823 OGGCORE_OGGADP.12.2.0.1.0_PLATFORMS_161019.1437
Linux, x64, 64bit (optimized), Generic on Oct 19 2016 16:01:40
Operating system character set identified as UTF-8.

Copyright (C) 1995, 2016, Oracle and/or its affiliates. All rights reserved.



GGSCI (edvmr1p0) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     HDFSRP      00:00:00      00:00:03    


GGSCI (edvmr1p0) 2> stop replicat hdfsrp

Sending STOP request to REPLICAT HDFSRP ...
Request processed.


GGSCI (edvmr1p0) 3> start hdfsrp

Sending START request to MANAGER ...
REPLICAT HDFSRP starting


GGSCI (edvmr1p0) 4> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     HDFSRP      00:00:00      00:00:04    


GGSCI (edvmr1p0) 5>

Now, if we add new column or modify the data, we will not rollover new JSON file until we don't reach the 1G file size defined in:

gg.handler.hdfs.maxFileSize=1g

These settings should cause the creation of a new JSON File:

Create new Column

SQL> alter table product_prod add third_column varchar2(20);

Table altered.
SQL> update product_prod set third_column='Test third' where product_id=4;

1 row updated.

SQL> commit;

Commit complete.

SQL>

Check HDFS files

[oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod
Found 3 items
-rw-r--r--   1 oracle supergroup       1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json          <- JSON from the first loan
-rw-r--r--   1 oracle supergroup        699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json          <- JSON when I added the "ONE_MORE_COLUMN" previously
-rw-r--r--   1 oracle supergroup        394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json          <- JSON after I changed the configuration to not rollover new JSON after table structure modification
[oracle@edvmr1p0 oggtrg]$

After this point, whatever we do to the table, as long as we don't exceed the 1G of size, it should grow, let's test it. We can examine the additional of a new column with a simple cat of the JSON

Check the JSON

[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json

{
    "table":"OGGSRC.PRODUCT_PROD",
    "op_type":"U",
    "op_ts":"2020-11-11 14:57:31.000115",
    "current_ts":"2020-11-11T14:57:34.807000",
    "pos":"00000000020000006904",
    "PRODUCT_ID":3,
    "PRODUCT_CODE":"OKI26BPY3QK",
    "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,",
    "SUPPLIER_ID":3,
    "ONE_MORE_COLUMN":null,
    "SECOND_COLUMN":"Data for second"
}
{
    "table":"OGGSRC.PRODUCT_PROD",
    "op_type":"U",
    "op_ts":"2020-11-11 14:59:44.000139",
    "current_ts":"2020-11-11T14:59:49.201000",
    "pos":"00000000020000009296",
    "PRODUCT_ID":4,
    "PRODUCT_CODE":"SXV14CFH8FL",
    "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat",
    "SUPPLIER_ID":4,
    "ONE_MORE_COLUMN":"More data",
    "SECOND_COLUMN":null,
    "THIRD_COLUMN":"Test third"
}

So now, if I add a new column (yet again) it SHOULDN'T create new JSON. When we are at it, let's add a PK for that table as well.

Create PK & new column (yet again)

SQL> alter table product_prod add constraint product_prod_pk primary key(PRODUCT_ID);

Table altered.

SQL> alter table product_prod add fourth_column number;

Table altered.

SQL> update product_prod set fourth_column=1 where product_id=1;

1 row updated.

SQL> commit;

Commit complete.

SQL> exit

Now, we can check again for the JSONs:

Check if HDFS, created new JSON

oracle@edvmr1p0 oggtrg]$ hdfs dfs -ls /user/oracle/oggsrc.product_prod
Found 3 items
-rw-r--r--   1 oracle supergroup       1230 2020-11-11 14:36 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-31-40.452.json
-rw-r--r--   1 oracle supergroup        699 2020-11-11 14:55 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-36-48.627.json
-rw-r--r--   1 oracle supergroup        394 2020-11-11 14:57 /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json
[oracle@edvmr1p0 oggtrg]$

No new JSON, let's examine again the last one:

Examine HDFS

[oracle@edvmr1p0 oggtrg]$ hdfs dfs -cat /user/oracle/oggsrc.product_prod/oggsrc.product_prod_2020-11-11_14-57-34.817.json

{
    "table":"OGGSRC.PRODUCT_PROD",
    "op_type":"U",
    "op_ts":"2020-11-11 14:57:31.000115",
    "current_ts":"2020-11-11T14:57:34.807000",
    "pos":"00000000020000006904",
    "PRODUCT_ID":3,
    "PRODUCT_CODE":"OKI26BPY3QK",
    "PRODUCT_NAME":"lorem, auctor quis, tristique ac, eleifend vitae,",
    "SUPPLIER_ID":3,
    "ONE_MORE_COLUMN":null,
    "SECOND_COLUMN":"Data for second"                   <- When I added the 2nd column
}
{
    "table":"OGGSRC.PRODUCT_PROD",
    "op_type":"U",
    "op_ts":"2020-11-11 14:59:44.000139",
    "current_ts":"2020-11-11T14:59:49.201000",
    "pos":"00000000020000009296",
    "PRODUCT_ID":4,
    "PRODUCT_CODE":"SXV14CFH8FL",
    "PRODUCT_NAME":"vehicula risus. Nulla eget metus eu erat",
    "SUPPLIER_ID":4,
    "ONE_MORE_COLUMN":"More data",
    "SECOND_COLUMN":null,
    "THIRD_COLUMN":"Test third"                   <- When I added the 3rd column
}
{
    "table":"OGGSRC.PRODUCT_PROD",
    "op_type":"U",
    "op_ts":"2020-11-11 15:02:49.000101",
    "current_ts":"2020-11-11T15:02:54.274000",
    "pos":"00000000020000013152",
    "PRODUCT_ID":1,
    "FOURTH_COLUMN":1                   <- When I added the 4th column and PK
}

Do you notice that in the last record, we have way less values than the rest, that is becasue I create a PK and HDFS wasn't forced to include all columns (in pseudo-PK column) but used the PK (Product_id) :)

Configuration Destination

Create Replicat

Test

Modify Configuration

My wiki