Using –append option, we can append new data to the existing data set.
Example
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username "root" \
--password "cloudera" \
--table "customers" \
--target-dir /append_demo
Above snippet import the content of customers table to /append_demo folder.
[cloudera@quickstart ~]$ sqoop import \
> --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
> --username "root" \
> --password "cloudera" \
> --table "customers" \
> --target-dir /append_demo
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/04/03 20:53:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/04/03 20:53:34 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/04/03 20:53:34 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/04/03 20:53:34 INFO tool.CodeGenTool: Beginning code generation
22/04/03 20:53:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customers` AS t LIMIT 1
22/04/03 20:53:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customers` AS t LIMIT 1
22/04/03 20:53:35 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/9e62689c37d762a30f481e13b68a7876/customers.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/04/03 20:53:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/9e62689c37d762a30f481e13b68a7876/customers.jar
22/04/03 20:53:37 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/04/03 20:53:37 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/04/03 20:53:37 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/04/03 20:53:37 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/04/03 20:53:37 INFO mapreduce.ImportJobBase: Beginning import of customers
22/04/03 20:53:37 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/04/03 20:53:38 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/04/03 20:53:39 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/04/03 20:53:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/04/03 20:53:40 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 20:53:40 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 20:53:40 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 20:53:40 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 20:53:40 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 20:53:40 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 20:53:41 INFO db.DBInputFormat: Using read commited transaction isolation
22/04/03 20:53:41 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`customer_id`), MAX(`customer_id`) FROM `customers`
22/04/03 20:53:41 INFO db.IntegerSplitter: Split size: 3108; Num splits: 4 from: 1 to: 12435
22/04/03 20:53:41 INFO mapreduce.JobSubmitter: number of splits:4
22/04/03 20:53:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1649003113144_0003
22/04/03 20:53:41 INFO impl.YarnClientImpl: Submitted application application_1649003113144_0003
22/04/03 20:53:41 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1649003113144_0003/
22/04/03 20:53:41 INFO mapreduce.Job: Running job: job_1649003113144_0003
22/04/03 20:53:49 INFO mapreduce.Job: Job job_1649003113144_0003 running in uber mode : false
22/04/03 20:53:49 INFO mapreduce.Job: map 0% reduce 0%
22/04/03 20:54:05 INFO mapreduce.Job: map 25% reduce 0%
22/04/03 20:54:08 INFO mapreduce.Job: map 50% reduce 0%
22/04/03 20:54:10 INFO mapreduce.Job: map 75% reduce 0%
22/04/03 20:54:11 INFO mapreduce.Job: map 100% reduce 0%
22/04/03 20:54:11 INFO mapreduce.Job: Job job_1649003113144_0003 completed successfully
22/04/03 20:54:11 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=685816
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=487
HDFS: Number of bytes written=953525
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=57409
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=57409
Total vcore-milliseconds taken by all map tasks=57409
Total megabyte-milliseconds taken by all map tasks=58786816
Map-Reduce Framework
Map input records=12435
Map output records=12435
Input split bytes=487
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=644
CPU time spent (ms)=4600
Physical memory (bytes) snapshot=549847040
Virtual memory (bytes) snapshot=6046851072
Total committed heap usage (bytes)=243007488
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=953525
22/04/03 20:54:11 INFO mapreduce.ImportJobBase: Transferred 931.1768 KB in 32.0698 seconds (29.036 KB/sec)
22/04/03 20:54:11 INFO mapreduce.ImportJobBase: Retrieved 12435 records.
[cloudera@quickstart ~]$
Let’s query the folder ‘/append_demo’.
[cloudera@quickstart ~]$ hadoop fs -ls /append_demo
Found 5 items
-rw-r--r-- 1 cloudera supergroup 0 2022-04-03 20:54 /append_demo/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 237145 2022-04-03 20:54 /append_demo/part-m-00000
-rw-r--r-- 1 cloudera supergroup 237965 2022-04-03 20:54 /append_demo/part-m-00001
-rw-r--r-- 1 cloudera supergroup 238092 2022-04-03 20:54 /append_demo/part-m-00002
-rw-r--r-- 1 cloudera supergroup 240323 2022-04-03 20:54 /append_demo/part-m-00003
Let’s execute ‘sqoop import’ command with –append options to confirm whether data is appended to existing dataset or not.
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username "root" \
--password "cloudera" \
--table "customers" \
--target-dir /append_demo \
--append
[cloudera@quickstart ~]$ sqoop import \
> --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
> --username "root" \
> --password "cloudera" \
> --table "customers" \
> --target-dir /append_demo \
> --append
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/04/03 20:58:53 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/04/03 20:58:53 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/04/03 20:58:53 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/04/03 20:58:53 INFO tool.CodeGenTool: Beginning code generation
22/04/03 20:58:53 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customers` AS t LIMIT 1
22/04/03 20:58:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customers` AS t LIMIT 1
22/04/03 20:58:54 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/d8bef25c5c94da00253aafbd3b32143e/customers.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/04/03 20:58:55 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/d8bef25c5c94da00253aafbd3b32143e/customers.jar
22/04/03 20:58:55 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/04/03 20:58:55 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/04/03 20:58:55 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/04/03 20:58:55 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/04/03 20:58:55 INFO mapreduce.ImportJobBase: Beginning import of customers
22/04/03 20:58:55 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/04/03 20:58:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/04/03 20:58:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/04/03 20:58:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/04/03 20:58:58 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 20:58:58 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 20:58:58 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 20:58:58 INFO db.DBInputFormat: Using read commited transaction isolation
22/04/03 20:58:58 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`customer_id`), MAX(`customer_id`) FROM `customers`
22/04/03 20:58:58 INFO db.IntegerSplitter: Split size: 3108; Num splits: 4 from: 1 to: 12435
22/04/03 20:58:58 INFO mapreduce.JobSubmitter: number of splits:4
22/04/03 20:58:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1649003113144_0004
22/04/03 20:58:59 INFO impl.YarnClientImpl: Submitted application application_1649003113144_0004
22/04/03 20:58:59 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1649003113144_0004/
22/04/03 20:58:59 INFO mapreduce.Job: Running job: job_1649003113144_0004
22/04/03 20:59:06 INFO mapreduce.Job: Job job_1649003113144_0004 running in uber mode : false
22/04/03 20:59:06 INFO mapreduce.Job: map 0% reduce 0%
22/04/03 20:59:23 INFO mapreduce.Job: map 25% reduce 0%
22/04/03 20:59:25 INFO mapreduce.Job: map 50% reduce 0%
22/04/03 20:59:26 INFO mapreduce.Job: map 75% reduce 0%
22/04/03 20:59:27 INFO mapreduce.Job: map 100% reduce 0%
22/04/03 20:59:27 INFO mapreduce.Job: Job job_1649003113144_0004 completed successfully
22/04/03 20:59:28 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=686020
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=487
HDFS: Number of bytes written=953525
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=60015
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=60015
Total vcore-milliseconds taken by all map tasks=60015
Total megabyte-milliseconds taken by all map tasks=61455360
Map-Reduce Framework
Map input records=12435
Map output records=12435
Input split bytes=487
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=599
CPU time spent (ms)=4070
Physical memory (bytes) snapshot=555954176
Virtual memory (bytes) snapshot=6048366592
Total committed heap usage (bytes)=243007488
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=953525
22/04/03 20:59:28 INFO mapreduce.ImportJobBase: Transferred 931.1768 KB in 31.1736 seconds (29.8706 KB/sec)
22/04/03 20:59:28 INFO mapreduce.ImportJobBase: Retrieved 12435 records.
22/04/03 20:59:28 INFO util.AppendUtils: Appending to directory append_demo
22/04/03 20:59:28 INFO util.AppendUtils: Using found partition 4
[cloudera@quickstart ~]$
Let’s query the folder ‘/append-demo’ and confirm whether new datasets is added to this folder or not.
[cloudera@quickstart ~]$ hadoop fs -ls /append_demo
Found 9 items
-rw-r--r-- 1 cloudera supergroup 0 2022-04-03 20:54 /append_demo/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 237145 2022-04-03 20:54 /append_demo/part-m-00000
-rw-r--r-- 1 cloudera supergroup 237965 2022-04-03 20:54 /append_demo/part-m-00001
-rw-r--r-- 1 cloudera supergroup 238092 2022-04-03 20:54 /append_demo/part-m-00002
-rw-r--r-- 1 cloudera supergroup 240323 2022-04-03 20:54 /append_demo/part-m-00003
-rw-r--r-- 1 cloudera cloudera 237145 2022-04-03 20:59 /append_demo/part-m-00004
-rw-r--r-- 1 cloudera cloudera 237965 2022-04-03 20:59 /append_demo/part-m-00005
-rw-r--r-- 1 cloudera cloudera 238092 2022-04-03 20:59 /append_demo/part-m-00006
-rw-r--r-- 1 cloudera cloudera 240323 2022-04-03 20:59 /append_demo/part-m-00007
Previous Next Home
No comments:
Post a Comment