Saturday 30 April 2022

Sqoop: Specify compression format explicitly while importing

You can even specify the type of compression format using the option --compression-codec.

 

Example

--compression-codec BZip2Codec
--compression-codec org.apache.hadoop.io.compress.GzipCodec
--compression-codec org.apache.hadoop.io.compress.SnappyCodec

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username "root" \
--password "cloudera" \
--table "orders" \
--target-dir /compress_demo_2 \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec

  Above snippet compress the data using SnappyCodec format while importing.

 

[cloudera@quickstart Desktop]$ sqoop import \
> --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
> --username "root" \
> --password "cloudera" \
> --table "orders" \
> --target-dir /compress_demo_2 \
> --compression-codec org.apache.hadoop.io.compress.SnappyCodec
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/04/03 02:40:25 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/04/03 02:40:25 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/04/03 02:40:25 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/04/03 02:40:25 INFO tool.CodeGenTool: Beginning code generation
22/04/03 02:40:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
22/04/03 02:40:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
22/04/03 02:40:26 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/d5bbae0f9786092e87d23a9d8ed7b8a3/orders.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/04/03 02:40:28 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/d5bbae0f9786092e87d23a9d8ed7b8a3/orders.jar
22/04/03 02:40:28 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/04/03 02:40:28 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/04/03 02:40:28 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/04/03 02:40:28 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/04/03 02:40:28 INFO mapreduce.ImportJobBase: Beginning import of orders
22/04/03 02:40:28 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/04/03 02:40:28 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/04/03 02:40:29 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/04/03 02:40:29 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/04/03 02:40:30 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:30 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:30 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:30 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:30 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:30 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:31 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:31 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:31 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:31 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:31 INFO db.DBInputFormat: Using read commited transaction isolation
22/04/03 02:40:31 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`order_id`), MAX(`order_id`) FROM `orders`
22/04/03 02:40:31 INFO db.IntegerSplitter: Split size: 17220; Num splits: 4 from: 1 to: 68883
22/04/03 02:40:31 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/03 02:40:31 INFO mapreduce.JobSubmitter: number of splits:4
22/04/03 02:40:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1647946797614_0018
22/04/03 02:40:32 INFO impl.YarnClientImpl: Submitted application application_1647946797614_0018
22/04/03 02:40:32 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1647946797614_0018/
22/04/03 02:40:32 INFO mapreduce.Job: Running job: job_1647946797614_0018
22/04/03 02:40:39 INFO mapreduce.Job: Job job_1647946797614_0018 running in uber mode : false
22/04/03 02:40:39 INFO mapreduce.Job:  map 0% reduce 0%
22/04/03 02:41:00 INFO mapreduce.Job:  map 25% reduce 0%
22/04/03 02:41:03 INFO mapreduce.Job:  map 50% reduce 0%
22/04/03 02:41:04 INFO mapreduce.Job:  map 75% reduce 0%
22/04/03 02:41:05 INFO mapreduce.Job:  map 100% reduce 0%
22/04/03 02:41:05 INFO mapreduce.Job: Job job_1647946797614_0018 completed successfully
22/04/03 02:41:06 INFO mapreduce.Job: Counters: 31
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=686148
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=469
		HDFS: Number of bytes written=891578
		HDFS: Number of read operations=16
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=8
	Job Counters 
		Killed map tasks=1
		Launched map tasks=4
		Other local map tasks=4
		Total time spent by all maps in occupied slots (ms)=82312
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=82312
		Total vcore-milliseconds taken by all map tasks=82312
		Total megabyte-milliseconds taken by all map tasks=84287488
	Map-Reduce Framework
		Map input records=68883
		Map output records=68883
		Input split bytes=469
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=942
		CPU time spent (ms)=8350
		Physical memory (bytes) snapshot=500830208
		Virtual memory (bytes) snapshot=6066114560
		Total committed heap usage (bytes)=243007488
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=891578
22/04/03 02:41:06 INFO mapreduce.ImportJobBase: Transferred 870.6816 KB in 36.3821 seconds (23.9316 KB/sec)
22/04/03 02:41:06 INFO mapreduce.ImportJobBase: Retrieved 68883 records.
[cloudera@quickstart Desktop]$

Let’s query the folder /compress_demo_2 and confirm the same.

[cloudera@quickstart Desktop]$ hadoop fs -ls /compress_demo_2
Found 5 items
-rw-r--r--   1 cloudera supergroup          0 2022-04-03 02:41 /compress_demo_2/_SUCCESS
-rw-r--r--   1 cloudera supergroup     218873 2022-04-03 02:40 /compress_demo_2/part-m-00000.snappy
-rw-r--r--   1 cloudera supergroup     218610 2022-04-03 02:41 /compress_demo_2/part-m-00001.snappy
-rw-r--r--   1 cloudera supergroup     220683 2022-04-03 02:41 /compress_demo_2/part-m-00002.snappy
-rw-r--r--   1 cloudera supergroup     233412 2022-04-03 02:41 /compress_demo_2/part-m-00003.snappy
[cloudera@quickstart Desktop]$ 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment