Thursday 26 May 2022

Sqoop export: Export data from HDFS to a database table

In this post, I am going to explain how to export data from HDFS to a database table.

 

Step 1: Let’s create a database and employee table.

 

Login to mysql shell.

[cloudera@quickstart ~]$ mysql -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 75
Server version: 5.1.73 Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

 

‘cloudera’ is the password for cloudera quick start vm instance.

 

Create a database ‘export_demo’.

mysql> CREATE DATABASE export_demo;
Query OK, 1 row affected (0.00 sec)

Create an employee table in export_demo database.

CREATE TABLE employee (id INT, name VARCHAR(20), age INT);

mysql> USE export_demo;
Database changed
mysql> 
mysql> CREATE TABLE employee (id INT, name VARCHAR(20), age INT);
Query OK, 0 rows affected (0.01 sec)

mysql> 
mysql> DESCRIBE employee;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id    | int(11)     | YES  |     | NULL    |       |
| name  | varchar(20) | YES  |     | NULL    |       |
| age   | int(11)     | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

Step 2: Let’s create emp.csv file and copy it to HDFS.

 

emp.csv

1,Ram,31
2,Krishna,32
3,Joel,41
4,Shankar,38
5,Shanthi,48
6,Sameer,29

Copy emp.csv file to HDFS location /export_demo.

[cloudera@quickstart ~]$ hadoop fs -mkdir /export_demo
[cloudera@quickstart ~]$ 
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal emp.csv /export_demo
[cloudera@quickstart ~]$ 
[cloudera@quickstart ~]$ hadoop fs -ls /export_demo
Found 1 items
-rw-r--r--   1 cloudera supergroup         70 2022-04-04 09:37 /export_demo/emp.csv
[cloudera@quickstart ~]$ 
[cloudera@quickstart ~]$ hadoop -fs -cat /export_demo/emp.csv
Error: No command named `-fs' was found. Perhaps you meant `hadoop fs'
[cloudera@quickstart ~]$ 
[cloudera@quickstart ~]$ hadoop fs -cat /export_demo/emp.csv
1,Ram,31
2,Krishna,32
3,Joel,41
4,Shankar,38
5,Shanthi,48
6,Sameer,29

Step 3: Export the content of emp.csv file to employee table.

sqoop export \
--connect "jdbc:mysql://quickstart.cloudera:3306/export_demo" \
--username "root" \
--password "cloudera" \
--table "employee" \
--export-dir /export_demo/emp.csv

[cloudera@quickstart ~]$ sqoop export \
> --connect "jdbc:mysql://quickstart.cloudera:3306/export_demo" \
> --username "root" \
> --password "cloudera" \
> --table "employee" \
> --export-dir /export_demo/emp.csv
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/04/04 09:42:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/04/04 09:42:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/04/04 09:42:38 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/04/04 09:42:38 INFO tool.CodeGenTool: Beginning code generation
22/04/04 09:42:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
22/04/04 09:42:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
22/04/04 09:42:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/c9f4658e25d6663f2b3cec91b53fba16/employee.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/04/04 09:42:40 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/c9f4658e25d6663f2b3cec91b53fba16/employee.jar
22/04/04 09:42:40 INFO mapreduce.ExportJobBase: Beginning export of employee
22/04/04 09:42:40 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/04/04 09:42:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/04/04 09:42:41 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
22/04/04 09:42:41 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
22/04/04 09:42:41 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/04/04 09:42:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/04/04 09:42:43 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 09:42:43 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 09:42:43 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 09:42:43 INFO input.FileInputFormat: Total input paths to process : 1
22/04/04 09:42:43 INFO input.FileInputFormat: Total input paths to process : 1
22/04/04 09:42:43 INFO mapreduce.JobSubmitter: number of splits:4
22/04/04 09:42:43 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
22/04/04 09:42:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1649003113144_0010
22/04/04 09:42:44 INFO impl.YarnClientImpl: Submitted application application_1649003113144_0010
22/04/04 09:42:44 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1649003113144_0010/
22/04/04 09:42:44 INFO mapreduce.Job: Running job: job_1649003113144_0010
22/04/04 09:42:51 INFO mapreduce.Job: Job job_1649003113144_0010 running in uber mode : false
22/04/04 09:42:51 INFO mapreduce.Job:  map 0% reduce 0%
22/04/04 09:43:05 INFO mapreduce.Job:  map 25% reduce 0%
22/04/04 09:43:08 INFO mapreduce.Job:  map 50% reduce 0%
22/04/04 09:43:09 INFO mapreduce.Job:  map 100% reduce 0%
22/04/04 09:43:10 INFO mapreduce.Job: Job job_1649003113144_0010 completed successfully
22/04/04 09:43:11 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=683716
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=799
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=19
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
	Job Counters 
		Launched map tasks=4
		Data-local map tasks=4
		Total time spent by all maps in occupied slots (ms)=56387
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=56387
		Total vcore-milliseconds taken by all map tasks=56387
		Total megabyte-milliseconds taken by all map tasks=57740288
	Map-Reduce Framework
		Map input records=6
		Map output records=6
		Input split bytes=596
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=615
		CPU time spent (ms)=2250
		Physical memory (bytes) snapshot=502194176
		Virtual memory (bytes) snapshot=6032302080
		Total committed heap usage (bytes)=243007488
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=0
22/04/04 09:43:11 INFO mapreduce.ExportJobBase: Transferred 799 bytes in 29.0627 seconds (27.4923 bytes/sec)
22/04/04 09:43:11 INFO mapreduce.ExportJobBase: Exported 6 records.
[cloudera@quickstart ~]$

Let’s query employee table and confirm the same.

mysql> SELECT * FROM employee;
+------+---------+------+
| id   | name    | age  |
+------+---------+------+
|    6 | Sameer  |   29 |
|    1 | Ram     |   31 |
|    2 | Krishna |   32 |
|    3 | Joel    |   41 |
|    4 | Shankar |   38 |
|    5 | Shanthi |   48 |
+------+---------+------+
6 rows in set (0.00 sec)


Note

Sqoop jobs either export or import, are map only jobs. There is no reducer phase.




Previous                                                    Next                                                    Home

No comments:

Post a Comment