Programming for beginners: Sqoop export: Export data from HDFS to a database table

In this post, I am going to explain how to export data from HDFS to a database table.

Step 1: Let’s create a database and employee table.

[cloudera@quickstart ~]$ mysql -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 75
Server version: 5.1.73 Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

‘cloudera’ is the password for cloudera quick start vm instance.

Create a database ‘export_demo’.

mysql> CREATE DATABASE export_demo;
Query OK, 1 row affected (0.00 sec)

Create an employee table in export_demo database.

CREATE TABLE employee (id INT, name VARCHAR(20), age INT);

mysql> USE export_demo;
Database changed
mysql> 
mysql> CREATE TABLE employee (id INT, name VARCHAR(20), age INT);
Query OK, 0 rows affected (0.01 sec)

mysql> 
mysql> DESCRIBE employee;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id    | int(11)     | YES  |     | NULL    |       |
| name  | varchar(20) | YES  |     | NULL    |       |
| age   | int(11)     | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

Step 2: Let’s create emp.csv file and copy it to HDFS.

emp.csv

1,Ram,31
2,Krishna,32
3,Joel,41
4,Shankar,38
5,Shanthi,48
6,Sameer,29

Copy emp.csv file to HDFS location /export_demo.

[cloudera@quickstart ~]$ hadoop fs -mkdir /export_demo
[cloudera@quickstart ~]$ 
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal emp.csv /export_demo
[cloudera@quickstart ~]$ 
[cloudera@quickstart ~]$ hadoop fs -ls /export_demo
Found 1 items
-rw-r--r--   1 cloudera supergroup         70 2022-04-04 09:37 /export_demo/emp.csv
[cloudera@quickstart ~]$ 
[cloudera@quickstart ~]$ hadoop -fs -cat /export_demo/emp.csv
Error: No command named `-fs' was found. Perhaps you meant `hadoop fs'
[cloudera@quickstart ~]$ 
[cloudera@quickstart ~]$ hadoop fs -cat /export_demo/emp.csv
1,Ram,31
2,Krishna,32
3,Joel,41
4,Shankar,38
5,Shanthi,48
6,Sameer,29

Step 3: Export the content of emp.csv file to employee table.

sqoop export \
--connect "jdbc:mysql://quickstart.cloudera:3306/export_demo" \
--username "root" \
--password "cloudera" \
--table "employee" \
--export-dir /export_demo/emp.csv

[cloudera@quickstart ~]$ sqoop export \
> --connect "jdbc:mysql://quickstart.cloudera:3306/export_demo" \
> --username "root" \
> --password "cloudera" \
> --table "employee" \
> --export-dir /export_demo/emp.csv
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/04/04 09:42:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/04/04 09:42:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/04/04 09:42:38 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/04/04 09:42:38 INFO tool.CodeGenTool: Beginning code generation
22/04/04 09:42:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
22/04/04 09:42:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
22/04/04 09:42:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/c9f4658e25d6663f2b3cec91b53fba16/employee.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/04/04 09:42:40 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/c9f4658e25d6663f2b3cec91b53fba16/employee.jar
22/04/04 09:42:40 INFO mapreduce.ExportJobBase: Beginning export of employee
22/04/04 09:42:40 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/04/04 09:42:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/04/04 09:42:41 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
22/04/04 09:42:41 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
22/04/04 09:42:41 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/04/04 09:42:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/04/04 09:42:43 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 09:42:43 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 09:42:43 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 09:42:43 INFO input.FileInputFormat: Total input paths to process : 1
22/04/04 09:42:43 INFO input.FileInputFormat: Total input paths to process : 1
22/04/04 09:42:43 INFO mapreduce.JobSubmitter: number of splits:4
22/04/04 09:42:43 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
22/04/04 09:42:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1649003113144_0010
22/04/04 09:42:44 INFO impl.YarnClientImpl: Submitted application application_1649003113144_0010
22/04/04 09:42:44 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1649003113144_0010/
22/04/04 09:42:44 INFO mapreduce.Job: Running job: job_1649003113144_0010
22/04/04 09:42:51 INFO mapreduce.Job: Job job_1649003113144_0010 running in uber mode : false
22/04/04 09:42:51 INFO mapreduce.Job:  map 0% reduce 0%
22/04/04 09:43:05 INFO mapreduce.Job:  map 25% reduce 0%
22/04/04 09:43:08 INFO mapreduce.Job:  map 50% reduce 0%
22/04/04 09:43:09 INFO mapreduce.Job:  map 100% reduce 0%
22/04/04 09:43:10 INFO mapreduce.Job: Job job_1649003113144_0010 completed successfully
22/04/04 09:43:11 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=683716
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=799
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=19
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
	Job Counters 
		Launched map tasks=4
		Data-local map tasks=4
		Total time spent by all maps in occupied slots (ms)=56387
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=56387
		Total vcore-milliseconds taken by all map tasks=56387
		Total megabyte-milliseconds taken by all map tasks=57740288
	Map-Reduce Framework
		Map input records=6
		Map output records=6
		Input split bytes=596
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=615
		CPU time spent (ms)=2250
		Physical memory (bytes) snapshot=502194176
		Virtual memory (bytes) snapshot=6032302080
		Total committed heap usage (bytes)=243007488
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=0
22/04/04 09:43:11 INFO mapreduce.ExportJobBase: Transferred 799 bytes in 29.0627 seconds (27.4923 bytes/sec)
22/04/04 09:43:11 INFO mapreduce.ExportJobBase: Exported 6 records.
[cloudera@quickstart ~]$

Let’s query employee table and confirm the same.

mysql> SELECT * FROM employee;
+------+---------+------+
| id   | name    | age  |
+------+---------+------+
|    6 | Sameer  |   29 |
|    1 | Ram     |   31 |
|    2 | Krishna |   32 |
|    3 | Joel    |   41 |
|    4 | Shankar |   38 |
|    5 | Shanthi |   48 |
+------+---------+------+
6 rows in set (0.00 sec)

Note

Sqoop jobs either export or import, are map only jobs. There is no reducer phase.

Previous Next Home

Programming for beginners

Thursday, 26 May 2022

Sqoop export: Export data from HDFS to a database table

No comments:

Post a Comment