Monday, 2 May 2022

Sqoop: import a table without primary key to HDFS

In this post, I am going to explain how to import a table without primary key to HDFS.

 

‘sqoop import’ command is used to import the data from a RDBMS table to HDFS.

 

Syntax

sqoop import --connect "{jdbc_connection_url}" --username "{user_name}" --password "{password}" --table "{table_name}" --target-dir {target_directory}

 

'target_directory' must not exist while executing 'sqoop import'.

 

Example
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username "root" --password "cloudera" --table "orders" --target-dir /import_demo

 

Above command copies the data from ‘orders’ table to the HDFS directory import_demo.

 

‘sqoop import’ command is a MapReduce (MapOnly, no reducers) job  internally and use 4 mappers by default. All the mapper jobs divide the data among themselves based the table primary key. When you try to import the table into HDFS, it fails by default.

 

How to import a table without primary key?

There are 2 ways to address this problem

a.   Set the number of mappers to 1.    (OR)

b.   specify split-by column to divide the work. In this case, ‘sqoop import’ job use the split-by column to divide the data among mappers.

 

Let’s experiment it with an example. I am using Cloudera quick start vm to demo this application.

 

Step 1: Create a table ‘employee_hobbies’ without primary key.

 

Login to MySQL by executing below command.

mysql -u root -p

 

‘cloudera’ is the password.

$mysql -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 92
Server version: 5.1.73 Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Execute the command ‘show database’ to print all the available databases.

 

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| cm                 |
| firehose           |
| hue                |
| metastore          |
| mysql              |
| nav                |
| navms              |
| oozie              |
| retail_db          |
| rman               |
| sentry             |
+--------------------+
12 rows in set (0.00 sec)

 

Let’s create a database ‘import_demo’.

create database import_demo;

 

mysql> create database import_demo;
Query OK, 1 row affected (0.00 sec)

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| cm                 |
| firehose           |
| hue                |
| import_demo        |
| metastore          |
| mysql              |
| nav                |
| navms              |
| oozie              |
| retail_db          |
| rman               |
| sentry             |
+--------------------+
13 rows in set (0.00 sec)

 

Create employee_hobbies table in import_demo database. To create the table in a database, we need to be in the database first. Execute below command to get into the database.

use import_demo;

mysql> use import_demo;
Database changed
mysql>

 

Create employee_hobbies table by executing below command.

CREATE TABLE employee_hobbies(id INT, hobby VARCHAR(20));

mysql> CREATE TABLE employee_hobbies(id INT, hobby VARCHAR(20));
Query OK, 0 rows affected (0.01 sec)

mysql> DESCRIBE employee_hobbies;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id    | int(11)     | YES  |     | NULL    |       |
| hobby | varchar(20) | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

 

Let’s insert some data into employee_hobbies table.

 

INSERT INTO employee_hobbies VALUES (1, "trekking");

INSERT INTO employee_hobbies VALUES (1, "playing cricket");

INSERT INTO employee_hobbies VALUES (2, "stamp collection");

INSERT INTO employee_hobbies VALUES (3, "singing");

INSERT INTO employee_hobbies VALUES (4, "bike riding");

INSERT INTO employee_hobbies VALUES (4, "cooking");

mysql> INSERT INTO employee_hobbies VALUES (1, "playing cricket");
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee_hobbies VALUES (2, "stamp collection");
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee_hobbies VALUES (3, "singing");
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee_hobbies VALUES (4, "bike riding");
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee_hobbies VALUES (4, "cooking");
Query OK, 1 row affected (0.00 sec)

mysql> 
mysql> SELECT * FROM employee_hobbies;
+------+------------------+
| id   | hobby            |
+------+------------------+
|    1 | trekking         |
|    1 | playing cricket  |
|    2 | stamp collection |
|    3 | singing          |
|    4 | bike riding      |
|    4 | cooking          |
+------+------------------+
6 rows in set (0.00 sec)

 

Step 2: Let’s try to import employee_hobbies data to HDFS.

 

Open new terminal and execute below command.

 

sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_1

 

[cloudera@quickstart Desktop]$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/03/31 11:35:08 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/03/31 11:35:08 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/03/31 11:35:08 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/03/31 11:35:08 INFO tool.CodeGenTool: Beginning code generation
22/03/31 11:35:09 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:35:09 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:35:09 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/112d78d1030654519832cd50f7c29564/employee_hobbies.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/03/31 11:35:11 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/112d78d1030654519832cd50f7c29564/employee_hobbies.jar
22/03/31 11:35:11 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/03/31 11:35:11 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/03/31 11:35:11 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/03/31 11:35:11 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/03/31 11:35:11 ERROR tool.ImportTool: Import failed: No primary key could be found for table employee_hobbies. Please specify one with --split-by or perform a sequential import with '-m 1'.

 

As you see the output, import operation is failed by stating no primary key found for the table employee_hobbies.

 

How to import a table without primary key?

There are 2 ways to address this problem

a.   Set the number of mappers to 1.    (OR)

b.   specify split-by column to divide the work. In this case, ‘sqoop import’ job use the split-by column to divide the data among mappers.

 

Approach 1: Set the number of mappers to 1.

 

Using -m option, we can set the number of mappers.

 

Syntax

-m noOfMappers

 

Example

-m 1

Above statement sets number of mappes to 1.

 

Example

sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_1 -m 1

[cloudera@quickstart Desktop]$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_1 -m 1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/03/31 11:38:07 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/03/31 11:38:07 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/03/31 11:38:07 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/03/31 11:38:07 INFO tool.CodeGenTool: Beginning code generation
22/03/31 11:38:08 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:38:08 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:38:08 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/5ba269158de6f17d50d504551219f467/employee_hobbies.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/03/31 11:38:10 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/5ba269158de6f17d50d504551219f467/employee_hobbies.jar
22/03/31 11:38:10 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/03/31 11:38:10 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/03/31 11:38:10 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/03/31 11:38:10 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/03/31 11:38:10 INFO mapreduce.ImportJobBase: Beginning import of employee_hobbies
22/03/31 11:38:10 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/03/31 11:38:10 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/03/31 11:38:11 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/03/31 11:38:11 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/03/31 11:38:11 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/03/31 11:38:12 INFO db.DBInputFormat: Using read commited transaction isolation
22/03/31 11:38:12 INFO mapreduce.JobSubmitter: number of splits:1
22/03/31 11:38:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1647946797614_0002
22/03/31 11:38:13 INFO impl.YarnClientImpl: Submitted application application_1647946797614_0002
22/03/31 11:38:13 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1647946797614_0002/
22/03/31 11:38:13 INFO mapreduce.Job: Running job: job_1647946797614_0002
22/03/31 11:38:20 INFO mapreduce.Job: Job job_1647946797614_0002 running in uber mode : false
22/03/31 11:38:20 INFO mapreduce.Job:  map 0% reduce 0%
22/03/31 11:38:26 INFO mapreduce.Job:  map 100% reduce 0%
22/03/31 11:38:26 INFO mapreduce.Job: Job job_1647946797614_0002 completed successfully
22/03/31 11:38:27 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=171177
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=87
		HDFS: Number of bytes written=82
		HDFS: Number of read operations=4
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Other local map tasks=1
		Total time spent by all maps in occupied slots (ms)=3678
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=3678
		Total vcore-milliseconds taken by all map tasks=3678
		Total megabyte-milliseconds taken by all map tasks=3766272
	Map-Reduce Framework
		Map input records=6
		Map output records=6
		Input split bytes=87
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=53
		CPU time spent (ms)=650
		Physical memory (bytes) snapshot=137052160
		Virtual memory (bytes) snapshot=1510182912
		Total committed heap usage (bytes)=60751872
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=82
22/03/31 11:38:27 INFO mapreduce.ImportJobBase: Transferred 82 bytes in 15.8959 seconds (5.1586 bytes/sec)
22/03/31 11:38:27 INFO mapreduce.ImportJobBase: Retrieved 6 records.

Let’s confirm the same by querying the data from folder /import_demo_1.

[cloudera@quickstart Desktop]$ hadoop fs -ls /import_demo_1
Found 2 items
-rw-r--r--   1 cloudera supergroup          0 2022-03-31 11:38 /import_demo_1/_SUCCESS
-rw-r--r--   1 cloudera supergroup         82 2022-03-31 11:38 /import_demo_1/part-m-00000
[cloudera@quickstart Desktop]$

Let’s print the contents of part file.

[cloudera@quickstart Desktop]$ hadoop fs -cat /import_demo_1/part-m-00000
1,trekking
1,playing cricket
2,stamp collection
3,singing
4,bike riding
4,cooking

Approach 2: specify split-by column to divide the work.

 

Example

sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_2 -split-by "hobby"

Above command split the data by column ‘hobby’.

[cloudera@quickstart Desktop]$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_2 -split-by "hobby"
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/03/31 11:50:38 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/03/31 11:50:38 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/03/31 11:50:39 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/03/31 11:50:39 INFO tool.CodeGenTool: Beginning code generation
22/03/31 11:50:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:50:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:50:39 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/fd2ba11459c9d21e64608b7ef88978fb/employee_hobbies.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/03/31 11:50:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/fd2ba11459c9d21e64608b7ef88978fb/employee_hobbies.jar
22/03/31 11:50:42 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/03/31 11:50:42 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/03/31 11:50:42 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/03/31 11:50:42 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/03/31 11:50:42 INFO mapreduce.ImportJobBase: Beginning import of employee_hobbies
22/03/31 11:50:42 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/03/31 11:50:42 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/03/31 11:50:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/03/31 11:50:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/03/31 11:50:45 INFO db.DBInputFormat: Using read commited transaction isolation
22/03/31 11:50:45 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`hobby`), MAX(`hobby`) FROM `employee_hobbies`
22/03/31 11:50:45 WARN db.TextSplitter: Generating splits for a textual index column.
22/03/31 11:50:45 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
22/03/31 11:50:45 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
22/03/31 11:50:45 INFO mapreduce.JobSubmitter: number of splits:5
22/03/31 11:50:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1647946797614_0003
22/03/31 11:50:45 INFO impl.YarnClientImpl: Submitted application application_1647946797614_0003
22/03/31 11:50:45 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1647946797614_0003/
22/03/31 11:50:45 INFO mapreduce.Job: Running job: job_1647946797614_0003
22/03/31 11:50:53 INFO mapreduce.Job: Job job_1647946797614_0003 running in uber mode : false
22/03/31 11:50:53 INFO mapreduce.Job:  map 0% reduce 0%
22/03/31 11:51:13 INFO mapreduce.Job:  map 20% reduce 0%
22/03/31 11:51:18 INFO mapreduce.Job:  map 40% reduce 0%
22/03/31 11:51:19 INFO mapreduce.Job:  map 100% reduce 0%
22/03/31 11:51:20 INFO mapreduce.Job: Job job_1647946797614_0003 completed successfully
22/03/31 11:51:21 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=857490
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=686
		HDFS: Number of bytes written=82
		HDFS: Number of read operations=20
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=10
	Job Counters 
		Launched map tasks=5
		Other local map tasks=5
		Total time spent by all maps in occupied slots (ms)=106550
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=106550
		Total vcore-milliseconds taken by all map tasks=106550
		Total megabyte-milliseconds taken by all map tasks=109107200
	Map-Reduce Framework
		Map input records=6
		Map output records=6
		Input split bytes=686
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=1247
		CPU time spent (ms)=3300
		Physical memory (bytes) snapshot=568188928
		Virtual memory (bytes) snapshot=7548784640
		Total committed heap usage (bytes)=303759360
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=82
22/03/31 11:51:21 INFO mapreduce.ImportJobBase: Transferred 82 bytes in 37.6898 seconds (2.1757 bytes/sec)
22/03/31 11:51:21 INFO mapreduce.ImportJobBase: Retrieved 6 records.
[cloudera@quickstart Desktop]$

Let’s confirm it by querying import_demo_2 folder.

[cloudera@quickstart Desktop]$ hadoop fs -ls /import_demo_2
Found 6 items
-rw-r--r--   1 cloudera supergroup          0 2022-03-31 11:51 /import_demo_2/_SUCCESS
-rw-r--r--   1 cloudera supergroup          0 2022-03-31 11:51 /import_demo_2/part-m-00000
-rw-r--r--   1 cloudera supergroup         24 2022-03-31 11:51 /import_demo_2/part-m-00001
-rw-r--r--   1 cloudera supergroup          0 2022-03-31 11:51 /import_demo_2/part-m-00002
-rw-r--r--   1 cloudera supergroup          0 2022-03-31 11:51 /import_demo_2/part-m-00003
-rw-r--r--   1 cloudera supergroup         58 2022-03-31 11:51 /import_demo_2/part-m-00004
[cloudera@quickstart Desktop]$

Query the files in import_demo_2 folder using cat command.

[cloudera@quickstart Desktop]$ hadoop fs -cat /import_demo_2/*
4,bike riding
4,cooking
1,trekking
1,playing cricket
2,stamp collection
3,singing

Note

22/03/31 11:50:45 WARN db.TextSplitter: Generating splits for a textual index column.

22/03/31 11:50:45 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

22/03/31 11:50:45 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

 

Even though, sqoop splits on a text column, it recommend to use integer column to perform split.

 

--warehouse-dir

–warehouse-dir option creates a folder with table name inside the actual specified location and copy the content to it.

 

Example

sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --warehouse-dir /import_demo_3 -split-by "hobby"

 

 

Above command creates employee_hobbies folder inside /import_demo3 and copy the part files to employee_hobbies folder.

[cloudera@quickstart Desktop]$ hadoop fs -ls /import_demo_3
Found 1 items
drwxr-xr-x   - cloudera supergroup          0 2022-03-31 11:57 /import_demo_3/employee_hobbies
[cloudera@quickstart Desktop]$ 
[cloudera@quickstart Desktop]$ hadoop fs -ls /import_demo_3/employee_hobbies
Found 6 items
-rw-r--r--   1 cloudera supergroup          0 2022-03-31 11:57 /import_demo_3/employee_hobbies/_SUCCESS
-rw-r--r--   1 cloudera supergroup          0 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00000
-rw-r--r--   1 cloudera supergroup         24 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00001
-rw-r--r--   1 cloudera supergroup          0 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00002
-rw-r--r--   1 cloudera supergroup          0 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00003
-rw-r--r--   1 cloudera supergroup         58 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00004
[cloudera@quickstart Desktop]$




 

Previous                                                    Next                                                    Home

No comments:

Post a Comment