In this post, I am going to explain how to import a table without primary key to HDFS.
‘sqoop import’ command is used to import the data from a RDBMS table to HDFS.
Syntax
sqoop import --connect "{jdbc_connection_url}" --username "{user_name}" --password "{password}" --table "{table_name}" --target-dir {target_directory}
'target_directory' must not exist while executing 'sqoop import'.
Example
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username "root" --password "cloudera" --table "orders" --target-dir /import_demo
Above command copies the data from ‘orders’ table to the HDFS directory import_demo.
‘sqoop import’ command is a MapReduce (MapOnly, no reducers) job internally and use 4 mappers by default. All the mapper jobs divide the data among themselves based the table primary key. When you try to import the table into HDFS, it fails by default.
How to import a table without primary key?
There are 2 ways to address this problem
a. Set the number of mappers to 1. (OR)
b. specify split-by column to divide the work. In this case, ‘sqoop import’ job use the split-by column to divide the data among mappers.
Let’s experiment it with an example. I am using Cloudera quick start vm to demo this application.
Step 1: Create a table ‘employee_hobbies’ without primary key.
Login to MySQL by executing below command.
mysql -u root -p
‘cloudera’ is the password.
$mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 92
Server version: 5.1.73 Source distribution
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
Execute the command ‘show
database’ to print all the available databases.
mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | cm | | firehose | | hue | | metastore | | mysql | | nav | | navms | | oozie | | retail_db | | rman | | sentry | +--------------------+ 12 rows in set (0.00 sec)
Let’s create a database ‘import_demo’.
create database import_demo;
mysql> create database import_demo; Query OK, 1 row affected (0.00 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | cm | | firehose | | hue | | import_demo | | metastore | | mysql | | nav | | navms | | oozie | | retail_db | | rman | | sentry | +--------------------+ 13 rows in set (0.00 sec)
Create employee_hobbies table in import_demo database. To create the table in a database, we need to be in the database first. Execute below command to get into the database.
use import_demo;
mysql> use import_demo; Database changed mysql>
Create employee_hobbies table by executing below command.
CREATE TABLE employee_hobbies(id INT, hobby VARCHAR(20));
mysql> CREATE TABLE employee_hobbies(id INT, hobby VARCHAR(20)); Query OK, 0 rows affected (0.01 sec) mysql> DESCRIBE employee_hobbies; +-------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+-------------+------+-----+---------+-------+ | id | int(11) | YES | | NULL | | | hobby | varchar(20) | YES | | NULL | | +-------+-------------+------+-----+---------+-------+ 2 rows in set (0.00 sec)
Let’s insert some data into employee_hobbies table.
INSERT INTO employee_hobbies VALUES (1, "trekking");
INSERT INTO employee_hobbies VALUES (1, "playing cricket");
INSERT INTO employee_hobbies VALUES (2, "stamp collection");
INSERT INTO employee_hobbies VALUES (3, "singing");
INSERT INTO employee_hobbies VALUES (4, "bike riding");
INSERT INTO employee_hobbies VALUES (4, "cooking");
mysql> INSERT INTO employee_hobbies VALUES (1, "playing cricket"); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO employee_hobbies VALUES (2, "stamp collection"); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO employee_hobbies VALUES (3, "singing"); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO employee_hobbies VALUES (4, "bike riding"); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO employee_hobbies VALUES (4, "cooking"); Query OK, 1 row affected (0.00 sec) mysql> mysql> SELECT * FROM employee_hobbies; +------+------------------+ | id | hobby | +------+------------------+ | 1 | trekking | | 1 | playing cricket | | 2 | stamp collection | | 3 | singing | | 4 | bike riding | | 4 | cooking | +------+------------------+ 6 rows in set (0.00 sec)
Step 2: Let’s try to import employee_hobbies data to HDFS.
Open new terminal and execute below command.
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_1
[cloudera@quickstart Desktop]$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/03/31 11:35:08 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/03/31 11:35:08 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/03/31 11:35:08 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/03/31 11:35:08 INFO tool.CodeGenTool: Beginning code generation
22/03/31 11:35:09 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:35:09 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:35:09 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/112d78d1030654519832cd50f7c29564/employee_hobbies.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/03/31 11:35:11 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/112d78d1030654519832cd50f7c29564/employee_hobbies.jar
22/03/31 11:35:11 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/03/31 11:35:11 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/03/31 11:35:11 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/03/31 11:35:11 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/03/31 11:35:11 ERROR tool.ImportTool: Import failed: No primary key could be found for table employee_hobbies. Please specify one with --split-by or perform a sequential import with '-m 1'.
As you see the output, import operation is failed by stating no primary key found for the table employee_hobbies.
How to import a table without primary key?
There are 2 ways to address this problem
a. Set the number of mappers to 1. (OR)
b. specify split-by column to divide the work. In this case, ‘sqoop import’ job use the split-by column to divide the data among mappers.
Approach 1: Set the number of mappers to 1.
Using -m option, we can set the number of mappers.
Syntax
-m noOfMappers
Example
-m 1
Above statement sets number of mappes to 1.
Example
sqoop
import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo"
--username "root" --password "cloudera" --table
"employee_hobbies" --target-dir /import_demo_1 -m 1
[cloudera@quickstart Desktop]$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_1 -m 1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/03/31 11:38:07 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/03/31 11:38:07 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/03/31 11:38:07 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/03/31 11:38:07 INFO tool.CodeGenTool: Beginning code generation
22/03/31 11:38:08 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:38:08 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:38:08 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/5ba269158de6f17d50d504551219f467/employee_hobbies.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/03/31 11:38:10 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/5ba269158de6f17d50d504551219f467/employee_hobbies.jar
22/03/31 11:38:10 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/03/31 11:38:10 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/03/31 11:38:10 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/03/31 11:38:10 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/03/31 11:38:10 INFO mapreduce.ImportJobBase: Beginning import of employee_hobbies
22/03/31 11:38:10 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/03/31 11:38:10 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/03/31 11:38:11 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/03/31 11:38:11 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/03/31 11:38:11 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/03/31 11:38:12 INFO db.DBInputFormat: Using read commited transaction isolation
22/03/31 11:38:12 INFO mapreduce.JobSubmitter: number of splits:1
22/03/31 11:38:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1647946797614_0002
22/03/31 11:38:13 INFO impl.YarnClientImpl: Submitted application application_1647946797614_0002
22/03/31 11:38:13 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1647946797614_0002/
22/03/31 11:38:13 INFO mapreduce.Job: Running job: job_1647946797614_0002
22/03/31 11:38:20 INFO mapreduce.Job: Job job_1647946797614_0002 running in uber mode : false
22/03/31 11:38:20 INFO mapreduce.Job: map 0% reduce 0%
22/03/31 11:38:26 INFO mapreduce.Job: map 100% reduce 0%
22/03/31 11:38:26 INFO mapreduce.Job: Job job_1647946797614_0002 completed successfully
22/03/31 11:38:27 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=171177
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=82
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=3678
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3678
Total vcore-milliseconds taken by all map tasks=3678
Total megabyte-milliseconds taken by all map tasks=3766272
Map-Reduce Framework
Map input records=6
Map output records=6
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=53
CPU time spent (ms)=650
Physical memory (bytes) snapshot=137052160
Virtual memory (bytes) snapshot=1510182912
Total committed heap usage (bytes)=60751872
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=82
22/03/31 11:38:27 INFO mapreduce.ImportJobBase: Transferred 82 bytes in 15.8959 seconds (5.1586 bytes/sec)
22/03/31 11:38:27 INFO mapreduce.ImportJobBase: Retrieved 6 records.
Let’s confirm the same by querying the data from folder /import_demo_1.
[cloudera@quickstart Desktop]$ hadoop fs -ls /import_demo_1
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2022-03-31 11:38 /import_demo_1/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 82 2022-03-31 11:38 /import_demo_1/part-m-00000
[cloudera@quickstart Desktop]$
Let’s print the contents of part file.
[cloudera@quickstart Desktop]$ hadoop fs -cat /import_demo_1/part-m-00000
1,trekking
1,playing cricket
2,stamp collection
3,singing
4,bike riding
4,cooking
Approach 2: specify split-by column to divide the work.
Example
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_2 -split-by "hobby"
Above command split the data by column ‘hobby’.
[cloudera@quickstart Desktop]$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --target-dir /import_demo_2 -split-by "hobby"
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/03/31 11:50:38 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/03/31 11:50:38 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/03/31 11:50:39 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/03/31 11:50:39 INFO tool.CodeGenTool: Beginning code generation
22/03/31 11:50:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:50:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_hobbies` AS t LIMIT 1
22/03/31 11:50:39 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/fd2ba11459c9d21e64608b7ef88978fb/employee_hobbies.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/03/31 11:50:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/fd2ba11459c9d21e64608b7ef88978fb/employee_hobbies.jar
22/03/31 11:50:42 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/03/31 11:50:42 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/03/31 11:50:42 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/03/31 11:50:42 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/03/31 11:50:42 INFO mapreduce.ImportJobBase: Beginning import of employee_hobbies
22/03/31 11:50:42 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/03/31 11:50:42 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/03/31 11:50:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/03/31 11:50:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/03/31 11:50:45 INFO db.DBInputFormat: Using read commited transaction isolation
22/03/31 11:50:45 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`hobby`), MAX(`hobby`) FROM `employee_hobbies`
22/03/31 11:50:45 WARN db.TextSplitter: Generating splits for a textual index column.
22/03/31 11:50:45 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
22/03/31 11:50:45 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
22/03/31 11:50:45 INFO mapreduce.JobSubmitter: number of splits:5
22/03/31 11:50:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1647946797614_0003
22/03/31 11:50:45 INFO impl.YarnClientImpl: Submitted application application_1647946797614_0003
22/03/31 11:50:45 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1647946797614_0003/
22/03/31 11:50:45 INFO mapreduce.Job: Running job: job_1647946797614_0003
22/03/31 11:50:53 INFO mapreduce.Job: Job job_1647946797614_0003 running in uber mode : false
22/03/31 11:50:53 INFO mapreduce.Job: map 0% reduce 0%
22/03/31 11:51:13 INFO mapreduce.Job: map 20% reduce 0%
22/03/31 11:51:18 INFO mapreduce.Job: map 40% reduce 0%
22/03/31 11:51:19 INFO mapreduce.Job: map 100% reduce 0%
22/03/31 11:51:20 INFO mapreduce.Job: Job job_1647946797614_0003 completed successfully
22/03/31 11:51:21 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=857490
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=686
HDFS: Number of bytes written=82
HDFS: Number of read operations=20
HDFS: Number of large read operations=0
HDFS: Number of write operations=10
Job Counters
Launched map tasks=5
Other local map tasks=5
Total time spent by all maps in occupied slots (ms)=106550
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=106550
Total vcore-milliseconds taken by all map tasks=106550
Total megabyte-milliseconds taken by all map tasks=109107200
Map-Reduce Framework
Map input records=6
Map output records=6
Input split bytes=686
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=1247
CPU time spent (ms)=3300
Physical memory (bytes) snapshot=568188928
Virtual memory (bytes) snapshot=7548784640
Total committed heap usage (bytes)=303759360
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=82
22/03/31 11:51:21 INFO mapreduce.ImportJobBase: Transferred 82 bytes in 37.6898 seconds (2.1757 bytes/sec)
22/03/31 11:51:21 INFO mapreduce.ImportJobBase: Retrieved 6 records.
[cloudera@quickstart Desktop]$
Let’s confirm it by querying import_demo_2 folder.
[cloudera@quickstart Desktop]$ hadoop fs -ls /import_demo_2
Found 6 items
-rw-r--r-- 1 cloudera supergroup 0 2022-03-31 11:51 /import_demo_2/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 0 2022-03-31 11:51 /import_demo_2/part-m-00000
-rw-r--r-- 1 cloudera supergroup 24 2022-03-31 11:51 /import_demo_2/part-m-00001
-rw-r--r-- 1 cloudera supergroup 0 2022-03-31 11:51 /import_demo_2/part-m-00002
-rw-r--r-- 1 cloudera supergroup 0 2022-03-31 11:51 /import_demo_2/part-m-00003
-rw-r--r-- 1 cloudera supergroup 58 2022-03-31 11:51 /import_demo_2/part-m-00004
[cloudera@quickstart Desktop]$
Query the files in import_demo_2 folder using cat command.
[cloudera@quickstart Desktop]$ hadoop fs -cat /import_demo_2/*
4,bike riding
4,cooking
1,trekking
1,playing cricket
2,stamp collection
3,singing
Note
22/03/31 11:50:45 WARN db.TextSplitter: Generating splits for a textual index column.
22/03/31 11:50:45 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
22/03/31 11:50:45 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
Even though, sqoop splits on a text column, it recommend to use integer column to perform split.
--warehouse-dir
–warehouse-dir option creates a folder with table name inside the actual specified location and copy the content to it.
Example
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/import_demo" --username "root" --password "cloudera" --table "employee_hobbies" --warehouse-dir /import_demo_3 -split-by "hobby"
Above command creates employee_hobbies folder inside /import_demo3 and copy the part files to employee_hobbies folder.
[cloudera@quickstart Desktop]$ hadoop fs -ls /import_demo_3
Found 1 items
drwxr-xr-x - cloudera supergroup 0 2022-03-31 11:57 /import_demo_3/employee_hobbies
[cloudera@quickstart Desktop]$
[cloudera@quickstart Desktop]$ hadoop fs -ls /import_demo_3/employee_hobbies
Found 6 items
-rw-r--r-- 1 cloudera supergroup 0 2022-03-31 11:57 /import_demo_3/employee_hobbies/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 0 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00000
-rw-r--r-- 1 cloudera supergroup 24 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00001
-rw-r--r-- 1 cloudera supergroup 0 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00002
-rw-r--r-- 1 cloudera supergroup 0 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00003
-rw-r--r-- 1 cloudera supergroup 58 2022-03-31 11:57 /import_demo_3/employee_hobbies/part-m-00004
[cloudera@quickstart Desktop]$
No comments:
Post a Comment