In this post, I am going to explain how to handle null values while importing the data with ‘sqoop import’ command.
Sqoop provide following options to handle null values.
a. --null-non-string
b. --null-string
--null-non-string
Using this option, we can specify a default value to non-text fields when the value is null.
Example
--null-non-string "-1"
--null-string
Using this option, we can specify a default value to text fields when the value is null.
Example
--null-string "not_set"
Let’s see it with an example.
Step 1: Login to sql console.
[cloudera@quickstart ~]$ mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 62
Server version: 5.1.73 Source distribution
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
‘cloudera’ is the password for mysql instance running on cloudera quick start VM.
Step 2: Create a database ‘sample’ and create new table ‘employees’.
mysql> CREATE DATABASE sample; Query OK, 1 row affected (0.01 sec) mysql> mysql> USE sample; Database changed mysql> mysql> CREATE TABLE employee (id INT, name VARCHAR(20), email VARCHAR(50), age int); Query OK, 0 rows affected (0.02 sec) mysql> DESC employee; +-------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+-------------+------+-----+---------+-------+ | id | int(11) | YES | | NULL | | | name | varchar(20) | YES | | NULL | | | email | varchar(50) | YES | | NULL | | | age | int(11) | YES | | NULL | | +-------+-------------+------+-----+---------+-------+ 4 rows in set (0.00 sec)
Step 3: Let’s insert some data to employee table.
INSERT INTO employee VALUES (1, "Krishna", "XXXXXXXXXXX", 23); INSERT INTO employee VALUES (2, "Ram", "XXXXXXXXXXX",null); INSERT INTO employee VALUES (3,null, "XXXXXXXXXXX", 23); INSERT INTO employee VALUES (4, "nayan", "XXXXXXXXXXX", 23); INSERT INTO employee VALUES (5,null, "XXXXXXXXXXX",null);
mysql> INSERT INTO employee VALUES (1, "Krishna", "XXXXXXXXXXX", 23); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO employee VALUES (2, "Ram", "XXXXXXXXXXX",null); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO employee VALUES (3,null, "XXXXXXXXXXX", 23); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO employee VALUES (4, "nayan", "XXXXXXXXXXX", 23); Query OK, 1 row affected (0.01 sec) mysql> INSERT INTO employee VALUES (5,null, "XXXXXXXXXXX",null); Query OK, 1 row affected (0.01 sec) mysql> mysql> SELECT * FROM employee; +------+---------+-------------+------+ | id | name | email | age | +------+---------+-------------+------+ | 1 | Krishna | XXXXXXXXXXX | 23 | | 2 | Ram | XXXXXXXXXXX | NULL | | 3 | NULL | XXXXXXXXXXX | 23 | | 4 | nayan | XXXXXXXXXXX | 23 | | 5 | NULL | XXXXXXXXXXX | NULL | +------+---------+-------------+------+ 5 rows in set (0.00 sec)
Let’s insert the data into HDFS by handling null values.
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/sample" \
--username "root" \
--password "cloudera" \
--table "employee" \
--target-dir /null-values-demo \
--null-non-string "-1" \
--null-string "not_set" \
-m 1
[cloudera@quickstart ~]$ sqoop import \
> --connect "jdbc:mysql://quickstart.cloudera:3306/sample" \
> --username "root" \
> --password "cloudera" \
> --table "employee" \
> --target-dir /null-values-demo \
> --null-non-string "-1" \
> --null-string "not_set" \
> -m 1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/04/04 00:14:22 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/04/04 00:14:22 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/04/04 00:14:22 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/04/04 00:14:22 INFO tool.CodeGenTool: Beginning code generation
22/04/04 00:14:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
22/04/04 00:14:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
22/04/04 00:14:23 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/f95fb039b159ba67971cea528f8a090e/employee.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/04/04 00:14:25 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/f95fb039b159ba67971cea528f8a090e/employee.jar
22/04/04 00:14:25 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/04/04 00:14:25 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/04/04 00:14:25 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/04/04 00:14:25 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/04/04 00:14:25 INFO mapreduce.ImportJobBase: Beginning import of employee
22/04/04 00:14:25 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/04/04 00:14:25 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/04/04 00:14:26 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/04/04 00:14:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/04/04 00:14:27 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 00:14:27 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 00:14:28 INFO db.DBInputFormat: Using read commited transaction isolation
22/04/04 00:14:28 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 00:14:28 INFO mapreduce.JobSubmitter: number of splits:1
22/04/04 00:14:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1649003113144_0009
22/04/04 00:14:29 INFO impl.YarnClientImpl: Submitted application application_1649003113144_0009
22/04/04 00:14:29 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1649003113144_0009/
22/04/04 00:14:29 INFO mapreduce.Job: Running job: job_1649003113144_0009
22/04/04 00:14:36 INFO mapreduce.Job: Job job_1649003113144_0009 running in uber mode : false
22/04/04 00:14:36 INFO mapreduce.Job: map 0% reduce 0%
22/04/04 00:14:42 INFO mapreduce.Job: map 100% reduce 0%
22/04/04 00:14:42 INFO mapreduce.Job: Job job_1649003113144_0009 completed successfully
22/04/04 00:14:42 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=171465
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=119
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=3506
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3506
Total vcore-milliseconds taken by all map tasks=3506
Total megabyte-milliseconds taken by all map tasks=3590144
Map-Reduce Framework
Map input records=5
Map output records=5
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=52
CPU time spent (ms)=620
Physical memory (bytes) snapshot=137809920
Virtual memory (bytes) snapshot=1510182912
Total committed heap usage (bytes)=60751872
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=119
22/04/04 00:14:42 INFO mapreduce.ImportJobBase: Transferred 119 bytes in 15.9547 seconds (7.4586 bytes/sec)
22/04/04 00:14:42 INFO mapreduce.ImportJobBase: Retrieved 5 records.
[cloudera@quickstart ~]$
Let’s query the content of the folder ‘/null-values-demo’ and confirm the same.
[cloudera@quickstart ~]$ hadoop fs -cat /null-values-demo/*
1,Krishna,XXXXXXXXXXX,23
2,Ram,XXXXXXXXXXX,-1
3,not_set,XXXXXXXXXXX,23
4,nayan,XXXXXXXXXXX,23
5,not_set,XXXXXXXXXXX,-1
No comments:
Post a Comment