Wednesday 25 May 2022

Sqoop: Handle null values while importing the data

In this post, I am going to explain how to handle null values while importing the data with ‘sqoop import’ command.

 

Sqoop provide following options to handle null values.

a.   --null-non-string

b.   --null-string

 

--null-non-string

Using this option, we can specify a default value to non-text fields when the value is null.

 

Example

--null-non-string "-1"

 

--null-string

Using this option, we can specify a default value to text fields when the value is null.

 

Example

--null-string "not_set"

 

Let’s see it with an example.

 

Step 1: Login to sql console.

[cloudera@quickstart ~]$ mysql -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 62
Server version: 5.1.73 Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 

 

‘cloudera’ is the password for mysql instance running on cloudera quick start VM.

 

Step 2: Create a database ‘sample’ and create new table ‘employees’.

mysql> CREATE DATABASE sample;
Query OK, 1 row affected (0.01 sec)

mysql> 
mysql> USE sample;
Database changed
mysql> 
mysql> CREATE TABLE employee (id INT, name VARCHAR(20), email VARCHAR(50), age int);
Query OK, 0 rows affected (0.02 sec)

mysql> DESC employee;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id    | int(11)     | YES  |     | NULL    |       |
| name  | varchar(20) | YES  |     | NULL    |       |
| email | varchar(50) | YES  |     | NULL    |       |
| age   | int(11)     | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

 

Step 3: Let’s insert some data to employee table.

INSERT INTO employee VALUES (1, "Krishna", "XXXXXXXXXXX", 23);
INSERT INTO employee VALUES (2, "Ram", "XXXXXXXXXXX",null);
INSERT INTO employee VALUES (3,null, "XXXXXXXXXXX", 23);
INSERT INTO employee VALUES (4, "nayan", "XXXXXXXXXXX", 23);
INSERT INTO employee VALUES (5,null, "XXXXXXXXXXX",null);

mysql> INSERT INTO employee VALUES (1, "Krishna", "XXXXXXXXXXX", 23);
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee VALUES (2, "Ram", "XXXXXXXXXXX",null);
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee VALUES (3,null, "XXXXXXXXXXX", 23);
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee VALUES (4, "nayan", "XXXXXXXXXXX", 23);
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO employee VALUES (5,null, "XXXXXXXXXXX",null);
Query OK, 1 row affected (0.01 sec)

mysql> 
mysql> SELECT * FROM employee;
+------+---------+-------------+------+
| id   | name    | email       | age  |
+------+---------+-------------+------+
|    1 | Krishna | XXXXXXXXXXX |   23 |
|    2 | Ram     | XXXXXXXXXXX | NULL |
|    3 | NULL    | XXXXXXXXXXX |   23 |
|    4 | nayan   | XXXXXXXXXXX |   23 |
|    5 | NULL    | XXXXXXXXXXX | NULL |
+------+---------+-------------+------+
5 rows in set (0.00 sec)

 

Let’s insert the data into HDFS by handling null values.

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/sample" \
--username "root" \
--password "cloudera" \
--table "employee" \
--target-dir /null-values-demo \
--null-non-string "-1" \
--null-string "not_set" \
-m 1

[cloudera@quickstart ~]$ sqoop import \
> --connect "jdbc:mysql://quickstart.cloudera:3306/sample" \
> --username "root" \
> --password "cloudera" \
> --table "employee" \
> --target-dir /null-values-demo \
> --null-non-string "-1" \
> --null-string "not_set" \
> -m 1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
22/04/04 00:14:22 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
22/04/04 00:14:22 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/04/04 00:14:22 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/04/04 00:14:22 INFO tool.CodeGenTool: Beginning code generation
22/04/04 00:14:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
22/04/04 00:14:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
22/04/04 00:14:23 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/f95fb039b159ba67971cea528f8a090e/employee.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/04/04 00:14:25 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/f95fb039b159ba67971cea528f8a090e/employee.jar
22/04/04 00:14:25 WARN manager.MySQLManager: It looks like you are importing from mysql.
22/04/04 00:14:25 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
22/04/04 00:14:25 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
22/04/04 00:14:25 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
22/04/04 00:14:25 INFO mapreduce.ImportJobBase: Beginning import of employee
22/04/04 00:14:25 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
22/04/04 00:14:25 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
22/04/04 00:14:26 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
22/04/04 00:14:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
22/04/04 00:14:27 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Thread.join(Thread.java:1281)
    at java.lang.Thread.join(Thread.java:1355)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 00:14:27 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Thread.join(Thread.java:1281)
    at java.lang.Thread.join(Thread.java:1355)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894) 
java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Thread.join(Thread.java:1281)
    at java.lang.Thread.join(Thread.java:1355)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 00:14:28 INFO db.DBInputFormat: Using read commited transaction isolation
22/04/04 00:14:28 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Thread.join(Thread.java:1281)
    at java.lang.Thread.join(Thread.java:1355)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
22/04/04 00:14:28 INFO mapreduce.JobSubmitter: number of splits:1
22/04/04 00:14:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1649003113144_0009
22/04/04 00:14:29 INFO impl.YarnClientImpl: Submitted application application_1649003113144_0009
22/04/04 00:14:29 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1649003113144_0009/
22/04/04 00:14:29 INFO mapreduce.Job: Running job: job_1649003113144_0009
22/04/04 00:14:36 INFO mapreduce.Job: Job job_1649003113144_0009 running in uber mode : false
22/04/04 00:14:36 INFO mapreduce.Job:  map 0% reduce 0%
22/04/04 00:14:42 INFO mapreduce.Job:  map 100% reduce 0%
22/04/04 00:14:42 INFO mapreduce.Job: Job job_1649003113144_0009 completed successfully
22/04/04 00:14:42 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=171465
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=87
        HDFS: Number of bytes written=119
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Other local map tasks=1
        Total time spent by all maps in occupied slots (ms)=3506
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=3506
        Total vcore-milliseconds taken by all map tasks=3506
        Total megabyte-milliseconds taken by all map tasks=3590144
    Map-Reduce Framework
        Map input records=5
        Map output records=5
        Input split bytes=87
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=52
        CPU time spent (ms)=620
        Physical memory (bytes) snapshot=137809920
        Virtual memory (bytes) snapshot=1510182912
        Total committed heap usage (bytes)=60751872
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=119
22/04/04 00:14:42 INFO mapreduce.ImportJobBase: Transferred 119 bytes in 15.9547 seconds (7.4586 bytes/sec)
22/04/04 00:14:42 INFO mapreduce.ImportJobBase: Retrieved 5 records.
[cloudera@quickstart ~]$

  Let’s query the content of the folder ‘/null-values-demo’ and confirm the same.

 

[cloudera@quickstart ~]$ hadoop fs -cat /null-values-demo/*
1,Krishna,XXXXXXXXXXX,23
2,Ram,XXXXXXXXXXX,-1
3,not_set,XXXXXXXXXXX,23
4,nayan,XXXXXXXXXXX,23
5,not_set,XXXXXXXXXXX,-1

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment