Saturday 2 January 2016

Hadoop: Java: globbing

“globbing” facilitates to use wild card characters to match number of files with single expression. FileSystem class provides following methods for processing globs.

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter)

Above methods returns an array of paths that match the path pattern.

Standard wildcards
Wildcard
Description
?
Used to represent single character. “c?t” matches cot, cat, cit etc.,
*
Matches any number of characters. “a*” matches empty string, a, aa, aaa, aaaa…….
[]
Specify a range.”m[a-e]m” matches mam, mbm, mcm, mdm, mem.
{}
{a, b} Matches either expression a (or) b
\
Escape character.
[^a-b]
Negate character range from a to b

Step 1: Set JAVA_HOME (If it is not set already)

Step 2: Set HADOOP_CLASSPATH like following
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar


Step 3: Following is an example application for globbing.

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class GlobEx {
 private static final String uri = "hdfs://localhost/user/harikrishna_gurram";
 private static final Configuration config = new Configuration();

 public static void printFileInfo() throws IOException {
  /* Get FileSystem object for given uri */
  FileSystem fs = FileSystem.get(URI.create(uri), config);

  /* Get all files */
  Path pattern = new Path("*");

  for (FileStatus status : fs.globStatus(pattern)) {
   System.out.println("Path : " + status.getPath());
  }

  System.out.println("*****************************");
  /* Get all files in 2015 */
  pattern = new Path("2015*");
  for (FileStatus status : fs.globStatus(pattern)) {
   System.out.println("Path : " + status.getPath());
  }

 }

 public static void main(String args[]) throws IOException {
  printFileInfo();
 }
}

String uri = " hdfs://localhost/user/harikrishna_gurram";

“uri” is used to locate file location in HDFS. Host details for above uri is configured in “hadoop-2.6.0/etc/hadoop/core-site.xml” file.

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://localhost/</value>
                <description>NameNode URI</description>
        </property>
</configuration>

Please refer to the setup for hadoop here.

Path pattern = new Path("*");
Above statement match all files.

pattern = new Path("2015*");
Above statement matches all files starting with 2015.

Step 4: Compile above java file.
$ hadoop com.sun.tools.javac.Main GlobEx.java

Step 5: Create jar file.
$ jar cf glob.jar GlobEx*class

Step 6: Run jar file.    

$ hadoop jar glob.jar GlobEx
Path : hdfs://localhost/user/harikrishna_gurram/2014-01-01.txt
Path : hdfs://localhost/user/harikrishna_gurram/2014-01-02.txt
Path : hdfs://localhost/user/harikrishna_gurram/2015-01-01.txt
Path : hdfs://localhost/user/harikrishna_gurram/2015-01-02.txt
Path : hdfs://localhost/user/harikrishna_gurram/2015-01-03.txt
Path : hdfs://localhost/user/harikrishna_gurram/dir1
Path : hdfs://localhost/user/harikrishna_gurram/dir2
Path : hdfs://localhost/user/harikrishna_gurram/directory1
Path : hdfs://localhost/user/harikrishna_gurram/dummy.txt
Path : hdfs://localhost/user/harikrishna_gurram/first
Path : hdfs://localhost/user/harikrishna_gurram/sample.zip
*****************************
Path : hdfs://localhost/user/harikrishna_gurram/2015-01-01.txt
Path : hdfs://localhost/user/harikrishna_gurram/2015-01-02.txt

Path : hdfs://localhost/user/harikrishna_gurram/2015-01-03.txt



Previous                                                 Next                                                 Home

No comments:

Post a Comment