“globbing”
facilitates to use wild card characters to match number of files with single
expression. FileSystem class provides following methods for processing globs.
public FileStatus[] globStatus(Path pathPattern)
throws IOException
public FileStatus[] globStatus(Path pathPattern,
PathFilter filter)
Above
methods returns an array of paths that match the path pattern.
Standard wildcards
Wildcard
|
Description
|
?
|
Used to
represent single character. “c?t” matches cot, cat, cit etc.,
|
*
|
Matches
any number of characters. “a*” matches empty string, a, aa, aaa, aaaa…….
|
[]
|
Specify a
range.”m[a-e]m” matches mam, mbm, mcm, mdm, mem.
|
{}
|
{a, b}
Matches either expression a (or) b
|
\
|
Escape
character.
|
[^a-b]
|
Negate
character range from a to b
|
Step 1: Set JAVA_HOME (If it is not set already)
Step 2: Set HADOOP_CLASSPATH like following
export
HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Step 3: Following is an example application for globbing.
import java.io.IOException; import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; public class GlobEx { private static final String uri = "hdfs://localhost/user/harikrishna_gurram"; private static final Configuration config = new Configuration(); public static void printFileInfo() throws IOException { /* Get FileSystem object for given uri */ FileSystem fs = FileSystem.get(URI.create(uri), config); /* Get all files */ Path pattern = new Path("*"); for (FileStatus status : fs.globStatus(pattern)) { System.out.println("Path : " + status.getPath()); } System.out.println("*****************************"); /* Get all files in 2015 */ pattern = new Path("2015*"); for (FileStatus status : fs.globStatus(pattern)) { System.out.println("Path : " + status.getPath()); } } public static void main(String args[]) throws IOException { printFileInfo(); } }
String uri =
" hdfs://localhost/user/harikrishna_gurram";
“uri” is
used to locate file location in HDFS. Host details for above uri is configured
in “hadoop-2.6.0/etc/hadoop/core-site.xml” file.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
<description>NameNode
URI</description>
</property>
</configuration>
Please refer
to the setup for hadoop here.
Path pattern = new Path("*");
Above
statement match all files.
pattern = new Path("2015*");
Above
statement matches all files starting with 2015.
Step 4: Compile above java file.
$ hadoop
com.sun.tools.javac.Main GlobEx.java
Step 5: Create jar file.
$ jar cf
glob.jar GlobEx*class
Step 6: Run jar file.
$ hadoop jar
glob.jar GlobEx
Path :
hdfs://localhost/user/harikrishna_gurram/2014-01-01.txt
Path :
hdfs://localhost/user/harikrishna_gurram/2014-01-02.txt
Path :
hdfs://localhost/user/harikrishna_gurram/2015-01-01.txt
Path : hdfs://localhost/user/harikrishna_gurram/2015-01-02.txt
Path :
hdfs://localhost/user/harikrishna_gurram/2015-01-03.txt
Path :
hdfs://localhost/user/harikrishna_gurram/dir1
Path :
hdfs://localhost/user/harikrishna_gurram/dir2
Path :
hdfs://localhost/user/harikrishna_gurram/directory1
Path :
hdfs://localhost/user/harikrishna_gurram/dummy.txt
Path :
hdfs://localhost/user/harikrishna_gurram/first
Path :
hdfs://localhost/user/harikrishna_gurram/sample.zip
*****************************
Path :
hdfs://localhost/user/harikrishna_gurram/2015-01-01.txt
Path :
hdfs://localhost/user/harikrishna_gurram/2015-01-02.txt
Path :
hdfs://localhost/user/harikrishna_gurram/2015-01-03.txt
No comments:
Post a Comment