Counting is one of the most fundamental and most frequently applied operations in graph traversal. In many analytical and operational scenarios, the primary concern is not which elements exist in the graph, but how many of them there are.
Typical questions include:
· How many users are stored in the graph?
· How many orders have been placed?
· How many relationships exist between entities?
· How many connections originate from a given type of vertex?
Each of these questions focuses on quantity, not identity. Apache TinkerPop addresses such needs through the count() step, which determines how many traversers reach a specific point in a traversal.
Before exploring more advanced techniques such as counting grouped or aggregated results, it is essential to understand how to count individual elements accurately and efficiently.
1. Example Domain: An Online Library System
To keep the discussion focused, consider a simple graph that represents an online library. The simplicity of the model allows attention to remain on traversal mechanics rather than domain complexity.
1.1 Graph Elements
Vertex labels
· book: represents a book in the library
· author: represents an author
Edge labels
· writtenBy: connects a book to its author
This model captures a common one-to-many relationship, an author may write multiple books, while each book is written by exactly one author.
1.2 Demo Example
The following Gremlin example creates a small, self-contained graph suitable for experimentation and validation of counting traversals.
Step 1: Create the Graph and Traversal Source
graph = TinkerGraph.open() g = graph.traversal()
gremlin> graph = TinkerGraph.open() ==>tinkergraph[vertices:0 edges:0] gremlin> g = graph.traversal() ==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
This creates an in-memory graph and a traversal source named g.
Step 2: Create Author Vertices
a1 = g.addV('author').property('name', 'R. K. Narayan').next() a2 = g.addV('author').property('name', 'Arundhati Roy').next() a3 = g.addV('author').property('name', 'Rabindranath Tagore').next()
gremlin> a1 = g.addV('author').property('name', 'R. K. Narayan').next() ==>v[0] gremlin> a2 = g.addV('author').property('name', 'Arundhati Roy').next() ==>v[2] gremlin> a3 = g.addV('author').property('name', 'Rabindranath Tagore').next() ==>v[4]
The graph now contains three author vertices.
Step 3: Create Book Vertices
b1 = g.addV('book').property('title', 'Malgudi Days').next() b2 = g.addV('book').property('title', 'Swami and Friends').next() b3 = g.addV('book').property('title', 'The God of Small Things').next() b4 = g.addV('book').property('title', 'Gitanjali').next()
At this stage, four book vertices exist in the graph.
Step 4: Create writtenBy Edges
g.V(b1).addE('writtenBy').to(a1).next() g.V(b2).addE('writtenBy').to(a1).next() g.V(b3).addE('writtenBy').to(a2).next() g.V(b4).addE('writtenBy').to(a3).next()
gremlin> g.V(b1).addE('writtenBy').to(a1).next() ==>e[14][6-writtenBy->0] gremlin> g.V(b2).addE('writtenBy').to(a1).next() ==>e[15][8-writtenBy->0] gremlin> g.V(b3).addE('writtenBy').to(a2).next() ==>e[16][10-writtenBy->2] gremlin> g.V(b4).addE('writtenBy').to(a3).next() ==>e[17][12-writtenBy->4]
Conceptually, the graph can be represented as:
(Book) Malgudi Days ──writtenBy──▶ R. K. Narayan
(Book) Swami and Friends ──writtenBy──▶ R. K. Narayan
(Book) The God of Small Things ──writtenBy──▶ Arundhati Roy
(Book) Gitanjali ──writtenBy──▶ Rabindranath Tagore
Example 1: How many books exist in the graph?
g.V().hasLabel('book').count()
How this traversal works?
· g.V(): Starts the traversal at all vertices in the graph.
· hasLabel('book'): Filters the traversers to include only vertices labeled book.
· count(): Counts how many traversers remain at this point in the traversal.
gremlin> g.V().hasLabel('book').count() ==>4
This result indicates that the graph currently contains 4 book vertices.
Example 2: Counting Relationships (Edges)
How many authorship relationships exist in the graph?
In other words, how many writtenBy edges are present?
One effective approach is to start from books and traverse their outgoing edges.
g.V().hasLabel('book').outE('writtenBy').count()
gremlin> g.V().hasLabel('book').outE('writtenBy').count() ==>4
This tells us that there are 4 book-to-author relationships in the graph. The number may be higher than the number of books if, for example, the model allows multiple authors per book.
Here we,
· Begin at all vertices.
· Restrict the traversal to book vertices.
· From each book, traverse outgoing edges labeled writtenBy.
· Count the number of such edges encountered.
Counting by Starting from Edges
Another valid approach is to begin the traversal directly from edges.
gremlin> g.E().hasLabel('writtenBy').count() ==>4
Although this traversal is syntactically correct, it is generally discouraged for large property graphs. In real-world datasets, edges often outnumber vertices by a significant margin. Starting from g.E() can therefore lead to unnecessary scanning and reduced performance.
Why Traversal Starting Points Matter?
All three traversals below return the same result.
· g.V().hasLabel('book').outE('writtenBy').count()
· g.V().outE('writtenBy').count()
· g.E().hasLabel('writtenBy').count()
However, they differ substantially in efficiency.
Practical Guidelines
· Prefer starting from vertices rather than edges.
· Apply restrictive filters as early as possible.
· Minimise the portion of the graph examined before counting.
As graph size grows from thousands to millions or billions of elements, these choices have a significant impact on query latency and resource usage.
In summary,
· count() reports how many traversers reach a given point in a traversal.
· It can be applied to:
o Vertices
o Edges
o Results of intermediate traversal steps
· Multiple traversals may yield the same numeric result, but not all traversals are equally efficient.
· Careful traversal design, especially the choice of starting point, is essential for scalable graph querying.
Previous Next Home
No comments:
Post a Comment