When most people think about Gremlin in Apache TinkerPop, they think about traversals like walking vertices and edges to discover relationships. But Gremlin is more than just a navigation language. It is also a powerful statistical engine capable of performing meaningful numerical analysis directly inside a traversal.
Modern graph applications are not only about connections, they are also about measurement.
· How many connections exist?
· What is the average number of relationships per vertex?
· What is the maximum capacity, weight, or score in the graph?
· What is the minimum value across a dataset?
· What is the total of a numeric property?
Instead of exporting graph data into an external analytics system, Gremlin allows you to compute these insights in place, within the graph engine itself.
Why Statistical Steps Matter in Graph Systems
In relational databases, aggregation functions like COUNT, SUM, and AVG are common. Gremlin provides similar capabilities, but with a key advantage, aggregations can be applied at any point in a traversal.
This means statistics can be:
· Global (across the entire graph)
· Label-specific (only certain vertices or edges)
· Traversal-scoped (computed per vertex using local())
· Relationship-aware (based on edge counts)
This flexibility turns Gremlin into a lightweight analytics framework embedded directly into your graph queries.
1. University Graph Model
Let's model an University data into Gremlin Graph.
Vertex Labels
· student
· course
· professor
· department
Edge Labels
· enrolled_in (student → course)
· teaches (professor → course)
· belongs_to (course → department)
Step 1: Create graph traversal instance.
graph = TinkerGraph.open() g = graph.traversal()
Step 2: Create Departments
g.addV('department').property('name','Computer Science').property('budget',500000) g.addV('department').property('name','Mathematics').property('budget',300000) g.addV('department').property('name','Physics').property('budget',200000)
Step 3: Create Courses
g.addV('course').property('name','Algorithms').property('credits',4) g.addV('course').property('name','Data Structures').property('credits',3) g.addV('course').property('name','Calculus').property('credits',4) g.addV('course').property('name','Quantum Mechanics').property('credits',5)
Step 4: Create Students
g.addV('student').property('name','Alice').property('age',20).property('gpa',3.8) g.addV('student').property('name','Bob').property('age',22).property('gpa',3.2) g.addV('student').property('name','Carol').property('age',21).property('gpa',3.6) g.addV('student').property('name','David').property('age',23).property('gpa',2.9)
Step 5: Create Professors
g.addV('professor').property('name','Dr. Smith').property('salary',120000) g.addV('professor').property('name','Dr. Brown').property('salary',95000) g.addV('professor').property('name','Dr. Lee').property('salary',110000)
Step 6: Connect Courses → Departments
g.V().has('course','name','Algorithms'). addE('belongs_to'). to(__.V().has('department','name','Computer Science')) g.V().has('course','name','Data Structures'). addE('belongs_to'). to(__.V().has('department','name','Computer Science')) g.V().has('course','name','Calculus'). addE('belongs_to'). to(__.V().has('department','name','Mathematics')) g.V().has('course','name','Quantum Mechanics'). addE('belongs_to'). to(__.V().has('department','name','Physics'))
Step 7: Connect Professors → Courses
g.V().has('professor','name','Dr. Smith'). addE('teaches'). to(__.V().has('course','name','Algorithms')) g.V().has('professor','name','Dr. Smith'). addE('teaches'). to(__.V().has('course','name','Data Structures')) g.V().has('professor','name','Dr. Brown'). addE('teaches'). to(__.V().has('course','name','Calculus')) g.V().has('professor','name','Dr. Lee'). addE('teaches'). to(__.V().has('course','name','Quantum Mechanics'))
Step 8: Connect Students → Courses
g.V().has('student','name','Alice'). addE('enrolled_in'). to(__.V().has('course','name','Algorithms')) g.V().has('student','name','Alice'). addE('enrolled_in'). to(__.V().has('course','name','Data Structures')) g.V().has('student','name','Bob'). addE('enrolled_in'). to(__.V().has('course','name','Calculus')) g.V().has('student','name','Carol'). addE('enrolled_in'). to(__.V().has('course','name','Algorithms')) g.V().has('student','name','Carol'). addE('enrolled_in'). to(__.V().has('course','name','Calculus')) g.V().has('student','name','David'). addE('enrolled_in'). to(__.V().has('course','name','Quantum Mechanics'))
Confirm the Graph by printing vertices and edges.
gremlin> g.V().valueMap(true) ==>[id:0,label:department,name:[Computer Science],budget:[500000]] ==>[id:33,label:student,name:[David],gpa:[2.9],age:[23]] ==>[id:3,label:department,name:[Mathematics],budget:[300000]] ==>[id:37,label:professor,name:[Dr. Smith],salary:[120000]] ==>[id:6,label:department,name:[Physics],budget:[200000]] ==>[id:40,label:professor,name:[Dr. Brown],salary:[95000]] ==>[id:9,label:course,credits:[4],name:[Algorithms]] ==>[id:43,label:professor,name:[Dr. Lee],salary:[110000]] ==>[id:12,label:course,credits:[3],name:[Data Structures]] ==>[id:15,label:course,credits:[4],name:[Calculus]] ==>[id:18,label:course,credits:[5],name:[Quantum Mechanics]] ==>[id:21,label:student,name:[Alice],gpa:[3.8],age:[20]] ==>[id:25,label:student,name:[Bob],gpa:[3.2],age:[22]] ==>[id:29,label:student,name:[Carol],gpa:[3.6],age:[21]] gremlin> gremlin> gremlin> g.E().valueMap(true) ==>[id:46,label:belongs_to] ==>[id:47,label:belongs_to] ==>[id:48,label:belongs_to] ==>[id:49,label:belongs_to] ==>[id:50,label:teaches] ==>[id:51,label:teaches] ==>[id:52,label:teaches] ==>[id:53,label:teaches] ==>[id:54,label:enrolled_in] ==>[id:55,label:enrolled_in] ==>[id:56,label:enrolled_in] ==>[id:57,label:enrolled_in] ==>[id:58,label:enrolled_in] ==>[id:59,label:enrolled_in]
2. Statistical Operations
2.1 count(): Measuring Quantity
'count()' returns the number of traversers currently flowing through the traversal.
Example 1: Total Vertices
gremlin> g.V().count() ==>14
Example 2: Total Students
gremlin> g.V().hasLabel('student').count() ==>4
Example 3: How Many Courses Exist?
gremlin> g.V().hasLabel('course').count() ==>4
Example 4: Courses Per Student (Using local())
g.V(). hasLabel('student'). project('name','count'). by(values('name')). by(out('enrolled_in').count())
gremlin> g.V(). ......1> hasLabel('student'). ......2> project('name','count'). ......3> by(values('name')). ......4> by(out('enrolled_in').count()) ==>[name:David,count:1] ==>[name:Alice,count:2] ==>[name:Bob,count:1] ==>[name:Carol,count:2]
2.2 sum(): Adding Values Together
'sum()' aggregates numeric values and returns their total. It works only on numeric properties.
Example 1: Total University Budget
g.V(). hasLabel('department'). values('budget'). sum()
gremlin> g.V(). ......1> hasLabel('department'). ......2> values('budget'). ......3> sum() ==>1000000
Example 2: Total Professor Salary Expense
g.V(). hasLabel('professor'). values('salary'). sum()
gremlin> g.V(). ......1> hasLabel('professor'). ......2> values('salary'). ......3> sum() ==>325000
Example 3: Total Credits Across All Courses
g.V(). hasLabel('course'). values('credits'). sum()
gremlin> g.V(). ......1> hasLabel('course'). ......2> values('credits'). ......3> sum() ==>16
2.3 mean(): Calculating Average
mean() calculates the arithmetic average.
mean = (Sum Of Values)/(Count Of Values)
Example 1: Average GPA
g.V(). hasLabel('student'). values('gpa'). mean()
gremlin> g.V(). ......1> hasLabel('student'). ......2> values('gpa'). ......3> mean() ==>3.375
Example 2: Average Professor Salary
g.V(). hasLabel('professor'). values('salary'). mean()
gremlin> g.V(). ......1> hasLabel('professor'). ......2> values('salary'). ......3> mean() ==>108333.33333333333
Example 3: Average Courses Per Student
g.V(). hasLabel('student'). local(out('enrolled_in').count()). mean()
gremlin> g.V(). ......1> hasLabel('student'). ......2> local(out('enrolled_in').count()). ......3> mean() ==>1.5
2.4 min(): Finding the Smallest Value
Returns the smallest value in the traversal stream. It works with Numbers, Strings and any comparable type (post TinkerPop 3.4).
Example 1: Youngest Student Age
g.V(). hasLabel('student'). values('age'). min()
gremlin> g.V(). ......1> hasLabel('student'). ......2> values('age'). ......3> min() ==>20
Example 2: Lowest Salary
g.V(). hasLabel('professor'). values('salary'). min()
gremlin> g.V(). ......1> hasLabel('professor'). ......2> values('salary'). ......3> min() ==>95000
Example 3: Alphabetically First Department
g.V(). hasLabel('department'). values('name'). min()
gremlin> g.V(). ......1> hasLabel('department'). ......2> values('name'). ......3> min() ==>Computer Science
2.4 max(): Finding the Largest Value
Returns the largest value in the traversal stream.
Example 1: Highest GPA
g.V(). hasLabel('student'). values('gpa'). max()
gremlin> g.V(). ......1> hasLabel('student'). ......2> values('gpa'). ......3> max() ==>3.8
Example 2: Largest Department Budget
g.V(). hasLabel('department'). values('budget'). max()
gremlin> g.V(). ......1> hasLabel('department'). ......2> values('budget'). ......3> max() ==>500000
Example 3: Alphabetically Last Department
g.V(). hasLabel('department'). values('name'). max()
gremlin> g.V(). ......1> hasLabel('department'). ......2> values('name'). ......3> max() ==>Physics
gremlin> g.V().valueMap(true) ==>[id:0,label:department,name:[Computer Science],budget:[500000]] ==>[id:33,label:student,name:[David],gpa:[2.9],age:[23]] ==>[id:3,label:department,name:[Mathematics],budget:[300000]] ==>[id:37,label:professor,name:[Dr. Smith],salary:[120000]] ==>[id:6,label:department,name:[Physics],budget:[200000]] ==>[id:40,label:professor,name:[Dr. Brown],salary:[95000]] ==>[id:9,label:course,credits:[4],name:[Algorithms]] ==>[id:43,label:professor,name:[Dr. Lee],salary:[110000]] ==>[id:12,label:course,credits:[3],name:[Data Structures]] ==>[id:15,label:course,credits:[4],name:[Calculus]] ==>[id:18,label:course,credits:[5],name:[Quantum Mechanics]] ==>[id:21,label:student,name:[Alice],gpa:[3.8],age:[20]] ==>[id:25,label:student,name:[Bob],gpa:[3.2],age:[22]] ==>[id:29,label:student,name:[Carol],gpa:[3.6],age:[21]] gremlin> gremlin> g.E().valueMap(true) ==>[id:46,label:belongs_to] ==>[id:47,label:belongs_to] ==>[id:48,label:belongs_to] ==>[id:49,label:belongs_to] ==>[id:50,label:teaches] ==>[id:51,label:teaches] ==>[id:52,label:teaches] ==>[id:53,label:teaches] ==>[id:54,label:enrolled_in] ==>[id:55,label:enrolled_in] ==>[id:56,label:enrolled_in] ==>[id:57,label:enrolled_in] ==>[id:58,label:enrolled_in] ==>[id:59,label:enrolled_in]
3. Global Aggregation
A global aggregation combines all traversers in the current stream and then computes the statistic across the entire stream.
Following snippet calculate the total Courses Across All Students
g.V(). hasLabel('student'). out('enrolled_in'). count()
gremlin> g.V(). ......1> hasLabel('student'). ......2> out('enrolled_in'). ......3> count() ==>6
Local Aggregation
A local aggregation computes the statistic within the context of each traverser individually. Think of it as “for each student, compute the value separately.
g.V().hasLabel('student'). project('name','courses_count'). by('name'). by(local(out('enrolled_in').count())). order().by(select('courses_count'), desc)
gremlin> g.V().hasLabel('student'). ......1> project('name','courses_count'). ......2> by('name'). ......3> by(local(out('enrolled_in').count())). ......4> order().by(select('courses_count'), desc) ==>[name:Alice,courses_count:2] ==>[name:Carol,courses_count:2] ==>[name:David,courses_count:1] ==>[name:Bob,courses_count:1]
Previous Next Home
No comments:
Post a Comment