Sunday, 17 May 2026

Labeling and Reusing Traversal Steps in Gremlin using as, select and project

  

When writing Gremlin traversals, it’s often useful to remember intermediate points in a traversal and refer back to them later. Gremlin provides several powerful steps—as, select, and project—that allow you to label traversal steps, reuse those labels, and shape query results in a structured way.

 

In this post, we’ll explore how the "as" step can be used to assign labels to vertices or steps during a traversal, and how the select step allows us to retrieve those labeled elements later in the query. We’ll also see how "by" modulators help to control exactly which properties are returned from selected elements.

 

While the path() step can often produce similar results, it can be memory and CPU intensive in complex traversals. Using "as" and "select" provides a more targeted and sometimes more efficient alternative, especially when you only need specific points in the traversal rather than the full path.

 

Finally, we’ll look at the project step, a more recent addition to TinkerPop, which offers a cleaner and more expressive way to construct structured results often replacing the need for "as" and "select" altogether. Through practical examples, this post demonstrates when to use each approach and how to write Gremlin queries that are both readable and performant.

 

1. Modelling the Graph

Let's model and Organization structure to demonstrate these examples.

 

Engineer    Manager    Director    SeniorDirector    VicePresident

 

In this model, each employee will be a vertex with

·      label: employee

·      name

·      role

 

Each reporting line will be an edge with label "reportsTo"

 

Step 1: Create VicePresident vertices.

g.addV('employee').
    property('name','Alice').
    property('role','VicePresident')

g.addV('employee').
    property('name','Bob').
    property('role','VicePresident')

Step 2: Create Senior Directors (Reporting to VPs)

We’ll add 3 Senior Directors, split across the two VPs.

g.addV('employee').
    property('name','Carol').
    property('role','SeniorDirector')

g.addV('employee').
    property('name','David').
    property('role','SeniorDirector')

g.addV('employee').
    property('name','Eve').
    property('role','SeniorDirector')

Now connect them.

g.V().has('employee','name','Carol').
  addE('reportsTo').
  to(__.V().has('employee','name','Alice'))

g.V().has('employee','name','David').
  addE('reportsTo').
  to(__.V().has('employee','name','Alice'))

g.V().has('employee','name','Eve').
  addE('reportsTo').
  to(__.V().has('employee','name','Bob'))

Step 3: Create Directors (More Density)

Let’s add 5 Directors.

['Frank','Grace','Heidi','Ivan','Judy'].each { name ->
    g.addV('employee').
      property('name', name).
      property('role','Director').
      iterate()
}

   

Connect them to Senior Directors.

 

g.V().has('name','Frank').addE('reportsTo').to(__.V().has('name','Carol'))
g.V().has('name','Grace').addE('reportsTo').to(__.V().has('name','Carol'))

g.V().has('name','Heidi').addE('reportsTo').to(__.V().has('name','David'))
g.V().has('name','Ivan').addE('reportsTo').to(__.V().has('name','David'))

g.V().has('name','Judy').addE('reportsTo').to(__.V().has('name','Eve'))

Step 4: Create Managers (Broader Middle Layer)

Add 6 Managers.

['Ken','Laura','Mallory','Niaj','Olivia','Peggy'].each { name ->
    g.addV('employee').
      property('name', name).
      property('role','Manager').
      iterate()
}

Connect them to Directors.

g.V().has('name','Ken').addE('reportsTo').to(__.V().has('name','Frank'))
g.V().has('name','Laura').addE('reportsTo').to(__.V().has('name','Frank'))

g.V().has('name','Mallory').addE('reportsTo').to(__.V().has('name','Grace'))

g.V().has('name','Niaj').addE('reportsTo').to(__.V().has('name','Heidi'))
g.V().has('name','Olivia').addE('reportsTo').to(__.V().has('name','Ivan'))

g.V().has('name','Peggy').addE('reportsTo').to(__.V().has('name','Judy'))

   

Step 5: Create Engineers (Largest Layer)

Now add 10 Engineers.

 

['Quinn','Ruth','Sybil','Trent','Uma',
 'Victor','Wendy','Xavier','Yvonne','Zack'].each { name ->
    g.addV('employee').
      property('name', name).
      property('role','Engineer').
      iterate()
}

   

Connect them to managers.

 

g.V().has('name','Quinn').addE('reportsTo').to(__.V().has('name','Ken'))
g.V().has('name','Ruth').addE('reportsTo').to(__.V().has('name','Ken'))

g.V().has('name','Sybil').addE('reportsTo').to(__.V().has('name','Laura'))

g.V().has('name','Trent').addE('reportsTo').to(__.V().has('name','Mallory'))
g.V().has('name','Uma').addE('reportsTo').to(__.V().has('name','Mallory'))

g.V().has('name','Victor').addE('reportsTo').to(__.V().has('name','Niaj'))
g.V().has('name','Wendy').addE('reportsTo').to(__.V().has('name','Niaj'))

g.V().has('name','Xavier').addE('reportsTo').to(__.V().has('name','Olivia'))
g.V().has('name','Yvonne').addE('reportsTo').to(__.V().has('name','Peggy'))
g.V().has('name','Zack').addE('reportsTo').to(__.V().has('name','Peggy'))

   

Example 1: Pick Engineer and VP in the Same Traversal

 

g.V().
  has('employee','role','Engineer').
  as('engineer').
  out('reportsTo').
  out('reportsTo').
  out('reportsTo').
  out('reportsTo').
  as('vp').
  select('engineer', 'vp').
  by('name')

gremlin> g.V().
......1>   has('employee','role','Engineer').
......2>   as('engineer').
......3>   out('reportsTo').
......4>   out('reportsTo').
......5>   out('reportsTo').
......6>   out('reportsTo').
......7>   as('vp').
......8>   select('engineer', 'vp').
......9>   by('name')
==>[engineer:Ruth,vp:Alice]
==>[engineer:Sybil,vp:Alice]
==>[engineer:Trent,vp:Alice]
==>[engineer:Uma,vp:Alice]
==>[engineer:Victor,vp:Alice]
==>[engineer:Wendy,vp:Alice]
==>[engineer:Xavier,vp:Alice]
==>[engineer:Yvonne,vp:Bob]
==>[engineer:Zack,vp:Bob]
==>[engineer:Quinn,vp:Alice]

   

We can use repeat and until steps as well to write above Query.

 

g.V().
  has('employee','role','Engineer').
  as('engineer').
  repeat(out('reportsTo')).
  until(has('role','VicePresident')).
  as('vp').
  select('engineer', 'vp').
  by('name')

gremlin> g.V().
......1>   has('employee','role','Engineer').
......2>   as('engineer').
......3>   repeat(out('reportsTo')).
......4>   until(has('role','VicePresident')).
......5>   as('vp').
......6>   select('engineer', 'vp').
......7>   by('name')
==>[engineer:Ruth,vp:Alice]
==>[engineer:Sybil,vp:Alice]
==>[engineer:Trent,vp:Alice]
==>[engineer:Uma,vp:Alice]
==>[engineer:Victor,vp:Alice]
==>[engineer:Wendy,vp:Alice]
==>[engineer:Xavier,vp:Alice]
==>[engineer:Yvonne,vp:Bob]
==>[engineer:Zack,vp:Bob]
==>[engineer:Quinn,vp:Alice]

Using 'as' and 'select' steps, we avoided returning the entire path and instead captures only the vertices we care about, making the traversal more efficient and expressive.

 

Example 2: project() step

The project() step is used to build a result object (a map) from the current element in the traversal. You give project() some names (keys), and for each key you tell Gremlin what value to calculate using by() steps.

 

For example,

g.V().hasLabel('employee').
  project('my_name','my_role','my_direct_reports').
    by('name').
    by('role').
    by(in('reportsTo').count())

   

In the above example, project step is saying that take the current employee and build a map with three fields:

my_name, my_role, and my_direct_reports.

·      by('name'): For the first key my_name, read the name property from the current employee vertex

·      by('role'): For the second key my_role, read the name property from the current employee vertex

·      by(in('reportsTo').count()): For the third key my_direct_reports, from the current employee traverse incoming reportsTo edges and count how many vertices point to this employee.

 

 

gremlin> g.V().hasLabel('employee').
......1>   project('my_name','my_role','my_direct_reports').
......2>     by('name').
......3>     by('role').
......4>     by(in('reportsTo').count())
==>[my_name:Alice,my_role:VicePresident,my_direct_reports:2]
==>[my_name:Ruth,my_role:Engineer,my_direct_reports:0]
==>[my_name:Bob,my_role:VicePresident,my_direct_reports:1]
==>[my_name:Sybil,my_role:Engineer,my_direct_reports:0]
==>[my_name:Carol,my_role:SeniorDirector,my_direct_reports:2]
==>[my_name:Trent,my_role:Engineer,my_direct_reports:0]
==>[my_name:David,my_role:SeniorDirector,my_direct_reports:2]
==>[my_name:Uma,my_role:Engineer,my_direct_reports:0]
==>[my_name:Eve,my_role:SeniorDirector,my_direct_reports:1]
==>[my_name:Victor,my_role:Engineer,my_direct_reports:0]
==>[my_name:Wendy,my_role:Engineer,my_direct_reports:0]
==>[my_name:Frank,my_role:Director,my_direct_reports:2]
==>[my_name:Xavier,my_role:Engineer,my_direct_reports:0]
==>[my_name:Grace,my_role:Director,my_direct_reports:1]
==>[my_name:Yvonne,my_role:Engineer,my_direct_reports:0]
==>[my_name:Heidi,my_role:Director,my_direct_reports:1]
==>[my_name:Zack,my_role:Engineer,my_direct_reports:0]
==>[my_name:Ivan,my_role:Director,my_direct_reports:1]
==>[my_name:Judy,my_role:Director,my_direct_reports:1]
==>[my_name:Ken,my_role:Manager,my_direct_reports:2]
==>[my_name:Laura,my_role:Manager,my_direct_reports:1]
==>[my_name:Mallory,my_role:Manager,my_direct_reports:2]
==>[my_name:Niaj,my_role:Manager,my_direct_reports:2]
==>[my_name:Olivia,my_role:Manager,my_direct_reports:1]
==>[my_name:Peggy,my_role:Manager,my_direct_reports:2]
==>[my_name:Quinn,my_role:Engineer,my_direct_reports:0]

   

Example 3: path() Engineer VP Reporting Chain

Show the full reporting chain from an Engineer up to the VP.

 

g.V().
  hasLabel('employee').
  outE('reportsTo').
  inV().
  outE('reportsTo').
  inV().
  outE('reportsTo').
  inV().
  outE('reportsTo').
  inV().
  path().
  by('name').
  by(label())

gremlin> g.V().
......1>   hasLabel('employee').
......2>   outE('reportsTo').
......3>   inV().
......4>   outE('reportsTo').
......5>   inV().
......6>   outE('reportsTo').
......7>   inV().
......8>   outE('reportsTo').
......9>   inV().
.....10>   path().
.....11>   by('name').
.....12>   by(label())
==>[Ruth,reportsTo,Ken,reportsTo,Frank,reportsTo,Carol,reportsTo,Alice]
==>[Sybil,reportsTo,Laura,reportsTo,Frank,reportsTo,Carol,reportsTo,Alice]
==>[Trent,reportsTo,Mallory,reportsTo,Grace,reportsTo,Carol,reportsTo,Alice]
==>[Uma,reportsTo,Mallory,reportsTo,Grace,reportsTo,Carol,reportsTo,Alice]
==>[Victor,reportsTo,Niaj,reportsTo,Heidi,reportsTo,David,reportsTo,Alice]
==>[Wendy,reportsTo,Niaj,reportsTo,Heidi,reportsTo,David,reportsTo,Alice]
==>[Xavier,reportsTo,Olivia,reportsTo,Ivan,reportsTo,David,reportsTo,Alice]
==>[Yvonne,reportsTo,Peggy,reportsTo,Judy,reportsTo,Eve,reportsTo,Bob]
==>[Zack,reportsTo,Peggy,reportsTo,Judy,reportsTo,Eve,reportsTo,Bob]
==>[Quinn,reportsTo,Ken,reportsTo,Frank,reportsTo,Carol,reportsTo,Alice]

   

We can even write above query in the following short hand form using repeat and until steps.

 

g.V().
  hasLabel('employee').
  repeat(outE('reportsTo').inV()) .
  until(has('role','VicePresident')).
  path().
  by('name').
  by(label())

gremlin> g.V().
......1>   hasLabel('employee').
......2>   repeat(outE('reportsTo').inV()) .
......3>   until(has('role','VicePresident')).
......4>   path().
......5>   by('name').
......6>   by(label())
==>[Ruth,reportsTo,Ken,reportsTo,Frank,reportsTo,Carol,reportsTo,Alice]
==>[Sybil,reportsTo,Laura,reportsTo,Frank,reportsTo,Carol,reportsTo,Alice]
==>[Carol,reportsTo,Alice]
==>[Trent,reportsTo,Mallory,reportsTo,Grace,reportsTo,Carol,reportsTo,Alice]
==>[David,reportsTo,Alice]
==>[Uma,reportsTo,Mallory,reportsTo,Grace,reportsTo,Carol,reportsTo,Alice]
==>[Eve,reportsTo,Bob]
==>[Victor,reportsTo,Niaj,reportsTo,Heidi,reportsTo,David,reportsTo,Alice]
==>[Wendy,reportsTo,Niaj,reportsTo,Heidi,reportsTo,David,reportsTo,Alice]
==>[Frank,reportsTo,Carol,reportsTo,Alice]
==>[Xavier,reportsTo,Olivia,reportsTo,Ivan,reportsTo,David,reportsTo,Alice]
==>[Grace,reportsTo,Carol,reportsTo,Alice]
==>[Yvonne,reportsTo,Peggy,reportsTo,Judy,reportsTo,Eve,reportsTo,Bob]
==>[Heidi,reportsTo,David,reportsTo,Alice]
==>[Zack,reportsTo,Peggy,reportsTo,Judy,reportsTo,Eve,reportsTo,Bob]
==>[Ivan,reportsTo,David,reportsTo,Alice]
==>[Judy,reportsTo,Eve,reportsTo,Bob]
==>[Ken,reportsTo,Frank,reportsTo,Carol,reportsTo,Alice]
==>[Laura,reportsTo,Frank,reportsTo,Carol,reportsTo,Alice]
==>[Mallory,reportsTo,Grace,reportsTo,Carol,reportsTo,Alice]
==>[Niaj,reportsTo,Heidi,reportsTo,David,reportsTo,Alice]
==>[Olivia,reportsTo,Ivan,reportsTo,David,reportsTo,Alice]
==>[Peggy,reportsTo,Judy,reportsTo,Eve,reportsTo,Bob]
==>[Quinn,reportsTo,Ken,reportsTo,Frank,reportsTo,Carol,reportsTo,Alice]

   

Example 4: Walk any depth of the org hierarchy without hard-coding the number of levels.

Let's find out all the managers in the hierarchy.

 

g.V().has('employee','name','Quinn').
  repeat(out('reportsTo')).
  emit().
  project('name','role').
    by('name').
    by('role')

gremlin> g.V().has('employee','name','Quinn').
......1>   repeat(out('reportsTo')).
......2>   emit().
......3>   project('name','role').
......4>     by('name').
......5>     by('role')
==>[name:Ken,role:Manager]
==>[name:Frank,role:Director]
==>[name:Carol,role:SeniorDirector]
==>[name:Alice,role:VicePresident]

   

2. Why path() Is Expensive in Gremlin?

path() is expensive because it forces Gremlin to remember and carry the full traversal history for every traverser as it moves through the graph.

 

In a normal traversal like this "g.V().out('reportsTo').out('reportsTo')", each traverser knows only where it is now and does not remember how it got there. This makes traversals memory-efficient, fast and easy to parallelize.

 

But, when you add path() "g.V().out('reportsTo').path()", Gremlin must now:

·      Store every vertex and edge visited

·      Keep them in order

·      Attach that growing history to each traverser

 

Each traverser becomes: currentVertex + [v1, e1, v2, e2, v3, ...], this history grows at every step.

 

2.1 In an Organization hierarchy like below:

Engineer Manager Director SeniorDirector VicePresident

 

Each traverser path stores:

[Engineer, reportsTo, Manager, reportsTo, Director, reportsTo, SeniorDirector, reportsTo, VP]

 

Now imagine:

·      10,000 engineers

·      Each with a 9-step path

 

That’s 90,000+ elements stored in memory, just for paths.

 

2.2 Fan-Out Makes It Much Worse

In real graphs, vertices often have multiple outgoing edges.

 

For example, consider following statement

g.V().out().out().out().path()

 

Each step might multiplies traversers.

 

10 100 1,000 10,000 traversers

 

Each traverser has its own path object with duplicated history. This leads to High memory usage, Garbage Collection pressure and slower queries.

 

2.3 Why as() + select() Is Cheaper?

Example 1: Expensive Operation

 

g.V().
  has('employee', 'role', 'Engineer').
  as('engineer').
  repeat(out('reportsTo')).
  until(has('role','VicePresident')).
  path().
  by('name')

gremlin> g.V().
......1>   has('employee', 'role', 'Engineer').
......2>   as('engineer').
......3>   repeat(out('reportsTo')).
......4>   until(has('role','VicePresident')).
......5>   path().
......6>   by('name')
==>[Ruth,Ken,Frank,Carol,Alice]
==>[Sybil,Laura,Frank,Carol,Alice]
==>[Trent,Mallory,Grace,Carol,Alice]
==>[Uma,Mallory,Grace,Carol,Alice]
==>[Victor,Niaj,Heidi,David,Alice]
==>[Wendy,Niaj,Heidi,David,Alice]
==>[Xavier,Olivia,Ivan,David,Alice]
==>[Yvonne,Peggy,Judy,Eve,Bob]
==>[Zack,Peggy,Judy,Eve,Bob]
==>[Quinn,Ken,Frank,Carol,Alice]

 

Example 2: Efficient Operation

g.V().
  has('employee', 'role', 'Engineer').
  as('engineer').
  repeat(out('reportsTo')).
  until(has('role','VicePresident')).
  as('vp').
  select('engineer','vp').
  by('name')

   

Why this is cheaper?

·      Only two vertices are remembered

·      No traversal history stored

·      Minimal memory footprint

 

gremlin> g.V().
......1>   has('employee', 'role', 'Engineer').
......2>   as('engineer').
......3>   repeat(out('reportsTo')).
......4>   until(has('role','VicePresident')).
......5>   as('vp').
......6>   select('engineer','vp').
......7>   by('name')
==>[engineer:Ruth,vp:Alice]
==>[engineer:Sybil,vp:Alice]
==>[engineer:Trent,vp:Alice]
==>[engineer:Uma,vp:Alice]
==>[engineer:Victor,vp:Alice]
==>[engineer:Wendy,vp:Alice]
==>[engineer:Xavier,vp:Alice]
==>[engineer:Yvonne,vp:Bob]
==>[engineer:Zack,vp:Bob]
==>[engineer:Quinn,vp:Alice]

   

When path() Is Worth It?

Use path() when you truly need:

 

·      Full lineage

·      Debugging visibility

·      Audit trails

·      Visual graph exploration

 

Avoid it for large-scale analytics and repeated production queries.

In summary, path() is expensive because it forces Gremlin to retain and propagate the full traversal history for every traverser, dramatically increasing memory usage, CPU cost, and result size as traversal depth and fan-out grow.

 

Key points to remember

·      as() + select() let you capture and reuse specific traversal points

·      project() produces clean, API-friendly results

·      path() is powerful but should be used intentionally

·      repeat() is essential for hierarchical traversals

·      Performance improves when you avoid collecting unnecessary path data

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment