Tuesday, 19 May 2026

Removing Duplicate Results with dedup in Gremlin

  

Graph traversals often return repeated values, especially when querying properties shared by many vertices or when multiple traversal paths converge on the same elements. While these duplicates are sometimes useful, for example, when analyzing frequency or distribution, in many cases the goal is to work with a unique set of results.

 

Apache TinkerPop provides the dedup step to address this requirement. The dedup step removes duplicate objects from a traversal stream, ensuring that each remaining element is unique according to the criteria specified. Conceptually, dedup behaves similarly to the unique operation found in common collection libraries, but it is fully integrated into Gremlin’s traversal model and supports advanced usage patterns.

 

At its simplest, dedup removes repeated values from the traversal results. However, its capabilities extend well beyond basic value comparison. By using modulators such as by(), dedup can enforce uniqueness based on specific properties rather than entire elements. Additionally, when used with labeled steps, dedup can eliminate duplicate traversal paths that share common vertices, providing full control over result cardinality.

 

Understanding how and when to apply dedup is essential for writing efficient, readable, and semantically correct Gremlin traversals. In the following sections, we will explore the different forms of dedup, beginning with simple value deduplication, progressing to property-based uniqueness, and finally examining label-aware deduplication in multi-step traversals.

 

1. Domain Model: Online Learning Platform

To demonstrate the dedup step, we will use a simple online learning platform domain. This domain naturally produces duplicate values and converging traversal paths, making it well suited for illustrating different forms of deduplication.

 

The graph consists of three primary vertex types:

·      Student: Represents a learner enrolled in one or more courses. Following are the properties of Student vertex.

o   name – student name

o   level – beginner, intermediate, or advanced

o   department – academic department (e.g., CS, Math, Physics)

·      Course: Represents a course offered on the platform. Following are the properties of Course vertex.

o   code – course identifier

o   title – course name

o   difficulty – difficulty level of the course

·      Instructor: Represents an instructor who teaches one or more courses. Following are the properties of Instructor vertex.

o   name – instructor name

o   experience – years of teaching experience

 

Following are the relationship edges.

 

·      enrolledIn: Connects a student to a course

·      teaches: Connects an instructor to a course

 

Follow below step-by-step procedure to build the Graph.

 

Step 1: Get the Graph traversal instance.

graph = TinkerGraph.open()
g = graph.traversal()

Step 2: Create vertices

// Students
alice = g.addV('student').
         property('name','Alice').
         property('level','beginner').
         property('department','CS').
         next()

bob = g.addV('student').
        property('name','Bob').
        property('level','beginner').
        property('department','CS').
        next()

charlie = g.addV('student').
            property('name','Charlie').
            property('level','intermediate').
            property('department','Math').
            next()

diana = g.addV('student').
          property('name','Diana').
          property('level','advanced').
          property('department','CS').
          next()

eric = g.addV('student').
        property('name','Eric').
        property('level','beginner').
        property('department','Physics').
        next()
// Courses
cs101 = g.addV('course').
         property('code','CS101').
         property('title','Intro to Computer Science').
         property('difficulty','beginner').
         next()

cs201 = g.addV('course').
         property('code','CS201').
         property('title','Data Structures').
         property('difficulty','intermediate').
         next()

math101 = g.addV('course').
           property('code','MATH101').
           property('title','Calculus I').
           property('difficulty','beginner').
           next()

// Instructors
smith = g.addV('instructor').
          property('name','Dr. Smith').
          property('experience',10).
          next()

lee = g.addV('instructor').
        property('name','Dr. Lee').
        property('experience',8).
        next()

Step 3: Creating Edges Using Anonymous Traversals (__)

When creating edges, we use anonymous traversals (__) to define the edge endpoints.

 

This is the recommended approach when a traversal step requires another traversal as an argument, ensuring correct traversal scoping and avoiding common runtime errors. 

 

Student Enrollments

g.V(alice).addE('enrolledIn').to(__.V(cs101)).iterate()
g.V(alice).addE('enrolledIn').to(__.V(math101)).iterate()

g.V(bob).addE('enrolledIn').to(__.V(cs101)).iterate()
g.V(bob).addE('enrolledIn').to(__.V(cs201)).iterate()

g.V(charlie).addE('enrolledIn').to(__.V(math101)).iterate()

g.V(diana).addE('enrolledIn').to(__.V(cs201)).iterate()

g.V(eric).addE('enrolledIn').to(__.V(cs101)).iterate()

   

Instructor Teaching Assignments

 

g.V(smith).addE('teaches').to(__.V(cs101)).iterate()
g.V(smith).addE('teaches').to(__.V(cs201)).iterate()

g.V(lee).addE('teaches').to(__.V(math101)).iterate()

   

2. Removing Duplicate Results: Introducing dedup

When working with graph traversals, it is common to encounter duplicate results. These duplicates often arise because many vertices share the same property values or because multiple traversal paths converge on the same elements.

 

In the online learning platform model, for example, several students may have enroll in the same course, or belong to the same department. If we query such properties directly, Gremlin will return all matching values, including duplicates.

 

The dedup step allows us to remove these duplicates from the traversal stream.

 

At a conceptual level, dedup ensures that each element passing through the traversal is unique. What “unique” means depends on how dedup is applied, as we will see in the following examples.

 

Example 1: Duplicate Property Values

Let us start with a simple query that retrieves the level of every student.

 

g.V().
  hasLabel('student').
  values('level').
  fold()

gremlin> g.V().
......1>   hasLabel('student').
......2>   values('level').
......3>   fold()
==>[beginner,beginner,beginner,intermediate,advanced]

   

As you observe from above output the level 'beginner' is repeated thrice.

 

Let's apply dedup step to remove duplicates.

 

g.V().
  hasLabel('student').
  values('level').
  dedup().
  fold()

gremlin> g.V().
......1>   hasLabel('student').
......2>   values('level').
......3>   dedup().
......4>   fold()
==>[beginner,intermediate,advanced]

   

From the above output, you can confirm that each level now appears only once.

 

Example 2: Understanding What dedup Operates On

It is important to understand that dedup operates on whatever is currently flowing through the traversal. In the previous example, the traversal was emitting strings (level values). Therefore, dedup compared those strings and removed duplicates.

 

If we instead apply dedup before calling values, the behavior changes.

 

g.V().
  hasLabel('student').
  dedup().
  values('level').
  fold()

gremlin> g.V().
......1>   hasLabel('student').
......2>   dedup().
......3>   values('level').
......4>   fold()
==>[beginner,beginner,beginner,intermediate,advanced]

   

As you see the output, now we have levels duplicated in the output, it happened because

·      All student vertices are already unique

·      The duplicate values reappear after extracting level

·      This distinction is critical when using dedup.

 

Example 3: One Student per Level Using dedup().by()

 

Sometimes we want more than just unique values. Suppose we want one representative student for each level. To do this, we apply dedup to vertices, but tell Gremlin how uniqueness should be determined.

 

 

g.V().
  hasLabel('student').
  dedup().
  by('level').
  project( 'name', 'level').
  by('name').
  by('level')

gremlin> g.V().
......1>   hasLabel('student').
......2>   dedup().
......3>   by('level').
......4>   project( 'name', 'level').
......5>   by('name').
......6>   by('level')
==>[name:Alice,level:beginner]
==>[name:Charlie,level:intermediate]
==>[name:Diana,level:advanced]

   

This traversal:

·      Starts with student vertices

·      Removes duplicates based on the level property

·      Keeps one student per unique level

 

Example 4: Get all the courses that students enrolled in

 

g.V().
  hasLabel('student').
  out('enrolledIn').
  values('title')

gremlin> g.V().
......1>   hasLabel('student').
......2>   out('enrolledIn').
......3>   values('title')
==>Intro to Computer Science
==>Calculus I
==>Intro to Computer Science
==>Intro to Computer Science
==>Data Structures
==>Calculus I
==>Data Structures

   

As you observe the output, the courses are duplicated in the output, let's apply dedup step.

 

g.V().
  hasLabel('student').
  out('enrolledIn').
  values('title').
  dedup()

gremlin> g.V().
......1>   hasLabel('student').
......2>   out('enrolledIn').
......3>   values('title').
......4>   dedup()
==>Intro to Computer Science
==>Calculus I
==>Data Structures

 


 

Previous                                                    Next                                                    Home

No comments:

Post a Comment