How to capture intersecting paths?

I have a graph that looks like this -


It's the same type of node interacting with one another. (:DataSet)-[:INTERACTS_WITH*]->(:DataSet)

I want to write a query that returns the starting nodes of the longest paths.

so far I have -

match p=(d:DataSet)-[r:INTERACTS_WITH*]->(d1:DataSet)
return nodes(p), length(p)
order by length(p) desc

And this kind of does the job but I see that the max path length is 8

CLEARLY that is not the length of the longest path. If I remove direction from the query, it becomes very expensive and unscalable. What am I doing wrong?

Removing the direction in your pattern would likely get you the answer you're after. Are you after the single longest path? If so you may want to add a LIMIT 1 to the end of your query, or otherwise look for a different way to get the top results (if you want all paths with the same longest path length). apoc.agg.maxItems() from APOC Procedures may do the trick.

Also, are there any particular restrictions you want on the paths explored? Do you want to ensure the nodes in each path never repeat (never looping back)?

  • Removing the direction is not an option as then it becomes a very very expensive query even with a total of only 4500 nodes.
  • I want to understand how the limit works in this case because for this query with a direction, the limit really affects how intersecting paths are returned. For example, for the neovis graph I posted above, the query is -
match p=(d:DataSet)-[r:INTERACTS_WITH*]->(d1:DataSet)
return p
order by length(p) desc limit 400
  • The image is pretty much the whole graph and it sure isn't 400 nodes!
  • Lastly, I tried the query with a limit of 20 and removing the direction. It times out. All node names of all types are indexed but not constrained.

LIMIT refers to rows. In your case, a row is a separate distinct path resulting from your MATCH.

The current query with LIMIT 400 means that only 400 paths are being returned (the top 400 paths by path length), and the nodes visualized are those that make up those 400 paths.

A LIMIT of 20 removing direction means that the top 20 paths will be returned. However in order to know if you have the top 20 paths, all paths of that pattern must be found first and evaluated (since you have a LIMIT following ORDER BY that should make it a Top operation so it should go easier on the heap, since only 20 paths need to be kept at a time). Another potential problem is that you have no upper bound on your var-length path. It is possible that the bottleneck is simply finding the sheer number of all possible permutations of paths of any length. You might consider adding upper bounds and possibly count() (without the ordering) to see what kind of numbers you're working with for the number of paths involved.

So is 1 path = 1 hop? If not, I don't see 400 paths in that graph. Actually on a closer look, I don't see 400 paths in there anyway. In the neovis graph I mean.

No. You don't have an upper bound on your pattern -[r:INTERACTS_WITH*]->, so 1 path can be a minimum of 1 hop, but no maximum. Paths can overlap, so you might have 1 path of 1 hop from node A to node B, but have another path of 2 hops from node A to node B to node C. Etc.

The paths collectively form your graph, and there is very likely intersection between the nodes and relationships comprising the paths.

If you want to get the number of paths, then check this:

match p=(d:DataSet)-[r:INTERACTS_WITH*]->(d1:DataSet)
return count(p) as pathCount
1 Like

Okay now it makes sense. 640 nodes on the lefthand side and 779 nodes on the righthand side form 2733 paths. So there are definitely overlaps. But now I'm even more curious as to the details of what the limit 400 is comprised of in the output.

LIMIT always applies to the number of rows being returned.

In this query:

match p=(d:DataSet)-[r:INTERACTS_WITH*]->(d1:DataSet) return count(p) as pathCount

Since I'm using an aggregation without a grouping key, this will return a single row with the count, so with a limit, without a limit, it doesn't matter, the result will be on a single row.

In your original query:

match p=(d:DataSet)-[r:INTERACTS_WITH*]->(d1:DataSet)
return p
order by length(p) desc limit 400

since you're returning p (and you're not using aggregations) each row will have a full path (which is an ordered series of nodes connected by relationships), each of which could be comprised of a potentially unlimited (but not infinite) number of nodes and relationships, since no upper bound is provided.

An aside, worth considering:

Your dataset has cycles (loops). The longest path is of length ∞.
image

Great. Thank you for pointing that out. So again, just trying to understand how neo4j works.... How come then, it's not going into an infinite loop when the query executes? And how come it comes back with a result that says longest path length = 8?

Cypher uses a type of uniqueness behavior called relationship isomorphism (also referred as RELATIONSHIP_PATH uniqueness within the traversal API), which means that per path, relationships must be unique, they cannot be reused. By definition an infinite path requires the ability to traverse the same relationships over and over, so with Cypher alone you cannot get caught in an infinite loop.

1 Like