Variable length paths based on intermediate nodes as well as relationships

I have a bi-modal data set similar to the movies database. For the sake of analogy, I'm trying to run metrics on the movies based on the people who acted in the movie. I want to know the number of movies at variable path lengths based on a specific node property. For the analogy we can use genre.

Nodes have the following labels and properties:

title: 'Serenity'
genre: 'Sci-fi'

name: 'Nathan Filian'
role: 'Malcom Reynolds'
// is_hub: 1 is only recorded for actors with over 800 movie credits

name: 'Joss Whedon'

name: 'Joss Whedon'

I can get an overall count with the following:

MATCH (movie:Movie {title: 'Serenity'})-[:ACTED_IN*2]-(m2:Movie {genre: 'action'})
RETURN movie, count(distinct m2) as movie_count

However, because some actors have over 800 movie credits I want to remove them from the search, due to delays in traversing at farther steps ( -[*4]- or -[*6]- ) and artificially inflating the movie count.

I initially converted the query to:

MATCH (movie:Movie {title: 'Serenity'})--(a:Actor)--(m2:Movie {genre: 'action'})
WHERE a.is_hub is null
RETURN movie, count(distinct m2)

But this gets unwieldy when I have to specify '.is_hub is null' at farther step lengths for every node in between the start and stop nodes (MATCH (movie:Movie {title: 'Serenity'})--(a:Actor)--(:Movie)--(a2:Actor)--(:Movie)--(a3:Actor)--(m2:Movie {genre: 'action'})

I was considering something along the lines of:

MATCH (movie:Movie {title: 'Serenity'})-[:ACTED_IN*6]-(m2:Movie {genre: 'action'})
WHERE NOT (movie)--({is_hub: 1)--(m2)
RETURN movie, count(distinct m2) as movie3_count

But due to the properties of null, just because it's null doesn't mean it isn't 1. So 'WHERE NOT (movie)--({is_hub: 1)--(m2)' doesn't actually filter anything.

Is there an easier way to do this?

Yes, you can use some list predicates, such as all() or none() to specify that you don't want any node in the path that is a hub.

MATCH path = (movie:Movie {title: 'Serenity'})-[:ACTED_IN*6]-(m2:Movie {genre: 'action'})
WHERE none(node in nodes(path) WHERE node.is_hub = 1)
RETURN movie, count(distinct m2) as movie3_count

That said, there are a few things you can do to make this quicker and easier.

First, since is_hub is really a boolean kind of property, you could either use a boolean value for it rather than an integer, or you could add a label to the node, such as :Hub.

Having this as a label would also work well if you decided to use APOC path expander procs for these kind of matches, which can be more efficient than just Cypher. Cypher queries look for all possible paths that match the pattern, including looking for all possible paths at variable lengths (within the bounds you give it) that end on nodes that you've already visited, making it poor when matching out to distinct nodes. The APOC path expanders use a different means of expansion so you only ever consider a single path to a node, and prune any path that ends on an already visited node. This results in much more efficient queries, able to cope with higher or sometimes even unbounded upper bounds, especially on sufficiently connected graphs like this one, where after we've found a path to all possible nodes no other paths will be considered.

When using the APOC path expanders, in this case apoc.path.subgraphNodes(), the query would look like this:

// first match to action genre movies
MATCH (m2:Movie {genre: 'action'})
WITH collect(m2) as actionMovies
MATCH (movie:Movie {title: 'Serenity'})
CALL apoc.path.subgraphNodes(movie, {endNodes:actionMovies, labelFilter:'-Hub', relationshipFilter:'ACTED_IN', maxLevel:6}) YIELD node
RETURN movie, count(node) as movie3_count

Fantastic. That's much easier. The pruning to the single shortest path will help on some of the other metrics I want to run as well.

Minor edit, when I adapted this to my dataset, I had to use this format:

MATCH path = (movie:Movie {title: 'Serenity'})-[:ACTED_IN*2]-(m2:Movie {genre: 'action'})
WHERE none(node in nodes(path) WHERE EXISTS(node.is_hub))
RETURN movie, count(distinct m2) as movie1_count

Supplementary question: According to the APOC User Guide, apoc.path.subgraphNodes will return all nodes in the subgraph, in this case all action movie nodes out to 6 steps. The RETURN statement in the subgraph example you provided would count all the nodes in the subgraph, not just the ones at 6 steps, correct? If yes, is there a way to group the distinct counts at each step level (ie: step_2: 33, step_4:77, step_6: 129)? Or would I run separate queries for each step level?

The way we're using subgraphNodes() here, by supplying the collection of possible end nodes, ensures that we only get paths to these nodes and return those end nodes that we reach, not any other node.

If you want to group nodes by the distance to each you can do a collect() near the end, keeping the length of the path in scope. Also, since this means we'll need the path to the node (to figure out the distance) and not just the node itself, we need to switch from using subgraphNodes() to using spanningTree(), which behaves identically except that it returns paths instead of just the end node of the path.

MATCH (m2:Movie {genre: 'action'})
WITH collect(m2) as actionMovies
MATCH (movie:Movie {title: 'Serenity'})
CALL apoc.path.spanningTree(movie, {endNodes:actionMovies, labelFilter:'-Hub', relationshipFilter:'ACTED_IN', maxLevel:6}) YIELD path
RETURN movie, length(path) as distance, count(path) as movie3_count
1 Like