Hi there,
My original use-case is fetching all mutual relationships between two groups of nodes.
One of the groups of nodes contains a supernode (i.e. a node with a very high degree).
Let's call n
the group of nodes that contains only regular nodes and sn
the group of nodes that contains a supernode.
My cypher query is the following:
MATCH (sn)-[r]-(n)
WHERE ID(sn) in [1, 999] // node 999 has high degree (1163)
AND ID(n) in [2, 3] // nodes 1, 2 and 3 have a low degree (3)
RETURN r
I'm currently using Neo4j 3.3.0 to benchmark this query, and I noticed something very odd: the performance of the query depends on the names of the variables I am using for sn
, n
and r
.
More particularly, the performances depend of the alphabetical order of the variables.
I have tried changing other things in the query:
- order of the
ID(x) in []
statements - switching
(sn)-[r]-(n)
for(sn)-[r]-(n)
But after lots of testing, the only change that has a consistent impact on the performance of the query is the alphabetical order of the variable names used for sn
, n
and r
.
To test my query, I have the following process:
- I rename
sn
,n
andr
to a different name (sn => v1
,n => v2
,r => v3
) and profile my query. - then, I change the alphabetical order (e.g.
sn => v2
,n => v3
,r => v1
) and profile again - etc.
Here are the results for all possible permutations:
-
sn => v1
,r => v2
,n => v3
: FAST -
sn => v1
,n => v2
,r => v3
: FAST -
r => v1
,sn => v2
,n => v3
: SLOW (i.e. at least of DB hit for each relationship of supernode999
) -
r => v1
,n => v2
,sn => v3
: FAST -
n => v1
,r => v2
,sn => v3
: SLOW (i.e. at least of DB hit for each relationship of supernode999
) -
n => v1
,sn => v2
,r => v3
: SLOW (i.e. at least of DB hit for each relationship of supernode999
)
Profiling for each query:
-
profile match (v1)-[v2]-(v3) where id(v1) in [1, 999] and id(v3) in [2, 3] return v2
-
profile match (v1)-[v3]-(v2) where id(v1) in [1, 999] and id(v2) in [2, 3] return v3
upload limit -
profile match (v2)-[v1]-(v3) where id(v2) in [1, 999] and id(v3) in [2, 3] return v1
-
profile match (v3)-[v1]-(v2) where id(v3) in [1, 99] and id(v2) in [2, 3] return v1
upload limit -
profile match (v3)-[v2]-(v1) where id(v3) in [1, 999] and id(v1) in [2, 3] return v2
upload limit -
profile match (v2)-[v3]-(v1) where id(v2) in [1, 999] and id(v1) in [2, 3] return v3
upload limit
Having to worry about the alphabetical ordering of variable names in a query makes it quite difficult to try and make queries fast
On the other hand, I might be on the wrong track, and I'm open to suggestions for alternative queries to get the same result in a more fast and stable way.
Best,
David