Dynamically create variable-length pattern where each relationship get it's property from a list

Hi,

I'm using neo4j to store filesystems and version them.
I have reimplemented the concept of git database with Trees, Blobs and SHA1 to reuse the same hierarchy if it already exists.

Git database concepts:

My implementation so far:


You can see the master branch in green, 2 OS commits in red, each one having their own filesystem root Tree.
The highlighted Blob is a new file that I inserted in the filesystem after the first commit, that's why only the new fs hierarchy is pointing to him.

What I would like to do now is generating a diff between these 2 hierarchies, find out what has changed:

  • new files/trees
  • removed files/tress
  • modified files/trees.

I managed to write a Cypher query that can retrieve all the modified Trees from the new hierarchies (the Trees who have a new SHA1, because one or more of their children has changed), and to match in the old filesystem the existing Tree, in order to compare them later on.

// params
//   old_root_sha1: sha1sum of the old filesystem root Tree
//   new_root_sha1: sha1sum of the new filesystem root Tree
//
// 1. MATCH all the subTrees* starting from the old filesystem's root Tree
MATCH (old_root:Tree {sha1sum: $old_root_sha1})-[:HAS_CHILD_TREE*]->(old_child_t:Tree)
// 2. Get them as a list of sha1sum ["ad233c48940b8355e457e11af561c32f1309e6c9", ...]
WITH old_root, collect(old_child_t.sha1sum) as old_sha1
// 3. MATCH all the subTrees* starting from the new filesystem's root Tree
MATCH new_paths = (new_root:Tree {sha1sum: $new_root_sha1})-[:HAS_CHILD_TREE*]->(new_child_t:Tree)
// 4. filter by selecting only those who don't appear in the sha1sum list of the old Trees
// we have the modified Trees
WHERE NOT new_child_t.sha1sum IN old_sha1
// 5. associate modified Tree sha1sum and list of the modified Tree path parts
//      "68804712c8f5f54b76b15c8f67fed90a51ddf5ea"	["bin"]
//      "7eae3c09dd5c6642f5ee58e2d7d1bd79adc592a1"	["boot", "grub", "`i386-pc`"]
// up to this point, query is fast (200 ms)
WITH old_root, new_child_t.sha1sum as new_tree_sha1, [rel IN relationships(new_paths) | rel.name] as new_path_parts
// 6. query the old root again to get its children
MATCH old_path = (old_root:Tree)-[:HAS_CHILD_TREE*]->(old_child_t:Tree)
// build the old path parts list
WHERE [rel IN relationships(old_path) | rel.name] = new_path_parts
RETURN new_tree_sha1, old_child_t.sha1sum as old_tree_sha1, new_path_parts

Retrieving the list of modified Trees is fast and causes me no trouble.
But I'm struggling with the part where, for each one of these modified Trees, I have to query the existing Tree in the old filesystem.

The query above is building relationships lists: eg ["boot", "grub", "i386-pc"], and building the same relationships for each child Tree of the old filesystem: eg ["usr", "bin"], and see if both matches.

But this is extremely slow, and there are many useless comparison.

What I would like to do is to dynamically build the pattern that I'm looking for.
Example:
currenty modified Tree path: ["boot", "grub", "i386-pc"]
pattern to build:

WITH xxxx as new_path_list // [["boot", "grub", "`i386-pc`"], ....]
// foreach path in new_path_list
//    build pattern based on path items in the llist
//    do the match and get the child Tree
MATCH (old_root:Tree)-[:HAS_CHILD_TREE {name: "boot"}->(:Tree)->[:HAS_CHILD_TREE {name: "grub"}]->(:Tree)->[:HAS_CHILD_TREE {name: "`i386-pc`"}]->(old_child_t:Tree)
RETURN old_child_t

Any idea how I could achieve this, and return the new_child_tree_sha1, old_child_tree_sha1, path_list ?

Please provide the following information if you ran into a more serious issue:

  • Neo4j: 3.4
  • py2neo 4.3.0

Thanks you !

The query shown above works, but it's super slow:

Profiling: