Hashing values through the graph

bill_dickenson · March 10, 2021, 9:25pm

We are comparing two very large tree graphs that are 95% identical. The goal is to flag the whole tree on the 2nd graph when a difference is found. I am hoping for a simpler solution using the hash function. I am struggling with updating one node, when it is part of a tree. That concept is just not sticking for me. In SQL this is a simple subquery. Neo4j, its probably simple but I'm not getting it. This is a variation on the fingerprinting example in the documentation, with the added complexity of needing to store the value.

To explain further, we organize by types. One of the important types is 'CallableUnit'. They represent blocks of nodes. If any node below "CallableUnit" changed (Red circle), the whole tree up to that CallableUnit needs to be marked. If any node has changed in a branch, I need to find the next highest "root" node (type:'CallableUnit') and mark it and all of its sub-nodes as a.changed = true. As you can see in this graphic, the whole path from the bottom to the CallableUnit node need have their attribute (a.changed) set to true. As you can also see from the Map of Application box, this is a decent sized application.

So we have two graphs, version 1 (original), and version 2 ( changed)

My plan is simple, but APOC is not cooperating. Every time a new graph is created, I was thinking of getting all of the CallableUnits (e.g. match (a:ProgNode {type:'CallableUnit')) and then iterate through that list of nodes (foreach) and calculate the hash value of that node (using a.text only), plus all of its subnodes (hash (a)-[*]->()) and storing it on the CallableUnit (set a.hash = ). That all gets done in batch and if it takes a bit, thats ok. Its still faster than a node by node comparison in Diff ( I think). Version 1 now has a hash value for every callableunit.

When the Version 2 graph is loaded, I would do the same load. Now both versions have a.hash set for every callableunit based on a.text

Then all i have to do is find CallableUnit nodes where the a.text = b.text, but the hash value (a.hash <> b.hash) is different. If I get a list of the CallableUnit roots that are different, I can mark the nodes from it to the bottom as a.changed = True. And just to make this MORE Complicated, I need to do it in batches because it is WAY too big to do in one shot.

CALL apoc.periodic.iterate(
"match 'b = (a:ProgNode {type:'CallableUnit', version:0})-[*]-()'", // get the sets of nodes below each callableunit
"foreach b ", // for each one of those
"c = apoc.hashing.fingerprint(b)", // calculate a hash
SET a.hash = c", // store it in the original
{batchSize:2000, iterateList:true, parallel:false, params: { nodes: $nodes_in } }
)

Note: for completeness, If the callableUnit has vanished in version 2, then it was deleted ( and thats ok). we account for that differently. If it did not exist in version 1, it was added. That is a change.

Note2: Yes, the hash code would have to be changed as the version changed. Easy enough to rerun.

bill_dickenson · March 15, 2021, 2:31pm

Anyone have any thoughts here ? Thanks

bill_dickenson · March 19, 2021, 2:30pm

2 weeks - no hits ! Could use the help on approach.

Topic		Replies	Views
How do we compare two graphs for equality Neo4j Graph Platform	4	1066	January 15, 2020
Help re-writing query Cypher performance , cypher	7	954	March 11, 2019
How do I compare two graphs for equality Cypher apoc , md5 , compare , knowledge-base	0	999	August 23, 2018
Visualize entire graph but with one condition filter Neo4j Graph Platform	2	179	February 24, 2022
Graph Search with path pruning Cypher cypher , operations	4	217	March 31, 2022

Hashing values through the graph

Related Topics