I have a graph with ~14m nodes and ~48m relationships. I know some proportion of the relationships are duplicates and want to delete them. I've found various queries (e.g., here) on the forum.
My challenge is executing the query seems impossible -- it more or less destroys my dev machine. It looks like, from the query profile, that the problem is the AllNodeScan and then the relationship expansion. I would like to batch or limit the return, but am finding this very difficult - no matter where I try to put the "LIMIT" clause, e.g., below, the "EXPLAIN" diagram still shows that the whole graph gets scanned at the start.
I can't use Label scanning because the duplicate relationships are across different node types. But also, there must be some way to limit a scan? If I can just do that, I can use iterative commit to go through the graph in batches.
So the question is how can I adjust the query, as below, to limit to just scan X nodes?
with a, b, type(r) as tr, properties(r) as pr, count(properties(r)) as cpr limit 10
return sum(cpr-1) as numDuplicateRelationships
As you've observed it is doing full graph scans because you are not using labels with (a) and (b), I don't know of a way around that (and how could cypher use anything but all nodes?)
APOC can programmatically construct dynamic cyphers en-mass, it sounds like it might come in handy here, it has solved a number of challenges for me
For example you could potentially, call db.labels() and try every label pair combination possible (programmatically constructed cypher)
- if you have duplicate edges with different labels on the end nodes, this implies (to me) that the metamodel may need work, and probably the pre-graph construction ETL (isn't this something better fixed before Neo4j?)
- I guess there are duplicate nodes as well? are they really duplicate or sub-concepts? Should those duplicate nodes be merged and all have the same label, or are they really slightly different and the ought to have a shared second label? I have cases like this in my graph, a group of nodes all have a shared label, and a second differentiating label. I index the shared label.
- it sounds like you are deleting relationships that are not identical, best I can tell. Are you checking the node/rel property values?
- is deleting the edge what you want, or would aggregating the edges (collapsing/bundling as some call it) be better? I combine the information across edges providing the same information from different sources (e.g. min/avg/max, count, summary info), that way I retain a weight of evidence.
Thanks! These are all super helpful pointers. I'm going to try out apoc.cypher.run().
What happened here is there was an error in the ETL process. Now fixed, we think, for future runs, but given the amount already loaded (we caught this late), hoping to fix it by pruning rather than wiping and redoing the whole ETL. Also more generally I'm trying to get to grips with working with working on a graph of this size and complexity.
Thanks again - this is super helpful.