I made a mistake and ingested 160 million relationships the wrong way (on 32 million nodes). The nodes are PubMed article_ids and the relationships are citations. I have (a:Article)-[:CITES]->(b:Article) where it should be (a:Article)<-[:CITES]-(b:Article).
I have tried the following:
MATCH (a:Article)-[rel:CITES]->(b:Article)
CALL apoc.refactor.invert(rel)
yield input, output RETURN COUNT(rel);
but keep getting (after about half an hour or more) "Server at localhost(127.0.0.1):7687 is no longer available".
I'm not sure how to deal with this error — is my large query crashing the server? I previously increased dbms.memory.heap.max_size to deal with an out-of-memory error.
My dedicated machine has 16 GB of RAM and the nodes consist of only article_id's (from 1-32 million).
If the apoc won't run, is there another way of doing this? For instance, I could create all the reverse relationships manually and then delete all the old CITES?
This works on my small test graph, is it a good idea to run it on such a large graph?
MATCH (m:Article)-[c:CITES]->(n:Article)
DELETE c
CREATE (m)<-[:CITES]-(n);
Are we sure that the operation will be all-or-none (i.e. atomic)? The last thing i would want is for some unknown number of relationships to be reversed.
I guess I could do:
MATCH (m:Article)-[c:CITES]->(n:Article)
DELETE c
CREATE (m)<-[:REFERENCES]-(n);
MATCH (m:Article)-[r:REFERENCES]->(n:Article)
DELETE r
CREATE (m)-[:CITES]->(n);
Edit: I am running the above on my large dataset, and the first statement has been running for over two hours, with the following CPU usage:
Edit2: Three and a half hours in, it crashed with "Server at localhost(127.0.0.1):7687 is no longer available"
My next solution was to divide to problem into batches. In Python:
batch_size = 5000
max_id = 33307598
driver = neo4j.GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
session = driver.session()
for batch in tqdm(range(max_id // batch_size )):
query = ("MATCH (m:Article)-[c:CITES]->(n:Article) " +
"WHERE m.ArticleId >= " + str(batch*batch_size) + " AND m.ArticleId < " + str((batch+1)*batch_size) + " " +
"DELETE c " +
"CREATE (m)<-[:REFERENCES]-(n);")
print(query)
result = session.run(query)
session.close()
driver.close()
Unfortunately the first iteration of this took nearly a minute, meaning this process extrapolates to 100 hours. It would be faster just to reingest.