LOAD CSV and apparently bogus RWLOCK deadlocks?

Hey all,

I am using Community Edition Neo4j 3.5.6 and APOC 3.5.0.2 on a 16 core 64Gb machine with an additional 64Gb of swap (that is never used, but just there for safety)

My machine is split into sharing memory 20/20/20/4 as described in:

...this seems stable now; it's small, but this is a proof of concept machine.

I am importing about 500M objects and about 3B relationships, with about 100 reltypes and about 50 labels/objtypes; the imports are from $big gigabytes of compressed, database-generated CSV file.

The Cypher code for importing data is statically generated from a python script, using templates, because (literally) 10,000 CSV files produced by data warehouse, and each one has to have its own LOAD CSV because LOAD CSV cannot be iterated over input filenames (eg: via UNWIND).

These templated cypher scripts (some over 5000 lines long) contain code like:

USING PERIODIC COMMIT 256
LOAD CSV WITH HEADERS
FROM 'file:/path/foo001.csv.gz' AS batch
MERGE (a:Account { uid:toInteger(batch.INTEGER__ID) })
SET
a.status = toInteger(batch.INTEGER__STATUS);

or

USING PERIODIC COMMIT 256
LOAD CSV WITH HEADERS
FROM 'file:/path/bar001.csv' AS batch
MATCH (x:Account { uid:toInteger(batch.INTEGER__X) })
MATCH (y:Account { uid:toInteger(batch.INTEGER__Y) })
MERGE (x)-[edge:Linky]->(y)
ON CREATE SET edge.reason = batch.REASON
ON MATCH SET edge.reason = batch.REASON;

...simply repeated for files foo001 through foo300 and so forth.

I am aware of and am considering eventual migration to neo4j-admin import as described here:

...which sounds amazing but I want to check that it fits my use case, which includes doing updates of pre-existing counters ON MATCH SET edge.weight = batch.more_weight + edge.weight; as well as other stuff like that.

Also, I am using small transactions of 256 having read in a previous posting that small transactions tend to reduce collisions.

Further I am following the advice of (I think it was) @michael.hunger, elsewhere in this community, to ORDER/GROUP my CSV data in order to minimise deadlock issues across parallel transactions.

I am also following Michael's advice to load distinct labels, and distinct relationships, in parallel:

WITH ALL OF THIS CONTEXT: here's the question:

  • I am collecting CSVs and importing them in phases, and executing them in parallel
  • in any given phase, the set of objects being imported, or relationships being imported, are completely disjoint
  • that is to say, I am importing (:A), (:B), (:C)-[:foo]->(:D) ... and in any given batch, none of A/B/C/foo/D are ever of the same type, and for each of A/B/C-foo-D the LOAD CSV is run serially for each shard/CSV input file.

In theory, there should be no lock contention? But no, I am wrong; it appears that LOAD CSV can tread upon itself:

LockClient[13625877] can't wait on resource RWLock[NODE(93708867), hash=515089118] since => LockClient[13625877] <-[:HELD_BY]- RWLock[NODE(469290148), hash=686715308] <-[:WAITING_FOR]- LockClient[13626010] <-[:HELD_BY]- RWLock[NODE(93708867), hash=515089118]

LockClient[13626299] can't wait on resource RWLock[NODE(469054858), hash=2000485859] since => LockClient[13626299] <-[:HELD_BY]- RWLock[NODE(469317257), hash=1199527701] <-[:WAITING_FOR]- LockClient[13626312] <-[:HELD_BY]- RWLock[NODE(469054858), hash=2000485859]

LockClient[13630461] can't wait on resource RWLock[RELATIONSHIP(2319588353), hash=477452471] since => LockClient[13630461] <-[:HELD_BY]- RWLock[NODE(411438449), hash=1187551531] <-[:WAITING_FOR]- LockClient[13630397] <-[:HELD_BY]- RWLock[RELATIONSHIP(2319588353), hash=477452471]

NB: 2x NODE deadlocks + 1x RELATIONSHIP deadlock, in examples above.

...I have perhaps a dozen of these, from the latest [edit: small, delta] import, and it seems that the problem occurs when the/a nodes is "hot", and has several (tens? hundreds?) of thousands of relationships pointing to it; this is the sort of thing/requirement that I have discussed previously:

I am not sure how I can mitigate this, if at all; possibly it is related to the issue described here:

...or perhaps it's because of the sorting/grouping, that the number of relations linking:

(foo)-[:Bar]-(where:City {name:"London"})

...greatly exceeds one-or-more transactions-worth of space.

The annoying thing here is that I apparently cannot fix this; all of the commentary that I see online about deadlocks, "solves" them by saying something like "simply trap the exception in Java and re-run the transaction"; but as above, I don't have any of that.

Hence this posting, with its vast amounts of context. Have I missed anything, please, or is anything in the above somehow a stupid choice, given what I am trying to achieve?

Thanks!

A couple things to check:

Make sure you have relevant indexes and/or unique constraints already created before your loads (such as :Account(uid)). This can ensure good matching speed, and prevent use of time and memory when it has to use a label scan instead.

Don't parallelize your relationship loads, unless you can be absolutely sure the end nodes of the relationship won't collide with each other (or other transactions involving those nodes). Relationship creation must take locks on the start and end nodes of the relationship, that's likely what's happening here. So CREATE (a)-[:KNOWS]->(b) needs to take locks on a and b for the duration of the batch.

As the error messages you encountered have node and relationship ids, you can check for yourself what nodes or relationships are involved, maybe that can help you find nodes in the graph that are being referenced by multiple CSVs.

Hi Andrew!

That's literally what I mean by in any given batch, none of A/B/C/foo/D are ever of the same type - that I select the objects and relationships for import on the basis that absolutely no two load-csvs should ever require locks on the same table.

From this, I am wondering whether individual LOAD CSVs are treading upon themselves, because...

Exactly; I have triaged 93708867 and 2319588353 from the above, and in all cases the/one node is a "hot" node ("has several (tens? hundreds?) of thousands of relationships pointing to it") — and so what I am wondering is how this can happen in a single and independent LOAD CSV?

Perhaps something in Neo is pipelining 2x transactions from a single LOAD CSV and they (the pipeline) are colliding with each other over the hot node?