Tuning for larger-than-memory multiple-TB graph node insertion

First project using Neo4j, learned cypher just last month, so if this is not a performance problem apologies for the wrong assumption.

First off, all the code involved is open, and the changes causing these performance issues are specifically from Further performance and RAM usage improvements: batch inserts by robobenklein · Pull Request #9 · utk-se/WorldSyntaxTree · GitHub

I've been trying to get past the ~250GB of data stored mark for awhile, I've tuned the memory as I've described in a post I made on my own site: Neo4j Performance adventures for petabyte-scale datasets – Unhexium

At my current stage I am hitting a lot of transient errors that don't seem to be resolving fast enough, most recently is neo4j.exceptions.TransientError: {code: Neo.TransientError.Transaction.BookmarkTimeout} {message: Database 'top1k' not up to the requested version: 609977. Latest database version is 609948}

When running a single file / worker process inserting nodes runs at a few thousand per second, which is not nearly fast enough to get through all the parsed repos on GitHub, so I have 128 workers all submitting batch queries of ~1k nodes per query.

When using just a few workers (like 8-16 total workers, writing ~10k nodes/s) I seem to experience these problems at a much lower rate, and I verified that the jobs that normally fail will pass just fine if they run with just a few workers.

I would expect that given the hardware I'm running this on, I should be able to run at least 64 workers (the machine has 128 hardware threads, 512G of RAM) but some jobs are failing due to the BookmarkTimeout errors.

I set the python driver to use a managed transaction with a timeout of 30 minutes per batch insertion, since inserting 1k nodes shouldn't really ever take more than that, right?

Earlier when I tried to insert 10k/batch I would quickly run out of memory:

Mar 04 21:07:31 caladan neo4j[2061807]: ERROR StatusLogger An exception occurred processing Appender log
Mar 04 21:07:31 caladan neo4j[2061807]:  org.neo4j.logging.shaded.log4j.core.appender.AppenderLoggingException: java.lang.OutOfMemoryError
Mar 04 21:07:31 caladan neo4j[2061807]: Caused by: java.lang.OutOfMemoryError
Mar 04 21:07:31 caladan neo4j[2061807]:         at java.base/java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:214)

Now with 1k/batch I get yet a different error:

Mar 04 22:34:59 caladan neo4j[2061807]: ERROR StatusLogger An exception occurred processing Appender log
Mar 04 22:34:59 caladan neo4j[2061807]:  java.lang.NegativeArraySizeException: -1508875789
Mar 04 22:34:59 caladan neo4j[2061807]:         at java.base/java.lang.StringCoding.encodeUTF8_UTF16(StringCoding.java:910)
Mar 04 22:34:59 caladan neo4j[2061807]:         at java.base/java.lang.StringCoding.encodeUTF8(StringCoding.java:885)
Mar 04 22:34:59 caladan neo4j[2061807]:         at java.base/java.lang.StringCoding.encode(StringCoding.java:415)
Mar 04 22:34:59 caladan neo4j[2061807]:         at java.base/java.lang.String.getBytes(String.java:941)
Mar 04 22:34:59 caladan neo4j[2061807]:         at org.neo4j.logging.shaded.log4j.core.layout.AbstractStringLayout.getBytes(AbstractStringLayout.java:218)

Though this doesn't seem like a UnicodeDecodeError, since I catch all those in the python program before they ever make it to a cypher query.

The largest a single batch could realistically get is ~1GB, assuming the worst case where every inserted node has a 1MB text property, which is incredibly rare. Even so, with 256G of RAM dedicated to Neo4j, what should I adjust to avoid these problems without sacrificing insert rate performance?

I've dropped the batch size to 100, decreased the number of processes to just 8, and still am getting these errors, so there seems to be some kind of problem around the amount of data I'm trying to insert, not the rate.

Sorry that I missed your post. :(

hmm, can you create a GH issue for this either at the python driver or the database repository?
seems that the server has issues with transactions not being correctly applied, haven't seen this before.

i took a quick look at the diff you linked,

I would suggest using the unwind approach throughout
are you using one session per thread? those shouldn't be shared across threads.
I'm also not sure if/where you create the transactions and if you commit them. Probably better to use tx functions for the writes.

Michael

If you want to write parent-nodes to the db before they exists, you can also just create constraints and use merge for those nodes on the key that identifies them and hydrate them later.

I also read through your post but got some more questions:

can you share a bit more about your data model, in the last few sentences you hint at larger text properties and indexing??

If I were to clone your repository and to run it on a large could I reproduce the issue? Does it pull hte data files needed for the imports automatically?

I made this issue a while back and haven't heard anything since: Theoretically constant-time node insertions always eventually lead to BookmarkTimeout · Issue #12686 · neo4j/neo4j · GitHub

Each process creates it's own connection to the db (GraphDatabase.driver) and opens one session (with driver.session() as...) which is used for multiple managed write transactions (@neo4j.unit_of_work()). ( Source: WorldSyntaxTree/neo4j_collector_worker.py at b1ede0da5461b1c841f897cbd952354370471a64 · utk-se/WorldSyntaxTree · GitHub )

Both the queries within that managed batch write function already use unwind to load their data, perhaps there's a different usage of unwind?

As for running the collection itself, the only requirements outside the setup.py install would be the list of repo URLs you want to analyze ( list I used (~60MB csv) https://lufi.unhexium.net/r/KWYGmBHwnp#FL08xaJZRw2uITX7sa5aZmaFWFDA6ZA8mqKypHdV92M= ) plus parallel will make running the jobs easier.

From clone of the repo to filling the db and getting the error:

virtualenv -p python3 venv
source venv/bin/activate
./setup.py install
# run 4 repos in parallel, each with 16 processes working on files, only the top 1000 repos from github
cat repos-by-stars.csv | tail +2 | head -1000 | cut -d , -f 1 | parallel --halt soon,fail=1 -j 4 --bar python -m wsyntree_collector -v add https://github.com/\{\}.git -w 16

This list was made a while back so I can make a new one if some of those repos aren't available anymore.