How to handle large data insertion in Neo4j

Hello All,
We want to create more than 20 million nodes into neo4j. But while pushing the nodes data in Neo4j we have observed that the performance of the database was quite slow, so to tackle this we have tried following things:

We have used Py2neo client library and toolkit for working with Neo4j

  1. We have used graph.create() api to create nodes without using any transaction control API’s
  2. We have created a single query to create all nodes. We have used this single query in graph.Run() but we have observed that it is also taking a lot of time.
  3. Then we used a transaction api like begin which returns a transaction object and then we are using graph.create() to insert the nodes in the database and then use commit API on the transaction object. We were able to commit when we were having less number of nodes in file but for large numbers of nodes we were not able to commit manually. To fix this we have used the auto-commit API’s provided by py2neo and closed the transaction manually at the end of code but still performance was not as good as expected.

What will be the best practice to insert millions of node at a time in Neo4j graph database ?
Is importing the node data using the CSV file affects the speed of the Neo4j graph database ?
For importing the CSV files in Neo4j we have to keep them in local directory of Neo4j, can we load the CSV files stored in our local drive ?

Thank you in advance.

It's hard to tell what's going on in your situation without knowing your data and code. I'm going to link this page that has a number of good principles about how to get the best possible performance out of bulk Neo4j inserts. This is about writing to Neo4j with spark, but the principles are the same whether you're using the spark connector, or whether you wrote your own program. It's about batching, parallelism, and what you're writing in the first place.

An anti-pattern I see a lot is when people have a big complicated cypher query that writes many different labels and rel types in one pass. This tends to create locks all over the graph and slows total insert performance. Another anti-pattern is unsorted input data or randomly sorted data that churns the neo4j page cache. Sometimes you can get improved performance by just sorting data before starting to write it to improve hit ratio, paticularly when your database is memory constrained. Another big problem people run into is just inappropriate memory tuning in the database.

good luck -- but if you continue to have issues after following this guidance, recommend you follow up with a code & data example, state what your insert rates are looking like, and where you're trying to get them to.

1 Like