Took 11 hours to finish running query. Need Help Query Tuning an APOC Function to Update Graph

neo4j version: Community 4.2.4
dbms.memory.heap.initial_size=24G
dbms.memory.heap.max_size=24G
dbms.memory.pagecache.size=8G
desktop version: 1.4.3

Hello,

What advice or suggestions would you recommend to improve on my cypher query?

I've read several sources regarding Cypher tuning before I posted this question:

From these readings, I've learned several areas that I could change to improve my query.

Here's an example of the offending query taking 11 hours to finish:

CALL apoc.periodic.iterate("
CALL apoc.load.csv('file:///newdata_04_08_21.csv', {header:false}) YIELD list AS row RETURN row
","
MATCH(p:Person {person_id: row[0]}), (b:Business:NewLaw {business_id: row[1]})
MERGE(p)-[r:FIRST_LAW]->(b)
ON MATCH
	SET r.neo4jLastUpdate = datetime( {timezone: 'America/Los Angeles'} )
ON CREATE
	SET r.neo4jCreated = datetime( {timezone: 'America/Los Angeles'} )
RETURN count(*)",
{batchSize: 100000, parallel: false})

I also have uniqueness constraints on the Person and Business nodes as follows:

CREATE CONSTRAINT PersonUnique ON (p:Person) ASSERT p.person_id IS UNIQUE;
CREATE CONSTRAINT BusinessUnique ON (b:Business) ASSERT b.business_id IS UNIQUE;

This query is passed to my Python script which is scheduled to run daily at 21:00 every night to update my Neo4j database. Last night, it took 11 hours for this query to finish.

The newdata_04_08_21.csv is about 2G and it does not have a header. This file is provided to me on a daily basis with millions of rows and its contents look similar to this:

joe-1234,electronics88,FIRST_LAW
jane-1234,retail145,FIRST_LAW
sam-5788,education179,FIRST_LAW

Based on what I learned from the Cypher tuning guides, it is best to avoid a cartesian product and to aggregate early. A better approach for my query would be this:

CALL apoc.periodic.iterate("
CALL apoc.load.csv('file:///newdata_04_08_21.csv', {header:false}) YIELD list AS row RETURN row
","
MATCH(p:Person)
WHERE p.person_id: row[0]
WITH p
MATCH(b:Business:NewLaw)
WHERE b.business_id=row[1]
WITH p, b
MERGE(p)-[r:FIRST_LAW]->(b)
ON MATCH
	SET r.neo4jLastUpdate = datetime( {timezone: 'America/Los Angeles'} )
ON CREATE
	SET r.neo4jCreated = datetime( {timezone: 'America/Los Angeles'} )
RETURN count(*)",
{batchSize: 100000, parallel: false})

With this revised query, I am avoiding the cartesian product because now I have two different MATCH statements supplemented with the WHERE clause to search for the unique node in the csv. I am also first collecting the unique names of the individual and business with the WITH statement, and then passing these two parameters to the MERGE statement.

Would this be an appropriate method to improve my cypher query?

I tried to generate the EXPLAIN execution plan, but it's not very descriptive with APOC statements.

I also experimented with the parallel: parameter on the apoc.periodic.iterate from false to true, but I'm running this apoc function to update the relationships in my graph daily, so I left it at false in order to avoid conflicting transactions.

Any advice here would be much appreciated.

Thanks

This is not an issue as you have unique constraint on the id's. Each of them will match to exactly one record so you will have only one record here.

Can you tell us how big your database is?

Also I see you are allocating more memory to HEAP than Page cache? This may cause the performance degrade over time as the database starts growing beyond 8 GB as it will cause too many page faults to load the data to match your queries? Can you reduce the HEAP and add it to Page cache?

Also for 11 hours, how big the file is.

Another aspect is as you are reading the file from the same disk as you are writing to disk performance also comes into picture. If you have a regular HDD it can slow down things drastically.

The size of my database is 6.7G with 400,000 nodes and 500,000,000 relationships. We have a CentOS virtual machine with 32G of ram, 8 CPU cores, and 20GB of storage dedicated to hosting this Neo4j instance. We could also increase the ram on the OS or on the Neo4j database if necessary.

Also, in our graph database there are 490,000,000 FIRST_LAW relationships between 100,000 :Person nodes and 5,000 :Business nodes. However, with the :Business:NewLaw nodes, the number of FIRST_LAW relationships to update is reduced down to 11,000,000 so I wouldn't think it takes hours to update this many relationships with APOC functions. This is what I'm updating on daily basis with this APOC function that took 11 hours to finish the other night. My team had weeks of discussions on how to eliminate or diminish these two super nodes but our business domain requires there to be so this number of relationships between :Person nodes and :Business nodes.

Given the size of our database at 6.7G, it seems like 8G in Page Cache should suffice to use for mapping the store files on disk. Since this is a dedicated host for Neo4j, we won't be reducing the memory heap on Neo4j.

We have 100,000 :Person nodes, 5,000 :Business nodes, and today there were 30 :Business:NewLaw nodes. This translated to required updates on 11,000,000 relationships that were shown in the newdata_04_08_21.csv file, which is about 2G in size. I need to either update or create the relationships between the :Person nodes and the :Business:NewLaw nodes which is why I'm using the MERGE statement with the APOC periodic iterate.

I found these two articles here. I'll take another shot after reading these two, and I'll report my findings.

I think I managed to solve my issue here. Thanks to all of these resources on cypher query tuning, I learned a great deal more on how queries work in Neo4j.

In particular, I found this piece on the Cypher Tuning guide extremely helpful:

So this was the offending query taking about 11 hours to finish:

CALL apoc.periodic.iterate("
CALL apoc.load.csv('file:///newdata_04_08_21.csv', {header:false}) YIELD list AS row RETURN row
","
MATCH(p:Person {person_id: row[0]}), (b:Business:NewLaw {business_id: row[1]})
MERGE(p)-[r:FIRST_LAW]->(b)
ON MATCH
	SET r.neo4jLastUpdate = datetime( {timezone: 'America/Los Angeles'} )
ON CREATE
	SET r.neo4jCreated = datetime( {timezone: 'America/Los Angeles'} )
RETURN count(*)",
{batchSize: 100000, parallel: false})

And the solution was quite simple. All I needed to do was to eliminate the redundant MATCH statements, and just use the MERGE clause since all I need to do with this specific query was to either match, or create, the relationships.

The reason why I wrote the query with the multiple MATCH statement with the cartesian product was based on my initial understanding on how to create relationships from the beginner guides on Neo4j. However, that is unnecessary in my specific case since the relationships have already been established in the csv file and all I needed to do was to iterate through this csv file to either update the existing relationships in the graph between Person and Business nodes, or create the new relationship between the nodes that already exist in the database.

Here's the improved query:

CALL apoc.periodic.iterate("
CALL apoc.load.csv('file:///newdata_04_08_21.csv', {header:false}) YIELD list AS row RETURN row
","
MERGE(p:Person {person_id: row[0]})-[r:FIRST_LAW]->(b:Business:NewLaw {business_id: row[1]})
ON MATCH
	SET r.neo4jLastUpdate = datetime( {timezone: 'America/Los Angeles'} )
ON CREATE
	SET r.neo4jCreated = datetime( {timezone: 'America/Los Angeles'} )
RETURN count(*)",
{batchSize: 100000, parallel: true})

I also parallelized this operation because all of the relationships listed on the daily csv file I'm loading from have been confirmed to be unique. With these improvements, my query went from 11 hours down to 60 seconds. This is huge! I'm still going through the results just to make sure everything was updated as expected, but so far I don't see any errors.

With those multiple match statements in my initial query, I was processing twice as many rows as I needed to when matching the :Person nodes as explained in the query tuning guides. That was duplicated work that I needed to eliminate. I essentially combined these 3 separate queries into one query as demonstrated in the improved version.

I'm not completely done with the Cypher query tuning guide, but so far it has proven extremely valuable in optimizing my queries.

If there's anything I missed or misstated, please let me know.

Thanks!

Actually, no, I didn't solve my issue. I made a mistake here by inferring that the two cypher queries below were equivalent:

MATCH(p:Person {person_id: row[0]}), (b:Business:NewLaw {business_id: row[1]})
MERGE(p)-[r:FIRST_LAW]->(b)

and

MERGE(p:Person {person_id: row[0]})-[r:FIRST_LAW]->(b:Business:NewLaw {business_id: row[1]})

But in reality they are very different. In the second query, the MERGE statement can either create or match the entire pattern if, and only if, either all three components (:Person and :Business:NewLaw nodes, and r:FIRST_LAW relationship) already exist or none of these three exist. In my database, the :Person and :Business:NewLaw nodes already exist but not the relationships. I need to establish the relationships as shown in the csv file which has about 10 million records on a daily basis. When I try to run

MERGE(p:Person {person_id: row[0]})-[r:FIRST_LAW]->(b:Business:NewLaw {business_id: row[1]})

on just a few records, I get an error saying that the :Person and :Business:NewLaw nodes already exist.

I was under the impression that the second query was a valid statement for my use case according the lessons in the Graph Academy. In this lesson, the syntax presented is:

MERGE (variable1:Label1 {nodeProperties1})-[:REL_TYPE]->
(variable2:Label2 {nodeProperties2})
RETURN variable

And the accompanying explanation is:
If there is an existing node with Label1 and nodeProperties1 with the :REL_TYPE relationship to an existing node with Label2 and nodeProperties2 in the graph, no relationship is created. If the relationship does not exist, the relationship is created.

The bolded sentence led me to believe that this MERGE clause would create the relationship between existing node labels. It should have really said:
If the nodes and relationships do not exist, the nodes and relationships are created.

Even when I'm using the indexes on the nodes this seems to take 11 hours. I'm not sure what else to do here. I guess I'll keep trying different approaches with increase the PageCache, or reorganizing my daily update process. Unfortunately, there's not much I can do with the super node on :Person. We spent weeks discussing the data model and it will remain the way it is for now.

1 Like

The volume of changes that my neo4j database experiences on a daily basis has reached a level such that it now takes days for the updates to finish. Creating the neo4j database from scratch using the daily csv files only takes about 3 hours. I've tried numerous approaches, but it seems like automating the daily creation of neo4j databases is a less painful approach than having to update 500,000,000 relationships. I can create a daily cron job that does overnight and it would solve my problem here.

I've been battling these daily updates in my Neo4j graph database and I think I finally managed to fix this without having to delete and recreate the neo4j database. The daily cron job that runs the updating python script, which contains the cypher queries, finished overnight in about 4 hours.

I'll write a more detailed response here for my future self, and other who may find this useful.

Basically, it was a mix of (a) better understanding my business domain to come up with better uniqueness constraints on important nodes, (b) better working with Python, Pandas, Numpy, etc. to wrangle the data, and (c) understanding polynomial time complexity which is a contributing factor in updating 500,000,000 relationships.

Based on what I've learned from my company's streaming data, it will behoove me to pick up a large-scale distributed data processing tool like Spark as this project continues to grow. For now, Pandas gives me the muscle I need to work with Neo4j and determine the relationships that need to be updated.

Just wanted to close this thread. I solved this problem. Refactoring the underlying graph data model was the answer to this problem. I had a super node in my graph that was causing a bottleneck. My recommendations for my future-self:

  1. Become an expert graph data modeller.
  2. Get really good at Cypher.

I'm going to add this experience to my list of blogpost. And out of all of the readings I did, these two helped me the most:

  1. Common Graph Structures - Graph Data Modeling for Neo4j

In addition, intermediate nodes can be invaluable organizing structures.

That quote can not be emphasized enough.

2 . Graph Modeling: All About Super Nodes | by David Allen | Neo4j Developer Blog | Medium

Now my graph data model finishes updating in about 30 minutes every night without having the need to take down the system.