Java heap space error when loading a huge csv file

Hi,

i'm loading data from a huge csv file (4GB, 10 000 000 lines in total), in each line there's a frame (defined by id) and optionally a box (defined by 2 2D points) that belongs to that frame. I create a node for each frame and a box and a relationship,
also i use constraints (id on Frame and code on Box - this is a little trick explained here A way to mimic a composite constraint - #3 by lukasz.mika
I do it this way as there's no composite index in the neo4j version i'm using).

CREATE CONSTRAINT ON (b:Box) ASSERT b.code IS UNIQUE;
CREATE CONSTRAINT ON (f:Frame) ASSERT f.id IS UNIQUE;
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///mydata.csv' AS row MERGE (f:Frame {id: row.id })
WITH f, row FOREACH (n IN (CASE WHEN row.x0 IS NULL THEN [] ELSE [1] END) |
MERGE (b:Box { code:(row.x0 + ',' + row.y0 + ',' + row.x1 + ',' + row.y1) })
MERGE (f)-[r:HAS_BOX]->(b));

I get this error - 'java.lang.OutOfMemoryError: Java heap space'

i tried the below but it didn't help

dbms.memory.heap.initial_size=8G
dbms.memory.heap.max_size=8G
dbms.memory.pagecache.size=1G

Looks a bit suspicious to me, as it throws an error after creating only 761000 nodes. However, on the same machine i'm able to load over 10 000 000 nodes (different node types).
Seems like there's something wrong with the node type 'Box' but i'm clueless as to why...

any ideas how to solve/investigate it ? any help appreciated.

Thanks,
lukasz

Hello,

The FOREACH in your query is causing the planner to plan an Eager operation (see the EXPLAIN of the plan and note the dark-blue Eager operator) which prevents the query from being able to periodically commit, and this ends up blowing up your heap.

For more info about what the Eager operator is, and why it causes this, see this article.

To avoid this in your query, you'll have to ditch the FOREACH. If the query you posted is the query in full, then we can just substitute with a WHERE clause for the filter, and keep the MERGE of the box and the relationship the same:

USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///mydata.csv' AS row 
MERGE (f:Frame {id: row.id })
WITH f, row 
WHERE row.x0 IS NOT NULL
MERGE (b:Box { code:(row.x0 + ',' + row.y0 + ',' + row.x1 + ',' + row.y1) })
MERGE (f)-[r:HAS_BOX]->(b);
1 Like

Thanks a lot for help! It works fine now for the case of one box per frame. But in reality i actually have more boxes per frame... it's between 0-5 boxes per frame. Below the convention is e.g. b0_y1 - box 0, point 1, coordinate x. so i tried to do it this way:

USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///mydata.csv' AS row 
MERGE (f:Frame {id: row.id })
WITH f, row 
WHERE row.b0_x0 IS NOT NULL
MERGE (b0:Box { code:(row.b0_x0 + ',' + row.b0_y0 + ',' + row.b0_x1 + ',' + row.b0_y1) })
MERGE (f)-[r0:HAS_BOX]->(b0)
WITH f, row 
WHERE row.b1_x0 IS NOT NULL
MERGE (b1:Box { code:(row.b1_x0 + ',' + row.b1_y0 + ',' + row.b1_x1 + ',' + row.b1_y1) })
MERGE (f)-[r1:HAS_BOX]->(b1)

but i get:

java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects

also after re-organizing the query a bit i got his one:

java.lang.OutOfMemoryError: Java heap space

i guess it just needs to be re-arranged and should work. But as far as cipher goes, i'm a beginner so having a struggle. Any help ?

Thanks
lukasz

There are Eager operators in your plan, so those again are sabotaging the periodic commit...it's trying to walk through the query across all rows at the same time (one operation across all rows, then the next operation across all rows, etc) which is blowing up your heap.

We need to adjust the query to avoid those Eagers and let the periodic commit do its job.

One way to do this is breaking up each of the segment with a subquery. You need to be running Neo4j 4.1.x or higher to do this:

USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///mydata.csv' AS row 
MERGE (f:Frame {id: row.id })
WITH f, row

CALL {
 WITH f, row
 WITH f, row
 WHERE row.b0_x0 IS NOT NULL
 MERGE (b0:Box { code:(row.b0_x0 + ',' + row.b0_y0 + ',' + row.b0_x1 + ',' + row.b0_y1) })
 MERGE (f)-[r0:HAS_BOX]->(b0)
 RETURN count(*) as count
}
WITH f, row 
CALL {
 WITH f, row 
 WITH f, row
 WHERE row.b1_x0 IS NOT NULL
 MERGE (b1:Box { code:(row.b1_x0 + ',' + row.b1_y0 + ',' + row.b1_x1 + ',' + row.b1_y1) })
 MERGE (f)-[r1:HAS_BOX]->(b1)
 RETURN count(*) as count
}
RETURN count(*)

The reason for the repeated WITH f, row lines in the subquery calls is due to a limitation right now where we cannot use a WHERE clause on the importing (first) WITH clause of the subquery, so we need to add a second one for this purpose.

Each subquery has to return something, and by returning a count(*) aggregation that means we can return 1 row even if the WHERE clause filtered out all the rows of the subquery a few lines earlier. This ensures the query can continue executing afterwards.

Thank you. It works finally. Inside CALL blocks i re-named the returned values as count1 and count2 as it complained about a conflict. Other than that it's perfect. Thanks a lot.

im experiencing a similar problem i tried aggrigating as well but it is not working

We'd need to see the full query to help, as well as an EXPLAIN plan.

Also you don't seem to be using USING PERIODIC COMMIT LOAD CSV, you would want to use that to batch the updates from your load...also if you're doing this through the browser, you would need to prefix your query with :auto, which lets the driver the browser uses perform the right type of transaction for that load:

Also make sure to check the EXPLAIN plan of the query, you want to make sure there are no Eager operators in the plan (they have a dark blue header), especially when using periodic commit load csv.


this is the query i have increased heap size and restarted many times its still showing the same error

You should be using USING PERIODIC COMMIT LOAD CSV for this (and as mentioned, prefix the query with :auto when running from the browser), that will commit it in batches, which will save your heap.

Aside from that the query looks mostly fine. Keep in mind that your last MERGE will only reuse :symptom nodes for nodes that are attached to that specific patient node. If multiple patients have the same symptom, then the nodes will not be reused in the MERGE, and you will have duplicates.

If :symptom nodes are supposed to be unique, then you need to change the query, with a MERGE on (s:symptom) by its name separate from the MERGE on (p)-[:HAS]->(s)

Also you would want a unique constraint on :symptom(name).