VM stop-the-world pause for simple LOAD csv query for creating a graph

So, the page cache can be larger than initial/max heap. The total of the two should be less than the total available RAM on the system? Is that an accurate statement?

There is nothing else running on the servers except for some small logging processes.

I can try these settings out. I've been looking for a way to better visualize/monitor Neo4J. Is there a recommended tool?

dbms.memory.pagecache.size represents the amount of RAM that is allocated to load the graph into RAM.

heap however is more a symptom of query concurrency/complexity. 100 concurrent match (n) return n limit 1; is going to consume a lot less heap than match (n) where n.status='active' collect(n.id) as idsToMatch with idsToMatch match (n2:People) where n2 in idsToMatch order by n2.name, n2.age limit 100; for example

yes dbms.memory.pagecache.size + dbms.memory.heap.max_size < total RAM on the box.

This is described in the documentation links previously provided.

1 Like

Thanks for the clarification. I was handed this configuration originally where page cache was a fraction of the heaps. False assumption on my part that page cache was a subset of heap size. I really appreciate this help. Testing these updates.

Hello everyone, is there any update to this problem? I recently see the same error when loading millions of rows for edges. it didn't resolve even when I added edge index, breakdown the table, set the heap_size and pagecache to 16GB (only have 32G ram). Can someone please help on this?

2022-02-18 16:24:58.497+0000 INFO  [o.n.k.a.p.GlobalProcedures] starting batching from `call apoc.load.jdbc($azure_url, "select  distinct m.person_id, m.measurement_id, m.measurement_concept_id,m.measurement_date, m.unit_concept_id, m.unit_concept_name, m.value_as_number from omop.measurement m where m.measurement_id between 100000000 and 200000000")
yield row
` operation using iteration `
with row
where row.unit_concept_id is not null
and row.value_as_number is not null
match (p:Person {person_id:row.person_id})
match (m:Measurement {measurement_concept_id: row.measurement_concept_id}) 
merge (p)-[r:HAS_MEASUREMENT {measurement_id:row.measurement_id, 
measurement_date:row.measurement_date, 
unit_concept_id:row.unit_concept_id, 
unit_concept_name:row.unit_concept_name, 
value_as_number:row.value_as_number}]->(m)
` in separate thread
2022-02-18 16:26:58.006+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=765, gcTime=803, gcCount=1}
2022-02-18 16:28:13.398+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=255, gcTime=307, gcCount=1}

@mengjiakang2017

  1. you can not 100% eliminate messages like
2022-02-18 16:26:58.006+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=765, gcTime=803, gcCount=1}

they are going to happen. its java being java.

  1. but his line reports pauseTime=765, gcTime=803 and no matter which yo use the pause was for less than 1 second. If you loaded same error when loading millions of rows for edges and for millions and 1s pause?

  2. the heap_size and pagecache to 16GB (only have 32G ram)
    but its not simply take total RAM and allocate 50% to heap and 50% to pagecachce.

in a world where there was unlimited RAM one would configure pagecache to be sligly larger than the size of the database. heap however is more about query complexity and query concurency

Thank you @dana_canzano for helping out!

to answer 1, I'm pasting some more logs from debug.log I see (basically first VM stop the world error repeated-> conection time out then it stops running current query and go to my next cypher query in the cypher-shell.

2022-02-17 01:45:48.077+0000 INFO [o.n.k.a.p.GlobalProcedures] starting batching from `call apoc.load.jdbc($azure_url, ​

"select distinct  m.measurement_id, m.visit_occurrence_id, m.measurement_concept_id, v.visit_concept_id, v.visit_start_date, v.visit_end_date​

from omop.measurement m​

join omop.visit_occurrence v ​

on v.visit_occurrence_id = m.visit_occurrence_id")​

yield row​

` operation using iteration `​

match (m:Measurement {measurement_concept_id: row.measurement_concept_id})​

match (v:VisitOccurrence {visit_concept_id:row.visit_concept_id})​

merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id, ​

visit_start_date:row.visit_start_date, ​

visit_end_date:row.visit_end_date,​

measurement_id:row.measurement_id}]->(v)​

` in separate thread​

2022-02-17 01:46:44.330+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=276, gcTime=307, gcCount=1}​

2022-02-17 01:47:21.996+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0xafcb7c11, L:/172.17.0.2:7687 - R:/10.120.14.85:60148] readAddress(..) failed: Connection timed out​

io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out​

2022-02-17 01:47:21.997+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0x6f3b153e, L:/172.17.0.2:7687 - R:/10.120.14.85:60149] readAddress(..) failed: Connection timed out​

io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out​

2022-02-17 01:48:46.708+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=578, gcTime=0, gcCount=0}​

for 2, when I ran again neo4j-admin memrec after loading most of the nodes and edges I got super high pagecache recommendation - more than 450GB, is this expected?

# Based on the above, the following memory settings are recommended:
dbms.memory.heap.initial_size=31g
dbms.memory.heap.max_size=31g
dbms.memory.pagecache.size=459300m

@mengjiakang2017

regarding

for 2, when I ran again neo4j-admin memrec after loading most of the nodes and edges I got super high pagecache recommendation - more than 450GB, is this expected?

a. how much total memory is on the instance? what does linux command free -g return?

b. how large is the graph on disk? At Neo4j startup your logs\debg.log will report for each database similar to

                          [system/035qstu4] Total size of mapped files:  218kb
                          [neo4j/74aue41] Total size of mapped files:  104GB
                          [test/ak73sce4] Total size of mapped files:  11GB              

indicating 3 databases, namely system, neo4j, and test and respectively on disk consuming 218kb, 104GB and 11GB.

in a perfect world and if you have sufficient RAM then dbms.memory.pagecache.size should be set to the summation of these values. And if you can define as such then the entire graph will reside in RAM. Not a requirement but a best practice

Thank you @dana_canzano .

for a. When I ran free -g, I got

 total        used        free      shared  buff/cache   available
Mem:            503         216         138           0         148         285
Swap:             7           0           7

for b. I'm still running a query for hours but the graph database size is about 2G based on previous experience.

also want to add that we are running neo4j in docker and I'm very unfamiliar with it - we've updated the neo4j.conf on the heap and pagecache but wanted to double-check if this is indeed applied to the container? Thanks!