VM stop-the-world pause for simple LOAD csv query for creating a graph

kevin.m.coy · February 12, 2021, 7:46pm

So, the page cache can be larger than initial/max heap. The total of the two should be less than the total available RAM on the system? Is that an accurate statement?

There is nothing else running on the servers except for some small logging processes.

I can try these settings out. I've been looking for a way to better visualize/monitor Neo4J. Is there a recommended tool?

dana_canzano · February 12, 2021, 8:05pm

dbms.memory.pagecache.size represents the amount of RAM that is allocated to load the graph into RAM.

heap however is more a symptom of query concurrency/complexity. 100 concurrent match (n) return n limit 1; is going to consume a lot less heap than match (n) where n.status='active' collect(n.id) as idsToMatch with idsToMatch match (n2:People) where n2 in idsToMatch order by n2.name, n2.age limit 100; for example

yes dbms.memory.pagecache.size + dbms.memory.heap.max_size < total RAM on the box.

This is described in the documentation links previously provided.

kevin.m.coy · February 12, 2021, 8:08pm

Thanks for the clarification. I was handed this configuration originally where page cache was a fraction of the heaps. False assumption on my part that page cache was a subset of heap size. I really appreciate this help. Testing these updates.

mengjiakang2017 · February 18, 2022, 4:41pm

Hello everyone, is there any update to this problem? I recently see the same error when loading millions of rows for edges. it didn't resolve even when I added edge index, breakdown the table, set the heap_size and pagecache to 16GB (only have 32G ram). Can someone please help on this?

2022-02-18 16:24:58.497+0000 INFO  [o.n.k.a.p.GlobalProcedures] starting batching from `call apoc.load.jdbc($azure_url, "select  distinct m.person_id, m.measurement_id, m.measurement_concept_id,m.measurement_date, m.unit_concept_id, m.unit_concept_name, m.value_as_number from omop.measurement m where m.measurement_id between 100000000 and 200000000")
yield row
` operation using iteration `
with row
where row.unit_concept_id is not null
and row.value_as_number is not null
match (p:Person {person_id:row.person_id})
match (m:Measurement {measurement_concept_id: row.measurement_concept_id}) 
merge (p)-[r:HAS_MEASUREMENT {measurement_id:row.measurement_id, 
measurement_date:row.measurement_date, 
unit_concept_id:row.unit_concept_id, 
unit_concept_name:row.unit_concept_name, 
value_as_number:row.value_as_number}]->(m)
` in separate thread
2022-02-18 16:26:58.006+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=765, gcTime=803, gcCount=1}
2022-02-18 16:28:13.398+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=255, gcTime=307, gcCount=1}

dana_canzano · February 18, 2022, 6:12pm

@mengjiakang2017

you can not 100% eliminate messages like

2022-02-18 16:26:58.006+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=765, gcTime=803, gcCount=1}

they are going to happen. its java being java.

but his line reports pauseTime=765, gcTime=803 and no matter which yo use the pause was for less than 1 second. If you loaded same error when loading millions of rows for edges and for millions and 1s pause?
the heap_size and pagecache to 16GB (only have 32G ram)
but its not simply take total RAM and allocate 50% to heap and 50% to pagecachce.

in a world where there was unlimited RAM one would configure pagecache to be sligly larger than the size of the database. heap however is more about query complexity and query concurency

mengjiakang2017 · February 21, 2022, 1:59am

Thank you @dana_canzano for helping out!

to answer 1, I'm pasting some more logs from debug.log I see (basically first VM stop the world error repeated-> conection time out then it stops running current query and go to my next cypher query in the cypher-shell.

2022-02-17 01:45:48.077+0000 INFO [o.n.k.a.p.GlobalProcedures] starting batching from `call apoc.load.jdbc($azure_url, 

"select distinct  m.measurement_id, m.visit_occurrence_id, m.measurement_concept_id, v.visit_concept_id, v.visit_start_date, v.visit_end_date

from omop.measurement m

join omop.visit_occurrence v 

on v.visit_occurrence_id = m.visit_occurrence_id")

yield row

` operation using iteration `

match (m:Measurement {measurement_concept_id: row.measurement_concept_id})

match (v:VisitOccurrence {visit_concept_id:row.visit_concept_id})

merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id, 

visit_start_date:row.visit_start_date, 

visit_end_date:row.visit_end_date,

measurement_id:row.measurement_id}]->(v)

` in separate thread

2022-02-17 01:46:44.330+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=276, gcTime=307, gcCount=1}

2022-02-17 01:47:21.996+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0xafcb7c11, L:/172.17.0.2:7687 - R:/10.120.14.85:60148] readAddress(..) failed: Connection timed out

io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out

2022-02-17 01:47:21.997+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0x6f3b153e, L:/172.17.0.2:7687 - R:/10.120.14.85:60149] readAddress(..) failed: Connection timed out

io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out

2022-02-17 01:48:46.708+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=578, gcTime=0, gcCount=0}

for 2, when I ran again neo4j-admin memrec after loading most of the nodes and edges I got super high pagecache recommendation - more than 450GB, is this expected?

# Based on the above, the following memory settings are recommended:
dbms.memory.heap.initial_size=31g
dbms.memory.heap.max_size=31g
dbms.memory.pagecache.size=459300m

dana_canzano · February 22, 2022, 11:06pm

@mengjiakang2017

regarding

for 2, when I ran again neo4j-admin memrec after loading most of the nodes and edges I got super high pagecache recommendation - more than 450GB, is this expected?

a. how much total memory is on the instance? what does linux command free -g return?

b. how large is the graph on disk? At Neo4j startup your logs\debg.log will report for each database similar to

                          [system/035qstu4] Total size of mapped files:  218kb
                          [neo4j/74aue41] Total size of mapped files:  104GB
                          [test/ak73sce4] Total size of mapped files:  11GB

indicating 3 databases, namely system, neo4j, and test and respectively on disk consuming 218kb, 104GB and 11GB.

in a perfect world and if you have sufficient RAM then dbms.memory.pagecache.size should be set to the summation of these values. And if you can define as such then the entire graph will reside in RAM. Not a requirement but a best practice

mengjiakang2017 · February 22, 2022, 11:22pm

Thank you @dana_canzano .

for a. When I ran free -g, I got

 total        used        free      shared  buff/cache   available
Mem:            503         216         138           0         148         285
Swap:             7           0           7

for b. I'm still running a query for hours but the graph database size is about 2G based on previous experience.

also want to add that we are running neo4j in docker and I'm very unfamiliar with it - we've updated the neo4j.conf on the heap and pagecache but wanted to double-check if this is indeed applied to the container? Thanks!

Topic		Replies	Views
Hello Graphers, Introduce-Yourself	0	598	October 1, 2019
Tutorial: Import Relational Data Into Neo4j Neo4j Website	0	731	August 5, 2020
session.run(LOAD CSV) not working Neo4j Graph Platform cypher	5	492	May 12, 2020
CSV import issue Import / Export	26	527	June 21, 2023
Query of load csv not completing even after 12 hours Cypher cypher	9	1302	June 19, 2019

VM stop-the-world pause for simple LOAD csv query for creating a graph

Related Topics