Neo4j periodically becoming unresponsive

Hi all,

We are in the process of building an python application, using neo4j. This application takes information about interactions of users with our website. These are stored in a graph in neo4j.

Using a queueing sytem one or more interactions of the user are send in one cypher query to the neo4j database. These queries contain multiple MERGE statements that add nodes if necessary. For all properties that are used in these MERGE statements the corresponding indexes are online.

Many of these requests are send to neo4j concurrently. At times this may be around 100 queries per second.

Neo4j community edition version 3.5.3 is running on an amazon ec2 instance (m5.2xlarge).

Most of the time this runs fine. However, periodically neo4j stops responding (or at least the queries take to long to execute, since they time-out in the application).

I have been looking into the logs and noticed these kind of lines:

2019-04-06 21:35:02.925+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=65457, gcTime=0, gcCount=0}
2019-04-06 21:35:14.029+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 10413195 to [/var/lib/neo4j/data/databases/graph.db/neostore.$
2019-04-06 21:40:21.188+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=241005, gcTime=0, gcCount=0}
2019-04-06 21:40:59.718+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=3206, gcTime=0, gcCount=0}
2019-04-06 21:41:00.298+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 10413114 checkpoint completed i$
2019-04-06 21:41:00.300+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 239
2019-04-06 21:41:09.282+0000 INFO [o.n.c.i.ExecutionEngine] Discarded stale query from the query cache after 45428 seconds:

When searching online some comments are made in relation to neo4j performance and garbage collection. Therefore I have tried several heap sizes (always setting initial and max size equal, as recommended). Sadly, the periodic pauses keep occuring (tried heap sizes up to 20GB). Although the time in between the pauses seems to differ with heap size. CPU and memory usage of the server look fine (in particular cpu percentage seems to peak during the pauses).

Also, in the logs it says gcTime=0, does this mean that this is actually not a garbage collection issue?

My question therefore is: what could be causing this periodic dips in performance? And what could be a solution?

Thanks, Jacob

Update:

Changing the the checkpoint interval time seems to influence the periodicity of the moments of unresponsiveness. More specifically: I set dbms.checkpoint.interval.time to 1800s and neo4j becomes unresponsive less often. Also, in the log files I see now that there indeed is a line about checkpoint triggering:

2019-04-10 14:19:22.845+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=33766, gcTime=0, gcCount=0}
2019-04-10 14:19:33.370+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 13116768 to [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.a], from [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.$
2019-04-10 14:27:51.995+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 13116647 checkpoint completed in 9m 47s 855ms
2019-04-10 14:27:52.374+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] Pruned log versions 137-137, last checkpoint was made in version 383

So it seems it has something to do with checkpointing. Any ideas?

Did you get a resolution to this? We are seeing sporadic issues in performance when the heap used seems to max out.

For heap maxing out (or going out of memory for heap) usually you'll want to look at the queries being run.

In particular you'll want to make sure you batch updates if a significant number of updates are expected from the query (you can use apoc.periodic.iterate() for this), and you'll want to PROFILE or EXPLAIN your query to ensure that indexes are being used to back your MATCH and MERGE operations.

Feel free to open a separate post for your issue if you haven't already.