Memory issue in Causal Cluster

Hello team,

we've run into a weird problem with memory consumption in our cluster setup.
Important note: all described problem is only specific to the cluster, when we run all same setup but with a single node (more hefty, though), the problem does not manifest itself.

Environment info:

  • neo4j enterprise 4.2
  • cluster consists of 5 members: 3 cores, 2 read replicas;
  • each member is 16Gb/4 cores instance Azure VM;
  • load intencity is very low, averaging at a few per sec and maxing (not often, not related to the problem) at a few dozens per second, mostly read queries;
  • cpu averge is 8-9%, with peaks at 25%;
  • the graph size is around 400 000 nodes and 2 000 000 edges;

The problem is that the memory on each member of the cluster steadily grows, until the system oom killer terminates the jvm.
The growing speed is on each node different, but the pattern is always the same.

Here is the memory consumption pattern:

All our queries are profiled/optimized and parametrized so eligible for planning caching. Playing with query plan cache parameters didn't do any good.

Now I tried to take a jvm dump from one of the members that was maxing out on memory, and analyze it using memory analyzing tool. Don't know if it is of any help, but here is some info:

Any hints in which direction to look, are greatly appreciated. We really want to go on with the cluster setup.

Hi this sounds either like a bug, so please open a support ticket or Issues · neo4j/neo4j · GitHub issue.

The first memory dump points more to the stats/query collector running, I think you can disable that.
e.g. with call db.stats.stop() but it shouldn't be running by default, not sure if there's a setting

db.stats.clear()
db.stats.collect()
db.stats.retrieve()
db.stats.retrieveAllAnonymized()
db.stats.status()
db.stats.stop()

As a workaround you can also disable the PIPELINED runtime by setting the default to SLOTTED

unsupported.cypher.runtime=SLOTTED

Seems there was a bug introduced in a version.

https://github.com/neo-technology/neo4j/pull/10720

The team recommends if you can to upgrade to 4.3.4
I still try to figure out if the fix also went into the last version of 4.2

Hi,

thanks for the resolution,
will try and get back with the results.

Hi, the github link does not work anymore, do you have another link ? I can't find anything related to memory problems and pipelined runtime in the changelogs. Do you know in which version of 4.2 the fix went ?

I got an answer from the support team : the issue is fixed in 4.2.12 but does not appear in the change log with the expected title.