Cluster leader keeps changing

tim.hanssen · December 5, 2018, 9:08am

We have an issue that our causal cluster started to keep holding elections causing the cluster to be in a sort of read only mode. (leader changing so often). The cluster was running without issues or additional elections for 7 days before this interruption.

Ubuntu 18 LTS
Neo4j 3.4.10
BOLT (without routing)
Causal cluster (from 3 nodes)

debug log (neo03 (starting state: follower))

debug log (neo04 (starting state: leader))

Any suggestions?

albertodelazzari · December 5, 2018, 10:39am

Hi Tim,
just to better understand your issue.
The first time you started the instances was the cluster correct? I mean an election was reached successfully within a certain amount of time with one leader and two followers.

The first time you started the instances, were all of them in an initial "follower" state? (you can check in your log files).
If not and something went wrong (maybe for an initial misconfiguration of the instances) please take a look at the "data" subdirectory of you Neo4j home directory. There should be a directory with some data about the cluster information, maybe you can try to delete this information from each instance and restart the all the servers.

In general, when starting from scratch, no instance will start as a leader. The proper state workflow should be: follower, candidate, leader and when a leader will be elected all the other instances will become followers.

I hope maybe helpful

tim.hanssen · December 5, 2018, 10:50am

Hi Alberto,

The first time you started the instances was the cluster correct? I mean an election was reached successfully within a certain amount of time with one leader and two followers.

Yes, the cluster was running for about 7days without any issue. The leader did not change in those 7 days.

So the first time election was I think done properly. We restarted after 4 days one core server (follower) and after the restart it came back online as a follower without any problem. Until last night this happend.

albertodelazzari · December 5, 2018, 11:38am

Ok, I will check through your logs to see if there is something useful.

The restart was related to a failure or maintenance?

Just a correction about what I told you before, when you have to clear out the state of a core member of the cluster instead of delete the directory, you can also use the "neo4j-admin unbind" command, it's the recommended way to do that.

Thanks

tim.hanssen · December 5, 2018, 11:40am

Thxs!

The restarted was done for maintenance. Backup settings ect.

After this incident I restarted one follower node, that fixed the issue for now.

albertodelazzari · December 5, 2018, 11:46am

Ok, in future, I think the best way to do any maintenance task on an instance is:

stop the instance
unbind the instance (that is delete the instance state using the neo4j-admin command)
do the tasks you have to perform
restart the instance

Once the instance will be up and running it will join the cluster with a correct state.

tim.hanssen · December 5, 2018, 12:00pm

We will, but the restart did not trigger this issue. This happend a few days later.

tim.hanssen · December 17, 2018, 10:36am

This failure was triggered by issue Causual cluster follower falling behind - #2 by tim.hanssen we fixed it changing some configuration.

Topic		Replies	Views
Leadership election issue Cluster	2	2179	May 15, 2019
Managing Multi-DataCentre(DC/DR) cluster environment in neo4j 3.5.5 Cluster	2	402	June 2, 2020
Neo4j Not able to form casual cluster - attempting to connect Neo4j Graph Platform	2	885	July 16, 2019
😱 Read Replia Won't Start 😱 Neo4j Graph Platform migrated	4	139	May 4, 2024
Neo4j Causal Cluster Backup & Restore Neo4j Graph Platform	5	748	April 29, 2021

Cluster leader keeps changing

Related Topics