New4j Cluster Stuck in discovery

Hi All,

To keep it crisp.

I have an application with a three node cluster and Neo4j DB is running on two of them.
It was a working setup before an interruption caused both nodes to go into a FOLLOWER role.
I restarted the services on both nodes and since then my Neo4j instances are not coming up.

If I track the logs, the debug.log do not show any issues and the neo4j-out logs show that both nodes are stuck in discovery phase and never move forward:

2020-02-21 11:14:05.236-0500 INFO Bolt enabled on 21.0.0.100:7687.
2020-02-21 11:14:05.246-0500 INFO Initiating metrics...
2020-02-21 11:14:05.538-0500 INFO My connection info: [
Discovery: listen=21.0.0.100:5000, advertised=21.0.0.100:5000,
Transaction: listen=21.0.0.100:6000, advertised=21.0.0.100:6000,
Raft: listen=21.0.0.100:7000, advertised=21.0.0.100:7000,
Client Connector Addresses: bolt://21.0.0.100:7687,http://21.0.0.100:7474,https://21.0.0.100:7473
]
2020-02-21 11:14:05.539-0500 INFO Discovering cluster with initial members: [21.0.0.104:5000, 21.0.0.100:5000]
2020-02-21 11:14:05.544-0500 INFO Attempting to connect to the other cluster members before continuing...

The tricky part is that even locally the neo4j is opening its connections on the relevant ports.
Also, I am not sure why its not able to form a connection with the other neo4j node.

I changed the relevant files in my server to start a server in standalone mode but then I am not able to add other nodes in the cluster due to an exception.

I have tried changing the parameter in neo4j.conf file to change "causal_clustering.expected_core_cluster_size=2" to a value of three.

But after the restart I again see it set back to 2.

I have tried unbinding graphdb and restarting the services, still no go.

I question is how can I prevent any one of the node to not check the other nodes during discovery and pass that phase?
Secondly, if its not able to reach the other node, why would it not be able to start service locally to make sure the relevant ports start responding locally.

Starting Nping 0.7.60 ( Nping — Network packet generation tool & ping utility ) at 2020-02-21 10:46 EST
SENT (0.0014s) Starting TCP Handshake > 21.0.0.100:7474
RCVD (0.0014s) Possible TCP RST received from 21.0.0.100:7474 --> Connection refused
SENT (1.0026s) Starting TCP Handshake > 21.0.0.100:7687
RCVD (1.0026s) Possible TCP RST received from 21.0.0.100:7687 --> Connection refused
SENT (2.0036s) Starting TCP Handshake > 21.0.0.100:7474
RCVD (2.0037s) Possible TCP RST received from 21.0.0.100:7474 --> Connection refused
SENT (3.0048s) Starting TCP Handshake > 21.0.0.100:7687
RCVD (3.0048s) Possible TCP RST received from 21.0.0.100:7687 --> Connection refused
SENT (4.0059s) Starting TCP Handshake > 21.0.0.100:7474
RCVD (4.0060s) Possible TCP RST received from 21.0.0.100:7474 --> Connection refused
SENT (5.0071s) Starting TCP Handshake > 21.0.0.100:7687
RCVD (5.0071s) Possible TCP RST received from 21.0.0.100:7687 --> Connection refused

Any help would be appreciated to recover my neo4j from this phase without a rebuild.

Thanks,
Akshay

Can you please try to set causal_clustering.minimum_core_cluster_size_at_formation=2 ?
See Configuration settings - Operations Manual

Hi Stefan,

I tried that option but looks like the neo4j.conf file is getting overwritten with old content again at the startup.
I am checking internally to see why is that happening.

But I have your suggestion noted.

Thanks,
Akshay

If you're on docker you need to use --env NEO4J_<setting>=<value> instead, see Configuration - Operations Manual

Hi Stefan, I am not docker environment.
The parameter that you provided is not supported on the neo4j version that I am running.

Anything else that I can try get the services up?

Thanks,
Akshay

So which version you're on?

Hi Stefan, sorry for the delayed response.
I am running 3.3.9.

Anything else that I can change in the neo4j.conf file to get this to start up.
I am attaching the neo4j.conf file for your reference.

neo4j.txt (38.6 KB)

I see these lines in the file you have attached.

\f0\fs24 \cf0
POD-1-vManage1:/var/lib/neo4j/conf#
POD-1-vManage1:/var/lib/neo4j/conf# cat neo4j.conf \

This file seems to be associated with some type of containerization.

Are you using some type of VM's to run the cluster?

Can you please provide bit more details about the env where you are trying to run the cluster?

Its running as a service on an application.
Each node is a different VM.

Can you please check these things?

  1. Check the neo4j.conf is clean. I see extra characters as I pasted above.
  2. Check if the firewall is open for ports 7474, 7687, 5000, 6000, 7000
  3. Do a test to see if you can reach each server by doing "telnet host 5000" etc to see if it makes a connection.

Thanks
Ravi

The extra characters were due to me copying the file incorrectly. The file is clean.

There is no firewall in between the nodes. This was working fine before the cluster went down due to an outage.

The port 5000 telnet even from the local node is giving me a connection refused.

I see the below in the logs all the time. It never passes this phase:

2020-02-26 15:14:29.257-0500 INFO My connection info: [
Discovery: listen=21.0.0.100:5000, advertised=21.0.0.100:5000,
Transaction: listen=21.0.0.100:6000, advertised=21.0.0.100:6000,
Raft: listen=21.0.0.100:7000, advertised=21.0.0.100:7000,
Client Connector Addresses: bolt://21.0.0.100:7687,http://21.0.0.100:7474,https://21.0.0.100:7473
]
2020-02-26 15:14:29.259-0500 INFO Discovering cluster with initial members: [21.0.0.104:5000, 21.0.0.100:5000]
2020-02-26 15:14:29.259-0500 INFO Attempting to connect to the other cluster members before continuing...

Can you run this commands and see if ports re open

 sudo firewall-cmd --list-ports

If you don't see the ports then you need to add them to firewall

Hi,

The firewall-cmd did not work.

But I tried below and can see 5000 port listening.

POD-1-vManage1:/home/admin# netstat -an | grep 5000 | grep -i listen
tcp6 0 0 21.0.0.100:5000 :::* LISTEN

these are all the ports its listening on the cluster communication link

POD-1-vManage1:/home/admin# netstat -an | grep -i listen | grep 21.0.0.100
tcp6 0 0 21.0.0.100:5000 :::* LISTEN
tcp6 0 0 21.0.0.100:9200 :::* LISTEN
tcp6 0 0 21.0.0.100:9300 :::* LISTEN
tcp6 0 0 21.0.0.100:7000 :::* LISTEN

you are missing port 6000, 7474 and 7687 here.

Also, what about the other servers? are they also listening on those ports?

Thats correct.
But when I start the service it doesnt move any further from the point where it initializes the service discovery.
So I am not sure why the service wont start itself if the other members are not available.
Any logs that we can enable to find out why the service is not opening the relevant ports at the startup?

also I can successfully nping cluster ip on port 5000 and 7000 but not on any other port

Hi Akshay
Please check your memory configuration in environment variable.Memory configuration should match the version of neo4j you are trying to support.If you have upgraded from earlier version of neo4j then you must follow the neo4j KBS for additional help.
Yours faithfully
Sameer S Gijare