To keep it crisp.
I have an application with a three node cluster and Neo4j DB is running on two of them.
It was a working setup before an interruption caused both nodes to go into a FOLLOWER role.
I restarted the services on both nodes and since then my Neo4j instances are not coming up.
If I track the logs, the debug.log do not show any issues and the neo4j-out logs show that both nodes are stuck in discovery phase and never move forward:
2020-02-21 11:14:05.236-0500 INFO Bolt enabled on 220.127.116.11:7687.
2020-02-21 11:14:05.246-0500 INFO Initiating metrics...
2020-02-21 11:14:05.538-0500 INFO My connection info: [
Discovery: listen=18.104.22.168:5000, advertised=22.214.171.124:5000,
Transaction: listen=126.96.36.199:6000, advertised=188.8.131.52:6000,
Raft: listen=184.108.40.206:7000, advertised=220.127.116.11:7000,
Client Connector Addresses: bolt://18.104.22.168:7687,http://22.214.171.124:7474,https://126.96.36.199:7473
2020-02-21 11:14:05.539-0500 INFO Discovering cluster with initial members: [188.8.131.52:5000, 184.108.40.206:5000]
2020-02-21 11:14:05.544-0500 INFO Attempting to connect to the other cluster members before continuing...
The tricky part is that even locally the neo4j is opening its connections on the relevant ports.
Also, I am not sure why its not able to form a connection with the other neo4j node.
I changed the relevant files in my server to start a server in standalone mode but then I am not able to add other nodes in the cluster due to an exception.
I have tried changing the parameter in neo4j.conf file to change "causal_clustering.expected_core_cluster_size=2" to a value of three.
But after the restart I again see it set back to 2.
I have tried unbinding graphdb and restarting the services, still no go.
I question is how can I prevent any one of the node to not check the other nodes during discovery and pass that phase?
Secondly, if its not able to reach the other node, why would it not be able to start service locally to make sure the relevant ports start responding locally.
Starting Nping 0.7.60 ( Nping — Network packet generation tool & ping utility ) at 2020-02-21 10:46 EST
SENT (0.0014s) Starting TCP Handshake > 220.127.116.11:7474
RCVD (0.0014s) Possible TCP RST received from 18.104.22.168:7474 --> Connection refused
SENT (1.0026s) Starting TCP Handshake > 22.214.171.124:7687
RCVD (1.0026s) Possible TCP RST received from 126.96.36.199:7687 --> Connection refused
SENT (2.0036s) Starting TCP Handshake > 188.8.131.52:7474
RCVD (2.0037s) Possible TCP RST received from 184.108.40.206:7474 --> Connection refused
SENT (3.0048s) Starting TCP Handshake > 220.127.116.11:7687
RCVD (3.0048s) Possible TCP RST received from 18.104.22.168:7687 --> Connection refused
SENT (4.0059s) Starting TCP Handshake > 22.214.171.124:7474
RCVD (4.0060s) Possible TCP RST received from 126.96.36.199:7474 --> Connection refused
SENT (5.0071s) Starting TCP Handshake > 188.8.131.52:7687
RCVD (5.0071s) Possible TCP RST received from 184.108.40.206:7687 --> Connection refused
Any help would be appreciated to recover my neo4j from this phase without a rebuild.