Issue in Setting up a Single core and 1 Read replica

Hi,

I'm trying to build an environment in AWS that has 1 core server and 1 read-replica following the direction in "Deploy a basic cluster - Operations Manual".

I set both configuration files according to the ones as specified, but the read-replica fails to connect to the core server. This is the error message in the debug.log file, which says the outbound connection fails.
That is very strange because the security group in/out and network acl in/out allow the port 5000. I'm not sure what else needs to be done in addition to the direction in the page.

2021-09-27 17:04:51.378+0000 WARN [a.s.Materializer] [outbound connection to [akka://cc-discovery-actor-system@10.3.0.239:5000], message stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(10.3.0.239:5000,None,List(),Some(10000 milliseconds),true)] failed because of java.net.ConnectException: Connection refused
2021-09-27 17:04:51.383+0000 ERROR [a.e.DummyClassForStringSources] Outbound message stream to [akka://cc-discovery-actor-system@10.3.0.239:5000] failed. Restarting it. Tcp command [Connect(10.3.0.239:5000,None,List(),Some(10000 milliseconds),true)] failed because of java.net.ConnectException: Connection refused
akka.stream.StreamTcpException: Tcp command [Connect(10.3.0.239:5000,None,List(),Some(10000 milliseconds),true)] failed because of java.net.ConnectException: Connection refused
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777) ~[?:?]
at akka.io.TcpOutgoingConnection$$anonfun$connecting$1.$anonfun$applyOrElse$3(TcpOutgoingConnection.scala:103) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.io.TcpOutgoingConnection.akka$io$TcpOutgoingConnection$$reportConnectFailure(TcpOutgoingConnection.scala:50) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.io.TcpOutgoingConnection$$anonfun$connecting$1.applyOrElse(TcpOutgoingConnection.scala:103) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.actor.Actor.aroundReceive(Actor.scala:539) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.actor.Actor.aroundReceive$(Actor.scala:537) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.io.TcpConnection.aroundReceive(TcpConnection.scala:31) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:610) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.actor.ActorCell.invoke(ActorCell.scala:579) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.dispatch.Mailbox.run(Mailbox.scala:229) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.dispatch.Mailbox.exec(Mailbox.scala:241) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) ~[akka-actor_2.12-2.5.22.jar:2.5.22]
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~[akka-actor_2.12-2.5.22.jar:2.5.22]

It will be appreciated if anyone can give any direction to solve this.
Thanks
Alex Ough

I checked if the core server opened a port 5000, but it doesn't seem to be because "sudo netstat -tnlp | grep :5000" command returns nothing.

Based on this, the core server is not listening to port 5000, which is not good. Is there any other configuration I need to add in addition to that in the document?

Thanks
Alex Ough

I don't believe that deployment is supported. You need a valid core cluster before you can add read replicas.

The minimum for a cluster is 2 core nodes, and you would need to set the following two configuration properties to do so:

causal_clustering.minimum_core_cluster_size_at_formation=2
causal_clustering.minimum_core_cluster_size_at_runtime=2

Be aware that since causal clustering uses quorum commits, both core nodes must be online and healthy and able to communicate with each other in order to support writes. If one of the cores goes down, then you've lost quorum and the offline node needs to be restored (you cannot spin up a new core node...and you cannot run neo4-admin unbind on the offline node in your recovery steps ). A cluster of 3 core nodes offers more resilience and easier recovery scenarios.

Thanks Andrew,
Yes, it is supported based on the document in the link I sent.

After I just realized the version mismatch, I tried with 4.3.2 and the core sever is working ok with port 5000 port open.
But the new problem is that the read-replica has this error.

ERROR [a.e.DummyClassForStringSources] Outbound message stream to [akka://cc-discovery-actor-system@node0.neo4j-3:5000] failed. Restarting it. Handshake with [akka://cc-discovery-actor-system@node0.neo4j-3:5000] did not complete within 30000 ms
akka.remote.artery.OutboundHandshake$HandshakeTimeoutException: Handshake with [akka://cc-discovery-actor-system@node0.neo4j-3:5000] did not complete within 30000 ms

It looks like it fails in communicating with the core server through the port 5000.
It is strange that because I tested the connection using 'telnet node0.neo4j-3 5000' to check the connection to the port from the read-replica, it works ok.

Any guide to fix this will be much appreciated.
Thanks
Alex Ough

My mistake, this replica-only deployment is new to 4.3.x and I somehow missed this.

You will need to make sure that causal_clustering.initial_discovery_members is set for the read replica, though I don't think you want to set this on the single core.

Also make sure your advertised and listen addresses are set properly on your core node (it should be listening on all interfaces). You can check the core node's logs to see if connection attempts are even coming through.

Thanks Andrew.
I followed what is directed in the document, but it doesn't seem to work.
Do you think if there is any other configuration I need to change in addition to the ones in the document?
If so, can you share them?

Thanks
Alex Ough

I think I found the issue.
It looks like 'dbms.default_listen_address' value in the core server should be '0.0.0.0'.
That may be implicitly known, but that is not in the document, so it may be better to be explicitly documented for the beginners.

Thanks
Alex Ough

Yes, that sounds right. We should have that documented in the configuration file:

# With default configuration Neo4j only accepts local connections.
# To accept non-local connections, uncomment this line:
#dbms.default_listen_address=0.0.0.0

Perhaps it could be more explicit that 0.0.0.0 binds to all interfaces and allows accepting of non-local connections.

Our documentation page for this has additional guidance:

Default network interface to listen for incoming connections. To listen for connections on all interfaces, use "0.0.0.0".

Hi Andrew,
Thanks for the additional document.

Alex Ough