Core server cannot rejoin Causal Cluster after unclean shutdown in Kubernetes

In a Kubernetes cluster on AWS EKS I am deploying a five core causal cluster via a handful of manifests very similar to a helm package and two initContainers, one for apoc plugin install, and a second for @david_allen's restore script built off of a backup of a single instance in another environment.

When one of the five cores get evicted from a node (what I am assuming ends up being an "unclean shutdown" as discussed here, it fails to rejoin the cluster. It fails with the following error:

2019-05-31 15:52:50.304+0000 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@6e02721d' was successfully initialized, but failed to start. Please see the attached cause exception "Unable to find transaction 195492 in any of my logical logs". Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@6e02721d' was successfully initialized, but failed to start. Please see the attached cause exception "Unable to find transaction 195492 in any of my logical logs".

I've tried many many things and have no more thoughts except that maybe there is a way to ensure that a core/pod never shuts down uncleanly, but I have no idea how to do that.

Any thoughts/ideas on this would be greatly appreciated.

You might want to check why it's getting evicted and post some logs on that. The immediate issue of getting this node to rejoin the cluster is one thing, but lacking the cause of why this came up, fixing this may or may not help.

I hope others will join in with other ideas -- but one thought that comes to mind is you can restart the pod with an empty volume. If /data is empty, it will join the cluster and then pull new state from the other core members in the cluster. Before doing this I'd make ultra-sure the other cluster members are in good shape as you don't really want to blow away any data....but starting one cold should get it back into the fold.

Thanks for your reply @david_allen.

Agree completely about avoiding the problem to begin with.

Considering the bringing the uncleanly shutdown/evicted core back up so it can self-heal/rejoin the cluster I do like your idea. However, implementing it would be complicated by the fact that I'm running a modified version of your restore script as an initContainer, and thus would have to have the running of that initContainer conditioned on whether or not the pod is being brought back up or if it's rebuilding from an unclean shutdown.

Assuming your suggestion is the best path forward it seems like I'd have to redesign my setup so that there is no restore from file initContainer and instead, when the cluster is set up the first time it is seeded in some other way. That way if one goes down, the core would just catchup with the others in the way you suggest. To your mind, what would be the best way to see the original setup of the cluster with data from, say, a backup from another environment stored in S3?

I don't recommend letting your core members catch up from cold every time, because as your history & data volumes grow, this will be more and more taxing on the rest of your cluster to replicate all of that stuff. Using backups & restore is, in my view the way to go.

If you're restoring from a good backup (optionally one that enabled consistency checking before you took the backup) I don't see why the data you restored should ever end up in this bad state unless there's some other factor you havent' considered that's intervening.

To make the pipeline work, you can always take a trivial backup of an essentially empty database to "seed your restore process", and thereby establish the pattern that you always restore from a backup, no matter what.

Might want to have a look into the consistency checker to ensure that prior to taking a backup, it is of a known good state.

Thanks for this suggestion. Running the consistency checker on my backup in the instance that just ran the restore initContainer seems to be checking out fine.

2019-06-04 20:16:35.696+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Selected RecordFormat:StandardV3_4[v0.A.9] record format from store /data/backupset/data/knowledge-graph-production-backup-2019-06-04
2019-06-04 20:16:35.779+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Format not configured. Selected format from the store: RecordFormat:StandardV3_4[v0.A.9]
....................  10%
....................  20%
....................  30%
....................  40%
....................  50%
....................  60%
....................  70%
....................  80%
.........Checking node and relationship counts
....................  10%
....................  20%
....................  30%
....................  40%
....................  50%
....................  60%
....................  70%
....................  80%
....................  90%
.................... 100%

Also, interestingly, if an edit is made to the statefulSet and the pods are updated they fail to come back up. This makes me think it's not necessarily unclean shutdowns.

Adding these here in case they help anyone else discover this issue. They are the logs of the Core starting, but only if is restarting/updating.

~$ kubectl logs neo4j-4
Changed password for user 'neo4j'.
Active database: graph.db
Directories in use:
  home:         /var/lib/neo4j
  config:       /var/lib/neo4j/conf
  logs:         /var/lib/neo4j/logs
  plugins:      /plugins
  import:       /var/lib/neo4j/import
  data:         /var/lib/neo4j/data
  certificates: /var/lib/neo4j/certificates
  run:          /var/lib/neo4j/run
Starting Neo4j.
2019-06-04 20:47:34.711+0000 WARN  ha.host.data is deprecated.
2019-06-04 20:47:34.809+0000 INFO  ======== Neo4j 3.5.2 ========
2019-06-04 20:47:34.815+0000 INFO  Starting...
2019-06-04 20:47:41.601+0000 INFO  Initiating metrics...
2019-06-04 20:47:41.802+0000 INFO  Resolved initial host 'neo4j.default.svc.cluster.local:5000' to [192.168.237.22:5000, 192.168.74.234:5000, 192.168.219.216:5000, 192.168.71.174:5000]
2019-06-04 20:47:41.999+0000 INFO  My connection info: [
	Discovery:   listen=0.0.0.0:5000, advertised=neo4j-4.neo4j.default.svc.cluster.local:5000,
	Transaction: listen=0.0.0.0:6000, advertised=neo4j-4.neo4j.default.svc.cluster.local:6000, 
	Raft:        listen=0.0.0.0:7000, advertised=neo4j-4.neo4j.default.svc.cluster.local:7000, 
	Client Connector Addresses: bolt://neo4j-4.neo4j.default.svc.cluster.local:7687,http://neo4j-4.neo4j.default.svc.cluster.local:7474,https://neo4j-4.neo4j.default.svc.cluster.local:7473
]
2019-06-04 20:47:41.999+0000 INFO  Discovering other core members in initial members set: [192.168.219.216:5000, 192.168.237.22:5000, 192.168.71.174:5000, 192.168.74.234:5000]
2019-06-04 20:47:53.997+0000 INFO  Bound to cluster with id fdfbe6d3-88f5-4d11-a0df-0bbbd2c4b500
2019-06-04 20:47:54.098+0000 INFO  Discovered core member at 192.168.237.22:5000
2019-06-04 20:47:59.801+0000 INFO  started background thread for async index updates
2019-06-04 20:47:59.908+0000 INFO  Deprecated index providers in use: lucene+native-2.0 (26 indexes). Use procedure 'db.indexes()' to see what indexes use which index provider.
2019-06-04 20:48:06.106+0000 DEBUG shutdown in progress. Aborting index tracking thread.
2019-06-04 20:48:06.107+0000 DEBUG final commit in background thread
2019-06-04 20:48:06.107+0000 INFO  stopping background thread for async index updates
2019-06-04 20:48:06.826+0000 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@27dc627a' was successfully initialized, but failed to start. Please see the attached cause exception "Unable to find transaction 178335 in any of my logical logs". Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@27dc627a' was successfully initialized, but failed to start. Please see the attached cause exception "Unable to find transaction 178335 in any of my logical logs".
org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@27dc627a' was successfully initialized, but failed to start. Please see the attached cause exception "Unable to find transaction 178335 in any of my logical logs".
	at org.neo4j.server.exception.ServerStartupErrors.translateToServerStartupError(ServerStartupErrors.java:45)
	at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:184)
	at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:123)
	at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:90)
	at com.neo4j.server.enterprise.CommercialEntryPoint.main(CommercialEntryPoint.java:22)
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.server.database.LifecycleManagingDatabase@27dc627a' was successfully initialized, but failed to start. Please see the attached cause exception "Unable to find transaction 178335 in any of my logical logs".
	at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:473)
	at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:111)
	at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:177)
	... 3 more
Caused by: java.lang.RuntimeException: Error starting org.neo4j.graphdb.facade.GraphDatabaseFacadeFactory, /data/databases
	at org.neo4j.graphdb.facade.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:216)
	at com.neo4j.causalclustering.core.CommercialCoreGraphDatabase.<init>(CommercialCoreGraphDatabase.java:28)
	at com.neo4j.server.database.CommercialGraphFactory.newGraphDatabase(CommercialGraphFactory.java:36)
	at org.neo4j.server.database.LifecycleManagingDatabase.start(LifecycleManagingDatabase.java:78)
	at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:452)
	... 5 more
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.causalclustering.core.state.CoreLife@45e617c4' was successfully initialized, but failed to start. Please see the attached cause exception "Unable to find transaction 178335 in any of my logical logs".
	at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:473)
	at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:111)
	at org.neo4j.graphdb.facade.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:211)
	... 9 more
Caused by: java.lang.RuntimeException: org.neo4j.kernel.impl.transaction.log.NoSuchTransactionException: Unable to find transaction 178335 in any of my logical logs
	at org.neo4j.causalclustering.core.state.machines.tx.LastCommittedIndexFinder.getLastCommittedIndex(LastCommittedIndexFinder.java:52)
	at org.neo4j.causalclustering.core.state.machines.tx.RecoverConsensusLogIndex.findLastAppliedIndex(RecoverConsensusLogIndex.java:38)
	at org.neo4j.causalclustering.core.state.machines.CoreStateMachines.installCommitProcess(CoreStateMachines.java:122)
	at org.neo4j.causalclustering.core.state.CoreLife.start0(CoreLife.java:81)
	at org.neo4j.kernel.lifecycle.SafeLifecycle.transition(SafeLifecycle.java:124)
	at org.neo4j.kernel.lifecycle.SafeLifecycle.start(SafeLifecycle.java:138)
	at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:452)
	... 11 more
Caused by: org.neo4j.kernel.impl.transaction.log.NoSuchTransactionException: Unable to find transaction 178335 in any of my logical logs
	at org.neo4j.kernel.impl.transaction.log.PhysicalLogicalTransactionStore$TransactionPositionLocator.getAndCacheFoundLogPosition(PhysicalLogicalTransactionStore.java:208)
	at org.neo4j.kernel.impl.transaction.log.PhysicalLogicalTransactionStore.getTransactions(PhysicalLogicalTransactionStore.java:110)
	at org.neo4j.causalclustering.core.state.machines.tx.LastCommittedIndexFinder.getLastCommittedIndex(LastCommittedIndexFinder.java:43)
	... 17 more
2019-06-04 20:48:06.905+0000 INFO  Neo4j Server shutdown initiated by request