I have 3 nodes cluster running in K8 (GKE), using the official helm-charts. All works very well, but from time to time I run into issues after restarting one of the nodes.
There are 10 graphs in the cluster, all of them are relatively small, just few k of nodes. The problem that I noticed is that usually it takes 1-2 minutes to restart the node and re-join the cluster.
But sometimes it gets stuck waiting for a snapshot
[c.n.c.c.s.CoreSnapshotService] [riotrecommendations6e707beadc635df065390062291002a5/bb88d792] Waiting for another raft group member to publish a core state snapshot [c.n.c.c.s.CoreSnapshotService] [riotrecommendations86111ba0b10ed2d1e895820b6fe6b790/cb2eefac] Waiting for another raft group member to publish a core state snapshot
Sometimes it fixes itself in a few minutes, sometimes it remains the same for hours or until restart. Restarting the node usually helps.
I'm mostly using default configs from the helm-chart.
Any idea where to look for potential logs/issues?
This is an unrelated consequence that I observed: I noticed, that sometimes 8 out of 10 graphs is synced and ready. But the pod is yet not accessible via service because the readinessProbe fails (waits for 7687). So in case, that the restarted node is LEAD for one of the ready graphs, it will be un-routable within the k8 network as the pod is not in the ready state, causing obviously troubles to clients trying to write to the db.