We recently shifted Neo4j from AWS to Azure by backing up our aws instances and restoring it on azure.
While the database has managed to spin up correctly, we noticed that our queries are taking much longer to complete. Now, we have 1000+ queries running that have been stuck in the running state for over 12 hours. These are queries that used to take a few hundred ms before.
Do we need to do something else in order to restore the database back to its original speed (the machine types are equivalent, and the disk is slightly slower)
Neo4j CPU usage: 50% (32 cores)
IOPS usage: 0.01% - 3.15%
Debug.log contains a bunch of messages like this:
2019-05-25 06:43:17.059+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0xf793ca34, L:/10.0.2.6:7687 - R:/54.251.151.37:53032] readAddress(..) failed: Connection timed out
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
at io.netty.channel.unix.Errors.newIOException(Errors.java:122)
at io.netty.channel.unix.Errors.ioResult(Errors.java:146)
at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:178)
at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:348)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:794)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:404)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at java.lang.Thread.run(Thread.java:748)
Not sure how, but you've got some lock contention issue going on...all of your queries are waiting on locks, and either this isn't a deadlock state or the deadlock detection isn't working, so you're stuck.
You'll either need to kill the queries or restart the db.
You said you're on Neo4j 3.4. Is that 3.4.0? If so, get off that .0 release, these contain brand new features and thus the highest potential for bugs with their first introduction. When you can please upgrade to the highest patch release for your minor version, in this case 3.4.14. That can at least remove the possibility that you're being affected by some bug that has since been patched along the 3.4.x line.
We are on version 3.4.1, but we will update to 3.4.14.
I learnt from one of my colleagues that when we moved from AWS to Azure, we directly started the database on the backup that was created and did not use neo4j-admin restore tool (it was taking too long according to them)
What are the implications of not using the restor tool? Did that cause the error?
We typically recommend neo4j-admin restore as it does several other housekeeping things for you that among other things may prevent some forms of corruption later (if used in a cluster).
You may want to double-check :schema from the browser to ensure all your expected indexes are present and online. You can use dbms.listQueries() and dbms.listTransactions() to see what's executing, and you should also be able to see query state as well.
You can also use the following query to get locking information via JMX beans:
Also, double-check permissions within your graph.db structure, I believe the read and write permissions should be set to the user that is used to run Neo4j.
Verified that all constraints and indexes are online
Upgrade to 3.4.14 was successful
Set a lockAcquisitionTimeout on the driver to see which queries are unable to acquire locks
None of the above works, and now we have thousands of queries failing because they couldn't acquire a lock (in 30s).
We are going to rebuild the indexes and see if that might help. What else can we do?
Update: We can't rebuild indexes either as setting a new label for 3k nodes is taking over 3 minutes, and we have over 1B+ nodes that need migration. Going to drop that idea. Our last resort is moving to a new machine type and os type.
Update 2: Weirdly, the queries that are running are not always shown in dbms.listQueries(). For example, one of the queries that we executed in the browser reported that it took 10s to complete, however this statement only showed up after 100s. This query was available in the list of queries generated by dbms.listQueries for 10s, but not afterwards. Neo4j thinks the query is done, but in reality it is still executing. Future queries seem to be waiting for the locks to be released by these phantom queries.
Hard to say without more info. There are some other tools I provided for checking lock health, try those out.
You can EXPLAIN some of the queries in question that are running into contention issues, that is a good sanity check to ensure they're using index lookups where expected.
You may also want to set dbms.transaction.timeout to the same value as your lock acquisition timeout.
As far as adding new labels to many nodes, or otherwise processing > 50k-100k nodes/rels at once, you should batch those, usually with apoc.periodic.iterate() from APOC Procedures.