RW Lock problem with Kubernetes

Hi,

I'm trying to populate a database with randomly generated users along with the company they work at ,as part of testing a larger application. I've started off small with approximately 36 random companies that the 100 users could be a part of. However, I intend to generate a few hundred thousand users in the end.

I'm working in python using the py2neo package. Looking at the timings I have found that it takes an average for 2 seconds to generate and insert each user. Because of this, I decided to set up a Kubernetes cluster (2 raspberry pis and 2 old laptops) so I can have multiple instances running and hopefully speed up the process.

To run my program on Kubernetes I split it into two services:

  1. Usernumber - Passes out the number of a user for the other service to create. This manages how many users I want to create and ensures no two users have the same user number.
  2. Usergenerator - Retrieves the user number from Usernumber, creates the user and inserts it into the dataabse.

My hope was to be able to run 1 Usernumber services and multiple Usergenerator services. Unfortunately, I have found that every now and they some users aren't being added to the database. Upon inspection of the logs I found it was because of a RW Lock (I assume the lock is on the company node and that two users are trying to be linked to that company).

At first I looked at changing:
dbms.transaction.timeout
dbms.lock.acquisition.timeout

which had no effect, so I included retries in my code along with checks that the user had been added to the database.
The code I used for inserting a user into the database is as follows:

    def user(cls, user_object):
        push_attempts = 1
        while push_attempts <= cls.max_retries:
            try:
                logger.debug(f'Attempting to insert user number {user_object.UserNumber}')
                with neo4j.GraphTransaction(graph=NeoGraph.graph, timeout=cls.timeout) as tx:
                    tx.merge(user_object)
                assert NeoGraph.graph.exists(user_object), Exception(f"Claims user {user_object.UserNumber} has been added but has not")
                logger.info(f'Successfully inserted user number {user_object.UserNumber}')
                break
            except Exception as e:
                logger.warning(
                    f'Failed to insert user number {user_object.UserNumber} on attempt {push_attempts} - {e}',
                    exc_info=True)
                push_attempts += 1
                logger.debug('Sleeping for 1 second and retrying')
                time.sleep(1)
        else:
            logger.critical(f'Failed to insert user number {user_object.UserNumber}')
            sys.exit()

I included the sys.exit() so I could stop the program when 1 user fails so don't have to search through hundreds of logs.

This still didn't work so I looked closely at my logs and the database logs (database logging set to debug).

My logs are:

{"asctime": "2020-07-29 08:57:22,368", "name": "Database.DbHandler", "levelname": "WARNING", "filename": "DbHandler.py", "funcName": "user", "message": "Failed to insert user number 10 on attempt 1 - <py2neo.database.GraphTransaction object at 0xffff820c74c0>", "exc_info": "Traceback (most recent call last):\n  File \"/app/Database/DbHandler.py\", line 244, in user\n    tx.merge(user_object)\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/database.py\", line 1060, in merge\n    merge(self, primary_label, primary_key)\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/ogm/__init__.py\", line 462, in __db_merge__\n    related_objects.__db_push__(tx)\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/ogm/__init__.py\", line 293, in __db_push__\n    tx.merge(related_object)\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/database.py\", line 1060, in merge\n    merge(self, primary_label, primary_key)\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/ogm/__init__.py\", line 462, in __db_merge__\n    related_objects.__db_push__(tx)\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/ogm/__init__.py\", line 300, in __db_push__\n    tx.run(\"MATCH (a) WHERE id(a) = $x MATCH (b) WHERE id(b) = $y \"\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/database.py\", line 916, in run\n    result = self._connector.run_in_tx(self._transaction, cypher, parameters, hydrant)\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/client/__init__.py\", line 1007, in run_in_tx\n    cx.sync(result)  # TODO: avoid sync on every tx.run\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/client/bolt.py\", line 320, in sync\n    self._audit(result)\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/client/bolt.py\", line 418, in _audit\n    task.audit()\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/client/bolt.py\", line 613, in audit\n    item.audit()\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/client/bolt.py\", line 783, in audit\n    raise self._failure\npy2neo.database.TransientError: LockClient[1077] can't wait on resource RWLock[NODE(647), hash=316713160] since => LockClient[1077] <-[:HELD_BY]- RWLock[NODE(666), hash=1732213899] <-[:WAITING_FOR]- LockClient[1144] <-[:HELD_BY]- RWLock[NODE(647), hash=316713160]\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/app/Database/DbHandler.py\", line 244, in user\n    tx.merge(user_object)\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/database.py\", line 878, in __exit__\n    self.rollback()\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/database.py\", line 943, in rollback\n    self._assert_unfinished()\n  File \"/usr/local/lib/python3.8/site-packages/py2neo/database.py\", line 882, in _assert_unfinished\n    raise GraphTransactionError(self)\npy2neo.database.GraphTransactionError: <py2neo.database.GraphTransaction object at 0xffff820c74c0>"}
{"asctime": "2020-07-29 08:57:22,787", "name": "Database.DbHandler", "levelname": "CRITICAL", "filename": "DbHandler.py", "funcName": "user", "message": "Failed to insert user number 7"}
{"asctime": "2020-07-29 08:57:23,390", "name": "Database.DbHandler", "levelname": "WARNING", "filename": "DbHandler.py", "funcName": "user", "message": "Failed to insert user number 10 on attempt 2 - Claims user 10 has been added but has not", "exc_info": "Traceback (most recent call last):\n  File \"/app/Database/DbHandler.py\", line 245, in user\n    assert NeoGraph.graph.exists(user_object), Exception(f\"Claims user {user_object.UserNumber} \"\nAssertionError: Claims user 10 has been added but has not"}
{"asctime": "2020-07-29 08:57:24,405", "name": "Database.DbHandler", "levelname": "WARNING", "filename": "DbHandler.py", "funcName": "user", "message": "Failed to insert user number 10 on attempt 3 - Claims user 10 has been added but has not", "exc_info": "Traceback (most recent call last):\n  File \"/app/Database/DbHandler.py\", line 245, in user\n    assert NeoGraph.graph.exists(user_object), Exception(f\"Claims user {user_object.UserNumber} \"\nAssertionError: Claims user 10 has been added but has not"}
{"asctime": "2020-07-29 08:57:25,419", "name": "Database.DbHandler", "levelname": "WARNING", "filename": "DbHandler.py", "funcName": "user", "message": "Failed to insert user number 10 on attempt 4 - Claims user 10 has been added but has not", "exc_info": "Traceback (most recent call last):\n  File \"/app/Database/DbHandler.py\", line 245, in user\n    assert NeoGraph.graph.exists(user_object), Exception(f\"Claims user {user_object.UserNumber} \"\nAssertionError: Claims user 10 has been added but has not"}
{"asctime": "2020-07-29 08:57:26,503", "name": "Database.DbHandler", "levelname": "WARNING", "filename": "DbHandler.py", "funcName": "user", "message": "Failed to insert user number 10 on attempt 5 - Claims user 10 has been added but has not", "exc_info": "Traceback (most recent call last):\n  File \"/app/Database/DbHandler.py\", line 245, in user\n    assert NeoGraph.graph.exists(user_object), Exception(f\"Claims user {user_object.UserNumber} \"\nAssertionError: Claims user 10 has been added but has not"}
{"asctime": "2020-07-29 08:57:27,713", "name": "Database.DbHandler", "levelname": "CRITICAL", "filename": "DbHandler.py", "funcName": "user", "message": "Failed to insert user number 10"}

and the database logs for the same time are:

2020-07-29 08:57:19.351+0000 INFO [o.n.c.i.ExecutionEngine] [neo4j] Discarded stale query from the query cache after 10 seconds: MATCH (a) WHERE id(a) = $x MATCH (b) WHERE id(b) = $y MERGE (a)-[_:DEFINED_BY]->(b) SET _ = $z
2020-07-29 08:57:19.353+0000 DEBUG [o.n.c.i.CommunityCompilerFactory] [neo4j] Discarded stale plan from the plan cache after 10 seconds: MATCH (a) WHERE id(a) = $x MATCH (b) WHERE id(b) = $y MERGE (a)-[_:DEFINED_BY]->(b) SET _ = $z
2020-07-29 08:57:21.427+0000 INFO [o.n.c.i.ExecutionEngine] [neo4j] Discarded stale query from the query cache after 10 seconds: MATCH (a)-[_:HAS]->(b) WHERE id(a) = $x AND NOT id(b) IN $y DELETE _
2020-07-29 08:57:21.428+0000 DEBUG [o.n.c.i.CommunityCompilerFactory] [neo4j] Discarded stale plan from the plan cache after 10 seconds: MATCH (a)-[_:HAS]->(b) WHERE id(a) = $x AND NOT id(b) IN $y DELETE _
2020-07-29 08:57:21.469+0000 INFO [o.n.c.i.ExecutionEngine] [neo4j] Discarded stale query from the query cache after 10 seconds: MATCH (a) WHERE id(a) = $x MATCH (b) WHERE id(b) = $y MERGE (a)-[_:HAS]->(b) SET _ = $z
2020-07-29 08:57:21.470+0000 DEBUG [o.n.c.i.CommunityCompilerFactory] [neo4j] Discarded stale plan from the plan cache after 10 seconds: MATCH (a) WHERE id(a) = $x MATCH (b) WHERE id(b) = $y MERGE (a)-[_:HAS]->(b) SET _ = $z
2020-07-29 08:57:21.546+0000 INFO [o.n.c.i.ExecutionEngine] [neo4j] Discarded stale query from the query cache after 10 seconds: MATCH (a)-[_:WORKS_AT]->(b) WHERE id(a) = $x AND NOT id(b) IN $y DELETE _
2020-07-29 08:57:21.547+0000 DEBUG [o.n.c.i.CommunityCompilerFactory] [neo4j] Discarded stale plan from the plan cache after 10 seconds: MATCH (a)-[_:WORKS_AT]->(b) WHERE id(a) = $x AND NOT id(b) IN $y DELETE _
2020-07-29 08:57:21.593+0000 INFO [o.n.c.i.ExecutionEngine] [neo4j] Discarded stale query from the query cache after 10 seconds: MATCH (a) WHERE id(a) = $x MATCH (b) WHERE id(b) = $y MERGE (a)-[_:WORKS_AT]->(b) SET _ = $z
2020-07-29 08:57:21.594+0000 DEBUG [o.n.c.i.CommunityCompilerFactory] [neo4j] Discarded stale plan from the plan cache after 10 seconds: MATCH (a) WHERE id(a) = $x MATCH (b) WHERE id(b) = $y MERGE (a)-[_:WORKS_AT]->(b) SET _ = $z
2020-07-29 08:57:21.659+0000 INFO [o.n.c.i.ExecutionEngine] [neo4j] Discarded stale query from the query cache after 10 seconds: OPTIONAL MATCH (a) WHERE id(a) IN $x OPTIONAL MATCH ()-[r]->() WHERE id(r) IN $y RETURN count(DISTINCT a) + count(DISTINCT r)
2020-07-29 08:57:21.661+0000 DEBUG [o.n.c.i.CommunityCompilerFactory] [neo4j] Discarded stale plan from the plan cache after 10 seconds: OPTIONAL MATCH (a) WHERE id(a) IN $x OPTIONAL MATCH ()-[r]->() WHERE id(r) IN $y RETURN count(DISTINCT a) + count(DISTINCT r)
2020-07-29 08:57:28.305+0000 INFO [o.n.c.i.ExecutionEngine] [neo4j] Discarded stale query from the query cache after 20 seconds: UNWIND $x AS data MERGE (_:Position {Id:data[0]}) SET _:Position SET _ = data[1] RETURN id(_)
2020-07-29 08:57:28.306+0000 DEBUG [o.n.c.i.CommunityCompilerFactory] [neo4j] Discarded stale plan from the plan cache after 20 seconds: UNWIND $x AS data MERGE (_:Position {Id:data[0]}) SET _:Position SET _ = data[1] RETURN id(_)

One thing I found interesting was I often found the "can't wait on resource RW Lock" error occurs only on the first push attempt. After that, the program believes it has pushed the user to the database but upon checking finds that it isn't actually on the database. I confirmed this by querying the database directly and finding the list of user numbers doesn't contain, in this case, user number 10.