Union find deadlock

Hi,

I tried running the union find algorithm on my graph, but it seems to have hit a deadlock. It's been running for 12+ hours but the server shows no CPU usage or I/O.

I suspect the issue is related to the size of my graph. It has 650 million nodes, 2.5 billion relationships, and is about 200 GB on disk.

My graph model is very simple: there is only one node label and one relationship label

How can I resolve this deadlock? It seems most threads are stuck here:

	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
	at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
	at org.neo4j.graphalgo.impl.ParallelUnionFindQueue.lambda$compute$1(ParallelUnionFindQueue.java:85)
	at org.neo4j.graphalgo.impl.ParallelUnionFindQueue$$Lambda$450/1742932504.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:122)

Cypher queries used:

// Load graph, this completes successfully
CALL algo.graph.load('my-graph','Label','REL_TYPE',{graph:'heavy'})
  YIELD name, graph, direction, undirected, sorted, nodes, loadMillis, alreadyLoaded,
        nodeWeight, relationshipWeight, nodeProperty, loadNodes, loadRelationships;

// This query hangs
CALL algo.unionFind('User', 'FRIEND', {graph:'my-graph',write:true, partitionProperty:"partition"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;

It seems I cannot attach text files, but I have uploaded the Java stack trace and log files here:
Stack trace
neo4j.log
debug.log

Neo4j version: 3.5.3
Neo4j graph algorithms version: 3.5.3.3

Thanks,
Steven

1 Like

Thanks a lot for your detailed information. We're looking into it this week.

I also created this GH issue to track it:

Cheers, Michael

1 Like

My program has done the same thing, here are the details...
Data store size: 1.92 TB
Neo4j version: 3.5.3 Enterprise
Neo4j Graph algorithms version 3.5.3.3

8.9 Billion nodes, 18 Billion rels

Despite having a pretty complex graph model the basics are 9 Node type and 1 Rel Type. I don't know how to check the thread stack but my debug.log shows the exact same message:

2019-03-31 20:08:07.606+0000 INFO [o.n.k.i.p.Procedures] [neo4j.BoltWorker-5 [bolt] [/127.0.0.1:43070] ] LOADING 100% Memory usage: 239 GiB
2019-03-31 20:08:10.675+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1962, gcTime=2000, gcCount=1}
2019-03-31 20:08:12.306+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1321, gcTime=1329, gcCount=1}
2019-03-31 20:08:14.111+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1197, gcTime=1256, gcCount=1}
2019-03-31 20:08:15.906+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1192, gcTime=1263, gcCount=1}
2019-03-31 20:08:17.744+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1236, gcTime=1235, gcCount=1}
2019-03-31 20:08:19.538+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=991, gcTime=1045, gcCount=1}
2019-03-31 20:10:03.847+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=103707, gcTime=0, gcCount=0}
2019-03-31 20:10:05.322+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=973, gcTime=966, gcCount=1}
2019-03-31 20:10:15.187+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=9765, gcTime=9845, gcCount=3}
2019-03-31 21:00:58.017+0000 INFO [o.n.k.i.p.Procedures] [algo-19] CC(HugeParallelUnionFindQueue) 32%

It has been at 32% for more than 7 hours now.

My cypher query was executed in a tmux window in Cypher-shell on Ec2 of 488 GB ram and 64 CPU. Heap space set at 380g and PC at 60g. The command was

CALL algo.unionFind(null, 'OBSERVED_WITH', {write:true, graph:'huge', partitionProperty:"partition"})
       YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;

Altogether the algorithm has been running for 25 Hours now.. it took about 18 Hours to reach the same point in the debug.log that OP was stuck at..

Is resolved in the current release:

1 Like