I have a lot of output Spark data that I'm attempting to load via LOAD CSV
, and everything appears to go well until a checkpoint is hit, at which point the server performance not only tanks, but it refuses to improve.
Here's some example log output both from my app, which is loading the part files generated by Spark and the debug.log file from Neo4j:
11:59:15 DEBUG Uploading 200 part files...
11:59:20 DEBUG ...46307 nodes & 185228 relationships created (1/200)
11:59:25 DEBUG ...46183 nodes & 184732 relationships created (2/200)
11:59:30 DEBUG ...46363 nodes & 185452 relationships created (3/200)
11:59:35 DEBUG ...46642 nodes & 186568 relationships created (4/200)
11:59:39 DEBUG ...46632 nodes & 186528 relationships created (5/200)
<.. snipped another 75 part files, all similar ..>
12:05:19 DEBUG ...46360 nodes & 185440 relationships created (80/200)
12:05:24 DEBUG ...46662 nodes & 186648 relationships created (81/200)
12:05:29 DEBUG ...46623 nodes & 186492 relationships created (82/200)
12:05:33 DEBUG ...46525 nodes & 186100 relationships created (83/200)
12:05:38 DEBUG ...47153 nodes & 188612 relationships created (84/200)
12:05:42 DEBUG ...46397 nodes & 185588 relationships created (85/200)
<.. Notice the time jump at this point and from here on ..>
12:10:53 DEBUG ...46599 nodes & 186396 relationships created (86/200)
12:11:23 DEBUG ...46255 nodes & 185020 relationships created (87/200)
12:11:52 DEBUG ...46752 nodes & 187004 relationships created (88/200)
12:12:24 DEBUG ...46622 nodes & 186488 relationships created (89/200)
12:13:00 DEBUG ...46863 nodes & 187452 relationships created (90/200)
12:13:26 DEBUG ...46954 nodes & 187816 relationships created (91/200)
12:13:54 DEBUG ...46624 nodes & 186492 relationships created (92/200)
12:14:22 DEBUG ...47058 nodes & 188232 relationships created (93/200)
12:14:48 DEBUG ...46600 nodes & 186400 relationships created (94/200)
12:15:16 DEBUG ...46621 nodes & 186484 relationships created (95/200)
12:15:51 DEBUG ...46261 nodes & 185040 relationships created (96/200)
12:16:19 DEBUG ...46848 nodes & 187392 relationships created (97/200)
12:16:53 DEBUG ...46785 nodes & 187140 relationships created (98/200)
12:17:43 DEBUG ...46810 nodes & 187236 relationships created (99/200)
<.. Performance continues to get worse and worse ..>
And here's the debug log (note: time zone differences mean the debug log is +600 hours):
2019-02-26 18:04:16.625+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=116, gcTime=112, gcCount=1}
2019-02-26 18:04:44.788+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=104, gcTime=115, gcCount=1}
2019-02-26 18:04:48.809+0000 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.transaction.db.3252] version=3252, last transaction in previous log=664880
2019-02-26 18:05:07.448+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=122, gcTime=127, gcCount=1}
2019-02-26 18:05:21.127+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=111, gcTime=113, gcCount=1}
2019-02-26 18:05:25.886+0000 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.transaction.db.3253] version=3253, last transaction in previous log=664888
2019-02-26 18:05:52.986+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 664891 checkpoint started...
2019-02-26 18:05:53.448+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 664891 to [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.counts.db.b], from [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.counts.db.a].
2019-02-26 18:10:45.076+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 664891 checkpoint completed in 4m 52s 89ms
2019-02-26 18:10:47.428+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] Pruned log versions 2857-2863, last checkpoint was made in version 3253
2019-02-26 18:12:56.884+0000 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/home/ec2-user/efs/neo4j/dev/data/databases/graph.db/neostore.transaction.db.3254] version=3254, last transaction in previous log=664896
Note how at 18:05:52 in the debug.log file is when the checkpoint occurs, and at the same time in my output the part files go from taking 4-5 seconds per to 5 minutes (!!) for the next one (okay, I can chalk that up to the checkpoint), and from then on start taking 20, 30, 40+ seconds per part file. If I let it keep running in the hopes that the performance will eventually improve, it absolutely does not (I've let it go for a solid 8 hours and it never finishes all 200 part files). This is horrid as it only took ~6 minutes to upload the first 85 part files.
In addition, any other work thrown at the graph (simple queries) can take a long time to return, and I end up having to stop and restart the server in order to get things reasonable again.
This is using Neo4j 3.5.1 community, running on an EC2 instance w/ 16 cores and 64 GB of RAM.
One final note: before the checkpoint hits, using htop
I can see that all 16 cores are regularly being utilized. After the checkpoint, all concurrency appears to completely halt.
What is going on here? Is this a problem with Neo4j or is there some configuration setting I should be looking at?
Thanks for any insights.