Hi,
I'm doing a POC which raised the following problem (couldn't find an answer in the forums):
I'm trying to import a CSV containing 10M relationships to a DB pre populated with about ~1.5M nodes with appropriated indices (or so I think).
The import rate starts off fine (~1K relationships per second) but quickly deteriorates. I estimate, it will take weeks to import the dataset. I read that people had no problems importing such datasets in less then an hour.
My setup:
Neo4j 3.5.1
Centos 6 64bit Centos running in a VM with 8GB RAM, 4 CPUs.
I see that the neo4j is consuming 100% cpu during the import (and consumes 2.4G RAM).
Here is an example (that just counts) for just 150000 lines:
PROFILE
LOAD CSV WITH HEADERS FROM "file:/home/Elad/testdir/neo/foo.csv" AS row
WITH row LIMIT 150000
MATCH (subnet:Subnet {subnetID: toInteger(row.SubnetID)})-[]-(device:Device {deviceID: toInteger(row.DeviceID)})
RETURN count(*);
If I import 100000 lines (66%) then it runs 4 times faster. I don't understand why this happens. I thought it should scale linearly.
I have unique indices on Subnet(subnetID) and Device(deviceID)
I'm attaching the output of the above profiled import
Any input will be appreciated
Edit 1:
Here is the actual import:
LOAD CSV WITH HEADERS FROM "file:/home/elad/testdir/neo/foo.csv" AS row
WITH row LIMIT 150000
MATCH (subnet:Subnet {subnetID: toInteger(row.SubnetID)})
MATCH (device:Device {deviceID: toInteger(row.DeviceID)})
MERGE (device)-[:CONNECTS_TO{low:toInteger(row.Low), high:toInteger(row.High), size:toInteger(row.Size)}]->(subnet);
I think the source of the problem is related to the first import (the one that just counts) as it clearly works a lot slower when done on a little more lines. I assume this is done line by line, so I thought the time it takes should be linearly dependent with the number of lines, but clearly it not.