I am doing an A/B testing to measure the throughput for the Creation of nodes in Neo4J. And I find throughput for creation of nodes decreases significantly as the number of properties increase.
Setup: Neo4j cluster 3.5.7 (3 core instances where one is the leader and the rest two are followers). Tried the same experiment in a single node as well and I observe the same behavior. But all the results below ran on the 3 node cluster.
TestA: Is to measure the throughput for creation of nodes in Neo4j where each node has 20 properties.
TestB: Is to measure the throughput for creation of nodes in Neo4j cluster 3.5.7 where each node has 40 properties.
Result: Throughput for TestB = 1/2 * Throughput for TestA
Below is the code I used to generate the load and measure the throughput.
import org.neo4j.driver.v1.*;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
public class UnwindCreateNodes {
Driver driver;
static int start;
static int end;
public UnwindCreateNodes(String uri, String user, String password) {
Config config = Config.build()
.withConnectionTimeout(10, TimeUnit.SECONDS)
.toConfig();
driver = GraphDatabase.driver(uri, AuthTokens.basic(user, password), config);
}
private void addNodes() {
List<Map<String, Object>> listOfProperties = new ArrayList<>();
for (int inner = start; inner < end; inner++) {
Map<String, Object> properties = new HashMap<>();
properties.put("name", "Jhon " + inner);
properties.put("last", "Alan" + inner);
properties.put("id", 2 + inner);
properties.put("key", "1234" + inner);
properties.put("field5", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field6", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field7", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field8", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field9", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field10", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field11", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field12", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field13", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field14", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field15", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field16", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field17", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field18", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field19", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field20", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field21", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field22", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field23", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field24", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field25", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field26", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field27", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field28", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field29", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field30", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field31", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field32", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field33", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field34", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field35", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field36", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field37", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field38", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field39", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field40", "kfhc iahf uheguehuguaeghuszjxcb sd");
listOfProperties.add(properties);
}
int noOfNodes = 0;
for (int i = 0; i < listOfProperties.size() / 5000; i++) {
List<Map<String, Object>> events = new ArrayList<>();
for (; noOfNodes < (i + 1) * (5000) && noOfNodes < listOfProperties.size(); noOfNodes++) {
events.add(listOfProperties.get(noOfNodes));
}
Map<String, Object> apocParam = new HashMap<>();
apocParam.put("events", events);
String query = "UNWIND $events AS event CREATE (a:Label) SET a += event";
Instant startTime = Instant.now();
try (Session session = driver.session()) {
session.writeTransaction((tx) -> tx.run(query, apocParam));
}
Instant finish = Instant.now();
long timeElapsed = Duration.between(startTime, finish).toMillis();
System.out.println("######################--timeElapsed NODES--############################");
System.out.println("no of nodes per batch " + events.size());
System.out.println(timeElapsed);
System.out.println("############################--NODES--############################");
}
}
public void close() {
driver.close();
}
public static void main(String... args) {
start = 200001;
end = 400001;
if (args.length == 2) {
start = Integer.valueOf(args[0]);
end = Integer.valueOf(args[1]);
}
UnwindCreateNodes unwindCreateNodes = new UnwindCreateNodes("bolt+routing://x.x.x.x:7687", "neo4j", "neo4j");
unwindCreateNodes.addNodes();
unwindCreateNodes.close();
}
}
Below is the graph.
It takes 3.5 seconds to insert 5000 nodes where each node has 40 properties
It takes 1.8 seconds to insert 5000 nodes where each node has 20 properties
This is a significant slowdown and 40 isn't a big number for the number of properties. I have a requirement until 100 properties but if I it does not scale for 40 I am not sure how I can scale for 100?
Other approaches I tried are using apoc.periodic.iterate
With taking out UNWIND
and without UNWIND
just using CREATE
etc but the behavior persists.
I don't want to store properties in some external store like RDBMS etc because that complicates things for me as I am building a generic application where I have no idea what properties are going to be used.
I cannot use CSV tool either because my data is coming from Kafka and also it is not structured in a way CSV tool wants. so no CSV tool for me.
Any idea to speed this up?