Throughput for creation of nodes in Neo4J 3.5.7 decreases significantly with the number of properties

I am doing an A/B testing to measure the throughput for the Creation of nodes in Neo4J. And I find throughput for creation of nodes decreases significantly as the number of properties increase.

Setup: Neo4j cluster 3.5.7 (3 core instances where one is the leader and the rest two are followers). Tried the same experiment in a single node as well and I observe the same behavior. But all the results below ran on the 3 node cluster.

TestA: Is to measure the throughput for creation of nodes in Neo4j where each node has 20 properties.

TestB: Is to measure the throughput for creation of nodes in Neo4j cluster 3.5.7 where each node has 40 properties.

Result: Throughput for TestB = 1/2 * Throughput for TestA

Below is the code I used to generate the load and measure the throughput.

import org.neo4j.driver.v1.*;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;


public class UnwindCreateNodes {

    Driver driver;
    static int start;
    static int end;

    public UnwindCreateNodes(String uri, String user, String password) {
        Config config = Config.build()
                .withConnectionTimeout(10, TimeUnit.SECONDS)
                .toConfig();
        driver = GraphDatabase.driver(uri, AuthTokens.basic(user, password), config);
    }


    private void addNodes() {
        List<Map<String, Object>> listOfProperties = new ArrayList<>();
        for (int inner = start; inner < end; inner++) {
            Map<String, Object> properties = new HashMap<>();
            properties.put("name", "Jhon " + inner);
            properties.put("last", "Alan" + inner);
            properties.put("id", 2 + inner);
            properties.put("key", "1234" + inner);
            properties.put("field5", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field6", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field7", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field8", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field9", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field10", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field11", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field12", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field13", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field14", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field15", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field16", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field17", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field18", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field19", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field20", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field21",  "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field22", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field23", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field24", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field25", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field26", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field27", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field28", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field29", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field30", "kfhc iahf uheguehuguaeghuszjxcb sd");

            properties.put("field31",  "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field32", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field33", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field34", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field35", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field36", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field37", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field38", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field39", "kfhc iahf uheguehuguaeghuszjxcb sd");
            properties.put("field40", "kfhc iahf uheguehuguaeghuszjxcb sd");
            listOfProperties.add(properties);
        }

        int noOfNodes = 0;
        for (int i = 0; i < listOfProperties.size() / 5000; i++) {
            List<Map<String, Object>> events = new ArrayList<>();
            for (; noOfNodes < (i + 1) * (5000) && noOfNodes < listOfProperties.size(); noOfNodes++) {
                events.add(listOfProperties.get(noOfNodes));
            }
            Map<String, Object> apocParam = new HashMap<>();
            apocParam.put("events", events);
            String query = "UNWIND $events AS event CREATE (a:Label) SET a += event";
            Instant startTime = Instant.now();
            try (Session session = driver.session()) {
                session.writeTransaction((tx) -> tx.run(query, apocParam));
            }
            Instant finish = Instant.now();
            long timeElapsed = Duration.between(startTime, finish).toMillis();
            System.out.println("######################--timeElapsed NODES--############################");
            System.out.println("no of nodes per batch " + events.size());
            System.out.println(timeElapsed);
            System.out.println("############################--NODES--############################");
        }
    }

    public void close() {
        driver.close();
    }

    public static void main(String... args) {
        start = 200001;
        end = 400001;
        if (args.length == 2) {
            start = Integer.valueOf(args[0]);
            end = Integer.valueOf(args[1]);
        }
        UnwindCreateNodes unwindCreateNodes = new UnwindCreateNodes("bolt+routing://x.x.x.x:7687", "neo4j", "neo4j");
        unwindCreateNodes.addNodes();
        unwindCreateNodes.close();
    }
}

Below is the graph.

enter image description here

It takes 3.5 seconds to insert 5000 nodes where each node has 40 properties

It takes 1.8 seconds to insert 5000 nodes where each node has 20 properties

This is a significant slowdown and 40 isn't a big number for the number of properties. I have a requirement until 100 properties but if I it does not scale for 40 I am not sure how I can scale for 100?

Other approaches I tried are using apoc.periodic.iterate With taking out UNWINDand without UNWIND just using CREATE etc but the behavior persists.

I don't want to store properties in some external store like RDBMS etc because that complicates things for me as I am building a generic application where I have no idea what properties are going to be used.

I cannot use CSV tool either because my data is coming from Kafka and also it is not structured in a way CSV tool wants. so no CSV tool for me.

Any idea to speed this up?

Do you have any indexes on any of those properties for :Label nodes? If so, you'll need to consider that the index needs to be updated too with inserts. Keep in mind also that for a cluster you're dealing with network I/O for Raft transactions, there is a latency there for consensus commits.

Seems like when you double the data that you're inserting, you're getting roughly double the execution time. That seems like a linear scaling, is this unexpected?

We are always looking to improve, and we are definitely eyeing some changes to our property store with an aim to improve efficiency sometime after the next major 4.0 release. I'm not sure if there's anything else we can spot here, but if we see anything relevant we'll let you know.

There are no indices. Moreover I tried with indices and without indices and the slow down is very negligible.

"Seems like when you double the data that you're inserting, you're getting roughly double the execution time. That seems like a linear scaling, is this unexpected?"

To be precise when I double the properties for each node the execution time doubles which means the throughput is halved. This is a significant slowdown and This is certainly not expected by me coz I thought neo4j would be able to handle it just fine.

When can we expect 4.0 release? Don't need any exact data but will it happen in Fall, Spring, Winter?

The properties often take up the most room in the store, so by effectively doubling the properties on the node you're basically doubling the data that needs to be inserted, when using the same number of nodes. Seeing about double the insertion time seems to be a natural consequence.

As previously mentioned, there are improvements scheduled on our store files, notably the properties store, that should see some improvement.

The 4.0 release is looking like late Winter 2019. The store improvements won't come with the 4.0 release, we're more likely to see them in 2020.

@andrew_bowman ok, I modified my benchmark code. Now I kept the total data size same when comparing the throughput between

10 properties of type long vs 20 properties of type int

Code for 10 properties of type long

import org.neo4j.driver.v1.*;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;


public class UnwindCreateNodes {

    Driver driver;
    static int start;
    static int end;

    public UnwindCreateNodes(String uri, String user, String password) {
        Config config = Config.build()
                .withConnectionTimeout(10, TimeUnit.SECONDS)
                .toConfig();
        driver = GraphDatabase.driver(uri, AuthTokens.basic(user, password), config);
    }


    private void addNodes() {
        List<Map<String, Object>> listOfProperties = new ArrayList<>();
        for (int inner = start; inner < end; inner++) {
            Map<String, Object> properties = new HashMap<>();
    
            properties.put("field1", 2L);
            properties.put("field2", 2L);
            properties.put("field3", 2L);
            properties.put("field4", 2L);
            properties.put("field5", 2L);
            properties.put("field6", 2L);
            properties.put("field7", 2L);
            properties.put("field8", 2L);
            properties.put("field9", 2L);
            properties.put("field10", 2L);
            
            
            listOfProperties.add(properties);
        }

        int noOfNodes = 0;
        for (int i = 0; i < listOfProperties.size() / 5000; i++) {
            List<Map<String, Object>> events = new ArrayList<>();
            for (; noOfNodes < (i + 1) * (5000) && noOfNodes < listOfProperties.size(); noOfNodes++) {
                events.add(listOfProperties.get(noOfNodes));
            }
            Map<String, Object> apocParam = new HashMap<>();
            apocParam.put("events", events);
            String query = "UNWIND $events AS event CREATE (a:Label) SET a += event";
            Instant startTime = Instant.now();
            try (Session session = driver.session()) {
                session.writeTransaction((tx) -> tx.run(query, apocParam));
            }
            Instant finish = Instant.now();
            long timeElapsed = Duration.between(startTime, finish).toMillis();
            System.out.println("######################--timeElapsed NODES--############################");
            System.out.println("no of nodes per batch " + events.size());
            System.out.println(timeElapsed);
            System.out.println("############################--NODES--############################");
        }
    }

    public void close() {
        driver.close();
    }

    public static void main(String... args) {
        start = 200001;
        end = 400001;
        if (args.length == 2) {
            start = Integer.valueOf(args[0]);
            end = Integer.valueOf(args[1]);
        }
        UnwindCreateNodes unwindCreateNodes = new UnwindCreateNodes("bolt+routing://x.x.x.x:7687", "neo4j", "neo4j");
        unwindCreateNodes.addNodes();
        unwindCreateNodes.close();
    }
}

Code for 20 properties of type int

import org.neo4j.driver.v1.*;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;


public class UnwindCreateNodes {

    Driver driver;
    static int start;
    static int end;

    public UnwindCreateNodes(String uri, String user, String password) {
        Config config = Config.build()
                .withConnectionTimeout(10, TimeUnit.SECONDS)
                .toConfig();
        driver = GraphDatabase.driver(uri, AuthTokens.basic(user, password), config);
    }


    private void addNodes() {
        List<Map<String, Object>> listOfProperties = new ArrayList<>();
        for (int inner = start; inner < end; inner++) {
            Map<String, Object> properties = new HashMap<>();
    
            properties.put("field1", 2);
            properties.put("field2", 2);
            properties.put("field3", 2);
            properties.put("field4", 2);
            properties.put("field5", 2);
            properties.put("field6", 2);
            properties.put("field7", 2);
            properties.put("field8", 2);
            properties.put("field9", 2);
            properties.put("field10", 2);
            properties.put("field11",  2);
            properties.put("field12", 2);
            properties.put("field13", 2);
            properties.put("field14", 2);
            properties.put("field15", 2);
            properties.put("field16", 2);
            properties.put("field17", 2);
            properties.put("field18", 2);
            properties.put("field19", 2);
            properties.put("field20", 2);
            
            listOfProperties.add(properties);
        }

        int noOfNodes = 0;
        for (int i = 0; i < listOfProperties.size() / 5000; i++) {
            List<Map<String, Object>> events = new ArrayList<>();
            for (; noOfNodes < (i + 1) * (5000) && noOfNodes < listOfProperties.size(); noOfNodes++) {
                events.add(listOfProperties.get(noOfNodes));
            }
            Map<String, Object> apocParam = new HashMap<>();
            apocParam.put("events", events);
            String query = "UNWIND $events AS event CREATE (a:Label) SET a += event";
            Instant startTime = Instant.now();
            try (Session session = driver.session()) {
                session.writeTransaction((tx) -> tx.run(query, apocParam));
            }
            Instant finish = Instant.now();
            long timeElapsed = Duration.between(startTime, finish).toMillis();
            System.out.println("######################--timeElapsed NODES--############################");
            System.out.println("no of nodes per batch " + events.size());
            System.out.println(timeElapsed);
            System.out.println("############################--NODES--############################");
        }
    }

    public void close() {
        driver.close();
    }

    public static void main(String... args) {
        start = 200001;
        end = 400001;
        if (args.length == 2) {
            start = Integer.valueOf(args[0]);
            end = Integer.valueOf(args[1]);
        }
        UnwindCreateNodes unwindCreateNodes = new UnwindCreateNodes("bolt+routing://x.x.x.x:7687", "neo4j", "neo4j");
        unwindCreateNodes.addNodes();
        unwindCreateNodes.close();
    }
}

The same behavior persists which is throughput decreases by half in the number of properties double although the total size is between two experiments are same.

chart%20(3)

I also tried different types like string

10 properties of type string length 6 vs 20 properties of type string length 3

The same behavior persists. so this clearly says properties store need to be redesigned. But if it takes until 2020 thats a bit too long.

Did you try to parallelize your work?

Just to note, Cypher only works with 64-bit numeric types (see the Cypher type system mappings), so what you're doing between using Integers vs Longs makes no difference when going through Cypher, it will convert to 64-bit longs, this is why you aren't seeing a difference.

Integer types (and others) can be used when using embedded Neo4j via the core API. I think you might be able to use them in custom procedures as well, if using the core API to write the properties.

@andrew_bowman converting integers to long doesn't sound efficent at all. but thats another problem in itself. As I said in my previous post I also did another experiment. which is

10 properties of type string length 6 vs 20 properties of type string length 3

chart%20(2)

you can see the blue vs red lines. The same behavior persists. Again this clearly says properties store need to be redesigned.

What other datatypes should I use to prove more and more that the same behavior exists regardless of the data size?

should I run another experiment with 10 properties of type int vs 80 properties of type boolean ? (80 because I hear java reserves 1 byte for type boolean but only uses 1 bit)