Import via http increasingly slow on a large dataset

I am importing several datasets of which a couple of large ones (500K-1.5M documents of the type shown below), and I have an index created with the command

"CREATE CONSTRAINT ON (n:_) ASSERT n.id IS UNIQUE;"

Indexing time goes from a couple of ms per document at the beginning to currently 1s per document (import still running).

Is this normal? How can this be prevented?

  • neo4j version, desktop version, browser version
    3.3.0 Community
  • what kind of API / driver do you use
    http /db/data/transaction/commit
  • screenshot of PROFILE or EXPLAIN with boxes expanded (lower right corner)
  • a sample of the data you want to import
"parameters": {
                "operation": "add",
                "dataset": "bag",
                "type": "hg:Building",
                "id": "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
                "data": {
                    "uri": "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
                    "type": "hg:Building",
                    "validSince": [
                        "1977-01-01",
                        "1977-12-31"
                    ],
                    "geometry": "{\\"type\\":\\"Polygon\\",\\"coordinates\\":[[[5.70891547486678,52.8504181579206],[5.70891457428957,52.8504168123607],[5.70894999504419,52.8504089883029],[5.70896652550447,52.8504363528222],[5.70900994956621,52.8504267996935],[5.70906180219923,52.8505135601466],[5.70903749603102,52.8505188379876],[5.70903433968836,52.8505135444138],[5.70898068889875,52.850525281936],[5.70896446255978,52.8504989051244],[5.70896638900196,52.8504984505955],[5.70893317961297,52.8504437219559],[5.70893125317285,52.8504441764843],[5.70891547486678,52.8504181579206]]]}",
                    "dataset": "bag",
                    "validSinceTimestamp": 220924800
                },
                "structure": "node"
            },
            "statement": "MERGE (n:_ {id: {id}})\\nON CREATE\\n   SET n = {data},\\n       n.created = timestamp(),\\n       n.id = {id},\\n       n.dataset = {dataset},\\n       n:`hg:Building`\\nON MATCH\\n   SET n = {data},\\n       n.accessTime = timestamp(),\\n       n.counter = coalesce(n.counter, 0) + 1,\\n       n.id = {id},\\n       n.dataset = {dataset},\\n       n:`hg:Building`\\n   REMOVE n:_VACANT\\nRETURN n\\n"
        },
  • which plugins / extensions / procedures do you use
  • neo4j.log and debug.log

I tried to run this from the browser by setting the parameter (has to be run on one line). I couldn't get the n={data} to work with the error TypeError: Property values can only be of primitive types or arrays thereof, but IMO that is a red herring, if your query is working, there must be some difference in the param usage in Browser vs. the HTTP call you are doing.

I think when you change the Label from _ or Vacant to hg:building you don't have an index or constraint on hg:building. Try adding that constraint to UNIQUE to hg:building.id.

When the label with the constraint is used, the index is used

When I get hg:building it is not. There is a node scan.

This will cause things to get slower as the number of nodes for the non-indexed label increases.

Merge I used
MERGE (n:Problem1 {id: {id}})
ON MATCH SET n.accessTime = timestamp(), n.counter = coalesce(n.counter, 0) + 1, n.id = {id}, n.dataset = {dataset}
,n.uri=$data.uri,n:hg:Building
RETURN n

Params:

:params {operation: "add",dataset: "bag",type: "hg:Building",id: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",data: {uri: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",type: "hg:Building",validSince: ["1977-01-01","1977-12-31"],geometry: {type:"Polygon",coordinates:[[[5.70891547486678,52.8504181579206],[5.70891457428957,52.8504168123607],[5.70894999504419,52.8504089883029],[5.70896652550447,52.8504363528222],[5.70900994956621,52.8504267996935],[5.70906180219923,52.8505135601466],[5.70903749603102,52.8505188379876],[5.70903433968836,52.8505135444138],[5.70898068889875,52.850525281936],[5.70896446255978,52.8504989051244],[5.70896638900196,52.8504984505955],[5.70893317961297,52.8504437219559],[5.70893125317285,52.8504441764843],[5.70891547486678,52.8504181579206]]]},dataset: "bag",validSinceTimestamp: 220924800},structure: "node"}

Hi David,
Thank you very much for your reply. Inspired from what you did I have also run an EXPLAIN query on my DB

EXPLAIN MERGE (n:_ {id: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084"})
ON CREATE
   SET n.id = "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
       n.dataset = "bag",
       n:`hg:Building`

ON MATCH
   SET n.id = "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
       n.dataset = "bag",
       n:`hg:Building`
   REMOVE n:_VACANT

RETURN n

and here is the result:

Only I do not understand where the NodeByLabelScan comes from unless something is seriously wrong, since I just want to remove the label of the found node, and not of any arbitrary node in the DB.

Do I have an error in the query?

And further, this problem was not showing up before we changed the id from a short string to a full URL, can this be related to the increased indexing time?

Thanks,
Stefano

Hi Stefano,

Try creating an index on the other Labels you are accessing id with. E.g _ or _Vacant. The table scan indicates that the query doesn't have an index for that label.

David

Hi David,
Thank you for your reply, I was convinced I had already created an index but the script that was supposed to do so did not run correctly.

Thanks again for your support.

Stefano

Hi Stefano,
One key point, while creating indexes AND on the same time importing data.
If data on which you are creating index is HUGE data then the process of populating the index data may take some time to complete. So its better to check the status of index (ONLINE, POPULATING or FAILED) make sure its ONLINE once all data is indexed.

On PROFILE your query, if NodeByLabelScan comes (even tough you had indexes) then that means, either indexing not yet complete or it may have failed. Always best practice to see failureMessage in verbose mode.