10k writes taking around 4 minutes. Is this the limit?

andreperez · June 30, 2021, 1:35pm

I'm using Pyingest script to read 10 CSV Files with 10k columns and 7 rows each. The best result I had so far was using a Chunk Size of 10000.
My PC has 16gb of RAM and a 4c/8t CPU.
The first two always take about 1~2 minutes, then it grows a bit and by the last file the write process takes around 5 minutes.
My DB Heap configs are the following:

# Java Heap Size: by default the Java heap size is dynamically calculated based
# on available system resources. Uncomment these lines to set specific initial
# and maximum heap size.
dbms.memory.heap.initial_size=5G
dbms.memory.heap.max_size=5G

# The amount of memory to use for mapping the store files.
# The default page cache memory assumes the machine is dedicated to running
# Neo4j, and is heuristically set to 50% of RAM minus the Java heap size.
dbms.memory.pagecache.size=7G

I got this value from neo4j-admin memrec (not mine actually, I'm using the AppImage version of Neo4J so I can't use memrec, just found this on this forum from a similar device).

Since Pyingest is using Pandas to optimize the CSV read I guess the Timings are getting slower because of Garbage Collector problems.

I'm trying to get the best result I can on my pc, so that when I put this on a Server (much more powerful obviously than my machine) it will perform beautifully.

Is there any way to optimize more?

EDIT: I forgot to put the queries I'm using.

      WITH $dict.rows as rows UNWIND rows as row
      MERGE (a: Cookie_id {domain: row.domain}) 
      MERGE (b: OS {version: row.version}) 
      MERGE (c: Device_type {classification: row.classification}) 
      MERGE (d: Device_model {model: row.model}) 
      MERGE (e: IP {addr: row.addr}) 
      MERGE (f: Access_time {hour_group: row.hour_group}) 
      MERGE (g: Access_day {is_weekend: row.is_weekend}) 
      MERGE (a)-[:USING_OS]->(b)
      MERGE (a)-[:BY_TYPE]->(c)
      MERGE (a)-[:ACCESSED_BY]->(d)
      MERGE (a)-[:HAS_IP]->(e)
      MERGE (a)-[:ACCESSED_AT_TIME]->(f)
      MERGE (a)-[:ACCESSED_AT_DAY]->(g)
      RETURN a

florent_biville · June 30, 2021, 1:56pm

Hi there! Have you created indices on the nodes you're MERGE-ing?
As you may know, MERGE=MATCH+CREATE, so creating indices on the MERGE patterns boosts the speed of the preliminary MATCH.

andreperez · June 30, 2021, 1:58pm

Actually I didn't.
Will try this to see the boost. Thanks for your answer!

andreperez · June 30, 2021, 2:09pm

I've got a question about this. Can I set any property as a constraint to act as a Index?
Or does it needs to be the Id?

andreperez · June 30, 2021, 2:21pm

Oh man! The total time to proccess the 10 files is taking now 15 seconds.
I set a constraint for the main property and set it as the Node Key.
Thank you so much

Topic		Replies	Views
Memory Config for large import Operations	1	1578	September 17, 2018
Config optimization for heavy query traffic Cypher performance , cypher	4	388	February 1, 2021
Recommended memory config for importing 10GB dataset with 16GB RAM Neo4j Graph Platform cypher	2	2720	September 8, 2020
Creating a lot of nodes at once Python	2	745	May 7, 2020
Improving data writing efficiency in python Cypher cypher	7	1966	April 12, 2020

10k writes taking around 4 minutes. Is this the limit?

Related Topics