We receive a large amount of data and we build a Cypher query that needs to "sync" relationships of a single node :Talent
with a large amount of other nodes :Segment
. The average number of relationships that need to be synced is around 900,000, and we need to write that data to the database within the least time possible. The current setup is as follows:
- Driver: using official Golang driver
- Neo4j causal cluster of 3 core and 2 read replicas. The cluster is managed by Kubernetes, so Neo4j is hosted within Docker containers and not on the machine directly
- Each core has a maximum of 3CPU units and 20Gi RAM with configuration of heap:
heap.max_size
andheap.initial_size
are13Gi
- Read replicas are configured with a maximum of 3CPU units and 18Gi RAM (read replicas are irrelevant to this question)
- Each core has a maximum of 3CPU units and 20Gi RAM with configuration of heap:
The approach we've taken to implement this is to first delete all relations between :Talent
and :Segment
and then write the updated relations again. This is one of the operations that are questionable but we've decided on this approach in contrast with querying existing relations and diffing at application layer.
Question 1 is there a better way to approach syncing relations instead of delete/create, in order to remove existing relations that are not part of the given list to create?
The query to write relations is as follows:
WITH $list as list
UNWIND list as item
MATCH (t:Talent {talent_id: $talentID}), (s:Segment {checksum: item.checksum})
MERGE (t)-[r:BELONGS_TO]->(s)
ON CREATE SET r.created_at = $created_at
SET r.updated_at = $updated_at,
r.affinity = item.affinity,
r.audience_size_percentage = item.audience_size_percentage
Note that this query will not contain all of the list of segments that will be received, and since we are using Golang we decided to utilise its concurrency abilities to try and achieve high write amounts with low time, still weren't able to achieve it.
Example: Say we received a list of 700,000 :Segment
details that need to be written, and the :Talent
already had 500,000.
- Before we write, we need to make sure that the segments that are not part of the received list are removed from relations
- Build and execute a bunch of queries concurrently, with a chunk of 5000 items each. So we end up running
140
concurrent operations for the 700k, each writing 5000 relations
Question 2 This process is taking up to an hour to finish per :Talent
. How can we bring it down to the minimum possible (a minute ideally, though seems impossible).
It is also worth mentioning the current schema:
Indexes
ON :AudienceSegment(checksum) ONLINE
ON :Talent(talent_id) ONLINE
ON :AudienceSegment(uuid) ONLINE (for uniqueness constraint)
ON :Talent(uuid) ONLINE (for uniqueness constraint)
Constraints
ON ( audiencesegment:AudienceSegment ) ASSERT audiencesegment.uuid IS UNIQUE
ON ( talent:Talent ) ASSERT talent.uuid IS UNIQUE
Note:
uuid
is a randomly generatedUUIDv4
that is added to all nodes
Note:
:Segment
is identified by thechecksum
field rather thanuuid