I'm new to Neo4j and Cypher, will appriciate help with a query, don't know where to start.
my dataset (CSV) is structured as demostrated below:
EntityID | EventType | Timestamp
A 11 02/05/21
A 54 03/05/21
A 11 04/05/21
B 62 01/02/20
B 54 08/03/20
B 49 15/08/20
I created a node fo each Entity and each unique EventType,
i would like to create a relationship between each Entity and each EventType whose timestamp is the earliest for that Entity (e.g. Entity A's node will be connected to EventType's 11 node)
I have over 1M rows and 40K entities in my dataset.
If you connect Entity "A" to EventType "11" with timestamp "02/05/21", then EventType "11" with timestamp "04/05/21" will not be connected to "A". This way, you cannot find if "A" attended the event on "04/05/21". It would be helpful if you can specify the Entity and EventType node properties in your data model.
Since you have over 1M rows, I would suggest you use neo4j-admin import . For that you would need two CSV files representing the Nodes (EntityId and EventType) and one for the Relationships.
- Import your CSV master data in a Pandas DataFrame
- Use .sort_values to sort the master DataFrame by TimeStamp column.
- Make a copy of it with only the Entity Column
- Use .drop_duplicates to retain only unique Entity IDs. Export to CSV
- Repeat Steps 2 & 3 for EventType
- Lastly use .drop_duplicates on the master DF passing the column EntityId for the subset parameter. By default, only the first row (which is also the earliest TimeStamp for that EntityId because you had sorted, is retained. Depending on your requirements, you can drop the TimeStamp column (unless you want it be a property of the EventType Node).
- Export this to CSV. This will be your Relationship CSV file.
- The rest is just a matter of using these three files in the Neo4j-admin import command with proper syntax.
I think this works. I tried it with the data sorted by date in both ascending and defending order, to ensure the result was not dependent on the order of the data.
Basically, it creates a new relationship between the entity and event in each row, setting the relationship date to the new date, or the min date if there was an existing relationship between the two entities. It then finds if there is a relationship between the row's entity and another event. If found, it deletes the one extra one with the greater date, leaving the relationship with the minimum event date.
LOAD CSV WITH HEADERS
FROM 'file:///xxx.csv' as rowStore
with rowStore.Entity as Entity, rowStore.EventType as EventType, date(rowStore.Timestamp) as EventDate
on create set r.EventDate = EventDate
on match set r.EventDate = apoc.coll.min([r.EventDate, EventDate])
with n, EventDate
where r.EventDate <> apoc.coll.min([r.EventDate, EventDate])
Solution is good But, we end up with lot of orphan nodes.
What is an orphan node? Do you not want the event nodes that are not connected to an entity? If that is the case, change the last line to:
delete r, p
Thanks! this is very helpful.