Very basic newbie q: Can I use neo4j to graphically show sequential activity from a log?

I posted my first question here and I should've posted in newbie section: Basics: Relationships of nodes where relationship info is the goal

I have a database copy log file and want to see where data went from original db to copies (...of copies of copies). First problem is importing the data, db IDs are fixed but names can change, so nodes are identified by ID but I want to display the name. Second problem is to show only most recent relationship: what data was last copied to that database since prior data was overwritten, and the relationship should show who copied the data on what date. Also I noticed neo4j doesn't like names to start with numbers, but names are arbitrary, and I also noticed a node can have multiple labels (maybe useful since an ID can have a different name later).

I'm pretty sure I'm on the wrong path:

  1. LOAD CSV WITH HEADERS FROM 'file:///desktop-csv-import/copy.csv' AS row
    MERGE (s:SourceID {source_db_id: row.source_db_id})
    RETURN count(s);
    Added 24648 labels, created 24648 nodes, set 24648 properties, started streaming 1 records after 2467427 ms and completed after 2467427 ms.
    Count 312324
  2. LOAD CSV WITH HEADERS FROM 'file:///desktop-csv-import/copy.csv' AS row
    MERGE (t:TargetID {target_db_id: row.target_db_id})
    RETURN count(t);
    Count 312324
    Added 38744 labels, created 38744 nodes, set 38744 properties, started streaming 1 records after 4096611 ms and completed after 4096611 ms.
  3. LOAD CSV WITH HEADERS FROM 'file:///desktop-csv-import/copy.csv' AS row
    MATCH (s:SourceID {source_db_id: row.source_db_id})
    MATCH (t:TargetID {target_db_id: row.target_db_id})
    MERGE (s)-[r:Copied_To]->(t)
    RETURN count(*);
    Count 312324
    Created 59218 relationships, started streaming 1 records after 12145906 ms and completed after 12145936 ms.
  4. create index on :TargetID (target_instance_id)
    (Too late, and I should've started with a shortened file after that last 3.3hr job)

Can someone show me the proper steps to graphically show the data below? Sample csv below, real csv has 312,000 rows and a few more columns.

copy.csv
source_name,source_id,target_name,target_id,user_name,date,sys_created_on
Hollywood,52a99701310c785,Hollywooddev,904f897bf9619ac,Charlie,11/6/19 10:41,11/6/19 10:25
Dallas,51e7b2f5d3ee434,Dallasdev,16639cc0dfdc527,Eve,9/9/19 9:24,9/9/19 9:05
Indiana,75ff897bf9619ab,Indianauat,bf6d5b0cf96199e,Rupert,8/22/19 16:17,8/22/19 15:58
Indiana,75ff897bf9619ab,Indianadev,b2efba8bf961931,Rupert,8/15/19 16:30,8/15/19 16:11
Gilroy,51e7b2f5d3ee445,Gilroydev,51e7b2f5d3ee47d,Alice,7/22/19 2:00,7/22/19 1:34
20Dallas,51e7b2f5d3ee434,20Dallastest,a75a9cc5d3ee474,Eve,7/9/19 13:53,7/9/19 13:34
Clevelanddev,99959efe8da152e,Clevelandgroupllcdemo1,2e3f70fbf961974,Bob,7/1/19 10:22,7/1/19 10:20
Clevelanddev,99959efe8da152e,Clevelandgroupllcdemo3,264f70fbf9619e4,Bob,5/30/19 8:40,5/30/19 8:37
Clevelanddev,99959efe8da152e,Clevelandgroupllcdemo3,264f70fbf9619e4,Bob,5/24/19 8:30,5/24/19 8:23
20Dallas,51e7b2f5d3ee434,20Dallastest,a75a9cc5d3ee474,Eve,5/21/19 7:19,5/21/19 7:03
Clevelanddev,99959efe8da152e,Clevelandgroupllcdemo3,264f70fbf9619e4,Bob,5/6/19 11:30,5/6/19 11:28
Indiana,75ff897bf9619ab,Indianadev,b2efba8bf961931,Rupert,4/30/19 19:42,4/30/19 19:22
Barrington,f035d00cf96192a,Barringtondev,36ad5b0cf961906,Sybil,3/12/19 0:11,3/11/19 23:52
Indiana,75ff897bf9619ab,Indianauat,bf6d5b0cf96199e,Rupert,3/11/19 17:59,3/11/19 17:42
Indiana,75ff897bf9619ab,Indianadev,b2efba8bf961931,Kilo,3/4/19 14:08,3/4/19 13:55
Indiana,75ff897bf9619ab,Indianadev,b2efba8bf961931,Kilo,3/3/19 17:09,3/3/19 16:50
Barrington,f035d00cf96192a,Barringtondev,36ad5b0cf961906,Sybil,3/2/19 19:10,3/2/19 19:08
Indiana,75ff897bf9619ab,Indianauat,bf6d5b0cf96199e,Kilo,2/27/19 13:56,2/27/19 13:37
Indiana,75ff897bf9619ab,Indianauat,bf6d5b0cf96199e,Kilo,2/26/19 19:30,2/26/19 19:11

I would recommend in your model to separate out the concept of the database versus the current state of the database. Have a node to represent the database which you mentioned maintains a consistent ID, and then in each of the copies of the database, link to back to that DB node. As a copy is made or added, you can create a linked list of the copies in the order they're created so you can easily traverse back through time. To easily help you always know where the end of the linked list is, create a LATEST relationship. When you made a new copy, don't forget to increment the relationship. On each of the Copy nodes, you can add attribute specific to the copy such as name and timestamp. You can link the copy back to a Person node of who made the copy. Here's a quick diagram made by the Arrows tool of the model I've been describing:

1 Like

Here is also some inspiration around event chain analysis using Neo4j:

Hi Bill,

to get you going and to build on what Michael wrote, I would try the following:

  1. LOAD CSV WITH HEADERS FROM 'file:///copy.csv' AS row
    MERGE (a:Source {id: row.source_id, name: row.source_name})

  2. LOAD CSV WITH HEADERS FROM 'file:///copy.csv' as row
    MATCH (a:Source {id: row.source_id, name: row.source_name})
    MERGE (a)<-[:IS_COPY_OF {copy_date: datetime(row.date), copy_user: row.user_name}]-(b:Target {id: row.target_id, name: row.target_name})

In the second command, you see, how I put the copy_date and the user who copied the source on the relationship. In order to convert the date to a Neo4j datetime, you need to modify your source file a bit. I changed it to YYYY-MM-DD HH:MM (e.g. for the first entry: 2019-06-11T10:41). Then the function datetime("2019-06-11T10:41") works which makes it then possible for you to sort by copy date.

The graph then looks like:

Now you can e.g. find all copies of Indiana:

MATCH (a:Target)-[:IS_COPY_OF]->(b:Source {name: "Indiana"}) RETURN a,b

You can also sort the copies of Indiana by latest:

MATCH (a:Target)-[s:IS_COPY_OF]->(b:Source {name: "Indiana"}) RETURN a.name, s.copy_date ORDER BY s.copy_date DESC

Here is the csv I used:
source_name,source_id,target_name,target_id,user_name,date
Hollywood,52a99701310c785,Hollywooddev,904f897bf9619ac,Charlie,2019-06-11T10:41
Dallas,51e7b2f5d3ee434,Dallasdev,16639cc0dfdc527,Eve,2019-09-09T9:24
Indiana,75ff897bf9619ab,Indianauat,bf6d5b0cf96199e,Rupert,2019-08-22T16:17
Indiana,75ff897bf9619ab,Indianadev,b2efba8bf961931,Rupert,2019-08-15T16:30
Gilroy,51e7b2f5d3ee445,Gilroydev,51e7b2f5d3ee47d,Alice,2019-07-22T2:00
20Dallas,51e7b2f5d3ee434,20Dallastest,a75a9cc5d3ee474,Eve,2019-07-09T13:53
Clevelanddev,99959efe8da152e,Clevelandgroupllcdemo1,2e3f70fbf961974,Bob,2019-07-01T10:22
Clevelanddev,99959efe8da152e,Clevelandgroupllcdemo3,264f70fbf9619e4,Bob,2019-05-30T8:40
Clevelanddev,99959efe8da152e,Clevelandgroupllcdemo3,264f70fbf9619e4,Bob,2019-05-24T8:30
20Dallas,51e7b2f5d3ee434,20Dallastest,a75a9cc5d3ee474,Eve,2019-05-21T7:19
Clevelanddev,99959efe8da152e,Clevelandgroupllcdemo3,264f70fbf9619e4,Bob,2019-05-06T11:30
Indiana,75ff897bf9619ab,Indianadev,b2efba8bf961931,Rupert,2019-04-30T19:42
Barrington,f035d00cf96192a,Barringtondev,36ad5b0cf961906,Sybil,2019-03-12T0:11
Indiana,75ff897bf9619ab,Indianauat,bf6d5b0cf96199e,Rupert,2019-03-11T17:59
Indiana,75ff897bf9619ab,Indianadev,b2efba8bf961931,Kilo,2019-03-04T14:08
Indiana,75ff897bf9619ab,Indianadev,b2efba8bf961931,Kilo,2019-03-03T17:09
Barrington,f035d00cf96192a,Barringtondev,36ad5b0cf961906,Sybil,2019-03-02T19:10
Indiana,75ff897bf9619ab,Indianauat,bf6d5b0cf96199e,Kilo,2019-02-27T13:56
Indiana,75ff897bf9619ab,Indianauat,bf6d5b0cf96199e,Kilo,2019-02-26T19:30

Regards,
Elena

1 Like

You people are awesome!

Thank you for pointing me in the right direction. Michael Black for conceptualizing my high level brain problem and Elena for being the first and sticking with me with your experienced implementation answers. I'm on my way with neo4j. And I will investigate Snowplow as Niclas pointed out.

2 Likes

The 'IS_COPY_OF' seems to limit the depth of relationships if I use WHERE b.name = "sourcedb".

I'd like to depict a graph of the complete hierarchy of copy relationships, is there a Cypher syntax where a variable of source includes the resulting set of targets?

Hi Bill, you may want to look at the Cypher docs for variable-length pattern matching, which can let you specify the depth of traversal of relationships.

If you have further questions let us know, along with the query so we can provide more relevant advice.

Thanks Andrew,

I've simplified the csv for my mental health.

Alice copied to bob on 9/19 copied to charlie on 9/20 copied to eve on 9/21 <-- I would like to see this chain of relationships
Alice copied to bob on 9/21 <-- I would like to see this relationship
Mallory copied to charlie on 9/20 so charlie no longer has a copy of alice <-- I don't want to display that

In my use case the above goes on for 55,000 events for a mix of different sources. Some relationships are 40 layers deep, but I want to filter out where the source data was overwritten.

source_name,source_id,target_name,target_id,user_name,date
mallory,16,charlie,13,dave,2019-09-21T15:10:00
alice,11,bob,12,sam,2019-09-21T15:00:00
charlie,13,eve,14,jack,2019-09-21T10:10:00
alice,11,bob,12,joe,2019-09-20T17:35:00
bob,12,charlie,13,joe,2019-09-20T15:00:00
alice,11,bob,13,dave,2019-09-19T15:10:00

Using the following

  1. LOAD CSV WITH HEADERS FROM 'file:///copy.csv' AS row
    MERGE (a:Source {id: row.source_id, name: row.source_name})

  2. LOAD CSV WITH HEADERS FROM 'file:///copy.csv' as rowMATCH (a:Source {id: row.source_id, name: row.source_name})
    MERGE (a)<-[:IS_COPY_OF {copy_date: datetime(row.date), copy_user: row.user_name}]-(b:Target {id: row.target_id, name: row.target_name})

  3. MATCH (a:Target)-[s:IS_COPY_OF]->(b:Source) RETURN *

From the above commands I get this image and would like to filter by original source and see relationship chains of only active copies of the source (eliminating where a target may have been overwritten by another source). The relationship should show who copied on what date.

The next step after the analysis is to print out or export the view of where source data ended up, and present that to the source owner. Hopefully screenshots are not the only option (newbie here).

So I guess what would make your life easier is if you had two csv files: one for the nodes and one for the edges which you load subsequently. E.g.:

nodes.csv:
name,id
mallory,16
alice,11
charlie,13
bob,12
eve,14

edges.csv: same as your copy.csv

Then you first load your nodes:
LOAD CSV WITH HEADERS FROM 'file:///nodes.csv' as row
MERGE (a:File {id: row.id, name: row.name})

And then your edges:
LOAD CSV WITH HEADERS FROM 'file:///edges.csv' as row
MATCH (a:File {id: row.source_id, name: row.source_name}), (b:File {id: row.target_id, name: row.target_name})
MERGE (a)<-[:IS_COPY_OF {copy_date: datetime(row.date), copy_user: row.user_name}]-(b)

This gives you
image

As for your last point, I am not sure how to model this best. Maybe looking for time-dependent graphs will help you there. Maybe even this: Representing time dependent graphs in Neo4j · SocioPatterns/neo4j-dynagraph Wiki · GitHub

Thank you that looks like a solution.

The need use a second CSV with the first two columns of the original, makes me curious why it could not be pulled that from the original CSV.