How to get all direct and indirect connected nodes within a time period backward(each node has timestamp in it)?

Can anyone help me to solve this problem in Neo4j ? specially the query-6 where it is required to find all connected node within a time period backward.


We should be able to help, this is a very relevant and interesting case, but it will go faster if we have a CSV with which to load the data. The format of the data as presented in the article isn't well designed for import.

Let's assume the node creation process in dynamic. and the graph expands with time.

Sure, I'm just asking for an example CSV to import so I can do some initial testing. As previously mentioned the data mentioned in the article is not in any good format for importing.

Okay, doing this by hand. As noted previously, an example CSV would make a huge difference here, the data in the article is not in a good format for easy import.

Here's how I'm creating the graph:

// create :Person nodes

{ID:2020020201, HealthStatus: 'Sick', ConfirmedTime: date('2020-01-24')},
{ID:2020020202, HealthStatus: 'Sick', ConfirmedTime: date('2020-01-25')},
{ID:2020020203, HealthStatus: 'Sick', ConfirmedTime: date('2020-01-25')},
{ID:2020020204, HealthStatus: 'Sick', ConfirmedTime: date('2020-01-21')},
{ID:2020020205, HealthStatus: 'Sick', ConfirmedTime: null}
] as personInfo
CREATE (p:Person)
SET p = personInfo

For shorthand, I'll be referring to these persons as Usr followed by the last number of the ID, so Usr1, Usr2, etc.

// create :Place nodes

{name:'Shopping mall Area A'},
{name:'Community A in Hepin District'},
{name:'Shopping mall Area B'},
{name:'Community B in Hexi District'},

{name:'Hotpot restaurant in Nankai District'},
{name:'Community B in Binhai District'},
{name:'Shopping mall Area C'}
] as placeInfo
CREATE (p:Place)
SET p = placeInfo

// create relationships

MATCH (p1:Person {ID:2020020201}),
(p2:Person {ID:2020020202}),
(p3:Person {ID:2020020203}),
(p4:Person {ID:2020020204}),
(p5:Person {ID:2020020205})

MATCH (pl1:Place {name:'Shopping mall Area A'}),
(pl2:Place {name:'Shopping mall Area B'}),
(pl3:Place {name:'Shopping mall Area C'}),
(pl4:Place {name:'Community A in Hepin District'}),
(pl5:Place {name:'Community B in Hexi District'}),
(pl6:Place {name:'Hotpot restaurant in Nankai District'}),
(pl7:Place {name:'Community B in Binhai District'})

CREATE (p1)-[:stay {start_time: dateTime('2020-01-23T12:00:00'), end_time: dateTime('2020-01-23T18:00:00')}]->(pl1),
(p1)-[:stay {start_time: dateTime('2020-01-23T18:00:00'), end_time: dateTime('2020-01-24T08:00:00')}]->(pl4),
(p2)-[:stay {start_time: dateTime('2020-01-23T12:00:00'), end_time: dateTime('2020-01-23T23:00:00')}]->(pl1),
(p3)-[:stay {start_time: dateTime('2020-01-23T15:00:00'), end_time: dateTime('2020-01-23T19:00:00')}]->(pl2),
(p3)-[:stay {start_time: dateTime('2020-01-23T12:00:00'), end_time: dateTime('2020-01-23T23:00:00')}]->(pl5),
(p4)-[:stay {start_time: dateTime('2020-01-23T11:00:00'), end_time: dateTime('2020-01-23T20:00:00')}]->(pl6),
(p4)-[:stay {start_time: dateTime('2020-01-23T20:00:00'), end_time: dateTime('2020-01-23T23:00:00')}]->(pl7),
(p5)-[:stay {start_time: dateTime('2020-01-23T11:00:00'), end_time: dateTime('2020-01-23T15:00:00')}]->(pl6),
(p5)-[:stay {start_time: dateTime('2020-01-23T16:00:00'), end_time: dateTime('2020-01-23T23:00:00')}]->(pl1),
(p5)-[:stay {start_time: dateTime('2020-01-23T16:00:00'), end_time: dateTime('2020-01-23T23:00:00')}]->(pl2),
(p5)-[:stay {start_time: dateTime('2020-01-23T16:00:00'), end_time: dateTime('2020-01-23T23:00:00')}]->(pl3)

We're using dateTime() here for the times, and did a slight formatting change so that these are parsable. We COULD use strings instead, but as a native temporal value we have the option of neat temporal operations, such as working with durations.

For any real set of data, we would want to use MERGE instead of CREATE for these nodes, and we would want to have indexes or unique constraints on :Person(ID) and :Place(name), and have richer naming/modeling for the places rather than just using direct string names.

Now for the Cypher equivalents of the data analysis queries.

1. Find out where Usr1 was on January 23

MATCH (p:Person {ID:2020020201})-[stay:stay]->(pl:Place)
WHERE localDateTime('2020-01-23T15:00:00') < stay.start_time < localDateTime('2020-01-23T23:00:00' )

which gives us the place: "Community A in Hepin District".

2. Check if Usr1 exposed to any confirmed cases during this time

MATCH (p:Person {ID:2020020201})-[stay1:stay]->(pl:Place)<-[stay2:stay]-(carrier)
WHERE dateTime('2020-01-23T15:00:00') < stay1.start_time < dateTime('2020-01-23T23:00:00')
 AND carrier.HealthStatus = 'Sick' AND carrier.ConfirmedTime <= date("2020-01-23")
 RETURN pl, carrier

What's going on here? We need to reuse the start of the previous query so we match to places Usr1 traveled on the date in question..

The query returns no results, which makes sense because in the small amount of data we have from the linked site, Usr1 is the only person that visited Community A in Hepin District.

There are a few problems with the query as presented on the site and translated to Cypher here.

We're only looking for a visit after 15:00:00 on 1-23, which doesn't cover all possible times of visitation on that date, and it won't detect if the start time fell on exactly the boundary lines. This should be checking for all of 1-23 and not this smaller range, and it should be using <= and >= for full inclusion of the boundary times. With the query as-is, we don't even consider the visit to Shopping mall Area A, since the start time to the visit was at 12:00:00 of that day, before the bounds which we're checking. This is likely either a typo or overlooked error in the query, not by intention.

Depending on what we're looking for, the criteria for finding carriers may be far too wide. Note that the criteria for the carrier is that the carrier must be sick, must have been confirmed before the date in question, and must have visited the same place as Usr1. But we have no constraints on when that visit took place. Maybe the visit took place days or months or years before they got sick or were confirmed. And what about if they got confirmed later...that doesn't mean they weren't sick and spreading this before! Even if that's okay to consider, are we tracking spread via potential touch, multi-person spread, or direct in-person spread? If it's in-person spread then we need to figure out of date ranges overlap, or at least if they visited on the same day.

Lastly, this doesn't count carriers that are sick but unconfirmed. That's probably fine, but a query to include these as-of-yet-unconfirmed individuals may at least show risk, if they're sick but we're missing the ConfirmedTime, as in Usr5 with a NULL confirmed time (this is equivalent to not having a confirmed time at all in Neo4j, since there is no such thing as a NULL property value, NULL is equivalent to nonexistence).

So with those flaws in the original query in mind, let's create something that covers all of 01-23, allows inclusivity on the date and time boundaries, and can also identify potential carriers that are sick but don't have a confirmed time:

MATCH (p:Person {ID:2020020201})-[stay1:stay]->(pl:Place)<-[stay2:stay]-(carrier)
WHERE dateTime('2020-01-23T00:00:00') < stay1.start_time < dateTime('2020-01-23T23:00:00')
 AND apoc.coll.max([stay1.start_time.epochMillis, stay2.start_time.epochMillis]) <= apoc.coll.min([stay1.end_time.epochMillis, stay2.end_time.epochMillis])
 AND carrier.HealthStatus = 'Sick' AND coalesce(carrier.ConfirmedTime, date("2020-01-23")) <= date("2020-01-23")
 RETURN pl, carrier

The second part of the WHERE clause dealing with maxes and mins uses functions from APOC procedures to perform scaler min() and max() across values within a single row, since the native min() and max() functions in Cypher are aggregation functions, which aren't what we need here.

Why are we using min() and max() like this? To determine date range overlap, this is a simple reduction in logic to do so (WHERE max(both start times) <= min(both end times)). This is assuming that we're only looking for person-to-person direct contact. (we're working on adding an APOC temporal overlap function to make this even easier)

Currently the APOC max() and min() scaler functions work only on numbers and not temporal types, so until we have a fix in place we project out the epochMillis values for the dates when using these functions.

Lastly we're using coalesce() to supply a default ConfirmedDate if the value is null, which will reveal any persons that are sick but haven't yet been confirmed.

The ConfirmedDate check here may also be irrelevant depending on the data set. It may work well for China and South Korea and locations that had early and strenuous testing, assuming testing before symptoms emerged, but not in other places with a lower testing percentage or where tests weren't available immediately, or if the person just decided to wait late in the process to get tested. How you treat this will alter whether you're only looking for absolutely confirmed spread (only considering dates of spread for after the confirmation), or if you want to consider times before confirmation when they may have still been sick and contagious.

And this query yields:

│"pl"                           │"carrier"                              │
│{"name":"Shopping mall Area A"}│{"HealthStatus":"Sick","ID":2020020205}│

Bingo. Usr5 who is sick but not confirmed, present in the Shopping mall Area A at the same time as Usr1, which wasn't previously identified due to the multiple problems in the original query mentioned previously. Since unconfirmed they may not be a carrier, but a tweak to the query might report that uncertainty as well.

Let's move on to the next one:

3. Check who have an undirected connection with Usr1

The query from the site basically describes this path:


The date comparisons in that query are on what are labeled above as stay3 and stay4.

If we translate the query from the site directly to Cypher we get something like:

MATCH (p:Person {ID:2020020201})-[stay1:stay]->(pl:Place)<-[stay2:stay]-(carrier1:Person)-[stay3:stay]->(pl2:Place)<-[stay4:stay]-(carrier2:Person)
WHERE dateTime('2020-01-23T15:00:00') < stay1.start_time < dateTime('2020-01-23T23:00:00')
 AND carrier2.HealthStatus = 'Sick' AND carrier2.ConfirmedTime <= date("2020-01-23")
 AND stay4.start_time > stay3.start_time AND stay4.end_time < stay3.end_time
 RETURN carrier1, pl2, carrier2

I am omitting other variables to concentrate on the two potential carriers at the end of the path and where they met.

This yields no results. This is because are several flaws here from the original query:

  1. Again we're only starting our check from 15:00:00 on 1-23 (again this is likely by typo or error, not intention), so we can fix that easily to be 00:00:00.

  2. The inequalities between stay4 and stay3 are not inclusive of the bounds, which is a problem because stay4.start_time = stay3.start_time on the desired path (Usr 4 and Usr5 staying at the hotpot restaurant, working backwards from their answer later on the site), so to get correct results we must use >= and <=.

  3. The overlap logic isn't correct. This is only looking for a single kind of overlap, where (stay4 associated with carrier2, and stay3 associated with carrier1) carrier2 arrives before carrier1, and leaves before carrier1. This doesn't consider other overlapping scenarios, where maybe carrier1 arrives before carrier2, or maybe carrier1 leaves before carrier2. Since we're only looking at one of 4 different possible kinds of overlaps, we are omitting potential results from this query. As mentioned previously, using the reduced formula of WHERE max(both start times) <= min(both end times) will cover all possible overlaps, and should be used instead.

Because the overlap logic isn't corrected, the expected solution path (Usr4 as carrier2, Usr5 as carrier, Hotpot restaurant as pl2) cannot be found. Even if we tweak the inequalities to be inclusive of the bounds (which solves the problem of stay4.start_time = stay3.start_time at the Hotpot restaurant), stay4's endtime is at 20:00:00, and stay3's endtime is at 15:00:00, so there's no way that stay4.end_time < stay3.end_time.

Let's fix all the problems above:

MATCH (p:Person {ID:2020020201})-[stay1:stay]->(pl:Place)<-[stay2:stay]-(carrier:Person)-[stay3:stay]->(pl2:Place)<-[stay4:stay]-(carrier2:Person)
WHERE dateTime('2020-01-23T00:00:00') < stay1.start_time < dateTime('2020-01-23T23:00:00')
 AND carrier2.HealthStatus = 'Sick' AND carrier2.ConfirmedTime <= date("2020-01-23")
 AND apoc.coll.max([stay3.start_time.epochMillis, stay4.start_time.epochMillis]) <= apoc.coll.min([stay3.end_time.epochMillis, stay4.end_time.epochMillis])
 RETURN carrier, pl2, carrier2

With correct overlap logic in place, we get the expected result:

│"carrier"                              │"pl2"                                          │"carrier2"                                                          │
│{"HealthStatus":"Sick","ID":2020020205}│{"name":"Hotpot restaurant in Nankai District"}│{"HealthStatus":"Sick","ConfirmedTime":"2020-01-21","ID":2020020204}│

Again, the other variables are omitted for the moment to concentrate on the two main carriers, but you can add pl back into the return to see where the carrier may have spread the virus to Usr1.

And again, if we're looking for person-to-person transfer, then the query misses that overlap check between Usr1 and Carrier, and doesn't even check that carrier is sick, both of these checks were present in our earlier query in step 2, but were omitted here.. Maybe that's something that needs to be added back in.

4. Find out who needs to be isolated

This is about finding, from Usr1, which other users they have been in direct contact with since getting sick, so this is again about some simple pattern matching and application of overlap logic.

If we did a literal translation from the original query to Cypher, we would get this:

MATCH (p:Person {ID:2020020201})-[stay1:stay]->(pl:Place)<-[stay2:stay]-(shouldIsolate:Person)
WHERE stay2.start_time < stay1.start_time AND stay2.end_time < stay1.end_time
 RETURN pl, shouldIsolate

This should get us the persons who may have been infected and the place of infection, but again there are several things wrong with this query and we get 0 results:

  1. We don't have the result data in the provided dataset. According to the original site,

We can see that Usr1 and Usr2 have connected with each other in a Community A in Heping District, Tianjin, which made Usr2 a suspicious case.

But we don't have any data that connects Usr2 to Community A in Heping District. Look at User2 data on the site, that data is missing, it only shows them at Shopping Mall Area A. I can't find any other link on the page to a fuller dataset, so there's nothing to be done.

  1. The inequality bounds are wrong again. They're still exclusive, they need to be inclusive.

  2. The overlap logic is wrong again. The original query is only looking for one particular kind of overlap, where the shouldIsolate person arrived before Usr1, and left before Usr1, leaving out 3 other kinds of overlap that could happen.

Let's fix this.

First let's address the inequality bounds and overlap logic, exactly the same was as we did above:

MATCH (p:Person {ID:2020020201})-[stay1:stay]->(pl:Place)<-[stay2:stay]-(shouldIsolate:Person)
WHERE apoc.coll.max([stay1.start_time.epochMillis, stay2.start_time.epochMillis]) <= apoc.coll.min([stay1.end_time.epochMillis, stay2.end_time.epochMillis])
 RETURN pl, shouldIsolate

This returns us something more expected:

│"pl"                           │"shouldIsolate"                                                     │
│{"name":"Shopping mall Area A"}│{"HealthStatus":"Sick","ID":2020020205}                             │
│{"name":"Shopping mall Area A"}│{"HealthStatus":"Sick","ConfirmedTime":"2020-01-25","ID":2020020202}│

We see our connection to our prior supposed spreader, Usr5, as well as the expected result of Usr2, who we note is now sick and was confirmed later at 1-25. They probably know by now to isolate, but had this data been available sooner, perhaps a warning may have reached this individual sooner, preventing any additional spread while they were infected and contagious but undiagnosed.

So since we see Usr2 in here, perhaps the error with the dataset we noted above isn't really a dataset omission...perhaps it was an error in the place name in the expected results text I quoted earlier. Perhaps it was meant to be Shopping Mall Area A instead. That would make sense.

We've fixed the overlap logic, but perhaps there's more to be done here. If we can positively identify when Usr1 was likely infected, then we can further narrow the dates in the query so that we exclude stay dates from before they were infected. As it is now, this query would be used to warn everyone that the user has been in contact with, including contact from before they were infected, and that may be far too wide of a net to throw.

1 Like

I added an analysis here from the site, but I see no sign of query-6, I only see up to query-4.