I was wondering if anyone has done a project that involve 2 independent pairs of nodes
The plan is to use a similarity algo to find the jaccard.score between my POSITTIONS_1 nodes (1st day of data) and POSITIONS_2 nodes (2nd day of data)
Both sets of nodes contain same type of information of latitude, longitude and time(string format)
In addition I am planning to use it at production scale,
It depends how big "production scale" is - and how quickly you want the computation to finish.
Node Similarity is a brute force similarity algorithm, and uses jaccard similarity to score nodes based on neighbors. To use that algorithm, you'd need the information you're comparing (latitude, longitude, time) to be nodes.
You could also use
K Nearest Neighbors - which is an approximate similarity algorithm, using cosine similarity, to compare nodes based on properties. It's much faster than Node Similarity (because it doesn't default to comparing every node with every other node).
Thanks for the advice,
In regards to the "production scale", it is at least 1 million rows of data being fed on monthly basis and ideally take less than 2 min to finish the computation.
Currently I am applying the methodology from this article
Build a similarity graph based on node properties, even when no relationships are present in the data model.
Reading time: 14 min read
In short summary, I replaced info used in the article above and created the new metric based on latitude and longitude since there are float numbers.
After the similarity relationships are developed, I used centrality algo, to filter out most of the interconnected nodes to focus on the nodes with desired latitude and latitude shown.
My main objective is to find common stop locations for logistic purposes.
Not sure if my approach is right for this use case.
Any advice would be appreciated
Actually I do have a concern.
Can the graph algorithm in neo4j process point data types and datetime data type?
In a lot of the tutorials and lessons I have seen, it seems that the algorithm can only process float or integer type formats for geospatial data in which I am dealing with now.
Does the statement above make sense? Some further advice on this would be appreciate.
For GDS, you'll want to convert your date/time data into a numerical format that GDS can interpret.
@jennifer.reif has a great blog post on handling time data in Neo4j that's worth taking a look at: Cypher Data Formats: Dealing with Dates, Part 1 | Neo4j
For lat/long data, the only algorithm that can explicitly use lat/long data is
A* (pathfinding). If you want to use lat/long for a similarity comparison, it can be compared as two numerical values, but we don't have any concept of spatial similarity built into the library.
Greatly appreciated for the advice
Currently my Graph has 5 mil nodes.
I tried using K nearest neighbors Graph Data Science playground Neuler.
However, I ran into error as shown below
Algorithm failed to complete
Error: Neo4jError: Unable to start new transaction since limit of concurrently executed transactions is reached. See setting dbms.transaction.concurrency.maximum
Any suggestions on how to resolve this issue?
That's a looks like a database setting -
You'll want to go into the config file (instructions here: (and
https://neo4j.com/docs/operations-manual/current/configuration/file-locations/#file-locations) and change the
dbms.transaction.concurrency.maximum setting to a higher value - or only run one thing at a time.