Node Similarity Algorithm for second and third level relationships comparison

I am trying to find similar nodes based on second/third level relations created using graphaware nlp annotate text API.

Nodes I have say News don't have direct relations to one another but the relations/similarity are through 3rd level down tags.

Can we use algo.nodeSimilarity for this purpose..?

Also, trying to understand the graph parameter in this alogorithm, not much info in the wiki

neo4j> CALL algo.nodeSimilarity('Match(p:News) return id(p) limit 10', 'Match p1=((p:News)-[h:HAS_ANNOTATED_TEXT]->(:AnnotatedText)-[c:CONTAINS_SENTENCE]->(s:Sentence)-[h1:HAS_TAG]->(t:Tag)) return p1', {
         direction: 'OUTGOING',
         write: false,
         topK: 2,
         topN: 10,
         similarityCutoff: 0.7,
         concurrency: 2
       YIELD nodesCompared, relationships, write, writeRelationshipType, writeProperty, p1, p50, p99, p100;
Failed to invoke procedure `algo.nodeSimilarity`: Caused by: java.lang.IllegalArgumentException: No column "id" exists```

Any pointers on this would be helpful

Yes! Node similarity is intended to help you identify how similar nodes are based on their neighbors (using the Jaccard similarity scoring function). Although node similarity is intended to work on a bipartite graph, you can use a Cypher projection to compare second and third degree neighbors (or just add a relationship in the graph directly, if you need something that is performant on large datasets).

For the projection, the first cypher clause defines the pool of nodes being considered (so you need source and target) and the second defines the relationship. Try something like:

`MATCH (p) WHERE p:News OR p:AnnotatedText OR p:Sentence OR p:Tag RETURN id(p) as id`,
`MATCH (p:News)-[h:HAS_ANNOTATED_TEXT]->(:AnnotatedText)-[c:CONTAINS_SENTENCE]->(s:Sentence)-[h1:HAS_TAG]->(t:Tag) RETURN id (p) as source, id(t) as target`

Specifying graph:'Cypher' tells the procedure that you're using a cypher projection instead of a named graph.

My only caution with using the Cypher projection is that this might be rather slow on a large graph -- you can speed it up by adding a direct relationship between News and Text and then use the huge graph loader if that becomes a problem.

Thank you Alicia.

I did figured out on Cypher Projection. But as you have stated, it lags on performance.

I have 20 Million primary nodes and close to 1M related nodes.

Looking for a performance orientied way to identify Communities and then node similarity as well.

Instead of using a cypher projection, create the relationship directly in your database such that (for example) (p:News)-[a:AssociatedWith]->(t:Tag). Then you can bypass Cypher and use the huge graph loader directly: CALL algo. nodeSimilarity('News|Tag', 'AssociatedWith') - you'll notice that can be orders of magnitude faster (for any algorithm, not just similarity ones).

Other performance tips include:

  • For similarity algorithms, try preprocessing with algo.unionFind (aka WCC), which will identify all the disjointed subgraphs, and then run similarity over the separate partitions. This saves you comparing nodes that have no neighbors in common.
  • Load a named graph once (with algo.graph.load) and then run your algorithms against that named graph - this saves you the time spent loading each time you want to run an algo.

If you're using community edition, there will be some limits to how performant it will be. All the algorithms are implemented to run in parallel, but there is a four core limitation for CE users.

Yes, I am on CE, once the PoC is success, we will be movting to EE.

Though I have only mentioned News and Tag, I have few more node type where these are connected say location, reporter, news channel etc.

As per your recommendation, creating a new relation type associated with clubbing all these node types will improve performance right?, let me take a look at that approach.

I have a similar point with recipes and ingredients plus taxonomy of ingredients and recipe categories, and some more. @aneeshmonn Did you manage to solve your use case successfully? How did you finally manage to do it? I am curious.