Yes! Node similarity is intended to help you identify how similar nodes are based on their neighbors (using the Jaccard similarity scoring function). Although node similarity is intended to work on a bipartite graph, you can use a Cypher projection to compare second and third degree neighbors (or just add a relationship in the graph directly, if you need something that is performant on large datasets).
For the projection, the first cypher clause defines the pool of nodes being considered (so you need source and target) and the second defines the relationship. Try something like:
`MATCH (p) WHERE p:News OR p:AnnotatedText OR p:Sentence OR p:Tag RETURN id(p) as id`,
`MATCH (p:News)-[h:HAS_ANNOTATED_TEXT]->(:AnnotatedText)-[c:CONTAINS_SENTENCE]->(s:Sentence)-[h1:HAS_TAG]->(t:Tag) RETURN id (p) as source, id(t) as target`
Specifying graph:'Cypher' tells the procedure that you're using a cypher projection instead of a named graph.
My only caution with using the Cypher projection is that this might be rather slow on a large graph -- you can speed it up by adding a direct relationship between News and Text and then use the huge graph loader if that becomes a problem.
Instead of using a cypher projection, create the relationship directly in your database such that (for example) (p:News)-[a:AssociatedWith]->(t:Tag). Then you can bypass Cypher and use the huge graph loader directly: CALL algo. nodeSimilarity('News|Tag', 'AssociatedWith') - you'll notice that can be orders of magnitude faster (for any algorithm, not just similarity ones).
Other performance tips include:
For similarity algorithms, try preprocessing with algo.unionFind (aka WCC), which will identify all the disjointed subgraphs, and then run similarity over the separate partitions. This saves you comparing nodes that have no neighbors in common.
Load a named graph once (with algo.graph.load) and then run your algorithms against that named graph - this saves you the time spent loading each time you want to run an algo.
If you're using community edition, there will be some limits to how performant it will be. All the algorithms are implemented to run in parallel, but there is a four core limitation for CE users.
Yes, I am on CE, once the PoC is success, we will be movting to EE.
Though I have only mentioned News and Tag, I have few more node type where these are connected say location, reporter, news channel etc.
As per your recommendation, creating a new relation type associated with clubbing all these node types will improve performance right?, let me take a look at that approach.
I have a similar point with recipes and ingredients plus taxonomy of ingredients and recipe categories, and some more. @aneeshmonn Did you manage to solve your use case successfully? How did you finally manage to do it? I am curious.