"The library contains both procedures" is a bit confusing


[edit] after re-reading this a bunch of times, I now realize what's confusing is the intro sentence needs to be more expansive. Now, I'm not sure how it should be worded...[end-edit]

There is this sentence which I found a confusing because it hints at stuff not yet described. It's vague in terms of names and numbers:

The library contains both procedures and functions to calculate similarity between sets of data. The function is best used when calculating the similarity between small numbers of sets. The procedures parallelize the computation, and are therefore more appropriate for computing similarities on bigger datasets.

I suggest that this intro sentence needs to summarize that there are read functions to calculate similarity, and there are write calculations that can write out the inter-relationships as well as a serial version and a parallel for large data sets.

Or something.... I'm still a tad confused. Sorry. [I might edit this again, when I better understand what's going on...]

The Fantasy-SciFi category overlap example confuses me...

Fantasy is not a subset of SciFi and vice versa. E.g. "The Once and Future King" is clearly (to me) a Fantasy genre. I can imagine (but can't think of one) where there is a very hard core SciFi that is nearly non-fiction and therefore it doesn't contain elements of Fantasy. In any case, it can be true that you have two categories that intersect and aren't proper subsets: e.g. Classics and SciFi. What is not mentioned in this doc, is how the procedure decides which to make the "From" and which to make "To" and what if there is a tie (in counts of the two set sizes)? The doc should note that with stream, that count1 <= count2.

This "section includes" needs to be broken out to include the write section.

Also not clear from the documentation is that the function gds.alpha.similarity.overlap can take ONLY two lists. Whereas the function gds.alpha.similarity.overlap.stream can take 2 or more. (For me, the extra stream component in the function name doesn't hint at that difference enough, so the documentation needs to help with this gap.)

Also... the documentation hints that the procedures can run parallel but it doesn't specify which ones. I wonder if the write procedure is indeed parallel as there could be race conditions if the underlying library isn't careful programmed.

This is obvious, but there are a lot of newbies trying out Neo4J, so IMHO it's worth mentioning that 0<= similarity <= 1.0

Also not mentioned is that the values YIELD'ed item1 and item2 are Neo4j id's and that functiongds.util.asNode() is used to convert the id to a Node. (I had to look up gds.util.asNode ). [added]. I see that item1 and item2 are described at the bottom of the page. I think it would be clearer if what item1 and item2 are described when they are first used.

Another point for clarification (I believe) is values for max and p95 in the gds.alpha.similarity.overlap.write example are 1.0000038146972656. I presume this is due to a floating point math issues and it should be 1.0. (Perhaps the code needs to enforce a maximum of 1.0?)

Also... gds.alpha.similarity.overlap.stats example doesn't show the output.