In particular, if I use fastRPExtended to generate embeddings for nodes with 256 dimensions, named 'graphEmbedding', does it make sense to use 'graphEmbedding' as the parameter of the KNN agorithm, nodeWeightProperty? The following example documentation uses a single float 'age'.

In the original paper, 'Efficient k-nearest neighbor graph construction for generic similarity measures', it discussed the parameter K and D:

it experimented on several K values: 10, 20, 30, 40, 50, and recommend K=20 by default.

It experimented on 3 sets of D values, (2,5), (10,20), (50, 100), and it says their algorithm best suits for D=20 and K=20 too.

Is K in the paper the topK in neo4j KNN?
Is D in the paper the dimensionality of the nodeWeightProperty in neo4j KNN? If that's the case, it doesn't make sense to use a dimension of 256 to graphEmbedding for this property value? Because the optimal D=20 according to the paper.

Please clarify the implementation of KNN in GDS. Thanks.

Yes, the K parameter in the paper is the same as the topK parameter in the GDS algorithm.

The D parameter in the paper represents the intrinsic dimensionality of the dataset. Intrinsic dimensionality is a measure of the complexity of the dataset, and it is different from the dimensionality of the input features in the dataset. Table 1 in the paper shows you the dimensionality of the datasets that the authors worked with. All but one of them were greater than 20.

I think you should have good recall with 256 dimension vectors as the nodeWeightProperty and topK = 20.

A related question regarding the choice of K: for the general KNN algorithm, the sqrt(N) is considered as the best K value. If the graph have 1 million nodes, the K would be 1000, which is huge compared with the default K=20 in the Dong's paper. So we shouldn't use sqrt(N) but use a much smaller K as described in the paper?

When we talk about KNN, we are often thinking of the classification machine learning algorithm. We find the labels of the k-nearest neighbor to a point with an unknown label, and then infer the value of the unknown label based on the majority class of the neighbors. The K = sqrt(n) rule of thumb often turns out to work well for that purpose.

The Neo4j implementation of KNN is not a classifier. We are using the algorithm to find the nearest neighbors to each node, and usually creating relationships to those nodes. Without checking every single combination of nodes in the graph, the algorithm is able to efficiently give us approximately the k-closest neighbors to each node. Dong's algorithm doesn't have anything to say about node labels. The target is recall, in other words, the percentage of each nodes true nearest neighbors returned by the algorithm.

The right K to optimize recall without running for too long will depend on your use case and the complexity of your dataset. The values of K used in Dong's paper were much smaller than sqrt(N) for the datasets shown in table 1. If you are using graph embeddings of dimension 256, I would think your recall would be good at K=20, similar to the dense vectors in the Corel, Audio, and Shape data sets that Dong tested.