My goal is to compute the similarity between nodes. For that, firstly I take the relevant properties from the node and then pass it on to the python code which generates feature_vector of dimension (200k,1) for each node. Now, in python, I get the feature_vector in sparse format. However, I can't directly store the sparse data as node property into neo4j. Either I have to convert it into the dense format (which takes up lots of space) or save it as two properties i.e. fv_index and fv_value in a node.
This trick I applied in order to save space, generates a new problem. For similarity computation, I can't directly use the fv_value property as for given two nodes values should correspond to same index. So, i tried to did the following through cypher.
Example: A.fv_index = [1,2,3] A.fv_value = [0,0,1] and B.fv_index = [2,3,4] B.fv_value = [1,1,0]
Modified index and values for similairty computation:
index = (A.fv_index) U (B.fv_index) = [1,2,3,4]
A.value = [0,0,1,0] and B.index = [0,1,1,0]
Essentially, i add the index and their values which are present in one node but missing in other.
Now, this approach takes a lot of time to do the computation for similarity. Approx 20 secs to compute similarity for a single node against ~30k nodes which is inefficient if i want to do it for all 30k nodes.
MATCH (p1:Post) WHERE p1.Id = "8n" MATCH (p2:Post) WITH apoc.map.fromLists (p1.fv_key,p1.fv_value) as x, apoc.map.fromLists (p2.fv_key,p2.fv_value) as y, p1,p2 WITH apoc.map.merge (p,q) as m,x,y,p1,p2 WITH apoc.map.mget (p, keys(m), [i IN range( 0, size(keys(m))) | 0 ]) AS x1, apoc.map.mget(q, keys(m), [i IN range( 0, size(keys(m))) | 0 ] AS y1, p1, p2 WITH algo.similarity.cosine(x1,y1) as similarity, p1, p2 WHERE similarity > 0.7 RETURN count(p2.Id)
Q: How do i efficiently store the sparse data? and/or how do I improve upon computation time on similarity significantly?