Playing with NLP and doing some simple metric calculations for TF-IDF (term frequency - inverse document frequency).

The query and calculations are very straight forward.
The problem is that the answers from Neo4J are wrong.

MATCH (:patent{num:'7547899'})<-[r:Is_in]-(a:Word)-[c:Is_in]->(b:patent)
RETURN a.term, r.count as Num, count(b), sum(c.count), r.count*(log(358/count(b))) as TFidf ORDER BY Num DESC

The first item is pretty obvious but I doublers checked the others. I exported the results are recalculated them within JMP (my standard statistical package). Using:

The results are close except for the first entry, but I was expecting much closer given it is a simple calculation.
Andy

I wonder if there's a rounding error somewhere in the Neo4j calculation. Sometimes I find that I need to multiply by 1.0 to make it do double based calculations properly.

So if we take the numbers from the first row:

Whereas with the *1.0 it's actually computing the log of 0.

Update your query to read like this:

MATCH (:patent{num:'7547899'})<-[r:Is_in]-(a:Word)-[c:Is_in]->(b:patent)
RETURN a.term, r.count as Num, count(b), sum(c.count), r.count*(log(358*1.0/count(b))) as TFidf ORDER BY Num DESC

Thanks that resolved it. I was more concerned by the first returned value since the difference seemed much greater than just rounding error. The fix seemed to have resolved both concerns.
Thanks
Andy

This is a side effect of Java interpreting number types. If both nominator and denominator are integers it will perform integer division. If one of them is a decimal then it will use the double and does floating point division. This can lead to unwanted rounding errors.

I think that is probably the root cause, but I am still a bit concerned if you look at the first result returned. The query returned 0.0 when the correct value was 3.25.... This is much more than a rounding error.

If your hypothesis is correct, I am guessing this calculation does integer math to get 358/225 = 1 and then does log(1) which equals 0. It looks like all other count(b) values are less than 1/2 the 358 so they returned a non-zero number.
Andy

Yeh sorry, maybe rounding error trivialises it too much. I think you are right that it's doing integer division when that isn't really what we want it to do.I find myself multiplying everything by 1.0 to make sure that floating point calculations are being done!