Are there any plans to support sort order in fulltext search index?

Hi all,

Context

Currently we are evaluating using Full-text search (FTS) index for some of our use cases. In one of the use cases we want to show data matching hashtag and results are sorted by a property (creation time) and we were not able to sort by property and limit the results with FTS index query method (db.index.fulltext.queryNodes).

Here is simplified data model:

MERGE (g:Group {id: string})
MERGE (a:Article {
   content: string, group_id: string,
   created_at: unixtimestamp,
   created_at_str: "20210819T1200"})
MERGE (g)-[:CONTAINS]->(a)

What we have tried

Example case: group1 has 90.000 articles, group2 has 10.000 articles

// index
call db.index.fulltext.createNodeIndex("fts__article",
    ["Article"], ["content", "group_id", "created_at_str"]);

// query most recent articles with very common hashtag
// (lets assume this hashtag is available in all articles,
// neo4j is sorting 100k articles outside of full text search index
with 'group_id: (group1 OR group2) AND content: "#verycommonhashtag"' as query
call db.index.fulltext.queryNodes("fts__article", query)
yield node as a, score
return a
order by a.created_at desc
limit 100

Boosting

Imagine now we have 1M matching articles and query is getting slower and slower because of sorting 1M data.

We have considered boosting results by creation time and limit results to order by doing:

with 'group_id: (group1 OR group2) AND content: "#verycommonhashtag" AND (created_at_str: 202108*^10 OR created_at_str: 202107*^8 OR created_at_str:202*^1)' as query
call db.index.fulltext.queryNodes("fts__article", query, {limit: 1000}
yield node as a, score
return a
order by a.created_at desc
limit 100

Problem with boosting

which boosts articles from last 2 month then limits results by 1000, then we order outside of FTS index. This works fine if distribution of data on each group is almost similar every day or every month across all groups. But we have some groups which are very active in some days and in-active in some other weeks, which makes it difficult to create boosting condition, should boosting criteria be by month (202108^10), by week (2021-week-34^10), by day (20210819^10) or even by hour?

Imagine 1 group had 2.000 articles on a single day and we are boosting by day or week, when we limit by 1.000 we are missing some articles which might be more recent than articles returned as a result of limiting

Feature request

Based on these observations, we are thinking maybe ordering inside Lucene might be helpful for us and faster. Still 1M articles should be sorted somewhere if they match criteria, either by Lucene or by Neo4j, considering that data is already in Lucene it might be faster than (1) fetching matched results from Lucene (2) then order in Neo4j.

Do you guys have plans to add sort order in FTS index?

Thanks a lot for awesome product!

  1. you should not merge on multiple properties just the id(s) you have in your constraint

one thing you can try to utilize both indexes the regular one on :Article(created_at) and the FTS

with 'group_id: (group1 OR group2) AND content: "#verycommonhashtag" AND (created_at_str: 202108*^10 OR created_at_str: 202107*^8 OR created_at_str:202*^1)' as query
call db.index.fulltext.queryNodes("fts__article", query, {limit: 1000}
yield node as a, score
match (b:Article) where b.created_at > 0  // or < datetime() 
and a=b // alternatively where id(b) = id(a)
return b
order by b.created_at desc
limit 100

Hi Michael,

Thanks for taking time to look into my question/feature request.

you should not merge on multiple properties just the id(s) you have in your constraint

Not sure I understand this statement, what do you mean here, can you please expand little bit more what it means "merging" here? I am only returning results from FTS index ordered by time

regarding query you have shared, I have 2 questions:

  • you are still limiting results when querying FTS index boosted by time (which is the problem on its own as described in question, problem here is 1001st article which is not included in the results might be most recent article, which we want to include)
  • any reason why condition you set should be faster where b.created_at > 0 and id(b) == id(a)? (because we already have 1000 articles from FTS, why should we bother another index again to sort data?)

Thanks again!