What's a better way to search/compare list elements? any() function

TL;DR

Comparing array elements is inefficient - how to improve?
Should I use a proper programming language? Which can be recommended?

MATCH (a:example_label),(b:example_label)
WHERE any(temp IN a.list WHERE temp IN b.list)
MERGE (a)-[:relationship]->[b];

Setup / Data Structures

I am on Neo4J 4.3.2.
All values and labels made short for simplicity.
I have nodes with label nC. They have two list properties: list_intern and list_extern.
See code below.

MATCH (n:nC)
return * limit 1;
{
  "identity": 324,
  "labels": [
    "nC"
  ],
  "properties": {
"list_intern": [
      "123",
      "abc"
    ],
"list_extern": [
      "456",
      "123"
    ],
"id": "319414",
"name": "foobar"
  }
}

Lists are sometimes null (do not exist). Lists are sometimes empty. List items are sometimes equal to the string null.
Most lists contain 2 to 15 items, some lists have over 50 items.
Lists are sorted, but do sometimes contain duplicate elements. Duplicates do not provide any further use, it would be possible to discard them.

Goal

I need to find all pairs of nodes where at least one non-empty entry of list_intern matches any non-empty entry of list_extern. I need to create a relationship between these pair of nodes. My source for this data doesn't provide relationship information, I have to compute them.

Trivial example

Using the any() function.

MATCH (a:nC),(b:nC)
WHERE any(temp IN a.list_intern WHERE temp IN b.list_extern)
MERGE (a)-[:relationship]->[b];

Improved Version

I was able to exclude some useless data from being processed by any()

MATCH (a:nC),(b:nC)
WHERE a.list_intern IS NOT NULL AND b.list_extern IS NOT NULL
AND NOT all(temp IN b.list_extern WHERE temp="null")
AND any(temp IN a.list_intern WHERE temp IN b.list_extern)
MERGE (a)-[:relationship]->[b];

Recently I discovered, I could replace any() with

apoc.coll.intersection(a.list_intern, b.list_extern) <> []

but that didn't make a difference.

Perfomance improved by using SOLR-Index

So, conveniently the data this DB uses is provided by an Apache SOLR.
For those not familiar: This application creates inverted indices. Consequently you are able to search your dataset quickly. SOLR offers a REST-API where I am able to request all objects that fulfill certain conditions.

I can make a GET-Request to receive all docs where item is in list_extern:

CALL apoc.load.json('<URL of SOLR>select?omitHeader=true&wt=json&rows=10000&fl=*&q=type:nC%20AND%20list_extern:%22'+item+'%22')

where I replace <URL of SOLR> with a valid value.
Note, SOLR-documents have a property type which is equal to the label used. type=nC
.
This is my best attempt at this problem, still using cypher:

match (a:nC)
WHERE a.list_intern IS NOT NULL 
UNWIND a.list_intern as item
CALL apoc.load.json('<URL of SOLR>select?omitHeader=true&wt=json&rows=10000&fl=*&q=type:nC%20AND%20list_extern:%22'+item+'%22')
YIELD value UNWIND value.response.docs as bcsv
WITH a,bcsv.id AS bid
MATCH (b:nC{id:bid})
WITH a,b
MERGE (a)-[:relationship]->[b];

Machine has 8 GB RAM, tiny dataset 27k nodes, of which 25k have the label nC.
Too slow. Dataset is going to be increased with similar data, at least doubling the node-count.

Question

How can this query be improved?
I know that performance also heavily depends on hardware. I want to reduce time complexity and computation time needed to complete the query.

Should I instead compute these relationships using a programming language?

Recommendations?
Do you recommend using a Neo4J-Driver to send the merge instructions? or should I build a new json/csv containing relationship information? (Which is then processed using cypher)

Thanks for reading, appreciate any comments!

Looking at a different way, is it possible to represent the list elements as nodes and then replace the list of elements stored as properties with relationships to the terms? Assuming the item nodes have a label of Term, your query may look like the following:

MATCH (a:nC),(b:nC)
WHERE exists ( (a)-[:HAS_INTERNAL_TERM]->(:Term)<-[:HAS_EXTERNAL_TERM]-(b)  )
MERGE (a)-[:relationship]->(b)

If the lists come with the ingested data and are only used for the purposes of linking ‘a’ to ‘b’, then you can create the terms and links to the terms as you ingest each node. You can then add to the above query the deletion of the HAS_TERM relations. You can periodically clean up unused TERM nodes.

If the HAS_TERM relationships were permanent, would you need the ‘relationship’ between the two nodes, as you could use the common linkage to at least one term to relate any two ‘nC’ nodes.

Sorry in advance if this doesn’t make sense for your application.

Back to your question. How are you running these queries now, manually in neo4j browser? I don’t see how to use Java and the driver to improve this. You need to match two candidate nodes then test if the two nodes are similar by comparing their list of terms. I don’t see how you split this up between cypher and Java.

It would be possible to build a custom function using the neo4j Java API that takes two nodes and returns a Boolean value, true if they have a least one term in common and false otherwise. The function would use Java to compare the lists. Assuming this function is called ‘isSimilar’, then your query would be the following:

MATCH (a:nC),(b:nC)
WHERE isSimilar(a, b)
MERGE (a)-[:relationship]->(b)

The custom function solution requires deploying the code on the server. Do you have access to your server, or are you using Aura? This is not allowed with Aura. I can write the custom function if you want to try it.

Thanks for your reply.

This is possible. Great idea - indices on lists are not supported. This way indices on properties can be used.
Creating the Data Model I did think of that possibility, but discarded it, not knowing about indices in-depth.
I will test this asap.

Displayed relationships have to be (:nC)-[]->(:nC).
Making a request to the graph-application, users should not see :Term nodes - their content is not secret, it may still appear as list-properties of :nC nodes though.
apoc.create.vRelationship() will be useful, if I decide to keep them.

yes / cypher-shell sometimes.

I was thinking about manipulating the source data before importing, before creating any nodes. I would read the .json I get from SOLR, compute the relationships and add them to the .json or write to another file. Those files would be loaded in a cypher query like above.
I am thinking that the usage of a "real" programming language or an ETL-Tool, no DB-Reads, just some items in lists, will be quicker.

yes, full access.

If you want to, sure let's find out what works best! :grin:

Thank you so far for this in-depth response.

I finished the custom function. Do you have a domain name? If so, I can use it as the prefix to the method when calling it. it is best to have your own namespace so your custom methods don't conflict with others with the same name.

I am not sure what you are planning to do with the custom function. How are domain names relevant?
I have no idea - i am curious.
So, I assumed the custom function would be deployed somewhere local on the server that is also running Neo4J - don' t need to worry about name conflicts.

Regarding the topic:
I am currently testing apoc.periodic.iterate() and apoc.periodic.commit(), have read a ton about memory configuration, bookmarked similiar topics in this forum.
You'll hear from me again next week.
Also i found this comment:

which reinforces taking a look at computing relationships before importing the data.

Ok, I will use a reasonable name. I will just upload the source and get you the jar, so you don't have to build it. You can try it if you want to see if it helps.

Preprocessing the data before ingestion is also an option to investigate. If you can process the data and create pairs of nodes by finding the nodes with the common elements in the two arrays and create a file with these pairs, your ingestion should be much faster. Are these nodes being streamed in realtime, or do get them offline and can process them beforehand.

I can't upload binary items to the board. If you are interested, send me an email address and I can send you the source code and the jar so you don't have to build it to test.

Your query would look like what is shown below. I externalized the property names, incase they changed you could just change your query instead of the source. It's assumed that the two properties consist of a single string value or a list of string values. The two properties can have any combination of list and scalar value. isSimilar will return true when the two properties have at least one value in common, false otherwise.

MATCH (a:nC),(b:nC)
WHERE customFunctions.isSimilar('list_intern', a, 'list_extern', b)
MERGE (a)-[:relationship]->(b)

Hey again Gary.

I am sure most email providers will not accept executables. Also, I am not willing to run unverified code, sorry - security reasons. Can you paste or upload (as a text-file) the source code here?
This is also good documentation.
Send me a direct message if you want to discuss.

I realized my data source APACHE SOLR can export CSV-files. I will try that, as LOAD CSV is allegedly (!) faster.

haha, good point. I guess I could give you a jar containing anything. Following is the source code for the custom function, junit 5 tests, and the pom file. You can use this to build the jar. Once you have maven project setup, you can build the jar with 'mvn package'. You can see its usage in the unit tests.

package customFunctions;

import org.neo4j.graphdb.Node;
import org.neo4j.procedure.Description;
import org.neo4j.procedure.Name;
import org.neo4j.procedure.UserFunction;

import java.util.Arrays;
import java.util.List;
import java.util.Objects;

public class CustomFunctions {

    @UserFunction()
    @Description("Calculate the similarity between two nodes by verifying list of strings have at least one element in common")
    public boolean isSimilar(@Name("aProp") String aProp, @Name("a") Node a, @Name("bProp") String bProp, @Name("b") Node b) {

        Objects.requireNonNull(aProp);
        Objects.requireNonNull(a);
        Objects.requireNonNull(bProp);
        Objects.requireNonNull(b);

        if(!a.hasProperty(aProp) || !b.hasProperty(bProp)) {
            return false;
        }

       Object list_intern = a.getProperty(aProp);
       Object list_extern = b.getProperty(bProp);

        boolean is_list_intern_a_list = list_intern.getClass().isArray();
        boolean is_list_extern_a_list = list_extern.getClass().isArray();

        if(is_list_intern_a_list && is_list_extern_a_list) {
            List<String> list_intern_as_list = Arrays.asList((String[]) list_intern);
            List<String> list_extern_as_list = Arrays.asList((String[]) list_extern);
            return list_intern_as_list.stream().anyMatch(list_extern_as_list::contains);
        }

        if(is_list_intern_a_list) {
            return Arrays.asList((String[])list_intern).contains(list_extern);
        }

        if(is_list_extern_a_list) {
            return Arrays.asList((String[])list_extern).contains(list_intern);
        }

        return list_intern.equals(list_extern);
    }
}
import customFunctions.CustomFunctions;
import org.junit.jupiter.api.*;
import org.neo4j.driver.*;
import org.neo4j.harness.Neo4j;
import org.neo4j.harness.Neo4jBuilders;

class CustomFunctionTests {

    static Driver driver;

    @BeforeAll
    static void setup_db() {
        Neo4j neo4j = Neo4jBuilders.newInProcessBuilder()
                .withFunction(CustomFunctions.class)
                .build();

        driver = GraphDatabase.driver(neo4j.boltURI(), Config.builder()
                .withoutEncryption()
                .build());
    }

    @BeforeEach
    void delete_data() {
        try (Session session = driver.session()) {
            session.run("match(n) detach delete n");
        }
    }

    @Test
    @DisplayName("a is null, ClientException is thrown")
    void test_null_a_value_results_in_neo4j_client_exception() {
        Assertions.assertThrows(org.neo4j.driver.exceptions.ClientException.class, ()->{
            String cypher = "create (b{id: 2}) return customFunctions.isSimilar('list_intern', null, 'list_extern', b) as result";
            executeCypher(cypher);
        });
    }

    @Test
    @DisplayName("b is null, Neo4j ClientException is thrown")
    void test_null_b_value_results_in_neo4j_client_exception() {
        Assertions.assertThrows(org.neo4j.driver.exceptions.ClientException.class, ()->{
            String cypher = "create (a{id: 1}) return customFunctions.isSimilar('list_intern', a, 'list_extern', null) as result";
            executeCypher(cypher);
        });
    }

    @Test
    @DisplayName("aProp is null, ClientException is thrown")
    void test_null_aProp_value_results_in_neo4j_client_exception() {
        Assertions.assertThrows(org.neo4j.driver.exceptions.ClientException.class, ()->{
            String cypher = "create (a{id: 1}), (b{id: 2}) return customFunctions.isSimilar(null, a, 'list_extern', b) as result";
            executeCypher(cypher);
        });
    }

    @Test
    @DisplayName("bProp is null, ClientException is thrown")
    void test_null_bProp_value_results_in_neo4j_client_exception() {
        Assertions.assertThrows(org.neo4j.driver.exceptions.ClientException.class, ()->{
            String cypher = "create (a{id: 1}), (b{id: 2}) return customFunctions.isSimilar('list_intern', a, null, b) as result";
            executeCypher(cypher);
        });
    }

    @Test
    @DisplayName("a has null list, b has null list")
    void test_both_nodes_have_null_lists() {
        String cypher = "create (a{id: 1}), (b{id: 2})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertFalse(isSimilar);
    }

    @Test
    @DisplayName("a has null list, b has non-null list")
    void test_a_node_has_null_lists_b_node_has_non_null_list() {
        String cypher = "create (a{id: 1}), (b{id: 2, list_extern: ['1','2']})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertFalse(isSimilar);
    }

    @Test
    @DisplayName("a has non-null list, b has null list")
    void test_a_node_has_non_null_lists_b_node_has_null_list() {
        String cypher = "create (a{id: 1, list_intern: ['1','2']}), (b{id: 2})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertFalse(isSimilar);
    }

    @Test
    @DisplayName("a and b have lists of integers with no common elements")
    void test_a_and_b_have_lists_with_no_common_elements() {
        String cypher = "create (a{id: 1, list_intern: ['10','20']}), (b{id: 2, list_extern: ['1','2']})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertFalse(isSimilar);
    }

    @Test
    @DisplayName("a and b have lists of integers with one common element")
    void test_a_and_b_have_lists_with_one_common_element() {
        String cypher = "create (a{id: 1, list_intern: ['10','2']}), (b{id: 2, list_extern: ['1','2']})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertTrue(isSimilar);
    }

    @Test
    @DisplayName("a and b have lists of integers with multiple common element")
    void test_a_and_b_have_lists_with_multiple_common_element() {
        String cypher = "create (a{id: 1, list_intern: ['10','2','100']}), (b{id: 2, list_extern: ['100','1','2']})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertTrue(isSimilar);
    }

    @Test
    @DisplayName("a has a single element and b has a list that contains the single element in a")
    void test_a_has_a_single_element_and_b_has_a_list_that_contains_the_element_of_a() {
        String cypher = "create (a{id: 1, list_intern: '2'}), (b{id: 2, list_extern: ['100','1','2']})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertTrue(isSimilar);
    }

    @Test
    @DisplayName("b has a single element and a has a list that contains the single element in b")
    void test_b_has_a_single_element_and_a_has_a_list_that_contains_the_element_of_b() {
        String cypher = "create (a{id: 1, list_intern: ['100','1','2']}), (b{id: 2, list_extern: '2'})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertTrue(isSimilar);
    }

    @Test
    @DisplayName("a has a single element and b has the single element equal to b")
    void test_a_has_a_single_element_and_b_has_the_single_element_of_b() {
        String cypher = "create (a{id: 1, list_intern: '2'}), (b{id: 2, list_extern: '2'})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertTrue(isSimilar);
    }

    @Test
    @DisplayName("a and b have single elements that are not equal")
    void test_a_and_b_have_a_single_element_that_are_not_equal() {
        String cypher = "create (a{id: 1, list_intern: '2'}), (b{id: 2, list_extern: '4'})";
        boolean isSimilar = executeTest(cypher);
        Assertions.assertFalse(isSimilar);
    }

    private boolean executeTest(String setupCypher) {
        try (Session session = driver.session()) {
            session.run(setupCypher);
            Result result = session.run("match (a{id: 1}), (b{id: 2}) return customFunctions.isSimilar('list_intern', a, 'list_extern', b) as result");
            if (result.hasNext()) {
                Record record = result.next();
                return record.get("result").asBoolean();
            } else {
                throw new IllegalArgumentException();
            }
        }
    }

    private void executeCypher(String cypher) {
        try (Session session = driver.session()) {
            Result result = session.run(cypher);
        }
    }
}
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.domain.CustomFunction</groupId>
    <artifactId>customFunction</artifactId>
    <version>0.0.1</version>

    <packaging>jar</packaging>
    <name>Custom isSimilar Function</name>
    <description>Prototype of a custom function to compare two nodes for similarity</description>

    <properties>
        <java.version>11</java.version>
        <maven.compiler.source>${java.version}</maven.compiler.source>
        <maven.compiler.target>${java.version}</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <neo4j.version>4.4.6</neo4j.version>
        <neo4j-java-driver.version>4.4.5</neo4j-java-driver.version>
        <maven-shade-plugin.version>3.2.4</maven-shade-plugin.version>
        <maven-compiler-plugin.version>3.8.1</maven-compiler-plugin.version>
        <maven-surefire-plugin.version>2.22.2</maven-surefire-plugin.version>
        <maven-failsafe-plugin.version>2.22.2</maven-failsafe-plugin.version>
    </properties>

    <dependencies>
        <dependency>
            <!-- This gives us the Procedure API our runtime code uses.
                 We have a `provided` scope on it, because when this is
                 deployed in a Neo4j Instance, the API will be provided
                 by Neo4j. If you add non-Neo4j dependencies to this
                 project, their scope should normally be `compile` -->
            <groupId>org.neo4j</groupId>
            <artifactId>neo4j</artifactId>
            <version>${neo4j.version}</version>
            <scope>provided</scope>

        </dependency>

        <!-- Test Dependencies -->
        <dependency>
            <!-- This is used for a utility that lets us start Neo4j with
                 a specific Procedure, which is nice for writing tests. -->
            <groupId>org.neo4j.test</groupId>
            <artifactId>neo4j-harness</artifactId>
            <version>${neo4j.version}</version>
            <scope>test</scope>
        </dependency>

        <dependency>
            <!-- Used to send cypher statements to our procedure. -->
            <groupId>org.neo4j.driver</groupId>
            <artifactId>neo4j-java-driver</artifactId>
            <version>${neo4j-java-driver.version}</version>
            <scope>test</scope>
        </dependency>

    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>${maven-compiler-plugin.version}</version>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>${maven-surefire-plugin.version}</version>
                <configuration>
                    <skipTests>true</skipTests>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-failsafe-plugin</artifactId>
                <version>${maven-failsafe-plugin.version}</version>
            </plugin>
            <plugin>
                <!-- This generates a jar-file with our procedure code,
                     plus any dependencies marked as `compile` scope.
                     This should then be deployed in the `plugins` directory
                     of each Neo4j instance in your deployment.
                     After a restart, the procedure is available for calling. -->
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>${maven-shade-plugin.version}</version>
                <configuration>
                    <createDependencyReducedPom>false</createDependencyReducedPom>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

TL;DR

Huge lists (multiple thousand items) are bad. Don't search in them, don't make them a property. Also, don't query for data you are not going to use.

Explanation

When testing the graph (unrelated to the import), I noticed that for some :nC, queries would take much longer, some didn't finish. I realized that a lot of nodes had way more entries in list_extern, the largest being 70k (!!). That was the main issue. The queries took so long, because each and every row contained these huge lists over and over again. For testing purposes I deleted these properties - queries finished instantly.

Import

SOLR has to read all the items, send them over the network, Neo4J has to parse these huge documents. That takes time. Maybe some memory/swap stuff going on. Don't know.

Solution

This problem was more of a "correctly use SOLR-Query-Parameters"-problem than anything else.
So to solve this I need to be more precise with my query.
fl=* returns all the fields, but I only needed the id of matches. Everything else will not be processed anyway.
So i changed it to fl=id and voi·là, much quicker. Quick enough now.

Further Learning

It might be useful to create a new SOLR-collection with the result. Also, I played with Stream-Expressions, it worked! But when the query got too large (I was increasing amount of searched rows slowly), the SOLR-Server crashed lol.
Here it is, works but needs refinement.

innerJoin(
cartesianProduct(
  search(mycollection, q="type:nC",rows="10",fl="list_intern",sort="list_intern asc"),
  list_intern 
),
cartesianProduct(
  search(mycollection, q="type:nC",rows="10",fl="list_extern",sort="list_extern asc"),
  list_extern),
on="list_intern=list_extern")

In this case, when still sending these huge lists, it made a difference of 3 percent max.

(Before I discovered the solution)
I tried that and it didn't improve the situation. Maybe my new queries were inefficient.
Had no chance to test your CustomFunction Gary, sorry. Thanks for staying with me!