Cypher query validation for genome analysis

Hi,

I'm in the last couple of days before my dissertation is due and after a long few weeks of manipulating a large amount of genomic and phenotype data, it should be ready to import tomorrow... I know I'm cutting it very fine - so apologies for this, but any help in the next day or so would be hugely appreciated!

I'm after validation of my queries prior to my database being ready as I won't have long to test it and it's a big database on a not-so-big powered computer. I'm using Neo4j 3.5.1 on the Desktop version in Linux Mint. My data has spent a few days on my (relatively slow) being curated and I hope to load it via Neo4j-import tomorrow, so this is pre-emptive.

I have the following labels:

(:Chromosome) <-[:BELONGS_TO]-(:Allele)<-[:CARRIES]-(:Sample)-[r]->(:Phenotype)

Where [r] has a variety of types between samples and phenotypes.

There is also a relationship on certain alleles in the same position on the chromosome of (:Allele)-[:ALTERNATIVE_OF]-(:Allele)

I first want to filter out unwanted samples and my initial thoughts are:

MATCH (s:SAMPLE) WHERE NOT s.id = 'sample1' AND 'sample2' AND 'sample3'.... etc
RETURN s

In the hope that this would return a list of all Alleles except those specified? Would this be right, or is there a better way to filter out nodes?

Next, I intend to find all Alleles with fewer than 7 [:CARRIES] relationships with Samples, but there's another factor. I have three different names on these relationships depending on their type: 'homozygous','heterozygous - haplotype A' and 'heterozygous - haplotype B'. I need to double the count of the homozygous relationships, but count the other two only once each. My guess is as follows:

MATCH (a:Allele)<-[r1:CARRIES, r.name = “homozygous”]-(s:Sample) AND (a:Allele)<-r2[:CARRIES, r.name =~ “heterozygous.*”]

WITH a, count(r1 * 2) AS homozygous_count AND r2 AS heterozygous_count

WHERE allele_count <= 7

RETURN a

I'm not sure about the use of the AND operators though. Which should hopefully leave a list of Alleles with fewer than 7 relationships to Samples, including double those of homozygous? Does that make any sense, and how would I include the initial filter in the query?

Following that, the next stage is to use take each of these Alleles individually and identify Phenotypes linked to them through Samples. This one I'm assuming as:

MATCH (a:Allele)<-[:CARRIES]-(s:Sample)-[r]->(p:Phenotype)

RETURN p

COLLECT(a) as alleles

ORDER BY SIZE(alleles) DESC

This will hopefully generate a list of linked phenotypes, but is there a way to also get the properties of the relationships between Samples and Phenotypes? It might be even better if it were possible to get a mean or modal average value on certain properties, is that possible within a query?

An alternative option is to use something like below and analyse each different type of relationship between Sample and Phenotype nodes within specific parameters:

MATCH (a:Allele)<-[:CARRIES]-(s:Sample)-[:LDL]->(:Phenotype {name: cholesterol})

WHERE value:LDL > 7.0

RETURN phenotype,

COLLECT(a) as alleles

ORDER BY SIZE(alleles) DESC

If anyone can validate (or invalidate!), improve and help combine some of these queries then I would relaly appreciate the support. Thanks, Dave.

Hey that's quite a number of questions.

I try to answer them in order:

For excluding things it often helps to tag them with a Label, like :Excluded or tag the positive samples.
Also make sure to not misspell you had Sample vs.SAMPLE`.

create constraint on (s:Sample) assert s.id is unique;

MATCH (s:Sample) 
WHERE NOT s.id IN $params
SET s:Excluded
RETURN count(*);

For degrees you can use WHERE size( (a)-[:CARRIES]->() ) < 7

For your more complex expression you can use:

MATCH (a:Allele)-[rel:CARRIES]->()
WITH a, sum(case r.type when 'homozygous' then 2 else 1 end) as count
WHERE count < 7
....

here you only missed a comma, and if you have multiple phenotypes a DISTINCT helps
you might want to add a limit:

MATCH (a:Allele)<-[:CARRIES]-(s:Sample)-[r]->(p:Phenotype)

RETURN p, COLLECT(distinct a) as alleles

ORDER BY SIZE(alleles) DESC LIMIT 100

Sure you can also build up a more complex structures.

RETURN p, COLLECT({allel:a, value1: s.foo, value2:carries.value, value3:r.value, ...}) as alleles

which gives you a more complex list of maps/dictionaries as result.

filtering on value.LDL > 7 is also possible (dot not colon)