Why does neo4j warn: "This query builds a cartesian product between disconnected patterns"?
The browser is telling you that:
- It is handling your query by doing a comparison between every
Gene
instance and everyChromosome
instance. If your DB hasG
genes andC
chromosomes, then the complexity of the query isO(GC)
. For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do1150000
comparisons. You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on
:Gene(chromosomeID)
, and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only doO(G)
(or25000
) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.Once we have created the index, we can use this query:
MATCH (c:Chromosome) WITH c MATCH (g:Gene) WHERE g.chromosomeID = c.chromosomeID CREATE (g)-[:PART_OF]->(c);
It uses a
WITH
clause to force the firstMATCH
clause to execute first, avoiding the cartesian product. The secondMATCH
(andWHERE
) clause uses the results of the firstMATCH
clause and the index to quickly get the exact genes that belong to each chromosome.
[UPDATE]
The WITH
clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH
is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH
.
As logisima mentions in the comments, this is just a warning. Matching a cartesian product is slow. In your case it should be OK since you want to connect previously unconnected Gene
and Chromosome
nodes and you know the size of the cartesian product. There are not too many chromosomes and a smallish number of genes. If you would MATCH
e.g. genes on proteins the query might blow.
I think the warning is intended for other problematic queries:
- if you
MATCH
a cartesian product but you don't know if there is a relationship you could useOPTIONAL MATCH
- if you want to
MATCH
both aGene
and aChromosome
without any relationships, you should split up the query
In case your query takes too long or does not finish, here is another question giving some hints how to optimize cartesian products: How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)