How to get cosine distance between two vectors in postgres?
About your table
First of all, I believe you should change your data type to plain array.
CREATE TABLE public.vector (
id serial NOT NULL,
vctor double precision [3] --for three dimensional vectors; of course you can change the dimension or leave it unbounded if you need it.
);
INSERT INTO public.vector (vctor) VALUES (ARRAY[2,3,4]);
INSERT INTO public.vector (vctor) VALUES (ARRAY[3,4,5]);
So
SELECT * FROM public.vector;
Will result in the following data
id | vctor
------|---------
1 | {2,3,4}
2 | {3,4,5}
Maybe not the answer you expected but consider this
As you may know already, calculating the cosine between the vectors involves calculating the magnitudes. I don't think the problem is the algorithm but the implementation; it requires calculating squares and square roots that is expensive for a RDBMS.
Now, talking about efficiency; the server process does not take the load when calling mathematical functions. In PostgreSQL, the mathematical functions (look here) run from the C library so they are pretty efficient. However, in the end, the host has to assign some resources to make these calculations.
I would indeed think carefully before implementing these rather costly operations inside the server. But there is not a right answer; it depends on how you are using the database. For example if it is a production database with thousands of concurrent users, I would move this kind of calculation elsewhere (a middle layer or a user application.) But if there are few users and your database is for a small research operation, then it is fine to implement it as a stored procedure or a process running inside your server but keep in mind this will affect scalability or portability. Of course, there are more considerations like how many rows will be processed, or whether or not you intend to fire triggers, etc.
Consider other alternatives
Make a client app
You can do a fast and decent program in VB or the language of your choice. And let the client app make the heavy calculation and use the database for what it does best that is storing and retrieving data.
Store the data differently
For this particular example, you could store the unit vectors plus the magnitude. In this way, finding the cosine between any two vectors reduces simply to the dot product of the unit vectors (only multiplication and division and no squares nor square roots.)
CREATE TABLE public.vector (
id serial NOT NULL,
uvctor double precision [3], --for three dimensional vectors; of course you can change the dimension or make it decimal if you need it
magnitude double precision
);
INSERT INTO public.vector (vctor) VALUES (ARRAY[0.3714, 0.5571, 0.7428], 5.385); -- {Ux, Uy, Uz}, ||V|| where V = [2, 3, 4];
INSERT INTO public.vector (vctor) VALUES (ARRAY[0.4243, 0.5657, 0.7071], 7.071); -- {Ux, Uy, Uz}, ||V|| where V = [3, 4, 5];
SELECT a.vctor as a, b.vctor as b, 1-(a.uvctor[1] * b.uvctor[1] + a.uvctor[2] * b.uvctor[2] + a.uvctor[3] * b.uvctor[3]) as cosine_distance FROM public.vector a
JOIN public.vector b ON a.id != b.id;
Resulting in
a | b | cosine_distance
-----------------------------|------------------------------|------------------
{0.3714,0.5571,0.7428,5.385} | {0.4243,0.5657,0.7071,7.071} | 0.00202963
{0.4243,0.5657,0.7071,7.071} | {0.3714,0.5571,0.7428,5.385} | 0.00202963
Even if you have to calculate the magnitude of the vector inside the server, you will make it once per vector and not every time you need to get the distance between two of them. This becomes more important as the number of rows is increasing. For 1000 vectors for example, you would have to calculate the magnitude 999000 times if you wanted to obtain the cosine difference between any two vectors using the original vector components.
Any combination of the above
Conclusion
When we pursue efficiency, most of the times there is not a canonical answer. Instead we have trade-offs that we have to consider and evaluate. It always depends on the ultimate goal we need to achieve. Databases are excellent for storing and retrieving data; they can definitely make other things but that comes with an added cost. If we can live with the added overhead then it's fine; otherwise we have to consider alternatives.
you can take reference to my code.
--for calculation of norm vector --
CREATE or REPLACE FUNCTION public.vector_norm(IN vector double precision[])
RETURNS double precision AS
$BODY$
BEGIN
RETURN(SELECT SQRT(SUM(pow)) FROM (SELECT POWER(e,2) as pow from unnest(vector) as e) as norm);
END;
$BODY$ LANGUAGE 'plpgsql';
ALTER FUNCTION public.vector_norm(double precision[]) OWNER TO postgres;
COMMENT ON FUNCTION public.vector_norm(double precision[]) IS 'This function is used to find a norm of vectors.';
--call function--
select public.vector_norm('{ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794}')
--for caculation of dot_product--
CREATE OR REPLACE FUNCTION public.dot_product(IN vector1 double precision[], IN vector2 double precision[])
RETURNS double precision
AS $BODY$
BEGIN
RETURN(SELECT sum(mul) FROM (SELECT v1e*v2e as mul FROM unnest(vector1, vector2) AS t(v1e,v2e)) AS denominator);
END;
$BODY$ LANGUAGE 'plpgsql';
ALTER FUNCTION public.dot_product(double precision[], double precision[]) OWNER TO postgres;
COMMENT ON FUNCTION public.dot_product(double precision[], double precision[])
IS 'This function is used to find a cosine similarity between two multi-dimensional vectors.';
--call fuction--
SELECT public.dot_product(ARRAY[ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794],ARRAY[ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794])
--for calculatuion of cosine similarity--
CREATE OR REPLACE FUNCTION public.cosine_similarity(IN vector1 double precision[], IN vector2 double precision[])
RETURNS double precision
LANGUAGE 'plpgsql'
AS $BODY$
BEGIN
RETURN(select ((select public.dot_product(ARRAY[ 0.63434,0.23487,0.324323], ARRAY[ 0.63434,0.23487,0.324323]) as dot_pod)/((select public.vector_norm(ARRAY[ 0.63434,0.23487,0.324323]) as norm1) * (select public.vector_norm(ARRAY[ 0.63434,0.23487,0.324323]) as norm2))) AS similarity_value)
END;
$BODY$;
ALTER FUNCTION public.cosine_similarity(double precision[], double precision[])
OWNER TO postgres;
COMMENT ON FUNCTION public.cosine_similarity(double precision[], double precision[])
IS 'this function is used to find a cosine similarity between two vector';