Are views harmful for performance in PostgreSQL?
The book is wrong.
Selecting from a view is exactly as fast or slow as running the underlying SQL statement – you can easily check that using explain analyze
.
The Postgres optimizer (and the optimizer for many other modern DBMSes) will be able to push down predicates on the view into the actual view statement – provided this is a simple statement (again, this can be verified using explain analyze
).
The "bad reputation" regarding performance stems – I think – from when you overuse views and start building views that use views that use views. Very often that results in statements that do too much compared to a statement that was hand-tailored without the views e.g. because some intermediate tables wouldn't be needed. In almost all cases the optimizer is not smart enough to remove those unneeded tables/joins or to push down predicates over multiple levels of views (this is true for other DBMSes as well).
To give you an example of what @a_horse explained:
Postgres implements the information schema, which consists of (sometimes complex) views providing information about DB objects in standardized form. This is convenient and reliable - and can be substantially more expensive than accessing the Postgres catalog tables directly.
Very simple example, to get all visible columns of a table
... from the information schema:
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'big'
AND table_schema = 'public';
... from the system catalog:
SELECT attname
FROM pg_catalog.pg_attribute
WHERE attrelid = 'public.big'::regclass
AND attnum > 0
AND NOT attisdropped;
Compare query plans and execution time for both with EXPLAIN ANALYZE
.
The first query is based on the view
information_schema.columns
, which joins to multiple tables we do not need for this at all.The second query only scans the one table
pg_catalog.pg_attribute
, hence much faster. (But the first query still only needs a few ms in common DBs.)
Details:
- Query to return output column names and data types of a query, table or view
EDIT:
With apologies, I need to retract my assertion that the accepted answer is not always correct - it states that the view is always identical to the same thing written as a subquery. I think that's indisputable, and I think I now know what's going on in my case.
I now also think there's a better answer to the original question.
The original question is about whether it should be guiding practice to use views (as opposed to, for example, repeating SQL in routines which may need to be maintained twice or more).
My answer would be "not if your query uses window functions or anything else which causes the optimiser to treat the query differently when it becomes a subquery, because the very act of creating the subquery (whether represented as a view or not) may degrade performance if you are filtering with parameters at runtime.
The complexity of my window function is unnecessary. The explain plan for this:
SELECT DISTINCT ts.train_service_key,
pc.assembly_key,
count(*) OVER
(PARTITION BY ts.train_service_key) AS train_records
FROM staging.train_service ts
JOIN staging.portion_consist pc
USING (ds_code, train_service_key)
WHERE assembly_key = '185132';
is much less costly than for this:
SELECT *
FROM (SELECT DISTINCT ts.train_service_key,
pc.assembly_key,
count(*) OVER
(PARTITION BY ts.train_service_key) AS train_records
FROM staging.train_service ts
JOIN staging.portion_consist pc
USING (ds_code, train_service_key)) AS query
WHERE assembly_key = '185132';
Hope that's a bit more specific and helpful.
In my recent experience (causing me to find this question), the accepted answer above isn't correct under all circumsances. I have a relatively simple query that includes a window function:
SELECT DISTINCT ts.train_service_key,
pc.assembly_key,
dense_rank() OVER (PARTITION BY ts.train_service_key
ORDER BY pc.through_idx DESC, pc.first_portion ASC,
((CASE WHEN (NOT ts.primary_direction)
THEN '-1' :: INTEGER
ELSE 1
END) * pc.first_seq)) AS coach_block_idx
FROM (staging.train_service ts
JOIN staging.portion_consist pc USING (ds_code, train_service_key))
If I add this filter:
where assembly_key = '185132'
The explain plan I get is as follows:
QUERY PLAN
Unique (cost=11562.66..11568.77 rows=814 width=43)
-> Sort (cost=11562.66..11564.70 rows=814 width=43)
Sort Key: ts.train_service_key, (dense_rank() OVER (?))
-> WindowAgg (cost=11500.92..11523.31 rows=814 width=43)
-> Sort (cost=11500.92..11502.96 rows=814 width=35)
Sort Key: ts.train_service_key, pc.through_idx DESC, pc.first_portion, ((CASE WHEN (NOT ts.primary_direction) THEN '-1'::integer ELSE 1 END * pc.first_seq))
-> Nested Loop (cost=20.39..11461.57 rows=814 width=35)
-> Bitmap Heap Scan on portion_consist pc (cost=19.97..3370.39 rows=973 width=38)
Recheck Cond: (assembly_key = '185132'::text)
-> Bitmap Index Scan on portion_consist_assembly_key_index (cost=0.00..19.72 rows=973 width=0)
Index Cond: (assembly_key = '185132'::text)
-> Index Scan using train_service_pk on train_service ts (cost=0.43..8.30 rows=1 width=21)
Index Cond: ((ds_code = pc.ds_code) AND (train_service_key = pc.train_service_key))
This is using the primary key index on the train service table and a non-unique index on the portion_consist table. It executes in 90ms.
I created a view (pasting it here to be absolutely clear but it's literally the query in a view):
CREATE OR REPLACE VIEW staging.v_unit_coach_block AS
SELECT DISTINCT ts.train_service_key,
pc.assembly_key,
dense_rank() OVER (PARTITION BY ts.train_service_key
ORDER BY pc.through_idx DESC, pc.first_portion ASC, (
(CASE
WHEN (NOT ts.primary_direction)
THEN '-1' :: INTEGER
ELSE 1
END) * pc.first_seq)) AS coach_block_idx
FROM (staging.train_service ts
JOIN staging.portion_consist pc USING (ds_code, train_service_key))
When I query this view with the identical filter:
select * from staging.v_unit_coach_block
where assembly_key = '185132';
This is the explain plan:
QUERY PLAN
Subquery Scan on v_unit_coach_block (cost=494217.13..508955.10 rows=3275 width=31)
Filter: (v_unit_coach_block.assembly_key = '185132'::text)
-> Unique (cost=494217.13..500767.34 rows=655021 width=43)
-> Sort (cost=494217.13..495854.68 rows=655021 width=43)
Sort Key: ts.train_service_key, pc.assembly_key, (dense_rank() OVER (?))
-> WindowAgg (cost=392772.16..410785.23 rows=655021 width=43)
-> Sort (cost=392772.16..394409.71 rows=655021 width=35)
Sort Key: ts.train_service_key, pc.through_idx DESC, pc.first_portion, ((CASE WHEN (NOT ts.primary_direction) THEN '-1'::integer ELSE 1 END * pc.first_seq))
-> Hash Join (cost=89947.40..311580.26 rows=655021 width=35)
Hash Cond: ((pc.ds_code = ts.ds_code) AND (pc.train_service_key = ts.train_service_key))
-> Seq Scan on portion_consist pc (cost=0.00..39867.86 rows=782786 width=38)
-> Hash (cost=65935.36..65935.36 rows=1151136 width=21)
-> Seq Scan on train_service ts (cost=0.00..65935.36 rows=1151136 width=21)
This is doing full scans on both tables and takes 17s.
Until I came across this I have been liberally using views with PostgreSQL (having understood the widely-held views expressed in the accepted answer). I'd specifically avoid using views if I need pre-aggregate filtering, for which I'd use set-returning functions.
I'm also aware that CTEs in PostgreSQL are strictly evaluated separately, by design, so I don't use them in the same way I would with SQL Server, for example, where they seem to be optimised as subqueries.
My answer, therefore, is, there are instances in which views do not perform exactly as the query upon which they are based, so caution is advised. I am using Amazon Aurora based on PostgreSQL 9.6.6.