Find first non-null values for multiple columns

Using `first_value()`

first_value(col) can be used with and OVER (ORDER BY CASE WHEN col IS NOT NULL THEN sortcol ELSE maxvalue END). ELSE maxvalue is required because SQL Server sorts nulls first)

CREATE TABLE foo(a int, b int, c int, sortCol int);
INSERT INTO foo VALUES
    (null, 4, 8, 1),
    (1, null, 0, 2),
    (5, 7, null, 3);

Now you can see what we have to do to force nulls to sort after the sortcol. To do desc you have to make sure they have a negative value.

SELECT TOP(1)
     first_value(a) OVER (ORDER BY CASE WHEN a IS NOT NULL THEN sortcol ELSE 2^31-1 END) AS a,
     first_value(b) OVER (ORDER BY CASE WHEN b IS NOT NULL THEN sortcol ELSE 2^31-1 END) AS b,
     first_value(c) OVER (ORDER BY CASE WHEN c IS NOT NULL THEN sortcol ELSE 2^31-1 END) AS c
FROM foo;

PostgreSQL

PostgreSQL is slightly simpler,

CREATE TABLE foo(a,b,c,sortCol)
AS VALUES
  (null, 4, 8, 1),
  (1, null, 0, 2),
  (5, 7, null, 3);

SELECT
     first_value(a) OVER (ORDER BY CASE WHEN a IS NOT NULL THEN sortcol END) AS a,
     first_value(b) OVER (ORDER BY CASE WHEN b IS NOT NULL THEN sortcol END) AS b,
     first_value(c) OVER (ORDER BY CASE WHEN c IS NOT NULL THEN sortcol END) AS c
FROM foo
FETCH FIRST ROW ONLY;

I believe all of this goes away when RDBMS start to adopt IGNORE NULLS. Then it'll just be first_value(a IGNORE NULLS).

Have you actually performance tested this solution before rejecting it?

SELECT
    (SELECT TOP(1) col1 FROM Table1 WHERE col1 IS NOT NULL ORDER BY SortCol) AS col1,
    (SELECT TOP(1) col2 FROM Table1 WHERE col2 IS NOT NULL ORDER BY SortCol) AS col2,
    (SELECT TOP(1) col3 FROM Table1 WHERE col3 IS NOT NULL ORDER BY SortCol) AS col3

If this is slow it's probably because you don't have an appropriate index. What indexes do you have?

The problem with implementing this as an aggregation (which you indeed could do if, for example, you implemented a "First-Non-Null" SQL CLR aggregate) is the wasted IO to read every row when you're typically only interested in the first few rows. The aggregation won't just stop after the first non-null even though its implementation would ignore further values. Aggregations are also unordered, so your result would depend on the ordering of the index selected by query engine.

The subquery solution, by contrast, reads minimal rows for each query (since you only need the first matching row) and supports any ordering. It will also work on database platforms where it's not possible to define custom aggregates.

Which one performs better will likely depend on the number of rows and columns in your table and how sparse your data is. Additional rows require reading more rows for the aggregate approach. Additional columns require additional subqueries. Sparser data requires checking more rows within each of the subqueries.

Here are some results for various table sizes:

Rows  Cols  Aggregation IO  CPU  Subquery IO  CPU
3     3                 2   0             6   0
1728  3                 8   63            6   0
1728  8                 12  266           16  0

The IO measured here is the number of logical reads. Notice that the number of logical reads for the subquery approach doesn't change with the number of rows in the table. Also keep in mind that the logical reads performed by each additional subquery will likely be for the same pages of data (containing the first few rows). Aggregation, on the other hand, has to process the entire table and involves some CPU time to do so.

This is the code I used for testing... the clustered index on SortCol is required since (in this case) it will determine the order of the aggregation.

Defining the table and inserting test data:

CREATE TABLE Table1 (Col1 int null, Col2 int null, Col3 int null, SortCol int);
CREATE CLUSTERED INDEX IX_Table1 ON Table1 (SortCol);

WITH R (i) AS
(
 SELECT null

 UNION ALL

 SELECT 0

 UNION ALL

 SELECT i + 1
 FROM R
 WHERE i < 10
)
INSERT INTO Table1
SELECT a.i, b.i, c.i, ROW_NUMBER() OVER (ORDER BY NEWID())
FROM R a, R b, R c;

Querying the table:

SET STATISTICS IO ON;

--aggregation
SELECT TOP(0) * FROM Table1 --shortcut to convert columns back to their types
UNION ALL
SELECT
 dbo.FirstNonNull(Col1),
 dbo.FirstNonNull(Col2),
 dbo.FirstNonNull(Col3),
 null
FROM Table1;


--subquery
SELECT
    (SELECT TOP(1) Col1 FROM Table1 WHERE Col1 IS NOT NULL ORDER BY SortCol) AS Col1,
    (SELECT TOP(1) Col2 FROM Table1 WHERE Col2 IS NOT NULL ORDER BY SortCol) AS Col2,
    (SELECT TOP(1) Col3 FROM Table1 WHERE Col3 IS NOT NULL ORDER BY SortCol) AS Col3;

The CLR "first-non-null" aggregate to test:

 [Serializable]
 [SqlUserDefinedAggregate(
  Format.UserDefined,
  IsNullIfEmpty = true,
  IsInvariantToNulls = true,
  IsInvariantToDuplicates = true,
  IsInvariantToOrder = false, 
#if(SQL90)
  MaxByteSize = 8000
#else
  MaxByteSize = -1
#endif
 )]
 public sealed class FirstNonNull : IBinarySerialize
 {
  private SqlBinary Value;

  public void Init()
  {
   Value = SqlBinary.Null;
  }

  public void Accumulate(SqlBinary next)
  {
   if (Value.IsNull && !next.IsNull)
   {
    Value = next;
   }
  }

  public void Merge(FirstNonNull other)
  {
   Accumulate(other.Value);
  }

  public SqlBinary Terminate()
  {
   return Value;
  }

  #region IBinarySerialize Members

  public void Read(BinaryReader r)
  {
   int Length = r.ReadInt32();

   if (Length < 0)
   {
    Value = SqlBinary.Null;
   }
   else
   {
    byte[] Buffer = new byte[Length];
    r.Read(Buffer, 0, Length);

    Value = new SqlBinary(Buffer);
   }
  }

  public void Write(BinaryWriter w)
  {
   if (Value.IsNull)
   {
    w.Write(-1);
   }
   else
   {
    w.Write(Value.Length);
    w.Write(Value.Value);
   }
  }

  #endregion
 }

Not exactly elegant, but it can do it in a single query. Though this will probably render any indexes rather useless, so as mentioned the multiple sub-query method is likely to be faster.


create table Foo (data1 tinyint, data2 tinyint, data3 tinyint, seq int not null)
go

insert into Foo (data1, data2, data3, seq)
values (NULL, 4, 8, 1), (1, NULL, 0, 2), (5, 7, NULL, 3)
go

with unpivoted as (
    select seq, value, col
    from (select seq, data1, data2, data3 from Foo) a
    unpivot (value FOR col IN (data1, data2, data3)) b
), firstSeq as (
    select min(seq) as seq, col
    from unpivoted
    group by col
), data as (
    select b.col, b.value
    from firstSeq a
    inner join unpivoted b on a.seq = b.seq and a.col = b.col
)
select * from data pivot (min(value) for col in (data1, data2, data3)) d
go

drop table Foo
go

Find first non-null values for multiple columns

Using `first_value()`

PostgreSQL

Tags:

Sql

Sql Server

Related

Recent Posts

Find first non-null values for multiple columns

Using first_value()

PostgreSQL

Tags:

Sql

Sql Server

Related

Using `first_value()`