Select a CSV string as multiple columns
A Word on the Performance of Different Types of Functions
Generally speaking, Scalar Functions and mTVFs (Multi-Statement Table Valued Functions) are a bit of a built in performance problem. It's better, if you can, to use iTVFs (Inline Table Valued Functions) because their code is actually included in the execution plan (much like a VIEW but parameterized) instead of being executed separately. Rather than replicate an entire article here, please see the following for some proof even when an iTVF is used as an iSF (Inline Scalar Function). In many cases, Scalar functions can be 7 times slower than iTVFs even when the code is identical. Here's the link, which you can now get to without having to sign up to be a member. To summarize, if your function has the word "BEGIN" in it, it's NOT going to be a high performance iTVF.
How to Make Scalar UDFs Run Faster (SQL Spackle)
Test Data for SQL Server 2008 or Later:
Shifting gears to the original problem posted by @Reza and borrowing heavily from the code that @stefan provided to generate some sample rows of data, here's the test data we'll use. I do this in a Temp Table so that we can easily drop the test table without having to worry about accidently dropping a real table.
--===== If the test table exists, drop it to make reruns in SSMS eaiser
IF OBJECT_ID('tempdb..#CSV','U') IS NOT NULL
DROP TABLE #CSV
;
--===== Create the test table.
CREATE TABLE #CSV --Using a Temp Table just for demo purposes
(
ID INT IDENTITY(1,1) --or whatever your PK is
,AString VARCHAR(8000)
)
;
--===== Insert some test data
INSERT INTO #CSV
(AString)
SELECT AString
FROM (
VALUES ('123,456,88789,null,null')
,('123,456,99789,1234,null')
,('123,456,00789,1234,null')
,('1,2222,77789,null,null')
,('11,222,88789,null,')
,('111,22,99789,,')
,('1111,2,00789,oooo,null')
) v (AString)
;
Test Data for SQL Server 2005 or Later:
If you're still using SQL Server 2005, don't despair. All of the code will still work there. We just need to change the way we generate the test data because the VALUES Clause couldn't handle the way we generated the data above until 2008. Instead, you need to use a series of SELECT/UNION ALL statements. There's virtually no performance difference between the two if you need a larger volume of data and the SELECT/UNION ALL method still works just fine through 2016.
Here's the test table generation code modified to work for 2005.
--===== If the test table exists, drop it to make reruns in SSMS eaiser
IF OBJECT_ID('tempdb..#CSV','U') IS NOT NULL
DROP TABLE #CSV
;
--===== Create the test table.
CREATE TABLE #CSV --Using a Temp Table just for demo purposes
(
ID INT IDENTITY(1,1) --or whatever your PK is
,AString VARCHAR(8000)
)
;
--===== Insert some test data
INSERT INTO #CSV
(AString)
SELECT AString
FROM (
SELECT '123,456,88789,null,null' UNION ALL
SELECT '123,456,99789,1234,null' UNION ALL
SELECT '123,456,00789,1234,null' UNION ALL
SELECT '1,2222,77789,null,null' UNION ALL
SELECT '11,222,88789,null,' UNION ALL
SELECT '111,22,99789,,' UNION ALL
SELECT '1111,2,00789,oooo,null'
) v (AString)
;
Picking a Splitter
First of all, SQL Server isn't the greatest tool to use for string manipulations. It's been proven many times that a properly written SQLCLR will smoke just about any attempt at string splitting compared to pure T-SQL solutions. The problems associated with SQLCLR solutions are 1) finding one that is properly written (properly returns the expected result set) and 2) it's sometimes difficult to convince the resident DBAs to allow SQLCLR to be used. If you do find one that's properly written to return the expected results, it's definitely worth spending the time trying to convince the DBA because it will usually handle MAX data-types and cares little if you pass it VARCHAR or NVARCHAR data not to mention being faster than any T-SQL solution I've ever seen.
If you don't have to worry about using NVARCHAR and you don't have to worry about multi-character delimiters (or can change the delimiter "on the way in") and you don't have to worry about MAX datatypes (limited to 8K strings), then then DelimitedSplit8K function may be for you. You can find the article for it along with the proper "high cardinality" testing for it at the following URL.
Tally OH! An Improved SQL 8K “CSV Splitter” Function
I'll also tell you that the high cardinality test data generation code in that article broke when SQL Server 2012 came out because of changes to how XML can be used to concatenate. Wayne Sheffield did some separate testing to compare it to the new splitter function in 2016 and made a repair to the test data generation code at the following article. Note that there are also 3 other splitters that are available in the article, 2 of which are only available in 2016. He also tests against an SQLCLR and lists some of the functional differences (shortcomings in some cases) of each splitter.
Splitting Strings in SQL Server 2016
There's also an article by Eirikur Eiriksson that not only explores a serious performance improvement but also explores splitting "True CSVs" that can be included delimiters between "text qualifiers" (although at considerable expense of performance). Here's the link to that article.
Reaping the benefits of the Window functions in T-SQL
All that, notwithstanding, even he demonstrates that an SQLCLR is a whole lot faster than anything you can do using only T-SQL. I recommend that method instead of any T-SQL method but DO extensive functionality testing to ensure it's going to return what you expect in some of the "odd" cases.
Solving the Posted Problem
Ok. Sorry for the long winded diversion about splitters. Let's get back to the original problem.
We need to do two things:
Split the CSVs in such a fashion where we know which row each split out element came from.
Reassemble the split out elements into separate columns all in the same row for each row we split.
First, we need to create a splitter function. Here's a modification of DelimitedSplit8K that I'm working on (well, not lately ;-) ) that will still support SQL Server 2005. The big difference is using a binary collation for the extra performance. This rendition isn't yet fully tested (I've personally been using it since Sep 2013) but, considering the simplicity of the changes and the fact that none of them actually change the functionality, I believe that you'll find it satisfactory for your needs.
CREATE FUNCTION [dbo].[DelimitedSplit8K]
/**********************************************************************************************************************
Purpose:
Given a string containing multiple elements separated by a single character delimiter and that single character
delimiter, this function will split the string and return a table of the single elements (Item) and the element
position within the string (ItemNumber).
Notes:
1. Performance of this function approaches that of a CLR.
2. Note that this code implicitly converts NVARCHAR to VARCHAR and that conversion may NOT be faithful.
Revision History:
Note that this code is a modification of a well proven function created as a community effort and initially documented
at the following URL (http://www.sqlservercentral.com/articles/Tally+Table/72993/). This code is still undergoing
tests. Although every care has certainly been taken to ensure its accuracy, you are reminded to do your own tests to
ensure that this function is suitable for whatever application you might use it for.
--Jeff Moden, 01 Sep 2013
**********************************************************************************************************************/
--===== Define I/O parameters
(@pString VARCHAR(8000) , @pDelimiter CHAR(1)) --DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000).
WITH E1(N) AS (--==== Itzik Ben-Gan style of a cCTE (Cascading CTE) and
-- should not be confused with a much slower rCTE (Recursive CTE).
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E4(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c, E1 d), --10E+4 or 10,000 rows max
cteTally(N) AS ( --=== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS ( --=== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT CASE WHEN SUBSTRING(@pString,t.N,1) = @pDelimiter COLLATE Latin1_General_BIN THEN t.N+1 END --added short circuit for casting
FROM cteTally t
WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter COLLATE Latin1_General_BIN
),
cteLen(N1,L1)AS ( --=== Return start position and length (for use in substring).
-- The ISNULL/NULLIF combo handles the length for the final of only element.
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(@pString, l.N1, l.L1)
FROM cteLen l
;
GO
Once that bad boy is in place, then problems like those in the original post become trivial to solve using either a PIVOT or a CROSS TAB. I prefer the rather ancient CROSS TAB method because it's frequently faster (sometimes significantly) than an actual PIVOT. Please see the following article on testing done there (is pretty old and should probably be updated for performance measurements, though it does explain how to use both methods).
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Here's the code that uses the function above and a CROSS TAB to solve the original problem posted on this thread.
--===== Do the split for each row and repivot to columns using a
-- high performance CROSS TAB. It uses the function only once
-- for each row, which is another advantage iTVFs have over
-- Scalare Functions.
SELECT csv.ID
,Col1 = MAX(CASE WHEN ca.ItemNumber = 1 THEN Item ELSE '' END)
,Col2 = MAX(CASE WHEN ca.ItemNumber = 2 THEN Item ELSE '' END)
,Col3 = MAX(CASE WHEN ca.ItemNumber = 3 THEN Item ELSE '' END)
,Col4 = MAX(CASE WHEN ca.ItemNumber = 4 THEN Item ELSE '' END)
,Col5 = MAX(CASE WHEN ca.ItemNumber = 5 THEN Item ELSE '' END)
FROM #CSV csv
CROSS APPLY dbo.DelimitedSplit8K(csv.AString,',') ca
GROUP BY csv.ID
;
GO
Here's the result set from the code above using the given test data.
If you have any questions or need clarification on any of this, please don't hesitate to ask.
For solving this, you will probably need some more procedural code. Different databases have different sets of built-in string functions (as you know). Thus, for finding a solution for this I have written "proof of concept" code that is rather generic and uses just the SUBSTR() function (the equivalent is SUBSTRING() in MS SQL).
When looking at LOAD DATA ... (MySQL) you can see that we need a .csv file, and an existing table. Using my function, you will be able to SELECT sections of the .csv file, by passing the column name plus 2 integers: one for the number of the "left-hand side delimiter", and one for the number of the "right-hand side delimiter". (Sounds horrible ...).
Example: suppose we have a comma-separated value looking like this, and it is stored in a colum called csv:
aaa,111,zzz
If we want to "extract" 111 out of this, we call the function like this:
select split(csv, 1, 2) ... ;
-- 1: start at the first comma
-- 2: end at the second comma
The "start" and the "end" of the string can be selected like this:
select split(csv, 0, 1) ... ; -- from the start of the string (no comma) up to the first comma
select split(csv, 2, 0) ... ; -- from the second comma right up to the end of the string
I know that the function code is not perfect, and that it can be simplified in places (eg in Oracle we should use INSTR(), which can find a particular occurrence of a part of a string). Also, there's no exception handling right now. It's just a first draft. Here goes ...
create or replace function split(
csvstring varchar2
, lcpos number
, rcpos number )
return varchar2
is
slen pls_integer := 0 ; -- string length
comma constant varchar2(1) := ',' ;
currentchar varchar2(1) := '' ;
commacount pls_integer := 0 ;
firstcommapos pls_integer := 0 ;
secondcommapos pls_integer := 0 ;
begin
slen := length(csvstring);
-- special case: leftmost value
if lcpos = 0 then
firstcommapos := 0 ;
for i in 1 .. slen
loop
currentchar := substr(csvstring, i, 1) ;
if currentchar = comma then
secondcommapos := i - 1 ;
exit ;
end if ;
end loop ;
return substr(csvstring, 1, secondcommapos) ;
end if ;
-- 2 commas somewhere in the middle of the string
if lcpos > 0 and rcpos > 0 then
for i in 1 .. slen
loop
currentchar := substr(csvstring, i, 1) ;
if currentchar = comma then
commacount := commacount + 1;
if commacount = lcpos then
firstcommapos := i ;
end if ;
if commacount = rcpos then
secondcommapos := i ;
end if ;
end if ;
end loop ;
return substr(csvstring, firstcommapos + 1, (secondcommapos-1) - firstcommapos ) ;
end if ;
-- special case: rightmost value
if rcpos = 0 then
secondcommapos := slen ;
for i in reverse 1 .. slen -- caution: count DOWN!
loop
currentchar := substr(csvstring, i, 1) ;
if currentchar = comma then
firstcommapos := i + 1 ;
exit ;
end if ;
end loop ;
return substr(csvstring, firstcommapos, secondcommapos-(firstcommapos-1)) ;
end if ;
end split;
Testing:
-- test table, test data
create table csv (
id number generated always as identity primary key
, astring varchar2(256)
);
-- insert some test data
begin
insert into csv (astring) values ('123,456,88789,null,null');
insert into csv (astring) values ('123,456,99789,1234,null');
insert into csv (astring) values ('123,456,00789,1234,null');
insert into csv (astring) values ('1,2222,77789,null,null');
insert into csv (astring) values ('11,222,88789,null,');
insert into csv (astring) values ('111,22,99789,,');
insert into csv (astring) values ('1111,2,00789,oooo,null');
end;
-- testing:
select
split(astring,0,1) col1
, split(astring,1,2) col2
, split(astring,2,3) col3
, split(astring,3,4) col4
, split(astring,4,0) col5
from csv
-- output
COL1 COL2 COL3 COL4 COL5
123 456 88789 null null
123 456 99789 1234 null
123 456 00789 1234 null
1 2222 77789 null null
11 222 88789 null -
111 22 99789 - -
1111 2 00789 oooo null
... The function seems to be overkill. However, if we write more procedural code, the SQL depending on it becomes rather "elegant". Best of luck with processing your csv!