Excel - Extract substring(s) from string using FILTERXML
Excel's FILTERXML
uses XPATH 1.0
which unfortunately means it is not as diverse as we would maybe want it to be. Also, Excel seems to not allow returning reworked node values and exclusively allows you to select nodes in order of appearance. However there is a fair share of functions we can still utilize. More information about that can be found here.
The function takes two parameters: =FILTERXML(<A string in valid XML format>,<A string in valid XPATH format>)
Let's say cell A1
holds the string: ABC|123|DEF|456|XY-1A|ZY-2F|XY-3F|XY-4f|xyz|123
. To create a valid XML string we use SUBSTITUTE
to change the delimiter to valid end- and start-tag constructs. So to get a valid XML construct for the given example we could do:
"<t><s>"&SUBSTITUTE(A1,"|","</s><s>")&"</s></t>"
For readability reasons I'll refer to the above construct with the word <XML>
as a placeholder. Below you'll find different usefull XPATH
functions in a valid construct to filter nodes:
1) All Elements:
=FILTERXML(<XML>,"//s")
Returns: ABC
, 123
, DEF
, 456
, XY-1A
, ZY-2F
, XY-3F
, XY-4f
, xyz
and 123
(all nodes)
2) Elements by position:
=FILTERXML(<XML>,"//s[position()=4]")
Or:
=FILTERXML(<XML>,"//s[4]")
Returns: 456
(node on index 4)
=FILTERXML(<XML>,"//s[position()<4]")
Returns: ABC
, 123
and DEF
(nodes on index < 4)
=FILTERXML(<XML>,"//s[position()=2 or position()>5]")
Returns: 123
, ZY-2F
, XY-3F
, XY-4f
, xyz
and 123
(nodes on index 2 or > 5)
=FILTERXML(<XML>,"//s[last()]")
Returns: 123
(node on last index)
=FILTERXML(<XML>,"//s[position() mod 2 = 1]")
Returns: ABC
, DEF
, XY-1A
, XY-3F
and xyz
(odd nodes)
=FILTERXML(<XML>,"//s[position() mod 2 = 0]")
Returns: 123
, 456
, ZF-2F
, XY-4f
and 123
(even nodes)
3) (Non) numeric elements:
=FILTERXML(<XML>,"//s[number()=.]")
Or:
=FILTERXML(<XML>,"//s[.*0=0]")
Returns: 123
, 456
, and 123
(numeric nodes)
=FILTERXML(<XML>,"//s[not(number()=.)]")
Or:
=FILTERXML(<XML>,"//s[.*0!=0)]")
Returns: ABC
, DEF
, XY-1A
, ZY-2F
, XY-3F
, XY-4f
and xyz
(non-numeric nodes)
4) Elements that (not) contain:
=FILTERXML(<XML>,"//s[contains(., 'Y')]")
Returns: XY-1A
, ZY-2F
, XY-3F
and XY-4f
(containing 'Y', notice XPATH
is case sensitive, exclusing xyz
)
=FILTERXML(<XML>,"//s[not(contains(., 'Y'))]")
Returns: ABC
, 123
, DEF
, 456
, xyz
and 123
(not containing 'Y', notice XPATH
is case sensitive, including xyz
)
5) Elements that (not) start or/and end with:
=FILTERXML(<XML>,"//s[starts-with(., 'XY')]")
Returns: XY-1A
, XY-3F
and XY-4f
(starting with 'XY')
=FILTERXML(<XML>,"//s[not(starts-with(., 'XY'))]")
Returns: ABC
, 123
, DEF
, 456
, ZY-2F
, xyz
and 123
(don't start with 'XY')
=FILTERXML(<XML>,"//s[substring(., string-length(.) - string-length('F') +1) = 'F']")
Returns: DEF
, ZY-2F
and XY-3F
(end with 'F', notice XPATH 1.0
does not support ends-with
)
=FILTERXML(<XML>,"//s[not(substring(., string-length(.) - string-length('F') +1) = 'F')]")
Returns: ABC
, 123
, 456
, XY-1A
, XY-4f
, xyz
and 123
(don't end with 'F')
=FILTERXML(<XML>,"//s[starts-with(., 'X') and substring(., string-length(.) - string-length('A') +1) = 'A']")
Returns: XY-1A
(start with 'X' and end with 'A')
6) Elements that are upper- or lowercase:
=FILTERXML(<XML>,"//s[translate(.,'abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ')=.]")
Returns: ABC
, 123
, DEF
, 456
, XY-1A
, ZY-2F
, XY-3F
and 123
(uppercase nodes)
=FILTERXML(<XML>,"//s[translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')=.]")
Returns: 123
, 456
, xyz
and 123
(lowercase nodes)
NOTE: Unfortunately XPATH 1.0
does not support upper-case()
nor lower-case()
so the above is a workaround. Add special characters if need be.
7) Elements that (not) contain any number:
=FILTERXML(<XML>,"//s[translate(.,'1234567890','')!=.]")
Returns: 123
, 456
, XY-1A
, ZY-2F
, XY-3F
, XY-4f
and 123
(contain any digit)
=FILTERXML(<XML>,"//s[translate(.,'1234567890','')=.]")
Returns: ABC
, DEF
and xyz
(don't contain any digit)
=FILTERXML(<XML>,"//s[translate(.,'1234567890','')!=. and .*0!=0]")
Returns: XY-1A
, ZY-2F
, XY-3F
and XY-4f
(holding digits but not a a number on it's own)
8) Unique elements or duplicates:
=FILTERXML(<XML>,"//s[preceding::*=.]")
Returns: 123
(duplicate nodes)
=FILTERXML(<XML>,"//s[not(preceding::*=.)]")
Returns: ABC
, 123
, DEF
, 456
, XY-1A
, ZY-2F
, XY-3F
, XY-4f
and xyz
(unique nodes)
=FILTERXML(<XML>,"//s[not(following::*=. or preceding::*=.)]")
Returns: ABC
, DEF
, 456
, XY-1A
, ZY-2F
, XY-3F
and XY-4f
(nodes that have no similar sibling)
9) Elements of certain length:
=FILTERXML(<XML>,"//s[string-length()=5]")
Returns: XY-1A
, ZY-2F
, XY-3F
and XY-4f
(5 characters long)
=FILTERXML(<XML>,"//s[string-length()<4]")
Returns: ABC
, 123
, DEF
, 456
, xyz
and 123
(shorter than 4 characters)
10) Elements based on preceding/following:
=FILTERXML(<XML>,"//s[preceding::*[1]='456']")
Returns: XY-1A
(previous node equals '456')
=FILTERXML(<XML>,"//s[starts-with(preceding::*[1],'XY')]")
Returns: ZY-2F
, XY-4f
, and xyz
(previous node starts with 'XY')
=FILTERXML(<XML>,"//s[following::*[1]='123']")
Returns: ABC
, and xyz
(following node equals '123')
=FILTERXML(<XML>,"//s[contains(following::*[1],'1')]")
Returns: ABC
, 456
, and xyz
(following node contains '1')
=FILTERXML(<XML>,"//s[preceding::*='ABC' and following::*='XY-3F']")
Returns: 123
, DEF
, 456
, XY-1A
and ZY-2F
(everything between 'ABC' and 'XY-3f')
11) Elements based on sub-strings:
=FILTERXML(<XML>,"//s[substring-after(., '-') = '3F']")
Returns: XY-3F
(nodes ending with '3F' after hyphen)
=FILTERXML(<XML>,"//s[contains(substring-after(., '-') , 'F')]")
Returns: ZY-2F
and XY-3F
(nodes containing 'F' after hyphen)
=FILTERXML(<XML>,"//s[substring-before(., '-') = 'ZY']")
Returns: ZY-2F
(nodes starting with 'ZY' before hyphen)
=FILTERXML(<XML>,"//s[contains(substring-before(., '-'), 'Y')]")
Returns: XY-1A
, ZY-2F
, XY-3F
and XY-4f
(nodes containing 'Y' before hyphen)
12) Elements based on concatenation:
=FILTERXML(<XML>,"//s[concat(., '|', following::*[1])='ZY-2F|XY-3F']")
Returns: ZY-2F
(nodes when concatenated with '|' and following sibling equals 'ZY-2F|XY-3F')
=FILTERXML(<XML>,"//s[contains(concat(., preceding::*[2]), 'FA')]")
Returns: DEF
(nodes when concatenated with sibling two indices to the left contains 'FA')
13) Empty vs. Non-empty:
=FILTERXML(<XML>,"//s[count(node())>0]")
Or:
=FILTERXML(<XML>,"//s[node()]")
Returns: ABC
, 123
, DEF
, 456
, XY-1A
, ZY-2F
, XY-3F
, XY-4f
, xyz
and 123
(all nodes that are not empty)
=FILTERXML(<XML>,"//s[count(node())=0]")
Or:
=FILTERXML(<XML>,"//s[not(node())]")
Returns: None (all nodes that are empty)
Now obviously the above is a demonstration of possibilities with XPATH 1.0
functions and you can get a whole range of combinations of the above and more! I tried to cover most commonly used string functions. If you are missing any please feel free to comment.
Whereas the question is quite broad on itself, I was hoping to give some general direction on how to use FILTERXML
for the queries at hand. The formula returns an array of nodes to be used in any other way. A lot of the times I would use it in TEXTJOIN()
or INDEX()
. But I guess other options would be new DA-functions to spill results.
Be alert that while parsing a string through FILTERXML()
, the ampersand character (&) and the left angle bracket (<) must not appear in their literal form. They will respectively need to be substituted with either &
or <
. Another option would be to use their numeric ISO/IEC 10646 character code being &
or <
respectively. After parsing, the function will return these characters back to you in their literal form. Needless to say that splitting a string by the semi-colon therefor just became tricky.
This post is meant as a little in-depth extra to show how we can make our own re-usable SPLIT()
function using FILTERXML()
but without the use of VBA. Though currently in BETA, LAMBDA()
is comming our way and with this function we can create our own function. Let me explain this at the hand of a 1st example:
The formula in B1
is simple =SPLIT(A1,"|","")
and it spills the delimited text values in order of appearance. However SPLIT()
is the name of our LAMBDA()
function we created in the "name manager".
=LAMBDA(txt,del,xpath,FILTERXML("<t><s>"&SUBSTITUTE(txt,del,"</s><s>")&"</s></t>","//s"&xpath))
As you can see, the function has 4 parameters:
txt
- A reference to our source value.del
- A delimiter we want to use.xpath
- Place for a xpath expression to apply some filter if need be.FILTERXML("<t><s>"&SUBSTITUTE(txt,del,"</s><s>")&"</s></t>","//s"&xpath)
The 4th parameter is where all the 3 previous parameters are being called to create the same construct as been covered in the main post. Now, since MS does not, we created our own SPLIT()
function with the three parameters.
The main post concentrates on the SUBSTITUTE()
of a specific delimiter, in our example the pipe-symbol. But what if you have several delimiters? You'd need multiple nested SUBSTITUTE()
functions, right? Wouldn't it be great if we can implement that in our SPLIT()
function too? Here is where LAMBDA()
is getting exciting for me personally because we can call it recursively if only we provide a way out, otherwise it will loop infinitely!
Let me demonstrate with a 2nd example:
The formula in B1
is again:
=SPLIT(A1,"|&#","")
The big difference this time is we have got several delimiters in a string in our 2nd paramter. And it still works..... crazy! What happend though in our LAMBDA()
is that I made it recursive:
=LAMBDA(txt,del,xpath,IF(del="",FILTERXML("<t><s>"&txt&"</s></t>","//s"&xpath),SPLIT(SUBSTITUTE(txt,LEFT(del),"</s><s>"),RIGHT(del,LEN(del)-1),xpath)))
The main portion of the LAMBDA()
still stands, but I nested an IF()
, and within the IF()
I'm calling the same SPLIT()
function over and over untill all delimiters have been substituted.
We could stretch this even a bit further to not only split by a single character but by whole words. In that case we will need an array to loop. I've tried so in the below example:
Formula in B1
is =SPLIT(A1,C1:C2;"")
, thus our string of characters changed into a cell reference. Note that in my case I made it so you'd either need to hardcode a single value or reference a single cell or you need to hardcode a vertical array or vertical range reference. This has an impact on our LAMBDA()
though:
=LAMBDA(txt,del,xpath,IF(COUNTA(del)=1,FILTERXML("<t><s>"&SUBSTITUTE(txt,@del,"</s><s>")&"</s></t>","//s"&xpath),SPLIT(SUBSTITUTE(txt,@del,"</s><s>"),INDEX(del,SEQUENCE(COUNTA(del)-1,,2)),xpath)))
Quite lengthy, but don't forget we can now use this as a function in our whole workbook.
We have now created our own SPLIT()
function with three parameters:
=SPLIT(<StringToBeSplited>,<YourDelimiters>,<OptionalXpath>)
Enjoy!