Why is `-lt` behaving differently for chars and strings?
A big thank-you to PetSerAl for all his invaluable input.
tl; dr:
-lt
and-gt
compare[char]
instances numerically by Unicode codepoint.- Confusingly, so do
-ilt
,-clt
,-igt
,-cgt
- even though they only make sense with string operands, but that's a quirk in the PowerShell language itself (see bottom).
- Confusingly, so do
-eq
(and its alias-ieq
), by contrast, compare[char]
instances case-insensitively, which is typically, but not necessarily like a case-insensitive string comparison (-ceq
again compares strictly numerically).-eq
/-ieq
ultimately also compares numerically, but first converts the operands to their uppercase equivalents using the invariant culture; as a result, this comparison is not fully equivalent to PowerShell's string comparison, which additionally recognizes so-called compatible sequences (distinct characters or even sequences considered to have the same meaning; see Unicode equivalence) as equal.- In other words: PowerShell special-cases the behavior of only
-eq
/-ieq
with[char]
operands, and does so in a manner that is almost, but not quite the same as case-insensitive string comparison.
This distinction leads to counter-intuitive behavior such as
[char] 'A' -eq [char] 'a'
and[char] 'A' -lt [char] 'a'
both returning$true
.To be safe:
- always cast to
[int]
if you want numeric (Unicode codepoint) comparison. - always cast to
[string]
if you want string comparison.
- always cast to
For background information, read on.
PowerShell's usually helpful operator overloading can be tricky at times.
Note that in a numeric context (whether implicit or explicit), PowerShell treats characters ([char]
([System.Char]
) instances) numerically, by their Unicode codepoint (not ASCII).
[char] 'A' -eq 65 # $true, in the 'Basic Latin' Unicode range, which coincides with ASCII
[char] 'Ā' -eq 256 # $true; 0x100, in the 'Latin-1 Supplement' Unicode range
What makes [char]
unusual is that its instances are compared to each other numerically as-is, by Unicode codepoint, EXCEPT with -eq
/-ieq
.
ceq
,-lt
, and-gt
compare directly by Unicode codepoints, and - counter-intuitively - so do-ilt
,-clt
,-igt
and-cgt
:
[char] 'A' -lt [char] 'a' # $true; Unicode codepoint 65 ('A') is less than 97 ('a')
-eq
(and its alias-ieq
) first transforms the characters to uppercase, then compares the resulting Unicode codepoints:
[char] 'A' -eq [char] 'a' # !! ALSO $true; equivalent of 65 -eq 65
It's worth reflecting on this Buddhist turn: this and that: in the world of PowerShell, character 'A' is both less than and equal to 'a', depending on how you compare.
Also, directly or indirectly - after transformation to uppercase - comparing Unicode codepoints is NOT the same as comparing them as strings, because PowerShell's string comparison additionally recognizes so-called compatible sequences, where characters (or even character sequences) are considered "the same" if they have the same meaning (see Unicode equivalence); e.g.:
# Distinct Unicode characters U+2126 (Ohm Sign) and U+03A9 Greek Capital Letter Omega)
# ARE recognized as the "same thing" in a *string* comparison:
"Ω" -ceq "Ω" # $true, despite having distinct Unicode codepoints
# -eq/ieq: with [char], by only applying transformation to uppercase, the results
# are still different codepoints, which - compared numerically - are NOT equal:
[char] 'Ω' -eq [char] 'Ω' # $false: uppercased codepoints differ
# -ceq always applies direct codepoint comparison.
[char] 'Ω' -ceq [char] 'Ω' # $false: codepoints differ
Note that use of prefixes i
or c
to explicitly specify case-matching behavior is NOT sufficient to force string comparison, even though conceptually operators such as -ceq
, -ieq
, -clt
, -ilt
, -cgt
, -igt
only make sense with strings.
Effectively, the i
and c
prefixes are simply ignored when applied to -lt
and -gt
while comparing [char]
operands; as it turns out (unlike what I originally thought), this is a general PowerShell pitfall - see below for an explanation.
As an aside: -lt
and -gt
logic in string comparison is not numeric, but based on collation order (a human-centric way of ordering independent of codepoints / byte values), which in .NET terms is controlled by cultures (either by default by the one currently in effect, or by passing a culture parameter to methods).
As @PetSerAl demonstrates in a comment (and unlike what I originally claimed), PS string comparisons use the invariant culture, not the current culture, so their behavior is the same, irrespective of what culture is the current one.
Behind the scenes:
As @PetserAl explains in the comments, PowerShell's parsing doesn't distinguish between the base form of an operator its i
-prefixed form; e.g., both -lt
and -ilt
are translated to the same value, Ilt
.
Thus, Powershell cannot implement differing behavior for -lt
vs. -ilt
, -gt
vs. igt
, ..., because it treats them the same at the syntax level.
This leads to somewhat counter-intuitive behavior in that operator prefixes are effectively ignored when comparing data types where case-sensitivity has no meaning - as opposed to getting coerced to strings, as one might expect; e.g.:
"10" -cgt "2" # $false, because "2" comes after "1" in the collation order
10 -cgt 2 # !! $true; *numeric* comparison still happens; the `c` is ignored.
In the latter case I would have expected the use of -cgt
to coerce the operands to strings, given that case-sensitive comparison is only a meaningful concept in string comparison, but that is NOT how it works.
If you want to dig deeper into how PowerShell operates, see @PetSerAl's comments below.
Not quite sure what to post here other than the comparisons are all correct when dealing with strings/characters. If you want an Ordinal comparison, do an Ordinal comparison and you get results based on that.
Best Practices for Using Strings in the .NET Framework
[string]::Compare('L','l')
returns 1
and
[string]::Compare("L","l", [stringcomparison]::Ordinal)
returns -32
Not sure what to add here to help clarify.
Also see: Upper vs Lower Case