User Appreciation Challenge #1: Dennis ♦
Jelly, 11 bytes
ŒufØAe€ØCŒḂ
Try it online!
Alternate versions:
ŒlfØae€ØCŒḂ
ŒufØAe€ØcŒḂ
ŒlfØae€ØcŒḂ
Of course a challenge appreciating Dennis must have an answer in a language of his.
05AB1E, 9 bytes
žM¹álSåÂQ
Try it online!
-2 thanks to Adnan.
This attacks Jelly's pain point exactly. It uses l
and A
, 1-byte equivalents for Jelly's Œl
and Øa
respectively.
x86 32-bit machine-code function, 42 41 bytes
Currently the shortest non-golfing-language answer, 1B shorter than @streetster's q/kdb+.
With 0 for truthy and non-zero for falsy: 41 40 bytes. (in general, saves 1 byte for 32-bit, 2 bytes for 64-bit).
With implicit-length strings (C-style 0-terminated): 45 44 bytes
x86-64 machine-code (with 32-bit pointers, like the x32 ABI): 44 43 bytes.
x86-64 with implicit-length strings, still 46 bytes (the shift/mask bitmap strategy is break-even now).
This is a function with C signature _Bool dennis_like(size_t ecx, const char *esi)
. The calling convention is slightly non-standard, close to MS vectorcall/fastcall but with different arg registers: string in ESI and the length in ECX. It only clobbers its arg-regs and EDX. AL holds the return value, with the high bytes holding garbage (as allowed by the SysV x86 and x32 ABIs. IDK what MS's ABIs say about high-garbage when returning bool or narrow integers.)
Explanation of the algorithm:
Loop over the input string, filtering and classifying into a boolean array on the stack: For each byte, check if it's an alphabetic character (if not, continue to next char), and transform it to an integer from 0-25 (A-Z). Use that 0-25 integer to check a bitmap of vowel=0/consonant=1. (The bitmap is loaded into a register as a 32-bit immediate constant). Push 0 or 0xFF onto the stack according to the bitmap result (actually in the low byte of a 32-bit element, which may have garbage in the top 3 bytes).
The first loop produces an array of 0 or 0xFF (in dword elements padded with garbage). Do the usual palindrome check with a second loop that stops when pointers cross in the middle (or when they both point to the same element if there were an odd number of alphabetic characters). The upward-moving pointer is the stack pointer, and we use POP to load+increment. Instead of compare / setcc in this loop, we can just use XOR to detect same/different since there are only two possible values. We could accumulate (with OR) whether we found any non-matching elements, but an early-out branch on flags set by XOR is at least as good.
Notice that the second loop uses byte
operand-size, so it doesn't care what garbage the first loop leaves outside the low byte of each array element.
It uses the undocumented salc
instruction to set AL from CF, in the same way that sbb al,al
would. It's supported on every Intel CPU (except in 64-bit mode), even Knight's Landing! Agner Fog lists timings for it on all AMD CPUs as well (including Ryzen), so if x86 vendors insist on tying up that byte of opcode space ever since 8086, we might as well take advantage of it.
Interesting tricks:
- unsigned-compare trick for a combined isalpha() and toupper(), and zero-extends the byte to fill eax, setting up for:
- immediate bitmap in a register for
bt
, inspired by some nice compiler output forswitch
. - Creating a variable-sized array on the stack with push in a loop. (Standard for asm, but not something you can do with C for the implicit-length string version). It uses 4 bytes of stack space for every input character, but saves at least 1 byte vs. optimal golfing around
stosb
. - Instead of cmp/setne on the boolean array, XOR booleans together to get a truth value directly. (
cmp
/salc
isn't an option, becausesalc
only works for CF, and 0xFF-0 doesn't set CF.sete
is 3 bytes, but would avoid theinc
outside the loop, for a net cost of 2 bytes (1 in 64-bit mode)) vs. xor in the loop and fixing it with inc.
; explicit-length version: input string in ESI, byte count in ECX
08048060 <dennis_like>:
8048060: 55 push ebp
8048061: 89 e5 mov ebp,esp ; a stack frame lets us restore esp with LEAVE (1B)
8048063: ba ee be ef 03 mov edx,0x3efbeee ; consonant bitmap
08048068 <dennis_like.filter_loop>:
8048068: ac lods al,BYTE PTR ds:[esi]
8048069: 24 5f and al,0x5f ; uppercase
804806b: 2c 41 sub al,0x41 ; range-shift to 0..25
804806d: 3c 19 cmp al,0x19 ; reject non-letters
804806f: 77 05 ja 8048076 <dennis_like.non_alpha>
8048071: 0f a3 c2 bt edx,eax # AL = 0..25 = position in alphabet
8048074: d6 SALC ; set AL=0 or 0xFF from carry. Undocumented insn, but widely supported
8048075: 50 push eax
08048076 <dennis_like.non_alpha>:
8048076: e2 f0 loop 8048068 <dennis_like.filter_loop> # ecx = remaining string bytes
; end of first loop
8048078: 89 ee mov esi,ebp ; ebp = one-past-the-top of the bool array
0804807a <dennis_like.palindrome_loop>:
804807a: 58 pop eax ; read from the bottom
804807b: 83 ee 04 sub esi,0x4
804807e: 32 06 xor al,BYTE PTR [esi]
8048080: 75 04 jne 8048086 <dennis_like.non_palindrome>
8048082: 39 e6 cmp esi,esp ; until the pointers meet or cross in the middle
8048084: 77 f4 ja 804807a <dennis_like.palindrome_loop>
08048086 <dennis_like.non_palindrome>:
; jump or fall-through to here with al holding an inverted boolean
8048086: 40 inc eax
8048087: c9 leave
8048088: c3 ret
;; 0x89 - 0x60 = 41 bytes
This is probably also one of the fastest answers, since none of the golfing really hurts too badly, at least for strings under a few thousand characters where the 4x memory usage doesn't cause a lot of cache-misses. (It might also lose to answers that take an early-out for non-Dennis-like strings before looping over all the characters.) salc
is slower than setcc
on many CPUs (e.g. 3 uops vs. 1 on Skylake), but a bitmap check with bt/salc
is still faster than a string-search or regex-match. And there's no startup overhead, so it's extremely cheap for short strings.
Doing it in one pass on the fly would mean repeating the classification code for the up and down directions. That would be faster but larger code-size. (Of course if you want fast, you can do 16 or 32 chars at a time with SSE2 or AVX2, still using the compare trick by range-shifting to the bottom of the signed range).
Test program (for ia32 or x32 Linux) to call this function with a cmdline arg, and exit with status = return value. strlen
implementation from int80h.org.
; build with the same %define macros as the source below (so this uses 32-bit regs in 32-bit mode)
global _start
_start:
;%define PTRSIZE 4 ; true for x32 and 32-bit mode.
mov esi, [rsp+4 + 4*1] ; esi = argv[1]
;mov rsi, [rsp+8 + 8*1] ; rsi = argv[1] ; For regular x86-64 (not x32)
%if IMPLICIT_LENGTH == 0
; strlen(esi)
mov rdi, rsi
mov rcx, -1
xor eax, eax
repne scasb ; rcx = -strlen - 2
not rcx
dec rcx
%endif
mov eax, 0xFFFFAEBB ; make sure the function works with garbage in EAX
call dennis_like
;; use the 32-bit ABI _exit syscall, even in x32 code for simplicity
mov ebx, eax
mov eax, 1
int 0x80 ; _exit( dennis_like(argv[1]) )
;; movzx edi, al ; actually mov edi,eax is fine here, too
;; mov eax,231 ; 64-bit ABI exit_group( same thing )
;; syscall
A 64-bit version of this function could use sbb eax,eax
, which is only 2 bytes instead of 3 for setc al
. It would also need an extra byte for dec
or not
at the end (because only 32-bit has 1-byte inc/dec r32). Using the x32 ABI (32-bit pointers in long mode), we can still avoid REX prefixes even though we copy and compare pointers.
setc [rdi]
can write directly to memory, but reserving ECX bytes of stack space costs more code-size than that saves. (And we need to move through the output array. [rdi+rcx]
takes one extra byte for the addressing mode, but really we need a counter that doesn't update for filtered characters so it's going to be worse than that.)
This is the YASM/NASM source with %if
conditionals. It can be built with -felf32
(32-bit code) or -felfx32
(64-bit code with the x32 ABI), and with implicit or explicit length. I've tested all 4 versions. See this answer for a script to build a static binary from NASM/YASM source.
To test the 64-bit version on a machine without support for the x32 ABI, you can change the pointer regs to 64-bit. (Then simply subtract the number of REX.W=1 prefixes (0x48 bytes) from the count. In this case, 4 instructions need REX prefixes to operate on 64-bit regs). Or simply call it with the rsp
and the input pointer in the low 4G of address space.
%define IMPLICIT_LENGTH 0
; This source can be built as x32, or as plain old 32-bit mode
; x32 needs to push 64-bit regs, and using them in addressing modes avoids address-size prefixes
; 32-bit code needs to use the 32-bit names everywhere
;%if __BITS__ != 32 ; NASM-only
%ifidn __OUTPUT_FORMAT__, elfx32
%define CPUMODE 64
%define STACKWIDTH 8 ; push / pop 8 bytes
%else
%define CPUMODE 32
%define STACKWIDTH 4 ; push / pop 4 bytes
%define rax eax
%define rcx ecx
%define rsi esi
%define rdi edi
%define rbp ebp
%define rsp esp
%endif
; A regular x86-64 version needs 4 REX prefixes to handle 64-bit pointers
; I haven't cluttered the source with that, but I guess stuff like %define ebp rbp would do the trick.
;; Calling convention similar to SysV x32, or to MS vectorcall, but with different arg regs
;; _Bool dennis_like_implicit(const char *esi)
;; _Bool dennis_like_explicit(size_t ecx, const char *esi)
global dennis_like
dennis_like:
; We want to restore esp later, so make a stack frame for LEAVE
push rbp
mov ebp, esp ; enter 0,0 is 4 bytes. Only saves bytes if we had a fixed-size allocation to do.
; ZYXWVUTSRQPONMLKJIHGFEDCBA
mov edx, 11111011111011111011101110b ; consonant/vowel bitmap for use with bt
;;; assume that len >= 1
%if IMPLICIT_LENGTH
lodsb ; pipelining the loop is 1B shorter than jmp .non_alpha
.filter_loop:
%else
.filter_loop:
lodsb
%endif
and al, 0x7F ^ 0x20 ; force ASCII to uppercase.
sub al, 'A' ; range-shift to 'A' = 0
cmp al, 'Z'-'A' ; if al was less than 'A', it will be a large unsigned number
ja .non_alpha
;; AL = position in alphabet (0-25)
bt edx, eax ; 3B
%if CPUMODE == 32
salc ; 1B only sets AL = 0 or 0xFF. Not available in 64-bit mode
%else
sbb eax, eax ; 2B eax = 0 or -1, according to CF.
%endif
push rax
.non_alpha:
%if IMPLICIT_LENGTH
lodsb
test al,al
jnz .filter_loop
%else
loop .filter_loop
%endif
; al = potentially garbage if the last char was non-alpha
; esp = bottom of bool array
mov esi, ebp ; ebp = one-past-the-top of the bool array
.palindrome_loop:
pop rax
sub esi, STACKWIDTH
xor al, [rsi] ; al = (arr[up] != arr[--down]). 8-bit operand-size so flags are set from the non-garbage
jnz .non_palindrome
cmp esi, esp
ja .palindrome_loop
.non_palindrome: ; we jump here with al=1 if we found a difference, or drop out of the loop with al=0 for no diff
inc eax ;; AL transforms 0 -> 1 or 0xFF -> 0.
leave
ret ; return value in AL. high bytes of EAX are allowed to contain garbage.
I looked at messing around with DF (the direction flag that controls lodsd
/ scasd
and so on), but it just didn't seem to be a win. The usual ABIs require that DF is cleared on function entry and exit. Assuming cleared on entry but leaving it set on exit would be cheating, IMO. It would be nice to use LODSD / SCASD to avoid the 3-byte sub esi, 4
, especially in the case where there's no high-garbage.
Alternate bitmap strategy (for x86-64 implicit-length strings)
It turns out this doesn't save any bytes, because bt r32,r32
still works with high garbage in the bit-index. It's just not documented the way shr
is.
Instead of bt / sbb
to get the bit into / out of CF, use a shift / mask to isolate the bit we want from the bitmap.
%if IMPLICIT_LENGTH && CPUMODE == 64
; incompatible with LOOP for explicit-length, both need ECX. In that case, bt/sbb is best
xchg eax, ecx
mov eax, 11111011111011111011101110b ; not hoisted out of the loop
shr eax, cl
and al, 1
%else
bt edx, eax
sbb eax, eax
%endif
push rax
Since this produces 0/1 in AL at the end (instead of 0/0xFF), we can do the necessary inversion of the return value at the end of the function with xor al, 1
(2B) instead of dec eax
(also 2B in x86-64) to still produce a proper bool
/ _Bool
return value.
This used to save 1B for x86-64 with implicit-length strings, by avoiding the need to zero the high bytes of EAX. (I had been using and eax, 0x7F ^ 0x20
to force to upper-case and zero the rest of eax with a 3-byte and r32,imm8
. But now I'm using the 2-byte immediate-with-AL encoding that most 8086 instructions have, like I was already doing for the sub
and cmp
.)
It loses to bt
/salc
in 32-bit mode, and explicit-length strings need ECX for the count so this doesn't work there either.
But then I realized that I was wrong: bt edx, eax
still works with high garbage in eax. It apparently masks the shift count the same way shr r32, cl
does (looking only at the low 5 bits of cl). This is different from bt [mem], reg
, which can access outside the memory referenced by the addressing-mode/size, treating it as a bitstring. (Crazy CISC...)
Intel's insn set ref manual doesn't document the masking, so maybe it's undocumented behaviour that Intel is preserving for now. (That kind of thing is not uncommon. bsf dst, src
with src=0 always leaves dst unmodified, even though it's documented to leave dst holding an undefined value in that case. AMD actually documents the src=0 behaviour.) I tested on Skylake and Core2, and the bt
version works with non-zero garbage in EAX outside of AL.
A neat trick here is using xchg eax,ecx
(1 byte) to get the count into CL. Unfortunately, BMI2 shrx eax, edx, eax
is 5 bytes, vs. only 2 bytes for shr eax, cl
. Using bextr
needs a 2-byte mov ah,1
(for the number of bits to extract), so it's again 5 + 2 bytes like SHRX + AND.
The source code has gotten pretty messy after adding %if
conditionals. Here's disassembly of x32 implicit-length strings (using the alternate strategy for the bitmap, so it's still 46 bytes).
The main difference from the explicit-length version is in the first loop. Notice how there's a lods
before it, and at the bottom, instead of just one at the top of the loop.
; 64-bit implicit-length version using the alternate bitmap strategy
00400060 <dennis_like>:
400060: 55 push rbp
400061: 89 e5 mov ebp,esp
400063: ac lods al,BYTE PTR ds:[rsi]
00400064 <dennis_like.filter_loop>:
400064: 24 5f and al,0x5f
400066: 2c 41 sub al,0x41
400068: 3c 19 cmp al,0x19
40006a: 77 0b ja 400077 <dennis_like.non_alpha>
40006c: 91 xchg ecx,eax
40006d: b8 ee be ef 03 mov eax,0x3efbeee ; inside the loop since SHR destroys it
400072: d3 e8 shr eax,cl
400074: 24 01 and al,0x1
400076: 50 push rax
00400077 <dennis_like.non_alpha>:
400077: ac lods al,BYTE PTR ds:[rsi]
400078: 84 c0 test al,al
40007a: 75 e8 jne 400064 <dennis_like.filter_loop>
40007c: 89 ee mov esi,ebp
0040007e <dennis_like.palindrome_loop>:
40007e: 58 pop rax
40007f: 83 ee 08 sub esi,0x8
400082: 32 06 xor al,BYTE PTR [rsi]
400084: 75 04 jne 40008a <dennis_like.non_palindrome>
400086: 39 e6 cmp esi,esp
400088: 77 f4 ja 40007e <dennis_like.palindrome_loop>
0040008a <dennis_like.non_palindrome>:
40008a: ff c8 dec eax ; invert the 0 / non-zero status of AL. xor al,1 works too, and produces a proper bool.
40008c: c9 leave
40008d: c3 ret
0x8e - 0x60 = 0x2e = 46 bytes