User Appreciation Challenge #1: Dennis ♦

Jelly, 11 bytes


Try it online!

Alternate versions:




Of course a challenge appreciating Dennis must have an answer in a language of his.

05AB1E, 9 bytes


Try it online!

-2 thanks to Adnan.

This attacks Jelly's pain point exactly. It uses l and A, 1-byte equivalents for Jelly's Œl and Øa respectively.

x86 32-bit machine-code function, 42 41 bytes

Currently the shortest non-golfing-language answer, 1B shorter than @streetster's q/kdb+.

With 0 for truthy and non-zero for falsy: 41 40 bytes. (in general, saves 1 byte for 32-bit, 2 bytes for 64-bit).

With implicit-length strings (C-style 0-terminated): 45 44 bytes

x86-64 machine-code (with 32-bit pointers, like the x32 ABI): 44 43 bytes.

x86-64 with implicit-length strings, still 46 bytes (the shift/mask bitmap strategy is break-even now).

This is a function with C signature _Bool dennis_like(size_t ecx, const char *esi). The calling convention is slightly non-standard, close to MS vectorcall/fastcall but with different arg registers: string in ESI and the length in ECX. It only clobbers its arg-regs and EDX. AL holds the return value, with the high bytes holding garbage (as allowed by the SysV x86 and x32 ABIs. IDK what MS's ABIs say about high-garbage when returning bool or narrow integers.)

Explanation of the algorithm:

Loop over the input string, filtering and classifying into a boolean array on the stack: For each byte, check if it's an alphabetic character (if not, continue to next char), and transform it to an integer from 0-25 (A-Z). Use that 0-25 integer to check a bitmap of vowel=0/consonant=1. (The bitmap is loaded into a register as a 32-bit immediate constant). Push 0 or 0xFF onto the stack according to the bitmap result (actually in the low byte of a 32-bit element, which may have garbage in the top 3 bytes).

The first loop produces an array of 0 or 0xFF (in dword elements padded with garbage). Do the usual palindrome check with a second loop that stops when pointers cross in the middle (or when they both point to the same element if there were an odd number of alphabetic characters). The upward-moving pointer is the stack pointer, and we use POP to load+increment. Instead of compare / setcc in this loop, we can just use XOR to detect same/different since there are only two possible values. We could accumulate (with OR) whether we found any non-matching elements, but an early-out branch on flags set by XOR is at least as good.

Notice that the second loop uses byte operand-size, so it doesn't care what garbage the first loop leaves outside the low byte of each array element.

It uses the undocumented salc instruction to set AL from CF, in the same way that sbb al,al would. It's supported on every Intel CPU (except in 64-bit mode), even Knight's Landing! Agner Fog lists timings for it on all AMD CPUs as well (including Ryzen), so if x86 vendors insist on tying up that byte of opcode space ever since 8086, we might as well take advantage of it.

Interesting tricks:

  • unsigned-compare trick for a combined isalpha() and toupper(), and zero-extends the byte to fill eax, setting up for:
  • immediate bitmap in a register for bt, inspired by some nice compiler output for switch.
  • Creating a variable-sized array on the stack with push in a loop. (Standard for asm, but not something you can do with C for the implicit-length string version). It uses 4 bytes of stack space for every input character, but saves at least 1 byte vs. optimal golfing around stosb.
  • Instead of cmp/setne on the boolean array, XOR booleans together to get a truth value directly. (cmp/salc isn't an option, because salc only works for CF, and 0xFF-0 doesn't set CF. sete is 3 bytes, but would avoid the inc outside the loop, for a net cost of 2 bytes (1 in 64-bit mode)) vs. xor in the loop and fixing it with inc.
; explicit-length version: input string in ESI, byte count in ECX
08048060 <dennis_like>:
 8048060:       55                      push   ebp
 8048061:       89 e5                   mov    ebp,esp  ; a stack frame lets us restore esp with LEAVE (1B)
 8048063:       ba ee be ef 03          mov    edx,0x3efbeee ; consonant bitmap

08048068 <dennis_like.filter_loop>:
 8048068:       ac                      lods   al,BYTE PTR ds:[esi]
 8048069:       24 5f                   and    al,0x5f    ; uppercase
 804806b:       2c 41                   sub    al,0x41    ; range-shift to 0..25
 804806d:       3c 19                   cmp    al,0x19    ; reject non-letters
 804806f:       77 05                   ja     8048076 <dennis_like.non_alpha>
 8048071:       0f a3 c2                bt     edx,eax    # AL = 0..25 = position in alphabet
 8048074:       d6                      SALC     ; set AL=0 or 0xFF from carry.  Undocumented insn, but widely supported
 8048075:       50                      push   eax
08048076 <dennis_like.non_alpha>:
 8048076:       e2 f0                   loop   8048068 <dennis_like.filter_loop>   # ecx = remaining string bytes
 ; end of first loop

 8048078:       89 ee                   mov    esi,ebp  ; ebp = one-past-the-top of the bool array
0804807a <dennis_like.palindrome_loop>:
 804807a:       58                      pop    eax      ; read from the bottom
 804807b:       83 ee 04                sub    esi,0x4
 804807e:       32 06                   xor    al,BYTE PTR [esi]
 8048080:       75 04                   jne    8048086 <dennis_like.non_palindrome>
 8048082:       39 e6                   cmp    esi,esp             ; until the pointers meet or cross in the middle
 8048084:       77 f4                   ja     804807a  <dennis_like.palindrome_loop>

08048086 <dennis_like.non_palindrome>:
 ; jump or fall-through to here with al holding an inverted boolean
 8048086:       40                      inc    eax
 8048087:       c9                      leave  
 8048088:       c3                      ret    
;; 0x89 - 0x60 = 41 bytes

This is probably also one of the fastest answers, since none of the golfing really hurts too badly, at least for strings under a few thousand characters where the 4x memory usage doesn't cause a lot of cache-misses. (It might also lose to answers that take an early-out for non-Dennis-like strings before looping over all the characters.) salc is slower than setcc on many CPUs (e.g. 3 uops vs. 1 on Skylake), but a bitmap check with bt/salc is still faster than a string-search or regex-match. And there's no startup overhead, so it's extremely cheap for short strings.

Doing it in one pass on the fly would mean repeating the classification code for the up and down directions. That would be faster but larger code-size. (Of course if you want fast, you can do 16 or 32 chars at a time with SSE2 or AVX2, still using the compare trick by range-shifting to the bottom of the signed range).

Test program (for ia32 or x32 Linux) to call this function with a cmdline arg, and exit with status = return value. strlen implementation from

; build with the same %define macros as the source below (so this uses 32-bit regs in 32-bit mode)
global _start
    ;%define PTRSIZE 4   ; true for x32 and 32-bit mode.

    mov  esi, [rsp+4 + 4*1]  ; esi = argv[1]
    ;mov  rsi, [rsp+8 + 8*1]  ; rsi = argv[1]   ; For regular x86-64 (not x32)

        ; strlen(esi)
         mov     rdi, rsi
         mov     rcx, -1
        xor     eax, eax
        repne scasb    ; rcx = -strlen - 2
        not     rcx
        dec     rcx

    mov  eax, 0xFFFFAEBB   ; make sure the function works with garbage in EAX
    call dennis_like

    ;; use the 32-bit ABI _exit syscall, even in x32 code for simplicity
    mov ebx, eax
    mov eax, 1
    int 0x80           ; _exit( dennis_like(argv[1]) )

    ;; movzx edi, al   ; actually mov edi,eax is fine here, too
    ;; mov eax,231     ; 64-bit ABI exit_group( same thing )
    ;; syscall

A 64-bit version of this function could use sbb eax,eax, which is only 2 bytes instead of 3 for setc al. It would also need an extra byte for dec or not at the end (because only 32-bit has 1-byte inc/dec r32). Using the x32 ABI (32-bit pointers in long mode), we can still avoid REX prefixes even though we copy and compare pointers.

setc [rdi] can write directly to memory, but reserving ECX bytes of stack space costs more code-size than that saves. (And we need to move through the output array. [rdi+rcx] takes one extra byte for the addressing mode, but really we need a counter that doesn't update for filtered characters so it's going to be worse than that.)

This is the YASM/NASM source with %if conditionals. It can be built with -felf32 (32-bit code) or -felfx32 (64-bit code with the x32 ABI), and with implicit or explicit length. I've tested all 4 versions. See this answer for a script to build a static binary from NASM/YASM source.

To test the 64-bit version on a machine without support for the x32 ABI, you can change the pointer regs to 64-bit. (Then simply subtract the number of REX.W=1 prefixes (0x48 bytes) from the count. In this case, 4 instructions need REX prefixes to operate on 64-bit regs). Or simply call it with the rsp and the input pointer in the low 4G of address space.


; This source can be built as x32, or as plain old 32-bit mode
; x32 needs to push 64-bit regs, and using them in addressing modes avoids address-size prefixes
; 32-bit code needs to use the 32-bit names everywhere

;%if __BITS__ != 32   ; NASM-only
%ifidn __OUTPUT_FORMAT__, elfx32
%define CPUMODE 64
%define STACKWIDTH 8    ; push / pop 8 bytes
%define CPUMODE 32
%define STACKWIDTH 4    ; push / pop 4 bytes
%define rax eax
%define rcx ecx
%define rsi esi
%define rdi edi
%define rbp ebp
%define rsp esp

    ; A regular x86-64 version needs 4 REX prefixes to handle 64-bit pointers
    ; I haven't cluttered the source with that, but I guess stuff like %define ebp rbp  would do the trick.

    ;; Calling convention similar to SysV x32, or to MS vectorcall, but with different arg regs
    ;; _Bool dennis_like_implicit(const char *esi)
    ;; _Bool dennis_like_explicit(size_t ecx, const char *esi)
global dennis_like
    ; We want to restore esp later, so make a stack frame for LEAVE
    push  rbp
    mov   ebp, esp   ; enter 0,0 is 4 bytes.  Only saves bytes if we had a fixed-size allocation to do.

    mov  edx, 11111011111011111011101110b   ; consonant/vowel bitmap for use with bt

;;; assume that len >= 1
    lodsb   ; pipelining the loop is 1B shorter than  jmp .non_alpha

    and   al, 0x7F ^ 0x20  ; force ASCII to uppercase.
    sub   al, 'A'          ; range-shift to 'A' = 0
    cmp   al, 'Z'-'A'      ; if al was less than 'A', it will be a large unsigned number
    ja  .non_alpha
    ;; AL = position in alphabet (0-25)

    bt    edx, eax              ; 3B
%if CPUMODE == 32
    salc                        ; 1B   only sets AL = 0 or 0xFF.  Not available in 64-bit mode
    sbb   eax, eax              ; 2B   eax = 0 or -1, according to CF.
    push  rax

    test   al,al
    jnz .filter_loop
    loop .filter_loop
    ; al = potentially garbage if the last char was non-alpha
    ; esp = bottom of bool array

    mov   esi, ebp  ; ebp = one-past-the-top of the bool array
    pop   rax

    sub   esi, STACKWIDTH
    xor   al, [rsi]   ; al = (arr[up] != arr[--down]).  8-bit operand-size so flags are set from the non-garbage
    jnz .non_palindrome

    cmp   esi, esp
    ja .palindrome_loop

.non_palindrome:  ; we jump here with al=1 if we found a difference, or drop out of the loop with al=0 for no diff
    inc   eax     ;; AL transforms 0 -> 1  or  0xFF -> 0.
    ret           ; return value in AL.  high bytes of EAX are allowed to contain garbage.

I looked at messing around with DF (the direction flag that controls lodsd / scasd and so on), but it just didn't seem to be a win. The usual ABIs require that DF is cleared on function entry and exit. Assuming cleared on entry but leaving it set on exit would be cheating, IMO. It would be nice to use LODSD / SCASD to avoid the 3-byte sub esi, 4, especially in the case where there's no high-garbage.

Alternate bitmap strategy (for x86-64 implicit-length strings)

It turns out this doesn't save any bytes, because bt r32,r32 still works with high garbage in the bit-index. It's just not documented the way shr is.

Instead of bt / sbb to get the bit into / out of CF, use a shift / mask to isolate the bit we want from the bitmap.

    ; incompatible with LOOP for explicit-length, both need ECX.  In that case, bt/sbb is best
    xchg  eax, ecx
    mov   eax, 11111011111011111011101110b   ; not hoisted out of the loop
    shr   eax, cl
    and   al, 1
    bt    edx, eax
    sbb   eax, eax
    push  rax

Since this produces 0/1 in AL at the end (instead of 0/0xFF), we can do the necessary inversion of the return value at the end of the function with xor al, 1 (2B) instead of dec eax (also 2B in x86-64) to still produce a proper bool / _Bool return value.

This used to save 1B for x86-64 with implicit-length strings, by avoiding the need to zero the high bytes of EAX. (I had been using and eax, 0x7F ^ 0x20 to force to upper-case and zero the rest of eax with a 3-byte and r32,imm8. But now I'm using the 2-byte immediate-with-AL encoding that most 8086 instructions have, like I was already doing for the sub and cmp.)

It loses to bt/salc in 32-bit mode, and explicit-length strings need ECX for the count so this doesn't work there either.

But then I realized that I was wrong: bt edx, eax still works with high garbage in eax. It apparently masks the shift count the same way shr r32, cl does (looking only at the low 5 bits of cl). This is different from bt [mem], reg, which can access outside the memory referenced by the addressing-mode/size, treating it as a bitstring. (Crazy CISC...)

Intel's insn set ref manual doesn't document the masking, so maybe it's undocumented behaviour that Intel is preserving for now. (That kind of thing is not uncommon. bsf dst, src with src=0 always leaves dst unmodified, even though it's documented to leave dst holding an undefined value in that case. AMD actually documents the src=0 behaviour.) I tested on Skylake and Core2, and the bt version works with non-zero garbage in EAX outside of AL.

A neat trick here is using xchg eax,ecx (1 byte) to get the count into CL. Unfortunately, BMI2 shrx eax, edx, eax is 5 bytes, vs. only 2 bytes for shr eax, cl. Using bextr needs a 2-byte mov ah,1 (for the number of bits to extract), so it's again 5 + 2 bytes like SHRX + AND.

The source code has gotten pretty messy after adding %if conditionals. Here's disassembly of x32 implicit-length strings (using the alternate strategy for the bitmap, so it's still 46 bytes).

The main difference from the explicit-length version is in the first loop. Notice how there's a lods before it, and at the bottom, instead of just one at the top of the loop.

    ; 64-bit implicit-length version using the alternate bitmap strategy
    00400060 <dennis_like>:
      400060:       55                      push   rbp
      400061:       89 e5                   mov    ebp,esp
      400063:       ac                      lods   al,BYTE PTR ds:[rsi]

    00400064 <dennis_like.filter_loop>:
      400064:       24 5f                   and    al,0x5f
      400066:       2c 41                   sub    al,0x41
      400068:       3c 19                   cmp    al,0x19
      40006a:       77 0b                   ja     400077 <dennis_like.non_alpha>
      40006c:       91                      xchg   ecx,eax
      40006d:       b8 ee be ef 03          mov    eax,0x3efbeee  ; inside the loop since SHR destroys it
      400072:       d3 e8                   shr    eax,cl
      400074:       24 01                   and    al,0x1
      400076:       50                      push   rax
    00400077 <dennis_like.non_alpha>:
      400077:       ac                      lods   al,BYTE PTR ds:[rsi]
      400078:       84 c0                   test   al,al
      40007a:       75 e8                   jne    400064 <dennis_like.filter_loop>

      40007c:       89 ee                   mov    esi,ebp
    0040007e <dennis_like.palindrome_loop>:
      40007e:       58                      pop    rax
      40007f:       83 ee 08                sub    esi,0x8
      400082:       32 06                   xor    al,BYTE PTR [rsi]
      400084:       75 04                   jne    40008a <dennis_like.non_palindrome>
      400086:       39 e6                   cmp    esi,esp
      400088:       77 f4                   ja     40007e <dennis_like.palindrome_loop>

    0040008a <dennis_like.non_palindrome>:
      40008a:       ff c8                   dec    eax  ; invert the 0 / non-zero status of AL.  xor al,1 works too, and produces a proper bool.
      40008c:       c9                      leave  
      40008d:       c3                      ret    

   0x8e - 0x60 = 0x2e = 46 bytes