Compute the Adler-32 checksum
Mathematica, 46 bytes
{1,4^8}.Fold[##+{0,#&@@#}&,{1,0},#]~Mod~65521&
An anonymous function that takes an integer array and returns the Adler-32, with some improvements from miles and Martin (see comments).
miles' is also 46 bytes, but faster:
{1,4^8}.{Tr@#+1,Tr[Accumulate@#+1]}~Mod~65521&
Julia, 73 46 bytes
x->[sum(x)+1;sum(cumsum(x)+1)]%65521⋅[1;4^8]
This is an anonymous function that accepts an array and returns an integer. To call it, assign it to a variable.
We combine sum(x) + 1
and sum(cumsum(x) + 1)
into an array, where x
is the input array, and take each modulo 65521. We then compute the dot product with 1 and 48, which gives us (sum(x) + 1) + 4^8 * sum(cumsum(x) + 1)
, which is exactly the Adler-32 formula.
Try it online! (Includes all test cases)
Saved 27 bytes thanks to Sp3000 and Dennis!
x86-64 machine code function: 33 32 bytes (or 31 30 bytes with an int[]
input instead of char[]
)
x86-32 machine code function: 31 bytes
As a GNU C inline-asm code fragment: saves 2B 1B (just the ret
insn).
Commented source and test driver on github
The 64bit version is callable directly from C with the standard System V x86-64 ABI (using 2 dummy args to get args in the regs I want). Custom calling conventions are not uncommon for asm code, so this is a bonus feature.
32bit machine code saves 1B, because merging the high and low halves with push16/push16 => pop32
only works in 32bit mode. A 32bit function would need a custom calling convention. We shouldn't hold that against it, but calling from C needs a wrapper function.
After processing 4096 ~
(ASCII 126) bytes, high = 0x3f040000, low = 0x7e001
. So high
's most-significant bit isn't set yet. My code takes advantage of this, sign-extending eax
into edx:eax
with cdq
as a way of zeroing edx
.
# See the NASM source below
0000000000401120 <golfed_adler32_amd64>:
401120: 31 c0 xor eax,eax
401122: 99 cdq
401123: 8d 7a 01 lea edi,[rdx+0x1]
0000000000401126 <golfed_adler32_amd64.byteloop>:
401126: ac lods al,BYTE PTR ds:[rsi]
401127: 01 c7 add edi,eax
401129: 01 fa add edx,edi
40112b: e2 f9 loop 401126 <golfed_adler32_amd64.byteloop>
000000000040112d <golfed_adler32_amd64.end>:
40112d: 66 b9 f1 ff mov cx,0xfff1
401131: 92 xchg edx,eax
401132: 99 cdq
401133: f7 f1 div ecx
401135: 52 push rdx
401136: 97 xchg edi,eax
401137: 99 cdq
401138: f7 f1 div ecx
40113a: 66 52 push dx # this is the diff from last version: evil push/pop instead of shift/add
40113c: 58 pop rax
40113d: 66 5a pop dx
40113f: c3 ret
0000000000401140 <golfed_adler32_amd64_end>:
0x40 - 0x20
= 32 bytes.
Commented NASM source:
tricks:
xchg eax, r32
is one byte; cheaper than mov. 8086 needed data in ax for a lot more stuff than >= 386, so they decided to spend a lot of opcode-space on the now-rarely-usedxchg ax, r16
.Mixing push64 and push16 for merging high and low into a single register saves reg-reg data movement instructions around two
div
s. The 32bit version of this trick works even better:push16 / push16 / pop32
is only 5B total, not 6.
Since we push/pop, this isn't safe for inline asm in the SysV amd64 ABI (with a red zone).
golfed_adler32_amd64_v3: ; (int dummy, const char *buf, int dummy, uint64_t len)
;; args: len in rcx, const char *buf in rsi
;; Without dummy args, (unsigned len, const char *buf), mov ecx, edi is the obvious solution, costing 2 bytes
xor eax,eax ; scratch reg for loading bytes
cdq ; edx: high=0
lea edi, [rdx+1] ; edi: low=1
;jrcxz .end ; We don't handle len=0. unlike rep, loop only checks rcx after decrementing
.byteloop:
lodsb ; upper 24b of eax stays zeroed (no partial-register stall on Intel P6/SnB-family CPUs, thanks to the xor-zeroing)
add edi, eax ; low += zero_extend(buf[i])
add edx, edi ; high += low
loop .byteloop
.end:
;; exit when ecx = 0, eax = last byte of buf
;; lodsb at this point would load the terminating 0 byte, conveniently leaving eax=0
mov cx, 65521 ; ecx = m = adler32 magic constant. (upper 16b of ecx is zero from the loop exit condition. This saves 1B over mov r32,imm32)
;sub cx, (65536 - 65521) ; the immediate is small enough to use the imm8 encoding. No saving over mov, though, since this needs a mod/rm byte
xchg eax, edx ; eax = high, edx = buf[last_byte]
cdq ; could be removed if we could arrange things so the loop ended with a load of the 0 byte
div ecx ; div instead of idiv to fault instead of returning wrong answers if high has overflowed to negative. (-1234 % m is negative)
push rdx ; push high%m and 6B of zero padding
xchg eax, edi ; eax=low
cdq
div ecx ; edx = low%m
;; concatenate the two 16bit halves of the result by putting them in contiguous memory
push dx ; push low%m with no padding
pop rax ; pop high%m << 16 | low%m (x86 is little-endian)
pop dx ; add rsp, 2 to restore the stack pointer
;; outside of 16bit code, we can't justify returning the result in the dx:ax register pair
ret
golfed_adler32_amd64_end_v3:
I also considered using rcx
as an array index, instead of having two loop counters, but adler32(s) != adler32(reverse(s)). So we couldn't use loop
. Counting from -len up towards zero and using movzx r32, [rsi+rcx]
just uses way too many bytes.
If we want to consider incrementing the pointer ourself, 32bit code is probably the way to go. Even the x32 ABI (32bit pointers) isn't sufficient, because inc esi
is 2B on amd64, but 1B on i386. It appears hard to beat xor eax,eax
/ lodsb
/ loop
: 4B total to get each element in turn zero-extended into eax. inc esi
/ movzx r32, byte [esi]
/ loop
is 5B.
scas
is another option for incrementing a pointer with a 1B instruction in 64bit mode. (rdi
/edi
instead of rsi
, so we'd take the pointer arg in rdi
). We can't use the flag result from scas
as a loop condition, though, because we don't want to keep eax zeroed. Different register allocation could maybe save a byte after the loop.
int[]
input
The full function taking uint8_t[]
is the "main" answer, because it's a more interesting challenge. Unpacking to int[]
is an unreasonable thing to ask our caller to do in this language, but it does save 2B.
If we take our input as an unpacked array of 32bit integers, we can save one byte easily (use lodsd
and replace xor eax,eax / cdq
with just xor edx,edx
).
We can save another byte by zeroing edx with lodsd
/cdq
, and re-arranging the loop so it loads the terminating 0 element before exiting. (We're still assuming it exists, even though this is an array of int
, not a string).
; untested: I didn't modify the test driver to unpack strings for this
golfed_adler32_int_array:
; xor edx,edx
lodsd ; first element. only the low byte non-zero
cdq ; edx: high=0
lea edi, [rdx+1] ; edi: low=1
;jrcxz .end ; handle len=0? unlike rep, loop only checks rcx after decrementing
.intloop:
add edi, eax ; low += buf[i]
add edx, edi ; high += low
lodsd ; load buf[i+1] for next iteration
loop .intloop
.end:
;; exit when ecx = 0, eax = terminating 0
xchg eax, edx
;cdq ; edx=0 already, ready for div
; same as the char version
I also made an untested version that uses scasd
(1B version of add edi,4
) and add eax, [rdi]
instead of lodsd
, but it's also 30 bytes. The savings from having high
in eax at the end of the loop are balanced out by larger code elsewhere. It has the advantage of not depending on a terminating 0
element in the input, though, which is maybe unreasonable for an unpacked array where we're also given the length explicitly.
C++11 test driver
See the github link. This answer was getting too big, and the test driver got more features with bigger code.