Count the bytes of a program

Shell + coreutils, 6

This answer becomes invalid if an encoding other than UTF-8 is used.

wc -mc

Test output:

$ printf '%s' "(~R∊R∘.×R)/R←1↓ιR" | ./count.sh 
     17      27
$

In case the output format is strictly enforced (just one space separating the the two integers), then we can do this:

Shell + coreutils, 12

echo`wc -mc`

Thanks to @immibis for suggesting to remove the space after the echo. It took me a while to figure that out - the shell will expand this to echo<tab>n<tab>m, and tabs by default are in $IFS, so are perfectly legal token separators in the resulting command.

GolfScript, 14 12 bytes

.,p{64/2^},,

Try it online on Web GolfScript.

Idea

GolfScript doesn't have a clue what Unicode is; all strings (input, output, internal) are composed of bytes. While that can be pretty annoying, it's perfect for this challenge.

UTF-8 encodes ASCII and non-ASCII characters differently:

All code points below 128 are encoded as 0xxxxxxx.
All other code points are encoded as 11xxxxxx 10xxxxxx ... 10xxxxxx.

This means that the encoding of each Unicode character contains either a single 0xxxxxxx byte or a single 11xxxxxx byte (and 0 to 5 10xxxxxx bytes).

By dividing all bytes of the input by 64, we turn 0xxxxxxx into 0 or 1, 11xxxxxx into 3, and 10xxxxxx into 2. All that's left is to count the bytes whose quotient is not 2.

Code

                (implicit) Read all input and push it on the stack.
.               Push a copy of the input.
 ,              Compute its length (in bytes).
  p             Print the length.
   {     },     Filter; for each byte in the original input:
    64/           Divide the byte by 64.
       2^         XOR the quotient with 2.
                If the return is non-zero, keep the byte.
           ,    Count the kept bytes.
                (implicit) Print the integer on the stack.

Python, 42 40 bytes

lambda i:[len(i),len(i.encode('utf-8'))]

Thanks to Alex A. for the two bytes off.

Straightforward, does what it says. With argument i, prints the length of i, then the length of i in UTF-8. Note that in order to accept multiline input, the function argument should be surrounded by triple quotes: '''.

EDIT: It didn't work for multiline input, so I just made it a function instead.

Some test cases (separated by blank newlines):

f("Hello, World!")
13 13

f('''
friends = ['john', 'pat', 'gary', 'michael']
for i, name in enumerate(friends):
    print "iteration {iteration} is {name}".format(iteration=i, name=name)
''')
156 156

f("(~R∊R∘.×R)/R←1↓ιR")
17 27

Count the bytes of a program

Shell + coreutils, 6

Test output:

Shell + coreutils, 12

GolfScript, 14 12 bytes

Idea

Code

Python, 42 40 bytes

Tags:

String

Unicode

Parsing

Code Golf

Related

Recent Posts