How can I convert two-valued text data to binary (bit-representation)
Another perl:
perl -pe 'BEGIN { binmode \*STDOUT } chomp; tr/AB/\0\1/; $_ = pack "B*", $_'
Proof:
$ echo ABBBAAAABBBBBABBABBBABBB | \
perl -pe 'BEGIN { binmode \*STDOUT } chomp; tr/AB/\0\1/; $_ = pack "B*", $_' | \
od -tx1
0000000 70 fb 77
0000003
The above reads input one line at a time. It's up to you to make sure the lines are exactly what they are supposed to be.
Edit: The reverse operation:
#!/usr/bin/env perl
binmode \*STDIN;
while ( defined ( $_ = getc ) ) {
$_ = unpack "B*";
tr/01/AB/;
print;
print "\n" if ( not ++$cnt % 3 );
}
print "\n" if ( $cnt % 3 );
This reads a byte of input at a time.
Edit 2: Simpler reverse operation:
perl -pe 'BEGIN { $/ = \3; $\ = "\n"; binmode \*STDIN } $_ = unpack "B*"; tr/01/AB/'
The above reads 3 bytes at a time from STDIN
(but receiving EOF
in the middle of a sequence is not a fatal problem).
{ printf '2i[q]sq[?z0=qPl?x]s?l?x'
tr -dc AB | tr AB 01 | fold -b24
} <infile | dc
In making the following statement, @lcd047 has pretty well nailed my earlier state of confusion:
You seem to be confused by the output of
od
. Useod -tx1
to look at bytes.od -x
reads words, and on little endian machines that swaps bytes. I didn't follow closely the exchange above, but I think your initial version was correct, and you don't need to mess with byte order at all. Just useod -tx1
, notod -x
.
Now this makes me feel a lot better - the earlier need for dd conv=swab
was bugging me all day. I couldn't pin it, but I knew there was something wrong w/ it. Being able to explain it away in my own stupidity is very comforting - especially since I learned something.
Anyway, that will delete every byte which isn't [AB]
, then tr
anslate those to [01]
accordingly, before fold
ing the resulting stream at 24 bytes per line. dc
?
reads a line at a time, checks if input contained anything, and, if so, P
rints the byte value of that number to stdout.
From man dc
:
P
- Pops off the value on top of the stack. If it is a string, it is simply printed without a trailing newline. Otherwise it is a number, and the integer portion of its absolute value is printed out as a "base (
UCHAR_MAX+1
)" byte stream.
- Pops off the value on top of the stack. If it is a string, it is simply printed without a trailing newline. Otherwise it is a number, and the integer portion of its absolute value is printed out as a "base (
i
- Pops the value off the top of the stack and uses it to set the input radix.
some shell automation
Here is a shell function I wrote based on the above which can go both ways:
ABdc()( HOME=/dev/null A='[fc[fc]]sp[100000000o]p2o[fc]' B=2i
case $1 in
(-B) { echo "$B"; tr AB 01 | paste -dP - ~ ; }| dc;;
(-A) { echo "$A"; od -vAn -tu1 | paste -dlpx - ~ ~ ~; }| dc|
dc | paste - - - ~ | expand -t10,20,30 |
cut -c2-9,12-19,22-29 | tr ' 01' AAB ;;
(*) set '' "$1";: ${1:?Invalid opt: "'$2'"} ;;
esac
)
That will translate the ABABABA
stuff to bytes with -B
, so you can just do:
ABdc -B <infile
But it will translate arbitrary input to 24 ABABABA
bit-per-byte encoded strings - in the same form as that presented for example in the question - w/ -B
.
seq 5 | ABdc -A | tee /dev/fd/2 | ABdc -B
AABBAAABAAAABABAAABBAABA
AAAABABAAABBAABBAAAABABA
AABBABAAAAAABABAAABBABAB
AAAABABAAAAAAAAAAAAAAAAA
1
2
3
4
5
For -A
output I rolled in cut
, expand
, and od
here, which I'll get into in a minute, but I also added another dc
. I dropped the line-for-line ?
read dc
script for another method which works an array at time with f
- which is a command that prints the f
ull dc
command-stack to stdout. Of course, because dc
is a stack-oriented last-in,first-out type of application, that means that the f
ull-stack comes out in the reverse order it went in.
This might be a problem, but I use another dc
anyway with an o
utput radix set to 100000000 to handle all of the 0-padding as simply as possible. And when it reads the other's last-in,first-out stream, it applies that logic to it all over again, and it all comes out in the wash. The two dc
s work in concert like this:
{ echo '[fc[fc]]sp[100000000o]p2o[fc]'
echo some data |
od -An -tu1 ###arbitrary input to unsigned decimal ints
echo lpx ###load macro stored in p and execute
} | tee /dev/fd/2 | ###just using tee to show stream stages
dc| tee /dev/fd/2 |dc
...the stream per the first tee
...
[fc[fc]]sp[100000000o]pc2o[fc] ###dc's init cmd from 1st echo
115 111 109 101 32 100 97 116 97 10 ###od's output
lpx ###load p; execute
...per the second tee
, as written from dc
to dc
...
100000000o ###first set output radix
1010 ###bin/rev vs of od's out
1100001 ###dc #2 reads it in, revs and pads it
1110100
1100001
1100100
100000
1100101
1101101
1101111 ###this whole process is repeated
1110011 ###once per od output line, so
fc ###each worked array is 16 bytes.
...and the output which the second dc
writes is...
01110011
01101111
01101101
01100101
00100000
01100100
01100001
01110100
01100001
00001010
From there the function paste
s it on <tabs>...
01110011 01101111 01101101
01100101 00100000 01100100
01100001 01110100 01100001
00001010
...expand
s <tabs> to spaces at 10 column intervals...
01110011 01101111 01101101
01100101 00100000 01100100
01100001 01110100 01100001
00001010
...cut
s away all but bytes 2-9,12-19,22-29
...
011100110110111101101101
011001010010000001100100
011000010111010001100001
00001010
...and tr
anslates <spaces> and zeroes to A
and ones to B
...
ABBBAABBABBABBBBABBABBAB
ABBAABABAABAAAAAABBAABAA
ABBAAAABABBBABAAABBAAAAB
AAAABABAAAAAAAAAAAAAAAAA
You can see on the last line there my primary motivation for including expand
- it's such a lightweight filter, and it very easily ensures that every sequence written - even the last - is padded out to 24 encoded-bits. When that is process reversed, and the strings are decoded to -B
yte-value, there are two appended NULs:
ABdc -B <<\IN | od -tc
ABBBAABBABBABBBBABBABBAB
ABBAABABAABAAAAAABBAABAA
ABBAAAABABBBABAAABBAAAAB
AAAABABAAAAAAAAAAAAAAAAA
IN
...as you can see...
0000000 s o m e d a t a \n \0 \0
0000014
real world data
I played with it, and tried it with some simple, realistic streams. I constructed this elaborate pipeline for staged reports...
{ ###dunno why, but I often use man man
( ###as a test input source
{ man man | ###streamed to tee
tee /dev/fd/3 | ###branched to stdout
wc -c >&2 ###and to count source bytes
} 3>&1 | ###the branch to stdout is here
ABdc -A | ###converted to ABABABA
tee /dev/fd/3 | ###branched again
ABdc -B ###converted back to bytes
times >&2 ###the process is timed
) | wc -c >&2 ###ABdc -B's output is counted
} 3>&1| wc -c ###and so is the output of ABdc -A
I don't have any good basis for performance comparison, here, though. I can only say that I was driven to this test when I was (perhaps naively) impressed enough to do so by...
man man | ABdc -A | ABdc -B
...which painted my terminal screen w/ man
's output at the same discernible speed as the unfiltered command might do. The output of the test was...
37595 ###source byte count
0m0.000000s 0m0.000000s ###shell processor time nil
0m0.720000s 0m0.250000s ###shell children's total user, system time
37596 ###ABdc -B output byte count
313300 ###ABdc -A output byte count
initial tests
The rest is just a more simple proof of concept that it works at all...
printf %s ABBBAAAABBBBBABBABBBABBB|
tee - - - - - - - -|
tee - - - - - - - - - - - - - - - |
{ printf '2i[q]sq[?z0=qPl?x]s?l?x'
tr -dc AB | tr AB 01 | fold -b24
} | dc | od -tx1
0000000 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 0000020 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 0000040 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 0000060 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 0000100 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 0000120 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 0000140 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 0000160 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 0000200 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 0000220 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 0000240 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 0000260 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 0000300 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 0000320 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 0000340 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 0000360 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 0000400 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 0000420 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 0000440 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 0000460 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 0000500 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 0000520 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 0000540 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 0000560 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 0000600 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 0000620 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 0000640 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 0000660