Detokenizing without extra spaces?

It is impossible to distinguish \A . from \A. once TeX has converted those into tokens: the only solution if you need to preserve those spaces is to read the argument verbatim.

However, if you are fine with that, then the simplest method is to update the l3kernel and l3experimental bundles (and l3packages) to a very recent version (Februrary 2012), then use tools from the l3regex package to add \string in front of each token in the argument, and expand. The code below does that (replace \tl_show:N by whatever you want to do to the string).

\documentclass{article}
\usepackage{l3regex}
\ExplSyntaxOn
\cs_new_protected:Npn \test #1
  {
    \tl_set:Nn \l_tmpa_tl {#1}
    \regex_replace_all:nnN { . } { \c{string} \0 } \l_tmpa_tl
    \tl_set:Nx \l_tmpb_tl { \l_tmpa_tl }
    % now \l_tmpb_tl contains what you want:
    \tl_show:N \l_tmpb_tl
  }
\ExplSyntaxOff
\begin{document}
  \test{\A\d{2,}.+ Hello, world!\z}
\end{document}

How does it work? \regex_replace_all:nnN performs a replacement on a stored token list, so we need to store the argument.

\tl_set:Nn    % Set locally
  \l_tmpa_tl  % the "local temporary token list" `\l_tmpa_tl`
  {#1}        % to contain "#1" (the argument).
\regex_replace_all:nnN % Replace every occurrence of
  { . }                % any token, even braces etc.
  {                    % by
    \c{string}         %   \string
    \0                 %   what was matched (the token)
  } \l_tmpa_tl         % in \l_tmpa_tl
\tl_set:Nx        % Set locally, with expansion,
  \l_tmpb_tl      % the "local temporary token list b"
  { \l_tmpa_tl }  % to (the expansion of) `\l_tmpa_tl`
\tl_show:N    % Show the contents of
  \l_tmpb_tl  % the token list variable `\l_tmpb_tl`

Of course, under the hood, l3regex does a lot of work so it will depend on how many such regular expressions you have to go through.

EDIT: A plain TeX solution for the very specific task your are asking for. I am assuming that the strings never contain the character ^^A (char code 1). The idea is to use \lowercase to change all true space tokens to some recognizable character. Then \detokenize, and loop through the result one character at a time (this automatically skips spaces) replacing ^^A by a space.

\catcode64=11
\long\def\test#1%
  {%
    \begingroup
      % Ensure that every character is preserved by \lowercase.
      \count@\z@
      \loop\ifnum\count@<256
        \lccode\count@\z@
        \advance\count@\@ne
      \repeat
      % Except spaces, changed to ^^A
      \lccode32=\@ne
      \lowercase
        {%
          \endgroup
          \edef\result{\expandafter\test@\detokenize{#1}\relax}%
        }%
  }
% Then map {^^A => space, space =>} onto the string.
\def\test@#1%
  {%
    \ifx#1\relax\test@end\fi
    \ifnum`#1=\@ne\space\else#1\fi
    \test@
  }
\def\test@end\fi#1\test@{\fi}
\catcode64=12
\test{ab c\d e{f} \fg }\show\result

Despite the question being tagged as tex-core I would like to point at xparse. It has the argument specification v that does the detokinization without spaces as far as I understand it. Form the manual:

Arguments of type “v” are read in verbatim mode, which will result in the grabbed argument consisting of tokens of category code 12 (“other”), except spaces, which are given category code 10 (“space”).

\DeclareDocumentCommand\foo{v}{\ttfamily #1}

And using it with

\foo!\A\d{2,}.+\z!

produces the same output as

\verb!\A\d{2,}.+\z!

There are no extra spaces introduced by xparse. In this sense the contents of argument #1 is “untouched”.

A truly general solution seems difficult but here is my attempt.

Call \spaceparse with the argument to be detokenized and parsed for spaces after commands. The result of the parsing is available in the macro \result. You would need to call \result to see the outcome.

Because \detokenize doubles the hash character, we first reverse that action. If you don't require this default action, then use the star (*) form of \spaceparse.

You can copy this into a package and call the package.

\documentclass{article}
\usepackage{catoptions}
% No conflict with etoolbox.sty:
% \usepackage{etoolbox}
\makeatletter
\robust@def*\spaceparse{\cpt@testst\sp@ceparse}
\robust@def*\sp@ceparse#1{%
  \begingroup
  \edef\@tempa{\detokenize{#1}}%
  \ifboolTF{cpt@st}{}{\s@expandarg\cpt@pophash\@tempa\@tempa}%
  \edef\@tempa##1{##1\expandcsonce\@tempa\@space\cpt@nil}%
  \edef\@tempb##1{\def##1####1\@space####2\cpt@nil}%
  \@tempb\@tempb{%
    \ifblankTF{##2}{%
      \toks@\expandafter{\the\toks@##1}%
    }{%
      \countbackslash{##1}%
      \ifnum\nr=\@ne
        \xifinsetTF{\@car##2\relax\@nil}\cpt@oth@rchars{%
          \toks@\expandafter{\the\toks@##1}%
        }{%
          \cptexpanded{\toks@{\the\toks@\unexpanded{##1}\@space}}%
        }%
      \else
        \cptexpanded{\toks@{\the\toks@\unexpanded{##1}\@space}}%
      \fi
      \@tempb##2\cpt@nil
    }%
  }%
  \@tempa{\toks@{}\@tempb}%
  \edef\result{\the\toks@}%
  \postgroupdef\result\endgroup
}
\robust@def*\countbackslash#1{%
  \begingroup
  \@tempcnta\z@
  \def\@tempa##1{%
    \def\@tempa####1##1####2\@nil{%
      \ifblankTF{####2}{}{%
        \advance\@tempcnta\@ne
        \@tempa####2\@nil
      }%
    }%
    \@tempa#1##1\@nil
  }%
  \s@expandarg\@tempa\@backslashchar
  \cptexpanded{\endgroup\def\noexpand\nr{\the\@tempcnta}}%
}
\makeatother

Tests:

\def\x{##1\A\d{2,}.+\z A B\\x y}
% Content of \x is already read:
\expandafter\spaceparse\expandafter{\x}
\show\result

\spaceparse{xx\x{f} \fg x}
\show\result

\spaceparse{ab c\d e{f} \fg x}
\show\result

\spaceparse{#1\A\d{2,}.+\z A B}
\show\result

\begin{document}

\end{document}

Detokenizing without extra spaces?

Tags:

Expansion

Tex Core

Related

Recent Posts