Utility to Strip Comments from LaTeX Source

latexpand

(distributed with TeXLive)

latexpand is a tool to expand include/input directives which also removes comments. It can also be tweaked to keep end-of-line comments because sometimes they are meaningful. In a second step remove lines with only %:

Click to copy

$ latexpand --empty-comments mytexfile.tex > mytexfile-stripped.tex

Remove lines with only % and whitespace:

Click to copy

$ sed -i '/^\s*%/d' mytexfile-stripped.tex

and squeeze blank lines (cat -s might have problems with window line endings and on os solaris -s has a different meaning)

Click to copy

$ cat -s mytexfile-stripped.tex | sponge mytexfile-stripped.tex

(sponge is part of moreutils from https://joeyh.name/code/moreutils/)

I'm not sure how to do this. So, I'm posting a new solution. The code I posted yesterday will eat comments from within a verbatim environment.

Here's a new example file to be cleaned:

Click to copy

I've a LaTeX source.
I'm ready %for submission
%But first I would like
to strip its comments.

So I hope there are
100\% auto
ways to get this done.

\begin{comment}
   Because there are subtle ways to mess it up.
\end{comment}

\begin{verbatim}
        next two lines should not be lost
        % don't lose this line
% this line should stay too
\end{verbatim}

According to the verbatim package documentation verbatim and comment environments should not be nested. The following code (similar to what I posted yesterday) will not eat commented lines that appear within a verbatim environment.

Here is the corrected Perl code:

Click to copy

#!/usr/bin/perl
use strict 'vars';
&MAIN(@ARGV);
sub MAIN {
   my ($filehandle) = @_;

   open FILE, "<$filehandle";
   my @doc = <FILE>;
   close FILE;

   &removeComments(\@doc);

   foreach my $line ( @doc ){
      print $line;
    }

   return 1;
}

sub removeComments {
   my ($docarray) = @_;

   my $isCommentEnvironment  = "no";
   my $isVerbatimEnvironment = "no";

   my @newdoc;

   foreach my $line ( @{$docarray} ){
      $isVerbatimEnvironment = "yes" if ( $line =~ /^\\begin{verbatim}/ );
      $isCommentEnvironment  = "yes" if ( $line =~ /^\\begin{comment}/ );
      if ( ($isVerbatimEnvironment eq "no") && ($isCommentEnvironment eq "no") ){
     next if ($line =~ /^%/);
     ## Temporarily replace "%" that you want to keep with a dummy string
     ## that does not appear elsewhere in your document.  Then, remove remainder
     ## of lines that still contain "%".
     if ( $line =~ /\\%/){
        $line =~ s/\\%/TMP::PERCENT/g;
        $line =~ s/%.*//;
        $line =~ s/TMP::PERCENT/\\%/g;
      } else {
     ## do not remove trailing % marking NO SPACE in LaTeX: $line =~ s/%.*//;
         $line =~ s/\s*%.+//;
       }
     push @newdoc, $line;
       }
      push @newdoc, $line if ( $isVerbatimEnvironment eq "yes" );

      $isVerbatimEnvironment = "no" if ( $line =~ /^\\end{verbatim}/ );
      $isCommentEnvironment  = "no" if ( $line =~ /^\\end{comment}/ );
    }

   @{$docarray} = @newdoc;
   return 1;
 }

It can be done using the Python ply.lex module to write a simple tokenizer:

Click to copy

import ply.lex, argparse, io

#Usage
# python stripcomments.py input.tex > output.tex
# python stripcomments.py input.tex -e encoding > output.tex

def strip_comments(source):
    tokens = (
                'PERCENT', 'BEGINCOMMENT', 'ENDCOMMENT', 'BACKSLASH',
                'CHAR', 'BEGINVERBATIM', 'ENDVERBATIM', 'NEWLINE', 'ESCPCT',
             )
    states = (
                ('linecomment', 'exclusive'), 
                ('commentenv', 'exclusive'), 
                ('verbatim', 'exclusive')
            )

    #Deal with escaped backslashes, so we don't think they're escaping %.
    def t_ANY_BACKSLASH(t):
        r"\\\\"
        return t

    #One-line comments
    def t_PERCENT(t):
        r"\%"
        t.lexer.begin("linecomment")

    #Escaped percent signs
    def t_ESCPCT(t):
        r"\\\%"
        return t

    #Comment environment, as defined by verbatim package       
    def t_BEGINCOMMENT(t):
        r"\\begin\s*{\s*comment\s*}"
        t.lexer.begin("commentenv")

    #Verbatim environment (different treatment of comments within)   
    def t_BEGINVERBATIM(t):
        r"\\begin\s*{\s*verbatim\s*}"
        t.lexer.begin("verbatim")
        return t

    #Any other character in initial state we leave alone    
    def t_CHAR(t):
        r"."
        return t

    def t_NEWLINE(t):
        r"\n"
        return t

    #End comment environment    
    def t_commentenv_ENDCOMMENT(t):
        r"\\end\s*{\s*comment\s*}"
        #Anything after \end{comment} on a line is ignored!
        t.lexer.begin('linecomment')

    #Ignore comments of comment environment    
    def t_commentenv_CHAR(t):
        r"."
        pass

    def t_commentenv_NEWLINE(t):
        r"\n"
        pass

    #End of verbatim environment    
    def t_verbatim_ENDVERBATIM(t):
        r"\\end\s*{\s*verbatim\s*}"
        t.lexer.begin('INITIAL')
        return t

    #Leave contents of verbatim environment alone
    def t_verbatim_CHAR(t):
        r"."
        return t

    def t_verbatim_NEWLINE(t):
        r"\n"
        return t

    #End a % comment when we get to a new line
    def t_linecomment_ENDCOMMENT(t):
        r"\n"
        t.lexer.begin("INITIAL")
        #Newline at the end of a line comment is stripped.

    #Ignore anything after a % on a line        
    def t_linecomment_CHAR(t):
        r"."
        pass

    lexer = ply.lex.lex()
    lexer.input(source)
    return u"".join([tok.value for tok in lexer])

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('filename', help = 'the file to strip comments from')
    parser.add_argument('--encoding', '-e', default='utf-8')

    args = parser.parse_args()

    with io.open(args.filename, encoding=args.encoding) as f:
        source = f.read()

    print(strip_comments(source))

if __name__ == '__main__':
    main()

Alternately, here is a Gist.

This correctly handles comments which directly follow a double backslash, such as

Click to copy

Line text \\%comment-text

The first answer above seems to handle this incorrectly (as if the percent sign were escaped), though I don't have sufficient reputation to comment.

Utility to Strip Comments from LaTeX Source

latexpand

Tags:

Comments

Tools

Related

Recent Posts