Utility to Strip Comments from LaTeX Source
latexpand
(distributed with TeXLive)
latexpand
is a tool to expand include/input
directives which also removes comments. It can also be tweaked to keep end-of-line comments because sometimes they are meaningful. In a second step remove lines with only %
:
$ latexpand --empty-comments mytexfile.tex > mytexfile-stripped.tex
Remove lines with only %
and whitespace:
$ sed -i '/^\s*%/d' mytexfile-stripped.tex
and squeeze blank lines (cat -s
might have problems with window line endings and on os solaris -s
has a different meaning)
$ cat -s mytexfile-stripped.tex | sponge mytexfile-stripped.tex
(sponge
is part of moreutils
from https://joeyh.name/code/moreutils/)
I'm not sure how to do this. So, I'm posting a new solution. The code I posted yesterday will eat comments from within a verbatim
environment.
Here's a new example file to be cleaned:
I've a LaTeX source.
I'm ready %for submission
%But first I would like
to strip its comments.
So I hope there are
100\% auto
ways to get this done.
\begin{comment}
Because there are subtle ways to mess it up.
\end{comment}
\begin{verbatim}
next two lines should not be lost
% don't lose this line
% this line should stay too
\end{verbatim}
According to the verbatim
package documentation verbatim
and comment
environments should not be nested. The following code (similar to what I posted yesterday) will not eat commented lines that appear within a verbatim
environment.
Here is the corrected Perl code:
#!/usr/bin/perl
use strict 'vars';
&MAIN(@ARGV);
sub MAIN {
my ($filehandle) = @_;
open FILE, "<$filehandle";
my @doc = <FILE>;
close FILE;
&removeComments(\@doc);
foreach my $line ( @doc ){
print $line;
}
return 1;
}
sub removeComments {
my ($docarray) = @_;
my $isCommentEnvironment = "no";
my $isVerbatimEnvironment = "no";
my @newdoc;
foreach my $line ( @{$docarray} ){
$isVerbatimEnvironment = "yes" if ( $line =~ /^\\begin{verbatim}/ );
$isCommentEnvironment = "yes" if ( $line =~ /^\\begin{comment}/ );
if ( ($isVerbatimEnvironment eq "no") && ($isCommentEnvironment eq "no") ){
next if ($line =~ /^%/);
## Temporarily replace "%" that you want to keep with a dummy string
## that does not appear elsewhere in your document. Then, remove remainder
## of lines that still contain "%".
if ( $line =~ /\\%/){
$line =~ s/\\%/TMP::PERCENT/g;
$line =~ s/%.*//;
$line =~ s/TMP::PERCENT/\\%/g;
} else {
## do not remove trailing % marking NO SPACE in LaTeX: $line =~ s/%.*//;
$line =~ s/\s*%.+//;
}
push @newdoc, $line;
}
push @newdoc, $line if ( $isVerbatimEnvironment eq "yes" );
$isVerbatimEnvironment = "no" if ( $line =~ /^\\end{verbatim}/ );
$isCommentEnvironment = "no" if ( $line =~ /^\\end{comment}/ );
}
@{$docarray} = @newdoc;
return 1;
}
It can be done using the Python ply.lex module to write a simple tokenizer:
import ply.lex, argparse, io
#Usage
# python stripcomments.py input.tex > output.tex
# python stripcomments.py input.tex -e encoding > output.tex
def strip_comments(source):
tokens = (
'PERCENT', 'BEGINCOMMENT', 'ENDCOMMENT', 'BACKSLASH',
'CHAR', 'BEGINVERBATIM', 'ENDVERBATIM', 'NEWLINE', 'ESCPCT',
)
states = (
('linecomment', 'exclusive'),
('commentenv', 'exclusive'),
('verbatim', 'exclusive')
)
#Deal with escaped backslashes, so we don't think they're escaping %.
def t_ANY_BACKSLASH(t):
r"\\\\"
return t
#One-line comments
def t_PERCENT(t):
r"\%"
t.lexer.begin("linecomment")
#Escaped percent signs
def t_ESCPCT(t):
r"\\\%"
return t
#Comment environment, as defined by verbatim package
def t_BEGINCOMMENT(t):
r"\\begin\s*{\s*comment\s*}"
t.lexer.begin("commentenv")
#Verbatim environment (different treatment of comments within)
def t_BEGINVERBATIM(t):
r"\\begin\s*{\s*verbatim\s*}"
t.lexer.begin("verbatim")
return t
#Any other character in initial state we leave alone
def t_CHAR(t):
r"."
return t
def t_NEWLINE(t):
r"\n"
return t
#End comment environment
def t_commentenv_ENDCOMMENT(t):
r"\\end\s*{\s*comment\s*}"
#Anything after \end{comment} on a line is ignored!
t.lexer.begin('linecomment')
#Ignore comments of comment environment
def t_commentenv_CHAR(t):
r"."
pass
def t_commentenv_NEWLINE(t):
r"\n"
pass
#End of verbatim environment
def t_verbatim_ENDVERBATIM(t):
r"\\end\s*{\s*verbatim\s*}"
t.lexer.begin('INITIAL')
return t
#Leave contents of verbatim environment alone
def t_verbatim_CHAR(t):
r"."
return t
def t_verbatim_NEWLINE(t):
r"\n"
return t
#End a % comment when we get to a new line
def t_linecomment_ENDCOMMENT(t):
r"\n"
t.lexer.begin("INITIAL")
#Newline at the end of a line comment is stripped.
#Ignore anything after a % on a line
def t_linecomment_CHAR(t):
r"."
pass
lexer = ply.lex.lex()
lexer.input(source)
return u"".join([tok.value for tok in lexer])
def main():
parser = argparse.ArgumentParser()
parser.add_argument('filename', help = 'the file to strip comments from')
parser.add_argument('--encoding', '-e', default='utf-8')
args = parser.parse_args()
with io.open(args.filename, encoding=args.encoding) as f:
source = f.read()
print(strip_comments(source))
if __name__ == '__main__':
main()
Alternately, here is a Gist.
This correctly handles comments which directly follow a double backslash, such as
Line text \\%comment-text
The first answer above seems to handle this incorrectly (as if the percent sign were escaped), though I don't have sufficient reputation to comment.