Parsing command-line arguments from a STRING in Clojure

Updated with a new, even more convoluted version. This is officially ridiculous; the next iteration will use a proper parser (or c.c.monads and a little bit of Parsec-like logic on top of that). See the revision history on this answer for the original.

This convoluted bunch of functions seems to do the trick (not at my DRYest with this one, sorry!):

(defn initial-state [input]
  {:expecting nil
   :blocks (mapcat #(str/split % #"(?<=\s)|(?=\s)")
                   (str/split input #"(?<=(?:'|\"|\\))|(?=(?:'|\"|\\))"))
   :arg-blocks []})

(defn arg-parser-step [s]
  (if-let [bs (seq (:blocks s))]
    (if-let [d (:expecting s)]
      (loop [bs bs]
        (cond (= (first bs) d)
              [nil (-> s
                       (assoc-in [:expecting] nil)
                       (update-in [:blocks] next))]
              (= (first bs) "\\")
              [nil (-> s
                       (update-in [:blocks] nnext)
                       (update-in [:arg-blocks]
                                  #(conj (pop %)
                                         (conj (peek %) (second bs)))))]
              :else
              [nil (-> s
                       (update-in [:blocks] next)
                       (update-in [:arg-blocks]
                                  #(conj (pop %) (conj (peek %) (first bs)))))]))
      (cond (#{"\"" "'"} (first bs))
            [nil (-> s
                     (assoc-in [:expecting] (first bs))
                     (update-in [:blocks] next)
                     (update-in [:arg-blocks] conj []))]
            (str/blank? (first bs))
            [nil (-> s (update-in [:blocks] next))]
            :else
            [nil (-> s
                     (update-in [:blocks] next)
                     (update-in [:arg-blocks] conj [(.trim (first bs))]))]))
    [(->> (:arg-blocks s)
          (map (partial apply str)))
     nil]))

(defn split-args [input]
  (loop [s (initial-state input)]
    (let [[result new-s] (arg-parser-step s)]
      (if result result (recur new-s)))))

Somewhat encouragingly, the following yields true:

(= (split-args "asdf 'asdf \" asdf' \"asdf ' asdf\" asdf")
   '("asdf" "asdf \" asdf" "asdf ' asdf" "asdf"))

So does this:

(= (split-args "asdf asdf '  asdf \" asdf ' \" foo bar ' baz \" \" foo bar \\\" baz \"")
   '("asdf" "asdf" "  asdf \" asdf " " foo bar ' baz " " foo bar \" baz "))

Hopefully this should trim regular arguments, but not ones surrounded with quotes, handle double and single quotes, including quoted double quotes inside unquoted double quotes (note that it currently treats quoted single quotes inside unquoted single quotes in the same way, which is apparently at variance with the *nix shell way... argh) etc. Note that it's basically a computation in an ad-hoc state monad, just written in a particularly ugly way and in a dire need of DRYing up. :-P


This bugged me, so I got it working in ANTLR. The grammar below should give you an idea of how to do it. It includes rudimentary support for backslash escape sequences.

Getting ANTLR working in Clojure is too much to write in this text box. I wrote a blog entry about it though.

grammar Cmd;

options {
    output=AST;
    ASTLabelType=CommonTree;
}

tokens {
    DQ = '"';
    SQ = '\'';
    BS = '\\';
}

@lexer::members {
    String strip(String s) {
        return s.substring(1, s.length() - 1);
    }
}

args: arg (sep! arg)* ;
arg : BAREARG
    | DQARG 
    | SQARG
    ;
sep :   WS+ ;

DQARG  : DQ (BS . | ~(BS | DQ))+ DQ
        {setText( strip(getText()) );};
SQARG  : SQ (BS . | ~(BS | SQ))+ SQ
        {setText( strip(getText()) );} ;
BAREARG: (BS . | ~(BS | WS | DQ | SQ))+ ;

WS  :   ( ' ' | '\t' | '\r' | '\n');