How to generate random strings that matches a Regex in Julia?
It should be possible to use Automa.jl to build a DFA and randomly traverse it. Automa uses a simpler syntax than PCRE, so the languange you can describe by it should actually be regular.
I quickly threw together the following, based mostly on the code in dot.jl
:
julia> function rand_re(machine::Automa.Machine)
out = IOBuffer()
node = machine.start
while true
if node.state ∈ machine.final_states
(rand() ≤ 1 / (length(node.edges) + 1)) && break
end
edge, node = rand(node.edges)
label = rand(collect(edge.labels))
print(out, Char(label))
end
return String(take!(out))
end
rand_re (generic function with 1 method)
julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a6bbb"
julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a9b"
julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a3aa"
julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a1a"
julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a5ba"
The caveat is that Automa uses byte-encoded sets for edge labels, so more care should be taken where I just write Char(label)
.
Since final states can still have outgoing edges, I chose to treat stopping and each edge with uniform probability. I think this will likely have the effect that potentially infinite terms will either be very short or very long; google "Boltzmann samplers" for how to solve that (not to be confused with sampling from a Boltzmann distribution!), but the solution is rather mathematically involved.
Alternatively, you could use ccall
or PyCall
to call rxvm_gen
or Rxvm.gen
of librxvm
, which contains (probably) a quite performant code for non-backtracking regular expressions.
Julia has PCRE, which means its regular expressions are far more powerful than true regular expressions. And are in-fact turing complete. I suspect there is a bunch of interesting theoretical computer science around this. I suspect your task for PCRE might be proved to be impossible because of the halting problem. But still what we can do is try a bunch of random strings and toss out those that don't match. And for simple regex that goes a long way. Its not guaranteed to give an answer though.
If one wanted stricter regex, like those covered by Automa.jl, there is probably something better that can be done, since you can walk the state machine solving it 1 bit at a time. Hopefully someone that knows Automa.jl can post their own answer.
Code
using Random: randstring
function rand_matching(regex; max_len=2^16, max_attempts=1000)
for _ in max_attempts
str = randstring(max_len)
m = match(regex, str)
if m != nothing
# rather than return whole string,
# just return the shortest bit that matches
return m.match
end
end
error("Could not find any string that matches regex")
end
demo:
julia> @time rand_matching(r"\d\d")
0.013517 seconds (34.34 k allocations: 1.998 MiB)
"38"
julia> @time rand_matching(r"\d\d")
0.001497 seconds (11 allocations: 128.656 KiB)
"44"
julia> @time rand_matching(r"a\d\d")
0.000670 seconds (11 allocations: 128.656 KiB)
"a19"
julia> @time rand_matching(r"a\d\d")
0.000775 seconds (11 allocations: 128.656 KiB)
"a83"
julia> @time rand_matching(r"a\d\db")
0.000670 seconds (11 allocations: 128.656 KiB)
"a44b"