A simple program reducer for any language
I’ve been fuzzing compilers for less familiar languages like Solidity and Diem (for smart contracts), and up-and-coming languages like Zig. More on that fuzzing effort here.
A fun part of that is looking at programs that crash the compilers. Reducing
those programs to submit in bug reports… less so. Thing is, I want to submit programs that are small and comprehensible for maintainers, and doing things by hand got tedious after about 5 reports. There are tools for reducing programs out there, but none really checked the boxes I wanted. So I wrote comby-reducer
to check those boxes, and things became a lot more fun:
Works on basically any language syntax or structured format (including e.g., DSLs)
Syntax-aware (not just regex), but without strictly requiring grammar definitions
Easy to define and include/exclude transformations (declarative, no scripting)
Easy to see what it did (so I can tweak transformations)
Easy to install and invoke
Fast on large inputs? Most of the crashing inputs are small, this not as important
Basically, something dead simple that allows easy tweaking for transforming syntax.
How it works
comby-reducer
is built using comby
,
which supports a lightweight way of rewriting syntactic structures of a
program’s parse tree, like expressions and function blocks. Comby is
language-aware and understands basic syntax of code, strings, and comment syntax
for many languages. Absent language-specific parsers, comby
uses a generic
parser that works well for syntactic constructs in DSLs and less mainstream
languages. You can learn more about comby here.
comby-reducer
uses a JavaScript library transpiled from comby
's core parser
engine, and a couple of functions for transforming a program to a fixed point.
Let’s move on to examples. If you want, you can learn more about how program reducers work by checking out the resources at the end of this post.
A program reduction tour
Some fuzzing found this program crashed the Move
compiler.
module M {
resource struct R {}
struct Cup<T> {}
fun t0(x8: u8, x64: u64, x128: u128) {
continue;
(false as u8);
(true as u128);
(() as u64);
(1,1
);
(0 as bool);
(0 as address);
R = (0 as R);
(0 as Cup<u8>);
(0 as ());
(0 as (u64, u8));
(x"1234" as u64);
}
}
comby-reducer
reduces the program to something I’d be happy to submit in a bug report:
module M {
resource struct R {}
fun t0() {
R = ();
}
}
Move
syntax is a similar Rust
syntax, and like many languages, uses parentheses ()
and braces {}
to delineate and nest expressions that correspond to an
underlying parse tree. comby-reducer
understands these syntactic constructs
well, and can transform content inside balanced parentheses and braces (but won’t
get confused if special characters like (
happen inside strings or comments).
This program crashes the Zig compiler:
const BitField = packed struct {
a: u3,
b: u3,
c: u2,
};
fn foo(error.Hi
) u3 {
return bar(&bit_field.b);
}
fn bar(x: *const u3) u3 {
return x.*;
}
export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
and reduces to another happy candidate for a bug report:
fn foo(error.Hi) u3 {}
export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
And here’s a reduced Solidity smart contract:
contract C {
function f(uint256[] calldata x, uint256 s, uint256 e) external returns (uint256) {
(x[s:e]).length;
}
}
Expand to see original Solidity program
pragma experimental ABIEncoderV2;
contract C {
function f(uint256[] calldata x, uint256 s, uint256 e) external returns (uint256) {
(x[s:e]).length;
}
function f(uint256[] calldata x, uint256 s, uint256 e, uint256 ss, uint256 ee) external returns (uint256) {
return uint256[](x[s:e][ss:ee]).length;
}
function f_s_only(uint256[] calldata x, uint256 s) external returns (uint256) {
return uint256[](x[s:]).length;
}
function f_e_only(uint256[] calldata x, uint256 e) external returns (uint256) {
return uint256[](x[:e]).length;
}
function g(uint256[] calldata x, uint256 s, uint256 e, uint256 idx) external returns (uint256) {
111111111111111111256[](x[s:e])[idx];
}
function gg(uint256[] calldata x, uint256 s, uint256 e, uint256 idx) external returns (uint256) {
return x[s:e][idx];
}
function gg_s_only(uint256[] calldata x, uint256 s, uint256 idx) external returns (uint256) {
return x[s:][idx];
}
function gg_e_only(uint256[] calldata x, uint256 e, uint256 idx) external returns (uint256) {
return x[:e][idx];
}
}
Declaring transformations
comby-reducer
uses around 20 transformations to produce the above.
These are in the repo,
but you can also see a sample of them by expanding the below tab to get a sense of things. Transformations are
defined using comby syntax, and we’ll walk through some of them.
Some simple reduction transformations
[delete_paren_content]
match='(:[1])'
rewrite='()'
rule='where nested'
[delete_brace_content]
match='{:[1]}'
rewrite='{}'
rule='where nested'
# Helps put blank bodies across newlines on the same line for line deletion.
[blank_brace]
match='{ }'
rewrite='{}'
[delete_line]
match=':[x\n]'
rewrite=''
[delete_string_content]
match='":[x]"'
rewrite='""'
[remove_first_paren_element]
match='(:[1],:[2])'
rewrite='(:[2])'
[delete_paren_contents]
match='(:[1])'
rewrite='()'
rule='where nested'
This transformation matches any content between balanced parentheses (including
newlines) and deletes the content. The :[1]
is a variable that binds to
matched content. By default, comby-reducer
will try to apply this
transformation at the top-level of a file, wherever it sees (...)
. The
rule='where nested'
tells comby-reducer that it should also attempt to reduce
nested matches of (...)
inside other matched (...)
. In general, parentheses
are a common syntax to nest expressions in programs, so it makes sense to add
rule='where nested'
.
Another transformation preserves the first comma-separated element inside parentheses:
[remove_first_paren_element]
match='(:[1],:[2])'
rewrite='(:[2])'
Program syntax often use call or function-like syntax that comma-separate
parameters or arguments inside parentheses. This transformation attempts to remove
elements in such syntax. This transform doesn’t have a rule part, since it might
not be as fruitful to attempt nested reductions inside of :[1]
or :[2]
. But,
we could easily add it.
The default transformations aren’t very language-specific and worked well to
reduce the languages I dealt with (Solidity, Move, Zig) to small
human-comprehensible programs. Of course, you can define your own
transformations with the desired level of syntactic specificity. E.g., we might
remove for
loops in JavaScript with:
[remove_js_style_for_loop]
match='''
for (:[1]) {
:[2]
}
'''
rewrite=''
Observing and replaying reduction
While running comby-reducer
, I noticed that programs would often reduce to structures like:
function foo() {
}
After some sequence of transformations, a block might end up empty, but contain whitespace. To crunch these, I added the transformation
[blank_brace]
match='{ }'
rewrite='{}'
This transformation simply deletes contiguous whitespace (including newlines).
In turn, this often lead to a reduction to something like func () {}
which
then ends up being removed by a transformation that deletes lines. Nice!
Because I had an easy way to introduce new transformations, I wanted more insight into how transformations behaved. This helped me understand what I could add to help reduction along, or even just discover transformations that would improve formatting.
For example, I noticed one Zig program reduced to:
fn foo(error.Hi
) u3 {}
export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
Here I just wanted to group the first pair of parentheses pair on one line like
fn foo(error.Hi) u3 {}
. So I added something that would match all content up
to trailing whitespace within balanced parentheses:
[delete_trailing_paren_whitespace]
match='(:[1] )'
rewrite='(:[1])'
To make this process a bit more snazzy, I added a way to replay transformations.
comby-reducer
takes a --record
argument, and the output can be replayed with
comby-reduce-replay
. This makes it possible to step through the process.
Here’s an example where I manually step through the Zig reduction.
Expand to see HTML rendering of transformation steps
------ 000.step 2021-03-26 04:50:53.499923-04:00 ++++++ 001.step 2021-03-26 04:50:54.572018-04:00 @|-1,16 +1,16 ============================================================ | |const BitField = packed struct { | a: u3, | b: u3, | c: u2, | }; | | fn foo(error.Hi) u3 { !| return bar(&bit_field.b); | } | | fn bar(x: *const u3) u3 { | return x.*; | } | | export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
------ 001.step 2021-03-26 04:50:54.572018-04:00 ++++++ 002.step 2021-03-26 04:50:55.616110-04:00 @|-1,16 +1,16 ============================================================ | |const BitField = packed struct { | a: u3, | b: u3, | c: u2, | }; | | fn foo(error.Hi) u3 { | return bar(); | } | !| fn bar(x: *const u3) u3 { | return x.*; | } | | export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
------ 002.step 2021-03-26 04:50:55.616110-04:00 ++++++ 003.step 2021-03-26 04:50:57.132244-04:00 @|-1,16 +1,12 ============================================================ | !|const BitField = packed struct { !| a: u3, !| b: u3, !| c: u2, !| }; | | fn foo(error.Hi) u3 { | return bar(); | } | | fn bar() u3 { | return x.*; | } | | export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
------ 003.step 2021-03-26 04:50:57.132244-04:00 ++++++ 004.step 2021-03-26 04:50:57.748298-04:00 @|-1,12 +1,10 ============================================================ | |const BitField = packed struct {}; | !| fn foo(error.Hi) u3 { !| return bar(); !| } | | fn bar() u3 { | return x.*; | } | | export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
------ 004.step 2021-03-26 04:50:57.748298-04:00 ++++++ 005.step 2021-03-26 04:50:58.356352-04:00 @|-1,10 +1,8 ============================================================ | |const BitField = packed struct {}; | | fn foo(error.Hi) u3 {} | !| fn bar() u3 { !| return x.*; !| } | | export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
------ 005.step 2021-03-26 04:50:58.356352-04:00 ++++++ 006.step 2021-03-26 04:50:59.464450-04:00 @|-1,8 +1,6 ============================================================ | |const BitField = packed struct {}; | | fn foo(error.Hi) u3 {} | -| fn bar() u3 {} -| | export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
------ 007.step 2021-03-26 04:51:01.000586-04:00 ++++++ 008.step 2021-03-26 04:51:01.596638-04:00 @|-1,5 +1,4 ============================================================ -|const BitField = packed struct {}; | | fn foo(error.Hi) u3 {} | | export fn entry() usize { return @sizeOf(@TypeOf(foo)); }
You can find out more about replaying in the repo.
afl-tmin
Comparison to It’s useful to compare some reduction to
afl-tmin
,
which is a readily-available reducer for our afl-instrumented Solidity and Move
compilers. Because afl-tmin
exploits coverage information, it can be very
effective at minimizing the size of inputs irrespective of format (it works for
text inputs like our crashing programs, and binary formats like JPEG). At the
same time, this catch-all coverage-guided approach sacrifices some ability to
exploit properties of the input domain. For bug reports, it’s useful to reduce
around certain syntax like parentheses, or otherwise preserve syntactic
elements, formatting, or symbol names. comby-reducer
makes it easy to tailor
the process around the input. For example, the reduced Move program by
comby-reducer
gives
module M {
resource struct R {}
fun t0() {
R = ();
}
}
while afl-tmin
gives
module M{resource struct R{}struct C{}fun t(x:u){();();();R=();()}}
The afl-tmin
variety has some redundant syntax, eagerly deletes whitespace
that help with readability, and renames the original function t0
to t
. These
are not necessarily bad, but there isn’t much room for tweaking.
Another example produced by comby-reduce
module M {
struct Box3<T1, T2, T3> {}
fun cpy<C: copyable>() {}
fun t3<U, C: copyable>() {
cpy(Box3<U, U> {});
}
}
versus afl-tmin
:
module M{struct S{}resource struct C{}struct o{}struct Bo00<>{}fun h(r:R){}fun y(r:R){}fun t(){}fun t<>(){}fun t(){();(Bo00<U>{})}}
Original program for the above reductions
module M {
struct S {}
resource struct Coin {}
struct Box<T> {}
struct Box3<T1, T2, T3> {}
fun both<R: resource, C: copyable>(r: R, c: C) {
abort 0
}
fun cpy<C: copyable>(c: C) {
abort 0
}
fun rsrc<R: resource>(r: R) {
abort 0
}
fun t0() {
both(S{}, Coin{});
both(0, Coin{})
}
fun t1<R: resource, C: copyable>() {
both(Box<C> {}, Box<R> {})
}
fun t2<R: resource, C: copyable>() {
rsrc(Box3<C, C, C> {});
cpy(Box3<R, C, C> {});
cpy(Box3<C, R, C> {});
cpy(Box3<C, C, R> {});
cpy(Box3<C, R, R> {});
cpy(Box3<R, C, R> {});
cpy(Box3<R, R, C> {});
cpy(Box3<R, R, R> {});
}
fun t3<U, C: copyable>() {
cpy(Box3<U, C, C> {});
cpy(Box3<C, U, C> {});
cpy(Box3<C, C, U> {});
cpy(Box3<C, U, U> {});
cpy( C,Box3<U, U> {});
cpy(Box3<U, U, C> {});
cpy(Box3<U, U, U> {});
}
}
Try it
See the GitHub repository for
usage examples and more technical details. Note that comby-reducer
is new,
developed with simplicity, and not yet very battle tested; feel free to post
issues in the GitHub issue tracker. If you want help writing
transformations comby
syntax, post in the Gitter channel.
Learn more
There are a lot of reducer tools and techniques out there. The academic in me would love to explore and compare these more deeply, but the engineer in me doesn’t have the time. 🙂
So instead, here are some related tools and topics that you might want to explore further.
The Reducing chapter in The Fuzzing Book provides a deeper explanation of reducers, and particularly grammar-based reduction. Here
comby-reducer
can be seen as a coarse, syntax-only grammar-based reducer that doesn’t need you to write or understand the grammar, nor write any scripts or visitors to hook into a parse tree. Instead, transformations are specified declaratively and appeal to intuitive notions around syntax common to many languages.This deep-dive article on test-case reduction including various references to existing tools.
A favorite tool mentioned here is C-reduce (which also works well on non-C languages). It is an especially nice resource for explaining interestingness tests which can further guide how and when the input is transformed.