That time I moved a parenthesis and got 13x speedup in a computer program

First off, Rust is incredible.

I spent several days chasing this down, and I think there might be some opportunity to convert more old C++ graybeards to Rust like me if things like this could somehow be easier to detect….

The story began when i was porting a project which already has a c++ implementation. my code is https://github.com/donbright/earcutr and the original is https://github.com/mapbox/earcut.hpp and both have benchmarks.

My port was unfortunately slower than unoptimized C++ code until… i found the following speedup, after a few dozen runs of various profilers, valgrind callgrind, kcachegrind, and stuff like [mike dunlavey’s profiling philosophy ](https://stackoverflow.com/users/23771/mike-dunlavey).

By changing this

fn zorder(xf: f64, yf: f64, minx: f64, miny: f64, invsize: f64) -> i32 {
let mut x: i32 = 32767 * ((xf- minx) * invsize) as i32;;
let mut y: i32 = 32767 * ((yf- miny) * invsize) as i32;

To this

fn zorder(xf: f64, yf: f64, minx: f64, miny: f64, invsize: f64) -> i32 {
let mut x: i32 = ( 32767.0 * ((xf - minx) * invsize)) as i32;
let mut y: i32 = ( 32767.0 * ((yf - miny) * invsize)) as i32;

Benchmarks went from this

test bench_water ... bench: 30,145,667 ns/iter (+/- 3,189,319)
test bench_water2 ... bench: 14,321,385 ns/iter (+/- 269,591)

to this

test bench_water ... bench: 2,305,028 ns/iter (+/- 99,240)
test bench_water2 ... bench: 2,684,839 ns/iter (+/- 251,179)

now that was not a linear process of getting from A to B, i took a dozen twisty roads, and one version with round() was running faster than without, however the basic problem i think may have been revealed by callgrind – somehow the slow code is using at least twice as many instructions as the C++ implementation. Somewhere in there between the float to int thing, there was some things that Clang and G++ could optimize, but Rust did not.

I realize it is possibly a theoretically problematic optimization. What i did, if i am not mistaken, was modify int32= (Int32 * (float as int32)) into in32 = ((float64 * float64) as int32). Since float and integer have different ranges this could overflow in the second case where it wouldn’t necessarily in the first, or vice versa. Or something. It’s like multiplying apples by oranges and assigning them to bananas.

But what if there is some alternative opportunity? What if there could be some mode where the compiler system could ‘offer ideas’ that the user could explore. Potentially unsafe optimizations that could be made safe using a bit of thinking and a good testing?

That is what made the difference for me… just one more “what if i tried it one last time to look at this code” … and the fact that the original MapBox earcut.hpp authors had 34 tests in their code and all still passed after this optimization, so could be confident that even though it was unsafe, it was probably good enough for this project (which is inherently built to tradeoff imperfection for speed).

Anyways, thanks for reading.

Advertisements