More

evmar · 2026-04-22T20:14:07 1776888847

Do you have any notes or other artifacts from your recompiler? I’d love to learn more.

evmar · 2026-04-22T20:13:36 1776888816

Yes, I agree that there is little harm in gathering too much code. I have tried out just scanning data memory for values that refer to addresses within the region marked as code and disassembling from those points, as well as scanning the instructions I traverse for any immediate values in the same range.

evmar · 2026-04-21T23:06:26 1776812786

Looking through Wikipedia at least, it's not exactly clear to me. They have separate pages for 'binary recompiler' and 'binary translation' that link to each other, and the latter is more about going between architectures (which is the main objective here).

evmar · 2026-04-21T22:58:31 1776812311

[post author] I went down some similar paths in retrowin32, though 32-bit x86 is likely easier.

I was also surprised by how much goop there is between startup and main. In retrowin32 I just implemented it all, though I wonder how much I could get away with not running it in the Theseus replace-some-parts model.

I mostly relied on my own x86 emulator, but I also implemented the thunking between 64-bit and 32-bit mode just to see how it was. It definitely was some asm but once I wrapped my head around it it wasn't so bad, check out the 'trans64' and 'trans32' snippets in https://github.com/evmar/retrowin32/blob/ffd8665795ae6c6bdd7... for I believe all of it. One reframing that helped me (after a few false starts) was to put as much code as possible in my high-level language and just use asm to bridge to it.

jcranmer · 2026-04-22T03:23:31 1776828211

Yeah, 32-bit x86 is somewhat easier because everything's in the same flat address space, and you at least have a system-wide code32 gdt entry that means you can ignore futzing around with the ldt. 16-bit means you get to deal with segmented memory, and the cherry on top is that gdb just stops being useful since it doesn't know anything about segmented memory (I don't think Linux even makes it possible to query the LDT of another process, even with ptrace, to be fair).

As for trying to ignore before main... well, the main benefit for me was being able to avoid emulating DOS interrupts entirely, between skipping the calls to set up various global variables, stubbing out some of the libc implementations, and manually marking in the emulator that code page X was 32-bit (something else that sends tools in a tizzy, a function switching from 16-bit to 32-bit mid-assembly code).

16-bit is weird and kinda fun to work with at times... but there's also a reason that progress on this is incredibly slow for me.

evmar · 2026-04-22T15:49:18 1776872958

Slow progress is fine, it took me like two years to get where I got! (Not that I was working on it full time or anything, but also there were just many false starts and I had no idea what I was doing...)

evmar · 2026-04-21T22:35:15 1776810915

It depends on what results you’re expecting! Relative to an interpreter, even the simplest unoptimized translation of that code is already significantly more efficient code at runtime.

In the post there is a godbolt link showing a compiler inlining a simple add, but a real implementation of x86 add would be much more complex.

I have read other projects where the authors put some effort into getting exactly the machine code they wanted out. For example, maybe you want the virtual regs.eax to actually exist in a machine register, and one way you might be able to convince a compiler to do that is by passing it around as a function parameter everywhere instead as a struct element. I have not investigated this myself.

evmar · 2026-04-14T01:01:31 1776128491

I’m not too familiar with Firefox builds. Why are clobber builds common? At first glance it seems weird to add a cache around your build system vs fixing your build system.

jagged-chisel · 2026-04-14T02:00:32 1776132032

Define “fixing.” If you’re building on ephemeral containers, an external cache is necessary for files that don’t change.

mbitsnbites · 2026-04-14T06:16:10 1776147370

Though I'm not actively working with Firefox so can't speak for their use cases, one important use case for clobber builds is CI.

I'm the author of BuildCache, and where I work we make thousands of clobber builds every day in our CI. Caching helps tremendously for keeping build times short.

There are a few use cases for local development too. For instance if you switch between git branches you may have to make near full rebuilds (e.g. in C++ if some header file is touched that gets included by many files).

Another advantage as a local dev is that you can tap into the central CI cache and when you pull the latest trunk and build it, chances are that the CI system has already built that version (e.g. as part of a merge gate) so you will get cache hits.

evmar · 2026-04-14T07:16:04 1776150964

I see, I might be confused by the terminology. "clobber" to me suggests intentionally trying to throw away cached results (clobbering what you have), but it sounds like you might just use it to mean builds where you don't have any existing build state already present.

secondcoming · 2026-04-14T07:13:49 1776150829

What even is a 'clobber build'?

__farre__ · 2026-04-14T08:00:34 1776153634

Sorry for being unclear. I'm using Firefox build system lingo without explanations. It's from the command `./mach clobber`, which is similar but not the same as `make clean`. I use 'clobber build' as "a build with no existing build state" and the qualifiers "cold" and "warm" to indicate if cache is empty or filled.

secondcoming · 2026-04-14T11:53:51 1776167631

Ah ok, thanks

evmar · 2026-03-30T18:04:06 1774893846

[ninja author] My first post about Ninja goes into this: https://neugierig.org/software/chromium/notes/2011/02/ninja....

HiPhish · 2026-03-30T18:24:51 1774895091

I'm afraid I still don't understand. One factor is having fewer features and not looking for obsolete files, that I can understand. I guess the other thing is using better rules to figure out when a file truly need to be rebuilt?

evmar · 2026-03-31T05:02:49 1774933369

To be honest, it's not clear to me why other systems are not faster. Ninja is relatively straightforward but also not too clever.

Now that I think about it, I did write more about some of the performance stuff we did here: https://aosabook.org/en/posa/ninja.html Looking back over that, I guess we did do some lower-level optimization work. I think a lot of it was just coming at it from a performance mindset.

evmar · 2026-03-30T17:19:36 1774891176

[ninja author] I did some thinking about this problem and eventually revisited with what I think is a pretty neat solution. I wrote about it here: https://neugierig.org/software/blog/2022/03/n2.html

actionfromafar · 2026-03-30T18:15:49 1774894549

Imagine if filesystems had exposed the file hash next to its mtime.

oftenwrong · 2026-03-30T21:23:35 1774905815

I might be missing your sarcasm, but this is a common approach for large scale builds. Virtual filesystems are used to provide a pre-computed tree hash as a xattr. In a more typical case, you can read the git tree hash.

actionfromafar · 2026-03-31T07:06:55 1774940815

Not sure it was meant as sarcasm really. I just think so many build (and other) problems could have been avoided it a file hash was available on every file by default.

sagarm · 2026-03-31T15:42:04 1774971724

That hash would be expensive to maintain, and the end result would still be racy since the file could be modified after the hash was read .

actionfromafar · 2026-03-31T16:39:43 1774975183

In the current POSIX paradigm yes, it would be expensive. But if the hash was defined as the hash of fixed blocks, it wouldn't be expensive. The raciness depends, a lot, on the semantics we would define. (In the context of a build system, it's no different than that the file could get a new mtime after we read the mtime.)

evmar · 2026-03-27T17:15:58 1774631758

I had a similar experience implementing simd instructions in my emulator, where I needed to break apart a 64-bit value into four eight-bit values, do an operation on each value, then pack it back together. My first implementation did it with all the bit shifts you’d expect, but my second one used two helpers to unpack into an array, map on the array to a second array, and pack the array again. The optimized output was basically the same.

evmar · 2026-03-24T20:45:32 1774385132

I don't know the technical details, but the kernel docs say "It exists because implementation in user-space, using existing tools, cannot match Windows performance while offering accurate semantics." https://docs.kernel.org/userspace-api/ntsync.html