Show HN: A minimal stack based VM in C

matheusmoreira · on Sept 6, 2020

The design section of the README is greatly appreciated!

> The core loop uses computed goto, which means that new instructions must be added in identical order

> Values are represented as tagged unions.

> Fundamental types are global (as in not tied to a specific VM instance)

What exactly is a global fundamental type? Is there a local counterpart?

codr7 · on Sept 6, 2020

You may choose whatever storage duration you feel like for your own types, but built in types are globally declared and therefore shared between VM instances.

matheusmoreira · on Sept 6, 2020

I see, so it refers to the C storage class. There can be several virtual machine instances referencing the same built-in integer type instance. This imposes on users the need to call lg_init() and lg_deinit() before and after using the library. I feel like this could have been avoided by statically enumerating the built-in types instead of initializing a data structure dynamically.

The data type structure contains pointers to functions that implement operations such as addition, subtraction, copying and cloning. An integer type instance is initialized with addition and subtraction functions. The integer value representation is part of the value data structure though. So how could a user of the library define new data types? It seems like it would be necessary to modify the library's source code in order to add new members to the tagged union.

Also, perhaps the virtual machine could be optimized further by using tagged pointers to make integer values immediate, avoiding the need to dereference the pointer.

codr7 · on Sept 6, 2020

You would either have to reuse one of the existing representations or modify the union atm.

I'll add a void *as_any to the union eventually which means you're just one level of indirection from supporting any representation without touching the union.

tarruda · on Sept 6, 2020

For those interested in the subject, another great read is Lua register based VM: https://webserver2.tecgraf.puc-rio.br/lua/local/source/5.1/l...

gergo_barany · on Sept 6, 2020

> ideas on how to improve its performance further without making a mess are most welcome.

One approach would be designing the input language for performance. In particular, having statically typed operations. A specialized iadd instruction for values that you are sure you want to treat as integers would save you a lot of indirecting through function pointers. A disadvantage of static typing is that you need to implement type checks if you want to guarantee well typedness.

Another (orthogonal) approach would be to consider a JIT backend. Not one you write yourself, that could definitely be considered "making a mess". But in the past I've had success using LibJIT (https://www.gnu.org/software/libjit/) for speeding up a stack interpreter. In that case, it was a subset of Python bytecode (see https://github.com/gergo-/pylibjit, the code has very probably bitrotted).

jiive · on Sept 6, 2020

This piqued my intrest. I’m a hobbyist C programmer so forgive me if this is a rudimentary question. What is the rationale/convention for naming a variable “_”? I’ve never seen this before.

For example:

  struct lg_buf *lg_buf_init(struct lg_buf *_) {
    _->data = NULL;
    _->len = _->cap = 0;
    return _;
  }

mostlylurks · on Sept 6, 2020

An underscore is primarily used as an identifier to denote a variable/parameter whose name does not really matter. It's more commonly used when you're not planning on using the aforementioned variable, but it's also sometimes used when the identifier is used but its name doesn't matter.

One such instance is the abbreviated scala lambda syntax, where (x:Int) => x + 2 can be abbreviated to _ + 2 in the same way that kotlin would allow abbreviating it as { it + 2 } with its equivalent default lambda parameter name, "it".

In the example you quoted, the identifier denotes the sole parameter, so in a sense its name does not matter, and as such people from certain programming circles might be inclined to use an underscore instead of taking the time to come up with an appropriate name. It's not like a more descriptive name would help in that example, the type name and function name already give sufficient context for it to be perfectly clear what the parameter is for, and it's not like parameter names have any semantic significance in C.

jiive · on Sept 6, 2020

> It's not like a more descriptive name would help in that example, the type name and function name already give sufficient context for it to be perfectly clear what the parameter is for, and it's not like parameter names have any semantic significance in C.

I agree, and I’ve been thinking about this today. There is also a minimalistic quality to this style, the _ pointer is more prominent simply by not having a real name. I like it!

nfoz · on Sept 6, 2020

I think this is borrowing a convention from Perl, where the variable $_ is often given an implicit value of "the thing I'm talking about" when you didn't bother to give a real variable name for it.

http://www.tutorialspoint.com/perl/perl_special_variables.ht...

(The $ sigil denotes that the variable is a scalar; all Perl variables have a sigil like that.)

So this reads naturally to me: _ is the lg_buf, it doesn't need a real name like "buf", because it's just "the thing" or "it".

rurban · on Sept 6, 2020

And Perl took this convention from Prolog. In Perl it's the default context, in Prolog the default unknown symbol.

makapuf · on Sept 6, 2020

I've seen variables named _ when you dont use it (and it's the way it's not an error not to use it after declaration in rust or go by example), but in that context I find it weird.

Randor · on Sept 6, 2020

Hi,

The ISO C standard in section 7.1.3 states that global functions and variables in compiler/system libraries should be prefixed with an underscore to avoid the risk of conflict with names in user programs.

I checked and indeed... lg_buf is a global variable.

But I've never seen anyone use just the underscore prefix without a name before.

chrisseaton · on Sept 6, 2020

> I checked and indeed... lg_buf is a global variable.

struct lg_buf is a type, not any kind of variable. And here _ is a local variable, not a global variable so can't conflict with anything else anyway.

So what you've said doesn't apply for more than one reason.

Randor · on Sept 6, 2020

Thanks,

In that case I don't really know why the guy named his variable with an underscore.

codr7 · on Sept 6, 2020

Because it's short, easy to recognize and somewhat accepted to signify that the name doesn't matter and even 'it' in Perl.

I find that using it for the default parameter (or self) simplifies the code substantially on a visual level.

fwsgonzo · on Sept 6, 2020

Funny, I actually have some fib benchmarks for my RISC-V emulator! It uses fib(40), but I added one for 20. Interestingly, that is the one benchmark that LuaJIT crushes my emulator in, so I'm still working on beating it, but i don't really have any plan. :)

    libriscv: fib(20) median 317ns    lowest: 310ns      highest: 356ns
    luajit: fib(20)   median 146ns    lowest: 145ns      highest: 170ns
    lua5.4: fib(20)   median 631ns    lowest: 598ns      highest: 694ns

Running your emulator: $ ./fibrec 567us

Modern compilers used with emulated machines can beat even v8 at times. Cool project! I divided your number by 100, is that correct? Is there some additional overhead in setting up / tearing down something?

codr7 · on Sept 6, 2020

Correct, all benchmarks run 100 repetitions of fib(20).

There is no additional overhead that I'm aware of except calling a function in a loop, which is intentional.

lebuffon · on Sept 6, 2020

You might want to peruse the C sources for GForth which has been under continuous development for 20 years or so. It introduced a concept called super-instructions that speeds things up quite a bit. I am not an expert on the internals, just a casual user.

addaon · on Sept 6, 2020

Thanks for the introduction to this. See paper at http://www.euroforth.org/ef03/ertl-gregg03.pdf.