Jump to content
  • Sky
  • Blueberry
  • Slate
  • Blackcurrant
  • Watermelon
  • Strawberry
  • Orange
  • Banana
  • Apple
  • Emerald
  • Chocolate
  • Charcoal
  • 0
evg-zhabotinsky

Preemptive multitasking and sysyield timeouts

Question

This is a suggestion on how and why to rework OC's sysyield timeouts. The required fixes basically give ability to implement preemptive multitasking for free.

1. Current implementation of sysyield timeouts is insufficient

Currently, it is not a real protection against executor thread hogging. It is only a safeguard, and a very efficient one.

Consider the following code:

local function dos()
  while true do
  end
end

while true do
  pcall(coroutine.resume, coroutine.create(dos))
  computer.pullSignal(0)
end

Flash it onto a EEPROM, and throw that into a bare T1 case with just a CPU, a smallest RAM stick, and some power to keep it going.

Turn on that computer, and you have a hogged a single executor thread until someone shuts it down. Boot some more computers off the same EEPROM, and you can easily hog all executor threads on a server.

On default config, running just 4 of such hogs makes other computers practically unusable, every sysyield causes a 5 second delay.

Tweaking the timeout won't really help, since then programs will have to pull(0) much more often, and in the presence of such hogs every yield will cause a delay.

Just killing the whole machine when it starts hogging executors won't help in some cases, specifically if computer is turned on by WoL, redstone signal or another computer.

Even though it's highly unlikely, such hogging can happen accidentally. Specifically, if a system is built so that it restarts on unknown errors, sysyield timeout will not stop an accidental infinite loop in it.

2. How it is currently implemented

I don't think I have a complete understanding yet, but that's how I see it:

The coroutine library is wrapped:

  • coroutine.yield() adds an extra nil as the first value, to differentiate normal yields from sysyields
  • coroutine.resume() repeatedly does following:
    1. Call sethook() on the coroutine to terminate it on a sysyield timeout
    2. Call the real coroutine.resume()
    3. If the first returned value is nil (i.e. normal yield), return the rest
    4. Otherwise (i.e. sysyield), yield all values and continue the loop

Therefore, when sysyield happens (i.e. on computer.pullSignal), the whole stack of running coroutines is unwound, and the outermost one yields the computer. On the next tick, or whenever a signal comes, the computer is resumed, and the same stack of coroutines is rebuilt, and the innermost one continues its execution.

The hook that is attached to each coroutine checks if computer has been running for too long, and throws an error if so, which terminates the currently running coroutine. It also extends the deadline by half a second exactly once, so that the parent coroutine has a chance to recover. So long as it does a sysyield in those half a second, or yields/returns and lets its parents do that, the deadline is reset and the computer can continue operating. If it does not, the remaining parent coroutines do not get a chance to recover, and the computer is halted with an error. And until the innermost coroutine times out, nothing can stop it unless it yields control.

For example, if you type while true do end in the LUA repl, it will start that code in a coroutine and print an error message when it times out. If that same loop is explicitly started inside a coroutine, the error is caught and another infinite loop is started, the REPL will not get a chance to do anything, and the whole computer will crash.

3. The suggested implementation

The basic problem is, the hook set by Lua code with debug.sethook() cannot yield a coroutine. It can only throw an error, terminating the coroutine, therefore the timeout has to be long enough to let the program do whatever it needs to without doing sysyields all the time. This is problematic, since all the computers share the same few executor threads, and the timeout also has to be short enough to allow other computers to run normally.

However, the debug hooks actually can yield, under certain very restrictive conditions: https://www.lua.org/manual/5.3/manual.html#lua_Hook The catch is, it can only happen from the native code (C for normal Lua, probably Java for LuaJ), and no values can be returned. I've written a simple test library in C to allow setting yieldable Lua hooks from Lua code: https://github.com/evg-zhabotinsky/yieldhook I tried to implement it myself at first, but then realized that I'm reimplementing debug.sethook... So I just copied it, cut out the unneeded bits and added a loader. If possible, it's probably cleaner to modify the original Lua library (ldblib.c for C version, not sure about LuaJ).

The final interface of the modified debug.sethook() is that if hook mask contains an 'y', event is either 'line' or 'count' and the hook function returns truth, the paused coroutine is yielded with no values, i.e. coroutine.resume() returns just true. The normal return from coroutine with no arguments can be differentiated from a hook yield by checking if coroutine is now dead, and the normal yields can be wrapped to prepend an extra true value.

tl;dr: see the test program in my repo, it demonstrates a simple preemptive multitasking implementation. Its wrapper for coroutine.resume returns:

  • true plus the values on normal yield or return
  • false plus the error message on error
  • nothing on a hook yield (that's what different from the standard behavior)

It should be pretty easy to modify the OC's existing wrapper and hook functions to sysyield instead of throwing an error on timeout. That timeout can then be made much shorter, since coroutines are no longer terminated. Preferably, it should be adjusted dynamically by machine.lua, so that cycles are distributed fairly (incl. according to CPU and memory tiers) with no leftovers and each computer runs exactly once every tick. Preferably, something like computer.getSliceTime() and computer.getRemainingTime() should return how much time the computer has been assigned during the tick and how much of it is left. The units of those functions can be declared "abstract cycle amounts" for the sake of "realism", in which case the slice time will be CPU frequency (quite unstable, just like the tooltip suggests xD) and the remaining time will be cycle count until the next bus synchronization or whatever.

Having done that, there basically will be preemptive multitasking between all the computers, and none of them will have to wait long for their turn to run. For compatibility, the current "die if hung for 5s" behavior can be implemented in OpenOS, and allowed to be disabled for the sake of "power users".

To allow for user-controlled preemptive multitasking on a single computer, I propose adding computer.setTimeout(time, coroutine), with time being the same units as computer.remainingTime() and coroutine being the timeout handler coroutine. The debug hook will work all the same, except it will yield prematurely if the soft deadline happens to be earlier. The wrapped coroutine.resume() will see that a timeout has happened, but hard timeout hasn't expired yet, and instead of yielding any further will disable soft timeout and resume the specified coroutine. The coroutine.status() has to be wrapped too to mark the "yielded by hook" coroutine as "normal", as if it called resume() on the handler coroutine. While the handler coroutine is running, soft timeouts are ignored. When it yields, the paused coroutine is resumed and soft timeout is enablet, if it was set again by the handler. However, if there is a pending soft timeout, the handler coroutine is immediately resumed again.

The setTimeout behavior suggested above matches closely how a timer interrupt would be handled in a real CPU, exactly what machine.lua should contain. The rest of multitasking implementation and protection is on the OS.

Apart of implementing a scheduler, synchronization mechanisms have to be implemented. Count hook can trigger in the middle of any non-trivial expression, so normal assignments cannot be used for unconditionally reliable synchronization. However, a coroutine cannot be resumed unless it's suspended, and everything on a computer runs in a single Lua state, so any coroutine body is a natural critical region, which can easily be leveraged to implement any kind of synchronization.

4. Who is going to implement all that?

Given some time, I can implement the Lua side of things. The native side, though... I'm afraid to break something there, and even though I've done the same modifications for the official Lua implementation, I don't know yet how to do the same for JNLua and LuaJ and in the context of OpenComputers specifically. Plus, I'm not keen on rebuilding OpenComputers from source. If someone confirms that there's nothing to learn or break and that it should just work, I'll probably do that too.

Once the core functionality is implemented, I'm thinking more like implementing a more Unix-like OS, aiming for POSIX compatibility, and the ability to implement proper preemptive multitasking is a must for that. The ultimate goal is to be able to compile and run most normal C programs without modifications. I'm already working on a primitive C to Lua compiler, because I gave up porting a simplistic MP3 decoder to Lua by hand. (I want to have online radio in my bunker!)

Aaanyway, comments and suggestions are welcome. I was more interested in the technical part, so I haven't really though much about API function names or how exactly to divide time among computers of different tiers, so there's that. (Also, is there a way to figure out tiers of arbitrary component from inside the computer?) I didn't think at all about how OpenOS would provide preemptive multitasking to its programs, too, and I have absolutely no idea if it would crash and burn when anything starts running "in parallel".

Link to post
Share on other sites

9 answers to this question

Recommended Posts

  • 0

I appreciate the study you've put into this. You aren't alone; we also care deeply about these issues.

The reason we don't do pre-emptive scheduling actually isn't due to yielding issues from the debug hook (as you've demonstrated) but because we can't persist the lua runtime state in a suspended debug hook. In other words, if the minecraft world chunk were to unload whilst we have a coroutine-in-a-suspended-debug-hook, then upon loading that lua state again (when the chunk loads again), we'd be forced to "stop" the machine. On the other hand, we definitely CAN persist when the root-level coroutine yields back to the machine (which is what you get when you call computer.pullSignal)

Link to post
Share on other sites
  • 0
10 hours ago, payonel said:

<...> we can't persist the lua runtime state in a suspended debug hook. <...> a coroutine-in-a-suspended-debug-hook <...>

I'd like to clarify that bit. Do you mean just stopping a machine while it's running? Yielding from Lua debug hook? Neither of the two is what I suggested. Moreover, the latter is mostly impossible: the Lua hook function cannot yield across a C boundary, which is the underlying C hook, and even that cannot yield arbitrarily.

I'm pretty sure that I got this correct by now. Line and count hooks are executed in the middle of instruction fetch, and if it yields in the end, it undoes the part of fetch that has already been done. (See vmfetch in lvm.c and luaG_traceexec in ldebug.c) Also, looking at lua_yieldk() implementation (in ldo.c) and coroutine.yield() implementation (luaB_yield() in lcorolib.c), it seems to me that the effect of yielding in the end of line/count C hook is exactly the same as if the previous opcode (in the interrupted coroutine) was a call to coroutine.yield(). I'm not positive on why no values can be passed, though; maybe that's because no opcode gets really executed.

So, basically, C hooks allow to inject normal coroutine.yield() calls, and my patch for debug.sethook() implementation passes that ability to Lua hooks. What I suggested was to inject such a yield instead of throwing an error whenever a deadline is hit, and to treat it as if it was a result of call to computer.pullSignal(0), mostly, except do not return the event, even if one is pending. (That's the other bit of native implementation that has to be done, I guess.) Plus, if hard deadline hasn't been reached yet (i.e. hook caused a yield due to soft deadline), there's no need to yield the whole computer yet, and a user-provided handler can be called instead.

Unless Eris has problems with coroutines that were suspended after an arbitrary opcode, as opposed to a call (to coroutine.yield), I don't see any reason for such an implementation not working.

Link to post
Share on other sites
  • 0
5 hours ago, payonel said:

it's a limitation of eris (edit: that is to say, eris is incompatible with hooks)

Yes, looks like it, now that I've taken a closer look at Eris. It even has a special error message for hook-yielded coroutines.

However, hook-yielded coroutines have an almost normal Lua frame at the top of the stack, and with slight modifications I seem to have managed to persist them.

See https://github.com/fnuecke/eris/pull/30

Link to post
Share on other sites
  • 0

I've rebuilt the mod with a freshly built native Lua library, and my patch seems to work in-game.

I'm currently updating machine.lua to use the new functionality, and looks like it's going to be a massive rework.

For example, user-provided GC hooks would be hard to update properly (and the current implementation is bugged), but I think I came up with a brand new implementation that can even be enabled by default! See this gist for the code.

What I wanted to ask right now is, how hard would it be to push an almost complete rework of machine.lua upstream?

P.S.  Is "Requests" the right subforum for this thread?

Link to post
Share on other sites
  • 0

First of all, if you pull off getting our machines to run without forced cooperative yields....that's just awesome. I'm pretty excited about the work you did with eris already.

Secondly, a fixed user-defined __gc solution also sounds very nice, but raises new risks. I would like to suggest we look at these two features separately. Obviously, the gc work coming after. The problem with __gc code is that debug hooks are disabled during __gc code execution. So a user could rob the thread and we'd not be able to yield it. So unless you have found a way to execute gc code WHILE not disabling debug hooks -- this is not something we can have enabled by default. (but still a thing we should fix, as yes, the current implementation is not perfect)

Thirdly, At this point, you can make a github issue as a feature request. We'll hold off making a PR until we have all the pieces and scope agreed on. You can also discuss these issues with me in a more real-time medium on our irc channel (esper.net) #oc

 

Link to post
Share on other sites
  • 0
39 minutes ago, payonel said:

So unless you have found a way to execute gc code WHILE not disabling debug hooks -- this is not something we can have enabled by default.

How about a way to execute gc code after gc is done with? That is, no user code at all runs during gc. Only grand total of 9 short lines of machine.lua code per hooked object are ran. The hooked objects can't be collected until the next gc pass anyway, so we might as well quickly dump them into a list and process their hooks immediately after GC finishes. (Or at least before any other user code runs, yielding hooks allow to do that.)

Anyway, I need to get __gc hooks out of the way to continue working on machine.lua upgrade, so I guess I'm leaving only the setmetatable() stub for now. The rest of the current __gc code won't be reused so upgrading it is pointless.

Link to post
Share on other sites
  • 0
1 hour ago, payonel said:

We'll hold off making a PR until we have all the pieces and scope agreed on.

Actually, I think PR is the proper way to manage such subprojects. It's not just a transaction from one branch to another, more like a dedicated single-issue single-branch repository, linking code changes and the related discussion together. Did you know that you can even create pull requests to other pull requests?

And when all is done, the PR branch can be squashed and rebased onto the target, and the original commit history should still remain attached to the discussion. (At least that's how it worked in Gitlab when I last checked.)

Buuut the repo isn't mine, so I won't insist on my idea of git workflow.

Link to post
Share on other sites
  • 0

Finally returned to the project. It sort-of works in-game already. Currently often breaks persistence when last yield was forced close to a Java function call, JNLua internal functions somehow end up in the persisted state and prevent persisting. Otherwise no problems noticed, save for a single very weird segfault on world save that I could not reproduce.

The issue was created here: https://github.com/MightyPirates/OpenComputers/issues/3036

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use and Privacy Policy.