This is a suggestion on how and why to rework OC's sysyield timeouts. The required fixes basically give ability to implement preemptive multitasking for free.
1. Current implementation of sysyield timeouts is insufficient
Currently, it is not a real protection against executor thread hogging. It is only a safeguard, and a very efficient one.
Flash it onto a EEPROM, and throw that into a bare T1 case with just a CPU, a smallest RAM stick, and some power to keep it going.
Turn on that computer, and you have a hogged a single executor thread until someone shuts it down. Boot some more computers off the same EEPROM, and you can easily hog all executor threads on a server.
On default config, running just 4 of such hogs makes other computers practically unusable, every sysyield causes a 5 second delay.
Tweaking the timeout won't really help, since then programs will have to pull(0) much more often, and in the presence of such hogs every yield will cause a delay.
Just killing the whole machine when it starts hogging executors won't help in some cases, specifically if computer is turned on by WoL, redstone signal or another computer.
Even though it's highly unlikely, such hogging can happen accidentally. Specifically, if a system is built so that it restarts on unknown errors, sysyield timeout will not stop an accidental infinite loop in it.
2. How it is currently implemented
I don't think I have a complete understanding yet, but that's how I see it:
The coroutine library is wrapped:
coroutine.yield() adds an extra nil as the first value, to differentiate normal yields from sysyields
coroutine.resume() repeatedly does following:
Call sethook() on the coroutine to terminate it on a sysyield timeout
Call the real coroutine.resume()
If the first returned value is nil (i.e. normal yield), return the rest
Otherwise (i.e. sysyield), yield all values and continue the loop
Therefore, when sysyield happens (i.e. on computer.pullSignal), the whole stack of running coroutines is unwound, and the outermost one yields the computer. On the next tick, or whenever a signal comes, the computer is resumed, and the same stack of coroutines is rebuilt, and the innermost one continues its execution.
The hook that is attached to each coroutine checks if computer has been running for too long, and throws an error if so, which terminates the currently running coroutine. It also extends the deadline by half a second exactly once, so that the parent coroutine has a chance to recover. So long as it does a sysyield in those half a second, or yields/returns and lets its parents do that, the deadline is reset and the computer can continue operating. If it does not, the remaining parent coroutines do not get a chance to recover, and the computer is halted with an error. And until the innermost coroutine times out, nothing can stop it unless it yields control.
For example, if you type while true do end in the LUA repl, it will start that code in a coroutine and print an error message when it times out. If that same loop is explicitly started inside a coroutine, the error is caught and another infinite loop is started, the REPL will not get a chance to do anything, and the whole computer will crash.
3. The suggested implementation
The basic problem is, the hook set by Lua code with debug.sethook() cannot yield a coroutine. It can only throw an error, terminating the coroutine, therefore the timeout has to be long enough to let the program do whatever it needs to without doing sysyields all the time. This is problematic, since all the computers share the same few executor threads, and the timeout also has to be short enough to allow other computers to run normally.
However, the debug hooks actually can yield, under certain very restrictive conditions: https://www.lua.org/manual/5.3/manual.html#lua_Hook The catch is, it can only happen from the native code (C for normal Lua, probably Java for LuaJ), and no values can be returned. I've written a simple test library in C to allow setting yieldable Lua hooks from Lua code: https://github.com/evg-zhabotinsky/yieldhook I tried to implement it myself at first, but then realized that I'm reimplementing debug.sethook... So I just copied it, cut out the unneeded bits and added a loader. If possible, it's probably cleaner to modify the original Lua library (ldblib.c for C version, not sure about LuaJ).
The final interface of the modified debug.sethook() is that if hook mask contains an 'y', event is either 'line' or 'count' and the hook function returns truth, the paused coroutine is yielded with no values, i.e. coroutine.resume() returns just true. The normal return from coroutine with no arguments can be differentiated from a hook yield by checking if coroutine is now dead, and the normal yields can be wrapped to prepend an extra true value.
tl;dr: see the test program in my repo, it demonstrates a simple preemptive multitasking implementation. Its wrapper for coroutine.resume returns:
true plus the values on normal yield or return
false plus the error message on error
nothing on a hook yield (that's what different from the standard behavior)
It should be pretty easy to modify the OC's existing wrapper and hook functions to sysyield instead of throwing an error on timeout. That timeout can then be made much shorter, since coroutines are no longer terminated. Preferably, it should be adjusted dynamically by machine.lua, so that cycles are distributed fairly (incl. according to CPU and memory tiers) with no leftovers and each computer runs exactly once every tick. Preferably, something like computer.getSliceTime() and computer.getRemainingTime() should return how much time the computer has been assigned during the tick and how much of it is left. The units of those functions can be declared "abstract cycle amounts" for the sake of "realism", in which case the slice time will be CPU frequency (quite unstable, just like the tooltip suggests xD) and the remaining time will be cycle count until the next bus synchronization or whatever.
Having done that, there basically will be preemptive multitasking between all the computers, and none of them will have to wait long for their turn to run. For compatibility, the current "die if hung for 5s" behavior can be implemented in OpenOS, and allowed to be disabled for the sake of "power users".
To allow for user-controlled preemptive multitasking on a single computer, I propose adding computer.setTimeout(time, coroutine), with time being the same units as computer.remainingTime() and coroutine being the timeout handler coroutine. The debug hook will work all the same, except it will yield prematurely if the soft deadline happens to be earlier. The wrapped coroutine.resume() will see that a timeout has happened, but hard timeout hasn't expired yet, and instead of yielding any further will disable soft timeout and resume the specified coroutine. The coroutine.status() has to be wrapped too to mark the "yielded by hook" coroutine as "normal", as if it called resume() on the handler coroutine. While the handler coroutine is running, soft timeouts are ignored. When it yields, the paused coroutine is resumed and soft timeout is enablet, if it was set again by the handler. However, if there is a pending soft timeout, the handler coroutine is immediately resumed again.
The setTimeout behavior suggested above matches closely how a timer interrupt would be handled in a real CPU, exactly what machine.lua should contain. The rest of multitasking implementation and protection is on the OS.
Apart of implementing a scheduler, synchronization mechanisms have to be implemented. Count hook can trigger in the middle of any non-trivial expression, so normal assignments cannot be used for unconditionally reliable synchronization. However, a coroutine cannot be resumed unless it's suspended, and everything on a computer runs in a single Lua state, so any coroutine body is a natural critical region, which can easily be leveraged to implement any kind of synchronization.
4. Who is going to implement all that?
Given some time, I can implement the Lua side of things. The native side, though... I'm afraid to break something there, and even though I've done the same modifications for the official Lua implementation, I don't know yet how to do the same for JNLua and LuaJ and in the context of OpenComputers specifically. Plus, I'm not keen on rebuilding OpenComputers from source. If someone confirms that there's nothing to learn or break and that it should just work, I'll probably do that too.
Once the core functionality is implemented, I'm thinking more like implementing a more Unix-like OS, aiming for POSIX compatibility, and the ability to implement proper preemptive multitasking is a must for that. The ultimate goal is to be able to compile and run most normal C programs without modifications. I'm already working on a primitive C to Lua compiler, because I gave up porting a simplistic MP3 decoder to Lua by hand. (I want to have online radio in my bunker!)
Aaanyway, comments and suggestions are welcome. I was more interested in the technical part, so I haven't really though much about API function names or how exactly to divide time among computers of different tiers, so there's that. (Also, is there a way to figure out tiers of arbitrary component from inside the computer?) I didn't think at all about how OpenOS would provide preemptive multitasking to its programs, too, and I have absolutely no idea if it would crash and burn when anything starts running "in parallel".
You can post now and register later.
If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.
This is a suggestion on how and why to rework OC's sysyield timeouts. The required fixes basically give ability to implement preemptive multitasking for free.
1. Current implementation of sysyield timeouts is insufficient
Currently, it is not a real protection against executor thread hogging. It is only a safeguard, and a very efficient one.
Consider the following code:
Flash it onto a EEPROM, and throw that into a bare T1 case with just a CPU, a smallest RAM stick, and some power to keep it going.
Turn on that computer, and you have a hogged a single executor thread until someone shuts it down. Boot some more computers off the same EEPROM, and you can easily hog all executor threads on a server.
On default config, running just 4 of such hogs makes other computers practically unusable, every sysyield causes a 5 second delay.
Tweaking the timeout won't really help, since then programs will have to pull(0) much more often, and in the presence of such hogs every yield will cause a delay.
Just killing the whole machine when it starts hogging executors won't help in some cases, specifically if computer is turned on by WoL, redstone signal or another computer.
Even though it's highly unlikely, such hogging can happen accidentally. Specifically, if a system is built so that it restarts on unknown errors, sysyield timeout will not stop an accidental infinite loop in it.
2. How it is currently implemented
I don't think I have a complete understanding yet, but that's how I see it:
The coroutine library is wrapped:
Therefore, when sysyield happens (i.e. on computer.pullSignal), the whole stack of running coroutines is unwound, and the outermost one yields the computer. On the next tick, or whenever a signal comes, the computer is resumed, and the same stack of coroutines is rebuilt, and the innermost one continues its execution.
The hook that is attached to each coroutine checks if computer has been running for too long, and throws an error if so, which terminates the currently running coroutine. It also extends the deadline by half a second exactly once, so that the parent coroutine has a chance to recover. So long as it does a sysyield in those half a second, or yields/returns and lets its parents do that, the deadline is reset and the computer can continue operating. If it does not, the remaining parent coroutines do not get a chance to recover, and the computer is halted with an error. And until the innermost coroutine times out, nothing can stop it unless it yields control.
For example, if you type while true do end in the LUA repl, it will start that code in a coroutine and print an error message when it times out. If that same loop is explicitly started inside a coroutine, the error is caught and another infinite loop is started, the REPL will not get a chance to do anything, and the whole computer will crash.
3. The suggested implementation
The basic problem is, the hook set by Lua code with debug.sethook() cannot yield a coroutine. It can only throw an error, terminating the coroutine, therefore the timeout has to be long enough to let the program do whatever it needs to without doing sysyields all the time. This is problematic, since all the computers share the same few executor threads, and the timeout also has to be short enough to allow other computers to run normally.
However, the debug hooks actually can yield, under certain very restrictive conditions: https://www.lua.org/manual/5.3/manual.html#lua_Hook The catch is, it can only happen from the native code (C for normal Lua, probably Java for LuaJ), and no values can be returned. I've written a simple test library in C to allow setting yieldable Lua hooks from Lua code: https://github.com/evg-zhabotinsky/yieldhook I tried to implement it myself at first, but then realized that I'm reimplementing debug.sethook... So I just copied it, cut out the unneeded bits and added a loader. If possible, it's probably cleaner to modify the original Lua library (ldblib.c for C version, not sure about LuaJ).
The final interface of the modified debug.sethook() is that if hook mask contains an 'y', event is either 'line' or 'count' and the hook function returns truth, the paused coroutine is yielded with no values, i.e. coroutine.resume() returns just true. The normal return from coroutine with no arguments can be differentiated from a hook yield by checking if coroutine is now dead, and the normal yields can be wrapped to prepend an extra true value.
tl;dr: see the test program in my repo, it demonstrates a simple preemptive multitasking implementation. Its wrapper for coroutine.resume returns:
It should be pretty easy to modify the OC's existing wrapper and hook functions to sysyield instead of throwing an error on timeout. That timeout can then be made much shorter, since coroutines are no longer terminated. Preferably, it should be adjusted dynamically by machine.lua, so that cycles are distributed fairly (incl. according to CPU and memory tiers) with no leftovers and each computer runs exactly once every tick. Preferably, something like computer.getSliceTime() and computer.getRemainingTime() should return how much time the computer has been assigned during the tick and how much of it is left. The units of those functions can be declared "abstract cycle amounts" for the sake of "realism", in which case the slice time will be CPU frequency (quite unstable, just like the tooltip suggests xD) and the remaining time will be cycle count until the next bus synchronization or whatever.
Having done that, there basically will be preemptive multitasking between all the computers, and none of them will have to wait long for their turn to run. For compatibility, the current "die if hung for 5s" behavior can be implemented in OpenOS, and allowed to be disabled for the sake of "power users".
To allow for user-controlled preemptive multitasking on a single computer, I propose adding computer.setTimeout(time, coroutine), with time being the same units as computer.remainingTime() and coroutine being the timeout handler coroutine. The debug hook will work all the same, except it will yield prematurely if the soft deadline happens to be earlier. The wrapped coroutine.resume() will see that a timeout has happened, but hard timeout hasn't expired yet, and instead of yielding any further will disable soft timeout and resume the specified coroutine. The coroutine.status() has to be wrapped too to mark the "yielded by hook" coroutine as "normal", as if it called resume() on the handler coroutine. While the handler coroutine is running, soft timeouts are ignored. When it yields, the paused coroutine is resumed and soft timeout is enablet, if it was set again by the handler. However, if there is a pending soft timeout, the handler coroutine is immediately resumed again.
The setTimeout behavior suggested above matches closely how a timer interrupt would be handled in a real CPU, exactly what machine.lua should contain. The rest of multitasking implementation and protection is on the OS.
Apart of implementing a scheduler, synchronization mechanisms have to be implemented. Count hook can trigger in the middle of any non-trivial expression, so normal assignments cannot be used for unconditionally reliable synchronization. However, a coroutine cannot be resumed unless it's suspended, and everything on a computer runs in a single Lua state, so any coroutine body is a natural critical region, which can easily be leveraged to implement any kind of synchronization.
4. Who is going to implement all that?
Given some time, I can implement the Lua side of things. The native side, though... I'm afraid to break something there, and even though I've done the same modifications for the official Lua implementation, I don't know yet how to do the same for JNLua and LuaJ and in the context of OpenComputers specifically. Plus, I'm not keen on rebuilding OpenComputers from source. If someone confirms that there's nothing to learn or break and that it should just work, I'll probably do that too.
Once the core functionality is implemented, I'm thinking more like implementing a more Unix-like OS, aiming for POSIX compatibility, and the ability to implement proper preemptive multitasking is a must for that. The ultimate goal is to be able to compile and run most normal C programs without modifications. I'm already working on a primitive C to Lua compiler, because I gave up porting a simplistic MP3 decoder to Lua by hand. (I want to have online radio in my bunker!)
Aaanyway, comments and suggestions are welcome. I was more interested in the technical part, so I haven't really though much about API function names or how exactly to divide time among computers of different tiers, so there's that. (Also, is there a way to figure out tiers of arbitrary component from inside the computer?) I didn't think at all about how OpenOS would provide preemptive multitasking to its programs, too, and I have absolutely no idea if it would crash and burn when anything starts running "in parallel".
Link to post
Share on other sites