Jump to content
  • Sky
  • Blueberry
  • Slate
  • Blackcurrant
  • Watermelon
  • Strawberry
  • Orange
  • Banana
  • Apple
  • Emerald
  • Chocolate
  • Charcoal
mpmxyz

Crunch - break the 4K limit!

Recommended Posts

Introduction

imagine you are programming a drone or a microcontroller... And then you are hitting the EEPROM limits - 4KiB isn't a lot of space. Now you could start writing ugly code but there is a better solution:

Crunch!

 

Crunch is a program that compresses Lua source files.

Depending on the input file size and the amount of memory available it is able to utilize a variety of methods to keep the output size small. That is starting with simple token based whitespace reduction and reaching to full parsing and scoping rules to replace local variables with shorter names. For even better compression it is also able to do a kind of lossy compression when replacing long globals with short local variables.

 

When using crunch you can focus on a more commented and nice looking code instead of worrying too much about keeping names short.

Even code that has been written with being small in mind can profit from it:

Skex has been reduced from 3990 bytes to 2357 bytes. More than 40% reduction!

 

Man Page

 

NAME
  crunch - a source code compressor for OpenComputers
 
SYNOPSIS
  crunch [options] INPUT.lua [OUTPUT, default: INPUT.cr.lua]
 
DESCRIPTION
  compresses lua source files with minimal changes to behaviour
  The most common use case is to prepare code for an EEPROM.
  (-> allows using more complex code within the EEPROM limits)
 
  Token Compression (lossless)
  ->remove comments and unnecessary whitespace
  (to be implemented: shorten strings and numbers)
  Compression with Tree Parsing
  ->includes everything from "token compression"
  ->shortens local variable names (lossless)
  ->replaces string, number, boolean and nil constants (lossless)
  ->replaces global variables by local variables (lossy when relying on side effects from outside code or _ENV changes)
 
OPTIONS
  --lz77[=10..230]
    enables lz77 compression
    The optional value defaults to 80 and defines the maximum length of a reference.
    All length values that are higher are used to define an uncompressed sequence of characters.
    Increasing the maximum reference length therefore reduces the maximum uncompressed sequence length.
    It is therefore best to find a balance between both extremes. (If you really need those 12 bytes...)
    
  --blacklist=name1,name2
    doesn't touch the given globals
    That prevents them being replaced by shortened locals.
    It is necessary e.g. if they are used with side effects from external scripts.
    
  --blacklist=*
    doesn't touch globals at all
    The output code is guaranteed to behave like the input code but it is a bit larger than normal compressed code.
    
  --tree
    forces tree parsing
    (->building a parsing table if it doesn't exist)
    
  --notree
    forces using only token compression (-> no variable renaming)
 
EXAMPLES
  crunch hello.lua
    compresses the file "hello.lua" using "hello.cr.lua" as output

  crunch --blacklist=_G,table a.lua b.lua
    compresses "a.lua" while leaving any part referring to the globals "_G" and "table" untouched
    The output is written to the file "b.lua".
 
  crunch --notree edit.lua
    compresses "edit.lua" using only token compression
    
  crunch --lz77 edit.lua
    compresses "edit.lua" with lz77 compression enabled

Depencencies

Software

All required libraries are included in the download.

 

Hardware

RAM: min. 512 KiB1, recommended: 1 MiB for small files, 2 MiB for big files2

Note 1: That needs removing the library "parser.lualr" from file system and only allows token based compression.

Note 2: Full parsing is quite demanding because it is using bottom up parsing which requires the whole tree structure to be in memory.

 

Installation

Simply download the tar archive and extract it into the root directory.

All files should then be there where they should be.

 

Download (last update: 13.04.17)

github

Ingame:

cd /
wget 'https://raw.githubusercontent.com/mpmxyz/ocprograms/master/tars/crunch.tar'
tar -xf crunch.tar

OR

oppm install crunch

Known Issues

 

The 'lossy' global-to-local compression cannot be used if a global variable is used with side effects to other files. Blacklist those globals to avoid errors. (example: "--blacklist=require" when compressing the default "Lua BIOS")

Edited by mpmxyz
added oppm download option
Link to post
Share on other sites

Very interesting and userful program! I want to be your faithful user  ;)

 

But, from first day of using, i get some errors.

 

First, i was unable to compress from my Windows command line. I downlod Crunch from github and unpack it to folder. For visibility, i move folders from /lib and /bin to root. After i tryed to compress this file, program finished without errors, but size of file was 0.

 

When i repeat with skex.lua, i got this:

 

C:\>crunch.lua skex.lua
C:\Program Files\Lua\5.1\lua.exe: C:\crunch.lua:510: .\parser\main.lua:1161: Syn
tax error: got name 'endendfunction', expected one of: 'until' 'end' ';' 'else'
'eof' 'elseif'
stack traceback:
        [C]: in function 'error'
        .\parser\main.lua:1161: in function <.\parser\main.lua:1138>
        (tail call): ?
        .\parser\lua.lua:74: in function 'lastAction'
        .\parser\main.lua:168: in function <.\parser\main.lua:164>
        [C]: in function 'xpcall'
        .\parser\main.lua:202: in function 'lexer'
        .\parser\main.lua:1178: in function 'parse'
        C:\crunch.lua:508: in function 'analyze'
        C:\crunch.lua:826: in main chunk
        [C]: ?
stack traceback:
        [C]: in function 'error'
        C:\crunch.lua:510: in function 'analyze'
        C:\crunch.lua:826: in main chunk
        [C]: ?

Maybe problem with Lua version 5.1, if you use new 5.2 methods?

 

 

After this, i was going to Minecraft and succesfully compress skex.lua. I started compressing random files from my and others projects

 

This file compressed succesfully, and don't losed functionality. Compressed file was about 75% of source, but i dont understand why from line 6 Crunch starting to unnecessary repeat keyword local.

 

When i tryed to compress this file, i got error:

6b7g6cis.png

 

Can you help me to understand why i got this errors?

Link to post
Share on other sites

#1+#2:

I'm using io.lines with an extra parameter to read the file in short 512 byte pieces.

The feature with the extra parameters doesn't exist in Lua 5.1. It instead just reads the file line by line but does not return newline characters.

That makes crunch think the file is just one line.

Now to #1: The file starts with a single line comment and crunch thinks there is only one line...  (->It's removed.)

And #2: The words are separated by newline characters only. Imagine what happens, when you ignore them! :-P

 

Even though I'm not completely supporting 5.1, I'll see if I can add a small workaround in there.

 

#3:

The keyword 'local' is repeated lot because it is repeated in the original file.

The feature to combine those is on my todo list but it isn't as easy as you might think:

a = 1
local a = 2
local b = a
print(

is not the same as:

a = 1
local a, b = 2, a
print(

#4:

Here we've got a well hidden bug: One of my regular expressions was wrong - I fixed it.

The download should be updated now.

Apart from that I'll have to look into storing line numbers somewhere in the parsing tree to make error messages more clear about where the error should be.

Link to post
Share on other sites

#1-2: Ok, i will install 5.2 on my computer then. 

#4: Thx for fix, will try again later. Will be glad to help you find more bugs =)

 

Also, about all compressing topics, what you think about self-extracting archives?

For example, we can paste in EEPROM string + dictionary, that will substring letters into words and then will be loaded? It will be cost some more RAM memory and processing time, but higher compress rate.

I noticed that after using Crunch, 20-40% of information is keywords. We can reduce all keywords to one character.

Simple example:

load([[λ∫p(s)print(s)Ελ∫w(s)io.write(s)Ε  (.. and more lua code..) ]]:gsub("∫","function"):gsub("λ"," local "):gsub("Ε"," end "))

What do you think about this?

Link to post
Share on other sites

Interesting idea. The only issue is that the epsilon, lambda and fancypants-s are non-ASCII characters and take multiple bytes when encoded as UTF-8:

Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:38:22) [MSC v.1600 32 bit (Intel)] on win32
>>> len("Ελ∫".encode("utf8"))
7
>>> "Ελ∫".encode("utf8")
b'\xce\x95\xce\xbb\xe2\x88\xab'

Though I guess two or three bytes are better than three (end), five (local) and eight (function). Alternatively you could use ASCII characters that are not allowed in Lua code, like !$?`@ or some control characters.

 

By the way, it isn't possible to do attribute access with . or : on literals (e. g. 42.5, "string", [[long string]]). The literals need to be wrapped in parentheses:

Lua 5.3.0  Copyright (C) 1994-2015 Lua.org, PUC-Rio
> "foo":len()
stdin:1: unexpected symbol near '"foo"'
> ("foo"):len()
3
Link to post
Share on other sites

as far as I'm aware there are three ways to do that sort of compression, each with its own upside/downsides

 

return load((source):gsub("\0","function"))()
--good if you're only have one key word

return load((source):gsub("\0(.)",{a="function",b="local",c="table",d="string"}))()
--good general solution for multiple key words, relatively easy to read the shortened version of

return load((source):gsub("\0(.)",function(a)return({"function","local","table","string"})[a:byte()]end))()
--grows slower than the table version as you add words, but requires more space in the beginning
--probably require more RAM than the table version
--should only use over the table version if you have more than 8 key words
where (source) is a string representation of the source code

I use \0 because you're almost guaranteed never to run in to it, but code that generates these should probably search for an unused character (preferably a printable one)

also, I'd like to note from my attempt at implementing this: the only characters that need escaping are the things that close your strings (double quote (") or single quote (')), newline, carriage return, and backslash. You could also try backslash to not have to worry about escaping the carriage return

Link to post
Share on other sites

Or use [[raw multiline strings]], which take all characters literally, except for the closing brackets. This is even possible if there are such strings in the code itself - Lua allows you to add any number of equals signs between the two brackets, and they will still be valid, as long as the opening and closing ones use the same number of equals signs. For example, the text

This string has [[brackets]] in it for some reason.
would break normal multiline strings, but can still be stored like this:

[=[This string has [[brackets]] in it for some reason.]=]
Link to post
Share on other sites

as far as I'm aware there are three ways to do that sort of compression, each with its own upside/downsides

 

return load((source):gsub("\0","function"))()
--good if you're only have one key word

return load((source):gsub("\0(.)",{a="function",b="local",c="table",d="string"}))()
--good general solution for multiple key words, relatively easy to read the shortened version of

return load((source):gsub("\0(.)",function(a)return({"function","local","table","string"})[a:byte()]end))()
--grows slower than the table version as you add words, but requires more space in the beginning
--probably require more RAM than the table version
--should only use over the table version if you have more than 8 key words
where (source) is a string representation of the source code

I use \0 because you're almost guaranteed never to run in to it, but code that generates these should probably search for an unused character (preferably a printable one)

also, I'd like to note from my attempt at implementing this: the only characters that need escaping are the things that close your strings (double quote (") or single quote (')), newline, carriage return, and backslash. You could also try backslash to not have to worry about escaping the carriage return

 

 

I like it. But, idea is not just swap keywords. We can find any other "words" - repeated sequences of characters, that we can swap with 2-bited control symbols. So, we can make self-unpacking load() string, using this "dictionary".

With multipass or others methods, we can reach 50-70% packing and use 12kb files for EEPROMs.

 

Simpliest example - we can swap two words local function in one keysymbol instead two of them separately.

 

I wish this "Self-extracting LUA archive" be in Crunch as option. Waiting answer from mpmxyz.

Link to post
Share on other sites

Or use [[raw multiline strings]], which take all characters literally, except for the closing brackets. This is even possible if there are such strings in the code itself - Lua allows you to add any number of equals signs between the two brackets, and they will still be valid, as long as the opening and closing ones use the same number of equals signs. For example, the text

This string has [[brackets]] in it for some reason.
would break normal multiline strings, but can still be stored like this:

[=[This string has [[brackets]] in it for some reason.]=]

actually, \n and \r characters at the begining and end will be striped out, and \r will be converted to \n

Link to post
Share on other sites

actually, \n and \r characters at the begining and end will be striped out, and \r will be converted to \n

True, it does assume that the contents are text. For Lua code that shouldn't be an issue though - leading newlines don't matter, and line endings are all the same to Lua.

Link to post
Share on other sites

True, it does assume that the contents are text. For Lua code that shouldn't be an issue though - leading newlines don't matter, and line endings are all the same to Lua.

Line endings may be the same to Lua, but they might not be to the compression code. For instance, if I were to replace all instances of "local " with "\0\13" and put that in a long string and read it in, my algorithm would break. (\13 is \r)

Link to post
Share on other sites

Line endings may be the same to Lua, but they might not be to the compression code. For instance, if I were to replace all instances of "local " with "\0\13" and put that in a long string and read it in, my algorithm would break. (\13 is \r)

That makes sense. I guess you could include the number 13 in the dictionary and always map it to e. g. an empty string, but never actually use it in any "compressed escape" sequence. That appears to be the only character that is treated specially in long strings. The short test

for i = 0, 255 do
    fn, error = load("return [[>" .. string.char(i) .. "<]]")
    if fn == nil or fn():byte(2) ~= i then
        print(i, fn and fn():byte(2) or error)
    end
end

only produces the output

13      10

meaning that all other characters should be safe to use. The only other problematic cases are \n at the start of the string (which is ignored) and ] at the end of a level 0 long string (which causes a syntax error). The former shouldn't cause any real issues when compressing text, and the latter can be prevented by increasing the level of the long string.

Link to post
Share on other sites

Update 06.06.15: --lz77 option is now available!

 

This option enables compression using a LZ77 variant.

-> You can now achieve space savings of around 50%!

 

@natedogith1 @dgelessus @Krutoy242:

Thanks for your help!

Your discussion gave me a lot of motivation to look into the topic of compression and your research into permitted characters in long strings helped me a lot since I could simply start writing code instead of beginning with my own tests.

 

Just for your entertainment, some bugs I ran into during development:

-reference length wasn't limited during compression: Above the limit it was interpreted as the length of a raw string. -> Decompression returned binary madness.

-"\r\n" within the original code has been processed as it is. But the compressed code is stored within a long string. So Lua reduces that to "\n" on compile time, which causes all related string lengths to be wrong. -> again some madness as output

-The decompression code within the compressed files might be stuck in an infinite loop when reading invalid data. (not fixed because the data created by crunch is supposed to be correct in the first place / a check would add unnecessary overhead)

-I had to use a hex editor a lot. Unfortunately the one I used wasn't able to change the file size. (for whatever reason, but at least not 'my' bug)

 

PS: Report any bugs you find! There are so many test cases I didn't have the time to check for...

Link to post
Share on other sites

Oh, i waited for this update!

You done awesome work, im very glad to have such tool.

 

I made some tests to find any bugs or mistakes.

 

  1. First, i want to ask you to make shorten equivalent of parameters like you done in tar. For example, for making LZ77 archives, can i use just -z option?
    And, why you make output as parametr. Can you make it like all OpenOS program works?
    name [options] input [output]
    
    crunch -z skex.lua skex.crz.lua
    
    Otherwise im bit confusing writing output path in line before input path.
  2. I succesfully done multiple crunching crunch itself without --lz77 option. Not bug, just small note, that second iteration of crunching given same size file, but inside they are a bit different. I used this commands:
    crunch --output=crunch_01.lua crunch.lua
    crunch --output=crunch_02.lua crunch_01.lua
    crunch --output=crunch_03.lua crunch_02.lua
  3. When i lunch with --lz77 parameters, i get awesome result - 78% reduction, about 4 times less then original file.
    Good news that crunched crunch.lua was works fine.
    Bad news that second iteration was crushed:
    C2U3CTt.png

    Also, same error i get even without --lz77 option in this file. Seems like there is some problem with multiline strings, because i can even remove or add 1 line and it will work fine, even if line will be after all source code (after ]]).
     
  4. I tryed to draw some multyline text with unicode and Russian letters. I crunched same file (from above) but with additional line after source, and get unexpected symbol inside text. You can see it in the end of second lunch.

Thats it for now!

Hope my tests will help.

Link to post
Share on other sites

#1:

You can put --options wherever you want. (i.e. --output behind the input file) "shell.parse" doesn't care!

But the whole "user interface" part was just quickly added to the program. I should revisit that part soon...

#2:

That's probably because the original file doesn't have that many local variables defined at the first line of the file.

They are added during compression to avoid repetition of long strings, numbers and global variables.

->The compressed file has a slightly different structure.

#3 + #4:

It's actually the same bug and it's not even mine: https://github.com/MightyPirates/OpenComputers/issues/1207

io.lines seems to have some problems with unicode characters.

 

Update 07.06.15: modified crunch to use file:lines with a binary stream instead of io.lines

Link to post
Share on other sites

Excellent! It works fine and i done 4 iterations of crunching crunch. Also, problem with my gui-layout file is gone too.

 

Now i had idea that i can even compress my serialized tables, when im using wireless modem or something like this. Honestly, this may be not use that you planned when wrote Crunch, but i still want to try =)

I have a big file, that is representation of data-center for using debug card and be "printed" with setBlocks().

First attempt to compress this 260kb file was failed cause run out of memory. Then i switched to server with four 3.5 memory cards.

Every try now i got yielding error:

 

9qfoKjY.png

Link to post
Share on other sites

Looks like my lazy programming started to haunt you:

--TODO: change pasting system to be top-bottom? (pasting takes n² time at worst case when moving stuff only one level at a time)

Just to explain that: In order to represent lists of values there have to be rules like:

table   = {<content>}
content = <empty> OR <value> , <content>

The parser therefore creates the following structure for "{1,2,3,}":

<table>
 {
 <content>
  1
  ,
  <content>
   2
   ,
   <content>
    3
    ,
    <content>
     <empty>
 }

This unnecessary nesting is currently removed by moving the contents of lower parts of the structure to higher parts.

This has the problem that it has a worst case runtime proportional to the square of the length of a list. And your list is long. Make that squared and you can see how the too long without yield error arises.

Fixing that should be possible.

In the meantime you should use this command for large files:

crunch --notree --lz77 DataCenter

--notree disables the full parsing part. (i.e. no variable replacement -> It shouldn't harm that much because you don't use them.)

--lz77 enables the powerful lz77 compressor which brings your file size down to just 95 KiB.

 

I updated the command syntax to be "crunch [options] input [output]" as you suggested. That makes it easier to use.

Link to post
Share on other sites

Sorry for double posting but I've got another...

 

Update 24.06.15:

added --verbose option  prints what crunch is doingadded --debug option to show stack traces of parsing error messages  That means that they aren't included by default anymore. (which increases readability)improved error messages  now with file and line numberimproved execution speed by writing a more efficient recursion flattening code  That's the thing I mentioned in my previous message.  Krutoy's Data Center is still limited by memory because it consists of around 18*8570=154260 tokens  That makes around 27 bytes of T3 server memory per token.  When you think about the fact that just a table entry in the array part of a table alone already uses around 16 bytes you can see the problem.improved execution speed by increasing the number of characters between forced sleeps during lz77 compression  I have to use os.sleep once in a while to avoid "too long without yield" errors. (->inefficient implementation)  I'm already thinking about an OC feature request to add a way to yield without pulling events. But I'll have to collect ideas first.@Krutoy242:I also noticed some users coming from computercraft.ru lately. I just wanted to confirm that I'm okay with that. You'd have to translate any feedback for me though. (bug reports or feature requests) 

;)

Link to post
Share on other sites

New update, great! Verbose is nice, will enable when packing huge files.

Sadly no new features, but improvments good too.

 

 

@Krutoy242:

I also noticed some users coming from computercraft.ru lately. I just wanted to confirm that I'm okay with that. You'd have to translate any feedback for me though. (bug reports or feature requests) 

;)

 

Yes. As i said before, i promote you on our forum a bit. I made blog entry and leave link in my signature. There already ~150 views - not so much with thousands here, but still number.

I still belive Crunch is very userful program for wrighting bios programs.

 

Feedback:

Feedback is poor.

One guy say that this is "magic".

Other guy noticed that he can use Crunch for obfuscation. I think, this is lame, but some paranoid guys only see how everybody wants to steal their code. Maybe, you should make option for this?

Next, on blog entry comments, our famouse admin, who made spawn and world on coords 100500x140300, was upset that Crunch only "changing local names". But then, other good programmer corrected, that it also shortned often used global names.

 

But anyway, i feel like im alone who using your program on our server, heh.


I have suggestion. Do not know where i should write this - here or in tar topic.

 

Idea was making self-extracting archives, or "installers".

Tar making archives, but i need tar program itself for open it. I can just add crunched Tar uncompresser after a .tar archive for self-extracting it. But also, i still want to compress result qith LZ77 or other compress method for less size.

So, for user it will looks simple - he downloading just 1 file, for example "crunch_installer.lua", and run it. This installer unpacking and put files in their places on computer, wrighting errors if needed.

Link to post
Share on other sites

Update 12.08.15:

Crunch has been rewritten to be modular. This made adding new features easier. Some of them are:

-compressing multiple occurrences of "t.veryLongIndex" with "t[x]" (x being a variable with the value of "veryLongIndex")

-unification of string and number values using the shortest representation with the same value ("\"test\"" vs. [["test"]] vs. '"test"')

-The program can be interrupted by pressing Ctrl + C. (It might be delayed when tree processing is enabled.)

 

Does anyone have suggestions for further improvements?

I'll see if I can implement them nicely using the modular system.

 

 

Regarding self extracting archives: How would such a file store its data?

-as a string: The extracting code is easier.

-hidden in comments: This would enable very large self extracting files if loadfile was modified to use a loader function instead of "file:read("*a")".

Additionally: Both have their problems with "forbidden characters".

 

I also have to decide if I add this feature to Tar or to a new program. (Maybe a program that uses Tars output?)

Link to post
Share on other sites

New update, great! Verbose is nice, will enable when packing huge files.

Sadly no new features, but improvments good too.

 

 

 

Yes. As i said before, i promote you on our forum a bit. I made blog entry and leave link in my signature. There already ~150 views - not so much with thousands here, but still number.

I still belive Crunch is very userful program for wrighting bios programs.

 

Feedback:

Feedback is poor.

One guy say that this is "magic".

Other guy noticed that he can use Crunch for obfuscation. I think, this is lame, but some paranoid guys only see how everybody wants to steal their code. Maybe, you should make option for this?

Next, on blog entry comments, our famouse admin, who made spawn and world on coords 100500x140300, was upset that Crunch only "changing local names". But then, other good programmer corrected, that it also shortned often used global names.

 

But anyway, i feel like im alone who using your program on our server, heh.


I have suggestion. Do not know where i should write this - here or in tar topic.

 

Idea was making self-extracting archives, or "installers".

Tar making archives, but i need tar program itself for open it. I can just add crunched Tar uncompresser after a .tar archive for self-extracting it. But also, i still want to compress result qith LZ77 or other compress method for less size.

So, for user it will looks simple - he downloading just 1 file, for example "crunch_installer.lua", and run it. This installer unpacking and put files in their places on computer, wrighting errors if needed.

 

LZF is conceptually very similar to LZ77. Generally speaking most modern compression algorithms give roughly the same compression, and with regard to the number of cores that you can use at once, it is up to you to decide how many you want to use. Generally speaking (unless you are creating large archives) there is no reason to need more than one though. In addition, with multiple cores doing the compression, the bottleneck may become the hard drive. Legacy zip compression is akin to the Deflate method in 7-zip, and will offer the most compatibility between different compression software.

 

John

Link to post
Share on other sites

Update 13.04.17:

-fixed a bug which was caused by filesystem.list returning file names without sorting in newer OC versions
-added crunch to oppm

I decided not to add a 'self extracting archive' option because that deserves to be another program. The output of that program could be compressed though.

My next target will be full Lua 5.3 support. I have to check memory requirements after that though. (lots of new operators)

Update 16.04.17:

-Added Lua 5.3 Support (Operators were missing.)

Memory requirement didn't increase noticeably. But the parsing table grew by 12KiB. So there is some increase.

Edited by mpmxyz
I don't double post unless it resurrects the dead. :-P
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use and Privacy Policy.