Jump to content
  • Sky
  • Blueberry
  • Slate
  • Blackcurrant
  • Watermelon
  • Strawberry
  • Orange
  • Banana
  • Apple
  • Emerald
  • Chocolate
  • Charcoal
Solra Bizna

Proposal: Universal interchange format

Recommended Posts

Fairly early in the OC-ARM development process, the question of component IO came up. I spent a good bit of time doing research and checking my assumptions. Then, I set out to create an interchange format with the following properties:

  • Trivially maps to every object type OpenComputers itself allows
  • (implied by the above) Trivially maps to the Lua representation
  • Feasible to implement in hardware
  • Simple to implement in software
  • Independent of architectural details
  • Compact
This is what came out.

There are two versions of this format: packed and unpacked. Packed is more compact, and is suitable for wire transmission or 8-/16-bit architectures. Unpacked is simpler to deal with on 32-bit architectures. (The packed representation, not coincidentally, takes up the same amount of space as is calculated for network packet size limiting purposes.)

Applications communicating over the network:

  • SHOULD use the "pickled" representation from the Lua `serialization` API, when possible (for compatibility with Lua)
  • SHOULD use the packed representation with network byte order, when the above is not possible
Architectures implementing a component IO medium:
  • MUST support the use of the "packed" representation
  • MUST support native byte order
  • SHOULD have optional support for network byte order, if that is different from native
  • MAY have optional support for "unpacked" IO, if this makes sense on the platform
A producer is a program that is producing Interchange Values. A consumer is a program that is consuming Interchange Values.

An Interchange Value is a type tag followed directly by the data. In the packed representation, tags are 16-bit. In the unpacked representation, they are 32-bit and sign-extended (such that 0xFFFE becomes 0xFFFFFFFE), but still limited to the same range).

  • Tag 0x0000 - 0x3FFF: ICTAG_STRING

    A UTF-8 string, whose byte length is the low 14 bits of the tag. In the unpacked representation, the string is padded to a multiple of 4 bytes; padding is not reflected in the length. No NUL terminator is required, and NUL may freely exist within the string, but producers SHOULD avoid producing strings containing embedded NULs. Consumers MAY (but SHOULD NOT) process embedded NUL as a NUL terminator, and discard the portion of the string after it. If there are invalid UTF-8 sequences in the string, they MAY be lost during subsequent processing. Consumers that convert to another representation (such as UTF-16 for Java storage, or Unicode code sequences for display) SHOULD discard invalid UTF-8 sequences. Producers about to output an ICTAG_STRING that contain an exact, valid UUID value with lowercase digits SHOULD instead produce an ICTAG_UUID.

  • Tag 0x4000 - 0x7FFF: ICTAG_BYTE_ARRAY

    An array of byte values with no particular semantics, whose length is the low 14 bits of the tag. In the unpacked representation, the string is padded to a multiple of 4 bytes; padding is not reflected in the length. Consumers that expect a string MAY treat an ICTAG_BYTE_ARRAY as an ICTAG_STRING. Consumers that expect an ICTAG_BYTE_ARRAY (e.g. disk IO) may only treat an ICTAG_STRING as an ICTAG_BYTE_ARRAY if it is still available in its original serialized form, not having made a round-trip conversion.

  • Tag 0xFFF8 / -8: ICTAG_UUID

    A 128-bit UUID. Consumers that expect a string MUST convert an ICTAG_UUID into its canonical ICTAG_STRING equivalent, using lowercase digits. Note: Consumers that expect a UUID are NOT required to accept a well-formed ICTAG_STRING in its place.

  • Tag 0xFFF9 / -7: ICTAG_COMPOUND

    A list of key-value pairs stored key first, value second, terminated by ICTAG_END. Any type but ICTAG_BYTE_ARRAY, ICTAG_COMPOUND, ICTAG_ARRAY, and ICTAG_NULL may appear as a key. Any type may appear as a value. If a consumer encounters an ICTAG_END where a value should go, the entire Interchange Buffer MUST be discarded as invalid.

  • Tag 0xFFFA / -6: ICTAG_ARRAY

    A list of Interchange Values, terminated by ICTAG_END. Any type may appear as an element of an array.

  • Tag 0xFFFB / -5: ICTAG_INT

    A signed, two's-complement 32-bit integer.

  • Tag 0xFFFC / -4: ICTAG_DOUBLE

    An IEEE 754 64-bit double. Producers SHOULD convert doubles with exact signed two's-complement 32-bit integer values to ICTAG_INT. Consumers that expect an ICTAG_DOUBLE MUST be able to process an ICTAG_INT in its place, and MUST NOT treat it differently from an ICTAG_DOUBLE with the same numerical value.

  • Tag 0xFFFD / -3: ICTAG_BOOLEAN

    Either true or false. In the packed representation, it is a single byte. In the unpacked representation, it is 32 bits. Zero is false, any non-zero value is true. Producers SHOULD produce all-bits-set for true. Consumers MUST treat any non-zero value as equivalent to all-bits-set.

  • Tag 0xFFFE / -2: ICTAG_NULL

    A strongly-typed NULL, equivalent to Java's null and Lua's nil.

  • Tag 0xFFFF / -1: ICTAG_END

    Signifies the end of an Interchange Buffer, array, or compound.

All other values are reserved. An Interchange Buffer that contains any other tag value MUST be discarded as invalid.

An Interchange Buffer simply consists of zero or more Interchange Values, terminated by ICTAG_END.

Issues

Should there be more integer/float types? e.g. ICTAG_SHORT, ICTAG_BYTE, ICTAG_LONG, ICTAG_FLOAT...

No. This would make writing simple processing code difficult. ICTAG_DOUBLE fully covers Lua's number range, and covers a significant portion of Java's number range as well. ICTAG_INT covers a sufficiently wide range to remove the need for floating-point math in the vast majority of IO operations.

8-/16-bit architectures cannot process 32-bit numbers easily. However, consumers on those architectures can, situationally, consume only the precision they expect, and gracefully fail when given too-large integers. (The situation would be far worse if they were expected to process an ICTAG_DOUBLE.)

Should converting integer-valued ICTAG_DOUBLE to ICTAG_INT be mandatory?

No. In situations where only integers are expected, but the original representation supports only "reals", transparent conversion is valuable. However, in situations where non-integer values are common, the conversion does not add value. It should be up to the generator to determine whether it makes sense to do the conversion.

Should converting from UUID-valued ICTAG_STRING to ICTAG_UUID be mandatory?

No. If, as is the usual case, the generator knows that subsequent use of the string as a UUID is unlikely, it can avoid the overhead.

Should accepting ICTAG_UUID in place of ICTAG_STRING be mandatory? and Should accepting ICTAG_INT in place of ICTAG_DOUBLE be mandatory?

Yes. In situations where the semantics of the values are not known, it is desirable to have "automatic simplification". "Automatic simplification" can only work if it can be expected to be reversed just as automatically when the conversion was not helpful.

Should there be ICTAG_TRUE and ICTAG_FALSE instead of a single ICTAG_BOOLEAN?

Unresolved.

How is an ICTAG_UUID structured? This is important with byte orders other than network byte order.

Unresolved. Possibilities:

  • A sequence of bytes, in display order. No swapping is performed or necessary.
  • (as in Java) A pair of 64-bit integers.
  • (as in MFC) A 32-bit integer followed by three 16-bit integers followed by 6 loose bytes.
Why have both ICTAG_BYTE_ARRAY and ICTAG_STRING? Why not just have one or the other?

Java Strings are encoded as UTF-16. Interconversion between UTF-8 and UTF-16 is pretty simple, but still is overhead that isn't necessary for tasks like disk IO. In addition, there is no obvious way to preserve invalid UTF-8 sequences such that code that wishes to deal with byte arrays (like disk IO) can get the original byte sequence after a round-trip conversion trip through UTF-16. Creating such a system as part of the specification may be desired, but would then allow "invisible overhead" to creep in.

Common Optimizations

(This information is only useful to architecture implementors.)

OC-ARM uses this format for component IO. It provides two facilities that usefully reduce the overhead of this system.

  • Byte array IO return: When a program expects an IO operation to return only a single byte array, it can determine the size of this byte array by reading the size needed to store the Interchange Buffer and subtracting overhead. It can then use a special method to (attempt to) read the bytes directly into memory, rather than into an Interchange Buffer.
  • Truncation: In situations (such as certain types of signal processing) where only a few values are important, the program can specify that it is interested only in the first N values before storing the Interchange Buffer. It can then proceed normally; extra values are simply discarded. This is useful in extremely memory-restricted situations.
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use and Privacy Policy.