Fory Xlang Serialization Format
Cross-language Serialization Specification
Format Version History:
- Version 0.1 - serialization spec formalized
Fory xlang serialization is an automatic object serialization framework that supports reference and polymorphism. Fory will convert an object from/to fory xlang serialization binary format. Fory has two core concepts for xlang serialization:
- Fory xlang binary format
- Framework implemented in different languages to convert object to/from Fory xlang binary format
The serialization format is a dynamic binary format. The dynamics and reference/polymorphism support make Fory flexible, much more easy to use, but also introduce more complexities compared to static serialization frameworks. So the format will be more complex.
Type Systems
Data Types
- bool: a boolean value (true or false).
- int8: a 8-bit signed integer.
- int16: a 16-bit signed integer.
- int32: a 32-bit signed integer.
- var_int32: a 32-bit signed integer which use fory var_int32 encoding.
- int64: a 64-bit signed integer.
- var_int64: a 64-bit signed integer which use fory PVL encoding.
- sli_int64: a 64-bit signed integer which use fory SLI encoding.
- float16: a 16-bit floating point number.
- float32: a 32-bit floating point number.
- float64: a 64-bit floating point number including NaN and Infinity.
- string: a text string encoded using Latin1/UTF16/UTF-8 encoding.
- enum: a data type consisting of a set of named values. Rust enum with non-predefined field values are not supported as an enum.
- named_enum: an enum whose value will be serialized as the registered name.
- struct: a morphic(final) type serialized by Fory Struct serializer. i.e. it doesn't have subclasses. Suppose we're
deserializing
List<SomeClass>
, we can save dynamic serializer dispatch sinceSomeClass
is morphic(final). - compatible_struct: a morphic(final) type serialized by Fory compatible Struct serializer.
- named_struct: a
struct
whose type mapping will be encoded as a name. - named_compatible_struct: a
compatible_struct
whose type mapping will be encoded as a name. - ext: a type which will be serialized by a customized serializer.
- named_ext: an
ext
type whose type mapping will be encoded as a name. - list: a sequence of objects.
- set: an unordered set of unique elements.
- map: a map of key-value pairs. Mutable types such as
list/map/set/array/tensor/arrow
are not allowed as key of map. - duration: an absolute length of time, independent of any calendar/timezone, as a count of nanoseconds.
- timestamp: a point in time, independent of any calendar/timezone, as a count of nanoseconds. The count is relative to an epoch at UTC midnight on January 1, 1970.
- local_date: a naive date without timezone. The count is days relative to an epoch at UTC midnight on Jan 1, 1970.
- decimal: exact decimal value represented as an integer value in two's complement.
- binary: an variable-length array of bytes.
- array: only allow 1d numeric components. Other arrays will be taken as List. The implementation should support the
interoperability between array and list.
- bool_array: one dimensional int16 array.
- int8_array: one dimensional int8 array.
- int16_array: one dimensional int16 array.
- int32_array: one dimensional int32 array.
- int64_array: one dimensional int64 array.
- float16_array: one dimensional half_float_16 array.
- float32_array: one dimensional float32 array.
- float64_array: one dimensional float64 array.
- tensor: multidimensional array which every sub-array have same size and type.
- arrow record batch: an arrow record batch object.
- arrow table: an arrow table object.
Note:
- Unsigned int/long are not added here, since not every language support those types.
Polymorphisms
For polymorphism, if one non-final class is registered, and only one subclass is registered, then we can take all elements in List/Map have same type, thus reduce runtime check cost.
Collection/Array polymorphism are not fully supported, since some languages such as golang have only one collection type. If users want to get exactly the type he passed, he must pass that type when deserializing or annotate that type to the field of struct.
Type disambiguation
Due to differences between type systems of languages, those types can't be mapped one-to-one between languages. When deserializing, Fory use the target data structure type and the data type in the data jointly to determine how to deserialize and populate the target data structure. For example:
class Foo {
int[] intArray;
Object[] objects;
List<Object> objectList;
}
class Foo2 {
int[] intArray;
List<Object> objects;
List<Object> objectList;
}
intArray
has an int32_array
type. But both objects
and objectList
fields in the serialize data have list
data
type. When deserializing, the implementation will create an Object
array for objects
, but create a ArrayList
for objectList
to populate its elements. And the serialized data of Foo
can be deserialized into Foo2
too.
Users can also provide meta hints for fields of a type, or the type whole. Here is an example in java which use annotation to provide such information.
@ForyObject(fieldsNullable = false, trackingRef = false)
class Foo {
@ForyField(trackingRef = false)
int[] intArray;
@ForyField(polymorphic = true)
Object object;
@ForyField(tagId = 1, nullable = true)
List<Object> objectList;
}
Such information can be provided in other languages too:
- cpp: use macro and template.
- golang: use struct tag.
- python: use typehint.
- rust: use macro.
Type ID
All internal data types are expressed using an ID in range 0~64
. Users can use 0~4096
for representing their
types.
Type mapping
See Type mapping
Spec overview
Here is the overall format:
| fory header | object ref meta | object type meta | object value data |
The data are serialized using little endian byte order overall. If bytes swap is costly for some object, Fory will write the byte order for that object into the data instead of converting it to little endian.
Fory header
Fory header consists starts one byte:
| 2 bytes | 4 bits | 1 bit | 1 bit | 1 bit | 1 bit | 1 byte | optional 4 bytes |
+--------------+---------------+-------+-------+--------+-------+------------+------------------------------------+
| magic number | reserved bits | oob | xlang | endian | null | language | unsigned int for meta start offset |
- magic number: used to identify fory serialization protocol, current version use
0x62d4
. - null flag: 1 when object is null, 0 otherwise. If an object is null, other bits won't be set.
- endian flag: 1 when data is encoded by little endian, 0 for big endian.
- xlang flag: 1 when serialization uses xlang format, 0 when serialization uses Fory java format.
- oob flag: 1 when passed
BufferCallback
is not null, 0 otherwise. - language: the language when serializing objects, such as JAVA, PYTHON, GO, etc. Fory can use this flag to determine whether spend more time on serialization to make the deserialization faster for dynamic languages.
If meta share mode is enabled, an uncompressed unsigned int is appended to indicate the start offset of metadata.
Reference Meta
Reference tracking handles whether the object is null, and whether to track reference for the object by writing corresponding flags and maintaining internal state.
Reference flags:
Flag | Byte Value | Description |
---|---|---|
NULL FLAG | -3 | This flag indicates the object is a null value. We don't use another byte to indicate REF, so that we can save one byte. |
REF FLAG | -2 | This flag indicates the object is already serialized previously, and fory will write a ref id with unsigned varint format instead of serialize it again |
NOT_NULL VALUE FLAG | -1 | This flag indicates the object is a non-null value and fory doesn't track ref for this type of object. |
REF VALUE FLAG | 0 | This flag indicates the object is referencable and the first time to serialize. |
When reference tracking is disabled globally or for specific types, or for certain types within a particular
context(e.g., a field of a type), only the NULL
and NOT_NULL VALUE
flags will be used for reference meta.
For languages which doesn't support reference such as rust, reference tracking must be disabled for correct deserialization by fory rust implementation.
For languages whose object values are not null by default:
- In rust, Fory takes
Option:None
as a null value - In c++, Fory takes
std::nullopt
as a null value - In golang, Fory takes
null interface/pointer
as a null value
If one want to deserialize in languages like Java/Python/JavaScript
, he should mark the type with all fields
not-null by default, or using schema-evolution mode to carry the not-null fields info in the data.
Type Meta
For every type to be serialized, it have a type id to indicate its type.
- basic types: the type id
- enum:
Type.ENUM
+ registered idType.NAMED_ENUM
+ registered namespace+typename
- list:
Type.List
- set:
Type.SET
- map:
Type.MAP
- ext:
Type.EXT
+ registered idType.NAMED_EXT
+ registered namespace+typename
- struct:
Type.STRUCT
+ struct metaType.NAMED_STRUCT
+ struct meta
Every type must be registered with an ID or name first. The registration can be used for security check and type identification.
Struct is a special type, depending whether schema compatibility is enabled, Fory will write struct meta differently.
Struct Schema consistent
- If schema consistent mode is enabled globally when creating fory, type meta will be written as a fory unsigned varint
of
type_id
. Schema evolution related meta will be ignored. - If schema evolution mode is enabled globally when creating fory, and current class is configured to use schema
consistent mode like
struct
vstable
in flatbuffers:- Type meta will be add to
captured_type_defs
:captured_type_defs[type def stub] = map size
ahead when registering type. - Get index of the meta in
captured_type_defs
, write that index as| unsigned varint: index |
.
- Type meta will be add to
Struct Schema evolution
If schema evolution mode is enabled globally when creating fory, and enabled for current type, type meta will be written using one of the following mode. Which mode to use is configured when creating fory.
-
Normal mode(meta share not enabled):
-
If type meta hasn't been written before, add
type def
tocaptured_type_defs
:captured_type_defs[type def] = map size
. -
Get index of the meta in
captured_type_defs
, write that index as| unsigned varint: index |
. -
After finished the serialization of the object graph, fory will start to write
captured_type_defs
:-
Firstly, set current to
meta start offset
of fory header -
Then write
captured_type_defs
one by one:buffer.write_var_uint32(len(writting_type_defs) - len(schema_consistent_type_def_stubs))
for type_meta in writting_type_defs:
if not type_meta.is_stub():
type_meta.write_type_def(buffer)
writing_type_defs = copy(schema_consistent_type_def_stubs)
-
-
-
Meta share mode: the writing steps are same as the normal mode, but
captured_type_defs
will be shared across multiple serializations of different objects. For example, suppose we have a batch to serialize:captured_type_defs = {}
stream = ...
# add `Type1` to `captured_type_defs` and write `Type1`
fory.serialize(stream, [Type1()])
# add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is written before.
fory.serialize(stream, [Type1(), Type2()])
# `Type1` and `Type2` are written before, no need to write meta.
fory.serialize(stream, [Type1(), Type2()]) -
Streaming mode(streaming mode doesn't support meta share):
-
If type meta hasn't been written before, the data will be written as:
| unsigned varint: 0b11111111 | type def |
-
If type meta has been written before, the data will be written as:
| unsigned varint: written index << 1 |
written index
is the id incaptured_type_defs
. -
With this mode,
meta start offset
can be omitted.
-
The normal mode and meta share mode will forbid streaming writing since it needs to look back for update the start offset after the whole object graph writing and meta collecting is finished. Only in this way we can ensure deserialization failure in meta share mode doesn't lost shared meta.
Type Def
Here we mainly describe the meta layout for schema evolution mode:
| 8 bytes header | variable bytes | variable bytes |
+----------------------+--------------------+-------------------+
| global binary header | meta header | fields meta |
For languages which support inheritance, if parent class and subclass has fields with same name, using field in subclass.
Global binary header
50 bits hash + 1bit compress flag + write fields meta + 12 bits meta size
. Right is the lower bits.
- lower 12 bits are used to encode meta size. If meta size
>= 0b111_1111_1111
, then writemeta_ size - 0b111_1111_1111
next. - 13rd bit is used to indicate whether to write fields meta. When this class is schema-consistent or use registered serializer, fields meta will be skipped. Class Meta will be used for share namespace + type name only.
- 14rd bit is used to indicate whether meta is compressed.
- Other 50 bits is used to store the unique hash of
flags + all layers class meta
.
Meta header
Meta header is a 8 bits number value.
- Lowest 5 digits
0b00000~0b11110
are used to record num fields.0b11111
is preserved to indicate that Fory need to read more bytes for length using Fory unsigned int encoding. Note that num_fields is the number of compatible fields. Users can use tag id to mark some fields as compatible fields in schema consistent context. In such cases, schema consistent fields will be serialized first, then compatible fields will be serialized next. At deserialization, Fory will use fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use fields info in meta for deserializing compatible fields. - The 6th bit: 0 for registered by id, 1 for registered by name.
- Remaining 2 bits are reserved for future extension.
Fields meta
Format:
| field info: variable bytes | variable bytes | ... |
+---------------------------------+-----------------+-----+
| header + type info + field name | next field info | ... |
Field Header
Field Header is 8 bits, annotation can be used to provide more specific info. If annotation not exists, fory will infer those info automatically.
The format for field header is:
2 bits field name encoding + 4 bits size + nullability flag + ref tracking flag
Detailed spec:
- 2 bits field name encoding:
- encoding:
UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID
- If tag id is used, field name will be written by an unsigned varint tag id, and 2 bits encoding will be
11
.
- encoding:
- size of field name:
- The
4 bits size: 0~14
will be used to indicate length1~15
, the value15
indicates to read more bytes, the encoding will encodesize - 15
as a varint next. - If encoding is
TAG_ID
, then num_bytes of field name will be used to store tag id.
- The
- ref tracking: when set to 1, ref tracking will be enabled for this field.
- nullability: when set to 1, this field can be null.