Xlang Serialization Format
Cross-language Serialization Specification
Apache Fory™ xlang serialization enables automatic cross-language object serialization with support for shared references, circular references, and polymorphism. Unlike traditional serialization frameworks that require IDL definitions and schema compilation, Fory serializes objects directly without any intermediate steps.
Key characteristics:
- Automatic: No IDL definition, no schema compilation, no manual object-to-protocol conversion
- Cross-language: Same binary format works seamlessly across Java, Python, C++, Rust, Go, JavaScript, and more
- Reference-aware: Handles shared references and circular references without duplication or infinite recursion
- Polymorphic: Supports object polymorphism with runtime type resolution
This specification defines the Fory xlang binary format. The format is dynamic rather than static, which enables flexibility and ease of use at the cost of additional complexity in the wire format.
Type Systems
Data Types
- bool: a boolean value (true or false).
- int8: a 8-bit signed integer.
- int16: a 16-bit signed integer.
- int32: a 32-bit signed integer.
- varint32: a 32-bit signed integer which use fory variable-length encoding.
- int64: a 64-bit signed integer.
- varint64: a 64-bit signed integer which use fory PVL encoding.
- tagged_int64: a 64-bit signed integer which use fory Hybrid encoding.
- uint8: an 8-bit unsigned integer.
- uint16: a 16-bit unsigned integer.
- uint32: a 32-bit unsigned integer.
- var_uint32: a 32-bit unsigned integer which use fory variable-length encoding.
- uint64: a 64-bit unsigned integer.
- var_uint64: a 64-bit unsigned integer which use fory PVL encoding.
- tagged_uint64: a 64-bit unsigned integer which use fory Hybrid encoding.
- float8: an 8-bit floating point number.
- float16: a 16-bit floating point number.
- bfloat16: a 16-bit brain floating point number.
- float32: a 32-bit floating point number.
- float64: a 64-bit floating point number including NaN and Infinity.
- string: a text string encoded using Latin1/UTF16/UTF-8 encoding.
- enum: a data type consisting of a set of named values. Rust enum with non-predefined field values are not supported as an enum.
- named_enum: an enum whose value will be serialized as the registered name.
- struct: a dynamic(final) type serialized by Fory Struct serializer. i.e. it doesn't have subclasses. Suppose we're
deserializing
List<SomeClass>, we can save dynamic serializer dispatch sinceSomeClassis dynamic(final). - compatible_struct: a dynamic(final) type serialized by Fory compatible Struct serializer.
- named_struct: a
structwhose type mapping will be encoded as a name. - named_compatible_struct: a
compatible_structwhose type mapping will be encoded as a name. - ext: a type which will be serialized by a customized serializer.
- named_ext: an
exttype whose type mapping will be encoded as a name. - list: a sequence of objects.
- set: an unordered set of unique elements.
- map: a map of key-value pairs. Mutable types such as
list/map/set/arrayare not allowed as key of map. - duration: an absolute length of time, independent of any calendar/timezone, as a count of nanoseconds.
- timestamp: a point in time, independent of any calendar/timezone, encoded as seconds (int64) and nanoseconds (uint32) since the epoch at UTC midnight on January 1, 1970.
- date: a naive date without timezone. The count is days relative to an epoch at UTC midnight on Jan 1, 1970.
- decimal: exact decimal value represented as an integer value in two's complement.
- binary: an variable-length array of bytes.
- array: only allow 1d numeric components. Other arrays will be taken as List. The implementation should support the
interoperability between array and list.
- bool_array: one dimensional bool array.
- int8_array: one dimensional int8 array.
- int16_array: one dimensional int16 array.
- int32_array: one dimensional int32 array.
- int64_array: one dimensional int64 array.
- float8_array: one dimensional float8 array.
- float16_array: one dimensional half_float_16 array.
- bfloat16_array: one dimensional bfloat16 array.
- float32_array: one dimensional float32 array.
- float64_array: one dimensional float64 array.
- union: a tagged union type that can hold one of several alternative types. The active alternative is identified by an index.
- typed_union: a union value with registered numeric union type ID.
- named_union: a union value with embedded union type name or shared TypeDef.
- none: represents an empty/unit value with no data (e.g., for empty union alternatives).
Note:
- Unsigned integer types use the same byte sizes as their signed counterparts; the difference is in value interpretation. See Type mapping for language-specific type mappings.
Polymorphisms
For polymorphism, if one non-final class is registered, and only one subclass is registered, then we can take all elements in List/Map have same type, thus reduce runtime check cost.
Collection/Array polymorphism are not fully supported, since some languages such as golang have only one collection type. If users want to get exactly the type he passed, he must pass that type when deserializing or annotate that type to the field of struct.
Type disambiguation
Due to differences between type systems of languages, those types can't be mapped one-to-one between languages. When deserializing, Fory use the target data structure type and the data type in the data jointly to determine how to deserialize and populate the target data structure. For example:
class Foo {
int[] intArray;
Object[] objects;
List<Object> objectList;
}
class Foo2 {
int[] intArray;
List<Object> objects;
List<Object> objectList;
}
intArray has an int32_array type. But both objects and objectList fields in the serialize data have list data
type. When deserializing, the implementation will create an Object array for objects, but create a ArrayList
for objectList to populate its elements. And the serialized data of Foo can be deserialized into Foo2 too.
Users can also provide meta hints for fields of a type, or the type whole. Here is an example in java which use annotation to provide such information.
@ForyObject(fieldsNullable = false, trackingRef = false)
class Foo {
@ForyField(trackingRef = false)
int[] intArray;
@ForyField(polymorphic = true)
Object object;
@ForyField(tagId = 1, nullable = true)
List<Object> objectList;
}
Such information can be provided in other languages too:
- cpp: use macro and template.
- golang: use struct tag.
- python: use typehint.
- rust: use macro.
Type ID
All internal data types use an 8-bit internal ID (0~255, with 0~56 defined here). Users can
register types by numeric ID (0~0xFFFFFFFE in current implementations). User IDs are encoded
separately from the internal type ID; there is no bit shifting/packing.
Named types (NAMED_*) do not embed a user ID; their names are carried in metadata instead.
Internal Type ID Table
| Type ID | Name | Description |
|---|---|---|
| 0 | UNKNOWN | Unknown type, used for dynamic typing |
| 1 | BOOL | Boolean value |
| 2 | INT8 | 8-bit signed integer |
| 3 | INT16 | 16-bit signed integer |
| 4 | INT32 | 32-bit signed integer |
| 5 | VARINT32 | Variable-length encoded 32-bit signed integer |
| 6 | INT64 | 64-bit signed integer |
| 7 | VARINT64 | Variable-length encoded 64-bit signed integer |
| 8 | TAGGED_INT64 | Hybrid encoded 64-bit signed integer |
| 9 | UINT8 | 8-bit unsigned integer |
| 10 | UINT16 | 16-bit unsigned integer |
| 11 | UINT32 | 32-bit unsigned integer |
| 12 | VAR_UINT32 | Variable-length encoded 32-bit unsigned integer |
| 13 | UINT64 | 64-bit unsigned integer |
| 14 | VAR_UINT64 | Variable-length encoded 64-bit unsigned integer |
| 15 | TAGGED_UINT64 | Hybrid encoded 64-bit unsigned integer |
| 16 | FLOAT8 | 8-bit floating point (float8) |
| 17 | FLOAT16 | 16-bit floating point (half precision) |
| 18 | BFLOAT16 | 16-bit brain floating point |
| 19 | FLOAT32 | 32-bit floating point (single precision) |
| 20 | FLOAT64 | 64-bit floating point (double precision) |
| 21 | STRING | UTF-8/UTF-16/Latin1 encoded string |
| 22 | LIST | Ordered collection (List, Array, Vector) |
| 23 | SET | Unordered collection of unique elements |
| 24 | MAP | Key-value mapping |
| 25 | ENUM | Enum registered by numeric ID |
| 26 | NAMED_ENUM | Enum registered by namespace + type name |
| 27 | STRUCT | Struct registered by numeric ID (schema consistent) |
| 28 | COMPATIBLE_STRUCT | Struct with schema evolution support (by ID) |
| 29 | NAMED_STRUCT | Struct registered by namespace + type name |
| 30 | NAMED_COMPATIBLE_STRUCT | Struct with schema evolution (by name) |
| 31 | EXT | Extension type registered by numeric ID |
| 32 | NAMED_EXT | Extension type registered by namespace + type name |
| 33 | UNION | Union value, schema identity not embedded |
| 34 | TYPED_UNION | Union value with registered numeric type ID |
| 35 | NAMED_UNION | Union value with embedded type name/TypeDef |
| 36 | NONE | Empty/unit type (no data) |
| 37 | DURATION | Time duration (seconds + nanoseconds) |
| 38 | TIMESTAMP | Point in time (seconds + nanoseconds since epoch) |
| 39 | DATE | Date without timezone (days since epoch) |
| 40 | DECIMAL | Arbitrary precision decimal |
| 41 | BINARY | Raw binary data |
| 42 | ARRAY | Generic array type |
| 43 | BOOL_ARRAY | 1D boolean array |
| 44 | INT8_ARRAY | 1D int8 array |
| 45 | INT16_ARRAY | 1D int16 array |
| 46 | INT32_ARRAY | 1D int32 array |
| 47 | INT64_ARRAY | 1D int64 array |
| 48 | UINT8_ARRAY | 1D uint8 array |
| 49 | UINT16_ARRAY | 1D uint16 array |
| 50 | UINT32_ARRAY | 1D uint32 array |
| 51 | UINT64_ARRAY | 1D uint64 array |
| 52 | FLOAT8_ARRAY | 1D float8 array |
| 53 | FLOAT16_ARRAY | 1D float16 array |
| 54 | BFLOAT16_ARRAY | 1D bfloat16 array |
| 55 | FLOAT32_ARRAY | 1D float32 array |
| 56 | FLOAT64_ARRAY | 1D float64 array |
Type ID Encoding for User Types
When registering user types (struct/ext/enum/union), the internal type ID is written as the 8-bit kind. The user type ID is written separately as an unsigned varint32 (small7); there is no bit shift or packing.
Examples:
| User ID | Type | Internal ID | Encoded User ID | Decimal |
|---|---|---|---|---|
| 0 | STRUCT | 27 | 0 | 0 |
| 0 | ENUM | 25 | 0 | 0 |
| 1 | STRUCT | 27 | 1 | 1 |
| 1 | COMPATIBLE_STRUCT | 28 | 1 | 1 |
| 2 | NAMED_STRUCT | 29 | 2 | 2 |
When reading type IDs:
- Read internal type ID from the type ID field.
- If the internal type is a user-registered kind, read
user_type_idas varuint32.
Type mapping
See Type mapping
Spec overview
Here is the overall format:
| fory header | object ref meta | object type meta | object value data |
The data are serialized using little endian byte order for all types.
Fory header
Fory header format for xlang serialization:
| 1 byte bitmap |
+--------------------------------+
| flags |
Detailed byte layout:
Byte 0: Bitmap flags
- Bit 0: null flag (0x01)
- Bit 1: xlang flag (0x02)
- Bit 2: oob flag (0x04)
- Bits 3-7: reserved
- null flag (bit 0): 1 when object is null, 0 otherwise. If an object is null, only this flag is set.
- xlang flag (bit 1): 1 when serialization uses Fory xlang format, 0 when serialization uses Fory language-native format.
- oob flag (bit 2): 1 when out-of-band serialization is enabled (BufferCallback is not null), 0 otherwise.
All data is encoded in little-endian format.
Reference Meta
Reference tracking handles whether the object is null, and whether to track reference for the object by writing corresponding flags and maintaining internal state.
Reference Flags
| Flag | Byte Value (int8) | Hex | Description |
|---|---|---|---|
| NULL FLAG | -3 | 0xFD | Object is null. No further bytes are written for this object. |
| REF FLAG | -2 | 0xFE | Object was already serialized. Followed by unsigned varint32 reference ID. |
| NOT_NULL VALUE FLAG | -1 | 0xFF | Object is non-null but reference tracking is disabled for this type. Object data follows immediately. |
| REF VALUE FLAG | 0 | 0x00 | Object is referencable and this is its first occurrence. Object data follows. Assigns next reference ID. |
Reference Tracking Algorithm
Writing:
function write_ref_or_null(buffer, obj):
if obj is null:
buffer.write_int8(NULL_FLAG) // -3
return true // done, no more data to write
if reference_tracking_enabled:
ref_id = lookup_written_objects(obj)
if ref_id exists:
buffer.write_int8(REF_FLAG) // -2
buffer.write_varuint32(ref_id)
return true // done, reference written
else:
buffer.write_int8(REF_VALUE_FLAG) // 0
add_to_written_objects(obj, next_ref_id++)
return false // continue to serialize object data
else:
buffer.write_int8(NOT_NULL_VALUE_FLAG) // -1
return false // continue to serialize object data
Reading:
function read_ref_or_null(buffer):
flag = buffer.read_int8()
switch flag:
case NULL_FLAG (-3):
return (null, true) // null object, done
case REF_FLAG (-2):
ref_id = buffer.read_varuint32()
obj = get_from_read_objects(ref_id)
return (obj, true) // referenced object, done
case NOT_NULL_VALUE_FLAG (-1):
return (null, false) // non-null, continue reading
case REF_VALUE_FLAG (0):
reserve_ref_slot() // will be filled after reading
return (null, false) // non-null, continue reading
Reference ID Assignment
- Reference IDs are assigned sequentially starting from
0 - The ID is assigned when
REF_VALUE_FLAGis written (first occurrence) - Objects are stored in a list/map indexed by their reference ID
- For reading, a placeholder slot is reserved before deserializing the object, then filled after
When Reference Tracking is Disabled
When reference tracking is disabled globally or for specific types, only the NULL and NOT_NULL VALUE flags
will be used for reference meta. This reduces overhead for types that are known not to have references.
Language-Specific Considerations
Languages with nullable and reference types by default (Java, Python, JavaScript):
In xlang mode, for cross-language compatibility:
- All fields are treated as not-null by default
- Reference tracking is disabled by default
- Users can explicitly mark fields as nullable or enable reference tracking via annotations
Optionaltypes (e.g.,java.util.Optional,typing.Optional) are treated as nullable
Annotation examples:
// Java: use @ForyField annotation
public class MyClass {
@ForyField(nullable = true, ref = true)
private Object refField;
@ForyField(nullable = false)
private String requiredField;
}
# Python: use typing with fory field descriptors
from pyfory import Fory, ForyField
class MyClass:
ref_field: ForyField(SomeType, nullable=True, ref=True)
required_field: ForyField(str, nullable=False)
Languages with non-nullable types by default:
| Language | Null Representation | Reference Tracking Support |
|---|---|---|
| Rust | Option::None | Via Rc<T>, Arc<T>, Weak<T> |
| C++ | std::nullopt, nullptr | Via std::shared_ptr<T>, weak_ptr<T> |
| Go | nil interface/pointer | Via pointer/interface types |
Important: For languages like Rust that don't have implicit reference semantics, reference tracking must use
explicit smart pointers (Rc, Arc).
Type Meta
Every non-primitive value begins with a type ID that identifies its concrete type. The type ID is followed by optional type-specific metadata.
Type ID encoding
- The type ID is written as an unsigned varint32 (small7).
- Internal types use their internal type ID directly (low 8 bits).
- User-registered types write the internal type ID, then write
user_type_idas varuint32.user_type_idis a numeric ID (0~0xFFFFFFFE in current implementations).internal_type_idis one ofENUM,STRUCT,COMPATIBLE_STRUCT,EXT, orTYPED_UNION.
- Named types do not embed a user ID. They use
NAMED_*internal type IDs and carry a namespace and type name (or shared TypeDef) instead.
Type meta payload
After the type ID:
- ENUM / STRUCT / EXT / TYPED_UNION: no extra bytes beyond the
user_type_id(registration by ID required on both sides). - COMPATIBLE_STRUCT:
- If meta share is enabled, write a shared TypeDef entry (see below).
- If meta share is disabled, no extra bytes.
- NAMED_ENUM / NAMED_STRUCT / NAMED_COMPATIBLE_STRUCT / NAMED_EXT / NAMED_UNION:
- If meta share is disabled, write
namespaceandtype_nameas meta strings. - If meta share is enabled, write a shared TypeDef entry (see below).
- If meta share is disabled, write
- UNION: no extra bytes at this layer.
- LIST / SET / MAP / ARRAY / primitives: no extra bytes at this layer.
Unregistered types are serialized as named types:
- Enums ->
NAMED_ENUM - Struct-like classes ->
NAMED_STRUCT(orNAMED_COMPATIBLE_STRUCTwhen meta share is enabled) - Custom extension types ->
NAMED_EXT - Unions ->
NAMED_UNION
The namespace is the package/module name and the type name is the simple class name.
Shared Type Meta (streaming)
When meta share is enabled, TypeDef metadata is written inline the first time a type is encountered, and subsequent occurrences only reference it.
Encoding:
marker = (index << 1) | flagflag = 0: new type definition followsflag = 1: reference to a previously written type definitionindexis the sequential index assigned to this type (starting from 0).
Write algorithm:
- Look up the class in the per-stream meta context map.
- If found, write
(index << 1) | 1. - If not found:
- assign
index = next_id - write
(index << 1) - write the encoded TypeDef bytes immediately after
- assign
Read algorithm:
- Read
markeras varuint32. flag = marker & 1,index = marker >>> 1.- If
flag == 1, use the cached TypeDef atindex. - If
flag == 0, read a TypeDef, cache it atindex, and use it.
TypeDef bytes include the 8-byte global header and optional size extension.
TypeDef (schema evolution metadata)
TypeDef describes a struct-like type (or a named enum/ext) for schema evolution and name resolution. It is encoded as:
| 8-byte global header | [optional size varuint] | TypeDef body |
Global header
The 8-byte header is a little-endian uint64:
- Low 8 bits: meta size (number of bytes in the TypeDef body).
- If meta size >= 0xFF, the low 8 bits are set to 0xFF and an extra
varuint32(meta_size - 0xFF)follows immediately after the header.
- If meta size >= 0xFF, the low 8 bits are set to 0xFF and an extra
- Bit 8:
HAS_FIELDS_META(1 = fields metadata present). - Bit 9:
COMPRESS_META(1 = body is compressed; decompress before parsing). - Bits 10-13: reserved for future extension (must be zero).
- High 50 bits: hash of the TypeDef body.
TypeDef body
TypeDef body has a single layer (fields are flattened in class hierarchy order):
| meta header (1 byte) | type spec | field info ... |
Meta header byte:
- Bits 0-4:
num_fields(0-30).- If
num_fields == 31, read an extravaruint32and add it.
- If
- Bit 5:
REGISTER_BY_NAME(1 = namespace + type name, 0 = numeric type ID). - Bits 6-7: reserved.
Type spec:
- If
REGISTER_BY_NAMEis set:namespacemeta stringtype_namemeta string
- Otherwise:
type_idasvaruint32(small7)
Field info list:
Each field is encoded as:
| field header (1 byte) | field type info | [field name bytes] |
Field header layout:
- Bits 6-7: field name encoding (
UTF8,ALL_TO_LOWER_SPECIAL,LOWER_UPPER_DIGIT_SPECIAL, orTAG_ID) - Bits 2-5: size
- For name encoding:
size = (name_bytes_length - 1) - For tag ID:
size = tag_id - If
size == 0b1111, readvaruint32(size - 15)and add it
- For name encoding:
- Bit 1: nullable flag
- Bit 0: reference tracking flag
Field type info:
- The top-level field type is written as
varuint32(type_id)(small7) without flags. - For
LIST/SET, an element type follows, encoded as(nested_type_id << 2) | (nullable << 1) | tracking_ref. - For
MAP, key type and value type follow, both encoded the same way. - One-dimensional primitive arrays use
*_ARRAYtype IDs; other arrays are encoded asLIST.
Field names:
- If
TAG_IDencoding is used, no name bytes are written. - Otherwise, write the encoded field name bytes as a meta string.
- For xlang, field names are converted to
snake_casebefore encoding for cross-language compatibility.
Field order:
Field order is implementation-defined. Decoders must match fields by name or tag ID rather than position. Fory uses a stable grouping and sorting order to produce deterministic TypeDefs.
Meta String
Meta string is a compressed encoding for metadata strings such as field names, type names, and namespaces. This compression significantly reduces the size of type metadata in serialized data.
Encoding Type IDs
| ID | Name | Bits/Char | Character Set |
|---|---|---|---|
| 0 | UTF8 | 8 | Any UTF-8 character |
| 1 | LOWER_SPECIAL | 5 | a-z . _ $ | |
| 2 | LOWER_UPPER_DIGIT_SPECIAL | 6 | a-z A-Z 0-9 . _ |
| 3 | FIRST_TO_LOWER_SPECIAL | 5 | First char uppercase, rest a-z . _ |
| 4 | ALL_TO_LOWER_SPECIAL | 5 | a-z A-Z . _ (uppercase escaped) |
Character Mapping Tables
LOWER_SPECIAL (5 bits per character)
| Character | Code (binary) | Code (decimal) |
|---|---|---|
| a-z | 00000-11001 | 0-25 |
| . | 11010 | 26 |
| _ | 11011 | 27 |
| $ | 11100 | 28 |
| | | 11101 | 29 |
Note: The | character is used as an escape sequence in ALL_TO_LOWER_SPECIAL encoding.
LOWER_UPPER_DIGIT_SPECIAL (6 bits per character)
| Character | Code (binary) | Code (decimal) |
|---|---|---|
| a-z | 000000-011001 | 0-25 |
| A-Z | 011010-110011 | 26-51 |
| 0-9 | 110100-111101 | 52-61 |
| . | 111110 | 62 |
| _ | 111111 | 63 |
Encoding Algorithms
LOWER_SPECIAL Encoding
For strings containing only a-z, ., _, $, |:
function encode_lower_special(str):
bits = []
for char in str:
bits.append(lookup_lower_special[char]) // 5 bits each
// Pad to byte boundary
total_bits = len(str) * 5
padding_bits = (8 - (total_bits % 8)) % 8
// First bit indicates if last char should be stripped (due to padding)
strip_last = (padding_bits >= 5)
if strip_last:
prepend bit 1
else:
prepend bit 0
return pack_bits_to_bytes(bits)
FIRST_TO_LOWER_SPECIAL Encoding
For strings like MyFieldName where only the first character is uppercase:
function encode_first_to_lower_special(str):
// Convert first char to lowercase
modified = str[0].lower() + str[1:]
// Then use LOWER_SPECIAL encoding
return encode_lower_special(modified)
ALL_TO_LOWER_SPECIAL Encoding
For strings with multiple uppercase characters like MyTypeName:
function encode_all_to_lower_special(str):
result = ""
for char in str:
if char.is_upper():
result += "|" + char.lower() // Escape uppercase with |
else:
result += char
return encode_lower_special(result)
Example: MyType → |my|type → encoded with LOWER_SPECIAL
Encoding Selection Algorithm
function choose_encoding(str):
if all chars in str are in [a-z . _ $ |]:
return LOWER_SPECIAL
if first char is uppercase AND rest are in [a-z . _]:
return FIRST_TO_LOWER_SPECIAL
if all chars are in [a-z A-Z . _]:
lower_special_size = encode_all_to_lower_special(str).size
luds_size = encode_lower_upper_digit_special(str).size
if lower_special_size <= luds_size:
return ALL_TO_LOWER_SPECIAL
else:
return LOWER_UPPER_DIGIT_SPECIAL
if all chars are in [a-z A-Z 0-9 . _]:
return LOWER_UPPER_DIGIT_SPECIAL
return UTF8
Meta String Header Format
Meta strings are written with a header that includes the encoding type:
| 3 bits encoding | 5+ bits length | encoded bytes |
Or for larger strings:
| varuint: (length << 3) | encoding | encoded bytes |
Special Character Sets by Context
Different contexts use different special characters:
| Context | Special Chars | Notes |
|---|---|---|
| Field Name | . _ $ | | $ for inner classes, | for escape |
| Namespace | . _ | Package/module separators |
| Type Name | $ _ | $ for inner classes in Java |
Deduplication
Meta strings are deduplicated within a serialization session:
First occurrence: | (length << 1) | [hash if large] | encoding | bytes |
Reference: | ((id + 1) << 1) | 1 |
- Bit 0 of the header indicates: 0 = new string, 1 = reference to previous
- Large strings (> 16 bytes) include 64-bit hash for content-based deduplication
- Small strings use exact byte comparison