# Data Types

# 1. Summary

This specification describes the different data types supported for the fields in a document and how Meilisearch handles them.

# 2. Functional Specification

No matter the type, the value of a field is unchanged in the returned documents upon search.

For example, if you have a complex document structure with nested objects, the document is returned with the same complexity upon search.

However, based on their type, the fields are handled and used in different ways by Meilisearch.

# 2.1. Supported types

# 2.1.1. String

string is the primary type for indexing data in Meilisearch.

See 3.1. String Tokenization section.

# 2.1.2. Numeric

The engine internally converts a numeric typed value (integer/float) to a human-readable decimal number string representation to make them searchable.

The >, >=, <, and <= filter operators apply only to numerical values.

# 2.1.3. Boolean

The engine internally converts a boolean typed value (true/false) to a lowercase human-readable text to make it searchable.

# 2.1.4. Array

An array is recursively broken into separate string tokens, which means separate words.

After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.

See 3.2. Array Tokenization section.

Meilisearch accepts complex data structures, no matter the deepness level.

See 3.4. Nested structures section.

# 2.1.5. Object

The engine flattens JSON objects at the root level of a document.

After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.

See 3.3. Object section.

Meilisearch accepts complex data structures, no matter the deepness level.

See 3.4. Nested structures section.

# 2.1.6. null

The null type is not taken into account at indexing.

# 3. Technical Details

# 3.1. String Tokenization

String tokenization is the process of splitting a string into a list of individual terms that are called tokens.

A string is passed to a tokenizer and is then broken into separate string tokens. A token is a word.

For Latin-based languages, the words are separated by space. For Kanji characters, the words are separated by character.

For Latin-based languages, there are two kinds of space separators:

  • Soft spaces (distance: 1): whitespaces, quotes, - | _ | \ | : | / | \\ | @ | " | + | ~ | = | ^ | * | #
  • Hard spaces (distance: 8): . | ; | , | ! | ? | ( | ) | [ | ] | { | }| |

Distance plays an essential role in determining whether documents are relevant. The proximity ranking rule sorts the results by increasing distance between matched query terms. Two words separated by a soft space are closer and thus considered more relevant than two words separated by a hard space.

After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.

# 3.1.1. Examples

To demonstrate how a string is split by space, let's say you have the following string as an input:

"Bruce Willis,Vin Diesel"

In the example above, the distance between Bruce and Willis is equal to 1. The distance between Vin and Diesel is equal to 1 too. But, the distance between Bruce and Vin is equal to 8. The same goes for Bruce and Diesel, or Willis and Vin, or also Willis and Diesel.

Let's see another example. Given two documents:

[
    {
        "movie_id": "001",
        "description": "Bruce.Willis"
    },
    {
        "movie_id": "002",
        "description": "Bruce super Willis"
    }
]

When making a query on Bruce Willis, 002 will be the first document returned and 001 will be the second one. This will happen because the proximity distance between Bruce and Willis is equal to 2 in the document 002 whereas the distance between Bruce and Willis is equal to 8 in the document 001 since the full stop is a hard space.

# 3.2. Array Tokenization

An array is recursively broken into separate string tokens, which means separate words. After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.

# 3.2.1. Examples

The following input:

[
    [
        "Bruce Willis",
        "Vin Diesel"
    ],
    "Kung Fu Panda"
]

Will be processed as if all elements were arranged at the same level:

"Bruce Willis. Vin Diesel. Kung Fu Panda."

The strings above will be separated by soft and hard spaces exactly as explained in the string example.

See 3.1.1. Examples section.

# 3.3. Nested Structures

Nested structures (e.g. Object, Array of Objects, etc) are internally flattened to a document's root level.

It allows expressing a nested field in all Meilisearch parameters that accept document attributes.

Meilisearch accepts the . (dot-notation) to express a nested field location in a document structure.

Meilisearch is limited to 127 levels of depth.

# 3.3.1. Examples

# 3.3.1.1. Object

The following JSON document:

    {
        "a": {
            "b": "c",
            "d": "e",
            "f": "g"
        }
    }

Flattens to:

{
    "a.b": "c",
    "a.d": "e",
    "a.f": "g"
}
# 3.3.1.2. Array of objects

The following JSON document:

{
    "a": [
        { "b": "c" },
        { "b": "d" },
        { "b": "e" },
    ]
}

Flattens to:

{
    "a.b": ["c", "d", "e"],
}
# 3.3.1.3. Array of objects mixed with scalar value

The following JSON document:

{
    "a": [
        42,
        { "b": "c" },
        { "b": "d" },
        { "b": "e" },
    ]
}

Flattens to:

{
    "a": 42,
    "a.b": ["c", "d", "e"],
}
# 3.3.1.4. Array of objects of array of objects of ...

The following JSON document:

{
    "a": [
        "b",
        [
            "c",
            "d"
        ],
        {
            "e": [
                "f",
                "g"
            ]
        },
        [
            {
                "h": "i"
            },
            {
                "e": [
                    "j",
                    {
                        "z": "y"
                    }
                ]
            },
        ],
        ["l"],
        "m"
    ]
}

Flattens to:

{
    "a": ["b", "c", "d", "l", "m"],
    "a.e": ["f", "g", "j"],
    "a.h": "i",
    "a.e.z": "y"
}
# 3.3.1.5. Collision between a representation

The following JSON document:

{
    "a": {
        "b": "c",
    },
    "a.b": "d"
}

Flattens to:

{
    "a.b": ["c", "d"],
}
# 3.3.1.6. searchableAttributes default value case

By default, searchableAttributes is set to [*], making all document fields searchable.

In that case, Attribute ranking rule consider a field higher in the internal representation more important than a lower one.

User document field order can be lost if the engine flattens identical field values are not co-located in a document payload.

The following JSON document:

{
    "a.b": "T-shirt",
    "price": 2.0,
    "a": {
        "b": "Nice T-shirt"
    }
}

Is internally represented

{
    "a.b": ["T-shirt", "Nice T-shirt"],
    "price": 2.0
}

The second representation of a.b in its nested form is merged with the first representation of a.b.

Users can't and should not rely on a given document field order when searchableAttributes is [*].

# 3.3.1.7. Dot-notation Expression

Permits to express the nested object property.

# 3.3.1.7.1. Example

Given this document structure

{
    "id": 0,
    "person": {
        "firstname": "John",
        "lastname": "Doe",
        "address": {
            "country": "US",
            "city": "New York"
        }
    }
}

A precise field can be expressed using the dot-notation

{
    "attributesToHighlight": ["person.firstname"]
}
# 3.3.1.8. All Object Properties Expression

It is possible to express the definition of all properties of an object.

e.g. In this case person is an object containing properties. attributesToRetrieve: ["person"]

This notation is accepted on all parameters or settings allowing to specify attributes. This is due to the fact that several documents may not share the same schema. See 3.4.1.8.2. Edge Case section.

# 3.3.1.8.1. Example

Given this document structure

{
    "id": 0,
    "person": {
        "firstname": "John",
        "lastname": "Doe",
        "address": {
            "country": "US",
            "city": "New York"
        }
    }
}

All properties of a document nested object can be expressed this way

{
    "attributesToHighlight": ["person"]
}

It's equivalent to

{
    "attributesToHighlight": [
        "person",
        "person.firstname",
        "person.lastname",
        "person.address",
        "person.address.country",
        "person.address.city"
    ]
}
# 3.3.1.8.2. Edge Case

One document might have a non-nested field person while another has a person object containing properties.

The chosen behavior is not to force the user's hand and Meilisearch do not throw any errors.

E.g. If a user specifies "filter": "person = 'Guillaume'" at search time, the document that would have a nested object person would not be brought up by the filter.

# 4. Future Possibilities

  • Change the default behavior of searchableAttributes so that it is predictable. We may remove the priority based on a field position in a document.
  • Support the wildcard notation with the dot-notation. e.g. person.*, person.address.* or person.l*
  • Support the array notation. e.g. person.addresses[1]