Skip to main content

Concepts

This document explains Parseable design and terminology at a high level.

Authentication

Parseable API require basic auth based authentication. All HTTP clients are able to generate basic auth headers from the username and password. Note that the username and password are set when you start Parseable server (via the environment variables P_USERNAME and P_PASSWORD).

In case you want to generate the basic auth yourself, use the following command.

echo -n '<username>:<password>' | base64

Then add the following header to your HTTP request.

Authorization: Basic <output-from-previous-command>

Ingestion

Log ingestion for Parseable is done via a HTTP POST request with a JSON payload. The payload can contain single log event as a JSON object or multiple log events as a JSON array.

You can use HTTP output plugins of any popular logging agent (FluentBit, Vector, LogStash among others) to send log events to Parseable. You can also directly integrate Parseable with your application via REST API calls.

Schema

With Parseable you don't need to explicitly define a schema for your log events. As you send the first log event to a Parseable stream - server automatically detects the schema and enforces that for subsequent log events (sent to that stream). You can fetch this schema using the Get schema API.

Schema evolution

We're working on a fluid schema approach which will allow schema evolution over time. This means, you can add new fields to your log events as logs evolve without breaking the schema.

Flattening

Nested JSON objects are automatically flattened. For example, the following JSON object

{
"foo": {
"bar": "baz"
}
}

will be flattened to

{
"foo.bar": "baz"
}

before it gets stored. While querying, this field should be referred as foo.bar. For example, select foo.bar from <stream-name>. The flattened field will be available in the schema as well.

Storage

Once the JSON payload data reaches server, it is validated and parsed to a columnar Apache Arrow format in memory. Subsequent events are appended to the Arrow record batch in memory and a copy is kept on disk (to prevent data loss). Finally, after a configurable duration, the Arrow record batch is converted to Parquet and then pushed to S3 (or compatible) bucket.

Parquet on object storage is organized into prefixes based on stream name, date and time. This ensures the server fetches a very specific dataset based on the query time range. More on query in the next section. We're working on a compaction approach that will further compress and optimize the storage while ensuring queryable data at all times.

Storage Stats

To fetch the ingested data and actual compressed data size for a stream, use the Get Stats API. Sample response:

{
"ingestion": {
"format": "json",
"size": "12800 Bytes"
},
"storage": {
"format": "parquet",
"size": "15517 Bytes"
},
"stream": "reactapplogs",
"time": "2022-11-17T07:03:13.134992Z"
}

Query

Parseable query API works with standard SQL. In addition to the SQL query, users need to specify the time range for which the query should be executed. The time range is specified using startTime and endTime parameters. The response is inclusive of both the timestamps.

Check out the Query API in Postman.

Parseable uses Apache Arrow native query engine called DataFusion in conjunction with efficient Parquet reader to execute the queries.