Flea-db

A java library for creating standalone, portable, schema-full object databases supporting pagination and faceted search, and offering strong-typed and generic APIs. Built on top of Apache Lucene.


Project maintained by brutusin Hosted on GitHub Pages — Theme by mattgraham

org.brutusin:flea-db Build Status Maven Central Latest Version

A java library for creating standalone, portable, schema-full object databases supporting pagination and faceted search, and offering strong-typed and generic APIs.

Built on top of Apache Lucene.

Main features:

Table of Contents:

Motivation

Maven dependency

<dependency>
    <groupId>org.brutusin</groupId>
    <artifactId>flea-db</artifactId>
</dependency>

Click here to see the latest available version released to the Maven Central Repository.

If you are not using maven and need help you can ask here.

APIs

All flea-db functionality is defined by FleaDB interface.

The library provides two implementations for it:

  1. A low-level generic implementation GenericFleaDB.
  2. A high-level strong-typed implementation ObjectFleaDB built on top of the previous one.

GenericFleaDB

GenericFleaDB is the lowest level flea-db implementation that defines the database schema using a JSON schema and stores and indexes records of type JsonNode. It uses Apache Lucene APIs and org.brutusin:json SPI to maintain two different indexes (one for the terms and other for the taxonomy, see index structure), hyding the underlying complexity from the user perspective.

This is how it works:

ObjectFleaDB

ObjectFleaDB is built on top of GenericFleaDB.

Basically an ObjectFleaDB delegates all its functionality to a wrapped GenericFleaDB instance, making use of org.brutusin:json to perform transformations POJO<->JsonNode and Class<->JsonSchema. This is the reason why all flea-db databases can be used with GenericFleaDB.

Schema

JSON SPI

As cited before, this library makes use of the org.brutusin:json, so a JSON service provider like json-provider is needed at runtime. The choosen provider will determine JSON serialization, validation, parsing, schema generation and expression semantics.

JSON Schema extension

Standard JSON schema specification has been extended to declare indexable properties ("index":"index" and "index":"facet" options). See annotations section for more details.

Example:

{
  "type": "object",
  "properties": {
    "age": {
      "type": "integer",
      "index": "index"
    },
    "category": {
      "type": "string",
      "index": "facet"
    }
  }
}

Annotations

See documentation in JSON SPI for supported annotations used in the strong-typed scenario.

Indexed fields nomenclature

Databases are self descriptive, they provide information of their schema and indexed fields (via Schema).

Field semantics are inherited from the expression semantics defined in the org.brutusin:json-provider

Indexation values

Supose JsonNode node to be stored and let fieldId be the expression identifying a database field, according to the previous section.

Expression exp = JsonCodec.getInstance().compile(fieldId);
JsonSchema fieldSchema = exp.projectSchema(rootSchema);
JsonNode fieldNode = exp.projectNode(node);

Then, the following rules apply to extract index and facet values for that field:

fieldSchema index:index index:facet
String fieldNode.asString() fieldNode.asString()
Boolean fieldNode.asString() fieldNode.asString()
Integer fieldNode.asLong() Unsupported
Number fieldNode.asDouble() Unsupported
Object each of its property names each of its property names
Array recurse for each of its elements recurse for each of its elements

Usage

Database persistence

Databases can be created in RAM memory or in disk, depending on the addressed problem characteristics (performance, dataset size, indexation time ...).

In order to create a persistent database, a constructor(s) with a File argument has to be choosen:

Flea db1 = new GenericFleaDB(indexFolder, jsonSchema);
// or
Flea db2 = new ObjectFleaDB(indexFolder, Record.class);

NOTE: Multiple instances can be used to read the same persistent database (for example different concurrent JVM executions), but only one can hold the writing file-lock (claimed the first time a write method is called).

On the other side, the database will be kept in RAM memory and lost at the end of the JVM execution.

Flea db1 = new GenericFleaDB(jsonSchema);
// or
Flea db2 = new ObjectFleaDB(Record.class);

Write operations

The following operations perform modifications on the database.

Store

In order to store a record the store(...) method has to be used:

db1.store(jsonNode);
// or
db2.store(record);

internally this ends up calling addDocument in the underlying Lucene IndexWriter.

Delete

The API enables to delete a set of records using delete(Query q).

NOTE: Due to Lucene facet internals, categories are never deleted from the taxonomy index, despite of being orphan.

Commit

Previous operations (store and delete) are not (and won't ever be) visible until commit() is called. Underlying seachers and writers are released, to be lazily created in further read or write operations.

Optimization

Databases can be optimized in order to achieve a better performance by using optimize(). This method triggers a highly costly (in terms of free disk space needs and computation) merging of the Lucene index segments into a single one.

Nevertheless, this operation is useful for immutable databases, that can be once optimized prior its usage.

Read operations

Two kind of read operations can be performed, both supporting a Query argument, that defines the search criteria.

Record queries

Record queries can be paginated and the ordering of the results can be specified via a Sort argument.

Facet queries

FacetResponse represents the faceting info returned by the database.

Faceting is provided by lucene-facet.

Closing

Databases must be closed after its usage, via close() method in order to free the resources and locks hold. Closing a database makes it no longer usable.

Threading issues

Both implementations are thread safe and can be shared across multiple threads.

Index structure

Persistent flea-db databases create the following index structure:

/flea-db/
|-- flea.json
|-- record-index
|   |-- ...
|-- taxonomy-index
|   |-- ...

being flea.json the database descriptor containing its schema, and being record-index and taxonomy-index subfolders the underlying Lucene index structures.

ACID properties

flea-db offers the following ACID properties, inherited from Lucene ones:

Examples:

Generic API:

// Generic interaction with a previously created database
FleaDB<JsonNode> db = new GenericFleaDB(indexFolder);

// Store records
JsonNode json = JsonCodec.getInstance.parse("...");
db.store(json);
db.commit();

// Query records
Query q = Query.createTermQuery("$.id", "0");
Paginator<JsonRecord> paginator = db.query(q);
int totalPages = paginator.getTotalPages(pageSize);
for (int i = 1; i <= totalPages; i++) {
    List<JsonRecord> page = paginator.getPage(i, pageSize);
    for (int j = 0; j < page.size(); j++) {
        JsonRecord json = page.get(j);
        System.out.println(json);
    }
}
db.close();

Strong-typed API:

// Create object database
FleaDB<Record> db = new ObjectFleaDB(indexFolder, Record.class);

// Store records
for (int i = 0; i < REC_NO; i++) {
    Record r = new Record();
    // ... populate record
    db.store(r);
}
db.commit();

// Query records
Query q = Query.createTermQuery("$.id", "0");
Paginator<Record> paginator = db.query(q);
int totalPages = paginator.getTotalPages(pageSize);
for (int i = 1; i <= totalPages; i++) {
    List<Record> page = paginator.getPage(i, pageSize);
    for (int j = 0; j < page.size(); j++) {
        Record r = page.get(j);
        System.out.println(r);
    }
}
db.close();

See available test classes for more examples.

Main stack

This module could not be possible without:

Lucene version

4.10.3 (Dec, 2014)

Support, bugs and requests

https://github.com/brutusin/flea-db/issues

Authors

Contributions are always welcome and greatly appreciated!

License

Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0