Delving into OrientDB

OrientDB looks very awesome for so many reasons and it seems like they crammed every possible buzzword and juicy feature a database could (ever?) have.

But like any responsible tech-geek, you must first do your due diligence and investigate as many aspects of the new kid on the block before committing to it and losing countless nights as a slave to missing documentation or hidden “features”.orientdb

In order to fulfill this due diligence, I decided to walk through and try to use the various available PHP-based drivers and see what they can do (out of the full spectrum the database offers) and what their underlying communications protocols allow them to do.

Trying out the driver will, in turn, force me to use the database’s features in various scenarios and the circle is complete.

The good:

Cross-datacenter replication: replication over wide area networks with all that it implies — possible loss of connectivity between data centers, slow response times etc.

Multi-master replication: writing to any of the available nodes and receiving a write-confirmation only after a minim quorum of nodes have confirmed their writes individually. This goes hand in hand with client-side connectivity to multiple nodes – which can also work via multicast (did I mention juicy features and buzzwords?).

Native sharding and clustering: you declare a class (~= table), say customer, you specify that it is sharded into multiple clusters, say customer_usa, customer_eu, customer_asia.

Then, each server can declare which clusters it works with. If a server works with one cluster and another works with another cluster… that’s sharding. If the first server is joined by another server which works with the same cluster… that’s replication.

Combine this with the fact that servers have auto discovery over the network and auto-join groups and get provisioned with the required data after they join – you have dynamic sharding.

Embeddable databases: I mean what good is a database if you can’t also embed it onto a mobile device (amirite?).

Full ACID transactions: missing feature in almost all other document-oriented databases, leaving it to be the last unturned stone and holy grail of the RDBMS — the only reason they are still tolerated.

Schema-full and schema-less documents: you can constrain a document to have a specified structure or just throw data at it.

Object oriented schema: Yes, CLASS, PROPERTY, EXTENDS and INSTANCEOF are keywords in OrientDB, it’s like Christmas all over again (or like that other thing only muuuch much better).

Direct object links: yes, that is correct, that thing kinda like on-disk pointers/references. What more do you want, people?

Bidirectional references: this is actually a whole spectrum of features regarding graph databases of which EDGEs are just the tip of the iceberg.

JSON based data input/output: You can either send just a JSON encoded document to the server with a new record or the changed values from an existing record, or you can send SQL queries with JSON encoded column assignments…

INSERT INTO Computer CONTENT {
    "cpu": "2x Xeon E5-2697",
    "gpu": "7x AMD R9 Nano",
    "motherboard": "Asus Z10PE-D8 WS",
    "memory": "8x Kingston 32GB DDR4 ECC",
    "ssd": "8x Kingston KC400 1TB Business SSD",
    "power": "EVGA SuperNOVA 1600 T2",
   "case": "Caselabs Mercury S8"
}

SQL compatibility: The above insert statement isn’t quite ANSI-SQL, now is it? But don’t worry, you can use ANSI-SQL if you want to. It doesn’t cover quite everything since some things don’t make sense to OrientDB.

INSERTs and UPDATEs are fully ANSI-SQL compatible plus their own additions with various features like REPLACE, UPSERT, MERGE etc… but, for example, JOINs don’t make sense in OrientDB because documents are LINKed and vertices are EDGEd together. Since JOINs don’t exists, you can never have more than one source table in your SELECT query so, there are no table aliases.

Live Queries: This is a server-push notification system to allow you to issue a query now and receive the results only when a new item matches the query conditions. Having to write a website chat client before websockets was hard, now, with websockets + PHP + MySQL, it’s still hard unless you’re thinking about message queues… in which case it’s not.

Graph queries with gremlin: the ”lingua franca” of graphs is gremlin and since OrientDB is also a graph database, it understands gremlin too.

User defined indexes (in Java)… User defined functions (in Java or JavaScript)… User defined sharding distribution algorithms…

They just keep coming at you with all of these features, you can’t even say half of these in one breath… wow…

The bad:

Indexes are never sharded – they are present on all servers. This implies that although your data is distributed over many servers, the indexes for said data are not. This limits speed and scalability – indexes can and will grow bigger than the RAM size of the machine.

Official Doctrine2 ODM driver

  • uses the HTTP API instead of the official PHP driver (which uses a binary protocol over TCP Sockets)
  • isn’t complete, it implements the ObjectManager and can find and hydrate objects/documents from the database, but persistence is missing entirely, only access to raw queries is provided
  • doesn’t have an integration with Symfony2

LINKs don’t have referential integrity  – LINKs normally won’t need referential integrity because you can’t alter/update the internal identifier of the object (the @rid).

But wait! Moving an object from a cluster to another does alter the @rid because the @rid contains the cluster id. So moving objects across clusters breaks referential integrity for LINKs.

Workaround: edges do have referential integrity even when vertices are moved across clusters.

The ugly:

Transactions are not interactive sessions They are batches of commands sent to the server in one communication flagged to be transactional. Both HTTP and Socket drivers (and their underlying protocols) can only buffer commands and send them to the server with a transaction flag (on commit). You can’t begin a transaction, select data, handle it in the client and then send more queries to the server based on said data.

Workaround: use OrientDB scripting in SQL or JavaScript to handle the logic directly on the server.

Prepared statement parameters aren’t uniformly available via all the API implementations. The HTTP API doesn’t allow prepared statement parameters in v2.1.9 and below; in v2.2 single commands allow prepared statement parameters, but not for the query or batch mode.

Workaround: feature-request, pull-request, wait for a new version to fix this.

Cross datacenter replication with write support is experimental. Write consensus/quorum settings when network splits occur can bring your database into incoherent states.

Transactions are limited to RAM size. While this might be a very reasonable limitation, it’s still the kind of thing that might allow you to one day.

The protocols

HTTP RESTful API: uses JSON for data interchange and accepts RESTful operations with documents as well as queries, commands and scripts in SQL, Gremlin and JavaScript.

Socket protocol: uses either a CSV-like serialization or a custom binary serialization and accepts custom operations with the serialized documents as well as queries, commands and scripts in SQL, Gremlin and JavaScript.

The drivers

ostico/phporient: Official PHP driver supported by OrientDB which works over the socket protocol with the CSV-like serialization. It covers a significant amount of the functionality that the protocol provides.

doctrine/orientdb-odm: Official Doctrine ODM driver supported by Doctrine which works over the HTTP RESTful API. Covers a small subset of the functionality of the protocol.

yapro/doctrine2-orientdb-restful-driver: Unofficial Doctrine ORM driver which works over the HTTP RESTful API, but here’s the kicker: it forces this splendid Document-Graph database to behave like a plain old RDBMS because of its support of SQL.

mihai-stancu/orientdb-orm: Unofficial Doctrine ORM driver, a cleaner, more-complete, reimplementation of yapro’s driver.

My driver

Yes, I know, ODM != ORM. Yes, I know, it’s a stretch to begin with.

Since documentation on either of these drivers is scarce, I decided to implement my own driver with a few goals in mind:

  • Reunite both Socket driver and HTTP driver under the same hood – to be able to feature test and compare them easier.
  • See what bumps I’ll hit along the road and which of them I can fix myself within the driver vs. which are protocol level problems.
  • Easily integrate with both Doctrine2 and Symfony2 – in the ODM this requires work, but in the ORM after writing the driver it just works via the ORM abstraction (like any other RDBMSs).
  • Benchmark the resulting driver with either Socket or HTTP protocols.

The result so far, as illustrated in the ROADMAP.md:

  • declare your DBAL connection with OrientDB
  • run raw queries using the DBAL connection (with prepared statements)
  • map your entities using only the classic ORM style annotations
  • use OneToMany or ManyToOne relationships wherever you want LINSET and LINK to be use (respectively)
  • generate and maintain your DB schema using doctrine:schema:create
  • run app/console doctrine:schema:update to keep your schema in sync
  • persist (and flush) an object with association (OneToMany or ManyToOne)
  • find objects with the find, findOne, findBy, findOneBy family of functions and receive a properly hydrated object

Major missing pieces at the moment:

  • Saving a LINKSET currently has a wonky behavior – it merges with the previous value of the LINKSET
  • ManyToMany mappings
  • Mapping EMBEDDED and EMBEDDEDSET objects
  • Using EDGEs instead of LINKs (by configuration)
  • DQL integration – working on it
  • Inheritance mapping integration
FacebookTwitterLinkedInShare
Tags: , ,

Mihai Stancu

I am passionately curious about how things work and most of the time I am also opinionated about how things could and should work.
Becoming a software developer was just a natural side effect.