Byte Rot: Machine Learning and APIs: introducing Mills in REST API Design

Level [C3]

REST (REpresentational State Transfer) was designed with the "state" at its heart, literally, standing for the S in the middle of the acronym.

TL;DR: Mill is a special type of resource where server's authority purely comes from exposing an algorithm, rather than "defining, exposing and maintaining integrity of a state". Unlike an RPC style endpoint, it has to adhere to a set of 5 constraints (see below).

Historically, when there were only a few thousand servers around, the state was predominantly documents. People were creating, editing and sharing a lot of text documents, and some HTML. With HTTP 1.1, caching and concurrency was built into the protocol and enabled it to represent richer distributed computing concerns and we have been building . With the rising popularity of REST over the last 10 years, much of today's web has been built on top of RESTful thinking, whether what is visible or what is behind the presentation (most external layer) servers. Nowadays when we talk of state, we normally mean data or rather records persisted in a data store (relational or non-relational). A lot of today's data, directly or indirectly, is created, updated and deleted using REST APIs. And this is all cool, of course.

When we design APIs, we map the state into REST Resources. It is very intuitive to think of resources as collection and instance. It is unambiguous and useful to communicate these concepts when for example we refer to /books and /books/123 URLs, as the collection or instance resources, respectively. We interact with these resources using verbs, and although HTTP verbs are not meant to be used just for CRUD operations, interacting with the state that exists on the server is inherent in the design.

But that is not all the story. Mainstream adoption of Machine Learning in the industry means we need to expose Machine Learning applications using APIs. The problem is the resource oriented approach of REST (where the state is at the heart of the design) does not work very well.

By the way, I am NOT 51...

How-old.net is an example of a Machine Learning application where instead of being an application, it could have been an API. For example (just for illustration, you could use other media types too):

POST /age_gender_classifier HTTP/1.1
Content-Type: image/jpeg

And the response:

200 OK
Content-Type: application/json

{
    "gender":"M"
    "age":37
}

Server is generating a response to the request by carrying out complex face recognition and running a model, most likely a deep network model. Server is not returning a state stored on the server, in fact this whole process is completely stateless.

And why does this matter? Well I feel if REST is supposed to move forward with our needs and use cases, it should define, clarify, internalise and finally digest edge cases. While such edge cases were pretty uncommon, with the rise and popularity of Machine Learning, such endpoints will be pretty standard.

A few days ago, on the second day of APIdays Mediterranea 2015 conference, I presented a talk on Machine Learning and APIs. And in this talk I presented simple concept of Mills. Mills, where you take your wheat to be ground and you carry back the flour.

Basically, it all goes back to the origin of a server's authority. To bring an example, a "Customer Profile" service, exposed by a REST API, is the authority to go to when another service requires access to a customer's profile. The "Customer Profile" service has defined a state, which is profile of the customers, and is responsible for ensuring integrity of the state (enforcing business rules on the state). For example, marketing email preference can have values of None, WeeklyDigest or All, it should not allow the value to be set to MonthlyDigest. We are quite used to these type of services and building REST APIs on top: CustomerProfile becomes a resource that we can query or interact with.

On the other hand, a server's authority could be exposing an algorithm. For example, tokenisation of text is a non-trivial problem that requires not only breaking the text to its words, but also maintaining muti-words and named entities intact. A REST API that exposes this functionality will be a mill.

5 constraints of a Mill

1) It encapsulates an algorithm not a state

Which was discussed ad nauseum, however, the distinction is very important. For example let's say we have an algorithm that you provide the postcode and it returns to you houses within 1 mile radius of that postcode - this is not an example of a mill.

2) Raw data in, processed data out

For example you send your text and get back the translation.

3) Calls are both safe and idempotent

Calling the endpoint should not directly change any state within the server. For example, the endpoint should not be directly mapped to the ML training engine, e.g. sending a text 1000 times skew the trained model for that text. The training endpoint is usually a normal resource, not a mill - see my slides

4) It has a single specialty

And as such, it accepts a single HTTP verb apart from OPTIONS, normally POST (although a GET with entity payload would be more semantically correct but not my preferred option for practical reasons).

5) It is named not as a verb but as a tool

A mill that exposes tokenisation, is to be called tokeniser. In a similar way, classifier would be the appropriate name for a system that classifies on top of a neural network, for example. Or normalising text, would have a normaliser mill.

No this is not the same as an RPC endpoint. No RPC smell. Honest :) That is why those 5 constraints exists.

Byte Rot

Sunday, 10 May 2015

Machine Learning and APIs: introducing Mills in REST API Design