Service API design guide

This document is intended as a short instructional design guide in building a service management API. It is certainly intended for someone who wishes to use mgmt resources and functions to interact with their facilities, however it may be of more general use as well. Hopefully this will help you make smarter design considerations early on, and prevent some amount of unnecessary technical debt.

Main aspects

What follows are some of the most common considerations which you may wish to take into account when building your service. This list is non-exhaustive. Of particular note, as of the writing of this document, many of these designs are not taken into account or not well-handled or implemented by the major API ("cloud") providers.

Authentication

The status-quo

Many services naturally require you to authenticate yourself. Usually the initial user who sets up the account and provides credit card details will need to download secret credentials in order to access the service. The onus is on the user to keep those credentials private, and to prevent leaking them. It is convenient (and insecure) to store them in git repositories containing scripts and configuration management code. Since it's likely you will use multiple different services, it also means you will have a ton of different credentials to guard.

An alternative

Instead, build your service to accept a public key that you store in the users account. Only consumers that can correctly sign messages matching this public key should be authorized. This mechanism is well-understood by anyone who has ever uploaded their public SSH key to a server. You can use SSH keys, GPG keys, or even get into Kerberos if that's appropriate. Best of all, if you and other services use a standardized mechanism like GPG, a user might only need to keep track of their single key-pair, even when they're using multiple services!

Events

The problem

People have been building "CRUD" and "REST"ful API's for years. The biggest missing part that most of them don't provide is events. If users want to know when a resource changes, they have to repeatedly poll the server, which is both network intensive, and introduces latency. When services were simpler, this wasn't as much of a consideration, but these days it matters. An embarrassingly small number of major software vendors implement these correctly, if at all.

Why events?

The mgmt tool is different from most other static tools in that it allows reading streams of incoming data, and stream of change events from resources we are managing. If an event API is not available, we can still poll, but this is not as desirable. An event-capable API doesn't prevent polling if that's preferred, you can always repeat a read request periodically.

Variants

The two common mechanisms for receiving events are "callbacks" and "long-polling". In the former, the service contacts the consumer when something happens. In the latter, the consumer opens a connection, and the service either closes the connection or sends the reply, when it's ready. Long-polling is often preferred since it doesn't require an open firewall on the consumers side. Callbacks are preferred because it's often cheaper for the service to implement that. It's also less reliable since it's hard to know if the callback message wasn't received because it was dropped, or if there just wasn't an event. And it requires static timeouts when retrying a callback message, and so on. It's best to implement long-polling or something equivalent at a minimum.

"Since" requests

When making an event request, some API's will let you tack on a "since" style parameter that tells the endpoint that we're interested in all of the events since a particular timestamp, or since a particular sequence ID. This can be very useful if missing an intermediate event is a concern. Implement this if you can, but it's better for all concerned if purely declarative facilities are all that is required. It also forces the endpoint to maintain some state, which may be undesirable for them.

Out of band

Some providers have the event system tacked on to a separate facility. If it's not part of the core API, then it's not useful. You shouldn't have to configure a separate system in order to start getting events.

Batching

With so many resources, you might expect to have 1000's of long-polling connections all sitting open and idle. That can't be efficient! It's not, which is why good API's need a batching facility. This lets the consumer group together many watches (all waiting on a long-poll) inside of a single call. That way, a single connection might only be needed for a large amount of information.

Don't auto-generate junk

Please build an elegant API. Many services auto-generate a "phone book" SDK of junk. It might seem inevitable, so if you absolutely need to do this, then put some extra effort into making it idiomatic. If I'm using an SDK generated for golang and I see an internal foo.String wrapper, then chances are you have designed your API and code to be easier to maintain for you, instead of prioritizing your customers. Surely the total volume of all customer code is more than your own, so why optimize for that instead of the putting the customer first?

Resources and functions

Mgmt has a concept of "resources" and "functions". Resources are used in an idempotent model to express desired state and perform that work, and "functions" are used to receive and pull data into the system. That separation has shown to be an elegant one. Consider it when designing your API's. For example, if some vital information can only be obtained after performing a modifying operation, then it might signal that you're missing some sort of a lookup or event-log system. Design your API's to be idempotent, this solves many distributed-system problems involving receiving duplicate messages, and so on.

Using mgmt as a library

Instead of building a new service from scratch, and re-inventing the typical management and CLI layer, consider using mgmt as a library, and directly benefiting from that work. This has not been done for a large production service, but the author believes it would be quite efficient, particularly if your application is written in golang. It's equivalently easy to do it for other languages as well, you just end up with two binaries instead of one. (Or you can embed the other binary into the new golang management tool.)

Cloud API considerations

Many "cloud" companies have a lot of technical debt and a lot of customers. As a result, it might be very hard for them to improve their API's, particularly without breaking compatibility promises for their existing customers. As a result, they should either add a versioned API, which lets newer consumers get the benefit, or add new parallel services which offer the modern features. If they don't, the only solution is for new competitors to build-in these better efficiencies, eventually offering better value to cost ratios, which will then make legacy products less lucrative and therefore unmaintainable as compared to their competitors.

Suggestions

If you have any ideas for suggestions or other improvements to this guide, please let us know! I hope this was helpful. Please reach out if you are building an API that you might like to have mgmt consume!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

service-guide.md

service-guide.md

Service API design guide

Main aspects

Authentication

The status-quo

An alternative

Events

The problem

Why events?

Variants

"Since" requests

Out of band

Batching

Don't auto-generate junk

Resources and functions

Using mgmt as a library

Cloud API considerations

Suggestions

Files

service-guide.md

Latest commit

History

service-guide.md

File metadata and controls

Service API design guide

Main aspects

Authentication

The status-quo

An alternative

Events

The problem

Why events?

Variants

"Since" requests

Out of band

Batching

Don't auto-generate junk

Resources and functions

Using mgmt as a library

Cloud API considerations

Suggestions