This document is intended as a short instructional design guide in building a
service management API. It is certainly intended for someone who wishes to use
mgmt
resources and functions to interact with their facilities, however it may
be of more general use as well. Hopefully this will help you make smarter design
considerations early on, and prevent some amount of unnecessary technical debt.
What follows are some of the most common considerations which you may wish to take into account when building your service. This list is non-exhaustive. Of particular note, as of the writing of this document, many of these designs are not taken into account or not well-handled or implemented by the major API ("cloud") providers.
Many services naturally require you to authenticate yourself. Usually the
initial user who sets up the account and provides credit card details will need
to download secret credentials in order to access the service. The onus is on
the user to keep those credentials private, and to prevent leaking them. It is
convenient (and insecure) to store them in git
repositories containing scripts
and configuration management code. Since it's likely you will use multiple
different services, it also means you will have a ton of different credentials
to guard.
Instead, build your service to accept a public key that you store in the users account. Only consumers that can correctly sign messages matching this public key should be authorized. This mechanism is well-understood by anyone who has ever uploaded their public SSH key to a server. You can use SSH keys, GPG keys, or even get into Kerberos if that's appropriate. Best of all, if you and other services use a standardized mechanism like GPG, a user might only need to keep track of their single key-pair, even when they're using multiple services!
People have been building "CRUD" and "REST"ful API's for years. The biggest missing part that most of them don't provide is events. If users want to know when a resource changes, they have to repeatedly poll the server, which is both network intensive, and introduces latency. When services were simpler, this wasn't as much of a consideration, but these days it matters. An embarrassingly small number of major software vendors implement these correctly, if at all.
The mgmt
tool is different from most other static tools in that it allows
reading streams of incoming data, and stream of change events from resources we
are managing. If an event API is not available, we can still poll, but this is
not as desirable. An event-capable API doesn't prevent polling if that's
preferred, you can always repeat a read request periodically.
The two common mechanisms for receiving events are "callbacks" and "long-polling". In the former, the service contacts the consumer when something happens. In the latter, the consumer opens a connection, and the service either closes the connection or sends the reply, when it's ready. Long-polling is often preferred since it doesn't require an open firewall on the consumers side. Callbacks are preferred because it's often cheaper for the service to implement that. It's also less reliable since it's hard to know if the callback message wasn't received because it was dropped, or if there just wasn't an event. And it requires static timeouts when retrying a callback message, and so on. It's best to implement long-polling or something equivalent at a minimum.
When making an event request, some API's will let you tack on a "since" style parameter that tells the endpoint that we're interested in all of the events since a particular timestamp, or since a particular sequence ID. This can be very useful if missing an intermediate event is a concern. Implement this if you can, but it's better for all concerned if purely declarative facilities are all that is required. It also forces the endpoint to maintain some state, which may be undesirable for them.
Some providers have the event system tacked on to a separate facility. If it's not part of the core API, then it's not useful. You shouldn't have to configure a separate system in order to start getting events.
With so many resources, you might expect to have 1000's of long-polling connections all sitting open and idle. That can't be efficient! It's not, which is why good API's need a batching facility. This lets the consumer group together many watches (all waiting on a long-poll) inside of a single call. That way, a single connection might only be needed for a large amount of information.
Please build an elegant API. Many services auto-generate a "phone book" SDK of
junk. It might seem inevitable, so if you absolutely need to do this, then put
some extra effort into making it idiomatic. If I'm using an SDK generated for
golang
and I see an internal foo.String
wrapper, then chances are you have
designed your API and code to be easier to maintain for you, instead of
prioritizing your customers. Surely the total volume of all customer code is
more than your own, so why optimize for that instead of the putting the customer
first?
Mgmt
has a concept of "resources" and "functions". Resources are used in an
idempotent model to express desired state and perform that work, and "functions"
are used to receive and pull data into the system. That separation has shown to
be an elegant one. Consider it when designing your API's. For example, if some
vital information can only be obtained after performing a modifying operation,
then it might signal that you're missing some sort of a lookup or event-log
system. Design your API's to be idempotent, this solves many distributed-system
problems involving receiving duplicate messages, and so on.
Instead of building a new service from scratch, and re-inventing the typical
management and CLI layer, consider using mgmt
as a library, and directly
benefiting from that work. This has not been done for a large production
service, but the author believes it would be quite efficient, particularly if
your application is written in golang. It's equivalently easy to do it for other
languages as well, you just end up with two binaries instead of one. (Or you can
embed the other binary into the new golang management tool.)
Many "cloud" companies have a lot of technical debt and a lot of customers. As a result, it might be very hard for them to improve their API's, particularly without breaking compatibility promises for their existing customers. As a result, they should either add a versioned API, which lets newer consumers get the benefit, or add new parallel services which offer the modern features. If they don't, the only solution is for new competitors to build-in these better efficiencies, eventually offering better value to cost ratios, which will then make legacy products less lucrative and therefore unmaintainable as compared to their competitors.
If you have any ideas for suggestions or other improvements to this guide,
please let us know! I hope this was helpful. Please reach out if you are
building an API that you might like to have mgmt
consume!