This repository contains the dataset and code for Cloudscape: A Study of Storage Services in Modern Cloud Architectures, presented at FAST25. If you use this dataset, you can cite it using: todo
.
This repository has 3 key parts:
-
Dataset: The dataset describes 396 cloud architectures. Each cloud architecture is a GraphML file in
data/graphs/
. The services used across these architectures are described indata/services.csv
. The schema section describes how to interpret the GraphML and csv files. -
Analysis Jupyter Notebook: We provide a Jupyter notebook that replicates the analysis in the paper. The notebook does not always produce the published version of the graphs. Instead, it favours demonstrating how the query is performed, and aims to miminize the presentation code. Our goal is to demonstrate how to traverse the dataset to future researchers. Refer to the Analysis section on how to setup the environment.
-
Web-based Architecture Explorer: For easier access to the dataset, we provide a web platform, hosted at cloudscape.cs.wisc.edu. This repository also contains the source code for this platform. The Explorer section describes its setup if you wish to modify it.
Each GraphML file describes one architecture. The architecture is encoded as a MultiDiGraph
, which means that the graph contains directed edges and potentially contains multiple edges between 2 nodes. There are many readers for GraphML files; we use the Python networkx
package.
Each graph has these attributes:
Attribute Name | Optional | Description | Example Value |
---|---|---|---|
name |
The title of the YouTube video that describes this architecture. | Snap: Journey of a Snap on Snapchat Using AWS | |
link |
The link to the source YouTube video. | https://www.youtube.com/watch?v=Cgv0kfp_6xQ | |
categories |
Comma separated list of functional goals. Possible goals:(data_ingestion, interactive, compute_intensive, control, other) | data_ingestion,compute_intesive | |
graph_usable |
Can the encoded graph be used for any quantitative graph analysis? The paper only analyzes architectures marked True. | True | |
notes |
yes | Distilled context from the video that captures the requirements of the architectures and how services are used. | 13TB of peak data to S3 daily. Use of DynamoDB to store metadata has increased throughput. |
Each node has these attributes:
Attribute name | Optional | Description | Example Value |
---|---|---|---|
service |
The name of the AWS service, or class of other service. Used in services.csv . |
Lambda or UserMobile | |
name |
yes | (ThirdParty nodes only). Name of the service. | nginx or legacy API |
notes |
yes | Additional context about how this service was used. | 6000 calls/s, 200k records/validation |
There are five types of nodes, AWS nodes, User nodes, Partner nodes, ThirdParty nodes, OnPremDC nodes.
AWS nodes represent AWS services and User nodes indicate an end user, and sometimes the device (for example, web app or Alexa). ThirdParty nodes are
nodes that don't fit in either, but are often mentioned in the video without much context (for example, 'internal company API'). Partner nodes are reserved for AWS partner companies that provide their software via the AWS marketplace (for example, SAP or Couchbase). OnPremDC nodes denote a company's on-premisis datacenter. Effectively, ThirdParty describes one off services that are only needed to explain one architecture, and thus each ThirdParty node is described by its name
.
Attribute name | Optional | Description | Example Value |
---|---|---|---|
flow_id |
Whole number that indicates which workflow this edge is a part of. All edges with the same flow_id are part of the same workflow. |
1 | |
seq |
Orders edges inside a workflow. When events are not strictly synchronous, branches are denoted by apostrophe('). events can have order within branches. Visit the explorer for examples. | 3'.2 | |
type |
data/meta. data edges convey movement of data, meta edges indicate request trigger or acknowledgement.response | data | |
notes |
yes | Additional context on how these 2 (connected) nodes communicate. | periodic |
Videos often describe a set of interactions that collectively result in a logical unit of work. We capture such interactions as workflows. A workflow is encoded using edge.flow_id
. All edges with the same flow_id
, and their corresponding nodes, describe the set of interactions for that workflow.
The data/services.csv
file describes all the different types of nodes used across the dataset.
Field | Description |
---|---|
name |
The name of the AWS service, or node type. As described in the node section. |
capability |
AWS services will have one of (storage , compute , networking , integration , control , other ). Ref to paper for description. For User, Partner, ThirdParty and OnPremDC nodes, it will contain the relevant (User , Partner , ThirdParty , OnPremDC ). |
is_aws |
True for AWS services, False otherwise. |
schema |
Only for AWS storage services. One of (fs , sql , nosql , specialized , object ). |
aws_product_categories |
Only for AWS services. Comma separated list of categories. |
image_url |
for explorer purposes only. refers to svg file in static/icons . |
To run the notebook locally, clone the repository and install the dependencies using pip install -r requirements-analysis.txt
. You should then be able to open the notebook with jupyter-notebook
. This was developed using Python 3.11
.
To run the explorer locally, clone the repository and install the dependencies using pip install -r requirements-explorer.txt
. You can then deploy the platform with streamlit run explorer_st.py
. This was developed using Python 3.11
.