1. Dataset Schema

To the Reviewers

We have refined some architecture graphs since the paper submission, which has led to the current analysis being slightly different from the submitted version. The trends hold across versions and they don't affect the claims and discussions in our paper.

This repository contains the dataset and code for Cloudscape: A Study of Storage Services in Modern Cloud Architectures, presented at FAST25. If you use this dataset, you can cite it using: todo.

This repository has 3 key parts:

Dataset: The dataset describes 396 cloud architectures. Each cloud architecture is a GraphML file in data/architectures/. The services used across these architectures are described in data/services.csv. The schema section describes how to interpret the GraphML and csv files.
Analysis Jupyter Notebook: We provide a Jupyter notebook that replicates the analysis in the paper. The notebook does not always produce the published version of the graphs. Instead, it favours demonstrating how the query is performed, and aims to miminize the presentation code. Our goal is to demonstrate how to traverse the dataset to future researchers. Refer to the Analysis section on how to setup the environment.
Web-based Architecture Explorer: For easier access to the dataset, we provide a web platform, hosted at cloudscape.cs.wisc.edu. This repository also contains the source code for this platform. The Explorer section describes its setup if you wish to modify it.

1. Dataset Schema

Architecture

Each GraphML file describes one architecture. The architecture is encoded as a MultiDiGraph, which means that the graph contains directed edges and potentially contains multiple edges between 2 nodes. There are many readers for GraphML files; we use the Python networkx package.

Graph

Each graph has these attributes:

Attribute Name	Optional	Description	Example Value
`name`		The title of the YouTube video that describes this architecture.	Snap: Journey of a Snap on Snapchat Using AWS
`link`		The link to the source YouTube video.	https://www.youtube.com/watch?v=Cgv0kfp_6xQ
`categories`		Comma separated list of functional goals. Possible goals:(data_ingestion, interactive, compute_intensive, control, other)	data_ingestion,compute_intesive
`graph_usable`		Can the encoded graph be used for any quantitative graph analysis? The paper only analyzes architectures marked True.	True
`notes`	yes	Distilled context from the video that captures the requirements of the architectures and how services are used.	13TB of peak data to S3 daily. Use of DynamoDB to store metadata has increased throughput.

Node

Each node has these attributes:

Attribute name	Optional	Description	Example Value
`service`		The name of the AWS service, or class of other service. Used in `services.csv`.	Lambda or UserMobile
`name`	yes	(ThirdParty nodes only). Name of the service.	nginx or legacy API
`notes`	yes	Additional context about how this service was used.	6000 calls/s, 200k records/validation

There are five types of nodes, AWS nodes, User nodes, Partner nodes, ThirdParty nodes, OnPremDC nodes.

AWS nodes represent AWS services and User nodes indicate an end user, and sometimes the device (for example, web app or Alexa). ThirdParty nodes are nodes that don't fit in either, but are often mentioned in the video without much context (for example, 'internal company API'). Partner nodes are reserved for AWS partner companies that provide their software via the AWS marketplace (for example, SAP or Couchbase). OnPremDC nodes denote a company's on-premisis datacenter. Effectively, ThirdParty describes one off services that are only needed to explain one architecture, and thus each ThirdParty node is described by its name.

Edge

Attribute name	Optional	Description	Example Value
`flow_id`		Whole number that indicates which workflow this edge is a part of. All edges with the same `flow_id` are part of the same workflow.	1
`seq`		Orders edges inside a workflow. When events are not strictly synchronous, branches are denoted by apostrophe('). events can have order within branches. Visit the explorer for examples.	3'.2
`type`		data/meta. data edges convey movement of data, meta edges indicate request trigger or acknowledgement.response	data
`notes`	yes	Additional context on how these 2 (connected) nodes communicate.	periodic

Workflow

Videos often describe a set of interactions that collectively result in a logical unit of work. We capture such interactions as workflows. A workflow is encoded using edge.flow_id. All edges with the same flow_id, and their corresponding nodes, describe the set of interactions for that workflow.

Services

The data/services.csv file describes all the different types of nodes used across the dataset.

Field	Description
`name`	The name of the AWS service, or node type. As described in the node section.
`capability`	AWS services will have one of (`storage`, `compute`, `networking`, `integration`, `control`, `other`). Ref to paper for description. For User, Partner, ThirdParty and OnPremDC nodes, it will contain the relevant (`User`, `Partner` , `ThirdParty`, `OnPremDC`).
`is_aws`	True for AWS services, False otherwise.
`schema`	Only for AWS storage services. One of (`fs`, `sql`, `nosql`, `specialized`, `object`).
`aws_product_categories`	Only for AWS services. Comma separated list of categories.
`image_url`	for explorer purposes only. refers to svg file in `static/icons`.

2. Analysis

To run the notebook locally, clone the repository and install the dependencies using pip install -r requirements-analysis.txt. You should then be able to open the notebook with jupyter-notebook. This was developed using Python 3.11.

3. Architecture Explorer

To run the explorer locally, clone the repository and install the dependencies using pip install -r requirements-explorer.txt. You can then deploy the platform with streamlit run explorer_st.py. This was developed using Python 3.11.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

To the Reviewers

1. Dataset Schema

Architecture

Graph

Node

Edge

Workflow

Services

2. Analysis

3. Architecture Explorer

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.streamlit		.streamlit
cloudscape		cloudscape
data		data
static/icons		static/icons
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Analysis.ipynb		Analysis.ipynb
README.md		README.md
explorer_st.py		explorer_st.py
requirements-analysis.txt		requirements-analysis.txt
requirements-explorer.txt		requirements-explorer.txt

darkryder/Cloudscape

Folders and files

Latest commit

History

Repository files navigation

To the Reviewers

1. Dataset Schema

Architecture

Graph

Node

Edge

Workflow

Services

2. Analysis

3. Architecture Explorer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages