jsphere_proposal.qmd

---
title: "Proposal for JSphere: Classification of The Use of JavaScript on The Web"
subtitle: "Project A for CSci 651 by John Heidemann"
author:
    - name: Steven Hé (Sīchàng)
    - name: "Mentor: Harsha V. Madhyastha"
format:
    html:
        html-math-method: katex
    pdf:
        pdf-engine: latexmk
        papersize: a4
        margin-left: 1in
        margin-right: 1in
        margin-top: 1in
        margin-bottom: 1in
        indent: 2m
bibliography: main.bib
csl: acm-sig-proceedings-long-author-list.csl
---

JavaScript (JS) plays a major role in content delivery on the modern Internet.
For the median webpage, reports show that JS code amounts to
around a quarter of the bytes transferred [@httparchive2024web;
@httparchive2024javascript] and JS execution time accounts for about half of
the compute delay [@goel2024sprinter].
In this sense, in most end users' Internet browsing experiences,
JS is central for steering the traffic and displaying information.

<!--
1. How does your research "fit in"? (The competition, from NABC.) 1.
    why is it interesting research (to the field)? 2.
    why is it interesting for you (what will you learn)?3.
    what is relationship to other work (briefly, a few sentences or so,
    ideally with references to prior work or other projects)
-->
However, it remains understudied why this huge amount of JS code is used.
While observations from the industry have argued that most of
the JS is unnecessary (e.g., [@ceglowski2015website]),
research efforts have instead focused on its security vulnerabilities,
privacy invasions, or performance (e.g., [@snyder2016browser;
@iqbal2021fingerprinting; @mardani2021horcrux]).
Despite these efforts, the root cause of all JS issues remains unanswered:
Why is so much JS used? How effective is JS in achieving these purposes?

<!-- 1. What need will your research address? -->
Better understanding of the use of JS will broadly benefit developers and
users.
Although JS was originally designed for beginner programmers to
script interactive user experience (UX) [@paolini1994netscape],
it has been adopted for a plethora of systems tasks, which
may explain its high resource usage.
For example, as the popularity of Single-Page Applications (SPAs)
[@oh2013automated] rose, many websites that could otherwise be served as
static HTML instead ship entire SPA frameworks, fetch JSON data from
the server, and convert them into DOM elements.
If we can identify common use of JS, we may guide developers interested in
optimizing their websites, improve general online browsing experience, and
inform future utilization of the Internet for serving Web content.

<!-- 1. What research you plan to do (briefly, a paragraph or so).
    (What is approach, from NABC.) -->
We propose to classify the functionalities of JS code on top websites on
the public Web. Specifically, we aim to classify each JS file in
the top 1000 websites, without any logins, into one or more of
the following categories, which we call *spheres*:

- **frontend processing**, e.g., user interaction, form validation, and
    other logic;
- **DOM element generation**, including styling application;
- **UX enhancement**, e.g., animations and effects;
- **extensional features**, e.g., authentication and analytics; or
- **silent**: code that does not call any JS APIs, likely never executed.

To infer these spheres, we will analyze JS browser API calls when
interacting with these websites.
We plan to follow prior practices [@snyder2016browser] when interacting with
each website: we will randomly interact with ("monkey test") its root page,
3 second-level linked webpages, and 9 third-level webpages, for
five times each.
Meanwhile, we will record traces of JS API calls with VisibleV8,
a modified Chromium browser [@jueckstock2019visiblev8].
Using the API names and their call stacks in the recorded traces,
we will apply heuristics to
classify each corresponding JS file into the above spheres.
For example, the `addEventListener` API call indicates frontend processing,
`createElement` indicates DOM element generation, `requestAnimationFrame`
indicates animation, and `sendBeacon` indicates tracking.
Although there are thousands of JS APIs,
only a fraction is used widely [@snyder2016browser], therefore we expect to
classify most JS code through a few popular APIs.
We may select several key APIs, and
design the classification heuristics based on their call counts and
relative frequencies.
We will conduct random sampling to
manually validate the classification accuracy of our heuristics.

<!-- 1. What do you expect the results of your research to be?
    (What are the benfits, from NABC.) -->
We expect to produce a summary of inferred sphere distributions of
the websites we analyze, and explore the relationships between JS spheres and
website performance metrics.
First, we will report the percentage of JS code in each sphere,
the sphere distributions across websites, and
the overall JS-sphere relationships as a Venn diagram.
Then, we plan to report the distributions of the number of bytes of
JS files (bytes of JS) across the spheres.
Additionally, we may associate the bytes of JS per sphere with
each webpage's page load time, including time to first byte and time to
interactive.
Finally, we will show the most popular JS APIs and conduct case studies on
them to see what specific functionalities they are used for.
Our results will provide a first glance at the common use of JS on the Web,
which may uncover patterns and trends in JS usage and light up the path for
future investigation on JS usage optimality and impact.
For example, if *extensional features* shows significant on the Venn diagram,
we may infer that websites frequently engage in special activities other than
content delivery; if *frontend processing* is common among JS files,
it may suggest that websites rely too much on JS for displaying content, and
should instead reconsider server-side rendering; if
*UX enhancement* is associated with longer page load time, it may reveal that
the use of JS usually works against the developers' intention in these cases.

<!-- 1. what specific *deliverables* you will have for
    Project B.-->
We will include the following deliverables for Project B, as required:

- Weekly meetings between Steven and Harsha.
- Weekly progress notes in a Google Docs document. These notes will include:
    - What Steven accomplished last week.
    - Questions and problems Steven needs help with.
    - What Steven plans to do next week.
    - Comments from Harsha.

    Each week, Steven will prepare these notes and send them to
    Harsha the night before their meeting.
    Steven will update the notes after the meeting with answers to
    their questions and any revisions to their plan.
- The resulting code, intermediate measurement data, and
    analysis documentation.
    Specifically, we will include the code used to browse and interact with
    the websites, collect the API traces, and classify the JS files.
    As an intermediate stage, we will design heuristics only for
    classifying the *DOM element generation* sphere.
    We will include the process and results of the manual validation on
    this single sphere. The measurement data will include the website domains,
    <!-- page load time and bytes, -->
    the API traces, the JS files, and the relations among them.
- A 2-6 page Project B report summarizing the progress of the project as of
    the Project B deadline and the plan for Research Project C.
- If the Project C proposal is approved, a 3-8 page Project C report and
    an end-of-semester poster summarizing the progress of
    the project at the end of the semester.

# References