Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add network docs #35

Merged
merged 10 commits into from
Mar 5, 2024
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,5 @@ pip install -r docs/requirements-rtd.txt
then build the docs by executing the command...

```
mkdir -p docs/build/html
sphinx-build -M html docs docs/build/html
sphinx-build -M html docs docs/build
```
5 changes: 1 addition & 4 deletions docs/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,4 @@ which is also indented for improved human readability and line-by-line GitHub tr
If this ``results`` folder eventually becomes too large for Git to reasonably handle, we will explore options to share via other data storage services.


Network Tracking
----------------

Stay tuned https://github.com/NeurodataWithoutBorders/nwb_benchmarks/issues/24
.. include:: network_tracking.rst
24 changes: 24 additions & 0 deletions docs/network_tracking.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
.. _network-tracking:

Network Tracking
----------------

The network tracking is implemented as part of the `nwb_benchmarks.core` module and consists of the following main components:

* ``CaptureConnections`` : This class uses the ``psutils`` library to capture network connections and map the connections to process IDs (PIDs). This information is then used downstream to allow filtering of network traffic packets by PID to allow us to distinguish between network traffic generated by us versus other processes running on the same system. See `core/_capture_connections.py <https://github.com/NeurodataWithoutBorders/nwb_benchmarks/blob/main/src/nwb_benchmarks/core/_capture_connections.py>`_
* ``NetworkProfiler`` : This class uses the ``tshark`` command line tool (and ``pyshark`` package) to capture the network traffic (packets) generated by all processes on the system. In combination with ``CaptureConnections`` we can then filter the captured packets to retrieve the packets generated by a particular PID via the ``get_packets_for_connections`` function. See `core/_network_profiler.py <https://github.com/NeurodataWithoutBorders/nwb_benchmarks/blob/main/src/nwb_benchmarks/core/_network_profiler.py>`_
* ``NetworkStatistics`` : This class provides functions for processing the network packets captured by the ``NetworkProfiler`` to compute basic network statistics, such as, the number of packets sent/received or the size of the data up/downloaded. The ``get_statistics`` function provides a convenient method to retrieve all the metrics via a single function call. See `core/_network_statistics.py <https://github.com/NeurodataWithoutBorders/nwb_benchmarks/blob/main/src/nwb_benchmarks/core/_network_statistics.py>`_
* ``NetworkTracker`` and ``network_activity_tracker`` : The ``NetworkTracker`` class, and corresponding ``network_activity_tracker`` context manager, built on the functionality implemented in the above modules to make it easy to track and compute network statistics for a given time during the execution of a code.

.. note::

``CaptureConnections`` uses `psutil.net_connections() <https://psutil.readthedocs.io/en/latest/#psutil.net_connections>`_, which requires sudo/root access on macOS and AIX.

.. note::

Running the network tracking generates additional threads/processes in order to capture traffic while the main code is running: **1)** ``NetworkProfiler.start_capture`` generates a ``subprocess`` for running the ``tshark`` command line tool, which is then being terminated when ``NetworkProfiler.stop_capture`` is called and **2)** ``CaptureConnections`` implements a ``Thread`` that is being run in the background. The ``NetworkTracker`` automatically starts and terminates these processs/threads, so a user typically does not need to manage these directly.

Typical usage
^^^^^^^^^^^^^

In most cases, users will use the ``NetworkTracker`` or ``network_activity_tracker`` to track network traffic and statistics as illustrated in :ref:`network-tracking-benchmarks`.
4 changes: 3 additions & 1 deletion docs/running_benchmarks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ use `psutil net_connections <https://psutil.readthedocs.io/en/latest/#psutil.net

sudo nwb_benchmarks run

Or drop the ``sudo`` if on Windows. Running on Windows may also require you to set the ``TSHARK_PATH`` environment variable beforehand, which should be the absolute path to the ``tshark.exe`` on your system.
Or drop the ``sudo`` if on Windows.

When running on Windows or if ``tshark`` is not installed on the path, then may also need to set the ``TSHARK_PATH`` environment variable beforehand, which should be the absolute path to the ``tshark`` executable (e.g., ``tshark.exe``) on your system.

Many of the current tests can take several minutes to complete; the entire suite will take many times that. Grab some coffee, read a book, or better yet (when the suite becomes larger) just leave it to run overnight.

Expand Down
45 changes: 44 additions & 1 deletion docs/writing_benchmarks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -116,4 +116,47 @@ Notice how the ``read_hdf5_nwbfile_remfile`` function (which reads an HDF5-backe
nwbfile = io.read()
return (nwbfile, io, file, byte_stream)

and so we managed to save ~5 lines of code for every occurence of this logic in the benchmarks. Good choices of function names are critical to effectively communicating the actions being undertaken. Thorough annotation of signatures is likewise critical to understanding input/output relationships for these functions.
and so we managed to save ~5 lines of code for every occurrence of this logic in the benchmarks. Good choices of function names are critical to effectively communicating the actions being undertaken. Thorough annotation of signatures is likewise critical to understanding input/output relationships for these functions.


.. _network-tracking-benchmarks:


Writing a network tracking benchmark
------------------------------------

Functions that require network access ---such as reading a file from S3--- are often a black box, with functions in other libraries (e.g., ``h5py``, ``fsspec``, etc.) managing the access to the remote resources. The runtime performance of such functions is often inherently driven by how these functions utilize the network to access the resources. It is, hence, important that we can profile the network traffic that is being generated to better understand, e.g., the amount of data that is being downloaded and uploaded, the number of requests that are being sent/received, and others.

To simplify the implementation of benchmarks for tracking network statistics, we implemented in the ``nwb_benchmarks.core`` module various helper classes and functions. The network tracking functionality is designed to track the network traffic generated by the main Python process that our tests are running during a user-defined period of time. The ``network_activity_tracker`` context manager can be used to track the network traffic generated by the code within the context. A basic network benchmark, then looks as follows:

.. code-block:: python

from nwb_benchmarks import TSHARK_PATH
from nwb_benchmarks.core import network_activity_tracker
import requests # Only used here for illustration purposes

class SimpleNetworkBenchmark:

def track_network_activity_uri_request():
with network_activity_tracker(tshark_path=TSHARK_PATH) as network_tracker:
x = requests.get('https://nwb-benchmarks.readthedocs.io/en/latest/setup.html')
return network_tracker.asv_network_statistics

In cases where a context manager may not be sufficient, we can alternatively use the ``NetworkTracker`` class directly to explicitly control when to start and stop the tracking.

.. code-block:: python

from nwb_benchmarks import TSHARK_PATH
from nwb_benchmarks.core import NetworkTracker
import requests # Only used here for illustration purposes

class SimpleNetworkBenchmark:

def track_network_activity_uri_request():
tracker = NetworkTracker()
tracker.start_network_capture(tshark_path=TSHARK_PATH)
x = requests.get('https://nwb-benchmarks.readthedocs.io/en/latest/setup.html')
tracker.stop_network_capture()
return tracker.asv_network_statistics

By default, the ``NetworkTracker`` and ``network_activity_tracker`` track the network activity of the current process ID (i.e., ``os.getpid()``), but the PID to track can also be set explicitly if a different process needs to be monitored.
42 changes: 29 additions & 13 deletions src/nwb_benchmarks/core/_network_tracker.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,22 +12,21 @@


@contextlib.contextmanager
def network_activity_tracker(tshark_path: Union[pathlib.Path, None] = None):
"""Context manager for tracking network activity and statistics for the code executed in the context"""
def network_activity_tracker(tshark_path: Union[pathlib.Path, None] = None, pid: int = None):
"""
Context manager for tracking network activity and statistics for the code executed in the context

:param tshark_path: Path to the tshark CLI command to use for tracking network traffic
:param pid: The id of the process to compute the network statistics for. If set to None, then the
PID of the current process will be used.
"""
network_tracker = NetworkTracker()

try:
network_tracker.start_network_capture(tshark_path=tshark_path)
time.sleep(0.3)

t0 = time.time()
yield network_tracker
finally:
network_tracker.stop_network_capture()

t1 = time.time()
network_total_time = t1 - t0
network_tracker.network_statistics["network_total_time_in_seconds"] = network_total_time
network_tracker.stop_network_capture(pid=pid)


class NetworkTracker:
Expand All @@ -52,11 +51,14 @@ def __init__(self):
self.pid_packets = None
self.network_statistics = None
self.asv_network_statistics = None
self.__start_capture_time = None

def start_network_capture(self, tshark_path: Union[pathlib.Path, None] = None):
"""
Start capturing the connections on this machine as well as all network packets

:param tshark_path: Path to the tshark CLI command to use for tracking network traffic

Side effects: This functions sets the following instance variables:
* self.connections_thread
* self.network_profile
Expand All @@ -69,10 +71,16 @@ def start_network_capture(self, tshark_path: Union[pathlib.Path, None] = None):
self.network_profiler = NetworkProfiler()
self.network_profiler.start_capture(tshark_path=tshark_path)

def stop_network_capture(self):
# start the main timer
self.__start_capture_time = time.time()

def stop_network_capture(self, pid: int = None):
"""
Stop capturing network packets and connections.

:param pid: The id of the process to compute the network statistics for. If set to None, then the
PID of the current process (i.e., os.getpid()) will be used.

Note: This function will fail if `start_network_capture` was not called first.

Side effects: This functions sets the following instance variables:
Expand All @@ -81,15 +89,23 @@ def stop_network_capture(self):
* self.network_statistics
* self.asv_network_statistics
"""
# stop capturing the network
self.network_profiler.stop_capture()
self.connections_thread.stop()

# get the connections for the PID of this process
self.pid_connections = self.connections_thread.get_connections_for_pid(os.getpid())
# compute the total time
stop_capture_time = time.time()
network_total_time = stop_capture_time - self.__start_capture_time

# get the connections for the PID of this process or the PID set by the user
if pid is None:
pid = os.getpid()
self.pid_connections = self.connections_thread.get_connections_for_pid(pid)
# Parse packets and filter out all the packets for this process pid by matching with the pid_connections
self.pid_packets = self.network_profiler.get_packets_for_connections(self.pid_connections)
# Compute all the network statistics
self.network_statistics = NetworkStatistics.get_statistics(packets=self.pid_packets)
self.network_statistics["network_total_time_in_seconds"] = network_total_time

# Very special structure required by ASV
# 'samples' is the value tracked in our results
Expand Down