evaluation.tex

\documentclass[thesis]{subfiles}
\begin{document}

\chapter{Evaluation}\label{chap:eval}
The following sections will focus on reasons for the missing parallelism, performance impacts from the Rust type system and testing of implemented features for the \gls{skill} binding.
This is predominantly done through the provided test suites as well as a custom benchmark.

\section{Parallelism}
  A feature that wasn't implemented was the parallelisation of the serialization.
  In \autoref{sec:badptr} the consequences of \PtrT towards parallelism were discussed and concluded that the Rust type system can't be used for graph like structures that should be used safely in parallel.

  Rusts guarantees for data race free programs stems from the borrowing rules.
  These are pushed to the runtime since the type system is unable to accommodate graph like structures while upholding the borrow rules.
  Wrapper types have to be used that enforce the rules at runtime.
  That means that the research point, how the \emph{type system} can be used to implement the serialization in a safe parallel manner, is a negative result.
  While it is not impossible to implement the serialisation in a `parallelized' way the implications make it quite clear that it would be unwise to do so.
  Since two mutex locks have to be made per field access, there is only a slow down to be obtained, which would be forced on the users code, too.

\section{Tests}
  There are two test suites present in the \gls{skill} repository~\autocite{skill-repo} that use 28 specifications.
  These test suites are used for the existing bindings generators and were used to develop in a test driven manner.

  The first test suite is used to test reading capabilities of bindings.
  To do that specifications are used to generate Rust bindings.
  Then a test file is generated that features a test method for every test data file.
  For example test data files test the built-in types or inheritance.
  The generated binding as well as the test file are then compiled and tested by \cod{cargo test}.
  All 644 tests from this test suite, that do not require restrictions to fail, are passed successfully.

  The second test suite tests the \gls{api} and with that requires more thorough adjustments for each generator.
  This test suite does not use \gls{skill} binary files as input.
  Instead it uses a specification file and a \gls{json} file that describes objects to create a graph with objects from the specification.
  This means that similarly to the first test suite a binding is generated.
  But instead of requiring only the reading from a given input file, here the tests, in the test file, have to use the \gls{api}, of the binding, to create objects and use these to set and get values.
  For Rusts test suite the tests are generated in a way that the \gls{api} is used to create a \SkillFile, create graphs based on the test suites \gls{json} input, write it to disc, read it back from disc and compare the read data to the \gls{json} input.
  Similarly to the first test suite, all 189 tests pass successfully that do not require restrictions to fail.% TODO mention amount

  Furthermore 10 additional tests were created.
  Some of these tests use multiple specifications to test the interoperability of Rust bindings and the \Foreign types.
  Another is used to test for memory leaks that can be created through the reference counting of \PtrT.
  Lastly there is a test that tests the \emph{custom} feature.
  These manually created tests are executed as part of the second test suite as they depend on the generated bindings.

  To validate that the created \gls{skill} binary files are interoperable with other existing tools and bindings \cod{skillView}\autocite{skill-view} was used throughout the development process.
  The tool allows to explore \gls{skill} binary files.
  Since the tool didn't show any issues with the files and because the first test suites input data comes from other generators, interoperability is demonstrated.

\section{Performance}
  In \autoref{fig:bench} a performance comparison of Rust, C++ and Java is shown.
  \autoref{fig:bench20} offers an more detailed view for files that are smaller than 20Mb while \autoref{fig:benchAll} features all data points.
  \autoref{fig:benchLog} shows a logarithmic plot of the data and features a guide line of $y=10x$.
  It can be seen that Rust has a steeper inclination than the guide line albeit much shallower than the other languages.
  Furthermore can be seen that Java matches the other transfer rates better if the files are larger as the start from the program and \gls{jvm} is more noticeable in smaller files.

  The displayed data was generated by a script that generated the bindings from the specification of \citetitle{skill-llvm}\autocite{skill-llvm}.
  It would then generate a main function that would read, write and again read a file given as an argument, 100 repetitions each.
  After compiling all binaries with the default optimisation levels, the harness would time the execution of the binaries while supplying a selection of data files from \citetitle{skill-llvm}\autocite{skill-llvm} that is listed in \autoref{tab:data}.
  Because of memory leaks in the C++ version this setup had to be chosen, instead of a more library focused one, where the binary would serialize the file multiple times in the same process.
  On the other hand, the C++ version has more features implemented than the Rust version.
  Because of these facts this comparison should be taken with caution but the large divergence of run times indicates that other more fundamental pieces differ.
  The test system information is listed in \autoref{tab:system}.
  The system was not interacted with and all user applications were closed while testing.

  \begin{table}[ht]
    \begin{minipage}{.49\linewidth}
      \centering
      \begin{tabu}{l|l}
        Component
         & Information    \\\hline
        CPU
         & i7-4702HQ      \\
        Memory
         & 16GB           \\
        Storage
         & SSD SM841      \\
        Arch Linux
         & 4.18.11        \\
        GCC
         & 8.2.1          \\
        Clang
         & 7.0.0          \\
        Java
         & Oracle 11      \\
        Rust
         & 1.30.0-nightly \\
        Optimization
         & 3              \\
        \gls{skill}
         & \cod{59e9c1f9} \\
        \cod{cppCommon}
         & \cod{dd6021f}  \\
      \end{tabu}
      \caption{Test System Information}\label{tab:system}
    \end{minipage}
    \begin{minipage}{.49\linewidth}
      \centering
      \begin{tabu}{l|r|r}
        File (release)
         & Size
         & Nodes      \\\hline
        tty.sf
         & 107K
         & 5.371      \\
        make.sf
         & 1.1M
         & 49.649     \\
        screen.sf
         & 3.2M
         & 139.779    \\
        bash.sf
         & 6.1M
         & 256.383    \\
        gnuplot.sf
         & 8.1M
         & 345.310    \\
        git.sf
         & 16M
         & 601.669    \\
        vim.sf
         & 19M
         & 781.822    \\
        llvm-split.sf
         & 52M
         & 2.043.963  \\
        lli.sf
         & 151M
         & 5.357.733  \\
        clang.sf
         & 256M
         & 9.044.428  \\
        opt.sf
         & 322M
         & 10.856.418 \\
      \end{tabu}
      \caption{Selected Performance Input Test Data from \citetitle{skill-llvm}\autocite{skill-llvm}}\label{tab:data}
    \end{minipage}
  \end{table}

  \begin{figure}[ht]
    \centering
    \hspace*{4em}\input{data/bench_legend}

    \subfloat[Files $<20\text{Mb}$\label{fig:bench20}]{\hspace*{4pt}\input{data/20mbbench}}

    \subfloat[All test files\label{fig:benchAll}]{\input{data/total_bench}}

    \subfloat[Lograrithmic graph of all test files, with guide line $y=10x$.\label{fig:benchLog}]{\input{data/total_bench_log}}
    \caption{Performance comparison between Rust, C++ and Java}\label{fig:bench}
  \end{figure}

  A \cod{perf} analysis didn't show any particular hot spot that could be optimized.
  The code seems to be generally slow which should be expected if every \PtrT instance access requires at least two integer additions for the borrow checking and at least four if it is obtained from a \UserTypePool.
  Since the memory locations, that are pointed to by \PtrT, are not one contiguous piece but fragmented memory locations it is further possible that the bigger spread of runtime per file is caused by memory fragmentation.
  It can also be assumed that even with a `perfectly' parallelized implementation the speedup of the parallel processing would not be enough to catch up to C++.
  While the GCC C++ single thread version is slower than the multi thread version ($\times 1.4$) the rust version is $\times 3.3$ slower than the single threaded GCC C++ version ($\times 4.6$ slower than the multi thread C++ version).
  LLVM is not responsible for the longer run-times since the run-times of the C++ versions, compiled with clang, almost match the ones of GCC.

\section{\glsentrytext{skill} Improvements}
  The one thing that could be improved in a new specification would be that annotation fields do not have to use two byes to signal that the annotation is \codc{NULL}.
  Currently the pool id and object id is given and set to 0 but since there can't be a pool with id 0 the additional byte for the object id can be saved.

  Another improvement was made to Rusts and C++ implementation of size calculation, reading and writing for variable length integers, used by \cod{v64} and pointers.
  \autoref{lst:v64EncOld} and \autoref{lst:v64DecOld} show reference implementations, found in \citetitle{skill-tr}\autocite{skill-tr}.
  The C++ implementation already featured the unrolled loop and defered size calculation, but it still used the same bit masks and shifts and needed two integers for decoding.
  \autoref{lst:v64OffNew}, \autoref{lst:v64EncNew} and \autoref{lst:v64DecNew} show the improved versions of size calculation, encoding and decoding.
  The previous decoding implementation needed one \cod{OR}, one \cod{SHIFT}, one \cod{COND}, two \cod{AND} and two \cod{ASSIG} operations per byte.
  The improved version needs one \cod{AND} operation and one \cod{ASSIG} operation less per byte and made an integer obsolete.
  The size calculation can be done with just one comparison per byte, removing a additional \cod{AND} operation.
  Likewise the additional \cod{AND} operation was removed from the encoding function.

  \LstTikzFig{data/v64_enc_old.cpp}{Reference encode implementation for \cod{v64}\autocite{skill-tr}}{lst:v64EncOld}[language=C++]
  \LstTikzFig{data/v64_dec_old.cpp}{Reference decode implementation for \cod{v64}\autocite{skill-tr}}{lst:v64DecOld}[language=C++]


  \LstTikzFig{data/v64_off_new.cpp}{Optimized size implementation for \cod{v64}}{lst:v64OffNew}[language=C++]
  \LstTikzFig{data/v64_enc_new.cpp}{Optimized encode implementation for \cod{v64}}{lst:v64EncNew}[language=C++]
  \LstTikzFig{data/v64_dec_new.cpp}{Optimized decode implementation for \cod{v64}}{lst:v64DecNew}[language=C++]

\end{document}