Skip to content

seriyps/html-parsers-benchmark

Repository files navigation

HTML parsers benchmark

Simple HTML DOM parser benchmark.

Competitors

Erlang

CPython

PyPi

  • BeautifulSoup 3
  • BeautifulSoup 4
  • html5lib

Node.JS

Ruby

C

Perl

Google Go

PHP

Haskell

Java

Dart

Mono

Preparation

Install OS dependencies python-virtualenv, erlang, pypy, C compiler and libxml2 dev packages

sudo apt-get install ...
    libxml2-dev libxslt1-dev build-essential  # common
    python-virtualenv python-lxml             # python
    erlang-base                               # erlang
    pypy                                      # python PyPy
    nodejs npm                                # NodeJS
    cabal-install libicu-dev                  # Haskell
    php5-cli php5-tidy                        # PHP
    golang                                    # Go
    ruby1.9.1 ruby1.9.1-dev rubygems1.9.1     # Ruby
    maven2 default-jdk                        # Java
    mono-runtime mono-dmcs                    # Mono

Then run (it will prepare virtual environments, fetch dependencies, compile sources etc)

./prepare.sh

In case of errors, I recommended to install also cython, python-dev and retry.

To prepare only some of the platforms, define PLATFORMS environment variable:

PLATFORMS="pypy python" ./prepare.sh

RUN

Just run

./run.sh <number of parser iterations>

eg

./run.sh 5000

To run tests only for some of the platforms, define PLATFORMS envifonment variable:

PLATFORMS="pypy python" ./run.sh 5000

To run series of tests use snippets like

for C in $(echo "10 50 100 400 600 1000"); do ./run.sh $C | tee output_$C.txt; done

Results

To convert results to CSV file, use to_csv.py

./run.sh 5000 | ./to_csv.py

or smth like

./run.sh 5000 | tee output.txt
./to_csv.py < output.txt

or, for series

for C in $(echo "10 50 100 400 600 1000"); do ./to_csv.py < output_$C.txt > results-$C.csv; done

There is also R - script that can build some pretty graphs: stats/main.r.

How to add my %platformname% to benchmark set?

Create directory %platformname%

mkdir %platformname%

Create run.sh and prepare.sh scripts:

  • run.sh - called every time when benchmark starts. Must use print_header() and timeit() functions from lib.sh to format output for each test. It must accept 2 arguments: HTML file path and number of iterations and pass them unchanged to benchmark scripts.
  • prepare.sh - called only once, before runing any benchmarks. It can download dependencies, compile sources etc.

Create your benchmark scripts. Requirements:

  • Must accept 2 arguments: path to HTML file and number of iterations
  • Must read HTML file once, then perform "number of iterations" parse cycles
  • Must print parser-loop runtime in seconds, calculated like start = time(); do_n_iterations(N); print time() - start
  • On each iteration must build full DOM tree in memory

Add %platformname% to platforms.txt file.

How to add new HTML to benchmark?

Just create HTML file named page_<some_page_name>.html.