Skip to content

Google Summer of Code 2014 Ideas

Carlos Agarie edited this page Jan 22, 2014 · 29 revisions

Contact

Feel free to reach us by joining #sciruby on chat.freenode.net or via our mailing list.

Instructions for students

We strongly recommend that you pick one of the ideas listed below. We value contributions in advance of GSoC, even if they're just little ones. Go pick out something in one of our trackers and work on it, talk to folks on the listserv, and get an idea for what features are needed.

You don't need to know a lot about Ruby to work on a project: depending on how much you already know, it'll be pretty easy to learn enough to be able to contribute. However, you may need some familiarity with scientific computation. If you don't have any, take a look at "Numerical Recipes in C", which you'll probably find in your university's library.

In any case, if you feel your skills aren't enough for some project, please ask us on our IRC channel (see contact section above) and we can help you.

Our number-one priority right now as an organization is NMatrix.

Projects ideas

NMatrix projects

NMatrix is SciRuby's numerical matrix core, implementing dense matrices as well as two types of sparse (linked-list-based and Yale/CSR). NMatrix is a fairly new but well-established project which has received Summer-of-Code-like grants from both Brighter Planet and the Ruby Association (in other words, from Matz, who created Ruby). Those who contribute to NMatrix will likely eventually become authors of a jointly-published peer-reviewed science article on the library. Additionally, NMatrix is a good place to gain practical C and C++ experience, while also working to improve Ruby.

  • Mentors: John Woods (@mohawkjohn)
  • ATLAS Functionality. NMatrix has many but not all ATLAS (cBLAS) and LAPACK functions exposed. We would like to see a consistent interface which makes sense in Ruby. We also want to be able to design and implement several NMatrix methods which depend upon ATLAS, cBLAS, and cLAPACK functions.
    • Rational Functionality. NMatrix includes some rational number capability, but support is lacking in areas where ATLAS functions are required, since ATLAS does not have a rational type. Rational-specific equivalents of ATLAS functions are needed. Along the way it may be possible to also implement some integer-specific ATLAS function equivalents. (This is a component of the ATLAS Functionality project, but could be proposed separately with sufficient justification.)
    • ATLAS-Free Support. NMatrix has some non-ATLAS versions of functions like gemm (matrix multiplication) which are typically used for rational matrices. Since we have to go to the trouble of implementing rational versions of many ATLAS functions, it might be useful to also have more simplistic ATLAS-free complex and floating-point versions of those ATLAS functions as well -- primarily for those who don't have LAPACK. They wouldn't be as heavily optimized, but would serve in a tight spot.
    • Basic matrix math functionality. Specifically, exponentials and square roots, matrix decomposition/factorization, calculation of norms, tensor products, principal component analysis (PCA). This is listed as a sub-project of ATLAS functionality, because several depend upon ATLAS, but it could really be proposed as a separate GSOC project. These functions are all enormously important, and would substantially improve the usability of NMatrix. Successful implementation would likely lead to co-authorship on a peer-reviewed article, and at the very least would look outstanding on a curriculum vitae.
  • Sparse improvements. The "new" Yale matrices used by NMatrix, which store diagonals (zero and non-zero) separately from non-diagonal non-zeros, are inefficient for matrices that are taller than they are wide. One way to address the problem would be to introduce an alternate "old" Yale storage. Another would be to allow matrices to be stored and operated on transposed. The goal, overall, is to be able to produce efficient Yale/sparse vectors regardless of the vectors' orientation.

SciRuby::Dataframe (provisional name)

  • Mentors: Carlos Agarie (@agarie), Claudio Bustos(@clbustos), John Prince (@jtprince)
  • SciRuby::Dataframe will be an implementation of a concept similar to Pandas (http://pandas.pydata.org/pandas-docs/dev/), which provides (Dataframes, tabular structures like data.frame in R and Series, for 1-dimensional data) usable by more powerful data analysis packages.
  • Some requirements:
  • Have some simple statistics built-in (statsample as a dependency): averages, quartiles, standard deviation, median, etc.
  • Be really easy to plot. For example, a user should be able to plot a histogram from a Series object with only one method call, maybe two, without much hassle. Integration with Plotrb (see below) would be great.
  • Easily receive and interpret data from a CSV file (or any delimeter separated value file), transforming it into a Dataframe with something as simple as SciRuby::Dataframe.csv("data.csv"). This would be a wrapper around Ruby's CSV class. Also, chunk processing of CSV files will be necessary. The faster_csv gem implements this.
  • Be able to add/remove columns and do operations on rows or columns. For simple operations, this should be very very easy by using NMatrix's referenced slices.
  • Have labeled columns and indexed rows. This means that the underlying data structure (NMatrix wrapper) will need to store some metadata.
  • Use NMatrix for data storage. This also implies that we can use the NMatrix::IO module.
  • As some of the requirements of this project depend on others (visualization, statistics, etc), the most important part is to design and develop it in such a way for its API to be easy to use for new users (e.g. scientists without much programming background) but extensible enough for other projects to use it.
  • Inspired by Pandas and Statsample::Dataset.

Plotrb

  • Mentors: John Woods (@mohawkjohn), Raoul J.P. Bonnal, Pjotr Prins, Wan Zuhao
  • D3 is an incredible interactive data visualisation library written in Javascript that runs in a browser.

Statsample, Distribution

  • Mentors: Claudio Bustos (@clbustos), John Woods for statistical distributions (@mohawkjohn)
  • Statsample is an essential scientific library which brings statistical functions to Ruby. Currently, it depends upon Ruby/GSL, which conflicts with NMatrix. To bring it up to spec, it needs to require the SciRuby fork of rb-gsl instead. Lastly, Statsample depends upon Distribution, which makes available statistical distribution functions for users of MRI (in pure Ruby and through GSL) and JRuby. Many of these functions remain unimplemented, or need a JRuby or GSL or pure Ruby version written.
  • There were two projects last year which became statsample extensions (as gems), statsample-timeseries and statsample-glm.

Minimization and Integration

Minimization and Integration are two SciRuby modules which are used by Claudio Bustos' statsample gem. For Minimization, students would research and suggest additional minimization methods, develop tests, and improve documentation. For Integration, students would implement additional numerical integration methods and add support for solving various types of (ordinary and/or partial) differential equations. We need to be explicit about the imprecisions and performance of each method, so benchmarks will be necessary. As always, the student is expected to write tests and document code. There has been some talk of removing support for Ruby versions earlier than 1.9.3 for both Integration and Minimization.

  • Mentors: Claudio Bustos (@clbustos), John Woods (@mohawkjohn)
  • Standardized minimization framework. Right now, Minimization is pure Ruby. There are additional minimization algorithms in GSL and probably in Java which can be used in Ruby. It'd be great to have a standardized framework so that pure Ruby functions can be used, or C functions if using MRI/YARV, or Java functions if using JRuby. Such a thing is already done for the Distribution gem, so that can be used as a model.
Clone this wiki locally