____ _ _ _
| _ \ __ _ | |_ __ _ | | __ _ __| |
| | | | / _` | | __| / _` | | | / _` | / _` |
| |_| | | (_| | | |_ | (_| | | |___ | (_| | | (_| |
|____/ \__,_| \__| \__,_| |_____| \__,_| \__,_|
Read me
The full documentation is available at: http://docs.datalad.org
DataLad makes data management and data distribution more accessible. To do that it stands on the shoulders of Git and Git-annex to deliver a decentralized system for data exchange. This includes automated ingestion of data from online portals, and exposing it in readily usable form as Git(-annex) repositories, so-called datasets. The actual data storage and permission management, however, remains with the original data providers.
DataLad is under rapid development. While the code base is still growing, the focus is increasingly shifting towards robust and safe operation with a sensible API. Organization and configuration are still subject of considerable reorganization and standardization. However, DataLad is, in fact, usable today and user feedback is always welcome.
Neurostars is the preferred venue for DataLad
support. Forum login is possible with your existing Google, Twitter, or GitHub
account. Before posting a new
topic, please check the
previous posts tagged with
#datalad
. To get help on a datalad-related issue, please consider to follow
this message
template.
A growing number of datasets is made available from http://datasets.datalad.org .
Those datasets are just regular git/git-annex repositories organized into
a hierarchy using git submodules mechanism. So you can use regular
git/git-annex commands to work with them, but might need datalad
to be
installed to provide additional functionality (e.g., fetching from
portals requiring authentication such as CRCNS, HCP; or accessing data
originally distributed in tarballs). But datalad aims to provide higher
level interface on top of git/git-annex to simplify consumption and sharing
of new or derived datasets. To that end, you can install all of
those datasets using
datalad install -r ///
which will git clone
all of those datasets under datasets.datalad.org
sub-directory. This command will not fetch any large data files, but will
merely recreate full hierarchy of all of those datasets locally, which
also takes a good chunk of your filesystem meta-data storage. Instead of
fetching all datasets at once you could either specify specific dataset to
be installed, e.g.
datalad install ///openfmri/ds000113
or install top level dataset by omitting -r
option and then calling
datalad install
for specific sub-datasets you want to have installed,
possibly with -r
to install their sub-datasets as well, e.g.
datalad install ///
cd datasets.datalad.org
datalad install -r openfmri/ds000001 indi/fcon1000
You can navigate datasets you have installed in your terminal or browser,
while fetching necessary files or installing new sub-datasets using the
datalad get [FILE|DIR]
command. DataLad will take care about
downloading, extracting, and possibly authenticating (would ask you for
credentials) in a uniform fashion regardless of the original data location
or distribution serialization (e.g., a tarball). Since it is using git
and git-annex underneath, you can be assured that you are getting exact
correct version of the data.
Use-cases DataLad covers are not limited to "consumption" of data. DataLad aims also to help publishing original or derived data, thus facilitating more efficient data management when collaborating or simply sharing your data. You can find more documentation at http://docs.datalad.org .
See CONTRIBUTING.md if you are interested in internals or contributing to the project.
On Debian-based systems we recommend to enable NeuroDebian
from which we provide recent releases of DataLad. datalad package recommends
some relatively heavy packages (e.g. scrapy) which are useful only if you are
interested in using crawl
functionality. If you need just the base
functionality of the datalad, install without recommended packages
(e.g., apt-get install --no-install-recommends datalad
)
By default, installation via pip installs core functionality of datalad
allowing for managing datasets etc. Additional installation schemes
are available, so you could provide enhanced installation via
pip install datalad[SCHEME]
where SCHEME
could be
crawl
to also installscrapy
which is used in some crawling constructstests
to also install dependencies used by unit-tests battery of the dataladfull
to install all dependencies.
For installation through pip
you would need some external dependencies
not shipped from it (e.g. git-annex
, etc.) for which please refer to
the next section.
Our setup.py and accompanying packaging describe all necessary dependencies.
On Debian-based systems we recommend to enable NeuroDebian
since we use it to provide backports of recent fixed external modules we
depend upon, and up-to-date Git-annex is necessary for proper operation of
DataLad packaged (install git-annex-standalone
from NeuroDebian repository).
Additionally, if you would like to develop and run our tests battery see
CONTRIBUTING.md regarding additional dependencies.
Later we will provide bundled installations of DataLad across popular platforms.
MIT/Expat
It is in a alpha stage -- nothing is set in stone yet -- but already usable in a limited scope.