Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure).
Drop a star if you like the project.😃 Motivates💪 me to keep working on such projects
The idea is simple. There are various datasets available out there, but they are scattered in different places over the web. Is there a quick way (in Pyspark) to access them instantly without going through the hassle of searching, downloading, and reading ... etc? SparkDataset tries to address that question :)
Start with importing data()
:
from sparkdataset import data
- To load a dataset:
titanic = data('titanic')
- To display the documentation of a dataset:
data('titanic', show_doc=True)
- To see the available datasets:
data()
- To search for datasets with terms
data('ab')
Did you mean:
crabs, abbey, Vocab
That's it.
Go to this notebook for a demonstration of the functionality
In R
, there is a very easy and immediate way to access multiple statistical datasets,
in almost no effort. All it takes is one line > data(dataset_name)
.
This makes the life easier for quick prototyping and testing.
Well, I am jealous that Pyspark does not have a similar functionality.
Thus, the aim of sparkdataset
is to fill that gap.
Currently, sparkdataset
has about 757 (mostly numerical-based) datasets, that are based on RDatasets
.
In the future, I plan to scale it to include a larger set of datasets.
For example,
- include textual data for NLP-related tasks, and
- allow adding a new dataset to the in-module repository.
$ pip install sparkdataset
$ pip uninstall sparkdataset
$ rm -rf $HOME/.sparkdataset
1.0.0
- Added search dataset by name similarity.
- Example:
>>> data('heat')
Did you mean:
Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt
- Added support to Windows.
- pandas
- pyspark :: 3.1.2
- Tested on OSX and Linux (debian).
- Supports both Python 3 (3.8.8 and above).
- add textual datasets (e.g. NLTK stuff).
- add samples generators.
- RDatasets: R's datasets collection.
The logo credit goes to Aleksandar Savic