This repository contains Jupyter notebooks, images, PDFs, etc. prepared for the course Introduction to Data Science offered for Ph. D. students at TIFR Hyderabad (https://moldis-group.github.io/teaching.html)
First of all, this material is made available on the GitHub to encourage others to access it freely, maintain a local copy, and may be even contribute corrections or new material. So, you can follow the three steps listed below freely, i.e., without having to commit to any responsibilities.
-
If you think that you will ever use (or reuse) this material (or a part of it) for any purpose, you should sign-in to github by creating an account (maybe google account works for this too), and then click the 'Fork' button on the top-right. Then, you will get a local copy to play with. You will also be notified when any changes is made to this master version. You will be able to merge the new changes to your own copy of this repository. Others can also pull the changes you make in your version.
-
To download the content to your computer, type the following in a terminal
git clone https://github.com/raghurama123/DataScience.git
or click the 'code' botton above and then click 'Download zip'
If you also Fork the material, then replace 'raghurama123' in the above line with your 'username'
- If you want to try the material in a web browser, i.e., to test the code or make small changes and run the code, you can access this repository at the interactive platform Binder by clicking the link: https://mybinder.org/v2/gh/raghurama123/DataScience/HEAD
If you also Fork the material, then replace 'raghurama123' in the above line with your 'username'
The syllabus of this course is evolving over time. The original plan was to cover the following topics
- Data Science: Big Data, Facets of data (structured/unstructured data)
- Toolboxes: Python libraries, SCIKIT-Learn, PANDAS
- Statistics: Distributions, Outlier, Skewness, Pearson’s/Spearman’s/Kendall’s coefficient, Kernel density
- Statistical Inference: Hypothesis testing, Confidence Intervals
- Supervised Machine Learning: What is machine learning? Learning curves, Support Vector Machines, Random Forest
- Regression: Linear Regression, Logistic Regression
- Unsupervised Machine Learning: Clustering, Case studies
- Big Data concepts: Handling large data, Hadoop, Spark, NoSQL, Graph databases, Natural language processing, MapReduce
- 10 minutes to Pandas
- Introduction to Data Science. A Python Approach to Concepts, Techniques and Applications, Laura Igual, Santi Segu, Springer (2017).
- Introducing Data Science, Davy Cielen, Arno D. B. Meysman, Mohamed Ali, Manning (2016).
- Learn Git
For comments, questions, suggestions or requests please write to [email protected]