-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wishlist for FSDL v2 #4
Comments
Suggestion from @sayakpaul Add a section to the troubleshooting lecture on things that are common practice in the research world that may not be worth the added complexity in the real world. Examples:
|
@josh-tobin do you mind sharing the platforms/approaches/tools you have your mind to cover the Managing data at a larger scale. or do you plan to do it platform-independent? One approach that I have found useful for quite a while now (apologies if it is naive):
|
This is somewhat similar to how I've done it in the past. Some considerations are HDFS vs GCS/S3/etc, and how to build performant data loaders. I'd want to dig into this some more before making any concrete recommendations though. |
Great! This could be also stretched to show how much impactful a data input pipeline is for training a model with good hardware utilization. |
In data management, should include new data formats such as Delta Lake -- originated from Databricks, Open source in Linux Foundation, with greatest momentum due to the integration with Databricks Cloud, Spark/SparkSQL/SparkStreaming, MLFlow Apache Hudi -- originated from Uber, suitable for upsert. All there has time-travel support needed for data versioning. MLFlow is now integrated with Delta Lake to do data version control. |
Monitoring -- |
More aspects on data engineering. Like implementing massively parallel programming techniques and other cutting edge solution in today's world to make big data ready for DL/ML |
Data
How to effectively handle Long-tail Data |
@DanielhCarranza say more about the reactive / proactive approaches you have in mind? |
I'd love to see a discussion on peer review (maybe it fits in the section on teams, or in testing/deployment section?). There are a lot of pieces need to be reviewed!
I'm aware of a couple good blog posts on the topic, but I'm not sure anything definitive exists. |
If we decide to do another version of the course, here are some new topics that could be exciting to add. This is off the top of my head, feel free to suggest other topics.
Bias / fairness
Deployment
Troubleshooting
Testing
Monitoring
Data
Infrastructure / tooling
Model lifecycle management
The text was updated successfully, but these errors were encountered: