Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download data #72

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Download data #72

wants to merge 8 commits into from

Conversation

JamesKohlsRepo
Copy link
Collaborator

Converted my fork into a branch called Download-Data

these are intended to be run as pipenv scripts:
pipenv run download_data
pipenv run delete_data

they can also be run from file
Updated Pipfile to specify versions.
Updated Scripts definitions.
Moved scripts to process_data repository.
Edited header comment in download_data.
Pipfile Show resolved Hide resolved
mediabridge/data_processing/download_data.py Show resolved Hide resolved
@jhanley634
Copy link
Collaborator

Hullo, James. Just checking in. This branch looks like it's nearly ready for a merge. Do my remarks mostly make sense? Is there any trouble with $ git mergetool for resolving merge conflicts against main? We are here to help -- pipe up and we can lend a hand.

Copy link
Collaborator

@skyfenton skyfenton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good, but make sure to merge changes from main and delete your other PR.

Corrected nits and did some clean-up. Simplified the download_data function with pathlib.
generate new pipfile
Manually combined pipfile and regenerated pipfile.lock, reverted requests version to try and avoid problems
tried to fix errors in pipfile


def main():
data_dir = os.path.join(os.path.dirname(__file__), "../../data")
Copy link
Collaborator

@jhanley634 jhanley634 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly this works, and you can merge it as-is. But consider calling normpath() on it, for better quality diagnostics if / when something downstream goes south. The pathlib equivalent would be .resolve(). The "../.." in the middle of what we're passing around has unambiguous meaning to the machine. But humans tend to get a little lost with all the up and down when trying to interpret a message like "bad thing happened with foo/bar/../baz/file.txt".

If you do choose to revisit this prior to merging down, consider DRYing up the two different definitions of data_dir so there's just one. The project's usual idiom seems to be: from mediabridge.definitions import DATA_DIR

(The clean_data_folder() seems like a long-winded way of saying .rmtree() to nuke followed by mkdir(), but maybe there's some permission attribute on the data/ directory that we're trying to preserve.)


It looks like you still need to get Pipfile merged, or ask one of us to jump in on your feature branch and merge it for you. Fortunately it's small, so it's easy to eyeball the differences with main branch, which you can conveniently view on the github site in a web browser. Keeping entries alphabetical eases that task.

Don't worry too much about the .lock file, as Pipfile is the important one. A pipenv lock will regenerate it and we can just "use ours" to resolve any merge nonsense. Good luck! Remember to click "delete" on your feature branch once the squash merge is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide script that downloads Netflix prize data and puts it in the right place
3 participants