Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv file time history #115

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

jsantner
Copy link

@jsantner jsantner commented Jun 14, 2018

  • Fixes #
  • Tests added
  • Added entry into CHANGELOG.md
  • Documentation updated

Changes proposed in this pull request:

  • Add test .yaml file that utilizes a .csv file for time-history
  • Alter the validation module so that reading time-history from .csv does not create errors.

@pr-omethe-us/chemked

@codecov
Copy link

codecov bot commented Jun 14, 2018

Codecov Report

Merging #115 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master   #115   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           4      4           
  Lines         966    987   +21     
  Branches      226    231    +5     
=====================================
+ Hits          966    987   +21
Impacted Files Coverage Δ
pyked/validation.py 100% <100%> (ø) ⬆️
pyked/chemked.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57c936c...af3145c. Read the comment docs.

@bryanwweber
Copy link
Member

@jsantner Why are these changes needed?

@jsantner
Copy link
Author

@bryanwweber
Previously, when loading a yaml file with a time history defined in a csv file, the validator gives this error message:

  File "c:\users\jsantne\documents\github\pyked\pyked\validation.py", line 255, in _validate_isvalid_history
    n_cols = len(value['values'][0])

KeyError: 0

You can see this on the Travis report after my second commit on this branch, where I had only added a test yaml file with time history defined in a csv file.

Copy link
Member

@bryanwweber bryanwweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsantner Thanks for submitting this! A few suggestions:

  1. I think the assumption is that the CSV filename will be specified relative to the directory of the YAML file, hence the directory argument should be unnecessary. If that's not working, I'd be interested to see a failing test case.
  2. If possible, I think we should load and check the CSV file as well. However, I think that will fit better when we refactor all of the validation, so we can skip it for now.

# If reading from a file, the file will not be validated.
# A file can have an arbitrary number of columns, and the columns
# to be used are specified.
if type(value['values']) is list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 'filename' not in value['values'].keys()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can assume the CSV file is specified relative to the yaml file, but DataPoint doesn't know where the yaml file directory is unless it's specified as an argument to __init__, right? Is there a simpler way to deal with a relative path?

I tried using 'filename' not in value['values'].keys() in a previous commit but this fails when the values are given as a list. In that case, value['values'] is a list, not a dict, so if 'filename' not in value['values'].keys() raises an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, just trying to open the file will try to open it relative to the working directory of the Python process. Are you saying that if I have a file structure like

|- database
|---butanol
|------file_1.yaml
|------file_1.csv

and I start Python in the database directory, and load file_1.yaml like

>>> ChemKED('butanol/file_1.yaml')

it won't work, because Python will assume the file_1.csv is relative to database, not butanol?

As on your other PR, I think the simpler/more "pythonic" way to do this is a try...except block, rather than checking the type of the value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly what I'm saying. I was using a script to read multiple yaml files in a complex directory structure within a database directory, and python was looking for the csv file in database, not in the folder with the yaml file.

I'm not sure that a try...except block would work well here. Are you thinking of something like? It seems more complex and confusing this way, and it puts a lot of code between the try and except

try:
    if 'filename' in value['values'].keys():
        # Don't do anything because csv files aren't validated
        pass
    else:
        # This should never happen. If vale['values'] is a dictionary with keys, 'filename' should be a key
        self._error(field, 'must include filename or list of values')
except KeyError:
        # value['values'] is probably a list.
        # Code from earlier that checks the number of columns

Since this PR is related to time histories, I have a somewhat related question for you that I just stumbled on. Let's say somebody has a csv file with three columns - time, pressure, and volume. Right now, these must be implemented as two separate time-histories and the csv will be loaded twice, right? Is there interest in allowing the user to specify multiple time-histories using a single file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep this focused on the code here, I moved the other discussion to the main comment thread. Anyhow, two things:

  1. I want to validate csv files in the future, so we might as well set up for that here
  2. The code could be
try:
    if 'filename' not in value['values'].keys():
        self._error(field, 'must include filename or list of values')
except TypeError:
    # Code from earlier that checks the number of columns

or

try:
    filename = value['values'].get('filename')
    if filename is None:
        self._error(field...
except AttributeError:
    # Code from earlier

which isn't all that confusing to me. The reason (to me) to avoid the type function is that it doesn't always handle inheritance in a straightforward way, so we'd be relying on the underlying YAML library to always return something that's a subclass of list. On the other hand, with the try-except, we're using the duck-typing in Python to try something that we expect to be the case, and catch the resulting errors.

@bryanwweber
Copy link
Member

bryanwweber commented Jun 26, 2018

Yes, that's exactly what I'm saying. I was using a script to read multiple yaml files in a complex directory structure within a database directory, and python was looking for the csv file in database, not in the folder with the yaml file.

OK, that's a case that we missed for sure. What if, rather than passing the directory name around, we turn the yaml_file into an instance of a Path object? Then we can find the path of the CSV file by doing yaml_file.stem/csv_filename. As of Python 3.6, the built-in open functions (which NumPy relies on) all accept instances of Path (see https://docs.python.org/3/library/functions.html#open), so we only really have to handle the Python 3.5 case, for which we can write a custom open function that just converts the filename to a string, something like

if sys.version < (3,6):
    oldopen = open
    def open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None):
        return oldopen(str(file), mode, buffering, encoding, errors, newline, closefd, opener)

I should note that this is just off the top of my head, and there may be a better way to handle this backwards dependency. Also, this doesn't help in the case of using a dictionary as the input. I'm not sure there's a good way to handle that, though.

Since this PR is related to time histories, I have a somewhat related question for you that I just stumbled on. Let's say somebody has a csv file with three columns - time, pressure, and volume. Right now, these must be implemented as two separate time-histories and the csv will be loaded twice, right? Is there interest in allowing the user to specify multiple time-histories using a single file?

The only problem I see with loading twice is that it might take some time to load the file from disk. I think it makes more sense to keep the specifications of the time histories separate, and try to cache the data in the csv file somehow, rather than loading it from disk twice.

pyked/chemked.py Outdated
values = np.genfromtxt(hist['values']['filename'], delimiter=',')
filename = hist['values']['filename']
if not isabs(filename):
filename = join(directory, filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will fail if the input is a dict and therefore directory is None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the default directory to '' would fix that. But, if the input is a dict, then the filename must be specified as an absolute path, right? Since there's no yaml file, a path relative to a yaml file wouldn't make sense. So, line 693 won't be run anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that change will fix the problem on this line, because the directory argument gets set to None if yaml_file is None, so this line will join None and filename, which won't work. Actually, I think that if people provide a dictionary input, the CSV file has to be specified relative to the PWD of the Python process, or as an absolute file name, but we don't need to check for that case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I just pushed another commit so that directory will never be set to None. Now, if yaml_file is None, then directory will be ''

…r forgiveness instead of permission when looking for csv file
@jsantner
Copy link
Author

I've never used Path objects before, that's an interesting idea. How would the Path object be sent from ChemKED to DataPoint where it's used to read the csv file, though? It still must be passed as an argument, right? I can add that functionality in if you think it's better than passing the directory.

Using a dictionary input, I think the filename would have to be specified as an absolute path. If it were a relative path, what would it be relative to? There's no yaml file.

@bryanwweber
Copy link
Member

What if we do the path munging in the loop that creates the DataPoints? Something like

for point in self._properties['datapoints']:
    if 'time-histories' in point:
        for th in point['time-histories']:
            try:
                filename = Path(th['values']['filename'])
            except TypeError:
                pass
            else:
                if yaml_file is not None:
                    th['values']['filename'] = (yaml_file.stem/filename).resolve()
                else:
                    th['values']['filename'] = filename.resolve()
     self.datapoints.append(DataPoint(point))

(please correct any indentation errors, writing code with proportional fonts is hard...) Then we don't have to pass anything around. This will resolve the path into an absolute path relative to the yaml file (if given) or relative to the CWD of the Python process, which should handle the dictionary and the yaml file cases gracefully.

BTW, one of the reasons I'm pushing back here is that I don't want to change how DataPoint is called, if at all possible.

Can you please make sure to add tests for all these code branches? The diff coverage should be 100%, you can see the lines that haven't been run here: https://codecov.io/gh/pr-omethe-us/PyKED/pull/115/diff where the lines that are in the brighter red in the left column weren't executed during testing.

@jsantner
Copy link
Author

That's a smart way to do it, I'll add it in and test it.
I wasn't sure how to make a test for an error message (line 256 in validation.py), that's why the coverage isn't 100%. Can you point me to an example in the tests where an error message is tested?

@bryanwweber
Copy link
Member

bryanwweber commented Jun 27, 2018

https://github.com/pr-omethe-us/PyKED/blob/master/pyked/tests/test_chemked.py#L59

By the way, you'll also have to turn the yaml_file into a Path when that gets processed in the __init__ method.

if yaml_file is not None:
    yaml_file = Path(yaml_file)
    with open...

Copy link
Member

@kyleniemeyer kyleniemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a test PR review

@@ -11,7 +11,7 @@ A BibTeX entry for LaTeX users is
```TeX
@misc{PyKED,
author = {Kyle E Niemeyer and Bryan W Weber},
year = 2017,
year = 2018,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now 2019

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
year = 2018,
year = 2020,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants