Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in the basis set parser #14

Open
q-posev opened this issue Dec 26, 2022 · 4 comments
Open

Bug in the basis set parser #14

q-posev opened this issue Dec 26, 2022 · 4 comments

Comments

@q-posev
Copy link

q-posev commented Dec 26, 2022

The regex that is currently used to parse the basis set files (const prim_pat = r"([+-]?\d*?\.\d+[D+-]{0,2}\d\d)\s+?([+-]?\d*?\.\d+[D+-]{0,2}+\d\d)" in BasisParser.jl) is not applicable to all available basis sets.
For example, the 6-311+g file contains floating point values that are not in scientific notation (e.g. 1.0000 instead of 1.0D+00). So the parser crashes because it cannot parse these values. Perhaps this can be handled via the try/catch mechanism and using an alternative regex.
Let me know if you need help fixing this. I am new to Julia but can work on the PR in my free time.

@gustavojra
Copy link
Member

Hey @q-posev, thank you for pointing that out.

The original regex could handle numbers that are not in scientific notation. The quantifier {0,2} before [D+-] indicates that this bit was optional.

The problem was this particular piece after the period: \d+[D+-]{0,2}\d\d). Written like this, the regex expects at least one decimal after the period followed by the optional scientific notation marker (D+/D-) and then two more obligatory decimal digits... Consequently, numbers not in scientific notation would not be parsed unless they had at least 3 digits after the period.

This broke down as soon as I tried fluorine using the basis set you suggested. Moreover, it also breaks dramatically for potassium because its first entry:

K     0
S    6   1.00
 182594.                     0.000227747

No decimal digits at all...

I updated the regex to be more flexible, in case you're interested here is what I came up with

const prim_pat = r"([+-]?\d*?\.\d*(?:D[+-]\d+)?)\s+?([+-]?\d*?\.\d*(?:D[+-]\d+)?)"

Note that \.\d+ not became \.\d*. So I dropped the obligatory digits after the decimal mark. Also, I put the scientific notation bit inside a non-capturing group (?:X) where X = D[+-]\d+. The ? just after the group indicates that this part is optional. However, if D+ or D- is matched, there has to be at least one digit after it... Because a number like 1.0D+ does not make any sense to me.

Anyways, you can check out the PR right here #15 . I just wanted to give you some feedback.

@q-posev
Copy link
Author

q-posev commented Dec 28, 2022

Hi @gustavojra, thank you for the prompt fix and for the explanation! I will have a look at the PR.
Indeed, I forgot to mention that it failed when I tried to run molecules containing F and Cl atoms. This fix is also important since many basis sets from BasisSetExchange contain values in the decimal notation. I also noticed that the 6-311g basis set is missing in the lib folder. Was it intentional?

@gustavojra
Copy link
Member

Not, it is not intentional, back when I created the lib folder I just downloaded all basis set in the BSE. Maybe this one wasn't downloaded for some reason? Either way, for standard basis set files like that we can always create a simple PR including new files.

@q-posev
Copy link
Author

q-posev commented Jan 4, 2023

OK, thanks @gustavojra ! I can prepare a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants