Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readtimearray on duplicate timestamps behaviour #451

Open
klangner opened this issue Apr 19, 2020 · 9 comments
Open

readtimearray on duplicate timestamps behaviour #451

klangner opened this issue Apr 19, 2020 · 9 comments

Comments

@klangner
Copy link

Currently when trying to read data from the CSV file with duplicate timestamps the function will crash.

Maybe it would be better to add parameter to this function so it will try to read as many rows as possible and then return partial result without crashing?
Or maybe just skip duplicate or out of order items?

BTW is there in Julia some kind of optional type? Like Haskell Maybe. Maybe then at least return this type instead of crashing the program?

@iblislin
Copy link
Collaborator

iblislin commented May 5, 2020

Hi @klangner

Maybe it would be better to add parameter to this function so it will try to read as many rows as possible and then return partial result without crashing?
Or maybe just skip duplicate or out of order items?

well, in this case, I think you can load the CSV into a DataFrame first, then remove the duplicated rows, then TimeArray(df, timestamp = :MyTimeColumn).

BTW is there in Julia some kind of optional type? Like Haskell Maybe. Maybe then at least return this type instead of crashing the program?

I guess it's Missing?

@imbrem
Copy link

imbrem commented Jun 3, 2020

I currently implemented this with a very dirty hack, namely passing in open(uniq FILE_NAME), but I would appreciate a flag to just ignore out-of-order entries.

@iblislin
Copy link
Collaborator

iblislin commented Jun 4, 2020

Hi @imbrem
Could you show an example case that it contains duplicated timestamp?
I also wondering

  1. If there is a time index in ascending order 2011/1/1, 2011/1/2, 2011/1/2, 2011/1/2, 2011/1/3 with 3 duplicated timestamps, which one do you expect to be skipped?
  2. If there is a time index in descending order, which one do you expect to be skipped?

About out-of-order cases: I'm also curious about that is there an algorithm that can determine the out-of-order entrie?

@klangner
Copy link
Author

klangner commented Jun 4, 2020

Hi @iblis17,
I would say that you can find duplicate timestamps when dealing with Daylight saving time.
Quite often in the data you will see 1 hour missing and half year later duplicate 1 hour data.
Also it can happen when the data is not added in increasing time. E.g You get the data from multiple sensors but in batch mode. So you will end up with batches which can have overlapping timestamp.
IMHO when you work with real data everything can happen :-)

@iblislin
Copy link
Collaborator

iblislin commented Jun 4, 2020

I would say that you can find duplicate timestamps when dealing with Daylight saving time.
Quite often in the data you will see 1 hour missing and half year later duplicate 1 hour data.

oh, so in this case, the data is still in proper order, only the time index is not ideal.
I think applying lag, lead, or some time series method on them is still reasonable.
I will consider to release the constrain about the time index, maybe allow duplications.

Also it can happen when the data is not added in increasing time. E.g You get the data from multiple sensors but in batch mode. So you will end up with batches which can have overlapping timestamp.

But for this case, I do not think the method provided by TimeSeries.jl can be applied on these data.
It makes no sense if user want to lag, lead, moving... etc on it.
So what functionality can we improve/provide to help these kind of data?

@iblislin
Copy link
Collaborator

iblislin commented Jun 4, 2020

Ah, and just recall that we have an option unchecked, so you can get the out-of-order or duplicated time index work.

TimeArray(ts, vector; unchecked = true)

@iblislin
Copy link
Collaborator

iblislin commented Jun 4, 2020

Anyway, I made a PR for accepting duplicated but sorted time index.

#455

@imbrem
Copy link

imbrem commented Jun 4, 2020

That works fine, but could it also be possible to add an option to actually remove out-of-order or duplicate time stamps, and/or actually go back and update their values in the result array? If desired, I can write the PR for this.

@iblislin
Copy link
Collaborator

iblislin commented Jun 5, 2020

could it also be possible to add an option to actually remove out-of-order or duplicate time stamps

@imbrem yeah, PRs are welcomed.

and/or actually go back and update their values in the result array?

Updating issues still need more discussions, and I need some time to think about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants