-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean Data Export #36
Comments
Cool, thanks @lukeloken. Great ideas. |
the one change I might suggest is: clean(flaggedobject, which='ALL', replace=NA) # to replace w/ NA
clean(flaggedobject, which='ALL', replace=NULL) # to delete row and the ability to do this: clean(object, which, ..., replace) #as the function arguments, which would support this:
clean(object, 'x > 9999', 'persist(x) > 10', 'MAD(x) > 3', replace=NA) Thoughts @lukeloken and @aappling-usgs ? |
I like the replace=Null option. Do you want the ability to 'clean' an unflagged object?
Or is it more straightforward to run two commands?
|
I like that option personally, because then I can use sensorQC to really rapidly clean a vector or data.frame with one line. The key is having it be a good thing for powerusers but not get in the way of a more normal workflow. |
I like having both options, but I don't know how difficult this is to code. |
oh yeah, for sure, I was thinking both. |
I'm have another question for you Jordan: How you would recommend removing a single observation that is clearly an outlier if there are also NA values. I think the MAD function does not work properly if there are NA values also in the series.
The two {flag} commands do not treat the 10000 observation the same. |
whoops, that is a bug. Good catch! #37 Also note that you can do: values1 <- c(runif(10,2,4),10000, runif(5,2,4))
values2 <- c(runif(10,2,4),10000, NA, runif(4,2,4))
flag(values1, "is.na(x)", 'MAD(x) > 3')
flag(values2, "is.na(x)", 'MAD(x) > 3') for really simple quick qaqc |
@lukeloken you should be able to try this after doing another check out the readme for the pattern: https://github.com/USGS-R/sensorQC/blob/master/README.md Can you let me if that is what you expected? (note that I changed 'ALL' to 'all' as an argument just to be consistent with other packages that is used in) |
Jordan, I think this works great. The only comment I have is in regard to the output of {clean}. Is there any reason to keep the structure of this a list rather than unlisting it to a data.frame class? This would save a line of code after processing. |
I think I need to think on that... |
No rush, I was just curious if there was a reason to keep the object class. ...$sensor and [[1]] get to the data.frame. One reason to keep it as is - It will only print the first 15 observations. Which is very handy. |
yeah, another little trick to that, is you can use print(sensor, 20) to print any number of rows. 15 is the default. In the future, I want the sensor w/o any flags (so the thing you get back from |
Sounds good. That will make it easiest to merge multiple 'cleaned' objects stemming from the same original data file. |
Thanks Jordan. The next step for me will be building individual rules for each parameter. Essentially, I'd like to build a table of parameters and rules variable, rule1, rule2, etc. and then run a loop to apply each set of rules to the appropriate parameter file<-read(...)
for (col in 2:ncol(file)){
data<-file[,c(1,col)]
rules<- function(something that looks at names(data[2]) and references the table above
newdata<-clean(data, rules, replace=NA)
if (col == 2){
finaldata<-newdata}
if (col>2){
finaldata<-merge(finaldata, newdata, by="times", all=TRUE)}
} If you have any suggestions, I'm all ears. |
looks fine to me. Probably need newdata<-clean(data, rules, replace=NA)$sensor %>%
setNames('times',variable) though to get the merge to work. Otherwise all of the data are going to have the same variable names ( This is probably a more common use case than I imagined it would be originally, so let me know what the pain-points are if you have any. |
Hey Jordan, I think I'm getting somewhere with the automatic code. I may have found another bug in the {clean} function. It does not appear to accept data.frame objects dates <- seq(as.POSIXct('1999-01-01'),by=1,length.out=500)
values1 <- c(runif(100,-5,12),10000, runif(397,2,100), NA, -9)
file <- data.frame("DateTime"=dates,"sensor.obs1"=values1)
clean(file, 'x > 9999', 'persist(x) > 10', replace=NA) Is there a quick way to convert a data.frame into a sensor object? #Make Dataset
dates <- seq(as.POSIXct('1999-01-01'),by=1,length.out=500)
values1 <- c(runif(100,-5,12),10000, runif(397,2,100), NA, -9)
values2 <- c(runif(397,2,18), NA, -9, runif(100,-2,36),10000)
file <- data.frame("DateTime"=dates,"sensor.obs1"=values1, "sensor.obs2"=values2)
#Set rules
Variables=names(file)[-1]
Rule1<-c('x < 0', 'x < 0')
Rule2<-c('MAD(x) > 3', 'MAD(x) > 2')
RuleTable<-data.frame(Variables, Rule1, Rule2, stringsAsFactors=FALSE)
col=2
for (col in 2:ncol(file)){
data<-file[,c(1,col)]
name<-names(data)[2]
rules<- unlist(RuleTable[which(RuleTable[,1]==name), 2:3])
flagged<-flag(data, rules)
newdata<-clean(flagged, replace=NA)$sensor
# newdata<-clean(data, rules, replace=NA)$sensor
names(newdata)<-c('times',name)
if (col == 2){
finaldata<-newdata}
if (col>2){
finaldata<-merge(finaldata, newdata, by="times", all=TRUE)}
} |
whoops, @lukeloken I hadn't created the method for For now, the simple way to go from sensor(data.frame(c(1),c(1))) |
As soon as this #40 is merged, you can do clean(data.frame...). I would suggest keeping them as data.frames so you don't lose time, as that is what you are going to join by. |
Hey Jordan, Thanks for plugging away with SensorQC. I think we are getting somewhere Quick question about MAD. How exactly does the code work? I know what Hope you have a good weekend. Luke On Fri, Oct 23, 2015 at 3:34 PM, Jordan S Read [email protected]
|
Nevermind, I think I found the correct code. abs(x - median(x)) / mad(x) |
Above Figure, pH from Lake Mendota. Red values= raw data, Black = 'cleaned' data after running MAD(x) > 3 So I have been running the MAD and some basic min/max expressions to remove outliers. In this example, the MAD function may not be appropriate as it excludes values when we sampled for extended periods of time in different areas of the lake (see above). This occurs because we drove our boat into Cherokee Marsh where the pH was drastically different than the rest of the lake. I'm thinking of additional rules to apply to {flag} and {clean} functions. One method that comes to mind is applying the MAD function of other statistical tests over a rolling window. Something similar to {runmean} in the 'caTools' package. Any thoughts on approaches to using a local outlier test rather than a global one? I realize this will increase the processing time, but it may be more appropriate for data strings that oscillate among multiple values. I could also see this being valid for stationary timeseries (e.g., rain event trigger high NO3 and discharge for several days). |
yeah, MAD should be used with a window if the data aren't expected to be stationary, so pretty much always. You can use MAD with window as |
Hey Jordan, I've been able to get the window function to work, but only with type='auto' and after generating a sensor object. It would be great to be able to 1) Window a data.frame object and 2) specify the window width (type=10~50 measurements or seconds). Hopefully we can pass this windowed object through {clean}/{flag} with 'MAD(w)>3'. I'm curious how your window procedure works? Specifically does it look both directions and how does it handle the last/first observations? # col = column number of data values (2:ncol(datatable)
data<-datatable[,c(1,col)]
sensored<-sensor(data)
windowed<-window(sensored, type=10)
cleanedframe[,col]<-clean(windowed, 'MAD(w)>3', replace=NA)$sensor[,2] |
@lukeloken Another option should be to pass in a vector of window indices. I don't know if you looked at the Which is a priority to be implemented first? manual windowing based on time, or adding your own |
@jread-usgs I see what your current window function does. It is designed for burst data in which you capture 30 secs of 1hz data every 30 minutes. This obviously does not work for flame data, which is continually collecting at 1hz. I would vote for building customized exclusion vectors. I've started a function below to compute a rolling MAD, but you probably have a faster/better way to do it. This is computationally heavy, but it seems like a good place to start (at least for my type of data). If the output of this function became the w slot, we could use 'w > 3' to flag. One problem I've already encountered is with lots of observations of the same value. For example, In many of the windows for pH, more than half the observations are identical, making mad=0 and MAD=Inf. Certainly don't want to exclude observations for this reason. x <- data[,2] # list of x values
window <- 20 # size of window in both directions to compute MAD
#Rolling MAD function. Windows subsequent observations and computes MAD statistic from local observations
rollingMAD<-function (x, window){
# Empty vector for computed MAD values
WindowedMAD<-rep(NA, length(x))
for (obs in 1:length(x)){
# Set interval of observations to compute MAD
# For observations near start and end only include real observations
seq<-(obs-window):(obs+window)
seq<-seq[which(seq<=length(x) & seq>=1)]
interval<-x[seq]
# Compute MAD
localmedian<-median(interval, na.rm=TRUE)
localmad<-mad(interval, na.rm=TRUE)
WindowedMAD[obs]<-abs(x[obs]-localmedian)/localmad
}
return (WindowedMAD)
} We might also think about including some of the running sd code from CaTools |
Made some progress on the rollingMAD function and cleaning a flame dataframe. {clean(x, 'w>3'} will work after loading the local MAD values into the 'w' slot. sensored$sensor$w <- rollingMAD(sensored$sensor$x, window)
cleaned <- clean(sensored, rules, replace=NA)$sensor[,2] No rush on looking at these, as it seems to be working at the moment. In the future, I'd be interested in how to increase processing speed. # ==========================================================
# Rolling MAD function.
# Windows subsequent observations and computes local MAD statistic
# ==========================================================
rollingMAD<-function (x, window){
# Create empty vector for computed MAD values
WindowedMAD<-as.numeric(rep(NA, length(x)))
for (obs in 1:length(x)){
# Set interval of observations
# truncate interval for observations near start and end
seq<-c((obs-window):(obs+window))
seq<-seq[which(seq<=length(x) & seq>=1)]
interval<-x[seq]
# Compute MAD
localmedian<-median(interval, na.rm=TRUE)
localmad<-mad(interval, na.rm=TRUE)
WindowedMAD[obs]<-abs(x[obs]-localmedian)/localmad
}
# Replace infinite values (mad==0) with NA
WindowedMAD[which(is.infinite(WindowedMAD))]<-NA
# Return vector of windowed MAD values.
# Equal length to x
return (WindowedMAD)
}
# ==========================================================
# Function to clean multiple parameters of a dataframe
# Col1 = times; col2 and up == variables
# Requires a data.frame of rules that match datatable column names
# window = number of observations to include in both directions for rollingMAD
# ==========================================================
sensorclean<-function(datatable, ruletable, window){
# Create emtpy datatable to fill with cleaned values.
# Keep only column 1 (times)
cleanedframe<- datatable
cleanedframe[,-1]<-NA
#Clean all variables in datatable starting with column 2
for (col in 2:ncol(datatable)){
# Create sensor object and fill 'w' slot with rollingMAD
data<-datatable[,c(1,col)]
sensored<-sensor(data)
sensored$sensor$w<-rollingMAD(sensored$sensor$x, window)
# Set rules manually or from ruletable and clean sensor object
# rules<-as.character(c('w>3', 'is.na(x)'))
rules<- unlist(ruletable[which(ruletable[,1]==names(data)[2]), -1])
if (length(rules)==0){
cleanedframe[,col]<-data[,2]}
if (length(rules)>0){
cleanedframe[,col]<-clean(sensored, rules, replace=NA)$sensor[,2]
# Plot flagged (red) and retained (black) observations
flags<-data[,2]
flags[which(flags==cleanedframe[,col])]<-NA
plot(cleanedframe[,col], col="black", ylab=names(data)[2], xlab="time", ylim=range(data[,2], na.rm=TRUE))
points(flags, col="red", pch=16)
}
}
return(cleanedframe)
}
# =====================================================
# Workflow to clean FLAMe data after loading above functions
# =====================================================
# setwd("C:/Users/Luke/Dropbox/FLAME/Data/2015-08-24_SparklingLake")
datatable<-read.csv("SparklingLake2015-08-24.csv", header=TRUE, stringsAsFactors=FALSE)
datatable$ltime<-as.POSIXct(datatable$ltime, format="%Y-%m-%d %H:%M:%S", tz="UTC")
ruletable<-read.csv("C:/Users/Luke/Dropbox/FLAME/SensorQC/SensorQCRules_2015.csv", header=TRUE, stringsAsFactors=FALSE)
window <- 15 # size of window in both directions to compute MAD
CleanFLAME<-sensorclean(datatable, ruletable, window) Example output from {sensorclean}. Red=removed values |
Nice! Looking good. Thanks for taking a stab at this. Happy to see if I can help the speed-up. |
@lukeloken how slow is this for you? Can you give me an indication of the processing time it takes for a reasonably sized dataset so I can see where the starting point is? |
On my Corei5 with 6gb RAM It takes about 14s to perform {rollingMAD} on one column of data. sensored$sensor$w<-rollingMAD(sensored$sensor$x, Medianwindow, MADwindow) To loop through all 46 columns and replace with NA using {sensorclean} takes about 5 minutes. This can obviously be shortened by not plotting, but for first pass it seems useful to see what is removed. CleanFLAME<-sensorclean(datatable, ruletable) Hilary just suggested looking at the {tsoutlier} package. library(tsoutliers)
fit <- arima(data[,2], order = c(1, 1, 0))
resid <- residuals(fit)
pars <- coefs2poly(fit)
outliers <- locate.outliers(resid, pars,types = c('AO'),cval=25)
plot(data[,1],data[,2],cex=0.5,pch=19,col='red', ylab=names(data)[2], xlab="")
points(data[outliers$ind,1],data[outliers$ind,2],cex=0.5,pch=19,col='black') Also a little slow, but might be another method for assigning flags. |
cool, I hadn't seen I am trying out your code/data now. |
harder than I thought to make this fast. Was trying to |
|
|
And I love rolling it raw! |
@lukeloken I will have something for you to test (if you are up for it) shortly. |
1.5 hrs... |
Happy to be a guinea pig |
Give it a shot? #42 you can devtools intstall_github from my fork (jread-usgs) and get this feature. See the readme here for an example on using the rolling window. It is centered right now by default. |
Hey Jordan, The code works and is much faster than my previous version. Is there any way I can see the code used to to calculate {MAD.roller} A few issues: I can tell that if MAD==Inf, the value becomes flagged. Also, I'm finding that the threshold of 3 does not work well for a lot of the data. I've increase the threshold to 10 and it does a reasonable job for some parameters but poorly for others (see figures) I'm now starting to wonder how useful the MAD function is for moving data. Perhaps other outlier detection methods are needed. Code to plot and assess rolling window outlier detection for (col in 2:ncol(datatable)) {
sensor<-sensor(datatable[,c(1,col)])
sensor2<-window(sensor, n=40, type='rolling')
flagged<-flag(sensor2, 'x == 999999', 'MAD(x,w) > 10', 'MAD(x) > 3')
outliers<-flagged$flags[[2]]$flag.i
flagname<-flagged$flags[[2]]$expression
# plot(sensor2)
par(mar=c(4,4,1,1))
plot(sensor2$sensor$times, sensor2$sensor$x, col="black", type="p", pch=1, cex=1,main=names(datatable)[col], ylab=names(datatable)[col], xlab="")
points(sensor2$sensor[outliers, ], col="red", pch=16, cex=1)
legend("topleft", c(flagname), col="red", pch=16, cex=1, bty="n")
} |
wow those are some pretty figs 😉 Code for MAD.roller is here I put that warning in there because I haven't really tested it w/ different things (like a bunch of NAs or Inf). Do you think that rolling window is big enough? I was having good luck with much longer periods - like 100 or 300, but that is for station data. There is also an issue here when data have trends. When doing windows in the past, I have detrended and then operated on it. It is another comp cost to doing that, but helps if you really just have a trend in the window that creates a wider spread in the data. |
I've been using a handful of window sizes and thought smaller would be better. This allows quick peaks to remain unflagged. Otherwise the tops of the mountains get cut off because the median remains on the valley floor. I think that detrending could be useful. The first step in the tsoutliers package is building an ARIMA model then using the residuals as an indicator of outliers. library(tsoutliers)
fit <- arima(data[,2], order = c(1, 1, 0))
resid <- residuals(fit)
pars <- coefs2poly(fit)
outliers <- locate.outliers(resid, pars,types = c('AO'),cval=25)
plot(data[,1],data[,2],cex=0.5,pch=19,col='red', ylab=names(data)[2], xlab="")
points(data[outliers$ind,1],data[outliers$ind,2],cex=0.5,pch=19,col='black') I haven't used these methods before, but am interested to dig into them if they seem like a good idea. |
Is it possible to create a 'clean' function that removes the flagged observations.
Suggestions:
Add an argument "which" that can either be "ALL" or a list of rules with the default being ALL.
Add an argument that simply deletes the whole row.
clean(flaggedobject, which=ALL, replace=NA)
clean(flaggedobject, which=c(1,2,4), replace=99999)
clean(flaggedobject, which=ALL, deleterow=TRUE)
The text was updated successfully, but these errors were encountered: