Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour of DataFrame(Raster) #778

Closed
edwardlavender opened this issue Oct 4, 2024 · 16 comments
Closed

Unexpected behaviour of DataFrame(Raster) #778

edwardlavender opened this issue Oct 4, 2024 · 16 comments

Comments

@edwardlavender
Copy link

edwardlavender commented Oct 4, 2024

I am confused by the behaviour of DataFrame(Raster). For the same file, I compared the values reported by DataFrame(Raster) and R's terra::extract(). On the same file, and for the same coordinates, there are notable discrepancies.

The reprex below is written in R, for convenient comparison of Rasters with terra. I read the same file into R and Julia. In Julia, I turn the Raster into a DataFrame that I send back to R. I then compare the difference between the values on the raster reported by Rasters and terra. These are up to 27 m for this example file.

> # install.packages("JuliaCall")
> # install.packages("terra")
> library(JuliaCall)
> 
> # Set up Julia
> julia_setup()
Julia version 1.10.5 at location /Users/lavended/.julia/juliaup/julia-1.10.5+0.aarch64.apple.darwin14/bin will be used.
Loading setup script for JuliaCall...
Finish loading setup script for JuliaCall.
> # julia_install_package("ArchGDAL")
> # julia_install_package("DataFrames")
> # julia_install_package("Rasters")
> julia_library("ArchGDAL")
> julia_library("DataFrames")
> julia_library("Rasters")
> 
> # Define path to raster (seabed depth)
> url <- "https://raw.githubusercontent.com/edwardlavender/patter/main/inst/extdata/dat_gebco.tif"
> julia_assign("url", url)
> 
> # Read raster in R and Julia
> geo <- terra::rast(url)
> julia_command('geo = Raster(url);')
> 
> # Convert Raster to DataFrame in Julia and send to R
> julia_command('df = DataFrame(geo);')
> df <- julia_eval('df')
> 
> # Clean up dataframe and remove NAs for convenience
> colnames(df) <- c("x", "y", "z1")
> df <- df[!is.na(df$z1), ]
> 
> # Compare Rasters to terra
> df$z2 <- terra::extract(geo, cbind(df$x, df$y))[, 1]
> 
> # There are notable differences between Rasters and terra
> head(df, 50)
            x       y        z1        z2
4512 709592.1 6270657  24.26738  24.26738
4513 709692.1 6270657  24.26738  24.26738
4514 709792.1 6270657  28.26466  28.26466
4515 709892.1 6270657  32.88578  32.88578
4516 709992.1 6270657  34.78776  34.78776
4517 710092.1 6270657  30.19141  30.19141
4518 710192.1 6270657  25.59505  25.59505
4519 710292.1 6270657  31.13120  31.13120
4520 710392.1 6270657  40.51071  40.51071
4521 710492.1 6270657  52.19024  52.19024
4522 710592.1 6270657  76.40860  76.40860
4523 710692.1 6270657 100.62696 100.62696
4524 710792.1 6270657 116.38429 116.38429
4525 710892.1 6270657 126.13935 116.38429
4526 710992.1 6270657 135.69778 135.69778
4527 711092.1 6270657 132.34482 132.34482
4528 711192.1 6270657 128.99184 132.34482
4529 711292.1 6270657 127.13093 127.13093
4702 709592.1 6270557  24.26738  24.26738
4703 709692.1 6270557  24.26738  24.26738
4704 709792.1 6270557  28.26466  28.26466
4705 709892.1 6270557  32.88578  32.88578
4706 709992.1 6270557  34.78776  34.78776
4707 710092.1 6270557  30.19141  30.19141
4708 710192.1 6270557  25.59505  25.59505
4709 710292.1 6270557  31.13120  31.13120
4710 710392.1 6270557  40.51071  40.51071
4711 710492.1 6270557  52.19024  52.19024
4712 710592.1 6270557  76.40860  76.40860
4713 710692.1 6270557 100.62696 100.62696
4714 710792.1 6270557 116.38429 116.38429
4715 710892.1 6270557 126.13935 116.38429
4716 710992.1 6270557 135.69778 135.69778
4717 711092.1 6270557 132.34482 132.34482
4718 711192.1 6270557 128.99184 132.34482
4719 711292.1 6270557 127.13093 127.13093
4892 709592.1 6270457  24.26738  36.89063
4893 709692.1 6270457  24.26738  34.37080
4894 709792.1 6270457  28.26466  38.41843
4895 709892.1 6270457  32.88578  43.35281
4896 709992.1 6270457  34.78776  46.28009
4897 710092.1 6270457  30.19141  44.41076
4898 710192.1 6270457  25.59505  42.54142
4899 710292.1 6270457  31.13120  48.73069
4900 710392.1 6270457  40.51071  57.97667
4901 710492.1 6270457  52.19024  68.92074
4902 710592.1 6270457  76.40860  89.12210
4903 710692.1 6270457 100.62696 109.32346
4904 710792.1 6270457 116.38429 121.98444
4905 710892.1 6270457 126.13935 121.98444
> 
> # The differences are up to 27 m!
> max(abs(df$z1 - df$z2), na.rm = TRUE)
[1] 27.01477
@edwardlavender edwardlavender changed the title Strange behaviour of DataFrame(Raster) Unexpected behaviour of DataFrame(Raster) Oct 4, 2024
@edwardlavender
Copy link
Author

For context, I also compared the behaviour of GeoArrays. With this package, values extracted via R match the behaviour of terra.

> # Use GeoArrays 
> # julia_install_package("GeoArrays")
> julia_library("GeoArrays")
> 
> # Read Raster as a GeoArray 
> julia_command('ga = GeoArrays.read(url);')
> 
> # Send updated data.frame to Julia
> julia_assign("df2", df)
> 
> # Define extract GeoArrays method 
> julia_command(
+   '
+   function extract(map::GeoArrays.GeoArray, x::Real, y::Real)
+     # Get row and column indices from coordinates
+     rc = GeoArrays.indices(map, [x, y])
+     if checkbounds(Bool, map, rc)
+         # Extract value of the GeoArray at the specified row/column
+         map[rc[1], rc[2], 1]
+     else
+         eltype(map)(NaN)
+     end
+ end;
+   '
+ )
> 
> # Confirm that GeoArrays and terra match
> df$z3 <- julia_eval('[extract(ga, df2.x[i], df2.y[i]) for i in 1:nrow(df2)];')
> all.equal(df$z2, df$z3)
[1] TRUE

@tiemvanderdeure
Copy link
Contributor

Are you sure this is not terra::extract misbehaving? My guess is that the x and y are the boundaries and sometimes it's taking the wrong cell because of floating point error

@rafaqz
Copy link
Owner

rafaqz commented Oct 4, 2024

You're assuming I can read R here, I haven't used it in over 5 years.

But why would DataFrame(raster) give you the same answer as terra extract? I don't get it.

DataFrame(raster) is literally just all the values in the raster as a vector (e.g. a DataFrame column), with the points in the lookups also as unrolled vectors (X and Y columns). Essentially nothing happens to the data at all

@edwardlavender
Copy link
Author

Thanks for your responses! I am not sure whether this is an issue of my understanding or an issue with Rasters or terra. DataFrame(Raster) gives me a dataframe of positions and values. I assumed that when I look up the raster values at those positions, I would get the same answer. But both GeoArrays.jl and terra give me different values for those coordinates than the ones reported by DataFrame(Raster).

@tiemvanderdeure
Copy link
Contributor

tiemvanderdeure commented Oct 4, 2024

But in GeoArrays you're using extract. If you try calling data.frame on the terra raster you'll get the same result as the DataFrame in Rasters

url <- "https://raw.githubusercontent.com/edwardlavender/patter/main/inst/extdata/dat_gebco.tif"
geo <- terra::rast(url)
df <- data.frame(geo)
head(df, 50)

just copy-pasting the last 5 rows here, which match z1 (not z2) in your original post

4901  52.19024
4902  76.40860
4903 100.62696
4904 116.38429
4905 126.13935

vs

4901 710492.1 6270457  52.19024  68.92074
4902 710592.1 6270457  76.40860  89.12210
4903 710692.1 6270457 100.62696 109.32346
4904 710792.1 6270457 116.38429 121.98444
4905 710892.1 6270457 126.13935 121.98444

@rafaqz
Copy link
Owner

rafaqz commented Oct 4, 2024

Sounds bad.

But you're going to need to make a Julia MWE so we can check it, if possible download your data in the script so it just runs

@rafaqz
Copy link
Owner

rafaqz commented Oct 4, 2024

But skipping Nas before you run extract seems risky, are you sure missing values are actually nas?

@rafaqz
Copy link
Owner

rafaqz commented Oct 4, 2024

My gut feeling is this may be because you are extracting the edges of pixels rather than centres, so floating point error is causing some to drop into the wrong pixel.

We probably need to swap out LinRange to StepRangeLen (but that won't guarantee it)

@rafaqz
Copy link
Owner

rafaqz commented Oct 4, 2024

Tiem sorry I missed your comment. Yes that makes more sense. But I think extract should work too?

@edwardlavender
Copy link
Author

Thanks for your responses! Okay that makes sense. So DataFrame(Raster) returns the coordinates at the pixel edges (I assumed it was cell centroids)? I can see how some floating point issues could then cause an extract function to then extract a value in the adjacent cell & give rise to these mismatches.

@rafaqz
Copy link
Owner

rafaqz commented Oct 4, 2024

GDAL defines pixels by their corners, netcdf by their centres. You can change that manually with shiftlocus but we don't do it for you to avoid the introduction of unnecessary floating point error.

@edwardlavender
Copy link
Author

edwardlavender commented Oct 4, 2024

Thanks! Please can you provide a very quick example of how to use shiftlocus with a raster? I can't find any examples of how to do this. Thank you!

@asinghvi17
Copy link
Collaborator

using Rasters
using Rasters.Lookups
# ras is some raster
# shift locus so that stored points are at the lower left of cell corners
ras2 = Rasters.shiftlocus(Start(), ras)
# shift locus so that stored points are at the center of the cell
ras3 = Rasters.shiftlocus(Center(), ras)
# shift locus so that stored points are at the top right of cell corners
ras4 = Rasters.shiftlocus(End(), ras)

You can also extract "interval bounds" that are tuples of (start, stop) for each dimension:

julia> Rasters.intervalbounds(cell_area, X)
1667-element Vector{Tuple{Float64, Float64}}:
 (1.8479035471683413e7, 1.8479102263377886e7)
 (1.8479102263377886e7, 1.8479169055072367e7)
 (1.8479169055072367e7, 1.847923584676684e7)
 (1.847923584676684e7, 1.8479302638461318e7)
 (1.8479302638461318e7, 1.847936943015579e7)
 (1.847936943015579e7, 1.847943622185027e7)
 (1.847943622185027e7, 1.8479503013544742e7)
 (1.8479503013544742e7, 1.847956980523922e7)
 (1.847956980523922e7, 1.8479636596933696e7)
 (1.8479636596933696e7, 1.8479703388628174e7)
 
 (1.8589709309430085e7, 1.858977610112456e7)
 (1.858977610112456e7, 1.8589842892819036e7)
 (1.8589842892819036e7, 1.8589909684513513e7)
 (1.8589909684513513e7, 1.858997647620799e7)
 (1.858997647620799e7, 1.8590043267902464e7)
 (1.8590043267902464e7, 1.8590110059596945e7)
 (1.8590110059596945e7, 1.8590176851291418e7)
 (1.8590176851291418e7, 1.8590243642985895e7)
 (1.8590243642985895e7, 1.859031043468037e7)
 (1.859031043468037e7, 1.8590377226374846e7)

(and same for Y)

@asinghvi17
Copy link
Collaborator

@lazarusA it would be good to add this to the docs in your PR, to show what shiftlocus means. Maybe also plot the DimPoints to really drive it home?

@edwardlavender
Copy link
Author

Thanks all for the responses & example code!

@rafaqz
Copy link
Owner

rafaqz commented Oct 6, 2024

We should probably change shiftlocus to be more like set, so you can do shiftlocus(raster, X=>Center(), Y=>Center()) where you need to e.g. skip the time dimension. Some of these helpers are not as clear.

(currently you can do that with shiftlocus(Center, rast; dims=(X, Y)) which I gues is fine too)

shiftlocus is also not listed here yet:
https://rafaqz.github.io/DimensionalData.jl/stable/object_modification

DD issue here now:
rafaqz/DimensionalData.jl#816

@rafaqz rafaqz closed this as completed Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants