You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
_Just to be clear... this is definitely an edge case and something I can easily work-around; I do not want anyone to spend a ton of time on this! I'm sure you have better things to work on...
While 20k features and p >> n is not uncommon with genomics data, the historical convention there has been to store the data in the transposed order which avoids this issue, or to use a different file format.
I just thought I would report it here to raise awareness / in case it points to an underlying issue with wider ramifications.
Greetings!
I noticed some unexpectedly slow performance for readr when trying to load a table with few rows but many columns.
I found some earlier issues relating to slow read times, but those seemed to relate to different issues.
I thought that it might be an issue with the column type guessing, but the performance is similar even when the column types are indicated.
I did not check to see how the performance scales with the number of rows, but the code snippet could be modified to check this.
Time
Representative times it took to load a 40 x 20,000 table of random floats:
read.delim
: 13.7sread_tsv
: 48.7sread_tsv + coltypes
: 49.2sFor comparison, the similar times for pandas/CSV.jl are:
pd.read_csv()
: 0.68sDataFrame(CSV.File())
: 0.93sThese are just rough estimates intended to give a sense of the scale of performance discrepancies.
Transposing the data results in a ~100x speed-up, for this particular example (0.39s).
To reproduce:
Session Info:
Thanks for all of your work on readr!
The text was updated successfully, but these errors were encountered: