-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tinyformat aware of locales. #44
base: master
Are you sure you want to change the base?
Conversation
tinyformat.h
Outdated
template<typename... Args> | ||
std::string format(const char* fmt, const Args&... args) | ||
{ | ||
std::ostringstream oss; | ||
oss.imbue (std::locale::classic()); // force "C" locale with '.' decimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little uneasy about interfering here with the default behaviour of using the global locale. But it's certainly more convenient than having to pass the desired locale as a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative design is for the unadorned format() to use the native locale (as before) and have users who want the C locale use the new variety and explicitly pass std::locale::classic()
. But that seems error-prone to me, since most users would probably prefer to always have '.' and (like me) be shocked that this wasn't already the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tough one as there's two quite valid but competing use cases.
- For machine readable output, consistency is super important. Using the
classic()
locale is the right thing for this. It's also generally more consistent between environments which is great. - When generating content for users it's arguably be more correct - and consistent with the other
format
functions - to use the current global locale forformat()
.
One way out of the conundrum might just be to add a new cformat
function ("format in C locale"), and leave format
as it is for consistency with the other flavours of format
. Obviously a better name than cformat
would be good...
I updated it slightly with a couple spots where I realized I needed to transfer locale settings from one stream to another. |
By the way, it's perfectly acceptable for you to reject this and take the stance that tinyformat doesn't want this complication or any possible performance penalty, and so is just always going to use the global native locale, so be it, and that apps who are concerned about that ought to just set the locale globally. But that makes it very difficult for people writing libraries (who want to use tinyformat internal to the library) who don't think that the library internals have any business either assuming or changing anything about the global state of any apps that use them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Larry, thanks for this. I'm definitely open to making use with locales possible/sensible. I'm not sure on the best design yet, but this looks pretty close. The versions which take an input stream and use the existing stream locale seems clearly correct. The version takes an explicit locale is also obviously right.
Unfortunately I've never had cause to use a locale in production code, so if any of the watchers of this repo have experience and would like to chime in that would be useful.
tinyformat.h
Outdated
template<typename... Args> | ||
std::string format(const char* fmt, const Args&... args) | ||
{ | ||
std::ostringstream oss; | ||
oss.imbue (std::locale::classic()); // force "C" locale with '.' decimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tough one as there's two quite valid but competing use cases.
- For machine readable output, consistency is super important. Using the
classic()
locale is the right thing for this. It's also generally more consistent between environments which is great. - When generating content for users it's arguably be more correct - and consistent with the other
format
functions - to use the current global locale forformat()
.
One way out of the conundrum might just be to add a new cformat
function ("format in C locale"), and leave format
as it is for consistency with the other flavours of format
. Obviously a better name than cformat
would be good...
I haven't used locale in production code before today, either! Meaning, I always naively assumed tinyformat would print the same for everybody as it did for me, "123.45". And hence the problem -- a Swedish OSL user had his system locale set to "sv_SE.UTF-8" and it was introducing lots of subtle problems related to numeric literals in his OSL source code. (His immediate bug was related to parsing, but the tinyformat side would have messed up writing the .oso file and many other things.) I don't have strong feelings about the naming conventions. If you want |
BTW, I agree that for the string case, there are potentially two use cases: force C locale, or use the native/global one. I really struggled with this issue myself, but in the end decided (for my wrapper functions) that most people are ignorant about locales and expect the "C" convention and will be surprised to get anything different if they inadvertently use the default calls. The only people who want locale-dependent formatting are the ones intentionally using internationalization, and they can be trusted with the burden of using extra arguments and specially-named functions. That was my rationale, anyway. |
I do buy this reasoning, it seems completely sensible. The thing that makes me pause is that naive use of the stream versions of I suppose we could change all versions of |
I'd be for the "all existing versions use C locale, new funcs for explicit", I think that's consistent and probably in line with what 99% of users think is really happening (though they are wrong). But then I'm not 100% sure how to handle the logistics of the "C" versions that write to existing streams. Does each of those functions, then, need to store the stream's original locale, change it to C, output, then restore? For that matter, is ostream::imbue() thread safe? And will it cause even more chaos if a stream that is everywhere else "localized" suddenly has individual writes (that used tinyformat) adhering to different conventions? It all hurts my head. I'm fine with us all sleeping on this for a few days to let it percolate, there is not really any urgency to decide. The good news is that MOST technical users will never see a change one way or the other, since they already have their global locale set to the "C" equivalent -- so none of these choices is likely to change things for anybody who isn't already living on one of those edge cases that's currently broken in some way. |
Yes - that's all I was thinking - all stream based Which thread safety issues are you particularly thinking about? The locale itself would be const (I assume), and the use of the stream/streambuf would already be non-threadsafe as per usual. So at a guess it seems fine to me. Could easily be wrong though! I'm certainly happy to let this percolate for a few days before a final decision. |
I didn't have any thread safety issues that I knew about for sure... just musing about unknowns, but I guess you're right, we're already accepting that tinyformat is changing (and hopefully restoring) persistent state about the stream. I'm coming around to seeing the wisdom of declaring that all the "default" functions should be consistent with each other and use C locale -- whether making a string or outputting to a stream -- and have an additional version that takes an explicit locale if you want anything else (including passing stream.getloc() to cause the output to conform to the stream's previous locale). That has the nice property of ensuring that
and
are 100% identical, regardless of any prior state of the stream. If you want something else, we'll have a
??? |
As a contrarian sort of argument, I think that tinyformat should continue defaulting to the global or stream-specified locale. That's the normal and well established behaviour, rightly or wrongly. So I see the problem more in terms of a convenient way to use "C" locale for string formatting specifically. Which brings me back to the I'd be inclined to explicitly set the global locale for all CLI tools via std::locale::global. This applies to That leaves a more complicated reality for GUI applications that likely depend on not having the global locale forced to "C" classic. In light of that -- mass-migrating code from |
So let's say you had a library that was concerned with generating valid GLSL source code, being linked into a pointy-clicky internationalised GUI application that happened to have some widgets using GLSL for their shaders. It's not desirable for the GLSL formatting of numbers to be locale-dependent. But the UI portions of the application are required to be localised into Japanese, or whatever. In this situation the solution was to actually eliminate string formatting (printf and std::iostreams) from the GLSL translating library, along with the locale dependency. Removing the format string was considered a security enhancement - at the time it was also being used as part of the web browser Java stack. No more printf, no more varargs, no more worries. (pre C++11) |
So what's the verdict, @c42f ? I think our choices are:
Chris Kulla @fpsunflower has been on vacation all week, but I'd also like to get his opinion before we make any final decisions. He's usually good at spotting any sloppiness in the decisions I'm making. |
@lgritz Out of curiosity, are you primarily in a CLI context, or a GUI context? |
Both! My use of tinyformat is in two software packages, and in both cases their heart is a library that's used from inside other applications (used by wide range of commercial, proprietary, and other open source projects), but both packages also contain CLI tools that are extensively used in their own right. Setting "C" locale is trivial for the CLI tools, but does nothing for the library internals. Interestingly, one of my projects where I use tinyformat is a programming language, so in that case, it's really crucial that the parsing and generating of output (text representation of floating point numbers, in particular) be canonical rather than locale-dependent. But the language JITs from the library, inside other apps that have extensive UI that may be localized, so I can't count on the global locale being "C". |
Right thanks for the ideas, I think the design tradeoffs are becoming clearer. To me it seems that:
With (1) in mind, I think it would be most useful to have two sets of functions
The difficult decision is then which of these should be called |
I'm currently of the opinion that the unadorned format() should always use "C" locale. It seems nearly foolproof, your points 2-3 are very compelling. I agree that the main use cases are "always C" and "use the global" and both of those should be simple, without having to pass an explicit locale. But it's so easy to also have the version that takes an explicit locale, I'd be inclined to keep that flexibility available in case somebody needs it. It does not compromise the design or simplicity of either of the other two varieties and is truly just a few lines of code. Of your choices, I like |
Wow, that world map of decimal separator use is a bit of a horror show. I didn't realize the countries were split close to half and half on this decision. I suppose the cost of certain internationalization bugs might be quite severe - for an obvious example, the difference between
True, though I'm kind of more inclined to add this later if someone actually needs it. Personally I think I'd be much more likely to use the global locale if I wanted locale specific formatting. |
At this time, I have no objection to just having the two common options (C and global), and we can add the other if we want it later. |
Uh oh. I've got quite bad news about performance. Before we go further I hacked up the speed tests - just on master - so that we save and restore the locale in
Applying the following patch
gives the following
It's a bit of a performance disaster. I was puzzled by the system time. Running |
Dammit, there's no winning on this. What about the string versions? Is there any performance difference? So, potential strategies to mitigate perf issues:
|
Yeah, it's pretty awful. Taking this into consideration, the best option seems like it might be a new family of For the string version, the cost might be somewhat mitigated by the other stuff it's doing, namely allocating string buffers, etc, and flushing doesn't involve a syscall to |
It's all surprisingly messy, and does make me wonder whether locale support in the standard library was a bit of an afterthought. |
If using C locale makes the string formatting slow, then I think the only choice is to leave the original format() to use global (for strings) and stream locales, and add format_c/format_loc when people have a specific preference and are willing to pay the cost. (My choice 1 a couple messages ago.) If the string formatting is the same speed either way (i.e., if the actual stream flush on a real file or console stream is the expensive part), then my choice 2 from earlier is still a possibility, if you want the string format() to do what most users expect and have the special call for locale-intentioned user. I am cursing the authors of the standard library for this. It's so obvious that locales should have been opt-in on each call, with the default calls being locale-independent. |
You want to make a final call on this, Chris? I don't have an especially strong preference, but I'd like to get it settled one way or the other so I can figure out what to do in my downstream projects and not have to change them a second time. |
I'm happy to revise my submission to match whatever decision you make. |
On the OIIO side, we've made the decision not to expose the location-explicit API calls, but instead to simply declare that since OIIO is all about file I/O at its core, it must be consistent across all users and platforms, therefore all of its numeric input and output are "C" locale. I think that this is probably not the right decision for tinyformat. My current recommendation is that you make the default ( |
@c42f Poke... let me know how you'd like me to revise this. |
Sorry Larry, I've been dithering on making a decision on this because there's no obvious nice choice. But it's still worth making a choice. I'd like to do a little more benchmarking, but my current feeling is that the std library forces us into leaving the family of As a side note - it's always been my intention to reproduce Coincidentally I've just had a nice PR in #45 which implements posix argument reordering, again for localization related reasons. @jrohel - since you're apparently in the process of dealing with localization, I'd be interested if you can share your experience as it relates to this PR, if you have some. |
Agreed. No performance implications and opt-in for forcing the locale. Sounds good to me. |
The default functions all are dependent on either the global native locale, or the locale associated with the stream they are outputting to. This can lead to situations where format("%g",123.45) might return "123,45" if it happens to be running on a machine with locale "fr_FR", for example. If that is written to an output file that is later read on a computer using a locale with '.' as the decimal mark, it could mis-parse as "123.0", leading to hilarious and tragic results. This patch adds format_c, printf_c, printfln_c, which force the classic "C" locale ('.' decimals, among other things). This incurs a performance penalty -- in particular, when being used on I/O streams it will force a flush when the locale for the stream is saved and restored -- but it is the safe thing to do for persistent, portable output.
I've updated the PR with a much less ambitious version. All existing functions are unchanged, but I added a format_c, printf_c, printfln_c that force C locale, even if a little more expensive. |
@c42f OK, my experiences:
If you want you can add another function "format_c()" or more general function "format_loc(locale, ...)" - first parameter will be locale. "format_c()" then will be only a special case of "format_loc()". |
Without all this, you will get situations where format("%g",123.45)
might return "123,45" if it happens to be running on a machine with
locale "fr_FR", for example. If that is written to an output file that
is later read on a computer using a locale with '.' as the decimal mark,
it could mis-parse as "123.0", leading to hilarious and tragic results.
It seems like the safe thing is to force the string-returning varieties
of tinyformat to use classic locale ('.' decimal), with an optional
means for the odd users to purposely specify a different (or native)
locale. It also seems that the best policy for the stream-appending
varieties of format() is to use the local already associated with those
streams (C++ lets each stream have its own locale).
format() to an existing stream (including cout) is now documented to
honor the stream's existing locale (as it always did).
format() that returns a string now will always format the string using
the classic "C" locale, in particular meaning that floating point
numbers will use '.' as the decimal mark, regardless of the "native"
locale set by the OS or process that may have an unexpected choice of
decimal point.
Add a new variety of format() that returns a string, but which takes
an explicit locale, for cases where you don't want the "C" locale
(for example, when generating strings that will appear in a UI that
you want localized properly). You can make this variety use the
current locale by passing std::locale() as the first argument (the
default constructor of std::locale just grabs the global process
locale).