Skip to content

Commit

Permalink
Fix binary encoding for the parser translator
Browse files Browse the repository at this point in the history
Skipping detecting the encoding is almost always right, just for binary it should actually happen.

A symbol containing escapes that are invalid
in utf-8 would fail to parse since symbols must be valid in the script encoding.
Additionally, the parser gem would raise an exception somewhere during string handling
  • Loading branch information
Earlopain committed Jan 11, 2025
1 parent 16d4e69 commit fa0154d
Show file tree
Hide file tree
Showing 8 changed files with 119 additions and 3 deletions.
20 changes: 17 additions & 3 deletions lib/prism/translation/parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def parse(source_buffer)
source = source_buffer.source

offset_cache = build_offset_cache(source)
result = unwrap(Prism.parse(source, filepath: source_buffer.name, version: convert_for_prism(version), partial_script: true, encoding: false), offset_cache)
result = unwrap(Prism.parse(source, **prism_options), offset_cache)

build_ast(result.value, offset_cache)
ensure
Expand All @@ -64,7 +64,7 @@ def parse_with_comments(source_buffer)
source = source_buffer.source

offset_cache = build_offset_cache(source)
result = unwrap(Prism.parse(source, filepath: source_buffer.name, version: convert_for_prism(version), partial_script: true, encoding: false), offset_cache)
result = unwrap(Prism.parse(source, **prism_options), offset_cache)

[
build_ast(result.value, offset_cache),
Expand All @@ -83,7 +83,7 @@ def tokenize(source_buffer, recover = false)
offset_cache = build_offset_cache(source)
result =
begin
unwrap(Prism.parse_lex(source, filepath: source_buffer.name, version: convert_for_prism(version), partial_script: true, encoding: false), offset_cache)
unwrap(Prism.parse_lex(source, **prism_options), offset_cache)
rescue ::Parser::SyntaxError
raise if !recover
end
Expand Down Expand Up @@ -285,6 +285,20 @@ def build_range(location, offset_cache)
)
end

# Options for how prism should parse/lex the source.
def prism_options
options = {
filepath: @source_buffer.name,
version: convert_for_prism(version),
partial_script: true,
}
# The parser gem always encodes to UTF-8, unless it is binary.
# https://github.com/whitequark/parser/blob/v3.3.6.0/lib/parser/source/buffer.rb#L80-L107
options[:encoding] = false if @source_buffer.source.encoding != Encoding::BINARY

options
end

# Converts the version format handled by Parser to the format handled by Prism.
def convert_for_prism(version)
case version
Expand Down
9 changes: 9 additions & 0 deletions test/prism/fixtures/encoding_binary.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# encoding: binary

"\xcd"

:"\xcd"

/#{"\xcd"}/

%W[\xC0]
6 changes: 6 additions & 0 deletions test/prism/fixtures/encoding_euc_jp.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# encoding: euc-jp

# \x8E indicates a double-byte character, \x01 is not a valid second byte in euc-jp
"\x8E\x01"

%W["\x8E\x01"]
12 changes: 12 additions & 0 deletions test/prism/ruby/parser_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,18 @@
# First, opt in to every AST feature.
Parser::Builders::Default.modernize

# The parser gem rejects some strings that would most likely lead to errors
# in consumers due to encoding problems. RuboCop however monkey-patches this
# method out in order to accept such code.
# https://github.com/whitequark/parser/blob/v3.3.6.0/lib/parser/builders/default.rb#L2289-L2295
Parser::Builders::Default.prepend(
Module.new {
def string_value(token)
value(token)
end
}
)

# Modify the source map == check so that it doesn't check against the node
# itself so we don't get into a recursive loop.
Parser::Source::Map.prepend(
Expand Down
1 change: 1 addition & 0 deletions test/prism/ruby/ruby_parser_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ def ==(other)
module Prism
class RubyParserTest < TestCase
todos = [
"encoding_euc_jp.txt",
"newline_terminated.txt",
"regex_char_width.txt",
"seattlerb/bug169.txt",
Expand Down
49 changes: 49 additions & 0 deletions test/prism/snapshots/encoding_binary.txt

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

24 changes: 24 additions & 0 deletions test/prism/snapshots/encoding_euc_jp.txt

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions test/prism/snippets_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
module Prism
class SnippetsTest < TestCase
except = [
"encoding_binary.txt",
"newline_terminated.txt",
"seattlerb/begin_rescue_else_ensure_no_bodies.txt",
"seattlerb/case_in.txt",
Expand Down

0 comments on commit fa0154d

Please sign in to comment.