Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to import evm assembly json (updated). #13673

Merged
merged 4 commits into from
Oct 24, 2023

Conversation

aarlt
Copy link
Member

@aarlt aarlt commented Nov 2, 2022

Target branch is import-export-script-refactoring (see #13576).

Depends on #13576 (merged), #13577 (merged), #13578 (merged)and #13579 (merged)

Closes #11787. Replaces #12834.

TODO

@aarlt aarlt self-assigned this Nov 2, 2022
@aarlt
Copy link
Member Author

aarlt commented Nov 2, 2022

Execution of the EVM Assembly JSON import/export tests need ~20 minutes.

@aarlt aarlt force-pushed the import-asm-json-updated branch from 1eb319b to 9651527 Compare November 2, 2022 13:49
@nikola-matic
Copy link
Collaborator

You got tons of std:: qualifier usages in cpp files here by the way.

libevmasm/Assembly.cpp Outdated Show resolved Hide resolved
@nikola-matic
Copy link
Collaborator

nikola-matic commented Nov 2, 2022

Also, changelog entry since this introduces a new input mode right?

@nikola-matic
Copy link
Collaborator

This PR will replace #12834.

PR contains cherry-picked commits from #13576 - once these PRs are merged, it's commits will be removed from this PR.

Depends on #13576, #13577, #13578 and #13579

Closes #11787.

So I see that #13576 is a dependency, but it looks like you've deleted (or renamed) ASTImportTest.sh; is it still a dependency, and how do you plan to resolve #13576 so it's aligned with this PR?

@nikola-matic
Copy link
Collaborator

Also, if you rebase against develop, you can finally get them sweet sweet green builds.

@aarlt aarlt force-pushed the import-asm-json-updated branch 2 times, most recently from 79a3849 to 27da0cd Compare November 3, 2022 15:48
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@ethereum ethereum deleted a comment from stackenbotten Nov 3, 2022
@aarlt aarlt force-pushed the import-asm-json-updated branch from 27da0cd to 858ed50 Compare November 3, 2022 15:55
@aarlt
Copy link
Member Author

aarlt commented Nov 3, 2022

This PR will replace #12834.
PR contains cherry-picked commits from #13576 - once these PRs are merged, it's commits will be removed from this PR.
Depends on #13576, #13577, #13578 and #13579
Closes #11787.

So I see that #13576 is a dependency, but it looks like you've deleted (or renamed) ASTImportTest.sh; is it still a dependency, and how do you plan to resolve #13576 so it's aligned with this PR?

Yeah, initially we had import/export tests only for the AST (ASTImportTest.sh). But with this PR we will have an additional kind of import/export tests for the EVM Assembly. So now the script will support AST & EVM Assembly import/export tests, that's why I just renamed it to ImportExportTest.sh.

In this PR the renaming is done in a single commit. But we could also rename the script later. It is then probably easier to read the PR.

libevmasm/Assembly.cpp Outdated Show resolved Hide resolved
{
shared_ptr<Assembly> subassembly(Assembly::loadFromAssemblyJSON(code, sourceList, /* isCreation = */ false));
assertThrow(subassembly, AssemblyException, "");
result->m_subs.emplace_back(make_shared<Assembly>(*subassembly));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could m_subs be a vector of unique_ptrs? If it can, it should, since from what I can tell, we're only iterating over its contents later on, so no need for this to be a sharted_ptr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that's possible to change that, but it would be an unrelated change. But if it makes sense, we could create another PR changing this.

Comment on lines 589 to 603
size_t index = static_cast<size_t>(std::stoi(dataItemID));
if (result->m_subs.size() <= index)
result->m_subs.resize(index + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should handle weird indexes, but the import segfaults for me if I try to use a value out of range:

{
    ".code": [],
    ".data": {
        "1": {".code": []}
    }
}
Segmentation fault (core dumped)

Looks like we're missing an assert in some other place that relies on these indexes being in range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, for this input, the result->m_subs will be resized to 2 but the index 0 will have a null pointer, so when the code reaches result->updatePaths() it crashes.

Copy link
Member

@r0qs r0qs Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the result->m_subs need to have the indexes as the item data IDs? It seems unnecessary to me, and we could fix this easily by just appending the subassembly items to the m_subs vector instead of inserting it to a specific index. I was not sure of that when I took this PR over, so I kept the logic of what was done initially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is really that we have to stick to the format used by --asm-json and that apparently has numeric indexes as IDs. Actually it's even a bit more complex - the keys are numbers if the content is code but can be arbitrary strings if the content is hex data.

I don't think we can do anything about it. I mean, we can always handle more, but we should not change the format in a way that the output from --asm-json will not be accepted.

And if the indexes are there, we must use them to order the subs according to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is really that we have to stick to the format used by --asm-json and that apparently has numeric indexes as IDs. Actually it's even a bit more complex - the keys are numbers if the content is code but can be arbitrary strings if the content is hex data.

Oh, ok, but in the case that the key on .data is an arbitrary string, it would not have any subassembly, would it?

I don't think we can do anything about it. I mean, we can always handle more, but we should not change the format in a way that the output from --asm-json will not be accepted.

And if the indexes are there, we must use them to order the subs according to them.

Hum...so I guess the alternative would be to have m_subs as an ordered map instead of a vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok, but in the case that the key on .data is an arbitrary string, it would not have any subassembly, would it?

It wouldn't. Just data. See if (_json.isMember(".data")) in Assembly::fromJSON.

But looking at it myself now, I wonder if it shouldn't be more strict. Here's the bit that creates .data currently and looks like it's not an arbitrary string. Seems to be a hash (of the data name?):

root[".data"] = Json::objectValue;
Json::Value& data = root[".data"];
for (auto const& i: m_data)
if (u256(i.first) >= m_subs.size())
data[util::toHex(toBigEndian((u256)i.first), util::HexPrefix::DontAdd, util::HexCase::Upper)] = util::toHex(i.second);
for (size_t i = 0; i < m_subs.size(); ++i)
{
std::stringstream hexStr;
hexStr << std::hex << i;
data[hexStr.str()] = m_subs[i]->assemblyJSON(_sourceIndices, /*_includeSourceList = */false);
}

I'd adjust the code to accept only the things we can actually produce. We can always relax it later if we introduce something new. A small test covering the validation would be nice too.

Hum...so I guess the alternative would be to have m_subs as an ordered map instead of a vector.

Maybe. I would not be surprised if we were relying on there being no gaps. The simplest thing to do is to just reject any JSON with gaps in numbering as an error and I'd suggest doing just that.

You could start with a local map and only convert it to a vector once you validate it. This would also prevent another issue the current code probably has - that if you use a very large integer for the index, we'll probably try to allocate a very large vector and run out of memory.

Also, please add one case with a gap in numeration (and maybe also a very high index) as a CLI test.

Copy link
Member

@r0qs r0qs Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But looking at it myself now, I wonder if it shouldn't be more strict. Here's the bit that creates .data currently and looks like it's not an arbitrary string. Seems to be a hash (of the data name?):

Yes, it is the hash of the keys in .data.

I'd adjust the code to accept only the things we can actually produce. We can always relax it later if we introduce something new. A small test covering the validation would be nice too.

Will do.

Maybe. I would not be surprised if we were relying on there being no gaps. The simplest thing to do is to just reject any JSON with gaps in numbering as an error and I'd suggest doing just that.

But with the addition of the --import-asm-json such constraint will not be valid anymore, since you would be able to import a JSON assembly that does not contain all subassemblies, right? Like the one in your example, which starts with ID 1: #13673 (comment)

You could start with a local map and only convert it to a vector once you validate it. This would also prevent another issue the current code probably has - that if you use a very large integer for the index, we'll probably try to allocate a very large vector and run out of memory.

Also, please add one case with a gap in numeration (and maybe also a very high index) as a CLI test.

Ok.

Copy link
Member

@cameel cameel Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But with the addition of the --import-asm-json such constraint will not be valid anymore, since you would be able to import a JSON assembly that does not contain all subassemblies, right?

Only if we decide to allow it. What I'm saying is that we don't have to - we can just add an input validation against it. In fact, I don't see a real use case for this. The numbers are there purely for ordering and you cannot really express those gaps in the bytecode. The produced bytecode should be the same for "2", "0", "1" and for "999", "5", "30". It would only make sense to allow the latter if the compiler could produce that numbering. But I'm pretty sure it can't so allowing it just forces us to make the code for handling it more complex without any benefits.

@r0qs r0qs force-pushed the import-asm-json-updated branch from 772ff3e to e2e7c2b Compare October 20, 2023 17:03
solRequire(jsonParseStrict(_source, assemblyJson), AssemblyImportException, "Could not parse JSON file.");
solRequire(jsonParse(_source, assemblyJson), AssemblyImportException, "Could not parse JSON file.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How non-strict is jsonParse()? If the difference is just the null, it's fine, but I have a feeling that this will affect also some other aspects of parsing that we'd rather keep strict :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not restrictive at all, it uses the default settings. Please see setDefaults() and strictMode() for a comparison.

But we could change the strictRoot field in the strict settings to false, which will allow us to accept Json::null, however we need to double check possible consequences of this. The commit 772ff3e was just an example of the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the changes anyway ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-  (*settings)["collectComments"] = true;
-  (*settings)["allowComments"] = true;
-  (*settings)["allowTrailingCommas"] = true;
-  (*settings)["strictRoot"] = false;
+  (*settings)["allowComments"] = false;
+  (*settings)["allowTrailingCommas"] = false;
+  (*settings)["strictRoot"] = true;
   (*settings)["allowDroppedNullPlaceholders"] = false;
   (*settings)["allowNumericKeys"] = false;
   (*settings)["allowSingleQuotes"] = false;
   (*settings)["stackLimit"] = 1000;
-  (*settings)["failIfExtra"] = false;
-  (*settings)["rejectDupKeys"] = false;
+  (*settings)["failIfExtra"] = true;
+  (*settings)["rejectDupKeys"] = true;
   (*settings)["allowSpecialFloats"] = false;
   (*settings)["skipBom"] = true;

I see that there are some important differences. For example rejectDupKeys does not sound like something we'd like to disable. And even things like comments or trailing commas would be something we could not roll back later in a non-breaking way when we finally switch to a better maintained JSON library.

Using strictMode() with just the strictRoot flag flipped would probably be acceptable, but still, this is really something better done in a separate PR, because here there are too many other concerns mixed in. We do need to consider whether the new JSON parser will allow that or not.

@cameel cameel force-pushed the import-asm-json-updated branch from 5794f98 to a2fdac2 Compare October 20, 2023 19:36
Comment on lines 13 to 20
dup3
/* "<stdin>":51:63 : 29,... */
sstore
/* "<stdin>":84:86 " */
0x20
/* "<stdin>":71:87 "PUSH",... */
calldataload
/* "<stdin>":68:188 ": "PUSH",... */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These code snippets should not be printed. The input file (<stdin>) is listed as one of the sources, which makes no sense but makes the compiler use it as the source file. The thing is, in the EVM asm import mode we never have access to sources so while printing locations makes sense, the snippets should be disabled. I.e. of all the values supported by --debug-info (ast-id location snippet) only location should be allowed. And the default should be different.

@cameel cameel force-pushed the import-asm-json-updated branch from a2fdac2 to 1bae0e1 Compare October 20, 2023 20:49
Copy link
Member

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finally done with a comprehensive review of this, including reviewing past comments. There were quite a few tweaks that I did myself as fixups. For the rest I added comments, but since this PR has tons of comments, here's a summary.

These are IMO both important and easy enough to still fix before we merge:

These are less important issues and improvements that would be fine done later, in follow-up PRs:

And these are general issues that I pointed out, but probably don't have a good, quick solution:

@r0qs r0qs force-pushed the import-asm-json-updated branch from 1bae0e1 to e33f75b Compare October 23, 2023 07:52
Comment on lines +183 to +198
if (!jumpType.empty())
{
if (item.instruction() == Instruction::JUMP || item.instruction() == Instruction::JUMPI)
{
std::optional<AssemblyItem::JumpType> parsedJumpType = AssemblyItem::parseJumpType(jumpType);
if (!parsedJumpType.has_value())
solThrow(AssemblyImportException, "Invalid jump type.");
item.setJumpType(parsedJumpType.value());
}
else
solThrow(
AssemblyImportException,
"Member 'jumpType' set on instruction different from JUMP or JUMPI (was set on instruction '" + name + "')"
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!jumpType.empty())
{
if (item.instruction() == Instruction::JUMP || item.instruction() == Instruction::JUMPI)
{
std::optional<AssemblyItem::JumpType> parsedJumpType = AssemblyItem::parseJumpType(jumpType);
if (!parsedJumpType.has_value())
solThrow(AssemblyImportException, "Invalid jump type.");
item.setJumpType(parsedJumpType.value());
}
else
solThrow(
AssemblyImportException,
"Member 'jumpType' set on instruction different from JUMP or JUMPI (was set on instruction '" + name + "')"
);
}
if (!jumpType.empty())
if (item.instruction() == Instruction::JUMP || item.instruction() == Instruction::JUMPI)
{
std::optional<AssemblyItem::JumpType> parsedJumpType = AssemblyItem::parseJumpType(jumpType);
if (!parsedJumpType.has_value())
solThrow(AssemblyImportException, "Invalid jump type.");
item.setJumpType(parsedJumpType.value());
}
else
solThrow(
AssemblyImportException,
"Member 'jumpType' set on instruction different from JUMP or JUMPI (was set on instruction '" + name + "')"
);

Honestly, I'd be perfectly fine leaving this as is, since it's not checked by check_style.sh, and we'll eventually move to clang-format, where we'll have to auto format the whole project anyway. At your discretion then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when clang-format? haha

libevmasm/Assembly.cpp Show resolved Hide resolved
libevmasm/Assembly.cpp Show resolved Hide resolved
Comment on lines +183 to +198
if (!jumpType.empty())
{
if (item.instruction() == Instruction::JUMP || item.instruction() == Instruction::JUMPI)
{
std::optional<AssemblyItem::JumpType> parsedJumpType = AssemblyItem::parseJumpType(jumpType);
if (!parsedJumpType.has_value())
solThrow(AssemblyImportException, "Invalid jump type.");
item.setJumpType(parsedJumpType.value());
}
else
solThrow(
AssemblyImportException,
"Member 'jumpType' set on instruction different from JUMP or JUMPI (was set on instruction '" + name + "')"
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!jumpType.empty())
{
if (item.instruction() == Instruction::JUMP || item.instruction() == Instruction::JUMPI)
{
std::optional<AssemblyItem::JumpType> parsedJumpType = AssemblyItem::parseJumpType(jumpType);
if (!parsedJumpType.has_value())
solThrow(AssemblyImportException, "Invalid jump type.");
item.setJumpType(parsedJumpType.value());
}
else
solThrow(
AssemblyImportException,
"Member 'jumpType' set on instruction different from JUMP or JUMPI (was set on instruction '" + name + "')"
);
}
if (!jumpType.empty())
if (item.instruction() == Instruction::JUMP || item.instruction() == Instruction::JUMPI)
{
std::optional<AssemblyItem::JumpType> parsedJumpType = AssemblyItem::parseJumpType(jumpType);
if (!parsedJumpType.has_value())
solThrow(AssemblyImportException, "Invalid jump type.");
item.setJumpType(parsedJumpType.value());
}
else
solThrow(
AssemblyImportException,
"Member 'jumpType' set on instruction different from JUMP or JUMPI (was set on instruction '" + name + "')"
);

Again, just pointing out - feel free to ignore (or not if Kamil sees this).

for (Json::Value const& sourceName: _json["sourceList"])
{
solRequire(
std::find(parsedSourceList.begin(), parsedSourceList.end(), sourceName.asString()) == parsedSourceList.end(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically just a nested for loop - so a bit wasteful; wouldn't it make more sense to make parsedSourceList a std::set<std::string> and use parsedSourceSet.count(sourceName.asString()) instead, and then just copy it to a vector when/if needed, i.e. std::vector<std::string> parsedSourceList(parsedSourceSet.begin(), parsedSourceSet.end())?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that destroy the order though? The order here is very important.

But other than that, yeah, that would be more efficient for long lists.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would, in which case we couldn't simply copy content over from the set into the vector, but could none the less use the same approach (i.e. using a set for a presence check, which technically wouldn't be any less efficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it would be better to first load all items into a vector and then only use the set as a temporary helper to check if items are unique. Even better, wrap that in a reusable isUnique() helper. Or maybe such a helper even already exists in boost?

But IMO this is also good enough as is, given that the feature is experimental and we gave some other inefficiencies and suboptimal things a pass too. This could be improved to the follow-up refactor PR.

{
solAssert(dataIter.key().isString());
std::string dataItemID = dataIter.key().asString();
Json::Value const& dataItem = data[dataItemID];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit convoluted - can't we just get dataItem directly from dataIter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same thing and apparently we can't. Though I only briefly looked at how the iterator is defined. I didn't see a member for it, but maybe you can find some mechanism for it if you look closer. In any case, I gave up myself on this because there were too many other things to adjust here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dereferencing the iterator should work, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. Feel free to check and suggest a variant that will work :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it works. In any case, regarding your latest comment about priorities - I wouldn't worry about this too much, and I'd prefer not to push to the branch in case @r0qs is actively working on (which I'm assuming he is). So we can leave it for another time, unless of course, he notices this comment and does it himself :)

Copy link
Member

@cameel cameel Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing should not be a problem as long as you use --force-with-lease and only push new commits without modifying any existing ones or rebasing the whole branch. Then for @r0qs it would be a simple git rebase -i origin/import-asm-json-updated (and maybe some conflict resolutions, but this is a small localized change and everyone should be used to resolving conflicts by now anyway :)).

libevmasm/Assembly.cpp Outdated Show resolved Hide resolved
libevmasm/Assembly.cpp Show resolved Hide resolved
libevmasm/Assembly.cpp Show resolved Hide resolved
@cameel
Copy link
Member

cameel commented Oct 23, 2023

Just to keep priorities clear here - we absolutely need this merged very soon, preferably today so I think @r0qs should focus on the three must-haves I listed in #13673 (review) - unless of course someone finds something serious, like a bug.

For the other minor improvements and refactors - feel free to just push fixups directly. Or just leave them be for now to be potentially addressed in some follow-up cleanup PR. I already fixed a lot of these myself while reviewing this last week. All the big, structural problems, like the import being part of CompilerStack are fixed at this point and remaining ones IMO are not serious enough to block the PR.

@r0qs r0qs force-pushed the import-asm-json-updated branch from e4e6f59 to fa75409 Compare October 23, 2023 14:55
libevmasm/Assembly.cpp Outdated Show resolved Hide resolved
@r0qs r0qs force-pushed the import-asm-json-updated branch 3 times, most recently from 840e7e3 to 9d0a4de Compare October 23, 2023 22:45
Changelog.md Outdated Show resolved Hide resolved
@cameel cameel force-pushed the import-asm-json-updated branch from 9d0a4de to c3825ed Compare October 24, 2023 14:10
@cameel cameel force-pushed the import-asm-json-updated branch from c3825ed to da10cb9 Compare October 24, 2023 14:51
Co-authored-by: Kamil Śliwak <[email protected]>
Co-authored-by: r0qs <[email protected]>
@cameel cameel force-pushed the import-asm-json-updated branch from da10cb9 to 91c7b32 Compare October 24, 2023 15:07
@cameel cameel dismissed their stale review October 24, 2023 15:08

All the important problems have been addressed.

@ekpyron ekpyron merged commit c7e5212 into develop Oct 24, 2023
@ekpyron ekpyron deleted the import-asm-json-updated branch October 24, 2023 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

Compiler mode to reimport assembly json.
6 participants