Provides a fast scripting language for running statemachines or other decision making processes from bursted code in unity.
BLAST 1.0.4.c: #7
Language Reference - Namespace (doxygen) - Examples
Disclaimer: This is a partly experimental package, anything listed in the documentation is guaranteed to be supported indefinately, the rest is under active development and very open for suggestions. Please log any issues at https://github.com/nijnstein/BLAST-Documentation/issues so we can handle them as efficiently as possible.
In many large scale simulations many scripts are mostly the same but with different data. BLAST can execute those with the control flow inverted, meaning that it in essence performs the SIMD equivelent of scripting languages, each operation will translate in a vector operation easily compiled to beautiful packed assembly. This can result in script running faster then native code dueue to the ways memory is forced to be and execution is performed. Usually this kind of optimization is not simply possible in real-life situations with compiled code. In SSMD mode, BLAST can handle scripts faster then native code in single execution. It shows how it becomes possible to optimize otherwise hard to optimize parts, especially if they were entered by a user in an online game were there is no control.
In many situations things boil down to a handler executing actions, either predefined or defined by configuration. If each action is implemented in .net code then when handling many actions as they occur results in calling each action seperately, in a loop or not. Its whole sequence of operations is executed from start until end. Scripting languages normally do the same and when used to handle states in a statemachine they offer a flexible and configuratable but slow solution. BLAST's SSMD mode takes many equal actions taken on different data and executes them as one. If it encounters a multiplication for example then it handles the multiplication in loop, if available with SIMD instructions, over all datasegments at once before reading the next token. This would not have been possible to do with the code that was written in c# to handle a single action and often is much faster to execute especially when there are many datasegments. Offcoarse, you could hand-optimize each actions's .net code to also have a path to execute many but that adds complexity, mantainance and you loose configurability. Many games these days are defined by configuration and in these blast would allow for higher performance or lower battery use.
- Limited datatype support: Numerics and IDs
- Small memory footprint
- Code & Data seperated
- Integrates with DOTS
- Uses BURST function pointers to connect to basic functionality in its environment.
- Seperation of code, data and stack
- SSMD mode, executes multiple script with simular controlflow but different data at once as if they were vectors. Current no control flow support
- Interpretation is BURST compatible.
- Unmatched performance.
An equation like a = (3 3) * (4 4) + (1.2 1.2); results in only 18 bytes of code, 4 bytes of data and 8 bytes of stack use, resulting in a block of 30 bytes, easily fitting into a single cache line ensuring maximum performce running this as a kernel over large datasets:
# code
a = (3 3) * (4 4) + (1.2 1.2);
# ssmd assembly
push vector cf1 #154 #153 #153 #063 csref #006 set a expand_v2 3 * expand_v2 4 + pop
# bytecode
000| 035 034 064 154 153 153 063 066 006 001
010| 128 061 087 011 061 088 008 034
Constant values are used in the script, but in ssmd's view, these are parameters, they are just constant for the current run; resulting in optimizations not normally possible like calculating parts of a script only once for the dataset. From version 1.0.5 these will be made availeble as input parameters for ssmd scripts.
Variables dont need to be defined although users might want to force their order in the datasegment using input defines.
All data starts out as a float in BLAST and either grows into vectors or gets converted/defined as something else.
A statement is:
- a collection of operations assigned to a variable terminated with a
;
:a = 1 * 3 + maxa((1 2 3), (5 6 7));
- a procedure call: 'procedure_identifier( maxa(1,2,3));`
Control flows can nest statements:
- If Then Else:
if ( a | b ) then ( c = 1; ) else ( c = 2; );
- While loop:
while( i < 100) ( i = i + 1;);
- For loop:
for( i = 0; i < 100; i = i + 1) ( a = a / 4; );
- Switch:
switch (a) (case 3: ( a = 2; ) default: ( a = 3;));
_The intitial language version will demand use of compounds in all situations. Later version may relax on this constraint. _
Everything between (
and )
is called a compound, a compound may have several interpretations:
- A list of statements:
( a = 1; b = 2; c = a * b; );
- Structuring of sequences:
a = a + (a | b);
- A list of parameters:
(a, b, c);
- A vector definition:
a = (1 2 3 4);
There is limited datatype support, this will change in future versions:
full support: NUMERIC, vectorsize 1-4
partial support: BOOL32
BLAST connects to its enviroment in several ways
- data can be communicated through the packagadata of the script by setting up its values
- data can be mapped to a DOTS IComponentData structure or any other segment of memory
- data can be exchanged through functionpointers registered in BLAST connection data from your simulation with the script engine.
Vectors may be defined based on all supported datatypes and functions, constant and functions may be mixed but the number of dataelements must be equal for each element defined in the vector.
1> Defines a, a vector of 2 constants and 1 pop operation popping 1 numeric or id:
a = (1 pop 2);
2> Defines b, a vector of 2 constants of size 3 and a pop3:
b = ((1 2 3) (pop pop pop) (1 2 3));
- Vectors can be used in all operations that can be used with other variables of the same datatype (numeric, id)
- Due to restrictions set by the project we will not allow unlimited vector length (see arrays), this depends on a function used to decode the opcodes for functions and that in turn depends on the language version (information follows)
- The minimal support will (in all languageversions) be: size 1, 2, 3 and 4; other sizes are under discussion as the current encoding schema allows for only 8 different lengths (8 differing lengths, not max 8)
Different language versions allow the interpretor(s) to differentiate between very distinct outputs depending on compiler settings.
Arithmetic operators: + - / *
Boolean operators: & | ^ !
Boolean evaluators: < > <= >= = ! !=
Assignment: =
Enclosures: ( )
Statement terminator: ;
Value seperator: ,
Decimal seperator: .
Indexer .
Identifier: [a..z][0..9|a..z]*[.|[][a..z][0..9|a..z]*[]]
Control Flow: if then else
while | for loops
switch case default
The # is used to define constants, input and output(s) and validations. Any sequence not matching #[define/input/output/validate]
is considered a comment.
- A
#[definition]
must appear at the start of the line - Defines and inputs must appear in the code BEFORE any other code
- A comment can be started at any point in the line
#define variable 1234
#input variable 0 4
#output variable 4 8
#validate variable a 123
All other text that starts with # is a comment and may start at any point of a line, also after #[definition]
:
#input variable [offset] [bytesize] # comment 123
Different package modes for different needs:
- Normal: code, metadata, data and stack combined
[----CODE----|----METADATA----|----DATA----|----STACK----]
1 2 3 4
- SSMD: code and metadata seperated from data & stack in 2 seperate segments, stack can usually best be omitted with the
NoStack
option.
[----CODE----|----METADATA----] [----DATA----|----STACK----]
1 2 3 4
- Entity: code is seperated from metadata, data and stack which are mapped onto an IComponentData
[----CODE----] [----METADATA----|----DATA----|----STACK----]
1 2 3 4
- Compiler: mode for internal and debugging purposes, this will maintain a node tree and variable overviews.
When blast encounters the following equation: a = a * b * 4, the depending on alignment or not the following assembly is executed in burst (only important parts are shown)
**Aligned float * float **
To perform the multiplication between 2 variables in a datasegment that is aligned:
**Register float * constant ** The result is multiplied in a register (of the interpretor) with a constant, blast knows to execute the move only once, undependant of the number of records. In future version the ssmd interpretor will call a value constant if its equal acros the input, it can then apply these optmizations each run based on the data given, this is an optimization that would normally be very difficult to cleanly achieve.
The datasegment and stack are setup as arrays of arrays and are said to be aligned when each basepointer of the subarrays has the same offset to the other. If thats not the case we call the data unaligned and blast has to calculate each base index resuling in un-packed instructions but still very fast code:
The effect of alignment on processing speed can be seen in sample 10, after all cubes have gotten different states and thus run different scripts, the interpretor gets datasegment arrays with indices skipping because they are for other scripts. It can be up to 3 times slower to work with unaligned datasegments but it very much depends on the script involved. As a programmer you should know this happens and you can compensate by writing|using your code differently if it gives problems. It still is fast and still uses simd instructions in most cases.
Not supported in current release
Allow BLAST to compile the script into c# burst compatible function pointers, this allows native performance for known scripts while keeping the same workflow in decision libraries. This way there is no performance hit running compile time known scripts at runtime in your simulation. Only runtime compiled scripts will take the performance hit of being interpreted. This will allow the developer to create 1 code path and to not worry too much about performance. Note however, for SSMD operation the compiler will still need bytecode as it is not possible to run in single script multiple data mode using native compiled code.
- an example of use would be updating behaviour or balancing scripts in multiplayer games on server login, a later binary update of the game might provide the same scripts hardcoded in an update without any alteration in code paths for the developer of that game.
This should void a big drawback of using a scripting engine to do a lot of tasks in a simulation where every millisecond counts.
**Registration of known scripts from code during development: **
/// <summary>
/// design time definition of a script, any derived class with filled code will be precompiled into a burst function
/// </summary>
class BlastScript1 : BlastScript
{
bs1() {
Name = "bs_designtime_1";
Code = @"
#input position float3 0 0 0
position = position + random(-0.01, 0.01);
position = min((100 100 100), max((-100, -100, -100), position));";
}
};
/// <summary>
/// any string field, decorated with the BlastScriptAttribute will be precompiled during build into a bursted function
/// </summary>
class BlastScripts
{
[BlastScript(Name = "bs_designtime_2")]
const string bs2 = @"
#input position float3 0 0 0
position = position + random(-0.01, 0.01);
position = min((100 100 100), max((-100, -100, -100), position));";
}
Dont allocate package memory for stacks, instead use the interpretor's stack. When processing many data records in ssmd mode then memory becomes a serious limitation. If the script is small and the data is small they should be packeged together easily fitting many in cpu cache. When there is more data at some point the cpus cache will be the limiting factor, especially when running multiple threads. If the stack is mapped to the stackmemory of the interpretor the cpu will swap less memory in and out when processing large amounts of records. This is easily seen in the following video (a .net build) https://youtu.be/FnEtQ0tWAto As soon as the stack spans multiple cache lines it becomes very benificial to use the threads stack allowing it to make full use of multithreading. Note that for very small scripts this actually hurts performance as code and data or not on the same cacheline anymore (usually above 64byte|128bytes)
Built in functions are directly encoded in opcodes and are made available to blast in the form of a BlastScriptAPI class. Several math functions have been built in and many other functions will follow.
See (Language Reference)[LanguageReference.md] for an overview of all built in functions.
f = rsqrt(1.1);
g = sqrt(2);
h = max(f, g);
Blast can call external functions that have been burst compiled and registered in the blast api used during compilation and execution.
An external function example in c#
// the declaration:
[BurstCompatible]
static public float ExternalTest(IntPtr blast, IntPtr env, IntPtr caller, float a, float b)
{
return a * b;
}
// the registration:
Blast.ScriptAPI.Register(ExternalTest);
Functions can also be identified with BlastFunctionAttributes, blast will enumerate these on startup and registers them as external function calls.
[BurstCompatible]
static public float AttributeTest(float b)
{
return b * 2;
}
Blast also allows the use of inlined functions, albeit with restriction imposed due to blasts memory requirements:
function a(a, b, c)
(
return a + b + c;
)
p0 = 1;
p1 = 2;
p2 = 3;
p = a(p1, p2, p3);
The function call is inlined into the callsite, this means there are several limitation or advantages depending on how its used:
- Inlined functions share global scope, only function parameters are scoped to the function defining them.
- Inlined functions may not declare new variables, if you need extra variables, declare them as input, output or as normal variable in the main body.
- Vector size and datatype is propagated, ie, an inlined function is generic, unless the function uses variables from the globalscope that might force vectorsizes onto operations.
- Inlined functions dont use additional stackmemory due to these restrictions and the fact they are inlined (no jumps, no register pushes to stack, no declarations)
Inline functions could should be viewed as shortcut-macros in algorithms, larger reusable code blocks should be added as external function calls for maximum performance as nativecode will always be faster then interpreted bytecode.
Parameters are identifiers or sequences of operations seperated by a ,
and may be:
- values
- operation sequences resulting in values
- other functions
- vectors
All things resulting in a value can be indexed:
a = 1; # declares storage
b = (4 5 6 7);
a = b.x;
b.x = 7;
debug(a); # logs 4
debug(b); # logs 7 5 6 7
Currently the maximum vectorsize in datatypes is 4, so the only allowed indexers are:
.x|y|z|w and .r|g|b|a
This will change in future versions.
a = max (1, 2);
a = max ((1 1), (2 2));
a = max (a, 1, max(a, 2));
a = max (a - b, a + max(a, 2));
Validation script as used in some tests:
#define result_1 11111
#define result_2 22222
b = b * 1 * 1 * 1 * (1 / 10) * 1;
a = 1 + 1;
b = 1 * -2 * 3 * 4;
c = 1 * (2 * 3 / 4) * 5; # 1.5 * 5 = 7.5
if(a = 1) then ( d = 122 ) else ( d = 100 );
while(a <= d)
(
a = a + 1;
b = b * 1 * 1 * 1 * 1 / 10 * 1;
)
switch(a > b)
(
case 1:
(
e = result_1;
)
default:
(
e = result_2;
)
);
f = 1.1;
g = sqrt(2);
h = max(f, g);
for(i = 5; i < 15; i = i + 1)
(
h = h + 1;
)
#validate e 11111
#validate a 101
#validate c 7.5
#validate h 11.4142
Implementation of a state in a statemachine
#
# HOVERTANK STATEMACHINE
#
### Define possible states
#define IDLE 1
#define EXPLORE 2
#define FORTIFY 3
#define ATTACK 4
#define EVADE 5
#define EXPLODE 6
#define DESTROYED 7
# this directly maps to an IComponentData of 5 floats:
#input current_state 0 4
#input explore_cell 4 4
#input attack_target 8 4
#input counter 12 4
#input current_cell_id 16 4
# every 60 frames: check for a target in range
counter = counter + 1;
if(counter < 60) then
(
current_state = EXPLORE;
return;
)
counter = 0;
attack_target = FindClosestTarget();
# if there is a target attack it
if(attack_target >= 0) then
(
current_state = ATTACK;
StopMovement();
return;
)
# otherwise keep exploring
current_state = EXPLORE;
# no current explore target -> select new target and start moving to it
if(explore_cell <= 0) then
(
explore_cell = ExploreIsland();
if(explore_cell > 0) then ( MoveToCellId(explore_cell); );
)
current_cell_id = GetCurrentCellId();
# if the cell we are exploring became explored -> find another target
cell_is_explored = CellIsExplored(explore_cell);
if(cell_is_explored = 1) then
(
explore_cell = current_cell_id;
)
# reached target of exploration or cell already explored -> select new target
if(explore_cell = current_cell_id) then
(
explore_cell = ExploreIsland();
if(explore_cell > 0) then ( MoveToCellId(explore_cell); )
)
# no target to explore, nothing to attack -> fortify until other goals apply
if(explore_cell < 0) then
(
current_state = FORTIFY;
StopMovement();
)
Various operations for testing:
a = 1 * -2;
b = -1 * 2;
c = a / b;
d = c * 100;
e = 1 + 2 * 2;
f = 1 + 2 * 2 + 5;
g = 1 + 2 / 6;
h = 1 + -2 / 6;
i = 1 - 2 * 2 + 5;
j = 1 + 2 * 2 - 5;
k = 1 +-2 * 2 + 5;
l = 1 + 1 + 1; # should all optimize to adda or suba
m = 1 - 1 - 1 - 1;
n = 1 + 1 + 1 + 1 + 1;
o = 1 - 1 - 1 - 1 - 1 + 1 + 1; # suba and adda
t = 1 + 1 + 1 + 1 + 1 + 1 + 1; # 1 more then in patterns
p = 2 * 2 * 2; # should all optimize to mula
q = 2 * 2 * 2 * 2;
r = 2 * 2 * 2 / 2 * 2; # division is inverted by analyser, optimizer converts to mula
s = a * e * j * g * v / 2;
u = 2 * 2 * 2 * 2 * 2 * 2 * 2;
v = 2 / 2 / 2 / 2 / 2 / 2;
w = a / b / c / d;
x = a * (b / 2) * 2;
y = (a * b) * 2;
z = (a * b) + b * (a + b);
aa = 1 + 1 * (1 / 2) * 2;
ab = 2 * (2 * 2 * 2) * 2;
ac = 2 - 2 - 2 - 2 - 2 - 2;
ad = 1 & -2 | 1;
ae = 1 & 2 & 3 & 4 & 5;
ag = 1 * 10 * 3 * (3 + 4 * 2);
Vector example with some script debuging features and insights into BLAST's stack usage, it uses less data and code bytes while keeping execution flat for the interpretation allowing it to use less memory on the target cpu. Note difference between ssmd and normal packaging in code and datasegment, in normal mode constant data is allocated in the datasegment while for ssmd constant data is inlined optimizing for many executions of the same code with different data.
a = maxa((1 2 3), (4 5 6), (7 8 9));
debug(a);
debugstack();
# compiler node tree (PackageMode == normal|ssmd)
#
# root of 6
# function push
# vector compound statement of 3
# constant parameter 7
# constant parameter 8
# constant parameter 9
# /
# /
# function push
# vector compound statement of 3
# constant parameter 4
# constant parameter 5
# constant parameter 6
# /
# /
# function push
# vector compound statement of 3
# constant parameter 1
# constant parameter 2
# constant parameter 3
# /
# /
# assignment of a
# function maxa
# function pop
# function pop
# function pop
# /
# /
# compound statement of 1
# function debug
# parameter a
# /
# /
# compound statement of 1
# function debugstack
# /
# /
#
#
# Packagemode == Normal:
#
# push vector 51 7 8 9 push vector 51 4 5 6 push vector 51 1 2 3 setf a 50 | pop pop pop nop
#
# 000| 033 051 131 089 132 033 051 088 129 130
# 010| 033 051 085 086 087 003 128 050 013 032
# 020| 032 032 000
#
# Blast.Debug - codepointer: 29, id: 128, NUMERIC: 9,00, vectorsize: 1
#
# DATA 0 = 9 1 Numeric // [0] == numeric[1] assigned by var a
# DATA 1 = 5 1 Numeric // [1] == 5 => numeric data, not in constants ** NOTE HOW VECTORS USE STACK MEMORY **
# DATA 2 = 6 1 Numeric // [2] == 6 => numeric data, not in constants
# DATA 3 = 7 1 Numeric // [3] == 7 => numeric data, not in constants
# DATA 4 = 9 1 Numeric // [4] == 8 => numeric data, not in constants
#
#
#
# Packagemode == SSMD:
#
# push vector 51 cf1h #224 #064 8 cf1h #016 #065 push vector 51 4 cf1h #160 #064 cf1h #192 #064 push vector 51 1 2 3 setf a 50 | pop pop pop nop
#
# 000| 033 051 064 224 064 089 064 016 065 033
# 010| 051 088 064 160 064 064 192 064 033 051
# 020| 085 086 087 003 128 050 013 032 032 032
# 030| 000
#
# DATA 0 = 9 1 Numeric // [0] == numeric[1] assigned by var a
Example output for nested functions, the results are pushed and popped when needed. This ensures no recursion or stacking of register data during interpretation. The script code debug(max((1 + 2), 2));
or debug(max(1 + 2, 2));
results int the following output:
root of 3
function pushc
constant parameter 1
operation Add
constant parameter 2
/
function pushf
function max
function pop
constant parameter 2
/
/
function debug
function pop
/
/
000| push compound 1 + 2 nop push function max ^ pop 2
010| debug pop nop
000| 030 085 002 086 000 029 042 009 025 086
010| 255 253 025 000
Blast.Debug - codepointer: 12, id: 25, NUMERIC: 3,00, vectorsize: 1
Example output for nested vector operations with negative constants encoded in bytecode, although the output is flattened into 6 statements which seems more work its actually faster to execute because by pushing stuff on the stack the datatype becomes fixed for the interpretor and it wont need to grow vectors value by value, also not processing nested compounds but instead executing them in a linear fashion ensures less function stack use for the code running the interpretation (more constant memory).
**Note that for this test the optimizer is disabled as it removes unneeded compounds resulting in less statements to execute but a less usefull test for flattening. **
[InlineData("a = ((2 2 2 2) * (2, -2 2 2)) * (4 4 4 4);", 16)]
root of 6
vector[4]function pushv
constant parameter 2
constant parameter 2
constant parameter 2
constant parameter 2
/
function pushc
operation Substract
constant parameter 2
/
vector[4]function pushv
constant parameter 2
function pop
constant parameter 2
constant parameter 2
/
vector[4]function pushc
vector[4]function pop
operation Multiply
vector[4]function pop
/
vector[4]function pushv
constant parameter 4
constant parameter 4
constant parameter 4
constant parameter 4
/
vector[4]assignment of a
vector[4]function pop
operation Multiply
vector[4]function pop
/
/
000| push vector 68 2 2 2 2 push compound - 2 nop
010| push vector 68 2 pop 2 2 push compound pop * pop
020| nop push vector 68 4 4 4 4 set a pop
030| * pop nop nop
000| 026 068 086 086 086 086 030 003 086 000
010| 026 068 086 025 086 086 030 025 005 025
020| 000 026 068 088 088 088 088 001 128 025
030| 005 025 000 000
This demonstrates inverse replacements on division by constants followed by remapping the multiplication sequence into a function call.
Input code segment: a = 1 / 2.1 / 3.2 / 4.3;
Resulting nodes and bytecode:
root of 1
assignment of a
function mula
constant parameter 1
constant parameter 0,4761905
constant parameter 0,3125
constant parameter 0,23255813
/
/
/
000| set a mula yield 1 0,4761905 0,3125 0,23255813 nop
000| 001 128 033 017 085 129 130 131 000
For now the optimizer will only look at single nodes and will mainly perform replacements of operation sequences. A string of additions is better executed as a function where more information is known about its result and it reduces control flow in the interpretor.