-
Notifications
You must be signed in to change notification settings - Fork 12
API
This document provides an insight into basic classes and interfaces of PT Pattern Matching Engine and shows how these interact.
- Workflow
- Source code repositories
- Pattern repositories
- Parsers and converters
- Basic classes and interfaces
- Logging
- UST nodes
- Errors and exceptions
- ANTLR utility classes
- JSON serialization
- Other utility classes
Workflow
is a basic class that combines stages of reading and parsing
files, tree conversions, and matching UST with patterns. The Workflow
class is responsible for timing these stages. It also provides support
for parallelizing.
-
ISourceCodeRepository SourceCodeRepository
— a source code repository -
LanguageFlags languages
— a list of languages to be parsed. Based on these languages,Workflow
determines, whichILanguageParser
parsers andIParseTreeToUstConverter
converters should be created. -
IPatternsRepository PatternsRepository
— a pattern repository -
Stage
is the final operational stage:-
Read
— reading a file -
Parse
— parsing -
Convert
— converting to UAST -
Preprocess
— tree preprocessing (calculating arithmetic expressions, connecting lines, simplifying patterns) -
Match
— pattern matching -
Patterns
— a mode for checking the parsing of patterns If a certain stage is specified, the stages preceding it will also be performed. This does not apply to thePatterns
stage, which is responsible only for testing templates.
-
-
ILogger Logger
— logger that is embedded in other internal objects implementingILoggable
.
Additional parameters that are used for ANTLR parsers only are also available:
-
MaxStackSize
— the maximum stack size in bytes. If the value is not zero, then parsing is started in the instance ofThread
with a specified value. This option is useful if a StackOverflow exception occurred during ANTLR parsing. -
MemoryConsumptionMb
— approximate memory consumption in megabytes after which the procedure for cleaning the ANTLR cache is started (ClearDFA
for a lexer and parser).
See also the paragraph "Memory consumption" in the article "Theory and practice of source code parsing with ANTLR and Roslyn."
The result of processing is obtained as follows:
WorkflowResult result = workflow.Process()
. The resulting class has
the following properties:
- List of uploaded source code files
IReadOnlyList<SourceCodeFile> SourceCodeFiles
- List of parse trees
IReadOnlyList<ParseTree> ParseTrees
- The list of USTs:
IReadOnlyList<Ust> Usts
- List of matching results
IReadOnlyList<MatchingResult> MatchingResults
- Timing for the stages:
-
TotalReadTimeSpan
— total reading time span -
TotalParseTimeSpan
— total parsing time span -
TotalConvertTimeSpan
— total converting time span -
TotalPreprocessTimeSpan
— total preprocessing time span -
TotalMatchTimeSpan
— total matching time span -
TotalPatternsTimeSpan
— total pattern processing time span -
TotalLexerTicks
— total lexer processing time span (ANTLR parsers only) -
TotalParserTicks
— total parser processing time span (ANTLR parsers only)
-
-
TotalProcessedFileCount
— total amount of files processed -
TotalProcessedCharsCount
— total amount of characters processed -
TotalProcessedLinesCount
— total amount of lines processed -
ErrorCount
— total amount of errors -
Workflow
also generates real-time logger messages.
The figure below illustrates the workflow:
Workflow
Repositories implement the ISourceCodeRepository
interface; they are
located in the PT.PM.Common project.
-
FileCodeRepository
uploads the source code from thefileName
file. -
FilesAggregatorCodeRepository
uploads all files from the specified folderfilePath
. -
MemoryCodeRepository
uploads the source code from thefileName
line. -
ZipAtUrlCachedCodeRepository
downloads the GitHub repository and unpacks it. It is used to test parsers and converters.
Repositories implement the IPatternsRepository
interface; they are
located in the PT.PM.Patterns project.
-
FilePatternsRepository
uploadsPatternDto
objects from thefilePath
file. -
MemoryPatternsRepository
allows uploadingPatternDto
objects from memory. -
JsonPatternsRepository
allows uploading patterns from a JSON document. -
DslPatternRepository
allows uploading a DSL pattern from thepatternData
line forlanguageFlags
languages. -
DefaultPatternRepository
contains default hardcoded patterns.
-
ILanguageParser
describes aSourceCode
parser to theParseTree
. Implementations:-
CSharpRoslynParser
— a Roslyn parser for C# -
JavaAntlrParser
— an ANTLR parser for Java -
PhpAntlrParser
— an ANTLR parser for PHP -
TSqlAntlrParser
— an ANTLR parser for T-SQL -
PlSqlAntlrParser
— an ANTLR parser for PL/SQL -
JavaScriptAntlrParser
— an ANTLR parser for JavaScript
-
-
IParseTreeToUstConverter
describes aParseTree
converter toUST
. Implementations:-
CSharpRoslynParseTreeConverter
— a converter for C# -
JavaAntlrParseTreeConverter
— a converter for Java -
PhpAntlrParseTreeConverter
— a converter for PHP -
TSqlAntlrConverter
— a converter for T-SQL -
PlSqlAntlrConverter
— a converter for PL/SQL -
JavaAntlrParseTreeConverter
— a converter for JavaScript
-
-
IAstPatternMatcher<TPatternsDataStructure>
describes aUST
matcher andPattern
patterns. Returns a collection fromMatchingResult
. There is a single implementation yet:BruteForcePatternMatcher
located in the PT.PM.Matching project.
-
SourceCodeFile
— a source code object containing:-
Name
— name -
RelativePath
— a relative path to a file in the root of a repository -
Code
— source code
-
-
ParseTree
— a parse tree obtained as a result of parsing theSourceCodeFile
source code object. -
UST
(Universal Syntax Tree) — a universal syntax tree obtained as a result of converters operation implementingIParseTreeToUstConverter
. -
Pattern
— an object that stores a pattern in a structured format, ready for use in the pattern matching engine-
Key
— a uniquestring
pattern identifier -
LanguageFlags
— a list of languages for which this universal pattern is applicable For example, aRandom
class instance is applicable for C# and Java. -
PatternNode Data
— an AST fragment for the pattern which is matched against theUST
.
-
-
PatternDto
— a serializedPattern
intended for its storage and transmission.-
Name
— name (optional) -
Key
— a uniquestring
pattern identifier -
LanguageFlags
— a list of languages for which this universal pattern is applicable For example, aRandom
class instance is applicable for C# and Java. -
DataFormat
— a pattern formatJSON
andDSL
are available. -
Value
— the text of a pattern inJSON
orDSL
format (depends onDataFormat
). -
CweId
— a Common Weakness Enumeration identifier -
Description
— a pattern description -
DebugInfo
— information that helps with debugging
-
- An example of a pattern serialized in JSON:
{ "Key": "96", "Name": "Hardcoded Password", "Languages": "CSharp, Java, PHP, PLSQL, TSQL", "DataFormat": "Dsl", "Value": "<[(?i)password]> = <[ \"\\w*\" || null ]>" }
-
MatchingResult
— a result of matching aPattern
againstUST
. Contains the following properties:-
Pattern
— a reference to the pattern -
List<UstNode> Nodes
— a list of matched patterns. The list is used because several nodes can be specified for multi-line patterns. -
FileNode
— a reference to the source code file
-
-
MatchingResultDto
— a matching resultMatchingResult
that can be used for serialization.-
MatchedCode
— a fragment of the matched code -
BeginLine
,BeginColumn
,EndLine
,EndColumn
— coordinates in the line-column format -
PatternKey
— a pattern ID -
SourceFile
— a source code file
-
- An example of a matching result serialized in JSON:
{ "MatchedCode": "rand()", "BeginLine": 60, "BeginColumn": 30, "EndLine": 60, "EndColumn": 36, "PatternKey": "27", "SourceFile": "Patterns.php" }
-
TextSpan
— a linear text location that containsStart
andLength
(a start and length, respectively). Used in all conversions, except the output. -
LineColumnTextSpan
— a text location that containsBeginLine
,BeginColumn
,EndLine
, andEndColumn
. Used to output the found pattern in a handy format.
-
ILogger
— an interface abstracting the logging methods:-
LogError
— logging errors in the text or exception formatException
. There are several types of exceptions:ParsingException
,ConversionException
,MatchingException
, and a generalException
. -
LogInfo
— logging information messages Examples:- Command line arguments: -f Patterns.php --stage convert
- File Patterns.php has been parsed (Elapsed: 00:00:00.6350338)
-
LogDebug
— logging Debug messages They are not used in the Release configuration. Examples:- Arithmetic expression 60 * 60 has been folded to 3600
- Strings "a" + "b" has been concatenated to "ab"
-
-
ConsoleLogger
implementation is located in the PT.PM.Console project. This logger outputs the result to the console and to a file. NLog is embedded. -
ILoggable
— an interface with theILogger Logger
property. It is implemented in all classes where logging is used. By default, all classes use theDummyLogger
implementation, which contains empty methods. It is used to get rid of a necessity to check that theLogger
is notnull
before calling the methodsLogError
,LogInfo
, etc.
-
Children
— the nearest descendants of the current node (siblings) -
Descendant
— all descendants of the current node (siblings, grandchildren, etc.)
UstNode
is the basic class for all nodes. It has the following
properties:
-
NodeType
— node type. It is used for nodes matching. -
Parent
— parent of nodeNull
is used for the root node. -
Children
— a list of children It is used forUST
traversing. -
TextSpan
— a location in the text in linear coordinates.
And methods:
-
CompareTo
— a basic implementation for matching two nodes against each other. More details are given in the article "Tree structures processing and unified AST" in the section "Algorithm for matching AST and patterns." -
DoesAnyDescendantMatchPredicate
shows if any descendant matches the passed predicate. For example, it can be used forPatternExpressionInsideExpression
. -
DoesAllDescendantsMatchPredicate
shows if all descendants match the transmitted predicate. -
ApplyActionToDescendants
applies the action to all descendants. It is used to add a text location offset when parsing island languages. -
GetAllDescendants
returns all descendants of the given node. -
ToString
returns a string representation of a node, which is more like a C# syntax.
-
EntityDeclaration
— entity declaration:-
TypeDeclaration
— class or interface declaration -
ConstructorDeclaration
— constructor declaration -
FieldDeclaration
— field declaration -
MethodDeclaration
— method declaration -
ParameterDeclaration
- parameter declaration -
StatementDeclaration
— statement declaration
-
-
namespaceDeclaration
- namespace declaration -
Statement
— an instruction or statement (usually ends with a semicolon)-
BlockStatement
— example:{ Statement* }
-
BreakStatement
— example:break;
-
ContinueStatement
— example:continue;
-
DoWhileStatement
— example:do (Expression) while Statement
-
EmptyStatement
— example:;
-
ExpressionStatement
— example:Expression;
-
ForeachStatement
— example:foreach (var e in Expression) Statement
-
ForStatement
— example:for (int i = 0; i < n; i++) Statement
-
GotoStatement
— example:goto Id;
-
IfElseStatement
— example:if (Expression) Statement else Statement
-
ReturnStatement
— example:return Expression;
-
ThrowStatement
— example:throw Expression;
-
TypeDeclarationStatement
— declaration of type inside theStatement
-
WhileStatement
— example:while (Expression) Statement
-
WithStatement
— example:using (Expression) Statement
-
WrapperStatement
— an artificialStatement
that contains a node ofUstNode
type.
-
-
Expression
— an expression that returns a result (for example, a function call, an arithmetic operator, etc.)-
AnonymousMethodExpression
— an anonymous function -
AssignmentExpression
— an assignment expressiona = b
-
ArrayCreationExpression
- creation of array:a = new[5, 5]
-
BinaryOperatorExpression
— a binary expression:a * b
,a + b
-
CastExpression
— conversion to a type(type)expr
-
ConditionalExpression
— a conditional expressiona ? b : c
-
IndexerExpression
— an indexera[b]
-
InvocationExpression
— a function callTarget(Args)
-
MemberReferenceExpression
— reference to a member of the classA.B.C
-
MultichildExpression
— an artificialExpression
containing several expressionsExpression
-
ObjectCreateExpression
— creation of an objectnew A()
-
UnaryOperatorExpression
— an unary expression++a
,!a
-
VariableDeclarationExpression
— variable declaration,var a = b
-
WrapperExpression
— an artificialExpression
containing a node of a typeUstNode
-
-
Token
— a tail node:-
IdToken
— ID -
Literal
— a primitive constant value:-
BinaryOperatorLiteral
— a binary literal -
BooleanLiteral
—true
orfalse
-
CommentLiteral
— a comment// password="e@jf7!ke"
-
FloatLiteral
— a floating-point number42.42
-
IntLiteral
— an integer number42
-
ModifierLiteral
- modifiers of types and type membersstatic class A
-
ParameterModifierLiteral
- parameter modifierint f(const int a)
-
NullLiteral
—null
-
StringLiteral
— a string"hello world"
TypeTypeLiteral
-
UnaryOperatorLiteral
— an unary literal
-
-
ThisReferenceToken
-this
-
-
CollectionNode<TAstNode> : UstNode
— node collection-
EntitiesNode
— entities collection -
ArgsNode
— the collection of argumentsinvocation(a, b, c)
-
-
DslNode
— a node for wrapping the DSL information (target languages, variable definitions) -
PatternBooleanLiteral
— a boolean literalbool
,true
, orfalse
-
PatternComment
allows finding comments on a regular expression, equivalent toComment: regex
in DSL. -
PatternExpression
— a wildcard expression (any expression), equivalent to#
in DSL. Negation for expressions can also be used. -
PatternExpressionInsideExpression
wraps any expression and indicates that it can be met at any depth of the tree. Equivalent to<{ expression }>
in DSL. -
PatternExpressions
— a node for matching multiple expressions considering the constraints. For example,HashBytes(^(md2|md4|md5)$, ...)
-
PatternIdToken
— placement of identifiers by a regular expression. For example,<[\w+]>
will be matched against any identifiers. -
PatternIntLiteral
— placement of integers by a range, for example,<[..-20 || -10 || -5..5 || 010 || 0x10 || 30..]>
-
PatternMultipleExpressions
— placement for an arbitrary number of any expressions. Equivalent to...
In DSL. It can be used for function arguments. -
PatternStatement
wraps an instruction. -
PatternStatements
wraps multiple instructions. -
PatternStringLiteral
placement of strings by a regular expression. For example,<[""]>
will be matched against any strings. -
PatternTryCatchStatement
— an empty constructiontry catch { }
-
PatternVarDef
— definition of a variable that can take multiple values. It can be either named<["\w*" || null]>
or unnamed:<[@pwd:password]>
. -
PatternVarRef
— a reference to a pinned variable. For example,Response.Cookies.Add(<[@cookie]>);
IUstVisitor
and IUstListener
interfaces for traversing UST
are
located in the PT.PM.UstPreprocessing project. The UstVisitor
class implements the IUstVisitor
interface and deep copying of the
trees is performed in it by default. Thus far, the
dynamic
dispatching is used.
UstPreprocessor
is used to simplify the UST
. It overrides some
UstVisitor
methods. In fact, at the preprocessing stage, UST
is
transformed into a more simple UST
. Смотри также Simplifying an
AST.
-
ParsingException
occurs when parsing source code or template, if there are lexical or syntax errors. Examples:Error: no viable alternative at input '(?' at 1:1
Error: token recognition error at: '>' at 1:18
-
ConversionException
occurs when the parse tree is converted to a universal AST (UST). It is also used when converting templates. For example,NullReferenceException
will be wrapped inConversionException
in some visitor method. -
MatchingException
occurs during the execution of the algorithm for matching patterns against the UST nodes. -
ShouldNotBeVisitedException
is used to explicitly specify that a visitor method should not be visited.
-
AntlrCaseInsensitiveInputStream
— a case-insensitive input stream, which is used, for example, for PHP, PL/SQL, and T-SQL. -
AntlrDefaultVisitor
— implementation of the default visitor for ANTLR parse trees.-
Visit
calls thetree.Accept(this)
for a particular node. Contains an exception handler. -
VisitChildren
- If there is a single child, the
Visit
is called. - Otherwise, it returns a descendant to
MultichildExpression
, the descendant toExpression
.
- If there is a single child, the
-
VisitTerminal
tries to parse the value with different regular expressions and on the basis of this determine the type (String
,Float
,Int
, etc.)
-
-
AntlrHelper
— converting ANTLR text locations to unified ones, outputting the parse tree in a convenient text representation. -
AntlrMemoryErrorListener
— logging ANTLR lexer and parser errors -
AntlrParser
— a basic class for all ANTLR parsers. Implements the following:- Virtual method of preprocessing text
(
PreprocessText
) which by default normalizes line breaks (replaces single\r
with\n
). - Code parsing is performed first using a fast
SLL
algorithm, and, in case of failure, using a slow fullLL
algorithm. - Tracking the ANTLR cache and cleaning it under certain conditions (if memory consumption by the process exceeded a certain threshold).
- Virtual method of preprocessing text
(
PT.PM actively uses the JSON.NET library. The following classes are used to interact with it:
-
USTJsonConverter
— deserializer of UST trees. Creates children of the tree dynamically usingActivator.CreateInstance
depending on the value ofNodeType
. -
JsonUstNodeSerializer
allows enabling or disabling text locationsTextSpan
during serialization of trees, and using indents (which are not used by default). -
PatternLanguageFlagsSafeConverter
deserializesPatternDto
, without showing an exception in case of incorrect or unsupported languagesLanguageFlags
.
-
DummyLogger
— a dummy logger It is used by default in all objects implementingILoggable
, so thatLogger.LogInfo
can be written instead ofLogger?.LogInfo
and theNullReferenceException
can be avoided. -
LanguageInfo
information about the language:-
Language Language
— the value of the enumeration for a given language (for example, CSharp). -
string Title
— the title name of the language (for example, C#) -
string[] Extensions
— language extensions (for example, .cs) -
bool CaseInsensitive
shows if the language is case sensitive -
LanguageFlags DependentLanguages
— languages, the source code of which may be found within the source code of the given language (island languages). For example,JavaScript
can be met within thePHP
. -
bool HaveAntlrParser
shows whether the ANTLR parser is used for this language.
-
-
LanguageExt
contains a static dictionary with supportedLanguageInfo
and methods for working with them. -
LanguageDetector
— a source code detection by a fragment. Has a singleParserLanguageDetector
implementation, which uses different parsers and selects the language with the minimum amount of parsing errors. -
TextHelper
— various utilities to work with a text:-
LinearToLineColumn
— transformation of linear coordinates into two-dimensional ones. -
LineColumnToLinear
— transformation of two-dimensional coordinates into linear ones. -
GetLinesCount
returns the amount of line breaks for a string. -
NormDirSeparator
normalizes the directory separation operator. It is used to bring the physical address of a file to the correct format, because Windows OS uses backslashes, and Linux — direct slashes.
-
-
WorkflowLoggerHelper
— outputting information and statistics after the process of pattern matching. -
UstDotRenderer
— rendering an UST tree to DOT, which can be visualized using Graphviz. -
GraphvizGraph
saves the transmitted graph inDOT
to an image. Supported formats:Bmp, Png, Jpg, Plain, Svg, Pdf, Gif, Dot
.PNG
is used by default. -
TestHelper
— various utilities to perform unit tests