Skip to content
Dmitriy Rassadin edited this page Nov 18, 2017 · 3 revisions

PT.PM API

This document provides an insight into basic classes and interfaces of PT Pattern Matching Engine and shows how these interact.

Contents

Workflow

Workflow is a basic class that combines stages of reading and parsing files, tree conversions, and matching UST with patterns. The Workflow class is responsible for timing these stages. It also provides support for parallelizing.

Input parameters

  • ISourceCodeRepository SourceCodeRepository — a source code repository
  • LanguageFlags languages — a list of languages to be parsed. Based on these languages, Workflow determines, which ILanguageParser parsers and IParseTreeToUstConverter converters should be created.
  • IPatternsRepository PatternsRepository — a pattern repository
  • Stage is the final operational stage:
    • Read — reading a file
    • Parse — parsing
    • Convert — converting to UAST
    • Preprocess — tree preprocessing (calculating arithmetic expressions, connecting lines, simplifying patterns)
    • Match — pattern matching
    • Patterns— a mode for checking the parsing of patterns If a certain stage is specified, the stages preceding it will also be performed. This does not apply to the Patterns stage, which is responsible only for testing templates.
  • ILogger Logger — logger that is embedded in other internal objects implementing ILoggable.

Additional parameters that are used for ANTLR parsers only are also available:

  • MaxStackSize — the maximum stack size in bytes. If the value is not zero, then parsing is started in the instance of Thread with a specified value. This option is useful if a StackOverflow exception occurred during ANTLR parsing.
  • MemoryConsumptionMb — approximate memory consumption in megabytes after which the procedure for cleaning the ANTLR cache is started (ClearDFA for a lexer and parser).

See also the paragraph "Memory consumption" in the article "Theory and practice of source code parsing with ANTLR and Roslyn."

Results

The result of processing is obtained as follows: WorkflowResult result = workflow.Process(). The resulting class has the following properties:

  • List of uploaded source code files IReadOnlyList<SourceCodeFile> SourceCodeFiles
  • List of parse trees IReadOnlyList<ParseTree> ParseTrees
  • The list of USTs: IReadOnlyList<Ust> Usts
  • List of matching results IReadOnlyList<MatchingResult> MatchingResults
  • Timing for the stages:
    • TotalReadTimeSpan — total reading time span
    • TotalParseTimeSpan — total parsing time span
    • TotalConvertTimeSpan — total converting time span
    • TotalPreprocessTimeSpan — total preprocessing time span
    • TotalMatchTimeSpan — total matching time span
    • TotalPatternsTimeSpan — total pattern processing time span
    • TotalLexerTicks — total lexer processing time span (ANTLR parsers only)
    • TotalParserTicks — total parser processing time span (ANTLR parsers only)
  • TotalProcessedFileCount — total amount of files processed
  • TotalProcessedCharsCount — total amount of characters processed
  • TotalProcessedLinesCount — total amount of lines processed
  • ErrorCount — total amount of errors
  • Workflow also generates real-time logger messages.

The figure below illustrates the workflow:

Workflow

Workflow

Source code repositories

Repositories implement the ISourceCodeRepository interface; they are located in the PT.PM.Common project.

  • FileCodeRepository uploads the source code from the fileName file.
  • FilesAggregatorCodeRepository uploads all files from the specified folder filePath.
  • MemoryCodeRepository uploads the source code from the fileName line.
  • ZipAtUrlCachedCodeRepository downloads the GitHub repository and unpacks it. It is used to test parsers and converters.

Pattern repositories

Repositories implement the IPatternsRepository interface; they are located in the PT.PM.Patterns project.

  • FilePatternsRepository uploads PatternDto objects from the filePath file.
  • MemoryPatternsRepository allows uploading PatternDto objects from memory.
  • JsonPatternsRepository allows uploading patterns from a JSON document.
  • DslPatternRepository allows uploading a DSL pattern from the patternData line for languageFlags languages.
  • DefaultPatternRepository contains default hardcoded patterns.

Parsers and converters

  • ILanguageParser describes a SourceCode parser to the ParseTree. Implementations:
    • CSharpRoslynParser — a Roslyn parser for C#
    • JavaAntlrParser — an ANTLR parser for Java
    • PhpAntlrParser — an ANTLR parser for PHP
    • TSqlAntlrParser — an ANTLR parser for T-SQL
    • PlSqlAntlrParser — an ANTLR parser for PL/SQL
    • JavaScriptAntlrParser — an ANTLR parser for JavaScript
  • IParseTreeToUstConverter describes a ParseTree converter to UST. Implementations:
    • CSharpRoslynParseTreeConverter — a converter for C#
    • JavaAntlrParseTreeConverter — a converter for Java
    • PhpAntlrParseTreeConverter — a converter for PHP
    • TSqlAntlrConverter — a converter for T-SQL
    • PlSqlAntlrConverter — a converter for PL/SQL
    • JavaAntlrParseTreeConverter — a converter for JavaScript
  • IAstPatternMatcher<TPatternsDataStructure> describes a UST matcher and Pattern patterns. Returns a collection from MatchingResult. There is a single implementation yet: BruteForcePatternMatcher located in the PT.PM.Matching project.

Basic classes and interfaces

  • SourceCodeFile — a source code object containing:
    • Name — name
    • RelativePath — a relative path to a file in the root of a repository
    • Code — source code
  • ParseTree — a parse tree obtained as a result of parsing the SourceCodeFile source code object.
  • UST (Universal Syntax Tree) — a universal syntax tree obtained as a result of converters operation implementing IParseTreeToUstConverter.
  • Pattern — an object that stores a pattern in a structured format, ready for use in the pattern matching engine
    • Key — a unique string pattern identifier
    • LanguageFlags — a list of languages for which this universal pattern is applicable For example, a Random class instance is applicable for C# and Java.
    • PatternNode Data — an AST fragment for the pattern which is matched against the UST.
  • PatternDto — a serialized Pattern intended for its storage and transmission.
    • Name — name (optional)
    • Key — a unique string pattern identifier
    • LanguageFlags — a list of languages for which this universal pattern is applicable For example, a Random class instance is applicable for C# and Java.
    • DataFormat — a pattern format JSON and DSL are available.
    • Value — the text of a pattern in JSON or DSL format (depends on DataFormat).
    • CweId — a Common Weakness Enumeration identifier
    • Description — a pattern description
    • DebugInfo — information that helps with debugging
  • An example of a pattern serialized in JSON:
    {
        "Key": "96",
        "Name": "Hardcoded Password",
        "Languages": "CSharp, Java, PHP, PLSQL, TSQL",
        "DataFormat": "Dsl",
        "Value": "<[(?i)password]> = <[ \"\\w*\" || null ]>" 
    }
  • MatchingResult — a result of matching a Pattern against UST. Contains the following properties:
    • Pattern — a reference to the pattern
    • List<UstNode> Nodes — a list of matched patterns. The list is used because several nodes can be specified for multi-line patterns.
    • FileNode — a reference to the source code file
  • MatchingResultDto — a matching result MatchingResult that can be used for serialization.
    • MatchedCode — a fragment of the matched code
    • BeginLine, BeginColumn, EndLine, EndColumn — coordinates in the line-column format
    • PatternKey — a pattern ID
    • SourceFile — a source code file
  • An example of a matching result serialized in JSON:
    {
        "MatchedCode": "rand()",
        "BeginLine": 60,
        "BeginColumn": 30,
        "EndLine": 60,
        "EndColumn": 36,
        "PatternKey": "27",
        "SourceFile": "Patterns.php"
    }
  • TextSpan — a linear text location that contains Start and Length (a start and length, respectively). Used in all conversions, except the output.
  • LineColumnTextSpan — a text location that contains BeginLine, BeginColumn, EndLine, and EndColumn. Used to output the found pattern in a handy format.

Logging

  • ILogger — an interface abstracting the logging methods:
    • LogError — logging errors in the text or exception format Exception. There are several types of exceptions: ParsingException, ConversionException, MatchingException, and a general Exception.
    • LogInfo — logging information messages Examples:
      • Command line arguments: -f Patterns.php --stage convert
      • File Patterns.php has been parsed (Elapsed: 00:00:00.6350338)
    • LogDebug — logging Debug messages They are not used in the Release configuration. Examples:
      • Arithmetic expression 60 * 60 has been folded to 3600
      • Strings "a" + "b" has been concatenated to "ab"
  • ConsoleLogger implementation is located in the PT.PM.Console project. This logger outputs the result to the console and to a file. NLog is embedded.
  • ILoggable — an interface with the ILogger Logger property. It is implemented in all classes where logging is used. By default, all classes use the DummyLogger implementation, which contains empty methods. It is used to get rid of a necessity to check that the Logger is not null before calling the methods LogError, LogInfo, etc.

UST nodes

  • Children — the nearest descendants of the current node (siblings)
  • Descendant — all descendants of the current node (siblings, grandchildren, etc.)

UstNode is the basic class for all nodes. It has the following properties:

  • NodeType — node type. It is used for nodes matching.
  • Parent — parent of node Null is used for the root node.
  • Children — a list of children It is used for UST traversing.
  • TextSpan — a location in the text in linear coordinates.

And methods:

  • CompareTo — a basic implementation for matching two nodes against each other. More details are given in the article "Tree structures processing and unified AST" in the section "Algorithm for matching AST and patterns."
  • DoesAnyDescendantMatchPredicate shows if any descendant matches the passed predicate. For example, it can be used for PatternExpressionInsideExpression.
  • DoesAllDescendantsMatchPredicate shows if all descendants match the transmitted predicate.
  • ApplyActionToDescendants applies the action to all descendants. It is used to add a text location offset when parsing island languages.
  • GetAllDescendants returns all descendants of the given node.
  • ToString returns a string representation of a node, which is more like a C# syntax.

Basic node types

  • EntityDeclaration — entity declaration:
    • TypeDeclaration — class or interface declaration
    • ConstructorDeclaration — constructor declaration
    • FieldDeclaration — field declaration
    • MethodDeclaration — method declaration
    • ParameterDeclaration - parameter declaration
    • StatementDeclaration — statement declaration
  • namespaceDeclaration - namespace declaration
  • Statement — an instruction or statement (usually ends with a semicolon)
    • BlockStatement — example: { Statement* }
    • BreakStatement — example: break;
    • ContinueStatement — example: continue;
    • DoWhileStatement — example: do (Expression) while Statement
    • EmptyStatement — example: ;
    • ExpressionStatement — example: Expression;
    • ForeachStatement — example: foreach (var e in Expression) Statement
    • ForStatement — example: for (int i = 0; i < n; i++) Statement
    • GotoStatement — example: goto Id;
    • IfElseStatement — example: if (Expression) Statement else Statement
    • ReturnStatement — example: return Expression;
    • ThrowStatement — example: throw Expression;
    • TypeDeclarationStatement — declaration of type inside the Statement
    • WhileStatement — example: while (Expression) Statement
    • WithStatement — example: using (Expression) Statement
    • WrapperStatement — an artificial Statement that contains a node of UstNode type.
  • Expression — an expression that returns a result (for example, a function call, an arithmetic operator, etc.)
    • AnonymousMethodExpression — an anonymous function
    • AssignmentExpression — an assignment expression a = b
    • ArrayCreationExpression - creation of array: a = new[5, 5]
    • BinaryOperatorExpression — a binary expression: a * b, a + b
    • CastExpression — conversion to a type (type)expr
    • ConditionalExpression — a conditional expression a ? b : c
    • IndexerExpression — an indexer a[b]
    • InvocationExpression — a function call Target(Args)
    • MemberReferenceExpression — reference to a member of the class A.B.C
    • MultichildExpression — an artificial Expression containing several expressions Expression
    • ObjectCreateExpression — creation of an object new A()
    • UnaryOperatorExpression — an unary expression ++a, !a
    • VariableDeclarationExpression — variable declaration, var a = b
    • WrapperExpression — an artificial Expression containing a node of a type UstNode
  • Token — a tail node:
    • IdToken — ID
    • Literal — a primitive constant value:
      • BinaryOperatorLiteral — a binary literal
      • BooleanLiteraltrue or false
      • CommentLiteral — a comment // password="e@jf7!ke"
      • FloatLiteral — a floating-point number 42.42
      • IntLiteral — an integer number 42
      • ModifierLiteral - modifiers of types and type members static class A
      • ParameterModifierLiteral - parameter modifier int f(const int a)
      • NullLiteralnull
      • StringLiteral — a string "hello world"
      • TypeTypeLiteral
      • UnaryOperatorLiteral — an unary literal
    • ThisReferenceToken - this
  • CollectionNode<TAstNode> : UstNode — node collection
    • EntitiesNode — entities collection
    • ArgsNode — the collection of argumentsinvocation(a, b, c)

Types of pattern nodes

  • DslNode — a node for wrapping the DSL information (target languages, variable definitions)
  • PatternBooleanLiteral — a boolean literal bool, true, or false
  • PatternComment allows finding comments on a regular expression, equivalent to Comment: regex in DSL.
  • PatternExpression — a wildcard expression (any expression), equivalent to # in DSL. Negation for expressions can also be used.
  • PatternExpressionInsideExpression wraps any expression and indicates that it can be met at any depth of the tree. Equivalent to <{ expression }> in DSL.
  • PatternExpressions — a node for matching multiple expressions considering the constraints. For example, HashBytes(^(md2|md4|md5)$, ...)
  • PatternIdToken — placement of identifiers by a regular expression. For example, <[\w+]> will be matched against any identifiers.
  • PatternIntLiteral — placement of integers by a range, for example, <[..-20 || -10 || -5..5 || 010 || 0x10 || 30..]>
  • PatternMultipleExpressions — placement for an arbitrary number of any expressions. Equivalent to ... In DSL. It can be used for function arguments.
  • PatternStatement wraps an instruction.
  • PatternStatements wraps multiple instructions.
  • PatternStringLiteralplacement of strings by a regular expression. For example, <[""]> will be matched against any strings.
  • PatternTryCatchStatement — an empty construction try catch { }
  • PatternVarDef — definition of a variable that can take multiple values. It can be either named <["\w*" || null]> or unnamed: <[@pwd:password]>.
  • PatternVarRef — a reference to a pinned variable. For example, Response.Cookies.Add(<[@cookie]>);

UST Traversing

IUstVisitorand IUstListener interfaces for traversing UST are located in the PT.PM.UstPreprocessing project. The UstVisitor class implements the IUstVisitor interface and deep copying of the trees is performed in it by default. Thus far, the dynamic dispatching is used.

UstPreprocessor is used to simplify the UST. It overrides some UstVisitor methods. In fact, at the preprocessing stage, UST is transformed into a more simple UST. Смотри также Simplifying an AST.

Errors and exceptions

  • ParsingException occurs when parsing source code or template, if there are lexical or syntax errors. Examples:
    • Error: no viable alternative at input '(?' at 1:1
    • Error: token recognition error at: '>' at 1:18
  • ConversionException occurs when the parse tree is converted to a universal AST (UST). It is also used when converting templates. For example, NullReferenceException will be wrapped in ConversionException in some visitor method.
  • MatchingException occurs during the execution of the algorithm for matching patterns against the UST nodes.
  • ShouldNotBeVisitedException is used to explicitly specify that a visitor method should not be visited.

ANTLR utility classes

  • AntlrCaseInsensitiveInputStream — a case-insensitive input stream, which is used, for example, for PHP, PL/SQL, and T-SQL.
  • AntlrDefaultVisitor — implementation of the default visitor for ANTLR parse trees.
    • Visit calls the tree.Accept(this) for a particular node. Contains an exception handler.
    • VisitChildren
      • If there is a single child, the Visit is called.
      • Otherwise, it returns a descendant to MultichildExpression, the descendant to Expression.
    • VisitTerminal tries to parse the value with different regular expressions and on the basis of this determine the type (String, Float, Int, etc.)
  • AntlrHelper — converting ANTLR text locations to unified ones, outputting the parse tree in a convenient text representation.
  • AntlrMemoryErrorListener — logging ANTLR lexer and parser errors
  • AntlrParser — a basic class for all ANTLR parsers. Implements the following:
    • Virtual method of preprocessing text (PreprocessText) which by default normalizes line breaks (replaces single \r with \n).
    • Code parsing is performed first using a fast SLL algorithm, and, in case of failure, using a slow full LL algorithm.
    • Tracking the ANTLR cache and cleaning it under certain conditions (if memory consumption by the process exceeded a certain threshold).

JSON serialization

PT.PM actively uses the JSON.NET library. The following classes are used to interact with it:

  • USTJsonConverter — deserializer of UST trees. Creates children of the tree dynamically using Activator.CreateInstance depending on the value of NodeType.
  • JsonUstNodeSerializer allows enabling or disabling text locations TextSpan during serialization of trees, and using indents (which are not used by default).
  • PatternLanguageFlagsSafeConverter deserializes PatternDto, without showing an exception in case of incorrect or unsupported languages LanguageFlags.

Other utility classes

  • DummyLogger — a dummy logger It is used by default in all objects implementing ILoggable, so that Logger.LogInfo can be written instead of Logger?.LogInfo and the NullReferenceException can be avoided.
  • LanguageInfo information about the language:
    • Language Language — the value of the enumeration for a given language (for example, CSharp).
    • string Title — the title name of the language (for example, C#)
    • string[] Extensions — language extensions (for example, .cs)
    • bool CaseInsensitive shows if the language is case sensitive
    • LanguageFlags DependentLanguages — languages, the source code of which may be found within the source code of the given language (island languages). For example, JavaScript can be met within the PHP.
    • bool HaveAntlrParser shows whether the ANTLR parser is used for this language.
  • LanguageExt contains a static dictionary with supported LanguageInfo and methods for working with them.
  • LanguageDetector — a source code detection by a fragment. Has a single ParserLanguageDetector implementation, which uses different parsers and selects the language with the minimum amount of parsing errors.
  • TextHelper — various utilities to work with a text:
    • LinearToLineColumn — transformation of linear coordinates into two-dimensional ones.
    • LineColumnToLinear — transformation of two-dimensional coordinates into linear ones.
    • GetLinesCount returns the amount of line breaks for a string.
    • NormDirSeparator normalizes the directory separation operator. It is used to bring the physical address of a file to the correct format, because Windows OS uses backslashes, and Linux — direct slashes.
  • WorkflowLoggerHelper — outputting information and statistics after the process of pattern matching.
  • UstDotRenderer — rendering an UST tree to DOT, which can be visualized using Graphviz.
  • GraphvizGraph saves the transmitted graph in DOT to an image. Supported formats: Bmp, Png, Jpg, Plain, Svg, Pdf, Gif, Dot. PNG is used by default.
  • TestHelper — various utilities to perform unit tests