Skip to content

Language File Sections

Andreas Gohr edited this page Mar 13, 2017 · 1 revision

This section will look at all the sections of a language file, and how they relate to the final highlighting result.

The Header

The header of a language file is the first lines with the big comment and the start of the variable $language_data:

<?php
/*************************************************************************************
 * <name-of-language-file.php>
 * ---------------------------------
 * Author: <name> (<e-mail address>)
 * Copyright: (c) 2008 <name> (<website URL>)
 * Release Version: <GeSHi release>
 * Date Started: <date started>
 *
 * <name-of-language> language file for GeSHi.
 *
 * <any-comments...>
 *
 * CHANGES
 * -------
 * <date-of-release> (<GeSHi release>)
 *  -  First Release
 *
 * TODO (updated <date-of-release>)
 * -------------------------
 * <things-to-do>
 *
 *************************************************************************************
 *
 *     This file is part of GeSHi.
 *
 *   GeSHi is free software; you can redistribute it and/or modify
 *   it under the terms of the GNU General Public License as published by
 *   the Free Software Foundation; either version 2 of the License, or
 *   (at your option) any later version.
 *
 *   GeSHi is distributed in the hope that it will be useful,
 *   but WITHOUT ANY WARRANTY; without even the implied warranty of
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *   GNU General Public License for more details.
 *
 *   You should have received a copy of the GNU General Public License
 *   along with GeSHi; if not, write to the Free Software
 *   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 ************************************************************************************/
 
$language_data = array (

The parts in angle brackets are the parts that you change for your language file. Everything else must remain the same!

Here are the parts you should change:

  • <name-of-language-file.php> - This should become the name of your language file. Language file names are in lower case and contain only alphanumeric characters, dashes and underscores. Language files end with .php (which you should put with the name of your language file, eg language.php)
  • <name> - Your name, or alias.
  • <e-mail address> - Your e-mail address. If you want your language file included with GeSHi you must include an e-mail address that refers to an inbox controlled by you.
  • <website> - A URL of a website of yours (perhaps to a page that deals with your contribution to GeSHi, or your home page/blog)
  • <date-started> - The date you started working on the language file. If you can’t remember, guestimate.
  • <name-of-language> - The name of the language you made this language file for (probably similar to the language file name).
  • <any-comments> - Any comments you have to make about this language file, perhaps on where you got the keywords for, what dialect of the language this language file is for etc etc. If you don’t have any comments, remove the space for them.
  • <date-of-release - The date you released the language file to the public. If you simply send it to me for inclusion in a new GeSHi and don’t release it, leave this blank, and I’ll replace it with the date of the GeSHi release that it is first added to.
  • <GeSHi release> - This is the version of the release that will contain the changes you made. So if you have version 1.0.8 of GeSHi running this will be the next version to be released, e.g. 1.0.8.1.

Everything should remain the same.

Also: I’m not sure about the copyright on a new language file. I’m not a lawyer, could someone contact me about whether the copyright for a new language file should be exclusivly the authors, or joint with me (if included in a GeSHi release)?

The First Indices

Here is an example from the php language file of the first indices:

'LANG_NAME' => 'PHP',
'COMMENT_SINGLE' => array(1 => '//', 2 => '#'),
'COMMENT_MULTI' => array('/*' => '*/'),
'CASE_KEYWORDS' => GESHI_CAPS_NO_CHANGE,
'QUOTEMARKS' => array("'", '"'),
'ESCAPE_CHAR' => '\\',

The first indices are the first few lines of a language file before the KEYWORDS index. These indices specify:

  • ‘LANG_NAME’: The name of the language. This name should be a human-readable version of the name (e.g. HTML 4 (transitional) instead of html4trans)
  • ‘COMMENT_SINGLE’: An array of single-line comments in your language, indexed by integers starting from 1. A single line comment is a comment that starts at the marker and goes until the end of the line. These comments may be any length > 0, and since they can be styled individually, can be used for other things than comments (for example the Java language file defines “import” as a single line comment). If you are making a language that uses a ’ (apostrophe) as a comment (or in the comment marker somewhere), use double quotes. e.g.: “’”
  • ‘COMMENT_MULTI’: Used to specify multiline comments, an array in the form ‘OPEN’ => ‘CLOSE’. Unfortunately, all of these comments you add here will be styled the same way (an area of improvement for GeSHi 1.2.X). These comment markers may be any length > 0.
  • ‘CASE_KEYWORDS’: Used to set whether the case of keywords should be changed automatically as they are found. For example, in an SQL or BASIC dialect you may want all keywords to be upper case. The accepted values for this are:
    • GESHI_CAPS_UPPER: Convert the case of all keywords to upper case.
    • GESHI_CAPS_LOWER: Convert the case of all keywords to lower case.
    • GESHI_CAPS_NO_CHANGE: Don’t change the case of any keyword.
  • ‘QUOTEMARKS’: Specifies the characters that mark the beginning and end of a string. This is another example where if your language includes the ’ string delimiter you should use double quotes around it.
  • ‘ESCAPE_CHAR’: Specifies the escape character used in all strings. If your language does not have an escape character then make this the empty string (''). This is not an array! If found, any character after an escape character and the escape character itself will be highlighted differently, and the character after the escape character cannot end a string.

In some language files you might see here other indices too, but those are dealt with later on.

Keywords

Keywords will make up the bulk of a language file. In this part you add keywords for your language, including inbuilt functions, data types, predefined constants etc etc.

Here’s a (shortened) example from the php language file:

'KEYWORDS' => array(
    1 => array(
        'as', 'break', 'case', 'do', 'else', 'elseif', 'endif',
        'endswitch', 'endwhile', 'for', 'foreach', 'if', 'include',
        'include_once', 'require', 'require_once', 'return',
        'switch', 'while'
        ),
    2 => array(
        '&lt;/script>', '&lt;?', '&lt;?php', '&lt;script language=',
        '?>', 'class', 'default', 'DEFAULT_INCLUDE_PATH', 'E_ALL',
        'E_COMPILE_ERROR', 'E_COMPILE_WARNING', 'E_CORE_ERROR',
        'E_CORE_WARNING', 'E_ERROR', 'E_NOTICE', 'E_PARSE',
        'E_USER_ERROR', 'E_USER_NOTICE', 'E_USER_WARNING',
        'E_WARNING', 'false', 'function', 'new', 'null',
        'PEAR_EXTENSION_DIR', 'PEAR_INSTALL_DIR', 'PHP_BINDIR',
        'PHP_CONFIG_FILE_PATH', 'PHP_DATADIR', 'PHP_EXTENSION_DIR',
        'PHP_LIBDIR', 'PHP_LOCALSTATEDIR', 'PHP_OS',
        'PHP_OUTPUT_HANDLER_CONT', 'PHP_OUTPUT_HANDLER_END',
        'PHP_OUTPUT_HANDLER_START', 'PHP_SYSCONFDIR', 'PHP_VERSION',
        'true', 'var', '__CLASS__', '__FILE__', '__FUNCTION__',
        '__LINE__', '__METHOD__'
        ),
    3 => array(
        'xml_parser_create', 'xml_parser_create_ns',
        'xml_parser_free', 'xml_parser_get_option',
        'xml_parser_set_option', 'xml_parse_into_struct',
        'xml_set_character_data_handler', 'xml_set_default_handler',
        'xml_set_element_handler',
        'xml_set_end_namespace_decl_handler',
        'xml_set_external_entity_ref_handler',
        'xml_set_notation_decl_handler', 'xml_set_object',
        'xml_set_processing_instruction_handler',
        'xml_set_start_namespace_decl_handler',
        'xml_set_unparsed_entity_decl_handler', 'yp_all', 'yp_cat',
        'yp_errno', 'yp_err_string', 'yp_first',
        'yp_get_default_domain', 'yp_master', 'yp_match', 'yp_next',
        'yp_order', 'zend_logo_guid', 'zend_version',
        'zlib_get_coding_type'
        )
    ),

You can see that the index ‘KEYWORDS’ refers to an array of arrays, indexed by positive integers. In each array, there are some keywords (in the actual php language file there is in fact many more keywords in the array indexed by 3). Here are some points to note about these keywords:

  • Indexed by positive integers: Use nothing else! I may change this in 1.2.X, but for the 1.0.X series, use positive integers only. Using strings here results in unnecessary overhead degrading performance when highlighting code with your language file!
  • Keywords sorted ascending: Keywords should be sorted in ascending order. This is mainly for readability. An issue with versions before 1.0.8 has been solved, so the reverse sorting order is no longer required and should thus be avoided. GeSHi itself sorts the keywords internally when building some of its caches, so the order doesn’t matter, but makes things easier to read and maintain.
  • Keywords are case sensitive (sometimes): If your language is case-sensitive, the correct casing of the keywords is defined as the case of the keywords in these keyword arrays. If you check the java language file you will see that everything is in exact casing. So if any of these keyword arrays are case sensitive, put the keywords in as their correct case! (note that which groups are case sensitive and which are not is configurable, see later on). If a keyword group is case insensitive, put the lowercase version of the keyword here OR in case documentation links require a special casing (other than all lowercase or all uppercase) the casing required for them use their casing.
  • Keywords must be in htmlentities() form: All keywords should be written as if they had been run through the php function htmlentities(). E.g, the keyword is &lt;foo&gt;, not <foo>
  • Don’t use keywords to highlight symbols: Just don’t!!! It doesn’t work, and there is seperate support for symbols since GeSHi 1.0.7.21.
  • Markup Languages are special cases: Check the html4strict language file for an example: You need to tweak the Parser control here to tell the surroundings of tagnames. In case of doubt, feel free to ask.

Symbols and Case Sensitivity

So you’ve put all the keywords for your language in? Now for a breather before we style them :). Symbols define what symbols your language uses. These are things like colons, brackets/braces, and other such general punctuation. No alphanumeric stuff belongs here, just the same as no symbols belong into the keywords section.

As of GeSHi version 1.0.7.21 the symbols section can be used in two ways:

  • Flat usage:

    • This mode is the suggested way for existing language files and languages that only need few symbols where no further differentiation is needed or desired. You simply put all the characters in an array under symbols as shown in the first example below. All symbols in flat usage belong to symbol style group 0.
  • Group usage:

    • This is a slightly more enhanced way to provide GeSHi symbol information. To use group you create several subarrays each containing only a subset of the symbols to highlight. Every array will need to have an unique index thus you can assign the appropriate styles later.

Here’s an example for flat symbol usage

'SYMBOLS' => array(
  '(', ')', '[', ']', '{', '}', '!', '@', '|', '&', '+', '-', '*', '/', '%', '=', '<', '>'
),

which is not too different from the newly introduced group usage shown below:

'SYMBOLS' => array(
  0 => array('(', ')', '[', ']', '{', '}'),
  1 => array('!', '@', '|', '&'),
  2 => array('+', '-', '*', '/', '%'),
  3 => array('=', '&lt;', '>')
),

👉 Note:

Please note that versions before 1.0.7.21 will silently ignore this setting.

Also note that GeSHi 1.0.7.21 itself had some bugs in Symbol highlighting that could cause heavily scrambled code output.

The following case sensitivity group alludes to the keywords section: here you can set which keyword groups are case sensitive.

In the ‘CASE_SENSITIVE’ group there’s a special key GESHI_COMMENTS which is used to set whether comments are case sensitive or not (for example, BASIC has the REM statement which while not being case sensitive is still alphanumeric, and as in the example given before about the Java language file using “import” as a single line comment, this can be useful sometimes. true if comments are case sensitive, false otherwise. All of the other indices correspond to indices in the 'KEYWORDS' section (see above).

Styles for your Language File

This is the fun part! Here you get to choose the colours, fonts, backgrounds and anything else you’d like for your language file.

Here’s an example:

'STYLES' => array(
    'KEYWORDS' => array(
        1 => 'color: #b1b100;',
        2 => 'color: #000000; font-weight: bold;',
        3 => 'color: #000066;'
        ),
    'COMMENTS' => array(
        1 => 'color: #808080; font-style: italic;',
        2 => 'color: #808080; font-style: italic;',
        'MULTI' => 'color: #808080; font-style: italic;'
        ),
    'ESCAPE_CHAR' => array(
        0 => 'color: #000099; font-weight: bold;'
        ),
    'BRACKETS' => array(
        0 => 'color: #66cc66;'
        ),
    'STRINGS' => array(
        0 => 'color: #ff0000;'
        ),
    'NUMBERS' => array(
        0 => 'color: #cc66cc;'
        ),
    'METHODS' => array(
        0 => 'color: #006600;'
        ),
    'SYMBOLS' => array(
        0 => 'color: #66cc66;'
        ),
    'REGEXPS' => array(
        0 => 'color: #0000ff;'
        ),
    'SCRIPT' => array(
        0 => '',
        1 => '',
        2 => '',
        3 => ''
        )
    ),

Note that all style rules should end with a semi-colon! This is important: GeSHi may add extra rules to the rules you specify (and will do so if a user tries to change your styles on the fly), so the last semi-colon in any stylesheet rule is important!

All strings here should contain valid stylesheet declarations (it’s also fine to have the empty string).

  • ‘KEYWORDS’: This is an array, from keyword index to style. The index you use is the index you used in the keywords section to specify the keywords belonging to that group.
  • ‘COMMENTS’: This is an array, from single-line comment index to style for that index. The index ‘MULTI’ is used for multiline comments (and cannot be an array). COMMENT_REGEXP use the style given for their key as if they were single-line comments.
  • ‘ESCAPE_CHAR’, ‘BRACKETS’ and ‘METHODS’: These are arrays with only one index: 0. You cannot add other indices to these arrays.
  • ‘STRINGS’: This defines the various styles for the Quotemarks you defined earlier. If you don’t use multiple styles for strings there’s only one index: 0. Please also add this index in case no strings are present.
  • ‘NUMBERS’: This sets the styles used to highlight numbers. The format used here depends on the format used to set the formats of numbers to highlight. If you just used an integer (bitmask) for numbers, you have to either specify one key with the respective constant, and\or include a key 0 as a default style. If you used an array for the number markup, copy the keys used there and assign the styles accordingly.
  • ‘SYMBOLS’: This provides one key for each symbol group you defined above. If you used the flat usage make sure you include a key for symbols group 0.
  • ‘REGEXPS’: This is an array with a style for each matching regex. Also, since 1.0.7.21, you can specify the name of a function to be called, that will be given the text matched by the regex, each time a match is found. Note that my testing found that create_function would not work with this due to a PHP bug, so you have to put the function definition at the top of the language file. Be sure to prefix the function name with geshi_[languagename]_ as to not conflict with other functions!
  • ‘SCRIPT’: For languages that use script delimiters, this is where you can style each block of script. For example, HTML and XML have blocks that begin with < and end with > (i.e. tags) and blocks that begin with & and end with ; (i.e. character entities), and you can set a style to apply to each whole block. You specify the delimiters for the blocks below. Note that many languages will not need this feature.

URLs for Functions

This section lets you specify a url to visit for each keyword group. Useful for pointing functions at their online manual entries.

Here is an example:

'URLS' => array(
    1 => '',
    2 => '',
    3 => 'http://www.php.net/{FNAME}',
    4 => ''
    ),

The indices of this array correspond to the keyword groups you specified in the keywords section. The string {FNAME} marks where the name of the function is substituted in. So for the example above, if the keyword being highlighted is “echo”, then the keyword will be a URL pointing to http://www.php.net/echo. Because some languages (Java!) don’t keep a uniform URL for functions/classes, you may have trouble in creating a URL for that language (though look in the java language file for a novel solution to it’s problem)

Number Highlighting Support

If your language supports different formats of numbers (e.g. integers and float representations) and you want GeSHi to handle them differently you can select from a set of predefined formats.

    'NUMBERS' =>
        GESHI_NUMBER_INT_BASIC | GESHI_NUMBER_INT_CSTYLE | GESHI_NUMBER_BIN_PREFIX_0B |
        GESHI_NUMBER_OCT_PREFIX | GESHI_NUMBER_HEX_PREFIX | GESHI_NUMBER_FLT_NONSCI |
        GESHI_NUMBER_FLT_NONSCI_F | GESHI_NUMBER_FLT_SCI_SHORT | GESHI_NUMBER_FLT_SCI_ZERO,

All the formats you want GeSHi to recognize are selected via a bitmask that is built by bitwise OR-ing the format constants. When styling you use these constants to assign the proper styles. A style not assigned will automatically fallback to style group 0.

👉 Note:

For a complete list of formats supported by GeSHi have a look into the sources of geshi.php.

If you want to define your own formats for numbers or when you want to group the style for two or more formats you can use the array syntax.

    'NUMBERS' => array(
        1 => GESHI_NUMBER_INT_BASIC | GESHI_NUMBER_INT_CSTYLE,
        2 => GESHI_NUMBER_BIN_PREFIX_0B,
        3 => GESHI_NUMBER_OCT_PREFIX,
        4 => GESHI_NUMBER_HEX_PREFIX,
        5 => GESHI_NUMBER_FLT_NONSCI | GESHI_NUMBER_FLT_NONSCI_F | GESHI_NUMBER_FLT_SCI_SHORT | GESHI_NUMBER_FLT_SCI_ZERO
        ),

This creates 5 style groups 1..5 that will highlight each of the formats specified for each group. Styling of these groups doesn’t use the constants but uses the indices you just defined.

Instead of using those predefined constants you also can assign a PCRE that matches a number when using this advanced format.

👉 Note:

The extended format hasn’t been exhaustively been tested. So beware of bugs there.

Object Orientation Support

Now we’re reaching the most little-used section of a language file, which includes such goodies as object orientation support and context support. GeSHi can highlight methods and data fields of objects easily, all you need to do is to tell it to do so and what the “splitter” is between object/method etc.

'OOLANG' => true,
'OBJECT_SPLITTER' => '-&gt;',

If your language has object orientation, the value of 'OOLANG' is true, otherwise it is false. If it is object orientated, in the 'OBJECT_SPLITTER' value you put the htmlentities() version of the “splitter” between objects and methods/fields. If it is not, then make this the empty string.

Using Regular Expressions

Regular expressions are a good way to catch any other lexic that fits certain rules but can’t be listed as a keyword. A good example is variables in PHP: variables always start with either one or two “$” signs, then alphanumeric characters (a simplification). This is easy to catch with regular expressions.

And new to version 1.0.2, there is an advanced way of using regular expressions to catch certain things but highlight only part of those things. This is particularly useful for languages like XML.

Caution:

Regular expressions use the PCRE syntax (perl-style), not the ereg() style!

Here is an example (this time the PHP file merged with the XML file):

0 => array(
    GESHI_SEARCH => '(((xml:)?[a-z\-]+))(=)',
    GESHI_REPLACE => '\\1',
    GESHI_MODIFIERS => '',
    GESHI_BEFORE => '',
    GESHI_AFTER => '\\4'
    ),
1 => array(
    GESHI_SEARCH => '(>/?[a-z0-9]*(>)?)',
    GESHI_REPLACE => '\\1',
    GESHI_MODIFIERS => '',
    GESHI_BEFORE => '',
    GESHI_AFTER => ''
    ),
2 => "[\\$]{1,2}[a-zA-Z_][a-zA-Z0-9_]*"

As you can see there are two formats. One is the “simple” format used in GeSHi < 1.0.2, and the other is a more advanced syntax. Firstly, the simple syntax:

  • May be in double quotes: To make it easier for those who always place their regular expressions in double quotes, you may place any regular expression here in double quotes if you wish.
  • Don’t use curly brackets where possible: If you want to use curly brackets (()) then by all means give it a try, but I’m not sure whether under some circumstances GeSHi may throw a wobbly. You have been warned! If you want to use brackets, it would be better to use the advanced syntax.
  • Don’t use the “everything” regex: (That’s the .*? regex). Use advanced syntax instead.

And now for advanced syntax, which gives you much more control over exactly what is highlighted:

  • GESHI_SEARCH: This element specifies the regular expression to search for. If you plan to capture the output, use brackets (()). See how in the first example above, most of the regular expression is in one set of brackets (with the equals sign in other brackets). You should make sure that the part of the regular expression that is supposed to match what is highlighted is in brackets.
  • GESHI_REPLACE: This is what the stuff matched by the regular expression will be replaced with. If you’ve grouped the stuff you want highlighted into brackets in the GESHI_SEARCH element, then you can use \\number to match that group, where number is a number corresponding to how many open brackets are between the open bracket of the group you want highlighted and the start of the GESHI_SEARCH string + 1. This may sound confusing, and it probably is, but if you’re familiar with how PHP’s regular expressions work you should understand. In the example above, the opening bracket for the stuff we want highlighted is the very first bracket in the string, so the number of brackets before that bracket and the start of the string is 0. So we add 1 and get our replacement string of \\1 (whew!).

If you didn’t understand a word of that, make sure that there are brackets around the string in GESHI_SEARCH and use \\1 for GESHI_REPLACE ;)

  • GESHI_MODIFIERS: Specify modifiers for your regular expression. If your regular expression includes the everything matcher (.*?), then your modifiers should include “s” and “i” (e.g. use ‘si’ for this).
  • **GESHI_BEFORE:**Specifies a bracket group that should appear before the highlighted match (this bracketed group will not be highlighted). Use this if you had to match what you wanted by matching part of your regexp string to something before what you wanted to highlight, and you don’t want that part to disappear in the highlighted result.
  • **GESHI_AFTER:**Specifies a bracket group that should appear after the highlighted match (this bracketed group will not be highlighted). Use this if you had to match what you wanted by matching part of your regexp string to something after what you wanted to highlight, and you don’t want that part to disappear in the highlighted result.

Is that totally confusing? Here’s the test for if you’re an android or not: If you found that perfectly understandable then you’re an android ;). Here’s a better example:

Let’s say that I’m making a language, and variables in this language always start with a dollar sign ($), are always written in lowercase letters and always end with an ampersand (&). eg:

$foo& = 'bar'

I want to highlight only the text between the $ and the &. How do I do that? With simple regular expressions I can’t, but with advanced, it’s relatively easy:

1 => array(
    // search for a dollar sign, then one or more of the characters a-z, then an ampersand
    GESHI_SEARCH => '(\$)([a-z]+)(&)',
    // we wanna highlight the characters, which are in the second bracketed group
    GESHI_REPLACE => '\\2',
    // no modifiers, since we're not matching the "anything" regex
    GESHI_MODIFIERS => '',
    // before the highlighted characters should be the first
    // bracketed group (always a dollar sign in this example)
    GESHI_BEFORE => '\\1',
    // after the highlighted characters should be the third
    // bracketed group (always an ampersand in this example)
    GESHI_AFTER => '\\3'
    ),

So if someone tried to highlight using my language, all cases of $foo& would turn into:

$<span style="color: blue;">foo</span><span style="color: green;">&amp;</span>

(which would of course be viewed in a browser to get something like $foo&)

Contextual Highlighting and Strict Mode

For languages like HTML, it’s good if we can highlight a tag (like <a> for example). But how do we stop every single “a” in the source getting highlighted? What about for attributes? If I’ve got the word “colspan” in my text I don’t want that highlighted! So how do you tell GeSHi not to highlight in that case? You do it with “Strict Blocks”.

Here is an example:

<? /* ... */
'STRICT_MODE_APPLIES' => GESHI_MAYBE,
'SCRIPT_DELIMITERS' => array(
    0 => array(
        '<?php' => '?>'
        ),
    1 => array(
        '<?' => '?>'
        ),
    2 => array(
        '<%' => '%>'
        ),
    3 => array(
        '<script language="php">' => '</script>'
        )
    4 => "/(<\?(?:php)?)(?:'(?:[^'\\\\]|\\\\.)*?'|\"(?:[^\"\\\\]|\\\\.)*?\"|\/\*(?!\*\/).*?\*\/|.)*?(\?>|\Z)/sm",
    5 => "/(<%)(?:'(?:[^'\\\\]|\\\\.)*?'|\"(?:[^\"\\\\]|\\\\.)*?\"|\/\*(?!\*\/).*?\*\/|.)*?(%>|\Z)/sm"
    ),
'HIGHLIGHT_STRICT_BLOCK' => array(
    0 => true,
    1 => true,
    2 => true,
    3 => true,
    4 => true,
    5 => true
    )
/* ... */ ?>

What is strict mode? Strict mode says that highlighting only occurs inside the blocks you specify. You can see from the example above that highlighting will only occur if the source is inside <?php ... ?> (though note the GESHI_MAYBE!). Here are some points about strict highlighting:

  • ‘STRICT_MODE_APPLIES’: This takes three values (all constants):
    • GESHI_ALWAYS: Strict mode always applies for all of the blocks you specify. Users of your language file cannot turn strict mode off. This should be used for markup languages.
    • GESHI_NEVER: Strict mode is never used. Users of your language file cannot turn strict mode on. Use this value if there is no such thing as a block of code that would not be highlighted in your language (most languages, like C, Java etc. use this because anything in a C file should be highlighted).
    • GESHI_MAYBE: Strict mode sometimes applies. It defaults to “off”. Users can turn strict mode on if they please. If strict mode is off then everything in the source will be highlighted, even things outside the strict block markers. If strict mode is on then nothing outside strict block markers will be highlighted.
  • ‘SCRIPT_DELIMITERS’: This is an array of script delimiters, in the format of the above. The indices are use in the ‘SCRIPT’ part of the styles section for highlighting everything in a strict block in a certain way. For example, you could set up your language file to make the background yellow of any code inside a strict block this way. The delimiters are in the form 'OPEN' => 'CLOSE'. Delimiters can be of any length > 0. Delimiters are not formatted as if they were run through htmlentities()!
  • ‘HIGHLIGHT_STRICT_BLOCK’: specifies whether any highlighting should go on inside each block. Most of the time this should be true, but for example, in the XML language file highlighting is turned off for blocks beginning with <!DOCTYPE and ending with >. However, you can still style the overall block using the method described above, and the XML language file does just that.

👉 Note:

The delimiters should be in reverse alphabetical order. Note that in the above example, <?php comes before <?.

Since GeSHi 1.0.8 instead of specifying an array with starter and ender you may also provide a regular expression that matches the full block you wish to highlight. If the regular expression match starts at the same position as a previous array declaration the Regexp match is taken. This is to allow for a fall-back when a preg_match doesn’t quite work as expected so you still get reasonably well results.

If you didn’t get this, you might want to look into the PHP or HTML language files as this feature is used there to fix some issues that have been there for about 3 years.

Special Parser Settings (Experimental)

Sometimes it is necessary for a language to render correctly to tweak some of the assumptions GeSHi usually makes to match the behaviour your language expects. To achieve this there is an experimental section called 'PARSER_CONTROL' which is optional and should be used only if necessary. With the help of this section some internal parameters of GeSHi can be set which are not overrideable by the API and thus their use should be limited as much as possible.

The syntax of the PARSER_CONTROL basically resembles an array structure simular to the one found in the rest of the language file. All subsections of the PARSER_CONTROL are optional. If a given setting isn’t present the usual default values of GeSHi are used. No validation of settings is performed for these settings. Also note that unknown settings are silently ignored.

Caution:

All PARSER_CONTROL settings are experimental and subject to change. So if you need a special setting in a public language file you should consider requesting it upstream. This is also the reason why documentation on these settings will only cover broad usage information as the underlying implementation might change without further notice.

One of the most common reasons why you might want to use the PARSER_CONTROL settings is to tweak what characters are allowed to surround a keyword. Usually GeSHi checks for a fixed set of characters like brackets and common symbols that denote the word boundary for a keyword. If this set conflicts with your language (e.g. - is allowed inside a keyword) or you want to limit the usage of a keyword to certain areas (e.g. for HTML tag names only match after <) you can change those conditions here.

Keyword boundary rules can either be set globally (directly within the PARSER_CONTROL’s KEYWORDS section or on a per-group basis. E.g. the following sample from the HTML language file sets different settings for keyword matching only for Keyword Group 2 and leaves the other groups alone.

    'PARSER_CONTROL' => array(
        'KEYWORDS' => array(
            2 => array(
                'DISALLOWED_BEFORE' => '(?<=&lt;|&lt;\/)',
                'DISALLOWED_AFTER' => '(?=\s|\/|&gt;)',
            )
        )
    )

👉 Note:

The name 'DISALLOWED_BEFORE' and 'DISALLOWED_AFTER' might sound confusing at first, since they don’t define what to prevent, but what to match in order to find a keyword. The reason for this strange naming is based in the original implementation of this feature when Nigel implemented this in the old parser statically. When this implementation was brought out via the PARSER_CONTROL settings the original naming wasn’t altered since at that time this really was a blacklist of characters. Later on this implementation was changed from a blacklist of characters to a part of a PCRE regexp, but leaving the name. The naming might be subject to change though.

Another option you can change since GeSHi 1.0.8.3 is whether to treat spaces within keywords as literals (only a single space as given) or if the space should match any whitespace at that location. The following code will enable this behaviour for the whole keyword set. As said above you can choose to enable this for single keyword groups only though.

    'PARSER_CONTROL' => array(
        'KEYWORDS' => array(
            'SPACE_AS_WHITESPACE' => true
        )
    ),

Another option of interest might be disabling certain features for a given language. This might come in handy if the language file you are working on doesn’t support a given function or highlighting certain aspects won’t work properly or would interfere with custom implementations using regular expressions.

    'PARSER_CONTROL' => array(
        'ENABLE_FLAGS' => array(
            'ALL' => GESHI_NEVER,
            'NUMBERS' => GESHI_NEVER,
            'METHODS' => GESHI_NEVER,
            'SCRIPT' => GESHI_NEVER,
            'SYMBOLS' => GESHI_NEVER,
            'ESCAPE_CHAR' => GESHI_NEVER,
            'BRACKETS' => GESHI_NEVER,
            'STRINGS' => GESHI_NEVER,
        )
    )

Inside the 'ENABLE_FLAGS' section follows an array of 'name'=>value pairs. Valid names are the sections below the 'STYLES' section (well, not exactly, but you can look there for what the features are called inside GeSHi). Valid values are the GeSHi constants GESHI_NEVER (don’t process this feature), GESHI_ALWAYS (always process this feature, ignore the user) and GESHI_MAYBE (listen to the user if he want’s this highlighted). The value GESHI_MAYBE is the default one and thus needs not to be set explicitely.

Another setting available through the PARSER_CONTROL settings is the possibility to limit the allowed characters before an single line comment.

    'PARSER_CONTROL' => array(
        'COMMENTS' => array(
            'DISALLOWED_BEFORE' => '$'
        )
    )

With the current implementation the DISALLOWED_BEFORE COMMENT-specific setting is a list of characters. But this is subject to change.

👉 Note:

There is no 'DISALLOWED_AFTER' setting with the 'COMMENTS'-PARSER_CONTROL.

Another PARSER_CONTROL setting for the environment around certain syntactic constructs refers to the handling of object-oriented languages.

    'PARSER_CONTROL' => array(
        'OOLANG' => array(
            'MATCH_BEFORE' => '',
            'MATCH_AFTER' => '[a-zA-Z_][a-zA-Z0-9_]*',
            'MATCH_SPACES' => '[\s]*'
        )
    )

Caution:

Please note that the settings discussed in this section are experimental and might be changed, removed or altered in their meaning at any time.

Tidying Up

Your files should end without a closing ?> and exactly one newline at the end.