pandoc-export-index

This software is a collection of Pandoc Lua filters and custom writers to export a document with indices in these formats:

InDesign ICML
docx
odt
~~ConTeXt~~
~~LaTeX~~

Currently there's only a Writer to export an index (one level only) to an ICML standalone document (-s option in Pandoc).

What is needed to specify indices

For indices, you need:

the names of the indices (many formats support only one index);
the terms (topics) for every index;
the references to those terms in the text.

An example document

In the doc directory there's an indices-example.md document.

Defining indices for Pandoc

This software considers an index definition a Div block with

an index class; this is mandatory, because makes the Div an index database;
an index-name; this is optional; if not set, its value is considered to be "index"; please do use simple names without numbers or symbols for indices' names, like "index", "names", "topics", "biblio", "subjects", etc.
a ref-class attribute that specifies the class that Span inlines must have to be considered references to this index; this is optional; if not set, its value is considered to be "index-ref";
a put-index-ref attribute that can be "before" or "after", see below; this is optional; if not set, its value is considered to be "after".

Why a Div? Because it's a Block that carries arbitrary data within the Attr structure; and it's a container of Blocks.

References

Index references are Span inlines with:

a class that matches the ref-class of an index defined somewhere in the document; this is mandatory (unless you specify indexed-text, see below), since it's what makes this Span an index reference;
an idref attribute that matches the id attribute of a term of that index; this is optional, but if not set, you won't get this occurrence in the index;
an optional indexed-text attribute with the text it refers to; this is useful when you use empty references (an empty Span just put at the left or at the right of the text it refers to)

Why a Span? Because it's among inlines and carries arbitrary data within the Attr structure.

Terms (topics)

Index terms are Div blocks with:

an index-term class; this is mandatory, because it's what makes this Div an index term, instead of a generic Div;
an id; this is mandatory too, otherwise you can't reference this term in the text;
an index-name attribute whose value matches the one of an index; this is optional, especially when the term Div is inside an index Div;
an optional sort-key attribute, specifying a simple text according to which the term must be sorted; generally the filters and writers of this repository don't do sorting.

Why a Div? A Para or a Plain are enough in many cases, but they have no data attached (no Attr). An index topic could also be quite long and multi-paragraph (i.e. think of an index of people with biographical profiles or a glossary with references to the pages where a topic is discussed).

Currently there's no support for sub-topics, but it's planned.

How indices are modelled in different formats

AFAIK we can divide formats into two families from the indexing point of view:

ICML, docx, odt: there's a database of terms and references to them in the text; rendering indices in HTML and epub could follow this model too;
ConTeXt, LaTeX: the database is built incrementally from macro calls like \index{term}, \index{head+sub} (ConTeXt), \index{head!sub} (LaTeX).

This package follows the first model, so writers for ConTeXt and LaTeX should do some work to adapt it.

In ConTeXt I know it's possible, because I used this workaround in a project of mine:

\defineregister[myIndex][deeptextcommand=\IdToTerm]

\starttext
... foo\myIndex[foo]{fooId} bar\myIndex[bar]{barId} ...

\placeregister[myIndex]
\stoptext

where \IdToTerm is a macro that gets an id as input and places the TeX tokens of the corresponding term, while \myIndex must be followed by two parameters: the sorting key in brackets and the term id in braces.

Extracting indices as JSON objects: the `indices2json.lua` Writer

indices2json.lua is a custom writer to extract indices and terms defined in a document as JSON objects, that you may then use to build an external database.

Example: enter the src directory and type

pandoc -f markdown -t indices2json.lua ../test/test.md

and you'll get something like this:

{
  "indices": [
    {
      "name": "subjects",
      "prefix": "subjects",
      "refClass": "index-ref",
      "refWhere": "after"
    }
  ],
  "terms": {
    "subjects": [
      {
        "blocks": [
          {
            "c": [
              {
                "c": "Consequo",
                "t": "Str"
              }
            ],
            "t": "Para"
          }
        ],
        "id": "consequo",
        "sortKey": "consequo",
        "text": "Consequo\n"
      },
      {
        "blocks": [
          {
            "c": [
              {
                "c": [
                  {
                    "c": "Labor",
                    "t": "Str"
                  }
                ],
                "t": "Emph"
              }
            ],
            "t": "Para"
          }
        ],
        "id": "labor",
        "sortKey": "labor",
        "text": "Labor\n"
      }
    ]
  }
}

Exporting an index to ICML: the `icml_with_index.lua` Writer

InDesign has only one index, so you can't define more indices inside a document (actually there's a workaround, using the first level of the index to discriminate among different indices, but it may an option for future versions of this software).

In ICML, the actual index is in a <Index> element that lives outside the main <Story> element, so you can't add it through a filter, because filters can only modify the contents of the <Story> element.

So it looks like the only way to add an index is through templates, and a custom writer:

pandoc -f markdown -t icml_with_index.lua -s test.md

The custom writer can modify the default template for ICML on the fly, putting an $index$ before <Story Self="pandoc_story", then fill the index variable with the index contents.

Here's the custom writer's main function:

function Writer(doc, opts)
 local collected = pandocIndices.collectIndices(doc)
  indices = collected.indices
  terms = collected.terms
  local filtered = doc
  for i = 1, #indices_filters do
    logging_info("applying filter #" .. i)
    local filter = indices_filters[i]
    filtered = filtered:walk(filter)
  end
  -- make a clone of opts and add the index variable
  local options = pandoc.WriterOptions(opts)
  options.variables.index = index_var
  return pandoc.write(filtered, 'icml', options)
end

Some filters are applied to collect index data and fill the index_var variable, whose value is put into options.variables.index before calling pandoc.write(filtered, 'icml', options). The writer then replaces $index$ in the template with the value of options.variables.index.

Exporting indices to DOCX: the `docx_index.lua` filter

docx_index.lua is a filter that injects references to index terms in the text.

Here's an example:

pandoc -f markdown -t docx -o doc-with-index.docx -L docx_index.lua doc.md

When you open the resulting DOCX file, you won't see an index. You must create it explicitly (e.g. References -> Insert index) with your word processing app (e.g. Word).

Exporting indices to ODT: the `odt_index.lua` filter

odt_index.lua is a filter that injects references to index terms in the text.

Here's an example:

pandoc -f markdown -t odt -o doc-with-index.odt -L odt_index.lua doc.md

When you open the resulting ODT file, you won't see an index. You must create it explicitly with your word processing app. In LibreOffice, you can click on Insert - Table of Contents and Index - Table of Contents, Index or Bibliography.

Though LibreOffice supports many indices, for now the only one that is created is the alphabetical index.

Generating an index

To generate an index for your document(s), you can

start from a predefined list of words (e.g. names or topics) that represent the index terms
mark references to indices in your texts, and then collect and organize the references into a list of index terms

An index from a list of words or paragraphs

The filter paras_to_index_terms.lua takes a list of paragraphs (i.e. one term per line), and encapsulates all the paragraphs in Div blocks that represent index terms.

Then all those Div terms are encapsulated in one Div that represents the index.

The filter works only on paragraphs that are children of the main Pandoc "root" element. If your document has Div blocks that contain paragraphs, they are kept untouched in the output document.

Example (phonetic.md)

# Phonetic alphabet

alpha

bravo

charlie

# Another title

Running pandoc -f markdown -t native -L paras_to_index_terms.lua phonetic.md, you get:

[ Header
    1
    ( "phonetic-alphabet" , [] , [] )
    [ Str "Phonetic" , Space , Str "alphabet" ]
, Header
    1
    ( "another-title" , [] , [] )
    [ Str "Another" , Space , Str "title" ]
, Div
    ( ""
    , [ "index" ]
    , [ ( "index-name" , "index" )
      , ( "ref-class" , "index-ref" )
      ]
    )
    [ Div
        ( "" , [ "index-term" ] , [ ( "index-name" , "index" ) ] )
        [ Para [ Str "alpha" ] ]
    , Div
        ( "" , [ "index-term" ] , [ ( "index-name" , "index" ) ] )
        [ Para [ Str "bravo" ] ]
    , Div
        ( "" , [ "index-term" ] , [ ( "index-name" , "index" ) ] )
        [ Para [ Str "charlie" ] ]
    ]
]

You can specify the index name and the class for references in the text, like this:

pandoc -f markdown -t native -L paras_to_index_terms.lua -V index_name=phonetic -V ref_class=phonetic-ref phonetic.md

and you get:

[ Header
    1
    ( "phonetic-alphabet" , [] , [] )
    [ Str "Phonetic" , Space , Str "alphabet" ]
, Header
    1
    ( "another-title" , [] , [] )
    [ Str "Another" , Space , Str "title" ]
, Div
    ( ""
    , [ "index" ]
    , [ ( "ref-class" , "phonetic-ref" )
      , ( "index-name" , "phonetic" )
      ]
    )
    [ Div
        ( ""
        , [ "index-term" ]
        , [ ( "index-name" , "phonetic" ) ]
        )
        [ Para [ Str "alpha" ] ]
    , Div
        ( ""
        , [ "index-term" ]
        , [ ( "index-name" , "phonetic" ) ]
        )
        [ Para [ Str "bravo" ] ]
    , Div
        ( ""
        , [ "index-term" ]
        , [ ( "index-name" , "phonetic" ) ]
        )
        [ Para [ Str "charlie" ] ]
    ]
]

Indices from references in your documents

Suppose you have no predefined index, nor a list of words that represent the index terms.

You can put references to indices in the text, with Span inlines. Those Spans must have a class that identifies them (see above).

compile_raw_indices.lua is a filter that outputs one or more raw indices, as Div blocks with the index class (see above), whose contents are index terms (Div blocks with the index-term class).

The terms' texts are the ones marked in the main text as references, and they are sorted alphabetically.

Example:

pandoc -f markdown -t markdown -L compile_raw_indices.lua test.md

here's the result:

:::::::: {#index .index index-name="index"}
::: {#consequo .index-term index-name="index" count="3" sort-key="consequat"}
consequat
:::

::: {.index-term index-name="index" count="1" sort-key="dolor"}
dolor
:::

::: {#dolor .index-term index-name="index" count="1" sort-key="dolor"}
dolor
:::

::: {#labor .index-term index-name="index" count="1" sort-key="labore"}
labore
:::

::: {#labor .index-term index-name="index" count="3" sort-key="laborum"}
laborum
:::
::::::::

It's clearly a raw index, that needs some further processing, but most of the task of extraction and sorting is done.

Here's the example, once it's been manually reworked:

:::::::: {#index .index index-name="index"}
::: {#consequo .index-term index-name="index" count="3" sort-key="consequo"}
consequo
:::

::: {#dolor .index-term index-name="index" count="2" sort-key="dolor"}
dolor
:::

::: {#labor .index-term index-name="index" count="4" sort-key="labor"}
labor
:::
::::::::

Index references with a class and without `index-name`

You may have references in the text specified through a class, instead of the index-name attribute.

Since you start only with references, and without any index specifying a ref-class, you miss the matching between the reference classes and their corresponding indices.

You can pass that information setting the value of the index_ref_classes variable, e.g. with `-V index_ref_classes='{"name-ref":"names","subj-ref":"subjects"}'.

That tells the filter that the Spans with a name-ref class are references to the "names" index, while those with a subj-ref class are references to the "subjects" index.

The value of the index_ref_classes variable is a JSON object, whose keys are the classes characterizing the references' Span inlines and whose values are the names of their corresponding indices.

Sorting indices

The filter sort_indices.lua sorts all terms in every index of a document.

Currently (version 0.4.1), the terms are sorted accordingly to their sort-key attribute, in ascending alphabetical order.

Automatically assign identifiers to index terms

The filter assign_ids_to_index_terms.lua assigns an identifier to any index term of any index.

If a term has already an identifier, the filter does not change it, unless you set the variable ids_reset with the -V ids_reset=true option.

Identifiers are the concatenation of a prefix and a counter.

The default value for the prefix is the index name, but you can change it setting the ids_prefixes variable, e.g. `-V ids_prefixes='{"names":"n_"}'.

The value of ids_prefixes is a JSON object where keys are the index names and values are the corresponding prefixes.

Version

The current version is 0.4.4 (2024, November 1st).

Aknowledgements

This software

provides custom writers and filters for Pandoc;
and makes use of William Lupton's logging.lua module.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
doc		doc
src		src
test		test
.gitignore		.gitignore
.luarc.json		.luarc.json
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pandoc-export-index

What is needed to specify indices

An example document

Defining indices for Pandoc

References

Terms (topics)

How indices are modelled in different formats

Extracting indices as JSON objects: the `indices2json.lua` Writer

Exporting an index to ICML: the `icml_with_index.lua` Writer

Exporting indices to DOCX: the `docx_index.lua` filter

Exporting indices to ODT: the `odt_index.lua` filter

Generating an index

An index from a list of words or paragraphs

Indices from references in your documents

Index references with a class and without `index-name`

Sorting indices

Automatically assign identifiers to index terms

Version

Aknowledgements

About

Releases

Packages

Languages

License

massifrg/pandoc-export-index

Folders and files

Latest commit

History

Repository files navigation

pandoc-export-index

What is needed to specify indices

An example document

Defining indices for Pandoc

References

Terms (topics)

How indices are modelled in different formats

Extracting indices as JSON objects: the indices2json.lua Writer

Exporting an index to ICML: the icml_with_index.lua Writer

Exporting indices to DOCX: the docx_index.lua filter

Exporting indices to ODT: the odt_index.lua filter

Generating an index

An index from a list of words or paragraphs

Indices from references in your documents

Index references with a class and without index-name

Sorting indices

Automatically assign identifiers to index terms

Version

Aknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Extracting indices as JSON objects: the `indices2json.lua` Writer

Exporting an index to ICML: the `icml_with_index.lua` Writer

Exporting indices to DOCX: the `docx_index.lua` filter

Exporting indices to ODT: the `odt_index.lua` filter

Index references with a class and without `index-name`

Packages