This software is a collection of Pandoc Lua filters and custom writers to export a document with indices in these formats:
-
InDesign ICML
-
docx
-
odt
-
ConTeXt -
LaTeX
Currently there's only a Writer
to export an index (one level only) to an ICML standalone document (-s
option in Pandoc).
For indices, you need:
-
the names of the indices (many formats support only one index);
-
the terms (topics) for every index;
-
the references to those terms in the text.
In the doc
directory there's an indices-example.md
document.
This software considers an index definition a Div
block with
-
an
index
class; this is mandatory, because makes theDiv
an index database; -
an
index-name
; this is optional; if not set, its value is considered to be "index"; please do use simple names without numbers or symbols for indices' names, like "index", "names", "topics", "biblio", "subjects", etc. -
a
ref-class
attribute that specifies the class thatSpan
inlines must have to be considered references to this index; this is optional; if not set, its value is considered to be "index-ref"; -
a
put-index-ref
attribute that can be "before" or "after", see below; this is optional; if not set, its value is considered to be "after".
Why a Div
? Because it's a Block
that carries arbitrary data within the Attr
structure;
and it's a container of Block
s.
Index references are Span
inlines with:
-
a class that matches the
ref-class
of an index defined somewhere in the document; this is mandatory (unless you specifyindexed-text
, see below), since it's what makes thisSpan
an index reference; -
an
idref
attribute that matches theid
attribute of a term of that index; this is optional, but if not set, you won't get this occurrence in the index; -
an optional
indexed-text
attribute with the text it refers to; this is useful when you use empty references (an emptySpan
just put at the left or at the right of the text it refers to)
Why a Span
? Because it's among inlines and carries arbitrary data
within the Attr
structure.
Index terms are Div
blocks with:
-
an
index-term
class; this is mandatory, because it's what makes thisDiv
an index term, instead of a genericDiv
; -
an
id
; this is mandatory too, otherwise you can't reference this term in the text; -
an
index-name
attribute whose value matches the one of an index; this is optional, especially when the termDiv
is inside an indexDiv
; -
an optional
sort-key
attribute, specifying a simple text according to which the term must be sorted; generally the filters and writers of this repository don't do sorting.
Why a Div
? A Para
or a Plain
are enough in many cases, but they have no data attached
(no Attr
). An index topic could also be quite long and multi-paragraph (i.e. think of
an index of people with biographical profiles or a glossary with references to the pages
where a topic is discussed).
Currently there's no support for sub-topics, but it's planned.
AFAIK we can divide formats into two families from the indexing point of view:
-
ICML, docx, odt: there's a database of terms and references to them in the text; rendering indices in HTML and epub could follow this model too;
-
ConTeXt, LaTeX: the database is built incrementally from macro calls like
\index{term}
,\index{head+sub}
(ConTeXt),\index{head!sub}
(LaTeX).
This package follows the first model, so writers for ConTeXt and LaTeX should do some work to adapt it.
In ConTeXt I know it's possible, because I used this workaround in a project of mine:
\defineregister[myIndex][deeptextcommand=\IdToTerm]
\starttext
... foo\myIndex[foo]{fooId} bar\myIndex[bar]{barId} ...
\placeregister[myIndex]
\stoptext
where \IdToTerm
is a macro that gets an id as input and places the TeX tokens of the
corresponding term, while \myIndex
must be followed by two parameters: the sorting key
in brackets and the term id in braces.
indices2json.lua
is a custom writer to extract
indices and terms defined in a document as JSON objects, that you may
then use to build an external database.
Example: enter the src
directory and type
pandoc -f markdown -t indices2json.lua ../test/test.md
and you'll get something like this:
{
"indices": [
{
"name": "subjects",
"prefix": "subjects",
"refClass": "index-ref",
"refWhere": "after"
}
],
"terms": {
"subjects": [
{
"blocks": [
{
"c": [
{
"c": "Consequo",
"t": "Str"
}
],
"t": "Para"
}
],
"id": "consequo",
"sortKey": "consequo",
"text": "Consequo\n"
},
{
"blocks": [
{
"c": [
{
"c": [
{
"c": "Labor",
"t": "Str"
}
],
"t": "Emph"
}
],
"t": "Para"
}
],
"id": "labor",
"sortKey": "labor",
"text": "Labor\n"
}
]
}
}
InDesign has only one index, so you can't define more indices inside a document (actually there's a workaround, using the first level of the index to discriminate among different indices, but it may an option for future versions of this software).
In ICML, the actual index is in a <Index>
element that lives outside the main <Story>
element, so you can't add it through a filter, because filters can only modify the
contents of the <Story>
element.
So it looks like the only way to add an index is through templates, and a custom writer:
pandoc -f markdown -t icml_with_index.lua -s test.md
The custom writer can modify the default template for ICML on the fly, putting an $index$
before
<Story Self="pandoc_story"
, then fill the index
variable with the index contents.
Here's the custom writer's main function:
function Writer(doc, opts)
local collected = pandocIndices.collectIndices(doc)
indices = collected.indices
terms = collected.terms
local filtered = doc
for i = 1, #indices_filters do
logging_info("applying filter #" .. i)
local filter = indices_filters[i]
filtered = filtered:walk(filter)
end
-- make a clone of opts and add the index variable
local options = pandoc.WriterOptions(opts)
options.variables.index = index_var
return pandoc.write(filtered, 'icml', options)
end
Some filters are applied to collect index data and fill the index_var
variable,
whose value is put into options.variables.index
before calling
pandoc.write(filtered, 'icml', options)
.
The writer then replaces $index$
in the template with the value of options.variables.index
.
docx_index.lua
is a filter that injects references to index terms in the text.
Here's an example:
pandoc -f markdown -t docx -o doc-with-index.docx -L docx_index.lua doc.md
When you open the resulting DOCX file, you won't see an index. You must create it explicitly (e.g. References -> Insert index) with your word processing app (e.g. Word).
odt_index.lua
is a filter that injects references to index terms in the text.
Here's an example:
pandoc -f markdown -t odt -o doc-with-index.odt -L odt_index.lua doc.md
When you open the resulting ODT file, you won't see an index. You must create it explicitly with your word processing app. In LibreOffice, you can click on Insert - Table of Contents and Index - Table of Contents, Index or Bibliography.
Though LibreOffice supports many indices, for now the only one that is created is the alphabetical index.
To generate an index for your document(s), you can
-
start from a predefined list of words (e.g. names or topics) that represent the index terms
-
mark references to indices in your texts, and then collect and organize the references into a list of index terms
The filter paras_to_index_terms.lua
takes a list of paragraphs (i.e. one term per line),
and encapsulates all the paragraphs in Div
blocks that represent index terms.
Then all those Div
terms are encapsulated in one Div
that represents the index.
The filter works only on paragraphs that are children of the main Pandoc "root" element.
If your document has Div
blocks that contain paragraphs, they are kept untouched in the
output document.
Example (phonetic.md
)
# Phonetic alphabet
alpha
bravo
charlie
# Another title
Running pandoc -f markdown -t native -L paras_to_index_terms.lua phonetic.md
, you get:
[ Header
1
( "phonetic-alphabet" , [] , [] )
[ Str "Phonetic" , Space , Str "alphabet" ]
, Header
1
( "another-title" , [] , [] )
[ Str "Another" , Space , Str "title" ]
, Div
( ""
, [ "index" ]
, [ ( "index-name" , "index" )
, ( "ref-class" , "index-ref" )
]
)
[ Div
( "" , [ "index-term" ] , [ ( "index-name" , "index" ) ] )
[ Para [ Str "alpha" ] ]
, Div
( "" , [ "index-term" ] , [ ( "index-name" , "index" ) ] )
[ Para [ Str "bravo" ] ]
, Div
( "" , [ "index-term" ] , [ ( "index-name" , "index" ) ] )
[ Para [ Str "charlie" ] ]
]
]
You can specify the index name and the class for references in the text, like this:
pandoc -f markdown -t native -L paras_to_index_terms.lua -V index_name=phonetic -V ref_class=phonetic-ref phonetic.md
and you get:
[ Header
1
( "phonetic-alphabet" , [] , [] )
[ Str "Phonetic" , Space , Str "alphabet" ]
, Header
1
( "another-title" , [] , [] )
[ Str "Another" , Space , Str "title" ]
, Div
( ""
, [ "index" ]
, [ ( "ref-class" , "phonetic-ref" )
, ( "index-name" , "phonetic" )
]
)
[ Div
( ""
, [ "index-term" ]
, [ ( "index-name" , "phonetic" ) ]
)
[ Para [ Str "alpha" ] ]
, Div
( ""
, [ "index-term" ]
, [ ( "index-name" , "phonetic" ) ]
)
[ Para [ Str "bravo" ] ]
, Div
( ""
, [ "index-term" ]
, [ ( "index-name" , "phonetic" ) ]
)
[ Para [ Str "charlie" ] ]
]
]
Suppose you have no predefined index, nor a list of words that represent the index terms.
You can put references to indices in the text, with Span
inlines.
Those Span
s must have a class that identifies them (see above).
compile_raw_indices.lua
is a filter that outputs one or more raw indices,
as Div
blocks with the index
class (see above),
whose contents are index terms (Div
blocks with the index-term
class).
The terms' texts are the ones marked in the main text as references, and they are sorted alphabetically.
Example:
pandoc -f markdown -t markdown -L compile_raw_indices.lua test.md
here's the result:
:::::::: {#index .index index-name="index"}
::: {#consequo .index-term index-name="index" count="3" sort-key="consequat"}
consequat
:::
::: {.index-term index-name="index" count="1" sort-key="dolor"}
dolor
:::
::: {#dolor .index-term index-name="index" count="1" sort-key="dolor"}
dolor
:::
::: {#labor .index-term index-name="index" count="1" sort-key="labore"}
labore
:::
::: {#labor .index-term index-name="index" count="3" sort-key="laborum"}
laborum
:::
::::::::
It's clearly a raw index, that needs some further processing, but most of the task of extraction and sorting is done.
Here's the example, once it's been manually reworked:
:::::::: {#index .index index-name="index"}
::: {#consequo .index-term index-name="index" count="3" sort-key="consequo"}
consequo
:::
::: {#dolor .index-term index-name="index" count="2" sort-key="dolor"}
dolor
:::
::: {#labor .index-term index-name="index" count="4" sort-key="labor"}
labor
:::
::::::::
You may have references in the text specified through a class,
instead of the index-name
attribute.
Since you start only with references, and without any index specifying
a ref-class
, you miss the matching between the reference classes and
their corresponding indices.
You can pass that information setting the value of the index_ref_classes
variable, e.g. with `-V index_ref_classes='{"name-ref":"names","subj-ref":"subjects"}'.
That tells the filter that the Span
s with a name-ref
class are references
to the "names" index, while those with a subj-ref
class are references to
the "subjects" index.
The value of the index_ref_classes
variable is a JSON object, whose keys are
the classes characterizing the references' Span
inlines and whose values
are the names of their corresponding indices.
The filter sort_indices.lua
sorts all terms in every index of a document.
Currently (version 0.4.1), the terms are sorted accordingly to their sort-key attribute, in ascending alphabetical order.
The filter assign_ids_to_index_terms.lua
assigns an identifier to any index term
of any index.
If a term has already an identifier, the filter does not change it,
unless you set the variable ids_reset
with the -V ids_reset=true
option.
Identifiers are the concatenation of a prefix and a counter.
The default value for the prefix is the index name, but you can change it
setting the ids_prefixes
variable, e.g. `-V ids_prefixes='{"names":"n_"}'.
The value of ids_prefixes
is a JSON object where keys are the index names
and values are the corresponding prefixes.
The current version is 0.4.4 (2024, November 1st).
This software
-
provides custom writers and filters for Pandoc;
-
and makes use of William Lupton's logging.lua module.