ODT to Diplodoc Converter

This tool turns a technical document from ODT format (LibreOffice Writer) into a set of files ready for Diplodoc — a modern system for technical documentation.

Below is a short list of routine tasks this converter does for you.

1. Split one ODT file into many MD files with folders

Without the tool:
Open ODT, copy each chapter by hand, create folders, write index.md, index.yaml, toc.yaml. If you change the document, do everything again.

With the tool:
The converter finds headings (#, ##, ### …) and creates nested folders. Each folder has index.md, index.yaml, toc.yaml. The structure follows the ODT exactly.

2. Move images and fix their paths

Without the tool:
Extract all images from ODT manually, copy them into subfolders, fix paths in Markdown files (they often have absolute paths like C:\...).

With the tool:
The script runs pandoc with --extract-media, finds all images, copies them into a media folder next to each index.md, and fixes links to relative paths (media/name.png).

3. Keep figure captions (including numbers)

Without the tool:
Rewrite each caption below the image, check the figure number, keep the same style (“Figure 1. Title”).

With the tool:
The script takes the full caption from the {alt="..."} field, adds a missing space after the number if needed, and puts the caption in italics right under the image.

4. Turn two-column table notes into `{% note info %}` blocks

Without the tool:
Edit HTML tables by hand, remove extra columns, move text into {% note %} tags.

With the tool:
The converter finds tables like
+-----+-----+ | ![](https://github.com/paulyeshchyk/odt2diplodoc/raw/HEAD/img) | Note text | +-----+-----+, deletes the left column with the picture, keeps only the text, and wraps it into {% note info %}…{% endnote %}.

5. Remove Pandoc garbage attributes (`{alt="..."}`)

Without the tool:
Pandoc often leaves lines like {alt="Figure 49: …"}, which appear in the final HTML. You must delete them from every file.

With the tool:
The script automatically removes all such attributes, leaving clean Markdown.

6. Generate the YAML files needed by Diplodoc

Without the tool:
Create index.yaml (fields title, href, meta) and toc.yaml (with items, include) by hand.

With the tool:
All YAML files are generated automatically:

inside each folder – index.yaml and toc.yaml;
in the root folder – index.yaml, toc.yaml, index.md;
links in toc.yaml point to index.md, and for sub-sections an include: {path: ..., mode: link} is added.

7. Fix internal links (between chapters)

Without the tool:
In ODT the author often puts cross-references like “see chapter 2”. After conversion they become [see chapter 2](#anchor-209), which do not work in Diplodoc. You have to find and replace each link manually.

With the tool:
The script builds a map of all anchors (anchor-209 -> path/to/section/index.md) and automatically replaces the links with relative paths that Diplodoc understands.

8. Control caching and Pandoc options with config and CLI

Without the tool:
You have to run Pandoc every time, even when only post-processing changes (like note style). You cannot give Pandoc a custom format string (markdown-raw_html+pipe_tables).

With the tool:

You can keep a cache (folder with full_doc.md) and skip calling Pandoc on next runs — faster work.
Through cli.py or the PandocOptions dataclass you can set any format extensions (+, - or not set).
You can use Lua filters (e.g. to remove image size attributes).

9. Integration with Visual Studio Code

Without the tool:
You must run scripts from the command line, remember paths and flags, difficult to debug.

With the tool:
The repository includes a .vscode/launch.json file with a ready-to-use Python debug configuration. Just open the project folder in VSCode, press F5, and the converter starts with your chosen parameters.

Also the official Diplodoc extension for VSCode lets you preview the result, check links and YAML structure right in the editor.

10. Flexibility for future changes (strategy architecture)

Without the tool:
Any change to the converter’s behaviour requires editing the main code – risky and time-consuming.

With the tool:
All post-processing is moved into strategies:

Global strategies (applied to the whole Markdown before parsing).
Section strategies (applied to the body of each article after parsing).

To add a new transformation (e.g. convert three-column tables into lists), you just write one class and register it in __init__.py. The rest of the code stays unchanged.

11. Working with cross-references to figures (how it works)

Without the tool Pandoc loses figure numbers and turns cross-references into empty text like (fig. ). Doing this manually — tracking numbers, finding image paths, and making links — is almost impossible.

With the tool:

How the converter solves it

Direct editing of XML inside the ODT
The script process_crossref.py unpacks a temporary copy of the ODT, uses lxml to find all <text:sequence> (figure captions) and <text:sequence-ref> (places where a reference should be).

Building a map
For each figure, the script remembers its number and its file path (e.g. Pictures/…). It builds a dictionary: number -> path.

Adding markers to captions
It adds {#fig:N} to the caption text. This marker is later removed, but it helps the strategy find the correct number.

Replacing reference fields
Every <text:sequence-ref> is replaced with plain text (рис. N). Now the Markdown from Pandoc will not be empty — it will contain the figure number.

Post‑processing the Markdown (global strategy)

The FixFigureReferencesStrategy finds strings like (рис. N) in the generated Markdown and turns them into [@fig:N].
Then another strategy (or a separate function) replaces [@fig:N] with a full Markdown link:
[(fig. N)](https://github.com/paulyeshchyk/odt2diplodoc/blob/HEAD/media/filename.png).
The number comes from the map, and the path is taken from the file name (the media folder).

The result
In the final HTML, the link is clickable and points to the correct figure. No external filters (like pandoc-crossref) are needed.

What the user gets
Just add the flag --enable-crossref. The converter does everything else. Figure references work just like in the original ODT.

How to start

Install Python 3.11+, Pandoc, PyYAML.
Clone the repository, create a virtual environment.
Put your manual.odt in the root folder.
Run python cli.py manual.odt ./docs/ru --pandoc-format "markdown-raw_html".
The result is ready to build with diplodoc build.

For more settings, see comments in cli.py and config.py.

License

This project is licensed under the MIT License — free to use and modify.

Diplodoc Converter

paul.yestchick