Some Data Should Be Code

I write a lot of Makefiles. I use it not as a command runner but as an ad-hoc build system for small projects, typically for compiling Markdown documents and their dependencies. Like so:

A build graph for a document. A central node `doc.md` represents the Markdown source. Two outgoing arrows point to `doc.html` and `doc.pdf`, representing the output format. A chain through `graph.dot`, `graph.png`, and `doc.md` represets how a Graphviz .dot file can be rendered to PNG. A chain through `data.csv`, `plot.py`, `plot.png`, and `doc.md` represents using a Python script to make a plot from a CSV file.

And the above graph was generated by this very simple Makefile:

graph.png: graph.dot
	dot -Tpng $< -o $@

clean:
	rm -f graph.png

(I could never remember the automatic variable syntax until I made flashcards for them.)

It works for simple projects, when you can mostly hand-write the rules. But the abstraction ceiling is very low. If you have a bunch of almost identical rules, e.g.:

a.png: a.csv plot.py
	python plot.py $< $@

b.png: b.csv plot.py
	python plot.py $< $@

c.png: c.csv plot.py
	python plot.py $< $@

You can use pattern-matching to them into a “rule schema”, by analogy to axiom schemata:

%.png: %.csv plot.py
	python plot.py $< $@

Which works backwards: when something in the build graph depends on a target matching %.png, Make synthesizes a rule instance with a dependency on the corresponding .csv file.

But pattern matching is still very limited. Lately I’ve been building my own plain-text accounting solution using some Python scripts. One of the tasks is to read a CSV of bank transactions from 2019–2024 and split it into TOML files for each year-month, to make subsequent processing parallelizable. So the rules might be something like:

ledger/2019-08.toml: inputs/checkbook_pro_export.csv
	uv run import_from_checkbook.py --year=2019 --month=8

ledger/2019-09.toml: inputs/checkbook_pro_export.csv
	uv run import_from_checkbook.py --year=2019 --month=9

# ...

I had to write a Python script to generate the complete Makefile. Makefiles look like code, but are data: they are a container format for tiny fragments of shell that are run on-demand by the Make engine. And because Make doesn’t scale, for complex tasks you have to bring out a real programming language to generate the Makefile.

I wish I could, instead, write a make.py file with something like this:

from whatever import *

g = BuildGraph()

EXPORT: str = "inputs/checkbook_pro_export.csv"

# The (year, month) pairs I have bank transaction CSVs for.
year_months: list[tuple[int, int]] = [
    (y, m) for y in range(2019, 2026) for m in range(1, 13)
]

# Import transactions for each year-month into a separate ledger.
for year, month in year_months:
    ledger_path: str = f"ledger/{year}_{month:02d}.toml"
    g.rule(
        targets=[ledger_path],
        deps=[EXPORT],
        fn=lambda: import_from_checkbook(ledger_path, year, month),
    )

Fortunately this exists: it’s called doit, but it’s not widely known.


A lot of things are like Makefiles: data that should be lifted one level up to become code.

Consider CloudFormation. Nobody likes writing those massive YAML files by hand, so AWS introduced CDK, which is literally just a library1 of classes that represent AWS resources. Running a CDK program emits CloudFormation YAML as though it were an assembly language for infrastructure. And so you get type safety, modularity, abstraction, conditionals and loops, all for free.

Consider GitHub Actions. How much better off would we be if, instead of writing the workflow-job-step tree by hand, we could just have a single Python script, executed on push, whose output is the GitHub Actions YAML-as-assembly? So you might write:

from ga import *
from checkout_action import CheckoutAction
from rust_action import RustSetupAction

# Define the workflow that runs on each commit.
commit_workflow = Workflow(
    name="commit",
    test=lambda ev: isinstance(ev, CommitEvent),
    jobs=[
        # The lint job.
        Job(
            name="lint",
            steps=[
                Step(
                    name="check out",
                    run=CheckoutAction(),
                ),
                Step(
                    name="set up Rust and Cargo",
                    run=RustSetupAction(),
                ),
                Step(
                    name="run cargo fmt",
                    run=Shell(["cargo", "fmt", "--check"])
                )
            ]
        )
    ]
)

Actions here would simply be ordinary Python libraries the CI script depends on. Again: conditions, loops, abstraction, type safety, we get all of those for free by virtue of using a language that was designed to be a language, rather than a data exchange language that slowly grows into a poorly-designed DSL.

Why do we repeatedly end up here? Static data has better safety/static analysis properties than code, but I don’t think that’s foremost in mind when people design these systems. Besides, using code to emit data (as CDK does) gives you those exact same properties. Rather, I think some people think it’s cute and clever to build tiny DSLs in a data format. They’re proud that they can get away with a “simple”, static solution rather than a dynamic one.

If you’re building a new CI system/IaC platform/Make replacement: please just let me write code to dynamically create the workflow/infrastructure/build graph.

Footnotes

  1. Or rather, a polyglot collection of libraries, one per language, like Pulumi