Installation procedures of early TeX installations
TeX was written at a time when programs operated under harsher constraints of memory (and speed), so it had to resort to many tricks and to things that are now oddities but were common at the time. Three are relevant to this question (INITEX, fmt files, core dumps); more on each of them below (though I realize the question is mainly about the last one).
These are mentioned in section 1331 (part 51: The main program) of the TeX program, as quoted in David's answer. (There's also a talk by Knuth about just these things. Over 3 days in July 1982 he gave 12 lectures addressed (I think) to (what we would call) system administrators at other universities and such places, who would install the new “portable” TeX program on their systems. This is the last of them: The Internal Details of TeX82 - Session 12. Watching those videos may answer some of the questions.)
First, the source code of TeX has some compile-time variants (like #ifdef ... #endif
in C, implemented as init ... tini
in the TeX program), which if turned on, result in a separate program INITEX (which has the additional ability to initialize certain internal data structures, compute hyphenation patterns, dump format files, etc). The reason for having INITEX be a separate program is that it needs extra memory to do its work (e.g. computing hyphenation patterns requires allocating memory for a trie), which leaves less memory for typesetting.
When compiled without these flags we would get a separate program, called the production version of TeX, which may either be VIRTEX or TEX (more about this later). (Today with contemporary TeX distributions e.g. TeX Live and MiKTeX, there are no longer two separate programs compiled and distributed; instead we have a single program that will behave like INITEX if given the -ini
flag.)
Second format files. This is closely related to the first, as the format files are dumped by INITEX. For example, after starting INITEX (or tex -ini
today) we could do \input plain
and then \dump
. What this does is dump out a lot of TeX's program state to a .fmt
file, so that the production TeX program can simply load the format file instead of having to process plain.tex
(or as it was called then, BASIC.TEX: incidentally this is why in The TeXbook the documentation of the plain
format is in Appendix B: Basic Control Sequences).
In the video linked above, somewhere around the 13-minute mark, he mentions some numbers that give an idea of the order of magnitude of time saved: he says that opening and reading about a dozen short .tfm
files (as INITEX or tex -ini
does with \input plain
) took over 15 seconds, so it was faster to read a single format file.
Third, core dumps / undump. As I understand it, the idea was that your program's memory could be all dumped out to a file, and that you could “undump”: when your program was started all its memory would be initialized as during the dump, and the program would start at the beginning.
There is an explanation and history of this feature here, in a comment by David R. Fuchs, who (see interview: 1, 2) was Knuth's “right-hand man” when writing TeX (the current version of TeX, i.e. TeX82):
Executable program files were not much more than memory images; to run a program, the OS pretty much just mapped the executable image into your address space and jumped to the start. But when the program stopped, your entire state was still there, sitting in your address space […]
The OS also had a built-in command to allow you to SAVE the current memory image back into a new executable file. There wasn't much to this command, either, since executables weren't much more than a memory image to begin with. So, the equivalent of dump/undump was really just built into the OS, and wasn't considered any big deal or super-special feature. Of course, all language runtimes knew all about this, so they were always written to understand as a matter of course that they had to be able to deal with it properly. It pretty much came naturally if you were used to that environment, and wasn't a burden.
Thus, when TeX (and I presume the various Lisp and Emacs and etc. that were birthed on these machines) were designed, it was completely expected that they'd work this way. Cycles were expensive, as was IO; so in TeX's case, for example, it took many seconds to read in the basic macro package and standard set of font metric files and to preprocess the hyphenation patterns into their data structure. By doing a SAVE of the resulting preloaded executable once during installation, everyone then saved these many seconds each time they ran TeX.
You can read more (about other systems/programs) in the comment, and in the two discussion threads in which this came up: one on the Emacs dumper (2016), and on making the Atom editor about 50% faster (2017).
Also, in the video, Knuth says this sort of system would be available at most places, or at least the lucky ones. (13:57: “on most—anyway, on lar—on lucky systems, let's say”.) (If your system didn't have it, you could still use the format file feature: you'd start VIRTEX and load the plain format with &plain
, instead of starting INITEX and \input plain
.)
So the dump/undump system is not really as arcane as it may sound. And I think you have partly answered your question yourself to some extent, by pointing out that Emacs and Perl had/have similar features.
Fourth, (ha!) is the Pascal coding trick that makes the dump/undump behaviour possible. This from the TeX program:
and
Some analysis (and speculation)
Of the three tricks, the first one (INITEX as a separate program, with different memory characteristics) was just inevitable, to fit within the memory constraints. But the other two (fmt files and dump/undump) seem rather similar (for example the verb “dump” is used for both). Do we need both?
My suspicion is that the fmt
file feature was added to TeX because the dump/undump feature was not available everywhere, that it's basically an implementation within TeX of what was in most (but not all) places an OS feature. So the fmt file feature was necessary, as it was the only thing guaranteed to work everywhere.
Could TeX have just the format file feature and not use dump/undump? On certain systems it was forced to, and evidently this was acceptable. On other systems (“the best implementations […] have a format file pre-loaded”, “On systems that allow such preloading…”) using dump/undump was natural so there's no reason not to. While loading a format did save a lot of time (e.g. that of opening lots of TFM files for font information), there was still work that needed to be done (“The VIRTEX program cannot read a format file instantaneously, of course…”):
- Sanity-checking of various compile-time constants for meaningful values,
- Initializing lots of variables to correct values (look at section 21 and over 30 “see also” sections mentioned),
- Open and load the format file (then close it): this is the bulk of the time, and involves undumping a lot of things (section 455): the string pool, the entire dynamic memory (TeX “does nearly all of its own memory allocation”—this is basically a giant array called mem), various tables, etc. Although simply writing the final value of all of these would be faster than doing it one step at a time while processing
plain.tex
(opening files, defining macros one-by-one instead versus simply writing out the final hash table), it would still a fair bit of time.
So the dump/undump trick is worth doing if available. There are a few problems though:
Even if given the right memory, how would the Pascal program know (if started from the beginning) whether it has already read a format file (i.e. is is being started from such a dump) or not?
(Mentioned around 16:40) The state of the program at the time of the dump file is actually not exactly right: it has a timestamp of when you started VIRTEX to core-dump, not the time of your job, and it has the wrong
\jobname
. (You can try this yourself with a modern distribution: if you start withtex -ini
(the closest we can get to VIRTEX), then do\input plain
and then your doc (hello world \bye
or whatever), then you'll get output onplain.dvi
.) So we need to re-initialize some of these.
This is where the “dirty trick” comes in: by accessing a global variable that would have initialized to a specific value in the case of the core dump, but uninitialized if started from scratch, we can distinguish the two cases, and (if resuming from a core dump) initialize only those things that need initializing.
This is IMO why DEK says the trick “pays off handsomely” (making this dump/undump possible at all, and saving some time), even though it's a trick that only helps on some systems.
To answer the actual questions asked about dump/undump:
This feature was common: going by the comment of DRF and the video, it was just the usual/sensible way to start (all) programs on many systems (in use at “big CS departments”). But DEK knew this feature didn't exist on all systems.
It was several (20+) seconds faster to load
BASIC.FMT
(with VIRTEX) instead of\input BASIC
(with INITEX). Not sure how much additional time saving came from the dump/undumping (starting TEX directly), but somehow I suspect it wouldn't have been more than this.The feature stopped being used when most people moved to other Operating Systems that didn't natively support dump/undump, and it needed special programs:
But when TeX was ported over to Unix (and then Linux), it came as a bit of a surprise that the model was different, and that there was no convenient, predefined way to get this functionality, and that the runtimes weren't typically set up to make it easy to do. The undump stuff was created to deal with it, but it was never pretty, since it was bolted on.
(Similar thing with
unexec
in Emacs apparently.)Based on the cues we have, we can guess Knuth wouldn't have considered it hackish (this was just how all programs worked), and would have been happy to have found a way to make TeX even faster — see David Carlisle's answer. :-)
In answer to your question 4, he thought it "paid off handsomely"...
tex.web says:
The \.{VIRTEX} program cannot read a format file instantaneously, of course;
the best implementations therefore allow for production versions of \TeX\ that
not only avoid the loading routine for \PASCAL\ object code, they also have
a format file pre-loaded. This is impossible to do if we stick to standard
\PASCAL; but there is a simple way to fool many systems into avoiding the
initialization, as follows:\quad(1)~We declare a global integer variable
called |ready_already|. The probability is negligible that this
variable holds any particular value like 314159 when \.{VIRTEX} is first
loaded.\quad(2)~After we have read in a format file and initialized
everything, we set |ready_already:=314159|.\quad(3)~Soon \.{VIRTEX}
will print `\.*', waiting for more input; and at this point we
interrupt the program and save its core image in some form that the
operating system can reload speedily.\quad(4)~When that core image is
activated, the program starts again at the beginning; but now
|ready_already=314159| and all the other global variables have
their initial values too. The former chastity has vanished!
In other words, if we allow ourselves to test the condition
|ready_already=314159|, before |ready_already| has been
assigned a value, we can avoid the lengthy initialization. Dirty tricks
rarely pay off so handsomely.
@^dirty \PASCAL@>
@^system dependencies@>