THEORY.md - pymake - A build system based on Build Systems à la Carte

THEORY.md (8792B)
      1 We model a task as the (possibly independent) computation of a value, a hash:
      2 
      3 - Task:
      4   - value: LazyFuture[T]
      5   - hash: LazyFuture[HashLike]
      6 
      7 You should philosophically view the value as information your dependents need to progress,
      8 e.g. a file-reader task should output a file handle that the dependents can open and read etc.
      9 You should philosophically view the hash as a way to convince your dependents
     10 that they don't need to rerun because they would get the same result anyway.
     11 I.e. if the hash == some previously recorded hash, the dependent should assume that rerunning would
     12 cause the same end result as that previous run.
     13 
     14 ----------------------------------------------------------------------------------------------------
     15 
     16 Computing the Task promises
     17 
     18 At its core the value and hash computations should be written by the user,
     19 but we provide a bunch of helpers to help the user "cheat" the computation,
     20 primarily through storing traces of previous runs.
     21 
     22 There are actually a few versions in existence that we can look at.
     23 
     24 - t.hist: Any previously known good version of t.
     25 - t.curr: Any version we currently have "access" to, and we have no idea whether it's good or not.
     26 - t.good: The goal is to compute this, the good version we should arrive at post-build.
     27 
     28 Each version has information and not all of them we know or can compute.
     29 
     30 - t.X.value: The value (or candidate value) of the task.
     31 - t.X.hash: The hash (or candidate hash) of the task.
     32 - t.X.deps: The hashes of the deps of the task. (only applies to historical tasks, of course.)
     33 - t.X.seq: A version number for t.X.value (represented as an infinitely incrementing seqnum).
     34   - seq nums can be reused, as long as if X.seq == Y.seq, then X and Y are the same thing
     35 - t.X.modtime: t.X is a file on disk, we can check its modtime.
     36 
     37 From here, we can make logical inferences based on the knowledge we have.
     38 Here is a table of how each piece of info could be calculated.
     39 
     40 - replay:
     41   - for known good version Z in {hist, good},
     42     for any versions X, Y in {hist, curr, good},
     43     - {t.Z.deps}.X == {t.Z.deps}.Y  implies  t.X == t.Y
     44   - Traces from the paper are the special case where Z = hist, X = hist, Y = good
     45     - {t.hist.deps} == {t.hist.deps}.good
     46     and also in particular we compare the hashes for good vs hist
     47     - {t.hist.deps}.hash == {t.hist.deps}.good.hash
     48     to infer that either the historical hash or value is good at the present time
     49     - t.hist.value == t.good.value (constructive traces)
     50     - t.hist.hash == t.good.hash   (verifying traces)
     51 
     52 - verify:
     53   - for Z in {hash, seq, modtime},
     54     - t.X.Z == t.Y.Z  implies  t.X == t.Y
     55   - We typically use the special case
     56     - t.curr.hash == t.good.hash  implies  t.curr == t.good
     57 
     58 - Note: Verifying traces from the paper are essentially replay + verify combined.
     59   - In the paper, the rule is:
     60     - If X.good is consistent with t.hist.deps and t.curr.hash == t.hist.hash,
     61     - Then t.curr.value == t.good.value.
     62   - In our system, we make the following deductions:
     63     - Since X.good is consistent with t.hist.deps, then replay tells us t.good == t.hist.
     64     - Since t.curr.hash == t.hist.hash, then verify tells us t.curr == t.hist.
     65     - Thus, we have that t.good == t.curr.
     66     - In particular, we have t.good.value == t.curr.value.
     67 
     68 Finally, we seed the logical inference rules + data we have,
     69 and then compute as much of {...}.good while performing as little work as possible.
     70 
     71 - t.hist: In general, it's up to the user how much info of historical runs to put in the db.
     72   - It's probably sensible to store the most information about the last time a task run.
     73 - t.curr: These are all user-specified and may not exist if the task is not somehow persisted.
     74 - t.good: These are completely unknown unless an inference rule is used, or the task is run.
     75 
     76 Here's a table for what data we can usually expect for each version / type / info combo.
     77 
     78 | version |  hist (db) |  hist (file) |  curr (db) |  curr (file) |
     79 | ------- | ---------- | ------------ | ---------- | ------------ |
     80 | value   |  maybe     |  usually no  |  yes       |  yes         |
     81 | hash    |  yes       |  yes         |  yes       |  computed    |
     82 | seq     |  yes       |  yes         |  yes       |  no          |
     83 | modtime |  no        |  yes         |  no        |  yes         |
     84 
     85 ----------------------------------------------------------------------------------------------------
     86 
     87 Cyclic dependencies
     88 
     89 The computation graph can be cyclic, not just because users can do whatever they want,
     90 but even in the usual case, for instance:
     91 
     92 - value may depend on hash (verify_hash)
     93 - hash may depend on value (sha256 of the value)
     94 
     95 If the user introduces true dependency cycles, we leave them to fend for themselves,
     96 I doubt there's any sane way to deal with this in our setup.
     97 
     98 If there are no true dependency cycles at the task level,
     99 this leaves only the dependency cycle shown above,
    100 where verify_hash depends on the hash, which usually depends on the value.
    101 
    102 To resolve this, we'll have verify_hash do the following:
    103 
    104 - Skip the verify_hash phase when entered recursively (e.g. by moving onto construct_value).
    105   This is implemented simply by checking/adding to a set of tasks whenever we enter verify_hash.
    106 - When verify_hash awaits the hash, check if the value computation is already completed.
    107   If so, just return since we're already done.
    108 
    109 ----------------------------------------------------------------------------------------------------
    110 
    111 Types of dependencies
    112 
    113 - Get the dependency's value, e.g. a dependency to curl bunny API, then parse the json.
    114   Here, we actually need the value in memory.
    115 
    116 - Rebuild if the dependency's hash changed, e.g. recompiling C object files
    117   Here, we don't actually need the value at all.
    118   During a what-if build (pretend graph is up-to-date except for certain tasks),
    119   we can simply assume the hash from the previous run.
    120 
    121 ----------------------------------------------------------------------------------------------------
    122 
    123 File dependencies vs task dependencies
    124 - When depending on a task
    125 
    126 ----------------------------------------------------------------------------------------------------
    127 
    128 Each task produces a value, a hash, and a timestamp.
    129 
    130 The hash is not necessarily 1-1 with the value, and may have different re-computation rules.
    131 For example, a file reader task produces a file handle, which must be re-created for each call,
    132 even though the hash is on the file contents, which stays the same across calls.
    133 
    134 Hashes and timestamps are used for rebuilders / caching policies.
    135 When the hashes of all inputs to a task are verified to match with a previously stored run,
    136 The output of that run can simply be used.
    137 Note that a run is simply a trace of the inputs used and their hashes, and the output's value and hash.
    138 Hash based caching policies can range between storing 0 runs, storing the latest run, or storing multiple past runs.
    139 
    140 A different caching policy could be to check that all inputs are older than the output, rebuilding otherwise.
    141 Timestamp based caching policies can be used to assert that files have not changed since the latest run,
    142 focusing verification time only on files whose timestamps have changed.
    143 
    144 Sometimes, the value is NOT used, e.g. when compiling C files into an object file,
    145 a dependency on the sources files must be created, but their contents don't actually need to be read.
    146 In such cases, only the source file's hashes are used, and only if caching is used,
    147 first to be able to record a run, and subsequently to verify if the current file's hash matches a previous run.
    148 
    149 ----------------------------------------------------------------------------------------------------
    150 
    151 Each task has a "store policy" for where the output of the task is stored, if at all.
    152 
    153 - Default: Stored in memory, discarded at the end of the build.
    154 - File-backed: Stored on disk, fetching the task gives you a file handle.
    155 - Store-backed: Stored in the Store, fetching the task gives you the contents.
    156 - No-store: Not stored anywhere.
    157 
    158 Each "store policy" corresponds to a output hasher.
    159 
    160 - Default: Output hash = current build id.
    161 - File-backed: Hash is computed from the file.
    162 - Store-backed: Just fetch from the store.
    163 - No-store: Always-different hash.
    164 
    165 ----------------------------------------------------------------------------------------------------
    166 
    167 The hash is useful independent of the value itself due to caching policies.
    168 
    169 The simplest caching policy is "Verifying". Here, we use the dependency information
    170 
    171 ----------------------------------------------------------------------------------------------------
    172 
    173 Metadata collected
    174 
    175 - Dependency information of latest run (not hashes, just whether A fetched B)
    176 - Hashes of the outputs of latest run (individually per task)
    177 
    178 Constructive trace rebuilder
    179 -
	pymake A build system based on Build Systems à la Carte
	git clone https://git.grace.moe/pymake
	Log \| Files \| Refs \| README