THEORY.md (8792B)
1 We model a task as the (possibly independent) computation of a value, a hash: 2 3 - Task: 4 - value: LazyFuture[T] 5 - hash: LazyFuture[HashLike] 6 7 You should philosophically view the value as information your dependents need to progress, 8 e.g. a file-reader task should output a file handle that the dependents can open and read etc. 9 You should philosophically view the hash as a way to convince your dependents 10 that they don't need to rerun because they would get the same result anyway. 11 I.e. if the hash == some previously recorded hash, the dependent should assume that rerunning would 12 cause the same end result as that previous run. 13 14 ---------------------------------------------------------------------------------------------------- 15 16 Computing the Task promises 17 18 At its core the value and hash computations should be written by the user, 19 but we provide a bunch of helpers to help the user "cheat" the computation, 20 primarily through storing traces of previous runs. 21 22 There are actually a few versions in existence that we can look at. 23 24 - t.hist: Any previously known good version of t. 25 - t.curr: Any version we currently have "access" to, and we have no idea whether it's good or not. 26 - t.good: The goal is to compute this, the good version we should arrive at post-build. 27 28 Each version has information and not all of them we know or can compute. 29 30 - t.X.value: The value (or candidate value) of the task. 31 - t.X.hash: The hash (or candidate hash) of the task. 32 - t.X.deps: The hashes of the deps of the task. (only applies to historical tasks, of course.) 33 - t.X.seq: A version number for t.X.value (represented as an infinitely incrementing seqnum). 34 - seq nums can be reused, as long as if X.seq == Y.seq, then X and Y are the same thing 35 - t.X.modtime: t.X is a file on disk, we can check its modtime. 36 37 From here, we can make logical inferences based on the knowledge we have. 38 Here is a table of how each piece of info could be calculated. 39 40 - replay: 41 - for known good version Z in {hist, good}, 42 for any versions X, Y in {hist, curr, good}, 43 - {t.Z.deps}.X == {t.Z.deps}.Y implies t.X == t.Y 44 - Traces from the paper are the special case where Z = hist, X = hist, Y = good 45 - {t.hist.deps} == {t.hist.deps}.good 46 and also in particular we compare the hashes for good vs hist 47 - {t.hist.deps}.hash == {t.hist.deps}.good.hash 48 to infer that either the historical hash or value is good at the present time 49 - t.hist.value == t.good.value (constructive traces) 50 - t.hist.hash == t.good.hash (verifying traces) 51 52 - verify: 53 - for Z in {hash, seq, modtime}, 54 - t.X.Z == t.Y.Z implies t.X == t.Y 55 - We typically use the special case 56 - t.curr.hash == t.good.hash implies t.curr == t.good 57 58 - Note: Verifying traces from the paper are essentially replay + verify combined. 59 - In the paper, the rule is: 60 - If X.good is consistent with t.hist.deps and t.curr.hash == t.hist.hash, 61 - Then t.curr.value == t.good.value. 62 - In our system, we make the following deductions: 63 - Since X.good is consistent with t.hist.deps, then replay tells us t.good == t.hist. 64 - Since t.curr.hash == t.hist.hash, then verify tells us t.curr == t.hist. 65 - Thus, we have that t.good == t.curr. 66 - In particular, we have t.good.value == t.curr.value. 67 68 Finally, we seed the logical inference rules + data we have, 69 and then compute as much of {...}.good while performing as little work as possible. 70 71 - t.hist: In general, it's up to the user how much info of historical runs to put in the db. 72 - It's probably sensible to store the most information about the last time a task run. 73 - t.curr: These are all user-specified and may not exist if the task is not somehow persisted. 74 - t.good: These are completely unknown unless an inference rule is used, or the task is run. 75 76 Here's a table for what data we can usually expect for each version / type / info combo. 77 78 | version | hist (db) | hist (file) | curr (db) | curr (file) | 79 | ------- | ---------- | ------------ | ---------- | ------------ | 80 | value | maybe | usually no | yes | yes | 81 | hash | yes | yes | yes | computed | 82 | seq | yes | yes | yes | no | 83 | modtime | no | yes | no | yes | 84 85 ---------------------------------------------------------------------------------------------------- 86 87 Cyclic dependencies 88 89 The computation graph can be cyclic, not just because users can do whatever they want, 90 but even in the usual case, for instance: 91 92 - value may depend on hash (verify_hash) 93 - hash may depend on value (sha256 of the value) 94 95 If the user introduces true dependency cycles, we leave them to fend for themselves, 96 I doubt there's any sane way to deal with this in our setup. 97 98 If there are no true dependency cycles at the task level, 99 this leaves only the dependency cycle shown above, 100 where verify_hash depends on the hash, which usually depends on the value. 101 102 To resolve this, we'll have verify_hash do the following: 103 104 - Skip the verify_hash phase when entered recursively (e.g. by moving onto construct_value). 105 This is implemented simply by checking/adding to a set of tasks whenever we enter verify_hash. 106 - When verify_hash awaits the hash, check if the value computation is already completed. 107 If so, just return since we're already done. 108 109 ---------------------------------------------------------------------------------------------------- 110 111 Types of dependencies 112 113 - Get the dependency's value, e.g. a dependency to curl bunny API, then parse the json. 114 Here, we actually need the value in memory. 115 116 - Rebuild if the dependency's hash changed, e.g. recompiling C object files 117 Here, we don't actually need the value at all. 118 During a what-if build (pretend graph is up-to-date except for certain tasks), 119 we can simply assume the hash from the previous run. 120 121 ---------------------------------------------------------------------------------------------------- 122 123 File dependencies vs task dependencies 124 - When depending on a task 125 126 ---------------------------------------------------------------------------------------------------- 127 128 Each task produces a value, a hash, and a timestamp. 129 130 The hash is not necessarily 1-1 with the value, and may have different re-computation rules. 131 For example, a file reader task produces a file handle, which must be re-created for each call, 132 even though the hash is on the file contents, which stays the same across calls. 133 134 Hashes and timestamps are used for rebuilders / caching policies. 135 When the hashes of all inputs to a task are verified to match with a previously stored run, 136 The output of that run can simply be used. 137 Note that a run is simply a trace of the inputs used and their hashes, and the output's value and hash. 138 Hash based caching policies can range between storing 0 runs, storing the latest run, or storing multiple past runs. 139 140 A different caching policy could be to check that all inputs are older than the output, rebuilding otherwise. 141 Timestamp based caching policies can be used to assert that files have not changed since the latest run, 142 focusing verification time only on files whose timestamps have changed. 143 144 Sometimes, the value is NOT used, e.g. when compiling C files into an object file, 145 a dependency on the sources files must be created, but their contents don't actually need to be read. 146 In such cases, only the source file's hashes are used, and only if caching is used, 147 first to be able to record a run, and subsequently to verify if the current file's hash matches a previous run. 148 149 ---------------------------------------------------------------------------------------------------- 150 151 Each task has a "store policy" for where the output of the task is stored, if at all. 152 153 - Default: Stored in memory, discarded at the end of the build. 154 - File-backed: Stored on disk, fetching the task gives you a file handle. 155 - Store-backed: Stored in the Store, fetching the task gives you the contents. 156 - No-store: Not stored anywhere. 157 158 Each "store policy" corresponds to a output hasher. 159 160 - Default: Output hash = current build id. 161 - File-backed: Hash is computed from the file. 162 - Store-backed: Just fetch from the store. 163 - No-store: Always-different hash. 164 165 ---------------------------------------------------------------------------------------------------- 166 167 The hash is useful independent of the value itself due to caching policies. 168 169 The simplest caching policy is "Verifying". Here, we use the dependency information 170 171 ---------------------------------------------------------------------------------------------------- 172 173 Metadata collected 174 175 - Dependency information of latest run (not hashes, just whether A fetched B) 176 - Hashes of the outputs of latest run (individually per task) 177 178 Constructive trace rebuilder 179 -