commit 906d33d40421e70fa9d77d2b9d6a7d55d1745698
parent 4e14f0183908587ec8e72692c4a9ac9dac1e04a7
Author: gracefu <81774659+gracefuu@users.noreply.github.com>
Date: Fri, 18 Apr 2025 07:39:28 +0800
sketch second version
Diffstat:
| A | make2/THEORY.md | | | 179 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
| A | make2/__init__.py | | | 268 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
2 files changed, 447 insertions(+), 0 deletions(-)
diff --git a/make2/THEORY.md b/make2/THEORY.md
@@ -0,0 +1,179 @@
+We model a task as the (possibly independent) computation of a value, a hash:
+
+- Task:
+ - value: LazyFuture[T]
+ - hash: LazyFuture[HashLike]
+
+You should philosophically view the value as information your dependents need to progress,
+e.g. a file-reader task should output a file handle that the dependents can open and read etc.
+You should philosophically view the hash as a way to convince your dependents
+that they don't need to rerun because they would get the same result anyway.
+I.e. if the hash == some previously recorded hash, the dependent should assume that rerunning would
+cause the same end result as that previous run.
+
+----------------------------------------------------------------------------------------------------
+
+Computing the Task promises
+
+At its core the value and hash computations should be written by the user,
+but we provide a bunch of helpers to help the user "cheat" the computation,
+primarily through storing traces of previous runs.
+
+There are actually a few versions in existence that we can look at.
+
+- t.hist: Any previously known good version of t.
+- t.curr: Any version we currently have "access" to, and we have no idea whether it's good or not.
+- t.good: The goal is to compute this, the good version we should arrive at post-build.
+
+Each version has information and not all of them we know or can compute.
+
+- t.X.value: The value (or candidate value) of the task.
+- t.X.hash: The hash (or candidate hash) of the task.
+- t.X.deps: The hashes of the deps of the task. (only applies to historical tasks, of course.)
+- t.X.seq: A version number for t.X.value (represented as an infinitely incrementing seqnum).
+ - seq nums can be reused, as long as if X.seq == Y.seq, then X and Y are the same thing
+- t.X.modtime: t.X is a file on disk, we can check its modtime.
+
+From here, we can make logical inferences based on the knowledge we have.
+Here is a table of how each piece of info could be calculated.
+
+- replay:
+ - for known good version Z in {hist, good},
+ for any versions X, Y in {hist, curr, good},
+ - {t.Z.deps}.X == {t.Z.deps}.Y implies t.X == t.Y
+ - Traces from the paper are the special case where Z = hist, X = hist, Y = good
+ - {t.hist.deps} == {t.hist.deps}.good
+ and also in particular we compare the hashes for good vs hist
+ - {t.hist.deps}.hash == {t.hist.deps}.good.hash
+ to infer that either the historical hash or value is good at the present time
+ - t.hist.value == t.good.value (constructive traces)
+ - t.hist.hash == t.good.hash (verifying traces)
+
+- verify:
+ - for Z in {hash, seq, modtime},
+ - t.X.Z == t.Y.Z implies t.X == t.Y
+ - We typically use the special case
+ - t.curr.hash == t.good.hash implies t.curr == t.good
+
+- Note: Verifying traces from the paper are essentially replay + verify combined.
+ - In the paper, the rule is:
+ - If X.good is consistent with t.hist.deps and t.curr.hash == t.hist.hash,
+ - Then t.curr.value == t.good.value.
+ - In our system, we make the following deductions:
+ - Since X.good is consistent with t.hist.deps, then replay tells us t.good == t.hist.
+ - Since t.curr.hash == t.hist.hash, then verify tells us t.curr == t.hist.
+ - Thus, we have that t.good == t.curr.
+ - In particular, we have t.good.value == t.curr.value.
+
+Finally, we seed the logical inference rules + data we have,
+and then compute as much of {...}.good while performing as little work as possible.
+
+- t.hist: In general, it's up to the user how much info of historical runs to put in the db.
+ - It's probably sensible to store the most information about the last time a task run.
+- t.curr: These are all user-specified and may not exist if the task is not somehow persisted.
+- t.good: These are completely unknown unless an inference rule is used, or the task is run.
+
+Here's a table for what data we can usually expect for each version / type / info combo.
+
+| version | hist (db) | hist (file) | curr (db) | curr (file) |
+| ------- | ---------- | ------------ | ---------- | ------------ |
+| value | maybe | usually no | yes | yes |
+| hash | yes | yes | yes | computed |
+| seq | yes | yes | yes | no |
+| modtime | no | yes | no | yes |
+
+----------------------------------------------------------------------------------------------------
+
+Cyclic dependencies
+
+The computation graph can be cyclic, not just because users can do whatever they want,
+but even in the usual case, for instance:
+
+- value may depend on hash (verify_hash)
+- hash may depend on value (sha256 of the value)
+
+If the user introduces true dependency cycles, we leave them to fend for themselves,
+I doubt there's any sane way to deal with this in our setup.
+
+If there are no true dependency cycles at the task level,
+this leaves only the dependency cycle shown above,
+where verify_hash depends on the hash, which usually depends on the value.
+
+To resolve this, we'll have verify_hash do the following:
+
+- Skip the verify_hash phase when entered recursively (e.g. by moving onto construct_value).
+ This is implemented simply by checking/adding to a set of tasks whenever we enter verify_hash.
+- When verify_hash awaits the hash, check if the value computation is already completed.
+ If so, just return since we're already done.
+
+----------------------------------------------------------------------------------------------------
+
+Types of dependencies
+
+- Get the dependency's value, e.g. a dependency to curl bunny API, then parse the json.
+ Here, we actually need the value in memory.
+
+- Rebuild if the dependency's hash changed, e.g. recompiling C object files
+ Here, we don't actually need the value at all.
+ During a what-if build (pretend graph is up-to-date except for certain tasks),
+ we can simply assume the hash from the previous run.
+
+----------------------------------------------------------------------------------------------------
+
+File dependencies vs task dependencies
+- When depending on a task
+
+----------------------------------------------------------------------------------------------------
+
+Each task produces a value, a hash, and a timestamp.
+
+The hash is not necessarily 1-1 with the value, and may have different re-computation rules.
+For example, a file reader task produces a file handle, which must be re-created for each call,
+even though the hash is on the file contents, which stays the same across calls.
+
+Hashes and timestamps are used for rebuilders / caching policies.
+When the hashes of all inputs to a task are verified to match with a previously stored run,
+The output of that run can simply be used.
+Note that a run is simply a trace of the inputs used and their hashes, and the output's value and hash.
+Hash based caching policies can range between storing 0 runs, storing the latest run, or storing multiple past runs.
+
+A different caching policy could be to check that all inputs are older than the output, rebuilding otherwise.
+Timestamp based caching policies can be used to assert that files have not changed since the latest run,
+focusing verification time only on files whose timestamps have changed.
+
+Sometimes, the value is NOT used, e.g. when compiling C files into an object file,
+a dependency on the sources files must be created, but their contents don't actually need to be read.
+In such cases, only the source file's hashes are used, and only if caching is used,
+first to be able to record a run, and subsequently to verify if the current file's hash matches a previous run.
+
+----------------------------------------------------------------------------------------------------
+
+Each task has a "store policy" for where the output of the task is stored, if at all.
+
+- Default: Stored in memory, discarded at the end of the build.
+- File-backed: Stored on disk, fetching the task gives you a file handle.
+- Store-backed: Stored in the Store, fetching the task gives you the contents.
+- No-store: Not stored anywhere.
+
+Each "store policy" corresponds to a output hasher.
+
+- Default: Output hash = current build id.
+- File-backed: Hash is computed from the file.
+- Store-backed: Just fetch from the store.
+- No-store: Always-different hash.
+
+----------------------------------------------------------------------------------------------------
+
+The hash is useful independent of the value itself due to caching policies.
+
+The simplest caching policy is "Verifying". Here, we use the dependency information
+
+----------------------------------------------------------------------------------------------------
+
+Metadata collected
+
+- Dependency information of latest run (not hashes, just whether A fetched B)
+- Hashes of the outputs of latest run (individually per task)
+
+Constructive trace rebuilder
+-
diff --git a/make2/__init__.py b/make2/__init__.py
@@ -0,0 +1,268 @@
+from _typeshed import Self
+import asyncio
+from asyncio import gather
+from asyncio.futures import Future
+from contextlib import asynccontextmanager
+from contextvars import ContextVar
+from types import CoroutineType
+from typing import (
+ Any,
+ Awaitable,
+ Callable,
+ ClassVar,
+ Generator,
+ Generic,
+ Protocol,
+ TypeAlias,
+ TypeVar,
+ runtime_checkable,
+)
+import dataclasses
+import weakref
+
+# Alias confusing names that clash with the build system
+AIOTask = asyncio.Task
+AIOTaskGroup = asyncio.TaskGroup
+AIOEvent = asyncio.Event
+
+
+CovT = TypeVar("CovT", covariant=True)
+ConT = TypeVar("ConT", contravariant=True)
+
+DbEntry: TypeAlias = object
+
+Factory: TypeAlias = Callable[[], CovT]
+
+
+@runtime_checkable
+class HashLike(Protocol):
+ def __hash__(self) -> int: ...
+ def __eq__(self: Self, other: Self, /) -> bool: ...
+
+
+@dataclasses.dataclass(frozen=True)
+class TaskKey:
+ rule_key: str
+ arg_keys: tuple["TaskKey | HashLike"]
+
+
+@dataclasses.dataclass(slots=True)
+class Box(Generic[CovT]):
+ x: CovT = dataclasses.field()
+
+
+build_tg_: ContextVar[AIOTaskGroup] = ContextVar("Build asyncio.TaskGroup")
+build_seq_: ContextVar[Box[int]] = ContextVar("Build sequence number")
+
+
+def build_tg():
+ return build_tg_.get()
+
+
+def build_seq():
+ seq = build_seq_.get()
+ seq.x += 1
+ return seq.x
+
+
+@asynccontextmanager
+async def BuildContext():
+ async with AIOTaskGroup() as tg:
+ tok = build_tg_.set(tg)
+ build_seq_.set(Box(0))
+ try:
+ yield
+ finally:
+ build_tg_.reset(tok)
+
+
+@dataclasses.dataclass
+class TaskContext:
+ task: "Task" = dataclasses.field()
+
+
+async def noop():
+ pass
+
+
+@dataclasses.dataclass(slots=True)
+class LazyTask(Generic[CovT]):
+ """
+ Lazy AIOTask
+
+ Lazy: The coroutine isn't run or spawned into a task until .start()ed or awaited.
+ Use await directly if you intend to await -- this runs the coroutine in the current task.
+ """
+
+ coro: CoroutineType[Any, None, CovT]
+ # Not started | Starting | Started
+ task: None | Future[CovT] = None
+
+ def start(self) -> Awaitable[CovT]:
+ """Ensures task is started, and returns a coroutine for its completion."""
+ if self.task is None:
+ self.task = build_tg().create_task(self.coro)
+ return self.task
+
+ def __await__(self) -> Generator[Any, None, CovT]:
+ if self.task is None:
+ # Run the coroutine directly to avoid creating an additional task.
+ self.task = asyncio.get_event_loop().create_future()
+ try:
+ res = yield from self.coro.__await__()
+ self.task.set_result(res)
+ return res
+ except BaseException as exc:
+ self.task.set_exception(exc)
+ raise
+ else:
+ res = yield from self.task.__await__()
+ return res
+
+
+@dataclasses.dataclass(slots=True, frozen=True)
+class LazyFuture(Generic[CovT]):
+ """
+ Future that lazily starts its own computation.
+
+ The handle is run the first time .start() is called or the future is awaited.
+
+ For convenience, LazyFutures are also context managers that intercept exceptions.
+ """
+
+ handle: LazyTask | None = dataclasses.field(default=None)
+ fut: Future[CovT] = dataclasses.field(
+ default_factory=lambda: asyncio.get_event_loop().create_future()
+ )
+
+ none: ClassVar["LazyFuture[None]"]
+
+ def __enter__(self):
+ return self.fut.set_result
+
+ def __exit__(self, exc_type, exc, tb):
+ if not self.fut.done():
+ self.fut.set_exception(exc)
+
+ def start(self):
+ """Ensures handle is started if any, then returns the future."""
+ if self.handle:
+ self.handle.start()
+ return self.fut
+
+ def __await__(self) -> Generator[Any, None, CovT]:
+ """Ensures handle is started if any, then awaits the future."""
+ if self.handle is not None:
+ yield from self.handle.start().__await__()
+ res = yield from self.fut.__await__()
+ return res
+
+
+LazyFuture.none = LazyFuture()
+LazyFuture.none.fut.set_result(None)
+
+
+CovT = TypeVar("CovT", covariant=True)
+
+
+@dataclasses.dataclass(frozen=True)
+class Task(Generic[CovT]):
+ """
+ A task is a computation of the task's value and hash.
+
+ Values are returned when other tasks fetch this task.
+
+ Other tasks can also obtain this task's hash, but hashes should be viewed
+ as a way to certify that "a value" is the same as some "other value",
+ typically to assert that the value from a previous run is good.
+ """
+
+ key: TaskKey = dataclasses.field()
+ value: LazyFuture[CovT] = dataclasses.field(repr=False, compare=False)
+ hash: LazyFuture[HashLike] = dataclasses.field(repr=False, compare=False)
+
+
+"""
+Traces record information of a run so work can be saved in the future.
+
+There are 3 variants, dep traces, hash traces and value traces.
+
+- A dep trace records a task's deps.
+- A hash trace records a task's deps, dep hashes, and *out hash*.
+- A value trace records a task's deps, dep hashes, *out hash, and out value*.
+
+Trace builders are classes to aid in the production of a trace.
+Once a trace builder can be finalized to obtain a true trace.
+
+As such, there are 4 classes:
+
+- DepTrace
+- HashTrace
+- ValueTrace
+- DepTraceBuilder
+- HashTraceBuilder
+- ValueTraceBuilder
+"""
+
+
+@dataclasses.dataclass
+class DepTrace:
+ key: TaskKey = dataclasses.field()
+ # We still record HashLike so HashTrace and ValueTrace can subclass DepTrace.
+ # The hash is simply ignored if we only want to know the dependencies of key.
+ in_hashes: dict[TaskKey, HashLike] = dataclasses.field()
+
+
+@dataclasses.dataclass
+class HashTrace(DepTrace):
+ out_hash: HashLike = dataclasses.field()
+
+
+@dataclasses.dataclass
+class ValueTrace(HashTrace, Generic[CovT]):
+ out_value: CovT = dataclasses.field()
+
+
+@dataclasses.dataclass
+class DepTraceBuilder:
+ key: TaskKey = dataclasses.field()
+ in_hashes: dict[TaskKey, LazyFuture[HashLike]] = dataclasses.field(
+ default_factory=dict
+ )
+
+ async def finalize(self) -> DepTrace:
+ return DepTrace(
+ key=self.key,
+ in_hashes={k: await v for k, v in self.in_hashes.items()},
+ )
+
+
+@dataclasses.dataclass
+class HashTraceBuilder(DepTraceBuilder):
+ out_hash: LazyFuture[HashLike] | None = dataclasses.field(default=None)
+
+ async def finalize(self) -> HashTrace:
+ if self.out_hash is None:
+ raise RuntimeError("Cannot finalize HashTraceBuilder without out_hash")
+ return HashTrace(
+ key=self.key,
+ in_hashes={k: await v for k, v in self.in_hashes.items()},
+ out_hash=await self.out_hash,
+ )
+
+
+@dataclasses.dataclass
+class ValueTraceBuilder(HashTraceBuilder, Generic[CovT]):
+ out_value: LazyFuture[CovT] | None = dataclasses.field(default=None)
+
+ async def finalize(self) -> ValueTrace[CovT]:
+ if self.out_hash is None:
+ raise RuntimeError("Cannot finalize ValueTraceBuilder without out_hash")
+ if self.out_value is None:
+ raise RuntimeError("Cannot finalize ValueTraceBuilder without out_value")
+ return ValueTrace(
+ key=self.key,
+ in_hashes={k: await v for k, v in self.in_hashes.items()},
+ out_hash=await self.out_hash,
+ out_value=await self.out_value,
+ )