pymake

A build system based on Build Systems à la Carte
git clone https://git.grace.moe/pymake
Log | Files | Refs | README

commit 906d33d40421e70fa9d77d2b9d6a7d55d1745698
parent 4e14f0183908587ec8e72692c4a9ac9dac1e04a7
Author: gracefu <81774659+gracefuu@users.noreply.github.com>
Date:   Fri, 18 Apr 2025 07:39:28 +0800

sketch second version

Diffstat:
Amake2/THEORY.md | 179+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Amake2/__init__.py | 268+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 447 insertions(+), 0 deletions(-)

diff --git a/make2/THEORY.md b/make2/THEORY.md @@ -0,0 +1,179 @@ +We model a task as the (possibly independent) computation of a value, a hash: + +- Task: + - value: LazyFuture[T] + - hash: LazyFuture[HashLike] + +You should philosophically view the value as information your dependents need to progress, +e.g. a file-reader task should output a file handle that the dependents can open and read etc. +You should philosophically view the hash as a way to convince your dependents +that they don't need to rerun because they would get the same result anyway. +I.e. if the hash == some previously recorded hash, the dependent should assume that rerunning would +cause the same end result as that previous run. + +---------------------------------------------------------------------------------------------------- + +Computing the Task promises + +At its core the value and hash computations should be written by the user, +but we provide a bunch of helpers to help the user "cheat" the computation, +primarily through storing traces of previous runs. + +There are actually a few versions in existence that we can look at. + +- t.hist: Any previously known good version of t. +- t.curr: Any version we currently have "access" to, and we have no idea whether it's good or not. +- t.good: The goal is to compute this, the good version we should arrive at post-build. + +Each version has information and not all of them we know or can compute. + +- t.X.value: The value (or candidate value) of the task. +- t.X.hash: The hash (or candidate hash) of the task. +- t.X.deps: The hashes of the deps of the task. (only applies to historical tasks, of course.) +- t.X.seq: A version number for t.X.value (represented as an infinitely incrementing seqnum). + - seq nums can be reused, as long as if X.seq == Y.seq, then X and Y are the same thing +- t.X.modtime: t.X is a file on disk, we can check its modtime. + +From here, we can make logical inferences based on the knowledge we have. +Here is a table of how each piece of info could be calculated. + +- replay: + - for known good version Z in {hist, good}, + for any versions X, Y in {hist, curr, good}, + - {t.Z.deps}.X == {t.Z.deps}.Y implies t.X == t.Y + - Traces from the paper are the special case where Z = hist, X = hist, Y = good + - {t.hist.deps} == {t.hist.deps}.good + and also in particular we compare the hashes for good vs hist + - {t.hist.deps}.hash == {t.hist.deps}.good.hash + to infer that either the historical hash or value is good at the present time + - t.hist.value == t.good.value (constructive traces) + - t.hist.hash == t.good.hash (verifying traces) + +- verify: + - for Z in {hash, seq, modtime}, + - t.X.Z == t.Y.Z implies t.X == t.Y + - We typically use the special case + - t.curr.hash == t.good.hash implies t.curr == t.good + +- Note: Verifying traces from the paper are essentially replay + verify combined. + - In the paper, the rule is: + - If X.good is consistent with t.hist.deps and t.curr.hash == t.hist.hash, + - Then t.curr.value == t.good.value. + - In our system, we make the following deductions: + - Since X.good is consistent with t.hist.deps, then replay tells us t.good == t.hist. + - Since t.curr.hash == t.hist.hash, then verify tells us t.curr == t.hist. + - Thus, we have that t.good == t.curr. + - In particular, we have t.good.value == t.curr.value. + +Finally, we seed the logical inference rules + data we have, +and then compute as much of {...}.good while performing as little work as possible. + +- t.hist: In general, it's up to the user how much info of historical runs to put in the db. + - It's probably sensible to store the most information about the last time a task run. +- t.curr: These are all user-specified and may not exist if the task is not somehow persisted. +- t.good: These are completely unknown unless an inference rule is used, or the task is run. + +Here's a table for what data we can usually expect for each version / type / info combo. + +| version | hist (db) | hist (file) | curr (db) | curr (file) | +| ------- | ---------- | ------------ | ---------- | ------------ | +| value | maybe | usually no | yes | yes | +| hash | yes | yes | yes | computed | +| seq | yes | yes | yes | no | +| modtime | no | yes | no | yes | + +---------------------------------------------------------------------------------------------------- + +Cyclic dependencies + +The computation graph can be cyclic, not just because users can do whatever they want, +but even in the usual case, for instance: + +- value may depend on hash (verify_hash) +- hash may depend on value (sha256 of the value) + +If the user introduces true dependency cycles, we leave them to fend for themselves, +I doubt there's any sane way to deal with this in our setup. + +If there are no true dependency cycles at the task level, +this leaves only the dependency cycle shown above, +where verify_hash depends on the hash, which usually depends on the value. + +To resolve this, we'll have verify_hash do the following: + +- Skip the verify_hash phase when entered recursively (e.g. by moving onto construct_value). + This is implemented simply by checking/adding to a set of tasks whenever we enter verify_hash. +- When verify_hash awaits the hash, check if the value computation is already completed. + If so, just return since we're already done. + +---------------------------------------------------------------------------------------------------- + +Types of dependencies + +- Get the dependency's value, e.g. a dependency to curl bunny API, then parse the json. + Here, we actually need the value in memory. + +- Rebuild if the dependency's hash changed, e.g. recompiling C object files + Here, we don't actually need the value at all. + During a what-if build (pretend graph is up-to-date except for certain tasks), + we can simply assume the hash from the previous run. + +---------------------------------------------------------------------------------------------------- + +File dependencies vs task dependencies +- When depending on a task + +---------------------------------------------------------------------------------------------------- + +Each task produces a value, a hash, and a timestamp. + +The hash is not necessarily 1-1 with the value, and may have different re-computation rules. +For example, a file reader task produces a file handle, which must be re-created for each call, +even though the hash is on the file contents, which stays the same across calls. + +Hashes and timestamps are used for rebuilders / caching policies. +When the hashes of all inputs to a task are verified to match with a previously stored run, +The output of that run can simply be used. +Note that a run is simply a trace of the inputs used and their hashes, and the output's value and hash. +Hash based caching policies can range between storing 0 runs, storing the latest run, or storing multiple past runs. + +A different caching policy could be to check that all inputs are older than the output, rebuilding otherwise. +Timestamp based caching policies can be used to assert that files have not changed since the latest run, +focusing verification time only on files whose timestamps have changed. + +Sometimes, the value is NOT used, e.g. when compiling C files into an object file, +a dependency on the sources files must be created, but their contents don't actually need to be read. +In such cases, only the source file's hashes are used, and only if caching is used, +first to be able to record a run, and subsequently to verify if the current file's hash matches a previous run. + +---------------------------------------------------------------------------------------------------- + +Each task has a "store policy" for where the output of the task is stored, if at all. + +- Default: Stored in memory, discarded at the end of the build. +- File-backed: Stored on disk, fetching the task gives you a file handle. +- Store-backed: Stored in the Store, fetching the task gives you the contents. +- No-store: Not stored anywhere. + +Each "store policy" corresponds to a output hasher. + +- Default: Output hash = current build id. +- File-backed: Hash is computed from the file. +- Store-backed: Just fetch from the store. +- No-store: Always-different hash. + +---------------------------------------------------------------------------------------------------- + +The hash is useful independent of the value itself due to caching policies. + +The simplest caching policy is "Verifying". Here, we use the dependency information + +---------------------------------------------------------------------------------------------------- + +Metadata collected + +- Dependency information of latest run (not hashes, just whether A fetched B) +- Hashes of the outputs of latest run (individually per task) + +Constructive trace rebuilder +- diff --git a/make2/__init__.py b/make2/__init__.py @@ -0,0 +1,268 @@ +from _typeshed import Self +import asyncio +from asyncio import gather +from asyncio.futures import Future +from contextlib import asynccontextmanager +from contextvars import ContextVar +from types import CoroutineType +from typing import ( + Any, + Awaitable, + Callable, + ClassVar, + Generator, + Generic, + Protocol, + TypeAlias, + TypeVar, + runtime_checkable, +) +import dataclasses +import weakref + +# Alias confusing names that clash with the build system +AIOTask = asyncio.Task +AIOTaskGroup = asyncio.TaskGroup +AIOEvent = asyncio.Event + + +CovT = TypeVar("CovT", covariant=True) +ConT = TypeVar("ConT", contravariant=True) + +DbEntry: TypeAlias = object + +Factory: TypeAlias = Callable[[], CovT] + + +@runtime_checkable +class HashLike(Protocol): + def __hash__(self) -> int: ... + def __eq__(self: Self, other: Self, /) -> bool: ... + + +@dataclasses.dataclass(frozen=True) +class TaskKey: + rule_key: str + arg_keys: tuple["TaskKey | HashLike"] + + +@dataclasses.dataclass(slots=True) +class Box(Generic[CovT]): + x: CovT = dataclasses.field() + + +build_tg_: ContextVar[AIOTaskGroup] = ContextVar("Build asyncio.TaskGroup") +build_seq_: ContextVar[Box[int]] = ContextVar("Build sequence number") + + +def build_tg(): + return build_tg_.get() + + +def build_seq(): + seq = build_seq_.get() + seq.x += 1 + return seq.x + + +@asynccontextmanager +async def BuildContext(): + async with AIOTaskGroup() as tg: + tok = build_tg_.set(tg) + build_seq_.set(Box(0)) + try: + yield + finally: + build_tg_.reset(tok) + + +@dataclasses.dataclass +class TaskContext: + task: "Task" = dataclasses.field() + + +async def noop(): + pass + + +@dataclasses.dataclass(slots=True) +class LazyTask(Generic[CovT]): + """ + Lazy AIOTask + + Lazy: The coroutine isn't run or spawned into a task until .start()ed or awaited. + Use await directly if you intend to await -- this runs the coroutine in the current task. + """ + + coro: CoroutineType[Any, None, CovT] + # Not started | Starting | Started + task: None | Future[CovT] = None + + def start(self) -> Awaitable[CovT]: + """Ensures task is started, and returns a coroutine for its completion.""" + if self.task is None: + self.task = build_tg().create_task(self.coro) + return self.task + + def __await__(self) -> Generator[Any, None, CovT]: + if self.task is None: + # Run the coroutine directly to avoid creating an additional task. + self.task = asyncio.get_event_loop().create_future() + try: + res = yield from self.coro.__await__() + self.task.set_result(res) + return res + except BaseException as exc: + self.task.set_exception(exc) + raise + else: + res = yield from self.task.__await__() + return res + + +@dataclasses.dataclass(slots=True, frozen=True) +class LazyFuture(Generic[CovT]): + """ + Future that lazily starts its own computation. + + The handle is run the first time .start() is called or the future is awaited. + + For convenience, LazyFutures are also context managers that intercept exceptions. + """ + + handle: LazyTask | None = dataclasses.field(default=None) + fut: Future[CovT] = dataclasses.field( + default_factory=lambda: asyncio.get_event_loop().create_future() + ) + + none: ClassVar["LazyFuture[None]"] + + def __enter__(self): + return self.fut.set_result + + def __exit__(self, exc_type, exc, tb): + if not self.fut.done(): + self.fut.set_exception(exc) + + def start(self): + """Ensures handle is started if any, then returns the future.""" + if self.handle: + self.handle.start() + return self.fut + + def __await__(self) -> Generator[Any, None, CovT]: + """Ensures handle is started if any, then awaits the future.""" + if self.handle is not None: + yield from self.handle.start().__await__() + res = yield from self.fut.__await__() + return res + + +LazyFuture.none = LazyFuture() +LazyFuture.none.fut.set_result(None) + + +CovT = TypeVar("CovT", covariant=True) + + +@dataclasses.dataclass(frozen=True) +class Task(Generic[CovT]): + """ + A task is a computation of the task's value and hash. + + Values are returned when other tasks fetch this task. + + Other tasks can also obtain this task's hash, but hashes should be viewed + as a way to certify that "a value" is the same as some "other value", + typically to assert that the value from a previous run is good. + """ + + key: TaskKey = dataclasses.field() + value: LazyFuture[CovT] = dataclasses.field(repr=False, compare=False) + hash: LazyFuture[HashLike] = dataclasses.field(repr=False, compare=False) + + +""" +Traces record information of a run so work can be saved in the future. + +There are 3 variants, dep traces, hash traces and value traces. + +- A dep trace records a task's deps. +- A hash trace records a task's deps, dep hashes, and *out hash*. +- A value trace records a task's deps, dep hashes, *out hash, and out value*. + +Trace builders are classes to aid in the production of a trace. +Once a trace builder can be finalized to obtain a true trace. + +As such, there are 4 classes: + +- DepTrace +- HashTrace +- ValueTrace +- DepTraceBuilder +- HashTraceBuilder +- ValueTraceBuilder +""" + + +@dataclasses.dataclass +class DepTrace: + key: TaskKey = dataclasses.field() + # We still record HashLike so HashTrace and ValueTrace can subclass DepTrace. + # The hash is simply ignored if we only want to know the dependencies of key. + in_hashes: dict[TaskKey, HashLike] = dataclasses.field() + + +@dataclasses.dataclass +class HashTrace(DepTrace): + out_hash: HashLike = dataclasses.field() + + +@dataclasses.dataclass +class ValueTrace(HashTrace, Generic[CovT]): + out_value: CovT = dataclasses.field() + + +@dataclasses.dataclass +class DepTraceBuilder: + key: TaskKey = dataclasses.field() + in_hashes: dict[TaskKey, LazyFuture[HashLike]] = dataclasses.field( + default_factory=dict + ) + + async def finalize(self) -> DepTrace: + return DepTrace( + key=self.key, + in_hashes={k: await v for k, v in self.in_hashes.items()}, + ) + + +@dataclasses.dataclass +class HashTraceBuilder(DepTraceBuilder): + out_hash: LazyFuture[HashLike] | None = dataclasses.field(default=None) + + async def finalize(self) -> HashTrace: + if self.out_hash is None: + raise RuntimeError("Cannot finalize HashTraceBuilder without out_hash") + return HashTrace( + key=self.key, + in_hashes={k: await v for k, v in self.in_hashes.items()}, + out_hash=await self.out_hash, + ) + + +@dataclasses.dataclass +class ValueTraceBuilder(HashTraceBuilder, Generic[CovT]): + out_value: LazyFuture[CovT] | None = dataclasses.field(default=None) + + async def finalize(self) -> ValueTrace[CovT]: + if self.out_hash is None: + raise RuntimeError("Cannot finalize ValueTraceBuilder without out_hash") + if self.out_value is None: + raise RuntimeError("Cannot finalize ValueTraceBuilder without out_value") + return ValueTrace( + key=self.key, + in_hashes={k: await v for k, v in self.in_hashes.items()}, + out_hash=await self.out_hash, + out_value=await self.out_value, + )