sketch second version - pymake - A build system based on Build Systems à la Carte

commit 906d33d40421e70fa9d77d2b9d6a7d55d1745698
parent 4e14f0183908587ec8e72692c4a9ac9dac1e04a7
Author: gracefu <81774659+gracefuu@users.noreply.github.com>
Date:   Fri, 18 Apr 2025 07:39:28 +0800

sketch second version

Diffstat:
A make2/THEORY.md  | 179 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A make2/__init__.py  | 268 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2 files changed, 447 insertions(+), 0 deletions(-)
diff --git a/make2/THEORY.md b/make2/THEORY.md
@@ -0,0 +1,179 @@
+We model a task as the (possibly independent) computation of a value, a hash:
+
+- Task:
+  - value: LazyFuture[T]
+  - hash: LazyFuture[HashLike]
+
+You should philosophically view the value as information your dependents need to progress,
+e.g. a file-reader task should output a file handle that the dependents can open and read etc.
+You should philosophically view the hash as a way to convince your dependents
+that they don't need to rerun because they would get the same result anyway.
+I.e. if the hash == some previously recorded hash, the dependent should assume that rerunning would
+cause the same end result as that previous run.
+
+----------------------------------------------------------------------------------------------------
+
+Computing the Task promises
+
+At its core the value and hash computations should be written by the user,
+but we provide a bunch of helpers to help the user "cheat" the computation,
+primarily through storing traces of previous runs.
+
+There are actually a few versions in existence that we can look at.
+
+- t.hist: Any previously known good version of t.
+- t.curr: Any version we currently have "access" to, and we have no idea whether it's good or not.
+- t.good: The goal is to compute this, the good version we should arrive at post-build.
+
+Each version has information and not all of them we know or can compute.
+
+- t.X.value: The value (or candidate value) of the task.
+- t.X.hash: The hash (or candidate hash) of the task.
+- t.X.deps: The hashes of the deps of the task. (only applies to historical tasks, of course.)
+- t.X.seq: A version number for t.X.value (represented as an infinitely incrementing seqnum).
+  - seq nums can be reused, as long as if X.seq == Y.seq, then X and Y are the same thing
+- t.X.modtime: t.X is a file on disk, we can check its modtime.
+
+From here, we can make logical inferences based on the knowledge we have.
+Here is a table of how each piece of info could be calculated.
+
+- replay:
+  - for known good version Z in {hist, good},
+    for any versions X, Y in {hist, curr, good},
+    - {t.Z.deps}.X == {t.Z.deps}.Y  implies  t.X == t.Y
+  - Traces from the paper are the special case where Z = hist, X = hist, Y = good
+    - {t.hist.deps} == {t.hist.deps}.good
+    and also in particular we compare the hashes for good vs hist
+    - {t.hist.deps}.hash == {t.hist.deps}.good.hash
+    to infer that either the historical hash or value is good at the present time
+    - t.hist.value == t.good.value (constructive traces)
+    - t.hist.hash == t.good.hash   (verifying traces)
+
+- verify:
+  - for Z in {hash, seq, modtime},
+    - t.X.Z == t.Y.Z  implies  t.X == t.Y
+  - We typically use the special case
+    - t.curr.hash == t.good.hash  implies  t.curr == t.good
+
+- Note: Verifying traces from the paper are essentially replay + verify combined.
+  - In the paper, the rule is:
+    - If X.good is consistent with t.hist.deps and t.curr.hash == t.hist.hash,
+    - Then t.curr.value == t.good.value.
+  - In our system, we make the following deductions:
+    - Since X.good is consistent with t.hist.deps, then replay tells us t.good == t.hist.
+    - Since t.curr.hash == t.hist.hash, then verify tells us t.curr == t.hist.
+    - Thus, we have that t.good == t.curr.
+    - In particular, we have t.good.value == t.curr.value.
+
+Finally, we seed the logical inference rules + data we have,
+and then compute as much of {...}.good while performing as little work as possible.
+
+- t.hist: In general, it's up to the user how much info of historical runs to put in the db.
+  - It's probably sensible to store the most information about the last time a task run.
+- t.curr: These are all user-specified and may not exist if the task is not somehow persisted.
+- t.good: These are completely unknown unless an inference rule is used, or the task is run.
+
+Here's a table for what data we can usually expect for each version / type / info combo.
+
+| version |  hist (db) |  hist (file) |  curr (db) |  curr (file) |
+| ------- | ---------- | ------------ | ---------- | ------------ |
+| value   |  maybe     |  usually no  |  yes       |  yes         |
+| hash    |  yes       |  yes         |  yes       |  computed    |
+| seq     |  yes       |  yes         |  yes       |  no          |
+| modtime |  no        |  yes         |  no        |  yes         |
+
+----------------------------------------------------------------------------------------------------
+
+Cyclic dependencies
+
+The computation graph can be cyclic, not just because users can do whatever they want,
+but even in the usual case, for instance:
+
+- value may depend on hash (verify_hash)
+- hash may depend on value (sha256 of the value)
+
+If the user introduces true dependency cycles, we leave them to fend for themselves,
+I doubt there's any sane way to deal with this in our setup.
+
+If there are no true dependency cycles at the task level,
+this leaves only the dependency cycle shown above,
+where verify_hash depends on the hash, which usually depends on the value.
+
+To resolve this, we'll have verify_hash do the following:
+
+- Skip the verify_hash phase when entered recursively (e.g. by moving onto construct_value).
+  This is implemented simply by checking/adding to a set of tasks whenever we enter verify_hash.
+- When verify_hash awaits the hash, check if the value computation is already completed.
+  If so, just return since we're already done.
+
+----------------------------------------------------------------------------------------------------
+
+Types of dependencies
+
+- Get the dependency's value, e.g. a dependency to curl bunny API, then parse the json.
+  Here, we actually need the value in memory.
+
+- Rebuild if the dependency's hash changed, e.g. recompiling C object files
+  Here, we don't actually need the value at all.
+  During a what-if build (pretend graph is up-to-date except for certain tasks),
+  we can simply assume the hash from the previous run.
+
+----------------------------------------------------------------------------------------------------
+
+File dependencies vs task dependencies
+- When depending on a task
+
+----------------------------------------------------------------------------------------------------
+
+Each task produces a value, a hash, and a timestamp.
+
+The hash is not necessarily 1-1 with the value, and may have different re-computation rules.
+For example, a file reader task produces a file handle, which must be re-created for each call,
+even though the hash is on the file contents, which stays the same across calls.
+
+Hashes and timestamps are used for rebuilders / caching policies.
+When the hashes of all inputs to a task are verified to match with a previously stored run,
+The output of that run can simply be used.
+Note that a run is simply a trace of the inputs used and their hashes, and the output's value and hash.
+Hash based caching policies can range between storing 0 runs, storing the latest run, or storing multiple past runs.
+
+A different caching policy could be to check that all inputs are older than the output, rebuilding otherwise.
+Timestamp based caching policies can be used to assert that files have not changed since the latest run,
+focusing verification time only on files whose timestamps have changed.
+
+Sometimes, the value is NOT used, e.g. when compiling C files into an object file,
+a dependency on the sources files must be created, but their contents don't actually need to be read.
+In such cases, only the source file's hashes are used, and only if caching is used,
+first to be able to record a run, and subsequently to verify if the current file's hash matches a previous run.
+
+----------------------------------------------------------------------------------------------------
+
+Each task has a "store policy" for where the output of the task is stored, if at all.
+
+- Default: Stored in memory, discarded at the end of the build.
+- File-backed: Stored on disk, fetching the task gives you a file handle.
+- Store-backed: Stored in the Store, fetching the task gives you the contents.
+- No-store: Not stored anywhere.
+
+Each "store policy" corresponds to a output hasher.
+
+- Default: Output hash = current build id.
+- File-backed: Hash is computed from the file.
+- Store-backed: Just fetch from the store.
+- No-store: Always-different hash.
+
+----------------------------------------------------------------------------------------------------
+
+The hash is useful independent of the value itself due to caching policies.
+
+The simplest caching policy is "Verifying". Here, we use the dependency information
+
+----------------------------------------------------------------------------------------------------
+
+Metadata collected
+
+- Dependency information of latest run (not hashes, just whether A fetched B)
+- Hashes of the outputs of latest run (individually per task)
+
+Constructive trace rebuilder
+-
diff --git a/make2/__init__.py b/make2/__init__.py
@@ -0,0 +1,268 @@
+from _typeshed import Self
+import asyncio
+from asyncio import gather
+from asyncio.futures import Future
+from contextlib import asynccontextmanager
+from contextvars import ContextVar
+from types import CoroutineType
+from typing import (
+    Any,
+    Awaitable,
+    Callable,
+    ClassVar,
+    Generator,
+    Generic,
+    Protocol,
+    TypeAlias,
+    TypeVar,
+    runtime_checkable,
+)
+import dataclasses
+import weakref
+
+# Alias confusing names that clash with the build system
+AIOTask = asyncio.Task
+AIOTaskGroup = asyncio.TaskGroup
+AIOEvent = asyncio.Event
+
+
+CovT = TypeVar("CovT", covariant=True)
+ConT = TypeVar("ConT", contravariant=True)
+
+DbEntry: TypeAlias = object
+
+Factory: TypeAlias = Callable[[], CovT]
+
+
+@runtime_checkable
+class HashLike(Protocol):
+    def __hash__(self) -> int: ...
+    def __eq__(self: Self, other: Self, /) -> bool: ...
+
+
+@dataclasses.dataclass(frozen=True)
+class TaskKey:
+    rule_key: str
+    arg_keys: tuple["TaskKey | HashLike"]
+
+
+@dataclasses.dataclass(slots=True)
+class Box(Generic[CovT]):
+    x: CovT = dataclasses.field()
+
+
+build_tg_: ContextVar[AIOTaskGroup] = ContextVar("Build asyncio.TaskGroup")
+build_seq_: ContextVar[Box[int]] = ContextVar("Build sequence number")
+
+
+def build_tg():
+    return build_tg_.get()
+
+
+def build_seq():
+    seq = build_seq_.get()
+    seq.x += 1
+    return seq.x
+
+
+@asynccontextmanager
+async def BuildContext():
+    async with AIOTaskGroup() as tg:
+        tok = build_tg_.set(tg)
+        build_seq_.set(Box(0))
+        try:
+            yield
+        finally:
+            build_tg_.reset(tok)
+
+
+@dataclasses.dataclass
+class TaskContext:
+    task: "Task" = dataclasses.field()
+
+
+async def noop():
+    pass
+
+
+@dataclasses.dataclass(slots=True)
+class LazyTask(Generic[CovT]):
+    """
+    Lazy AIOTask
+
+    Lazy: The coroutine isn't run or spawned into a task until .start()ed or awaited.
+    Use await directly if you intend to await -- this runs the coroutine in the current task.
+    """
+
+    coro: CoroutineType[Any, None, CovT]
+    # Not started | Starting | Started
+    task: None | Future[CovT] = None
+
+    def start(self) -> Awaitable[CovT]:
+        """Ensures task is started, and returns a coroutine for its completion."""
+        if self.task is None:
+            self.task = build_tg().create_task(self.coro)
+        return self.task
+
+    def __await__(self) -> Generator[Any, None, CovT]:
+        if self.task is None:
+            # Run the coroutine directly to avoid creating an additional task.
+            self.task = asyncio.get_event_loop().create_future()
+            try:
+                res = yield from self.coro.__await__()
+                self.task.set_result(res)
+                return res
+            except BaseException as exc:
+                self.task.set_exception(exc)
+                raise
+        else:
+            res = yield from self.task.__await__()
+            return res
+
+
+@dataclasses.dataclass(slots=True, frozen=True)
+class LazyFuture(Generic[CovT]):
+    """
+    Future that lazily starts its own computation.
+
+    The handle is run the first time .start() is called or the future is awaited.
+
+    For convenience, LazyFutures are also context managers that intercept exceptions.
+    """
+
+    handle: LazyTask | None = dataclasses.field(default=None)
+    fut: Future[CovT] = dataclasses.field(
+        default_factory=lambda: asyncio.get_event_loop().create_future()
+    )
+
+    none: ClassVar["LazyFuture[None]"]
+
+    def __enter__(self):
+        return self.fut.set_result
+
+    def __exit__(self, exc_type, exc, tb):
+        if not self.fut.done():
+            self.fut.set_exception(exc)
+
+    def start(self):
+        """Ensures handle is started if any, then returns the future."""
+        if self.handle:
+            self.handle.start()
+        return self.fut
+
+    def __await__(self) -> Generator[Any, None, CovT]:
+        """Ensures handle is started if any, then awaits the future."""
+        if self.handle is not None:
+            yield from self.handle.start().__await__()
+        res = yield from self.fut.__await__()
+        return res
+
+
+LazyFuture.none = LazyFuture()
+LazyFuture.none.fut.set_result(None)
+
+
+CovT = TypeVar("CovT", covariant=True)
+
+
+@dataclasses.dataclass(frozen=True)
+class Task(Generic[CovT]):
+    """
+    A task is a computation of the task's value and hash.
+
+    Values are returned when other tasks fetch this task.
+
+    Other tasks can also obtain this task's hash, but hashes should be viewed
+    as a way to certify that "a value" is the same as some "other value",
+    typically to assert that the value from a previous run is good.
+    """
+
+    key: TaskKey = dataclasses.field()
+    value: LazyFuture[CovT] = dataclasses.field(repr=False, compare=False)
+    hash: LazyFuture[HashLike] = dataclasses.field(repr=False, compare=False)
+
+
+"""
+Traces record information of a run so work can be saved in the future.
+
+There are 3 variants, dep traces, hash traces and value traces.
+
+- A dep trace records a task's deps.
+- A hash trace records a task's deps, dep hashes, and *out hash*.
+- A value trace records a task's deps, dep hashes, *out hash, and out value*.
+
+Trace builders are classes to aid in the production of a trace.
+Once a trace builder can be finalized to obtain a true trace.
+
+As such, there are 4 classes:
+
+- DepTrace
+- HashTrace
+- ValueTrace
+- DepTraceBuilder
+- HashTraceBuilder
+- ValueTraceBuilder
+"""
+
+
+@dataclasses.dataclass
+class DepTrace:
+    key: TaskKey = dataclasses.field()
+    # We still record HashLike so HashTrace and ValueTrace can subclass DepTrace.
+    # The hash is simply ignored if we only want to know the dependencies of key.
+    in_hashes: dict[TaskKey, HashLike] = dataclasses.field()
+
+
+@dataclasses.dataclass
+class HashTrace(DepTrace):
+    out_hash: HashLike = dataclasses.field()
+
+
+@dataclasses.dataclass
+class ValueTrace(HashTrace, Generic[CovT]):
+    out_value: CovT = dataclasses.field()
+
+
+@dataclasses.dataclass
+class DepTraceBuilder:
+    key: TaskKey = dataclasses.field()
+    in_hashes: dict[TaskKey, LazyFuture[HashLike]] = dataclasses.field(
+        default_factory=dict
+    )
+
+    async def finalize(self) -> DepTrace:
+        return DepTrace(
+            key=self.key,
+            in_hashes={k: await v for k, v in self.in_hashes.items()},
+        )
+
+
+@dataclasses.dataclass
+class HashTraceBuilder(DepTraceBuilder):
+    out_hash: LazyFuture[HashLike] | None = dataclasses.field(default=None)
+
+    async def finalize(self) -> HashTrace:
+        if self.out_hash is None:
+            raise RuntimeError("Cannot finalize HashTraceBuilder without out_hash")
+        return HashTrace(
+            key=self.key,
+            in_hashes={k: await v for k, v in self.in_hashes.items()},
+            out_hash=await self.out_hash,
+        )
+
+
+@dataclasses.dataclass
+class ValueTraceBuilder(HashTraceBuilder, Generic[CovT]):
+    out_value: LazyFuture[CovT] | None = dataclasses.field(default=None)
+
+    async def finalize(self) -> ValueTrace[CovT]:
+        if self.out_hash is None:
+            raise RuntimeError("Cannot finalize ValueTraceBuilder without out_hash")
+        if self.out_value is None:
+            raise RuntimeError("Cannot finalize ValueTraceBuilder without out_value")
+        return ValueTrace(
+            key=self.key,
+            in_hashes={k: await v for k, v in self.in_hashes.items()},
+            out_hash=await self.out_hash,
+            out_value=await self.out_value,
+        )

	pymake A build system based on Build Systems à la Carte
	git clone https://git.grace.moe/pymake
	Log \| Files \| Refs \| README

A	make2/THEORY.md	\|	179	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	make2/__init__.py	\|	268	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++