Implement a new error handling library based on pkg/errors. It provides
stack saving on wrapping and exports some function to add stack saving
also to external errors.
It also implements custom zerolog error formatting without adding too
much verbosity by just printing the chain error file:line without a full
stack trace of every error.
* Add a --detailed-errors options to print error with they full chain
* Wrap all error returns. Use errors.WithStack to wrap without adding a
new messsage and error.Wrap[f] to add a message.
* Add golangci-lint wrapcheck to check that external packages errors are
wrapped. This won't check that internal packages error are wrapped.
But we want also to ensure this case so we'll have to find something
else to check also these.
Replace zap with zerolog.
zerolog has a cleaner interface and can be easily configured with custom
error chain printing using a new error handling library that will be
implemented in another PR.
skip fetching of tasks with status skipped, not only tasks marked as skip.
This avoid many wrong an noisy logs of type "executor task with id taskid
doesn't exist. This shouldn't happen. Skipping fetching"
updateRunTaskStatus should also accept transitions from not started to a
finished state like "success", "failed", "stopped" since we could miss some
status updates from the executor for many reasons.
rename activeExecutorTasks to scheduledExecutorTasks and don't filter out
finished tasks.
In some logic we need all the scheduled tasks and not only the not finished
ones.
Since the executor only periodically updates its state we could end up
scheduling much more tasks than the executor ActiveTasksLimit. This will happen
in the case of many parallel tasks that can all start at the same time.
To avoid this also considere the executor tasks saved in etcd that represent
the real view of scheduled tasks.
Currently when a run is marked to stop we are going to stop currently running
tasks and then their childs will be marked as skipped.
But tasks not depending on a stopped task (root task or childs with a finished
parent) that are just waiting for an executor slot, will be scheduled when
there will be a free slot also if the run is marked to stop (and then the
scheduler will stop them after some seconds).
This patch will mark all not started tasks as skipped when the run is marked to
stop.
During tests provide a zaptest Logger so all services output will be redirected
to golang testing logger.
When multiple services of the same type are provided add a unique name field to
distinguish them.
etcd PR 11104 (https://github.com/etcd-io/etcd/pull/11104) implemented mutex
TryLock. Since it's only available in etcd master just copy relevant code and
add a TODO to remove it when updating the etcd client to a version implementing
TryLock.
Use TryLock everywhere where it'll be useful.
* objectstorage: remove `types` package and move `ErrNotExist` in base package
* objectstorage: Implement .Is and add helper `IsErrNotExist` for `ErrNotExist`
* util: Rename `ErrNotFound` to `ErrNotExist`
* util: Add `IsErr*` helpers and use them in place of `errors.Is()`
* datamanager: add `ErrNoDataStatus` to report when there's not data status in ost
* runservice/common: remove `ErrNotExist` and use errors in util package
Reorganize ExecutorTask to better distinguish between the task Spec and
the Status.
Split the task Spec in a sub part called ExecutorTaskSpecData that contains
tasks data that don't have to be saved in etcd because it contains data that can
be very big and can be generated starting from the run and the runconfig.
Currently `advanceRunTasks` isn't deterministic and doesn't calculate the final
state in one call. So could happen that `getTasksToRun` will select a task to be
executed since its parent are finished (marked as skipped in advanceRunTasks)
but the task isn't marked to be skipped (because advanceRunTasks has calculated
this task before its parents).
Currently fix this doing the same task selection logic done in `advanceRunTasks`
and add a TODO to make `advanceRunTasks` be deterministic by processing tasks by
their level (from level 0).
Export clients and related packages.
The main rule is to not import internal packages from exported packages.
The gateway client and related types are totally decoupled from the gateway
service (not shared types between the client and the server).
Instead the configstore and the runservice client currently share many types
that are now exported (decoupling them will require that a lot of types must be
duplicated and the need of functions to convert between them, this will be done
in future when the APIs will be declared as stable).
* Don't fail tasks inside the delete executor action, just delete the executor
from etcd
* The scheduler, when detecting a task without a related executor will mark the
task as failed and correctly set end time of the task and its steps.
currently we are deleting the executor tasks only when all the run tasks
log/archives were fetched. But it'll better to remove a single executor task
when the task fetching is finished.
This could also fix possible issues on k8s since we are scheduling tasks but the
k8s scheduler may not schedule them if there aren't enough resources causing a
scheduling deadlock since we won't remove finished pods because their related
tasks are not removed and k8s cannot start new pods since it has no resources.
The cache group fields defines under which cache group the run cache data will
belong. This is needed/useful for some next changes:
* Make cache correctly work for user direct runs. Since the user direct runs all
belong to the same run group (the user id) all the use direct runs will share the
same caches. To distinguish between the different caches we need to use something
in addition to the user id (the local repo uuid generated by the direct run
start command)
* Share the cache between multiple projects
Just a raw replace of "github.com/pkg/errors".
Next steps will improve errors (like remote errors, api errors, not exist errors
etc...) to leverage its functionalities
rename the previous posix storage to posixflat and make it currently not user
selectable (since I'm not sure it's really worth using it).
The new posix storage uses the filesystem without any escaping so it's not a
real flat namespace.
This isn't a real issue since also minio is not a flat namespace and we are so
forced to use it like a hierarchycal filesystem.
Logs and archives can be shared by multiple runs. So removing a run doesn't
imply that we could also remote the logs and archives since they could be
"referenced" by another run.
Store also the runids as specific objects along with the logs and archives so,
we'll remove them only when no runids objects exist.
Also if they are logically part of the runservice the names runserviceExecutor
and runserviceScheduler are long and quite confusing for an external user
Simplify them separating both the code parts and updating the names:
runserviceScheduler -> runservice
runserviceExecutor -> executor