Reorganize ExecutorTask to better distinguish between the task Spec and
the Status.
Split the task Spec in a sub part called ExecutorTaskSpecData that contains
tasks data that don't have to be saved in etcd because it contains data that can
be very big and can be generated starting from the run and the runconfig.
Currently `advanceRunTasks` isn't deterministic and doesn't calculate the final
state in one call. So could happen that `getTasksToRun` will select a task to be
executed since its parent are finished (marked as skipped in advanceRunTasks)
but the task isn't marked to be skipped (because advanceRunTasks has calculated
this task before its parents).
Currently fix this doing the same task selection logic done in `advanceRunTasks`
and add a TODO to make `advanceRunTasks` be deterministic by processing tasks by
their level (from level 0).
Export clients and related packages.
The main rule is to not import internal packages from exported packages.
The gateway client and related types are totally decoupled from the gateway
service (not shared types between the client and the server).
Instead the configstore and the runservice client currently share many types
that are now exported (decoupling them will require that a lot of types must be
duplicated and the need of functions to convert between them, this will be done
in future when the APIs will be declared as stable).
* Don't fail tasks inside the delete executor action, just delete the executor
from etcd
* The scheduler, when detecting a task without a related executor will mark the
task as failed and correctly set end time of the task and its steps.
Implement runservice maintenance mode and export/import.
When runservice is set in maintenance mode it'll start only the maintenance and
export/import handlers.
Setting maintenance mode will set a key in etcd so all the runservice instances
will detect it and enter in maintenance mode. This is done asyncronously so it
could take some time (future improvements will add some api to show all the
runservice states)
Export is always available and will export the datamanager contents. Currently
only datamanager contents are exported (no logs and workspace archives).
Import is available only during maintenance, given a datamanager export will
import it and reset etcd to this import state.
Use the go sql context functions (ExecContext, QueryContext etc...)
The context is saved inside Tx so the library users should only pass it one time
to the db.Do function.
In runservice readdb Run method we could end with a deadlock if two of the
goroutines that call HandleEvents.* try to write to the errCh at the same
time before the errCh is read. If this happens one of the two will be blocked on
writing to the channel but the read won't happen since it'll blocked by
wg.Wait().
Fix this doing:
* use a buffered channel large as the number of executed goroutines.
* create a new errCh at every loop (so we'll ignore later errors after the first
one)
Note: we could also use a non blocking send to avoid this situation but we
should also start the wg.Wait before the goroutines or earlier errors could be
lost causing another kind of hang.
* Don't make cors enabled on all (*) by default.
* Handle related web.allowedOrigins options
* Only the gateway api should be called by a browser so setup the cors handler
only on it
currently we are deleting the executor tasks only when all the run tasks
log/archives were fetched. But it'll better to remove a single executor task
when the task fetching is finished.
This could also fix possible issues on k8s since we are scheduling tasks but the
k8s scheduler may not schedule them if there aren't enough resources causing a
scheduling deadlock since we won't remove finished pods because their related
tasks are not removed and k8s cannot start new pods since it has no resources.
Don't put datamanager base dirs inside the root of the ost but use a base path.
Let's do it now before releasing since this is a breaking change that requires
moving the ost data to the new path
Currently we aren't setting a basepath and it wasn't always correctly handled.
Fix missing basepath handling and improve tests to also use a non empty
basepath.
The cache group fields defines under which cache group the run cache data will
belong. This is needed/useful for some next changes:
* Make cache correctly work for user direct runs. Since the user direct runs all
belong to the same run group (the user id) all the use direct runs will share the
same caches. To distinguish between the different caches we need to use something
in addition to the user id (the local repo uuid generated by the direct run
start command)
* Share the cache between multiple projects
* add a config option allowPrivilegedContainers
* fail task setup if privileged containers are requested but they aren't
allowed.
* report if privileged containers are allowed to the runservice
ErrInternal is an internal error that should be provided to the user (http api
will return a 500 with the error message)
It'll be used for any kind of error that are not auth or bad requests (like
errors to communicate to another service)
* Don't use a complex UnmarshalJSON for RunConfigTask and ExecutorTask but
introduce a Steps type as a slice of Step (where Step is an empty interface)
and declare an UnmarshalJSON method on the Step type.
split data files in multiple files of a max size (default 10Mib)
In this way every data snapshot will change only the datafiles that have some
changes instead of the whole single file.
Don't create an ErrFromRemote wrapping the returned error but
wrap the ErrFromRemote
Also use xerrors Is/As to get the underlying error to return to api clients
while maintaining context for logging
Just a raw replace of "github.com/pkg/errors".
Next steps will improve errors (like remote errors, api errors, not exist errors
etc...) to leverage its functionalities
rename the previous posix storage to posixflat and make it currently not user
selectable (since I'm not sure it's really worth using it).
The new posix storage uses the filesystem without any escaping so it's not a
real flat namespace.
This isn't a real issue since also minio is not a flat namespace and we are so
forced to use it like a hierarchycal filesystem.
Logs and archives can be shared by multiple runs. So removing a run doesn't
imply that we could also remote the logs and archives since they could be
"referenced" by another run.
Store also the runids as specific objects along with the logs and archives so,
we'll remove them only when no runids objects exist.