The cache group fields defines under which cache group the run cache data will
belong. This is needed/useful for some next changes:
* Make cache correctly work for user direct runs. Since the user direct runs all
belong to the same run group (the user id) all the use direct runs will share the
same caches. To distinguish between the different caches we need to use something
in addition to the user id (the local repo uuid generated by the direct run
start command)
* Share the cache between multiple projects
* add a config option allowPrivilegedContainers
* fail task setup if privileged containers are requested but they aren't
allowed.
* report if privileged containers are allowed to the runservice
ErrInternal is an internal error that should be provided to the user (http api
will return a 500 with the error message)
It'll be used for any kind of error that are not auth or bad requests (like
errors to communicate to another service)
* Don't use a complex UnmarshalJSON for RunConfigTask and ExecutorTask but
introduce a Steps type as a slice of Step (where Step is an empty interface)
and declare an UnmarshalJSON method on the Step type.
split data files in multiple files of a max size (default 10Mib)
In this way every data snapshot will change only the datafiles that have some
changes instead of the whole single file.
Don't create an ErrFromRemote wrapping the returned error but
wrap the ErrFromRemote
Also use xerrors Is/As to get the underlying error to return to api clients
while maintaining context for logging
Just a raw replace of "github.com/pkg/errors".
Next steps will improve errors (like remote errors, api errors, not exist errors
etc...) to leverage its functionalities
rename the previous posix storage to posixflat and make it currently not user
selectable (since I'm not sure it's really worth using it).
The new posix storage uses the filesystem without any escaping so it's not a
real flat namespace.
This isn't a real issue since also minio is not a flat namespace and we are so
forced to use it like a hierarchycal filesystem.
Logs and archives can be shared by multiple runs. So removing a run doesn't
imply that we could also remote the logs and archives since they could be
"referenced" by another run.
Store also the runids as specific objects along with the logs and archives so,
we'll remove them only when no runids objects exist.
Also if they are logically part of the runservice the names runserviceExecutor
and runserviceScheduler are long and quite confusing for an external user
Simplify them separating both the code parts and updating the names:
runserviceScheduler -> runservice
runserviceExecutor -> executor
* runservice: use generic task annotations instead of approval annotations
* runservice: add method to set task annotations
* gateway: when an user call the run task approval action, it will set in the
task annotations the approval users ids. The task won't be approved.
* scheduler: when the number of approvers meets the required minimum number
(currently 1) call the runservice to approve the task
In this way we could easily implement some approval features like requiring a
minimum number of approvers (saved in the task annotations) before marking the
run as approved in the runservice.
On s3 limit the max object size to 1GiB when the size is not provided (-1) or
the minio client will calculate a big part size since it tries to use the
maximum object size (5TiB) and will allocate a very big buffer in ram. Also
leave as commented out the previous hack that was firstly creating the file
locally to calculate the size and then put it (for future reference).
This options is a noop on s3 but on the posix implementation it becomes useful
when there isn't the need to have a persistent file, thus avoiding some fsync
calls.
* Rename to datamanager since it handles a complete "database" backed by an
objectstorage and etcd
* Don't write every single entry as a single file but group them in a single
file. In future improve this to split the data in multiple files of a max size.
`lts` was choosen to reflect a "long term storage" but currently it's just an
object storage implementation. So use this term and "ost" as its abbreviation
(to not clash with "os").
An executor can handle multiple archs (an executor that talks with a k8s cluster
with multi arch nodes). Don't use a label for archs but a custom executor
field.
* Add the concept of executor groups and siblings executors
* Add the concept of dynamic executor: an executor in an executor group that
doesn't need to be manually deleted from the scheduler since the other sibling
executors will take care of cleaning up its pods.
* Remove external labels visibility from pod.
* Add functions to return the sibling executors and the executor group
* Delete pods of disappeared sibling executors
take and change a copy of the current run so we'll change newRun and use curRun
status for logic decision. In this way result are reproducible or they will be
affected by the random run.Tasks map iteration order.
Handle the task dependencies conditions:
* on_success (default if no conditions are specified)
* on_failure
* on_skipped
Not the runservice won't stop run but continue executing tasks that depends on a
parent also if this is failed
* split functions in sub parts to ease future testing
* save run fewer times
* rework events logic to considere both run phase and result changes (emit an
event on every phase or result change)
Add the ability to define a run with a setuperror phase.
When the run setup has errors client could submit a run with a list of setup
errors. In such case the run will be created in the setuperror phase.
Setup errors are currently generated by the webhook receiver and the run service
when it checks the run config for possible issues.
* Use just RunConfig
* Use StaticEnvironment vs Environment in RunConfig to distinguish between env
that won't change at run recreation from env that could change at every
recreation
* The RunCreate api will just receive the runtasks instead of a runconfig (more
right)
* client: always parse the json error message field and return its contents
* Use ErrBadRequest and ErrNotFound in every handler and command
* Gateway: by default pass underlying service error (configstore, runservice) to
client keeping the status code and message. In future, if some errors must be
masked, we should change the specific parts that need special handling.
* Command: use ErrBadRequest
* Always return a json message also on error. For internal errors return a
generic "internal server error" message to not leak the real internal error to
clients
* Return 201 Created on resource creation
* Return 204 No Content on resource deletion and other action with no json
output
* Remove all the small index files on the lts
* Keep on s3 only a full index of all runs containing the runid, grouppath and phase
million of runs can take only some hundred of megabytes
* Periodically create a new dump of the index