"Seegrid will be due for a migration to confluence on the 1st of August. Any update on or after the 1st of August will NOT be migrated"

VGL Architecture and Design

System Architecture

The following is a high level architecture diagram describing the various components of VGL and their dependencies on external services.
vgl-system-architecture.png

Job Flow

The following diagram describes the processing workflow for any given VGL job.

vgl-workflow.png

Boot Process

The following diagram illustrates how the VM actually receives and processes the job inputs.

vgl-script-architecture.png

Accessing Files

vl-cloudstore.png

Tracking State

Simplified State Transitions

The following is a simplified state transition diagram from the perspective of a VL backend. All monitoring checks are polled with the exception of the Provisioning -> Pending transition which is handled via a long running REST call on a seperate thread.

vl-statechange.png

State Diagrams

The following state diagram describes the life cycle of a job in VGL v1:

VGLv1-JobLifeCycle.png

- The cancel action for an Unsubmitted job (the red arrow line) is currently enabled. We need to review to see if this action is applicable or relevant.
- Currently there is no cancel action for a Pending job (the blue arrow line). We also need to review this to see if this action is needed.
- Last but not least, currently there is no way or workflow for the user to modify and re-submit an Unsubmitted or Failed job. Review is needed on this too.

VGLv1.1-JobLifeCycle-FirstDraft.png

Changes
a. Removed 'Cancelled' state or job status.
b. Introduced the 'Edit' functionality or action to allow the updating of a saved job.
c. Changed the 'repeat' action label to 'clone/duplicate'.
d. Introduced a new life cycle for changing a 'Pending' state to 'Failed' state upon 'refresh' action.

Issues that need further clarification
a. Do we allow an 'Active' job to be deleted? VGL v1 doesn't allow an 'Active' job to be removed but it allows the job to be cancelled.

Note(s)
- VGL doesn't allow a 'Failed' job to be edited and re-submited. As such, we need to prevent failure which can be rectified by the user before job submission (e.g. invalid cloud storage credentials or storage store id).

Reviewer(s)
- Adam and Richard

VGLv1.1-JobLifeCycle-SecondDraft.png



Changes

a. Removed 'delete' action for an 'Active' job and replaced it with the 'cancel' action.
b. The 'cancel' action will change the 'Pending' or 'Active' job status to 'Saved'.

Comment(s)

About current 'failed' state

In general, there are two types of Job failure in VGL. I'll call them pre-submit and post-submit failures respectively.

The pre-submit failure can be caused by the following reason(s): database failure, network connection failure, filesystem failure, S3 (storage) or EC2 (compute) credentials failure, S3 or EC2 service unavailable failure and etc. This type of failure occurs before the job gets uploaded and executed in EC2. Those failures can normally be rectified (such as credentials issue) or retried by the user when the service becomes available again. VGL v1 currently doesn't allow a 'Failed' job to be edited and retried. We probably need to introduce a new state for this type of failure to differentiate it from the post-submit failure or introduce a failure flag to differentiate the two.

The post-submit failure occurs after the Job is uploaded to EC2 and the VM instance is launched to execute the job. Possible failures include error in the scientific code, VM failure during Job execution, network connection failure which prevented the scientific code from downloading its dataset, S3 failure and etc.. This type of failure is much harder to deal with and it is normally not recoverable without re-submitting a new Job. VGL v1 currently allows a 'Failed' job to be cloned or deleted even though it doesn't actually capture this type of failure. We need to decide whether or not this type of failure needs to be captured. If yes, we then need to work out what kind of error can be and needs to be captured and how.

Additional Note(s)

Construct workflow (Existing from VGL v1)
Step1: Select existing or create new series
Step2: Enter Job details
Step3: Manage Job input files
Step4: Define Job script (via script builder or plain text editor with syntax highlighting)
Step5: Review Job before submission

Edit workflow (NEW)
Step1: Edit Job details
Step2: Manage Job input files
Step3: Edit Job script (in plain text editor with syntax highlighting)
Step4: Review Job before submission

Clone/duplicate workflow (Existing from VGL v1)
Step1: Select files to copy into new Job
Step2: Edit Job details
Step3: Review Job before submission

Reviewer(s)
- Terry, Josh and Richard

VGLv1.1-JobLifeCycle-ProposedFinalDraft.png

VGLv1.1-JobLifeCycle-ProposedFinalDraftTwo.png


Note(s):
a. Changed the job submission failure transition to reflect what was discussed in last review meeting. The proposed final draft 2 is the latest state diagram.
b. The 'Failed' state and its related transitions (in red arrows) will not be implemented in VGL v.1.1.
c. Texts written in blue below require further review and attention.

Changes (from v1 to current proposed final draft)
a. Removed 'Cancelled' state or job status.
b. Introduced the 'Edit' functionality or action to allow the updating of a saved job.
c. Changed the 'repeat' action label to 'clone/duplicate'.
d. Removed 'refresh' action from 'Pending' status to 'Failed' status.
e. Ignored 'Failed' state for VGL v1.1. A saved Job which failed upon submission will be returned to 'Saved' state. For the purpose of error reporting on UI, a new field may be introduced to the existing 'jobs' table to record the cause of a job submission failure.
f. Introduced audit trail for state transition (data to be captured: job id, from state, to state and transition date) - not showing in the above diagram.

Issue(s) that need to be reviewed for final release in this iteration
a. What happens if the 'cancel' action failed? Comment: I think it probably doesn't really matter provided the 'Pending' or 'Active' state hasn't been changed to 'Done' state. We could ignore the 'cancel' action failure, update the Job state to 'Saved' and let the user to re-submit and overwrite the previous run results.
b. Because 'Pending' or 'Active' state can be changed to 'Done' state at any point in time without reflecting on the UI so it is possible that the user may cancel a Job that is already done. Recommendation: Additional checking logic needs to be included in the 'cancel' action to check if a Job is already done and notify the user that the Job cannot be cancelled and update the Job status on UI.

Review by Ben Caradoc-Davies

Suggestions:
  • Do not add Failed state as the Saved/Pending/Active/Done and resolution status (Success/Failure) are orthogonal concerns.
  • Remove state diagram arrows from Done/Active back to issue creation as these are not state transitions. Instead, add two job factories Create New / Clone Existing with a note that the second copies settings from an existing job.
  • Submission failure should not be called "failed"; this could be an input validation step. Maybe "submission failed" or "not submitted".
Other than these suggestions, the new job lifecycle is a simplification and improvement on the earlier versions.

VGLv1.1-JobLifeCycle-Final.png

List of implemented changes:

a. Removed 'Cancelled' and 'Failed' states.
b. Introduced 'Deleted' state:
- Previously deleting a job will physically remove its record from the database. This is no longer the case in VGL v1.1 instead a deleted job is marked as 'Deleted'.
- In VGL v1.1, we added job transition audit trial use case and to make sense of a job's audit trail records we need to retain its details.
- A deleted job is not displayed on the UI or to the user.
c. Introduced 'Edit' action or workflow to allow updating of an unsubmitted or cancelled (Saved) job.
d. Changed the 'repeat' action label to 'duplicate' on the UI.
e. Changed job files clean up logic:
- Previously after a successful job submission, all files will be removed from the job staging directory. This has changed. In VGL v1.1, all files will get removed from the job staging directory only when the job status changes from 'Pending' or 'Active' to 'Done'. This is done so to keep the simplified workflow easy to implement and maintain. In VGL v1.1, cancelling a 'Pending' or an 'Active' job will revert the job status back to 'Saved' state. As we allow the user to edit a 'Saved' job, it is best to keep the job's files in staging directory for ease of retrieving and editing.
f. Introduced job transition audit trail.
- Failure or error in the audit trail operation will not affect user action or operation. The audit trail recording is not an atomic operation.
g. Disabled 'delete' action from 'Pending' or 'Active' job. To delete a 'Pending' or an 'Active' job, the user must first cancel the job (this will terminate the compute instance) and then delete it (mark the job as 'Deleted' and remove its files from staging directory and cloud storage).
h. Added an additional guard to 'cancel' action to check for status update and to notify the user about aborting of cancel operation if the job has already been processed. This notification is only available to individual job cancel action.
i. Changed the job cancel action to fail gracefully. This means the job status will be changed to 'Saved' regardless of VM instance termination status. The user can then re-submit the cancelled job (Saved) and the results from previous run will be ovewritten.
j. Added additonal clean up logic to the job 'delete' action. It now also removes job files from S3 storage if and only if the job has not been registered to GeoNetwork.

Known limitations

a. The script changes made in Step 3 of the Edit workflow will not be saved until the next button is clicked. User could potential lose the changes if he/she clicks on previous button after making the script changes.


-- JoshVote - 24 Aug 2012
Topic attachments
I Attachment Action Size Date Who Comment
VGLv1-JobLifeCycle.diadia VGLv1-JobLifeCycle.dia manage 3.2 K 05 Sep 2012 - 16:25 RichardGoh  
VGLv1-JobLifeCycle.pngpng VGLv1-JobLifeCycle.png manage 23.4 K 05 Sep 2012 - 16:25 RichardGoh  
VGLv1.1-JobLifeCycle-Final.diadia VGLv1.1-JobLifeCycle-Final.dia manage 2.9 K 10 Oct 2012 - 11:59 RichardGoh The final version of VGL v1.1 Job Life Cycle
VGLv1.1-JobLifeCycle-Final.pngpng VGLv1.1-JobLifeCycle-Final.png manage 24.0 K 10 Oct 2012 - 11:58 RichardGoh The final version of VGL v1.1 Job Life Cycle in PNG format
VGLv1.1-JobLifeCycle-FirstDraft.pngpng VGLv1.1-JobLifeCycle-FirstDraft.png manage 19.9 K 06 Sep 2012 - 17:44 RichardGoh The first draft of VGL v1.1 Job Life Cycle in PNG format
VGLv1.1-JobLifeCycle-ProposedFinalDraft.diadia VGLv1.1-JobLifeCycle-ProposedFinalDraft.dia manage 2.9 K 12 Sep 2012 - 10:03 RichardGoh The proposed final draft of VGL v1.1 Job Life Cycle in Dia binary format
VGLv1.1-JobLifeCycle-ProposedFinalDraft.pngpng VGLv1.1-JobLifeCycle-ProposedFinalDraft.png manage 22.6 K 12 Sep 2012 - 10:02 RichardGoh  
VGLv1.1-JobLifeCycle-ProposedFinalDraftTwo.diadia VGLv1.1-JobLifeCycle-ProposedFinalDraftTwo.dia manage 2.9 K 12 Sep 2012 - 15:22 RichardGoh  
VGLv1.1-JobLifeCycle-ProposedFinalDraftTwo.pngpng VGLv1.1-JobLifeCycle-ProposedFinalDraftTwo.png manage 22.0 K 12 Sep 2012 - 15:22 RichardGoh  
VGLv1.1-JobLifeCycle-SecondDraft.diadia VGLv1.1-JobLifeCycle-SecondDraft.dia manage 2.9 K 10 Sep 2012 - 15:02 RichardGoh The second draft of VGL v1.1 Job Life Cycle in Dia binary format
VGLv1.1-JobLifeCycle-SecondDraft.pngpng VGLv1.1-JobLifeCycle-SecondDraft.png manage 22.9 K 10 Sep 2012 - 15:00 RichardGoh VGL v1.1 Job Life Cycle Second Draft
VGLv1.1-JobLifeCycle.diadia VGLv1.1-JobLifeCycle.dia manage 2.8 K 07 Sep 2012 - 11:45 RichardGoh The first draft of VGL v1.1 Job Life Cycle
vgl-script-architecture.graphmlgraphml vgl-script-architecture.graphml manage 45.7 K 24 Aug 2012 - 16:22 JoshVote  
vgl-script-architecture.pngpng vgl-script-architecture.png manage 59.1 K 27 Nov 2013 - 12:15 JoshVote  
vgl-system-architecture.graphmlgraphml vgl-system-architecture.graphml manage 58.6 K 24 Aug 2012 - 14:34 JoshVote  
vgl-system-architecture.pngpng vgl-system-architecture.png manage 44.7 K 24 Aug 2012 - 14:34 JoshVote  
vgl-workflow.graphmlgraphml vgl-workflow.graphml manage 25.4 K 24 Aug 2012 - 14:51 JoshVote  
vgl-workflow.pngpng vgl-workflow.png manage 21.1 K 24 Aug 2012 - 14:51 JoshVote  
vl-cloudstore.graphmlgraphml vl-cloudstore.graphml manage 16.3 K 27 Jul 2016 - 16:02 JoshVote  
vl-cloudstore.pngpng vl-cloudstore.png manage 19.9 K 27 Jul 2016 - 16:02 JoshVote  
vl-statechange.graphmlgraphml vl-statechange.graphml manage 18.7 K 27 Jul 2016 - 15:55 JoshVote  
vl-statechange.pngpng vl-statechange.png manage 46.3 K 27 Jul 2016 - 15:55 JoshVote  
Topic revision: r18 - 27 Jul 2016, JoshVote
 

Current license: All material on this collaboration platform is licensed under a Creative Commons Attribution 3.0 Australia Licence (CC BY 3.0).