Launching Jobs
Good to know: All Deque tools like jobs and notebooks are actually stored and executed on the Deque Platform. The app gives an experience of native execution but everything runs in the cloud. The user can even shutdown their computer and the job execution runs uninterrupted.
A job is a python program, which is typically intended to run for long amounts of time. A job can be run on a single node or multiple nodes. Because of the deeper integration between Deque components (instrumentation services and orchestrator) with the most popular deep learning frameworks (Pytorch and Tensorflow) Deque jobs are highly automated.
When the user logs into jobs for the first time, there are two (2) sample jobs that are automatically created for them. The user can explore these jobs, explore the settings and even run them on the Platform and experience a complete job lifecycle. Once the user is ready to create their own jobs, the user can click on the plus icon on the top right. (ADD SCREENSHOT)
Job Settings: The app will prompt a job creation window where the user can create the job by checking out from GitHub or opening a folder from their local machine. The next step is to provide the main file. This is the file that will launch the job and must have a standard python main function. The user must also specify an environment associated with this job. Environments are defined separately and contain a valid Anaconda Yaml file.
Compute Settings: The user must also provide an instance type for the job. Like Notebook, the instance type can be a GPU instance, which will appear as a blue highlight or a CPU only instance, which will appear as a grey highlight in the drop down menu. Each instance type label will also show the number of GPUs, number of CPUs, GPU memory and RAM associated with that instance type. This transparency is a benefit of using the Deque app, which is not available on other platforms. Having this information allows users to optimally choose the instance balancing both training cost and speed. (ATTACH SCREENSHOT) Additionally for Job, the user can select on prem as a compute provider and as a result the instance type drop down will be dynamically populated by the instances available in the users on prem environment. This feature is enabled by installing the Deque Instrumentation Service on any on prem machine the user wants to make available for training. See Instrumentation Service (ADD HYPERLINK).
Experiment Tracking: Experiment tracking allows you to track the performance of your model over many iterations of training. It shows you the accuracy of the model and the value of the loss function, both of which provide intuition as to how to tune your hyperparameters. This process is also called hyperparameter tuning. Deque has integrated Tensorboard and MLFlow (opensource) right into the app, allowing users to seamlessly, visually evaluate and monitor the model performance. To turn on experiment tracking within a Deque job, the user simply has to select either Tensorboard or MLFlow under experiment settings within Job settings.
Placement Settings: Where the user decides which Compute provider they want to use for each Job and in which geography they want to run the Job. This is beneficial because proximity to data is important. If you are running the data on the West Coast and the Job is on the East Coast that is sub optimal because of network latency.
Distributed Settings: One of the unique Deque features is distributed training. If the user wants to train a job on multiple GPUs and multiple nodes a user simply has to enable a toggle button during or after job creation in the settings menu. The user can then specify the number of nodes that the user wants to utilize. This streamlines the entire distributed training experience eliminating hours of effort administering clusters and jobs. It allows the user to focus on their code.
Copy link