NOTE This system currently has limited availability. 

For a comparison of ATS-4 systems, see: Using El Capitan Systems: Hardware Overview

Job Limits

Each LC platform is a shared resource. Users are expected to adhere to the following usage policies to ensure that the resources can be effectively and productively used by everyone. You can view the policies on a system itself by running:

news job.lim.MACHINENAME

Web Version of El Capitan Job Limits

El Capitan is the CORAL2 flagship system. There are 32 login nodes, 56 debug nodes, and 11,040 batch nodes. The compute nodes have 96 AMD EPYC cores/node, 4 AMD MI300A gpus/node, and 512 GB memory/node. El Capitan is running TOSS 4 with Cray compilers.

System documentation is available on the SCF at https://hpc.llnl.gov/documentation/user-guides/using-el-capitan-systems

Batch jobs are scheduled through FLUX. The queue can be found by typing flux jobs -A at the prompt.

Jobs are scheduled per node. El Capitan has three main scheduling queues:

  • pdebug—56 nodes
  • pci—8 nodes
  • pbatch/plarge—11,040 nodes
Queue           nodes / job       Max runtime
---------------------------------------------
pdebug        maximum    32            2 hrs
pci           maximum     1            4 hrs
pbatch        maximum  4150           24 hrs
plarge        minimum* 4096           24 hrs
---------------------------------------------

plarge and DATs

Only one of the pbatch and plarge queues is active at a time. The plarge queue is activated weekly on Thursdays, if ther are pending jobs or outstanding interactive DAT requests. Jobs can be submitted to plarge at any time. Non-interactive jobs should use at least 4096 nodes.

Interactive DATs may be smaller than 4096 nodes and can be requested via lc-hotline@llnl.gov or using the ASC DAT Request form.

pdebug

pdebug is intended for debugging, visualization, and other inherently interactive work. It is NOT intended for production work. Do not use pdebug to run batch jobs. Do not chain jobs to run one after the other. Do not use more than half of the nodes during normal business hours. Individuals who misuse the pdebug queue in this or any similar manner will be denied access to running jobs in the pdebug queue.

Other Policies

Do NOT run computationally intensive work on the login nodes. There are a limited number of login nodes which are meant primarily for editing files and launching jobs. A majority of the time when a login node is laggy, it is because a user has started up a compile on that login node.

Interactive access to a batch node is allowed while you have a batch job running on that node, and only for the purpose of monitoring your job. When logging into a batch node, be mindful of the impact your work has on the other jobs running on the node.

Documentation

Using El Capitan Systems

Topics include: Quickstart, hardware, compilers, GPU programming, flux, rabbits, tools, and more.

Support

Please call or send email to the LC Hotline if you have questions.

Zone
SCF
Vendor
HPE Cray
User-Available Nodes
Login Nodes*
32 nodes: elcap[1001-1016,12121-12136]
Batch Nodes
11,040
Debug Nodes
64
Total Nodes
11,136
APUs
APU Architecture
AMD MI300A
CPUs
CPU Architecture
4th Generation AMD EPYC
Cores/Node
96
Total Cores
1,069,056
GPUs
GPU Architecture
CDNA 3
Total GPUs
44,544
GPUs per compute node
4
GPU global memory (GiB)
512.00
Memory Total (GiB)
5,701,632
Peak Performance
Peak PFLOPS (CPUs+GPUs)
2792.900
Clock Speed (GHz)
2.0
OS
TOSS 4
Interconnect
HPE Slingshot 11
Parallel job type
multiple nodes per job
Scheduler
Flux
Recommended location for parallel file space
Program
ASC
Class
ATS-4, CORAL-2
Year Commissioned
2024
Compilers
Documentation