El Capitan | HPC @ LLNL

NOTE This system currently has limited availability.

Job Limits

Each LC platform is a shared resource. Users are expected to adhere to the following usage policies to ensure that the resources can be effectively and productively used by everyone. You can view the policies on a system itself by running:

news job.lim.MACHINENAME

Web Version of El Capitan Job Limits

El Capitan is the CORAL2 flagship system. There are 32 login nodes, 56 debug nodes, and 11,040 batch nodes. The compute nodes have 96 AMD EPYC cores/node, 4 AMD MI300A gpus/node, and 512 GB memory/node. El Capitan is running TOSS 4 with Cray compilers.

System documentation is available on the SCF at https://hpc.llnl.gov/documentation/user-guides/using-el-capitan-systems

Batch jobs are scheduled through FLUX. The queue can be found by typing flux jobs -A at the prompt.

Jobs are scheduled per node. El Capitan has three main scheduling queues:

pdebug—56 nodes

pci—8 nodes

pbatch/plarge—11,040 nodes

Queue nodes / job Max runtime --------------------------------------------- pdebug maximum 32 2 hrs pci maximum 1 4 hrs pbatch maximum 4150 24 hrs plarge minimum* 4096 24 hrs ---------------------------------------------
plarge and DATs

Only one of the pbatch and plarge queues is active at a time. The plarge queue is activated weekly on Thursdays, if ther are pending jobs or outstanding interactive DAT requests. Jobs can be submitted to plarge at any time. Non-interactive jobs should use at least 4096 nodes.

Interactive DATs may be smaller than 4096 nodes and can be requested via lc-hotline@llnl.gov or using the ASC DAT Request form.

pdebug

pdebug is intended for debugging, visualization, and other inherently interactive work. It is NOT intended for production work. Do not use pdebug to run batch jobs. Do not chain jobs to run one after the other. Do not use more than half of the nodes during normal business hours. Individuals who misuse the pdebug queue in this or any similar manner will be denied access to running jobs in the pdebug queue.

Other Policies

Do NOT run computationally intensive work on the login nodes. There are a limited number of login nodes which are meant primarily for editing files and launching jobs. A majority of the time when a login node is laggy, it is because a user has started up a compile on that login node.

Interactive access to a batch node is allowed while you have a batch job running on that node, and only for the purpose of monitoring your job. When logging into a batch node, be mindful of the impact your work has on the other jobs running on the node.

Documentation

Using El Capitan Systems

Topics include: Quickstart, hardware, compilers, GPU programming, flux, rabbits, tools, and more.

Support

Please call or send email to the LC Hotline if you have questions.

phone: 925-422-4531

email: lc-hotline@llnl.gov
Zone	SCF
Vendor	HPE Cray
User-Available Nodes	Login Nodes* 32 nodes: elcap[1001-1016,12121-12136] Batch Nodes 11,040 Debug Nodes 64 Total Nodes 11,136
APUs	APU Architecture AMD MI300A
CPUs	CPU Architecture 4th Generation AMD EPYC Cores/Node 96 Total Cores 1,069,056
GPUs	GPU Architecture CDNA 3 Total GPUs 44,544 GPUs per compute node 4 GPU global memory (GiB) 512.00
Memory Total (GiB)	5,701,632
Peak Performance	Peak PFLOPS (CPUs+GPUs) 2792.900
Clock Speed (GHz)	2.0
OS	TOSS 4
Interconnect	HPE Slingshot 11
Parallel job type	multiple nodes per job
Scheduler	Flux
Recommended location for parallel file space	/p/lustre4
Program	ASC
Class	ATS-4, CORAL-2
Year Commissioned	2024
Compilers	See Compilers page or the El Capitan Systems user guide
Documentation	Using El Capitan Systems

Job Limits