Instance

Array
(
    [title] => Recent Forum Threads
    [title_url] => 
    [ignore_sticky] => 0
    [exclude_current] => 0
    [limit] => 10
    [sluglist] => ["jobs-dashboard"]
    [rw_opt] => Array
        (
            [widget_select] => 1
            [pageid_281769] => 1
            [pageid_281772] => 1
        )

    [display_widget_mobile] => 
    [rw_opt_exclude] => Array
        (
            [pageid_274493] => 1
            [cpt_podcast] => 1
            [cpta_podcast] => 1
            [category_16613] => 1
            [category_16631] => 1
            [taxonomy_series] => 1
            [pageid_354254] => 1
        )

    [node_id] => Array
        (
            [0] => 2
        )

)

Threads

IT InfiniBand/GPU -Sr Staff Systems Engineer

IT InfiniBand/GPU -Sr Staff Systems Engineer
by Admin on 12-27-2023 at 6:46 pm

Full Time
San Jose, CA
Posted 2 years ago
Applications have closed

Website Cadence

Cadence is looking for a Sr Staff Systems Engineer who accelerates strategic customer deployments and ensures on-time bring-up and deployment of HPC infrastructure and troubleshooting and supports technical roles supporting HPC, InfiniBand, and GPU at our San Jose location!

The successful candidate will be a hands-on technical candidate within the infrastructure team and be exposed to customer interfaces dealing with the Windows and Linux OS.

The System Engineer will need experience in Linux environments and proficiency in tasks such as shell scripting.

Role: IT -Sr Staff Systems Engineer

Location on-site (not remote): San Jose, CA

Must Haves

15+ years of experience in system administration and engineering.
Minimum five years overall experience in technical roles supporting GPU Infrastructure setup using InfiniBand
Experience with interconnections between InfiniBand & GPU’s
Experience with GPU Enabled MPI’s
Experience with GPU Nvidia CUDA or AMD’s ROCm
Experience with; H100, AMD MI210, GPU servers in Cluster
Customer deployments and ensure on-time bring-up of GPU Servers. InfiniBand fabric bring-up, configuration, and subnet management on the IB switch
Participate in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc.) teams to develop best-in-class practices and tools; you will be analyzing, debugging, and resolving critical firmware and software issues for the workload performance at scale
Provide engineering solutions to enable large-scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assist systems engineers in building creative solutions
Strong knowledge of Linux operating systems and networking and security concepts.
Document and drive acceptance and qualification test plans, procedures, and reports

Requirements

Accelerate strategic customer deployments and ensure on-time bring-up and deployment of HPC infrastructure
Participate in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc.) teams to develop best-in-class practices and tools; you will be analyzing, debugging, and resolving critical firmware and software issues for the workload performance at scale
Provide engineering solutions to enable large-scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assist systems engineers in building creative solutions
Development and implementation of server and rack-level telemetry aspects, collaborate and establish continuous improvements in our design flows
Recent experience in critical data center technologies such as server architectures, software containers, job schedulers, and parallel computing. Deployment and operation of large-scale systems; resilient system design; and clustering of computing resources
cluster management for HPC and actively connect with management regarding any problems with the equipment and propose a resolution
Establish and maintain IT infrastructure and procedures for customer-facing and internal systems
Actively establish the technical relationship with our customer’s engineers, management, and architects at focus accounts
Create and develop test plans for new features on each product. Recommend improvements to enable automated scripting for testing and archiving of results. Develop HPC computing strategies for cloud-based computing, GPU-accelerated computing, etc.
Provide remote cluster support to large environments, including scalability/flexibility and troubleshooting end-user issues involving job submission, runtime, and resource access.
InfiniBand fabric configuration and administration on Red hat/Centos/Linux experience in configuring PKeys and troubleshooting the end-to-end InfiniBand environment
InfiniBand fabric bring-up, configuration, subnet management, and monitoring on the IB switch and client side for multi-tenancy setup, understanding of IPoIB communication modes
Performance comparison of the InfiniBand network with cluster interconnects and debugging the InfiniBand performance-related issues
Automate configuration management, software updates, and system availability maintenance and monitoring using modern DevOps tools (Ansible, Gitlab, etc.)
Be a technical specialist on GPU computing and networking products, directly supporting GPU customers
Direct experience and strong knowledge of parallel programming, GPU CUDA/ROCm development, and applications.
Actively partner with the R&D teams delivering services to our infrastructure to gather their service requirements to live within this infrastructure.
Automate repetitive tasks and implement custom solutions using scripting/programming languages such as bash or python
Configure and troubleshoot a heterogeneous (QDR, FDR, EDR) InfiniBand network and associated subnet manager
Experience with High-performance computer interconnects (e.g. 10 and 40 Gigabit Ethernet, InfiniBand)
Able to move 50+ pounds

The annual salary range for California is $130,200 to $241,800. You may also be eligible to receive incentive compensation: bonus, equity, and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure. Please note that the salary range is a guideline and compensation may vary based on factors such as qualifications, skill level, competencies and work location. Our benefits programs include: paid vacation and paid holidays, 401(k) plan with employer match, employee stock purchase plan, a variety of medical, dental and vision plan options, and more.

Share this post via:

Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
Appreciate your take, Rahul. You’re absolutely right that market scale drives architectural investment—scalar dominated when desktop and enterprise ruled, and…

— Jonah McLeod on June 29, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
Well.. I found this to be a funny article. Flynn's critique is fine and good...but not really the driving factor…

— Rahul Razdan on June 29, 2025
Reachability in Analog and AMS. Innovation in Verification
Apologies for that slip-up on our part. Failing memories!

— Bernard Murphy on June 27, 2025
Reachability in Analog and AMS. Innovation in Verification
swka: This is true, I worked with MunEDA up until the Cadence acquisition. Before that I worked with Solido up…

— Daniel Nenni on June 26, 2025
Reachability in Analog and AMS. Innovation in Verification
One quick correction. WiCkeD was MunEDA tool, which was acquired by Cadence. So it is never part of Synopsys. Synopsy…

— swka on June 26, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
At Simplex Micro, the name says it all. Founder Dr. Thang Tran chose it to reflect his belief that in…

— Jonah McLeod on June 25, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
Thanks for the thoughtful read—and you're right, we’re in a fascinating inflection point. On your first point: Lunar Lake doesn’t…

— Jonah McLeod on June 24, 2025
Flynn Was Right: How a 2003 Warning Foretold Today’s Architectural Pivot
An interesting article for sure, as we are in a sea of change. I have perhaps two nitpicks; - Lunar…

— Xebec on June 24, 2025

Search Semiwiki

Recent Forum Threads

Recent Article Comments

Recent Podcast Episodes

Must Haves

Requirements

Recent Forum Threads

Recent Article Comments