WP_Term Object
(
    [term_id] => 99
    [name] => CASPA
    [slug] => caspa
    [term_group] => 0
    [term_taxonomy_id] => 99
    [taxonomy] => category
    [description] => 
    [parent] => 386
    [count] => 6
    [filter] => raw
    [cat_ID] => 99
    [category_count] => 6
    [category_description] => 
    [cat_name] => CASPA
    [category_nicename] => caspa
    [category_parent] => 386
    [is_post] => 1
)

FPGA, Data and CASPA: Spring into AI (2 of 2)

FPGA, Data and CASPA: Spring into AI (2 of 2)
by Alex Tan on 03-23-2018 at 12:00 pm

21364-fig4.jpgAdding color to the talks, Dr. Jeff Welser, VP and IBM Almaden Research Lab Director showed how AI and recent computing resources could be harnessed to contain data explosion. Unstructured data growth by 2020 would be in the order of 50 Zetta-bytes (with 21 zeros). One example, the Summit supercomputer developed by IBM for use at the Oak Ridge National Lab utilized over 27K nVidia GPUs targeting 100+ PetaFlops.

The next question is where do AI algorithms run? Figure 4 shows the segregation of AI algorithms and its respective compute platform. Advanced analytics of the Big Data is usually done on the CPU’s, while ML (learning without explicit programming) being performed on a mixed of CPUs, FPGAs, GPUs; and deep learning (many-layer neural networks) uses GPUs to train; CPUs, FPGA’s to inference; with current trend is to race to ASICs.

Jeff showed how over the years the percentage of error rate in performing 21364-fig4.jpgimage and speech recognitions have dramatically dropped to single digit, approaching human accuracy level. Interestingly, he also shared how custom hardware implementation for AI has circled from FPGA/ASIC back to FPGA again as illustrated in Figure 5.

How can we continue the rate of progress beyond the GPU? The trend is x2.5 Gflops/Watt/year. We could use a reduced or mixed precision accelerator. One example is the use of Phase Change Memory (PCM) to deliver up to 500X speed-up with respect to current NVM devices at equivalent accuracy to a GPU. Some researchers attempt to optimize resistive devices for symmetric switching by exploring new materials and devices as building blocks for new AI chip with neuron-like network. Spiking-based architecture (a non-VNA) is adopted for low-power inference (a.k.a. TrueNorth). It consists of 1 million neurons, 256 million synapses, consuming ~70 mW and has ~4 cm2 footprint. IBM collaborated with the Air Force Research Lab on a TrueNorth 64-chip array consisting of 100 million neurons last year.

Another approach is to use a massively distributed deep learning by optimizing communication and data movement based on the available hardware. Scaling the Resnet training to 64 Minsky nodes (IBM nickname for Power8 based HPC), 256 GPU’s, cut the training time from 16 days to 7 hours. To sum up, he anticipates the following key technologies for the next era of computing:

Context and Learning; Visual Analytics and Interaction; Software Defined Environment; Data-centric Systems; and Atomic and Nano-scale.

While previous speakers examined system level aspects of the AI ecosystems, Dr. Steven Woo, VP of Systems and Solutions at Rambus shared his perspective from different vantage point, a bottom-up view of the memory element. Increased data spurred high-speed memory demand to allow quick data movement related to AI training and inferencing.
21364-fig4.jpg
The performance bottleneck shifted due to growing data from interconnected devices. AI drove new system architecture development as manifested in the nVidia’s Tesla V100 and Wave Computing Dataflow Processing Units. The Roofline Model can be utilized for performance prediction. One could analyze the sweet-spot of an application’s optimal performance on the underlying hardware’s memory bandwidth and processing power. Rooflines vary for different system architectures. The plot in Figure 6 captures performance (operations per second) versus the operational intensity (operations per byte). Two architectural limits are illustrated by the green lines which intersect at ridge point forming a roofline shape.

How to ease the B/W issue? There are several ways, each comes with its own trade-offs:

  • Reduce precision of data – less bit width.
  • On-Chip Memory – highest bandwidth and power efficiency; tens of Tb/sec but less storage vs DRAM).
  • High Bandwidth Memory (HBM) – very high bandwidth and density; 256Gb/sec, High-IO’s, but has interposer related challenges.
  • Graphics Double-Data Rate (GDDR) – good tradeoff between bandwidth, power efficiency, cost, reliability; usually for graphics, now AI; its application challenge is related to ensuring clean signal integrity.

As random access latency is inherently long, architect may convert random access to more streaming application.

Let’s look into how AI may lend a hand in perfecting our senses.
Dr. Achin Bhowmik, CTO and EVP of Engineering for Starkey Hearing 21364-fig4.jpgTechnologies pointed out the 3.2 billion-dollar healthcare market value and the hearing technology product his company provides. Formerly an Intel engineering executive of the Perceptual Computing Group, covering various aspects of IoT designs and AI applications, his passion is to push the envelope of the hearing technology to augment human perception, leading to a better life.

As we know our sensory systems is comprised of vision, hearing, touch, taste and smell (he mentions that we should include spatial sense or sense-of-balance in the list). He quoted statistics from the National Council of Aging that every 13 seconds an older adult is treated at emergency room for a fall (and every 20 minutes an older adult dies from a fall), costing an estimated $67.7B by 2020. This data accentuates how restroring hearing loss (and hence, spatial sense) help reduce such incidents.

Unlike the advancement in computer vision and face recognition technologies, hearing technology is evolving. Just like IoT, hearing aid requires small form-factor. It needs to be small enough to be practical, while on the other hand it requires many interacting components (6 sensors including spatial one, DSP chip and radio/receiver). He highlighted the latest hearing aid recently introduced which could fit inside the ear-canal. It has 7 days of battery life, running at 5 mWatt with ability to interact with phone as gateway. The device has two-layer data security protocol to prevent snooping, one at the device and another at the smartphone app. Figure 8 provides a snapshot of the miniaturization trend.

21364-fig4.jpgThe product utilizes technology referred as Accuity Immersion, which leverages microphone placement to aid with high-frequency information for improved sound quality and sense of special awareness, helping users relearn key acoustic brain cues to support clear speech, a sense of presence and spatial attention for vital connection to their environment. It allows also sense of directionality (to restore front-to-back cues for a more natural listening experience).

Completing the talks, Dr. Yunsup Li, CTO and co-founder of SiFive, the first fabless company to build customized silicon based on the free and open RISC-V instruction set architecture, pointed out the high-barrier of entry for building custom chips at scale. The company provides scalable chip development on Amazon’s AWS environment. The two-flavors the E300-Everywhere targets embedded processing, IoT, and wearables markets. Designed in TSMC 180 nm.

The second is U500-Unleashed customizable RISC-V SoC IP contains the configurable U5-Coreplex 1.6 GHz+ cache-coherent 64-bit multiprocessor with up to eight cores and application-specific custom hardware. His company selling point is that the fully Linux-capable IPs can reduce NRE and time-to-market for customized SoC designs in markets such as ML, storage, and networking. Also available high-speed peripherals such as PCIe 3.0, USB3.0, GbE, and DDR3/4. The design is compatible with a TSMC 28-nm process.

Delivering the intelligence from the edge to compute brain involves an AI ecosystem. With the backdrop of Intel’s based environment one could anticipate that scalability and elasticity across the networked components is key to match with data explosion. This includes not only pushing the current switching technology envelope (CMOS, FinFet) and memory limits, but also probing into non-conventional solutions such as neuron-like approach.