

# Intel<sup>®</sup> HPC Solutions Update Focus on FPGA and ML

Dr. Jean-Laurent PHILIPPE, PhD EMEA HPC Technical Sales Specialist

With Dell Amsterdam, October 27, 2016



# Legal Disclaimers

Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com].

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at <a href="https://www-ssl.intel.com/content/www/us/en/high-performance-computing/path-to-aurora.html">https://www-ssl.intel.com/content/www/us/en/high-performance-computing/path-to-aurora.html</a>.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>.

Intel, the Intel logo, Xeon, Intel Xeon Phi, Intel Optane and 3D XPoint are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

\*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation. All rights reserved.





#### Introduction

Intel<sup>®</sup> Xeon<sup>®</sup> Processor and FPGA Machine Learning Conclusion

Backup

# **Growing Challenges in HPC**

#### System Bottlenecks "The Walls"



Divergent Workloads

Machine learning

HPC. visualization

Barriers to Extending Usage



Memory | I/O | Storage Energy Efficient Performance Space | Resiliency | Unoptimized Software Resources Split Among Modeling and Simulation | Big Data Analytics | Machine Learning | Visualization Democratization at Every Scale | Cloud Access | Exploration of New Parallel Programming Models



# A Holistic Architectural Approach is Required



# Intel<sup>®</sup> Scalable System Framework



# Many Workloads – one Framework

A Flexible Framework for Today & Tomorrow



#### Delivering Breakthrough System Performance



#### **How Intel<sup>®</sup> Scalable System Framework Works Innovative Technologies Tighter Integration**



# and Co-Design







Introduction

Intel® Xeon® Processor and FPGA

Machine Learning

Conclusion

Backup



## Suitable Workloads for FPGAs

٠



Source: Bain FPGA market research (October 2015) survey of 400 developers



## Data Center Workloads for FPGA



Size of Bubble indicates CPU Intensity

Very Applicable Applicable Less Common



# Intel<sup>®</sup> Xeon<sup>®</sup> processor + FPGA Value Proposition



#### Delivered through combined hardware & software features of Xeon + FPGA

Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

# Xeon + FPGA Target Workloads

| FPGA Activity                                                                   | Workload Examples                                                                                                                                                                                                    |  |
|---------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Compute intensive algorithms                                                    | <ul> <li>Visual Understanding/Deep Learning classification</li> <li>Compression/decompression</li> <li>Video Motion Estimation</li> <li>Genomics (Pair HMM, Smith Waterman)</li> <li>Memory copy routines</li> </ul> |  |
| Latency sensitive pre-filtering & processing for CPU                            | <ul> <li>Bump in the wire network processing</li> <li>FSI market data pre-filtering</li> <li>HPC Radar data pre-processing</li> <li>Automotive video input</li> <li>Security appliance, targeted Vswitch</li> </ul>  |  |
| Evolving algorithms or stable algorithms on low latency and inline interconnect | <ul> <li>New compression algorithms</li> <li>High compression ratios</li> <li>Custom crypto algorithms</li> </ul>                                                                                                    |  |



# Xeon + FPGA Use Case Examples by Segment

|                                 | Cloud SPs                                     | Comm SPs                      | Enterprise IT                            | Tech<br>Computing     |
|---------------------------------|-----------------------------------------------|-------------------------------|------------------------------------------|-----------------------|
| Example End user                | SaaS/IaaS<br>provider                         | NFV adopter                   | Database, Big<br>Data Analytics<br>user  | FSI user              |
| Workload accelerated<br>on FPGA | Visual<br>Understanding                       | VM-to-VM<br>Packet Processing | Database<br>Compression                  | Trading<br>algorithms |
| Sample FPGA IP<br>Libraries     | Convolutional<br>Neural Network<br>algorithms | Parse, Lookup,<br>Modify      | Compression,<br>Sort, Join<br>algorithms | Proprietary           |



# Differences between Discrete FPGA & Xeon + FPGA

|                                  | Discrete FPGA                                                                                                          | Xeon + FPGA                                                                                                                   |  |
|----------------------------------|------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|--|
| Workload type best<br>suited for | Coarse-grained acceleration offload: FPGA<br>works on independent task, returns result<br>to CPU when complete         | Fine-grained workload acceleration: CPU/FPGA jointly working on task, access shared data set                                  |  |
| Where to deploy<br>FPGA          | <ul><li>On PCIe card</li><li>On motherboard</li></ul>                                                                  | In server form factor inside CPU socket (up to 2 sockets)                                                                     |  |
| FPGA Options                     | <ul> <li>Option of any FPGA to deploy</li> <li>Option to deploy multiple FPGAs together</li> </ul>                     | <ul> <li>1 FPGA option available</li> <li>1 FPGA integrated with CPU as multi-chip package</li> </ul>                         |  |
| Memory Options                   | <ul> <li>Option for memory local to FPGA</li> <li>System memory access is via PCIe &amp; not cache-coherent</li> </ul> | <ul> <li>FPGA shares system memory with CPU</li> <li>System memory access is low latency &amp; cache-<br/>coherent</li> </ul> |  |
| Power Options                    | FPGA powered separately from CPU                                                                                       | FPGA & CPU share socket TDP                                                                                                   |  |
| Tools &<br>Programming           | <ul> <li>Same Altera tool suite for discrete &amp; integrated FPGA</li> <li>Program FPGA with OpenCL or RTL</li> </ul> |                                                                                                                               |  |

Choice between the reconfigurable accelerators will depend on workload demands & deployment environment



#### Introduction

Intel<sup>®</sup> Xeon<sup>®</sup> Processor and FPGA

#### **Machine Learning**

Conclusion

Backup



# Artificial Intelligence

Artificial Intelligence is

#### Human Intelligence Exhibited by Machines

### Artificial Intelligence

# Machine Learning

Machine Learning is a small, but fast growing workload

- Training: Simple math applied at massive scale to analyze & create a model
- Scoring: Trained models are applied to new data to generate predictions
  - Future: Autonomous computation methods that learn from experience

#### Artificial Intelligence

# Machine Learning Deep Learning

Deep Learning is One Branch of Machine Learning

# Intel's AI Framework



**Fuel** the development of vertical specific solutions

**Accelerate** adoption of analytics platforms

**Drive** CPU optimizations across open source machine learning frameworks

**Enable** maximum performance with Intel libraries

**Deliver** best single node and multi-node performance

# Knights Mill: Optimal Deep Learning Throughput



#### **Faster Time to Train Machines**

- Provides High Single Precision Peak performance
- Provides High Variable Precision QVNNI
   performance
- Bootable Host-CPU avoids PCIe latency & bottlenecks
- Efficient Scaling with Multi-node optimizations for top ML frameworks
- High memory bandwidth for seamlessly training Complex Neural Network datasets





Introduction

Intel<sup>®</sup> Xeon<sup>®</sup> Processor and FPGA

**Machine Learning** 

Conclusion

Backup



21

# IA for AI: Better Hardware Today & Tomorrow





Best Performance

Maximum scalability





Other names and brands may be claimed as the property of others All comparative descriptions used in this slide are made based on the comparison of Intel 's own products.



# Thank you ...







# Backup. Dell offering

# Dell PowerEdge C4130 Accelerator platform



height

Xeon Phi

density and unmatched flexibility



DDR4 DIMMs

Xeon CPU

# Dell PowerEdge C6320 high-performance platform



#### Performance optimized





# Dell PowerEdge R930 Large-Memory platform



# Designed for the most demanding **HPDA** applications







# The Dell H-Series Omni-Path Architecture

The **next-generation** of High-Performance Computing fabrics



#### Dell Intel® EE Lustre\* Software



7GB/s Peak Write per building block

#### Up to **44GB/s** in a single rack

#### The **Ultimate** HPDA File System



30