Teslaについて過去ツイートを纏めてみました (Memo from my past tweets)。

Tesla Memo [1/3] : Tesla Autonomy Investor Day, Apr 2019 - Sep 2020

https://this-may-interest-you2.blogspot.com/2021/09/tesla-memo-1.html

Tesla Memo [2/3] : before Tesla AI Day, Aug 19, 2021

https://this-may-interest-you2.blogspot.com/2021/09/tesla-memo-2.html

概要

Tesla AI Day, Aug 19, 2021
Tesla CFP8: 8-bit Configurable Floating-Point
Hybrid 8-bit Floating Point (HFP8) Training and Inference for DNNs, IBM
2x Dojo D1 Chip Module?
Comparative Study on Power Module Architectures for Modularity and Scalability", Review, Dec 2020
"Scaled compute fabric for accelerated deep learning", Cerebras, Patent Application, Mar 5, 2020 (Aug 28, 2018)
最後に
おまけ

AMD Exascale APU, Chiplets

※ページ内リンク設定が解らないのでスクロールして下さい。

Tesla AI Day, Aug 19, 2021

2021年8月20日 https://twitter.com/ogawa_tter/status/1428564578787487751

3:03:20 https://www.youtube.com/watch?v=j0z4FweCy4M&t=6302s

※Dojo紹介から Ganesh Venkataramananさんの発表。
Venkataramananさんは前エントリーのこの特許の共同発明者

"Packaged device having embedded array of components", US20200221568A1

2021年8月10日 https://twitter.com/ogawa_tter/status/1424825283077574656

LinkedInによると https://www.linkedin.com/in/ganesh-venkataramanan-99272a3/

Tesla

Sr Director Autopilot Hardware at Tesla

Dec 2018 – Present

Director Autopilot Hardware

Mar 2016 – Dec 2018

AMD

Senior Director Design Engineering

Jul 2014 – Mar 2016

...
Senior Design Engineer/ Design Engineer

Jun 2001 – Jun 2004

...

High Performance Training Node

64 bit Superscalar CPU
Vector Dispatch with 4x (8x8 Matrix Multiplication) & SIMD

1024 GFLOPS: BFP16, CFP8 (8 bit Configurable Floating Point)
64 GFLOPS: FP32

Int32, Int16, Int8
128 MB ECC SRAM
Low-Latency High Bandwidth 2-D Network Switch (1 Cycle Hope)

512 GB/s in Each Cardinal Direction

D1 Chip

354 Training Nodes
362 TFLOPS: BFP16, CFP8
22.6 TFLOPS: FP32
10 TBps/dir: On-Chip Bandwidth
4 TBps/edge: Off-Chip Bandwidth
400 W TDP

500,000 nodes

Dojo Interface Processor

PCIe Gen4 for Host
High Bandwidth DRAM Shared Memory for the Computer plane
Higher Radix Network Connection

う～ん、う～ん、う～ん…

Training Tile

25x Known Good D1 Die onto Fan-Out Wafer
9 PFLOPS: BFP16, CFP8
9 TBps/edge: Off-Tile Bandwidth
Vertical Directory Connect Voltage-Regulator Module (VRM)
52 V DC
<1 cu Ft

2021年8月19日の前の週に初めて Training Tileが稼働 (2 GHz)

※ここから先の 2-Dメッシュ拡大クラスター構想には思うところは色々ありますが今回は省略します。

前エントリーで紹介した実装に関連するだろうな特許

2021年8月20日 https://twitter.com/ogawa_tter/status/1429237908209233921

"Voltage regulator module with cooling structure", US10887982B2

Mohamed Nasr, Aydin Nabovati, Jin Zhao

https://patents.google.com/patent/US10887982B2/en
2018-03-22: Priority to US201862646835P
...
2019-09-26: Publication of US20190297723A1
2021-01-05: Publication of US10887982B2
2021-01-05: Application granted

"Mechanical architecture for a multi-chip module", WO2020061330A1

Robert Yinan CAO, Mitchell Heschke, Mengzhi Pang, Shishuang Sun, Vijaykumar Krithivasan

https://patents.google.com/patent/WO2020061330A1/en
2018-09-19: Priority to US201862733559P
...
2020-03-26: Publication of WO2020061330A1

Tesla CFP8: 8-bit Configurable Floating-Point

Dojo Chipでサポートの CFP8: 8-bit Configurable Floating-Pointについて Tesla AI Dayでは詳細はありませんでしたが、前エントリーで紹介した Teslaの特許を再確認したところ：

"Scalable matrix node engine with configurable data formats", US20200348909A1

Debjit Das Sarma, William McGee, Emil TALPES

2021年8月21日 https://twitter.com/ogawa_tter/status/1428916933068394497

https://patents.google.com/patent/US20200348909A1/en
2019-05-03: Priority to US16/403,083
...
2020-11-05: Publication of US20200348909A1

Abstract

"The matrix processor instruction specifies a floating-point operand formatted with an exponent that has been biased with a specified bias."

FIG. 4 is a block diagram illustrating embodiments of an 8-bit floating-point format.

1-4-3 (400)
1-5-2 (410)

"Depending on the use case, one or another format is selected. In the event a high precision number is needed, more bit can be allocated to the mantissa and a format, such as format 400 with more mantissa bits than format 410, is selected. A format with more mantissa bits may be selected for performing gradient descent where very small deltas are required to preserve accuracy. As another example, a format with more mantissa bits may be selected for performing forward propagation to compute a cost function. As another optimization, each floating-point format utilizes a configurable bias."

FIG. 5 is a block diagram illustrating an embodiment of a 21-bit floating-point format.

1-7-13 (500)

"Using 8-bit elements as input elements and storing the intermediate results using a 21-bit format preserves the precision and accuracy required for training while also maintaining high input bandwidth to the matrix processor."

さらに 27-bit形式も考慮しているようですが、図の説明はありません。

27-bit floating-Point

1-9-17

Hybrid 8-bit Floating Point (HFP8) Training and Inference for DNNs, IBM

ところが、複数形式の 8-bit Floating-pointを用いた学習と推論は IBMが NeurIPS 2019のポスターで発表、さらにその前に特許申請・認定済み：

"Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks",

Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Xiaodong Cui, Wei Zhang, Kailash Gopalakrishnan, IBM, Poster, NeurIPS 2019

2021年9月15日 https://twitter.com/ogawa_tter/status/1437912543801061376

https://papers.nips.cc/paper/2019/hash/65fc9fb4897a89789352e211ca2d398f-Abstract.html

Training

(1-4-3) for forward propagation
(1-5-2) for backward propagation

"Finally, through hardware design experiments, we’ve confirmed that floating-point units (FPUs) that can support both formats are only 5% larger than the original FPUs that only support 1-5-2."

"For errors and gradients in HFP8 back-propagation, we employ the (1-5-2) FP8 format, which has proven to be optimal across various deep learning tasks."

当然ながら、特許の手続きが進められてました：

"Hybrid floating point representation for deep learning acceleration", US10963219B2

Naigang Wang, Jungwook CHOI, Kailash Gopalakrishnan, Ankur Agrawal, Silvia Melitta Mueller

https://patents.google.com/patent/US10963219B2/en
2019-02-06: Application filed by IBM
...
2020-08-06: Publication of US20200249910A1
2021-03-30: Application granted
2021-03-30: Publication of US10963219B2

"Neural network circuitry having floating point format with asymmetric range"、US20210064976A1

Xiao Sun, Jungwook CHOI, Naigang Wangm, Chia-Yu Chen, Kailash Gopalakrishnan

2021年9月20日 https://twitter.com/ogawa_tter/status/1439926135630925825

https://patents.google.com/patent/US20210064976A1/en
2019-09-03: Application filed by IBM
...
2021-03-04: Publication of US20210064976A1

ISSCC 2021ではチップ実装を発表：

"Introducing the AI chip leading the world in precision scaling", Jan 18, 2021

2021年9月20日 https://twitter.com/ogawa_tter/status/1439931032904429572

https://research.ibm.com/blog/ai-chip-precision-scaling

"A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling",

Ankur Agrawal, SaeKyu Lee, Joel Silberman, ..., Kailash Gopalakrishnan, IBM, ISSCC 2021

https://ieeexplore.ieee.org/document/9365791

FP8-fwd (1, 4. 3)
FP8-bwd (1, 5, 2)
FP9 (1. 5. 3)

※FP8-fwd、FP8-bwdを FP9に拡張変換して演算を実行、賢い
※他にも精度を保つための演算器の精度に工夫等々

"IBM Brings 8-Bit AI Training to Hardware", Feb 26, 2021

https://www.eetimes.com/ibm-brings-8-bit-ai-training-to-hardware/

IBMと Teslaの特許出願日：

"Hybrid floating point representation for deep learning acceleration", IBM

2019-02-06: Application filed by IBM

"Scalable matrix node engine with configurable data formats", Tesla

2019-05-03: Priority to US16/403,083

2019-05-23: Application filed by Tesla

"Neural network circuitry having floating point format with asymmetric range", IBM

2019-09-03: Application filed by IBM

機械学習の Forward propagationと Backward propagationで同じビット幅で異なる浮動小数点表現を採用する特許の申請は IBMが数ヶ月先行しているので、Tesla CFP8: 8-bit Configurable Floating-Pointでの学習アルゴリズムは慎重さが求められそうです (両社は逆のようですがこの違いは本質では無いでしょう)。

2x Dojo D1 Chip Module?

Tesla AI Dayで、1チップ D1 Chipが公開

車載向け小規模モジュール化に関係するかもしれない実装系の特許申請：

※ラストは Ganesh VENKATARAMANANさん。

"Electronic assembly having multiple substrate segments", US20210005546A1

Mengzhi Pang, Shishuang Sun, Ganesh VENKATARAMANAN

2021年8月12日 https://twitter.com/ogawa_tter/status/1425563060526206977

https://patents.google.com/patent/US20210005546A1/en
2017-12-08: Priority to US201762596217P
...
2021-01-07; Publication of US20210005546A1

2021年8月22日 https://twitter.com/ogawa_tter/status/1429264180138315784

Comparative Study on Power Module Architectures for Modularity and Scalability", Review, Dec 2020

Dojo Training Tileの実現には Teslaのさまざまなノウハウが活かされているとのことで、何かしらの参考になるかもしれない Tesla Model 3 Inverterも含まれる車載 Power Moduleのサーベイ論文：

"Comparative Study on Power Module Architectures for Modularity and Scalability", Review, Journal of Electronic Packaging, Dec 2020

2021年8月26日 https://twitter.com/ogawa_tter/status/1430732968541097986

https://asmedigitalcollection.asme.org/electronicpackaging/article/142/4/040801/1084429/Comparative-Study-on-Power-Module-Architectures

Tesla Model 3 Inverterレポートのサンプル

[Sample] STMicroelectronics SiC Module: Tesla Model 3 Inverter, Jun 2018

https://systemplus.fr/wp-content/uploads/2018/06/SP18413-STM_SiC_Module_Tesla_Model_3_Inverter_sample-2.pdf
Flyer https://www.systemplus.fr/wp-content/uploads/2018/06/SP18413-STMicroelectronics-SiC-Module-Tesla-Model3-Inverter_flyer-1.pdf

"Scaled compute fabric for accelerated deep learning", Cerebras, Patent Application, Mar 5, 2020 (Aug 28, 2018)

前のエントリーで触れた Cerebrasのモジュール構成システムの特許申請

"Scaled compute fabric for accelerated deep learning", WO2020044152A1

Gary R. Lauterbach, Sean Lie, Michael Morrison, Michael Edwin JAMES, Srikanth Arekapudi

2020年8月19日 https://twitter.com/ogawa_tter/status/1295823046364688385

2021年8月28日 https://twitter.com/ogawa_tter/status/1431523840815812608

https://patents.google.com/patent/WO2020044152A1/en
2018-08-28: Priority to US201862723636P
...
2019-08-11: Application filed by Cerebras Systems Inc.
2020-03-05: Publication of WO2020044152A1

"An array of clusters communicates via high- speed serial channels.

The array and the channels are implemented on a Printed Circuit Board (PCB).

Each cluster comprises respective processing and memory elements.

Each cluster is implemented via a plurality of 3D-stacked dice, 2.5D-stacked dice, or both in a Ball Grid Array (BGA). "

"A memory portion of the cluster is implemented via one or more High Bandwidth Memory (HBM) dice of the stacked dice."

"PE Cluster 481 comprises 64 instances of PE 499 on a die, each PE with 48KB of local memory and operable at 500MHz. PEs+HBM 483 comprises the HBM2 3D stack 3D-stacked on top of the PE die in a BGA package with approximately 800 pins and dissipating approximately 20 watts during operation. There is 4GB/64 = 64MB of non-local memory capacity per PE. "

"The 48KB local memory of each PE is used to store instructions (e.g., all or any portions of Task SW on PEs 260 of Fig. 2) and data, such as parameters and activations (e.g., all or any portions of (weight) wAD 1080 and (Activation) aA 1061 of Fig. 10B). "

[(64x PE (48 KB local Memory)) + HBM2: 4GB (64 MB/PE)]

※特許申請なので実際に進めているかは不明ですが、今後に注目。

最後に

nikonikotenさんのコメント

「しかし、車メーカーがAI用のHPCを開発するとは。全ての技術の垂直統合を進めるのは圧巻です。こういったメーカーが今まであったでしょうか。」

2021年8月20日 https://twitter.com/nikonikoten/status/1428726626582294537

おまけ

Jim Kellerさんが AMDを辞めて Teslaに移ったのが 2106年2月

https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%A0%E3%83%BB%E3%82%B1%E3%83%A9%E3%83%BC

LinkedInによると

"Scalable matrix node engine with configurable data formats", US20200348909A1

Debjit Das Sarma, William McGee, Emil TALPES

3名が AMDから Teslaに移られたのが 2016年2月、4月。

Jim Kellerさんが 2020年12月に CTOに就任された Tenstorrent社の設立は 2016年5月、Ljubisa Bajicさん (Founder and CEO) の前職は AMD。

大きなくくりではどちらもメニーコアを採用

2021年8月21日 https://twitter.com/nikonikoten/status/1428927693005877249

2015年は AMDの Exascaleへの取り組み、アーキテクチャの方向性：CPUと GPUを統合した APUや EPYCに繋がる Chipletが確定した時期の可能性が高く、方向性の違いで Jim Kellerさんらが AMDを去った？

2021年8月21日 https://twitter.com/ogawa_tter/status/1429028441924390915

2019年12月9日 https://twitter.com/ogawa_tter/status/1203971484118482944

"DOE's Fast Forward and Design Forward R&D Projects: Influence Exascale Hardware", Sandia National Lab, Feb 2015

https://www.osti.gov/servlets/purl/1513941

"AMD Accelerating Technologies for Exascale Computing", AMD, CCDSC, Oct 3 2016

http://www.netlib.org/utk/people/JackDongarra/CCDSC-2016/slides/talk05-brantley.pdf

"Achieving Exascale Capabilities through Heterogeneous Computing", AMD, IEEE Micro, July-Aug. 2015

https://ieeexplore.ieee.org/document/7155462

"Design and Analysis of an APU for Exascale Computing", AMD, Industry Session, HPCA 2017

http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf

※AMD APUの現状は？？？

Chiplets

"NoC Architectures for Silicon Interposer Systems", Natalie Enright Jerger, Ajaykumar Kannan. ..., Gabriel H. Loh, University of Toronto and AMD, Micro 2014

https://dl.acm.org/doi/10.1109/MICRO.2014.61

"Enabling Interposer-based Disintegration of Multi-core Processors", Ajaykumar Kannan, Natalie Enright Jerger, Gabriel H. Loh, University of Toronto and AMD, Micro 2015

https://dl.acm.org/doi/10.1145/2830772.2830808

"Exploiting Interposer Technologies to Disintegrate and Reintegrate Multicore Processors", Ajaykumar Kannan, Natalie Enright Jerger, Gabriel H. Loh, University of Toronto and AMD, IEEE Micro, May/Jun 2016

https://ieeexplore.ieee.org/document/7484949

"Enabling Interposer-based Disintegration of Multi-Core Processors", Ajaykumar Kannan, MSc Thesis, Nov, 2015

https://tspace.library.utoronto.ca/handle/1807/70378

Acknowledgements "to thank Gabriel Loh at AMD Corp"

Gabriel H. Lohさんの ISCA 2021での発表と講演スライド (いずれビデオ公開のはず)

"Understanding Chiplets Today to Anticipate Future Integration Opportunities and Limits", Gabriel H. Loh, et al., AMD, Industry Track, ISCA 2021

https://ieeexplore.ieee.org/abstract/document/9474021

"An Overview of Chiplet Technology for the AMD EPYC and Ryzen Processor Families", Gabriel Loh, Senior Fellow, AMD, IEEE SCV - Industry Spotlight, Aug 24, 2021

2021年9月14日 https://twitter.com/ogawa_tter/status/1437443390132555780

http://site.ieee.org/scv/files/2021/09/Chiplet-Tchnlgy-AMD-EPYC%E2%84%A2-Ryzen%E2%84%A2.pdf

AMD Chiplet (NoC) についての Ian Cutressさん (AnandTech) の興味深い考察

"Does an AMD Chiplet Have a Core Count Limit?", Sep 7, 2021

2021年9月10日 https://twitter.com/ogawa_tter/status/1436012235009060864

https://www.anandtech.com/show/16930/does-an-amd-chiplet-have-a-core-count-limit
https://www.youtube.com/watch?v=8teWvMXK99I

引用 RTは AMDの NoC関連過去ツイート。

今後の Chipletの進展には NoCの工夫が必須。

This may interest you2

2021年9月23日木曜日

Tesla Memo [3/3] : Tesla AI Day, Aug 19, 2021

概要