당신은 주제를 찾고 있습니까 “딥 러닝 gpu – 딥러닝에 그래픽처리장치를 사용하는 이유 [핫클립] / YTN 사이언스“? 다음 카테고리의 웹사이트 ppa.maxfit.vn 에서 귀하의 모든 질문에 답변해 드립니다: https://ppa.maxfit.vn/blog/. 바로 아래에서 답을 찾을 수 있습니다. 작성자 YTN 사이언스 이(가) 작성한 기사에는 조회수 2,868회 및 좋아요 21개 개의 좋아요가 있습니다.
딥 러닝 gpu 주제에 대한 동영상 보기
여기에서 이 주제에 대한 비디오를 시청하십시오. 주의 깊게 살펴보고 읽고 있는 내용에 대한 피드백을 제공하세요!
d여기에서 딥러닝에 그래픽처리장치를 사용하는 이유 [핫클립] / YTN 사이언스 – 딥 러닝 gpu 주제에 대한 세부정보를 참조하세요
함유근 박사는 딥러닝을 통한 기후변화 예측 시스템을 만들었는데,
여기에는 CPU(중앙처리장치)가 아닌
GPU(그래픽처리장치)를 사용한다고 한다.
그렇다면 왜 GPU를 사용하는 것인지 알아본다.
딥 러닝 gpu 주제에 대한 자세한 내용은 여기를 참조하세요.
[AI칩러닝] AI의 핵심, GPU… 딥러닝의 유행을 불러오다 ①
AI 성능을 대폭 끌어올린 딥러닝(DL). … 하지만 AI에서 컴퓨팅 파워를 구분하는 기준은 CPU보다 GPU(그래픽처리장치) 차이를 더 많이 언급한다.
Date Published: 9/1/2022
[Hardware] 딥러닝에 GPU 를 사용하는 이유 – 우노
간단한 설명 딥러닝 알고리즘은 본질적으로, 많은 양의 단순 사칙연산(행렬 곱셈 등)을 수행한다. GPU 는 이러한 단순 사칙연산(행렬 곱셈 등)에 특화 …
Date Published: 3/4/2022
[딥러닝강의]  GPU 와 딥러닝 소프트웨어 – Deep Play
왜 딥러닝에서 CPU 가 아닌 GPU 를 사용할까? GPU 의 원래 목적은 그래픽을 rendering 하는 것이다. GPU 를 만드는 회사는 크게 NVIDIA 와 AMD 로 나뉜다.
Date Published: 3/15/2022
딥러닝과 GPU – 네이버 블로그
딥러닝 모델을 이용하여서 2차원 데이터로 구성되는 테스트 데이터에 활용 … 딥러닝의 주요 처리 장치인 CPU와 GPU에 대하여 이야기 하고자 한다.
Date Published: 7/28/2021
딥러닝을 위한 그래픽 카드 구매 가이드 – Wanho Choi
보다 자세한 내용은 원문을 참고바랍니다. RTX 2080 Ti (11 GB)를 기준으로 정규화시킨 성능 비교표. 참고 1) Nvia Turing vs Volta vs Pascal …
Date Published: 4/7/2022
The Best GPUs for Deep Learning in 2020 – Tim Dettmers
Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA? When is it better to use the cloud vs a dedicated GPU …
Date Published: 10/30/2022
Best GPU for Deep Learning in 2022 (so far) – Lambda Labs
While waiting for NVIDIA’s next-generation consumer and professional GPUs, we deced to write a blog about the best GPU for Deep Learning …
Date Published: 2/18/2021
권장 GPU 인스턴스 – 딥 러닝 AMI
특정 딥 러닝 목표에 맞는 DLAMI의 GPU 인스턴스를 선택합니다. … Amazon EC2 P3 인스턴스 은(는) 최대 8개의 NVIDIA Tesla V100 GPU 제공.
Date Published: 1/24/2021
주제와 관련된 이미지 딥 러닝 gpu
주제와 관련된 더 많은 사진을 참조하십시오 딥러닝에 그래픽처리장치를 사용하는 이유 [핫클립] / YTN 사이언스. 댓글에서 더 많은 관련 이미지를 보거나 필요한 경우 더 많은 관련 기사를 볼 수 있습니다.
주제에 대한 기사 평가 딥 러닝 gpu
- Author: YTN 사이언스
- Views: 조회수 2,868회
- Likes: 좋아요 21개
- Date Published: 2020. 11. 10.
- Video Url link: https://www.youtube.com/watch?v=yEXJmEnQazI
[AI칩러닝] AI의 핵심, GPU… 딥러닝의 유행을 불러오다 ①
(원본=셔터스톡)[편집자 주] 인공지능(AI) 기술의 중심은 소프트웨어(SW) 기술이다. 어떤 모델을 구축하느냐, 어떤 언어를 사용할까, 데이터를 어떻게 분류할 것인가. 이런 질문에 대한 답이 SW에서 나오기 때문이다. 하지만 복잡한 AI SW를 구현하기 위해서는 높은 성능의 하드웨어(HW)가 필수적이다. 인공지능의 대표적인 방법론 중 하나인 머신러닝(ML). AI 성능을 대폭 끌어올린 딥러닝(DL). 인간 뉴런 구조를 본떠 만든 인공신경망(ANN, Artificial Neural Network). 이런 개념들은 80년대에도 활발히 연구됐다. 하지만 실제 구현은 불과 몇 년 밖에 안됐다. 컴퓨팅 성능이 그만큼 받쳐주지 못했기 때문이다. HPC(고성능컴퓨팅), AI 가속기, AI 프로세서, 고성능 메모리장치 등이 등장하면서 본격적인 AI 시대가 문을 열게 된 것이다. 칩러닝(ChipLearning)을 통해 AI를 구현하는 HW, 반도체 또는 ‘칩’이라고 불리는 HW 산업과 기술을 알아보자.
사람들이 컴퓨팅 파워를 이야기할 때 가장 먼저 꼽는 것은 바로 ‘CPU(중앙처리장치)’다. 프로세서라고 말하면 한 때는 CPU를 지칭할 때도 있었다.
하지만 AI에서 컴퓨팅 파워를 구분하는 기준은 CPU보다 GPU(그래픽처리장치) 차이를 더 많이 언급한다.
자연어처리(NLP)나 머신비전과 같은 AI 모델에서는 프로세서 병렬 연산 속도가 얼마나 빠르냐를 기준으로 성능을 구분한다. 그리고 GPU는 가장 대표적인 병렬 연산 가능한 프로세서다.
◆ AI는 왜 GPU인가?…GPGPU의 탄생
지금은 GPU 외에도 NPU(신경망처리장치)나 IPU(이미지처리장치) 등 다양한 병렬 연산장치가 개발되고 있다.
하지만 고전적인 컴퓨터 장치에서 프로세서는 단일 컴퓨팅 능력을 높인 CPU와 병렬처리기능을 높이는 GPU가 거의 전부였다.
GPU는 원래 이름 그대로 컴퓨터 그래픽 요소를 처리하기 위해 만든 개념이다. 하지만 GPU가 일반적인 연산에도 유용하게 쓰인다는 것이 알려지면서 범용 컴퓨팅 처리를 위한 GPGPU(General-Purpose computing on GPU) 기술이 개발됐다.
GPGPU라는 기술이 개발되면서 AI 성능은 급격히 성장했다. 영국 시장조사업체 키사코 리서치 미카엘 아조프 수석연구원은 엔비디아가 GPGPU 기술을 공개한 2010년부터 딥러닝 활용이 급격히 늘어났다고 말했다.
딥러닝 도입으로 수개월 걸리던 대형 신경망 네트워크 교육 시간이 불과 몇 시간에서 며칠 사이로 줄었기 때문이다.
CPU와 GPU 차이 (이미지=엔비디아)
GPU는 CPU와 전혀 다른 구조를 지녔다.
CPU는 다양한 환경에서 작업을 빠르게 수행하기 위해 복잡한 구조로 만들어졌다. 명령어 하나로 계산 여러 개를 한꺼번에 하거나 복잡한 수식 처리에 유리하다.
GPU는 그래픽과 같은 특화된 연산을 빠른 속도로 수행하기 위해 필요 없는 부분을 과감히 삭제했다. CPU가 10개 미만 고성능 코어 몇 개와 이를 보조하는 장치로 이뤄졌다면, GPU는 그냥 수백개 코어를 결합한 단순한 구조다.
그래서 GPU는 직접 작업을 처리할 수는 없다. 제어하는 영역은 CPU에게 맡겼기 때문이다.
이에 GPU는 연산력을 높이는 데 특화됐다. 예를 들어 3D 프로그램을 CPU만으로 돌리면 수시간이 걸리지만, GPGPU는 불과 수분에서 수십분이면 이를 처리한다.
GPGPU 기술은 딥러닝에 매우 적합하다. 딥러닝은 빅데이터를 바탕으로 모델 부피를 키워 성능을 높이는 시스템이다. 특정된 연산을 수없이 계산해야 한다. 계산이 많으면 많을수록 오차가 줄고 시스템의 정확도는 높아진다.
GPGPU 기술을 발전, 즉 GPU 성능 향상이 곧 AI 성능 향상이 된 시기가 온 것이다.
◆ AI 가속기 시장을 독점한 엔비디아?
현재 GPGPU용 GPU 시장은 엔비디아가 거의 독점하고 있다. 엔비디아는 GPU뿐만 아니라 AI 가속기 전체에서도 90% 이상 점유율을 지녔다.
시장조사업체 리프터(Liftr)는 지난해 5월 아마존웹서비스(AWS), 마이크로소프트(MS) 애저, 구글 클라우드, 알리바바 클라우드 등 글로벌 4대 클라우드 데이터센터에서 사용하는 AI 가속기 97.4%가 엔비디아 가속기라는 조사결과를 발표한 바 있다. AMD GPU는 1%, 자일링스 FPGA 1%, 알테라를 인수한 인텔 FPGA가 0.6% 정도를 차지하는 수준이었다.
4대 클라우드 기업에 사용되는 가속기 비율(자료=리프터, 2019년 5월)
최근 AMD가 자일링스를 인수하며 가속기 시장에 AMD 지분이 어느 정도 올라갔지만, 단순 합으로만 2% 수준의 점유율은 아직도 부족하다.
리프터에 따르면 엔비디아 제품들이 AI 가속기 시장에서 경쟁 중이다.
AWS에는 테슬라 M60이 36%, 테슬라 V100이 29%, 테슬라 K80이 24%를 차지했다. 모두 엔비디아 제품이다. 11% 정도가 자일링스 버텍스 울트라스케일+ VU9P FPGA를 가속기로 사용했다.
MS 애저와 구글 클라우드는 전체가 엔비디아 테슬라 GPU를 사용했다. 테슬라 K80, M60, P100, T4, V100 GPU가 주로 사용됐다.
중국 알리바바는 엔비디아 GPU를 주로 사용했지만 다양한 기업의 솔루션을 채택했다. ▲AMD 파이어프로 S7150 GPU 10% ▲인텔 아리아 10 GX 1150 FPGA 6% ▲자일링스 울트라스케일+ FPGA 2% 사용했다.
4대 클라우드 업체별 AI 가속기 제품 사용 현황 (자료=리프터, 2019년 5월)
올해 엔비디아는 새로운 암페어 아키텍처 기반 A100 GPU를 공개했다. A100은 540억개 트랜지스터를 집적한 것과 같은 성능을 지녔다. 최대 9.7테라플롭스(TF, 1초에 1조회 연산)의 FP64 연산 능력으로 기존 볼타 아키텍처 기반의 V100보다 20배 뛰어난 성능을 지녔다.
최근 AWS와 애저는 데이터센터에 A100을 탑재했다. 양사는 얼마 전 A100기반 인스턴트(클라우드 회사가 제공하는 서버 서비스)를 제공한다고 밝혔다.[관련기사] 엔비디아 A100 GPU, AWS 최신 머신러닝·HPC용 서버에 탑재
최신 MLPerf 벤치마크서 A100은 훈련과 추론, 두 영역에서 모두 최고 성능을 기록했다. MLPerf 훈련 벤치마크 3라운드에서 엔비디아와 구글은 8개 항목에서 1위를 차지했다. 추론에서는 A100이 CPU 대비 237배 뛰어난 성능을 기록하며 1위를 차지했다.[관련기사] 엔비디아 최신 MLPerf 추론 결과 1위…단점은 가격·크기
엔비디아 A100 GPU(사진=엔비디아)
◆ 엔비디아는 어떻게 AI 가속기를 독점했나?
A100과 같은 높은 성능 GPU를 바탕으로 AI 가속기 시장을 점령한 엔비디아 독주체제는 앞으로 계속될 전망이다. 단순한 가속기 성능 외에도 쿠다(CUDA)를 바탕으로 AI SW 설계도 함께 독점했기 때문이다.
AI용 가속기에 사용되는 엔비디아 GPGPU를 사용하기 위해서는 자사의 쿠다 SW를 반드시 익혀야한다.
쿠다와 비슷한 GPGPU 기술로 OpenCL이나 다이렉트컴퓨트(DirectCompute) 등이 있지만 이들은 업계 표준을 기준으로 개발해, 압도적인 엔비디아 GPU 환경에서는 쿠다보다 불편하거나 사용에 한계가 있다.
바로 엔비디아 HW에 액세스해 컨트롤할 수 있는 쿠다에 익숙한 개발자들에게 외면받기 쉽다는 것이다.
최근 엔비디아의 발표에 따르면 실제 쿠다를 활용하고 있는 개발자 수가 점점 늘어나고 있다. 8월말 기준 전 세계 200만명 이상의 개발자들이 엔비디아 개발자 프로그램에 등록했다.
엔비디아는 10억개 이상 쿠다 기반 GPU가 GPGPU 플랫폼을 실행한다고 전했다. 또한 엔비디아 개발자 회원뿐만 아니라 수많은 비회원도 GPGPU 플랫폼인 쿠다를 사용하고 있다.
등록 없이도 무료로 사용가능한 쿠다 다운로드 건수가 개발자 회원 수에 비해 월등히 높다는 것이다. 현재 월평균 3만9000여명이 개발자 프로그램에 가입하고 있고, 쿠다 다운로드 건수는 43만8000여 건에 이른다.
이 기사는 [AI칩러닝] AI의 핵심, GPU… 딥러닝의 유행을 불러오다 ②로 이어집니다.[관련기사] AMD의 ‘DNA’는 엔비디아 A100을 넘을 수 있을까?
[딥러닝강의]  GPU 와 딥러닝 소프트웨어
[딥러닝강의] GPU 와 딥러닝 소프트웨어
Deep Learning Frameworks (https://analyticsindiamag.com/evaluation-of-major-deep-learning-frameworks/)
CPU vs. GPU
왜 딥러닝에서 CPU 가 아닌 GPU 를 사용할까? GPU 의 원래 목적은 그래픽을 rendering 하는 것이다. GPU 를 만드는 회사는 크게 NVIDIA 와 AMD 로 나뉜다. 여러 커뮤니티에서 NVIDIA 와 AMD 중에 무엇이 더 나은지 논쟁을 한다. 하지만 딥러닝 측면에서는 NVIDIA 의 GPU 가 더욱 좋다. AMD GPU 는 딥러닝 목적으로 사용하기 힘들다. 따라서 딥러닝에서 GPU 를 말하면 대부분 하나의 회사 NVIDIA 의 GPU 를 의미한다.
평면인 그림에 형태·위치·조명 등 외부의 정보에 따라 다르게 나타나는 그림자 색상 농도 등을 고려하면서 실감나는 3차원 화상을 만들어내는 과정 또는 그러한 기법을 일컫는다. 즉, 평면적으로 보이는 물체에 그림자나 농도의 변화 등을 주어 입체감이 들게 함으로써 사실감을 추가하는 컴퓨터그래픽상의 과정이 곧 렌더링이다. [네이버 지식백과] 렌더링 [rendering]
CPU 와 GPU 의 차이
CPU 와 GPU 의 차이
1) 코어의 종류와 숫자
CPU 는 GPU 보다 더 적은 코어를 갖고 있지만 각각의 코어가 GPU 보다 더 강력한 컴퓨팅 파워를 갖고 있다. 따라서 CPU 는 순차적인 작업 (Sequential task) 에 더 강점이 있다. 반면 GPU 는 CPU 보다 코어수는 많지만 각각의 코어가 GPU 보다 더 성능이 낮기 때문에 병렬적인 작업 (Paralell task) 에 더 강점이 있다. 현재 PC 에서 사용되는 CPU 의 코어는 보통 4~10개 정도이며 hyperthreading 기술을 통해 thread 를 2배 정도 늘릴 수 있다. 예를 들어 8 코어 16 threads CPU 의 경우, 병렬적으로 16개의 task 를 수행할 수 있다는 뜻이다. 반면, 예를 들어 NVIDIA 의 Tital XP GPU 의 경우 3840 코어를 갖고 있다. 또 최근 출시된 2080 TI 의 경우 4,352 개의 코어를 갖고 있다. threading 을 감안하더라도 CPU와 GPU 의 코어 수의 차이는 200 배 이상이다.
CPU 와 GPU 의 중요한 차이 중 하나는 바로 메모리를 어떻게 사용하는지에 관한 것이다. CPU 는 캐시 메모리가 있지만 대부분의 시스템과 공유된 메모리를 사용한다. PC 의 경우 8GB, 16GB RAM 이 일반적인 메모리의 종류이다. GPU 의 경우 시스템과 메모리를 공유할 수도 있지, 이 경우 GPU 와 메모리 사이의 Bottleneck 이 생기기 때문에 성능이 낮아진다. 따라서 GPU 는 GPU card 내에 자체의 메모리가 존재한다. 예를 들어 Tital XP GPU 의 경우 12 GB 의 메모리를 갖고 있다.
CPU 의 강점은 갖는 것은 여러가지 일 (음악 감상, 영화 감상, 딥러닝, 게임 등)을 할 수 있으며, 순차적인 처리를 빠르게 할 수 있다는 것이다. 반면, GPU 는 여러가지 일을 다 잘하진 못해도 병렬화를 할 수 있다면, 이를 빠르게 처리할 수 있다. 병렬화 가능한 일 중 하나가 바로 행렬 곱 (Matrix multiplication) 이다.
병렬화 가능한 행렬곱
행과 열의 곱을 각각의 GPU 코어에서 처리 후 이를 합친다면 순차적으로 처리하는 것보다 더 빠르게 수행할 수 있을 것이다. GPU 를 이용한 프로그래밍을 하는 방법은 NVIDIA 의 CUDA 라이브러리를 사용하는 것이다. CUDA 를 이용하면 C, C++ 등의 언어로 GPU 프로그래밍을 할 수 있다. 또한 CUDA 보다 high-level API 로 cuBLAS, cuDNN 등 이 있다. high-level API 의 경우 CUDA 를 기반으로 행렬 연산, 딥러닝 등을 구현하는 것에 초점을 맞추어져 있기 때문에 이러한 작업들에 GPU 를 더욱 쉽게 이용할 수 있다. 하지만 CUDA 를 사용한 프로그래밍은 구현하기가 쉽지 않다. 예를 들어 캐시 메모리와 같은 메모리 계층 구조를 잘 이용하지 않으면 효율적인 코드를 작성하기 어렵다.
따라서 딥러닝을 위해서 CUDA 코드를 처음부터 작성해서 이용하지 않는다. 먼저 NVIDIA 에서 CUDA 를 활용하기 위한 high-level API (cuDNN) 를 만들어 놓았기 때문에 이를 이용하면 되기 때문이고, 또한 cuDNN 을 활용한 다양한 딥러닝 프레임워크들이 이미 나와있기 때문이다.
이름 단체 Caffe2 Facebook PyTorch Facebook TensorFlow Google CNTK Microsoft Paddle Baidu MXNet Amazon
최근 딥러닝 프레임워크는 빠르게 생겨나고 있다. Tensorflow (Keras) 와 PyTorch 가 최근 가장 많이 사용되는 딥러닝 라이브러리가 볼 수 있다. 최근 MXNet 기반 Computer Vision toolkit 인 GluonCV 도 ICCV 에서 발표되는 등 많은 발전이 있는듯하다. GluonCV 는 컴퓨터 비전 분야의 State-of-the-art 를 사용하기 쉽게 만든 MXnet 의 high-level 라이브러리로 볼 수 있다.
딥러닝 프레임워크를 사용하면 우리는 딥러닝을 효율적으로 구현할 수 있다. 상세하게는 아래와 같은 장점이 있다.
1) 딥러닝 모델을 쉽게 만들 수 있다
2) Gradient 를 빠르게 구할 수 있다.
3) GPU 이용을 쉽게하고 효율적으로 동작할 수 있다 (cuDNN, cuBLAS 를 wrapping 해서 구현 → 효율 좋음).
파이토치 (https://pytorch.org/) 도 위와 같은 이유로 사용되는 대표적인 딥러닝 프레임워크 중 하나이며 최근 많은 인기를 얻고 있다. 파이토치가 인기를 얻게 된 이유는 기술을 단순화 시켜서 사용하도록 만든 사용 편리성과 함께, ‘동적 계산 그래프 (Dynamic computational graph)’ 를 사용한다는 점이다. 텐서플로우의 경우 정적 계산 그래프 (Static computational graph) 으로 한 번 모델을 만든 후, 여러번 돌리는 방식인 반면, 동적 계산 그래프트는 한 번의 forward path 마다 새로운 그래프를 다시 그린다.
cs231n lecture 8. Deep Learning Software
동적 계산 그래프 (Dynamic computational graph)
동적 계산 그래프는 어떤 장점을 가질까? 동적 계산 그래프는 복잡한 아키텍쳐를 구축할 때 더욱 유연하다. 예를 들어, RNN 을 활용한 Image captioning 의 경우 input 에 따라 output sequence (텍스트) 의 길이가 달라진다. 따라서 computational graph 가 input 이미지에 따라 동적으로 변한다는 것이다. 이 경우 forward path 마다 구조가 바뀌는 것이 더 효율적이므로 동적 계산 그래프가 더욱 적합하다.
image captioning (https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2)
이와 비슷한 Visual Question Answering (VQS) 분야에 Neuromodule 도 동적 계산 그래프가 적합한 예이다. Neuromodule 은 “이미지” 와 “질문” 두 가지를 입력하면 답변을 위한 custom architecture 를 만들고 답을 출력하고자 하는 모델이다. 예를 들어, 고양이 사진을 주고 “고양이의 색깔은?” 과 같은 질문을 던지는 것이다. 이 경우에도 마찬가지로 네트워크가 동적으로 구성되어야 할 것이다. 어떤 사진을 주고 “이 이미지에서 고양이의 수가 강아지의 수보다 많은가?” 라는 질문을 던지면 앞선 네트워크와는 다르게 구성되어야 한다. 이처럼 동적 계산 그래프를 활용하면 다양한 네트워크를 구성할 수 있다.
파이토치는 연구를 위한 목적으로 만들어진 라이브러리이다. 그래서 요청에 따라 빠르게 응답해야하는 프로덕트 용으로는 적합하지 않을 수 있다. 하지만 Open Neural Network Exchange (https://onnx.ai/) 라는 새로운 프로젝트가 나오면서 이런 제약이 줄어들고 있다. onnx 는 머신러닝 모델의 interoperability 를 위한 프로젝트로 모델을 ONNX Format 로 export 한 후 다른 환경에서 import 할 수 있도록 한다. 이를 활용하면 파이토치 모델을 프로덕션용 Caffe2 모델로 변환해서 배포할 수 있다. ONNX Format 은 파이토치 외에도 CNTK, MXNet, 텐서플로, 케라스, Scikit-Learn 에도 활용할 수 있다.
환경 : Ubuntu 16.04, Jupyter Lab (with JupyterHub), Python 3.8, Cuda
1) 파이썬 가상 환경 (virtual environment) 생성 & Jupyter Lab 등록
conda create –name pytorch python==3.8 source activate pytorch pip install ipykernel python -m ipykernel install –name pytorch
2) CUDA version check
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
3) Install Pytorch
pytorch 홈페이지에서 환경을 클릭하면 설치할 수 있는 커맨드를 출력해준다. CUDA 9.0 을 사용하고 있는데, CUDA 9.2 를 선택해서 나온 커맨드로 설치해도 무방한듯 하다.
conda install pytorch torchvision cudatoolkit=9.2 -c pytorch
설치 완료 후 주피터 랩에서 pytorch kernel 을 실행 후, import pytorch 로 임포트가 잘 되는 것을 확인하자.
파이토치로 시작하는 딥러닝 (비슈누 수브라마니안) 1장, 2장
딥러닝 모델을 이용하여서 2차원 데이터로 구성되는 테스트 데이터에 활용하는 솔루션에 대하여 논하기 전에 하드웨어 구축과 관련하여 딥러닝의 주요 처리 장치인 CPU와 GPU에 대하여 이야기 하고자 한다.
딥러닝이란 여러 비선형 변환기법의 조합을 통해 다량의 데이터나 복잡한 자료들 속에서 핵심적인 내용 또는 기능을 요약하는 작업, 즉 높은 수준의 추상화를 시도하는 기계학습 알고리즘의 집합으로 정의되며, 큰 틀에서 사람의 사고방식을 컴퓨터에게 가르치는 기계학습의 한 분야라고 이야기할 수 있다.
딥러닝의 핵심은 분류를 통한 예측이다. 수많은 데이터 속에서 패턴을 발견해 인간이 사물을 구분하듯 컴퓨터가 객체를 분별한다. 데이터를 잘 구분할 수 있는 선들을 긋고 이 공간들을 잘 왜곡해 합하는 것을 반복하여 복잡한 공간 속에서 최적의 구분선을 만들어 내는 목적을 가지고 있다.
딥러닝의 대표 알고리즘으로는 비지도 학습 및 지도 학습, CNN, RNN등으로 고도화 되면서 이를 처리하는 하드웨어의 발전도 진화하는 추세이다. 기존의 중앙처리 장치인 CPU의 성능이 모든 것을 좌우했다면 현재는 그래픽 장치인 GPU의 역할이 성능을 좌우한다.
CPU와 GPU는 둘 다 데이터를 읽어들여 연산처리를 통해 답을 도출하는 기능을 수행하며 컴퓨터에서 두뇌 역할을 하는 점에서 비슷하지만 프로세서 내부의 구조를 살펴보면 CPU와 GPU의 차이는 크다. 프로세서는 크게 연산을 담당하는 산출연산처리장치(ALU)와 명령어를 해석·실행하는 컨트롤유닛(CU), 각종 데이터를 담아두는 캐시로 나뉜다.
CPU와 GPU 차이
CPU는 명령어가 입력된 순서대로 데이터를 처리하는 직렬(순차) 처리 방식에 특화된 <그림 1>과 같은 구조를 가지고 있다. 따라서 한 번에 한 가지의 명령어만 처리기 때문에 연산을 담당하는 ALU의 개수가 많을 필요가 없다. 또한 CPU의 내부 면적의 절반 이상은 캐시 메모리로 채워져 있다. 캐시 메모리는 CPU와 램(RAM)과의 속도차이로 발행하는 병목현상을 막기 위한 장치이며 처리할 데이터를 미리 RAM에서 불러와 CPU 내부 캐시 메모리에 임시로 저장해 처리 속도를 높일 수 있다.
반대로 GPU는 여러 명령어를 동시에 처리하는 병렬 처리 방식을 가지고 있다. 따라서 <그림2>와 같이 캐시 메모리 비중이 크지 않고 연산을 할 수 있는 ALU가 1개의 코어에 수백, 수천 개가 장착돼 있다.
컴퓨터는 데이터를 디지털인 0, 1의 수로 구분하여 저장하는데, 간단한 정수 외의 실생활에서 널리 사용되는 실수를 디지털로 저장하기 위해 고정소수점 혹은 부동소수점이라는 데이터 저장방식을 활용한다. 고정소수점은 10비트(bit)를 기준으로 0~1023까지 총 1024개의 숫자를 표현할 수 있다. 하지만 부동소수점은 2bit로 자릿수를 표현해 최대 25만5000개의 숫자 표현이 가능하다. 같은 비트임에도 부동소수점이 고정소수점에 비해 정밀한 표현이 가능한 셈이다. 또 고정소수점은 더하기, 빼기 연산을 빨리 수행할 수 있고, 부동소수점은 그래픽, 음성 등 멀티미디어 데이터를 상대적으로 빨리 처리할 수 있다.
따라서CPU는 정수나 고정소수점 데이터를 많이 사용하는 인터넷 서핑, 문서 작성 등 일상생활의 작업을 보다 빠르게 수행에 적합한 반면 GPU 는 시간이 많이 걸리는 멀티미디어, 특히 3차원 그래픽과 사운드를 잘 수행에 적합하기 때문에 처리해야 할 명령어와 데이터의 성격에 따라 때로는 CPU가 때로는 GPU가 빠를 수가 있다.
The Best GPUs for Deep Learning in 2020 — An In-depth Analysis
Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.
This blog post is designed to give you different levels of understanding of GPUs and the new Ampere series GPUs from NVIDIA. You have the choice: (1) If you are not interested in the details of how GPUs work, what makes a GPU fast, and what is unique about the new NVIDIA RTX 30 Ampere series, you can skip right to the performance and performance per dollar charts and the recommendation section. These form the core of the blog post and the most valuable content.
(2) If you worry about specific questions, I have answered and addressed the most common questions and misconceptions in the later part of the blog post.
(3) If you want to get an in-depth understanding of how GPUs and Tensor Cores work, the best is to read the blog post from start to finish. You might want to skip a section or two based on your understanding of the presented topics.
I will head each major section with a small summary, which might help you to decide if you want to read the section or not.
This blog post is structured in the following way. First, I will explain what makes a GPU fast. I will discuss CPUs vs GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs and how these relate to deep learning performance. These explanations might help you get a more intuitive sense of what to look for in a GPU. Then I will make theoretical estimates for GPU performance and align them with some marketing benchmarks from NVIDIA to get reliable, unbiased performance data. I discuss the unique features of the new NVIDIA RTX 30 Ampere GPU series that are worth considering if you buy a GPU. From there, I make GPU recommendations for 1-2, 4, 8 GPU setups, and GPU clusters. After that follows a Q&A section of common questions posed to me in Twitter threads; in that section, I will also address common misconceptions and some miscellaneous issues, such as cloud vs desktop, cooling, AMD vs NVIDIA, and others.
How do GPUs work?
If you use GPUs frequently, it is useful to understand how they work. This knowledge will come in handy in understanding why GPUs might be slow in some cases and fast in others. In turn, you might be able to understand better why you need a GPU in the first place and how other future hardware options might be able to compete. You can skip this section if you just want the useful performance numbers and arguments to help you decide which GPU to buy. The best high-level explanation for the question of how GPUs work is my following Quora answer:
This is a high-level explanation that explains quite well why GPUs are better than CPUs for deep learning. If we look at the details, we can understand what makes one GPU better than another.
The Most Important GPU Specs for Deep Learning Processing Speed
This section can help you build a more intuitive understanding of how to think about deep learning performance. This understanding will help you to evaluate future GPUs by yourself.
Tensor Cores reduce the used cycles needed for calculating multiply and addition operations, 16-fold — in my example, for a 32×32 matrix, from 128 cycles to 8 cycles.
Tensor Cores reduce the reliance on repetitive shared memory access, thus saving additional cycles for memory access.
Tensor Cores are so fast that computation is no longer a bottleneck. The only bottleneck is getting data to the Tensor Cores.
There are now enough cheap GPUs that almost everyone can afford a GPU with Tensor Cores. That is why I only recommend GPUs with Tensor Cores. It is useful to understand how they work to appreciate the importance of these computational units specialized for matrix multiplication. Here I will show you a simple example of A*B=C matrix multiplication, where all matrices have a size of 32×32, what a computational pattern looks like with and without Tensor Cores. This is a simplified example, and not the exact way how a high performing matrix multiplication kernel would be written, but it has all the basics. A CUDA programmer would take this as a first “draft” and then optimize it step-by-step with concepts like double buffering, register optimization, occupancy optimization, instruction-level parallelism, and many others, which I will not discuss at this point.
To understand this example fully, you have to understand the concepts of cycles. If a processor runs at 1GHz, it can do 10^9 cycles per second. Each cycle represents an opportunity for computation. However, most of the time, operations take longer than one cycle. Thus it creates a pipeline where for one operation to start, it needs to wait for the number of cycles of time it takes for the previous operation to finish. This is also called the latency of the operation.
Here are some important cycle timings or latencies for operations:
Global memory access (up to 48GB): ~200 cycles
Shared memory access (up to 164 kb per Streaming Multiprocessor): ~20 cycles
Fused multiplication and addition (FFMA): 4 cycles
Tensor Core matrix multiply: 1 cycle
Furthermore, you should know that the smallest units of threads on a GPU is a pack of 32 threads — this is called a warp. Warps usually operate in a synchronous pattern — threads within a warp have to wait for each other. All memory operations on the GPU are optimized for warps. For example, loading from global memory happens at a granularity of 32*4 bytes, exactly 32 floats, exactly one float for each thread in a warp. We can have up to 32 warps = 1024 threads in a streaming multiprocessor (SM), the GPU-equivalent of a CPU core. The resources of an SM are divided up among all active warps. This means that sometimes we want to run fewer warps to have more registers/shared memory/Tensor Core resources per warp.
For both of the following examples, we assume we have the same computational resources. For this small example of a 32×32 matrix multiply, we use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM.
Matrix multiplication without Tensor Cores
If we want to do an A*B=C matrix multiply, where each matrix is of size 32×32, then we want to load memory that we repeatedly access into shared memory because its latency is about ten times lower (200 cycles vs 20 cycles). A memory block in shared memory is often referred to as a memory tile or just a tile. Loading two 32×32 floats into a shared memory tile can happen in parallel by using 2*32 warps. We have 8 SMs with 8 warps each, so due to parallelization, we only need to do a single sequential load from global to shared memory, which takes 200 cycles.
To do the matrix multiplication, we now need to load a vector of 32 numbers from shared memory A and shared memory B and perform a fused multiply-and-accumulate (FFMA). Then store the outputs in registers C. We divide the work so that each SM does 8x dot products (32×32) to compute 8 outputs of C. Why this is exactly 8 (4 in older algorithms) is very technical. I recommend Scott Gray’s blog post on matrix multiplication to understand this. This means we have 8x shared memory access at the cost of 20 cycles each and 8 FFMA operations (32 in parallel), which cost 4 cycles each. In total, we thus have a cost of:
200 cycles (global memory) + 8*20 cycles (shared memory) + 8*4 cycles (FFMA) = 392 cycles
Let’s look at the cycle cost of using Tensor Cores.
Matrix multiplication with Tensor Cores
With Tensor Cores, we can perform a 4×4 matrix multiplication in one cycle. To do that, we first need to get memory into the Tensor Core. Similarly to the above, we need to read from global memory (200 cycles) and store in shared memory. To do a 32×32 matrix multiply, we need to do 8×8=64 Tensor Cores operations. A single SM has 8 Tensor Cores. So with 8 SMs, we have 64 Tensor Cores — just the number that we need! We can transfer the data from shared memory to the Tensor Cores with 1 memory transfers (20 cycles) and then do those 64 parallel Tensor Core operations (1 cycle). This means the total cost for Tensor Cores matrix multiplication, in this case, is:
200 cycles (global memory) + 20 cycles (shared memory) + 1 cycle (Tensor Core) = 221 cycles.
Thus we reduce the matrix multiplication cost significantly from 392 cycles to 221 cycles via Tensor Cores. In this simplified case, the Tensor Cores reduced the cost of both shared memory access and FFMA operations.
While this example roughly follows the sequence of computational steps for both with and without Tensor Cores, please note that this is a very simplified example. Real cases of matrix multiplication involve much larger shared memory tiles and slightly different computational patterns.
However, I believe from this example, it is also clear why the next attribute, memory bandwidth, is so crucial for Tensor-Core-equipped GPUs. Since global memory is the most considerable portion of cycle cost for matrix multiplication with Tensor Cores, we would even have faster GPUs if the global memory latency could be reduced. We can do this by either increasing the clock frequency of the memory (more cycles per second, but also more heat and higher energy requirements) or by increasing the number of elements that can be transferred at any one time (bus width).
From the previous section, we have seen that Tensor Cores are very fast. So fast, in fact, that they are idle most of the time as they are waiting for memory to arrive from global memory. For example, during BERT Large training, which uses huge matrices — the larger, the better for Tensor Cores — we have a Tensor Core TFLOPS utilization of about 30%, meaning that 70% of the time, Tensor Cores are idle.
This means that when comparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth. For example, The A100 GPU has 1,555 GB/s memory bandwidth vs the 900 GB/s of the V100. As such, a basic estimate of speedup of an A100 vs V100 is 1555/900 = 1.73x.
Shared Memory / L1 Cache Size / Registers
Since memory transfers to the Tensor Cores are the limiting factor in performance, we are looking for other GPU attributes that enable faster memory transfer to Tensor Cores. Shared memory, L1 Cache, and amount of registers used are all related. To understand how a memory hierarchy enables faster memory transfers, it helps to understand how matrix multiplication is performed on a GPU.
To perform matrix multiplication, we exploit the memory hierarchy of a GPU that goes from slow global memory to fast local shared memory, to lightning-fast registers. However, the faster the memory, the smaller it is. As such, we need to separate the matrix into smaller matrices. We perform matrix multiplication across these smaller tiles in local shared memory that is fast and close to the streaming multiprocessor (SM) — the equivalent of a CPU core. With Tensor Cores, we go a step further: We take each tile and load a part of these tiles into Tensor Cores. A matrix memory tile in shared memory is ~10-50x faster than the global GPU memory, whereas the Tensor Cores’ registers are ~200x faster than the global GPU memory.
Having larger tiles means we can reuse more memory. I wrote about this in detail in my TPU vs GPU blog post. In fact, you can see TPUs as having very, very, large tiles for each Tensor Core. As such, TPUs can reuse much more memory with each transfer from global memory, which makes them a little bit more efficient at matrix multiplications than GPUs.
Each tile size is determined by how much memory we have per streaming multiprocessor (SM) — the equivalent to a “CPU core” on a GPU. We have the following shared memory sizes on the following architectures:
Volta: 96kb shared memory / 32 kb L1
Turing: 64kb shared memory / 32 kb L1
Ampere: 164 kb shared memory / 32 kb L1
We see that Ampere has a much larger shared memory allowing for larger tile sizes, which reduces global memory access. Thus, Ampere can make better use of the overall memory bandwidth on the GPU memory. This improves performance by roughly 2-5%. The performance boost is particularly pronounced for huge matrices.
The Ampere Tensor Cores have another advantage in that they share more data between threads. This reduces the register usage. Registers are limited to 64k per streaming multiprocessor (SM) or 255 per thread. Comparing the Volta vs Ampere Tensor Core, the Ampere Tensor Core uses 3x fewer registers, allowing for more tensor cores to be active for each shared memory tile. In other words, we can feed 3x as many Tensor Cores with the same amount of registers. However, since bandwidth is still the bottleneck, you will only see tiny increases in actual vs theoretical TFLOPS. The new Tensor Cores improve performance by roughly 1-3%.
Overall, you can see that the Ampere architecture is optimized to make the available memory bandwidth more effective by using an improved memory hierarchy: from global memory to shared memory tiles, to register tiles for Tensor Cores.
Estimating Ampere Deep Learning Performance
Theoretical estimates based on memory bandwidth and the improved memory hierarchy of Ampere GPUs predict a speedup of 1.78x to 1.87x.
NVIDIA provides accuracy benchmark data of Tesla A100 and V100 GPUs. These data are biased for marketing purposes, but it is possible to build a debiased model of these data.
Debiased benchmark data suggests that the Tesla A100 compared to the V100 is 1.70x faster for NLP and 1.45x faster for computer vision.
This section is for those who want to understand the more technical details of how I derive the performance estimates for Ampere GPUs. If you do not care about these technical aspects, it is safe to skip this section.
Theoretical Ampere Speed Estimates
Putting together the reasoning above, we would expect the difference between two Tensor-Core-equipped GPU architectures to be mostly about memory bandwidth. Additional benefits come from more shared memory / L1 cache and better register usage in Tensor Cores.
If we take the Tesla A100 GPU bandwidth vs Tesla V100 bandwidth, we get a speedup of 1555/900 = 1.73x. Additionally, I would expect a 2-5% speedup from the larger shared memory and 1-3% from the improved Tensor Cores. This puts the speedup range between 1.78x and 1.87x. With similar reasoning, you would be able to estimate the speedup of other Ampere series GPUs compared to a Tesla V100.
Practical Ampere Speed Estimates
Suppose we have an estimate for one GPU of a GPU-architecture like Ampere, Turing, or Volta. It is easy to extrapolate these results to other GPUs from the same architecture/series. Luckily, NVIDIA already benchmarked the A100 vs V100 across a wide range of computer vision and natural language understanding tasks. Unfortunately, NVIDIA made sure that these numbers are not directly comparable by using different batch sizes and the number of GPUs whenever possible to favor results for the A100. So in a sense, the benchmark numbers are partially honest, partially marketing numbers. In general, you could argue that using larger batch sizes is fair, as the A100 has more memory. Still, to compare GPU architectures, we should evaluate unbiased memory performance with the same batch size.
To get an unbiased estimate, we can scale the V100 and A100 results in two ways: (1) account for the differences in batch size, (2) account for the differences in using 1 vs 8 GPUs. We are lucky that we can find such an estimate for both biases in the data that NVIDIA provides.
Doubling the batch size increases throughput in terms of images/s (CNNs) by 13.6%. I benchmarked the same problem for transformers on my RTX Titan and found, surprisingly, the very same result: 13.5% — it appears that this is a robust estimate.
As we parallelize networks across more and more GPUs, we lose performance due to some networking overhead. The A100 8x GPU system has better networking (NVLink 3.0) than the V100 8x GPU system (NVLink 2.0) — this is another confounding factor. Looking directly at the data from NVIDIA, we can find that for CNNs, a system with 8x A100 has a 5% lower overhead than a system of 8x V100. This means if going from 1x A100 to 8x A100 gives you a speedup of, say, 7.00x, then going from 1x V100 to 8x V100 only gives you a speedup of 6.67x. For transformers, the figure is 7%.
Using these figures, we can estimate the speedup for a few specific deep learning architectures from the direct data that NVIDIA provides. The Tesla A100 offers the following speedup over the Tesla V100:
Transformer (12 layer, Machine Translation, WMT14 en-de): 1.70x
Thus, the figures are a bit lower than the theoretical estimate for computer vision. This might be due to smaller tensor dimensions, overhead from operations that are needed to prepare the matrix multiplication like img2col or Fast Fourier Transform (FFT), or operations that cannot saturate the GPU (final layers are often relatively small). It could also be artifacts of the specific architectures (grouped convolution).
The practical transformer estimate is very close to the theoretical estimate. This is probably because algorithms for huge matrices are very straightforward. I will use these practical estimates to calculate the cost efficiency of GPUs.
Possible Biases in Estimates
The estimates above are for A100 vs V100. In the past, NVIDIA sneaked unannounced performance degradations into the “gaming” RTX GPUs: (1) Decreased Tensor Core utilization, (2) gaming fans for cooling, (3) disabled peer-to-peer GPU transfers. It might be possible that there are unannounced performance degradations in the RTX 30 series compared to the full Ampere A100.
As of now, one of these degradations was found: Tensor Core performance was decreased so that RTX 30 series GPUs are not as good as Quadro cards for deep learning purposes. This was also done for the RTX 20 series, so it is nothing new, but this time it was also done for the Titan equivalent card, the RTX 3090. The RTX Titan did not have performance degradation enabled.
I will update this blog post as information about further unannounced performance degradation becomes available.
Additional Considerations for Ampere / RTX 30 Series
Ampere allows for sparse network training, which accelerates training by a factor of up to 2x.
Sparse network training is still rarely used but will make Ampere future-proof.
Ampere has new low-precision data types, which makes using low-precision much easy, but not necessarily faster than for previous GPUs.
The new fan design is excellent if you have space between GPUs, but it is unclear if multiple GPUs with no space in-between them will be efficiently cooled.
3-Slot design of the RTX 3090 makes 4x GPU builds problematic. Possible solutions are 2-slot variants or the use of PCIe extenders.
4x RTX 3090 will need more power than any standard power supply unit on the market can provide right now.
The new NVIDIA Ampere RTX 30 series has additional benefits over the NVIDIA Turing RTX 20 series, such as sparse network training and inference. Other features, such as the new data types, should be seen more as an ease-of-use-feature as they provide the same performance boost as Turing does but without any extra programming required.
Sparse Network Training
Ampere allows for fine-grained structure automatic sparse matrix multiplication at dense speeds. How does this work? Take a weight matrix and slice it into pieces of 4 elements. Now imagine 2 elements of these 4 to be zero. Figure 1 shows how this could look like.
Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.
When you multiply this sparse weight matrix with some dense inputs, the sparse matrix tensor core feature in Ampere automatically compresses the sparse matrix to a dense representation that is half the size as can be seen in Figure 2. After this compression, the densely compressed matrix tile is fed into the tensor core which computes a matrix multiplication of twice the usual size. This effectively yields a 2x speedup since the bandwidth requirements during matrix multiplication from shared memory are halved.
Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.
I was working on sparse network training in my research and I also wrote a blog post about sparse training. One criticism of my work was that “You reduce the FLOPS required for the network, but it does not yield speedups because GPUs cannot do fast sparse matrix multiplication.” Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training.
Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post.
While this feature is still experimental and training sparse networks are not commonplace yet, having this feature on your GPU means you are ready for the future of sparse training.
In my work, I’ve previously shown that new data types can improve stability during low-precision backpropagation.
Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision.
Currently, if you want to have stable backpropagation with 16-bit floating-point numbers (FP16), the big problem is that ordinary FP16 data types only support numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradients explode into NaN values. To prevent this during FP16 training, we usually perform loss scaling where you multiply the loss by a small number before backpropagating to prevent this gradient explosion.
The Brain Float 16 format (BF16) uses more bits for the exponent such that the range of possible numbers is the same as for FP32: [-3*10^38, 3*10^38]. BF16 has less precision, that is significant digits, but gradient precision is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient blowing up quickly. As such, we should see an increase in training stability by using the BF16 format as a slight loss of precision.
What this means for you: With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups. With TF32 precision, you get near FP32 stability while giving the speedups close to FP16. The good thing is, to use these data types, you can just replace FP32 with TF32 and FP16 with BF16 — no code changes required!
Overall, though, these new data types can be seen as lazy data types in the sense that you could have gotten all the benefits with the old data types with some additional programming efforts (proper loss scaling, initialization, normalization, using Apex). As such, these data types do not provide speedups but rather improve ease of use of low precision for training.
New Fan Design / Thermal Issues
The new fan design for the RTX 30 series features both a blower fan and a push/pull fan. The design is ingenious and will be very effective if you have space between GPUs. So if you have 2 GPUs and one slot space between them (+3 PCIe slots), you will be fine, and there will be no cooling issues. However, it is unclear how the GPUs will perform if you have them stacked next to each other in a setup with more than 2 GPUs. The blower fan will be able to exhaust through the bracket away from the other GPUs, but it is impossible to tell how well that works since the blower fan is of a different design than before. So my recommendation: If you want to buy 1 GPU or 2 GPUs in a 4 PCIe slot setup, then there should be no issues. However, if you’re going to use 3-4 RTX 30 GPUs next to each other, I would wait for thermal performance reports to know if you need different GPU coolers, PCIe extenders, or other solutions. I will update the blog post with this information as it becomes available.
To overcome thermal issues, water cooling will provide a solution in any case. Many vendors offer water cooling blocks for RTX 3080/RTX 3090 cards, which will keep them cool even in a 4x GPU setup. Beware of all-in-one water cooling solution for GPUs if you want to run a 4x GPU setup, though it is difficult to spread out the radiators in most desktop cases.
Another solution to the cooling problem is to buy PCIe extenders and spread the GPUs within the case. This is very effective, and other fellow PhD students at the University of Washington and I use this setup with great success. It does not look pretty, but it keeps your GPUs cool! It can also help if you do not have enough space to spread the GPUs. For example, if you can find the space within a desktop computer case, it might be possible to buy standard 3-slot-width RTX 3090 and spread them with PCIe extenders within the case. With this, you might solve both the space issue and cooling issue for a 4x RTX 3090 setup with a single simple solution.
Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 2 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs.
3-slot Design and Power Issues
The RTX 3090 is a 3-slot GPU, so one will not be able to use it in a 4x setup with the default fan design from NVIDIA. This is kind of justified because it runs at 350W TDP, and it will be difficult to cool in a multi-GPU 2-slot setting. The RTX 3080 is only slightly better at 320W TDP, and cooling a 4x RTX 3080 setup will also be very difficult.
It is also difficult to power a 4x 350W = 1400W system in the 4x RTX 3090 case. Power supply units (PSUs) of 1600W are readily available, but having only 200W to power the CPU and motherboard can be too tight. The components’ maximum power is only used if the components are fully utilized, and in deep learning, the CPU is usually only under weak load. With that, a 1600W PSU might work quite well with a 4x RTX 3080 build, but for a 4x RTX 3090 build, it is better to look for high wattage PSUs (+1700W). Some of my followers have had great success with cryptomining PSUs — have a look in the comment section for more info about that. Otherwise, it is important to note that not all outlets support PSUs above 1600W, especially in the US. This is the reason why in the US, there is currently not a standard desktop PSU above 1600W on the market. If you get a server or cryptomining PSUs, beware of the form factor — make sure it fits into your computer case.
Power Limiting: An Elegant Solution to Solve the Power Problem?
It is possible to set a power limit on your GPUs. So you would be able to programmatically set the power limit of an RTX 3090 to 300W instead of their standard 350W. In a 4x GPU system, that is a saving of 200W, which might just be enough to build a 4x RTX 3090 system with a 1600W PSU feasible. It also helps to keep the GPUs cool. So setting a power limit can solve the two major problems of a 4x RTX 3080 or 4x RTX 3090 setups, cooling, and power, at the same time. For a 4x setup, you still need effective blower GPUs (and the standard design may prove adequate for this), but this resolves the PSU problem.
Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.
You might ask, “Doesn’t this slow down the GPU?” Yes, it does, but the question is by how much. I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 under different power limits to test this. I benchmarked the time for 500 mini-batches for BERT Large during inference (excluding the softmax layer). I choose BERT Large inference since, from my experience, this is the deep learning model that stresses the GPU the most. As such, I would expect power limiting to have the most massive slowdown for this model. As such, the slowdowns reported here are probably close to the maximum slowdowns that you can expect. The results are shown in Figure 7.
Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).
As we can see, setting the power limit does not seriously affect performance. Limiting the power by 50W — more than enough to handle 4x RTX 3090 — decreases performance by only 7%.
GPU Deep Learning Performance
The following benchmark includes not only the Tesla A100 vs Tesla V100 benchmarks but I build a model that fits those data and four different benchmarks based on the Titan V, Titan RTX, RTX 2080 Ti, and RTX 2080.[1,2,3,4] In an update, I also factored in the recently discovered performance degradation in RTX 30 series GPUs. And since I wrote this blog post, we now also have the first solid benchmark for computer vision which confirms my numbers.
Beyond this, I scaled intermediate cards like the RTX 2070, RTX 2060, or the Quadro RTX 6000 & 8000 cards via interpolating between those data points of benchmark data. Usually, within an architecture GPUs scale quite linearly with respect to streaming multiprocessors and bandwidth, and my within-architecture model is based on that.
I collected only benchmark data for mixed-precision FP16 training since I believe there is no good reason why one should use FP32 training.
Figure 8: Normalized GPU deep learning performance relative to an RTX 2080 Ti.
Compared to an RTX 2080 Ti, the RTX 3090 yields a speedup of 1.41x for convolutional networks and 1.35x for transformers while having a 15% higher release price. Thus the Ampere RTX 30 yields a substantial improvement over the Turing RTX 20 series in raw performance and is also cost-effective (if you do not have to upgrade your power supply and so forth).
GPU Deep Learning Performance per Dollar
What is the GPU that gives you the best bang for your buck? It depends on the cost of the overall system. If you have an expensive system, it makes sense to invest in more expensive GPUs.
Here I have three PCIe 3.0 builds, which I use as base costs for 2/4 GPU systems. I take these base costs and add the GPU costs on top of it. The GPU costs are the mean of the GPU’s Amazon and eBay costs. For the new Ampere GPUs, I use just the release price. Together with the performance values from above, this yields performance per dollar values for these systems of GPUs. For the 8-GPU system, I use a Supermicro barebone — the industry standard for RTX servers — as baseline cost. Note that these bar charts do not account for memory requirements. You should think about your memory requirements first and then look for the best option in the chart. Here some rough guidelines for memory:
Using pretrained transformers; training small transformer from scratch>= 11GB
Training large transformer or convolutional nets in research / production: >= 24 GB
Prototyping neural networks (either transformer or convolutional nets) >= 10 GB
Kaggle competitions >= 8 GB
Applying computer vision >= 10GB
Neural networks for video: 24 GB
Reinforcement learning =10GB + a strong deep learning desktop the largest Threadripper or EPYC CPU you can afford.
Figure 9: Normalized deep learning performance-per-dollar relative to RTX 3080.
Figure 10: Normalized 4-GPU deep learning performance-per-dollar relative to RTX 3080.
Figure 11: Normalized 8-GPU deep learning performance-per-dollar relative to RTX 3080
The first thing that need to emphasize again: If you choose a GPU, you need to make sure that it has enough memory for what you want to do. The steps in selecting the best deep learning GPU for you should be:
What do I want to do with the GPU(s): Kaggle competitions, machine learning, learning deep learning, hacking on small projects (GAN-fun or big language models?), doing research in computer vision / natural language processing / other domains, or something else? How much memory do I need for what I want to do? Use the Cost/Performance charts from above to figure out which GPU is best for you that fulfills the memory criteria. Are there additional caveats for the GPU that I chose? For example, if it is an RTX 3090, can I fit it into my computer? Does my power supply unit (PSU) have enough wattage to support my GPU(s)? Will heat dissipation be a problem, or can I somehow cool the GPU effectively?
Some of these details require you to self-reflect about what you want and maybe research a bit about how much memory the GPUs have that other people use for your area of interest. I can give you some guidance, but I cannot cover all areas here.
When do I need >= 11 GB of Memory?
I mentioned before that you should have at least 11 GB of memory if you work with transformers, and better yet, >= 24 GB of memory if you do research on transformers. This is so because most previous models that are pretrained have pretty steep memory requirements, and these models were trained with at least RTX 2080 Ti GPUs that have 11 GB of memory. Thus having less than 11 GB can create scenarios where it is difficult to run certain models.
Other areas that require large amounts of memory are anything medical imaging, some state-of-the-art computer vision models, anything with very large images (GAN, style transfer).
In general, if you seek to build models that give you the edge in competition, be it research, industry, or Kaggle competition, extra memory will provide you with a possible edge.
When is <11 GB of Memory Okay? The RTX 3070 and RTX 3080 are mighty cards, but they lack a bit of memory. For many tasks, however, you do not need that amount of memory. The RTX 3070 is perfect if you want to learn deep learning. This is so because the basic skills of training most architectures can be learned by just scaling them down a bit or using a bit smaller input images. If I would learn deep learning again, I would probably roll with one RTX 3070, or even multiple if I have the money to spare. The RTX 3080 is currently by far the most cost-efficient card and thus ideal for prototyping. For prototyping, you want the largest memory, which is still cheap. With prototyping, I mean here prototyping in any area: Research, competitive Kaggle, hacking ideas/models for a startup, experimenting with research code. For all these applications, the RTX 3080 is the best GPU. Suppose I would lead a research lab/startup. I would put 66-80% of my budget in RTX 3080 machines and 20-33% for “rollout” RTX 3090 machines with a robust water cooling setup. The idea is, RTX 3080 is much more cost-effective and can be shared via a slurm cluster setup as prototyping machines. Since prototyping should be done in an agile way, it should be done with smaller models and smaller datasets. RTX 3080 is perfect for this. Once students/colleagues have a great prototype model, they can rollout the prototype on the RTX 3090 machines and scale to larger models. How can I fit +24GB models into 10GB memory? It is a bit contradictory that I just said if you want to train big models, you need lots of memory, but we have been struggling with big models a lot since the onslaught of BERT and solutions exists to train 24 GB models in 10 GB memory. If you do not have the money or what to avoid cooling/power issues of the RTX 3090, you can get RTX 3080 and just accept that you need do some extra programming by adding memory-saving techniques. There are enough techniques to make it work, and they are becoming more and more commonplace. Here just a list of common techniques: FP16/BF16 training (apex) Gradient checkpointing (only store some of the activations and recompute them in the backward pass) GPU-to-CPU Memory Swapping (swap layers not needed to the CPU; swap them back in just-in-time for backprop) Model Parallelism (each GPU holds a part of each layer; supported by fairseq) Pipeline parallelism (each GPU hols a couple of layers of the network) ZeRO parallelism (each GPU holds partial layers) 3D parallelism (Model + pipeline + ZeRO) CPU Optimizer state (store and update Adam/Momentum on the CPU while the next GPU forward pass is happening) If you are not afraid to tinker a bit and implement some of these techniques — which usually means integrating packages that support them with your code — you will be able to fit that 24GB large network on a smaller GPU. With that hacking spirit, the RTX 3080, or any GPU with less than 11 GB memory, might be a great GPU for you. Is upgrading from RTX 20 to RTX 30 GPU worth it? Or Should I wait for the next GPU? If I were you, I would think twice about upgrading from an RTX 20 GPU to an RTX 30 GPU. You might be eager to get that 30% faster training or so, but it can be a big headache to deal with all the other RTX 30 GPU problems. The power supply, the cooling, you need to sell your old GPUs. Is it worth it all? I could imagine if you need that extra memory, for example, to go from RTX 2080 Ti to RTX 3090, or if you want a huge boost in performance, say from RTX 2060 to RTX 3080, then it can be pretty worth it. But if you stay “in your league,” that is, going from Titan RTX to RTX 3090, or, RTX 2080 Ti to RTX 3080, it is hardly worth it. You gain a bit of performance, but you will have headaches about the power supply and cooling, and you are a good chunk of money lighter. I do not think it is worth it. I would wait until a better alternative to GDDR6X memory is released. This will make GPUs use less power and might even make them faster. Maybe wait a year and see how the landscape has changed since then. It is worth mentioning that technology is slowing anyways. So waiting for a year might net you a GPU, which will stay current for more than 5 years. There will be a time when cheap HBM memory can be manufactured. If that time comes, and you buy that GPU and you will likely stay on that GPU for more than 7 years. Such GPUs might be available in 3-4 years. As such, playing the waiting game can be a pretty smart choice. General Recommendations In general, the RTX 30 series is very powerful, and I recommend these GPUs. Be aware of memory, as discussed in the previous section, but also power requirements and cooling. If you have one PCIe slot between GPUs, cooling will be no problem at all. Otherwise, with RTX 30 cards, make sure you get water cooling, PCIe extenders, or effective blower cards (data in the next weeks will show the NVIDIA fan design is adequate). In general, I would recommend the RTX 3090 for anyone that can afford it. It will equip you not only for now but will be a very effective card for the next 3-7 years. As such, it is a good investment that will stay strong. It is unlikely that HBM memory will become cheap within three years, so the next GPU would only be about 25% better than the RTX 3090. We will probably see cheap HBM memory in 3-5 years, so after that, you definitely want to upgrade. For PhD students, those who want to become PhD students, or those who get started with a PhD, I recommend RTX 3080 GPUs for prototyping and RTX 3090 GPUs for doing rollouts. If your department has a GPU cluster, I would highly recommend a Slurm GPU cluster with 8 GPU machines. However, since the cooling of RTX 3080 GPUs in an 8x GPU server setup is questionable it is unlikely that you will be able to run these. If the cooling works, I would recommend 66-80% RTX 3080 GPUs and the rest of the GPUs being either RTX 3090 or Tesla A100. If cooling does not work I would recommend 66-80% RTX 2080 and the rest being Tesla A100s. Again, it is crucial, though, that you make sure that heating issues in your GPU servers are taken care of before you commit to specific GPUs for your servers. More on GPU clusters below. If you have multiple RTX 3090’s, make sure you choose solutions that guarantee sufficient cooling and power. I will update the blog post about this as more and more data is rolling in what is a proper setup. For anyone without strictly competitive requirements (research, competitive Kaggle, competitive startups), I would recommend in order: Used RTX 2080 Ti, used RTX 2070, new RTX 3080, new RTX 3070. If you do not like used cards, but the RTX 3080. If you cannot afford the RTX 3080, go with the RTX 3070. All of these cards are very cost-effective solutions and will ensure fast training of most networks. If you use the right memory tricks and are fine with some extra programming, there are now enough tricks to make a 24 GB neural network fit into a 10 GB GPU. As such, if you accept a bit of uncertainty and some extra programming, the RTX 3080 might also be a better choice compared to the RTX 3090 since performance is quite similar between these cards. If your budget is limited and an RTX 3070 is too expensive, a used RTX 2070 is about $260 on eBay. It is not clear yet if there will be an RTX 3060, but if you are on a limited budget, it might also be worth waiting a bit more. If priced similarly to the RTX 2060 and GTX 1060, you can expect a price of $250 to $300 and a pretty strong performance. If your budget is limited, but you still need large amounts of memory, then old, used Tesla or Quadro cards from eBay might be best for you. The Quadro M6000 has 24 GB of memory and goes for $400 on eBay. The Tesla K80 has a 2-in-1 GPU with 2x 12 GB of memory for about $200. These cards are slow compared to more modern cards, but the extra memory can come in handy for specific projects where memory is paramount. Recommendations for GPU Clusters GPU cluster design depends highly on use. For a +1,024 GPU system, networking is paramount, but if users only use at most 32 GPUs at a time on such a system investing in powerful networking infrastructure is a waste. Here, I would go with similar prototyping-rollout reasoning, as mentioned in the RTX 3080 vs RTX 3090 case. In general, RTX cards are banned from data centers via the CUDA license agreement. However, often universities can get an exemption from this rule. It is worth getting in touch with someone from NVIDIA about this to ask for an exemption. If you are allowed to use RTX cards, I would recommend standard Supermicro 8 GPU systems with RTX 3080 or RTX 3090 GPUs (if sufficient cooling can be assured). A small set of 8x A100 nodes ensures effective “rollout” after prototyping, especially if there is no guarantee that the 8x RTX 3090 servers can be cooled sufficiently. In this case, I would recommend A100 over RTX 6000 / RTX 8000 because the A100 is pretty cost-effective and future proof. In the case you want to train vast networks on a GPU cluster (+256 GPUs), I would recommend the NVIDIA DGX SuperPOD system with A100 GPUs. At a +256 GPU scale, networking is becoming paramount. If you want to scale to more than 256 GPUs, you need a highly optimized system, and putting together standard solutions is no longer cutting it. Especially at a scale of +1024 GPUs, the only competitive solutions on the market are the Google TPU Pod and NVIDIA DGX SuperPod. At that scale, I would prefer the Google TPU Pod since their custom made networking infrastructure seems to be superior to the NVIDIA DGX SuperPod system — although both systems come quite close to each other. The GPU system offers a bit more flexibility of deep learning models and applications over the TPU system, while the TPU system supports larger models and provides better scaling. So both systems have their advantages and disadvantages. Do Not Buy These GPUs I do not recommend buying multiple RTX Founders Editions (any) or RTX Titans unless you have PCIe extenders to solve their cooling problems. They will simply run too hot, and their performance will be way below what I report in the charts above. 4x RTX 2080 Ti Founders Editions GPUs will readily dash beyond 90C, will throttle down their core clock, and will run slower than properly cooled RTX 2070 GPUs. I do not recommend buying Tesla V100 or A100 unless you are forced to buy them (banned RTX data center policy for companies) or unless you want to train very large networks on a huge GPU cluster — these GPUs are just not very cost-effective. If you can afford better cards, do not buy GTX 16 series cards. These cards do not have tensor cores and, as such, provide relatively poor deep learning performance. I would choose a used RTX 2070 / RTX 2060 / RTX 2060 Super over a GTX 16 series card. If you are short on money, however, the GTX 16 series cards can be a good option. When Is it Best Not to Buy New GPUs? If you already have RTX 2080 Tis or better GPUs, an upgrade to RTX 3090 may not make sense. Your GPUs are already pretty good, and the performance gains are negligible compared to worrying about the PSU and cooling problems for the new power-hungry RTX 30 cards — just not worth it. The only reason I would want to upgrade from 4x RTX 2080 Ti to 4x RTX 3090 would be if I do research on huge transformers or other highly compute dependent network training. However, if memory is a problem, you may first consider some memory tricks to fit large models on your 4x RTX 2080 Tis before upgrading to RTX 3090s. If you have one or multiple RTX 2070 GPUs, I would think twice about an upgrade. These are pretty good GPUs. Reselling those GPUs on eBay and getting RTX 3090s could make sense, though, if you find yourself often limited by the 8 GB memory. This reasoning is valid for many other GPUs: If memory is tight, an upgrade is right. Question & Answers & Misconceptions Summary: PCIe 4.0 and PCIe lanes do not matter in 2x GPU setups. For 4x GPU setups, they still do not matter much. RTX 3090 and RTX 3080 cooling will be problematic. Use water-cooled cards or PCIe extenders. NVLink is not useful. Only useful for GPU clusters. You can use different types of GPUs in one computer (e.g., GTX 1080 + RTX 2080 + RTX 3090), but you will not be able to parallelize across them efficiently. You will need Infiniband +50Gbit/s networking to parallelize training across more than two machines. AMD CPUs are cheaper than Intel CPUs; Intel CPUs have almost no advantage. Despite heroic software engineering efforts, AMD GPUs + ROCm will probably not be able to compete with NVIDIA due to lacking community and Tensor Core equivalent for at least 1-2 years. Cloud GPUs are useful if you use them for less than 1 year. After that, a desktop is the cheaper solution. Do I need PCIe 4.0? Generally, no. PCIe 4.0 is great if you have a GPU cluster. It is okay if you have an 8x GPU machine, but otherwise, it does not yield many benefits. It allows better parallelization and a bit faster data transfer. Data transfers are not a bottleneck in any application. In computer vision, in the data transfer pipeline, the data storage can be a bottleneck, but not the PCIe transfer from CPU to GPU. So there is no real reason to get a PCIe 4.0 setup for most people. The benefits will be maybe 1-7% better parallelization in a 4 GPU setup. Do I need 8x/16x PCIe lanes? Same as with PCIe 4.0 — generally, no. PCIe lanes are needed for parallelization and fast data transfers, which are seldom a bottleneck. Operating GPUs on 4x lanes is fine, especially if you only have 2 GPUs. For a 4 GPU setup, I would prefer 8x lanes per GPU, but running them at 4x lanes will probably only decrease performance by around 5-10% if you parallelize across all 4 GPUs. How do I fit 4x RTX 3090 if they take up 3 PCIe slots each? You need to get one of the two-slot variants, or you can try to spread them out with PCIe extenders. Besides space, you should also immediately think about cooling and a suitable PSU. It seems the most manageable solution will be to get 4x RTX 3090 EVGA Hydro Copper with a custom water cooling loop. This will keep the cards very cool. EVGA produced hydro copper versions of GPUs for years, and I believe you can trust in their water-cooled GPUs’ quality. There might also be other variants which are cheaper though. PCIe extenders might also solve both space and cooling issues, but you need to make sure that you have enough space in your case to spread out the GPUs. Make sure your PCIe extenders are long enough! How do I cool 4x RTX 3090 or 4x RTX 3080? See the previous section. Can I use multiple GPUs of different GPU types? Yes, you can! But you cannot parallelize efficiently across GPUs of different types. I could imagine a 3x RTX 3070 + 1 RTX 3090 could make sense for a prototyping-rollout split. On the other hand, parallelizing across 4x RTX 3070 GPUs would be very fast if you can make the model fit onto those GPUs. The only other reason why you want to do this that I can think of is if you’re going to use your old GPUs. This works just fine, but parallelization across those GPUs will be inefficient since the fastest GPU will wait for the slowest GPU to catch up to a synchronization point (usually gradient update). What is NVLink, and is it useful? Generally, NVLink is not useful. NVLink is a high speed interconnect between GPUs. It is useful if you have a GPU cluster with +128 GPUs. Otherwise, it yields almost no benefits over standard PCIe transfers. I do not have enough money, even for the cheapest GPUs you recommend. What can I do? Definitely buy used GPUs. Used RTX 2070 ($400) and RTX 2060 ($300) are great. If you cannot afford that, the next best option is to try to get a used GTX 1070 ($220) or GTX 1070 Ti ($230). If that is too expensive, a used GTX 980 Ti (6GB $150) or a used GTX 1650 Super ($190). If that is too expensive, it is best to roll with free GPU cloud services. These usually provided a GPU for a limited amount of time/credits, after which you need to pay. Rotate between services and accounts until you can afford your own GPU. I built a carbon calculator for calculating your carbon footprint for academics (carbon from flights to conferences + GPU time). The calculator can also be used to calculate a pure GPU carbon footprint. You will find that GPUs produce much, much more carbon than international flights. As such, you should make sure you have a green source of energy if you do not want to have an astronomical carbon footprint. If no electricity provider in our area provides green energy, the best way is to buy carbon offsets. Many people are skeptical about carbon offsets. Do they work? Are they scams? I believe skepticism just hurts in this case, because not doing anything would be more harmful than risking the probability of getting scammed. If you worry about scams, just invest in a portfolio of offsets to minimize risk. I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards. What do I need to parallelize across two machines? If you want to be on the safe side, you should get at least +50Gbits/s network cards to gain speedups if you want to parallelize across machines. I recommend having at least an EDR Infiniband setup, meaning a network card with at least 50 GBit/s bandwidth. Two EDR cards with cable are about $500 on eBay. In some cases, you might be able to get away with 10 Gbit/s Ethernet, but this is usually only the case for special networks (certain convolutional networks) or if you use certain algorithms (Microsoft DeepSpeed). Is the sparse matrix multiplication features suitable for sparse matrices in general? It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured. It might be possible to adjust the algorithm slightly, which involves that you pool 4 values into a compressed representation of 2 values, but this also means that precise arbitrary sparse matrix multiplication is not possible with Ampere GPUs. Do I need an Intel CPU to power a multi-GPU setup? I do not recommend Intel CPUs unless you heavily use CPUs in Kaggle competitions (heavy linear algebra on the CPU). Even for Kaggle competitions AMD CPUs are still great, though. AMD CPUs are cheaper and better than Intel CPUs in general for deep learning. For a 4x GPU built, my go-to CPU would be a Threadripper. We built dozens of systems at our university with Threadrippers, and they all work great — no complaints yet. For 8x GPU systems, I would usually go with CPUs that your vendor has experience with. CPU and PCIe/system reliability is more important in 8x systems than straight performance or straight cost-effectiveness. Does computer case design matter for cooling? No. GPUs are usually perfectly cooled if there is at least a small gap between GPUs. Case design will give you 1-3 C better temperatures, space between GPUs will provide you with 10-30 C improvements. The bottom line, if you have space between GPUs, cooling does not matter. If you have no space between GPUs, you need the right cooler design (blower fan) or another solution (water cooling, PCIe extenders), but in either case, case design and case fans do not matter. Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA? Not in the next 1-2 years. It is a three-way problem: Tensor Cores, software, and community. AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive. Rumors show that some data center card with Tensor Core equivalent is planned for 2020, but no new data emerged since then. Just having data center cards with a Tensor Core equivalent would also mean that few would be able to afford such AMD GPUs, which would give NVIDIA a competitive advantage. Let’s say AMD introduces a Tensor-Core-like-hardware feature in the future. Then many people would say, “But there is no software that works for AMD GPUs! How am I supposed to use them?” This is mostly a misconception. The AMD software via ROCm has come to a long way, and support via PyTorch is excellent. While I have not seen many experience reports for AMD GPUs + PyTorch, all the software features are integrated. It seems, if you pick any network, you will be just fine running it on AMD GPUs. So here AMD has come a long way, and this issue is more or less solved. However, if you solve software and the lack of Tensor Cores, AMD still has a problem: the lack of community. If you have a problem with NVIDIA GPUs, you can Google the problem and find a solution. That builds a lot of trust in NVIDIA GPUs. You have the infrastructure that makes using NVIDIA GPUs easy (any deep learning framework works, any scientific problem is well supported). You have the hacks and tricks that make usage of NVIDIA GPUs a breeze (e.g., apex). You can find experts on NVIDIA GPUs and programming around every other corner while I knew much less AMD GPU experts. In the community aspect, AMD is a bit like Julia vs Python. Julia has a lot of potential, and many would say, and rightly so, that it is the superior programming language for scientific computing. Yet, Julia is barely used compared to Python. This is because the Python community is very strong. Numpy, SciPy, Pandas are powerful software packages that a large number of people congregate around. This is very similar to the NVIDIA vs AMD issue. Thus, it is likely that AMD will not catch up until Tensor Core equivalent is introduced (1/2 to 1 year?) and a strong community is built around ROCm (2 years?). AMD will always snatch a part of the market share in specific subgroups (e.g., cryptocurrency mining, data centers). Still, in deep learning, NVIDIA will likely keep its monopoly for at least a couple more years. When is it better to use the cloud vs a dedicated GPU desktop/server? Rule-of-thumb: If you expect to do deep learning for longer than a year, it is cheaper to get a desktop GPU. Otherwise, cloud instances are preferable unless you have extensive cloud computing skills and want the benefits of scaling the number of GPUs up and down at will. For the exact point in time when a cloud GPU is more expensive than a desktop depends highly on the service that you are using, and it is best to do a little math on this yourself. Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090 (similar performance). The desktop with RTX 3090 costs $2,200 (2-GPU barebone + RTX 3090). Additionally, assuming you are in the US, there is an additional $0.12 per kWh for electricity. This compares to $2.14 per hour for the AWS on-demand instance. At 15% utilization per year, the desktop uses: (350 W (GPU) + 100 W (CPU))*0.15 (utilization) * 24 hours * 365 days = 591 kWh per year So 591 kWh of electricity per year, that is an additional $71. The break-even point for a desktop vs a cloud instance at 15% utilization (you use the cloud instance 15% of time during the day), would be about 300 days ($2,311 vs $2,270): $2.14/h * 0.15 (utilization) * 24 hours * 300 days = $2,311 So if you expect to run deep learning models after 300 days, it is better to buy a desktop instead of using AWS on-demand instances. AWS spot instances are a bit cheaper at about 0.9$ per hour. However, many users on Twitter were telling me that on-demand instances are a nightmare, but that spot instances are hell. AWS itself lists the average frequency of interruptions of V100 GPU spot instances to be above 20%. This means you need a pretty good spot instance management infrastructure to make it worth it to use spot instances. But if you have it, AWS spot instances and similar services are pretty competitive. You need to own and run a desktop for 20 months to run even compared to AWS spot instances. This means if you expect to run deep learning workloads in the next 20 months, a desktop machine will be cheaper (and easier to use). You can do similar calculations for any cloud service to make the decision if you go for a cloud service or a desktop. Common utilization rates are the following: PhD student personal desktop: < 15% PhD student slurm GPU cluster: > 35%
Company-wide slurm research cluster: > 60%
In general, utilization rates are lower for professions where thinking about cutting edge ideas is more important than developing practical products. Some areas have low utilization rates (interpretability research), while other areas have much higher rates (machine translation, language modeling). In general, the utilization of personal machines is almost always overestimated. Commonly, most personal systems have a utilization rate between 5-10%. This is why I would highly recommend slurm GPU clusters for research groups and companies instead of individual desktop GPU machines.
Best GPU overall: RTX 3080 and RTX 3090.
GPUs to avoid (as an individual): Any Tesla card; any Quadro card; any Founders Edition card; Titan RTX, Titan V, Titan XP.
Cost-efficient but expensive: RTX 3080.
Cost-efficient and cheaper: RTX 3070, RTX 2060 Super
I have little money: Buy used cards. Hierarchy: RTX 2070 ($400), RTX 2060 ($300), GTX 1070 ($220), GTX 1070 Ti ($230), GTX 1650 Super ($190), GTX 980 Ti (6GB $150).
I have almost no money: There are a lot of startups that promo their clouds: Use free cloud credits and switch companies accounts until you can afford a GPU.
I do Kaggle: RTX 3070.
I am a competitive computer vision, pretraining, or machine translation researcher: 4x RTX 3090. Wait until working builds with good cooling, and enough power are confirmed (I will update this blog post).
I am an NLP researcher: If you do not work on machine translation, language modeling, or pretraining of any kind, an RTX 3080 will be sufficient and cost-effective.
I started deep learning, and I am serious about it: Start with an RTX 3070. If you are still serious after 6-9 months, sell your RTX 3070 and buy 4x RTX 3080. Depending on what area you choose next (startup, Kaggle, research, applied deep learning), sell your GPUs, and buy something more appropriate after about three years (next-gen RTX 40s GPUs).
I want to try deep learning, but I am not serious about it: The RTX 2060 Super is excellent but may require a new power supply to be used. If your motherboard has a PCIe x16 slot and you have a power supply with around 300 W, a GTX 1050 Ti is a great option since it will not require any other computer components to work with your desktop computer.
GPU Cluster used for parallel models across less than 128 GPUs: If you are allowed to buy RTX GPUs for your cluster: 66% 8x RTX 3080 and 33% 8x RTX 3090 (only if sufficient cooling is guaranteed/confirmed). If cooling of RTX 3090s is not sufficient buy 33% RTX 6000 GPUs or 8x Tesla A100 instead. If you are not allowed to buy RTX GPUs, I would probably go with 8x A100 Supermicro nodes or 8x RTX 6000 nodes.
GPU Cluster used for parallel models across 128 GPUs: Think about 8x Tesla A100 setups. If you use more than 512 GPUs, you should think about getting a DGX A100 SuperPOD system that fits your scale.
2020-09-20: Added discussion of using power limiting to run 4x RTX 3090 systems. Added older GPUs to the performance and cost/performance charts. Added figures for sparse matrix multiplication.
2020-09-07: Added NVIDIA Ampere series GPUs. Included lots of good-to-know GPU details.
2019-04-03: Added RTX Titan and GTX 1660 Ti. Updated TPU section. Added startup hardware discussion.
2018-11-26: Added discussion of overheating issues of RTX cards.
2018-11-05: Added RTX 2070 and updated recommendations. Updated charts with hard performance data. Updated TPU section.
2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
2017-04-09: Added cost-efficiency analysis; updated recommendation with NVIDIA Titan Xp
2017-03-19: Cleaned up blog post; added GTX 1080 Ti
2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
2015-02-23: Updated GPU recommendations and memory calculations
2014-09-28: Added emphasis for memory requirement of CNNs
I want to thank Agrin Hilmkil, Ari Holtzman, Gabriel Ilharco, Nam Pho for their excellent feedback on the current version of this blog post.
For past updates of this blog post, I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances. I want to thank Brad Nemire for providing me with an RTX Titan for benchmarking purposes.
Best GPU for Deep Learning in 2022 (so far)
While waiting for NVIDIA’s next-generation consumer and professional GPUs, we decided to write a blog about the best GPU for Deep Learning currently available as of March 2022. For readers who use pre-Ampere generation GPUs and are considering an upgrade, these are what you need to know:
Ampere GPUs have significant improvement over pre-Ampere GPUs based on the throughput and throughput-per-dollar metrics. This is especially true for language models, where Ampere Tensor Cores can leverage structured sparsity.
and metrics. This is especially true for language models, where Ampere Tensor Cores can leverage structured sparsity. Ampere GPUs do not offer a significant upgrade on memory. For example, if you have a Quadro RTX 8000 from the Turing generation, upgrading it to its Ampere successor the A6000 would not enable you to train a larger model.
from the Turing generation, upgrading it to its Ampere successor the would not enable you to train a larger model. Three Ampere GPU models are good upgrades: A100 SXM4 for multi-node distributed training. A6000 for single-node, multi-GPU training. 3090 is the most cost-effective choice, as long as your training jobs fit within their memory.
for multi-node distributed training. for single-node, multi-GPU training. is the most cost-effective choice, as long as your training jobs fit within their memory. Other members of the Ampere family may also be your best choice when combining performance with budget, form factor, power consumption, thermal, and availability.
The above claims are based on our benchmark for a wide range of GPUs across different Deep Learning applications. Without further ado, let’s dive into the numbers.
Ampere or not Ampere
First, we compare Ampere and pre-Ampere GPUs in the context of single GPU training. We hand picked a few image and language models and focus on three metrics:
Maximum Batch Size : This is the largest number of samples that can be fit into the GPU memory. We usually prefer GPUs that can accommodate larger batch size, because they lead to more accurate gradients for each optimization step and are more future-proved for larger models.
: This is the largest number of samples that can be fit into the GPU memory. We usually prefer GPUs that can accommodate larger batch size, because they lead to more accurate gradients for each optimization step and are more future-proved for larger models. Throughput : This is the number of samples that can be processed per second by a GPU. We measure the throughput for each GPU with its own maximum batch size to avoid GPU starving (GPU cores stay idle due to lack of data to be processed).
: This is the number of samples that can be processed per second by a GPU. We measure the throughput for each GPU with its own maximum batch size to avoid GPU starving (GPU cores stay idle due to lack of data to be processed). Throughput-per-dollar: This is the throughput of a GPU normalized by the market price of the GPU. It reflects how “cost-effective” the GPU is regarding the computation/purchase price ratio.
We give detailed numbers for some GPUs that are popular choices we’ve seen from the Deep Learning community. We include both the current Ampere generation ( A100 , A6000 , and 3090 ) and the previous Turing/Volta generation ( Quadro RTX 8000 , Titan RTX , RTX 2080Ti , V100 ) for readers who are interested in comparing their performance and considering an upgrade in the near future. We also include the 3080 Max-Q since it is one of the most powerful mobile GPUs currently available.
Maximum batch size
Model / GPU A100 80GB SXM4 RTX A6000 RTX 3090 V100 32GB RTX 8000 Titan RTX RTX 2080Ti 3080 Max-Q ResNet50 720 496 224 296 496 224 100 152 ResNet50 FP16 1536 912 448 596 912 448 184 256 SSD 256 144 80 108 144 80 32 48 SSD FP16 448 288 140 192 288 140 56 88 Bert Large Finetune 32 18 8 12 18 8 2 4 Bert Large Finetune FP16 64 36 16 24 36 16 4 8 TransformerXL Large 24 16 4 8 16 4 0 2 TransformerXL Large FP16 48 32 8 16 32 8 0 4
No surprise, the maximum batch size is closely correlated with GPU memory size. A100 80GB has the largest GPU memory on the current market, while A6000 (48GB) and 3090 (24GB) match their Turing generation predecessor RTX 8000 and Titan RTX . The 3080 Max-Q has a massive 16GB of ram, making it a safe choice of running inference for most mainstream DL models. Released three and half years ago, RTX 2080Ti (11GB) could cope with the state-of-the-art image models at the time but is now falling behind. This is especially the case for someone who works on large image models or language models (can’t fit a single training example for TransformerXL Large in either FP32 or FP16 precision).
We also see a roughly 2x of maximum batch size when switching the training precision from FP32 / TF32 to FP16 .
Model / GPU A100 80GB SXM4 RTX A6000 RTX 3090 V100 32GB RTX 8000 Titan RTX RTX 2080Ti 3080 Max-Q ResNet50 925 437 471 368 300 306 275 193 ResNet50 FP16 1386 775 801 828 646 644 526 351 SSD 272 135 137 136 116 119 106 55 SSD FP16 420 230 214 224 180 181 139 90 Bert Large Finetune 60 25 17 12 11 11 6 7 Bert Large Finetune FP16 123 63 47 49 41 40 23 20 TransformerXL Large 12847 6114 4062 2329 2158 1878 0 967 TransformerXL Large FP16 18289 11140 7582 4372 4109 3579 0 1138
Throughput is impacted by both GPU cores, GPU memory, and memory bandwidth. Imagine you are in a restaurant: memory bandwidth decides how fast food is brought to your table, memory size is your table size, GPU cores decide how fast you can eat. The amount of food consumed in a fixed amount of time (training/inference throughput) could be blocked by any one of or multiple of these three factors. But most likely, how fast you can eat (the cores).
Overall, we see Ampere GPUs deliver a significant boost of throughput compared to their Turing/Volta predecessors. For example, lets examine the current flag ship GPU A100 80GB SXM4 .
For image models (ResNet50 and SSD):
2.25x faster than V100 32GB in 32-bit ( TF32 for A100 and FP32 for V100 )
faster than in 32-bit ( for and for ) 1.77x faster than V100 32GB in FP16
For language models (Bert Large and TransformerXL Large):
5.26x faster than V100 32GB in 32-bit ( TF32 for A100 and FP32 for V100 )
faster than in ( for and for ) 3.35x faster than V100 32GB in FP16
While a performance boost is guaranteed by switching to Ampere, the most significant improvement comes from training language models in TF32 v.s. FP32 where the latest Ampere Tensor Cores leverage structured sparsity. So if you are training language models on Turing/Volta or even an older generation of GPUs, definitely consider upgrading to Ampere generation GPUs.
Throughput per Dollar
Throughput per dollar is somewhat an over-simplified estimation of how cost-effective a GPU is since it does not include the cost of operating (time and electricity) the GPU. Frankly speaking, we didn’t know what to expect, but when we tested, we found two interesting trends:
Ampere GPUs have an overall increase of Throughput per dollar over Turing/Volta. For example, A100 80GB SXM4 has higher Throughput per dollar than V100 32GB for ALL the models above. The same to A6000 and 3090 when compared with RTX 8000 and Titan RTX . This suggests that upgrading your old GPUs to the newest generation is a wise move.
has higher Throughput per dollar than for ALL the models above. The same to and when compared with and . This suggests that upgrading your old GPUs to the newest generation is a wise move. Among the latest generation, lower-end GPUs usually have higher throughput per dollar than higher-end GPUs. For example, 3090 > A6000 > A100 80GB SXM4 . This means, if you are budget limited, buying lower-end GPUs in quantity might be a better choice than chasing the pricey flagship GPU.
You can find “Throughput per Watt” from our benchmark website. Take these metrics as a grain of salt since the value of time/watt can be entirely subjective (how complicated is your problem? how tolerant are you to time/electricity bills). Nonetheless, for users who demand fast R&D iterations, upgrading to Ampere GPUs seems to be a very worthwhile investment, as you will save lots of time in the long term.
Scalability test for server grade GPUs (1x – 8x)
We also tested the scalability of these GPUs with multi-GPU training jobs. We observed nearly perfect linear scale for A100 80GB SXM4 (blue line), thanks to the fast device-to-device communication of NVSwitch . Other server-grade GPUs, including A6000 , V100 , and RTX 8000 , all scored high scaling factors. For example, A6000 delivered 7.7x and 7.8x performance with 8x GPUs in TF32 and FP16 , respectively.
We didn’t include them in this graph, but the scaling factors for Geforce cards are significantly worse. For example, 3090 delivered only about 5x more throughput with 8x GPUs. This is mainly due to the fact that GPUDirect Peering is disabled on the latest generations of Geforce cards so communication between GPUs (to gather the gradients) must go through CPUs, which leads to severe bottlenecks as the number of GPUs increases.
We observed that 4x Geforce GPUs could give a 2.5x – 3x speed up, depending on the hardware setting and the problem at hand. And it appears to be inefficient to go beyond 4x GPUs with Geforce cards.
The above analysis used a few hand-picked image & language models to compare popular choices of GPUs for Deep Learning. You can find more comprehensive comparisons from our benchmark website. The following two figures give a high-level summary of our studies on the relative performance of a wider range of GPUs across a more extensive set of models:
Relative Training Throughput w.r.t 1xV100 32GB in TF32/FP32 (Averaged across all 11 models)
Relative Training Throughput w.r.t 1xV100 32GB in FP16 (Averaged across all 11 models)
The models used in the above studies include:
The A100 family ( 80GB / 40GB with PCIe / SMX4 form factors) has a clear lead over the rest of the Ampere Cards. A6000 comes second, followed closely by 3090 , A40 , and A5000 . There is a large gap between them and the lower tier 3080 and A4000 , but their prices are more affordable.
So, which GPUs to choose if you need an upgrade in early 2022 for Deep Learning? We feel there are two yes/no questions that help you choose between A100 , A6000 , and 3090 . These three together probably cover most of the use cases in training Deep Learning models:
Do you need multi-node distributed training? If the answer is yes, go for A100 80GB / 40GB SXM4 because they are the only GPUs that support Infiniband . Without Infiniband , your distributed training simply would not scale. If the answer is no, see the next question.
If the answer is yes, go for / because they are the only GPUs that support . Without , your distributed training simply would not scale. If the answer is no, see the next question. How big is your model? That helps you to choose between A100 PCIe ( 80GB ), A6000 ( 48GB ), and 3090 ( 24GB ). A couple of 3090 s are adequate for mainstream academic research. Choose A6000 if you work with a large image/language model and need multi-GPU training to scale efficiently. An A6000 system should cover most of the use cases in the context of a single node. Only choose A100 PCIe 80GB when working on extremely large models
Of course, options such as A40 , A5000 , 3080 , and A4000 may be your best choice when combining performance with other factors such as budget, form factor, power consumption, thermal, and availability.
For example, power consumption and thermals can be an issue when you use multiple of 3090 s ( 360 watts ) or 3080 s ( 350 watts ) in a workstation chassis, and we recommend no more than three of these cards for workstations. In contrast, although A5000 (also 24GB ) is up to 20% slower than 3090 , it consumes much less power (only 230 watts ) and offers better thermal performance, which allows you to create a higher performance system with four cards.
Still having problem identifying the best GPU for your needs? Feel free to start a conversation with our engineers for recommendations.
PyTorch benchmark software stack
키워드에 대한 정보 딥 러닝 gpu
다음은 Bing에서 딥 러닝 gpu 주제에 대한 검색 결과입니다. 필요한 경우 더 읽을 수 있습니다.
이 기사는 인터넷의 다양한 출처에서 편집되었습니다. 이 기사가 유용했기를 바랍니다. 이 기사가 유용하다고 생각되면 공유하십시오. 매우 감사합니다!
사람들이 주제에 대해 자주 검색하는 키워드 딥러닝에 그래픽처리장치를 사용하는 이유 [핫클립] / YTN 사이언스
딥러닝에 #그래픽처리장치를 #사용하는 #이유 #[핫클립] #/ #YTN #사이언스
YouTube에서 딥 러닝 gpu 주제의 다른 동영상 보기
주제에 대한 기사를 시청해 주셔서 감사합니다 딥러닝에 그래픽처리장치를 사용하는 이유 [핫클립] / YTN 사이언스 | 딥 러닝 gpu, 이 기사가 유용하다고 생각되면 공유하십시오, 매우 감사합니다.