为Google|为Google Cloud配置深度学习环境（CUDA、cuDNN、Tensorflow2、VScode远程ssh等）为GoogleCloud配置深度学习环境（

前言本文基于作者使用环境：macOS Catalina。仅使用系统自带Terminal进行配置。
配置google cloud上的深度学习环境真的是一把鼻涕一把泪。不过总算是弄好了。记录一下，也算是帮助后来的人。
总的来说，最近这段时间（2020.07）配置环境是挺麻烦的，网上教程有的也已经不再管用。原因有二：

Tensorflow于去年更新至2.0版本，GPU版本的tensorflow安装命令已与之前不同，支持的驱动和CUDA版本也在变化。教程还没有跟上。
NVIDIA在本月初发布了CUDA 11，但目前Tf还未支持，版本选择需要注意。

那么就开始吧！
目录

前言
Google Cloud Platform（GCP）的注册与虚拟机实例的设置
连接至虚拟机实例并配置GPU计算环境
配置服务器上的jupyter notebook
配置VScode上的remote-SSH以远程管理和运行代码
参考

Google Cloud Platform（GCP）的注册与虚拟机实例的设置

目前，GCP注册与以前相比没有太大变化。可以参考这篇文章的注册部分。选择地区时没有中国，可以选择香港。具体的详细地址可随便填写。但信用卡账单寄送地址需要正确填写。
填写完毕后，会尝试从信用卡扣款8港币来测试卡有效性。之后会退回。
点击验证邮件里的链接，再填写几个问卷。就可以使用了。
进入GCP的主页后，添加虚拟机实例：
左边Menu -> Compute Engine -> 虚拟机实例 -> 创建实例

文章图片

注意：这里区域选择asia-east1台湾，地区可以选择a。

机器配置选择通用N1即可，将来可以更改。目前就先使用默认配置，GPU由于目前没有配额，添加了也没有用，暂时先不添加。

为Google|为Google Cloud配置深度学习环境（CUDA、cuDNN、Tensorflow2、VScode远程ssh等）

文章图片

启动磁盘点击更改后，选择Ubuntu，版本16.04 LTS，启动磁盘大小40GB：

文章图片

防火墙两个都勾选上：

文章图片

点击创建。实例就创建好了。创建之后可能是开机状态。把机器停止，先去申请GPU配额。

配额申请
左边Menu -> IAM和管理 -> 配额

文章图片

一般来说，在这里上面会弹出一个小黄条，提示你升级账户后才可以有完整体验。直接点击升级就可以。不会扣款。点击后会有个升级成功提示。升级后才可以申请GPU配额。
随后，在配额界面过滤表中点选：限制名称 -> GPUs (all regions)

文章图片

点击上方修改配额并勾选筛选出的名为GPUs (all regions)的服务。
在右方填写信息，下一步

文章图片
这里我已经提额过了，所以当前限制写的是1。在新限额里填1，请求说明里就说自己要做科研项目即可。点击完成、下方点击提交请求后，会在邮箱内收到一封Google的邮件，表示已经收到提额申请。

文章图片

一般来说，会告诉你配额申请在2天内会通过。但是我两分钟之后就收到了邮件，不知道是不是运气好，哈哈哈。不过暂时没有收到也没关系，我们先配置其他环境。

文章图片

静态外部IP配置
左边Menu -> VPC网络 -> 外部IP地址

文章图片

这里已经能看到我们刚刚创建好的虚拟机实例了。将类型改为静态，随便输入名称即可。
文章图片
SSH连接配置
接下来是SSH连接的配置。主要有3步：生成密钥-上传密钥-SSH登陆。
首先打开Terminal：

#进入.ssh文件夹 cd ~/.ssh #生成密钥（google为密钥名） ssh-keygen -f google

会提示你输入一个保护密码：

Generating public/private rsa key pair. Enter passphrase (empty for no passphrase):

输入密码，确认输入密码后就生成好了，会有一张randomart image显示出来。接下来查看你的公钥：
若你不了解vim命令。那么注意这里使用vim命令之后别随便碰键盘，如果碰了。Control-C可以退出编辑模式，具体操作请另搜索vim指令。我们只会用到:wq保存并退出和i编辑两个指令。

vim google.pub

将文件内容全部Command-C复制。回到谷歌云网页左边Menu -> Compute Engine -> 元数据 -> SSH密钥，将自己的公钥直接复制进去。保存即可。

文章图片

为了登陆的方便，我们需要在本地.ssh文件夹下创建一个config文件：

vim config

进入后，按i可以进入编辑模式，进入模式后能看到terminal下方出现了INSERT字样。将以下内容粘贴进去：

Host gcp HostName {这里填写之前固定的虚拟机实例外部IP} User {这里填写刚刚元数据里的用户名} IdentityFile ~/.ssh/google

随后按Control-C退出编辑模式，输入:wq保存并退出。
至此，虚拟机实例和连接就已经配置好了。
重要：为了安全，GCP默认每120s没有收到从SSH发来的消息就断开SSH链接，在Windows上使用的软件可以直接设定发送心跳包的时间间隔，这里我们需要在本机上设定：

sudo vim /etc/ssh/ssh_config

在打开的文件中添加一行ServerAliveInterval 20，保存退出。
连接至虚拟机实例并配置GPU计算环境弄完刚才这些，提额申请可能已经通过了。那么我们就来试一试吧。首先添加GPU：
进入虚拟机实例页面，点击我们的虚拟机实例进入详细设定。在实例已经停止的情况下，点击修改，展开CPU平台和GPU一项并添加一个GPU，默认的是Tesla K80。保存。
好了！可以启动我们的实例了。可能需要个几秒。一般来说会成功启动，不过也有可能遇到报错。最常见的是当前服务器GPU不够了。那就睡一觉或者等会儿再来配置吧。

通过SSH连接至实例
由于我们之前已经在.ssh文件夹内添加了config文件，在连接的时候不需要在命令中声明私钥地址、登陆用户、实例IP了。直接：

ssh gcp

第一次登陆会核对一些信息，可以直接yes，随后就输入我们之前的密码。
登陆成功后，命令行会变为绿色。

文章图片

接下来就是环境配置了。

安装显卡驱动、CUDA以及cuDNN
这里千万要注意，tensorflow的版本、CUDA、cuDNN的版本是紧密相关的，当你看到这篇教程时，记得去Tensorflow官网看看GPU配置里面的软件要求一栏。目前的版本是这样的：

文章图片
方法一（官网给出，本人未测试）：
在网页的下方有Tensorflow给出的方便安装的脚本：

# Add NVIDIA package repositories # Add HTTPS support for apt-key sudo apt-get install gnupg-curl wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.1.243-1_amd64.deb sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub sudo dpkg -i cuda-repo-ubuntu1604_10.1.243-1_amd64.deb sudo apt-get update wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb sudo apt install ./nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb sudo apt-get update# Install NVIDIA driver # Issue with driver install requires creating /usr/lib/nvidia sudo mkdir /usr/lib/nvidia sudo apt-get install --no-install-recommends nvidia-418 # Reboot. Check that GPUs are visible using the command: nvidia-smi# Install development and runtime libraries (~4GB) sudo apt-get install --no-install-recommends \ cuda-10-1 \ libcudnn7=7.6.4.38-1+cuda10.1\ libcudnn7-dev=7.6.4.38-1+cuda10.1# Install TensorRT. Requires that libcudnn7 is installed above. sudo apt-get install -y --no-install-recommends \ libnvinfer6=6.0.1-1+cuda10.1 \ libnvinfer-dev=6.0.1-1+cuda10.1 \ libnvinfer-plugin6=6.0.1-1+cuda10.1

方法二（亲自测试）：
直接安装。先用ssh gcp并输入密码连接实例。
我们直接安装Tensorflow2目前支持的CUDA10.1版本。因为现在CUDA安装时会自动帮你安装显卡驱动：

wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1604-10-1-local-10.1.105-418.39_1.0-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu1604-10-1-local-10.1.105-418.39_1.0-1_amd64.deb sudo apt-key add /var/cuda-repo-/7fa2af80.pub sudo apt-get update sudo apt-get install cuda

输入最后一条之后，需要输入y确认两次。然后就等命令行跑完。会先安装CUDA10.1，随后会安装418版本的显卡驱动。安装大约需要10分钟。安装的时候可以去cuDNN网页上注册一下，方便后面下载。点击Download就会自动跳转注册。
然后sudo reboot重启一下机器。重启后需要连接断开。再次使用ssh连接。输入nvidia-smi可以看到显卡的情况：

Thu Jul 23 08:05:10 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.39Driver Version: 418.39CUDA Version: 10.1| |-------------------------------+----------------------+----------------------+ | GPUNamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC | | FanTempPerfPwr:Usage/Cap|Memory-Usage | GPU-UtilCompute M. | |===============================+======================+======================| |0Tesla K80Off| 00000000:00:04.0 Off |0 | | N/A58CP830W / 149W |16MiB / 11441MiB |0%Default | +-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+ | Processes:GPU Memory | |GPUPIDTypeProcess nameUsage| |=============================================================================| |01743G/usr/lib/xorg/Xorg15MiB | +-----------------------------------------------------------------------------+

显示正常的话，就是安装成功啦！
接下来安装cuDNN，需要到上面说的网址注册登陆。进入后Agree同意协议。

文章图片

点击Download cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.1，注意，这里不能直接复制cuDNN Library for Linux的下载链接到我们的实例，否则会403 Forbidden。需要做的是直接点击下载，在chrome下载页面右键复制下载链接地址。

文章图片

像我这里复制到的链接，后面一串就是我的Token。直接复制我的链接是无效的。

https://developer.download.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.1_20191031/cudnn-10.1-linux-x64-v7.6.5.32.tgz?iceCj5uvMJUMcXOqWNvKyAO44He_JQgo6I71TxNKcGdbiqcPoJ4wAXWNb3WpQYhBUZN8NMAk_5El5kyADfUx4r8af5WfLPk4WumDm27oJZJhWzBMgsBtJPH9sSPmlMxY0ID_6eQod-8WgkBT0HvM_BwDaGWQ_w05f1rRBjTGqm-X-VeCBppxtKUFcxYY2AW91s492p06zkVhMwIWOl5LHQ4p7eycCRGv3Q

将链接复制好后，打开Terminal下载cuDNN会得到一个巨长的文件名。用mv重命名一下方便我们使用。然后解压。

wget {你的链接} mv {下载文件的名字} cudnn.tgz tar -xzvf cudnn.tgz

之后复制文件：

sudo cp cuda/include/cudnn.h /usr/local/cuda-10.1/include/ sudo cp cuda/lib64/* /usr/local/cuda-10.1/lib64/

【为Google|为Google Cloud配置深度学习环境（CUDA、cuDNN、Tensorflow2、VScode远程ssh等）】安装好啦！

Anaconda的安装
最好到官网查询最新版本的链接。这里提供一个2020.02版本的：

wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh bash Anaconda3-2020.02-Linux-x86_64.sh

运行后进入Anaconda的安装。ENTER后进入用户协议阅读，按空格快速跳过。后面需要输入yes，再ENTER确认安装地址。就开始安装了。最后会提示是否conda init，一定要yes。安装完毕后测试一下conda命令。如果不行的话运行source ~/.bashrc试试。如果仍然不行，可能是anaconda位置没有配置好。执行vim ~/.bashrc，并在最后一行加上anaconda的位置：export PATH=$PATH:/home/{你的用户名}/anaconda3/bin。之后source ~/.bashrc。再次测试conda命令。若提示需要init，按照步骤执行。完成后需要reboot。
安装成功后，命令行前会出现conda环境的指示，如果不activate特定环境的话，就是(base)

安装tensorflow的gpu版本

conda install tensorflow-gpu

同样需要输入y确认安装。至此，tensorflow的gpu环境就配置完成了。

Tensorflow启用GPU的测试
我们编写一个python文件，调用tensorflow自己的测试模块来看GPU是否启用成功。

vim test.py

以下是文件内容:

import tensorflow as tf print(tf.test.is_gpu_available())

接下来执行python test.py，看看print出来的结果：

(base) f@instance-2:~$ python test.py WARNING:tensorflow:From test.py:2: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. 2020-07-23 12:50:56.552239: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2020-07-23 12:50:56.560837: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2200000000 Hz 2020-07-23 12:50:56.561175: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560faca2aab0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-07-23 12:50:56.561213: I tensorflow/compiler/xla/service/service.cc:176]StreamExecutor device (0): Host, Default Version 2020-07-23 12:50:56.562632: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-07-23 12:50:56.582969: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-23 12:50:56.583828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla K80 computeCapability: 3.7 coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s 2020-07-23 12:50:56.584151: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-07-23 12:50:56.586376: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-07-23 12:50:56.588334: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-07-23 12:50:56.588812: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-07-23 12:50:56.591138: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-07-23 12:50:56.592375: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-07-23 12:50:56.598009: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-07-23 12:50:56.598203: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-23 12:50:56.599128: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-23 12:50:56.599942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2020-07-23 12:50:56.600047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-07-23 12:50:56.948921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-23 12:50:56.948999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]0 2020-07-23 12:50:56.949022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:N 2020-07-23 12:50:56.949809: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-23 12:50:56.950670: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-23 12:50:56.951456: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-23 12:50:56.952191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/device:GPU:0 with 10676 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7) 2020-07-23 12:50:56.954963: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560fad3ee950 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-07-23 12:50:56.955001: I tensorflow/compiler/xla/service/service.cc:176]StreamExecutor device (0): Tesla K80, Compute Capability 3.7 True

看到最后的True了吗！！！总算是成功了。
配置服务器上的jupyter notebook

配置防火墙
左边Menu -> VPC网络 -> 防火墙 -> 创建防火墙规则
名称：jupyter
来源IP地址范围：0.0.0.0/0
协议和端口 -> 指定的协议和端口 -> tcp:5000
创建jupyter-notebook的config文件

jupyter notebook --generate-config

此时会告诉你生成的config文件位置，编辑一下config：

vim /home/{你的用户名}/.jupyter/jupyter_notebook_config.py

插入几行：

c = get_config() c.NotebookApp.ip = '*' c.NotebookApp.open_browser = False c.NotebookApp.port = 5000

:wq保存之后直接运行jupyter notebook即可打开jupyter notebook app
运行后会弹出几个带有token的链接：

[I 14:11:24.544 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 14:11:24.549 NotebookApp] To access the notebook, open this file in a browser: file:///home/f/.local/share/jupyter/runtime/nbserver-4059-open.html Or copy and paste one of these URLs: http://localhost:5000/?token=c8e314854ec5c254e4b2b09e5284c403bf9186eed6f69a28 or http://127.0.0.1:5000/?token=c8e314854ec5c254e4b2b09e5284c403bf9186eed6f69a28

将链接复制到浏览器，并把127.0.0.1改为服务器实例的外部ip就能在任意终端访问了。

文章图片

注：在跑深度学习时有时我们希望创建的notebook在SSH断开后仍然运行，那么启动jupyter notebook时，使用nohup jupyter notebook --allow-root &
配置VScode上的remote-SSH以远程管理和运行代码 VScode上有remote-SSH插件可以安装。安装后按command+shift+p唤出设置，输入remote-SSH选择Connet to Host，选Configure SSH Hosts。这里让我们选择config的位置，我们就选择之前自己创建用于快捷登陆gcp的config文件，位于.ssh文件夹中。不用修改，保存并关闭文件即可。
再次按command+shift+p选择Connet to Host后，就能看到我们之前写的gcp。回车后弹出新窗口。一样需要输入我们设置的密码。初次启动会初始化一段时间。
然后在扩展页面给虚拟机实例安装插件。包括我们常用的python、anaconda extension和code runner

文章图片

打开文件夹，就可以正常操作运行python文件啦。记得在下面选择合适的interpreter。

文章图片
遗憾的是，本来想使用远程jupyter notebook的kernel，在VScode上编辑ipynb并在远程运行。但无奈本人能力不足，暂时无法实现。目前仅支持浏览器访问的jupyter notebook界面。
这里给出一篇教程可供参考：VScode连接远程服务器上的jupyter notebook
参考手把手教你用Google云平台搭建自己的深度学习工作站
在谷歌云服务器上搭建深度学习平台
jupyter notebook相关配置
云端部署jupyter notebook
Tensorflow-gpu 测试代码
VScode上的remote-SSH
SSH自动断线设置