多 GPU

简述

如果有两块卡,但是代码里不设置的话,默认把变量都放到 device('/gpu:0'),所以只有 gpu 0 在计算。

tensorflow 默认是占满显存的,然后等到程序需要用的时候直接拿来用,这个是 tensorflow 设计的一个机制,对于这一机制大家褒贬不一

限制 GPU 资源

动态申请显存

1
2
3
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

限制GPU使用率

1
2
3
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
config=tf.ConfigProto(gpu_options=gpu_options)
session = tf.Session(config=config)

其中0.333是你自己设置的想用百分之多少的显存。

例一

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Multi GPU computing
# GPU:0 computes A^n
with tf.device('/gpu:0'):
#compute A^n and store result in c2
a = tf.constant(A)
c2.append(matpow(a, n))

#GPU:1 computes B^n
with tf.device('/gpu:1'):
#compute B^n and store result in c2
b = tf.constant(B)
c2.append(matpow(b, n))

with tf.device('/cpu:0'):
sum = tf.add_n(c2) #Addition of all elements in c2, i.e. A^n + B^n

t1_2 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
# Runs the op.
sess.run(sum)
t2_2 = datetime.datetime.now()

例二

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Place all ops on CPU by default
with tf.device('/cpu:0'):
tower_grads = []
reuse_vars = False

# tf Graph input
X = tf.placeholder(tf.float32, [None, num_input])
Y = tf.placeholder(tf.float32, [None, num_classes])

# Loop over all GPUs and construct their own computation graph
for i in range(num_gpus):
with tf.device(assign_to_device('/gpu:{}'.format(i), ps_device='/cpu:0')):

# Split data between GPUs
_x = X[i * batch_size: (i+1) * batch_size]
_y = Y[i * batch_size: (i+1) * batch_size]

# Because Dropout have different behavior at training and prediction time, we
# need to create 2 distinct computation graphs that share the same weights.

# Create a graph for training
logits_train = conv_net(_x, num_classes, dropout,
reuse=reuse_vars, is_training=True)
# Create another graph for testing that reuse the same weights
logits_test = conv_net(_x, num_classes, dropout,
reuse=True, is_training=False)

# Define loss and optimizer (with train logits, for dropout to take effect)
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=logits_train, labels=_y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
grads = optimizer.compute_gradients(loss_op)

# Only first GPU compute accuracy
if i == 0:
# Evaluate model (with test logits, for dropout to be disabled)
correct_pred = tf.equal(tf.argmax(logits_test, 1), tf.argmax(_y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

reuse_vars = True
tower_grads.append(grads)

tower_grads = average_gradients(tower_grads)
train_op = optimizer.apply_gradients(tower_grads)

# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
sess.run(init)
step = 1
# Keep training until reach max iterations
for step in range(1, num_steps + 1):
# Get a batch for each GPU
batch_x, batch_y = mnist.train.next_batch(batch_size * num_gpus)
# Run optimization op (backprop)
ts = time.time()
sess.run(train_op, feed_dict={X: batch_x, Y: batch_y})
te = time.time() - ts
if step % display_step == 0 or step == 1:
# Calculate batch loss and accuracy
loss, acc = sess.run([loss_op, accuracy], feed_dict={X: batch_x,
Y: batch_y})
print("Step " + str(step) + ": Minibatch Loss= " + \
"{:.4f}".format(loss) + ", Training Accuracy= " + \
"{:.3f}".format(acc) + ", %i Examples/sec" % int(len(batch_x)/te))
step += 1
print("Optimization Finished!")

# Calculate accuracy for 1000 mnist test images
print("Testing Accuracy:", \
np.mean([sess.run(accuracy, feed_dict={X: mnist.test.images[i:i+batch_size],
Y: mnist.test.labels[i:i+batch_size]}) for i in range(0, len(mnist.test.images), batch_size)]))
  1. multigpu_basics (code)
  2. Multi-GPU Training Example (code)
  3. tensorflow 发现两块卡的显存都占用,但是实际上只有一块卡在运算

多线程

在进行 tf.ConfigProto() 初始化时,我们也可以通过设置 intra_op_parallelism_threads 参数和 inter_op_parallelism_threads 参数,来控制每个操作符op并行计算的线程个数。

二者的区别在于:

intra_op_parallelism_threads 控制运算符op内部的并行

当运算符op为单一运算符,并且内部可以实现并行时,如矩阵乘法,reduce_sum之类的操作,可以通过设置 intra_op_parallelism_threads 参数来并行, intra 代表内部。

inter_op_parallelism_threads 控制多个运算符op之间的并行计算

当有多个运算符 op,并且他们之间比较独立,运算符和运算符之间没有直接的路径Path相连。Tensorflow 会尝试并行地计算他们,使用由 inter_op_parallelism_threads 参数来控制数量的一个线程池。

以上两个参数如果设置为0代表让系统设置合适的数值

1
2
3
4
5
6
config = tf.ConfigProto(device_count={"CPU": 4}, # limit to num_cpu_core CPU usage
inter_op_parallelism_threads = 1,
intra_op_parallelism_threads = 4,
log_device_placement=True)
with tf.Session(config = config) as sess:
# To Do

实例比较,线程数为2和4,平均每个batch的运行时间:

当参数为intra_op_parallelism_threads = 2时, 每个step的平均运行时间从610ms降低到380ms。
当参数为intra_op_parallelism_threads = 4时, 每个step的平均运行时间从610ms降低到230ms。

总结,在固定CPUcore的资源限制下,通过合理设置线程thread个数可以明显提升tensorflow程序运行速度。