TensorFlow 并行计算

多 GPU

简述

如果有两块卡，但是代码里不设置的话，默认把变量都放到 device('/gpu:0')，所以只有 gpu 0 在计算。

tensorflow 默认是占满显存的，然后等到程序需要用的时候直接拿来用，这个是 tensorflow 设计的一个机制，对于这一机制大家褒贬不一

限制 GPU 资源

动态申请显存

1
2
3

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

限制GPU使用率

1
2
3

gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
config=tf.ConfigProto(gpu_options=gpu_options)
session = tf.Session(config=config)

其中0.333是你自己设置的想用百分之多少的显存。

例一

# Multi GPU computing
# GPU:0 computes A^n
with tf.device('/gpu:0'):
    #compute A^n and store result in c2
    a = tf.constant(A)
    c2.append(matpow(a, n))

#GPU:1 computes B^n
with tf.device('/gpu:1'):
    #compute B^n and store result in c2
    b = tf.constant(B)
    c2.append(matpow(b, n))

with tf.device('/cpu:0'):
  sum = tf.add_n(c2) #Addition of all elements in c2, i.e. A^n + B^n

t1_2 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
    # Runs the op.
    sess.run(sum)
t2_2 = datetime.datetime.now()

例二

# Place all ops on CPU by default
with tf.device('/cpu:0'):
    tower_grads = []
    reuse_vars = False

    # tf Graph input
    X = tf.placeholder(tf.float32, [None, num_input])
    Y = tf.placeholder(tf.float32, [None, num_classes])

    # Loop over all GPUs and construct their own computation graph
    for i in range(num_gpus):
        with tf.device(assign_to_device('/gpu:{}'.format(i), ps_device='/cpu:0')):

            # Split data between GPUs
            _x = X[i * batch_size: (i+1) * batch_size]
            _y = Y[i * batch_size: (i+1) * batch_size]

            # Because Dropout have different behavior at training and prediction time, we
            # need to create 2 distinct computation graphs that share the same weights.

            # Create a graph for training
            logits_train = conv_net(_x, num_classes, dropout,
                                    reuse=reuse_vars, is_training=True)
            # Create another graph for testing that reuse the same weights
            logits_test = conv_net(_x, num_classes, dropout,
                                   reuse=True, is_training=False)

            # Define loss and optimizer (with train logits, for dropout to take effect)
            loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
                logits=logits_train, labels=_y))
            optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
            grads = optimizer.compute_gradients(loss_op)

            # Only first GPU compute accuracy
            if i == 0:
                # Evaluate model (with test logits, for dropout to be disabled)
                correct_pred = tf.equal(tf.argmax(logits_test, 1), tf.argmax(_y, 1))
                accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

            reuse_vars = True
            tower_grads.append(grads)

    tower_grads = average_gradients(tower_grads)
    train_op = optimizer.apply_gradients(tower_grads)

    # Initializing the variables
    init = tf.global_variables_initializer()

    # Launch the graph
    with tf.Session() as sess:
        sess.run(init)
        step = 1
        # Keep training until reach max iterations
        for step in range(1, num_steps + 1):
            # Get a batch for each GPU
            batch_x, batch_y = mnist.train.next_batch(batch_size * num_gpus)
            # Run optimization op (backprop)
            ts = time.time()
            sess.run(train_op, feed_dict={X: batch_x, Y: batch_y})
            te = time.time() - ts
            if step % display_step == 0 or step == 1:
                # Calculate batch loss and accuracy
                loss, acc = sess.run([loss_op, accuracy], feed_dict={X: batch_x,
                                                                     Y: batch_y})
                print("Step " + str(step) + ": Minibatch Loss= " + \
                      "{:.4f}".format(loss) + ", Training Accuracy= " + \
                      "{:.3f}".format(acc) + ", %i Examples/sec" % int(len(batch_x)/te))
            step += 1
        print("Optimization Finished!")

        # Calculate accuracy for 1000 mnist test images
        print("Testing Accuracy:", \
            np.mean([sess.run(accuracy, feed_dict={X: mnist.test.images[i:i+batch_size],
            Y: mnist.test.labels[i:i+batch_size]}) for i in range(0, len(mnist.test.images), batch_size)]))

多线程

在进行 tf.ConfigProto() 初始化时，我们也可以通过设置 intra_op_parallelism_threads 参数和 inter_op_parallelism_threads 参数，来控制每个操作符op并行计算的线程个数。

二者的区别在于:

intra_op_parallelism_threads 控制运算符op内部的并行

当运算符op为单一运算符，并且内部可以实现并行时，如矩阵乘法，reduce_sum之类的操作，可以通过设置 intra_op_parallelism_threads 参数来并行, intra 代表内部。

inter_op_parallelism_threads 控制多个运算符op之间的并行计算

当有多个运算符 op，并且他们之间比较独立，运算符和运算符之间没有直接的路径Path相连。Tensorflow 会尝试并行地计算他们，使用由 inter_op_parallelism_threads 参数来控制数量的一个线程池。

以上两个参数如果设置为0代表让系统设置合适的数值

config = tf.ConfigProto(device_count={"CPU": 4}, # limit to num_cpu_core CPU usage
                inter_op_parallelism_threads = 1, 
                intra_op_parallelism_threads = 4,
                log_device_placement=True)
with tf.Session(config = config) as sess:
  # To Do

实例比较，线程数为2和4，平均每个batch的运行时间：

当参数为intra_op_parallelism_threads = 2时, 每个step的平均运行时间从610ms降低到380ms。
当参数为intra_op_parallelism_threads = 4时, 每个step的平均运行时间从610ms降低到230ms。

总结，在固定CPUcore的资源限制下，通过合理设置线程thread个数可以明显提升tensorflow程序运行速度。