[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.keras.layers.Dense leads to significant differences between CPU and GPU runs of the model implementation code #67829

Open
PhyllisJi opened this issue May 17, 2024 · 3 comments
Assignees
Labels
comp:keras Keras related issues TF2.14 For issues related to Tensorflow 2.14.x type:bug Bug

Comments

@PhyllisJi
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

tf 2.14.0

Custom code

Yes

OS platform and distribution

Ubuntu 20.04

Mobile device

No response

Python version

3.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

12.2

GPU model and memory

No response

Current behavior?

The difference in the output values of the entire neural network for forward propagation exceeds 0.05 when trained with CPU and GPU respectively. But the outputs are consistent before adding the line fc3_output = tf.keras.layers.Dense(units=10, use_bias=True, name="linear3_mutated") ( relu4_output).

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np
import os


os.environ['CUDA_VISIBLE_DEVICES'] = ''

def Model_VlysjQxB81qtaIXsA_VkCXmPGmE7aDNP(input):
    input = tf.keras.Input(shape=input)
    _zeropadding_input = tf.keras.layers.ZeroPadding2D(padding=((0, 0), (0, 0)))(input)
    conv1_output = tf.keras.layers.Conv2DTranspose(filters=6, kernel_size=(5, 5), strides=(1, 1), padding="valid", output_padding=(0, 0), data_format="channels_last", dilation_rate=(1, 1), use_bias=True, name="conv1_mutated")(input)
    relu1_output = tf.nn.relu(conv1_output)
    _zeropadding_relu1_output = tf.keras.layers.ZeroPadding2D(padding=((0, 0), (0, 0)))(relu1_output)
    maxpool1_output = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding="valid", data_format="channels_last", name="pool1")(_zeropadding_relu1_output)
    _zeropadding_maxpool1_output = tf.keras.layers.ZeroPadding2D(padding=((0, 0), (0, 0)))(maxpool1_output)
    conv2_output = tf.keras.layers.Conv2D(filters=16, kernel_size=(6, 8), strides=(1, 1), padding="valid", data_format="channels_last", dilation_rate=(1, 1), groups=1, use_bias=True, name="conv2_mutated")(_zeropadding_maxpool1_output)
    relu2_output = tf.math.softsign(conv2_output)
    _zeropadding_relu2_output = tf.keras.layers.ZeroPadding2D(padding=((0, 0), (0, 0)))(relu2_output)
    maxpool2_output = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding="valid", data_format="channels_last", name="pool2")(_zeropadding_relu2_output)
    output_transpose = [(0), (0, 1), (0, 2, 1), (0, 3, 1, 2), (0, 4, 1, 2, 3)]
    maxpool2_output = tf.transpose(maxpool2_output, list(output_transpose[(len(maxpool2_output.shape) - 1)]))
    flatten_output = tf.keras.layers.Flatten(data_format="channels_last", name="flatten")(maxpool2_output)
    fc1_output = tf.keras.layers.Dense(units=120, use_bias=True, name="linear1")(flatten_output)
    relu3_output = tf.keras.layers.ThresholdedReLU(theta=0.1, name="relu3_mutated")(fc1_output)
    fc2_output = tf.keras.layers.Dense(units=84, use_bias=True, name="linear2_mutated")(relu3_output)
    relu4_output = tf.math.erf(fc2_output)
    fc3_output = tf.keras.layers.Dense(units=10, use_bias=True, name="linear3_mutated")(relu4_output)
    output_transpose = [(0), (0, 1), (0, 2, 1), (0, 3, 1, 2), (0, 4, 1, 2, 3)]
    fc3_output = tf.transpose(fc3_output, list(output_transpose[(len(fc3_output.shape) - 1)]))
    tail_flatten_output = tf.keras.layers.Flatten(data_format="channels_last", name="tail_flatten")(fc3_output)
    tail_fc_output = tf.keras.layers.Dense(units=10, use_bias=True, name="tail_fc")(tail_flatten_output)

    tail_fc_output = tail_fc_output
    model = tf.keras.models.Model(inputs=input, outputs=tail_fc_output)
    return model


def go():
    with tf.device('/CPU:0'):
        try:
            shape = [1, 1, 28, 28]
            _numpy = np.random.random(shape).astype(np.float32)
            tf_input = tf.convert_to_tensor(_numpy.transpose(0, 2, 3, 1), dtype=tf.float32)
            tf_model = Model_VlysjQxB81qtaIXsA_VkCXmPGmE7aDNP(tf_input.shape[1:])
            tf_output = tf_model(tf_input)
            flag = True
        except Exception:
            flag = False
        return flag


def initialize(model):
    module_dir = os.path.dirname(__file__)
    gradient_transpose = [(0,), (1, 0), (2, 1, 0), (2, 3, 1, 0), (2, 3, 4, 1, 0)]
    for layer in model.layers:
        matrix_path = module_dir + '/../initializer/' + layer.name
        if hasattr(layer, 'kernel_initializer'):
            weight_init_path = matrix_path + '/weight.npz'
            weight_init = np.load(weight_init_path)
            weight_init = weight_init['matrix']
            tf_weight = tf.convert_to_tensor(weight_init, dtype=tf.float32)
            tf_weight = tf.transpose(tf_weight, gradient_transpose[len(tf_weight.shape) - 1])
            layer.kernel.assign(tf.keras.initializers.Constant(tf_weight)(layer.kernel.shape))
        if hasattr(layer, 'bias_initializer') and layer.use_bias:
            bias_init_path = matrix_path + '/bias.npz'
            bias_init = np.load(bias_init_path)
            bias_init = bias_init['matrix']
            tf_bias = tf.convert_to_tensor(bias_init, dtype=tf.float32)
            tf_bias = tf.transpose(tf_bias, gradient_transpose[len(tf_bias.shape) - 1])
            layer.bias.assign(tf.keras.initializers.Constant(tf_bias)(layer.bias.shape))

def train(inp, label):
    with tf.device('/CPU:0'):
        shape = inp.shape
        tf_input = tf.convert_to_tensor(inp.transpose(0, 2, 3, 1), dtype=tf.float32)
        tf_model = Model_VlysjQxB81qtaIXsA_VkCXmPGmE7aDNP(tf_input.shape[1:])

        initialize(tf_model)
        tf_output = tf_model(tf_input)
        output_transpose = [(0), (0, 1), (0, 2, 1), (0, 3, 1, 2), (0, 4, 1, 2, 3)]
        tf_output_trans = tf.transpose(tf_output, list(output_transpose[(len(tf_output.shape) - 1)])).numpy()

        tf_targets = tf.convert_to_tensor(label)
        with tf.GradientTape() as tape:
            tf_predictions = tf_model(tf_input)
            tf_loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)(tf_targets, tf_predictions)
        tf_gradients = tape.gradient(tf_loss, tf_model.trainable_variables)
        tf_gradients_dic = {}
        for var, gradient in zip(tf_model.trainable_variables, tf_gradients):
            gradient_transpose = [(0, ), (1, 0), (2, 0, 1), (3, 2, 0, 1), (4, 3, 0, 1, 2)]
            tf_gradient = tf.transpose(gradient, list(gradient_transpose[len(gradient.shape) - 1])).numpy()
            tf_gradients_dic.setdefault(var.name.replace('/', '.')[:-2], tf_gradient)
        return tf_gradients_dic, float(tf_loss.numpy()), tf_output_trans

Relevant log output

No response

@PhyllisJi
Copy link
Author

Data from /Users/pinji/Desktop/MoCoDiff/tf-0513/tensorflow-LeNet/LeNet-12-654/case/tensorflow_cpu/output.npz:
Array name: output
[[-0.19694163, -0.09662387, -0.2447289 , ..., -0.03054091, -0.01255638,
0.11781679],
[-0.09944421, -0.1292284 , -0.20278051, ..., -0.01450334, 0.1526829 ,
0.2054872 ],
[ 0.07628068, -0.14517091, -0.16096437, ..., -0.21423727, 0.14428714,
0.09608107],
...,
[-0.20587711, -0.10656235, -0.33036825, ..., -0.13541238, 0.3215404 ,
0.2165656 ],
[-0.11513966, -0.1102111 , -0.3434586 , ..., -0.19525276, 0.08722814,
-0.05503507],
[-0.09822497, -0.10671525, -0.2368006 , ..., -0.16467465, 0.15050802,
0.10809175]]

========================================
Data from /Users/pinji/Desktop/MoCoDiff/tf-0513/tensorflow-LeNet/LeNet-12-654/case/tensorflow_gpu/output.npz:
Array name: output
[[-0.19693479, -0.09660047, -0.24478191, ..., -0.03068957, -0.01259471,
0.1177731 ],
[-0.09948827, -0.12920822, -0.20286846, ..., -0.01475166, 0.15262538,
0.20549278],
[ 0.07635015, -0.14510253, -0.16103148, ..., -0.21449052, 0.14431402,
0.09607705],
...,
[-0.20597562, -0.1065662 , -0.33040133, ..., -0.13553414, 0.32143065,
0.2166584 ],
[-0.1152329 , -0.11016633, -0.34351912, ..., -0.19536608, 0.08703801,
-0.05509032],
[-0.09832728, -0.10680126, -0.2368007 , ..., -0.1649086 , 0.1503351 ,
0.10806311]]

========================================
543 diff: 0.055949583649635315
cpu-543: [-0.1139553 -0.04668004 -0.4390249 0.20775127 0.0899936 0.22854462
-0.46038878 -0.15526699 0.23111053 -0.03327706]
gpu-543: [-0.12328502 -0.0404007 -0.4342668 0.21254592 0.05579592 0.24491714
-0.479826 -0.15429592 0.17516094 -0.05355967]

@tilakrayal tilakrayal added TF2.14 For issues related to Tensorflow 2.14.x comp:keras Keras related issues labels May 20, 2024
@tilakrayal
Copy link
Contributor

@PhyllisJi,
I tried to execute the mentioned code. Kindly find the gist of it here. In the given code snippet you have defined the class and its methods but are not calling them anywhere. Also try to execute the code with the keras3.0 which is default for the TensorFlow 2.16 and let us know if you are facing the same issue. Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 20, 2024
@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 21, 2024
@PhyllisJi
Copy link
Author

@PhyllisJi, I tried to execute the mentioned code. Kindly find the gist of it here. In the given code snippet you have defined the class and its methods but are not calling them anywhere. Also try to execute the code with the keras3.0 which is default for the TensorFlow 2.16 and let us know if you are facing the same issue. Thank you!

Due to the need for data support, I have put the reproduction code, data and steps in the repository, which you can reproduce by clone. https://github.com/PhyllisJi/MoCoDiff_Bug/tree/tf-issue%2367829

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues TF2.14 For issues related to Tensorflow 2.14.x type:bug Bug
Projects
None yet
Development

No branches or pull requests

2 participants