Error Debugging Experience

1、Dynamic Graph to Static Graph Error Log

1.1 How to Read the Error Log

The following is an example code of Dynamic-to-Static error reporting:

import paddle
import numpy as np

@paddle.jit.to_static
def func(x):
    two = paddle.full(shape=[1], fill_value=2, dtype="int32")
    x = paddle.reshape(x, shape=[1, two])
    return x

def train():
    x = paddle.to_tensor(np.ones([3]).astype("int32"))
    func(x)

if __name__ == '__main__':
    train()

After execution, the error log is shown below:

The error log can be divided into 4 parts from top to bottom:

  • Native Python error stack: As shown in the first two lines, it represents a series of subsequent errors caused by the function train() called on line 145 of the /workspace/Paddle/run_dy2stat_error.py file.

  • The start flag of Dynamic-to-Static error stack: In transformed code, represents the dynamic-to-static error message stack, and refers to the error message when the transformed code is running. In the actual scene, you can directly search for the In transformed code keyword, and start from this line and read the error log.

  • User code error stack: It hides the useless error message at the framework level, and reports the error stack of the user code. We add a wavy line and HERE indicator under the error code to indicate the specific error location. We also expanded the error line code context to help you quickly locate the error location. As shown in the third part of the above figure, it can be seen that the user code that made the last error is x = paddle.reshape(x, shape=[1, two]).

  • Error message at the frame level: Provides static graph networking error information. Generally, you can directly locate the error reported in which OpDesc was generated directly based on the information in the last three lines, which is usually the error reported by the infershape logic that executed this Op. The error message in the above figure indicates that the reshape Op error occurred. The cause of the error is that the shape of tensor x is [3], and it is not allowed to reshape it to [1, 2].

NOTE: In some scenarios, the error type will be identified and suggestions for modification will be given, as shown in the figure below. Revise suggestion The following are troubleshooting suggestions for errors. You can check and modify the code according to the suggestions.

1.2 Customized Display of Error Information

1.2.1 Native error message without being processed by the Dynamic-to-Static error reporting module

If you want to view Paddle’s native error message stack, that is, the error message stack that has not been processed by the Dynamic-to-Static error reporting module, you can set the environment variable TRANSLATOR_DISABLE_NEW_ERROR=1 to turn off the dynamic-to-static error module. The default value of this environment variable is 0, which means that the module is enabled by default. Add the following code to the code in section 1.1 to view the native error message:

import os
os.environ["TRANSLATOR_DISABLE_NEW_ERROR"] = '1'

You can get the following error message:

1.2.2 C++ error stack

The C++ error stack is hidden by default. You can set the C++ environment variable FLAGS_call_stack_level=2 to display the C++ error stack information. For example, you can enter export FLAGS_call_stack_level=2 in the terminal to set it, and then you can see the error stack on the C++ side:

2、Debugging Method

Before debugging, please ensure that the dynamic graph code before conversion can run successfully. The following introduces several debugging methods recommended in Dynamic-to-Static.

2.1 Pdb Debugging

pdb is a module in Python that defines an interactive Pyhton source code debugger. It supports setting breakpoints and single stepping between source lines, listing source code and variables, running Python code, etc.

2.1.1 Debugging steps

  • step1: Insert import pdb; pdb.set_trace() before the code where you want to enable pdb debugging.

    import paddle
    import numpy as np
    
    @paddle.jit.to_static
    def func(x):
        x = paddle.to_tensor(x)
        import pdb; pdb.set_trace()       # <------ enable pdb debugging
        two = paddle.full(shape=[1], fill_value=2, dtype="int32")
        x = paddle.reshape(x, shape=[1, two])
        return x
    
    func(np.ones([3]).astype("int32"))
    
  • Step2: Run the .py file normally, the following similar result will appear in the terminal, enter the corresponding pdb command after the (Pdb) position for debugging.

    > /tmp/tmpm0iw5b5d.py(9)func()
    -> two = paddle.full(shape=[1], fill_value=2, dtype='int32')
    (Pdb)
    
  • step3: Enter l, p and other commands in the pdb interactive mode to view the corresponding code and variables of the static graph after Dynamic-to-Static, and then troubleshoot related problems.

    > /tmp/tmpm0iw5b5d.py(9)func()
    -> two = paddle.full(shape=[1], fill_value=2, dtype='int32')
    (Pdb) l
      4     import numpy as np
      5     def func(x):
      6         x = paddle.assign(x)
      7         import pdb
      8         pdb.set_trace()
      9  ->     two = paddle.full(shape=[1], fill_value=2, dtype='int32')
     10         x = paddle.reshape(x, shape=[1, two])
     11         return x
    [EOF]
    (Pdb) p x
    var assign_0.tmp_0 : LOD_TENSOR.shape(3,).dtype(int32).stop_gradient(False)
    (Pdb)
    

2.1.2 Common commands

For more pdb usage methods, please check the official document

2.3 Use Print to View Variables

The print function can be used to view variables, and the function will be transformed. When only the Paddle Tensor is printed, it will be converted to the Paddle operator Print, otherwise Python print will be run.

import paddle
import numpy as np

@paddle.jit.to_static
def func(x):
    x = paddle.to_tensor(x)

    # Print x, x is Paddle Tensor, and Paddle Print(x) will be run
    print(x)
    # Print comments, non-Paddle Tensor, Python print will be run
    print("Here call print function.")

    if len(x) > 3:
        x = x - 1
    else:
        x = paddle.ones(shape=[1])
    return x

func(np.ones([1]))

After running, you can see the value of x:

Variable: assign_0.tmp_0
  - lod: {}
  - place: CUDAPlace(0)
  - shape: [1]
  - layout: NCHW
  - dtype: double
  - data: [1]

3、Quickly determine the cause of the problem

After summarizing the types of error messages, the Dynamic-to-Static problems can be roughly divided into the following categories:

3.1 (NotFound) Input(“X”)

The error message is roughly as follows:

RuntimeError: (NotFound) Input("Filter") of ConvOp should not be null.
    [Hint: Expected ctx->HasInputs("Filter") == true, but received ctx->HasInputs("Filter"):0 != true:1.]
    [operator < conv2d > error]

The reasons for such problems are generally:

When the execution reaches the error line, the type of some input or weight is still the Tensor of the dynamic graph, rather than the Variable or Parameter of the static graph.

Troubleshooting suggestions:

  • First confirm whether the sublayer where the code is located inherits nn.Layer

  • Whether the function of this line of code bypasses the forward function and is called separately (before version 2.1)

  • How to check whether it is of Tensor or Variable type, which can be debugged through pdb

3.2 Expected input_dims[i] == input_dims[0]

The error message is roughly as follows:

[Hint: Expected input_dims[i] == input_dims[0], but received input_dims[i]:-1, -1 != input_dims[0]:16, -1.]
    [operator < xxx_op > error]

The reasons for such problems are generally:

When append_op generates static graph Program one by one, when a certain Paddle API is executed, infershape does not meet the requirements during compilation.

Troubleshooting suggestions:

  • At the code level, determine whether the upstream uses reshape to cause the -1 pollution spread

Since the shape of the dynamic graph is known during execution, reshape(x, [-1, 0, 128]) is no problem. However, when static graphs are networked, they are all shapes at compile time (may be -1), so when using the reshape interface, try to reduce the use of -1.

  • It can be combined with debugging skills to determine whether it is the output shape of an API that has diff behavior under the dynamic and static graph

For example, some Paddle API dynamic graph return 1-D Tensor, but the static graph is always consistent with the input, such as ctx->SetOutputDim(“Out”, ctx->GetInputDim(“X”));

3.3 desc->CheckGuards() == true

The error message is roughly as follows:

[Hint: Expected desc->CheckGuards() == true, but received desc->CheckGuards():0 != true: 1.]

The reasons for such problems are generally:

When the execution reaches the error line, the type of some input or weight is still the Tensor of the dynamic graph, rather than the Variable or Parameter of the static graph.

The following is a summary of the slice syntax functions of the current dynamic and static graph:

Troubleshooting suggestions:

  • Does the model code have the above-mentioned complex Tensor slice operation?

  • It is recommended to use the paddle.slice interface to replace complex Tensor slice operations

3.4 Segment Fault

When a segfault occurs in the dynamic-to-static module, there will be very little error stack information, but the cause of such problems is generally clear. The general causes of such problems are:

A certain sublayer does not inherit nn.Layer, and there is a call to the paddle.to_tensor interface in the _init_.py function. As a result, the Tensor data of the dynamic graph is accessed in the static graph mode when the Program is generated or the model parameters are saved.

Troubleshooting suggestions:

  • Ensure that each sublayer inherits nn.Layer

3.5 Recommendations for Using Container

Under the dynamic graph, the following container classes of container are provided:

  • ParameterList

    class MyLayer(paddle.nn.Layer):
        def __init__(self, num_stacked_param):
            super().__init__()
    
            w1 = paddle.create_parameter(shape=[2, 2], dtype='float32')
            w2 = paddle.create_parameter(shape=[2], dtype='float32')
    
            # In this usage, MyLayer.parameters() returns empty
            self.params = [w1, w2]                            # <----- Wrong usage
    
            self.params = paddle.nn.ParameterList([w1, w2])   # <----- Correct usage
    
  • LayerList

    class MyLayer(paddle.nn.Layer):
        def __init__(self):
            super().__init__()
    
            layer1 = paddle.nn.Linear(10, 10)
            layer2 = paddle.nn.Linear(10, 16)
    
            # In this usage, MyLayer.parameters() returns empty
            self.linears = [layer1, layer2]                        # <----- Wrong usage
    
            self.linears = paddle.nn.LayerList([layer1, layer2])   # <----- Correct usage