Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17933

sanity test_63a: FAIL: failed grant check: client:700186624 server:368967680

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • Lustre 2.16.0
    • RHEL 9.4 zfs server
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/faabd6b5-a956-42bd-a631-867ad6924185

      test_63a failed with the following error:

      failed grant check: client:700186624 server:368967680
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/105370 - 5.14.0-362.24.1.el9_3.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/105370 - 5.14.0-427.20.1_lustre.el9.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_63a - failed grant check: client:700186624 server:368967680

      Attachments

        Issue Links

          Activity

            [LU-17933] sanity test_63a: FAIL: failed grant check: client:700186624 server:368967680
            pjones Peter Jones added a comment -

            green should we revert this patch?

            pjones Peter Jones added a comment - green should we revert this patch?

            with this patch landed:

            Subject: LU-17933 target: do not break grants on RPC failure

            I hit this problem:

            [ 4745.188760] Lustre: DEBUG MARKER: == sanity-quota test 58: project ID should be kept for new mirrors created by FID ========================================================== 23:17:31 (1746314251)
            [ 4770.124961] LustreError: 208496:0:(file.c:247:ll_close_inode_openhandle()) lustre-clilmv-ffff9e1973355000: inode [0x200000402:0xac:0x0] mdc close failed: rc = -22
            [ 4805.518750] LustreError: 207592:0:(tgt_grant.c:747:tgt_grant_check()) lustre-OST0001: cli 86e86cb6-1593-4c0f-9b46-63889d50caa5 claims 1703936 GRANT, real grant 0
            [ 4805.571439] LustreError: 207592:0:(tgt_grant.c:560:tgt_grant_incoming()) lustre-OST0001: cli 86e86cb6-1593-4c0f-9b46-63889d50caa5/ffff9e1965267800 dirty 10223616 pend 0 grant -3407872
            [ 4805.571632] LustreError: 207592:0:(tgt_grant.c:562:tgt_grant_incoming()) LBUG
            [ 4805.571689] CPU: 1 PID: 207592 Comm: ll_ost_io00_006 Tainted: G           O     --------- -  - 4.18.0 #12
            [ 4805.571746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
            [ 4805.571805] Call Trace:
            [ 4805.571851]  dump_stack+0x6e/0xa0
            [ 4805.571888]  lbug_with_loc.cold.4+0x5/0x4e [libcfs]
            [ 4805.571932]  tgt_grant_incoming.isra.1+0x575/0x580 [ptlrpc]
            [ 4805.572065]  tgt_grant_prepare_write+0x397/0xf30 [ptlrpc]
            [ 4805.572194]  ofd_preprw+0xa47/0x27f0 [ofd]
            [ 4805.572278]  tgt_brw_write+0xda5/0x2720 [ptlrpc]
            [ 4805.572400]  ? tgt_request_preprocess.isra.1+0xdd/0x1440 [ptlrpc]
            [ 4805.572531]  tgt_request_handle+0x36c/0x1af0 [ptlrpc]
            [ 4805.572648]  ptlrpc_main+0x109b/0x3570 [ptlrpc]
            [ 4805.572765]  ? trace_hardirqs_on+0x1c/0xe0
            [ 4805.572795]  ? _raw_spin_unlock_irqrestore+0x43/0x70
            [ 4805.572829]  ? ptlrpc_wait_event+0x530/0x530 [ptlrpc]
            [ 4805.572944]  kthread+0x16e/0x1a0
            [ 4805.572977]  ? set_kthread_struct+0x40/0x40
            [ 4805.573008]  ret_from_fork+0x24/0x30
            [ 4805.573042] Kernel panic - not syncing: LBUG
            

            bisection:

            COMMIT          TESTED  PASSED  FAILED          COMMIT DESCRIPTION
            8dbf049479      1       0       1       BAD     LU-17427 mdt: reduce hold time for BFL rename lock
            9fe3951c44      7       3       4       BAD     LU-16350 ldiskfs: update for kernel 6.12
            df2b5d99ad      6       2       4       BAD     LU-17933 target: do not break grants on RPC failure
            a2e3a2f5a3      10      10      0       GOOD    LU-14590 utils: merge similar list_param params
            
            bzzz Alex Zhuravlev added a comment - with this patch landed: Subject: LU-17933 target: do not break grants on RPC failure I hit this problem: [ 4745.188760] Lustre: DEBUG MARKER: == sanity-quota test 58: project ID should be kept for new mirrors created by FID ========================================================== 23:17:31 (1746314251) [ 4770.124961] LustreError: 208496:0:(file.c:247:ll_close_inode_openhandle()) lustre-clilmv-ffff9e1973355000: inode [0x200000402:0xac:0x0] mdc close failed: rc = -22 [ 4805.518750] LustreError: 207592:0:(tgt_grant.c:747:tgt_grant_check()) lustre-OST0001: cli 86e86cb6-1593-4c0f-9b46-63889d50caa5 claims 1703936 GRANT, real grant 0 [ 4805.571439] LustreError: 207592:0:(tgt_grant.c:560:tgt_grant_incoming()) lustre-OST0001: cli 86e86cb6-1593-4c0f-9b46-63889d50caa5/ffff9e1965267800 dirty 10223616 pend 0 grant -3407872 [ 4805.571632] LustreError: 207592:0:(tgt_grant.c:562:tgt_grant_incoming()) LBUG [ 4805.571689] CPU: 1 PID: 207592 Comm: ll_ost_io00_006 Tainted: G O --------- - - 4.18.0 #12 [ 4805.571746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 [ 4805.571805] Call Trace: [ 4805.571851] dump_stack+0x6e/0xa0 [ 4805.571888] lbug_with_loc.cold.4+0x5/0x4e [libcfs] [ 4805.571932] tgt_grant_incoming.isra.1+0x575/0x580 [ptlrpc] [ 4805.572065] tgt_grant_prepare_write+0x397/0xf30 [ptlrpc] [ 4805.572194] ofd_preprw+0xa47/0x27f0 [ofd] [ 4805.572278] tgt_brw_write+0xda5/0x2720 [ptlrpc] [ 4805.572400] ? tgt_request_preprocess.isra.1+0xdd/0x1440 [ptlrpc] [ 4805.572531] tgt_request_handle+0x36c/0x1af0 [ptlrpc] [ 4805.572648] ptlrpc_main+0x109b/0x3570 [ptlrpc] [ 4805.572765] ? trace_hardirqs_on+0x1c/0xe0 [ 4805.572795] ? _raw_spin_unlock_irqrestore+0x43/0x70 [ 4805.572829] ? ptlrpc_wait_event+0x530/0x530 [ptlrpc] [ 4805.572944] kthread+0x16e/0x1a0 [ 4805.572977] ? set_kthread_struct+0x40/0x40 [ 4805.573008] ret_from_fork+0x24/0x30 [ 4805.573042] Kernel panic - not syncing: LBUG bisection: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION 8dbf049479 1 0 1 BAD LU-17427 mdt: reduce hold time for BFL rename lock 9fe3951c44 7 3 4 BAD LU-16350 ldiskfs: update for kernel 6.12 df2b5d99ad 6 2 4 BAD LU-17933 target: do not break grants on RPC failure a2e3a2f5a3 10 10 0 GOOD LU-14590 utils: merge similar list_param params
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56547/
            Subject: LU-17933 target: do not break grants on RPC failure
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: df2b5d99adfc6117a327f8e9102ba30076253213

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56547/ Subject: LU-17933 target: do not break grants on RPC failure Project: fs/lustre-release Branch: master Current Patch Set: Commit: df2b5d99adfc6117a327f8e9102ba30076253213

            "Vladimir Saveliev <vladimir.saveliev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56547
            Subject: LU-17933 target: do not break grants on RPC failure
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 166ad18f8168190882d96a45723b6f12ed92eebb

            gerrit Gerrit Updater added a comment - "Vladimir Saveliev <vladimir.saveliev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56547 Subject: LU-17933 target: do not break grants on RPC failure Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 166ad18f8168190882d96a45723b6f12ed92eebb
            [  140.727559] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung ========================================================== 10:45:27 (1724942727)
            ..
            [  141.236486] LustreError: 2008:0:(osc_request.c:2447:osc_brw_redo_request()) @@@ redo for recoverable error -5  req@ffff8800a9dcb480 x1808733430536576/t0(0) o4->lustre-OST0001-osc-ffff8800b3957800@192.168.203.102@tcp:6/4 lens 488/448 e 0 to 0 dl 1724942744 ref 2 fl Interpret:RMQU/200/0 rc -5/-5 job:'dmesg.0' uid:0 gid:0
            ..
            [  178.765575] Lustre: DEBUG MARKER: == sanity test 63a: Verify oig_wait interruption does not crash ================================================================= 10:46:05 (1724942765)
            ..
            [  278.785023] Lustre: DEBUG MARKER: sanity test_63a: @@@@@@ FAIL: failed grant check: client:574177280 server:92524544
            

            The above is from https://testing.whamcloud.com/gerrit-janitor/45308/testresults/sanity2-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/oleg302-client-console.txt.

            The grant accounting might break when write RPC x1808733430536576 failed.

            Grants are allocated in:

              tgt_brw_write
                obd_preprw
                  ofd_preprw
                    ofd_preprw_write
                      tgt_grant_prepare_write
                        oa->o_grant = tgt_grant_alloc()
                          tgd->tgd_tot_granted += grant;
                          ted->ted_grant += grant;
                          return grant;
            

            If further it happens to return error, oa->o_grant is zeroed:

              ofd_preprw_write
                ..
              out:
                tgt_grant_prepare_read
                  oa->o_grant = 0;
            

            After that client does not get grant update, while tgd->tgd_tot_granted and ted->ted_grant remain increased.

            vsaveliev Vladimir Saveliev added a comment - [ 140.727559] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung ========================================================== 10:45:27 (1724942727) .. [ 141.236486] LustreError: 2008:0:(osc_request.c:2447:osc_brw_redo_request()) @@@ redo for recoverable error -5 req@ffff8800a9dcb480 x1808733430536576/t0(0) o4->lustre-OST0001-osc-ffff8800b3957800@192.168.203.102@tcp:6/4 lens 488/448 e 0 to 0 dl 1724942744 ref 2 fl Interpret:RMQU/200/0 rc -5/-5 job:'dmesg.0' uid:0 gid:0 .. [ 178.765575] Lustre: DEBUG MARKER: == sanity test 63a: Verify oig_wait interruption does not crash ================================================================= 10:46:05 (1724942765) .. [ 278.785023] Lustre: DEBUG MARKER: sanity test_63a: @@@@@@ FAIL: failed grant check: client:574177280 server:92524544 The above is from https://testing.whamcloud.com/gerrit-janitor/45308/testresults/sanity2-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/oleg302-client-console.txt . The grant accounting might break when write RPC x1808733430536576 failed. Grants are allocated in: tgt_brw_write obd_preprw ofd_preprw ofd_preprw_write tgt_grant_prepare_write oa->o_grant = tgt_grant_alloc() tgd->tgd_tot_granted += grant; ted->ted_grant += grant; return grant; If further it happens to return error, oa->o_grant is zeroed: ofd_preprw_write .. out: tgt_grant_prepare_read oa->o_grant = 0; After that client does not get grant update, while tgd->tgd_tot_granted and ted->ted_grant remain increased.
            yujian Jian Yu added a comment -

            sanity test 63b, 64a, 64c, and 64d also failed with the same issue.
            The same tests passed on RHEL 9.4 ldiskfs server test session:
            https://testing.whamcloud.com/test_sessions/f00b5ec1-c22b-49fd-a786-04f9815b1746

            yujian Jian Yu added a comment - sanity test 63b, 64a, 64c, and 64d also failed with the same issue. The same tests passed on RHEL 9.4 ldiskfs server test session: https://testing.whamcloud.com/test_sessions/f00b5ec1-c22b-49fd-a786-04f9815b1746

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: