Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18343

interop: sanity test_160a: FAIL: mkdir parent '/mnt/lustre/d160a.sanity/pics/2008' failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0, Lustre 2.15.6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/04e77c61-78cc-47c8-b19b-c8f083816002

      test_160a failed with the following error:

      Registered 4 changelog users: 'cl1 cl1 cl1 cl1'
      CMD: onyx-136vm12 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users
      mkdir: cannot create directory '/mnt/lustre/d160a.sanity/pics/2008': Protocol error
       sanity test_160a: @@@@@@ FAIL: mkdir parent '/mnt/lustre/d160a.sanity/pics/2008' failed
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-b2_15/94 - 4.18.0-553.5.1.el8_10.x86_64
      servers: https://build.whamcloud.com/job/lustre-master/4581 - 4.18.0-553.16.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_160a - mkdir parent '/mnt/lustre/d160a.sanity/pics/2008' failed

      Attachments

        Issue Links

          Activity

            [LU-18343] interop: sanity test_160a: FAIL: mkdir parent '/mnt/lustre/d160a.sanity/pics/2008' failed
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56731/
            Subject: LU-18343 lod: old client mkdir on wrong MDT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c1117ce67f7660af912f73467639e681ab79cd01

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56731/ Subject: LU-18343 lod: old client mkdir on wrong MDT Project: fs/lustre-release Branch: master Current Patch Set: Commit: c1117ce67f7660af912f73467639e681ab79cd01

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56731
            Subject: LU-18343 lod: old client mkdir on wrong MDT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 07e8a6f0cff954cefd61bdba099e5b109444ea27

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56731 Subject: LU-18343 lod: old client mkdir on wrong MDT Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 07e8a6f0cff954cefd61bdba099e5b109444ea27
            laisiyao Lai Siyao added a comment -

            This is a bug in master code: after LU-15971 clients maintain default LMV on client side, so they can always send mkdir to correct MDT, while 2.15 clients may not. But in the code listed by Emoly, the server side applied strict check on this (it returns -EREMOTE previously, and client will retry upon this).

            I will push a fix to master to resolve this.

            laisiyao Lai Siyao added a comment - This is a bug in master code: after LU-15971 clients maintain default LMV on client side, so they can always send mkdir to correct MDT, while 2.15 clients may not. But in the code listed by Emoly, the server side applied strict check on this (it returns -EREMOTE previously, and client will retry upon this). I will push a fix to master to resolve this.
            ys Yang Sheng added a comment -

            If you want to verify some thing. I have made livedebug environment on test cluster as YuJian point.

            Session lustre-reviews-lustre-master-el8.10-x86_64_lustre-b2_15-el8.10-x86_64-custom finished testing. The cluster will remain available for debugging until 2024-10-19 12:53:21 UTC.
            The nodes are as follows:
            Clients: onyx-146vm1, onyx-146vm2
            MDSs: onyx-146vm10, onyx-146vm11
            OSSs: onyx-146vm3
            
            Data collected for this session can be found in directory: /autotest/autotest-2/2024-10-17/lustre-reviews_custom_108387_1001_a853630d-8867-49a1-84df-8d0b250c7ed0/
            No action is necessary when the debug period expires, the nodes will automatically return to Autotest's resource pool.
            
            ys Yang Sheng added a comment - If you want to verify some thing. I have made livedebug environment on test cluster as YuJian point. Session lustre-reviews-lustre-master-el8.10-x86_64_lustre-b2_15-el8.10-x86_64-custom finished testing. The cluster will remain available for debugging until 2024-10-19 12:53:21 UTC. The nodes are as follows: Clients: onyx-146vm1, onyx-146vm2 MDSs: onyx-146vm10, onyx-146vm11 OSSs: onyx-146vm3 Data collected for this session can be found in directory: /autotest/autotest-2/2024-10-17/lustre-reviews_custom_108387_1001_a853630d-8867-49a1-84df-8d0b250c7ed0/ No action is necessary when the debug period expires, the nodes will automatically return to Autotest's resource pool.
            ys Yang Sheng added a comment - - edited

            The issue is very easy to reproduce. Just run

            # lfs mkdir -i 0 -c 1 dir1
            # cd dir1
            # mkdir -p 1/2/3   
            

            Can got the EPROTO. Client log.

            00800000:00000001:1.0:1729183177.246269:0:555344:0:(lmv_obd.c:1919:lmv_create()) Process entered
            40000000:00000001:1.0:1729183177.246278:0:555344:0:(fid_request.c:313:seq_client_alloc_fid()) Process entered
            40000000:00000040:1.0:1729183177.246279:0:555344:0:(fid_request.c:350:seq_client_alloc_fid()) cli-cli-lustre-MDT0000-mdc-ffff991819094800: Allocated FID [0x200000403:0x2b:0x0]
            40000000:00000001:1.0:1729183177.246280:0:555344:0:(fid_request.c:354:seq_client_alloc_fid()) Process leaving (rc=0 : 0 : 0)
            00800000:00000002:1.0:1729183177.246283:0:555344:0:(lmv_obd.c:1980:lmv_create()) CREATE name '1' [0x200000403:0x2b:0x0] on [0x200000402:0x1:0x0] -> mds #0
            00800000:00000002:1.0:1729183177.246965:0:555344:0:(lmv_obd.c:1988:lmv_create()) Created - [0x200000403:0x2b:0x0]
            00800000:00000001:1.0:1729183177.246966:0:555344:0:(lmv_obd.c:1994:lmv_create()) Process leaving (rc=0 : 0 : 0)
            00800000:00000001:1.0:1729183177.248092:0:555344:0:(lmv_obd.c:1919:lmv_create()) Process entered
            00800000:00000001:1.0:1729183177.248095:0:555344:0:(lmv_obd.c:1473:lmv_locate_tgt_qos()) Process entered
            00800000:00000001:1.0:1729183177.248096:0:555344:0:(lmv_obd.c:1476:lmv_locate_tgt_qos()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5)
            40000000:00000001:1.0:1729183177.248123:0:555344:0:(fid_request.c:313:seq_client_alloc_fid()) Process entered
            40000000:00000040:1.0:1729183177.248124:0:555344:0:(fid_request.c:350:seq_client_alloc_fid()) cli-cli-lustre-MDT0000-mdc-ffff991819094800: Allocated FID [0x200000403:0x2c:0x0]
            40000000:00000001:1.0:1729183177.248125:0:555344:0:(fid_request.c:354:seq_client_alloc_fid()) Process leaving (rc=0 : 0 : 0)
            00800000:00000002:1.0:1729183177.248127:0:555344:0:(lmv_obd.c:1980:lmv_create()) CREATE name '2' [0x200000403:0x2c:0x0] on [0x200000403:0x2b:0x0] -> mds #0
            00800000:00000002:1.0:1729183177.248627:0:555344:0:(lmv_obd.c:1988:lmv_create()) Created - [0x200000403:0x2c:0x0]
            00800000:00000001:1.0:1729183177.248627:0:555344:0:(lmv_obd.c:1994:lmv_create()) Process leaving (rc=0 : 0 : 0)
            00800000:00000001:1.0:1729183177.249785:0:555344:0:(lmv_obd.c:1919:lmv_create()) Process entered
            00800000:00000001:1.0:1729183177.249790:0:555344:0:(lmv_obd.c:1473:lmv_locate_tgt_qos()) Process entered
            00000020:00000001:1.0:1729183177.249791:0:555344:0:(lu_tgt_descs.c:459:ltd_qos_penalties_calc()) Process entered
            00000020:00000001:1.0:1729183177.249792:0:555344:0:(lu_tgt_descs.c:596:ltd_qos_penalties_calc()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5)
            00800000:00000001:1.0:1729183177.249793:0:555344:0:(lmv_obd.c:1485:lmv_locate_tgt_qos()) Process leaving via unlock (rc=18446744073709551605 : -11 : 0xfffffffffffffff5)
            40000000:00000001:1.0:1729183177.249820:0:555344:0:(fid_request.c:313:seq_client_alloc_fid()) Process entered
            40000000:00000040:1.0:1729183177.249821:0:555344:0:(fid_request.c:350:seq_client_alloc_fid()) cli-cli-lustre-MDT0001-mdc-ffff991819094800: Allocated FID [0x240000402:0x10:0x0]
            40000000:00000001:1.0:1729183177.249822:0:555344:0:(fid_request.c:354:seq_client_alloc_fid()) Process leaving (rc=0 : 0 : 0)
            00800000:00000002:1.0:1729183177.249825:0:555344:0:(lmv_obd.c:1980:lmv_create()) CREATE name '3' [0x240000402:0x10:0x0] on [0x200000403:0x2c:0x0] -> mds #1
            00800000:00000001:1.0:1729183177.302918:0:555344:0:(lmv_obd.c:1994:lmv_create()) Process leaving (rc=18446744073709551545 : -71 : ffffffffffffffb9)
            
            ys Yang Sheng added a comment - - edited The issue is very easy to reproduce. Just run # lfs mkdir -i 0 -c 1 dir1 # cd dir1 # mkdir -p 1/2/3 Can got the EPROTO. Client log. 00800000:00000001:1.0:1729183177.246269:0:555344:0:(lmv_obd.c:1919:lmv_create()) Process entered 40000000:00000001:1.0:1729183177.246278:0:555344:0:(fid_request.c:313:seq_client_alloc_fid()) Process entered 40000000:00000040:1.0:1729183177.246279:0:555344:0:(fid_request.c:350:seq_client_alloc_fid()) cli-cli-lustre-MDT0000-mdc-ffff991819094800: Allocated FID [0x200000403:0x2b:0x0] 40000000:00000001:1.0:1729183177.246280:0:555344:0:(fid_request.c:354:seq_client_alloc_fid()) Process leaving (rc=0 : 0 : 0) 00800000:00000002:1.0:1729183177.246283:0:555344:0:(lmv_obd.c:1980:lmv_create()) CREATE name '1' [0x200000403:0x2b:0x0] on [0x200000402:0x1:0x0] -> mds #0 00800000:00000002:1.0:1729183177.246965:0:555344:0:(lmv_obd.c:1988:lmv_create()) Created - [0x200000403:0x2b:0x0] 00800000:00000001:1.0:1729183177.246966:0:555344:0:(lmv_obd.c:1994:lmv_create()) Process leaving (rc=0 : 0 : 0) 00800000:00000001:1.0:1729183177.248092:0:555344:0:(lmv_obd.c:1919:lmv_create()) Process entered 00800000:00000001:1.0:1729183177.248095:0:555344:0:(lmv_obd.c:1473:lmv_locate_tgt_qos()) Process entered 00800000:00000001:1.0:1729183177.248096:0:555344:0:(lmv_obd.c:1476:lmv_locate_tgt_qos()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) 40000000:00000001:1.0:1729183177.248123:0:555344:0:(fid_request.c:313:seq_client_alloc_fid()) Process entered 40000000:00000040:1.0:1729183177.248124:0:555344:0:(fid_request.c:350:seq_client_alloc_fid()) cli-cli-lustre-MDT0000-mdc-ffff991819094800: Allocated FID [0x200000403:0x2c:0x0] 40000000:00000001:1.0:1729183177.248125:0:555344:0:(fid_request.c:354:seq_client_alloc_fid()) Process leaving (rc=0 : 0 : 0) 00800000:00000002:1.0:1729183177.248127:0:555344:0:(lmv_obd.c:1980:lmv_create()) CREATE name '2' [0x200000403:0x2c:0x0] on [0x200000403:0x2b:0x0] -> mds #0 00800000:00000002:1.0:1729183177.248627:0:555344:0:(lmv_obd.c:1988:lmv_create()) Created - [0x200000403:0x2c:0x0] 00800000:00000001:1.0:1729183177.248627:0:555344:0:(lmv_obd.c:1994:lmv_create()) Process leaving (rc=0 : 0 : 0) 00800000:00000001:1.0:1729183177.249785:0:555344:0:(lmv_obd.c:1919:lmv_create()) Process entered 00800000:00000001:1.0:1729183177.249790:0:555344:0:(lmv_obd.c:1473:lmv_locate_tgt_qos()) Process entered 00000020:00000001:1.0:1729183177.249791:0:555344:0:(lu_tgt_descs.c:459:ltd_qos_penalties_calc()) Process entered 00000020:00000001:1.0:1729183177.249792:0:555344:0:(lu_tgt_descs.c:596:ltd_qos_penalties_calc()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) 00800000:00000001:1.0:1729183177.249793:0:555344:0:(lmv_obd.c:1485:lmv_locate_tgt_qos()) Process leaving via unlock (rc=18446744073709551605 : -11 : 0xfffffffffffffff5) 40000000:00000001:1.0:1729183177.249820:0:555344:0:(fid_request.c:313:seq_client_alloc_fid()) Process entered 40000000:00000040:1.0:1729183177.249821:0:555344:0:(fid_request.c:350:seq_client_alloc_fid()) cli-cli-lustre-MDT0001-mdc-ffff991819094800: Allocated FID [0x240000402:0x10:0x0] 40000000:00000001:1.0:1729183177.249822:0:555344:0:(fid_request.c:354:seq_client_alloc_fid()) Process leaving (rc=0 : 0 : 0) 00800000:00000002:1.0:1729183177.249825:0:555344:0:(lmv_obd.c:1980:lmv_create()) CREATE name '3' [0x240000402:0x10:0x0] on [0x200000403:0x2c:0x0] -> mds #1 00800000:00000001:1.0:1729183177.302918:0:555344:0:(lmv_obd.c:1994:lmv_create()) Process leaving (rc=18446744073709551545 : -71 : ffffffffffffffb9)
            emoly.liu Emoly Liu added a comment -

            laisiyao , could you please look into this issue? Thanks.

            emoly.liu Emoly Liu added a comment - laisiyao  , could you please look into this issue? Thanks.
            emoly.liu Emoly Liu added a comment -

            After some investigation and testing, I found this issue(-EPROTO) was caused by the fix of LU-12998 commit 6dbb4c at https://review.whamcloud.com/c/fs/lustre-release/+/53437 

            +                       lod_foreach_mdt(lod, mdt) {
            +                               if (mdt->ltd_index ==
            +                                   lo->ldo_dir_stripe_offset) {
            +                                       rc = -EPROTO;
            +                                       /* refresh statfs */
            +                                       dt_statfs(env, mdt->ltd_tgt,
            +                                                 &mdt->ltd_statfs);
            +                                       no_create = (mdt->ltd_statfs.os_state &
            +                                                    OS_STATFS_NOCREATE);
            +                                       break;
            

            I will ask Lai to have a look.

            emoly.liu Emoly Liu added a comment - After some investigation and testing, I found this issue(-EPROTO) was caused by the fix of LU-12998 commit 6dbb4c at https://review.whamcloud.com/c/fs/lustre-release/+/53437   + lod_foreach_mdt(lod, mdt) { + if (mdt->ltd_index == + lo->ldo_dir_stripe_offset) { + rc = -EPROTO; + /* refresh statfs */ + dt_statfs(env, mdt->ltd_tgt, + &mdt->ltd_statfs); + no_create = (mdt->ltd_statfs.os_state & + OS_STATFS_NOCREATE); + break; I will ask Lai to have a look.
            emoly.liu Emoly Liu added a comment -

            According to the Maloo testing results, this issue should be introduced between 2.15.64 and 2.15.63, build #4544 to #4540.

            Still in investigation.

            emoly.liu Emoly Liu added a comment - According to the Maloo testing results, this issue should be introduced between 2.15.64 and 2.15.63, build #4544 to #4540. Still in investigation.

            "Emoly Liu <emoly@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56686
            Subject: LU-18343 mdt: revert commit 652032d6c18 (debug)
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3f74264d7d08ac55e02d919ff212f77d7d792a78

            gerrit Gerrit Updater added a comment - "Emoly Liu <emoly@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56686 Subject: LU-18343 mdt: revert commit 652032d6c18 (debug) Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3f74264d7d08ac55e02d919ff212f77d7d792a78

            People

              laisiyao Lai Siyao
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: