Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9408

Client fails to mount with ZFS master (0.7.0) and Lustre master (2.9.56)

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • None
    • Lustre 2.10.0
    • 3
    • 9223372036854775807

    Description

      ZFS: zfs-0.7.0-rc3-225-g7a25f08
      SPL: spl-0.7.0-rc3-8-g481762f
      Lustre: v2_9_56_0-11-gbfa524f

      This is a straightforward single MDS with MDT/MGT and an OSS with a single OST.
      (also reproduced with split MDT and MGT on a slightly earlier ZFS).

      This setup works without problems with ldiskfs backing.

      Snippet from MDS log on initial mount:

      Apr 26 14:01:18 ieel-mds04 kernel: SPL: Loaded module v0.7.0-rc3_8_g481762f
      Apr 26 14:01:19 ieel-mds04 kernel: ZFS: Loaded module v0.7.0-rc3_225_g7a25f08, ZFS pool version 5000, ZFS filesystem version 5
      ...
      Apr 26 14:20:05 ieel-mds04 kernel: SPL: using hostid 0x7e3a4ec9
      Apr 26 14:20:52 ieel-mds04 kernel: Lustre: MGS: Connection restored to ec5ab9aa-e46a-19dd-47d3-1aa7d07fdc3f (at 0@lo)
      Apr 26 14:20:52 ieel-mds04 kernel: Lustre: srv-scratch-MDT0001: No data found on store. Initialize space
      Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: new disk, initializing
      Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: Imperative Recovery not enabled, recovery window 300-900
      Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 21965:0:(osd_oi.c:497:osd_oid()) unsupported quota oid: 0x16
      Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 22330:0:(fid_handler.c:329:__seq_server_alloc_meta()) srv-scratch-MDT0001: Allocated super-sequence failed: rc = -115
      Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 22330:0:(fid_request.c:227:seq_client_alloc_seq()) cli-scratch-MDT0001: Can't allocate new meta-sequence,rc -115
      Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 22330:0:(fid_request.c:383:seq_client_alloc_fid()) cli-scratch-MDT0001: Can't allocate new sequence: rc = -115
      Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 22330:0:(lod_dev.c:419:lod_sub_recovery_thread()) scratch-MDT0001-osd getting update log failed: rc = -115
      

      OST Log on initial mount:

      Apr 26 14:20:56 ieel-oss03 kernel: SPL: Loaded module v0.7.0-rc3_8_g481762f
      Apr 26 14:20:58 ieel-oss03 kernel: ZFS: Loaded module v0.7.0-rc3_225_g7a25f08, ZFS pool version 5000, ZFS filesystem version 5
      Apr 26 14:21:08 ieel-oss03 kernel: SPL: using hostid 0x5d9bdb4b
      Apr 26 14:25:01 ieel-oss03 kernel: LNet: HW nodes: 1, HW CPU cores: 2, npartitions: 1
      Apr 26 14:25:01 ieel-oss03 kernel: alg: No test for adler32 (adler32-zlib)
      Apr 26 14:25:01 ieel-oss03 kernel: alg: No test for crc32 (crc32-table)
      Apr 26 14:25:01 ieel-oss03 kernel: Lustre: Lustre: Build Version: 2.9.56_11_gbfa524f
      Apr 26 14:25:01 ieel-oss03 kernel: LNet: Added LNI 192.168.56.22@tcp [8/256/0/180]
      Apr 26 14:25:01 ieel-oss03 kernel: LNet: Accept secure, port 988
      Apr 26 14:25:02 ieel-oss03 kernel: Lustre: scratch-OST0000: new disk, initializing
      Apr 26 14:25:02 ieel-oss03 kernel: Lustre: srv-scratch-OST0000: No data found on store. Initialize space
      Apr 26 14:25:02 ieel-oss03 kernel: Lustre: scratch-OST0000: Imperative Recovery not enabled, recovery window 300-900
      Apr 26 14:25:02 ieel-oss03 kernel: LustreError: 13214:0:(osd_oi.c:497:osd_oid()) unsupported quota oid: 0x16
      Apr 26 14:25:07 ieel-oss03 kernel: Lustre: scratch-OST0000: Connection restored to scratch-MDT0001-mdtlov_UUID (at 192.168.56.13@tcp)
      

      Client attempting to mount (messages):

      Apr 26 14:30:44 ieel-c03 kernel: LNet: HW CPU cores: 2, npartitions: 1
      Apr 26 14:30:44 ieel-c03 kernel: alg: No test for adler32 (adler32-zlib)
      Apr 26 14:30:44 ieel-c03 kernel: alg: No test for crc32 (crc32-table)
      Apr 26 14:30:49 ieel-c03 kernel: sha512_ssse3: Using AVX optimized SHA-512 implementation
      Apr 26 14:30:52 ieel-c03 kernel: Lustre: Lustre: Build Version: 2.8.0.51-1-PRISTINE-3.10.0-514.6.1.el7.x86_64
      Apr 26 14:30:52 ieel-c03 kernel: LNet: Added LNI 192.168.56.32@tcp [8/256/0/180]
      Apr 26 14:30:52 ieel-c03 kernel: LNet: Accept secure, port 988
      Apr 26 14:30:52 ieel-c03 kernel: LustreError: 2336:0:(lmv_obd.c:553:lmv_check_connect()) scratch-clilmv-ffff88003b98d800: no target configured for index 0.
      Apr 26 14:30:52 ieel-c03 kernel: LustreError: 2336:0:(llite_lib.c:265:client_common_fill_super()) cannot connect to scratch-clilmv-ffff88003b98d800: rc = -22
      Apr 26 14:30:52 ieel-c03 kernel: LustreError: 2363:0:(lov_obd.c:922:lov_cleanup()) scratch-clilov-ffff88003b98d800: lov tgt 0 not cleaned! deathrow=0, lovrc=1
      Apr 26 14:30:52 ieel-c03 kernel: Lustre: Unmounted scratch-client
      Apr 26 14:30:52 ieel-c03 kernel: LustreError: 2336:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-22)
      

      Attached logs.tbz2 has debug_kernel dumps.

      Attachments

        Issue Links

          Activity

            [LU-9408] Client fails to mount with ZFS master (0.7.0) and Lustre master (2.9.56)

            If I format mdt00 with --index=0, everything works just fine. This could be closed as "not a bug" i guess, though it's kind of a strange one to debug.

            utopiabound Nathaniel Clark added a comment - If I format mdt00 with --index=0, everything works just fine. This could be closed as "not a bug" i guess, though it's kind of a strange one to debug.
            ihara Shuichi Ihara (Inactive) added a comment - - edited

            Ah, are you sure does "--index=1" work without --index=0?

            Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: new disk, initializing
            Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: Imperative Recovery not enabled, recovery window 300-900

            it seem you've also setup --index=1 without --index=0 in your original description..

            However, you have --index=0 for MDT in your ldiskfs setup.

            8 UP osp scratch-OST0000-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5
            9 UP osp scratch-OST0001-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5

            ihara Shuichi Ihara (Inactive) added a comment - - edited Ah, are you sure does "--index=1" work without --index=0? Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: new disk, initializing Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: Imperative Recovery not enabled, recovery window 300-900 it seem you've also setup --index=1 without --index=0 in your original description.. However, you have --index=0 for MDT in your ldiskfs setup. 8 UP osp scratch-OST0000-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5 9 UP osp scratch-OST0001-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5
            zpool create -f -o ashift=12 -o cachefile=none mdt00 /dev/sdc /dev/sdd
            mkfs.lustre --reformat --backfstype=zfs --mgs --mdt --index=1 --fsname=scratch mdt00/mdt
            
            utopiabound Nathaniel Clark added a comment - zpool create -f -o ashift=12 -o cachefile=none mdt00 /dev/sdc /dev/sdd mkfs.lustre --reformat --backfstype=zfs --mgs --mdt --index=1 --fsname=scratch mdt00/mdt

            Hi Nathaniel Clark,

            could you show me your exact mkfs options, so that i could reproduce here.

            Thanks,
            Shilong

            wangshilong Wang Shilong (Inactive) added a comment - Hi Nathaniel Clark, could you show me your exact mkfs options, so that i could reproduce here. Thanks, Shilong
            utopiabound Nathaniel Clark added a comment - - edited

            ihara,

            Oh, you are correct, I've updated the client code to match the OSS/MDS but I didn't unload the old modules.

            Using the quoted version above, I get the same result. Works with ldiskfs, does not work with zfs.

            I've go two setups with running the same version of lusture: 2.9.56_11_gbfa524f
            ZFS:

            [root@ieel-mds03 ~]# lctl dl
              0 UP osd-zfs scratch-MDT0001-osd scratch-MDT0001-osd_UUID 8
              1 UP mgs MGS MGS 7
              2 UP mgc MGC192.168.56.12@tcp 3e0eccdf-f828-338f-d3fc-2e717a638014 5
              3 UP mds MDS MDS_uuid 3
              4 UP lod scratch-MDT0001-mdtlov scratch-MDT0001-mdtlov_UUID 4
              5 UP mdt scratch-MDT0001 scratch-MDT0001_UUID 5
              6 UP mdd scratch-MDD0001 scratch-MDD0001_UUID 4
              7 UP osp scratch-OST0000-osc-MDT0001 scratch-MDT0001-mdtlov_UUID 5
            

            ldiskfs:

            [root@ieel-mds04 ~]# lctl dl
              0 UP osd-ldiskfs scratch-MDT0000-osd scratch-MDT0000-osd_UUID 10
              1 UP mgs MGS MGS 7
              2 UP mgc MGC192.168.56.13@tcp 0e5f0018-97cf-c2a4-4817-f51b7410ec7b 5
              3 UP mds MDS MDS_uuid 3
              4 UP lod scratch-MDT0000-mdtlov scratch-MDT0000-mdtlov_UUID 4
              5 UP mdt scratch-MDT0000 scratch-MDT0000_UUID 13
              6 UP mdd scratch-MDD0000 scratch-MDD0000_UUID 4
              7 UP qmt scratch-QMT0000 scratch-QMT0000_UUID 4
              8 UP osp scratch-OST0000-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5
              9 UP osp scratch-OST0001-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5
             10 UP lwp scratch-MDT0000-lwp-MDT0000 scratch-MDT0000-lwp-MDT0000_UUID 5
            

            Should qmt be missing from ZFS?

            utopiabound Nathaniel Clark added a comment - - edited ihara , Oh, you are correct, I've updated the client code to match the OSS/MDS but I didn't unload the old modules. Using the quoted version above, I get the same result. Works with ldiskfs, does not work with zfs. I've go two setups with running the same version of lusture: 2.9.56_11_gbfa524f ZFS: [root@ieel-mds03 ~]# lctl dl 0 UP osd-zfs scratch-MDT0001-osd scratch-MDT0001-osd_UUID 8 1 UP mgs MGS MGS 7 2 UP mgc MGC192.168.56.12@tcp 3e0eccdf-f828-338f-d3fc-2e717a638014 5 3 UP mds MDS MDS_uuid 3 4 UP lod scratch-MDT0001-mdtlov scratch-MDT0001-mdtlov_UUID 4 5 UP mdt scratch-MDT0001 scratch-MDT0001_UUID 5 6 UP mdd scratch-MDD0001 scratch-MDD0001_UUID 4 7 UP osp scratch-OST0000-osc-MDT0001 scratch-MDT0001-mdtlov_UUID 5 ldiskfs: [root@ieel-mds04 ~]# lctl dl 0 UP osd-ldiskfs scratch-MDT0000-osd scratch-MDT0000-osd_UUID 10 1 UP mgs MGS MGS 7 2 UP mgc MGC192.168.56.13@tcp 0e5f0018-97cf-c2a4-4817-f51b7410ec7b 5 3 UP mds MDS MDS_uuid 3 4 UP lod scratch-MDT0000-mdtlov scratch-MDT0000-mdtlov_UUID 4 5 UP mdt scratch-MDT0000 scratch-MDT0000_UUID 13 6 UP mdd scratch-MDD0000 scratch-MDD0000_UUID 4 7 UP qmt scratch-QMT0000 scratch-QMT0000_UUID 4 8 UP osp scratch-OST0000-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5 9 UP osp scratch-OST0001-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5 10 UP lwp scratch-MDT0000-lwp-MDT0000 scratch-MDT0000-lwp-MDT0000_UUID 5 Should qmt be missing from ZFS?

            This confused me. As far as I read your original description, client version is "2.8.0.51-1", not same version of OSS/MDS (2.9.56_11_gbfa524f).

            ihara Shuichi Ihara (Inactive) added a comment - This confused me. As far as I read your original description, client version is "2.8.0.51-1", not same version of OSS/MDS (2.9.56_11_gbfa524f).

            Yes that's the exact version for OSS/MDS.

            The client version is lustre-client-2.9.56_11_gbfa524f-1.el7.centos.x86_64.

            I'm going to retest with latest master.

            utopiabound Nathaniel Clark added a comment - Yes that's the exact version for OSS/MDS. The client version is lustre-client-2.9.56_11_gbfa524f-1.el7.centos.x86_64 . I'm going to retest with latest master.

            Nathaniel,

            Just want to confirm. This is what you are testing exact Lustre vesion below, correct?

            OSS/MDS

            Apr 26 14:25:01 ieel-oss03 kernel: Lustre: Lustre: Build Version: 2.9.56_11_gbfa524f
            

            Cient

            Apr 26 14:30:52 ieel-c03 kernel: Lustre: Lustre: Build Version: 2.8.0.51-1-PRISTINE-3.10.0-514.6.1.el7.x86_64
            

            We have tested latest master (both server and cient) with zfs-0.6.5 and didn't reproduce your problem. But, our tested client version might be diferent from your setup.
            I'm wondering whether if there are some version interoperability issue here. We will investigate this again.

            ihara Shuichi Ihara (Inactive) added a comment - Nathaniel, Just want to confirm. This is what you are testing exact Lustre vesion below, correct? OSS/MDS Apr 26 14:25:01 ieel-oss03 kernel: Lustre: Lustre: Build Version: 2.9.56_11_gbfa524f Cient Apr 26 14:30:52 ieel-c03 kernel: Lustre: Lustre: Build Version: 2.8.0.51-1-PRISTINE-3.10.0-514.6.1.el7.x86_64 We have tested latest master (both server and cient) with zfs-0.6.5 and didn't reproduce your problem. But, our tested client version might be diferent from your setup. I'm wondering whether if there are some version interoperability issue here. We will investigate this again.

            I turned on tracing and got this back:

            40000000:00000001:1.0:1493296626.206119:0:5772:0:(fid_request.c:350:seq_client_alloc_fid()) Process entered
            40000000:00000001:1.0:1493296626.206120:0:5772:0:(fid_request.c:219:seq_client_alloc_seq()) Process entered
            40000000:00000001:1.0:1493296626.206122:0:5772:0:(fid_request.c:180:seq_client_alloc_meta()) Process entered
            40000000:00000001:1.0:1493296626.206123:0:5772:0:(fid_handler.c:351:seq_server_alloc_meta()) Process entered
            40000000:00000001:1.0:1493296626.206124:0:5772:0:(fid_handler.c:322:__seq_server_alloc_meta()) Process entered
            40000000:00000001:1.0:1493296626.206126:0:5772:0:(fid_handler.c:277:seq_server_check_and_alloc_super()) Process entered
            40000000:00000001:1.0:1493296626.206127:0:5772:0:(fid_request.c:148:seq_client_alloc_super()) Process entered
            40000000:00000001:1.0:1493296626.206129:0:5772:0:(fid_request.c:165:seq_client_alloc_super()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d)
            40000000:00080000:1.0:1493296626.206131:0:5772:0:(fid_handler.c:290:seq_server_check_and_alloc_super()) srv-scratch-MDT0001: Can't allocate super-sequence: rc -115
            40000000:00000001:1.0:1493296626.206132:0:5772:0:(fid_handler.c:291:seq_server_check_and_alloc_super()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d)
            40000000:00020000:1.0:1493296626.206134:0:5772:0:(fid_handler.c:329:__seq_server_alloc_meta()) srv-scratch-MDT0001: Allocated super-sequence failed: rc = -115
            

            Which means in lustre/fid/fid_request.c::seq_client_alloc_super():
            In struct lu_client_seq *seq the sequence server seq->lcs_srv and sequence export seq->lcs_exp are both NULL.

            utopiabound Nathaniel Clark added a comment - I turned on tracing and got this back: 40000000:00000001:1.0:1493296626.206119:0:5772:0:(fid_request.c:350:seq_client_alloc_fid()) Process entered 40000000:00000001:1.0:1493296626.206120:0:5772:0:(fid_request.c:219:seq_client_alloc_seq()) Process entered 40000000:00000001:1.0:1493296626.206122:0:5772:0:(fid_request.c:180:seq_client_alloc_meta()) Process entered 40000000:00000001:1.0:1493296626.206123:0:5772:0:(fid_handler.c:351:seq_server_alloc_meta()) Process entered 40000000:00000001:1.0:1493296626.206124:0:5772:0:(fid_handler.c:322:__seq_server_alloc_meta()) Process entered 40000000:00000001:1.0:1493296626.206126:0:5772:0:(fid_handler.c:277:seq_server_check_and_alloc_super()) Process entered 40000000:00000001:1.0:1493296626.206127:0:5772:0:(fid_request.c:148:seq_client_alloc_super()) Process entered 40000000:00000001:1.0:1493296626.206129:0:5772:0:(fid_request.c:165:seq_client_alloc_super()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d) 40000000:00080000:1.0:1493296626.206131:0:5772:0:(fid_handler.c:290:seq_server_check_and_alloc_super()) srv-scratch-MDT0001: Can't allocate super-sequence: rc -115 40000000:00000001:1.0:1493296626.206132:0:5772:0:(fid_handler.c:291:seq_server_check_and_alloc_super()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d) 40000000:00020000:1.0:1493296626.206134:0:5772:0:(fid_handler.c:329:__seq_server_alloc_meta()) srv-scratch-MDT0001: Allocated super-sequence failed: rc = -115 Which means in lustre/fid/fid_request.c::seq_client_alloc_super() : In struct lu_client_seq *seq the sequence server seq->lcs_srv and sequence export seq->lcs_exp are both NULL .

            Acutally, i could not reproduce the problem in my local setup.

            wangshilong Wang Shilong (Inactive) added a comment - Acutally, i could not reproduce the problem in my local setup.

            The -115 = -EINPROGRESS error means that the server can't perform the operation for some reason, but that the client should retry until it succeeds. It is possible that the server code doesn't expect to see this, so it isn't retrying, which is something that we should fix.

            That said, it also isn't clear why the server would be return -EINPROGRESS for something like sequence allocation, unless e.g. LFSCK is running and it can't look up the object(s) where the last-used sequence number is stored.

            It probably makes sense to run this with -1 debugging and find out where the -115 error is coming from, and then we can see what needs to be fixed.

            adilger Andreas Dilger added a comment - The -115 = -EINPROGRESS error means that the server can't perform the operation for some reason, but that the client should retry until it succeeds. It is possible that the server code doesn't expect to see this, so it isn't retrying, which is something that we should fix. That said, it also isn't clear why the server would be return -EINPROGRESS for something like sequence allocation, unless e.g. LFSCK is running and it can't look up the object(s) where the last-used sequence number is stored. It probably makes sense to run this with -1 debugging and find out where the -115 error is coming from, and then we can see what needs to be fixed.

            People

              wangshilong Wang Shilong (Inactive)
              utopiabound Nathaniel Clark
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: