[LU-9408] Client fails to mount with ZFS master (0.7.0) and Lustre master (2.9.56) Created: 26/Apr/17  Updated: 17/May/17  Resolved: 05/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Nathaniel Clark Assignee: Wang Shilong (Inactive)
Resolution: Not a Bug Votes: 0
Labels: zfs

Attachments: File logs.tbz2    
Issue Links:
Related
is related to LU-7991 Add project quota for ZFS Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

ZFS: zfs-0.7.0-rc3-225-g7a25f08
SPL: spl-0.7.0-rc3-8-g481762f
Lustre: v2_9_56_0-11-gbfa524f

This is a straightforward single MDS with MDT/MGT and an OSS with a single OST.
(also reproduced with split MDT and MGT on a slightly earlier ZFS).

This setup works without problems with ldiskfs backing.

Snippet from MDS log on initial mount:

Apr 26 14:01:18 ieel-mds04 kernel: SPL: Loaded module v0.7.0-rc3_8_g481762f
Apr 26 14:01:19 ieel-mds04 kernel: ZFS: Loaded module v0.7.0-rc3_225_g7a25f08, ZFS pool version 5000, ZFS filesystem version 5
...
Apr 26 14:20:05 ieel-mds04 kernel: SPL: using hostid 0x7e3a4ec9
Apr 26 14:20:52 ieel-mds04 kernel: Lustre: MGS: Connection restored to ec5ab9aa-e46a-19dd-47d3-1aa7d07fdc3f (at 0@lo)
Apr 26 14:20:52 ieel-mds04 kernel: Lustre: srv-scratch-MDT0001: No data found on store. Initialize space
Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: new disk, initializing
Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: Imperative Recovery not enabled, recovery window 300-900
Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 21965:0:(osd_oi.c:497:osd_oid()) unsupported quota oid: 0x16
Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 22330:0:(fid_handler.c:329:__seq_server_alloc_meta()) srv-scratch-MDT0001: Allocated super-sequence failed: rc = -115
Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 22330:0:(fid_request.c:227:seq_client_alloc_seq()) cli-scratch-MDT0001: Can't allocate new meta-sequence,rc -115
Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 22330:0:(fid_request.c:383:seq_client_alloc_fid()) cli-scratch-MDT0001: Can't allocate new sequence: rc = -115
Apr 26 14:20:52 ieel-mds04 kernel: LustreError: 22330:0:(lod_dev.c:419:lod_sub_recovery_thread()) scratch-MDT0001-osd getting update log failed: rc = -115

OST Log on initial mount:

Apr 26 14:20:56 ieel-oss03 kernel: SPL: Loaded module v0.7.0-rc3_8_g481762f
Apr 26 14:20:58 ieel-oss03 kernel: ZFS: Loaded module v0.7.0-rc3_225_g7a25f08, ZFS pool version 5000, ZFS filesystem version 5
Apr 26 14:21:08 ieel-oss03 kernel: SPL: using hostid 0x5d9bdb4b
Apr 26 14:25:01 ieel-oss03 kernel: LNet: HW nodes: 1, HW CPU cores: 2, npartitions: 1
Apr 26 14:25:01 ieel-oss03 kernel: alg: No test for adler32 (adler32-zlib)
Apr 26 14:25:01 ieel-oss03 kernel: alg: No test for crc32 (crc32-table)
Apr 26 14:25:01 ieel-oss03 kernel: Lustre: Lustre: Build Version: 2.9.56_11_gbfa524f
Apr 26 14:25:01 ieel-oss03 kernel: LNet: Added LNI 192.168.56.22@tcp [8/256/0/180]
Apr 26 14:25:01 ieel-oss03 kernel: LNet: Accept secure, port 988
Apr 26 14:25:02 ieel-oss03 kernel: Lustre: scratch-OST0000: new disk, initializing
Apr 26 14:25:02 ieel-oss03 kernel: Lustre: srv-scratch-OST0000: No data found on store. Initialize space
Apr 26 14:25:02 ieel-oss03 kernel: Lustre: scratch-OST0000: Imperative Recovery not enabled, recovery window 300-900
Apr 26 14:25:02 ieel-oss03 kernel: LustreError: 13214:0:(osd_oi.c:497:osd_oid()) unsupported quota oid: 0x16
Apr 26 14:25:07 ieel-oss03 kernel: Lustre: scratch-OST0000: Connection restored to scratch-MDT0001-mdtlov_UUID (at 192.168.56.13@tcp)

Client attempting to mount (messages):

Apr 26 14:30:44 ieel-c03 kernel: LNet: HW CPU cores: 2, npartitions: 1
Apr 26 14:30:44 ieel-c03 kernel: alg: No test for adler32 (adler32-zlib)
Apr 26 14:30:44 ieel-c03 kernel: alg: No test for crc32 (crc32-table)
Apr 26 14:30:49 ieel-c03 kernel: sha512_ssse3: Using AVX optimized SHA-512 implementation
Apr 26 14:30:52 ieel-c03 kernel: Lustre: Lustre: Build Version: 2.8.0.51-1-PRISTINE-3.10.0-514.6.1.el7.x86_64
Apr 26 14:30:52 ieel-c03 kernel: LNet: Added LNI 192.168.56.32@tcp [8/256/0/180]
Apr 26 14:30:52 ieel-c03 kernel: LNet: Accept secure, port 988
Apr 26 14:30:52 ieel-c03 kernel: LustreError: 2336:0:(lmv_obd.c:553:lmv_check_connect()) scratch-clilmv-ffff88003b98d800: no target configured for index 0.
Apr 26 14:30:52 ieel-c03 kernel: LustreError: 2336:0:(llite_lib.c:265:client_common_fill_super()) cannot connect to scratch-clilmv-ffff88003b98d800: rc = -22
Apr 26 14:30:52 ieel-c03 kernel: LustreError: 2363:0:(lov_obd.c:922:lov_cleanup()) scratch-clilov-ffff88003b98d800: lov tgt 0 not cleaned! deathrow=0, lovrc=1
Apr 26 14:30:52 ieel-c03 kernel: Lustre: Unmounted scratch-client
Apr 26 14:30:52 ieel-c03 kernel: LustreError: 2336:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-22)

Attached logs.tbz2 has debug_kernel dumps.



 Comments   
Comment by Andreas Dilger [ 26/Apr/17 ]

As Bob pointed out in chat, this is related to the recent project quota feature landing in patch https://review.whamcloud.com/23947 "LU-4017 quota: add project quota support for Lustre". It isn't clear why this is not failing our current ZFS tests with 0.6.5.9, but it definitely needs to be fixed.

Comment by Peter Jones [ 27/Apr/17 ]

Wang Shilong

Do you have any suggestions here?

Peter

Comment by Nathaniel Clark [ 27/Apr/17 ]

Acutally, I've just tested this with ZFS/SPL 0.6.5.7 and I get the same error. It must be something in my setup, but I'm at a loss as to what. ldiskfs works fine. The quota error, I think, is a red herring. I think the issue is related to these lines:

Apr 27 08:32:26 ieel-mds03 kernel: LustreError: 5565:0:(fid_handler.c:329:__seq_server_alloc_meta()) srv-scratch-MDT0001: Allocated super-sequence failed: rc = -115
Apr 27 08:32:26 ieel-mds03 kernel: LustreError: 5565:0:(fid_request.c:227:seq_client_alloc_seq()) cli-scratch-MDT0001: Can't allocate new meta-sequence,rc -115
Apr 27 08:32:26 ieel-mds03 kernel: LustreError: 5565:0:(fid_request.c:383:seq_client_alloc_fid()) cli-scratch-MDT0001: Can't allocate new sequence: rc = -115
Apr 27 08:32:26 ieel-mds03 kernel: LustreError: 5565:0:(lod_dev.c:419:lod_sub_recovery_thread()) scratch-MDT0001-osd getting update log failed: rc = -115
Comment by Andreas Dilger [ 27/Apr/17 ]

The -115 = -EINPROGRESS error means that the server can't perform the operation for some reason, but that the client should retry until it succeeds. It is possible that the server code doesn't expect to see this, so it isn't retrying, which is something that we should fix.

That said, it also isn't clear why the server would be return -EINPROGRESS for something like sequence allocation, unless e.g. LFSCK is running and it can't look up the object(s) where the last-used sequence number is stored.

It probably makes sense to run this with -1 debugging and find out where the -115 error is coming from, and then we can see what needs to be fixed.

Comment by Wang Shilong (Inactive) [ 28/Apr/17 ]

Acutally, i could not reproduce the problem in my local setup.

Comment by Nathaniel Clark [ 28/Apr/17 ]

I turned on tracing and got this back:

40000000:00000001:1.0:1493296626.206119:0:5772:0:(fid_request.c:350:seq_client_alloc_fid()) Process entered
40000000:00000001:1.0:1493296626.206120:0:5772:0:(fid_request.c:219:seq_client_alloc_seq()) Process entered
40000000:00000001:1.0:1493296626.206122:0:5772:0:(fid_request.c:180:seq_client_alloc_meta()) Process entered
40000000:00000001:1.0:1493296626.206123:0:5772:0:(fid_handler.c:351:seq_server_alloc_meta()) Process entered
40000000:00000001:1.0:1493296626.206124:0:5772:0:(fid_handler.c:322:__seq_server_alloc_meta()) Process entered
40000000:00000001:1.0:1493296626.206126:0:5772:0:(fid_handler.c:277:seq_server_check_and_alloc_super()) Process entered
40000000:00000001:1.0:1493296626.206127:0:5772:0:(fid_request.c:148:seq_client_alloc_super()) Process entered
40000000:00000001:1.0:1493296626.206129:0:5772:0:(fid_request.c:165:seq_client_alloc_super()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d)
40000000:00080000:1.0:1493296626.206131:0:5772:0:(fid_handler.c:290:seq_server_check_and_alloc_super()) srv-scratch-MDT0001: Can't allocate super-sequence: rc -115
40000000:00000001:1.0:1493296626.206132:0:5772:0:(fid_handler.c:291:seq_server_check_and_alloc_super()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d)
40000000:00020000:1.0:1493296626.206134:0:5772:0:(fid_handler.c:329:__seq_server_alloc_meta()) srv-scratch-MDT0001: Allocated super-sequence failed: rc = -115

Which means in lustre/fid/fid_request.c::seq_client_alloc_super():
In struct lu_client_seq *seq the sequence server seq->lcs_srv and sequence export seq->lcs_exp are both NULL.

Comment by Shuichi Ihara (Inactive) [ 30/Apr/17 ]

Nathaniel,

Just want to confirm. This is what you are testing exact Lustre vesion below, correct?

OSS/MDS

Apr 26 14:25:01 ieel-oss03 kernel: Lustre: Lustre: Build Version: 2.9.56_11_gbfa524f

Cient

Apr 26 14:30:52 ieel-c03 kernel: Lustre: Lustre: Build Version: 2.8.0.51-1-PRISTINE-3.10.0-514.6.1.el7.x86_64

We have tested latest master (both server and cient) with zfs-0.6.5 and didn't reproduce your problem. But, our tested client version might be diferent from your setup.
I'm wondering whether if there are some version interoperability issue here. We will investigate this again.

Comment by Nathaniel Clark [ 01/May/17 ]

Yes that's the exact version for OSS/MDS.

The client version is lustre-client-2.9.56_11_gbfa524f-1.el7.centos.x86_64.

I'm going to retest with latest master.

Comment by Shuichi Ihara (Inactive) [ 01/May/17 ]

This confused me. As far as I read your original description, client version is "2.8.0.51-1", not same version of OSS/MDS (2.9.56_11_gbfa524f).

Comment by Nathaniel Clark [ 02/May/17 ]

ihara,

Oh, you are correct, I've updated the client code to match the OSS/MDS but I didn't unload the old modules.

Using the quoted version above, I get the same result. Works with ldiskfs, does not work with zfs.

I've go two setups with running the same version of lusture: 2.9.56_11_gbfa524f
ZFS:

[root@ieel-mds03 ~]# lctl dl
  0 UP osd-zfs scratch-MDT0001-osd scratch-MDT0001-osd_UUID 8
  1 UP mgs MGS MGS 7
  2 UP mgc MGC192.168.56.12@tcp 3e0eccdf-f828-338f-d3fc-2e717a638014 5
  3 UP mds MDS MDS_uuid 3
  4 UP lod scratch-MDT0001-mdtlov scratch-MDT0001-mdtlov_UUID 4
  5 UP mdt scratch-MDT0001 scratch-MDT0001_UUID 5
  6 UP mdd scratch-MDD0001 scratch-MDD0001_UUID 4
  7 UP osp scratch-OST0000-osc-MDT0001 scratch-MDT0001-mdtlov_UUID 5

ldiskfs:

[root@ieel-mds04 ~]# lctl dl
  0 UP osd-ldiskfs scratch-MDT0000-osd scratch-MDT0000-osd_UUID 10
  1 UP mgs MGS MGS 7
  2 UP mgc MGC192.168.56.13@tcp 0e5f0018-97cf-c2a4-4817-f51b7410ec7b 5
  3 UP mds MDS MDS_uuid 3
  4 UP lod scratch-MDT0000-mdtlov scratch-MDT0000-mdtlov_UUID 4
  5 UP mdt scratch-MDT0000 scratch-MDT0000_UUID 13
  6 UP mdd scratch-MDD0000 scratch-MDD0000_UUID 4
  7 UP qmt scratch-QMT0000 scratch-QMT0000_UUID 4
  8 UP osp scratch-OST0000-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5
  9 UP osp scratch-OST0001-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5
 10 UP lwp scratch-MDT0000-lwp-MDT0000 scratch-MDT0000-lwp-MDT0000_UUID 5

Should qmt be missing from ZFS?

Comment by Wang Shilong (Inactive) [ 03/May/17 ]

Hi Nathaniel Clark,

could you show me your exact mkfs options, so that i could reproduce here.

Thanks,
Shilong

Comment by Nathaniel Clark [ 03/May/17 ]
zpool create -f -o ashift=12 -o cachefile=none mdt00 /dev/sdc /dev/sdd
mkfs.lustre --reformat --backfstype=zfs --mgs --mdt --index=1 --fsname=scratch mdt00/mdt
Comment by Shuichi Ihara (Inactive) [ 03/May/17 ]

Ah, are you sure does "--index=1" work without --index=0?

Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: new disk, initializing
Apr 26 14:20:52 ieel-mds04 kernel: Lustre: scratch-MDT0001: Imperative Recovery not enabled, recovery window 300-900

it seem you've also setup --index=1 without --index=0 in your original description..

However, you have --index=0 for MDT in your ldiskfs setup.

8 UP osp scratch-OST0000-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5
9 UP osp scratch-OST0001-osc-MDT0000 scratch-MDT0000-mdtlov_UUID 5

Comment by Nathaniel Clark [ 05/May/17 ]

If I format mdt00 with --index=0, everything works just fine. This could be closed as "not a bug" i guess, though it's kind of a strange one to debug.

Comment by Peter Jones [ 05/May/17 ]

Thanks Nathaniel

Comment by Andreas Dilger [ 16/May/17 ]

It would be good to get a patch to quiet the spurious "osd_oid()) unsupported quota oid: 0x16" message at startup, since even I find that confusing and wonder whether there is something wrong. We know this is for project quota, which isn't supported in ZFS yet.

Comment by Wang Shilong (Inactive) [ 17/May/17 ]

Andreas, Fan Yong is working on Project quota for ZFS, i think that messages will be removed too with zfs project quota supported.

Comment by nasf (Inactive) [ 17/May/17 ]

The osd_oid() has been totally removed via the patch:
https://review.whamcloud.com/#/c/27093/

Generated at Sat Feb 10 02:25:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.