[LU-11375] Client 2.10 fails to mount Server 2.11.54 Created: 13/Sep/18  Updated: 25/Oct/18  Resolved: 10/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Blocker
Reporter: Nathaniel Clark Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: interop
Environment:

2 MDS, 2 OSS ~2.11.54_91
Client 2.10.4


Issue Links:
Related
is related to LU-10018 MDT as a statfs proxy Resolved
is related to LU-11488 sanity test_133b: @@@@@@ FAIL: OST go... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Mounting client fails with:

# mount -t lustre 192.168.56.12@tcp:/zfs2 /mnt/fs2
mount.lustre: mount 192.168.56.12@tcp:/zfs2 at /mnt/fs2 failed: Bad address

MDT0 has the following error in debug log:

00000100:02000000:1.0:1536858056.517840:0:2585:0:(import.c:1534:ptlrpc_import_recovery_state_machine()) zfs2-MDT0000: Connection restored to 880830c7-7d38-ac3b-c683-8d949f550eaa (at 192.168.56.30@tcp)
00000100:00020000:1.0:1536858056.518193:0:2585:0:(layout.c:2067:__req_capsule_get()) @@@ Wrong buffer for field `mdt_body' (1 of 1) in format `MDS_STATFS': 0 vs. 216 (client)
  req@ffff9a06cd249b00 x1611512228611360/t0(0) o41->1953ab2d-accd-2169-bea6-82979f16a7d4@192.168.56.30@tcp:452/0 lens 224/0 e 0 to 0 dl 1536858067 ref 1 fl Interpret:/0/ffffffff rc 0/-1
00000020:00000080:1.0:1536858056.524976:0:2585:0:(genops.c:1561:class_disconnect()) disconnect: cookie 0x3f757c9924a4d7eb

Setup:
mds03: MGT, MDT1
mds04: MDT0
oss03: OST0, OST2
oss04: OST1, OST3
This is all ZFS based (ZFS 0.7.9)



 Comments   
Comment by John Hammond [ 13/Sep/18 ]

Likely from https://review.whamcloud.com/29136 LU-10018 protocol: MDT as a statfs proxy

Comment by James Nunez (Inactive) [ 13/Sep/18 ]

We run automated testing between 2.10.4 clients and master servers and we do see these tests failing in lustre-initialization. Looking at https://testing.whamcloud.com/test_sets/73262750-b59f-11e8-9df3-52540065bddc, we see the mount fail

2018-09-11T07:51:28 Starting client: trevis-19vm1.trevis.whamcloud.com:  -o user_xattr,flock trevis-19vm4@tcp:/lustre /mnt/lustre
2018-09-11T07:51:28 CMD: trevis-19vm1.trevis.whamcloud.com mkdir -p /mnt/lustre
2018-09-11T07:51:28 CMD: trevis-19vm1.trevis.whamcloud.com mount -t lustre -o user_xattr,flock trevis-19vm4@tcp:/lustre /mnt/lustre
2018-09-11T07:51:28 mount.lustre: mount trevis-19vm4@tcp:/lustre at /mnt/lustre failed: Bad address

On that client (vm1), we see

[  221.250210] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-19vm4@tcp:/lustre /mnt/lustre
[  221.731388] LustreError: 11-0: lustre-MDT0000-mdc-ffff90943bde8800: operation mds_statfs to node 10.9.4.227@tcp failed: rc = -14
[  221.732646] LustreError: 15598:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #0 (lustre MDT0000-mdc-ffff90943bde8800), error -14
[  221.738793] LustreError: 15598:0:(lov_obd.c:878:lov_cleanup()) lustre-clilov-ffff90943bde8800: lov tgt 0 not cleaned! deathrow=0, lovrc=1
[  221.742206] Lustre: Unmounted lustre-client
[  221.743475] LustreError: 15598:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-14)
Comment by Gerrit Updater [ 13/Sep/18 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33162
Subject: LU-11375 mdc: use old statfs format
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a9914b892add64cf2b584b91c6da9872c871d6ef

Comment by Alex Zhuravlev [ 14/Sep/18 ]

the patch isn't ready... I've met interesting issue - the initial MDS_STATFS is formed before the connection (and connection flags) is set up properly. so the client and the server have different connect flags.

Comment by James Nunez (Inactive) [ 18/Sep/18 ]

Since master build #3787, it looks like Ubuntu 18.04 clients can't mount due to this issue. In the MDS console log, we see

[  191.353309] LustreError: 15466:0:(layout.c:2067:__req_capsule_get()) @@@ Wrong buffer for field `mdt_body' (1 of 1) in format `MDS_STATFS': 0 vs. 216 (client)

Logs at
https://testing.whamcloud.com/test_sets/6afaa572-b86c-11e8-8c12-52540065bddc

Comment by Alex Zhuravlev [ 18/Sep/18 ]

please try the updated patch.

 

Comment by Alex Zhuravlev [ 19/Sep/18 ]

@James please suggest how to test with with appropriate Commit parameter

Comment by James Nunez (Inactive) [ 19/Sep/18 ]

Added the following to the commit message to test b2_10 clients with servers from this patch:
Test-Parameters: clientjob=lustre-b2_10 clientbuildno=136 testgroup=review-ldiskfs

Comment by James Nunez (Inactive) [ 19/Sep/18 ]

Looks like this patch allows a 2.10.5 client to mount a later server. Results at https://testing.whamcloud.com/test_sessions/319c062d-ad5a-4f4c-bcb1-a7c23ebff084

Comment by James A Simmons [ 05/Oct/18 ]

The lustre linux client (also 2.10) also now works.

Comment by Gerrit Updater [ 10/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33162/
Subject: LU-11375 mdc: use old statfs format
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e70a6fd8a6400c0460a6c66668103c23b7997d30

Comment by Peter Jones [ 10/Oct/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:43:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.