[LU-6321] Clean downgrade from 2.7.0 to 2.6.0 failed: fail to init namespace LFSCK component: rc = -5 Created: 02/Mar/15  Updated: 04/Mar/15  Resolved: 04/Mar/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0, Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5707 LFSCK 3: cannot load namespace LFSCK ... Resolved
is related to LU-5820 LFSCK 4: Record linkEA verification h... Resolved
Severity: 3
Rank (Obsolete): 17682

 Description   

1. formatted and setup lustre for 2.6.0, then clean upgrade the system to 2.7.0, successful

2. downgrade the system to 2.6.0, mount system failed

MDS shows:

LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
LustreError: 33604:0:(lfsck_namespace.c:1786:lfsck_namespace_setup()) lustre-MDT0000-osd: fail to init namespace LFSCK component: rc = -5
LustreError: 33604:0:(mdd_device.c:1051:mdd_prepare()) lustre-MDD0000: failed to initialize lfsck: rc = -5
LustreError: 33604:0:(obd_mount_server.c:1769:server_fill_super()) Unable to start targets: -5
LustreError: 33712:0:(qsd_reint.c:54:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1010000:0x0], rc:-5
Lustre: Failing over lustre-MDT0000
LustreError: 33712:0:(qsd_reint.c:54:qsd_reint_completion()) Skipped 1 previous similar message
Lustre: 33604:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425324784/real 1425324784]  req@ffff88081d695400 x1494561337639040/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1425324790 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount lustre-MDT0000 complete
LustreError: 33604:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-5)
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
LustreError: 34030:0:(lfsck_namespace.c:1786:lfsck_namespace_setup()) lustre-MDT0000-osd: fail to init namespace LFSCK component: rc = -5
LustreError: 34030:0:(mdd_device.c:1051:mdd_prepare()) lustre-MDD0000: failed to initialize lfsck: rc = -5
LustreError: 34030:0:(obd_mount_server.c:1769:server_fill_super()) Unable to start targets: -5
LustreError: 34125:0:(qsd_reint.c:54:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5
LustreError: 34126:0:(qsd_reint.c:54:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1010000:0x0], rc:-5
Lustre: Failing over lustre-MDT0000
Lustre: 34030:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425324879/real 1425324879]  req@ffff8804187e4800 x1494561337639168/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1425324885 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount lustre-MDT0000 complete
LustreError: 34030:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-5)
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted

client shows:

Setup mgs, mdt, osts
Starting mds1: -o user_xattr,acl  /dev/sdb1 /mnt/mds1
Start of /dev/sdb1 on mds1 failed 5
Starting ost1:   /dev/sdb1 /mnt/ost1
Start of /dev/sdb1 on ost1 failed 19
Starting client: onyx-28: -o user_xattr,flock onyx-25@tcp:/lustre /mnt/lustre
Lustre: 74562:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425324792/real 1425324792]  req@ffff8804364bec00 x1494557626728452/t0(0) o250->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 400/544 e 0 to 1 dl 1425324797 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 76865:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8804364be800 x1494557626728456/t0(0) o101->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 76878:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8804364be000 x1494557626728464/t0(0) o101->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Lustre: 74562:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425324817/real 1425324817]  req@ffff8804364be400 x1494557626728468/t0(0) o250->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 400/544 e 0 to 1 dl 1425324827 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 76865:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8804364be800 x1494557626728460/t0(0) o101->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 15c-8: MGC10.2.4.47@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Lustre: Unmounted lustre-client
LustreError: 76865:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-5)
Starting client onyx-23.onyx.hpdd.intel.com,onyx-27,onyx-28: -o user_xattr,flock onyx-25@tcp:/lustre /mnt/lustre
Lustre: 74562:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425324829/real 1425324829]  req@ffff8804364be000 x1494557626728472/t0(0) o250->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 400/544 e 0 to 1 dl 1425324834 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 76949:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8804364bec00 x1494557626728476/t0(0) o101->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 76962:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8804364be400 x1494557626728484/t0(0) o101->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Lustre: 74562:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425324854/real 1425324854]  req@ffff8804364be000 x1494557626728488/t0(0) o250->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 400/544 e 0 to 1 dl 1425324864 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 76949:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8804364bec00 x1494557626728480/t0(0) o101->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 15c-8: MGC10.2.4.47@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Lustre: Unmounted lustre-client
LustreError: 76949:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-5)
Using TIMEOUT=20
Lustre: DEBUG MARKER: Using TIMEOUT=20
jobstats not supported by server
disable quota as required
 upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted 


 Comments   
Comment by Oleg Drokin [ 02/Mar/15 ]

getting mds side debug log with increased debug would be great to better understand what failed and why

Comment by Andreas Dilger [ 02/Mar/15 ]

This may relate to "LU-5820 lfsck: use multiple namespace LFSCK trace files" patch http://review.whamcloud.com/12809 or "LU-5707 lfsck: store namespace LFSCK statistics info in new EA" patch http://review.whamcloud.com/12321 which were supposed to allow the LFSCK code to ignore the new namespace log file and create a new one?

Comment by nasf (Inactive) [ 03/Mar/15 ]

Originally, the "lfsck_namespace" file stored both the namespace LFSCK statistics information and the FIDs to be double scanned. But to improve the namespace LFSCK performance (since Lustre-2.7), we split single trace file as multiple ones, and name them as "lfsck_namespace_xx". At that time, the original "lfsck_namespace" only needs to record the namespace LFSCK statistics information. So we make it as regular file, NOT index file. When downgrade to Lustre-2.6, the old LFSCK wants an index trace file instead of regular file, so failed.

Two solutions for that:
1) on Lustre-2.6, remove the old "lfsck_namespace" under ldiskfs mode manually.
2) patch Lustre-2.7 and master code, to make the "lfsck_namespace" as index file.

I prefer the later solution.

Comment by Gerrit Updater [ 03/Mar/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13945
Subject: LU-6321 lfsck: make lfsck_namespace trace file as index
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2398502aaa7c4e9153f9c89ecb3086169daae81d

Comment by nasf (Inactive) [ 03/Mar/15 ]

Sarah, would you please to help verify above patch when you have time? Thanks!

Comment by Gerrit Updater [ 03/Mar/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13946
Subject: LU-6321 lfsck: make lfsck_namespace trace file as index
Project: fs/lustre-release
Branch: b2_7
Current Patch Set: 1
Commit: b6ef06c39f4c0dfff1e22f2ab6d805b816a08857

Comment by Sarah Liu [ 04/Mar/15 ]

Fan Yong, the patch works!

Comment by nasf (Inactive) [ 04/Mar/15 ]

Thanks Sarah!

Comment by Gerrit Updater [ 04/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13946/
Subject: LU-6321 lfsck: make lfsck_namespace trace file as index
Project: fs/lustre-release
Branch: b2_7
Current Patch Set:
Commit: 1ece3b3ffdf3dc112be19dd0ee2563b3e22d4b57

Comment by Gerrit Updater [ 04/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13945/
Subject: LU-6321 lfsck: make lfsck_namespace trace file as index
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ca8067522d6a6928e33dc8d34d5ad208c7eb535f

Generated at Sat Feb 10 01:59:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.