[LU-7590] After downgrade system from master RHEL7 to 2.5.5 RHEL6.6, hit cannot access /mnt/lustre: Permission denied Created: 21/Dec/15  Updated: 12/Jan/18  Resolved: 12/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

before upgrade: 2.5.5 RHEL6.6 ldiskfs
after upgrade: master build # 3264 RHEL7


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

after clean downgrade from master/ #3264 RHEL7 to 2.5.5 RHEL6.6, cannot access /mnt/lustre This issue also happened with zfs

client console

[root@onyx-27 ~]# mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)
onyx-25:/lustre on /mnt/lustre type lustre (rw,user_xattr)
[root@onyx-27 ~]# pwd
/root
[root@onyx-27 ~]# ls /mnt/luLustreError: 11808:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -13
LustreError: 11808:0:(file.c:3128:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -13
stre 
ls: cannot access /mnt/lustre: Permission denied
[root@onyx-27 ~]# 

MDS dmesg

alg: No test for crc32 (crc32-pclmul)
Lustre: Lustre: Build Version: 2.5.5-RC2--PRISTINE-2.6.32-504.23.4.el6_lustre.x86_64
LNet: Added LNI 10.2.4.47@tcp [8/256/0/180]
LNet: Accept secure, port 988
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: MGC10.2.4.47@tcp: Connection restored to MGS (at 0@lo)
Lustre: lustre-MDT0000: used disk, loading
LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
LustreError: 11876:0:(lfsck_namespace.c:154:lfsck_namespace_load()) lustre-MDT0000-o: fail to load lfsck_namespace, expected = 256, rc = 4
Lustre: lustre-MDT0000-lwp-MDT0000: Connection restored to lustre-MDT0000 (at 0@lo)
Lustre: 11949:0:(client.c:1943:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1450734129/real 1450734129]  req@ffff88082d934c00 x1521204989001764/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.4.56@tcp:28/4 lens 400/544 e 0 to 1 dl 1450734134 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 11949:0:(client.c:1943:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1450734154/real 1450734154]  req@ffff88082d934400 x1521204989001888/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.4.56@tcp:28/4 lens 400/544 e 0 to 1 dl 1450734164 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 11949:0:(client.c:1943:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1450734169/real 1450734169]  req@ffff88082d473000 x1521204989001916/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.4.56@tcp:28/4 lens 400/544 e 0 to 1 dl 1450734184 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: lustre-OST0000-osc-MDT0000: Connection restored to lustre-OST0000 (at 10.2.4.56@tcp)
LustreError: 12091:0:(mdt_identity.c:136:mdt_identity_do_upcall()) lustre-MDT0000: error invoking upcall /sbin/l_getidentity lustre-MDT0000 0: rc -2; check /proc/fs/lustre/mdt/lustre-MDT0000/identity_upcall, time 312us
[root@onyx-25 ~]# 



 Comments   
Comment by Sarah Liu [ 22/Dec/15 ]

I ran the exact same test but upgrade from 2.5.5 RHEL6.6 to master RHEL6.7 and then downgrade, didn't hit this problem.

Comment by Andreas Dilger [ 22/Dec/15 ]

The -13 = -EPERM, which is what would be expected if /sbin/l_get_identity wasn't found (in the old days this returned -EIDRM, which was much easier to diagnose, but also confused users):

LustreError: 12091:0:(mdt_identity.c:136:mdt_identity_do_upcall()) lustre-MDT0000: error invoking upcall /sbin/l_getidentity lustre-MDT0000 0: rc -2; check /proc/fs/lustre/mdt/lustre-MDT0000/identity_upcall, time 312us

The -2 = -ENOENT, which means the /sbin/l_getidentity couldn't be found for some reason?

The other error is about revalidating FID [0x200000007:0x1:0x0 which is FID_SEQ_ROOT, but I suspect that is just printed because it got an error when checking permissions on the root directory.

Comment by Joseph Gmitter (Inactive) [ 23/Dec/15 ]

Hi Dmitry,
Can you please have a look at this issue to gather more info?
Thanks.
Joe

Comment by Sarah Liu [ 30/Dec/15 ]

Before downgrade from master RHEL7.1 to lower version lustre, according to the comment mentioned in LU-7410, I remount the MDS with option abort_recovery, while after doing this and checking with mount, it only shows with "ro". This is different from what I saw after upgrade system to master RHEL6.7, in that scenario, it shows "rw, abort_recovery" after remount MDS with the same option

[root@onyx-25 ~]# mount -t lustre -o abort_recovery /dev/sdb1 /mnt/mds1
[ 3437.966195] LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
[ 3438.206440] Lustre: MGS: Connection restored to MGC10.2.4.47@tcp_0 (at 0@lo)
[ 3438.217099] Lustre: Skipped 4 previous similar messages
[ 3439.039274] LustreError: 10157:0:(mdt_handler.c:5603:mdt_iocontrol()) lustre-MDT0000: Aborting recovery for device
[root@onyx-25 ~]# [ 3443.513614] Lustre: 3306:0:(client.c:1994:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1451508192/real 1451508192]  req@ffff8804026ac800 x1522015822538812/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.4.56@tcp:28/4 lens 520/544 e 0 to 1 dl 1451508197 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 3443.554336] Lustre: lustre-MDT0000: Connection restored to 10.2.4.47@tcp (at 0@lo)

[root@onyx-25 ~]# mount|grep lustre
/dev/sdb1 on /mnt/mds1 type lustre (ro)

[root@onyx-25 ~]# rpm -qa|grep lustre
lustre-osd-ldiskfs-mount-2.7.64-3.10.0_229.20.1.el7_lustre.x86_64.x86_64
lustre-osd-ldiskfs-2.7.64-3.10.0_229.20.1.el7_lustre.x86_64.x86_64
lustre-iokit-2.7.64-3.10.0_229.20.1.el7_lustre.x86_64.x86_64
lustre-modules-2.7.64-3.10.0_229.20.1.el7_lustre.x86_64.x86_64
lustre-2.7.64-3.10.0_229.20.1.el7_lustre.x86_64.x86_64
lustre-tests-2.7.64-3.10.0_229.20.1.el7_lustre.x86_64.x86_64
kernel-3.10.0-229.20.1.el7_lustre.x86_64
[root@onyx-25 ~]# 
Comment by Dmitry Eremin (Inactive) [ 29/Jan/16 ]

I think this happens because of during mkfs.lustre on RHEL7 used wrong path to l_getidentity. There is a symbolic link /sbin -> /usr/sbin on RHEL7.x but this is not true for RHEL6.x. Originally --param=mdt.identity_upcall=/sbin/l_getidentity was specified or auto discovered. This parameter should be changed or correct link to this binary is provided.

Comment by Dmitry Eremin (Inactive) [ 29/Jan/16 ]

If the Lustre FS was created by llmount.sh script then we will have this issue. In this script if L_GETIDENTITY environment is undefined it set with `which l_getidentity`. Therefore on RHEL 7.x which utility will return /sbin/l_getidentity instead of /usr/sbin/l_getidentity on RHEL 6.x.

As workaround you can run llmount.sh script with L_GETIDENTITY environment set explicitly. For example:

L_GETIDENTITY=/usr/sbin/l_getidentity llmount.sh
Comment by Sarah Liu [ 29/Jan/16 ]

Thank you for the suggestion, I will try the workaround method.

Comment by Dmitry Eremin (Inactive) [ 12/Jan/18 ]

I hope this workaround works. If you are disagree please reopen this ticket.

Generated at Sat Feb 10 02:10:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.