[LU-8725] Rolling Downgrade 2.8.x<->master : FAIL: unable to write to /mnt/lustre/d0_runas_test as UID 500 Created: 18/Oct/16  Updated: 11/May/17  Resolved: 11/May/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Saurabh Tandan (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

RHEL 6.7 2.8.x <-> RHEL 7
master build# 3456


Attachments: Text File mds_debug.log     Text File mds_dmesg.log     Text File oss_debug.log     Text File oss_dmesg.log    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

Please provide additional information about the failure here.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/b7dc839c-957d-11e6-bc10-5254006e85c2.
Sanity test failed after downgrading MDS with the message:

unable to write to /mnt/lustre/d0_runas_test as UID 500

suite_log:

-----============= acceptance-small: sanity ============----- Sat Oct 15 23:27:53 PDT 2016
Running: bash /usr/lib64/lustre/tests/sanity.sh
onyx-23vm7: Checking config lustre mounted on /mnt/lustre
onyx-23vm8: Checking config lustre mounted on /mnt/lustre
Checking servers environments
Checking clients onyx-23vm7,onyx-23vm8 environments
Using TIMEOUT=20
disable quota as required
osd-ldiskfs.track_declares_assert=1
osd-ldiskfs.track_declares_assert=1
running as uid/gid/euid/egid 500/500/500/500, groups:
 [touch] [/mnt/lustre/d0_runas_test/f11686]
touch: cannot touch `/mnt/lustre/d0_runas_test/f11686': Permission denied
 sanity : @@@@@@ FAIL: unable to write to /mnt/lustre/d0_runas_test as UID 500.
        Please set RUNAS_ID to some UID which exists on MDS and client or
        add user 500:500 on these nodes. 

Steps Followed:
1. Set up lustre file system with Old version 2.8.x
2. Upgraded OSS and ran sanity.sh
3. Upgraded MDS and ran sanity.sh
4. Upgraded Clients and ran sanity.sh
5. Downgraded Clients and ran sanity.sh
6. Downgraded MDS (used that extra step of abort_recovery for remount and '-f' for unmount again). File system got mounted with no issues. But when tried to run Sanity.sh from client the above error message showed up.

As the test failed I did tried to unmount the file system once and remount it on all nodes as well.

MDS log when mounted:

[root@onyx-25 ~]# mount -t lustre -o acl,user_xattr /dev/sdb1 /mnt/mds0
mount.lustre: increased /sys/block/sdb/queue/max_sectors_kb from 1024 to 16384
LNet: HW CPU cores: 32, npartitions: 4
alg: No test for adler32 (adler32-zlib)
alg: No test for crc32 (crc32-table)
alg: No test for crc32 (crc32-pclmul)
Lustre: Lustre: Build Version: jenkins-arch=x86_64,build_type=server,distro=el6.7,ib_stack=inkernel-39-gf08239d-PRISTINE-2.6.32-573.26.1.el6_lustre.g948c890.x86_64
LNet: Added LNI 10.2.4.47@tcp [8/256/0/180]
LNet: Accept secure, port 988
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: MGS: Connection restored to MGC10.2.4.47@tcp_0 (at 0@lo)

MDS log when sanity.sh was run:

Lustre: DEBUG MARKER: -----============= acceptance-small: sanity ============----- Sat Oct 15 23:27:53 PDT 2016
Lustre: DEBUG MARKER: Using TIMEOUT=20
LustreError: 20325:0:(mdt_identity.c:135:mdt_identity_do_upcall()) lustre-MDT0000: error invoking upcall /sbin/l_getidentity lustre-MDT0000 0: rc -2; check /proc/fs/lustre/mdt/lustre-MDT0000/identity_upcall, time 117us
LustreError: 20325:0:(mdt_identity.c:135:mdt_identity_do_upcall()) lustre-MDT0000: error invoking upcall /sbin/l_getidentity lustre-MDT0000 0: rc -2; check /proc/fs/lustre/mdt/lustre-MDT0000/identity_upcall, time 220us
LustreError: 20325:0:(mdt_identity.c:135:mdt_identity_do_upcall()) Skipped 4 previous similar messages
Lustre: DEBUG MARKER: sanity : @@@@@@ FAIL: unable to write to /mnt/lustre/d0_runas_test as UID 500.

OSS log when sanity.sh was run:

[root@onyx-26 ~]# [118430.582487] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity ============----- Sat Oct 15 23:27:53 PDT 2016
[118433.837028] Lustre: DEBUG MARKER: Using TIMEOUT=20
[118436.286200] Lustre: DEBUG MARKER: sanity : @@@@@@ FAIL: unable to write to /mnt/lustre/d0_runas_test as UID 500.


 Comments   
Comment by Joseph Gmitter (Inactive) [ 19/Oct/16 ]

Hi Saurabh,

Can you try the changes suggested to the test scrip that we discussed on the QE call for l_getidentity.

Thanks.
Joe

Comment by Andreas Dilger [ 19/Oct/16 ]

It looks like this is a problem with identity_upcall being set explicitly by the test framework when the MDT is formatted, which doesn't happen for normal Lustre configurations, but is needed when developers are running directly out of the build tree:

    export L_GETIDENTITY=${L_GETIDENTITY:-"$LUSTRE/utils/l_getidentity"}
    if [ ! -f "$L_GETIDENTITY" ]; then
        if `which l_getidentity > /dev/null 2>&1`; then
            export L_GETIDENTITY=$(which l_getidentity)
        else
            export L_GETIDENTITY=NONE
        fi
    fi

                opts+=${L_GETIDENTITY:+" --param=mdt.identity_upcall=$L_GETIDENTITY"}
Comment by Andreas Dilger [ 19/Oct/16 ]

I believe in earlier distros (RHEL6-) the upcall is /usr/sbin/l_getidentity but on newer distros (RHEL7+) I believe it is /sbin/l_getidentity because of distro packaging changes? If this is explicitly stored in the configuration log, it will be incorrect after a downgrade. That said, I don't see where it gets set to /sbin/l_getidentity on RHEL7 installs except by the test framework, so there may still be a problem for normal usage?

Comment by Saurabh Tandan (Inactive) [ 11/May/17 ]

Cannot reproduce, Hence closing the ticket.

Generated at Sat Feb 10 02:20:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.