[LU-4016] metadata-updates failure: attributes check failed Created: 26/Sep/13  Updated: 31/Dec/13  Resolved: 20/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Build # 1687
OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, four clients; one agent (c07), one with robinhood/db running (c08) and just running as Lustre clients (c09, c10)


Severity: 3
Rank (Obsolete): 10782

 Description   

Test results at https://maloo.whamcloud.com/test_sessions/eb9d697c-2658-11e3-8d26-52540035b04c

metadata-updates fails with "attributes check failed". Node logs don't exist, but the suite_log shows:

Checking file(s) attributes ... 
c08: /lustre/scratch/d0.metadata-updates/c07/testfile [ uid gid size mode ] expected : 0 0 1024 0644 ;  got : root 0 1024 0644 
c08: Wrong file attributes
c09: /lustre/scratch/d0.metadata-updates/c07/testfile [ uid gid size mode ] expected : 0 0 1024 0644 ;  got : root 0 1024 0644 
c09: Wrong file attributes
pdsh@c10: c08: ssh exited with exit code 56
c07: /lustre/scratch/d0.metadata-updates/c07/testfile [ uid gid size mode ] expected : 0 0 1024 0644 ;  got : root 0 1024 0644 
c07: Wrong file attributes
pdsh@c10: c09: ssh exited with exit code 56
pdsh@c10: c07: ssh exited with exit code 56
c10: /lustre/scratch/d0.metadata-updates/c07/testfile [ uid gid size mode ] expected : 0 0 1024 0644 ;  got : root 0 1024 0644 
c10: Wrong file attributes
pdsh@c10: c10: ssh exited with exit code 56
 metadata-updates : @@@@@@ FAIL: attributes check failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
  = /usr/lib64/lustre/tests/metadata-updates.sh:227:main()

Although this was run during HSM testing, the copytool was not started and HSM was not enabled on the MDS, but all HSM attributes were in the test configuration files.



 Comments   
Comment by James Nunez (Inactive) [ 27/Sep/13 ]

Another similar failure while HSM was enabled: https://maloo.whamcloud.com/test_sessions/277b4f38-2779-11e3-8a10-52540035b04c

There are additional failures, after truncate and chmod:

Checking file(s) attributes ... 
c10: /lustre/scratch/d0.metadata-updates/c08/testfile [ uid gid size mode ] expected : 0 0 1024 0644 ;  got : root 0 1024 0644 
c10: Wrong file attributes
c08: /lustre/scratch/d0.metadata-updates/c08/testfile [ uid gid size mode ] expected : 0 0 1024 0644 ;  got : root 0 1024 0644 
c08: Wrong file attributes
pdsh@c10: c10: ssh exited with exit code 56
c09: /lustre/scratch/d0.metadata-updates/c08/testfile [ uid gid size mode ] expected : 0 0 1024 0644 ;  got : root 0 1024 0644 
c09: Wrong file attributes
pdsh@c10: c08: ssh exited with exit code 56
pdsh@c10: c09: ssh exited with exit code 56
 metadata-updates : @@@@@@ FAIL: attributes check failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
  = /usr/lib64/lustre/tests/metadata-updates.sh:227:main()
Dumping lctl log to /tmp/test_logs/2013-09-26/124150/metadata-updates..*.1380231630.log
Part 2. file(s) attributes modification.
Performing chmod 00222 ...
Checking file(s) attributes ... 
c10: /lustre/scratch/d0.metadata-updates/c08/testfile [ uid gid size mode ] expected : 0 0 1024 0222 ;  got : root 0 1024 0222 
c10: Wrong file attributes
pdsh@c10: c10: ssh exited with exit code 56
c09: /lustre/scratch/d0.metadata-updates/c08/testfile [ uid gid size mode ] expected : 0 0 1024 0222 ;  got : root 0 1024 0222 
c09: Wrong file attributes
c08: /lustre/scratch/d0.metadata-updates/c08/testfile [ uid gid size mode ] expected : 0 0 1024 0222 ;  got : root 0 1024 0222 
c08: Wrong file attributes
pdsh@c10: c08: ssh exited with exit code 56
pdsh@c10: c09: ssh exited with exit code 56
 metadata-updates : @@@@@@ FAIL: wrong attributes after chmod 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
  = /usr/lib64/lustre/tests/metadata-updates.sh:232:main()
Dumping lctl log to /tmp/test_logs/2013-09-26/124150/metadata-updates..*.1380231632.log
Changing atime, mtime ...
Checking atime, mtime ... 
Part 3. truncate file(s) to 0 size, check new file size.
Truncating file(s) ... 
Checking file(s) attributes ... 
c10: /lustre/scratch/d0.metadata-updates/c08/testfile [ uid gid size mode ] expected : 0 0 0 0222 ;  got : root 0 0 0222 
c10: Wrong file attributes
pdsh@c10: c10: ssh exited with exit code 56
c08: /lustre/scratch/d0.metadata-updates/c08/testfile [ uid gid size mode ] expected : 0 0 0 0222 ;  got : root 0 0 0222 
c08: Wrong file attributes
pdsh@c10: c08: ssh exited with exit code 56
c09: /lustre/scratch/d0.metadata-updates/c08/testfile [ uid gid size mode ] expected : 0 0 0 0222 ;  got : root 0 0 0222 
c09: Wrong file attributes
pdsh@c10: c09: ssh exited with exit code 56
 metadata-updates : @@@@@@ FAIL: wrong attributes after truncate 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
  = /usr/lib64/lustre/tests/metadata-updates.sh:240:main()
Dumping lctl log to /tmp/test_logs/2013-09-26/124150/metadata-updates..*.1380231634.log
Comment by James Nunez (Inactive) [ 15/Oct/13 ]

Another failure at https://maloo.whamcloud.com/test_sessions/4574cfe8-35e1-11e3-b051-52540035b04c

Comment by Stephen Champion [ 22/Oct/13 ]

I've been seeing this on rhel and sles - I think since I started testing with 1.8.5. I observed it on 2.4.1 today. Infiniband cluster, two clients, one MDS, two OSS with one OST each. Upcall is default, nsswitch point to files nis.

I've never looked at it much - I just excluded the failing tests. It looks like stat is reporting %U, instead of %u. Can't say why. I looked for a bug in coreutils, but didn't see anything obvious. I've never observed this behavior outside of running acceptance.

Comment by Stephen Champion [ 22/Oct/13 ]

for HOST in ${HOSTS//,/ } ; do
TESTFILE=$TESTDIR/\\\$HOST/$FILE;
+ echo \\\stat -c \\\"%u %g %s 0%a\\\" \\\$TESTFILE
tmp=\\\$(stat -c \\\"%u %g %s 0%a\\\" \\\$TESTFILE);
echo \\\"\\\$TESTFILE [ uid gid size mode ] expected : $attr ; got : \\\$tmp \\\";

n006: stat -c root %g %s 0%a /mnt/accfs/d0.metadata-updates/n006/testfile
n007: stat -c root %g %s 0%a /mnt/accfs/d0.metadata-updates/n006/testfile

So our %u was eaten by pdsh:

  1. pdsh -w n006,n007 echo '%u'
    n007: root
    n006: root
  2. pdsh -L
    3 modules loaded:

Module: rcmd/exec
Author: Mark Grondona <mgrondona@llnl.gov>
Descr: arbitrary command rcmd connect method
Active: yes

Module: misc/netgroup
Author: Mark Grondona <mgrondona@llnl.gov>
Descr: Target netgroups from pdsh
Active: yes
Options:
-g groupname target hosts in netgroup "groupname"
-X groupname exclude hosts in netgroup "groupname"

Module: rcmd/ssh
Author: Jim Garlick <garlick@llnl.gov>
Descr: ssh based rcmd connect method
Active: yes

Less clear is what we can do about it. I do not see an option to pdsh to prohibit this processing. Escaping it as %%u works for me - but would presumably break configurations it currently works for.

Comment by James A Simmons [ 22/Oct/13 ]

This is a old bug that has been around since the 1.6 days. I thought I was the only one that saw this problem. The way I work around it is with the below patch.

diff -urp metadata-updates.sh.orig metadata-updates.sh
— metadata-updates.sh.orig 2013-05-16 09:54:56.000000000 -0400
+++ metadata-updates.sh 2013-05-20 13:33:15.000000000 -0400
@@ -32,7 +32,7 @@
NEW_ATIME="2001-01-01 GMT"
NEW_MTIME="2005-05-05 GMT"

-test_UID=$(id -u)
+test_UID=$(id -u -n)
test_GID=$(id -g)

SUMFILE=$TESTDIR/mdsum
@@ -106,7 +106,7 @@
do_nodesv $NODES_TO_USE "set $TRACE;
for HOST in ${HOSTS//,/ } ; do
TESTFILE=$TESTDIR/\\\$HOST/$FILE;

  • tmp=\\\$(stat -c \\\"%u %g %s 0%a\\\" \\\$TESTFILE);
    + tmp=\\\$(stat -c \\\"%U %g %s 0%a\\\" \\\$TESTFILE);
    echo \\\"\\\$TESTFILE [ uid gid size mode ] expected : $attr ; got : \\\$tmp \\\";
    if [ x\\\"\\\$tmp\\\" != x\\\"$attr\\\" ] ; then
    echo \\\"Wrong file attributes\\\";
Comment by Stephen Champion [ 23/Oct/13 ]

That works. I assumed that test_UID needed to be numerical, but it's not actually used anywhere else. I think we should rename it (because it's not a UID anymore) and change GID, too, for consistency.

Mind if I submit a patch with those changes, based on your idea?

Comment by James A Simmons [ 23/Oct/13 ]

Sure go right ahead

Comment by Stephen Champion [ 23/Oct/13 ]

http://review.whamcloud.com/8052

Against master, applies cleanly to 2.4.1 and works in my environment.

Comment by James Nunez (Inactive) [ 05/Nov/13 ]

Landed to master.

Comment by James A Simmons [ 18/Nov/13 ]

Can we cherry pick it against 2.4.1. We will later need it cherry picked against 2.5.1

Comment by James Nunez (Inactive) [ 18/Nov/13 ]

The patch should apply to 2.4 and 2.5 with no problems. I'll request that it be cherry-picked against 2.4 and 2.5.

Comment by James Nunez (Inactive) [ 19/Nov/13 ]

Reopened to apply patch to 2.5 and 2.4.

2.4 patch: http://review.whamcloud.com/#/c/8333/
2.5 patch: http://review.whamcloud.com/#/c/8332/

Comment by Jian Yu [ 23/Nov/13 ]

Patch http://review.whamcloud.com/8052 was cherry-picked to Lustre b2_4 branch for 2.4.2.

Comment by Peter Jones [ 20/Dec/13 ]

Landed for 2.4.2 and 2.6. Will land for 2.5.1 shortly

Generated at Sat Feb 10 01:38:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.