[LU-7656] replay-single_70c test failed tar: Exiting with failure status due to previous errors Created: 12/Jan/16  Updated: 13/May/16  Resolved: 13/May/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Noopur Maheshwari (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

Configuration : 4 Node - ( 1 MDS/1 OSS/2 Clients)
Release
191_2.6.32_431.17.1.x2.0.62.x86_64_gb0424d1 Build Date: Thu 03 Sep 2015 12:25:48 AM UTC
2.6.32_431.29.2.el6.x86_64_g01ca899 Build Date: Sat 05 Sep 2015 05:39:37 PM UTC
Server 2.5.1.x6
Client 2.7.59


Attachments: File 70c.lctl.tgz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
== replay-single test 70c: tar 1mdts recovery == 02:32:52 (1441506772)
Starting client fre1211,fre1212:  -o user_xattr,flock fre1209@tcp:/lustre /mnt/lustre
Started clients fre1211,fre1212: 
fre1209@tcp:/lustre on /mnt/lustre type lustre (rw,user_xattr,flock)
fre1209@tcp:/lustre on /mnt/lustre type lustre (rw,user_xattr,flock)
Started tar 8730
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
Filesystem          1K-blocks  Used Available Use% Mounted on
fre1209@tcp:/lustre   1377952 68056   1233908   6% /mnt/lustre
tar: Removing leading `/' from member names
test_70c fail mds1 1 times
Failing mds1 on fre1209
Stopping /mnt/mds1 (opts:) on fre1209
pdsh@fre1211: fre1209: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to fre1209
02:35:20 (1441506920) waiting for fre1209 network 900 secs ...
02:35:20 (1441506920) network interface is UP
mount facets: mds1
Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/mds1
fre1209: mount.lustre: set /sys/block/vdb/queue/max_sectors_kb to 2147483647
fre1209: 
Started lustre-MDT0000
fre1212: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 11 sec
fre1211: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 11 sec
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
Filesystem          1K-blocks  Used Available Use% Mounted on
fre1209@tcp:/lustre   1377952 68056   1237060   6% /mnt/lustre
test_70c fail mds1 2 times
Failing mds1 on fre1209
Stopping /mnt/mds1 (opts:) on fre1209
pdsh@fre1211: fre1209: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to fre1209
02:38:01 (1441507081) waiting for fre1209 network 900 secs ...
02:38:01 (1441507081) network interface is UP
mount facets: mds1
Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/mds1
fre1209: mount.lustre: set /sys/block/vdb/queue/max_sectors_kb to 2147483647
fre1209: 
Started lustre-MDT0000
fre1212: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 9 sec
fre1211: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 9 sec
Resetting fail_loc on all nodes.../usr/lib64/lustre/tests/test-framework.sh: line 2976:  8730 Killed                  ( while true; do
    test_mkdir -p -c$MDSCOUNT $DIR/$tdir || break; if [ $MDSCOUNT -ge 2 ]; then
        $LFS setdirstripe -D -c$MDSCOUNT $DIR/$tdir || error "set default dirstripe failed";
    fi; cd $DIR/$tdir || break; tar cf - /etc | tar xf - || error "tar failed"; cd $DIR || break; rm -rf $DIR/$tdir || break;
done )
done.
tar: etc/ssl: Cannot stat: No such file or directory
tar: etc/sysconfig/network-scripts: Cannot stat: No such file or directory
tar: etc/sysconfig: Cannot stat: No such file or directory
tar: etc/pam.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc0.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc5.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc2.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc4.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc6.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc3.d: Cannot stat: No such file or directory
tar: etc/rc.d/rc1.d: Cannot stat: No such file or directory
tar: etc/rc.d: Cannot stat: No such file or directory
tar: etc/profile.d: Cannot stat: No such file or directory
tar: etc/alternatives: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors



 Comments   
Comment by Gerrit Updater [ 12/Jan/16 ]

Noopur Maheshwari (noopur.maheshwari@seagate.com) uploaded a new patch: http://review.whamcloud.com/17959
Subject: LU-7656 tests: tar a temporary folder
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 76c60777cc75f9d1d6c870ce986d793517e21969

Comment by Joseph Gmitter (Inactive) [ 14/Jan/16 ]

James,
Can you have a look at the patch?
Thanks.
Joe

Comment by Andreas Dilger [ 14/Jan/16 ]

Have you verified that this is related to trying to archive dangling symlinks from the source /etc folder, or what is the source of the error? Have you tried using "tar -cf --ignore-failed-read" to avoid an error on tar during read? It may also be that these errors are generated at restore time because the files are being deleted during cleanup while tar is still running.

Comment by Noopur Maheshwari (Inactive) [ 03/Feb/16 ]

Hello Andreas,

Dangling symlinks do not cause tar to fail. I created a dangling symlink in a temporary folder and performed tar on that folder, tar did not fail.
I tried using "tar -cf --ignore-failed-read", it gives a warning instead of an error for read. Yes, it avoids an error on tar during read.

Comment by James Nunez (Inactive) [ 22/Feb/16 ]

Noopur - In the patch, you stated "Changing directory to /tmp does not help in this case. We see these tar failures without Lustre mounted as well. There is a problem with the tar utility, OS or VM (kvm or vmware). This isn't a lustre problem. Abandoning."

So, I am closing this ticket as "Not a Bug"

Comment by Noopur Maheshwari (Inactive) [ 29/Feb/16 ]

Hello James,

I figured out that it isn't a tar utility issue, instead it is a test case issue.

kill -0, used in the test case, is to determine if one had permissions to send signals to a running process via kill.
kill -0, neither kills tar, nor waits for it to complete.

The tar process is running in an infinite loop, and the removal/cleanup of files interferes in the process and causes tar to fail.
Main process should wait for the tar process to complete before cleanup and then exit gracefully. I'll push the patch for the same.

Could you please reopen the ticket?

Thanks

Comment by Gerrit Updater [ 01/Mar/16 ]

Noopur Maheshwari (noopur.maheshwari@seagate.com) uploaded a new patch: http://review.whamcloud.com/18732
Subject: LU-7656 tests: tar fix for replay-single/70c
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b346da54ead50afc6f72615a33f4ed0e1f27b41e

Comment by Gerrit Updater [ 11/May/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18732/
Subject: LU-7656 tests: tar fix for replay-single/70c
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 13f4d2a5ab81b479fcc1cd2263c2cd8db8b616c5

Comment by Joseph Gmitter (Inactive) [ 13/May/16 ]

Landed to master for 2.9.0

Generated at Sat Feb 10 02:10:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.