[LU-3175] recovery-mds-scale test_failover_mds: unlink ./clients/client1/~dmtmp/PWRPNT/PPTC112.TMP failed (Read-only file system) Created: 15/Apr/13 Updated: 18/Apr/13 Resolved: 18/Apr/13 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: master |
||
| Severity: | 3 |
| Rank (Obsolete): | 7735 |
| Description |
|
While running recovery-mds-scale test_failover_mds, dbench and iozone operations failed on client nodes as follows: copying /usr/share/dbench/client.txt to /mnt/lustre/d0.dbench-wtm-79/client.txt running 'dbench 2' on /mnt/lustre/d0.dbench-wtm-79 at Mon Apr 15 08:12:41 PDT 2013 dbench PID=11113 dbench version 4.00 - Copyright Andrew Tridgell 1999-2004 Running for 600 seconds with load 'client.txt' and minimum warmup 120 secs 0 of 2 processes prepared for launch 0 sec 2 of 2 processes prepared for launch 0 sec releasing clients 2 241 18.48 MB/sec warmup 1 sec latency 20.124 ms 2 496 17.34 MB/sec warmup 2 sec latency 16.341 ms 2 664 14.10 MB/sec warmup 3 sec latency 608.471 ms 2 666 10.58 MB/sec warmup 4 sec latency 1093.980 ms 2 722 8.52 MB/sec warmup 5 sec latency 649.730 ms 2 724 7.10 MB/sec warmup 6 sec latency 1189.957 ms 2 724 6.09 MB/sec warmup 7 sec latency 1332.253 ms 2 724 5.32 MB/sec warmup 8 sec latency 2332.481 ms 2 727 4.73 MB/sec warmup 9 sec latency 3176.583 ms 2 729 4.26 MB/sec warmup 10 sec latency 632.289 ms 2 731 3.87 MB/sec warmup 11 sec latency 804.657 ms 2 731 3.55 MB/sec warmup 12 sec latency 1804.771 ms 2 761 3.29 MB/sec warmup 13 sec latency 2337.010 ms 2 791 3.07 MB/sec warmup 14 sec latency 1105.492 ms [811] unlink ./clients/client1/~dmtmp/PWRPNT/PPTC112.TMP failed (Read-only file system) - expected NT_STATUS_OK ERROR: child 1 failed at line 811 [811] unlink ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed (Read-only file system) - expected NT_STATUS_OK ERROR: child 0 failed at line 811 Child failed with status 1 Machine = Linux wtm-81 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 U Excel chart generation enabled
Excel chart generation enabled
Verify Mode. Pattern 3a3a3a3a
Performance measurements are invalid in this mode.
Using maximum file size of 102400 kilobytes.
Using Maximum Record Size 512 KB
Command line used: iozone -a -M -R -V 0xab -g 100M -q 512k -i0 -i1 -f /mnt/lustre/d0.iozone-wtm-81/iozone-file
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
64 4
Can not open temp file: /mnt/lustre/d0.iozone-wtm-81/iozone-file
Dmesg on the client nodes showed that: Lustre: DEBUG MARKER: Starting failover on mds1 LustreError: 11115:0:(llite_lib.c:1294:ll_md_setattr()) md_setattr fails: rc = -30 LustreError: 11114:0:(llite_lib.c:1294:ll_md_setattr()) md_setattr fails: rc = -30 LustreError: 11115:0:(file.c:158:ll_close_inode_openhandle()) inode 144115205289279635 mdc close failed: rc = -30 LustreError: 11115:0:(file.c:158:ll_close_inode_openhandle()) inode 144115205289279635 mdc close failed: rc = -30 The test results are still in the Maloo import queue. |
| Comments |
| Comment by Peter Jones [ 15/Apr/13 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 16/Apr/13 ] |
|
Yujian, is it possible to get the mds & client debug log? Looks b2_1 has the similar problem (see |
| Comment by Jian Yu [ 16/Apr/13 ] |
Hi Niu, the debug logs are in /scratch/logs/2.4.0/recovery-mds-scale.test_failover_mds.debug_log.tar.bz2 on brent node. The debug level is -1. The Maloo report is https://maloo.whamcloud.com/test_sets/3d09403c-a5f9-11e2-b0a9-52540035b04c. |
| Comment by Niu Yawei (Inactive) [ 16/Apr/13 ] |
|
Thank you, Yujian. Unfortunately, looks the log for MDS is truncated, what we got is after testing log: mds (wtm-82) log starts from 1366038913, but the last -EROFS seen on client (wtm-79) log is at 1366038775. |
| Comment by Jian Yu [ 16/Apr/13 ] |
This is because hard failure mode, i.e., wtm-82 was powered off and on during the testing. So, the debug log on wtm-82 was gathered after it was up. |
| Comment by Jian Yu [ 17/Apr/13 ] |
|
Lustre Branch: master A new test result: https://maloo.whamcloud.com/test_sessions/ec1c08ae-a737-11e2-b3cc-52540035b04c |
| Comment by Niu Yawei (Inactive) [ 18/Apr/13 ] |
|
Thank you, Yujian! Shutting down system logger: [ OK ]
Stopping iscsi: sd 17:0:0:1: [sdm] Synchronizing SCSI cache
sd 9:0:0:1: [sdn] Synchronizing SCSI cache
sd 16:0:0:1: [sdg] Synchronizing SCSI cache
sd 12:0:0:1: [sdk] Synchronizing SCSI cache
sd 13:0:0:1: [sdf] Synchronizing SCSI cache
sd 14:0:0:1: [sdj] Synchronizing SCSI cache
sd 15:0:0:1: [sdh] Synchronizing SCSI cache
sd 10:0:0:1: [sdl] Synchronizing SCSI cache
sd 11:0:0:1: [sde] Synchronizing SCSI cache
Aborting journal on device sdi-8.
JBD2: I/O error detected when updating journal superblock for sdi-8.
LustreError: 9599:0:(osd_handler.c:636:osd_trans_commit_cb()) transaction @0xffff88061c03e780 commit error: 2
LDISKFS-fs error (device sdi): ldiskfs_journal_start_sb: Detected aborted journal
LDISKFS-fs (sdi): Remounting filesystem read-only
LustreError: 9599:0:(osd_handler.c:636:osd_trans_commit_cb()) transaction @0xffff880c18e70880 commit error: 2
journal commit I/O error
journal commit I/O error
LDISKFS-fs error (device sdi) in osd_trans_stop: IO failure
LustreError: 12393:0:(osd_handler.c:846:osd_trans_stop()) Failure to stop transaction: -5
LDISKFS-fs error (device sdi): ldiskfs_find_entry:
sd 8:0:0:1: [sdi] Synchronizing SCSI cache
When rebooting mds, we stopped iscsi before umount mds, that's why EROFS happened. Maybe there is something wrong in the (iscsi) shutdown script? I think all filesystems should be umount before stopping iscsi. |
| Comment by Jian Yu [ 18/Apr/13 ] |
|
The "pm -h powerman --off" command on Rosso cluster did not power off the physical test node directly, it just gracefully brought down the test node in a safe way (like shutdown command). On vm nodes, the "pm -h powerman --off" command worked correctly, which was the reason why autotest did not hit the issue in this ticket. After I changed to use "--reset" option, which really powered off the test node, I did not hit the issue again. |