[LU-1999] recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds Created: 20/Sep/12 Updated: 18/Apr/13 Resolved: 18/Oct/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Johann Lombardi (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4071 |
| Description |
|
This issue was created by maloo for Oleg Drokin <green@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/9d79385c-038b-11e2-83cf-52540035b04c. The sub-test test_106 failed with the following error:
Info required for matching: recovery-small 106 |
| Comments |
| Comment by Oleg Drokin [ 20/Sep/12 ] |
|
Johann, this failure comes from a test added by you yesterday |
| Comment by Johann Lombardi (Inactive) [ 21/Sep/12 ] |
|
It looks like a test issue since the client got evicted as expected: LustreError: 167-0: lustre-MDT0000-mdc-ffff880078abb000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. Lustre: Evicted from MGS (at 10.10.4.118@tcp) after server handle changed from 0x1345e746bb7e9980 to 0x1345e746bb7e9edc Lustre: MGC10.10.4.118@tcp: Reactivating import Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mdd The issue is the the client reconnected before the check in the test was made. |
| Comment by Johann Lombardi (Inactive) [ 21/Sep/12 ] |
|
I have disabled the test on master until i come up with a test script fix to avoid those failures. |
| Comment by Johann Lombardi (Inactive) [ 21/Sep/12 ] |
|
Tentative patch: |
| Comment by Johann Lombardi (Inactive) [ 08/Oct/12 ] |
|
Patch landed. No new reoccurrence since then. |
| Comment by Oleg Drokin [ 12/Oct/12 ] |
|
It seems that I am still hitting this in my local testing (ie SLOW=yes REFORMAT=yes sh recovery-small.sh ), though I cannot speak of the frequence. == recovery-small test 106: lightweight connection support == 23:16:05 (1350098165)
fail_loc=0x805
Starting client: centos6-4.localnet: -o user_xattr,flock centos6-4.localnet@tcp:/lustre /mnt/lustre2
fail_loc=0
Filesystem 1K-blocks Used Available Use% Mounted on
192.168.10.214@tcp:/lustre
374928 52384 302004 15% /mnt/lustre
Failing mds1 on node centos6-4.localnet
Stopping /mnt/mds1 (opts:) on centos6-4.localnet
affected facets: mds1
Failover mds1 to centos6-4.localnet
23:16:24 (1350098184) waiting for centos6-4.localnet network 900 secs ...
23:16:24 (1350098184) network interface is UP
Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1
Started lustre-MDT0000
touch: setting times of `/mnt/lustre2/f.recovery-small.106': Input/output error
recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds
Trace dump:
= ./../tests/test-framework.sh:3844:error_noexit()
= ./../tests/test-framework.sh:3866:error()
= recovery-small.sh:1528:test_106()
= ./../tests/test-framework.sh:4108:run_one()
= ./../tests/test-framework.sh:4138:run_one_logged()
= ./../tests/test-framework.sh:4009:run_test()
= recovery-small.sh:1537:main()
Dumping lctl log to /tmp/test_logs/1350095359/recovery-small.test_106.*.1350098192.log
Dumping logs only on local client.
192.168.10.214@tcp:/lustre /mnt/lustre2 lustre rw,flock,user_xattr 0 0
Stopping client centos6-4.localnet /mnt/lustre2 (opts:)
FAIL 106 (32s)
|
| Comment by Oleg Drokin [ 12/Oct/12 ] |
|
Just did a bit of research, it's happening, but not super freequently: [root@intelbox ~]# grep 'lightweight client not evicted by mds' /var/log/remotemessages Oct 9 03:43:14 centos6-0 kernel: [ 2489.388687] Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds Oct 9 13:27:35 centos6-1 kernel: [ 2898.929903] Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds Oct 11 13:33:30 centos6-0 kernel: [ 6723.295214] Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds Oct 12 23:16:32 centos6-4 kernel: [ 3261.398403] Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds |
| Comment by Johann Lombardi (Inactive) [ 15/Oct/12 ] |
|
Oleg, could you please attach dmesg output when the issue happened? |
| Comment by Li Wei (Inactive) [ 15/Oct/12 ] |
|
I saw this with today's master plus my == recovery-small test 106: lightweight connection support == 16:41:54 (1350290514) fail_loc=0x805 Starting client: h221f: -o user_xattr,flock h221f@tcp:/lustre /mnt/lustre2 Failed to load ZFS module stack. Load the module manually by running 'insmod <location>/zfs.ko' as root. fail_loc=0 Filesystem 1K-blocks Used Available Use% Mounted on h221f@tcp:/lustre 374928 51792 303136 15% /mnt/lustre Failing mds1 on node h221f Stopping /mnt/mds1 (opts:) on h221f affected facets: mds1 Failover mds1 to h221f 16:42:05 (1350290525) waiting for h221f network 900 secs ... 16:42:05 (1350290525) network interface is UP Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1 Failed to load ZFS module stack. Load the module manually by running 'insmod <location>/zfs.ko' as root. Started lustre-MDT0000 touch: setting times of `/mnt/lustre2/f.recovery-small.106': Input/output error recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds Trace dump: = ./../tests/test-framework.sh:3869:error_noexit() = ./../tests/test-framework.sh:3891:error() = recovery-small.sh:1528:test_106() = ./../tests/test-framework.sh:4133:run_one() = ./../tests/test-framework.sh:4163:run_one_logged() = ./../tests/test-framework.sh:4034:run_test() = recovery-small.sh:1537:main() Dumping lctl log to /tmp/test_logs/1350288504/recovery-small.test_106.*.1350290525.log Dumping logs only on local client. 192.168.56.4@tcp:/lustre /mnt/lustre2 lustre rw,flock,user_xattr 0 0 Stopping client h221f /mnt/lustre2 (opts:) sh: lsof: command not found FAIL 106 (12s) The logs will be attached in a minute. |
| Comment by Johann Lombardi (Inactive) [ 15/Oct/12 ] |
|
hm, strange, the message is in the lustre log: 00000100:02020000:0.0:1350290525.905481:0:6240:0:(import.c:1325:ptlrpc_import_recovery_state_machine()) 167-0: lustre-MDT0000-mdc-ffff880006762400: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. but not in dmesg output ... |
| Comment by Johann Lombardi (Inactive) [ 17/Oct/12 ] |
|
I can now reproduce. The issue comes from the console rate limit code which skips the messages expected by the test. |
| Comment by Johann Lombardi (Inactive) [ 17/Oct/12 ] |
| Comment by Johann Lombardi (Inactive) [ 18/Oct/12 ] |
|
Patch landed. Close the bug again. |