[LU-1999] recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds Created: 20/Sep/12  Updated: 18/Apr/13  Resolved: 18/Oct/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Johann Lombardi (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File lu-1999-logs.tar.bz2    
Severity: 3
Rank (Obsolete): 4071

 Description   

This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/9d79385c-038b-11e2-83cf-52540035b04c.

The sub-test test_106 failed with the following error:

CMD: client-24vm3 lctl get_param -n *.lustre-MDT0000.num_exports | cut -d' ' -f2
Update not seen after 60s: wanted '3' got '4'
recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds

Info required for matching: recovery-small 106



 Comments   
Comment by Oleg Drokin [ 20/Sep/12 ]

Johann, this failure comes from a test added by you yesterday

Comment by Johann Lombardi (Inactive) [ 21/Sep/12 ]

It looks like a test issue since the client got evicted as expected:

LustreError: 167-0: lustre-MDT0000-mdc-ffff880078abb000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
Lustre: Evicted from MGS (at 10.10.4.118@tcp) after server handle changed from 0x1345e746bb7e9980 to 0x1345e746bb7e9edc
Lustre: MGC10.10.4.118@tcp: Reactivating import
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds 
Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mdd

The issue is the the client reconnected before the check in the test was made.

Comment by Johann Lombardi (Inactive) [ 21/Sep/12 ]

I have disabled the test on master until i come up with a test script fix to avoid those failures.
http://review.whamcloud.com/4066

Comment by Johann Lombardi (Inactive) [ 21/Sep/12 ]

Tentative patch:
http://review.whamcloud.com/4069

Comment by Johann Lombardi (Inactive) [ 08/Oct/12 ]

Patch landed. No new reoccurrence since then.

Comment by Oleg Drokin [ 12/Oct/12 ]

It seems that I am still hitting this in my local testing (ie SLOW=yes REFORMAT=yes sh recovery-small.sh ), though I cannot speak of the frequence.
This is with today's master checkout.

== recovery-small test 106: lightweight connection support == 23:16:05 (1350098165)
fail_loc=0x805
Starting client: centos6-4.localnet: -o user_xattr,flock centos6-4.localnet@tcp:/lustre /mnt/lustre2
fail_loc=0
Filesystem           1K-blocks      Used Available Use% Mounted on
192.168.10.214@tcp:/lustre
                        374928     52384    302004  15% /mnt/lustre
Failing mds1 on node centos6-4.localnet
Stopping /mnt/mds1 (opts:) on centos6-4.localnet
affected facets: mds1
Failover mds1 to centos6-4.localnet
23:16:24 (1350098184) waiting for centos6-4.localnet network 900 secs ...
23:16:24 (1350098184) network interface is UP
Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
Started lustre-MDT0000
touch: setting times of `/mnt/lustre2/f.recovery-small.106': Input/output error
 recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds 
  Trace dump:
  = ./../tests/test-framework.sh:3844:error_noexit()
  = ./../tests/test-framework.sh:3866:error()
  = recovery-small.sh:1528:test_106()
  = ./../tests/test-framework.sh:4108:run_one()
  = ./../tests/test-framework.sh:4138:run_one_logged()
  = ./../tests/test-framework.sh:4009:run_test()
  = recovery-small.sh:1537:main()
Dumping lctl log to /tmp/test_logs/1350095359/recovery-small.test_106.*.1350098192.log
Dumping logs only on local client.
192.168.10.214@tcp:/lustre /mnt/lustre2 lustre rw,flock,user_xattr 0 0
Stopping client centos6-4.localnet /mnt/lustre2 (opts:)
FAIL 106 (32s)
Comment by Oleg Drokin [ 12/Oct/12 ]

Just did a bit of research, it's happening, but not super freequently:

[root@intelbox ~]# grep 'lightweight client not evicted by mds' /var/log/remotemessages 
Oct  9 03:43:14 centos6-0 kernel: [ 2489.388687] Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds
Oct  9 13:27:35 centos6-1 kernel: [ 2898.929903] Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds
Oct 11 13:33:30 centos6-0 kernel: [ 6723.295214] Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds
Oct 12 23:16:32 centos6-4 kernel: [ 3261.398403] Lustre: DEBUG MARKER: recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds
Comment by Johann Lombardi (Inactive) [ 15/Oct/12 ]

Oleg, could you please attach dmesg output when the issue happened?

Comment by Li Wei (Inactive) [ 15/Oct/12 ]

I saw this with today's master plus my LU-1717 patch locally on a single-node setup:

== recovery-small test 106: lightweight connection support == 16:41:54 (1350290514)
fail_loc=0x805
Starting client: h221f: -o user_xattr,flock h221f@tcp:/lustre /mnt/lustre2
Failed to load ZFS module stack.
Load the module manually by running 'insmod <location>/zfs.ko' as root.
fail_loc=0
Filesystem           1K-blocks      Used Available Use% Mounted on
h221f@tcp:/lustre       374928     51792    303136  15% /mnt/lustre
Failing mds1 on node h221f
Stopping /mnt/mds1 (opts:) on h221f
affected facets: mds1
Failover mds1 to h221f
16:42:05 (1350290525) waiting for h221f network 900 secs ...
16:42:05 (1350290525) network interface is UP
Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
Failed to load ZFS module stack.
Load the module manually by running 'insmod <location>/zfs.ko' as root.
Started lustre-MDT0000
touch: setting times of `/mnt/lustre2/f.recovery-small.106': Input/output error
 recovery-small test_106: @@@@@@ FAIL: lightweight client not evicted by mds 
  Trace dump:
  = ./../tests/test-framework.sh:3869:error_noexit()
  = ./../tests/test-framework.sh:3891:error()
  = recovery-small.sh:1528:test_106()
  = ./../tests/test-framework.sh:4133:run_one()
  = ./../tests/test-framework.sh:4163:run_one_logged()
  = ./../tests/test-framework.sh:4034:run_test()
  = recovery-small.sh:1537:main()
Dumping lctl log to /tmp/test_logs/1350288504/recovery-small.test_106.*.1350290525.log
Dumping logs only on local client.
192.168.56.4@tcp:/lustre /mnt/lustre2 lustre rw,flock,user_xattr 0 0
Stopping client h221f /mnt/lustre2 (opts:)
sh: lsof: command not found
FAIL 106 (12s)

The logs will be attached in a minute.

Comment by Johann Lombardi (Inactive) [ 15/Oct/12 ]

hm, strange, the message is in the lustre log:

00000100:02020000:0.0:1350290525.905481:0:6240:0:(import.c:1325:ptlrpc_import_recovery_state_machine()) 167-0: lustre-MDT0000-mdc-ffff880006762400: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.

but not in dmesg output ...

Comment by Johann Lombardi (Inactive) [ 17/Oct/12 ]

I can now reproduce. The issue comes from the console rate limit code which skips the messages expected by the test.

Comment by Johann Lombardi (Inactive) [ 17/Oct/12 ]

http://review.whamcloud.com/#change,4288

Comment by Johann Lombardi (Inactive) [ 18/Oct/12 ]

Patch landed. Close the bug again.

Generated at Sat Feb 10 01:21:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.