[LU-180] Sometimes evicted clients never reconnect Created: 30/Mar/11 Updated: 19/Apr/11 Resolved: 19/Apr/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 8539 |
| Description |
|
Hi, On our benchmarking cluster we observe a very strange behavior of Lustre: sometimes clients that are evicted by an OST never reconnect. The consequence is that we have a lot of Defunct processes on the impacted compute nodes, so we have to reboot them. For instance on a compute node we can see that one OST connection is missing: [root@kay297 ~]# lfs df filesystem summary: 215354196400 16076170152 199823591804 7% /scratch_lustre The phenomenon is erratic, it does not always impact the same clients or the same OSTs. I attach the syslogs of kay297 (client) and kay3 (OSS). |
| Comments |
| Comment by Sebastien Buisson (Inactive) [ 31/Mar/11 ] |
|
Hi, After less than 24 hours we have a new occurrence of this issue. The client node is kay310 (10.17.1.211@o2ib) and the OST is scratch-OST000b hosted by OSS kay2. [root@kay310 ~]# lctl dl [root@kay310 ~]# cat /proc/fs/lustre/osc/scratch-OST000b-osc-ffff8803313ff000/import [buissons@kay2 ~]$ lctl dl [buissons@kay2 ~]$ cat /proc/fs/lustre/obdfilter/scratch-OST000b/exports/10.17.1.211@o2ib/uuid This is very annoying because I tried to reactivate the import on the client, but with no success. [root@kay310 ~]# lctl --device 21 activate The only option we have is to reboot the impacted Lustre clients, which is not acceptable on a benchmarking cluster. Sebastien. |
| Comment by Peter Jones [ 31/Mar/11 ] |
|
Niu Could you look into this one please? Thanks Peter |
| Comment by Sebastien Buisson (Inactive) [ 31/Mar/11 ] |
|
Hi, This problem seems to be similar to bugzilla 21636. The initial description in this bugzilla is not really the same, but if you look at comment 82 you will see that the errors are exactly the same as what we are suffering from on our benchmarking cluster. HTH |
| Comment by Niu Yawei (Inactive) [ 01/Apr/11 ] |
|
Looks the invalidate thread did never finish the import invalidation job, I suspect that there is something deadlock. Hi, Sebastien Could you also collect all the threads' stack trace of the abnormal client when this issue happened? (echo t > /proc/sysrq-trigger) I want to see where the invalidate thread was blocked. |
| Comment by Niu Yawei (Inactive) [ 18/Apr/11 ] |
|
This bug is probably caused by a deadlock in clio, which could cause the client being evicted, and because of the deadlock, the invalidation thread (ll_imp_inval) would be blocked on a semaphore, then the import can not be reactived anymore. The deadlock issue is Hi, Sebastien, when you hit the issue next time, could you please check the stack trace to see if it's similar to the trace shown in the attached file? Thanks. |
| Comment by Gregoire Pichon [ 19/Apr/11 ] |
|
There is a new occurence of the bug. In attachment is the trace log of the client
Tell me if you need other information from the client node. |
| Comment by Gregoire Pichon [ 19/Apr/11 ] |
|
The stack trace of the new occurence looks similar to the one in deadlock-trace.log. We are going to install the fix described by |
| Comment by Niu Yawei (Inactive) [ 19/Apr/11 ] |
|
Hi, Gregoire Yes, I checked the log and I believe it's the deadlock problem in |