[LU-3007] Umount OST caused a oops in ptlrpc_free_rqbd Created: 21/Mar/13 Updated: 17/Jul/13 Resolved: 17/Jul/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
OSS server runing Lustre 2.3.61 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 7320 |
| Description |
|
After finishing a test run with the 2.4 file system when we went to umount the entire file system one of the OSS expected a oops when it went to umount. Mar 9 20:11:12 widow-oss11c2 kernel: [112998.231220] Lustre: server umount routed1-OST00b1 complete |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 21/Mar/13 ] |
|
James, |
| Comment by Oleg Drokin [ 21/Mar/13 ] |
|
so it seems the unmount actually succeeded from further messages. |
| Comment by James A Simmons [ 22/Mar/13 ] |
|
Bruno the soft-lockup only happened once on single OSS when unmounting this particular OST. The total OSTs were 376 if I remember correctly. The other were fine. |
| Comment by Bruno Faccini (Inactive) [ 22/Mar/13 ] |
|
Humm, Having a look to some assembly of the remove_vm_area() routine it seems that the soft-lockup occured during the for loop to find the matching vm_area : struct vm_struct *remove_vm_area(const void *addr)
{
struct vmap_area *va;
va = find_vmap_area((unsigned long)addr);
if (va && va->flags & VM_VM_AREA) {
struct vm_struct *vm = va->private;
if (!(vm->flags & VM_UNLIST)) {
struct vm_struct *tmp, **p;
/*
* remove from list and disallow access to
* this vm_struct before unmap. (address range
* confliction is maintained by vmap.)
*/
write_lock(&vmlist_lock);
for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next) <<<<<<<<<<<<
;
*p = tmp->next;
write_unlock(&vmlist_lock);
}
vmap_debug_free_range(va->va_start, va->va_end);
free_unmap_vmap_area(va);
vm->size -= PAGE_SIZE;
return vm;
}
return NULL;
}
Thus it is kind of strange that this thread was the only "impacted"!!... |
| Comment by James A Simmons [ 28/Mar/13 ] |
|
I found out from our admin it was more than OST umounting that had a problem. The logs you saw was for one that completed. The rest never completed and after 30 minutes the admin rebooted the machine instead. |
| Comment by Oleg Drokin [ 28/Mar/13 ] |
|
can we get logs from one of the nodes that did not complete unmounting, please? |
| Comment by James A Simmons [ 28/Mar/13 ] |
|
Here is the entire log of the day for this OSS that crashed when unmounting. |
| Comment by Bruno Faccini (Inactive) [ 05/Apr/13 ] |
|
Thank's for the log, but next time this kind of hung situation requires a hard reboot/reset, is it possible that you take a crash-dump instead ?? |
| Comment by Bruno Faccini (Inactive) [ 15/Apr/13 ] |
|
James, |
| Comment by James A Simmons [ 23/Apr/13 ] |
|
We have changed are policy now for test shots. For the last test shot we did not encounter this problem. If we don't encounter this at the next test shot I say we can close the ticket. |
| Comment by Bruno Faccini (Inactive) [ 17/Jul/13 ] |
|
Hello James, |
| Comment by James A Simmons [ 17/Jul/13 ] |
|
Yes you can close this ticket. If we encounter it we can reopen but we haven't seen this problem lately. |
| Comment by Bruno Faccini (Inactive) [ 17/Jul/13 ] |
|
Ok thank's James. |