Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.3.0, Lustre 2.1.2
Affects Version/s: Lustre 2.1.0
Labels:
None
Environment:
rhel 6.1 clients with Bull 's Advanced Edition suite.
Client version: lustre-modules-2.1.0-2.6.32_131.0.15.el6.x86_64_Bull.2.204.el6
Server version: lustre-modules-2.1.0-2.6.32_131.17.1.bl6.Bull.27.0.x86_64_Bull.2.207.bl6

Severity:
3
Rank (Obsolete):
4602

Description

We have a new customer hitting something very similar to ~~LU-816~~.
The difference is they are running Lustre 2.1.

The sequence of the problem is as follows (giving hours corresponding to the logs):
1/ The client reports a process that was blocked for more than 120 seconds with the following trace (no kernel or Lustre messages before this in the 40 minutes preceding):
(thread blocked for 2 minutes reported at 13:43:27, so blocked since 13:41:27)

INFO: task ls:12058 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ls D 0000000000000008 0 12058 11471 0x00000084
ffff8805ee2dfa58 0000000000000086 ffff8805ee2dfa68 ffff880bbc28e600
ffff8805ee2dfd08 ffff880616207800 ffff8805ee2dfad8 ffff880620230a38
ffff880604529038 ffff8805ee2dffd8 000000000000f598 ffff880604529038
Call Trace:
[<ffffffff814dc84e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814dc6eb>] mutex_lock+0x2b/0x50
[<ffffffffa068c7b8>] cl_lock_mutex_get+0x78/0xd0 [obdclass]
[<ffffffffa068eb4a>] cl_lock_hold_mutex+0xca/0x710 [obdclass]
[<ffffffffa09dff73>] ? lov_io_init_raid0+0x4b3/0x920 [lov]
[<ffffffffa069060e>] cl_lock_request+0x5e/0x1f0 [obdclass]
[<ffffffffa09d6bf9>] ? lov_io_init+0x99/0x120 [lov]
[<ffffffffa0a73ab0>] cl_glimpse_lock+0x180/0x390 [lustre]
[<ffffffffa0a74274>] cl_glimpse_size+0x184/0x190 [lustre]
[<ffffffffa0a2a28f>] ll_inode_revalidate_it+0x6f/0x1f0 [lustre]
[<ffffffff81181804>] ? do_path_lookup+0x94/0xa0
[<ffffffff8117efc5>] ? putname+0x45/0x50
[<ffffffffa0a2a459>] ll_getattr_it+0x49/0x170 [lustre]
[<ffffffffa0a2a5b7>] ll_getattr+0x37/0x40 [lustre]
[<ffffffff81177c11>] vfs_getattr+0x51/0x80
[<ffffffffa0a1572e>] ? ll_ddelete+0x5e/0x300 [lustre]
[<ffffffff81177ca0>] vfs_fstatat+0x60/0x80
[<ffffffff81177d2e>] vfs_lstat+0x1e/0x20
[<ffffffff81177d54>] sys_newlstat+0x24/0x50
[<ffffffff810d1b62>] ? audit_syscall_entry+0x272/0x2a0
[<ffffffff8100c635>] ? math_state_restore+0x45/0x60
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

2/ A thread on the server side is blocked and reported 200secs later with the following trace:
(thread reported blocked for 200s at 13:45:42, so blocked since 13:42:22)

Lustre: Service thread pid 24618 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume lat
er. Dumping the stack trace for debugging purposes:
Pid: 24618, comm: ll_ost_425

Call Trace:
[<ffffffff810657ec>] ? lock_timer_base+0x3c/0x70
[<ffffffff8147eab0>] schedule_timeout+0x190/0x2d0
[<ffffffff81065900>] ? process_timeout+0x0/0x10
[<ffffffffa0595731>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
[<ffffffffa0747453>] ptlrpc_set_wait+0x323/0x730 [ptlrpc]
[<ffffffff8104c780>] ? default_wake_function+0x0/0x20
[<ffffffffa0750e84>] ? lustre_msg_set_status+0x94/0x110 [ptlrpc]
[<ffffffffa0747916>] ptlrpc_queue_wait+0xb6/0x290 [ptlrpc]
[<ffffffffa02a85ec>] ? lprocfs_counter_add+0x12c/0x170 [lvfs]
[<ffffffffa072cf3e>] ldlm_server_glimpse_ast+0x10e/0x3d0 [ptlrpc]
[<ffffffffa081130d>] filter_intent_policy+0x35d/0x710 [obdfilter]
[<ffffffffa05a4805>] ? cfs_hash_bd_lookup_intent+0xe5/0x130 [libcfs]
[<ffffffffa070bb8a>] ldlm_lock_enqueue+0x2da/0xa50 [ptlrpc]
[<ffffffffa072a485>] ? ldlm_export_lock_get+0x15/0x20 [ptlrpc]
[<ffffffffa05a3872>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
[<ffffffffa0732407>] ldlm_handle_enqueue0+0x447/0x1090 [ptlrpc]
[<ffffffffa0754240>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
[<ffffffffa07330b6>] ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
[<ffffffffa07330c0>] ? ldlm_server_completion_ast+0x0/0x680 [ptlrpc]
[<ffffffffa06dc870>] ? ost_blocking_ast+0x0/0xe40 [ost]
[<ffffffffa072ce30>] ? ldlm_server_glimpse_ast+0x0/0x3d0 [ptlrpc]
[<ffffffffa06e4c6a>] ost_handle+0x22ba/0x4b90 [ost]
[<ffffffff8104c792>] ? default_wake_function+0x12/0x20
[<ffffffff8103b9b9>] ? __wake_up_common+0x59/0x90
[<ffffffffa0750444>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa0761459>] ptlrpc_main+0xc79/0x19d0 [ptlrpc]
[<ffffffff810017ac>] ? __switch_to+0x1ac/0x320
[<ffffffffa07607e0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
[<ffffffff810041aa>] child_rip+0xa/0x20
[<ffffffffa07607e0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
[<ffffffff810041a0>] ? child_rip+0x0/0x20

3/ Client reports another thread which was blocked since 13:43:27 (reported at 13:45:27):

kernel INFO: task ldlm_cb_05:9780 blocked for more than 120 seconds.
kernel "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel ldlm_cb_05 D 0000000000000006 0 9780 2 0x00000080
ffff880604557b90 0000000000000046 ffff880604557b58 ffff880604557b54
ffff880604557b50 ffff88063fc24f00 ffff880655435f80 0000000100610371
ffff88060ac5e6b8 ffff880604557fd8 000000000000f598 ffff88060ac5e6b8
Call Trace:
[<ffffffff814dc84e>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814dc6eb>] mutex_lock+0x2b/0x50
[<ffffffffa068c7b8>] cl_lock_mutex_get+0x78/0xd0 [obdclass]
[<ffffffffa097d91e>] osc_ldlm_glimpse_ast+0x7e/0x160 [osc]
[<ffffffffa076e304>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa074d6bb>] ldlm_callback_handler+0x142b/0x21c0 [ptlrpc]
[<ffffffffa07709ac>] ? lustre_msg_get_transno+0x7c/0xe0 [ptlrpc]
[<ffffffffa076e304>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa077ef59>] ptlrpc_main+0xbb9/0x1990 [ptlrpc]
[<ffffffffa077e3a0>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
[<ffffffff8100c1ca>] child_rip+0xa/0x20
[<ffffffffa077e3a0>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
[<ffffffff8100c1c0>] ? child_rip+0x0/0x20

4/ The 2 threads on the client stay blocked (and are reported every 2 minutes) until the server finally evicts the client a first time at 13:49:38 :

LustreError: 0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 155s: evicting client at 172.17.1.
230@o2ib ns: filter-scratch-OST001c_UUID lock: ffff8800774e0d80/0xaa844a441355c6d7 lrc: 3/0,0 mode: PW/PW res: 3545740/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551
615) flags: 0x10020 remote: 0x49baa5ebdb4811c0 expref: 15 pid: 5451 timeout 4476555308

5/ The client side "ls" thread unblocks (no more reports of the watchdog) with the message:

1333111826 2012 Mar 30 13:50:26 willowb11 kern err kernel LustreError: 11982:0:(cl_io.c:1700:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 14 remaining pages
1333111826 2012 Mar 30 13:50:26 willowb11 kern err kernel LustreError: 11982:0:(cl_io.c:965:cl_io_cancel()) Canceling ongoing page trasmission

But the ldlm_cb_02 is still locked at that point.

6/ The server evicts the client a second time at 13:53:34 :

Lustre: scratch-OST001c: haven't heard from client e33a9e90-14c3-9556-290a-e2f09abad36a (at 172.17.1.230@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8805ebfa1400, cur 1333112014 expire 1333111864 last 1333111787

7/ At that point, no more threads are locked, but it is impossible for the client to reconnect to the OST, getting these messages:

LustreError: 167-0: This client was evicted by scratch-OST0023; in progress operations using this service will fail.
LustreError: 2829:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805f7714000 x1397861693879877/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 2830:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805e483d400 x1397861693879881/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 2830:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805e4beec00 x1397861693879901/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 2830:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 1 previous similar message
LustreError: 11-0: an error occurred while communicating with 172.17.0.7@o2ib. The ldlm_enqueue operation failed with -107
Lustre: scratch-OST001c-osc-ffff880620978000: Connection to service scratch-OST001c via nid 172.17.0.7@o2ib was lost; in progress operations using this service will wait for recovery to complete.
LustreError: 167-0: This client was evicted by scratch-OST001c; in progress operations using this service will fail.
LustreError: 2831:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805e5af9800 x1397861693879966/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 2831:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
Lustre: 9014:0:(cl_lock.c:2025:cl_lock_page_out()) Writing 1 pages error: -108
LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805da30f400 x1397861693880082/t0(0) o-1->scratch-OST001c_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 17 previous similar messages
LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805d9f09400 x1397861693880324/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 1 previous similar message
LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805ec787c00 x1397861693881645/t0(0) o-1->scratch-OST0023_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 14 previous similar messages
LustreError: 2833:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8806228ff000 x1397861693882624/t0(0) o-1->scratch-OST001c_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 2833:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 13 previous similar messages
LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8805e4a01c00 x1397861693883727/t0(0) o-1->scratch-OST001c_UUID@172.17.0.7@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 2825:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 14 previous similar messages

I attach the full logs of client, server holding the concerned OST, and a stack trace of all threads on the client once in the final state (unable to reconnect).

The only issue found is to reboot the client. This is critical as this can happen several times an hour and is happening on login nodes of the cluster, thus impacting a lot of users at once.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

syslog-CLIENT_willowb11-20120330
30/Mar/12 1:14 PM
34 kB
Sebastien Piechurski
syslog-CLIENT_willowb11-20120330-sysrq-trigger
30/Mar/12 1:14 PM
1.02 MB
Sebastien Piechurski
syslog-OSS_willowb6-20120330
30/Mar/12 1:14 PM
18 kB
Sebastien Piechurski

Issue Links

Trackbacks

Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....

Client threads block for sometime before being evicted and can never reconnect afterward

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates