[LU-9352] Lustre clients asserting (lcommon_cl.c:148:cl_file_inode_init()) ASSERTION( (((inode->i_mode) & 00170000) == 0100000) ) failed Created: 17/Apr/17 Updated: 28/Nov/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | John Salinas (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | LS_RZ | ||
| Issue Links: |
|
||||||||
| Severity: | 4 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We had 4 Lustre clients wolf-6-9 and the Lustre server nodes had crashed, taken a dump and come back up. When Lustre servers were back up and the file system was restored all the clients started asserting with: [22181.389692] Lustre: 6338:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 82 previous similar messages [22806.001932] Lustre: 6338:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1492387170/real 1492387170] req@ffff8809df2aa100 x1564860107186640/t0(0) o38->lsdraid-MDT0000-mdc-ffff880832a4b000@192.168.1.5@o2ib:12/10 lens 520/544 e 0 to 1 dl 1492387225 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [22806.044008] Lustre: 6338:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 82 previous similar messages [23130.818757] LustreError: 167-0: lsdraid-OST0000-osc-ffff880832a4b000: This client was evicted by lsdraid-OST0000; in progress operations using this service will fail. [23131.304299] Lustre: 6376:0:(llite_lib.c:2619:ll_dirty_page_discard_warn()) lsdraid: dirty page discard: 192.168.1.5@o2ib:/lsdraid/fid: [0x200000401:0x18:0x0]// may get corrupted (rc -108) [23131.304639] Lustre: lsdraid-OST0000-osc-ffff880832a4b000: Connection restored to 192.168.1.3@o2ib (at 192.168.1.3@o2ib) [23131.315453] Lustre: 6353:0:(llite_lib.c:2619:ll_dirty_page_discard_warn()) lsdraid: dirty page discard: 192.168.1.5@o2ib:/lsdraid/fid: [0x200000401:0x16:0x0]// may get corrupted (rc -108) [23131.321505] Lustre: 6377:0:(llite_lib.c:2619:ll_dirty_page_discard_warn()) lsdraid: dirty page discard: 192.168.1.5@o2ib:/lsdraid/fid: [0x20000040c:0xb43b:0x0]// may get corrupted (rc -108) [23155.804921] Lustre: Evicted from MGS (at 192.168.1.5@o2ib) after server handle changed from 0x9b83d124804f915d to 0x3fceeb65e42b218b [23155.824502] LustreError: 167-0: lsdraid-MDT0000-mdc-ffff880832a4b000: This client was evicted by lsdraid-MDT0000; in progress operations using this service will fail. [23155.824529] Lustre: MGC192.168.1.5@o2ib: Connection restored to 192.168.1.5@o2ib (at 192.168.1.5@o2ib) [23155.824532] Lustre: Skipped 1 previous similar message [23155.869652] LustreError: Skipped 1 previous similar message [23155.879415] LustreError: 35603:0:(file.c:180:ll_close_inode_openhandle()) lsdraid-clilmv-ffff880832a4b000: inode [0x200000401:0x59:0x0] mdc close failed: rc = -5 [23155.879478] LustreError: 48874:0:(file.c:3403:ll_inode_revalidate_fini()) lsdraid: revalidate FID [0x200000007:0x1:0x0] error: rc = -5 [23155.879537] LustreError: 27229:0:(vvp_io.c:1429:vvp_io_init()) lsdraid: refresh file layout [0x20000040c:0xc035:0x0] error -5. [23155.880185] traps: fsx[6464] general protection ip:7f42c4af1ab4 sp:7ffd375bc060 error:0 in libc-2.17.so[7f42c4aa9000+1b7000] [23155.880204] traps: fsx[6463] general protection ip:7f45b5297ab4 sp:7ffc1eae97e0 error:0 in libc-2.17.so[7f45b524f000+1b7000] [23155.969495] LustreError: 35603:0:(file.c:180:ll_close_inode_openhandle()) Skipped 7 previous similar messages [23156.181966] LustreError: 48928:0:(lcommon_cl.c:148:cl_file_inode_init()) ASSERTION( (((inode->i_mode) & 00170000) == 0100000) ) failed: [23156.181969] LustreError: 48928:0:(lcommon_cl.c:148:cl_file_inode_init()) LBUG [23156.181970] Pid: 48928, comm: prove [23156.181970] Call Trace: [23156.181994] [<ffffffffa09507d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [23156.182001] [<ffffffffa0950d75>] lbug_with_loc+0x45/0xc0 [libcfs] [23156.182041] [<ffffffffa0dc28f8>] cl_file_inode_init+0x248/0x300 [lustre] [23156.182054] [<ffffffffa0d9ae65>] ll_update_inode+0x305/0x5d0 [lustre] [23156.182064] [<ffffffffa0b68c29>] ? lmv_get_lustre_md+0x149/0x2d0 [lmv] [23156.182077] [<ffffffffa0d9d08d>] ll_prep_inode+0x12d/0xc90 [lustre] [23156.182090] [<ffffffffa0da9753>] ? ll_i2gids+0x33/0xb0 [lustre] [23156.182110] [<ffffffff81101302>] ? from_kgid+0x12/0x20 [23156.182121] [<ffffffffa0d687c1>] ll_revalidate_it_finish+0x71/0x170 [lustre] [23156.182132] [<ffffffffa0d83b9b>] __ll_inode_revalidate+0x19b/0xc00 [lustre] [23156.182143] [<ffffffffa0d84a22>] ll_getattr+0x62/0x7c0 [lustre] [23156.182150] [<ffffffff811e3c66>] vfs_getattr+0x46/0x80 [23156.182153] [<ffffffff811e3d95>] vfs_fstatat+0x75/0xc0 [23156.182156] [<ffffffff811e42ee>] SYSC_newstat+0x2e/0x60 [23156.182163] [<ffffffff81642363>] ? do_page_fault+0x23/0x80 [23156.182165] [<ffffffff8163e648>] ? page_fault+0x28/0x30 [23156.182168] [<ffffffff811e45ce>] SyS_newstat+0xe/0x10 [23156.182174] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b [23156.182175] [23156.182316] Kernel panic - not syncing: LBUG [23156.182319] CPU: 38 PID: 48928 Comm: prove Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [23156.182320] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [23156.182326] ffffffffa096ddef 0000000056b71082 ffff88103be23ad0 ffffffff81636431 [23156.182331] ffff88103be23b50 ffffffff8162fcc0 ffffffff00000008 ffff88103be23b60 [23156.182336] ffff88103be23b00 0000000056b71082 ffffffffa0dd3780 0000000000000000 [23156.182336] Call Trace: [23156.182346] [<ffffffff81636431>] dump_stack+0x19/0x1b [23156.182349] [<ffffffff8162fcc0>] panic+0xd8/0x1e7 [23156.182357] [<ffffffffa0950ddb>] lbug_with_loc+0xab/0xc0 [libcfs] [23156.182373] [<ffffffffa0dc28f8>] cl_file_inode_init+0x248/0x300 [lustre] [23156.182385] [<ffffffffa0d9ae65>] ll_update_inode+0x305/0x5d0 [lustre] [23156.182391] [<ffffffffa0b68c29>] ? lmv_get_lustre_md+0x149/0x2d0 [lmv] [23156.182403] [<ffffffffa0d9d08d>] ll_prep_inode+0x12d/0xc90 [lustre] [23156.182416] [<ffffffffa0da9753>] ? ll_i2gids+0x33/0xb0 [lustre] [23156.182419] [<ffffffff81101302>] ? from_kgid+0x12/0x20 [23156.182429] [<ffffffffa0d687c1>] ll_revalidate_it_finish+0x71/0x170 [lustre] [23156.182440] [<ffffffffa0d83b9b>] __ll_inode_revalidate+0x19b/0xc00 [lustre] [23156.182451] [<ffffffffa0d84a22>] ll_getattr+0x62/0x7c0 [lustre] [23156.182455] [<ffffffff811e3c66>] vfs_getattr+0x46/0x80 [23156.182458] [<ffffffff811e3d95>] vfs_fstatat+0x75/0xc0 [23156.182461] [<ffffffff811e42ee>] SYSC_newstat+0x2e/0x60 [23156.182464] [<ffffffff81642363>] ? do_page_fault+0x23/0x80 [23156.182467] [<ffffffff8163e648>] ? page_fault+0x28/0x30 [23156.182470] [<ffffffff811e45ce>] SyS_newstat+0xe/0x10 [23156.182473] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b Before this we did see this on the various clients
[ 7621.463514] Lustre: 6338:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[ 7680.841904] INFO: task fsx:6463 blocked for more than 120 seconds.
[ 7680.850637] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7680.861238] fsx D ffff880830affeb0 0 6463 6459 0x00000080
[ 7680.870886] ffff880830affae0 0000000000000082 ffff8808501a7300 ffff880830afffd8
[ 7680.880858] ffff880830afffd8 ffff880830afffd8 ffff8808501a7300 ffff88103c4c5c10
[ 7680.890701] ffff88103c4c5c18 7fffffffffffffff ffff8808501a7300 ffff880830affeb0
[ 7680.900454] Call Trace:
[ 7680.904601] [<ffffffff8163bb39>] schedule+0x29/0x70
[ 7680.911469] [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
[ 7680.919304] [<ffffffffa0a51a53>] ? ptlrpc_set_add_new_req+0xe3/0x160 [ptlrpc]
[ 7680.928615] [<ffffffffa0e47e80>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[ 7680.937161] [<ffffffffa0a80c2c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
[ 7680.945840] [<ffffffffa0a63f30>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[ 7680.955264] [<ffffffff8163bf06>] wait_for_completion+0x116/0x170
[ 7680.963198] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
[ 7680.970537] [<ffffffffa0e48044>] osc_io_setattr_end+0xc4/0x180 [osc]
[ 7680.978788] [<ffffffffa08ee3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 7680.987737] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7680.995551] [<ffffffffa08ee48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 7681.003654] [<ffffffffa08ee9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
[ 7681.011803] [<ffffffffa08eea96>] lov_io_end+0x36/0xb0 [lov]
[ 7681.019088] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7681.026856] [<ffffffffa0c202e3>] cl_io_loop+0xb3/0x190 [obdclass]
[ 7681.034706] [<ffffffffa0dc2550>] cl_setattr_ost+0x240/0x3a0 [lustre]
[ 7681.042840] [<ffffffffa0d9c7f9>] ll_setattr_raw+0x12a9/0x1330 [lustre]
[ 7681.051174] [<ffffffffa0d9c8e3>] ll_setattr+0x63/0xc0 [lustre]
[ 7681.058714] [<ffffffff811fbe29>] notify_change+0x279/0x3d0
[ 7681.065875] [<ffffffff811dcf13>] do_truncate+0x73/0xc0
[ 7681.072636] [<ffffffff811e17a8>] ? __sb_start_write+0x58/0x110
[ 7681.080185] [<ffffffff811dd2bc>] do_sys_ftruncate.constprop.17+0x11c/0x180
[ 7681.088901] [<ffffffff811dd35e>] SyS_ftruncate+0xe/0x10
[ 7681.095774] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[ 7681.103428] INFO: task fsx:6464 blocked for more than 120 seconds.
[ 7681.111278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7681.120987] fsx D ffff88083b5efeb0 0 6464 6459 0x00000080
[ 7681.129878] ffff88083b5efae0 0000000000000086 ffff8808501a3980 ffff88083b5effd8
[ 7681.139209] ffff88083b5effd8 ffff88083b5effd8 ffff8808501a3980 ffff88103c5ba650
[ 7681.148535] ffff88103c5ba658 7fffffffffffffff ffff8808501a3980 ffff88083b5efeb0
[ 7681.157861] Call Trace:
[ 7681.161544] [<ffffffff8163bb39>] schedule+0x29/0x70
[ 7681.168024] [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
[ 7681.175467] [<ffffffffa0a51a53>] ? ptlrpc_set_add_new_req+0xe3/0x160 [ptlrpc]
[ 7681.184430] [<ffffffffa0e47e80>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[ 7681.192611] [<ffffffffa0a80c2c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
[ 7681.200963] [<ffffffffa0a63f30>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[ 7681.210050] [<ffffffff8163bf06>] wait_for_completion+0x116/0x170
[ 7681.217661] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
[ 7681.224683] [<ffffffffa0e48044>] osc_io_setattr_end+0xc4/0x180 [osc]
[ 7681.232649] [<ffffffffa08ee3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 7681.241303] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7681.248866] [<ffffffffa08ee48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 7681.256715] [<ffffffffa08ee9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
[ 7681.264660] [<ffffffffa08eea96>] lov_io_end+0x36/0xb0 [lov]
[ 7681.271743] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7681.279302] [<ffffffffa0c202e3>] cl_io_loop+0xb3/0x190 [obdclass]
[ 7681.286951] [<ffffffffa0dc2550>] cl_setattr_ost+0x240/0x3a0 [lustre]
[ 7681.294894] [<ffffffffa0d9c7f9>] ll_setattr_raw+0x12a9/0x1330 [lustre]
[ 7681.303025] [<ffffffffa0d9c8e3>] ll_setattr+0x63/0xc0 [lustre]
[ 7681.310366] [<ffffffff811fbe29>] notify_change+0x279/0x3d0
[ 7681.317316] [<ffffffff811dcf13>] do_truncate+0x73/0xc0
[ 7681.323870] [<ffffffff811e17a8>] ? __sb_start_write+0x58/0x110
[ 7681.331196] [<ffffffff811dd2bc>] do_sys_ftruncate.constprop.17+0x11c/0x180
[ 7681.339693] [<ffffffff811dd35e>] SyS_ftruncate+0xe/0x10
[ 7681.346340] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[ 7707.260211] Lustre: 6361:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492372056/real 1492372056] req@ffff88016636a400 x1564860107137792/t0(0) o103->lsdraid-MDT0000-mdc-ffff880832a4b000@192.168.1.5@o2ib:17/18 lens 416/224 e 0 to 1 dl 1492372063 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[ 7707.297453] Lustre: 6361:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[ 7707.309810] Lustre: lsdraid-MDT0000-mdc-ffff880832a4b000: Connection to lsdraid-MDT0000 (at 192.168.1.5@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[ 7739.314486] LustreError: 166-1: MGC192.168.1.5@o2ib: Connection to MGS (at 192.168.1.5@o2ib) was lost; in progress operations using this service will fail
[ 7801.288244] INFO: task fsx:6463 blocked for more than 120 seconds.
[ 7801.296906] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7801.307434] fsx D ffff880830affeb0 0 6463 6459 0x00000080
[ 7801.317015] ffff880830affae0 0000000000000082 ffff8808501a7300 ffff880830afffd8
[ 7801.326920] ffff880830afffd8 ffff880830afffd8 ffff8808501a7300 ffff88103c4c5c10
[ 7801.336722] ffff88103c4c5c18 7fffffffffffffff ffff8808501a7300 ffff880830affeb0
[ 7801.346435] Call Trace:
[ 7801.350508] [<ffffffff8163bb39>] schedule+0x29/0x70
[ 7801.357404] [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
[ 7801.365221] [<ffffffffa0a51a53>] ? ptlrpc_set_add_new_req+0xe3/0x160 [ptlrpc]
[ 7801.374549] [<ffffffffa0e47e80>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[ 7801.383050] [<ffffffffa0a80c2c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
[ 7801.391743] [<ffffffffa0a63f30>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[ 7801.401132] [<ffffffff8163bf06>] wait_for_completion+0x116/0x170
[ 7801.409054] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
[ 7801.416405] [<ffffffffa0e48044>] osc_io_setattr_end+0xc4/0x180 [osc]
[ 7801.424667] [<ffffffffa08ee3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 7801.433634] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7801.441454] [<ffffffffa08ee48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 7801.449568] [<ffffffffa08ee9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
[ 7801.457722] [<ffffffffa08eea96>] lov_io_end+0x36/0xb0 [lov]
[ 7801.465010] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7801.472742] [<ffffffffa0c202e3>] cl_io_loop+0xb3/0x190 [obdclass]
[ 7801.480564] [<ffffffffa0dc2550>] cl_setattr_ost+0x240/0x3a0 [lustre]
[ 7801.488662] [<ffffffffa0d9c7f9>] ll_setattr_raw+0x12a9/0x1330 [lustre]
[ 7801.496959] [<ffffffffa0d9c8e3>] ll_setattr+0x63/0xc0 [lustre]
[ 7801.504464] [<ffffffff811fbe29>] notify_change+0x279/0x3d0
[ 7801.511594] [<ffffffff811dcf13>] do_truncate+0x73/0xc0
[ 7801.518325] [<ffffffff811e17a8>] ? __sb_start_write+0x58/0x110
[ 7801.525833] [<ffffffff811dd2bc>] do_sys_ftruncate.constprop.17+0x11c/0x180
[ 7801.534511] [<ffffffff811dd35e>] SyS_ftruncate+0xe/0x10
[ 7801.541335] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[ 7801.548948] INFO: task fsx:6464 blocked for more than 120 seconds.
[ 7801.556758] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7801.566424] fsx D ffff88083b5efeb0 0 6464 6459 0x00000080
[ 7801.575267] ffff88083b5efae0 0000000000000086 ffff8808501a3980 ffff88083b5effd8
[ 7801.584558] ffff88083b5effd8 ffff88083b5effd8 ffff8808501a3980 ffff88103c5ba650
[ 7801.593846] ffff88103c5ba658 7fffffffffffffff ffff8808501a3980 ffff88083b5efeb0
[ 7801.603132] Call Trace:
[ 7801.606799] [<ffffffff8163bb39>] schedule+0x29/0x70
[ 7801.613263] [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
[ 7801.620689] [<ffffffffa0a51a53>] ? ptlrpc_set_add_new_req+0xe3/0x160 [ptlrpc]
[ 7801.629636] [<ffffffffa0e47e80>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[ 7801.637804] [<ffffffffa0a80c2c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
[ 7801.646153] [<ffffffffa0a63f30>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[ 7801.655416] [<ffffffff8163bf06>] wait_for_completion+0x116/0x170
[ 7801.663196] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
[ 7801.670366] [<ffffffffa0e48044>] osc_io_setattr_end+0xc4/0x180 [osc]
[ 7801.678492] [<ffffffffa08ee3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 7801.687310] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7801.695042] [<ffffffffa08ee48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 7801.703072] [<ffffffffa08ee9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
[ 7801.711180] [<ffffffffa08eea96>] lov_io_end+0x36/0xb0 [lov]
[ 7801.718425] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7801.725989] [<ffffffffa0c202e3>] cl_io_loop+0xb3/0x190 [obdclass]
[ 7801.733633] [<ffffffffa0dc2550>] cl_setattr_ost+0x240/0x3a0 [lustre]
[ 7801.741573] [<ffffffffa0d9c7f9>] ll_setattr_raw+0x12a9/0x1330 [lustre]
[ 7801.749702] [<ffffffffa0d9c8e3>] ll_setattr+0x63/0xc0 [lustre]
[ 7801.757038] [<ffffffff811fbe29>] notify_change+0x279/0x3d0
[ 7801.763987] [<ffffffff811dcf13>] do_truncate+0x73/0xc0
[ 7801.770540] [<ffffffff811e17a8>] ? __sb_start_write+0x58/0x110
[ 7801.777865] [<ffffffff811dd2bc>] do_sys_ftruncate.constprop.17+0x11c/0x180
[ 7801.786350] [<ffffffff811dd35e>] SyS_ftruncate+0xe/0x10
[ 7801.792996] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[ 7839.277447] Lustre: 6338:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1492372195/real 1492372195] req@ffff8809df2a9b00 x1564860107138192/t0(0) o38->lsdraid-MDT0000-mdc-ffff880832a4b000@192.168.1.5@o2ib:12/10 lens 520/544 e 0 to 1 dl 1492372226 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 7839.315041] Lustre: 6338:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
[ 7921.733573] INFO: task fsx:6463 blocked for more than 120 seconds.
[ 7921.742066] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7921.752431] fsx D ffff880830affeb0 0 6463 6459 0x00000080
[ 7921.761858] ffff880830affae0 0000000000000082 ffff8808501a7300 ffff880830afffd8
[ 7921.771634] ffff880830afffd8 ffff880830afffd8 ffff8808501a7300 ffff88103c4c5c10
[ 7921.781416] ffff88103c4c5c18 7fffffffffffffff ffff8808501a7300 ffff880830affeb0
[ 7921.791098] Call Trace:
[ 7921.795146] [<ffffffff8163bb39>] schedule+0x29/0x70
[ 7921.801932] [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
[ 7921.809719] [<ffffffffa0a51a53>] ? ptlrpc_set_add_new_req+0xe3/0x160 [ptlrpc]
[ 7921.818969] [<ffffffffa0e47e80>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[ 7921.827468] [<ffffffffa0a80c2c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
[ 7921.836091] [<ffffffffa0a63f30>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[ 7921.845456] [<ffffffff8163bf06>] wait_for_completion+0x116/0x170
[ 7921.853295] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
[ 7921.860546] [<ffffffffa0e48044>] osc_io_setattr_end+0xc4/0x180 [osc]
[ 7921.868721] [<ffffffffa08ee3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 7921.877597] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7921.885333] [<ffffffffa08ee48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 7921.893362] [<ffffffffa08ee9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
[ 7921.901445] [<ffffffffa08eea96>] lov_io_end+0x36/0xb0 [lov]
[ 7921.908666] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7921.916316] [<ffffffffa0c202e3>] cl_io_loop+0xb3/0x190 [obdclass]
[ 7921.924069] [<ffffffffa0dc2550>] cl_setattr_ost+0x240/0x3a0 [lustre]
[ 7921.932109] [<ffffffffa0d9c7f9>] ll_setattr_raw+0x12a9/0x1330 [lustre]
[ 7921.940338] [<ffffffffa0d9c8e3>] ll_setattr+0x63/0xc0 [lustre]
[ 7921.947769] [<ffffffff811fbe29>] notify_change+0x279/0x3d0
[ 7921.954813] [<ffffffff811dcf13>] do_truncate+0x73/0xc0
[ 7921.961460] [<ffffffff811e17a8>] ? __sb_start_write+0x58/0x110
[ 7921.968883] [<ffffffff811dd2bc>] do_sys_ftruncate.constprop.17+0x11c/0x180
[ 7921.977476] [<ffffffff811dd35e>] SyS_ftruncate+0xe/0x10
[ 7921.984224] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[ 7921.991749] INFO: task fsx:6464 blocked for more than 120 seconds.
[ 7921.999474] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7922.009057] fsx D ffff88083b5efeb0 0 6464 6459 0x00000080
[ 7922.017819] ffff88083b5efae0 0000000000000086 ffff8808501a3980 ffff88083b5effd8
[ 7922.027030] ffff88083b5effd8 ffff88083b5effd8 ffff8808501a3980 ffff88103c5ba650
[ 7922.036242] ffff88103c5ba658 7fffffffffffffff ffff8808501a3980 ffff88083b5efeb0
[ 7922.045439] Call Trace:
[ 7922.049028] [<ffffffff8163bb39>] schedule+0x29/0x70
[ 7922.055436] [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
[ 7922.062832] [<ffffffffa0a51a53>] ? ptlrpc_set_add_new_req+0xe3/0x160 [ptlrpc]
[ 7922.071769] [<ffffffffa0e47e80>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[ 7922.079951] [<ffffffffa0a80c2c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
[ 7922.088305] [<ffffffffa0a63f30>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[ 7922.097396] [<ffffffff8163bf06>] wait_for_completion+0x116/0x170
[ 7922.105008] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
[ 7922.112014] [<ffffffffa0e48044>] osc_io_setattr_end+0xc4/0x180 [osc]
[ 7922.119982] [<ffffffffa08ee3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 7922.128637] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7922.136386] [<ffffffffa08ee48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 7922.144462] [<ffffffffa08ee9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
[ 7922.152623] [<ffffffffa08eea96>] lov_io_end+0x36/0xb0 [lov]
[ 7922.159869] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 7922.167613] [<ffffffffa0c202e3>] cl_io_loop+0xb3/0x190 [obdclass]
[ 7922.175421] [<ffffffffa0dc2550>] cl_setattr_ost+0x240/0x3a0 [lustre]
[ 7922.183526] [<ffffffffa0d9c7f9>] ll_setattr_raw+0x12a9/0x1330 [lustre]
[ 7922.191819] [<ffffffffa0d9c8e3>] ll_setattr+0x63/0xc0 [lustre]
[ 7922.199319] [<ffffffff811fbe29>] notify_change+0x279/0x3d0
[ 7922.206268] [<ffffffff811dcf13>] do_truncate+0x73/0xc0
[ 7922.212823] [<ffffffff811e17a8>] ? __sb_start_write+0x58/0x110
[ 7922.220147] [<ffffffff811dd2bc>] do_sys_ftruncate.constprop.17+0x11c/0x180
[ 7922.228632] [<ffffffff811dd35e>] SyS_ftruncate+0xe/0x10
[ 7922.235281] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[ 7923.592450] LustreError: 27229:0:(osc_cache.c:951:osc_extent_wait()) extent ffff88012eaeb068@{[69632 -> 73727/73727], [3|0|+|rpc|wiY|ffff8806c50ac850], [17432576|1|+|-|ffff8802ae377000|4096|ffff88104d932280]} lsdraid-OST0001-osc-ffff880832a4b000: wait ext to 0 timedout, recovery in progress?
[ 7923.625757] LustreError: 27229:0:(osc_cache.c:951:osc_extent_wait()) ### extent: ffff88012eaeb068 ns: lsdraid-OST0001-osc-ffff880832a4b000 lock: ffff8802ae377000/0x766280d790dde765 lrc: 11/0,1 mode: PW/PW res: [0x85588:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20000000000 nid: local remote: 0x17e15c0cc7f9c1bf expref: -99 pid: 26746 timeout: 0 lvb_type: 1
[ 8042.175864] INFO: task fsx:6463 blocked for more than 120 seconds.
[ 8042.184502] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8042.195019] fsx D ffff880830affeb0 0 6463 6459 0x00000080
[ 8042.204575] ffff880830affae0 0000000000000082 ffff8808501a7300 ffff880830afffd8
[ 8042.214464] ffff880830afffd8 ffff880830afffd8 ffff8808501a7300 ffff88103c4c5c10
[ 8042.224251] ffff88103c4c5c18 7fffffffffffffff ffff8808501a7300 ffff880830affeb0
[ 8042.233932] Call Trace:
[ 8042.238032] [<ffffffff8163bb39>] schedule+0x29/0x70
[ 8042.244837] [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
[ 8042.252625] [<ffffffffa0a51a53>] ? ptlrpc_set_add_new_req+0xe3/0x160 [ptlrpc]
[ 8042.261883] [<ffffffffa0e47e80>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[ 8042.270308] [<ffffffffa0a80c2c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
[ 8042.278917] [<ffffffffa0a63f30>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[ 8042.288229] [<ffffffff8163bf06>] wait_for_completion+0x116/0x170
[ 8042.296078] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
[ 8042.303299] [<ffffffffa0e48044>] osc_io_setattr_end+0xc4/0x180 [osc]
[ 8042.311491] [<ffffffffa08ee3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 8042.320343] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8042.328108] [<ffffffffa08ee48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 8042.336118] [<ffffffffa08ee9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
[ 8042.344221] [<ffffffffa08eea96>] lov_io_end+0x36/0xb0 [lov]
[ 8042.351457] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8042.359196] [<ffffffffa0c202e3>] cl_io_loop+0xb3/0x190 [obdclass]
[ 8042.367001] [<ffffffffa0dc2550>] cl_setattr_ost+0x240/0x3a0 [lustre]
[ 8042.375099] [<ffffffffa0d9c7f9>] ll_setattr_raw+0x12a9/0x1330 [lustre]
[ 8042.383383] [<ffffffffa0d9c8e3>] ll_setattr+0x63/0xc0 [lustre]
[ 8042.390874] [<ffffffff811fbe29>] notify_change+0x279/0x3d0
[ 8042.397978] [<ffffffff811dcf13>] do_truncate+0x73/0xc0
[ 8042.404692] [<ffffffff811e17a8>] ? __sb_start_write+0x58/0x110
[ 8042.412185] [<ffffffff811dd2bc>] do_sys_ftruncate.constprop.17+0x11c/0x180
[ 8042.420849] [<ffffffff811dd35e>] SyS_ftruncate+0xe/0x10
[ 8042.427670] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[ 8042.435264] INFO: task fsx:6464 blocked for more than 120 seconds.
[ 8042.443057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8042.452712] fsx D ffff88083b5efeb0 0 6464 6459 0x00000080
[ 8042.461542] ffff88083b5efae0 0000000000000086 ffff8808501a3980 ffff88083b5effd8
[ 8042.470819] ffff88083b5effd8 ffff88083b5effd8 ffff8808501a3980 ffff88103c5ba650
[ 8042.480091] ffff88103c5ba658 7fffffffffffffff ffff8808501a3980 ffff88083b5efeb0
[ 8042.489357] Call Trace:
[ 8042.493008] [<ffffffff8163bb39>] schedule+0x29/0x70
[ 8042.499480] [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
[ 8042.506915] [<ffffffffa0a51a53>] ? ptlrpc_set_add_new_req+0xe3/0x160 [ptlrpc]
[ 8042.515873] [<ffffffffa0e47e80>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[ 8042.524049] [<ffffffffa0a80c2c>] ? ptlrpcd_add_req+0x1ec/0x2e0 [ptlrpc]
[ 8042.532586] [<ffffffffa0a63f30>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[ 8042.541853] [<ffffffff8163bf06>] wait_for_completion+0x116/0x170
[ 8042.549637] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
[ 8042.556810] [<ffffffffa0e48044>] osc_io_setattr_end+0xc4/0x180 [osc]
[ 8042.564949] [<ffffffffa08ee3b0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 8042.573771] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8042.581502] [<ffffffffa08ee48b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 8042.589512] [<ffffffffa08ee9a6>] lov_io_call.isra.9+0x86/0x140 [lov]
[ 8042.597456] [<ffffffffa08eea96>] lov_io_end+0x36/0xb0 [lov]
[ 8042.604533] [<ffffffffa0c1db3d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8042.612090] [<ffffffffa0c202e3>] cl_io_loop+0xb3/0x190 [obdclass]
[ 8042.619734] [<ffffffffa0dc2550>] cl_setattr_ost+0x240/0x3a0 [lustre]
[ 8042.627675] [<ffffffffa0d9c7f9>] ll_setattr_raw+0x12a9/0x1330 [lustre]
[ 8042.635802] [<ffffffffa0d9c8e3>] ll_setattr+0x63/0xc0 [lustre]
[ 8042.643140] [<ffffffff811fbe29>] notify_change+0x279/0x3d0
[ 8042.650088] [<ffffffff811dcf13>] do_truncate+0x73/0xc0
[ 8042.656639] [<ffffffff811e17a8>] ? __sb_start_write+0x58/0x110
[ 8042.663963] [<ffffffff811dd2bc>] do_sys_ftruncate.constprop.17+0x11c/0x180
[ 8042.672448] [<ffffffff811dd35e>] SyS_ftruncate+0xe/0x10
[ 8042.679094] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[ 8114.126337] Lustre: 6338:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1492372470/real 1492372470] req@ffff8809df2acb00 x1564860107139152/t0(0) o38->lsdraid-MDT0000-mdc-ffff880832a4b000@192.168.1.5@o2ib:12/10 lens 520/544 e 0 to 1 dl 1492372525 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
All clients that had the Lustre FS mounted crashed with the same error. |
| Comments |
| Comment by John Salinas (Inactive) [ 17/Apr/17 ] |
|
Dumps are on onyx: /scratch/johnsali/LU-9352.tgz |
| Comment by Peter Jones [ 18/Apr/17 ] |
|
Hongchao Could you please advise on this one? Thanks Peter |
| Comment by Hongchao Zhang [ 20/Apr/17 ] |
|
Hi John, |
| Comment by John Salinas (Inactive) [ 20/Apr/17 ] |
|
It appears the clients are at:
|
| Comment by Peter Jones [ 25/Apr/17 ] |
|
John As per our recent discussion, please concentrate your testing on the latest master tag Thanks Peter |
| Comment by Andreas Dilger [ 25/Apr/17 ] |
|
Is there a reason you are still testing with 2.8.57? That is a development tag that is 222 patches behind 2.9.0, and many, many patches behind 2.10. |
| Comment by John Salinas (Inactive) [ 25/Apr/17 ] |
|
Peter this bug was filed before that discussion. Does anyone have reason to think this has been fixed in 2.9 or in master I will be happy re-test! |
| Comment by Peter Jones [ 25/Apr/17 ] |
|
John I do not have a specific change to point to but, given the hundreds of changes that have been made since you hit this issue and our limited engineering resources to close out on everything necessary for the 2.10.0 release that will be needed for CORAL and many other purposes, I defintiely want to make sure we are focusing our efforts on the latest code. I'm sorry if that was not clear in our recent meeting but, yes, please do repeat the testing to verify whether this issue still manifests itself in the same way in the latest tag Thanks Peter |
| Comment by Peter Jones [ 12/May/17 ] |
|
It does not seem like this is blocking testing of CORAL deliverables at present so removing the fix version. If new information comes to light we will engage. |