Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.7.0
-
Hyperon
-
3
-
17581
Description
While performing failover testing:
LustreError: 167-0: lustre-OST0000-osc-ffff88083b339000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:777:ldlm_resource_complain()) lustre-OST0000-osc-ffff88083b339000: namespace resource [0x2d097bb:0x0:0x0].0 (ffff88106a992e40) refcount nonzero (1) after lock cleanup; forcing cleanup.
Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:1374:ldlm_resource_dump()) --- Resource: [0x2d097bb:0x0:0x0].0 (ffff88106a992e40) refcount = 3
Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:1377:ldlm_resource_dump()) Granted locks (in reverse order):
Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:1380:ldlm_resource_dump()) ### ### ns: lustre-OST0000-osc-ffff88083b339000 lock: ffff88106a60d540/0xa1f402a8812988f4 lrc: 3/0,1 mode: PW/PW res: [0x2d097bb:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->8191) flags: 0x126400020000 nid: local remote: 0x72a9f4c80e66e07b expref: -99 pid: 51570 timeout: 0 lvb_type: 1
Feb 23 09:22:48 iwc260 kernel: Lustre: lustre-OST0000-osc-ffff88083b339000: Connection restored to lustre-OST0000 (at 192.168.120.13@o2ib)
Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) LBUG
Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) LBUG
Feb 23 09:22:49 iwc260 kernel: Pid: 90252, comm: ldlm_bl_36
Feb 23 09:22:49 iwc260 kernel:
Feb 23 09:22:49 iwc260 kernel: Call Trace:
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0435895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0435e97>] lbug_with_loc+0x47/0xb0 [libcfs]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2ba56>] discard_cb+0x156/0x190 [osc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2bdcc>] osc_page_gang_lookup+0x1ac/0x330 [osc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2b900>] ? discard_cb+0x0/0x190 [osc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2c094>] osc_lock_discard_pages+0x144/0x240 [osc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa04461c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2b900>] ? discard_cb+0x0/0x190 [osc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2298b>] osc_lock_flush+0x8b/0x260 [osc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b22e08>] osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0761a6c>] ldlm_cancel_callback+0x6c/0x170 [ptlrpc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077450a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0779120>] ldlm_cli_cancel+0x60/0x360 [ptlrpc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b22c3b>] osc_ldlm_blocking_ast+0xdb/0x3c0 [osc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa04461c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077cb60>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077d0c1>] ldlm_bl_thread_main+0x291/0x3f0 [ptlrpc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077ce30>] ? ldlm_bl_thread_main+0x0/0x3f0 [ptlrpc]
Feb 23 09:22:49 iwc260 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0
Feb 23 09:22:49 iwc260 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 23 09:22:49 iwc260 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
Feb 23 09:22:49 iwc260 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
This bug is not resolved. At a customer site today, one of the issues reported with the patch series was reproduced:
Here is the gathered information from one of the engineers:
Nov 13 21:22:32 r2i1n4 kernel: [41261.553520] LustreError: 11-0: work-OST0018-osc-ffff880f4a556c00: Communicating with 10.148.254.146@o2ib, operation obd_ping failed with -107.
Nov 13 21:22:32 r2i1n4 kernel: [41261.553537] Lustre: work-OST0018-osc-ffff880f4a556c00: Connection to work-OST0018 (at 10.148.254.146@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Nov 13 21:22:32 r2i1n4 kernel: [41261.553744] LustreError: 167-0: work-OST0018-osc-ffff880f4a556c00: This client was evicted by work-OST0018; in progress operations using this service will fail.
Nov 13 21:22:19 oss3 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.148.4.3@o2ib ns: filter-work-OST0018_UUID lock: ffff881bf46fa1c0/0x1b850f2869999e97 lrc: 3/0,0 mode: PW/PW res: [0xcc6a24:0x0:0x0].0 rrc: 31 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 10.148.4.3@o2ib remote: 0x7a3c9c39e0502fcc expref: 5 pid: 20187 timeout: 4335795111 lvb_type: 0
r2i1n4:~ # lctl dl
0 UP mgc MGC10.148.254.141@o2ib 2dc4c54f-dab6-1190-fca7-d7747093f936 5
…
35 UP osc work-OST0017-osc-ffff880f4a556c00 9d007967-5de8-eb2e-eaf9-99d4b17d74e6 5
36 IN osc work-OST0018-osc-ffff880f4a556c00 9d007967-5de8-eb2e-eaf9-99d4b17d74e6 5
37 UP osc work-OST0019-osc-ffff880f4a556c00 9d007967-5de8-eb2e-eaf9-99d4b17d74e6 5
r2i1n4:~ # lctl --device 36 activate
error: activate: failed: Invalid argument
r2i1n4:~ # cat /proc/fs/lustre/osc/work-OST0018-osc-ffff880f4a556c00/import
import:
name: work-OST0018-osc-ffff880f4a556c00
target: work-OST0018_UUID
state: EVICTED
instance: 21
connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type]
import_flags: [ invalid, replayable, pingable, connect_tried ]
connection:
failover_nids: [ 10.148.254.146@o2ib, 10.148.254.147@o2ib ]
current_connection: 10.148.254.146@o2ib
connection_attempts: 2
generation: 2
in-progress_invalidations: 1
rpcs:
inflight: 0
unregistering: 0
timeouts: 0
avg_waittime: 803 usec
service_estimates:
services: 1 sec
network: 1 sec
transactions:
last_replay: 0
peer_committed: 90194391109
last_checked: 90194391109
read_data_averages:
bytes_per_rpc: 19550
usec_per_rpc: 938
MB_per_sec: 20.84
write_data_averages:
bytes_per_rpc: 508
usec_per_rpc: 5386
MB_per_sec: 0.09
r2i1n4:~ # ps -ef | grep ll_imp_inval
root 832 2 0 Nov13 ? 00:00:00 [ll_imp_inval]
root 25849 25765 0 14:18 pts/0 00:00:00 grep ll_imp_inval
Nov 16 14:22:19 r2i1n4 kernel: [274907.717827] ll_imp_inval S ffff881ffc253b00 0 832 2 0x00000000
Nov 16 14:22:19 r2i1n4 kernel: [274907.717832] ffff881ffc34dc50 0000000000000046 ffff881ffc34c010 0000000000010900
Nov 16 14:22:19 r2i1n4 kernel: [274907.717838] 0000000000010900 0000000000010900 0000000000010900 ffff881ffc34dfd8
Nov 16 14:22:19 r2i1n4 kernel: [274907.717845] ffff881ffc34dfd8 0000000000010900 ffff881fd7ffa580 ffff881020f2c300
Nov 16 14:22:19 r2i1n4 kernel: [274907.717852] Call Trace:
Nov 16 14:22:19 r2i1n4 kernel: [274907.717879] [<ffffffffa0bcf361>] osc_object_invalidate+0x1c1/0x280 [osc]
Nov 16 14:22:19 r2i1n4 kernel: [274907.717907] [<ffffffffa0bb5bae>] osc_ldlm_resource_invalidate+0xae/0x170 [osc]
Nov 16 14:22:19 r2i1n4 kernel: [274907.717930] [<ffffffffa0716d68>] cfs_hash_for_each_relax+0x178/0x340 [libcfs]
Nov 16 14:22:19 r2i1n4 kernel: [274907.717971] [<ffffffffa0718fc0>] cfs_hash_for_each_nolock+0x70/0x1c0 [libcfs]
Nov 16 14:22:19 r2i1n4 kernel: [274907.718008] [<ffffffffa0bc989c>] osc_import_event+0xfac/0x1420 [osc]
Nov 16 14:22:19 r2i1n4 kernel: [274907.718064] [<ffffffffa0a88592>] ptlrpc_invalidate_import+0x292/0x890 [ptlrpc]
Nov 16 14:22:19 r2i1n4 kernel: [274907.718151] [<ffffffffa0a8a702>] ptlrpc_invalidate_import_thread+0x42/0x2e0 [ptlrpc]
Nov 16 14:22:19 r2i1n4 kernel: [274907.718197] [<ffffffff810843f6>] kthread+0x96/0xa0
Nov 16 14:22:19 r2i1n4 kernel: [274907.718205] [<ffffffff8146f364>] kernel_thread_helper+0x4/0x10