[LU-10160] LBUG ldlm_lock.c:2259:__ldlm_reprocess_all()) ASSERTION( list_empty(&rpc_list) ) failed: Created: 25/Oct/17  Updated: 27/Nov/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Oleg Drokin
Resolution: Unresolved Votes: 0
Labels: soak
Environment:

Soak cluster version=2.10.54_13_g84f690e
lustre-master-ib build 6 with DNE configured


Attachments: File soak-2-lustre-log.1508920865.26469.txt.gz     Text File vmcore-dmesg.txt    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After OSS failover and recovery:

Oct 25 08:41:02 soak-2 kernel: Lustre: soaked-OST0012: Recovery over after 0:35, of 33 clients 33 recovered and 0 were evicted.
Oct 25 08:41:03 soak-2 kernel: LustreError: 26237:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0012: BRW to missing obj 0x0:1943200
Oct 25 08:41:03 soak-2 kernel: LustreError: 26369:0:(lustre_dlm.h:1371:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2)
Oct 25 08:41:03 soak-2 kernel: format at lustre_dlm.h:1097:ldlm_lvbo_fill doesn't end in newline
Oct 25 08:41:03 soak-2 kernel: LustreError: 26237:0:(ofd_io.c:616:ofd_preprw_write()) Skipped 3 previous similar messages
Oct 25 08:41:03 soak-2 kernel: format at lustre_dlm.h:1097:ldlm_lvbo_fill doesn't end in newline
Oct 25 08:41:03 soak-2 kernel: LustreError: 12692:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0012: BRW to missing obj 0x780000400:835459
Oct 25 08:41:03 soak-2 kernel: LustreError: 12692:0:(ofd_io.c:616:ofd_preprw_write()) Skipped 1 previous similar message
Oct 25 08:41:03 soak-2 kernel: Lustre: soaked-OST0006: Connection restored to e5f19f2d-79a3-0744-c0c5-e9d37cbaf536 (at 192.168.1.118@o2ib)
Oct 25 08:41:03 soak-2 kernel: Lustre: soaked-OST0000: Connection restored to e5f19f2d-79a3-0744-c0c5-e9d37cbaf536 (at 192.168.1.118@o2ib)
Oct 25 08:41:03 soak-2 kernel: Lustre: Skipped 43 previous similar messages
Oct 25 08:41:03 soak-2 kernel: Lustre: Skipped 43 previous similar messages
Oct 25 08:41:03 soak-2 kernel: LustreError: 26369:0:(lustre_dlm.h:1097:ldlm_lvbo_fill()) lock ffff8800a4621c00: delayed lvb init failed (rc -2)
Oct 25 08:41:03 soak-2 kernel: LustreError: 26469:0:(lustre_dlm.h:1097:ldlm_lvbo_fill()) lock ffff8800afe75200: delayed lvb init failed (rc -2)
Oct 25 08:41:03 soak-2 kernel: LustreError: 26469:0:(ldlm_lock.c:2259:__ldlm_reprocess_all()) ASSERTION( list_empty(&rpc_list) ) failed:
Oct 25 08:41:04 soak-2 kernel: LustreError: 26469:0:(ldlm_lock.c:2259:__ldlm_reprocess_all()) LBUG
Oct 25 08:41:04 soak-2 kernel: Pid: 26469, comm: ldlm_cn00_003
Oct 25 08:41:04 soak-2 kernel: #012Call Trace:
Oct 25 08:41:04 soak-2 kernel: [<ffffffffc0d3e7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Oct 25 08:41:04 soak-2 kernel: [<ffffffffc0d3e83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Oct 25 08:41:04 soak-2 kernel: [<ffffffffc112a88b>] __ldlm_reprocess_all+0x32b/0x3a0 [ptlrpc]
Oct 25 08:41:04 soak-2 kernel: [<ffffffffc112a910>] ldlm_reprocess_all+0x10/0x20 [ptlrpc]
Oct 25 08:41:04 soak-2 kernel: [<ffffffffc114dc8d>] ldlm_request_cancel+0x46d/0x720 [ptlrpc]
Oct 25 08:41:04 soak-2 kernel: [<ffffffffc115417a>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
Oct 25 08:41:04 soak-2 kernel: [<ffffffffc1154451>] ldlm_cancel_handler+0x141/0x490 [ptlrpc]
Oct 25 08:41:05 soak-2 kernel: [<ffffffffc118537e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
Oct 25 08:41:05 soak-2 kernel: [<ffffffffc1182248>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Oct 25 08:41:05 soak-2 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
Oct 25 08:41:05 soak-2 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
Oct 25 08:41:05 soak-2 kernel: [<ffffffffc1188b22>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Oct 25 08:41:05 soak-2 kernel: [<ffffffff81029557>] ? __switch_to+0xd7/0x510
Oct 25 08:41:05 soak-2 kernel: [<ffffffff816a8f01>] ? __schedule+0x311/0x8b0
Oct 25 08:41:05 soak-2 kernel: [<ffffffffc1188090>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
Oct 25 08:41:05 soak-2 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Oct 25 08:41:05 soak-2 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Oct 25 08:41:05 soak-2 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
Oct 25 08:41:05 soak-2 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0

Lustre-log dump attached



 Comments   
Comment by James Nunez (Inactive) [ 25/Oct/17 ]

Niu - Would you please review this error and comment on possible causes?

Thanks

Comment by Cliff White (Inactive) [ 25/Oct/17 ]

We do have multiple earlier hits of the 'BRW to missing obj'
Example:

ct 12 08:29:22 soak-3 kernel: Lustre: soaked-OST000d: Will be in recovery for at least 2:30, or until 34 clients reconnect
Oct 12 08:29:29 soak-3 kernel: Lustre: soaked-OST000d: Connection restored to 19d86900-89d1-0961-b2cc-97d38ab6ed62 (at 192.168.1.137@o2ib)
Oct 12 08:29:29 soak-3 kernel: Lustre: Skipped 68 previous similar messages
Oct 12 08:29:40 soak-3 kernel: Lustre: soaked-OST0013: Recovery over after 0:32, of 34 clients 34 recovered and 0 were evicted.
Oct 12 08:29:41 soak-3 kernel: LustreError: 9714:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0007: BRW to missing obj 0x0:51358755
Oct 12 08:29:52 soak-3 kernel: LustreError: 43590:0:(osd_index.c:223:__osd_xattr_load_by_oid()) soaked-OST000d: can't get bonus, rc = -17
Oct 12 08:29:53 soak-3 kernel: Lustre: 9643:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507796987/real 1507796987]  req@ffff880816ff1200 x1581039691499472/t0(0) o38->soaked-MDT0000-lwp-OST000d@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1507796993 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
........
Oct 13 13:46:50 soak-4 kernel: Lustre: soaked-OST000f: Will be in recovery for at least 2:30, or until 34 clients reconnect
Oct 13 13:46:57 soak-4 kernel: Lustre: soaked-OST0009: Recovery over after 1:02, of 34 clients 34 recovered and 0 were evicted.
Oct 13 13:46:57 soak-4 kernel: LustreError: 9585:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0009: BRW to missing obj 0x0:67095807
Oct 13 13:46:57 soak-4 kernel: LustreError: 9585:0:(ofd_io.c:616:ofd_preprw_write()) Skipped 10 previous similar messages
Comment by Oleg Drokin [ 26/Oct/17 ]

Hm, another unrelated stuff from the logs, seems the obd to target name is not working especially well:

00000100:00080000:1.0:1508920863.284572:0:26256:0:(import.c:104:ptlrpc_import_enter_resend()) ffff880179734000 0<E4><CF><EE>^G<88><FF><FF>0<E4><CF><EE>^G<88><FF><FF>: changing import state from NEW to RECOVER
00000100:00080000:1.0:1508920863.284578:0:26256:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) ffff880179734000 0<E4><CF><EE>^G<88><FF><FF>0<E4><CF><EE>^G<88><FF><FF>: changing import state from RECOVER to FULL
Comment by Cliff White (Inactive) [ 17/Nov/17 ]

Crashed again with 2.10.55, vmcore-dmesg attached, crash dump available on soak.

Comment by Cliff White (Inactive) [ 26/Feb/18 ]

Crashed again after full weekend, multiple successful failovers. Crash dump available on spirit.

[37601.153508] Lustre: soaked-OST000c: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[37601.282466] Lustre: soaked-OST000c: in recovery but waiting for the first client to connect
[37601.515428] Lustre: soaked-OST000c: Will be in recovery for at least 2:30, or until 33 clients reconnect
[37610.255005] Lustre: soaked-OST0012: Recovery over after 0:52, of 33 clients 33 recovered and 0 were evicted.
[37610.255159] LustreError: 5409:0:(ofd_io.c:671:ofd_preprw_write()) soaked-OST0012: BRW to missing obj 0x740000402:491210
[37610.256286] Lustre: soaked-OST0012: deleting orphan objects from 0x0:830355 to 0x0:831537
[37610.258346] LustreError: 5387:0:(lustre_dlm.h:1376:ldlm_lvbo_update()) delayed lvb init failed (rc -2)
[37610.509648] format at lustre_dlm.h:1103:ldlm_lvbo_fill doesn't end in newline
[37610.509688] LustreError: 157856:0:(lustre_dlm.h:1103:ldlm_lvbo_fill()) lock ffff880820f860c0: delayed lvb init failed (rc -2)
[37610.509688] LustreError: 157856:0:(ldlm_lock.c:2244:__ldlm_reprocess_all()) ASSERTION( list_empty(&rpc_list) ) failed:
[37610.509691] LustreError: 157856:0:(ldlm_lock.c:2244:__ldlm_reprocess_all()) LBUG
[37610.509693] Pid: 157856, comm: ldlm_cn00_011
[37610.509693]
Call Trace:
[37610.509715]  [<ffffffffc0a477ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[37610.509726]  [<ffffffffc0a4783c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[37610.509795]  [<ffffffffc0ddd60b>] __ldlm_reprocess_all+0x30b/0x380 [ptlrpc]
[37610.509846]  [<ffffffffc0ddd690>] ldlm_reprocess_all+0x10/0x20 [ptlrpc]
[37610.509904]  [<ffffffffc0e002ad>] ldlm_request_cancel+0x41d/0x710 [ptlrpc]
[37610.509962]  [<ffffffffc0e0534a>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
[37610.510018]  [<ffffffffc0e05621>] ldlm_cancel_handler+0x141/0x490 [ptlrpc]
[37610.510104]  [<ffffffffc0e3658e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[37610.510168]  [<ffffffffc0e33448>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[37610.510174]  [<ffffffff810c6452>] ? default_wake_function+0x12/0x20
[37610.510179]  [<ffffffff810bc0f8>] ? __wake_up_common+0x58/0x90
[37610.510242]  [<ffffffffc0e39d42>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[37610.510305]  [<ffffffffc0e392b0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[37610.510310]  [<ffffffff810b252f>] kthread+0xcf/0xe0
[37610.510315]  [<ffffffff810c0d34>] ? finish_task_switch+0x54/0x160
[37610.510318]  [<ffffffff810b2460>] ? kthread+0x0/0xe0
[37610.510323]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[37610.510326]  [<ffffffff810b2460>] ? kthread+0x0/0xe0
[37610.510327]
[37610.510328] Kernel panic - not syncing: LBUG
Comment by Gerrit Updater [ 27/Nov/19 ]

Li Xi (lixi@ddn.com) uploaded a new patch: https://review.whamcloud.com/36885
Subject: LU-10160 ldlm: fix message of ldlm_lvbo_fill
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d7eef551256205643ee70fe8cd102498470cae1d

Generated at Sat Feb 10 02:32:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.