[LU-13069] osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 308 changes, 111 in progress, 0 in flight Created: 12/Dec/19  Updated: 19/Oct/22  Resolved: 10/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

osp llog processing stops before umount, because it doesn't see some records at catalog

REFORMAT=yes PTLDEBUG="-malloc-trace" SUBSYSTEM_DEBUG="+log+osc+osd-rpc-lnet-ldlm" DEBUG_SIZE=200 ONLY=135 sh lustre/tests/sanity.sh
 
[ 3647.083597] Lustre: DEBUG MARKER: == sanity test 135: Race catalog processing ========================================================== 07:07:08 (1573646828)
[ 3669.794725] Lustre: lustre-OST0000-osc-ffff88003711b800: disconnect after 24s idle
[ 4257.120915] LNet: 20833:0:(lib-move.c:3914:lnet_parse_put()) Dropping PUT from 12345-10.0.2.9@tcp portal 4 match 1650088281856656 offset 0 length 192: 4
[ 4600.368617] Lustre: lustre-OST0001-osc-ffff88003711b800: reconnect after 931s idle
[ 4600.369232] Lustre: lustre-OST0000: Connection restored to 10.0.2.9@tcp (at 10.0.2.9@tcp)
[ 4600.369235] Lustre: Skipped 5 previous similar messages
[ 4640.122168] LNet: 20832:0:(lib-move.c:3914:lnet_parse_put()) Dropping PUT from 12345-10.0.2.9@tcp portal 16 match 1650088287536976 offset 0 length 192: 4
[ 4896.183677] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0001-osc-MDT0000: catlog [0x5:0x1:0x0] crosses index zero
[ 4896.975199] Lustre: 23911:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0000-osc-MDT0000: catlog [0x2:0x1:0x0] crosses index zero
[ 4896.975203] Lustre: 23911:0:(llog_cat.c:918:llog_cat_process_or_fork()) Skipped 1 previous similar message
[ 4898.956196] Lustre: 23911:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0000-osc-MDT0000: catlog [0x2:0x1:0x0] crosses index zero
[ 4898.956201] Lustre: 23911:0:(llog_cat.c:918:llog_cat_process_or_fork()) Skipped 3 previous similar messages
[ 4901.807494] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0001-osc-MDT0000: catlog [0x5:0x1:0x0] crosses index zero
[ 4901.807504] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) Skipped 3 previous similar messages
[ 4906.198002] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0001-osc-MDT0000: catlog [0x5:0x1:0x0] crosses index zero
[ 4906.198007] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) Skipped 9 previous similar messages
[ 4914.323032] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0001-osc-MDT0000: catlog [0x5:0x1:0x0] crosses index zero
[ 4914.323037] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) Skipped 15 previous similar messages
[ 4931.388799] Lustre: 23911:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0000-osc-MDT0000: catlog [0x2:0x1:0x0] crosses index zero
[ 4931.388804] Lustre: 23911:0:(llog_cat.c:918:llog_cat_process_or_fork()) Skipped 31 previous similar messages
[ 4963.746106] Lustre: 23911:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0000-osc-MDT0000: catlog [0x2:0x1:0x0] crosses index zero
[ 4963.746112] Lustre: 23911:0:(llog_cat.c:918:llog_cat_process_or_fork()) Skipped 55 previous similar messages
[ 5027.798533] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0001-osc-MDT0000: catlog [0x5:0x1:0x0] crosses index zero
[ 5027.798541] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) Skipped 121 previous similar messages
[ 5156.633800] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) lustre-OST0001-osc-MDT0000: catlog [0x5:0x1:0x0] crosses index zero
[ 5156.633808] Lustre: 23913:0:(llog_cat.c:918:llog_cat_process_or_fork()) Skipped 245 previous similar messages
[ 5218.830741] Lustre: DEBUG MARKER: == sanity test complete, duration 1597 sec =========================================================== 07:33:20 (1573648400)
[ 5219.039200] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
[ 5228.383189] LustreError: 23913:0:(osp_sync.c:1293:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 308 changes, 111 in progress, 0 in flight
[ 5228.383242] LustreError: 23913:0:(osp_sync.c:1293:osp_sync_thread()) LBUG
[ 5228.383260] Pid: 23913, comm: osp-syn-1-0 3.10.0-693.21.1.x3.1.9.x86_64 #1 SMP Tue Jun 26 09:38:31 PDT 2018
[ 5228.383261] Call Trace:
[ 5228.383271]  [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40
[ 5228.383275]  [<ffffffffc055f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 5228.383283]  [<ffffffffc055f87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 5228.383288]  [<ffffffffc11ea71b>] osp_sync_thread+0xa1b/0xa70 [osp]
[ 5228.383296]  [<ffffffff810b4031>] kthread+0xd1/0xe0

and test 136

[ 1303.271639] Lustre: 20681:0:(llog_cat.c:919:llog_cat_process_or_fork()) lustre-OST0000-osc-MDT0000: catlog [0x2:0x1:0x0] crosses index zero
[ 1336.588370] LustreError: 20681:0:(osp_sync.c:1293:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 2 changes, 971 in progress, 4 in flight
[ 1336.591561] LustreError: 20681:0:(osp_sync.c:1293:osp_sync_thread()) LBUG
Log

00000040:00080000:3.0:1575562250.250807:0:20681:0:(llog.c:665:llog_process_thread()) index: 438, lh_last_idx: 786 synced_idx: 0 lgh_last_idx: 787
00000040:00080000:3.0:1575562250.250808:0:20681:0:(llog_cat.c:814:llog_cat_process_common()) processing log [0xf85d7:0x1:0x0]:0 at index 438 of catalog [0x2:0x1:0x0]
00000040:00080000:3.0:1575562250.250855:0:20681:0:(llog.c:822:llog_process_or_fork()) Processing [0xf85d7:0x1:0x0] flags 0x005 startcat -1054886720 startidx -1 first_idx -1 last_idx -1
00000004:00080000:1.0:1575562250.250966:0:20043:0:(osp_sync.c:564:osp_sync_interpret()) reply req ffff880043637500/2, rc 0, transno 32314
00002000:00080000:2.0:1575562250.251029:0:21009:0:(ofd_dev.c:1798:ofd_destroy_hdl()) lustre-OST0000: Destroy object 0x0:1316016 count 1
00000004:00080000:3.0:1575562250.251034:0:20041:0:(osp_sync.c:564:osp_sync_interpret()) reply req ffff880050e1c380/2, rc 0, transno 32315
00000004:00080000:3.0:1575562250.251039:0:20681:0:(osp_sync.c:1166:osp_sync_process_queues()) 349 changes, 1363 in progress, 6 in flight
00000040:00080000:3.0:1575562250.251040:0:20681:0:(llog.c:715:llog_process_thread()) stop processing plain 0xf85d7:1:0 index 2 count 2
00000040:00080000:3.0:1575562250.251042:0:20681:0:(llog.c:715:llog_process_thread()) stop processing catalog 0x2:1:0 index 438 count 1713
00000040:00000400:3.0:1575562250.251045:0:20681:0:(llog_cat.c:919:llog_cat_process_or_fork()) lustre-OST0000-osc-MDT0000: catlog [0x2:0x1:0x0] crosses index zero
00000040:00080000:3.0:1575562250.251051:0:20681:0:(llog.c:822:llog_process_or_fork()) Processing [0x2:0x1:0x0] flags 0x012 startcat 439 startidx 0 first_idx 439 last_idx 787
00000040:00080000:3.0:1575562250.251057:0:20681:0:(llog.c:665:llog_process_thread()) index: 440, lh_last_idx: 787 synced_idx: 0 lgh_last_idx: 787
00000040:00080000:3.0:1575562250.251058:0:20681:0:(llog_cat.c:814:llog_cat_process_common()) processing log [0xf85d9:0x1:0x0]:0 at index 440 of catalog [0x2:0x1:0x0]

So 438 was processed, and then loop restart and next index is 440.
Catalog dump shows the same thing.

Number of records: 973
Target uuid :
-----------------------
#439 (064)id=[0xf85d8:0x1:0x0]:0 path=oi.1/0x1:0xf85d8:0x0
#788 (064)id=[0xf8735:0x1:0x0]:0 path=oi.1/0x1:0xf8735:0x0
#10327 (064)id=[0xfac78:0x1:0x0]:0 path=oi.1/0x1:0xfac78:0x0
#10328 (064)id=[0xfac79:0x1:0x0]:0 path=oi.1/0x1:0xfac79:0x0
#10329 (064)id=[0xfac7a:0x1:0x0]:0 path=oi.1/0x1:0xfac7a:0x0
#10330 (064)id=[0xfac7b:0x1:0x0]:0 path=oi.1/0x1:0xfac7b:0x0
#10331 (064)id=[0xfac7c:0x1:0x0]:0 path=oi.1/0x1:0xfac7c:0x0

The root case lay on a loop restart, for a wrapped catalog processing starts from a next index and it increase a last processed. But llog_process_thread do the one more increasing for a incoming index.

  d->opd_sync_last_catalog_idx = 0;
        do {
                int     size;
 
                wrapped = (llh->lgh_hdr->llh_cat_idx >= llh->lgh_last_idx &&
                           llh->lgh_hdr->llh_count > 1);
 
                rc = llog_cat_process(&env, llh, osp_sync_process_queues, d,
                                      d->opd_sync_last_catalog_idx, 0);
 
                size = OBD_FAIL_PRECHECK(OBD_FAIL_CAT_RECORDS) ?
                       cfs_fail_val : (LLOG_HDR_BITMAP_SIZE(llh->lgh_hdr) - 1);
                /* processing reaches catalog bottom */
                if (d->opd_sync_last_catalog_idx == size)
                        d->opd_sync_last_catalog_idx = LLOG_CAT_FIRST;
                else if (wrapped)
                        /* If catalog is wrapped we can`t predict last index of
                         * processing because lgh_last_idx could be changed.
                         * Starting form the next one */
                        d->opd_sync_last_catalog_idx++;
 
        } while (rc == 0 && (wrapped ||
                             d->opd_sync_last_catalog_idx == LLOG_CAT_FIRST));
  if (cd != NULL) {
                last_called_index = cd->lpcd_first_idx;
                index = cd->lpcd_first_idx + 1;
        }


 Comments   
Comment by Gerrit Updater [ 12/Dec/19 ]

Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/36996
Subject: LU-13069 obdclass: don't skip records for wrapped catalog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7309fae5cc65a4fa16d69abc08fd171ed356ae0d

Comment by Gerrit Updater [ 10/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36996/
Subject: LU-13069 obdclass: don't skip records for wrapped catalog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cc10922919325c212ae98a69d63328c0efbd4f83

Comment by Peter Jones [ 10/Jan/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 18/Jan/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46165
Subject: LU-13069 obdclass: don't skip records for wrapped catalog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7fa058df8fdaffbb5f16cf2877de2f5402210839

Generated at Sat Feb 10 02:58:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.