[LU-16720] large-scale test_3a osp_precreate_rollover_new_seq()) ASSERTION( fid_seq(fid) != fid_seq(last_fid) ) failed: fid [0x240000bd0:0x1:0x0], last_fid [0x240000bd0:0x3fff:0x0] Created: 07/Apr/23 Updated: 12/Apr/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Starting from March 30, right after landings on that day, a new assertion crash appeared in large-scale test 3a (only gets run in full testing I guess, so flew under radar) LustreError: 676976:0:(osp_precreate.c:488:osp_precreate_rollover_new_seq()) ASSERTION( fid_seq(fid) != fid_seq(last_fid) ) failed: fid [0x240000bd0:0x1:0x0], last_fid [0x240000bd0:0x3fff:0x0] LustreError: 676976:0:(osp_precreate.c:488:osp_precreate_rollover_new_seq()) LBUG Pid: 676976, comm: osp-pre-0-0 4.18.0-425.10.1.el8_lustre.x86_64 #1 SMP Thu Mar 2 00:54:22 UTC 2023 Call Trace TBD: [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] [<0>] osp_precreate_thread+0x121d/0x1230 [osp] [<0>] kthread+0x10b/0x130 [<0>] ret_from_fork+0x35/0x40
Example crashes: https://testing.whamcloud.com/test_sets/5173c0c5-ff80-4f5b-aec2-d6e1419cbd85 https://testing.whamcloud.com/test_sets/68c90481-1450-4526-a659-b6d5d6b97f0a https://testing.whamcloud.com/test_sets/20a4a76a-e1bf-4f46-985c-b8cbed94e51b I suspect this is due to |
| Comments |
| Comment by Andreas Dilger [ 07/Apr/23 ] |
|
Dongyang, this shouldn't be a case with replay_barrier, just creating a lot of files. It isn't exactly the same as LU-16692, since this is LASSERT that the sequences are different, while that ticket is LASSERT that they are the same. It seems like there is an off-by-one in the rollover? Also, it may be that we need to replace the LASSERT with error handling, since they seem too easily hit. |
| Comment by Dongyang Li [ 10/Apr/23 ] |
|
This is a different issue to LU-16692. Looks like the LASSERT happened in osp_precreate_rollover_new_seq() |
| Comment by Dongyang Li [ 11/Apr/23 ] |
|
I think I know what's going on. [ 9541.509199] Lustre: DEBUG MARKER: == replay-ost-single test 12b: write after OST failover to a missing object ========================================================== 03:08:10 (1680059290) [ 9545.683083] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n debug [ 9546.092712] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0 [ 9546.795469] Lustre: lustre-OST0000-osc-MDT0000: update sequence from 0x240000401 to 0x240000bd0 the replay_barrier on ost1 dropping writes so we lost the seq range update, after that as we progress to large-scale when we need to allocate new SEQ from ofd we still got the old one because the seq range update is lost. |