Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.2
-
None
-
3
-
9223372036854775807
Description
server got crash in the following test case.
idea is that create, read and delete files by mdtest. After finished mdtest, increase osp.*.max_rpcs_in_flight and osp.*.max_rpcs_in_progress to speedup background object deletion process, then changed them back to default value and repeat mdtest.
for i in `seq 10`; do salloc -p 40n -N 40 -n 640 --ntasks-per-node=16 /usr/mpi/gcc/openmpi-4.1.4rc1/bin/mpirun /work/tools/bin/mdtest -n 50000 -t -P -G=-1573035764 -d /exafs/home/sihara/mdt0@/exafs/home/sihara/mdt1@/exafs/home/sihara/mdt2@/exafs/home/sihara/mdt3 -x /exafs /home/sihara/stonewall -C -Y -E -u -F -i 1 -r -v clush -w root@ai400x2-1-vm[1-4] lctl set_param osp.*.max_rpcs_in_flight=128 osp.*.max_rpcs_in_progress=32768 sleep 120 clush -w root@ai400x2-1-vm[1-4] lctl set_param osp.*.max_rpcs_in_flight=8 osp.*.max_rpcs_in_progress=4096 done
Here is what server got crash.
[40284.439365] LustreError: 6866:0:(osp_sync.c:644:osp_sync_send_new_rpc()) ASSERTION( atomic_read(&d->opd_sync_rpcs_in_flight) <= d->opd_sync_max_rpcs_in_flight ) failed: [40284.444545] LustreError: 6866:0:(osp_sync.c:644:osp_sync_send_new_rpc()) LBUG [40284.446786] Pid: 6866, comm: osp-syn-4-0 3.10.0-1062.18.1.el7_lustre.ddn12.x86_64 #1 SMP Wed Dec 23 06:55:33 PST 2020 [40284.446787] Call Trace: [40284.446808] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [40284.446813] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [40284.446821] [<0>] osp_sync_send_new_rpc+0xed/0xf0 [osp] [40284.446827] [<0>] osp_sync_process_record+0x3e9/0x1040 [osp] [40284.446833] [<0>] osp_sync_process_queues+0x564/0xde0 [osp] [40284.446863] [<0>] llog_process_thread+0xa2a/0x1b20 [obdclass] [40284.446880] [<0>] llog_process_or_fork+0xd9/0x560 [obdclass] [40284.446908] [<0>] llog_cat_process_cb+0x2c1/0x2d0 [obdclass] [40284.446925] [<0>] llog_process_thread+0xa2a/0x1b20 [obdclass] [40284.446941] [<0>] llog_process_or_fork+0xd9/0x560 [obdclass] [40284.446958] [<0>] llog_cat_process_or_fork+0x201/0x3a0 [obdclass] [40284.446975] [<0>] llog_cat_process+0x2e/0x30 [obdclass] [40284.446981] [<0>] osp_sync_thread+0x19e/0xc30 [osp] [40284.446994] [<0>] kthread+0xd1/0xe0 [40284.446998] [<0>] ret_from_fork_nospec_begin+0x7/0x21 [40284.447016] [<0>] 0xfffffffffffffffe [40284.447027] Kernel panic - not syncing: LBUG [40284.448636] CPU: 17 PID: 6866 Comm: osp-syn-4-0 Kdump: loaded Tainted: G OE ------------ T 3.10.0-1062.18.1.el7_lustre.ddn12.x86_64 #1 [40284.452338] Hardware name: DDN SFA400NVX2E, BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [40284.454559] Call Trace: [40284.455935] [<ffffffff95f7b416>] dump_stack+0x19/0x1b [40284.457623] [<ffffffff95f74a0b>] panic+0xe8/0x21f [40284.459248] [<ffffffffc095658b>] lbug_with_loc+0x9b/0xa0 [libcfs] [40284.461041] [<ffffffffc17392ed>] osp_sync_send_new_rpc+0xed/0xf0 [osp] [40284.462872] [<ffffffffc173e309>] osp_sync_process_record+0x3e9/0x1040 [osp] [40284.464779] [<ffffffffc0da0360>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc] [40284.466678] [<ffffffffc173f4c4>] osp_sync_process_queues+0x564/0xde0 [osp] [40284.468511] [<ffffffff958c7410>] ? wake_up_atomic_t+0x30/0x30 [40284.470211] [<ffffffffc0a50efa>] llog_process_thread+0xa2a/0x1b20 [obdclass] [40284.472066] [<ffffffffc0a56a58>] ? llog_cat_id2handle+0x3b8/0x670 [obdclass] [40284.473894] [<ffffffffc173ef60>] ? osp_sync_process_record+0x1040/0x1040 [osp] [40284.475757] [<ffffffffc0a520c9>] llog_process_or_fork+0xd9/0x560 [obdclass] [40284.477581] [<ffffffffc0a56e31>] ? llog_cat_process_common+0x121/0x470 [obdclass] [40284.479457] [<ffffffffc0a580d1>] llog_cat_process_cb+0x2c1/0x2d0 [obdclass] [40284.481249] [<ffffffffc0a50efa>] llog_process_thread+0xa2a/0x1b20 [obdclass] [40284.483040] [<ffffffff9596eafd>] ? tracing_record_cmdline+0x1d/0x120 [40284.484755] [<ffffffffc0a57e10>] ? llog_cat_cancel_records+0x1d0/0x1d0 [obdclass] [40284.486594] [<ffffffffc0a520c9>] llog_process_or_fork+0xd9/0x560 [obdclass] [40284.488336] [<ffffffff958d7a0f>] ? ttwu_do_activate+0x6f/0x80 [40284.489939] [<ffffffffc0a57e10>] ? llog_cat_cancel_records+0x1d0/0x1d0 [obdclass] [40284.491736] [<ffffffffc0a54341>] llog_cat_process_or_fork+0x201/0x3a0 [obdclass] [40284.493493] [<ffffffff958db612>] ? default_wake_function+0x12/0x20 [40284.495084] [<ffffffff958d38c2>] ? __wake_up_common+0x82/0x120 [40284.496625] [<ffffffffc173ef60>] ? osp_sync_process_record+0x1040/0x1040 [osp] [40284.498322] [<ffffffffc0a5450e>] llog_cat_process+0x2e/0x30 [obdclass] [40284.499918] [<ffffffffc173b90e>] osp_sync_thread+0x19e/0xc30 [osp] [40284.501446] [<ffffffff95f80e02>] ? __schedule+0x402/0x840 [40284.502884] [<ffffffffc173b770>] ? osp_sync_process_committed+0xd70/0xd70 [osp] [40284.504539] [<ffffffff958c6321>] kthread+0xd1/0xe0 [40284.505870] [<ffffffff958c6250>] ? insert_kthread_work+0x40/0x40 [40284.507344] [<ffffffff95f8ed1d>] ret_from_fork_nospec_begin+0x7/0x21 [40284.508850] [<ffffffff958c6250>] ? insert_kthread_work+0x40/0x40