Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16098

LustreError: 6866:0:(osp_sync.c:644:osp_sync_send_new_rpc()) ASSERTION( atomic_read(&d->opd_sync_rpcs_in_flight) <= d->opd_sync_max_rpcs_in_flight ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.2
    • None
    • 3
    • 9223372036854775807

    Description

      server got crash in the following test case.
      idea is that create, read and delete files by mdtest. After finished mdtest, increase osp.*.max_rpcs_in_flight and osp.*.max_rpcs_in_progress to speedup background object deletion process, then changed them back to default value and repeat mdtest.

      for i in `seq 10`; do
      	salloc -p 40n -N 40 -n 640 --ntasks-per-node=16 /usr/mpi/gcc/openmpi-4.1.4rc1/bin/mpirun /work/tools/bin/mdtest -n 50000 
      -t -P -G=-1573035764 -d /exafs/home/sihara/mdt0@/exafs/home/sihara/mdt1@/exafs/home/sihara/mdt2@/exafs/home/sihara/mdt3 -x /exafs
      /home/sihara/stonewall -C -Y -E -u -F -i 1 -r -v
      
      	clush -w root@ai400x2-1-vm[1-4] lctl set_param osp.*.max_rpcs_in_flight=128 osp.*.max_rpcs_in_progress=32768 
      	sleep 120
      	clush -w root@ai400x2-1-vm[1-4] lctl set_param osp.*.max_rpcs_in_flight=8 osp.*.max_rpcs_in_progress=4096 
      done
      

      Here is what server got crash.

      [40284.439365] LustreError: 6866:0:(osp_sync.c:644:osp_sync_send_new_rpc()) ASSERTION( atomic_read(&d->opd_sync_rpcs_in_flight) <= d->opd_sync_max_rpcs_in_flight ) failed: 
      [40284.444545] LustreError: 6866:0:(osp_sync.c:644:osp_sync_send_new_rpc()) LBUG
      [40284.446786] Pid: 6866, comm: osp-syn-4-0 3.10.0-1062.18.1.el7_lustre.ddn12.x86_64 #1 SMP Wed Dec 23 06:55:33 PST 2020
      [40284.446787] Call Trace:
      [40284.446808] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      [40284.446813] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [40284.446821] [<0>] osp_sync_send_new_rpc+0xed/0xf0 [osp]
      [40284.446827] [<0>] osp_sync_process_record+0x3e9/0x1040 [osp]
      [40284.446833] [<0>] osp_sync_process_queues+0x564/0xde0 [osp]
      [40284.446863] [<0>] llog_process_thread+0xa2a/0x1b20 [obdclass]
      [40284.446880] [<0>] llog_process_or_fork+0xd9/0x560 [obdclass]
      [40284.446908] [<0>] llog_cat_process_cb+0x2c1/0x2d0 [obdclass]
      [40284.446925] [<0>] llog_process_thread+0xa2a/0x1b20 [obdclass]
      [40284.446941] [<0>] llog_process_or_fork+0xd9/0x560 [obdclass]
      [40284.446958] [<0>] llog_cat_process_or_fork+0x201/0x3a0 [obdclass]
      [40284.446975] [<0>] llog_cat_process+0x2e/0x30 [obdclass]
      [40284.446981] [<0>] osp_sync_thread+0x19e/0xc30 [osp]
      [40284.446994] [<0>] kthread+0xd1/0xe0
      [40284.446998] [<0>] ret_from_fork_nospec_begin+0x7/0x21
      [40284.447016] [<0>] 0xfffffffffffffffe
      [40284.447027] Kernel panic - not syncing: LBUG
      [40284.448636] CPU: 17 PID: 6866 Comm: osp-syn-4-0 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.18.1.el7_lustre.ddn12.x86_64 #1
      [40284.452338] Hardware name: DDN SFA400NVX2E, BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [40284.454559] Call Trace:
      [40284.455935]  [<ffffffff95f7b416>] dump_stack+0x19/0x1b
      [40284.457623]  [<ffffffff95f74a0b>] panic+0xe8/0x21f
      [40284.459248]  [<ffffffffc095658b>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [40284.461041]  [<ffffffffc17392ed>] osp_sync_send_new_rpc+0xed/0xf0 [osp]
      [40284.462872]  [<ffffffffc173e309>] osp_sync_process_record+0x3e9/0x1040 [osp]
      [40284.464779]  [<ffffffffc0da0360>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
      [40284.466678]  [<ffffffffc173f4c4>] osp_sync_process_queues+0x564/0xde0 [osp]
      [40284.468511]  [<ffffffff958c7410>] ? wake_up_atomic_t+0x30/0x30
      [40284.470211]  [<ffffffffc0a50efa>] llog_process_thread+0xa2a/0x1b20 [obdclass]
      [40284.472066]  [<ffffffffc0a56a58>] ? llog_cat_id2handle+0x3b8/0x670 [obdclass]
      [40284.473894]  [<ffffffffc173ef60>] ? osp_sync_process_record+0x1040/0x1040 [osp]
      [40284.475757]  [<ffffffffc0a520c9>] llog_process_or_fork+0xd9/0x560 [obdclass]
      [40284.477581]  [<ffffffffc0a56e31>] ? llog_cat_process_common+0x121/0x470 [obdclass]
      [40284.479457]  [<ffffffffc0a580d1>] llog_cat_process_cb+0x2c1/0x2d0 [obdclass]
      [40284.481249]  [<ffffffffc0a50efa>] llog_process_thread+0xa2a/0x1b20 [obdclass]
      [40284.483040]  [<ffffffff9596eafd>] ? tracing_record_cmdline+0x1d/0x120
      [40284.484755]  [<ffffffffc0a57e10>] ? llog_cat_cancel_records+0x1d0/0x1d0 [obdclass]
      [40284.486594]  [<ffffffffc0a520c9>] llog_process_or_fork+0xd9/0x560 [obdclass]
      [40284.488336]  [<ffffffff958d7a0f>] ? ttwu_do_activate+0x6f/0x80
      [40284.489939]  [<ffffffffc0a57e10>] ? llog_cat_cancel_records+0x1d0/0x1d0 [obdclass]
      [40284.491736]  [<ffffffffc0a54341>] llog_cat_process_or_fork+0x201/0x3a0 [obdclass]
      [40284.493493]  [<ffffffff958db612>] ? default_wake_function+0x12/0x20
      [40284.495084]  [<ffffffff958d38c2>] ? __wake_up_common+0x82/0x120
      [40284.496625]  [<ffffffffc173ef60>] ? osp_sync_process_record+0x1040/0x1040 [osp]
      [40284.498322]  [<ffffffffc0a5450e>] llog_cat_process+0x2e/0x30 [obdclass]
      [40284.499918]  [<ffffffffc173b90e>] osp_sync_thread+0x19e/0xc30 [osp]
      [40284.501446]  [<ffffffff95f80e02>] ? __schedule+0x402/0x840
      [40284.502884]  [<ffffffffc173b770>] ? osp_sync_process_committed+0xd70/0xd70 [osp]
      [40284.504539]  [<ffffffff958c6321>] kthread+0xd1/0xe0
      [40284.505870]  [<ffffffff958c6250>] ? insert_kthread_work+0x40/0x40
      [40284.507344]  [<ffffffff95f8ed1d>] ret_from_fork_nospec_begin+0x7/0x21
      [40284.508850]  [<ffffffff958c6250>] ? insert_kthread_work+0x40/0x40
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            sihara Shuichi Ihara
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: