[LU-14302] obdfilter-survey test 2b hangs in case=netdisk Created: 06/Jan/21  Updated: 21/Dec/23

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.12.7, Lustre 2.15.0, Lustre 2.15.3, Lustre 2.15.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-15756 obdfilter-survey test_2b:osc_request.... Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

obdfilter-survey test_2b hangs starting on 02 OCT 2020 with logs at https://testing.whamcloud.com/test_sets/258c8761-9d71-4d82-8a5e-1c473e5a089c. So far, this looks like it only hangs for ZFS.

Looking at the suite_log for the failure at https://testing.whamcloud.com/test_sessions/296ca0ce-3daa-42aa-a92d-cdbdb7af75da, the last test run is for netdisk

Unable to detect ip address for host: ''
+ NETTYPE=tcp thrlo=4 nobjhi=1 thrhi=4 size=1024 case=netdisk rslt_loc=/tmp targets="10.9.7.117:lustre-OST0000 10.9.7.117:lustre-OST0001 10.9.7.117:lustre-OST0002 10.9.7.117:lustre-OST0003 10.9.7.117:lustre-OST0004 10.9.7.117:lustre-OST0005 10.9.7.117:lustre-OST0006" /usr/bin/obdfilter-survey
Fri Dec 25 12:00:08 UTC 2020 Obdfilter-survey for case=netdisk from trevis-201vm1.trevis.whamcloud.com
ost  7 sz  7340032K rsz 1024K obj    7 thr   28 write    5.28 [   0.00,    8.15] rewrite    4.88 [   0.00,    4.00] read 

We see some errors in the console logs. For the client1 (vm1) we see an error message from osc_init_grant()

 [22622.025189] Lustre: DEBUG MARKER: == obdfilter-survey test 2b: Stripe F/S over the Network, async journal ============================== 12:00:01 (1608897601)
[25456.252361] LustreError: 200837:0:(osc_request.c:1043:osc_init_grant()) lustre-OST0003_osc: granted 368640 but already consumed 1048576
[26276.964898] sysrq: SysRq : Changing Loglevel
[26276.965724] sysrq: Loglevel set to 8
[26276.967609] sysrq: SysRq : Show State
[26276.968269]   task                        PC stack   pid father
[26276.969210] systemd         S    0     1      0 0x00000000
[26276.970088] Call Trace:
[26276.970548]  __schedule+0x2a6/0x700
[26276.971139]  schedule+0x38/0xa0
[26276.971677]  schedule_hrtimeout_range_clock+0xf7/0x110
[26276.972514]  ? ep_scan_ready_list.constprop.22+0x20c/0x230
[26276.973406]  ep_poll+0x3c8/0x410
[26276.973953]  ? wake_up_q+0x80/0x80
[26276.974534]  do_epoll_wait+0xb0/0xd0
[26276.975131]  __x64_sys_epoll_wait+0x1a/0x20
[26276.975821]  do_syscall_64+0x5b/0x1a0
[26276.976447]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[26276.977339] RIP: 0033:0x7f023c889257
[26276.977937] Code: Bad RIP value.
[26276.978495] RSP: 002b:00007ffc57bbe450 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[26276.979678] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f023c889257
[26276.980796] RDX: 0000000000000056 RSI: 00007ffc57bbe490 RDI: 0000000000000004
[26276.981913] RBP: 00007ffc57bbe490 R08: 0000000000000000 R09: 7465677261742e79
[26276.983033] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000056
[26276.984152] R13: 00000000ffffffff R14: 00007ffc57bbe490 R15: 0000000000000001
[26276.985270] kthreadd        S    0     2      0 0x80004000
[26276.986143] Call Trace:
[26276.986580]  __schedule+0x2a6/0x700
[26276.987166]  schedule+0x38/0xa0
[26276.987703]  kthreadd+0x2e8/0x300
[26276.988275]  ? kthread_create_on_cpu+0xa0/0xa0
[26276.989001]  ret_from_fork+0x35/0x40
[26276.989601] rcu_gp          I    0     3      2 0x80004000
[26276.990486] Call Trace:
[26276.990916]  __schedule+0x2a6/0x700
[26276.991509]  schedule+0x38/0xa0
[26276.992048]  rescuer_thread+0x29e/0x340
[26276.992686]  ? process_one_work+0x360/0x360

In the OSS (vm3) console log, we see some grant errors

[22642.753377] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param obdfilter.lustre-*.sync_journal=0
[22643.253588] LustreError: 366155:0:(tgt_grant.c:919:tgt_grant_alloc()) lustre-OST0000: client lustre-OST0000_osc_UUID/00000000089e081d requesting > max (2147483647), 34359738368
[22643.256053] LustreError: 366155:0:(tgt_grant.c:919:tgt_grant_alloc()) Skipped 6 previous similar messages
[26311.031755] sysrq: SysRq : Changing Loglevel
[26311.032550] sysrq: Loglevel set to 8
[26311.034388] sysrq: SysRq : Show State
[26311.034963]   task                        PC stack   pid father
[26311.035888] systemd         S    0     1      0 0x00000000
[26311.036726] Call Trace:
[26311.037144]  __schedule+0x2a6/0x700
[26311.037701]  schedule+0x38/0xa0
[26311.038200]  schedule_hrtimeout_range_clock+0xf7/0x110
[26311.038999]  ? ep_scan_ready_list.constprop.22+0x20c/0x230
[26311.039846]  ep_poll+0x3c8/0x410
[26311.040381]  ? wake_up_q+0x80/0x80
[26311.040903]  do_epoll_wait+0xb0/0xd0
[26311.041468]  __x64_sys_epoll_wait+0x1a/0x20
[26311.042113]  do_syscall_64+0x5b/0x1a0
[26311.042695]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[26311.043491] RIP: 0033:0x7ffbb0799257
[26311.044053] Code: Bad RIP value.
[26311.044562] RSP: 002b:00007fffda2f4c30 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[26311.045720] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007ffbb0799257
[26311.046806] RDX: 0000000000000057 RSI: 00007fffda2f4c70 RDI: 0000000000000004
[26311.047893] RBP: 00007fffda2f4c70 R08: 0000000000000000 R09: 7465677261742e79
[26311.048978] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000057
[26311.050066] R13: 00000000ffffffff R14: 00007fffda2f4c70 R15: 0000000000000001
[26311.051152] kthreadd        S    0     2      0 0x80004000
[26311.052001] Call Trace:

We’ve seen this hang in the past at
https://testing.whamcloud.com/test_sets/c037a9f9-459a-41c2-9eea-ea54cb91eb3b



 Comments   
Comment by James Nunez (Inactive) [ 20/Jan/21 ]

sanity-benchmark test_fsx has a similar issue with the following call trace on each node

[17061.275242] sysrq: SysRq : Changing Loglevel
[17061.276192] sysrq: Loglevel set to 8
[17061.278000] sysrq: SysRq : Show State
[17061.278766]   task                        PC stack   pid father
[17061.279880] systemd         S    0     1      0 0x00000000
[17061.280938] Call Trace:
[17061.281501]  __schedule+0x2a6/0x700
[17061.282203]  schedule+0x38/0xa0
[17061.282834]  schedule_hrtimeout_range_clock+0xf7/0x110
[17061.283834]  ? ep_scan_ready_list.constprop.22+0x20c/0x230
[17061.284870]  ep_poll+0x3c8/0x410
[17061.285565]  ? wake_up_q+0x80/0x80
[17061.286245]  do_epoll_wait+0xb0/0xd0
[17061.286954]  __x64_sys_epoll_wait+0x1a/0x20
[17061.287781]  do_syscall_64+0x5b/0x1a0
[17061.288537]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[17061.289513] RIP: 0033:0x7f1833964257
[17061.290221] Code: Bad RIP value.
[17061.290865] RSP: 002b:00007fffdf992860 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[17061.292251] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1833964257
[17061.293825] RDX: 0000000000000055 RSI: 00007fffdf9928a0 RDI: 0000000000000004
[17061.295585] RBP: 00007fffdf9928a0 R08: 0000000000000000 R09: 7465677261742e79
[17061.297085] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000055
[17061.298425] R13: 00000000ffffffff R14: 00007fffdf9928a0 R15: 0000000000000001
[17061.299748] kthreadd        S    0     2      0 0x80004000

https://testing.whamcloud.com/test_sets/55bc0e09-caa5-4d08-a4b5-7b53173af78a

Comment by James Nunez (Inactive) [ 15/Feb/21 ]

Here are a small fraction of test hangs that are attributed to this ticket. I think the true stack trace/issue for each hang below is being masked by the entry_syscall_64 call trace.
sanity-pcc test 1a - https://testing.whamcloud.com/test_sets/1fcf362d-9c64-4b02-9e41-6a213615adb2
sanity test 413b - https://testing.whamcloud.com/test_sets/25bd46f6-32a6-4508-ace2-b83c79b50249
sanity-benchmark test fsx - https://testing.whamcloud.com/test_sets/fbf24af3-578f-44c0-9992-95f517bc5028
sanity-quota test 61 - https://testing.whamcloud.com/test_sets/17cb0fe2-25d9-4ab2-8d52-2e98192bf50a
parallel-scale-nfsv3 test racer_on_nfs - https://testing.whamcloud.com/test_sets/163068e8-97c2-4470-b009-856fb6f8b0c5
obdfilter-survey test 2b - https://testing.whamcloud.com/test_sets/f892168b-a967-4d6c-8c94-c1475c82420b
sanity test 27F - https://testing.whamcloud.com/test_sets/65c9dadf-a86b-4919-8af2-f7b6c195c4e1

Comment by James Nunez (Inactive) [ 09/Jul/21 ]

Another test hang for sanity-sec test 25 for 2.12.7 RC1 at https://testing.whamcloud.com/test_sets/3abb9645-7bc7-4630-bc41-557da7b04375

On client1 adn 2, we see the mount point is 'still busy'

[17855.295093] Lustre: DEBUG MARKER: == sanity-sec test 25: test save and reload nodemap config =========================================== 16:59:52 (1625072392)
[17855.928251] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
[17855.928251] if [ $running -ne 0 ] ; then
[17855.928251] echo Stopping client $(hostname) /mnt/lustre opts:;
[17855.928251] lsof /mnt/lustre || need_kill=no;
[17855.928251] if [ x != x -a x$need_kill != xno ]; then
[17855.928251]     pids=$(lsof -t /mnt/lustre | sort -u);
[17855.928251]     if 
[21488.693672] sysrq: SysRq : Changing Loglevel
[21488.694779] sysrq: Loglevel set to 8
[21488.696709] sysrq: SysRq : Show State
[21488.697557]   task                        PC stack   pid father
[21488.698776] systemd         S    0     1      0 0x00000000
[21488.699937] Call Trace:
[21488.700608]  __schedule+0x292/0x880
[21488.700631]  schedule+0x32/0x80
[21488.702137]  schedule_hrtimeout_range_clock+0x106/0x110
[21488.702169]  ? ep_scan_ready_list.isra.12+0x1af/0x1d0
[21488.704339]  ep_poll+0x2b6/0x360
[21488.704368]  ? wake_up_q+0x70/0x70
[21488.704371]  SyS_epoll_wait+0xb2/0xe0
[21488.706704]  do_syscall_64+0x74/0x160
[21488.706722]  entry_SYSCALL_64_after_hwframe+0x59/0xbe
[21488.708632] RIP: 0033:0x7f5917ae78a3
[21488.708633] RSP: 002b:00007ffebbb08010 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[21488.708634] RAX: ffffffffffffffda RBX: 00005607a2f63de0 RCX: 00007f5917ae78a3
[21488.708635] RDX: 0000000000000031 RSI: 00007ffebbb08020 RDI: 0000000000000004
[21488.708636] RBP: 00007ffebbb08370 R08: 5566e87192e6acdb R09: 00005607a2d099c0
[21488.708636] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001
[21488.708638] R13: 00007ffebbb08020 R14: ffffffffffffffff R15: 0000000000000000
[21488.718338] kthreadd        S    0     2      0 0x00000000
Comment by Sarah Liu [ 22/Feb/22 ]

Hit similar error on master zfs, obdfilter-survey and replay-dual hung with following trace
https://testing.whamcloud.com/test_sets/42fe79b4-b112-43e3-b813-ac9ffbe552c7
https://testing.whamcloud.com/test_sets/9d7643af-3d54-4dbc-b445-c2bdce57d12d

[17639.357284] Lustre: DEBUG MARKER: == obdfilter-survey test 2a: Stripe F/S over the Network ========================================================== 03:48:52 (1644378532)
[24905.814376] sysrq: SysRq : Changing Loglevel
[24905.815333] sysrq: Loglevel set to 8
[24905.819043] sysrq: SysRq : Show State
[24905.819814] task:systemd         state:S stack:    0 pid:    1 ppid:     0 flags:0x00000000
[24905.821371] Call Trace:
[24905.821936]  __schedule+0x2c4/0x700
[24905.822647]  schedule+0x37/0xa0
[24905.823246]  schedule_hrtimeout_range_clock+0xf7/0x110
[24905.824250]  ? ep_scan_ready_list.constprop.23+0x20c/0x230
[24905.825291]  ep_poll+0x403/0x460
[24905.825968]  ? wake_up_q+0x80/0x80
[24905.826636]  do_epoll_wait+0xb0/0xd0
[24905.827316]  __x64_sys_epoll_wait+0x1a/0x20
[24905.828125]  do_syscall_64+0x5b/0x1a0
[24905.828857]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[24905.829846] RIP: 0033:0x7f0e2cb5a0f7
[24905.830547] Code: Unable to access opcode bytes at RIP 0x7f0e2cb5a0cd.
[24905.831770] RSP: 002b:00007fffa13c46a0 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[24905.833189] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f0e2cb5a0f7
[24905.834530] RDX: 0000000000000059 RSI: 00007fffa13c46e0 RDI: 0000000000000004
[24905.835860] RBP: 00007fffa13c46e0 R08: 0000000000000000 R09: 00007fffa13c4c70
[24905.837189] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000059
[24905.838516] R13: 00000000ffffffff R14: 00007fffa13c46e0 R15: 0000000000000001
[24905.839849] task:kthreadd        state:S stack:    0 pid:    2 ppid:     0 flags:0x80004000
[24905.841412] Call Trace:
[24905.841893]  __schedule+0x2c4/0x700
[24905.842565]  schedule+0x37/0xa0
[24905.843179]  kthreadd+0x2e8/0x300
[24905.843843]  ? kthread_create_on_cpu+0xa0/0xa0
[24905.844704]  ret_from_fork+0x35/0x40
[24905.845398] task:rcu_gp          state:I stack:    0 pid:    3 ppid:     2 flags:0x80004000
[24905.846963] Call Trace:
[24905.847465]  __schedule+0x2c4/0x700
[24905.848134]  schedule+0x37/0xa0
[24905.848752]  rescuer_thread+0x29e/0x340
[24905.849501]  ? process_one_work+0x360/0x360
[24905.850295]  kthread+0x116/0x130
[24905.850933]  ? kthread_flush_work_fn+0x10/0x10
[24905.851787]  ret_from_fork+0x35/0x40
[24905.852489] task:rcu_par_gp      state:I stack:    0 pid:    4 ppid:     2 flags:0x80004000
[24905.854047] Call Trace:
...

another one on ldiskfs DNE config parallel-scale
https://testing.whamcloud.com/test_sets/28eb47b3-d0c0-4184-9dbb-561da28a53b9

Comment by Andreas Dilger [ 30/Nov/22 ]

Has not failed in 6 months.

Comment by Jian Yu [ 14/Dec/22 ]

sanity test 24v on b2_15 branch:
https://testing.whamcloud.com/test_sets/3ec921b2-742b-41af-b0c3-8914b574c31b

Comment by Sarah Liu [ 20/Dec/22 ]

hit this again in 2.15.2-rc1

https://testing.whamcloud.com/test_sets/5ed91b10-10ab-4dd9-b960-6a72d84f11df

client

Sun Dec 11 23:16:42 2022] Lustre: 19950:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670800585/real 1670800585] req@00000000532f8e4e x1751941312768064/t0(0) o4->lustre-OST0004_osc@10.240.39.106@tcp:6/4 lens 488/448 e 0 to 1 dl 1670800604 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'ptlrpcd_01_00.0' [Sun Dec 11 23:16:42 2022] Lustre: lustre-OST0004_osc: Connection to lustre-OST0004 (at 10.240.39.106@tcp) was lost; in progress operations using this service will wait for recovery to complete [Sun Dec 11 23:16:42 2022] Lustre: lustre-OST0004_osc: Connection restored to host_6_UUID (at 10.240.39.106@tcp) [Sun Dec 11 23:24:14 2022] LustreError: 19947:0:(osc_request.c:1037:osc_init_grant()) lustre-OST0005_osc: granted 372736 but already consumed 1048576 [Sun Dec 11 23:24:27 2022] Lustre: 19950:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670801059/real 1670801059] req@00000000c6ac8d21 x1751941313020928/t0(0) o4->lustre-OST0003_osc@10.240.39.106@tcp:6/4 lens 488/448 e 0 to 1 dl 1670801069 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'ptlrpcd_01_00.0' [Sun Dec 11 23:24:27 2022] Lustre: 19948:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670801059/real 1670801059] req@000000009b3144c2 x1751941313020800/t0(0) o4->lustre-OST0003_osc@10.240.39.106@tcp:6/4 lens 488/448 e 0 to 1 dl 1670801069 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'ptlrpcd_00_00.0' [Sun Dec 11 23:24:27 2022] Lustre: 19950:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [Sun Dec 11 23:24:27 2022] Lustre: 19948:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [Sun Dec 11 23:24:27 2022] Lustre: lustre-OST0003_osc: Connection to lustre-OST0003 (at 10.240.39.106@tcp) was lost; in progress operations using this service will wait for recovery to complete [Sun Dec 11 23:24:27 2022] Lustre: Skipped 1 previous similar message [Sun Dec 11 23:24:27 2022] Lustre: lustre-OST0003_osc: Connection restored to host_6_UUID (at 10.240.39.106@tcp) [Sun Dec 11 23:24:27 2022] Lustre: Skipped 1 previous similar message [Sun Dec 11 23:55:08 2022] LustreError: 19947:0:(osc_request.c:1037:osc_init_grant()) lustre-OST0003_osc: granted 372736 but already consumed 1048576

 

OSS

[Sun Dec 11 23:07:10 2022] Lustre: lustre-OST0000: Client lustre-OST0000_osc_UUID (at 10.240.39.104@tcp) reconnecting [Sun Dec 11 23:08:25 2022] Lustre: lustre-OST0005: Client lustre-OST0005_osc_UUID (at 10.240.39.104@tcp) reconnecting [Sun Dec 11 23:08:30 2022] LustreError: 353209:0:(ldlm_lib.c:3490:target_bulk_io()) @@@ bulk WRITE failed: rc = -107 req@00000000b67af265 x1751941312481472/t0(0) o4->lustre-OST0005_osc_UUID@10.240.39.104@tcp:0/0 lens 488/448 e 0 to 0 dl 1670800121 ref 1 fl Interpret:/0/0 rc 0/0 job:'ptlrpcd_00_01.0' [Sun Dec 11 23:08:30 2022] LustreError: 353209:0:(ldlm_lib.c:3490:target_bulk_io()) Skipped 6 previous similar messages [Sun Dec 11 23:08:30 2022] Lustre: lustre-OST0005: Bulk IO write error with lustre-OST0005_osc_UUID (at 10.240.39.104@tcp), client will retry: rc = -107 [Sun Dec 11 23:08:30 2022] Lustre: Skipped 1 previous similar message [Sun Dec 11 23:09:46 2022] Lustre: 353215:0:(service.c:2329:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/1s); client may timeout req@0000000005918235 x1751941312522240/t8589941733(0) o4->lustre-OST0004_osc_UUID@10.240.39.104@tcp:0/0 lens 488/448 e 0 to 0 dl 1670800187 ref 1 fl Complete:/0/0 rc 0/0 job:'ptlrpcd_00_00.0' [Sun Dec 11 23:14:23 2022] Lustre: lustre-OST0004: Client lustre-OST0004_osc_UUID (at 10.240.39.104@tcp) reconnecting [Sun Dec 11 23:14:25 2022] LustreError: 353232:0:(ldlm_lib.c:3490:target_bulk_io()) @@@ bulk WRITE failed: rc = -107 req@00000000e6dde848 x1751941312695872/t0(0) o4->lustre-OST0003_osc_UUID@10.240.39.104@tcp:0/0 lens 488/448 e 0 to 0 dl 1670800483 ref 1 fl Interpret:/0/0 rc 0/0 job:'ptlrpcd_01_01.0' [Sun Dec 11 23:14:25 2022] Lustre: lustre-OST0003: Bulk IO write error with lustre-OST0003_osc_UUID (at 10.240.39.104@tcp), client will retry: rc = -107 [Sun Dec 11 23:14:28 2022] LustreError: 353216:0:(ldlm_lib.c:3490:target_bulk_io()) @@@ bulk WRITE failed: rc = -107 req@0000000004060a7c x1751941312694464/t0(0) o4->lustre-OST0004_osc_UUID@10.240.39.104@tcp:0/0 lens 488/448 e 0 to 0 dl 1670800480 ref 1 fl Interpret:/0/0 rc 0/0 job:'ptlrpcd_00_00.0' [Sun Dec 11 23:14:28 2022] Lustre: lustre-OST0004: Bulk IO write error with lustre-OST0004_osc_UUID (at 10.240.39.104@tcp), client will retry: rc = -107 [Sun Dec 11 23:14:52 2022] Lustre: lustre-OST0006: Client lustre-OST0006_osc_UUID (at 10.240.39.104@tcp) reconnecting

Generated at Sat Feb 10 03:08:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.