[LU-3239] ofd_internal.h:518:ofd_info_init()) ASSERTION( info ) Created: 28/Apr/13  Updated: 02/May/13  Resolved: 02/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Minh Diep Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: LB
Environment:

https://maloo.whamcloud.com/test_sets/0e21f9ba-afec-11e2-b8a3-52540035b04c


Issue Links:
Blocker
is blocking LU-398 NRS (Network Request Scheduler ) Resolved
Severity: 3
Rank (Obsolete): 7954

 Description   

Lustre: DEBUG MARKER: == parallel-scale test simul: simul == 22:08:33 (1367125713)
LustreError: 19190:0:(ofd_internal.h:518:ofd_info_init()) ASSERTION( info ) failed:
LustreError: 19182:0:(ofd_internal.h:518:ofd_info_init()) ASSERTION( info ) failed:
LustreError: 19182:0:(ofd_internal.h:518:ofd_info_init()) LBUG
Pid: 19182, comm: ll_ost_io00_001

Call Trace:
[<ffffffffa044e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa044ee97>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0e03e62>] ofd_info_init+0x92/0x130 [ofd]
[<ffffffffa0e05835>] ofd_get_info+0x2e5/0xa90 [ofd]
[<ffffffff812805cd>] ? pointer+0x8d/0x830
[<ffffffffa029f7e5>] ? lprocfs_counter_add+0x125/0x182 [lvfs]
[<ffffffffa078528a>] nrs_orr_range_fill_physical+0x18a/0x540 [ptlrpc]
[<ffffffffa0762dd6>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
[<ffffffffa073e630>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
[<ffffffffa07871d7>] nrs_orr_res_get+0x817/0xb80 [ptlrpc]
[<ffffffffa077d306>] nrs_resource_get+0x56/0x110 [ptlrpc]
[<ffffffffa077dccb>] nrs_resource_get_safe+0x8b/0x100 [ptlrpc]
[<ffffffffa0780248>] ptlrpc_nrs_req_initialize+0x38/0x90 [ptlrpc]
[<ffffffffa074cff0>] ptlrpc_main+0x1170/0x16f0 [ptlrpc]
[<ffffffffa074be80>] ? ptlrpc_main+0x0/0x16f0 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa074be80>] ? ptlrpc_main+0x0/0x16f0 [ptlrpc]
[<ffffffffa074be80>] ? ptlrpc_main+0x0/0x16f0 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1367125716.19182
LustreError: 19190:0:(ofd_internal.h:518:ofd_info_init()) LBUG
Pid: 19190, comm: ll_ost_io03_000

Call Trace:
[<ffffffffa044e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa044ee97>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0e03e62>] ofd_info_init+0x92/0x130 [ofd]
[<ffffffffa0e05835>] ofd_get_info+0x2e5/0xa90 [ofd]
[<ffffffff812805cd>] ? pointer+0x8d/0x830
[<ffffffffa029f7e5>] ? lprocfs_counter_add+0x125/0x182 [lvfs]
[<ffffffffa078528a>] nrs_orr_range_fill_physical+0x18a/0x540 [ptlrpc]
[<ffffffffa0762dd6>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
[<ffffffffa073e630>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
[<ffffffffa07871d7>] nrs_orr_res_get+0x817/0xb80 [ptlrpc]
[<ffffffffa077d306>] nrs_resource_get+0x56/0x110 [ptlrpc]
[<ffffffffa077dccb>] nrs_resource_get_safe+0x8b/0x100 [ptlrpc]
[<ffffffffa0780248>] ptlrpc_nrs_req_initialize+0x38/0x90 [ptlrpc]
[<ffffffffa074cff0>] ptlrpc_main+0x1170/0x16f0 [ptlrpc]
[<ffffffffa074be80>] ? ptlrpc_main+0x0/0x16f0 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa074be80>] ? ptlrpc_main+0x0/0x16f0 [ptlrpc]
[<ffffffffa074be80>] ? ptlrpc_main+0x0/0x16f0 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1367125719.19190
LNet: Service thread pid 19182 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 19182, comm: ll_ost_io00_001

I set nrs on mds to crrn, oss02 to orr, oss03 to trr and run parallel-scale



 Comments   
Comment by Nathaniel Clark [ 29/Apr/13 ]

Possibly related similar stack trace but from next ASSERTION line (sanityn test_71):

https://maloo.whamcloud.com/test_sets/4c4286e0-b0de-11e2-bece-52540035b04c
https://maloo.whamcloud.com/test_sets/f78b8b88-b100-11e2-b2c4-52540035b04c

Comment by Di Wang [ 29/Apr/13 ]

It seems to me, NRS makes get_info step into OFD before it is initialized.

Comment by Andreas Dilger [ 29/Apr/13 ]

Alex, should the OFD environment already be initialized by this point? Is there something that can easily be done to avoid this problem?

Comment by Di Wang [ 29/Apr/13 ]

IMHO, we can simply add lu_env_refill in ofd_get_info. I will cook a patch now.

Comment by Di Wang [ 29/Apr/13 ]

http://review.whamcloud.com/6204

Comment by Minh Diep [ 30/Apr/13 ]

Wang Di,

I hit this after running your patch

general protection fault: 0000 1 SMP ^M
last sysfs file: /sys/devices/pci0000:80/0000:80:05.0/0000:84:00.0/local_cpus^M
CPU 0 ^M
Modules linked in: osp(U) ofd(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) jbd2 mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode serio_raw mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core ses enclosure sg igb dca ptp pps_core ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]^M
^M
Pid: 13927, comm: ll_ost_io00_003 Not tainted 2.6.32-358.6.1.el6_lustre.gce942d0.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH^M
RIP: 0010:[<ffffffffa05aa678>] [<ffffffffa05aa678>] keys_fill+0x48/0x190 [obdclass]^M
RSP: 0018:ffff8806a9a4d9c0 EFLAGS: 00010246^M
RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000000^M
RDX: 00000000000000f8 RSI: ffff8806a9a4dbd6 RDI: ffff88069d88af40^M
RBP: ffff8806a9a4da00 R08: 0000000000000000 R09: ffff8806a9a4dcd0^M
R10: 000000000000000f R11: 0000000000000001 R12: ffff88069d88af40^M
R13: ffffffffa0613a00 R14: ffff8806a9a4dbd0 R15: ffff880fb6667800^M
FS: 00007f4b3d486700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000^M
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
CR2: 00000000006d4420 CR3: 00000010335b5000 CR4: 00000000000007f0^M
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
Process ll_ost_io00_003 (pid: 13927, threadinfo ffff8806a9a4c000, task ffff8806b04deaa0)^M
Stack:^M
ffffffffffffffff 0000000000000000 0000000000000001 ffff8806ae9a06c0^M
<d> ffff880fccc03000 0000000000000000 ffff8806a9a4dbd0 ffff880fb6667800^M
<d> ffff8806a9a4da10 ffffffffa05aa7dd ffff8806a9a4da30 ffffffffa05aa818^M
Call Trace:^M
[<ffffffffa05aa7dd>] lu_context_refill+0x1d/0x30 [obdclass]^M
[<ffffffffa05aa818>] lu_env_refill+0x28/0x30 [obdclass]^M
[<ffffffffa0e06852>] ofd_get_info+0x302/0xba0 [ofd]^M
[<ffffffff812805fd>] ? pointer+0x8d/0x830^M
[<ffffffffa029f7e5>] ? lprocfs_counter_add+0x125/0x182 [lvfs]^M
[<ffffffffa078528a>] nrs_orr_range_fill_physical+0x18a/0x540 [ptlrpc]^M
[<ffffffffa0762dd6>] ? __req_capsule_get+0x166/0x700 [ptlrpc]^M
[<ffffffffa073e630>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]^M
[<ffffffffa07871c6>] nrs_orr_res_get+0x806/0xb90 [ptlrpc]^M
[<ffffffffa077d306>] nrs_resource_get+0x56/0x110 [ptlrpc]^M
[<ffffffffa077dccb>] nrs_resource_get_safe+0x8b/0x100 [ptlrpc]^M
[<ffffffffa0780248>] ptlrpc_nrs_req_initialize+0x38/0x90 [ptlrpc]^M
[<ffffffffa074cff0>] ptlrpc_main+0x1170/0x16f0 [ptlrpc]^M
[<ffffffffa074be80>] ? ptlrpc_main+0x0/0x16f0 [ptlrpc]^M
[<ffffffff8100c0ca>] child_rip+0xa/0x20^M
[<ffffffffa074be80>] ? ptlrpc_main+0x0/0x16f0 [ptlrpc]^M
[<ffffffffa074be80>] ? ptlrpc_main+0x0/0x16f0 [ptlrpc]^M
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M

Comment by Mikhail Pershin [ 30/Apr/13 ]

This is not about missed keys but non-initialized context at all because obd_get_info() is called not from ptlrpc_server_handle_request(). We need all lu_context setup things in ptlrpc_server_handle_req_in like that is done in ptlrpc_server_handle_request() since ptlrpc_server_handle_req_in() can cause calls to device stack now.

Comment by Di Wang [ 30/Apr/13 ]

yes, I forgot this is in handle_req_in, instead of handle_request.

Comment by Di Wang [ 30/Apr/13 ]

I just updated the patch. To avoid the hassle in handle_req_in, ofd_get_info(KEY_FIEMAP) will initialize the env itself, since ofd_get_info(KEY_FIEMAP) is the only use case from handle_req_in, adding env initialization in handle_req_in might not worth for now.

Comment by Mikhail Pershin [ 30/Apr/13 ]

I tend to agree, if this is the only single case.

Comment by Alex Zhuravlev [ 30/Apr/13 ]

we do already have env in ptlrpc_main()..

Comment by Di Wang [ 30/Apr/13 ]

well, those envs might be initialized too early before ofd stack setup. that is why we do env_refill in ost_handle?

Comment by Alex Zhuravlev [ 30/Apr/13 ]

env_refill() is fine, but recreation env from scratch on potentially every read/write is not, i think.

Comment by Di Wang [ 30/Apr/13 ]

hmm, only for ofd_get_info, because get_info is the only use case from handle_request_in, where env(le_ses) is not initialized correctly. so I figured initializing env(le_ses) for all requests in req_in might not worth. Only do this env_init in ofd_get_info should be enough. Besides we can not use env_refill here, because le_ses might be in some uninitialized state, you can see from Minh's test. Hmm, I did not see ofd_get_info is called in the process of read/write. Do I miss sth?

Comment by Alex Zhuravlev [ 30/Apr/13 ]

get_info(KEY_FIEMAP) is used to order requests with regard physical offset, AFAIK.

Comment by Di Wang [ 30/Apr/13 ]

hmm, yes, probably env_refill should be better. I change it back to env_refill here in the update patch.

Comment by Andreas Dilger [ 01/May/13 ]

Please do not close this bug until a test that runs at least basic NRS policy functionality.

Comment by Peter Jones [ 02/May/13 ]

Landed for 2.4

Generated at Sat Feb 10 06:23:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.