[LU-17069] Crash when writing to a deactivated OSC Created: 31/Aug/23 Updated: 31/Aug/23 Resolved: 31/Aug/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Etienne Aujames | Assignee: | Etienne Aujames |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This does not impact Lustre version above 2.15. Call Trace: [12029.369503] BUG: unable to handle kernel NULL pointer dereference at (null) [12029.369605] IP: [<ffffffffc1041a67>] osc_lru_reserve+0x27/0x170 [osc] [12029.369702] PGD 800000006895f067 PUD 4a719067 PMD 0 [12029.369767] Oops: 0000 [#1] SMP ... [12029.389698] RIP: 0010:[<ffffffffc1041a67>] [<ffffffffc1041a67>] osc_lru_reserve+0x27/0x170 [osc] [12029.391039] RSP: 0018:ffff9756a8b83a70 EFLAGS: 00010206 [12029.392217] RAX: ffff97567fb4b000 RBX: ffff9756ba9cab88 RCX: 0000000000000000 [12029.393346] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9756818946c0 [12029.394480] RBP: ffff9756a8b83a88 R08: 0000000000000001 R09: 0000000000000000 [12029.395523] R10: ffff9756a4b6cf28 R11: 000000000000000f R12: ffff9756818946c0 [12029.396445] R13: 0000000000020000 R14: ffff9756ba9cab88 R15: ffff975682b0ba00 [12029.397361] FS: 00007efdb61e4740(0000) GS:ffff9756bfd00000(0000) knlGS:0000000000000000 [12029.398336] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12029.399246] CR2: 0000000000000000 CR3: 000000004cf86000 CR4: 00000000000606e0 [12029.400145] Call Trace: [12029.400914] [<ffffffffc1047ac6>] osc_io_write_iter_init+0x86/0x1e0 [osc] [12029.401708] [<ffffffffc0afb1bf>] cl_io_iter_init+0x5f/0x120 [obdclass] [12029.402488] [<ffffffffc10d4ff3>] lov_io_add_sub.isra.34+0xc3/0x380 [lov] [12029.403245] [<ffffffffc10d9967>] lov_io_iter_init+0x257/0x720 [lov] [12029.403974] [<ffffffffc10da38a>] lov_io_rw_iter_init+0x38a/0x520 [lov] [12029.404717] [<ffffffffc0afb1bf>] cl_io_iter_init+0x5f/0x120 [obdclass] [12029.405401] [<ffffffffc0afd4d2>] cl_io_loop+0x42/0x1c0 [obdclass] [12029.406016] [<ffffffffc169a3bb>] ll_file_io_generic+0x63b/0xc90 [lustre] [12029.406637] [<ffffffffc169aea9>] ll_file_aio_write+0x289/0x660 [lustre] [12029.407251] [<ffffffffc169b380>] ll_file_write+0x100/0x1c0 [lustre] [12029.407850] [<ffffffff9d44e590>] vfs_write+0xc0/0x1f0 [12029.408448] [<ffffffff9d9aaed5>] ? system_call_after_swapgs+0xa2/0x13a [12029.409027] [<ffffffff9d44f36f>] SyS_write+0x7f/0xf0 [12029.409603] [<ffffffff9d9aaed5>] ? system_call_after_swapgs+0xa2/0x13a [12029.410153] [<ffffffff9d9aaf92>] system_call_fastpath+0x25/0x2a [12029.410649] [<ffffffff9d9aaed5>] ? system_call_after_swapgs+0xa2/0x13a Code: unsigned long osc_lru_reserve(struct client_obd *cli, unsigned long npages) { unsigned long reserved = 0; unsigned long max_pages; unsigned long c; /* reserve a full RPC window at most to avoid that a thread accidentally * consumes too many LRU slots */ max_pages = cli->cl_max_pages_per_rpc * cli->cl_max_rpcs_in_flight; if (npages > max_pages) npages = max_pages; c = atomic_long_read(cli->cl_lru_left); <-------- Crash .... Reproducer:
The issue is present uniquely after remounting the client this way the client will not try to connect to the OST and will not init the LRU cache: lctl dl ... 4 UP osc lustrefs-OST0000-osc-ffff8cb84915e000 ... crash> p obd_devs[4]->u.cli.cl_lru_left $3 = (atomic_long_t *) 0x0 Most of the jobs do not cause a crash because most of the time they do some operation before a write (e.g: stats, seek etc...). Those operations are not impacted by this bug and will return an EIO error (in osc_io_iter_init()). This is the same crash that |
| Comments |
| Comment by Etienne Aujames [ 31/Aug/23 ] |
|
On 2.15 Lustre, the issue seems to have been resolved by But the backport is too complicated on 2.12: too much change in clio. I will push a special patch for 2.12. |
| Comment by Peter Jones [ 31/Aug/23 ] |
|
Nice detective work Etienne but the community 2.12.x branch is no longer active so I will close out this ticket |
| Comment by Gerrit Updater [ 31/Aug/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52197 |
| Comment by Etienne Aujames [ 31/Aug/23 ] |
|
Thanks Peter, The CEA are still using the 2.12 (they are testing their first 2.15 FS now), so I have pushed the patch for 2.12 to resolve the current production issue. |