Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legacy framebuffer broken since 5.18.13 #341

Closed
muellerjoel opened this issue Aug 1, 2022 · 41 comments
Closed

Legacy framebuffer broken since 5.18.13 #341

muellerjoel opened this issue Aug 1, 2022 · 41 comments
Labels
bug Something isn't working

Comments

@muellerjoel
Copy link

NVIDIA Open GPU Kernel Modules Version

nvidia-dkms 515.57.10

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

Arch Linux

Kernel Release

Linux 5erver 5.18.15-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 29 Jul 2022 22:52:39 +0000 x86_64 GNU/Linux

Hardware: GPU

GPU 0: NVIDIA GeForce GT 1030 (UUID: GPU-c59b6402-5cbe-7921-cefb-54226fe287f9)

Describe the bug

I can't boot the kernel full in console and login in as user

git clone https://github.com/archlinux/linux.git

git checkout 848b2b6b5a582418eccf9c757594a51dab2b30d0 #5.18.15 kernel

git revert 07186778cf645cc79e6913a28dadf445cd3e2439

resolve the issue

https://bbs.archlinux.org/viewtopic.php?id=278519

To Reproduce

Install ArchLinux with kernel => 5.18.13
Install nvidia drivers
boot in to kernel after cryptsetup password prompt will freeze
Freeze -> you can login over ssh

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

@muellerjoel muellerjoel added the bug Something isn't working label Aug 1, 2022
@aritger
Copy link
Collaborator

aritger commented Aug 1, 2022

GP108 (pascal) is not supported by the open kernel modules in this git repo. But, if this happens with the proprietary driver, that is definitely a bug.

I can't find much related to the problem in the attached logs. Was the attached nvidia-bug-report.log.gz generated right after the symptom was reproduced? In journalctl, I see:

Jul 31 13:51:53 archlinux kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
Jul 31 13:51:53 archlinux kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  515.57  Wed Jun 22 22:44:07 UTC 2022
Jul 31 13:51:53 archlinux kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  515.57  Wed Jun 22 22:31:08 UTC 2022
Jul 31 13:51:53 archlinux kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Jul 31 13:51:53 archlinux kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
Jul 31 13:51:53 archlinux kernel: nvidia-uvm: Loaded the UVM driver, major device number 510.
[...]
Aug 01 17:18:32 5erver sudo[1113208]:    morta : TTY=pts/0 ; PWD=/tmp/linux ; USER=root ; COMMAND=/usr/bin/nvidia-bug-report.sh

Would it be possible to reboot, reproduce the problem with the proprietary driver, ssh into the system and then run nvidia-bug-report.sh? It would be nice to get a fresh dmesg rig
ht after the repro.

@muellerjoel
Copy link
Author

I have upgraded to a custom kernel who runs with this GeForce card but you have a fresh journalctl from me in this thread

https://bbs.archlinux.org/viewtopic.php?id=278519

@loqs
Copy link

loqs commented Aug 1, 2022

More information is available in https://bugzilla.kernel.org/show_bug.cgi?id=216303#c2
Since https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec backported to 5.18.13 as https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=07186778cf645cc79e6913a28dadf445cd3e2439 generic framebuffer registration is disabled when any driver asks for a platform device to be unregistered.
The kernel developers suggested the issue could be resolved by the nvidia driver providing an emulated framebuffer which it would register with the drm subsystem.

@aritger aritger changed the title Kernel freeze since 5.18.13 with NVIDIA Corporation GP108 [GeForce GT 1030] Legacy framebuffer broken since 5.18.13 Aug 1, 2022
@aritger
Copy link
Collaborator

aritger commented Aug 1, 2022

Thanks, @logs. I've renamed the Issue subject from

Kernel freeze since 5.18.13 with NVIDIA Corporation GP108 [GeForce GT 1030]

to

Legacy framebuffer broken since 5.18.13

@aritger
Copy link
Collaborator

aritger commented Aug 1, 2022

Thanks for the report. We're tracking this internally as NVIDIA bug 3740048.

@Elias011980
Copy link

I see a different issue (no freeze, but VESA (legacy) frame buffer unusable after X has started), due to a commit that went into kernel v5.18.13, so it is likely not NVIDIA driver-version specific, but most likely impacting any version...
Here is the kernel issue (that kernel folks are unwilling to "fix" by reverting the "culprit" commit):
https://bugzilla.kernel.org/show_bug.cgi?id=216303

@aaronp24
Copy link
Member

aaronp24 commented Aug 1, 2022

Hi @Joel-Muller, thanks for reporting this. The NVIDIA driver has traditionally not installed itself as a framebuffer console driver because doing so breaks an important use case: closing all graphical applications, unloading and uninstalling the existing NVIDIA driver, installing a new one, and starting graphics back up without having to reboot. It has not traditionally been a requirement that drm drivers also act as fbdev drivers, so this works out on earlier kernels.

If newer kernels introduce the requirement that drm drivers must install a framebuffer console, then we will look into implementing that.

@YusufKhan-gamedev
Copy link

@aaronp24 wouldnt that mean that tty resolution is something absurd like 640x480? Thats a huge dealbreaker!

@aaronp24
Copy link
Member

aaronp24 commented Aug 3, 2022

@Joel-Muller , can you check whether this commit breaks the console even if no NVIDIA driver is installed at all? While trying to reproduce this problem we noticed that the call to sysfb_disable is disabling the framebuffer console on the NVIDIA device when a drm driver for a completely different device (ASPEED in your case, Intel in our internal testing) is loaded, regardless of whether or not a drm driver exists for the NVIDIA device. That seems like a bug: the kernel shouldn't consider the efifb or similar console on an NVIDIA GPU to be conflicting with the PCI resources of a different GPU.

@muellerjoel
Copy link
Author

@aaronp24

Ok so I do

sudo pacman -R nvidia-dkms
Remove nvidia nvidia_drm from kernelparmeter

How I remove a self compiled dirty kernel which I installed with make?!

sudo pacman -U linux-5.18.15 linux-headers-
5.18.5 from /var/cache/pacman/pkg/

It‘s ok so?

@muellerjoel
Copy link
Author

Its freeze also! Without nvidia modules and removed nvidia-dkms driver

@aaronp24
Copy link
Member

aaronp24 commented Aug 4, 2022

Thanks for confirming. We'll continue investigating, but that behavior of the upstream kernel doesn't seem right and sounds like it wasn't intentional.

@Elias011980
Copy link

I can confirm that the proposed patch does not work either for me (tested with vanilla kernel v5.19 and patched NVIDIA modules for official (closed source) v515.65.01 drivers); I observe the exact same problem as before: the virtual consoles video mode is not properly set and makes them unusable after X has started.

@YusufKhan-gamedev
Copy link

If what aplattner said was true then the patches shouldnt work to fix this issue.

@anderspitman
Copy link

Has there been any progress on this, or at least a workaround? I can't boot if I update to kernel 5.18.13, and I'm fairly confident I'm hitting this bug.

@ionenwks
Copy link

ionenwks commented Nov 18, 2022

Has there been any progress on this, or at least a workaround? I can't boot if I update to kernel 5.18.13, and I'm fairly confident I'm hitting this bug.

As far as I know(?), it shouldn't prevent booting, at worst you get a blank console until Xorg or so starts and should still be able to ssh in either way.

Personally still get a console fine (currently on kernel-6.0.9, 525.53-closed, single gpu 1070, Gentoo) using FB_EFI or FB_VESA though (with SYSFB_SIMPLEFB / FB_SIMPLEFB / DRM_SIMPLEDRM disabled). On that note, DRM_SIMPLEDRM would in fact prevent Xorg from working in most cases. Fedora, whom is using it by default, is for now patching the kernel to skip simpledrm and let kernel use FB_EFI, so normally should aim to disable it (or having it as module, which is harmless).

Unsure what your distro is using if it's not your own build.

@anderspitman
Copy link

I solved my problem by going to Best Buy and picking up and AMD card. Working great now. Hopefully nvidia has open source drivers at some point.

@aaronp24
Copy link
Member

Just to set expectations here: this is definitely a kernel bug. The kernel is disabling the framebuffer console on all devices whenever DRM is initialized on any device. In the case of affected users, this breaks the console on an NVIDIA GPU when another device (such as aspeed or i915) has a DRM driver, even if no driver for the NVIDIA GPU is installed.

The underlying kernel bug needs to be fixed so that use cases where no NVIDIA driver is installed will work again.

@vldevel
Copy link

vldevel commented Dec 12, 2022

Linux v6.1 is broken beyond repair (former versions were still "patchable", but I failed to produce a working patch with v6.1)...

@Elias011980
Copy link

For v6.1, simply remove the call to sysfb_disable() from drivers/video/aperture.c

@sl1pkn07
Copy link

sl1pkn07 commented Dec 29, 2022

For v6.1, simply remove the call to sysfb_disable() from drivers/video/aperture.c

a patch please?

greetings

EDIT:

simply this?

diff --git a/drivers/video/aperture.c b/drivers/video/aperture.c
index 41e77de1ea82..7ca0730ed1c5 100644
--- a/drivers/video/aperture.c
+++ b/drivers/video/aperture.c
@@ -294,7 +294,7 @@ int aperture_remove_conflicting_devices(resource_size_t base, resource_size_t si
         * ask for this, so let's assume that a real driver for the display
         * was already probed and prevent sysfb to register devices later.
         */
-       sysfb_disable();
+       // sysfb_disable();
 
        aperture_detach_devices(base, size);
 

(if fail apply, is because lost TABs afte c&p)

EDIT: works!!! thanks @Elias011980 !!!

@aaronp24
Copy link
Member

aaronp24 commented Jan 23, 2023

I threw together some experimental patches based on #342 and #356 to get the drm core to set up a framebuffer console. This is enough to get drm_fbdev_generic to initialize, and it works alright with drm-kms clients such as typical Wayland compositors. Unfortunately, more work will be needed to make it interoperate correctly with native nvidia-modeset clients such as the NVIDIA X driver or Vulkan direct-to-display. I'm going to keep working on that part of it.

@sl1pkn07
Copy link

Hi Aaron, i can apply that patch with the official drivers?

my gpu lack GSP (GTX1070Ti)

greetings

@aaronp24
Copy link
Member

You won't be able to apply the nvidia-modeset change to make nvidia-drm's client privileged, but you can achieve something similar by having nvidia-drm take modeset ownership as soon as it loads rather than when a client goes through the set_master path. Note that this will prevent any nvidia-modeset clients from setting modes at all as long as nvidia-drm is loaded, rather than just not interacting properly with nvidia-drm's console handling.

  1. Download nvidia-525.85.05-nvidia-drm-fbdev.patch.gz
  2. Extract with gunzip nvidia-525.85.05-nvidia-drm-fbdev.patch.gz
  3. Apply it to the driver package: bash NVIDIA-Linux-x86_64-525.85.05.run --apply-patch nvidia-525.85.05-nvidia-drm-fbdev.patch
  4. Install the patched driver: bash NVIDIA-Linux-x86_64-525.85.05-custom.run

@sl1pkn07
Copy link

Hi, i have installed the patched driver followind the steps, but not works (this exactly #342 (comment))

You won't be able to apply the nvidia-modeset change to make nvidia-drm's client privileged, but you can achieve something similar by having nvidia-drm take modeset ownership as soon as it loads rather than when a client goes through the set_master path. Note that this will prevent any nvidia-modeset clients from setting modes at all as long as nvidia-drm is loaded, rather than just not interacting properly with nvidia-drm's console handling.

how i can do this?

only works if disable sysfb_disable(); in the kernel

greetings

@aaronp24
Copy link
Member

The patch I attached to #341 (comment) has the change in it to take modeset ownership earlier. You need to load nvidia-drm with modeset=1 fbdev=1 to turn on the new code.

@sl1pkn07
Copy link

thanks

now seems works now with fbdev=1

greetings!!

@sl1pkn07
Copy link

sl1pkn07 commented Feb 2, 2023

when try to open X session with binary (drivers patched + kernel 6.1.8

[    27.158] (EE) NVIDIA(GPU-0): Failed to acquire modesetting permission.
[    27.158] (EE) NVIDIA(0): Failing initialization of X screen
----
[    27.179] (EE) Screen(s) found, but none have a usable configuration.
[    27.179] (EE) 

this is expected?. wayland session works

greetings

@aaronp24
Copy link
Member

aaronp24 commented Feb 8, 2023

Yes, that's expected. The patch causes nvidia-drm to take modeset ownership as soon as it initializes and not give it up. That prevents other nvidia-modeset clients from setting modes while nvidia-drm is loaded. DRM clients such as Wayland compositors should still work since they go through nvidia-drm instead of talking to nvidia-modeset directly.

@sl1pkn07
Copy link

sl1pkn07 commented Mar 2, 2023

Hi @aaronp24, you patchset fails to build with kernels 6.2.1 and new beta 530.xx

/var/lib/dkms/nvidia/530.30.02/build/nvidia-drm/nvidia-drm-drv.c: En la función ‘nv_drm_register_drm_device’:
/var/lib/dkms/nvidia/530.30.02/build/nvidia-drm/nvidia-drm-drv.c:1501:9: error: declaración implícita de la función ‘drm_fbdev_generic_setup’ [-Werror=implicit-function-declaration]
 1501 |         drm_fbdev_generic_setup(dev, 32);
      |         ^~~~~~~~~~~~~~~~~~~~~~~
ld -r -o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-interface.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-pci.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-dmabuf.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-nano-timer.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-acpi.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-cray.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-dma.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-i2c.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-mmap.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-p2p.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-pat.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-procfs.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-usermap.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-vm.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-vtophys.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/os-interface.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/os-mlock.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/os-pci.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/os-registry.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/os-usermap.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-modeset-interface.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-pci-table.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-kthread-q.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-memdbg.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-ibmnpu.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-report-err.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-rsync.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-msi.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-caps.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv-frontend.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nv_uvm_interface.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nvlink_linux.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/nvlink_caps.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/linux_nvswitch.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/procfs_nvswitch.o /var/lib/dkms/nvidia/530.30.02/build/nvidia/i2c_nvswitch.o

greetings

@sl1pkn07
Copy link

530.41.03 and still have the issue. needs the patch in #341 (comment). the @aaronp24 not works at all or is very limited (and not works in +6.2.1)

@Lucretia
Copy link

The fedora patch doesn't work for me, see the issue I linked above.

@BalkanMadman
Copy link

BalkanMadman commented Jul 20, 2023

Have a look at Linux kernel commit 5ae3716cfdcd286268133867f67d0803847acefc ("video/aperture: Only remove sysfb on the default vga pci device"). It is to be included with 6.5 kernels and might be backported to LTS versions. This commit should fix the issue. Those who are able, please test it out.

@Lucretia
Copy link

Yeah, that file doesn't quite match the 6.4.x kernel. Would this be correct?

int aperture_remove_conflicting_pci_devices(struct pci_dev *pdev, const char *name)
{
        bool primary = false;
        resource_size_t base, size;
        int bar, ret;

#ifdef CONFIG_X86
        primary = pdev->resource[PCI_ROM_RESOURCE].flags & IORESOURCE_ROM_SHADOW;
#endif

+	if (primary)
+		sysfb_disable();
+
        for (bar = 0; bar < PCI_STD_NUM_BARS; ++bar) {
                if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
                        continue;

                base = pci_resource_start(pdev, bar);
                size = pci_resource_len(pdev, bar);
-                ret = aperture_remove_conflicting_devices(base, size, primary, name);
-                if (ret)
-                        return ret;
+		aperture_detach_devices(base, size);
        }

        /*
         * WARNING: Apparently we must kick fbdev drivers before vgacon,
         * otherwise the vga fbdev driver falls over.
         */
        ret = vga_remove_vgacon(pdev);
        if (ret)
                return ret;

        return 0;

}
EXPORT_SYMBOL(aperture_remove_conflicting_pci_devices);

@BalkanMadman
Copy link

Yeah, that file doesn't quite match the 6.4.x kernel. Would this be correct?

int aperture_remove_conflicting_pci_devices(struct pci_dev *pdev, const char *name)
{
        bool primary = false;
        resource_size_t base, size;
        int bar, ret;

#ifdef CONFIG_X86
        primary = pdev->resource[PCI_ROM_RESOURCE].flags & IORESOURCE_ROM_SHADOW;
#endif

+	if (primary)
+		sysfb_disable();
+
        for (bar = 0; bar < PCI_STD_NUM_BARS; ++bar) {
                if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
                        continue;

                base = pci_resource_start(pdev, bar);
                size = pci_resource_len(pdev, bar);
-                ret = aperture_remove_conflicting_devices(base, size, primary, name);
-                if (ret)
-                        return ret;
+		aperture_detach_devices(base, size);
        }

        /*
         * WARNING: Apparently we must kick fbdev drivers before vgacon,
         * otherwise the vga fbdev driver falls over.
         */
        ret = vga_remove_vgacon(pdev);
        if (ret)
                return ret;

        return 0;

}
EXPORT_SYMBOL(aperture_remove_conflicting_pci_devices);

Nope, sadly. The file has changed a bit, therefore your best bet is to use kernel v6.5-rc1+. If you want this to work on older kernels, there's a dirty workaround that may blow the Earth up. You've been warned.

  1. Clone the Linux kernel git repository (git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) or pull from remote if you already have one.
  2. Checkout from the needed verson git tag (in your case it's v6.4) like so: git checkout v6.4
  3. Get the diff between the local version and commit 5ae3716cfdcd286268133867f67d0803847acefc: git diff 5ae3716cfdcd286268133867f67d0803847acefc -- drivers/video/aperture.c
  4. Here is your patch, now you can save it to a file (e.g. i-love-amd.patch) and apply either via git patch or plain patch

@Lucretia
Copy link

Lucretia commented Jul 23, 2023

I'm on gentoo. I had to remove the net-misc/r8168 drivers and installed sys-kernel/git-sources to test it and it works here.

@BalkanMadman
Copy link

BalkanMadman commented Jul 23, 2023

I'm on gentoo. I had to remove the net-misc/r8168 drivers and installed sys-kernel/git-sources to test it and it works here.

I'm glad to hear that! By the way, there's no need to use separate package for Realtek 8168 Gigabit Ethernet support, the kernel supports it by its own. Just enable the CONFIG_R8169 and you will have the support for your chip.

@Lucretia
Copy link

there's no need to use separate package for Realtek 8168 Gigabit Ethernet support, the kernel supports it by its own. Just enable the CONFIG_R8169 and you will have the support for your chip.

Yeah I know, the 6.5 ebuild just gives a weird error if it's installed, but yeah, don't need them, needed them for some previous kernel.

@Lucretia
Copy link

The 6.4.7 kernel, on gentoo anyway, seems to have the patches.

FYI, check going back to a console from X after the machine has slept, the console seems to disappear.

@Nyangawa
Copy link

Nyangawa commented Sep 11, 2023

Have a look at Linux kernel commit 5ae3716cfdcd286268133867f67d0803847acefc ("video/aperture: Only remove sysfb on the default vga pci device"). It is to be included with 6.5 kernels and might be backported to LTS versions. This commit should fix the issue. Those who are able, please test it out.

Running linux 6.5.2 with an Nvidia GPU, I can confirm that the issue has been fixed.

@aaronp24
Copy link
Member

While newer kernels have patched this issue, as of the 545.23.06 release you can now also work around the problem by loading nvidia-drm with the modeset=1 fbdev=1 parameters. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

14 participants