If you’re subscribed to Fedora’s devel list, then you probably noticed this thread about improving Fedora’s boot experience explode over the past two days.
So I have this thing on my desk at Red Hat that basically defines a simple design process. (Yes, it also uses the word ‘ideate’ and yes, it sounds funny but it is a real word apparently!) While the mailing list thread on the topic at this point is high-volume and a bit chaotic, there is a lot of useful information and suggestions in there that I think could be pulled into a design process and sorted out. So I took 3 hours (yes, 3 hours) this morning to wade through the thread and attempt to do this.
1. Define the problem
What problem(s) are we actually trying to solve in the boot process? You have to know what you’re trying to solve – then you’ll know whether or not a given solution will fix the problem, and you’ll also know later on how to evaluate (after the work is done) whether or not you actually fixed the problem.
So what exactly is the problem with Fedora’s boot experience? Here is the high-level problem:
Fedora 18’s boot experience is disjointed and lacks polish. It is not as smooth and seamless as it could be.
Okay, but what is meant by that? What specific polish is it lacking? Where specifically is it not smooth? Let’s dive into this and enumerate out the specific issues folks on the thread identified:
1. Grub2’s theme is out-of-date (it’s based on F17’s artwork.) [mclasen]
Fedora 18 does indeed ship with a graphical grub2 theme that is based on Fedora 17’s wallpaper. Shipping with out-of-date artwork is inarguably not very polished.
2. Grub2 has a progress bar that indicates grub’s timeout to exit the menu; this could be confused with progress for the booting of the entire system. [mclasen]
Can you see how this poses a confusing situation for a user who might confuse grub2’s graphical screen and progress bar with the loading screen and progress bar for starting up their computer itself? If you’ve never seen grub2 before and don’t know what a bootloader is, this is especially the case, I think.
3. The Fedora logo on the login screen (GDM) is very small and doesn’t match the one used by plymouth – it’s the full logo with logotype rather than just the Fedora logomark. [mclasen]
This I think is also a fair point – it would look a bit more polished for the size & position of the Fedora logomark in the plymouth bootsplash to be mirrored in the GDM login screen so one can fade into the other seamlessly. It’s definitely not a life-or-death issue, but it would be a nice touch that would make the transition between the two more seamless.
4. It takes too long a time to load the desktop from GDM login. [Ignacio]
Oddly I can’t link to his post in the devel-list archives, but Ignacio brought up this point which sounds like a valid problem if that is what he experiences. Nobody actually responded to his post, though. I think to understand this issue a bit better, someone needs to take it up and do a bit of research to figure out how long it actually takes on various systems, and maybe do some profiling to see why it’s taking so long.
5. Newly-installed kernels are added to the main grub2 menu rather than placed under the ‘advanced options’ submenu as intended. [Elad]
According to Elad, this is because the kernel package doesn’t use grub2-config – it uses grubby – and the kernel team cites issues using grub2-config as the reason. This seems like a valid problem since the grub2 menu is not functioning as it was designed for Fedora.
6. Braille display users can’t install their systems without help because we don’t have a brltty daemon running at boot startup time. [zan]
Seems like a valid accessibility issue to me.
7. Changing video modes makes the screen flash unnecessarily, especially if the boot time is so fast the mode you’re loading into only shows on screen for a few seconds. [mizmo]
I pointed this out in one of my contributions to the thread – to have the screen flash between video modes during boot up does look unpolished – it’s kind of like when you have a loose cable to your TV or monitor and you get that flashing. Depending on your display hardware, the flashing may be more or less distracting / disruptive. Minimizing this ‘flashing’ would help the boot experience look more polished though, I think.
Lennart suggested the part where if parts of bootup are so fast they display for a super short period of time, you’ll get the flash too. He suggested suppressing any ‘fancy’ plymouth display output until 10s into boot (so you’re not displaying anything unless you’re going to have time for it to show up long enough for the user to see.) Another idea he has was to use performance data to see if it’s worth showing any fancy output at all on the particular system you’re booting on.
8. Early boot options are not (at least not easily) localizable and require the inclusion of half the graphical stack. [Tomasz, mizmo]
Both myself and Tomasz suggested this as an issue with our boot process as it is today. I would argue that it’s definitely a valid problem if you can’t understand English.
9. Some weird grub2 config file errors flash to the screen very briefly during bootup. [Jóhann]
Jóhann brought this one up – it really is just an outright bug. We shouldn’t be flashing error messages so quickly they can’t be read about error conditions that honestly don’t even matter.
10. On some systems, bootup is so fast there isn’t enough time to display anything meaningful. [Lennart]
Lennart cited some benchmarks for this – on laptop systems, BIOS POST takes 500 ms, kernel 1s, userspace 1s. But other folks on the thread with different systems had very different benchmarks. For example, DJ said his system takes 45 seconds to get to grub. Peter Robinson said that he’s used modern EFI servers that take 15 minutes to POST. Lennart also brought up that Windows 8 certified HW, in order to be certified, has to get to POST in less than 2 seconds.
So we have some very slow systems and some very fast systems, and both should have a smooth experience.
11. Grub’s menu didn’t display by default up through Fedora 15. Now it displays by default in single-boot / final release Fedora systems. [drago01, mizmo]
We used to suppress the menu, and now we don’t after the move to grub2. According to Peter Jones, we had patches against grub1 that suppressed the menu and they don’t work in grub2. “If somebody contributes a patch upstream,” he said, “that’d be fine, but it’s unlikely we’d want it by default.”
Is the menu showing up by default a problem? I would argue that for many users who never need to access the menu (and certainly if they have to, it’s not something they have to do often) it not only lengthens their boot process by the timeout (I believe it’s 5 seconds?), but it also displays information that could be confusing. It also requires yet another video mode switch which necessitates a flash of the screen to get into, and a flash of the screen to get out of. The design of the screen also doesn’t fit in with the rest of the system, so from an aesthetics point-of-view it’s unpolished as well.
12. The LUKS password box is confusing. [Mirek]
Mirek pointed out that the LUKS password box that displays during plymouth is confusing. I think a big reason for this is it’s just a blank input box with a lock icon. There’s no text – I don’t think text/translations can be displayed at this point.
13. We may not be adhering to the bootloader spec. [cmurphy]
I think Matthew Garrett brought up this spec in the thread as well: FreeDesktop.org Bootloader Spec. It would definitely improve our compatibility with other distros in multiboot situations.
14. New kernels break things for users frequently. [Jiri]
Jiri brought this problem up, “New kernels bring a lot of regressions and we don’t have enough test coverage to avoid them. The general solution to those problems is to go back to the last working kernel version. But by making it less obvious we make these frequent problems more difficult to solve.”
2. Define the scope
Okay, so now that we have a list of problems, what do we do? Which ones should we solve, which ones are higher-priority, which ones could we let go for a while? I’m going to take a stab at breaking them into three categories – outright bugs, polish items, and bigger issues to work out. Maybe you don’t agree 100% with my categorization, but I think for the most part it shouldn’t be so controversial. It seems like the items in the ‘Bigger Issues’ category are what created the greatest volume of messages on the list, which makes sense – people don’t necessarily agree on what exactly the problem is or how it would ideally work.
These are issues that impact functionality or usability negatively and most would not argue don’t currently operate in the most ideal of manners.
- Grub2’s theme is out-of-date (it’s based on F17’s artwork.)
- Grub2 has a progress bar that indicates grub’s timeout to exit the menu; this could be confused with progress for the booting of the entire system.
- It takes too long a time to load the desktop from GDM login.
- Newly-installed kernels are added to the main grub2 menu rather than placed under the ‘advanced options’ submenu as intended.
- Braille display users can’t install their systems without help because we don’t have a brltty daemon running at boot startup time.
- Early boot options are not (at least not easily) localizable and require the inclusion of half the graphical stack.
- Some weird grub2 config file errors flash to the screen very briefly during bootup.
- The LUKS password box is confusing.
- We may not be adhering to the bootloader spec.
These are issues that negatively impact the appearance / look & feel and polish of the experience, but don’t really impact functionality.
- The Fedora logo on the login screen (GDM) is very small and doesn’t match the one used by plymouth – it’s the full logo with logotype rather than just the Fedora logomark.
- Changing video modes makes the screen flash unnecessarily, especially if the boot time is so fast the mode you’re loading into only shows on screen for a few seconds.
These are issues that are complex to unpack, and not everyone appears to agree on what the ideal behavior is.
- On some systems, bootup is so fast there isn’t enough time to display anything meaningful.
- Grub’s menu didn’t display by default up through Fedora 15. Now it displays by default in single-boot / final release Fedora systems.
- New kernels break things for users frequently.
So I think categorizing these issues breaks down the scope a little bit. The things that are outright bugs could be filed and their fixes are likely pretty obvious. The things that are design / polish issues should be discussed by the relevant designers and developers and will hopefully end in agreed upon solutions that are then implemented. The bigger issues are where the bulk of the discussion should happen probably. So we went from a list of 14 issues to think about down to 3.
So a little bit of research actually came out of the discussion. First, Jóhann provided some links referencing how other operating systems handle this situation:
- Windows & Windows 8 – MSDN Blog post on accessing advanced boot settings on Windows 8 – In summary, they consolidated all the options into a single ‘boot options’ menu. Their solution is instead of triggering the boot options menu during boot, that you click on a button in the desktop UI that reboots you into ‘advanced startup’ mode which shows the boot menu by default.
- Apple OS X – Startup key combinations for Intel-based Macs and Startup Keys – Boot Options – Apple appears to have an entire menagerie of keys you can press during startup to access various modes and controls.
There was also a lot of discussion about various use cases that may need to be handled differently. We should make sure we consider each of these use cases when working through the problems we’re trying to solve. These are the ones I was able to extract from the thread:
1. The single-boot, final release user.
This person only has Fedora installed on their system – no other operating systems, at least, not bare-metal. (They may have other OSes in VMs of course.) They are using a final release of Fedora, not a development build. They want to be able to boot to their desktop quickly and get to work. They are likely using a laptop, and that laptop probably has a very fast boot time.
2. The server user
3. The user whose system can’t boot
4. The multi-boot user
5. The encrypted disk user
6. The developer / tester
So let’s go through each of the three major issues identified and the discussion around each. I’ll start with the biggest one – whether or not we should display the GRUB2 menu by default or not.
1. Should GRUB2’s boot menu display by default or not?
I believe the history here is that we used to display grub by default for development & test releases of Fedora and turn it off by default for final releases. This changed with our move to grub2 in Fedora 16.
There are, of course, two major arguments around the display of this menu:
- We should display the boot menu by default.
- We should not display the boot menu by default.
- Remove the hood of the car, and keep it off in case something goes wrong, or to entice new drivers to look in there and guess what is going on.
- Keep the hood of the car on, and if something goes wrong, pop it. If the driver wants to tweak, or have a look around let them pull the lever and pop the hood.
I really like this in that it kind of points out, using a classic free software analogy – nobody is proposing that we weld the hood shut here. So we should get that off the table right away.
The arguments for displaying the boot menu by default
Here are the arguments that support displaying the boot menu by default:
- If the grub menu is suppressed by default, people won’t know how to access it anymore when they need it. [Alec, skvidal, jflorian, Bjorn]
- If the grub menu is suppressed by default and grub goes by too quickly, the window of time to press the key would be too short to hit reliably. [Bjorn]
- We shouldn’t make information secret or hard to discover in order to recruit more professionals into learning and using Linux. [skvidal]
- There isn’t a GUI tool yet available for configuring the boot loader. We should keep displaying it by default until this tool is available. [Hans]
The arguments against displaying the boot menu by default
Here are the arguments that support suppressing the boot menu by default – not displaying it at all:
- Changing video modes makes the screen flash unnecessarily. Not displaying the boot menu by default would eliminate some of this flashing. The video mode changing also screws up how our X setup works and results in unnecessary bugs for users.
- We used to suppress the boot menu by default in earlier releases and its suppression didn’t cause major problems.
- There’s other ways for the user to indicate wanting to enter the menu besides boot-time keypresses – other OSes have methods to enter these menus by rebooting from a running system (systemd is working on this) or automatically loading the menu when an error condition is encountered.
- Not listening for keypresses doesn’t probe USB, meaning not waiting for keypresses will make boot even faster since we won’t have to load/probe USB.
- (Nobody explicitly stated this, but) Displaying information geared towards power users by default is intimidating / confusing to less-knowledgeable users.
So the main concerns from the folks who want the menu displayed by default is that they are worried that when they need it, it will be too difficult discover how to access it and to actually access it even if you knew. The concerns from the people who don’t support displaying it by default are that there are better ways of accessing the menu or even automatically displaying it when a situation where it is needed is detected, and also fall along lines of polish and making bootup look cleaner since the majority of the time you’re not in a condition where you need the boot menu.
To me it seems like both sides’ concerns could be mostly allayed by not displaying the menu by default but also ensuring that it was accessible when needed and making sure it is not difficult to access on demand.
The proposals for the mechanics of not displaying the menu by default fell into a few categories:
- Use a timeout and keypress (or keyhold) combination hit at the right time to opt-in to displaying the menu. Suggestions for what key to press ranged from all keys to a set of multiple keys (shift, enter, esc, various F-keys, etc.)
- Have a label telling people what the keypress is instead of enabling multiple keys to enter the boot menu.
- Have a menu or application in the desktop that would reboot the system into the desired mode.
- If the system cannot boot up until a certain point, automatically reboot and make the menu visible.
- Have the boot menu always display if it’s a multi-boot system.
The most comprehensive proposal came from Peter Jones who is the grub maintainer in Fedora:
The idea would be to have a positive indication from systemd that we’ve gotten to some pre-defined point on the previous boot (say, starting your login manager), and not to show you any menu unless the previous boot didn’t get that far. So when you install a new kernel, the process would look like:
1) install kernel
2) set it to boot once with grub2-set-default
3) upon reboot, set it as default if and only if we get to the “success” point
4) if we see a second boot happen without the success flag set, don’t set it default, and wait the normal 5 seconds for input
This has a number of advantages when booting on some systems. On UEFI systems, which is most new desktops:
1) we don’t need any grub UI whatsoever
2) we don’t need the 5 second timeout
3) we don’t need to indicate to the firmware that we need USB probed unless it’s the device we’re booting from.
Together, these currently represent the majority of time from poweron to login. On new desktop hardware, this would be a dramatically faster boot experience. Note that getting to the system firmware menus or switching kernels would have to be selected before reboot, except in the case where the previous boot failed – in that case, we’d display the menus, probe the keyboard, and wait the 5 seconds.
On BIOS machines I think we can still accomplish #1 and #2 as well, but there’s no guarantee of a way to disable firmware timeouts or “press f2 for setup” screens and loading the usb stack.
He also gave a good list of reasons why relying on keypresses instead to access the menu is problematic:
So, the problems with that when we implemented it on grub1 were numerous, but basically they’re all of one variety:
1) we have to clear the buffer at some point because BIOSes often leave junk in them
2) it’s unclear to the user when the buffers are cleared
3) if the user holds down the key, the BIOS complains that the key is stuck,
4) if the user doesn’t hold down the key, but just presses it, it’s easy to do so too early
So I’d really rather have it so that /under normal circumstances/, if the user wants the non-default kernel or parameters, they tell us so before rebooting.
Kevin also asked about how we could detect error conditions that were a bit more complex, for example, the display manager loads but isn’t displaying correctly. Peter responded that we can always add more logic to define more ways to tell when we think something hasn’t worked right.
Let’s walk through how this proposal would effect each of the user cases we came up with:
- The single-boot, final release user will be able to boot their system faster, they will experience less ‘flashing’ while boot-up happens, and they won’t need to see a screen they don’t understand every time they turn their computer on.
- The server user will be able to boot their system faster, and if they need to access the menu they can reboot into it. If their boot time is exceedingly slow, however, this will make it take longer for them to get into the menu since they’ll have to fully boot just to initiate a reboot. However, there’s no guarantee they would have timed the boot menu keypress trigger exactly right anyway so they may have needed an extra boot otherwise. They are also, of course, better equipped to configure their servers to always display the boot menu by default.
- The user whose system can’t boot will be automatically taken to the boot menu.
- The multi-boot user should be able to boot into their other OS of choice from a fully-loaded Fedora desktop. They could also modify their grub config file to always display the boot menu. (Or, we could always display the boot menu if we detect a multi-boot system.) There’s a few possibilities here.
- The encrypted disk user isn’t really affected either way.
- The developer/tester could probably be fine if we turned the menu on by default in testing/development versions again.
Whether or not this is ultimately an acceptable solution, I don’t know, but it certainly sounds good to me after some analysis.
2. On some systems, bootup is so fast there isn’t enough time to display anything meaningful.
The issue here is that we had posts to the list saying bootup was as fast as a few seconds to taking 45 seconds just to POST. If boot-up is so fast that anything you display to the screen (including the plymouth splash) won’t show up long enough to be visible, it does seem pointless to display it. However, if boot-up is really slow, it does make sense to have progress bars and information on the screen updating so the user knows that progress is being made.
Lennart’s proposal of not displaying the plymouth bootsplash unless it takes at least 10 seconds seems to make sense to enable progress display for slower-booting systems but allow faster-booting systems to skip it all together. I didn’t see other proposals around this issue, but since this one seems to handle both fast and slow machines it seems to make some sense.
3. New kernels break things for users frequently.
Jiri pointed this out, and it’s a real problem. I think we’ve all hit a kernel update that broke suspend or sound or network drivers, and it’s painful to deal with. If there’s a way to reboot a system in this broken state into the boot menu, or even an easy way to set the default kernel from the desktop once you’ve confirmed a particular older kernel fixes the problem – isn’t that an okay way to get around this problem?
Generally, though, how do we prevent this kind of breakage given that we don’t have unlimited QA resources? Is there any way we can avoid it?
Well I hope this makes some sense out of a very long and at times convoluted thread. What do you think?
Thanks to Ryan Lerch for the screenshots!