PCI Power Management — The Linux Kernel documentation (2024)

Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.

An overview of concepts and the Linux kernel’s interfaces related to PCI powermanagement. Based on previous work by Patrick Mochel <mochel@transmeta.com>(and others).

This document only covers the aspects of power management specific to PCIdevices. For general description of the kernel’s interfaces related to devicepower management refer to Device Power Management Basics andRuntime Power Management Framework for I/O Devices.

1. Hardware and Platform Support for PCI Power Management

1.1. Native and Platform-Based Power Management

In general, power management is a feature allowing one to save energy by puttingdevices into states in which they draw less power (low-power states) at theprice of reduced functionality or performance.

Usually, a device is put into a low-power state when it is underutilized orcompletely inactive. However, when it is necessary to use the device onceagain, it has to be put back into the “fully functional” state (full-powerstate). This may happen when there are some data for the device to handle oras a result of an external event requiring the device to be active, which maybe signaled by the device itself.

PCI devices may be put into low-power states in two ways, by using the devicecapabilities introduced by the PCI Bus Power Management Interface Specification,or with the help of platform firmware, such as an ACPI BIOS. In the firstapproach, that is referred to as the native PCI power management (native PCI PM)in what follows, the device power state is changed as a result of writing aspecific value into one of its standard configuration registers. The secondapproach requires the platform firmware to provide special methods that may beused by the kernel to change the device’s power state.

Devices supporting the native PCI PM usually can generate wakeup signals calledPower Management Events (PMEs) to let the kernel know about external eventsrequiring the device to be active. After receiving a PME the kernel is supposedto put the device that sent it into the full-power state. However, the PCI BusPower Management Interface Specification doesn’t define any standard method ofdelivering the PME from the device to the CPU and the operating system kernel.It is assumed that the platform firmware will perform this task and therefore,even though a PCI device is set up to generate PMEs, it also may be necessary toprepare the platform firmware for notifying the CPU of the PMEs coming from thedevice (e.g. by generating interrupts).

In turn, if the methods provided by the platform firmware are used for changingthe power state of a device, usually the platform also provides a method forpreparing the device to generate wakeup signals. In that case, however, itoften also is necessary to prepare the device for generating PMEs using thenative PCI PM mechanism, because the method provided by the platform depends onthat.

Thus in many situations both the native and the platform-based power managementmechanisms have to be used simultaneously to obtain the desired result.

1.2. Native PCI Power Management

The PCI Bus Power Management Interface Specification (PCI PM Spec) wasintroduced between the PCI 2.1 and PCI 2.2 Specifications. It defined astandard interface for performing various operations related to powermanagement.

The implementation of the PCI PM Spec is optional for conventional PCI devices,but it is mandatory for PCI Express devices. If a device supports the PCI PMSpec, it has an 8 byte power management capability field in its PCIconfiguration space. This field is used to describe and control the standardfeatures related to the native PCI power management.

The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses(B0-B3). The higher the number, the less power is drawn by the device or busin that state. However, the higher the number, the longer the latency forthe device or bus to return to the full-power state (D0 or B0, respectively).

There are two variants of the D3 state defined by the specification. The firstone is D3hot, referred to as the software accessible D3, because devices can beprogrammed to go into it. The second one, D3cold, is the state that PCI devicesare in when the supply voltage (Vcc) is removed from them. It is not possibleto program a PCI device to go into D3cold, although there may be a programmableinterface for putting the bus the device is on into a state in which Vcc isremoved from all devices on the bus.

PCI bus power management, however, is not supported by the Linux kernel at thetime of this writing and therefore it is not covered by this document.

Note that every PCI device can be in the full-power state (D0) or in D3cold,regardless of whether or not it implements the PCI PM Spec. In addition tothat, if the PCI PM Spec is implemented by the device, it must support D3hotas well as D0. The support for the D1 and D2 power states is optional.

PCI devices supporting the PCI PM Spec can be programmed to go to any of thesupported low-power states (except for D3cold). While in D1-D3hot thestandard configuration registers of the device must be accessible to software(i.e. the device is required to respond to PCI configuration accesses), althoughits I/O and memory spaces are then disabled. This allows the device to beprogrammatically put into D0. Thus the kernel can switch the device back andforth between D0 and the supported low-power states (except for D3cold) and thepossible power state transitions the device can undergo are the following:

Current State | New State

D0 | D1, D2, D3

D1 | D2, D3

D2 | D3

D1, D2, D3 | D0

The transition from D3cold to D0 occurs when the supply voltage is provided tothe device (i.e. power is restored). In that case the device returns to D0 witha full power-on reset sequence and the power-on defaults are restored to thedevice by hardware just as at initial power up.

PCI devices supporting the PCI PM Spec can be programmed to generate PMEswhile in any power state (D0-D3), but they are not required to be capableof generating PMEs from all supported power states. In particular, thecapability of generating PMEs from D3cold is optional and depends on thepresence of additional voltage (3.3Vaux) allowing the device to remainsufficiently active to generate a wakeup signal.

1.3. ACPI Device Power Management

The platform firmware support for the power management of PCI devices issystem-specific. However, if the system in question is compliant with theAdvanced Configuration and Power Interface (ACPI) Specification, like themajority of x86-based systems, it is supposed to implement device powermanagement interfaces defined by the ACPI standard.

For this purpose the ACPI BIOS provides special functions called “controlmethods” that may be executed by the kernel to perform specific tasks, such asputting a device into a low-power state. These control methods are encodedusing special byte-code language called the ACPI Machine Language (AML) andstored in the machine’s BIOS. The kernel loads them from the BIOS and executesthem as needed using an AML interpreter that translates the AML byte code intocomputations and memory or I/O space accesses. This way, in theory, a BIOSwriter can provide the kernel with a means to perform actions dependingon the system design in a system-specific fashion.

ACPI control methods may be divided into global control methods, that are notassociated with any particular devices, and device control methods, that haveto be defined separately for each device supposed to be handled with the help ofthe platform. This means, in particular, that ACPI device control methods canonly be used to handle devices that the BIOS writer knew about in advance. TheACPI methods used for device power management fall into that category.

The ACPI specification assumes that devices can be in one of four power stateslabeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PMD0-D3 states (although the difference between D3hot and D3cold is not takeninto account by ACPI). Moreover, for each power state of a device there is aset of power resources that have to be enabled for the device to be put intothat state. These power resources are controlled (i.e. enabled or disabled)with the help of their own control methods, _ON and _OFF, that have to bedefined individually for each of them.

To put a device into the ACPI power state Dx (where x is a number between 0 and3 inclusive) the kernel is supposed to (1) enable the power resources requiredby the device in this state using their _ON control methods and (2) execute the_PSx control method defined for the device. In addition to that, if the deviceis going to be put into a low-power state (D1-D3) and is supposed to generatewakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI3.0) control method defined for it has to be executed before _PSx. Powerresources that are not required by the device in the target power state and arenot required any more by any other device should be disabled (by executing their_OFF control methods). If the current power state of the device is D3, it canonly be put into D0 this way.

However, quite often the power states of devices are changed during asystem-wide transition into a sleep state or back into the working state. ACPIdefines four system sleep states, S1, S2, S3, and S4, and denotes the systemworking state as S0. In general, the target system sleep (or working) statedetermines the highest power (lowest number) state the device can be putinto and the kernel is supposed to obtain this information by executing thedevice’s _SxD control method (where x is a number between 0 and 4 inclusive).If the device is required to wake up the system from the target sleep state, thelowest power (highest number) state it can be put into is also determined by thetarget state of the system. The kernel is then supposed to use the device’s_SxW control method to obtain the number of that state. It also is supposed touse the device’s _PRW control method to learn which power resources need to beenabled for the device to be able to generate wakeup signals.

1.4. Wakeup Signaling

Wakeup signals generated by PCI devices, either as native PCI PMEs, or asa result of the execution of the _DSW (or _PSW) ACPI control method beforeputting the device into a low-power state, have to be caught and handled asappropriate. If they are sent while the system is in the working state(ACPI S0), they should be translated into interrupts so that the kernel canput the devices generating them into the full-power state and take care of theevents that triggered them. In turn, if they are sent while the system issleeping, they should cause the system’s core logic to trigger wakeup.

On ACPI-based systems wakeup signals sent by conventional PCI devices areconverted into ACPI General-Purpose Events (GPEs) which are hardware signalsfrom the system core logic generated in response to various events that need tobe acted upon. Every GPE is associated with one or more sources of potentiallyinteresting events. In particular, a GPE may be associated with a PCI devicecapable of signaling wakeup. The information on the connections between GPEsand event sources is recorded in the system’s ACPI BIOS from where it can beread by the kernel.

If a PCI device known to the system’s ACPI BIOS signals wakeup, the GPEassociated with it (if there is one) is triggered. The GPEs associated with PCIbridges may also be triggered in response to a wakeup signal from one of thedevices below the bridge (this also is the case for root bridges) and, forexample, native PCI PMEs from devices unknown to the system’s ACPI BIOS may behandled this way.

A GPE may be triggered when the system is sleeping (i.e. when it is in one ofthe ACPI S1-S4 states), in which case system wakeup is started by its core logic(the device that was the source of the signal causing the system wakeup to occurmay be identified later). The GPEs used in such situations are referred to aswakeup GPEs.

Usually, however, GPEs are also triggered when the system is in the workingstate (ACPI S0) and in that case the system’s core logic generates a SystemControl Interrupt (SCI) to notify the kernel of the event. Then, the SCIhandler identifies the GPE that caused the interrupt to be generated which,in turn, allows the kernel to identify the source of the event (that may bea PCI device signaling wakeup). The GPEs used for notifying the kernel ofevents occurring while the system is in the working state are referred to asruntime GPEs.

Unfortunately, there is no standard way of handling wakeup signals sent byconventional PCI devices on systems that are not ACPI-based, but there is onefor PCI Express devices. Namely, the PCI Express Base Specification introduceda native mechanism for converting native PCI PMEs into interrupts generated byroot ports. For conventional PCI devices native PMEs are out-of-band, so theyare routed separately and they need not pass through bridges (in principle theymay be routed directly to the system’s core logic), but for PCI Express devicesthey are in-band messages that have to pass through the PCI Express hierarchy,including the root port on the path from the device to the Root Complex. Thusit was possible to introduce a mechanism by which a root port generates aninterrupt whenever it receives a PME message from one of the devices below it.The PCI Express Requester ID of the device that sent the PME message is thenrecorded in one of the root port’s configuration registers from where it may beread by the interrupt handler allowing the device to be identified. [PMEmessages sent by PCI Express endpoints integrated with the Root Complex don’tpass through root ports, but instead they cause a Root Complex Event Collector(if there is one) to generate interrupts.]

In principle the native PCI Express PME signaling may also be used on ACPI-basedsystems along with the GPEs, but to use it the kernel has to ask the system’sACPI BIOS to release control of root port configuration registers. The ACPIBIOS, however, is not required to allow the kernel to control these registersand if it doesn’t do that, the kernel must not modify their contents. Of coursethe native PCI Express PME signaling cannot be used by the kernel in that case.

2. PCI Subsystem and Device Power Management

2.1. Device Power Management Callbacks

The PCI Subsystem participates in the power management of PCI devices in anumber of ways. First of all, it provides an intermediate code layer betweenthe device power management core (PM core) and PCI device drivers.Specifically, the pm field of the PCI subsystem’s struct bus_type object,pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containingpointers to several device power management callbacks:

const struct dev_pm_ops pci_dev_pm_ops = { .prepare = pci_pm_prepare, .complete = pci_pm_complete, .suspend = pci_pm_suspend, .resume = pci_pm_resume, .freeze = pci_pm_freeze, .thaw = pci_pm_thaw, .poweroff = pci_pm_poweroff, .restore = pci_pm_restore, .suspend_noirq = pci_pm_suspend_noirq, .resume_noirq = pci_pm_resume_noirq, .freeze_noirq = pci_pm_freeze_noirq, .thaw_noirq = pci_pm_thaw_noirq, .poweroff_noirq = pci_pm_poweroff_noirq, .restore_noirq = pci_pm_restore_noirq, .runtime_suspend = pci_pm_runtime_suspend, .runtime_resume = pci_pm_runtime_resume, .runtime_idle = pci_pm_runtime_idle,};

These callbacks are executed by the PM core in various situations related todevice power management and they, in turn, execute power management callbacksprovided by PCI device drivers. They also perform power management operationsinvolving some standard configuration registers of PCI devices that devicedrivers need not know or care about.

The structure representing a PCI device, struct pci_dev, contains several fieldsthat these callbacks operate on:

struct pci_dev { ... pci_power_t current_state; /* Current operating state. */ int pm_cap; /* PM capability offset in the configuration space */ unsigned int pme_support:5; /* Bitmask of states from which PME# can be generated */ unsigned int pme_poll:1; /* Poll device's PME status bit */ unsigned int d1_support:1; /* Low power state D1 is supported */ unsigned int d2_support:1; /* Low power state D2 is supported */ unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ unsigned int wakeup_prepared:1; /* Device prepared for wake up */ unsigned int d3hot_delay; /* D3hot->D0 transition time in ms */ ...};

They also indirectly use some fields of the struct device that is embedded instruct pci_dev.

2.2. Device Initialization

The PCI subsystem’s first task related to device power management is toprepare the device for power management and initialize the fields of structpci_dev used for this purpose. This happens in two functions defined indrivers/pci/, pci_pm_init() and pci_acpi_setup().

The first of these functions checks if the device supports native PCI PMand if that’s the case the offset of its power management capability structurein the configuration space is stored in the pm_cap field of the device’s structpci_dev object. Next, the function checks which PCI low-power states aresupported by the device and from which low-power states the device can generatenative PCI PMEs. The power management fields of the device’s struct pci_dev andthe struct device embedded in it are updated accordingly and the generation ofPMEs by the device is disabled.

The second function checks if the device can be prepared to signal wakeup withthe help of the platform firmware, such as the ACPI BIOS. If that is the case,the function updates the wakeup fields in struct device embedded in thedevice’s struct pci_dev and uses the firmware-provided method to prevent thedevice from signaling wakeup.

At this point the device is ready for power management. For driverless devices,however, this functionality is limited to a few basic operations carried outduring system-wide transitions to a sleep state and back to the working state.

2.3. Runtime Device Power Management

The PCI subsystem plays a vital role in the runtime power management of PCIdevices. For this purpose it uses the general runtime power management(runtime PM) framework described in Runtime Power Management Framework for I/O Devices.Namely, it provides subsystem-level callbacks:

that are executed by the core runtime PM routines. It also implements theentire mechanics necessary for handling runtime wakeup signals from PCI devicesin low-power states, which at the time of this writing works for both the nativePCI Express PME signaling and the ACPI GPE-based wakeup signaling described inSection 1.

First, a PCI device is put into a low-power state, or suspended, with the helpof pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices callpci_pm_runtime_suspend() to do the actual job. For this to work, the device’sdriver has to provide a pm->runtime_suspend() callback (see below), which isrun by pci_pm_runtime_suspend() as the first action. If the driver’s callbackreturns successfully, the device’s standard configuration registers are saved,the device is prepared to generate wakeup signals and, finally, it is put intothe target low-power state.

The low-power state to put the device into is the lowest-power (highest number)state from which it can signal wakeup. The exact method of signaling wakeup issystem-dependent and is determined by the PCI subsystem on the basis of thereported capabilities of the device and the platform firmware. To prepare thedevice for signaling wakeup and put it into the selected low-power state, thePCI subsystem can use the platform firmware as well as the device’s native PCIPM capabilities, if supported.

It is expected that the device driver’s pm->runtime_suspend() callback willnot attempt to prepare the device for signaling wakeup or to put it into alow-power state. The driver ought to leave these tasks to the PCI subsystemthat has all of the information necessary to perform them.

A suspended device is brought back into the “active” state, or resumed,with the help of pm_request_resume() or pm_runtime_resume() which both callpci_pm_runtime_resume() for PCI devices. Again, this only works if the device’sdriver provides a pm->runtime_resume() callback (see below). However, beforethe driver’s callback is executed, pci_pm_runtime_resume() brings the deviceback into the full-power state, prevents it from signaling wakeup while in thatstate and restores its standard configuration registers. Thus the driver’scallback need not worry about the PCI-specific aspects of the device resume.

Note that generally pci_pm_runtime_resume() may be called in two differentsituations. First, it may be called at the request of the device’s driver, forexample if there are some data for it to process. Second, it may be calledas a result of a wakeup signal from the device itself (this sometimes isreferred to as “remote wakeup”). Of course, for this purpose the wakeup signalis handled in one of the ways described in Section 1 and finally converted intoa notification for the PCI subsystem after the source device has beenidentified.

The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle()and pm_request_idle(), executes the device driver’s pm->runtime_idle()callback, if defined, and if that callback doesn’t return error code (or is notpresent at all), suspends the device with the help of pm_runtime_suspend().Sometimes pci_pm_runtime_idle() is called automatically by the PM core (forexample, it is called right after the device has just been resumed), in whichcases it is expected to suspend the device if that makes sense. Usually,however, the PCI subsystem doesn’t really know if the device really can besuspended, so it lets the device’s driver decide by running itspm->runtime_idle() callback.

2.4. System-Wide Power Transitions

There are a few different types of system-wide power transitions, described inDevice Power Management Basics. Each of them requires devices to behandled in a specific way and the PM core executes subsystem-level powermanagement callbacks for this purpose. They are executed in phases such thateach phase involves executing the same subsystem-level callback for every devicebelonging to the given subsystem before the next phase begins. These phasesalways run after tasks have been frozen.

2.4.1. System Suspend

When the system is going into a sleep state in which the contents of memory willbe preserved, such as one of the ACPI sleep states S1-S3, the phases are:

prepare, suspend, suspend_noirq.

The following PCI bus type’s callbacks, respectively, are used in these phases:

pci_pm_prepare()pci_pm_suspend()pci_pm_suspend_noirq()

The pci_pm_prepare() routine first puts the device into the “fully functional”state with the help of pm_runtime_resume(). Then, it executes the devicedriver’s pm->prepare() callback if defined (i.e. if the driver’s structdev_pm_ops object is present and the prepare pointer in that object is valid).

The pci_pm_suspend() routine first checks if the device’s driver implementslegacy PCI suspend routines (see Section 3), in which case the driver’s legacysuspend callback is executed, if present, and its result is returned. Next, ifthe device’s driver doesn’t provide a struct dev_pm_ops object (containingpointers to the driver’s callbacks), pci_pm_default_suspend() is called, whichsimply turns off the device’s bus master capability and runspcibios_disable_device() to disable it, unless the device is a bridge (PCIbridges are ignored by this routine). Next, the device driver’s pm->suspend()callback is executed, if defined, and its result is returned if it fails.Finally, pci_fixup_device() is called to apply hardware suspend quirks relatedto the device if necessary.

Note that the suspend phase is carried out asynchronously for PCI devices, sothe pci_pm_suspend() callback may be executed in parallel for any pair of PCIdevices that don’t depend on each other in a known way (i.e. none of the pathsin the device tree from the root bridge to a leaf device contains both of them).

The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() hasbeen called, which means that the device driver’s interrupt handler won’t beinvoked while this routine is running. It first checks if the device’s driverimplements legacy PCI suspends routines (Section 3), in which case the legacylate suspend routine is called and its result is returned (the standardconfiguration registers of the device are saved if the driver’s callback hasn’tdone that). Second, if the device driver’s struct dev_pm_ops object is notpresent, the device’s standard configuration registers are saved and the routinereturns success. Otherwise the device driver’s pm->suspend_noirq() callback isexecuted, if present, and its result is returned if it fails. Next, if thedevice’s standard configuration registers haven’t been saved yet (one of thedevice driver’s callbacks executed before might do that), pci_pm_suspend_noirq()saves them, prepares the device to signal wakeup (if necessary) and puts it intoa low-power state.

The low-power state to put the device into is the lowest-power (highest number)state from which it can signal wakeup while the system is in the target sleepstate. Just like in the runtime PM case described above, the mechanism ofsignaling wakeup is system-dependent and determined by the PCI subsystem, whichis also responsible for preparing the device to signal wakeup from the system’starget sleep state as appropriate.

PCI device drivers (that don’t implement legacy power management callbacks) aregenerally not expected to prepare devices for signaling wakeup or to put theminto low-power states. However, if one of the driver’s suspend callbacks(pm->suspend() or pm->suspend_noirq()) saves the device’s standard configurationregisters, pci_pm_suspend_noirq() will assume that the device has been preparedto signal wakeup and put into a low-power state by the driver (the driver isthen assumed to have used the helper functions provided by the PCI subsystem forthis purpose). PCI device drivers are not encouraged to do that, but in somerare cases doing that in the driver may be the optimum approach.

2.4.2. System Resume

When the system is undergoing a transition from a sleep state in which thecontents of memory have been preserved, such as one of the ACPI sleep statesS1-S3, into the working state (ACPI S0), the phases are:

resume_noirq, resume, complete.

The following PCI bus type’s callbacks, respectively, are executed in thesephases:

pci_pm_resume_noirq()pci_pm_resume()pci_pm_complete()

The pci_pm_resume_noirq() routine first puts the device into the full-powerstate, restores its standard configuration registers and applies early resumehardware quirks related to the device, if necessary. This is doneunconditionally, regardless of whether or not the device’s driver implementslegacy PCI power management callbacks (this way all PCI devices are in thefull-power state and their standard configuration registers have been restoredwhen their interrupt handlers are invoked for the first time during resume,which allows the kernel to avoid problems with the handling of shared interruptsby drivers whose devices are still suspended). If legacy PCI power managementcallbacks (see Section 3) are implemented by the device’s driver, the legacyearly resume callback is executed and its result is returned. Otherwise, thedevice driver’s pm->resume_noirq() callback is executed, if defined, and itsresult is returned.

The pci_pm_resume() routine first checks if the device’s standard configurationregisters have been restored and restores them if that’s not the case (thisonly is necessary in the error path during a failing suspend). Next, resumehardware quirks related to the device are applied, if necessary, and if thedevice’s driver implements legacy PCI power management callbacks (seeSection 3), the driver’s legacy resume callback is executed and its result isreturned. Otherwise, the device’s wakeup signaling mechanisms are blocked andits driver’s pm->resume() callback is executed, if defined (the callback’sresult is then returned).

The resume phase is carried out asynchronously for PCI devices, like thesuspend phase described above, which means that if two PCI devices don’t dependon each other in a known way, the pci_pm_resume() routine may be executed forthe both of them in parallel.

The pci_pm_complete() routine only executes the device driver’s pm->complete()callback, if defined.

2.4.3. System Hibernation

System hibernation is more complicated than system suspend, because it requiresa system image to be created and written into a persistent storage medium. Theimage is created atomically and all devices are quiesced, or frozen, before thathappens.

The freezing of devices is carried out after enough memory has been freed (atthe time of this writing the image creation requires at least 50% of system RAMto be free) in the following three phases:

prepare, freeze, freeze_noirq

that correspond to the PCI bus type’s callbacks:

pci_pm_prepare()pci_pm_freeze()pci_pm_freeze_noirq()

This means that the prepare phase is exactly the same as for system suspend.The other two phases, however, are different.

The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runsthe device driver’s pm->freeze() callback, if defined, instead of pm->suspend(),and it doesn’t apply the suspend-related hardware quirks. It is executedasynchronously for different PCI devices that don’t depend on each other in aknown way.

The pci_pm_freeze_noirq() routine, in turn, is similar topci_pm_suspend_noirq(), but it calls the device driver’s pm->freeze_noirq()routine instead of pm->suspend_noirq(). It also doesn’t attempt to prepare thedevice for signaling wakeup and put it into a low-power state. Still, it savesthe device’s standard configuration registers if they haven’t been saved by oneof the driver’s callbacks.

Once the image has been created, it has to be saved. However, at this point alldevices are frozen and they cannot handle I/O, while their ability to handleI/O is obviously necessary for the image saving. Thus they have to be broughtback to the fully functional state and this is done in the following phases:

thaw_noirq, thaw, complete

using the following PCI bus type’s callbacks:

pci_pm_thaw_noirq()pci_pm_thaw()pci_pm_complete()

respectively.

The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq().It puts the device into the full power state and restores its standardconfiguration registers. It also executes the device driver’s pm->thaw_noirq()callback, if defined, instead of pm->resume_noirq().

The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the devicedriver’s pm->thaw() callback instead of pm->resume(). It is executedasynchronously for different PCI devices that don’t depend on each other in aknown way.

The complete phase is the same as for system resume.

After saving the image, devices need to be powered down before the system canenter the target sleep state (ACPI S4 for ACPI-based systems). This is done inthree phases:

prepare, poweroff, poweroff_noirq

where the prepare phase is exactly the same as for system suspend. The othertwo phases are analogous to the suspend and suspend_noirq phases, respectively.The PCI subsystem-level callbacks they correspond to:

pci_pm_poweroff()pci_pm_poweroff_noirq()

work in analogy with pci_pm_suspend() and pci_pm_suspend_noirq(), respectively,although they don’t attempt to save the device’s standard configurationregisters.

2.4.4. System Restore

System restore requires a hibernation image to be loaded into memory and thepre-hibernation memory contents to be restored before the pre-hibernation systemactivity can be resumed.

As described in Device Power Management Basics, the hibernation imageis loaded into memory by a fresh instance of the kernel, called the boot kernel,which in turn is loaded and run by a boot loader in the usual way. After theboot kernel has loaded the image, it needs to replace its own code and data withthe code and data of the “hibernated” kernel stored within the image, called theimage kernel. For this purpose all devices are frozen just like before creatingthe image during hibernation, in the

prepare, freeze, freeze_noirq

phases described above. However, the devices affected by these phases are onlythose having drivers in the boot kernel; other devices will still be in whateverstate the boot loader left them.

Should the restoration of the pre-hibernation memory contents fail, the bootkernel would go through the “thawing” procedure described above, using thethaw_noirq, thaw, and complete phases (that will only affect the devices havingdrivers in the boot kernel), and then continue running normally.

If the pre-hibernation memory contents are restored successfully, which is theusual situation, control is passed to the image kernel, which then becomesresponsible for bringing the system back to the working state. To achieve this,it must restore the devices’ pre-hibernation functionality, which is done muchlike waking up from the memory sleep state, although it involves differentphases:

restore_noirq, restore, complete

The first two of these are analogous to the resume_noirq and resume phasesdescribed above, respectively, and correspond to the following PCI subsystemcallbacks:

pci_pm_restore_noirq()pci_pm_restore()

These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(),respectively, but they execute the device driver’s pm->restore_noirq() andpm->restore() callbacks, if available.

The complete phase is carried out in exactly the same way as during systemresume.

3. PCI Device Drivers and Power Management

3.1. Power Management Callbacks

PCI device drivers participate in power management by providing callbacks to beexecuted by the PCI subsystem’s power management routines described above and bycontrolling the runtime power management of their devices.

At the time of this writing there are two ways to define power managementcallbacks for a PCI device driver, the recommended one, based on using adev_pm_ops structure described in Device Power Management Basics, andthe “legacy” one, in which the .suspend() and .resume() callbacks from structpci_driver are used. The legacy approach, however, doesn’t allow one to defineruntime power management callbacks and is not really suitable for any newdrivers. Therefore it is not covered by this document (refer to the source codeto learn more about it).

It is recommended that all PCI device drivers define a struct dev_pm_ops objectcontaining pointers to power management (PM) callbacks that will be executed bythe PCI subsystem’s PM routines in various circ*mstances. A pointer to thedriver’s struct dev_pm_ops object has to be assigned to the driver.pm field inits struct pci_driver object. Once that has happened, the “legacy” PM callbacksin struct pci_driver are ignored (even if they are not NULL).

The PM callbacks in struct dev_pm_ops are not mandatory and if they are notdefined (i.e. the respective fields of struct dev_pm_ops are unset) the PCIsubsystem will handle the device in a simplified default manner. If they aredefined, though, they are expected to behave as described in the followingsubsections.

3.1.1. prepare()

The prepare() callback is executed during system suspend, during hibernation(when a hibernation image is about to be created), during power-off aftersaving a hibernation image and during system restore, when a hibernation imagehas just been loaded into memory.

This callback is only necessary if the driver’s device has children that ingeneral may be registered at any time. In that case the role of the prepare()callback is to prevent new children of the device from being registered untilone of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run.

In addition to that the prepare() callback may carry out some operationspreparing the device to be suspended, although it should not allocate memory(if additional memory is required to suspend the device, it has to bepreallocated earlier, for example in a suspend/hibernate notifier as describedin Suspend/Hibernation Notifiers).

3.1.2. suspend()

The suspend() callback is only executed during system suspend, after prepare()callbacks have been executed for all devices in the system.

This callback is expected to quiesce the device and prepare it to be put into alow-power state by the PCI subsystem. It is not required (in fact it even isnot recommended) that a PCI driver’s suspend() callback save the standardconfiguration registers of the device, prepare it for waking up the system, orput it into a low-power state. All of these operations can very well be takencare of by the PCI subsystem, without the driver’s participation.

However, in some rare case it is convenient to carry out these operations ina PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), andpci_set_power_state() should be used to save the device’s standard configurationregisters, to prepare it for system wakeup (if necessary), and to put it into alow-power state, respectively. Moreover, if the driver calls pci_save_state(),the PCI subsystem will not execute either pci_prepare_to_sleep(), orpci_set_power_state() for its device, so the driver is then responsible forhandling the device as appropriate.

While the suspend() callback is being executed, the driver’s interrupt handlercan be invoked to handle an interrupt from the device, so all suspend-relatedoperations relying on the driver’s ability to handle interrupts should becarried out in this callback.

3.1.3. suspend_noirq()

The suspend_noirq() callback is only executed during system suspend, aftersuspend() callbacks have been executed for all devices in the system andafter device interrupts have been disabled by the PM core.

The difference between suspend_noirq() and suspend() is that the driver’sinterrupt handler will not be invoked while suspend_noirq() is running. Thussuspend_noirq() can carry out operations that would cause race conditions toarise if they were performed in suspend().

3.1.4. freeze()

The freeze() callback is hibernation-specific and is executed in two situations,during hibernation, after prepare() callbacks have been executed for all devicesin preparation for the creation of a system image, and during restore,after a system image has been loaded into memory from persistent storage and theprepare() callbacks have been executed for all devices.

The role of this callback is analogous to the role of the suspend() callbackdescribed above. In fact, they only need to be different in the rare cases whenthe driver takes the responsibility for putting the device into a low-powerstate.

In that cases the freeze() callback should not prepare the device system wakeupor put it into a low-power state. Still, either it or freeze_noirq() shouldsave the device’s standard configuration registers using pci_save_state().

3.1.5. freeze_noirq()

The freeze_noirq() callback is hibernation-specific. It is executed duringhibernation, after prepare() and freeze() callbacks have been executed for alldevices in preparation for the creation of a system image, and during restore,after a system image has been loaded into memory and after prepare() andfreeze() callbacks have been executed for all devices. It is always executedafter device interrupts have been disabled by the PM core.

The role of this callback is analogous to the role of the suspend_noirq()callback described above and it very rarely is necessary to definefreeze_noirq().

The difference between freeze_noirq() and freeze() is analogous to thedifference between suspend_noirq() and suspend().

3.1.6. poweroff()

The poweroff() callback is hibernation-specific. It is executed when the systemis about to be powered off after saving a hibernation image to a persistentstorage. prepare() callbacks are executed for all devices before poweroff() iscalled.

The role of this callback is analogous to the role of the suspend() and freeze()callbacks described above, although it does not need to save the contents ofthe device’s registers. In particular, if the driver wants to put the deviceinto a low-power state itself instead of allowing the PCI subsystem to do that,the poweroff() callback should use pci_prepare_to_sleep() andpci_set_power_state() to prepare the device for system wakeup and to put itinto a low-power state, respectively, but it need not save the device’s standardconfiguration registers.

3.1.7. poweroff_noirq()

The poweroff_noirq() callback is hibernation-specific. It is executed afterpoweroff() callbacks have been executed for all devices in the system.

The role of this callback is analogous to the role of the suspend_noirq() andfreeze_noirq() callbacks described above, but it does not need to save thecontents of the device’s registers.

The difference between poweroff_noirq() and poweroff() is analogous to thedifference between suspend_noirq() and suspend().

3.1.8. resume_noirq()

The resume_noirq() callback is only executed during system resume, after thePM core has enabled the non-boot CPUs. The driver’s interrupt handler will notbe invoked while resume_noirq() is running, so this callback can carry outoperations that might race with the interrupt handler.

Since the PCI subsystem unconditionally puts all devices into the full powerstate in the resume_noirq phase of system resume and restores their standardconfiguration registers, resume_noirq() is usually not necessary. In generalit should only be used for performing operations that would lead to raceconditions if carried out by resume().

3.1.9. resume()

The resume() callback is only executed during system resume, afterresume_noirq() callbacks have been executed for all devices in the system anddevice interrupts have been enabled by the PM core.

This callback is responsible for restoring the pre-suspend configuration of thedevice and bringing it back to the fully functional state. The device should beable to process I/O in a usual way after resume() has returned.

3.1.10. thaw_noirq()

The thaw_noirq() callback is hibernation-specific. It is executed after asystem image has been created and the non-boot CPUs have been enabled by the PMcore, in the thaw_noirq phase of hibernation. It also may be executed if theloading of a hibernation image fails during system restore (it is then executedafter enabling the non-boot CPUs). The driver’s interrupt handler will not beinvoked while thaw_noirq() is running.

The role of this callback is analogous to the role of resume_noirq(). Thedifference between these two callbacks is that thaw_noirq() is executed afterfreeze() and freeze_noirq(), so in general it does not need to modify thecontents of the device’s registers.

3.1.11. thaw()

The thaw() callback is hibernation-specific. It is executed after thaw_noirq()callbacks have been executed for all devices in the system and after deviceinterrupts have been enabled by the PM core.

This callback is responsible for restoring the pre-freeze configuration ofthe device, so that it will work in a usual way after thaw() has returned.

3.1.12. restore_noirq()

The restore_noirq() callback is hibernation-specific. It is executed in therestore_noirq phase of hibernation, when the boot kernel has passed control tothe image kernel and the non-boot CPUs have been enabled by the image kernel’sPM core.

This callback is analogous to resume_noirq() with the exception that it cannotmake any assumption on the previous state of the device, even if the BIOS (orgenerally the platform firmware) is known to preserve that state over asuspend-resume cycle.

For the vast majority of PCI device drivers there is no difference betweenresume_noirq() and restore_noirq().

3.1.13. restore()

The restore() callback is hibernation-specific. It is executed afterrestore_noirq() callbacks have been executed for all devices in the system andafter the PM core has enabled device drivers’ interrupt handlers to be invoked.

This callback is analogous to resume(), just like restore_noirq() is analogousto resume_noirq(). Consequently, the difference between restore_noirq() andrestore() is analogous to the difference between resume_noirq() and resume().

For the vast majority of PCI device drivers there is no difference betweenresume() and restore().

3.1.14. complete()

The complete() callback is executed in the following situations:

  • during system resume, after resume() callbacks have been executed for alldevices,

  • during hibernation, before saving the system image, after thaw() callbackshave been executed for all devices,

  • during system restore, when the system is going back to its pre-hibernationstate, after restore() callbacks have been executed for all devices.

It also may be executed if the loading of a hibernation image into memory fails(in that case it is run after thaw() callbacks have been executed for alldevices that have drivers in the boot kernel).

This callback is entirely optional, although it may be necessary if theprepare() callback performs operations that need to be reversed.

3.1.15. runtime_suspend()

The runtime_suspend() callback is specific to device runtime power management(runtime PM). It is executed by the PM core’s runtime PM framework when thedevice is about to be suspended (i.e. quiesced and put into a low-power state)at run time.

This callback is responsible for freezing the device and preparing it to beput into a low-power state, but it must allow the PCI subsystem to perform allof the PCI-specific actions necessary for suspending the device.

3.1.16. runtime_resume()

The runtime_resume() callback is specific to device runtime PM. It is executedby the PM core’s runtime PM framework when the device is about to be resumed(i.e. put into the full-power state and programmed to process I/O normally) atrun time.

This callback is responsible for restoring the normal functionality of thedevice after it has been put into the full-power state by the PCI subsystem.The device is expected to be able to process I/O in the usual way afterruntime_resume() has returned.

3.1.17. runtime_idle()

The runtime_idle() callback is specific to device runtime PM. It is executedby the PM core’s runtime PM framework whenever it may be desirable to suspendthe device according to the PM core’s information. In particular, it isautomatically executed right after runtime_resume() has returned in case theresume of the device has happened as a result of a spurious event.

This callback is optional, but if it is not implemented or if it returns 0, thePCI subsystem will call pm_runtime_suspend() for the device, which in turn willcause the driver’s runtime_suspend() callback to be executed.

3.1.18. Pointing Multiple Callback Pointers to One Routine

Although in principle each of the callbacks described in the previoussubsections can be defined as a separate function, it often is convenient topoint two or more members of struct dev_pm_ops to the same routine. There area few convenience macros that can be used for this purpose.

The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with onesuspend routine pointed to by the .suspend(), .freeze(), and .poweroff()members and one resume routine pointed to by the .resume(), .thaw(), and.restore() members. The other function pointers in this struct dev_pm_ops areunset.

The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but itadditionally sets the .runtime_resume() pointer to the same value as.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer tothe same value as .suspend() (and .freeze() and .poweroff()).

The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of structdev_pm_ops to indicate that one suspend routine is to be pointed to by the.suspend(), .freeze(), and .poweroff() members and one resume routine is tobe pointed to by the .resume(), .thaw(), and .restore() members.

3.1.19. Driver Flags for Power Management

The PM core allows device drivers to set flags that influence the handling ofpower management for the devices by the core itself and by middle layer codeincluding the PCI bus type. The flags should be set once at the driver probetime with the help of the dev_pm_set_driver_flags() function and they should notbe updated directly afterwards.

The DPM_FLAG_NO_DIRECT_COMPLETE flag prevents the PM core from using thedirect-complete mechanism allowing device suspend/resume callbacks to be skippedif the device is in runtime suspend when the system suspend starts. That alsoaffects all of the ancestors of the device, so this flag should only be used ifabsolutely necessary.

The DPM_FLAG_SMART_PREPARE flag causes the PCI bus type to return a positivevalue from pci_pm_prepare() only if the ->prepare callback provided by thedriver of the device returns a positive value. That allows the driver to optout from using the direct-complete mechanism dynamically (whereas settingDPM_FLAG_NO_DIRECT_COMPLETE means permanent opt-out).

The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver’sperspective the device can be safely left in runtime suspend during systemsuspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff()to avoid resuming the device from runtime suspend unless there are PCI-specificreasons for doing that. Also, it causes pci_pm_suspend_late/noirq() andpci_pm_poweroff_late/noirq() to return early if the device remains in runtimesuspend during the “late” phase of the system-wide transition under way.Moreover, if the device is in runtime suspend in pci_pm_resume_noirq() orpci_pm_restore_noirq(), its runtime PM status will be changed to “active” (as itis going to be put into D0 going forward).

Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver allows its“noirq” and “early” resume callbacks to be skipped if the device can be leftin suspend after a system-wide transition into the working state. This flag istaken into consideration by the PM core along with the power.may_skip_resumestatus bit of the device which is set by pci_pm_suspend_noirq() in certainsituations. If the PM core determines that the driver’s “noirq” and “early”resume callbacks should be skipped, the dev_pm_skip_resume() helper functionwill return “true” and that will cause pci_pm_resume_noirq() andpci_pm_resume_early() to return upfront without touching the device andexecuting the driver callbacks.

3.2. Device Runtime Power Management

In addition to providing device power management callbacks PCI device driversare responsible for controlling the runtime power management (runtime PM) oftheir devices.

The PCI device runtime PM is optional, but it is recommended that PCI devicedrivers implement it at least in the cases where there is a reliable way ofverifying that the device is not used (like when the network cable is detachedfrom an Ethernet adapter or there are no devices attached to a USB controller).

To support the PCI runtime PM the driver first needs to implement theruntime_suspend() and runtime_resume() callbacks. It also may need to implementthe runtime_idle() callback to prevent the device from being suspended againevery time right after the runtime_resume() callback has returned(alternatively, the runtime_suspend() callback will have to check if thedevice should really be suspended and return -EAGAIN if that is not the case).

The runtime PM of PCI devices is enabled by default by the PCI core. PCIdevice drivers do not need to enable it and should not attempt to do so.However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid()helper function. In addition to that, the runtime PM usage counter ofeach PCI device is incremented by local_pci_probe() before executing theprobe callback provided by the device’s driver.

If a PCI driver implements the runtime PM callbacks and intends to use theruntime PM framework provided by the PM core and the PCI subsystem, it needsto decrement the device’s runtime PM usage counter in its probe callbackfunction. If it doesn’t do that, the counter will always be different fromzero for the device and it will never be runtime-suspended. The simplestway to do that is by calling pm_runtime_put_noidle(), but if the driverwants to schedule an autosuspend right away, for example, it may callpm_runtime_put_autosuspend() instead for this purpose. Generally, itjust needs to call a function that decrements the devices usage counterfrom its probe routine to make runtime PM work for the device.

It is important to remember that the driver’s runtime_suspend() callbackmay be executed right after the usage counter has been decremented, becauseuser space may already have caused the pm_runtime_allow() helper functionunblocking the runtime PM of the device to run via sysfs, so the driver mustbe prepared to cope with that.

The driver itself should not call pm_runtime_allow(), though. Instead, itshould let user space or some platform-specific code do that (user space cando it via sysfs as stated above), but it must be prepared to handle theruntime PM of the device correctly as soon as pm_runtime_allow() is called(which may happen at any time, even before the driver is loaded).

When the driver’s remove callback runs, it has to balance the decrementationof the device’s runtime PM usage counter at the probe time. For this reason,if it has decremented the counter in its probe callback, it must runpm_runtime_get_noresume() in its remove callback. [Since the core carriesout a runtime resume of the device and bumps up the device’s usage counterbefore running the driver’s remove callback, the runtime PM of the deviceis effectively disabled for the duration of the remove execution and allruntime PM helper functions incrementing the device’s usage counter arethen effectively equivalent to pm_runtime_get_noresume().]

The runtime PM framework works by processing requests to suspend or resumedevices, or to check if they are idle (in which cases it is reasonable tosubsequently request that they be suspended). These requests are representedby work items put into the power management workqueue, pm_wq. Although thereare a few situations in which power management requests are automaticallyqueued by the PM core (for example, after processing a request to resume adevice the PM core automatically queues a request to check if the device isidle), device drivers are generally responsible for queuing power managementrequests for their devices. For this purpose they should use the runtime PMhelper functions provided by the PM core, discussed inRuntime Power Management Framework for I/O Devices.

Devices can also be suspended and resumed synchronously, without placing arequest into pm_wq. In the majority of cases this also is done by theirdrivers that use helper functions provided by the PM core for this purpose.

For more information on the runtime PM of devices refer toRuntime Power Management Framework for I/O Devices.

4. Resources

PCI Local Bus Specification, Rev. 3.0

PCI Bus Power Management Interface Specification, Rev. 1.2

Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b

PCI Express Base Specification, Rev. 2.0

Device Power Management Basics

Runtime Power Management Framework for I/O Devices

PCI Power Management — The Linux Kernel  documentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 6561

Rating: 4.3 / 5 (74 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.