The disclosed device for packet coalescing includes detecting a trigger condition for initiating packet coalescing of packet traffic and sending, to an endpoint device, a notification to start packet coalescing. The device can observe a status in response to starting the packet coalescing and report a performance of the packet coalescing. A system can include a controller that detects a trigger condition for packet coalescing and notifies an endpoint device via a notification register. The controller can read a status register to report, based on the read status, a packet coalescing performance. Various other methods, systems, and computer-readable media are also disclosed.
A method for performing prefetching operations is disclosed. The method includes storing a recorded access pattern indicating a set of accesses for a region; in response to an access within the region, fetching the recorded access pattern; and performing prefetching based on the access pattern.
G06F 12/0862 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
G06F 3/06 - Digital input from, or digital output to, record carriers
A method and system for distributing keys in a key distribution system includes receiving a connection for communication from a first component. A determination is made whether the first component requires a key be generated and distributed. Based upon a security mode for the communication, the key generated and distributed to the first component.
G06F 21/57 - Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
In response to receiving a scene description, a processing system [100] generates a set of planes [242] in the scene and a bounding volume [128] representing a partition of the scene. Using the set of planes in the scene, a compute unit [230] of an accelerated processing unit [200] performs a spatial test on the bounding volume to determine whether the bounding volume intersects one or more planes of the set of planes in the scene. Based on the spatial test, the compute unit generates intersection data [435] indicating whether the bounding volume intersects one or more planes of the set of planes in the scene. The accelerated processing unit then uses the intersection data to render the scene.
A method for hierarchical work scheduling includes consuming a work item at a first scheduling domain having a local scheduler circuit and one or more workgroup processing elements. Consuming the work item produces a set of new work items. Subsequently, the local scheduler circuit distributes at least one new work item of the set of new work items to be executed locally at the first scheduling domain. If the local scheduler circuit of the first scheduling domain determines that the set of new work items includes one or more work items that would overload the first scheduling domain with work if scheduled for local execution, those work items are distributed to the next higher-level scheduler circuit in a scheduling domain hierarchy for redistribution to one or more other scheduling domains.
Connection modification based on traffic pattern is described. In accordance with the described techniques, a traffic pattern of memory operations across a set of connections between at least one device and at least one memory is monitored. The traffic pattern is then compared to a threshold traffic pattern condition, such as an amount of data traffic in different directions across the connections. A traffic direction of at least one connection of the set of connections is modified based on the traffic pattern corresponding to the threshold traffic pattern condition.
A memory system includes a PHY embodied on an integrated circuit, the PHY coupling to a memory over conductive traces on a substrate. The PHY includes a reference clock generation circuit providing a reference clock signal to the memory, a first group of driver circuits providing CA signals to the memory, and a second group of driver circuits providing DQ signals to the memory. A plurality of the conductive traces which carry the DQ signals are constructed with a length longer than that of conductive traces carrying the reference clock signal in order to reduce an effective insertion delay associated with coupling the reference clock signal to latch respective incoming DQ signals.
An apparatus for component cooling includes a first heat transfer element configured to be thermally coupled to a heat-generating electronic component, and a second heat transfer element. The apparatus further includes a plurality of heat transfer paths thermally coupled between the first heat transfer element and the second heat transfer element. Each of the plurality of heat transfer paths configured to provide a separate heat conduction path from the first heat transfer element to the second heat transfer element. The apparatus further includes a manifold including a first fluid passage providing a first portion of a heat transfer fluid in thermal contact with the first heat transfer element, and a second fluid passage providing a second portion of the heat transfer fluid in thermal contact with the second heat transfer element.
H01L 23/467 - Arrangements for cooling, heating, ventilating or temperature compensation involving the transfer of heat by flowing fluids by flowing gases, e.g. air
F25B 21/02 - Machines, plants or systems, using electric or magnetic effects using Nernst-Ettinghausen effect
H01L 23/473 - Arrangements for cooling, heating, ventilating or temperature compensation involving the transfer of heat by flowing fluids by flowing liquids
F28B 1/02 - Condensers in which the steam or vapour is separated from the cooling medium by walls, e.g. surface condenser using water or other liquid as the cooling medium
9.
CONNECTING A CHIPLET TO AN INTERPOSER DIE AND TO A PACKAGE INTERFACE USING A SPACER INTERCONNECT COUPLED TO A PORTION OF THE CHIPLET
A semiconductor package assembly includes a package interface. An interposer die has a first surface and a second surface opposite to the first surface, where the first surface of the interposer is die positioned on the package interface. The interposer die includes a plurality of conductive connections between the first surface and second surface. A chiplet includes a connectivity region having conductive pathways, with a first portion of the connectivity region coupled to a conductive connection of the interposer die and a second portion of the connectivity region cantilevered from the interposer die.
H01L 23/538 - Arrangements for conducting electric current within the device in operation from one component to another the interconnection structure between a plurality of semiconductor chips being formed on, or in, insulating substrates
H01L 25/07 - Assemblies consisting of a plurality of individual semiconductor or other solid state devices all the devices being of a type provided for in the same subgroup of groups , or in a single subclass of , , e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group
H01L 25/065 - Assemblies consisting of a plurality of individual semiconductor or other solid state devices all the devices being of a type provided for in the same subgroup of groups , or in a single subclass of , , e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group
10.
LOW POWER PROCESSING OF REMOTE MANAGEABILITY REQUESTS
An apparatus and method for efficiently performing power management for multiple clients of a semiconductor chip that supports remote manageability. In various implementations, a network interface receives a packet, and sends at least an indication of the packet to a manageability processing circuity (MPC) of a processing node with multiple clients for processing tasks. The MPC determines whether a client or itself is a destination needed to process the packet. If the destination is the MPC, then packet processing is done by the MPC without involvement from the clients, which can be in an idle state. For example, the MPC can process a remote manageability packet requesting diagnostic information from one or more clients of the processing node. The network interface and the MPC use a sideband communication channel for data transmission, which foregoes lane training for further reduction in latency and power consumption.
A disclosed method can include (i) reporting, by a microcontroller, detection of a violation of a physical infrastructure constraint to a machine check architecture, (ii) triggering, by the machine check architecture in response to the reporting, a machine-check exception such that the violation of the physical infrastructure constraint is recorded, and (iii) performing a corrective action based on the triggering of the machine-check exception. Various other apparatuses, systems, and methods are also disclosed.
The disclosed method includes detecting a stalled memory request, for a first level cache within a cache hierarchy of a processor, that includes an outstanding memory request associated with a second level cache within the cache hierarchy that is higher than the first level cache. The method further includes indicating, to the second level cache, that the first level cache is experiencing a starvation issue due to stalled memory requests to enable the second level cache to perform starvation-remediation actions. Various other methods, systems, and computer-readable media are also disclosed.
A disclosed computing device includes at least one prefetcher and a processing device communicatively coupled to the prefetcher. The processing device is configured to detect a throttling instruction that indicates a start of a throttling region. The computing device is further configured to prevent the prefetcher from being trained on one or more memory instructions included in the throttling region in response to the throttling instruction. Various other apparatuses, systems, and methods are also disclosed.
G06F 3/06 - Digital input from, or digital output to, record carriers
G06F 12/0862 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
Methods, devices, and systems for retrieving information based on cache miss prediction. It is predicted, based on a history of cache misses at a private cache, that a cache lookup for the information will miss a shared victim cache. A speculative memory request is enabled based on the prediction that the cache lookup for the information will miss the shared victim cache. The information is fetched based on the enabled speculative memory request.
A method for operating a memory having a plurality of banks accessible in parallel, each bank including a plurality of grains accessible in parallel is provided. The method includes: based on a memory access request that specifies a memory address, identifying a set that stores data for the memory access request, wherein the set is spread across multiple grains of the plurality of grains; and performing operations to satisfy the memory access request, using entries of the set stored across the multiple grains of the plurality of grains.
A processor [101] implements a simultaneous multithreading (SMT) protection mode that, when enabled, prevents execution of particular software [106] (e.g., a virtual machine) at a processor core [102] when a thread associated with different software (e.g., a different virtual machine or a hypervisor) is currently executing at the processor core. By preventing execution of the software, data, software execution patterns, and other potentially sensitive information is kept protected from unauthorized access or detection. Further, in at least some embodiments the SMT protection mode is implemented on a per-software basis, so that different software can choose whether to implement the protection mode, thereby allowing the processor to be employed in a wide variety of computing environments.
A memory controller for generating accesses for a memory includes a row hammer logic circuit for providing a sample request. In response to the sample request, the memory controller generates a sample command for dispatch to the memory to cause the memory to capture a current row. In response to a completion of the sample command, the memory controller generates a mitigation command for dispatch to the memory.
A method for receiving a multi-level error signal having more than two logic levels includes oversampling the multi-level error signal to provide sampled symbols, wherein a first level of the multi-level error signal indicates no error, and second and third levels of the multi-level error signal indicate first and second error conditions, respectively. The sampled signals are de-serialized to provide sets of symbols. A start of a symbol period is determined in response to detecting that a given sample is different from a prior sample, and the prior sample indicates no error. The sets of symbols are filtered to provide corresponding output symbols based on the start.
A remote display synchronization technique preserves the presence of a local display device for a remotely-rendered video stream. A server and a client device cooperate to dynamically determine a target frame rate for a stream of rendered frames suitable for the current capacities of the server and the client device and networking conditions. The server generates from this target frame rate a synchronization signal that serves as timing control for the rendering process. The client device may provide feedback to instigate a change in the target frame rate, and thus a corresponding change in the synchronization signal. In this approach, the rendering frame rate and the encoding frequency may be "synchronized" in a manner consistent with the capacities of the server, the network, and the client device, resulting in generation, encoding, transmission, decoding, and presentation of a stream of frames that mitigates missed encoding of frames while providing acceptable latency.
H04N 21/242 - Synchronization processes, e.g. processing of PCR [Program Clock References]
H04N 21/2662 - Controlling the complexity of the video stream, e.g. by scaling the resolution or bitrate of the video stream based on the client capabilities
H04N 21/24 - Monitoring of processes or resources, e.g. monitoring of server load, available bandwidth or upstream requests
A memory controller coupled to a memory module receives both processing-in-memory (PIM) requests and memory requests from a host (e.g., a host processor). The memory controller issues PIM requests to one group of memory banks and concurrently issues memory requests to one or more other groups of memory banks. Accordingly, memory requests are performed on groups of memory banks that would otherwise be idle while PIM requests are performed on the one group of memory banks. Optionally, the memory controller coupled to the memory module also takes various actions when switching between operating in a PIM mode and a non-processing-in-memory mode to reduce or hide overhead when switching between the two modes.
A processing unit includes a plurality of adders and a plurality of carry bit generation circuits. The plurality of adders add first and second X bit binary portion values of a first Y bit binary value and a second Y bit binary value. Y is a multiple of X. The plurality of adders further generate first carry bits. The plurality of carry bit generation circuits is coupled to the plurality of adders, respectively, and receive the first carry bits. The plurality of carry bit generation circuits generate second carry bits based on the first carry bits. The plurality of adders use the second carry bits to add the first and second X bit binary portions of the first and second Y bit binary values, respectively.
G06F 7/506 - Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination with simultaneous carry generation for, or propagation over, two or more stages
G06F 7/509 - Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination for multiple operands, e.g. digital integrators
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
H04L 9/32 - Arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system
22.
SIGNAL INTERFERENCE TESTING USING RELIABLE READ WRITE INTERFACE
A memory controller includes a first arbiter for selecting memory commands for dispatch to a memory over a first channel, a second arbiter for selecting memory commands for dispatch to the memory over a second channel, and a test circuit. The test circuit generates a respective testing sequence of read commands and write commands for each of the first channel and second channel, and causes the testing sequences to be transmitted over the first and second channels at least partially overlapping in time without selection by the first or second arbiters.
Predicates for processing in memory is described. In accordance with the described techniques, a predicate instruction to compute a conditional value based on data stored in a memory is provided to a processing-in-memory component. A response that includes the conditional value computed by the processing-in-memory component is received, and the conditional value is stored in a predicate register. One or more conditional instructions are provided to the processing-in-memory component based on the conditional value stored in the predicate register.
A method includes, in a cache directory, storing a set of entries corresponding to one or more memory regions having a first region size when the cache directory is in a first configuration, and based on a workload sparsity metric, reconfiguring the cache directory to a second configuration. In the second configuration, each entry in the set of entries corresponds to a memory region having a second region size.
G06F 12/0893 - Caches characterised by their organisation or structure
G06F 12/0891 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using clearing, invalidating or resetting means
25.
SELECTING BETWEEN BASIC AND GLOBAL PERSISTENT FLUSH MODES
Selecting between basic and global persistent flush modes is described. In accordance with the described techniques, a system includes a data fabric in electronic communication with at least one cache and a controller configured to select between operating in a global persistent flush mode and a basic persistent flush mode based on an available flushing latency of the system, control the at least one cache to flush dirty data to the data fabric in response to a flush event trigger while operating in the global persistent flush mode, and transmit a signal to switch control to an application to flush the dirty data from the at least one cache to the data fabric while operating in the basic persistent flush mode.
Runtime flushing to persistency in heterogenous systems is described. In accordance with the described techniques, a system may include a persistent memory in electronic communication with at least one cache and a controller configured to command the at least one cache to flush dirty data to the persistent memory in response to a dirtiness of the at least one cache reaching a cache dirtiness threshold.
An apparatus for sub-cooling components includes a component cooling device, a processor thermally coupled to the component cooling device, an electronic component, a fan configured to direct an airflow across the processor and the electronic component, and a thermoelectric cooling device thermally coupled to the component cooling device. The thermoelectric cooling device is configured to cool the airflow from a first temperature to a second temperature.
H05K 7/20 - Modifications to facilitate cooling, ventilating, or heating
F25B 21/02 - Machines, plants or systems, using electric or magnetic effects using Nernst-Ettinghausen effect
H01L 23/467 - Arrangements for cooling, heating, ventilating or temperature compensation involving the transfer of heat by flowing fluids by flowing gases, e.g. air
H01L 23/473 - Arrangements for cooling, heating, ventilating or temperature compensation involving the transfer of heat by flowing fluids by flowing liquids
H01L 23/38 - Cooling arrangements using the Peltier effect
Data reuse cache techniques are described. In one example, a load instruction is generated by an execution unit of a processor unit. In response to the load instruction, data is loaded by a load-store unit for processing by the execution unit and is also stored to a data reuse cache communicatively coupled between the load-store unit and the execution unit. Upon receipt of a subsequent load instruction for the data from the execution unit, the data is loaded from the data reuse cache for processing by the execution unit.
Systems, apparatuses, and methods for prefetching data by a display controller are proposed. From time to time, a performance-state change of a memory is performed. During such changes, a memory clock frequency is changed for a memory subsystem (220) storing frame buffer(s) (230) used to drive pixels to a display device (250). During the performance-state change, memory accesses may be temporarily blocked. To sustain a desired quality of service for the display, a display controller (150) is configured to prefetch data in advance of the performance-state change. In order to ensure the display controller has sufficient memory bandwidth to accomplish the prefetch, bandwidth reduction circuitry (112A, 112N) in clients (205) of the system are configured to temporarily reduce memory bandwidth of corresponding clients.
G09G 5/00 - Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
G06F 1/324 - Power saving characterised by the action undertaken by lowering clock frequency
G06F 1/3234 - Power saving characterised by the action undertaken
G06F 3/06 - Digital input from, or digital output to, record carriers
G06F 12/0862 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
G09G 5/393 - Arrangements for updating the contents of the bit-mapped memory
G11C 7/22 - Read-write [R-W] timing or clocking circuits; Read-write [R-W] control signal generators or management
G09G 5/395 - Arrangements specially adapted for transferring the contents of the bit-mapped memory to the screen
Systems, apparatuses, and methods for implementing a hierarchical scheduler. In various implementations, a processor includes a global scheduler, and a plurality of independent local schedulers with each of the local schedulers coupled to a plurality of processors. In one implementation, the processor is a graphics processing unit and the processors are computation units. The processor further includes a shared cache that is shared by the plurality of local schedulers. Each of the local schedulers also includes a local cache used by the local scheduler and processors coupled to the local scheduler. To schedule work items for execution, the global scheduler is configured to store one or more work items in the shared cache and convey an indication to a first local scheduler of the plurality of local schedulers which causes the first local scheduler to retrieve the one or more work items from the shared cache. Subsequent to retrieving the work items, the local scheduler is configured to schedule the retrieved work items for execution by the coupled processors. Each of the plurality of local schedulers is configured to schedule work items for execution independent of scheduling performed by other local schedulers.
Systems, apparatuses, and methods for implementing a message passing system to schedule work in a computing system. In various implementations, a processor includes a global scheduler, and a plurality of local schedulers with each of the local schedulers coupled to a plurality of processors. The processor further includes a shared cache that is shared by the plurality of local schedulers. Also, a plurality of mailboxes are implemented to enable communication between the local schedulers and the global scheduler. To schedule work items for execution, the global scheduler is configured to store one or more work items in the shared cache and store an indication in a mailbox for a first local scheduler of the plurality of local schedulers. Responsive to detecting the message in the mailbox, the first local scheduler identifies a location of the one or more work items in the shared cache and retrieves them for scheduling locally.
The disclosed computing device includes a cache memory and at least one processor coupled to the cache memory. The at least one processor is configured to copy data written to one or more nonredundant wordlines of the cache memory to one or more redundant wordlines of the cache memory. The at least one processor is additionally configured to detect a mismatch between data read from the one or more nonredundant wordlines and data stored in the one or more redundant wordlines. The at least one processor is also configured to perform a remediation action in response to detecting the mismatch. Various other methods, systems, and computer-readable media are also disclosed.
The disclosed computer-implemented method for generating remedy recommendations for power and performance issues within semiconductor software and hardware. For example, the disclosed systems and methods can apply a rule-based model to telemetry data to generate rule-based root-cause outputs as well as telemetry-based unknown outputs. The disclosed systems and methods can further apply a root-cause machine learning model to the telemetry-based unknown outputs to analyze deep and complex failure patterns with the telemetry-based unknown outputs to ultimately generate one or more root-cause remedy recommendations that are specific to the identified failure and the client computing device that is experiencing that failure.
Address translation service management techniques are described. These techniques are based on metadata that is usable to provide a hint as insight into memory access, and based on this, use of a translation lookaside buffer is optimized to control which entries are maintained in the queue and manage address translation requests.
G06F 12/1027 - Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB]
G06F 12/1009 - Address translation using page tables, e.g. page table structures
G06F 12/0862 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
35.
OFFSET DATA INTEGRITY CHECKS FOR LATENCY REDUCTION
Data integrity checks for reducing communication latency is described. A transmitting endpoint transmits data to a receiving endpoint by generating an integrity tag for a first subset of data blocks and a second integrity tag for a second subset of data blocks. In implementations, the first and second integrity tags overlap at least one data block and are offset based on computational complexities of generating the integrity tags. A receiving endpoint generates comparison tags for each of the integrity tags and uses the comparison tags to validate an authenticity of received data. In response to validating the first and second integrity tags, data blocks covered by both the first and second integrity tags are released for use. Additional integrity tags are generated and validated for subsequent subsets of data blocks during data communication, thus reducing latency by offsetting times at which comparison tags are generated and validated.
H04L 9/32 - Arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system
36.
HYBRID MEMORY ARCHITECTURE FOR ADVANCED 3D SYSTEMS
Disclosed wherein stacked memory dies that utilize a mix of high and low operational temperature memory and non-volatile based memory dies, and chip packages containing the same. High temperature memory dies, such as those using non-volatile memory (NVM) technologies are in a memory stack with low temperature memory dies, such as those having volatile memory technologies. In some cases, the high temperature memory technologies could be used together, in some cases, on the same IC die as logic circuitry. In one example, a memory stack is provided that include a first memory IC die having high temperature memory circuitry, such as non-volatile memory, stacked below a second memory IC die. The second memory IC die has high temperature memory circuitry, such as volatile memory circuitry.
G11C 5/02 - Disposition of storage elements, e.g. in the form of a matrix array
G11C 11/22 - Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using ferroelectric elements
G11C 11/401 - Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells
37.
3D LAYOUT AND ORGANIZATION FOR ENHANCEMENT OF MODERN MEMORY SYSTEMS
Memory stacks having substantially vertical bitlines, and chip packages having the same, are disclosed herein. In one example, a memory stack is provided that includes a first memory IC die and a second memory IC die. The second memory IC die is stacked on the first memory IC die. Bitlines are routed through the first and second IC dies in a substantially vertical orientation. Wordlines within the first memory IC die are oriented orthogonal to the bitlines.
Error correction for stacked memory is described. In accordance with the described techniques, a system includes a plurality of error correction code engines to detect vulnerabilities in a stacked memory and coordinate at least one vulnerability detected for a portion of the stacked memory to at least one other portion of the stacked memory.
Dynamic memory operations are described. In accordance with the described techniques, a system includes a stacked memory and one or more memory monitors configured to monitor conditions of the stacked memory. A system manager is configured to receive the monitored conditions of the stacked memory from the one or more memory monitors, and dynamically adjust operation of the stacked memory based on the monitored conditions. In one or more implementations, a system includes a memory and at least one register configured to store a ranking for each of a plurality of portions of the memory. Each respective ranking is determined based on an associated retention time of the respective portion of the memory. A memory controller is configured to dynamically refresh the portions of the memory at different times based on the ranking for each of the plurality of portions of the memory stored in the at least one register.
G11C 5/04 - Supports for storage elements; Mounting or fixing of storage elements on such supports
G11C 11/4074 - Power supply or voltage generation circuits, e.g. bias voltage generators, substrate voltage generators, back-up power, power control circuits
G06F 3/06 - Digital input from, or digital output to, record carriers
40.
MEMORY CONTOLLER AND NEAR-MEMORY SUPPORT FOR SPARSE ACCESSES
A data processing system includes a data processor and a memory controller receiving memory access requests from the data processor and generating at least one memory access cycle to a memory system in response to the receiving. The memory controller includes a command queue and a sparse element processor. The command queue is for receiving and storing the memory access requests including a first memory access request including a small element request. The sparse element processor is for causing the memory controller to issue a second memory access request to the memory system in response to the first memory access request with a density greater than a density indicated by the first memory access request.
A data processing node includes a processor element and a data fabric circuit. The data fabric circuit is coupled to the processor element and to a local memory element and includes a crossbar switch. The data fabric circuit is operable to bypass the crossbar switch for memory access requests between the processor element and the local memory element.
An electronic device includes a processor having processor circuitry and a leader memory controller, a controller coupled to the processor and having a follower memory controller, and a memory. The processor circuitry is operable to access the memory by issuing memory access requests to the leader memory controller. The leader memory controller is operable to complete the memory access requests using the follower memory controller to issue memory commands to the at least one memory die.
A method for forming a semiconductor assembly that includes forming a first set of layers on a first wafer, where one or more layers of the first set includes one or more devices of the semiconductor assembly. The method further includes forming a second set of layers on a second wafer, where one or more layers of the second set include connections between one or more of the devices of the semiconductor assembly. The method additionally includes coupling a layer of the first set to a layer of the second set using metal to metal hybrid bonding.
H01L 23/00 - SEMICONDUCTOR DEVICES NOT COVERED BY CLASS - Details of semiconductor or other solid state devices
H01L 25/065 - Assemblies consisting of a plurality of individual semiconductor or other solid state devices all the devices being of a type provided for in the same subgroup of groups , or in a single subclass of , , e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group
Random access memory (RAM) is attached to an input/output (I/O) controller of a chipset (e.g., on a motherboard). This chipset attached RAM is optionally used as part of a tiered storage solution with other tiers including, for example, nonvolatile memory (e.g., a solid state drive (SSD)) or a hard disk drive. The chipset attached RAM is separate from the system memory, allowing the chipset attached RAM to be used to speed up access to frequently used data stored in the tiered storage solution without reducing the amount of system memory available to an operating system running on the one or more processing units.
A disclosed method can include (i) positioning a first surface of a component of a semiconductor device on a first plated through-hole, (ii) covering, with a layer of dielectric material, at least a second surface of the component that is opposite the first surface of the component, (iii) removing a portion of the layer of dielectric material covering the second surface of the component to form at least one cavity, and (iv) depositing conductive material in the cavity to form a second plated through-hole on the second surface of the component. Various other apparatuses, systems, and methods are also disclosed.
H01L 23/538 - Arrangements for conducting electric current within the device in operation from one component to another the interconnection structure between a plurality of semiconductor chips being formed on, or in, insulating substrates
Systems, apparatuses, and methods for implementing defogging techniques for images and video are disclosed. A defogging engine generates a defog filter result from a grayscale format version of an input image. An estimation engine generates an enhancement strength variable from a hue-saturation-value (HSV) format version of the input image. An enhancement engine receives both the defog filter result from the defogging engine and the enhancement strength variable from the estimation engine. The enhancement engine also receives the original red-green-blue (RGB) color space format version of the input image. The enhancement engine generates an enhanced version of the input image from the original RGB format version based on the defog filter result and the enhancement strength variable. The enhanced version of the input image mitigates fog, haze, mist or other environmental impediments that obscured the original input image.
An apparatus and method for efficiently performing power management for a multi-client computing system. In various implementations, a computing system includes multiple clients that process tasks corresponding to applications. The clients store generated requests of a particular type while processing tasks. A client receives an indication specifying that another client is having requests of the particular type being serviced. In response to receiving this indication, the client inserts a first urgency level in one or more stored requests of the particular type prior to sending the requests for servicing. When the client determines a particular time interval has elapsed, the client sends an indication to other clients specifying that requests of the particular type are being serviced. The client also inserts a second urgency level different from the first urgency level in one or more stored requests of the particular type prior to sending the requests for servicing.
Systems, apparatuses, and methods for automatic firmware provision of high speed serializer/deserializer (SERDES) links are disclosed. A system on chip (SoC) includes one or more microcontrollers, a programmable interconnect, a plurality of physical layer engines, and a plurality of SERDES lanes. The programmable interconnect and the plurality of SERDES lanes are able to support communication protocols for interfaces such as PCIE, SATA, Ethernet, and others. On bootup, the SoC receives a custom specification of how the plurality of SERDES lanes are to be configured. The one or more microcontrollers generate a mapping for the programmable interconnect based on the specification. The mapping is used to configure the programmable interconnect to match the specification. The result is the programmable interconnect connecting the plurality of SERDES lanes to the appropriate physical layer circuits to implement the desired configuration.
An apparatus and method for efficiently supporting multiple peripheral communication protocols in a computing system. A computing system includes multiple servers with one or more of the servers using multiple connectors for connecting to multiple peripheral devices such as data storage devices. At least one of the connectors is able to support multiple communication protocols, rather than a single communication protocol. A processor of the server determines a peripheral device has been attached to a connector that supports multiple communication protocols, and the processor determines whether one of the multiple communication protocols supported by the particular connector matches the attached peripheral device's communication protocol. If so, the processor configures the connector with the matching communication protocol. Otherwise, the processor generates an indication that specifies that there is no match.
A system and method for efficiently designing a through silicon via (TSV) macro blocks are described. In various implementations, the circuitry of a processor executes instructions of a place and route tool that provides automatic placement of macro blocks and standard cells on an integrated circuit die based on a copy of a netlist of the integrated circuit being designed and a copy of a standard cell library that includes a variety of standard cells and macro blocks. The processor places two functional macros in the floorplan with a channel between them. In the channel, the processor places a TSV macro that includes at least one boundary cell inside of the TSV macro. The processor prevents placement of a boundary cell adjacent to at least one side of the TSV macro despite empty space exists due to no standard cells or macros about the at least one side.
A technique for operating a cache is disclosed. The technique includes based on a workload change, identifying a first allocation permissions policy; operating the cache according to the first allocation permissions policy; based on set sampling, identifying a second allocation permissions policy; and operating the cache according to the second allocation permissions policy.
Systems, apparatuses, and methods for dynamically estimating power losses in a computing system. A system management circuit tracks a state of a computing system and dynamically estimates power losses in the computing system based in part on the state. Based on the estimated power losses, power consumption of the computing system is estimated. In response to detecting reduced power losses in at least a portion of the computing system, the system management circuit is configured to increase a power-performance state of one or more circuits of the computing system while remaining within a power allocation limit of the computing system.
Systems, apparatuses, and methods for managing power allocation in a computing system. A system management unit detects a condition indicating a change in power is indicated. Such a change may be detecting an indication that a power change is either required, possible, or requested. In response to detecting a reduction in power is indicated, the system management unit identifies currently executing tasks of the computing system and accesses sensitivity data to determine which of a number of computing units (or power domains) to select for power reduction. Based at least in part on the data, a unit is identified that is determined to have a relatively low sensitivity to power state changes under the current operating conditions. A relatively low sensitivity indicates that a change in power to the corresponding unit will not have as significant an impact on overall performance of the computing system than if another unit was selected. Power allocated for the selected unit is then decreased.
An apparatus and method for efficiently transferring information as signals through a silicon package substrate. A semiconductor fabrication process (or process) begins with a relatively thin package substrate core layer and uses lasers to create openings in the package substrate at locations of the signal routes. The use of each of the relatively thin core layer and the lasers allows for reduction in the pitch of the signal routes. The process creates signal routes in the openings using stacked vias from one side of the package substrate to an opposite side of the package substrate. Additionally, the process forms the package substrate with multiple embedded passive components with different thicknesses in different layers of the package substrate. The embedded passive components are used to improve signal integrity of the signal routes.
H01L 23/538 - Arrangements for conducting electric current within the device in operation from one component to another the interconnection structure between a plurality of semiconductor chips being formed on, or in, insulating substrates
H01L 21/48 - Manufacture or treatment of parts, e.g. containers, prior to assembly of the devices, using processes not provided for in a single one of the groups
A data processor is adapted to couple to a memory. The data processor includes a memory operation array, a power engine, and an initialization circuit. The memory operation array includes a command portion and a data portion. The power engine has an input for receiving power state change request signals and an output for providing memory operations responsive to instructions stored in the command portion. The initialization circuit populates the data portion such that consecutive memory operations are separated by an amount corresponding to a predetermined minimum timing parameter.
A memory includes a link training circuit with a pseudo-random bit sequence (PRBS) generator and a burst error detection counter. The burst error detection counter including a comparator, a first input coupled to the data input, a second input coupled to the PRBS generator, and a counter operable to increase an error count value by one responsive to detecting any number of errors greater than zero in a sequence of symbols including a predetermined number of symbols.
A processing system including a parallel processing unit selectively allocating pages of memory for interleaving across configurable subsets of channels based on a mode of allocation. In some embodiments, in a first mode, a page of memory is allocated to and interleaved across a plurality of channels, and in a second mode, a page of memory is allocated to and interleaved across a subset of the plurality of channels.
A processing system divides an image to be rendered into one or more tiles and performs a visibility pass on the primitives of the image. During the visibility pass, the processing system generates visibility data for each primitive of a draw call of the image based on a visible primitive count and a visible draw call count. In response to a primitive of the draw call being visible in the first tile, the processing system increments the visible primitive count and generates visibility data indicating that the primitives of the draw call are to be rendered using draw call index data stored in an on-chip memory. If the primitive is the first visible primitive of the draw call, the processing system further increments the visible draw call count. Additionally, the processing system renders the primitives of the draw call using the draw call index data stored in the on-chip memory.
The disclosed system may include a processor configured to encode, using an encoding scheme that reduces a number of bits needed to represent one or more instructions from a set of instructions in an instruction buffer represented by a dependency matrix, a dependency indicating that a child instruction represented in the dependency matrix depends on a parent instruction represented in the dependency matrix. The processor may also be configured to store the encoded dependency in the dependency matrix and dispatch instructions in the instruction buffer based at least on decoding one or more dependencies stored in the dependency matrix for the instructions. Various other methods, systems, and computer-readable media are also disclosed.
A technique for operating a cache is disclosed. The technique includes utilizing a first portion of a cache in a directly accessed manner; and utilizing a second portion of the cache as a cache.
G06F 12/0891 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using clearing, invalidating or resetting means
A technique for operating a cache is disclosed. The technique includes recording access data for a first set of memory accesses of a first frame; identifying parameters for a second set of memory accesses of a second frame subsequent to the first frame, based on the access data; and applying the parameters to the second set of memory accesses.
Methods and systems are disclosed for managing performance states of a data fabric of a system on chip (SoC). Techniques disclosed include determining a performance state of the data fabric based on data fabric bandwidth utilizations of respective components of the SoC. A metric, characteristic of a workload centric to cores of the SoC, is derived from hardware counters, and, based on the metric, it is determined whether to alter the performance state.
An encryption circuit includes an iterative block cipher circuit. The iterative block cipher circuit has a counter input for a row index, a key input for receiving a secret key, and an output for providing an encrypted counter value in response to performing a block cipher process using the row index as a counter the secret key. The encryption circuit uses the iterative block cipher circuit during a row operation to a memory.
G06F 21/72 - Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure computing or processing of information in cryptographic circuits
H04L 9/06 - Arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for blockwise coding, e.g. D.E.S. systems
G06F 21/79 - Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data in semiconductor storage media, e.g. directly-addressable memories
64.
CHANNEL AND SUB-CHANNEL THROTTLING FOR MEMORY CONTROLLERS
An arbiter is operable to pick commands from a command queue for dispatch to a memory. The arbiter includes a traffic throttle circuit for mitigating excess power usage increases in coordination with one or more additional arbiters. The traffic throttle circuit includes a monitoring circuit and a throttle circuit. The monitoring circuit is for measuring a number of read and write commands picked by the arbiter and the one or more additional arbiters over a first predetermined period of time. The throttle circuit, responsive to a low activity state, limits a number of read and write commands issued by the arbiter during a second predetermined period of time.
A disclosed method for making efficient picks of micro-operations for execution includes selecting a first set of micro-operations that are ready for execution during a certain clock cycle. The method also includes selecting a second set of micro-operations that are ready for execution during the certain clock cycle. The method additionally includes replacing one or more of the complex micro-operations included in the first set of micro-operations with one or more simple micro-operations included in the second set of micro-operations due at least in part to a number of complex micro-operations included in the first set of micro-operations exceeding a set of complex resources capable of executing the complex micro-operations. Various other apparatuses, systems, and methods are also disclosed.
G06F 9/48 - Program initiating; Program switching, e.g. by interrupt
G06F 7/57 - Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups or for performing logical operations
G06F 5/01 - Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
An apparatus and method for efficiently managing balanced performance among replicated partitions of an integrated circuit despite loss of functionality due to manufacturing defects. A processing unit includes at least two replicated partitions, each assigned to operation parameters of a respective power domain. The partitions include multiple compute units. The compute units include multiple lanes of execution. Due to a variety of types of manufacturing defects, one or more of the partitions of the processing unit has less than a predetermined number of operational compute units. To balance the throughput of the multiple partitions, a power manager generates both static and dynamic scaling factors based on at least the corresponding number of operational compute units. Using these scaling factors, the power manager adjusts the operation parameters of power domains for the partitions relative to one another.
An apparatus and method for efficiently scheduling tasks in a dynamic manner to multiple cores that support a heterogeneous computing architecture. A computing system includes multiple cores with at least two cores being capable of executing instructions of a same instruction set architecture (ISA), and therefore, are architecturally compatible. In an implementation, each of the at least two cores is a general-purpose central processing unit (CPU) core capable of executing instructions of a same ISA. However, the throughput and the power consumption greatly differ between the at least two cores based on their hardware designs. An operating system scheduler assigns a thread to a first core, and the first core measures thread dynamic behavior of the thread over a time interval. Based on the thread dynamic behavior, the scheduler reassigns the thread to a second core different from the first core.
A data processor is for accessing a memory having a first pseudo channel and a second pseudo channel. The data processor includes at least one memory accessing agent, a memory controller, and a data fabric. The at least one memory accessing agent generates generating memory access requests including first memory access requests that access the memory. The memory controller provides memory commands to the memory in response to the first memory access requests. The data fabric routes the first memory access requests to a first downstream port in response to a corresponding first memory request accessing the first pseudo channel, and to a second downstream port in response to the corresponding first memory request accessing the second pseudo channel. The memory controller has first and second upstream ports coupled to the first and second downstream ports of the data fabric, respectively, and a downstream port coupled to the memory.
A data processor accesses a memory having a first pseudo channel and a second pseudo channel. The data processor includes at least one memory accessing agent for generating a memory access request, a memory controller for providing a memory command to the memory in response to a normalized request selectively using a first pseudo channel pipeline circuit and a second pseudo channel pipeline circuit, and a data fabric for converting the memory access request into the normalized request selectively for the first pseudo channel and the second pseudo channel.
A processing system [100] includes a processing device [230] coupled to a memory [106] configured to check for and correct faults in requested data. In response to correcting the faults of the requested data, the memory sends the corrected data and unused check bits to the processing device as a plurality of fetch returns. The memory also sends a parity fetch based on the corrected data and one or more operations to the processing device. After receiving the plurality of fetch returns and the unused check bits, the processing device checks each fetch return for faults based on the unused check bits. In response to determining that a fetch return includes a fault, the processing device erases the fetch return and reconstructs the fetch return based on one or more other received fetch returns and the parity fetch.
One or more rotated bounding volumes are generated for one or more nodes of a bounding volume hierarchy (BVH). Volume intersection ray tracing tests are then performed using the rotated bounding volumes with the aim of reducing the number of calculations required relative to an original, non-rotated bounding volume. Rotated bounding volumes are selected from a plurality of candidate rotations, and selection of one of the candidate rotations are based on surface areas, such as minimum total surface areas, of bounding volumes corresponding to each of the candidate rotations. In order to minimize data storage and increase performance, a number of candidate rotations may be limited to a predetermined set of rotations.
The disclosed computer-implemented method for interpolating register-based lookup tables can include identifying, within a set of registers, a lookup table that has been encoded for storage within the set of registers. The method can also include receiving a request to look up a value in the lookup table and responding to the request by interpolating, from the encoded lookup table stored in the set of registers, a representation of the requested value. Various other methods, systems, and computer-readable media are also disclosed.
The disclosed method may include detecting, by a control circuit coupled to a first read only memory (ROM) device and a second ROM device, a failure of a first output signal from the first ROM device to a common output. The first ROM device is connected to the common output and the second ROM device is disconnected from the common output. The method also includes switching, by the control circuit in response to detecting the failure, the common output from the first ROM device to the second ROM device. Various other methods, systems, and computer-readable media are also disclosed.
A technique for rendering is provided. The technique includes for a set of primitives processed in a coarse binning pass, outputting early draw data to an early draw buffer; while processing the set of primitives in the coarse binning pass, processing the early draw data in a fine binning pass; and processing remaining primitives of the set of primitives in the fine binning pass.
Real time workload-based system adjustment is described. In accordance with the described techniques, a processor and a memory are operated according to first settings associated with a first workload. A second workload configured to utilize the processor and the memory is detected. The second workload is associated with second settings. Responsive to detecting the second workload, operation of the processor and the memory are adjusted to operate according to the second settings without rebooting.
A technique for operating a cache is disclosed. The technique includes in response to a power down trigger that indicates that the cache effectiveness is considered to be low, powering down the cache.
Methods and systems are disclosed for cross-chiplet performance data streaming. Techniques disclosed include accumulating, by a subservient chiplet, event data associated with an event indicative of a performance aspect of the subservient chiplet; sending, by the subservient chiplet, the event data over a chiplet bus to a master chiplet; and adding, by the master chiplet, the received event data to an event record, the event record containing previously received, from the subservient chiplet over the chiplet bus, event data associated with the event.
Core activation and deactivation for a multi-core processor is described. In accordance with the described techniques, a processor having multiple cores operates using a first core configuration. A request to switch from the first core configuration to a second core configuration is received. Responsive to the request, a switch from the first core configuration to the second core configuration occurs by adjusting a number of active cores of the processor without rebooting.
G06F 9/50 - Allocation of resources, e.g. of the central processing unit [CPU]
G06F 11/20 - Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
G06F 1/3234 - Power saving characterised by the action undertaken
Array of pointers prefetching is described. In accordance with described techniques, a pointer target instruction is detected by identifying that a destination location of a load instruction is used in an address compute for a memory operation and the load instruction is included in a sequence of load instructions having addresses separated by a step size. An instruction for fetching data of a future load instruction is injected in an instruction stream of a processor. The data of the future load instruction is stored in a temporary register. An additional instruction is injected in the instruction stream for prefetching a pointer target based on an address of the memory operation and the data of the future load instruction.
G06F 12/0862 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
Package lids with carveouts configured for processor connection and alignment are described. Lid carveouts are configured to align and mechanically secure a cooling device to the package lid by receiving protrusions of the cooling device. Because the lid carveouts ensure precise alignment and orientation of a cooling device relative to a package lid, the lid design enables targeted cooling of discrete portions of the lid. Lid carveouts are further configured to expose one or more connectors disposed on a surface that supports package internal components. When contacted by corresponding connectors of a cooling device, the lid carveouts enable direct connections between the package and the attached cooling device. By creating a direct connection between package components and an attached cooling device, the lid carveouts enable a high-speed connection for proactive and on-demand cooling actuation.
A technique for performing ray tracing operations is provided, The technique includes, in response to detecting that a threshold number of traversal stage work-items of a wavefront have terminated, increasing intersection test parallelization for non-terminated work-items..
Package lids with carveouts configured to expose lights directly connected to an internal component of a processor are described. Lid carveouts are configured to precisely align and mechanically secure a cooling device to the package lid by receiving protrusions of the cooling device via a press fit connection, while maintaining visibility of lights directly connected to processor internal components when the cooling device is connected. Lid carveouts are further configured to expose one or more connectors disposed on a processor surface that supports its internal component. When contacted by corresponding connectors of an auxiliary device, such as a light not integrated into a processor package or a cooling device, the lid carveouts enable direct connections between the package's internal components and the auxiliary device.
Load dependent branch prediction is described. In accordance with described techniques, a load dependent branch instruction is detected by identifying that a destination location of a load instruction is used in an operation for determining whether a conditional branch is taken or not taken. The load instruction is included in a sequence of load instructions having addresses separated by a step size. An instruction is injected in an instruction stream of a processor for fetching data of a future load instruction using an address of the load instruction offset by a distance based on the step size. An additional instruction is injected in the instruction stream of the processor for precomputing an outcome of a load dependent branch using an address computed based on an address of the operation and the data of the future load instruction.
User configurable hardware settings for overclocking is described. In accordance with the described techniques, user input to adjust hardware settings for operating a processing unit in an overclocking mode is received. The user input, for example, adjusts at least one of a voltage droop threshold or a frequency adjustment of the clock rate. A voltage droop is detected while operating the processing unit in the overclocking mode. Responsive to detecting the voltage droop, a clock rate of the processing unit is adjusted based at least in part on the adjusted hardware settings.
A first frame of a video stream is obtained. The first frame is defined by a plurality of pixels associated with a set of color data. A determination is made that a pixel of the plurality of pixels comprises high-frequency information. Responsive to the determination that the pixel comprises high-frequency information, a pixel lock is generated for the pixel such that color data associated with the pixel is maintained during a color accumulation process for at least one of the first frame or a second frame of the video stream that is subsequent to the first frame.
A first frame of a video stream rendered at a first resolution is obtained. A second frame of the video stream upscaled to a second higher resolution is also obtained. The first plurality of pixels is upscaled to the second resolution. The upsampling generates upsampled color data for the upsampled first plurality of pixels. The upsampled color data is accumulated with a second set of color data associated with a second plurality of pixels defining the second frame to generate final color data for the upsampled first plurality of pixels. Color data of the second set of color data associated with a pixel lock contributes more to the final color data than corresponding color data of the upsampled color data. The upsampled first plurality of pixels is stored with the final color data as an upscaled frame representing the first frame at the second resolution.
A memory package includes first, second, third, and fourth channels arranged consecutively in a clockwise direction on the memory package, each of the first, second, third, and fourth channels having access circuitry and memory arrays. In a first mode, the first channel controls access to the memory arrays in the second channel and the fourth channel controls access to the memory arrays in the third channel.
An apparatus includes a processor configured to determine a first distribution associated with an artificial agent based on behavior associated with the artificial agent and a second distribution based on behavior of a user. The processor is further configured to generate a human-likeness similarity measurement by comparing the first distribution to the second distribution and modify the behavior of the artificial agent in response to the similarity measurement failing to satisfy a similarity threshold.
Methods and systems are disclosed for traversing nodes in a BVH tree by an intersection engine. Techniques disclosed comprise receiving, by the intersection engine, a traversal instruction, including a tracing-mode, ray data, and an identifier of a node to be traversed. Where the tracing-mode includes a closest hit mode and a first hit mode. If the node to be traversed is an internal node, the intersection engine determines, based on the tracing-mode, an order in which children nodes of the node are to be next traversed and output identifiers of the children nodes in the determined order.
A technique for performing ray tracing operations is provided. The technique includes combining one or more common exponent values of a compressed box node with one or more minimum vertex mantissas of the compressed box node and one or more maximum vertex mantissas of the compressed box node to obtain one or more minimum vertices and one or more maximum vertices; and combining a minimum vertex of the one or more minimum vertices and a maximum vertex of the one or more maximum vertices to obtain a bounding box for a compressed box data item of the compressed box node.
A method and apparatus for managing power states in a computer system includes, responsive to an event received by a processor, powering up a first circuitry. Responsive to the event not being serviceable by the first circuitry, powering up at least a second circuitry of the computer system to service the event.
A technique for performing ray tracing operations is provided. The technique includes traversing through a bounding volume hierarchy to an instance node; performing an instance node transform using common circuitry; traversing to a leaf node of the bounding volume hierarchy; and performing an intersection test for the leaf node using the common circuitry.
Devices, methods and systems for managing resources in a computing device. Information regarding resource usage is captured. A prediction is generated, based on the information, that resource usage by a processor will exceed a threshold during an upcoming time. An operating parameter of the processor is adjusted, based on the prediction. In some implementations, information regarding memory bandwidth is captured. A prediction is generated, based on the information, that a memory region stored in a first memory device will be addressed by a memory intensive instruction during an upcoming time period. Data stored in the memory region is moved to a second memory device, based on the prediction.
Methods and systems are disclosed for inline suspension of an accelerated processing unit (APU). Techniques include receiving a packet, including a mode of operation and commands to be executed by the APU; suspending execution of commands received in previous packets when the mode of operation is a suspension initiation mode; and executing, by the APU, the commands in the received packet. The execution of the suspended commands is restored when the mode of operation in a subsequently received packet is a suspension conclusion mode.
G06F 9/48 - Program initiating; Program switching, e.g. by interrupt
G06F 9/50 - Allocation of resources, e.g. of the central processing unit [CPU]
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
G06F 9/38 - Concurrent instruction execution, e.g. pipeline, look ahead
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06T 1/20 - Processor architectures; Processor configuration, e.g. pipelining
95.
EMULATING PERFORMANCE OF PRIOR GENERATION PLATFORMS
Methods and systems are disclosed for emulating, in a platform, the performance of a target platform. Techniques disclosed include receiving, by the platform, values of system features, associated with a target performance of the target platform; and setting, by the platform, one or more configuration knobs, based on the received values of system features, to match a performance of the platform to the target performance of the target platform.
Cascading execution of atomic operations, including: receiving a request for each thread of a plurality of threads to perform an atomic operation, wherein the plurality of threads comprises a plurality of thread subsets each corresponding to a local memory, wherein the local memory for a thread subset is accessible by the thread subset and inaccessible to a remainder of threads in the plurality of threads; generating a plurality of intermediate results by performing, by each thread subset, the atomic operation in the local memory corresponding to the thread subset; and generating a result for the request by aggregating the plurality of intermediate results in a shared memory accessible to all threads in the plurality of threads.
Parallel processors typically allocate resources to workloads based on workload priority. Priority inversion of resource allocation between workloads of different priorities reduces the operating efficiency of a parallel processor in some cases. A parallel processor [100] mitigates priority inversion by soft-locking resources to prevent their allocation for the processing of lower priority workloads. Soft-locking is enabled responsive to a soft-lock condition, such as one or more priority inversion heuristics [110] exceeding corresponding thresholds or multiple failed allocations of higher priority workloads within a time period. In some cases, priority inversion heuristics include quantities of higher priority workloads and lower priority workloads that are in-flight or incoming, ratios between such quantities, quantities of render targets, or a combination of these. The soft-lock is released responsive to expiry of a soft-lock timer or incoming or in-flight higher priority workloads falling below a threshold, for example.
Methods and systems are disclosed for clock delay compensation in a multiple chiplet system. Techniques disclosed include distributing, by a clock generator, a clock signal across distribution trees of respective chiplets; measuring phases, by phase detectors, where each phase measurement is associated with a chiplet of the chiplets and is indicative of a propagation speed of the clock signal through the distribution tree of the chiplet. Then, for each chiplet, techniques are further disclosed that determine, by a microcontroller, based on the phase measurements associated with the chiplet, a delay offset, and that delay, based on the delay offset, the propagation of the clock signal through the distribution tree of the chiplet using a delay unit associated with the chiplet.
Methods and systems are disclosed for frequency transitioning in a memory interface system. Techniques disclosed include receiving a signal indicative of a change in operating frequency, into a new frequency, in a processing unit interfacing with memory via the memory interface system; switching the system from a normal mode of operation into a transition mode of operation; updating control and state register (CSR) banks of respective transceivers of the system through a mission bus used during the normal mode of operation; and operating the system in the new frequency.
Restricting peripheral device protocols in confidential compute architectures, the method including: receiving a first address translation request from a peripheral device supporting a first protocol, wherein the first protocol supports cache coherency between the peripheral device and a processor cache; determining that a confidential compute architecture is enabled; and providing, in response to the first address translation request, a response including an indication to the peripheral device to not use the first protocol.