A representative reconfigurable processing circuit and a reconfigurable arithmetic circuit are disclosed, each of which may include input reordering queues; a multiplier shifter and combiner network coupled to the input reordering queues; an accumulator circuit; and a control logic circuit, along with a processor and various interconnection networks. A representative reconfigurable arithmetic circuit has a plurality of operating modes, such as floating point and integer arithmetic modes, logical manipulation modes, Boolean logic, shift, rotate, conditional operations, and format conversion, and is configurable for a wide variety of multiplication modes. Dedicated routing connecting multiplier adder trees allows multiple reconfigurable arithmetic circuits to be reconfigurably combined, in pair or quad configurations, for larger adders, complex multiplies and general sum of products use, for example.
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06F 5/01 - Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
G06F 7/544 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices for evaluating functions by calculation
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
G06F 9/38 - Concurrent instruction execution, e.g. pipeline, look ahead
G06F 9/48 - Program initiating; Program switching, e.g. by interrupt
H03K 19/21 - EXCLUSIVE-OR circuits, i.e. giving output if input signal exists at only one input; COINCIDENCE circuits, i.e. giving output only if all input signals are identical
2.
METHOD, APPARATUS, AND COMPUTER-READABLE MEDIUM FOR PARALLELIZATION OF A COMPUTER PROGRAM ON A PLURALITY OF COMPUTING CORES
An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
A system and method to efficiently configure an array of processing cores to perform functions of a program. A function of the program is converted to a configuration of cores. The configuration is laid out in a first subset of the array of cores. The configuration is stored. The configuration is replicated to perform the function on a second subset of the array of cores.
A system and method to allocate serial interconnection lanes on a die to multiple communication protocols is disclosed. The die has at least one processing core. The die incudes a first communication subsystem including a controller, a protocol coding sublayer (PCS) for interchanging data, and a data interface coupled to the core. The die includes a second communication subsystem including a controller, a PCS for interchanging data, and a data interface coupled to the core. A mode input selects at least one of the first or second communication protocol. A data router has an input coupled to the PCS of the first communication subsystem and an input coupled to the PCS of the second communication subsystem. The data router has an output coupled to the set of serial interconnection lanes, and a selection input coupled to the mode input to allocate some of the lanes for the selected protocol.
Systems and methods for configuring a reduced instruction set computer processor architecture to execute fully homomorphic encryption (FHE) logic gates as a streaming topology. The method includes parsing sequential FHE logic gate code, transforming the FHE logic gate code into a set of code modules that each have in input and an output that is a function of the input and which do not pass control to other functions, creating a node wrapper around each code module, configuring at least one of the primary processing cores to implement the logic element equivalents of each element in a manner which operates in a streaming mode wherein data streams out of corresponding arithmetic logic units into the main memory and other ones of the plurality arithmetic logic units.
G06F 9/38 - Concurrent instruction execution, e.g. pipeline, look ahead
G06F 7/48 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices
G06F 17/14 - Fourier, Walsh or analogous domain transformations
A system and method to compress application control data, such as weights for a layer of a convolutional neural network, is disclosed. A multi-core system for executing at least one layer of the convolutional neural network includes a storage device storing a compressed weight matrix of a set of weights of the at least one layer of the convolutional network and a decompression matrix. The compressed weight matrix is formed by matrix factorization and quantization of a floating point value of each weight to a floating point format. A decompression module is operable to obtain an approximation of the weight values by decompressing the compressed weight matrix through the decompression matrix. A plurality of cores executes the at least one layer of the convolutional neural network with the approximation of weight values to produce an inference output.
H03M 7/30 - Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
H03M 7/40 - Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
H03M 7/02 - Conversion to or from weighted codes, i.e. the weight given to a digit depending on the position of the digit within the block or code word
A representative reconfigurable processing circuit and a reconfigurable arithmetic circuit are disclosed, each of which may include input reordering queues; a multiplier shifter and combiner network coupled to the input reordering queues; an accumulator circuit; and a control logic circuit, along with a processor and various interconnection networks. A representative reconfigurable arithmetic circuit has a plurality of operating modes, such as floating point and integer arithmetic modes, logical manipulation modes, Boolean logic, shift, rotate, conditional operations, and format conversion, and is configurable for a wide variety of multiplication modes. Dedicated routing connecting multiplier adder trees allows multiple reconfigurable arithmetic circuits to be reconfigurably combined, in pair or quad configurations, for larger adders, complex multiplies and general sum of products use, for example.
G06F 15/78 - Architectures of general purpose stored program computers comprising a single central processing unit
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
H03K 19/21 - EXCLUSIVE-OR circuits, i.e. giving output if input signal exists at only one input; COINCIDENCE circuits, i.e. giving output only if all input signals are identical
G06F 9/48 - Program initiating; Program switching, e.g. by interrupt
G06F 5/01 - Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
G06F 7/544 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices for evaluating functions by calculation
G06F 9/38 - Concurrent instruction execution, e.g. pipeline, look ahead
A method and system for providing robust streaming of data from a multi-core die is disclosed. The techniques include using a high bandwidth memory (HBM) device as retransmit buffers for large amounts of data to ensure robust communication in relatively high round trip-transmission time (RTT) transmission. Another technique is supporting two or more Ethernet ports between components to both transmit the same data packets on the two ports to insure robustness. Another technique is to use sequence numbers and send data packets from the different ports in a round robin fashion and reorder the packets upon receipt of an external device. Another technique is dynamically adding and removing paths for data packets between devices with multiple ports based on the quality of the path.
A representative reconfigurable processing circuit and a reconfigurable arithmetic circuit are disclosed, each of which may include input reordering queues; a multiplier shifter and combiner network coupled to the input reordering queues; an accumulator circuit; and a control logic circuit, along with a processor and various interconnection networks. A representative reconfigurable arithmetic circuit has a plurality of operating modes, such as floating point and integer arithmetic modes, logical manipulation modes, Boolean logic, shift, rotate, conditional operations, and format conversion, and is configurable for a wide variety of multiplication modes. Dedicated routing connecting multiplier adder trees allows multiple reconfigurable arithmetic circuits to be reconfigurably combined, in pair or quad configurations, for larger adders, complex multiplies and general sum of products use, for example.
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
G06F 9/38 - Concurrent instruction execution, e.g. pipeline, look ahead
G06F 5/01 - Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
G06F 9/48 - Program initiating; Program switching, e.g. by interrupt
G06F 7/544 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices for evaluating functions by calculation
H03K 19/21 - EXCLUSIVE-OR circuits, i.e. giving output if input signal exists at only one input; COINCIDENCE circuits, i.e. giving output only if all input signals are identical
Systems and methods for configuring a reduced instruction set computer processor architecture to execute fully homomorphic encryption (FHE) logic gates as a streaming topology. The method includes parsing sequential FHE logic gate code, transforming the FHE logic gate code into a set of code modules that each have in input and an output that is a function of the input and which do not pass control to other functions, creating a node wrapper around each code module, configuring at least one of the primary processing cores to implement the logic element equivalents of each element in a manner which operates in a streaming mode wherein data streams out of corresponding arithmetic logic units into the main memory and other ones of the plurality arithmetic logic units.
G06F 7/48 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices
G06F 9/38 - Concurrent instruction execution, e.g. pipeline, look ahead
G06F 17/14 - Fourier, Walsh or analogous domain transformations
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
11.
RECONFIGURABLE REDUCED INSTRUCTION SET COMPUTER PROCESSOR ARCHITECTURE WITH FRACTURED CORES
Systems and methods for reconfiguring a reduced instruction set computer processor architecture are disclosed. Exemplary implementations may: provide a primary processing core consisting of a RISC processor; provide a node wrapper associated with each of the plurality of secondary cores, the node wrapper comprising access memory associates with each secondary core, and a load/unload matrix associated with each secondary core; operate the architecture in a manner in which, for at least one core, data is read from and written to the at least cache memory in a control-centric mode; the secondary cores are selectively partitioned to operate in a streaming mode wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores.
G06F 15/78 - Architectures of general purpose stored program computers comprising a single central processing unit
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
12.
Parallel processing of data having data dependencies for accelerating the launch and performance of operating systems and other computing applications
Representative embodiments are disclosed for a rapid and highly parallel decompression of compressed executable and other files, such as executable files for operating systems and applications, having compressed blocks including run length encoded (“RLE”) data having data-dependent references. An exemplary embodiment includes a plurality of processors or processor cores to identify a start or end of each compressed block; to partially decompress, in parallel, a selected compressed block into independent data, dependent (RLE) data, and linked dependent (RLE) data; to sequence the independent data, dependent (RLE) data, and linked dependent (RLE) data from a plurality of partial decompressions of a plurality of compressed blocks, to obtain data specified by the dependent (RLE) data and linked dependent (RLE) data, and to insert the obtained data into a corresponding location in an uncompressed file. The representative embodiments are also applicable to other types of data processing for applications having data dependencies.
H03M 7/46 - Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind
G06F 3/06 - Digital input from, or digital output to, record carriers
H03M 7/30 - Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
H03M 7/40 - Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
H03M 7/42 - Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code using table look-up for the coding or decoding process, e.g. using read-only memory
A method and system for providing robust streaming of data from a multi-core die is disclosed. The techniques include using a high bandwidth memory (HBM) device as retransmit buffers for large amounts of data to ensure robust communication in relatively high round trip-transmission time (RTT) transmission. Another technique is supporting two or more Ethernet ports between components to both transmit the same data packets on the two ports to insure robustness. Another technique is to use sequence numbers and send data packets from the different ports in a round robin fashion and reorder the packets upon receipt of an external device. Another technique is dynamically adding and removing paths for data packets between devices with multiple ports based on the quality of the path.
A system and method to compress application control data, such as weights for a layer of a convolutional neural network, is disclosed. A multi-core system for executing at least one layer of the convolutional neural network includes a storage device storing a compressed weight matrix of a set of weights of the at least one layer of the convolutional network and a decompression matrix. The compressed weight matrix is formed by matrix factorization and quantization of a floating point value of each weight to a floating point format. A decompression module is operable to obtain an approximation of the weight values by decompressing the compressed weight matrix through the decompression matrix. A plurality of cores executes the at least one layer of the convolutional neural network with the approximation of weight values to produce an inference output.
G06F 7/483 - Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
H03M 7/40 - Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
H03M 7/30 - Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
G06K 9/62 - Methods or arrangements for recognition using electronic means
A representative reconfigurable processing circuit and a reconfigurable arithmetic circuit are disclosed, each of which may include input reordering queues; a multiplier shifter and combiner network coupled to the input reordering queues; an accumulator circuit; and a control logic circuit, along with a processor and various interconnection networks. A representative reconfigurable arithmetic circuit has a plurality of operating modes, such as floating point and integer arithmetic modes, logical manipulation modes, Boolean logic, shift, rotate, conditional operations, and format conversion, and is configurable for a wide variety of multiplication modes. Dedicated routing connecting multiplier adder trees allows multiple reconfigurable arithmetic circuits to be reconfigurably combined, in pair or quad configurations, for larger adders, complex multiplies and general sum of products use, for example.
G06F 5/01 - Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
G06F 7/544 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices for evaluating functions by calculation
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
H03K 19/21 - EXCLUSIVE-OR circuits, i.e. giving output if input signal exists at only one input; COINCIDENCE circuits, i.e. giving output only if all input signals are identical
G06F 9/48 - Program initiating; Program switching, e.g. by interrupt
A representative reconfigurable processing circuit and a reconfigurable arithmetic circuit are disclosed, each of which may include input reordering queues; a multiplier shifter and combiner network coupled to the input reordering queues; an accumulator circuit; and a control logic circuit, along with a processor and various interconnection networks. A representative reconfigurable arithmetic circuit has a plurality of operating modes, such as floating point and integer arithmetic modes, logical manipulation modes, Boolean logic, shift, rotate, conditional operations, and format conversion, and is configurable for a wide variety of multiplication modes. Dedicated routing connecting multiplier adder trees allows multiple reconfigurable arithmetic circuits to be reconfigurably combined, in pair or quad configurations, for larger adders, complex multiplies and general sum of products use, for example.
G06F 15/78 - Architectures of general purpose stored program computers comprising a single central processing unit
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06F 9/38 - Concurrent instruction execution, e.g. pipeline, look ahead
H03K 19/21 - EXCLUSIVE-OR circuits, i.e. giving output if input signal exists at only one input; COINCIDENCE circuits, i.e. giving output only if all input signals are identical
G06F 9/48 - Program initiating; Program switching, e.g. by interrupt
G06F 5/01 - Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
G06F 7/544 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices for evaluating functions by calculation
17.
Method and apparatus for configuring a reduced instruction set computer processor architecture to execute a fully homomorphic encryption algorithm
Systems and methods for configuring a reduced instruction set computer processor architecture to execute fully homomorphic encryption (FHE) logic gates as a streaming topology. The method includes parsing sequential FHE logic gate code, transforming the FHE logic gate code into a set of code modules that each have in input and an output that is a function of the input and which do not pass control to other functions, creating a node wrapper around each code module, configuring at least one of the primary processing cores to implement the logic element equivalents of each element in a manner which operates in a streaming mode wherein data streams out of corresponding arithmetic logic units into the main memory and other ones of the plurality arithmetic logic units.
G06F 7/48 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices
G06F 17/14 - Fourier, Walsh or analogous domain transformations
G06F 9/38 - Concurrent instruction execution, e.g. pipeline, look ahead
Systems and methods for reconfiguring a reduced instruction set computer processor architecture are disclosed. Exemplary implementations may: provide a primary processing core consisting of a RISC processor; provide a node wrapper associated with each of the plurality of secondary cores, the node wrapper comprising access memory associates with each secondary core, and a load/unload matrix associated with each secondary core; operate the architecture in a manner in which, for at least one core, data is read from and written to the at least cache memory in a control-centric mode; the secondary cores are selectively partitioned to operate in a streaming mode wherein data streams out of the corresponding secondary core into the main memory and other ones of the plurality of secondary cores.
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
G06F 15/78 - Architectures of general purpose stored program computers comprising a single central processing unit
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06F 7/57 - Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups or for performing logical operations
19.
Method and apparatus for a multi-core system for implementing stream-based computations having inputs from multiple streams
A method and system of efficient use and programming of a multi-processing core device. The system includes a programming construct that is based on stream-domain code. A programmable core based computing device is disclosed. The computing device includes a plurality of processing cores coupled to each other. A memory stores stream-domain code including a stream defining a stream destination module and a stream source module. The stream source module places data values in the stream and the stream conveys data values from the stream source module to the stream destination module. A runtime system detects when the data values are available to the stream destination module and schedules the stream destination module for execution on one of the plurality of processing cores.
Representative embodiments are disclosed for a rapid and highly parallel decompression of compressed executable and other files, such as executable files for operating systems and applications, having compressed blocks including run length encoded (“RLE”) data having data-dependent references. An exemplary embodiment includes a plurality of processors or processor cores to identify a start or end of each compressed block; to partially decompress, in parallel, a selected compressed block into independent data, dependent (RLE) data, and linked dependent (RLE) data; to sequence the independent data, dependent (RLE) data, and linked dependent (RLE) data from a plurality of partial decompressions of a plurality of compressed blocks, to obtain data specified by the dependent (RLE) data and linked dependent (RLE) data, and to insert the obtained data into a corresponding location in an uncompressed file. The representative embodiments are also applicable to other types of data processing for applications having data dependencies.
H03M 7/46 - Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind
G06F 3/06 - Digital input from, or digital output to, record carriers
H03M 7/30 - Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
H03M 7/40 - Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
H03M 7/42 - Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code using table look-up for the coding or decoding process, e.g. using read-only memory
21.
Method, apparatus, and computer-readable medium for parallelization of a computer program on a plurality of computing cores
An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
A method and system of compiling and linking source stream programs for efficient use of multi-node devices. The system includes a compiler, a linker, a loader and a runtime component. The process converts a source code stream program to a compiled object code that is used with a programmable node based computing device having a plurality of processing nodes coupled to each other. The programming modules include stream statements for input values and output values in the form of sources and destinations for at least one of the plurality of processing nodes and stream statements that determine the streaming flow of values for the at least one of the plurality of processing nodes. The compiler converts the source code stream based program to object modules, object module instances and executables. The linker matches the object module instances to at least one of the multiple cores. The loader loads the tasks required by the object modules in the nodes and configure the nodes matched with the object module instances. The runtime component runs the converted program.
An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06Q 40/00 - Finance; Insurance; Tax strategies; Processing of corporate or income taxes
Representative embodiments are disclosed for a rapid and highly parallel decompression of compressed executable and other files, such as executable files for operating systems and applications, having compressed blocks including run length encoded (“RLE”) data having data-dependent references. An exemplary embodiment includes a plurality of processors or processor cores to identify a start or end of each compressed block; to partially decompress, in parallel, a selected compressed block into independent data, dependent (RLE) data, and linked dependent (RLE) data; to sequence the independent data, dependent (RLE) data, and linked dependent (RLE) data from a plurality of partial decompressions of a plurality of compressed blocks, to obtain data specified by the dependent (RLE) data and linked dependent (RLE) data, and to insert the obtained data into a corresponding location in an uncompressed file. The representative embodiments are also applicable to other types of data processing for applications having data dependencies.
H03M 7/40 - Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
G06F 17/30 - Information retrieval; Database structures therefor
H03M 7/46 - Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind
H03M 7/42 - Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code using table look-up for the coding or decoding process, e.g. using read-only memory
25.
Method, apparatus, and computer-readable medium for parallelization of a computer program on a plurality of computing cores
An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06Q 40/00 - Finance; Insurance; Tax strategies; Processing of corporate or income taxes
G06F 9/45 - Compilation or interpretation of high level programme languages
26.
Method, apparatus, and computer-readable medium for parallelization of a computer program on a plurality of computing cores
An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06F 9/45 - Compilation or interpretation of high level programme languages
G06Q 40/00 - Finance; Insurance; Tax strategies; Processing of corporate or income taxes
27.
Method, apparatus, and computer-readable medium for parallelization of a computer program on a plurality of computing cores
An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06F 9/45 - Compilation or interpretation of high level programme languages
G06Q 40/00 - Finance; Insurance; Tax strategies; Processing of corporate or income taxes
28.
Parallel processing of data having data dependencies for accelerating the launch and performance of operating systems and other computing applications
Representative embodiments are disclosed for a rapid and highly parallel decompression of compressed executable and other files, such as executable files for operating systems and applications, having compressed blocks including run length encoded (“RLE”) data having data-dependent references. An exemplary embodiment includes a plurality of processors or processor cores to identify a start or end of each compressed block; to partially decompress, in parallel, a selected compressed block into independent data, dependent (RLE) data, and linked dependent (RLE) data; to sequence the independent data, dependent (RLE) data, and linked dependent (RLE) data from a plurality of partial decompressions of a plurality of compressed blocks, to obtain data specified by the dependent (RLE) data and linked dependent (RLE) data, and to insert the obtained data into a corresponding location in an uncompressed file. The representative embodiments are also applicable to other types of data processing for applications having data dependencies.
H03M 7/40 - Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
H03M 7/30 - Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
H03M 7/46 - Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind
G06F 3/06 - Digital input from, or digital output to, record carriers
29.
Method and apparatus for a compiler and related components for stream-based computations for a general-purpose, multiple-core system
A method and system of compiling and linking source stream programs for efficient use of multi-node devices. The system includes a compiler, a linker, a loader and a runtime component. The process converts a source code stream program to a compiled object code that is used with a programmable node based computing device having a plurality of processing nodes coupled to each other. The programming modules include stream statements for input values and output values in the form of sources and destinations for at least one of the plurality of processing nodes and stream statements that determine the streaming flow of values for the at least one of the plurality of processing nodes. The compiler converts the source code stream based program to object modules, object module instances and executables. The linker matches the object module instances to at least one of the multiple cores. The loader loads the tasks required by the object modules in the nodes and configure the nodes matched with the object module instances. The runtime component runs the converted program.
An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
G06F 9/45 - Compilation or interpretation of high level programme languages
G06Q 40/00 - Finance; Insurance; Tax strategies; Processing of corporate or income taxes
31.
Method and apparatus for a general-purpose multiple-core system for implementing stream-based computations
A method and system of efficient use and programming of a multi-processing core device. The system includes a programming construct that is based on stream-domain code. A programmable core based computing device is disclosed. The computing device includes a plurality of processing cores coupled to each other. A memory stores stream-domain code including a stream defining a stream destination module and a stream source module. The stream source module places data values in the stream and the stream conveys data values from the stream source module to the stream destination module. A runtime system detects when the data values are available to the stream destination module and schedules the stream destination module for execution on one of the plurality of processing cores.
A method and system of compiling and linking source stream programs for efficient use of multi-node devices. The system includes a compiler, a linker, a loader and a runtime component. The process converts a source code stream program to a compiled object code that is used with a programmable node based computing device having a plurality of processing nodes coupled to each other. The programming modules include stream statements for input values and output values in the form of sources and destinations for at least one of the plurality of processing nodes and stream statements that determine the streaming flow of values for the at least one of the plurality of processing nodes. The compiler converts the source code stream based program to object modules, object module instances and executables. The linker matches the object module instances to at least one of the multiple cores. The loader loads the tasks required by the object modules in the nodes and configure the nodes matched with the object module instances. The runtime component runs the converted program.
A method and system of efficient use and programming of a multi-processing core device. The system includes a programming construct that is based on stream-domain code. A programmable core based computing device is disclosed. The computing device includes a plurality of processing cores coupled to each other. A memory stores stream-domain code including a stream defining a stream destination module and a stream source module. The stream source module places data values in the stream and the stream conveys data values from the stream source module to the stream destination module. A runtime system detects when the data values are available to the stream destination module and schedules the stream destination module for execution on one of the plurality of processing cores.
Aspects for achieving individualized protected space in an operating system are provided. The aspects include performing on demand hardware instantiation via an ACE (an adaptive computing engine), and utilizing the hardware for monitoring predetermined software programming to protect an operating system.
The disclosure relates to the management of PKI digital certificates, including certificate discovery, installation, verification and replacement for endpoints over an insecure network. A database of certificates may be maintained through discovery, replacement and other activities. Certificate discovery identifies certificates and associated information including network locations, methods of access, applications of use and non-use, and may produce logs and reports. Automated requests to certificate authorities for new certificates, renewals or certificate signing requests may precede the installation of issued certificates to servers using installation scripts directed to a particular application or product, which may provide notification or require approval or intervention. An administrator may be notified of expiring certificates, using a database or scanning or server agents. Detailed information on various example embodiments of the inventions are provided in the Detailed Description below, and the inventions are defined by the appended claims.
H04L 9/00 - Arrangements for secret or secure communications; Network security protocols
G06F 3/00 - Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
G06K 9/00 - Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
36.
System, method and software for static and dynamic programming and configuration of an adaptive computing architecture
The present invention provides a system, method and software for programming and configuring an adaptive computing architecture or device. The invention utilizes program constructs which correspond to and map directly to the adaptive hardware having a plurality of reconfigurable nodes coupled through a reconfigurable matrix interconnection network. A first program construct corresponds to a selected node. A second program construct corresponds to an executable task of the selected node and includes one or more firing conditions capable of determining the commencement of the executable task of the selected node. A third program construct corresponds to at least one input port coupling the selected node to the matrix interconnect network for input data to be consumed by the executable task. A fourth program construct corresponds to at least one output port coupling the selected node to the matrix interconnect network for output data to be produced by the executable task.
A memory controller to provide memory access services in an adaptive computing engine is provided. The controller comprises: a network interface configured to receive a memory request from a programmable network; and a memory interface configured to access a memory to fulfill the memory request from the programmable network, wherein the memory interface receives and provides data for the memory request to the network interface, the network interface configured to send data to and receive data from the programmable network.
Task definitions are used by a task scheduler and prioritizer to allocate task operation to a plurality of processing units. The task definition is an electronic record that specifies researching needed by, and other characteristics of, a task to be executed. Resources include types of processing nodes desired to execute the task, needed amount or rate of processing cycles, amount of memory capacity, number of registers, input/output ports, buffer sizes, etc. Characteristics of a task include maximum latency tome, frequency of execution of a task, communication ports, and other characteristics. An exemplary task definition language and syntax is described that uses constructs including other of attempted scheduling operations, percentage or amount of resources desired by different operations, handling of multiple executable images or modules, overlays, port aliases and other features.
G06F 15/76 - Architectures of general purpose stored program computers
G06F 15/16 - Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
A hardware task manager for managing operations in an adaptive computing system. The task manager indicates when input and output buffer resources are sufficient to allow a task to execute. The task can require an arbitrary number of input values from one or more other (or the same) tasks. Likewise, a number of output buffers must also be available before the task can start to execute and store results in the output buffers. The hardware task manager maintains a counter in association with each input and output buffer. For input buffers, a negative value for the counter means that there is no data in the buffer and, hence, the respective input buffer is not ready or available. Thus, the associated task can not run. Predetermined numbers of bytes, or “units,” are stored into the input buffer and an associated counter is incremented. When the counter value transitions from a negative value to a zero the high-order bit of the counter is cleared, thereby indicating the input buffer has sufficient data and is available to be processed by a task.
G06F 3/00 - Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
40.
System for authorizing functionality in adaptable hardware devices
A system for authorizing new or ongoing functional use of an adaptable device. The device generates usage information including the times that the device is used, types of functionality provided, indication of amount and type of resources used, and other information. The usage information is transmitted back to a controlling entity, such as an original manufacturer of the adaptable device. The controlling entity can act to enable or prevent use of the provided functionality, as desired. Part of the requirement for using functionality can be monetary, by predetermined agreement, or by other criteria.