Mauricio Breternitz

Papers

A unified model for accelerating unsupervised iterative re‐ranking algorithm Pisani, F.; Valem, L. P.; Guimarães Pedronette, D. C.; Torres, R.; Borin, Ed.; Breternitz, M.; Concurrency and Computation: Pratice and Experience 2020 [PDF]
Efficiency and Scalability of Multi-Lane Capsule Networks (MLCN) M.Breternitz; Vanderson Martins do Rosario; Edson Borin; Int’l Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)2019 [PDF]
Memory Efficient Weightless Neural Network using Bloom Filters Felipe Franca; M.Breternitz; Leandro Araujo; 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning [PDF]
The multi-lane capsule network (MLCN) Vanderson Martins do Rosario; Borin, Edson; M.Breternitz; IEEE Signal Processing Letters 2019 [PDF]
ComP-Net: command processor networking for efficient intra-kernel communications on GPUs Lebeane, Michael W (Lebeane, M.) Khaled Hamidouche (Hamidouche, K.); Brad Benton (Benton, B.); Maurício Breternitz (Breternitz, M.); Steven K. Reinhardt (Reinhardt, S. K.); Lizy K. John (John, L. K.); Parallel Architectures and Compilation Techniques - Conference Proceedings PACT, 2018 [PDF]
Mixed reality application to support infrastructure maintenance, Silva, H.; Resende, R.; Breternitz, M. 2nd International Young Engineers Forum, YEF-ECE 2018
GPU Triggered Networking for Intra-Kernel Communications M.Lebeane et al Supercomputing, 2017 [PDF]
Extended Task Queueing: Active Messages for Heterogeneous Systems M.Lebeane et al Supercomputing, 2016 [PDF]
PY-PITS: A scalable Python Runtime System for the computation of Partially Idempotent Tasks E.Borin et al MPP 2016 Workshop on Parallel Programming Models (best paper award) [PDF]
HadoopCL2: Motivating the Design of a Distributed, Heterogeneous Programming System With Machine-Learning Applications M. Grossman, M. Breternitz, V. Sarkar. IEEE Transactions on Parallel and Distributed Systems, issue 99, 2015 [PDF]
Optimizing Big Data Analytics on Heterogeneous Processors M. Daga, J. Gu, M. Breternitz Tutorial, 2015 IEEE Conference on Big Data,
Adaptive global power optimization for Web servers. Piga, Leonardo, Mauricio Breternitz et al. The Journal of Supercomputing (2014): 1-25. [PDF]
Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems Junli Gu, Mauricio Breternitz et al. Proceedings of 5th Asia-Pacific Workshop on Systems. ACM, 2014.
Microcode Compression Using Structured-Constrained Clustering E. Borin, G. Araujo, M. Breternitz and Y. Wu. International Journal of Parallel Programming, V..42, Issue 1, pp. 140 – 164, 2014.
HadoopCL: MapReduce on Distributed Heterogeneous Platforms Through Seamless Integration of Hadoop and OpenCL. M. Grossman, M. Breternitz, V. Sarkar. 2013 International Workshop on High Performance Data Intensive Computing. May 2013.
Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, M. Grossman, M. Breternitz, V. Sarkar AMD Developer Summit 2013.
Cloud Workload Analysis with SWAT M.Breternitz, K.Lowery, A.Chernoff, P.Kaminski, L.Piga, SBAC-PAD 2012 -International Conference on Computer Architecture, New York, NY
Efficient Image Re-Ranking Computation on GPUs D.Pedronette, R.Torres, Ed.Borin, M.Breternitz ISPA 2012
LAR-CC: Large Atomic regions with conditional Commits Borin, E; Wu, Y; Breternitz, M; Wang, CGO'2011-IEEE/ACM 9th Annual International Symposium on Code Generation and Optimization, Apr 2-6, 2011
Structure-Constrained Microcode Compression Borin,E; Araujo, G; Breternitz, M; Wu, Y SBAC-PAD 2011 -23rd Int’l Symposium on Computer Architecture and High-Performance Computing,Oct /2011
Face Detection: Performance opportunities for CPU-GPU Kernel Migration in Fusion Architecture Breternitz, M; Chernoff, A; Kaminski; P; Lowery, K. AMD Fusion Developer Summit, June 11-14/2011
TAO -Two Level Atomicity for Dynamic Binary Optimizations E.Borin, Y.Wu, C.Wang, W.Liu, M.Breternitz, S.Hu, E.Natanzon, S.Rotem, R.Rosner. CGO'2011-IEEE/ACM 8th Annual International Symposium on Code Generation and Optimization
Segmented Bloom Filter Algorithm for Efficient Predictors M. Breternitz, G.H.Loh, B.Black, J.Rupley, P.Sassone, W.Attrot, Y.Wu SBAC-PAD Conference, 2008
StarDBT: An Efficient Multi-platform Dynamic Binary Translation System C.Wang, S.Hu, H-S Kim,S.Nair, M.Breternitz, Z.Ying, Y.Wu APAC Conference, 2007
Clustering-Based Microcode Compression E. Borin, Mauricio Breternitz Jr, Y. Wu, G. Araujo ICCD2006
Enhanced Code Density of Embedded CISC Processors with Echo Technology Youfeng Wu, Mauricio Breternitz, Herbert Hum, Ramesh Peri, and Jay Pickett CODES+ISSS 2005
Echo Technology for Memory Constrained Processors Youfeng Wu, Mauricio Breternitz Jr., Herbert Hum, Ramesh Peri, Jay Pickett CTCES 2004
The Accuracy of Initial Prediction in Two-Phase Dynamic Binary Translators Youfeng Wu, Mauricio Breternitz Jr., Justin Quek, Orna Etzion, Jesse Fang International Symposium on Code Generation and Optimization with Special Emphasis on Feedback-Directed and Runtime Optimization CGO 2004; Page(s): 227238.
Continuous Trip Count Profiling for Loop Optimizations in Two-Phase Dynamic Binary Translators Youfeng Wu, Mauricio Breternitz Jr., Tevi Devor INTERACT-8 Interaction between Compilers and Computer Architectures, 2004; Page(s): 3-12.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr., Herbert H. J. Hum, Sanjeev Kumar Proceedings 12th International Conference on Parallel Architectures and Compilation Techniques -PACT 2003; Page(s): 135-145.
Enhanced Compression Techniques to Simplify Programm Decompression and Execution Mauricio Breternitz Jr., Roger Smith ICCD 1997; Page(s): 170-176.
The Motorola PowerPC PEEK profiler Stewart, K.; Butt, F.; Sarkisian, D.; Breternitz, M., Jr. Performance, Computing, and Communications Conference, 1997. IPCCC 1997., IEEE International , 1997; Page(s): 342-349.
Design tradeoffs and experience with Motorola PowerPC migration tools Breternitz, M.; Manikonda, A.; Ommerman, M.; Su, W.; Thornton, A. Computer Design: VLSI in Computers and Processors, 1996. ICCD '96. Proceedings., 1996 IEEE International Conference on , 1996; Page(s): 301308.
Motorola PowerPC Migration Tools-emulation and translation Afzal, T.; Breternitz, M.; Kacher, M.; Menyhert, S.; Ommerman, M.; Su, W. Compcon '96. 'Technologies for the Information Superhighway' Digestof Papers , 1996;
Solutions and debugging for data consistency in multiprocessors with noncoherent caches Bernstein, D.; Breternitz, M., Jr.; Gheith, A.M.; Mendelson, B. International Journal of Parallel Programming, vol.23, no.1, Feb. 1995.; Page(s) 83-103.
An optimal asynchronous scheduling algorithm for software cache consistency Simons, B.; Sarkar, V.; Breternitz, M., Jr.; Lai, M.; System Sciences, 1994. Vol.II: Software Technology, Proceedings of the Twenty-Seventh Hawaii International Conference, 1994; Page(s): 502-511.
Implementation Optimization Techniques for Architecture Synthesis of Application-Specific Processors. Mauricio Breternitz Jr., John Paul Shen MICRO-24 1991; Page(s): 114-123.
Adapting AIX to a shared memory cluster. Breternitz, M., Jr.; Gheith, A.; Jindal, A.; Lehr, T. Proceedings. SHARE Europe Anniversary Meeting. Client/Server -the Promise and the Reality. Carouge/Geneva, Switzerland: SHARE Europe, 1993.; Page(s): 415-428.
Architecture synthesis of high-performance application-specific processors Breternitz, M., Jr.; Shen, J.P. Design Automation Conference, 1990. 27th ACM/IEEE , 1990; Pg(s): 542-548.
Tradeoffs between pipelining and multiple functional units in fine grain parallelism exploitation M. Breternitz and A. Nicolau. ICS-90 International Conference on Supercomputing, Santa Clara CA, April 1989.
Organization Of Array Data For Concurrent Memory Access Breternitz, M.; Shen, J.P.; Microprogramming and Microarchitecture, 1988., Proceeding of the 21st Annual Workshop on; pp97-99.
The White Dwarf: a high-performance application-specific processor Wolfe, A.; Breternitz, M., Jr.; Stephens, C.; Ting, A.L.; Kirk, D.B.; Bianchini, R.P., Jr.; Shen, J.P. 1988.15th Annual International Symposium on Computer Architecture, 1988; Page(s): 212-222.
THESIS Architecture Synthesis of High Performance Application Specific Processors; Mauricio Breternitz; Ph.D. Thesis, Electrical and Computer Engineering Department, Carnegie-Mellon University, April 1991 [PDF]

Talks

Efficiency and Scalability of Multi-Lane Capsule Networks (MLCN)
CIENCIA 2019 Encontro com a Ciencia e Tecnologia em Portugal [PDF]
ASR Automatic Speech Recognition for European Portuguese with the Kaldi Framework [PDF]
CIENCIA-2018, Encontro com a Ciencia e Tecnologia em Portugal
Microarchitecture, Computing Systems, High Performance Computing and End-to-End Deep Neural Networks (Keynote)
WSCAD-2018, XIX Simposium em Sistemas Computacionais de Alto Desempenho, 3/10/2018
Sao Paulo, Brazil [PDF]
E2eML: High Performance, Power Efficient Application of Machine Learning Systems
NII Shonan Meeeting Seminar 134, 08/20/2018 http://shonan.nii.ac.jp/seminar/134/ [PDF]
AMD’s Open Compute and Open Source Cross Platform Solutions for Machine Learning Invited Lecture
Deep Learning Tools and Methods Workshop, IDIAP, EPFL Martigny, 2016 https://www.idiap.ch/workshop/dltm/front-page [talk]
Optimizing Big Data Analytics on Heterogeneous Processors M. Daga, J. Gu, M. Breternitz
Tutorial, 2015 IEEE Conference on Big Data [PDF]

U.S. Patents

US-20210173591-A1 STORAGE LOCATION ASSIGNMENT AT A CLUSTER COMPUTE SERVER [link]

10,558,466 System and method for parallelization of data processing in a processor [link]
10,318,340 NVRAM-aware data processing system
10,318,153 Techniques for changing management modes of multilevel memory hierarchy
10,271,008 Enhanced resolution video and security via machine learning
10,198,349 Programming in-memory accelerators to improve the efficiency of datacenter operations
10,089,155 Power aware work stealing [link]
10,067,709 Page migration acceleration using a two-level bloom filter on high bandwidth memory systems
10,019,365 Adaptive value range profiling for enhanced system performance
9,817,644 Apparatus, method, and system for providing a decision mechanism for conditional commits in an atomic region
9,766,936 Selecting a resource from a set of resources for performing an operation
9,658,895 System and method for configuring boot-time parameters of nodes of a cloud computing system
9,639,140 Power management of interactive workloads driven by direct and indirect user feedback
9,479,449 Workload partitioning among heterogeneous processing nodes
9,274,585 Combined dynamic and static power and performance optimization on data centers
9,262,231 System and method for modifying a hardware configuration of a cloud computing system
9,251,069 Mechanisms to bound the presence of cache blocks with specific properties in caches
9,223,714 Instruction boundary prediction for variable length instruction set
9,183,055 Selecting a resource from a set of resources for performing an operation
9,170,854 Thread assignment for power and performance efficiency using multiple power states
9,152,601 Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units
9,152,532 System and method for configuring a cloud computing system with a synthetic test workload
9,146,846 Programmable physical address mapping for memory
9,146,844 Apparatus, method, and system for providing a decision mechanism for conditional commits in an atomic region
9,116,703 Semi-static power and performance optimization of data centers
8,935,472 Processing device with independently activatable working memory bank and methods
8,929,220 Processing system using virtual network interface controller addressing as flow control metadata
8,887,056 System and method for configuring cloud computing systems
8,782,645 Automatic load balancing for heterogeneous cores
8,738,877 Processor with garbage-collection based classification of memory
8,683,468 Automatic kernel migration for heterogeneous cores
8,549,504 Apparatus, method, and system for providing a decision mechanism for conditional commits in an atomic region
8,146,106 On-demand emulation via user-level exception handling
8,099,587 Compressing and accessing a microcode ROM
7,840,953 Method and system for reducing program code size
7,757,221 Apparatus and method for dynamic binary translator to support precise exceptions with minimal optimization constraints
7,725,887 Method and system for reducing program code size
7,694,281 Two-pass MRET trace selection for dynamic optimization
7,620,781 Efficient Bloom filter
7,451,121 Genetic algorithm for microcode compression
7,430,574 Efficient execution and emulation of bit scan operations
7,428,731 Continuous trip count profiling for loop optimizations in two-phase dynamic binary translators
7,095,342 Compressing microcode
6,823,070 Method for key escrow in a communication system and apparatus therefor
6,523,095 Method and data processing system for using quick decode instructions
6,484,228 Method and apparatus for data compression and decompression for a data processor system
6,381,739 Method and apparatus for hierarchical restructuring of computer code
6,343,354 Method and apparatus for compression, decompression, and execution of program code
6,216,213 Method and apparatus for compression, decompression, and execution of program code
6,044,220 Method and apparatus for operating a data processor to execute software written using a foreign instruction set
5,966,143 Data allocation into multiple memories for concurrent access
5,889,999 Method and apparatus for sequencing computer instruction execution in a data processing system
5,805,895 Method and apparatus for code translation optimization
5,737,576 Method and system for efficient instruction execution in a data processing system having multiple prefetch units
5,659,699 Method and system for managing cache memory utilizing multiple hash functions
5,634,025 Method and system for efficiently fetching variable-width instructions in a data processing system having multiple prefetch units
5,537,620 Redundant load elimination on optimizing compilers

Plus 55 more U.S. patents pending