Papers

  • A unified model for accelerating unsupervised iterative re‐ranking algorithm Pisani, F.; Valem, L. P.; Guimarães Pedronette, D. C.; Torres, R.; Borin, Ed.; Breternitz, M.; Concurrency and Computation: Pratice and Experience 2020 [PDF]
  • Efficiency and Scalability of Multi-Lane Capsule Networks (MLCN) M.Breternitz; Vanderson Martins do Rosario; Edson Borin; Int’l Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)2019 [PDF]
  • Memory Efficient Weightless Neural Network using Bloom Filters Felipe Franca; M.Breternitz; Leandro Araujo; 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning [PDF]
  • The multi-lane capsule network (MLCN) Vanderson Martins do Rosario; Borin, Edson; M.Breternitz; IEEE Signal Processing Letters 2019 [PDF]
  • ComP-Net: command processor networking for efficient intra-kernel communications on GPUs Lebeane, Michael W (Lebeane, M.) Khaled Hamidouche (Hamidouche, K.); Brad Benton (Benton, B.); Maurício Breternitz (Breternitz, M.); Steven K. Reinhardt (Reinhardt, S. K.); Lizy K. John (John, L. K.); Parallel Architectures and Compilation Techniques - Conference Proceedings PACT, 2018 [PDF]
  • Mixed reality application to support infrastructure maintenance, Silva, H.; Resende, R.; Breternitz, M. 2nd International Young Engineers Forum, YEF-ECE 2018
  • GPU Triggered Networking for Intra-Kernel Communications M.Lebeane et al Supercomputing, 2017 [PDF]
  • Extended Task Queueing: Active Messages for Heterogeneous Systems M.Lebeane et al Supercomputing, 2016 [PDF]
  • PY-PITS: A scalable Python Runtime System for the computation of Partially Idempotent Tasks E.Borin et al MPP 2016 Workshop on Parallel Programming Models (best paper award) [PDF]
  • HadoopCL2: Motivating the Design of a Distributed, Heterogeneous Programming System With Machine-Learning Applications M. Grossman, M. Breternitz, V. Sarkar. IEEE Transactions on Parallel and Distributed Systems, issue 99, 2015 [PDF]
  • Optimizing Big Data Analytics on Heterogeneous Processors M. Daga, J. Gu, M. Breternitz Tutorial, 2015 IEEE Conference on Big Data,
  • Adaptive global power optimization for Web servers. Piga, Leonardo, Mauricio Breternitz et al. The Journal of Supercomputing (2014): 1-25. [PDF]
  • Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems Junli Gu, Mauricio Breternitz et al. Proceedings of 5th Asia-Pacific Workshop on Systems. ACM, 2014.
  • Microcode Compression Using Structured-Constrained Clustering E. Borin, G. Araujo, M. Breternitz and Y. Wu. International Journal of Parallel Programming, V..42, Issue 1, pp. 140 – 164, 2014.
  • HadoopCL: MapReduce on Distributed Heterogeneous Platforms Through Seamless Integration of Hadoop and OpenCL. M. Grossman, M. Breternitz, V. Sarkar. 2013 International Workshop on High Performance Data Intensive Computing. May 2013.
  • Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, M. Grossman, M. Breternitz, V. Sarkar AMD Developer Summit 2013.
  • Cloud Workload Analysis with SWAT M.Breternitz, K.Lowery, A.Chernoff, P.Kaminski, L.Piga, SBAC-PAD 2012 -International Conference on Computer Architecture, New York, NY
  • Efficient Image Re-Ranking Computation on GPUs D.Pedronette, R.Torres, Ed.Borin, M.Breternitz ISPA 2012
  • LAR-CC: Large Atomic regions with conditional Commits Borin, E; Wu, Y; Breternitz, M; Wang, CGO'2011-IEEE/ACM 9th Annual International Symposium on Code Generation and Optimization, Apr 2-6, 2011
  • Structure-Constrained Microcode Compression Borin,E; Araujo, G; Breternitz, M; Wu, Y SBAC-PAD 2011 -23rd Int’l Symposium on Computer Architecture and High-Performance Computing,Oct /2011
  • Face Detection: Performance opportunities for CPU-GPU Kernel Migration in Fusion Architecture Breternitz, M; Chernoff, A; Kaminski; P; Lowery, K. AMD Fusion Developer Summit, June 11-14/2011
  • TAO -Two Level Atomicity for Dynamic Binary Optimizations E.Borin, Y.Wu, C.Wang, W.Liu, M.Breternitz, S.Hu, E.Natanzon, S.Rotem, R.Rosner. CGO'2011-IEEE/ACM 8th Annual International Symposium on Code Generation and Optimization
  • Segmented Bloom Filter Algorithm for Efficient Predictors M. Breternitz, G.H.Loh, B.Black, J.Rupley, P.Sassone, W.Attrot, Y.Wu SBAC-PAD Conference, 2008
  • StarDBT: An Efficient Multi-platform Dynamic Binary Translation System C.Wang, S.Hu, H-S Kim,S.Nair, M.Breternitz, Z.Ying, Y.Wu APAC Conference, 2007
  • Clustering-Based Microcode Compression E. Borin, Mauricio Breternitz Jr, Y. Wu, G. Araujo ICCD2006
  • Enhanced Code Density of Embedded CISC Processors with Echo Technology Youfeng Wu, Mauricio Breternitz, Herbert Hum, Ramesh Peri, and Jay Pickett CODES+ISSS 2005
  • Echo Technology for Memory Constrained Processors Youfeng Wu, Mauricio Breternitz Jr., Herbert Hum, Ramesh Peri, Jay Pickett CTCES 2004
  • The Accuracy of Initial Prediction in Two-Phase Dynamic Binary Translators Youfeng Wu, Mauricio Breternitz Jr., Justin Quek, Orna Etzion, Jesse Fang International Symposium on Code Generation and Optimization with Special Emphasis on Feedback-Directed and Runtime Optimization CGO 2004; Page(s): 227238.
  • Continuous Trip Count Profiling for Loop Optimizations in Two-Phase Dynamic Binary Translators Youfeng Wu, Mauricio Breternitz Jr., Tevi Devor INTERACT-8 Interaction between Compilers and Computer Architectures, 2004; Page(s): 3-12.
  • Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr., Herbert H. J. Hum, Sanjeev Kumar Proceedings 12th International Conference on Parallel Architectures and Compilation Techniques -PACT 2003; Page(s): 135-145.
  • Enhanced Compression Techniques to Simplify Programm Decompression and Execution Mauricio Breternitz Jr., Roger Smith ICCD 1997; Page(s): 170-176.
  • The Motorola PowerPC PEEK profiler Stewart, K.; Butt, F.; Sarkisian, D.; Breternitz, M., Jr. Performance, Computing, and Communications Conference, 1997. IPCCC 1997., IEEE International , 1997; Page(s): 342-349.
  • Design tradeoffs and experience with Motorola PowerPC migration tools Breternitz, M.; Manikonda, A.; Ommerman, M.; Su, W.; Thornton, A. Computer Design: VLSI in Computers and Processors, 1996. ICCD '96. Proceedings., 1996 IEEE International Conference on , 1996; Page(s): 301308.
  • Motorola PowerPC Migration Tools-emulation and translation Afzal, T.; Breternitz, M.; Kacher, M.; Menyhert, S.; Ommerman, M.; Su, W. Compcon '96. 'Technologies for the Information Superhighway' Digestof Papers , 1996;
  • Solutions and debugging for data consistency in multiprocessors with noncoherent caches Bernstein, D.; Breternitz, M., Jr.; Gheith, A.M.; Mendelson, B. International Journal of Parallel Programming, vol.23, no.1, Feb. 1995.; Page(s) 83-103.
  • An optimal asynchronous scheduling algorithm for software cache consistency Simons, B.; Sarkar, V.; Breternitz, M., Jr.; Lai, M.; System Sciences, 1994. Vol.II: Software Technology, Proceedings of the Twenty-Seventh Hawaii International Conference, 1994; Page(s): 502-511.
  • Implementation Optimization Techniques for Architecture Synthesis of Application-Specific Processors. Mauricio Breternitz Jr., John Paul Shen MICRO-24 1991; Page(s): 114-123.
  • Adapting AIX to a shared memory cluster. Breternitz, M., Jr.; Gheith, A.; Jindal, A.; Lehr, T. Proceedings. SHARE Europe Anniversary Meeting. Client/Server -the Promise and the Reality. Carouge/Geneva, Switzerland: SHARE Europe, 1993.; Page(s): 415-428.
  • Architecture synthesis of high-performance application-specific processors Breternitz, M., Jr.; Shen, J.P. Design Automation Conference, 1990. 27th ACM/IEEE , 1990; Pg(s): 542-548.
  • Tradeoffs between pipelining and multiple functional units in fine grain parallelism exploitation M. Breternitz and A. Nicolau. ICS-90 International Conference on Supercomputing, Santa Clara CA, April 1989.
  • Organization Of Array Data For Concurrent Memory Access Breternitz, M.; Shen, J.P.; Microprogramming and Microarchitecture, 1988., Proceeding of the 21st Annual Workshop on; pp97-99.
  • The White Dwarf: a high-performance application-specific processor Wolfe, A.; Breternitz, M., Jr.; Stephens, C.; Ting, A.L.; Kirk, D.B.; Bianchini, R.P., Jr.; Shen, J.P. 1988.15th Annual International Symposium on Computer Architecture, 1988; Page(s): 212-222.
  • THESIS Architecture Synthesis of High Performance Application Specific Processors; Mauricio Breternitz; Ph.D. Thesis, Electrical and Computer Engineering Department, Carnegie-Mellon University, April 1991 [PDF]

Talks

  • Efficiency and Scalability of Multi-Lane Capsule Networks (MLCN)
    CIENCIA 2019 Encontro com a Ciencia e Tecnologia em Portugal [PDF]
  • ASR Automatic Speech Recognition for European Portuguese with the Kaldi Framework [PDF]
    CIENCIA-2018, Encontro com a Ciencia e Tecnologia em Portugal
  • Microarchitecture, Computing Systems, High Performance Computing and End-to-End Deep Neural Networks (Keynote)
    WSCAD-2018, XIX Simposium em Sistemas Computacionais de Alto Desempenho, 3/10/2018
    Sao Paulo, Brazil [PDF]
  • E2eML: High Performance, Power Efficient Application of Machine Learning Systems
    NII Shonan Meeeting Seminar 134, 08/20/2018 http://shonan.nii.ac.jp/seminar/134/ [PDF]
  • AMD’s Open Compute and Open Source Cross Platform Solutions for Machine Learning Invited Lecture
    Deep Learning Tools and Methods Workshop, IDIAP, EPFL Martigny, 2016 https://www.idiap.ch/workshop/dltm/front-page [talk]
  • Optimizing Big Data Analytics on Heterogeneous Processors M. Daga, J. Gu, M. Breternitz
    Tutorial, 2015 IEEE Conference on Big Data [PDF]

U.S. Patents

  • 10,558,466 System and method for parallelization of data processing in a processor [link]
  • 10,318,340 NVRAM-aware data processing system
  • 10,318,153 Techniques for changing management modes of multilevel memory hierarchy
  • 10,271,008 Enhanced resolution video and security via machine learning
  • 10,198,349 Programming in-memory accelerators to improve the efficiency of datacenter operations
  • 10,089,155 Power aware work stealing [link]
  • 10,067,709 Page migration acceleration using a two-level bloom filter on high bandwidth memory systems
  • 10,019,365 Adaptive value range profiling for enhanced system performance
  • 9,817,644 Apparatus, method, and system for providing a decision mechanism for conditional commits in an atomic region
  • 9,766,936 Selecting a resource from a set of resources for performing an operation
  • 9,658,895 System and method for configuring boot-time parameters of nodes of a cloud computing system
  • 9,639,140 Power management of interactive workloads driven by direct and indirect user feedback
  • 9,479,449 Workload partitioning among heterogeneous processing nodes
  • 9,274,585 Combined dynamic and static power and performance optimization on data centers
  • 9,262,231 System and method for modifying a hardware configuration of a cloud computing system
  • 9,251,069 Mechanisms to bound the presence of cache blocks with specific properties in caches
  • 9,223,714 Instruction boundary prediction for variable length instruction set
  • 9,183,055 Selecting a resource from a set of resources for performing an operation
  • 9,170,854 Thread assignment for power and performance efficiency using multiple power states
  • 9,152,601 Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units
  • 9,152,532 System and method for configuring a cloud computing system with a synthetic test workload
  • 9,146,846 Programmable physical address mapping for memory
  • 9,146,844 Apparatus, method, and system for providing a decision mechanism for conditional commits in an atomic region
  • 9,116,703 Semi-static power and performance optimization of data centers
  • 8,935,472 Processing device with independently activatable working memory bank and methods
  • 8,929,220 Processing system using virtual network interface controller addressing as flow control metadata
  • 8,887,056 System and method for configuring cloud computing systems
  • 8,782,645 Automatic load balancing for heterogeneous cores
  • 8,738,877 Processor with garbage-collection based classification of memory
  • 8,683,468 Automatic kernel migration for heterogeneous cores
  • 8,549,504 Apparatus, method, and system for providing a decision mechanism for conditional commits in an atomic region
  • 8,146,106 On-demand emulation via user-level exception handling
  • 8,099,587 Compressing and accessing a microcode ROM
  • 7,840,953 Method and system for reducing program code size
  • 7,757,221 Apparatus and method for dynamic binary translator to support precise exceptions with minimal optimization constraints
  • 7,725,887 Method and system for reducing program code size
  • 7,694,281 Two-pass MRET trace selection for dynamic optimization
  • 7,620,781 Efficient Bloom filter
  • 7,451,121 Genetic algorithm for microcode compression
  • 7,430,574 Efficient execution and emulation of bit scan operations
  • 7,428,731 Continuous trip count profiling for loop optimizations in two-phase dynamic binary translators
  • 7,095,342 Compressing microcode
  • 6,823,070 Method for key escrow in a communication system and apparatus therefor
  • 6,523,095 Method and data processing system for using quick decode instructions
  • 6,484,228 Method and apparatus for data compression and decompression for a data processor system
  • 6,381,739 Method and apparatus for hierarchical restructuring of computer code
  • 6,343,354 Method and apparatus for compression, decompression, and execution of program code
  • 6,216,213 Method and apparatus for compression, decompression, and execution of program code
  • 6,044,220 Method and apparatus for operating a data processor to execute software written using a foreign instruction set
  • 5,966,143 Data allocation into multiple memories for concurrent access
  • 5,889,999 Method and apparatus for sequencing computer instruction execution in a data processing system
  • 5,805,895 Method and apparatus for code translation optimization
  • 5,737,576 Method and system for efficient instruction execution in a data processing system having multiple prefetch units
  • 5,659,699 Method and system for managing cache memory utilizing multiple hash functions
  • 5,634,025 Method and system for efficiently fetching variable-width instructions in a data processing system having multiple prefetch units
  • 5,537,620 Redundant load elimination on optimizing compilers

Plus 55 more U.S. patents pending