Iulian Brumar

I’m an engineer who designs hardware and low-level software for AI. I'm currently working with Qualcomm on AI compilers such as MLIR. My work on AI compilers enables faster and power-efficient model execution on hardware accelerators.

My background is a PhD from Harvard in Computer Architecture and Compilers. As a PhD student in Computer Science I was working with compiler infrastructures such as LLVM, Machine Learning Modelling, Integer Linear/Quadratic Programming and Simulators for Domain Specific System-on-Chip design under the supervision of Prof. David Brooks and Gu-Yeon Wei. One key characteristic I'm interested in while designing these systems is the potential to approximate operations for specific application domains such as AI. With the end/slowdown of Moore's law there is a lot of focus on finding non-conventional computing paradigms and in my research I'm working on merged dataflow accelerators as well as architectures robust to approximate computations.

Iulian Brumar

iulianbrumarv [at] yahoo.com

Computer Science
Harvard University
33 Oxford St
Cambridge
US

Google Scholar

Interests

Compilers, AI, Computer Architecture.

Publications

Early-Stage Non-Conventional Hardware Accelerator Discovery via Optimization Methods and Compiler Analysis, PhD Dissertation. Harvard University.

Iulian Brumar

Guac: Energy-Aware and SSA-Based Generation of Coarse-Grained Merged Accelerators from LLVM-IR, Under submission.

Iulian Brumar, Rodrigo Rocha, Alex Bernat, Devashree Tripathy, David Brooks, Gu-Yeon Wei

Early DSE and Automatic Generation of Coarse Grained Merged Accelerators, in ACM Transactions on Embedded Computing Systems, Volume 22, Issue 2.

Iulian Brumar, Georgios Zacharopoulos, Yuan Yao, Saketh Rama, David Brooks, Gu-Yeon Wei

Trireme: Exploration of hierarchical multi-level parallelism for hardware acceleration, in ACM Transactions on Embedded Computing Systems, Volume 22, Issue 2.

Georgios Zacharopoulos, Adel Ejjeh, Ying Jing, En-Yu Yang, Tianyu Jia, Iulian Brumar, Jeremy Intan, Muhammad Huzaifa, Sarita Adve, Vikram Adve, Gu-Yeon Wei, David Brooks

Hardware Support for Approximate Task Memoization, Iulian Brumar, et al.

Iulian Brumar, Emilio Castillo, Miquel Moreto, Marc Casas, Gurindar S. Sohi

Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM, HPEC'19.

*Brian Plancher, *Camelia D. Brumar, *Iulian Brumar, *Lillian Pentecost, *Saketh Rama, David Brooks

ATM: Approximate Task Memoization in the Runtime System, Iulian Brumar, et al., IPDPS'17.

Iulian Brumar, Miquel Moreto, Marc Casas, Miquel Moreto, Mateo Valero, Gurindar S. Sohi

Sita: A Methodology for Automatic Matching and Merging of Accelerator Operations for Systems-on-Chip. DARPA Report.

*Iulian Brumar, *Saketh Rama, Burcin Cakir, David Brooks, Gu-Yeon Wei, Vijay Reddi

*Equal first co-authors.

Teaching

Teaching Fellow
Harvard University

Fall 2019

Cambridge, Massachusetts, United States

Harvard CS246 and CS146 Computer Architecture - In this course, grad and undergrad students are presented with how an out-of-order processor is designed and what are the main memory system problems together with classical solutions for these problems. The students are required to read seminal computer architecture papers. I've helped with leading the research paper discussions and with setting up the Pin-based architectural simulations, performance counters-based assignments and I've supervised students taking final projects in FPGA and simulator based designs, Value Prediction and Branch Prediction.

Teaching Assistant
Harvard University

Fall 2016

Barcelona, Spain

UPC Operating Systems - Course introducing the students to the fundamentals of operating system design and use. Some course topics were process, threads, memory, filesystem and I/O management

Teaching Assistant
Universitat Politècnica de Catalunya

Summer 2015

Barcelona Area, Spain

Helped high school students learn programming concepts during the summer school organized by Prof. Salvador Roura at the School of Mathematics and Statistics (FME). The course is aimed at students who wanted to participate at the Spanish Programming Olympiad (https://olimpiada-informatica.org/).

Work Experience

PhD Graduate Student
Harvard University

Sep 2016 - Aug 2018 • 2 yrs

Cambridge, Massachusetts, United States

Developing AccelMerger, an early stage design space exploration tools that aims at solving three main challenges in domain System-on-Chip design: automation, flexible granularity and computational pattern reuse/merging.

This work has three main components:

Compiler (LLVM) level function merging techniques that include semantically correct code generation of merged functions that are more profitable when mapped to hardware via HLS than non-merged functions,
Neural Networks cost model to evaluate the feasibility of profitable merges at the IR level to guide the decisions of codegen,
Early Stage design exploration with state of the art optimization techniques such as Integer Linear Programming. This work is developed under the supervision of Prof. David Brooks.

Silicon Performance Intern
Facebook, Inc.

Jun 2021 - Sep 2021 • 3 mos

Remote, United States

Early Stage Design Space Exploration of NoC. Helping Facebook AR/VR Sillicon Performance team, in developing Network on Chip models and implement them in Facebook's Early Stage Design Space Exploration tool FARSI. Error in FARSI was reduced from 432.23\% to 7.46\% in memory intense scenarios. The implementation was validated under different latency, number of NoC queues, bus widths, software/hardware configurations. Work developed in collaboration with Amit Kumar and Vivek Venkateshan.

Research Student
Universitat Politècnica de Catalunya and Bacelona Supercomputing Center

Sep 2014 - Aug 2018 • 4 yrs

Barcelona Area, Spain

Extending state of the art runtime systems to allow the application to make a disciplined use of the hardware by learning from the execution history. For example, we have performed extensive research on the concept of Approximate Task Memoization (ATM) which allows to reuse the results of previous computations at slight costs of application correctness. In this kind of research, performance analysis also proves to be very useful to enable the runtime to optimize the most common application patterns. ATM for example relies on extensive studies using performance analysis tools. This is the work we have managed to successfully publish in the current IPDPS17 conference. The next step was to create hardware support for the approximate task memoization. For example we've identified hashing and the memoization table management as the key overheads in performing coarse grained memoization and the hardware helped speed up many of the applications with aggressive value reuse.

Multicore Performance Modelling Research Intern
ARM

Oct 2016 - Jan 2017 • 4 mos

Cambridge, United Kingdom

Research on Medium Grained Parallelism discovery. Medium grained parallelism includes granularities from Tasks as in "Task Based Programming Models" and larger granularities than instruction level parallelism(ILP).

This is motivated by the need of leveraging all possible sources of parallelism to exploit the concurrency available in current and future hardware substrates. While many approaches have looked at exploiting loop-level parallelism using different techniques (e.g. thread-level speculation), not much material is available on the limits of parallelisation at coarser granularities.

The first use case for the analysis tools has been simulating the scenario where functions can be invoked asynchronously, and we studied the impact of syscalls and external library calls on the available parallelism (with limited support for data dependency tracking).

Visiting Research Student
University of Wisconsin

Sep 2017 - Dec 2017 • 4 mos

Madison, United States

Increasing parallelism using redundant task elimination. Creating architectural and micro-architectural support in gem5 to support fine grained approximate task memoization. Collaboration with Prof. Gurindar S. Sohi.

Research Fellow
Universitat Politècnica de Catalunya

Sep 2016 - Jul 2018 • 2 yrs

Barcelona, Spain

Approximate Task Memoization in Runtime Systems. Implementing task memoization in the runtime system of the OmpSs parallel programming model.

Education

Phd in Computer Science
Harvard University

Current

Cambridge, MA, United States

GPA: 4.0/4.0

Master in Innovation and Research in Informatics
Universitat Politècnica de Catalunya

Oct 2016

Barcelona, Spain

120 ECTS credits. Grade: 9.5/10

Highest Honors: Best Academic Record Award among 36 students.

Degree in Informatics Engineering
Universitat Politècnica de Catalunya

Jul 2014

Barcelona, Spain

240 ECTS credits. Grade: 8.77/10

Highest Honors: Best Academic Record Award among 119 students.

Interests

Publications

Teaching

Work Experience

Education

Honors and Awards