I’m an engineer who designs hardware and low-level software for AI. I'm currently working with Qualcomm on AI compilers such as MLIR. My work on AI compilers enables faster and power-efficient model execution on hardware accelerators.
My background is a PhD from Harvard in Computer Architecture and Compilers. As a PhD student in Computer Science I was working with compiler infrastructures such as LLVM, Machine Learning Modelling, Integer Linear/Quadratic Programming and Simulators for Domain Specific System-on-Chip design under the supervision of Prof. David Brooks and Gu-Yeon Wei. One key characteristic I'm interested in while designing these systems is the potential to approximate operations for specific application domains such as AI. With the end/slowdown of Moore's law there is a lot of focus on finding non-conventional computing paradigms and in my research I'm working on merged dataflow accelerators as well as architectures robust to approximate computations.
iulianbrumarv [at] yahoo.com
Computer Science
Harvard University
33 Oxford St
Cambridge
US
Compilers, AI, Computer Architecture.
Iulian Brumar
Guac: Energy-Aware and SSA-Based Generation of Coarse-Grained Merged Accelerators from LLVM-IR, Under submission.Iulian Brumar, Rodrigo Rocha, Alex Bernat, Devashree Tripathy, David Brooks, Gu-Yeon Wei
Early DSE and Automatic Generation of Coarse Grained Merged Accelerators, in ACM Transactions on Embedded Computing Systems, Volume 22, Issue 2.Iulian Brumar, Georgios Zacharopoulos, Yuan Yao, Saketh Rama, David Brooks, Gu-Yeon Wei
Trireme: Exploration of hierarchical multi-level parallelism for hardware acceleration, in ACM Transactions on Embedded Computing Systems, Volume 22, Issue 2.Georgios Zacharopoulos, Adel Ejjeh, Ying Jing, En-Yu Yang, Tianyu Jia, Iulian Brumar, Jeremy Intan, Muhammad Huzaifa, Sarita Adve, Vikram Adve, Gu-Yeon Wei, David Brooks
Hardware Support for Approximate Task Memoization, Iulian Brumar, et al.Iulian Brumar, Emilio Castillo, Miquel Moreto, Marc Casas, Gurindar S. Sohi
Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM, HPEC'19.*Brian Plancher, *Camelia D. Brumar, *Iulian Brumar, *Lillian Pentecost, *Saketh Rama, David Brooks
ATM: Approximate Task Memoization in the Runtime System, Iulian Brumar, et al., IPDPS'17.Iulian Brumar, Miquel Moreto, Marc Casas, Miquel Moreto, Mateo Valero, Gurindar S. Sohi
Sita: A Methodology for Automatic Matching and Merging of Accelerator Operations for Systems-on-Chip. DARPA Report.*Iulian Brumar, *Saketh Rama, Burcin Cakir, David Brooks, Gu-Yeon Wei, Vijay Reddi
*Equal first co-authors.
Fall 2019
Cambridge, Massachusetts, United States
Harvard CS246 and CS146 Computer Architecture - In this course, grad and undergrad students are presented with how an out-of-order processor is designed and what are the main memory system problems together with classical solutions for these problems. The students are required to read seminal computer architecture papers. I've helped with leading the research paper discussions and with setting up the Pin-based architectural simulations, performance counters-based assignments and I've supervised students taking final projects in FPGA and simulator based designs, Value Prediction and Branch Prediction.
Teaching AssistantFall 2016
Barcelona, Spain
UPC Operating Systems - Course introducing the students to the fundamentals of operating system design and use. Some course topics were process, threads, memory, filesystem and I/O management
Teaching AssistantSummer 2015
Barcelona Area, Spain
Helped high school students learn programming concepts during the summer school organized by Prof. Salvador Roura at the School of Mathematics and Statistics (FME). The course is aimed at students who wanted to participate at the Spanish Programming Olympiad (https://olimpiada-informatica.org/).
Sep 2016 - Aug 2018 • 2 yrs
Cambridge, Massachusetts, United States
Developing AccelMerger, an early stage design space exploration tools that aims at solving three main challenges in domain System-on-Chip design: automation, flexible granularity and computational pattern reuse/merging.
This work has three main components:
Jun 2021 - Sep 2021 • 3 mos
Remote, United States
Early Stage Design Space Exploration of NoC. Helping Facebook AR/VR Sillicon Performance team, in developing Network on Chip models and implement them in Facebook's Early Stage Design Space Exploration tool FARSI. Error in FARSI was reduced from 432.23\% to 7.46\% in memory intense scenarios. The implementation was validated under different latency, number of NoC queues, bus widths, software/hardware configurations. Work developed in collaboration with Amit Kumar and Vivek Venkateshan.
Research StudentSep 2014 - Aug 2018 • 4 yrs
Barcelona Area, Spain
Extending state of the art runtime systems to allow the application to make a disciplined use of the hardware by learning from the execution history. For example, we have performed extensive research on the concept of Approximate Task Memoization (ATM) which allows to reuse the results of previous computations at slight costs of application correctness. In this kind of research, performance analysis also proves to be very useful to enable the runtime to optimize the most common application patterns. ATM for example relies on extensive studies using performance analysis tools. This is the work we have managed to successfully publish in the current IPDPS17 conference. The next step was to create hardware support for the approximate task memoization. For example we've identified hashing and the memoization table management as the key overheads in performing coarse grained memoization and the hardware helped speed up many of the applications with aggressive value reuse.
Multicore Performance Modelling Research InternOct 2016 - Jan 2017 • 4 mos
Cambridge, United Kingdom
Research on Medium Grained Parallelism discovery. Medium grained parallelism includes granularities from Tasks as in "Task Based Programming Models" and larger granularities than instruction level parallelism(ILP).
This is motivated by the need of leveraging all possible sources of parallelism to exploit the concurrency available in current and future hardware substrates. While many approaches have looked at exploiting loop-level parallelism using different techniques (e.g. thread-level speculation), not much material is available on the limits of parallelisation at coarser granularities.
The first use case for the analysis tools has been simulating the scenario where functions can be invoked asynchronously, and we studied the impact of syscalls and external library calls on the available parallelism (with limited support for data dependency tracking).
Visiting Research StudentSep 2017 - Dec 2017 • 4 mos
Madison, United States
Increasing parallelism using redundant task elimination. Creating architectural and micro-architectural support in gem5 to support fine grained approximate task memoization. Collaboration with Prof. Gurindar S. Sohi.
Research FellowSep 2016 - Jul 2018 • 2 yrs
Barcelona, Spain
Approximate Task Memoization in Runtime Systems. Implementing task memoization in the runtime system of the OmpSs parallel programming model.
Current
Cambridge, MA, United States
GPA: 4.0/4.0
Master in Innovation and Research in InformaticsOct 2016
Barcelona, Spain
120 ECTS credits. Grade: 9.5/10
Highest Honors: Best Academic Record Award among 36 students.
Degree in Informatics EngineeringJul 2014
Barcelona, Spain
240 ECTS credits. Grade: 8.77/10
Highest Honors: Best Academic Record Award among 119 students.
Cambridge, MA, United States
Spanish National Teaching Fellowship (FPU)Barcelona Area, Spain
Founding awarded to 800 top research students in Spain across all disciplines.
Computer Architecture Research FellowshipBarcelona Area, Spain
Awarded to only 3 students in the Computer Architecture Department.
Best Master's Record Award for Computer Science Students at UPC. 2014-2016 PromotionBarcelona Area, Spain
Only 3 awards were offered among 65 graduates.
Best Overall Undergraduate Record Award for Computer Science Students at UPC. 2010-2014 PromotionBarcelona Area, Spain
Only 5 awards were awarded among 119 graduates.
2012 Best Sophomore Record Award in InformaticsBarcelona Area, Spain
Featured in the "Premis i Distincions" (Prizes) section:
http://www.fib.upc.edu/img/pdf/memoria/Memoria_2012-2013.pdf
Universitat Politecnica de Catalunya (UPC) offers this award to freshmen and sophomores in computer science once every year.