# PARAMETER MONTE CARLO TREE SEARCH: EFFI CIENT CHIP PLACEMENT VIA TRANSFER LEARNING

Anonymous authors

Paper under double-blind review

#### ABSTRACT

Automated chip placement is an important problem in enhancing the design and effectiveness of computer chips. Previous approaches have employed transfer learning to adapt knowledge obtained via machine learning from one chip placement task to another. However, these approaches have not notably reduced the necessary chip design time, which is crucial for minimizing the total resource utilization. This paper introduces a novel transfer learning approach called Parameter Monte Carlo Tree Search (PMCTS) that utilizes MCTS to transfer the learned knowledge from deep reinforcement learning (RL) models trained on one chip design task to another chip design by searching directly over the model parameters to generate models for efficient chip placement. We employ MCTS to escape the local optima reached by training from scratch and fine-tuning methods. We evaluate our methodology on four chip design tasks from the literature: Ariane, Ariane133, IBM01, and IBM02. Through extensive experiments, we find that our approach can generate models for optimized chip placement in less time than training from scratch and fine-tuning methods when transferring knowledge from complex chip designs to simpler ones.

025 026 027

024

004

010 011

012

013

014

015

016

017

018

019

021

#### 1 INTRODUCTION

028 029

Chip placement is a fundamental and time-consuming stage in the chip design process (Cheng et al., 2022). The process involves placing the various components, such as macros and standard cells, of 031 a netlist file, which contains a logical description of a circuit's interconnections, in specific positions on the chip layout (Ellis-Monaghan & Gutwin, 2003). Standard cells refer to the fundamental logic 033 gates such as NAND, NOR, and XOR, while macros represent functional blocks like static random-034 access memory (SRAM). The goal of placement is to minimize power, performance, and area (PPA) 035 metrics while adhering to restrictions like placement density and routing congestion (Cheng & Yan, 2021). This is a longstanding problem, but the exponential growth of artificial intelligence (AI) 037 technology has generated an increased demand for advanced computational hardware. However, 038 due to the end of Moore's Law and Dennard scaling, it is essential to shift towards specialized architecture to cope with the rapidly increasing computational requirements of AI (Mirhoseini et al., 2020). As such, AI-based techniques are expected to accelerate the chip design cycle, establishing a 040 mutually beneficial connection between hardware and AI. 041

042 Transfer learning methods can prove beneficial in reducing the cost and resource requirements of 043 developing machine learning (ML) models from scratch for novel domains (You et al., 2023). These 044 methods utilise pre-existing model weights rather than random initialization to reduce the computing demand and enable faster adaptation, which in turn can support more design iterations (Cao et al., 2010). Many transfer learning approaches exist, including zero-shot learning (Xian et al., 2017) and 046 domain adaptation (Farahani et al., 2021) to transfer knowledge from a source domain to a target 047 domain. However, zero-shot learning often requires the class sets or features to be identical, and in 048 the case of domain adaptation approaches, a degree of parallelism between the domains is necessary. As such, for chip design tasks, fine-tuning has been the most common approach Mirhoseini et al. (2020), though they have not demonstrated the expected increases in speed. 051

001

Search algorithms have played a crucial role in domains such as Neural Architecture Search (NAS)
 (Li & Peng, 2020) and Hyper Parameter Optimization (HPO) (Alibrahim & Ludwig, 2021). These approaches are less commonly applied directly to the parameters of a model (Singamsetti et al.,

 2021; Mahajan & Guzdial, 2022; Doosti & Guzdial, 2023). Building on prior search-based transfer learning approaches, we propose an approach for chip placement that directly updates network parameters via search.

057 Our study introduces a novel transfer learning method based on Monte Carlo Tree Search (MCTS) 058 and MCTS using Upper Confidence Bounds for Trees (UCB) to balance exploration with exploita-059 tion (Kocsis & Szepesvári, 2006; Auer et al., 2002). MCTS is probabilistic and heuristic-driven, 060 allowing it to adapt to the most promising search regions (Deng & Wu, 2023) and guiding our 061 proposed approach to effectively explore inside promising search regions. Our approach aims at 062 directly optimizing model parameters within a complex search space. It iteratively updates a search 063 tree by simulating possible outcomes for an RL agent in the chip design environment. Every node 064 in the tree corresponds to two networks: a policy network and a value network, and each node is associated with a cost value: the performance of these networks in a specific chip design environ-065 ment. The policy network, denoted as  $\pi_{\theta}(a|s)$ , determines the probability of selecting an action a 066 in a given state s, with the parameter  $\theta$  influencing this behaviour, and the value network provides 067 an estimation of the expected reward for the chip placement. The cost value is the weighted total of 068 wirelength, congestion, and density costs associated with the chip design task; a node is considered 069 more efficient if it creates chip placements with less cost value.

- 071 The contributions of this paper are summarized as follows:
  - We propose a novel transfer learning-based optimization approach named Parameter Monte Carlo Tree Search (PMCTS) utilizing the MCTS algorithm to dynamically update the parameters of a neural network to adapt models with a relatively small amount of computation and which can generate efficient chip placements.
  - We validate the proposed methodology across four distinct chip design tasks—Ariane, Ariane133, IBM01, and IBM02—highlighting its effectiveness and stability in diverse chip design tasks.
  - We show that transferring knowledge from a complex chip design task, such as IBM02, to a simpler chip design, like Ariane, outperforms both training from scratch and fine-tuning approaches in terms of creating chip placements with reduced cost value in lesser time-frames. However, transferring knowledge from a simpler chip design to a more complex one does not yield the same benefits.

# 2 RELATED WORK

This section covers related work in transfer learning, parameter search, and chip placement problems.

2.1 TRANSFER LEARNING

092 Transfer learning using deep neural networks (DNNs) refers to the process of transferring knowledge 093 and parameters from a DNN trained on a source dataset to another DNN aiming to address a similar 094 target problem (Shafahi et al., 2019). Several methods, including zero-shot (Xian et al., 2017), 095 one-shot (Fei-Fei et al.), and few-shot (Ravi & Larochelle, 2017) approaches, exist to transfer 096 knowledge from a source domain to a target domain. However, these approaches often rely on 097 manually authored features or significant additional data to effectively guide the knowledge transfer process. Our proposed approach does not require the incorporation of additional manually authored 098 features or significant additional training to adapt to a target domain. 099

100 101

072

073

074

075

076

077

078

079

081

082

084 085

087

088

089 090

091

2.2 PARAMETER OPTIMIZATION

Hyperparameter Paramater Optimization (HPO) is the most common search problem associated with
 DNNs (Feurer & Hutter, 2019). However, our primary focus is on optimizing parameters rather than
 hyperparameters.

The majority of current neural network (NN) research has been on deep learning, where the pri mary approach for training NNs is backpropagation (Rumelhart et al., 1986), an algorithm used to compute the gradient of the loss function. Backpropagation-based approaches have been useful in

108 different domains for determining optimal parameters. However, backpropagation can often become 109 stuck in local minima (Siddique & Tokhi, 2001). Further, the optimal parameters for backpropaga-110 tion appear to differ depending on the specific scenario. Sparse regularization methods in NNs focus 111 on a subset of the network to enhance computational and space efficiency without compromising 112 performance by addressing the issue of redundant and correlated structures. The backpropagation technique proposed by Sun et al. (2017) involves computing a very small but vital part of the gra-113 dient and modifying only the corresponding small portion of the parameters in each update. This 114 approach results in highly sparsified gradients that only modify highly relevant parameters for the 115 given training sample. Additionally, pruning strategies may compress a model without significantly 116 reducing its capacity for prediction (Yeom et al., 2021). Our approach can be understood as a type 117 of sparse optimization, where we rely on search to identify parameters for optimization. 118

Neuroevolution stands as an alternate optimization approach that involves the training of NNs using 119 evolutionary algorithms. Making use of evolutionary search facilitates significant functionalities 120 that are not commonly available to gradient-based methodologies, such as the ability to learn neural 121 network's activation functions and architectures (Stanley et al., 2019). Aly et al. (2019) introduce 122 an evolutionary metaheuristic aimed at optimizing DNNs. The objective is to search multiple regions 123 of the search space while preserving a specific distance between those regions to guarantee diversity. 124 Zhou et al. (2024) introduce a neuroevolutionary diversity policy search method for multi-objective 125 reinforcement learning (RL) challenges, utilizing gradient-based genetic operations to equip each 126 policy with a buffer for gathering past experiences. A deep genetic algorithm proposed by Such 127 et al. (2017) competitively trains DNNs for difficult RL tasks. Their genetic algorithm exhibits 128 remarkable competitiveness with widely used algorithms in the domain of deep RL challenges. Our 129 proposed transfer learning method utilizes MCTS, a distinct search approach not previously applied to these cases. 130

There has been prior work on applying search algorithms to transfer learning tasks (Singamsetti et al., 2021; Mahajan & Guzdial, 2022; Doosti & Guzdial, 2023). These approaches have shown success at few-shot transfer learning and have generally found that tree-based optimization approaches like MCTS outperform other search algorithms, such as hill-climbing and beam search (Mahajan & Guzdial, 2022; Doosti & Guzdial, 2023). However, these approaches have not previously been applied in an RL setting.

137 138

139

#### 2.3 AUTOMATED CHIP PLACEMENT

140 The chip placement process involves determining the optimal positioning of various macrocom-141 ponents, a task that requires sophisticated multi-objective optimization to balance factors such as 142 wirelength, congestion, and density. Various approaches have been developed to address this prob-143 lem, including conventional divide-and-conquer (Fiduccia & Mattheyses, 1988) and hierarchical 144 placement strategies (Tsay et al., 1988), as well as advanced deep RL techniques (Mirhoseini et al., 145 2020). Myung et al. (2023) propose that pre-training networks to extract features from different 146 netlists and utilize them as encoders in the policy network can greatly decrease the duration of train-147 ing. Moreover, Lai et al. (2022) present an approach for placement using RL. This method utilizes convolutional NNs to learn position, wirelength, and view information for circuit modules, resulting 148 in a rich visual representation. Lee et al. (2024) discuss using diffusion models for chip placement. 149 They developed a NN with interleaved graph convolutions and multi-headed attention layers and 150 also a synthetic data generation algorithm. However, there are certain limitations associated with 151 this approach. In particular, synthetic datasets with shorter wirelengths may result in models ex-152 hibiting poorer performance to out-of-distribution circuits. We note that we specifically focus on 153 transfer learning as a way to speed up chip design tasks and, as such, do not compare with these 154 approaches that train from scratch. 155

Historically, simulated annealing (SA) has been a long-standing method with proven success in chip placement (Sarrafzadeh et al., 2003). SA is a search-based approach but optimizes the chip design directly rather than optimizing a chip design model. But SA shows a significant lack of speed, presents challenges in parallelization, and encounters difficulties adapting to current circuits' growing complexity and size.

161 The RL-based methodology of Mirhoseini et al. (2020) explores an RL agent trained to optimize the arrangement of chips. In our paper, we build upon the policy and value networks from their study.

To the best of our knowledge, our approach is the first to demonstrate that transferring knowledge
 from a complex chip design task to a simpler chip design can quickly create models that can generate
 efficient chip placements with reduced cost values compared to training from scratch and fine-tuning
 approaches.

166 167

168 169

170

171

172

173

174

# 3 SYSTEM OVERVIEW

This section outlines our method, which is divided into two main components. First, we train a model on a source chip design task. After that, our proposed transfer learning method based on MCTS directly modifies the parameters of the trained source model through interactions with the target chip design environment. This method creates a tree of child nodes, each of which represents a distinct model. Lastly, we return the best model, which has the lowest cost value, from the tree.

### 175 3.0.1 STEP 1: SOURCE TRAINING

176 The first step involves training a model on the source chip design task  $Task_S$  as described in step 1 of 177 Algorithm 1. We apply a specific training configuration for each chip design task due to differences 178 in the numbers of horizontal and vertical routes per micron, as well as the allocation values for 179 macro horizontal and macro vertical routing. We employ an existing architecture of policy and 180 value networks and all other domain-specific hyperparameters from Mirhoseini et al. (2020). The 181 parameters of the policy model are updated using PPO (Schulman et al., 2017) based on the source 182 chip design task  $Task_S$ . We use an existing proxy cost function from Cheng et al. (2023) as shown in Equation 3. 183

184 185

#### 3.0.2 STEP 2: PARAMETER SEARCH

186 During the second step, we utilize parameter search as a form of transfer learning to adapt the source 187 model to the target chip design task. Initially, the root node  $N_R$  corresponds to the source model  $S_m$ . 188 From this root node  $N_R$ , we generate a child node by applying a randomly selected search operator to 189 modify its parameters. Child nodes are created until the tree T has reached its maximum number of 190 nodes,  $N_M$  which can be specified. To balance exploration with exploitation, we employ an  $\epsilon$  value 191 of 0.1, so the best-performing child nodes are selected ninety percent of the time based on the Cost 192 value calculated on the target chip design task  $Task_T$  by using the proxy cost function as shown in Equation 3, but less-explored nodes are chosen periodically from all the nodes  $N_i$ . The selected 193 child node  $N_c$  will function as the new root node  $N_R$ . The node  $N_R$  will later go through further 194 parameter modification using a randomly selected search operator to generate a child node  $N_i$ , 195 calculate the Cost value for the target chip design  $Task_T$ , and backpropagate this information up 196 the tree. These operations will terminate when the node reaches its maximum length of child nodes 197 L. The search operator directly modifies the parameters of the model, allowing a potentially more efficient optimization process, as these can be understood as sparse updates. Lastly, the algorithm 199 outputs the most efficient node  $N_{LC}$  based on the Cost value from the search tree T for the target 200 chip design task  $Task_T$ .

201 202

203

#### 3.1 SEARCH OPERATORS

The formulation of our search operators is an essential component of our methodology since these functions directly determine potential sparse parameter modifications during the optimization process. The operators contain two distinct functions that modify the parameter values of the models, with only one operation being randomly selected at a time. These are similar to the mutation functions from prior search-based transfer learning work Singamsetti et al. (2021). We selected these operations because they are simple and systematic, providing a comprehensive range of options to explore the search space. We note that the operators can be applied sequentially to the same layer, allowing for arbitrary value changes in theory. The search operators are as follows:

- 211
- 212
- 213 214

• The first operation randomly selects a specific layer,  $L_r \in \{L_1, L_2, \ldots, L_n\}$ , of the policy network  $\pi_{\theta}(a|s)$ , where  $r \sim \mathcal{U}\{1, 2, \ldots, n\}$  and modifies the weights by adding a scalar value, a, uniformly distributed within the range (-1, 1) as shown in Equation (1).

$$W'_{i,j} = W_{i,j} + a, \quad a \sim \mathcal{U}(-1,1)$$
 (1)

Algorithm 1 PMCTS Algorithm **Input**: Source Chip Design Task  $Task_S$ **Output**: Least cost node  $N_{LC}$  for Target Chip Task  $Task_T$ 1: Step 1: Source Training 2:  $S_m = PPO(Task_S)$ 3: Save  $S_m$ 4: Step 2: Parameter Search 5: Initialize  $S_m$  as root node  $N_R$ 6: while  $N_M$  not met **do** 7:  $N_c = \text{Selection}(N_i)$  $N_R \leftarrow N_c$ 8: 9: for step l = 1 to L do 10:  $N_i$  = Search Operator( $N_R$ ) 11:  $Cost(N_j, Task_T)$ 12:  $Backprop(N_i)$ end for 13: 14: end while 15: Select  $N_{LC}$  from tree T 16: return  $N_{LC}$ 

• The second operation randomly chooses an individual layer of the policy network and involves multiplying the weights of the chosen layer by a scalar value uniformly distributed across the interval [0, 1), as shown in Equation (2).

$$W'_{i,j} = W_{i,j} \cdot a, \quad a \sim \mathcal{U}[0,1) \tag{2}$$

#### 3.2 COST FUNCTION

The cost function is identical to the proxy cost calculation function developed by Cheng et al. (2023). It is represented by the *Proxy Cost* estimation, which is the weighted sum of wirelength, congestion, and density costs of the chip design task, as depicted in Equation 3. Domain experts identify these three variables as the most important variables in determining the effectiveness of a specific chip design. The parameters  $\gamma$  and  $\lambda$  are set to 0.5 in our experiments, as specified by Cheng et al. (2023).

$$Proxy \ Cost = Wirelength \ Cost + \gamma \cdot Density \ Cost + \lambda \cdot Congestion \ Cost$$
(3)

The *Proxy Cost* is very essential as it directly influences the learning of the RL agent. The *Wirelength Cost* is the first element of the *Proxy Cost* equation, and it is the normalized half perimeter wirelength (HPWL) shown in Equation 4 where *width* and *height* are the width and height of the chip canvas, respectively, and the *weight*, set to 1 by default, specified by the source pin.

Wirelength Cost = 
$$\frac{1}{|nets|} \sum_{net} \frac{net.weight \times HPWL(net)}{width + height}$$
 (4)

Secondly, according to the Equation 5, the *Density Cost* is the mean density of the top 10 % most dense grid cells, where *n* is the number of grid cells in the top 10%.

Density 
$$Cost = \frac{1}{n} \sum_{i=1}^{n} Density of the top 10\% densest grid cells$$
 (5)

Lastly, the *Congestion Cost* is divided into two components: congestion resulting from macros and
 congestion led by net routing. Macro congestion results from the extra routing layer resources
 used by macros, while routing congestion is caused by the use of routing resources by each routed
 network. In the process of calculating macro congestion, the horizontal and vertical values of a grid

cell are obtained by summing the routing resources utilized by macros that cross the right and top boundaries of the cell. The total congestion for each grid cell is calculated by summing the macro congestion and routing congestion for each direction individually, as shown in Equation 6. Next, the *Congestion Cost* is calculated as the mean of the top 10% of all  $H_{-cong}$  and  $V_{-cong}$  values of grid cells in the chip canvas defined by the Equation 7. A detailed overview can be found in the prior work Cheng et al. (2023) and the *Circuit Training* repository Guadarrama et al. (2021).

$$H\_cong = H\_macro\_cong + H\_net\_cong$$
<sup>(6)</sup>

(7)

$$V\_cong = V\_macro\_cong + V\_net\_cong$$

276 277 278

282

# 283 284

285

286

287

288

289

290 291

292

302

303

# 4 EXPERIMENTAL SETUP

In this paper, we focus on the application of our proposed methodology across different chip design tasks. We evaluate our approach on well-understood tasks that have been studied in prior transfer learning work of Mirhoseini et al. (2020). This setting allows us to demonstrate that our method can produce optimized models in less time than corresponding baselines, which typically rely on traditional backpropagation.

Congestion Cost =  $\frac{1}{n} \sum_{i \in \text{Top 10\%}}^{n} (H_{\text{cong},i} + V_{\text{cong},i})$ 

4.1 Environments

293 We examine our approach on four distinct chip designs. The first design refers to the Ariane RISC-V CPU found in the Google Circuit Training repository (Mirhoseini et al., 2020). It consists of 133 295 hard macro units and serves as a standard reference for the chip placement problem. Hard macros refer to pre-built building blocks (Lavin et al., 2013). The second design, Ariane133, is a slight 296 variant of the first design while maintaining the 133-macro unit configuration (Cheng et al., 2023). 297 In addition, we assess our approach using two designs from the ICCAD04 benchmark, namely 298 IBM01 and IBM02, which consist of 246 and 271 hard macro units, respectively (Cheng et al., 299 2023; Adya et al., 2004; Adya & Markov, 2002). The complexity of the IBM variations exceeds that 300 of the Ariane versions. 301

4.2 BASELINES

304 In this research, we utilize two primary baselines: the scratch model and the fine-tuning approach 305 from the prior work of Mirhoseini et al. (2020). Both baselines utilize conventional gradient descent 306 techniques for optimization, providing a comparison to evaluate the efficacy of our proposed method. 307 The training from scratch model baseline is used as a reference for how well an RL agent can perform 308 when trained on the target domain from random initialization. The fine-tuning baseline instead takes 309 a model trained on a source domain and fine-tunes it on a target domain. Through the comparison with these baselines, we show the effectiveness of our approach in creating optimized models for 310 chip placement without primarily depending on gradient descent methods. 311

312 313

4.3 COMPUTE RESOURCES

314 All experiments are conducted using the cloud computing resource provided by <315 Redacted for Anonymity >, which consists of 18 CPU cores and 2xNVIDIA v1001 GPUs with 316 32GB of memory each. For both training from scratch and fine-tuning approaches, we ran experi-317 ments for 12 hours to allow for convergence. In contrast, our PMCTS approach typically runs for 318 shorter durations, ranging from 2 to 8 hours depending on the complexity of the source and target 319 chip designs. For example, using Ariane as the source and Ariane133 as the target chip design, 320 PMCTS completed in 50 minutes with a max nodes value of 100, meaning the tree generated 100 321 nodes before stopping. However, when using Ariane as the source and IBM02 as the target, PMCTS required 2.5 hours to complete due to the complexity of IBM02, where generating new models and 322 calculating the cost for such complex chip designs takes longer. We employ three arbitrary seed 323 values for all experiments, which we give in full in the Appendix A.

| 324 | Method           | Source          |                 |                 |
|-----|------------------|-----------------|-----------------|-----------------|
| 325 | Method           | IBM02           | IBM01           | Ariane133       |
| 326 | Scratch (Ariane) |                 | $1.29 \pm 0.02$ | ·               |
| 327 | Fine Tuning      | $1.27 \pm 0.03$ | $1.30 \pm 0.05$ | $1.28 \pm 0.05$ |
| 328 | Zero Shot        | $1.36 \pm 0.03$ | $1.35 \pm 0.00$ | $1.32 \pm 0.01$ |
| 329 | Our Approach     | 1.19±0.01       | 1.17±0.04       | 1.24±0.07       |

Table 1: Estimated average cost value and standard deviations for the baselines and our approach: when transferring from complex to simple chip (Ariane) designs.

#### 334 335 4.4 EXPERIMENTS

336 We evaluate the performance of our approach utilizing four separate chip design tasks. We use all 337 permutations of source and target domains. For example, when assessing our approach's transfer 338 learning performance, we use the Araine chip design task as the source domain to train a model. 339 This trained model is then used as the source input model for our approach, and then we evaluate 340 the performance of the output models from our approach across the three target chip design tasks, 341 namely Ariane133, IBM01, and IBM02. We then repeat this process for each of the other chip 342 design tasks. In addition, for the fine-tuning approach, we take a model trained on a source chip design task and fine-tune it on the remaining target chip designs. We take all our architectures for 343 both scratch and fine-tuning using similar model settings from the same prior study Mirhoseini 344 et al. (2020). For training from scratch and fine-tuning, we execute the models for 12 hours. For 345 our suggested approach, we only consider models from the first hour of running PMCTS. We note 346 that it is very likely, based on prior work, that both from scratch and fine-tuning approaches would 347 eventually outperform the PMCTS models based on the results of prior work of Mirhoseini et al. 348 (2020), but we are focused on speeding up the optimization process, thereby emphasizing this lower 349 timeframe setting.

350 351 352

330

331

332 333

### 5 Results

353 354

In this section, we present our results. The tables show the results obtained by comparing the 355 baseline methodologies with our PMCTS approach. The figures illustrate the average and combined 356 normalized cost values for all relevant approaches, where normalization was performed by dividing 357 each cost value by the maximum cost value. Regarding the figures, our method avoids retraining 358 the same model repeatedly. Instead, it iteratively generates new models by modifying the layers' weights through mutation functions, evaluating the cost value for each newly created model. This 359 iterative process introduces variations in the cost values over time, which explains the observed 360 trends in the figures. The shaded regions of the figures represent the 95% confidence interval range. 361 We separate the results into two categories—one for transferring knowledge from complex to simple 362 chip designs and the other for transferring from simple to complex designs—to clarify the analysis. 363 This splitting allows focused comparisons by emphasizing the differences in performance between 364 the two directions of knowledge transfer. The approach is considered to be better when the cost value is lower. The detailed cost values for each seed corresponding to each method are provided in 366 Appendix A.

367 Table 1 shows a comparison of the average cost values and standard deviations for the three seed val-368 ues across four different methods: scratch training, fine-tuning, zero shot and our proposed approach 369 for transferring knowledge from complex source chip designs to the simpler target Ariane chip de-370 sign task. For the scratch training approach, training the model from scratch on the Ariane chip 371 design results in comparatively higher costs, indicating that the method is unable to identify effi-372 cient models within a certain amount of time. For fine-tuning, zero shot and our proposed approach, 373 we use separate source models trained on IBM02, IBM01, and Ariane133 chip designs and then 374 transfer the knowledge to the target Ariane chip design. Fine-tuning offers some improvements over 375 training from scratch. In contrast, our proposed approach consistently outperforms scratch training, fine-tuning and zero shot, resulting in models with lower cost values in all scenarios in shorter times. 376 The results show the usefulness of our method, establishing it as the most cost-effective method for 377 quickly creating efficient models for chip placement tasks.

| Method              | Source          |                 |  |
|---------------------|-----------------|-----------------|--|
| wichiou             | IBM02           | IBM01           |  |
| Scratch (Ariane133) | $1.32 \pm 0.04$ |                 |  |
| Fine Tuning         | $1.27 \pm 0.02$ | $1.34{\pm}0.06$ |  |
| Zero Shot           | $1.33 \pm 0.06$ | $1.41 \pm 0.06$ |  |
| Our Approach        | $1.24{\pm}0.03$ | $1.27{\pm}0.02$ |  |

| Method          | Source          |                 |  |
|-----------------|-----------------|-----------------|--|
| Michiou         | Ariane133       | Ariane          |  |
| Scratch (IBM01) | $2.18 \pm 0.08$ |                 |  |
| Fine Tuning     | $2.23 \pm 0.15$ | $2.25 \pm 0.11$ |  |
| Zero Shot       | $2.35 \pm 0.07$ | $2.34{\pm}0.06$ |  |
| Our Approach    | 2.1±0.05        | 2.23±0.11       |  |

Table 2: Estimated average cost value and standard deviations for the baselines and our approach: when transferring from complex to simple chip (Ariane133) designs.

Table 3: Estimated average cost value and standard deviations for the baselines and our approach: when transferring from simple to complex chip (IBM01) designs.

| Method          | Source          |                 |           |  |
|-----------------|-----------------|-----------------|-----------|--|
| Methou          | IBM01           | Ariane133       | Ariane    |  |
| Scratch (IBM02) |                 | $2.72 \pm 0.21$ |           |  |
| Fine Tuning     | $2.92{\pm}0.26$ | 2.27±0.25       | 2.47±0.12 |  |
| Zero Shot       | $2.59{\pm}0.15$ | $2.37 \pm 0.10$ | 2.61±0.17 |  |
| Our Approach    | 2.32±0.26       | $2.44{\pm}0.08$ | 2.43±0.09 |  |

Table 4: Estimated average cost value and standard deviations for the baselines and our approach: when transferring from simple to complex chip (IBM02) designs.

398 399

397

386

387

Table 2 demonstrates the efficiency of our approach for transferring knowledge while using Ariane 133 as the target chip design task. As table 2 shows the values for transferring knowledge from the complex chip design tasks, namely IBM02 and IBM01, to the simple chip design, Ariane 133.
We do not include Ariane as the source model because it is not more complex than the Arian 133.
When using IBM02 and IBM01 as source designs, our method consistently outperforms baseline approaches, achieving the lowest costs. These results further demonstrate the robustness and efficiency of our method compared to the baselines for complex-to-simple transfer learning tasks.

407 Tables 3 and 4 present the cost comparisons and standard deviations for knowledge transfer from simpler chip designs to complex ones. In Table 3, when transferring knowledge from Ariane133 and 408 Ariane to IBM01, our approach outperforms fine-tuning but is unable to outperform training from 409 scratch. In Table 3, we leave out IBM02 as the source model for knowledge transfer to IBM01, as 410 IBM02 is not simpler than IBM01. Our approach demonstrates superior performance when using 411 IBM02 as the target chip design task and IBM01 as the source chip shown in Table 4, consistently 412 outperforming scratch training and fine-tuning approaches. Despite these exceptions, our method 413 demonstrates significant overall performance. The experiments indicate there may be a certain level 414 of similarity between IBM02 and IBM01, resulting in the performance of our approach exceeding 415 both initial training and fine-tuning.

416 Figure 1 illustrates the combined and normalized average cost values of scratch training for the 417 four chip design tasks, fine-tuning approaches for the complex to simpler chip designs, and our 418 proposed method for the complex to simpler chip designs (e.g., using IBM02 as the source and 419 Ariane as the target chip design, etc.). The key focus is on the first hour of training, where our method 420 consistently outperforms both scratch training and fine-tuning in terms of speed and efficiency. The 421 graph demonstrates that our approach quickly identifies the best model, outperforming fine-tuning 422 and scratch training in the same time frame. Furthermore, our method maintains a lower range of normalized cost values throughout the entire period, highlighting its rapid convergence. The faster 423 detection of the best models highlights the efficiency of our approach compared to conventional 424 methods. 425

Figure 2 displays the combined and normalized average cost values for the four chip designs when
training from scratch. It also includes the fine-tuning approaches for simple-to-complex chip designs and our proposed method for simple-to-complex chip designs, such as using Ariane as the
source chip design and IBM02 as the target chip design. We observe that our approach encounters
challenges when transferring knowledge from simple chip designs to more complex chip designs.
Although it does not consistently surpass scratch training and fine-tuning in later time frames, it
avoids falling considerably behind.



Figure 1: Showing the combined and normalized averaged cost value for all training from scratch models, as well as the complex-to-simple chip design approach of fine-tuning and our proposed method



Figure 2: Showing the combined and normalized averaged cost value for all training from scratch models, as well as the simple-to-complex chip design approach of fine-tuning and our proposed method

#### 

# 6 DISCUSSION

Our proposed transfer learning approach, PMCTS, offers several key advantages that make it highly effective for solving the chip placement problem. One of the primary benefits is its simplicity, as it directly modifies the network's parameters without requiring complex architectural changes or additional training on the target chip domains. Our experiments show that traditional methods such as training from scratch or fine-tuning are time-consuming to reach high-performing models. In contrast, our approach demonstrates greater speed and efficiency, resulting in reduced time and computational costs. This makes it particularly appealing for chip placements where rapid devel-opment and cost savings are critical. The shorter period required to generate effective models of our approach, particularly when transferring knowledge from complex to simple chip designs, of-fers significant advantages, thereby presenting a more economical and scalable solution to the chip industry.

486 However, it is crucial to acknowledge that this benefit does not apply when the knowledge transfer is 487 reversed, going from simpler to more complex designs. This suggests that the complexity inherent 488 in certain chip designs contains vital information that greatly helps the transfer learning process. 489 However, simpler designs lack sufficient depth to transfer knowledge to complex designs. In the 490 majority of cases, our PMCTS approach shows higher efficiency compared to both training from scratch and fine-tuning approaches. It would be worthwhile to explore the possibility of using either 491 the scratch model or the fine-tuned model as input to PMCTS to further enhance its performance. 492 However, considering that one key advantage of PMCTS is the ability to obtain a more efficient 493 model with significantly less computation time, we believe that this would contradict the purpose of 494 this research. Therefore, we prefer to conduct an additional study of this in the future. 495

496 We believe that part of our results, which demonstrate superior performance in transferring knowledge from complex to simple chip designs compared to the reverse, may be due to our selection 497 of search operators. Particularly, our second search operator essentially has the effect of reducing 498 parameter magnitudes in the network. This may align with pruning or other sparse optimization ap-499 proaches in terms of simplifying a given model, making it more general, and encompassing a simpler 500 task. In Table 9 in the Appendix A, we analyzed the sensitivity of PMCTS to our mutation func-501 tions, demonstrating the role of the multiplication operation. When transferring knowledge from 502 complex to simple chip designs, removing the multiplication operator led to quicker convergence to better nodes. For instance, in the IBM01 to Ariane transfer, the minimum cost value of 1.1930 was 504 achieved after creating 30 nodes with all functions, whereas removing the multiplication operator 505 reduced this to just 17 nodes. In contrast, removing the multiplication operator significantly reduced 506 performance when transferring knowledge from simple to complex chip designs. These insights will 507 inform future refinements for our search operators to optimize performance in diverse transfer learn-508 ing settings. In future work, we hope to explore alternative search operators, which may improve the robustness of our approach. 509

We plan to extend our approach to more complex chip design tasks, like BlackParrot (Quad-Core) and MemPool Group, to validate the effective trend of complex-to-simple knowledge transfer
(Petrisko et al., 2020; Riedel et al., 2023). In addition, alternative search algorithms like Quality-Diversity can be explored given how they offer a wide range of optimized solutions that vary based on a number of user-specified parameters of interest (Chatzilygeroudis et al., 2021).

- 515
- 516 517

# 7 CONCLUSIONS

518 519 520

521

522

523

524

525

526

527

528

This paper presents a novel transfer learning approach named Parameter Monte Carlo Tree Search (PMCTS) that utilizes MCTS to optimize model parameters for chip placement to generate efficient models in a short timeframe while transferring knowledge from complex chip design tasks to less complex ones and without requiring additional authored transfer features. The speed, along with its effectiveness, makes our approach potentially useful for chip placement scenarios where rapid optimization and implementation are crucial. By conducting experiments on four different chip design tasks—Ariane, Ariane133, IBM01, and IBM02—we demonstrate that knowledge transfer from complex chip designs to less complex ones results in improved performance, surpassing the training from scratch and fine-tuning in terms of both time and performance. In contrast, the transfer from simpler to more complex designs does not produce the same advantage in all scenarios.

- 529 530
- 531 532

# 8 ETHICS STATEMENT

533 534

We identify that our work was done in partnership with an international technology company involved in chip manufacturing. While our partner hopes to benefit from this research, we will make all of our code accessible to the public, and we utilize publicly available chip design environments for evaluation. We also note that while job loss is possible with any AI application, presently this type of automated chip design task is handled via search-based approaches in industry. As such, we do not identify any negative potential repercussions in terms of job loss from our approach.

# 540 9 REPRODUCIBILITY STATEMENT

For reproducibility, we make use of only publicly available chip design task environments as covered in Subsection 4.1. We also make use of previously published network architectures and optimization approaches as covered in Subsection 4.4. Finally, we include all code in the supplementary materials and identify the seeds used for all experiments in our Appendix A.

REFERENCES

546 547

548

552

553

554

558

559

561

562

567

568

569

570

576

580

- Saurabh N Adya and Igor L Markov. Consistent placement of macro-blocks using floorplanning and standard-cell placement. In *Proceedings of the 2002 international symposium on Physical design*, pp. 12–17, 2002.
  - SN Adya, S Chaturvedi, JA Roy, D Papa, and IL Markov. Unification of partitioning, floorplanning and placement. In *Proc. ICCAD*, pp. 550–557, 2004.
- Hussain Alibrahim and Simone A Ludwig. Hyperparameter optimization: Comparing genetic al gorithm against grid search and bayesian optimization. In 2021 IEEE Congress on Evolutionary
   *Computation (CEC)*, pp. 1551–1559. IEEE, 2021.
  - Ahmed Aly, David Weikersdorfer, and Claire Delaunay. Optimizing deep neural networks with multiple search neuroevolution. *arXiv preprint arXiv:1901.05988*, 2019.
  - Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. *Machine learning*, 47:235–256, 2002.
- Bin Cao, Sinno Jialin Pan, Yu Zhang, Dit-Yan Yeung, and Qiang Yang. Adaptive transfer
  learning. Proceedings of the AAAI Conference on Artificial Intelligence, 24(1):407-412, Jul.
  2010. doi: 10.1609/aaai.v24i1.7682. URL https://ojs.aaai.org/index.php/AAAI/
  article/view/7682.
  - Konstantinos Chatzilygeroudis, Antoine Cully, Vassilis Vassiliades, and Jean-Baptiste Mouret. Quality-diversity optimization: a novel branch of stochastic optimization. In *Black Box Optimization, Machine Learning, and No-Free Lunch Theorems*, pp. 109–135. Springer, 2021.
- 571 Chung-Kuan Cheng, Andrew B Kahng, Sayak Kundu, Yucheng Wang, and Zhiang Wang. Assess 572 ment of reinforcement learning for macro placement. In *Proceedings of the 2023 International* 573 *Symposium on Physical Design*, pp. 158–166, 2023.
- Ruoyu Cheng and Junchi Yan. On joint learning for solving placement and routing in chip design,
   2021. URL https://arxiv.org/abs/2111.00234.
- Ruoyu Cheng, Xianglong Lyu, Yang Li, Junjie Ye, Jianye Hao, and Junchi Yan. The policy-gradient placement and generative routing neural networks for chip design. *Advances in Neural Information Processing Systems*, 35:26350–26362, 2022.
  - TianJin Deng and Jia Wu. Efficient graph neural architecture search using monte carlo tree search and prediction network. *Expert Systems with Applications*, 213:118916, 2023.
- Anahita Doosti and Matthew Guzdial. Transfer learning for underrepresented music generation.
   *arXiv preprint arXiv:2306.00281*, 2023.
- Joanna Ellis-Monaghan and Paul Gutwin. Graph theoretical problems in next-generation chip de sign. *Congressus Numerantium*, 01 2003.
- Abolfazl Farahani, Sahar Voghoei, Khaled Rasheed, and Hamid R. Arabnia. A brief review of domain adaptation. In Robert Stahlbock, Gary M. Weiss, Mahmoud Abou-Nasr, Cheng-Ying Yang, Hamid R. Arabnia, and Leonidas Deligiannidis (eds.), *Advances in Data Science and Information Engineering*, pp. 877–894, Cham, 2021. Springer International Publishing. ISBN 978-3-030-71704-9.
  - L Fei-Fei, R Fergus, and P Perona. One-shot learning of object categories. ieee trans. *Pattern Recognition and Machine Intelligence*.

616

633

634

- Matthias Feurer and Frank Hutter. Hyperparameter optimization. Automated machine learning: Methods, systems, challenges, pp. 3–33, 2019.
- 597 Charles M Fiduccia and Robert M Mattheyses. A linear-time heuristic for improving network parti 598 tions. In *Papers on Twenty-five years of electronic design automation*, pp. 241–247. 1988.
- Sergio Guadarrama, Summer Yue, Toby Boyd, Joe Wenjie Jiang, Ebrahim Songhori, Goldie Anna Tam, Terence, and Azalia Mirhoseini. Circuit Training: An open-source framework for generating chip floor plans with distributed deep reinforcement learning., 2021. URL https://github.
   com/google\_research/circuit\_training.
- Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In *European conference* on machine learning, pp. 282–293. Springer, 2006.
- Yao Lai, Yao Mu, and Ping Luo. Maskplace: Fast chip placement via reinforced visual representation learning. *Advances in Neural Information Processing Systems*, 35:24019–24030, 2022.
- Christopher Lavin, Brent Nelson, and Brad Hutchings. Improving clock-rate of hard-macro designs. In 2013 International Conference on Field-Programmable Technology (FPT), pp. 246–253. IEEE, 2013.
- Vint Lee, Chun Deng, Leena Elzeiny, Pieter Abbeel, and John Wawrzynek. Chip placement with
   diffusion. *arXiv preprint arXiv:2407.12282*, 2024.
- 4 Yichen Li and Xingchao Peng. Network architecture search for domain adaptation. *arXiv preprint* arXiv:2008.05706, 2020.
- Anmol Mahajan and Matthew Guzdial. Modeling individual humans via a secondary task transfer learning method. In *Federated and Transfer Learning*, pp. 259–281. Springer, 2022.
- Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang,
   Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, et al. Chip placement with deep
   reinforcement learning. *arXiv preprint arXiv:2004.10746*, 2020.
- Wooshik Myung, Donghyun Lee, Chenhang Song, Guanrui Wang, and Cheng Ma. Policy gradient-based core placement optimization for multichip many-core systems. *IEEE Transactions on Neural Networks and Learning Systems*, 34(8):4529–4543, 2023. doi: 10.1109/TNNLS.2021. 3117878.
- Daniel Petrisko, Farzam Gilani, Mark Wyse, Dai Cheol Jung, Scott Davidson, Paul Gao, Chun Zhao,
   Zahra Azad, Sadullah Canakci, Bandhav Veluri, et al. Blackparrot: An agile open-source risc-v
   multicore for accelerator socs. *IEEE Micro*, 40(4):93–102, 2020.
- Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In *International conference on learning representations*, 2017.
  - Samuel Riedel, Matheus Cavalcante, Renzo Andri, and Luca Benini. MemPool: A scalable manycore architecture with a low-latency shared L1 memory. *IEEE Transactions on Computers*, 72 (12):3561–3575, 2023. doi: 10.1109/TC.2023.3307796.
- David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back propagating errors. *nature*, 323(6088):533–536, 1986.
- Majid Sarrafzadeh, Maogang Wang, and Xiaojian Yang. *Modern placement techniques*. Springer Science & Business Media, 2003.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
   optimization algorithms. *arXiv preprint arXiv:1707.06347*, 2017.
- Ali Shafahi, Parsa Saadatpanah, Chen Zhu, Amin Ghiasi, Christoph Studer, David Jacobs, and Tom
   Goldstein. Adversarially robust transfer learning. *arXiv preprint arXiv:1905.08232*, 2019.
- MNH Siddique and Mohammad O Tokhi. Training neural networks: backpropagation vs. genetic algorithms. In *IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222)*, volume 4, pp. 2673–2678. IEEE, 2001.

- Mohan Singamsetti, Anmol Mahajan, and Matthew Guzdial. Conceptual expansion neural architecture search (cenas). *International Conference on Computational Creativity*, 2021.
- Kenneth O Stanley, Jeff Clune, Joel Lehman, and Risto Miikkulainen. Designing neural networks
   through neuroevolution. *Nature Machine Intelligence*, 1(1):24–35, 2019.
- Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and
   Jeff Clune. Deep neuroevolution: Genetic algorithms are a competitive alternative for training
   deep neural networks for reinforcement learning. *arXiv preprint arXiv:1712.06567*, 2017.
- Ku Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meprop: Sparsified back propagation
   for accelerated deep learning with reduced overfitting. In *International Conference on Machine Learning*, pp. 3299–3308. PMLR, 2017.
- Ren-Song Tsay, Ernest S Kuh, and Chi-Ping Hsu. Proud: A sea-of-gates placement algorithm. *IEEE Design & Test of Computers*, 5(6):44–56, 1988.
  - Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In *Proceedings of the IEEE conference on computer vision and pattern recognition*, pp. 4582–4591, 2017.
  - Seul-Ki Yeom, Philipp Seegerer, Sebastian Lapuschkin, Alexander Binder, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. Pruning by explaining: A novel criterion for deep neural network pruning. *Pattern Recognition*, 115:107899, 2021.
- Kaichao You, Guo Qin, Anchang Bao, Meng Cao, Ping Huang, Jiulong Shan, and Mingsheng Long.
   Efficient convbn blocks for transfer learning and beyond. *arXiv preprint arXiv:2305.11624*, 2023.
- Dan Zhou, Jiqing Du, and Sachiyo Arai. Neuroevolutionary diversity policy search for multi objective reinforcement learning. *Information Sciences*, 657:119932, 2024.

#### A APPENDIX

Here we give the results of the experiments in detail, with one table for each of the source target tasks.

| Method           | Source    | Cost (Seed 10) | Cost (Seed 55) | Cost (Seed 111) |
|------------------|-----------|----------------|----------------|-----------------|
| Scratch (Ariane) |           | 1.312          | 1.303          | 1.268           |
| Fine Tuning      | IBM02     | 1.273          | 1.296          | 1.232           |
| Our Approach     | IDIVI02   | 1.202          | 1.176          | 1.193           |
| Fine Tuning      | IBM01     | 1.267          | 1.267          | 1.360           |
| Our Approach     |           | 1.193          | 1.132          | 1.193           |
| Fine Tuning      | Ariane133 | 1.245          | 1.333          | 1.268           |
| Our Approach     | Anale 155 | 1.193          | 1.312          | 1.202           |

Table 5: Estimated cost for baselines and our approach: transfer learning from complex to simple chip (Ariane) designs.

| Method              | Source  | Cost (Seed 10) | Cost (Seed 55) | Cost (Seed 111) |
|---------------------|---------|----------------|----------------|-----------------|
| Scratch (Ariane133) |         | 1.347          | 1.352          | 1.272           |
| Fine Tuning         | IBM02   | 1.27           | 1.255          | 1.289           |
| Our Approach        | IDIVI02 | 1.245          | 1.206          | 1.264           |
| Fine Tuning         | IBM01   | 1.272          | 1.352          | 1.389           |
| Our Approach        | IDMUI   | 1.265          | 1.266          | 1.292           |

Table 6: Estimated cost for baselines and our approach: transfer learning from complex to simple chip (Ariane133) designs.

| 7 | 0 | 2 |
|---|---|---|
| 7 | 0 | 3 |
| 7 | 0 | 4 |

| Method          | Source    | Cost (Seed 10) | Cost (Seed 55) | Cost (Seed 111) |
|-----------------|-----------|----------------|----------------|-----------------|
| Scratch (IBM01) |           | 2.222          | 2.233          | 2.096           |
| Fine Tuning     | Ariane    | 2.343          | 2.133          | 2.264           |
| Our Approach    | Ariane    | 2.194          | 2.145          | 2.359           |
| Fine Tuning     | Ariane133 | 2.085          | 2.381          | 2.216           |
| Our Approach    | Anale155  | 2.039          | 2.145          | 2.101           |

Table 7: Estimated cost for baselines and our approach: transfer learning from simple to complex chip (IBM01) designs.

| Method          | Source    | Cost (Seed 10) | Cost (Seed 55) | Cost (Seed 111) |
|-----------------|-----------|----------------|----------------|-----------------|
| Scratch (IBM02) |           | 2.932          | 2.504          | 2.714           |
| Fine Tuning     | Ariane    | 2.324          | 2.551          | 2.526           |
| Our Approach    | Analie    | 2.397          | 2.526          | 2.364           |
| Fine Tuning     | Ariane133 | 2.266          | 2.532          | 2.025           |
| Our Approach    | Analie155 | 2.360          | 2.526          | 2.444           |
| Fine Tuning     | IBM01     | 2.704          | 2.84           | 3.211           |
| Our Approach    | IDMOT     | 2.537          | 2.028          | 2.390           |

Table 8: Estimated cost for baselines and our approach: transfer learning from simple to complex chip (IBM02) designs.

| Source    | Target  | <b>Operation Type</b> | Min Cost Value | Min Nodes |
|-----------|---------|-----------------------|----------------|-----------|
| IBM01     | Ariane  | All Operations        | 1.1930         | 30        |
| IDMUT     | Analie  | Without Multiply      | 1.1930         | 17        |
| IBM02     | Ariane  | All Operations        | 1.1930         | 164       |
| IBM02     | Ariane  | Without Multiply      | 1.1930         | 111       |
| Ariane133 | IBM02   | All Operations        | 2.4436         | 88        |
| Analie155 | IDIVI02 | Without Multiply      | 3.270          | 37        |
| Ariane133 | IBM01   | All Operations        | 2.101          | 109       |
|           |         | Without Multiply      | 2.347          | 79        |

Table 9: Sensitivity Analysis of the Search Operators of the PMCTS Approach.