## 431 A Appendix

### 432 A.1 Module, Net and Pin

Module. A chip is a combination of numerous modules, and there are two types of them: macros and standard cells. Macros are relatively large, including DRAMs, caches, and IO interfaces. Standard cells are mainly logical gates, much smaller than macros, and the size can be ignored. As in Fig.8 (a), there are four macros and several standard cells. Placement methods usually place macros first and then the standard cells to ensure there is enough space for macros [36]. Due to the considerable number of standard cells, we currently use our MaskPlace method on macro placement.

**Pin.** Pins are input/output interfaces for modules and are connected by wires directly, which have fixed relative positions on modules. We define the relative position of the pin  $P^{(i,j)}$  from the leftbottom corner of the module it belongs to as  $\Delta^{(i,j)} = (\Delta_x^{(i,j)}, \Delta_y^{(i,j)})$ . For example, there are five pins and three macros in Fig.9 (a), and the pin offset information is also shown at the bottom. In the placement task, we should not ignore the positions of pins because it determines the wirelength. However, graph neural network-based models [3, 22] ignored them when converting circuits into a graph, which may lead to sub-optimal results.

Net. A net contains a set of pins connected by the same wires. Thus the pins have the same information (0/1 in digital circuits). For example, four pins belong to Net 1, and the other three pins belong to Net 2 in Fig.8 (a). Usually, one pin belongs to only one net, and one net has more than two pins (one input and several outputs). Pins from the same net can form a net bounding box as Fig.8 (a)(b).



Figure 8: **Metrics for placement.** HPWL is an optimization item, while congestion and density are constraint items in the actual placement scenario. HPWL is smaller the better, while congestion and density need to be less than the given thresholds. Placement (b) is better than (a) because HPWL and congestion of (a) are smaller. Placement (c) is invalid because there are overlaps in cell  $g_{7,5}$  and  $g_{7,8}$ .

#### 451 A.2 Metric

**HPWL.** HPWL (Half Perimeter Wire Length) is widely used to estimate wirelength by small computation cost [24]. It is the sum of half perimeter of net bounding boxes as Fig.8 (a)(b), where the bounding box is the minimal rectangle including all pins belonging to this net.

**Congestion.** The congestion metric is used to avoid routing congestion, resulting in an increase in the actual wirelength because the resources for wires are limited in a real chip. There are many ways to estimate congestion, one is to compute a rough routing result [3], but it is very computationally intensive. We use RUDY [25] as the estimation of congestion, which is a common way to evaluate. In RUDY, each grid cell needs to accumulate the inverse of the height and width (1/h + 1/w) of all the net bounding boxes covering itself and take out the maximum value (or the average of the first k maximums) of all grid cells as Fig.8 (a)(b).

**Density.** Density is a metric to reduce overlaps and avoid time-consuming computation for  $O(V^2)$ 462 463 constraints [1]. So, it is an approximate calculation essentially. It is defined as the maximum stackable coverage area ratio for each grid cell on a chip canvas. For example, as Fig.8 (c), the maximum 464 stackable coverage area ratio is 2.0 in grid cell  $g_{7.5}$  because two modules fully occupy it. However, 465 density less than a small value is not a sufficient condition for the absence of overlap. Because our 466 method can ensure no overlaps, we only consider it in evaluation. In the practical application scenario 467 of chip design, HPWL is an optimization item. Conversely, congestion and density are constraint 468 469 items.

**Examples.** We give a set of placement results to explain the metrics in Fig.8. We can see that HPWL is the sum of width and height of net bounding boxes. Congestion (RUDY) is the max congestion value of grid cell  $g_{i,j}$ , and the value in each grid cell is cumulative from the reciprocal of the width and height of the net bounding box containing that grid cell. (a) and (b) are from the same circuit, but (b) is a better placement because (b) has lower HPWL and congestion. Density is the max density value of grid cell  $g_{i,j}$ , and the value in each grid cell is stackable coverage area ratio of the grid cell. The density of Fig.8 (c) is 2.0 because  $g_{7,5}$  completely covered by two modules.

**Relationship between pin offset and HPWL.** The pin offset can affect the HPWL. In the graphbased method, the input features for module include size  $(M_w, M_h)$ , position  $(M_x, M_y)$  and type. So, the network can hardly infer the real position of pins and tend to use the center positions of modules to predict the positions of pins. In this way, the agent will align the centers of the two modules horizontally, and the result of placement is like Fig.9 (b) to get the wirelength 6. However, when considering the pins are near the bottom of the modules, it is better to align the bottom of the two modules as Fig.9 (c), and thus wirelength can be reduced to 2 if we consider the pin offset.



Figure 9: **Explanation for module, pin and net.** (a) gives an example for pin offset information. When we remove the pin offset information, the model tends to align the centers of the two modules horizontally like (b) because it uses the center position of modules to estimate pin location. However, we have a better design as (c) when considering pins are located on the bottom of the modules.

### 484 A.3 Algorithms

**Reward Computation.** The dense reward generation algorithm is shown in Algorithm 1. It can generate dense rewards without an efficiency decrease. For simplicity, we omit the calculation of the y dimension, which is the same as the x dimension.

## Algorithm 1: Dense HPWL Reward Computation (omit y-dimension)

**Data:** Placed position of module  $M^t$   $(M^t_x, M^t_y)$ , max/min x/y coordinates of nets MaxMinCoord, pin offsets  $(\Delta_x^{(t,j)}, \Delta_y^{(t,j)})$ , pin to net connection  $P_n^{(t,j)}$ ; **Result:** Incremental HPWL Reward *reward*; reward  $\leftarrow 0;$ for each  $\Delta_x^{(t,j)}, P_n^{(t,j)}$  of all pins  $P^{(t,j)}$  from  $M^t$  do  $x \leftarrow M_x^t + \Delta_x^{(t,j)} \; ; \textit{// calculate pin coordinates}$ if  $P_n^{(t,j)}$  not in MaxMinCoord then // The net for the first time has a definite location of the pin  $MaxMinCoord[P_n^{(t,j)}].x.max \leftarrow x;$  $MaxMinCoord[P_n^{(t,j)}].x.min \leftarrow x;$ else // Update the bounding box range if  $MaxMinCoord[P_n^{(t,j)}].x.max < x$  then  $reward \leftarrow reward + (x - MaxMinCoord[P_n^{(t,j)}].x.max);$  $MaxMinCoord[P_n^{(t,j)}].x.max = x;$ else if  $MaxMinCoord[P_n^{(t,j)}].x.min > x$  then  $reward \leftarrow reward + (MaxMinCoord[P_n^{(t,j)}].x.min - x);$  $MaxMinCoord[P_n^{(t,j)}].x.min = x;$ end end end

**Position Mask Generation.** The efficient position mask generation algorithm is in Algorithm 2.

Algorithm 2: Position Mask GenerationData: Width, Height and Position of t-1 placed module  $M^{1:t-1}$ <br/> $(M_w^{1:t-1}, M_h^{1:t-1}, M_y^{1:t-1})$ Result: Position Mask  $f_p^t$  for Module  $M^t$  $f_p^t \leftarrow ones(N, N); // ones(N, N)$  is all-ones  $N \times N$  matrixfor  $i \leftarrow 1$  to t - 1 do $tmp \leftarrow ones(N, N);$ // find positions that will cause  $M^t$  and  $M^i$  to overlap $tmp[M_x^i - M_w^t + 1: M_x^i + M_w^i - 1, M_y^i - M_h^t + 1: M_y^i + M_h^i - 1] \leftarrow 0;$ // exclude infeasible positions $f_p^t \leftarrow tmp \odot f_p^t; // \odot$  is element-wise product



488

Wire Mask Generation. The efficient wire mask generation algorithm is shown in Algorithm 3.
For simplicity, we omit the calculation of the y dimension, which is the same as the x dimension.

491 **Congestion Satisfaction.** The algorithm implemented in the congestion satisfaction block can be 492 seen in Algorithm 4.

Algorithm 3: Wire Mask Generation (omit y-dimension)

 $\begin{array}{l} \textbf{Data:} \text{ Hash Map of Max/Min X/Y coordinates of nets } MaxMinCoord, \text{ pin's offsets} \\ & (\Delta_x^{(t,j)}, \Delta_y^{(t,j)}), \text{ pin to net connection } P_n^{(t,j)} \\ \textbf{Result:} \text{ Wire Mask } f_w^t \text{ for module } M^t \\ & f_w^t \leftarrow zeros(N,N); \\ // \text{ Accumulate the wirelength increase for each net} \\ \textbf{foreach } \Delta_x^{(t,j)}, P_n^{(t,j)} \text{ of all pins } P^{(t,j)} \text{ from } M^t \text{ do} \\ & // \text{ If the pin is to the left of the net bounding box} \\ & \textbf{for } i \leftarrow 0 \text{ to } MaxMinCoord[P_n^{(t,j)}].x.min + \Delta_x^{(t,j)} - 1 \text{ do} \\ & \mid f_w^t[i,:] \leftarrow f_w^t[i,:] + MaxMinCoord[P_n^{(t,j)}].x.min + \Delta_x^{(t,j)} - i; \\ & \textbf{end} \\ & // \text{ If the pin is to the right of the net bounding box} \\ & \textbf{for } i \leftarrow MaxMinCoord[P_n^{(t,j)}].x.max + \Delta_x^{(t,j)} + 1 \text{ to } N - 1 \text{ do} \\ & \mid f_w^t[i,:] \leftarrow f_w^t[i,:] + i - (MaxMinCoord[P_n^{(t,j)}].x.max + \Delta_x^{(t,j)}].x.max + \Delta_x^{(t,j)}); \\ & \textbf{end} \end{array}$ 

Algorithm 4: Placement with Congestion Constraint

**Data:** Trained place agent *agent*, expected congestion threshold  $C_{th}$ **Result:** A placement plan  $[a_1, a_2, ..., a_V]$  that meet the congestion requirement for  $i \leftarrow 1$  to V do Choose  $a_i$  from the probability matrix generated by policy network *agent*;  $Cong \leftarrow$  congestion matrix from the state after taking  $a_i$ ; Compute congestion value c from Cong; if  $c > C_{th}$  then Randomly sample N different actions  $a_i^{1:N}$  from action space; Compute N congestion values  $c_i^{1:N}$  from congestion metrics; Get N wirelength values  $w_i^{1:N}$  from wire masks; Sort the N actions according to  $w_i^{1:N}$  (as the 1st key) and  $c_i^{1:N}$  (as the 2nd key);  $flag \leftarrow False;$ for  $j \leftarrow 1$  to N do if  $c_i^j \leq C_{th}$  then  $flag \leftarrow True;$  $a_i \leftarrow a_i^j;$ break; end end // If all sampled actions cannot satisfy congestion threshold, we choose the one with minimal congestion increase. if flag is False then  $a_i \leftarrow$  the action  $a_i^j$  with minimum  $c_i^j$ ; end Take action  $a_i$  as the final action; end

# 493 A.4 Details of Model Architecture

The parameters of layers in model architecture are in Table 8. Also, the features used by pixel-level mask generation are in Table 9. The comparison of features for the placement order in different methods can be seen in Table 10.

| Table 8: Model Architecture |           |              |               |  |  |  |
|-----------------------------|-----------|--------------|---------------|--|--|--|
| Block                       | Layer     | Kernel Size  | Output shape  |  |  |  |
|                             | Conv      | $1 \times 1$ | (224, 224, 8) |  |  |  |
| Local Mask Fusion           | Conv      | $1 \times 1$ | (224, 224, 8) |  |  |  |
|                             | Conv      | $1 \times 1$ | (224, 224, 1) |  |  |  |
| Global Mask Encoder         | ResNet-18 | -            | 1000          |  |  |  |
| Giodai Wask Elicouel        | FC        | -            | 768           |  |  |  |
|                             | Deconv    | $3 \times 3$ | (14, 14, 8)   |  |  |  |
|                             | Deconv    | $3 \times 3$ | (28, 28, 4)   |  |  |  |
| Global Mask Decoder         | Deconv    | $3 \times 3$ | (56, 56, 2)   |  |  |  |
|                             | Deconv    | $3 \times 3$ | (112, 112, 1) |  |  |  |
|                             | Deconv    | $3 \times 3$ | (224, 224, 1) |  |  |  |
| Merge                       | Conv      | $1 \times 1$ | (224, 224, 1) |  |  |  |
| Position Embedding          | -         | -            | 64            |  |  |  |
|                             | FC        | _            | 512           |  |  |  |
| FC for Value                | FC        | -            | 64            |  |  |  |
|                             | FC        | -            | 1             |  |  |  |

Table 9: State Features

| Module status | Index          | Feature                         | Notation             | Dimension per module          |
|---------------|----------------|---------------------------------|----------------------|-------------------------------|
| Placed        |                | Width                           | $M_w$                | 1                             |
|               |                | Height                          | $M_h$                | 1                             |
|               | $M^{1:t-1}$    | Position $M_x, M_y$             |                      | 2                             |
|               |                | Pin Offset $\Delta_x, \Delta_y$ |                      | $2 \times \text{num of pins}$ |
|               |                | Pin to Net Connection           | $P_n$                | num of pins                   |
| Unplaced      | $M^t, M^{t+1}$ | Width                           | $M_w$                | 1                             |
|               |                | Height                          | $M_h$                | 1                             |
|               |                | Pin Offset                      | $\Delta_x, \Delta_y$ | $2 \times \text{num of pins}$ |
|               |                | Pin to Net Connection           | $P_n$                | num of pins                   |

Table 10: Features used for placement order

| Method              | Features for place order                                               |
|---------------------|------------------------------------------------------------------------|
| Graph Placement [3] | Topological order, Area                                                |
| DeepPR [22]         | None                                                                   |
| MaskPlace           | Number of nets, Area, Number of its connected modules have been placed |

## 497 A.5 Training Configuration

<sup>498</sup> The detailed configuration and hyperparameter settings of our model is in Table 11.

| Table 11: Model Configuration |                   |                    |                                               |  |  |  |
|-------------------------------|-------------------|--------------------|-----------------------------------------------|--|--|--|
| Configuration                 | Value             | Configuration      | Value                                         |  |  |  |
| Optimizer                     | Adam              | Learning rate      | $2.5 \times 10^{-3}$                          |  |  |  |
| Total epoch                   | 150               | Epoch for update   | 10                                            |  |  |  |
| Batch size                    | 64                | Buffer capacity    | $10 \times \mathrm{num} \mathrm{~of~modules}$ |  |  |  |
| Clip $\epsilon$               | 0.2               | Clip gradient norm | 0.5                                           |  |  |  |
| Reward discount $\gamma$      | 0.95              | Num GPUs           | 1                                             |  |  |  |
| CPU                           | AMD Ryzen 9 5950X | GPU                | GeForce RTX 3090                              |  |  |  |

Also, we implement DREAMPlace<sup>5</sup> [9], Graph Placement<sup>6</sup> [3] ,and DeepPR<sup>7</sup> [22] by their open 499 source codes with their default settings. 500

#### A.6 Details of Benchmark 501

The detailed statistics of benchmarks are in Table 12. Hard macros are macros placed by the RL 502 method in Graph Placement [3], and the remaining macros, also named soft macros, are placed by 503 the classic optimization-based method. However, this distinction does not apply to the process of 504 our method, which means we place all macros by RL. The statistical range of nets, pins, and area 505 utilization are macros. Ports are terminals connecting to an external circuit, seen as fixed and no-size 506 modules. Our method is also applicable to circuits with ports without additional modifications. 507

| Benchmark | Macros | Hard Macros | Standard Cells | Nets Pins Ports |        | Area Util(%) |       |
|-----------|--------|-------------|----------------|-----------------|--------|--------------|-------|
| adaptec 1 | 543    | 63          | 210,904        | 3,709           | 4,768  | 0            | 55.62 |
| adaptec2  | 566    | 190         | 254,457        | 4,346           | 10,663 | 0            | 74.46 |
| adaptec3  | 723    | 201         | 450,927        | 6,252           | 11,521 | 0            | 61.51 |
| adaptec4  | 1,329  | 92          | 494,716        | 5,939           | 13,720 | 0            | 48.62 |
| bigblue1  | 560    | 32          | 277,604        | 657             | 1,897  | 0            | 31.58 |
| bigblue3  | 1,293  | 138         | 1,095,519      | 5,537           | 15,225 | 0            | 66.81 |
| ariane    | 932    | 134         | 0              | 12,404          | 22,802 | 1,231        | 78.39 |
| ibm01     | 246    | 246         | 12,506         | 908             | 1,928  | 246          | 61.94 |
| ibm02     | 280    | 272         | 19,321         | 602             | 1,466  | 259          | 64.63 |
| ibm03     | 290    | 290         | 22,846         | 614             | 1,237  | 283          | 57.97 |
| ibm04     | 608    | 296         | 26,899         | 1,512           | 3,167  | 287          | 54.88 |
| ibm06     | 178    | 178         | 32,320         | 83              | 175    | 166          | 54.77 |
| ibm07     | 507    | 292         | 45,419         | 2,471           | 5,992  | 287          | 46.03 |
| ibm08     | 309    | 302         | 51,000         | 1,725           | 3,721  | 286          | 47.13 |
| ibm09     | 253    | 56          | 53,142         | 446             | 898    | 285          | 44.52 |
| ibm10     | 786    | 56          | 68,643         | 2,160           | 4,720  | 744          | 61.40 |
| ibm11     | 373    | 56          | 70,185         | 682             | 1,371  | 406          | 41.40 |
| ibm12     | 651    | 205         | 70,425         | 1,589           | 3,468  | 637          | 53.85 |
| ibm13     | 424    | 100         | 83,775         | 804             | 1,669  | 490          | 39.43 |
| ibm14     | 614    | 91          | 146,991        | 1,620           | 3,960  | 517          | 22.49 |
| ibm15     | 393    | 22          | 161,177        | 748             | 1,521  | 383          | 28.89 |
| ibm16     | 458    | 37          | 183,026        | 1,755           | 3,981  | 504          | 39.46 |
| ibm17     | 760    | 107         | 184,735        | 2,055           | 4,366  | 743          | 19.11 |
| ibm18     | 285    | 285         | 210,328        | 727             | 1,600  | 272          | 11.09 |

Table 12: Statistics of different chip benchmarks.

<sup>5</sup>github.com/limbo018/DREAMPlace github.com/google-research/circuit\_training

<sup>&</sup>lt;sup>7</sup>github.com/Thinklab-SJTU/EDA-AI

#### Supplementary Experiment A.7 508

**Placement w/o real size** Considering that DeepPR ignored the module size, we implement 509 MaskPlace in the same setting, and the result can be found in Table 13. The result shows that 510 our method has significant advantages. 511

Method adaptec1 adaptec2 adaptec3 adaptec4 bigblue1 bigblue3 DeepPR [22] 5298 22256 32839 63560 8602 94083 MaskPlace 2941 20593 18553 2331 27403 16181

Table 13: Routing wirelength for macro placement w/o real size

512 **More benchmarks** We also conducted experiments in the IBM benchmark suite (ICCAD 2004) [31], which has been used to evaluate placement for more than a decade. This benchmark suite 513 comprises 18 chip designs with 178~786 macros and 12k~210k standard cells. We remove the 514 "ibm05" because it does not contain any macros. We use our MaskPlace to place macros and then use 515 DREAMPlace [9] to place standard cells. We compared our method with graph placement [3] and 516

simulated annealing used in paper [3]. The results are in Table 14. 517

Table 14: Comparisons of HPWL ( $\times 10^5$ ) for macro and standard cell placement in ibm benchmark.

| Method                   | ibm01  | ibm02  | ibm03  | ibm04  | ibm05  | ibm06  |
|--------------------------|--------|--------|--------|--------|--------|--------|
| Graph Placement [3]      | 31.71  | 55.12  | 80.00  | 86.86  | -      | 63.48  |
| Simulated Annealing [3]  | 25.85  | 54.87  | 80.68  | 83.32  | -      | 69.09  |
| MaskPlace+DREAMPlace [9] | 24.18  | 47.45  | 71.37  | 78.76  | -      | 55.70  |
| Method                   | ibm07  | ibm08  | ibm09  | ibm10  | ibm11  | ibm12  |
| Graph Placement [3]      | 117.71 | 134.77 | 148.74 | 440.78 | 218.73 | 438.57 |
| Simulated Annealing [3]  | 117.71 | 144.89 | 141.67 | 463.04 | 228.79 | 435.77 |
| MaskPlace+DREAMPlace [9] | 95.27  | 120.64 | 122.91 | 367.55 | 202.23 | 397.25 |
| Method                   | ibm13  | ibm14  | ibm15  | ibm16  | ibm17  | ibm18  |
| Graph Placement [3]      | 278.93 | 455.31 | 520.06 | 642.08 | 814.37 | 450.67 |
| Simulated Annealing [3]  | 259.89 | 405.80 | 510.06 | 614.54 | 720.40 | 442.00 |
| MaskPlace+DREAMPlace [9] | 246.49 | 302.67 | 457.86 | 584.67 | 643.75 | 398.83 |

#### B **Related Work** 518

Classic optimization-based methods. Optimization has been the dominant method in placement 519 for decades. They can be divide into three categories: partitioning-based methods [4, 5], simulated 520 annealing methods [10, 11] and analytical methods [6–9, 12–21]. 521

Partitioning-based methods [4, 5] is to cluster the whole circuits into several parts to minimize the 522 connections between parts. And then, place modules in one part first and then arrange these parts to 523 suitable positions on the chip with the divide-and-conquer idea. However, optimizing the modules 524 in one part is isolated, and sometimes it is hard to cut the circuits into relatively independent parts, 525 which is highly related to the topology of the circuits. 526

Simulated Annealing (SA) methods [10, 11] are also known as hill-climbing methods, which is a 527 widely used iterative heuristic algorithm for solving combinatorial optimization problems. Simply, 528 it initializes a random status and then searches for status with low cost by moving from the current 529 status to a neighbor status. If the metrics of the neighbor status are better than that of the current 530 status, it moves to the neighbor status. Otherwise, the move may still be taken with a decreasing 531 probability over time. The advantage is that it can be used when metrics do not have the analysis 532

formula or cannot be differentiable. However, it is not efficient enough, and the placement results are highly dependent on the random initial state.

Analytical methods are gradually replacing the above two methods because of the best performance. 535 They are can be divide into quadratic methods [12-18], and nonlinear (nonquaratic) methods [6-536 9, 19–21]. Quadratic methods [12-18] transform the placement problem into a sequence of convex 537 quadratic problems, and there are well-established solvers for such problems. However, it is a 538 very rough approximation. Nonlinear methods [6-9, 19-21] design a single differentiable objective 539 function and optimize it. The advantage is that it can handle large-scale modules. However, the 540 objective function is still approximated, and they cannot avoid overlaps when combining multiple 541 metrics in one objective function. Methods in this category achieved the highest placement quality in 542 all traditional methods [9]. 543

Learning-based methods. With the development of deep learning, some learning-based approaches [11, 37–39] have been proposed to assist the traditional methods. Huang et al. [37] uses convolutional neural networks to estimate the congestion for SA placement. Vashisht et al. [11] uses the reinforcement learning models to generate the initial placement of SA. Kirby et al. [38], Agnesina et al. [39] help classic placement tools choose the most suitable hyperparameters with reinforcement learning methods. However, these methods do not implement end-to-end placement by deep learning, so the placement results depend heavily on the classic methods.

Pure reinforcement learning methods [3, 22, 23, 40] view placement as a process of placing modules 551 sequentially. Mirhoseini et al. [3] uses reinforcement learning to place hard macros, and the force-552 directed method [18] to place remaining soft macros. Jiang et al. [23] replaces force-directed method 553 with DREAMPlace [9] to place soft macros based on Graph Placement [3]. Cheng and Yan [22] 554 proposes a reinforcement learning method by using wirelength as the reward. Chang et al. [40] puts 555 all metrics in the RL reward. What they have in common is that they convert the circuit as a graph 556 structure and input them to the graph neural networks [41]. However, the pin information has been 557 lost, leading to sub-optimal placement. Also, they cannot avoid overlaps because of the reduction in 558 search space. These methods still have room for improvement in terms of realistic chip placement. 559 For instance, DeepPR [22] ignores the realistic size of the module. However, the size of the modules 560 varies widely in most circuits. Although it proposes to use routing wirelength instead of HPWL as 561 562 the reward, it will affect the efficiency and lead to sparse reward, making models difficult to train. In contrast, HPWL is a high-quality wirelength estimation, and we do not need to discard this inherent 563 dense reward. 564