# Collaborative Human-AI Moderation of Toxic Comments

The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. Currently, content moderation is often performed by a collaboration between humans and machine learning models. However, it is not well understood how to design the collaborative process so as to maximize the combined moderator-model system performance. This benchmark presents a rigorous study of this problem, focusing on an approach that incorporates model uncertainty into the collaborative process. 

To reference this work, please cite:

```none
@inproceedings{kivlichan-etal-2021-measuring,
    title = "Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation",
    author = "Kivlichan, Ian  and
      Lin, Zi  and
      Liu, Jeremiah  and
      Vasserman, Lucy",
    booktitle = "Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.woah-1.5",
    doi = "10.18653/v1/2021.woah-1.5",
    pages = "36--53"}
```

## Test Performance on Wikipedia Toxicity

The table below shows the predictive and uncertainty performance on the held-out dataset of [Wikipedia Toxicity](https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes). All models are based on BERT-base.

| Method | AUROC/AUPRC/Acc | ECE/Brier Score | Calib AUROC (u/t) | Calib AUPRC (u/t) |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| [Deterministic](deterministic.py)                          | 0.9734/0.8019/0.9231 | 0.0245/0.0548 | 0.9230/0.9175 | 0.4053/0.3032 |
| [SNGP](sngp.py)                                            | 0.9741/0.8029/0.9233 | 0.0280/0.0548 | 0.9238/0.9171 | 0.4063/0.3019 |
| [Monte Carlo Dropout](dropout.py)                          | 0.9729/0.8006/0.9274 | 0.0198/0.0508 | 0.9282/0.9179 | 0.4020/0.2929 |
| [Ensemble (size=10)](ensemble.py)<sup>1</sup>              | 0.9738/0.8074/0.9231 | 0.0235/0.0544 | 0.9245/0.9172 | 0.4045/0.3025 |
| [SNGP Ensemble (size=10)](sngp_ensemble.py)                | 0.9741/0.8045/0.9226 | 0.0281/0.0549 | 0.9249/0.9170 | 0.4158/0.3034 |
| [Deterministic + Focal Loss](deterministic.py)<sup>2</sup> | 0.9730/0.8036/0.9476 | 0.1486/0.0628 | 0.9405/0.9123 | 0.3804/0.2223 |
| [SNGP + Focal Loss](sngp.py)                               | 0.9736/0.8076/0.9455 | 0.0076/0.0388 | 0.9385/0.9142 | 0.3885/0.2319 |
| [Monte Carlo Dropout + Focal Loss](dropout.py)             | 0.9741/0.8076/0.9472 | 0.1442/0.0622 | 0.9425/0.9146 | 0.3890/0.2277 |
| [Ensemble + Focal Loss (size=10)](ensemble.py)             | 0.9735/0.8077/0.9479 | 0.1536/0.0639 | 0.9418/0.9126 | 0.3840/0.2212 |
| [SNGP Ensemble + Focal Loss (size=10)](sngp_ensemble.py)   | 0.9742/0.8122/0.9467 | 0.0075/0.0379 | 0.9400/0.9140 | 0.3846/0.2271 |

| Method | CollabAcc (Uncertainty) | AbstainPrec (Uncertainty) | AbstainRecall (Uncertainty) | CollabAUROC (Uncertainty) | CollabAUPRC (Uncertainty) |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| Deterministic                        | 0.9236/0.9256/0.9281/0.9329/0.9459/0.9640/0.9781/0.9884 | 0.4923/0.4923/0.4993/0.4904/0.4556/0.4087/0.3665/0.3267 | 0.0064/0.0320/0.0649/0.1275/0.2963/0.5315/0.7150/0.8497 | 0.9736/0.9746/0.9757/0.9780/0.9835/0.9904/0.9947/0.9973 | 0.8034/0.8095/0.8158/0.8301/0.8645/0.9127/0.9470/0.9697 |
| SNGP                                 | 0.9238/0.9258/0.9281/0.9325/0.9460/0.9657/0.9792/0.9891 | 0.5000/0.5000/0.4781/0.4621/0.4545/0.4245/0.3728/0.3291 | 0.0065/0.0325/0.0622/0.1204/0.2961/0.5532/0.7289/0.8577 | 0.9743/0.9752/0.9763/0.9784/0.9839/0.9904/0.9946/0.9969 | 0.8041/0.8092/0.8157/0.8278/0.8623/0.9085/0.9430/0.9639 |
| Monte Carlo Dropout                  | 0.9278/0.9295/0.9318/0.9366/0.9498/0.9689/0.9826/0.9912 | 0.4273/0.4273/0.4482/0.4636/0.4481/0.4151/0.3683/0.3191 | 0.0058/0.0294/0.0616/0.1276/0.3084/0.5715/0.7607/0.8786 | 0.9732/0.9743/0.9755/0.9778/0.9838/0.9911/0.9954/0.9977 | 0.8024/0.8093/0.8173/0.8314/0.8706/0.9232/0.9575/0.9767 |
| Ensemble (size=10)                   | 0.9235/0.9253/0.9277/0.9326/0.9460/0.9645/0.9785/0.9890 | 0.4438/0.4438/0.4578/0.4757/0.4577/0.4140/0.3693/0.3294 | 0.0057/0.0288/0.0594/0.1236/0.2975/0.5382/0.7202/0.8565 | 0.9740/0.9750/0.9760/0.9782/0.9838/0.9905/0.9949/0.9974 | 0.8089/0.8148/0.8206/0.8334/0.8701/0.9176/0.9526/0.9741 |
| SNGP Ensemble (size=10)              | 0.9231/0.9252/0.9278/0.9324/0.9456/0.9653/0.9791/0.9893 | 0.5078/0.5078/0.5148/0.4887/0.4607/0.4271/0.3769/0.3332 | 0.0065/0.0328/0.0664/0.1262/0.2976/0.5518/0.7305/0.8611 | 0.9743/0.9752/0.9763/0.9784/0.9838/0.9904/0.9946/0.9969 | 0.8057/0.8105/0.8165/0.8291/0.8636/0.9106/0.9447/0.9655 |
| Deterministic + Focal Loss           | 0.9480/0.9499/0.9522/0.9568/0.9692/0.9829/0.9915/0.9962 | 0.4664/0.4664/0.4664/0.4626/0.4332/0.3533/0.2928/0.2432 | 0.0088/0.0444/0.0889/0.1764/0.4132/0.6738/0.8379/0.9279 | 0.9733/0.9744/0.9758/0.9784/0.9848/0.9922/0.9965/0.9986 | 0.8064/0.8173/0.8298/0.8523/0.9032/0.9531/0.9795/0.9917 |
| SNGP + Focal Loss                    | 0.9460/0.9480/0.9505/0.9552/0.9670/0.9810/0.9903/0.9957 | 0.4917/0.4971/0.4934/0.4855/0.4285/0.3549/0.2986/0.2508 | 0.0090/0.0456/0.0905/0.1782/0.3934/0.6515/0.8224/0.9209 | 0.9739/0.9751/0.9765/0.9792/0.9859/0.9931/0.9969/0.9987 | 0.8101/0.8196/0.8305/0.8515/0.9011/0.9521/0.9789/0.9919 |
| Monte Carlo Dropout + Focal Loss     | 0.9477/0.9497/0.9521/0.9569/0.9691/0.9834/0.9918/0.9966 | 0.4879/0.4879/0.4879/0.4836/0.4370/0.3622/0.2974/0.2470 | 0.0092/0.0462/0.0923/0.1831/0.4140/0.6862/0.8452/0.9361 | 0.9744/0.9755/0.9768/0.9792/0.9850/0.9922/0.9966/0.9987 | 0.8107/0.8220/0.8347/0.8570/0.9056/0.9538/0.9802/0.9924 |
| Ensemble + Focal Loss (size=10)      | 0.9484/0.9502/0.9525/0.9572/0.9695/0.9835/0.9919/0.9966 | 0.4643/0.4643/0.4643/0.4626/0.4324/0.3560/0.2935/0.2434 | 0.0088/0.0445/0.0890/0.1775/0.4150/0.6833/0.8453/0.9346 | 0.9738/0.9750/0.9764/0.9790/0.9853/0.9925/0.9967/0.9987 | 0.8104/0.8210/0.8334/0.8557/0.9057/0.9551/0.9807/0.9925 |
| SNGP Ensemble + Focal Loss (size=10) | 0.9473/0.9494/0.9518/0.9559/0.9677/0.9824/0.9913/0.9961 | 0.5378/0.5218/0.5056/0.4595/0.4199/0.3566/0.2968/0.2467 | 0.0100/0.0489/0.0948/0.1725/0.3942/0.6696/0.8361/0.9264 | 0.9745/0.9757/0.9771/0.9798/0.9863/0.9934/0.9971/0.9989 | 0.8146/0.8239/0.8347/0.8554/0.9046/0.9553/0.9813/0.9928 |

| Method | CollabAcc (Toxicity) | AbstainPrec (Toxicity) | AbstainRecall (Toxicity) | CollabAUROC (Toxicity) | CollabAUPRC (Toxicity) |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| Deterministic                        | 0.9231/0.9232/0.9233/0.9237/0.9292/0.9530/0.9887/0.9960 | 0.0027/0.0175/0.0193/0.0311/0.1213/0.2990/0.4369/0.3643 | 0.0000/0.0011/0.0025/0.0081/0.0789/0.3889/0.8525/0.9476 | 0.9734/0.9734/0.9735/0.9740/0.9779/0.9876/0.9945/0.9979 | 0.8020/0.8036/0.8057/0.8143/0.8587/0.9282/0.9695/0.9890 |
| SNGP                                 | 0.9233/0.9233/0.9234/0.9239/0.9291/0.9521/0.9883/0.9964 | 0.0050/0.0050/0.0081/0.0318/0.1159/0.2884/0.4332/0.3655 | 0.0001/0.0003/0.0011/0.0083/0.0755/0.3759/0.8469/0.9528 | 0.9741/0.9741/0.9742/0.9747/0.9785/0.9878/0.9946/0.9979 | 0.8030/0.8033/0.8043/0.8139/0.8627/0.9304/0.9709/0.9899 |
| Monte Carlo Dropout                  | 0.9274/0.9274/0.9276/0.9280/0.9336/0.9571/0.9892/0.9960 | 0.0151/0.0151/0.0185/0.0316/0.1251/0.2972/0.4122/0.3431 | 0.0002/0.0010/0.0025/0.0087/0.0861/0.4092/0.8514/0.9447 | 0.9729/0.9730/0.9731/0.9735/0.9776/0.9872/0.9943/0.9978 | 0.8009/0.8022/0.8044/0.8138/0.8574/0.9262/0.9689/0.9887 |
| Ensemble (size=10)                   | 0.9231/0.9231/0.9232/0.9237/0.9290/0.9523/0.9884/0.9962 | 0.0027/0.0027/0.0121/0.0327/0.1187/0.2927/0.4354/0.3656 | 0.0000/0.0002/0.0016/0.0085/0.0771/0.3805/0.8491/0.9507 | 0.9738/0.9738/0.9739/0.9744/0.9782/0.9877/0.9945/0.9978 | 0.8075/0.8077/0.8101/0.8218/0.8608/0.9291/0.9702/0.9893 |
| SNGP Ensemble (size=10)              | 0.9226/0.9227/0.9227/0.9232/0.9285/0.9515/0.9876/0.9963 | 0.0125/0.0125/0.0125/0.0296/0.1176/0.2889/0.4332/0.3682 | 0.0002/0.0008/0.0016/0.0076/0.0760/0.3733/0.8397/0.9517 | 0.9741/0.9742/0.9742/0.9747/0.9785/0.9878/0.9946/0.9978 | 0.8047/0.8056/0.8067/0.8150/0.8634/0.9304/0.9710/0.9901 |
| Deterministic + Focal Loss           | 0.9476/0.9476/0.9477/0.9484/0.9539/0.9752/0.9892/0.9959 | 0.0000/0.0083/0.0168/0.0393/0.1261/0.2760/0.2776/0.2415 | 0.0000/0.0008/0.0032/0.0150/0.1202/0.5265/0.7943/0.9212 | 0.9730/0.9730/0.9731/0.9737/0.9776/0.9872/0.9943/0.9979 | 0.8036/0.8049/0.8075/0.8163/0.8576/0.9266/0.9679/0.9882 |
| SNGP + Focal Loss                    | 0.9455/0.9455/0.9456/0.9461/0.9516/0.9750/0.9891/0.9960 | 0.0000/0.0032/0.0078/0.0298/0.1204/0.2947/0.2907/0.2525 | 0.0000/0.0003/0.0014/0.0109/0.1105/0.5410/0.8004/0.9270 | 0.9736/0.9736/0.9736/0.9741/0.9780/0.9877/0.9946/0.9980 | 0.8077/0.8084/0.8098/0.8174/0.8590/0.9282/0.9699/0.9892 |
| Monte Carlo Dropout + Focal Loss     | 0.9472/0.9473/0.9474/0.9481/0.9533/0.9759/0.9896/0.9961 | 0.0159/0.0088/0.0192/0.0452/0.1213/0.2871/0.2822/0.2444 | 0.0003/0.0008/0.0036/0.0171/0.1149/0.5439/0.8021/0.9259 | 0.9741/0.9742/0.9743/0.9750/0.9787/0.9879/0.9947/0.9981 | 0.8086/0.8095/0.8127/0.8227/0.8625/0.9301/0.9701/0.9895 |
| Ensemble + Focal Loss (size=10)      | 0.9479/0.9479/0.9480/0.9486/0.9541/0.9753/0.9893/0.9960 | 0.0000/0.0032/0.0047/0.0358/0.1241/0.2736/0.2756/0.2404 | 0.0000/0.0003/0.0009/0.0137/0.1191/0.5251/0.7937/0.9228 | 0.9735/0.9736/0.9736/0.9742/0.9781/0.9877/0.9946/0.9980 | 0.8077/0.8082/0.8088/0.8185/0.8595/0.9282/0.9694/0.9890 |
| SNGP Ensemble + Focal Loss (size=10) | 0.9467/0.9468/0.9468/0.9474/0.9527/0.9757/0.9895/0.9963 | 0.0000/0.0032/0.0034/0.0320/0.1199/0.2893/0.2850/0.2477 | 0.0000/0.0003/0.0006/0.0120/0.1125/0.5431/0.8028/0.9302 | 0.9742/0.9742/0.9742/0.9748/0.9786/0.9880/0.9948/0.9982 | 0.8122/0.8127/0.8131/0.8218/0.8619/0.9293/0.9706/0.9898 |

## Transfer Learning Performance on CivilComments

In practice, it is common for a trained toxicity detection model to be deployed
in a noisy environment with greater topical diversity and distribution shift.
To approximate this, we evaluate the performance of a WikipediaToxicity-trained
model on the [CivilComments](https://www.tensorflow.org/datasets/catalog/civil_comments)
dataset.
Comparing to WikipediaToxicity (which contains conversation between Wikipedia
editors), CivilComments is a much more diverse and noisy dataset that aggregates
comments from approximately 50 English-language news sites across the world.

| Method | AUROC/AUPRC/Acc | ECE/Brier Score | Calib AUROC (u/t) | Calib AUPRC (u/t) |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| Deterministic                        | 0.7796/0.6689/0.9628 | 0.0128/0.0246 | 0.9412/0.9649 | 0.3581/0.4429 |
| SNGP                                 | 0.7695/0.6665/0.9640 | 0.0070/0.0253 | 0.9457/0.9651 | 0.3660/0.4348 |
| Monte Carlo Dropout                  | 0.7806/0.6727/0.9671 | 0.0136/0.0241 | 0.9502/0.9631 | 0.3707/0.4103 |
| Ensemble (size=10)                   | 0.7849/0.6741/0.9625 | 0.0141/0.0242 | 0.9420/0.9660 | 0.3484/0.4453 |
| SNGP Ensemble (size=10)              | 0.7749/0.6719/0.9633 | 0.0076/0.0248 | 0.9463/0.9661 | 0.3655/0.4386 |
| Deterministic + Focal Loss           | 0.8013/0.6766/0.9795 | 0.1973/0.0377 | 0.9444/0.9427 | 0.3018/0.2418 |
| SNGP + Focal Loss                    | 0.8003/0.6820/0.9784 | 0.0182/0.0264 | 0.9465/0.9465 | 0.3181/0.2639 |
| Monte Carlo Dropout + Focal Loss     | 0.8009/0.6790/0.9790 | 0.1896/0.0360 | 0.9481/0.9470 | 0.3185/0.2650 |
| Ensemble + Focal Loss (size=10)      | 0.8041/0.6814/0.9795 | 0.1998/0.0381 | 0.9461/0.9444 | 0.3035/0.2444 |
| SNGP Ensemble + Focal Loss (size=10) | 0.8002/0.6827/0.9790 | 0.0176/0.0266 | 0.9481/0.9471 | 0.3212/0.2571 |

| Method | CollabAcc (Uncertainty) | AbstainPrec (Uncertainty) | AbstainRecall (Uncertainty) | CollabAUROC (Uncertainty) | CollabAUPRC (Uncertainty) |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| Deterministic                        | 0.9634/0.9653/0.9679/0.9720/0.9814/0.9907/0.9953/0.9975 | 0.5105/0.5003/0.5006/0.4565/0.3703/0.2781/0.2161/0.1734 | 0.0137/0.0673/0.1346/0.2457/0.4984/0.7487/0.8725/0.9336 | 0.7797/0.7804/0.7814/0.7833/0.7895/0.8011/0.8148/0.8291 | 0.6698/0.6731/0.6772/0.6845/0.7043/0.7340/0.7632/0.7906 |
| SNGP                                 | 0.9645/0.9664/0.9688/0.9731/0.9828/0.9922/0.9963/0.9979 | 0.5092/0.4950/0.4880/0.4556/0.3762/0.2822/0.2153/0.1696 | 0.0141/0.0686/0.1353/0.2528/0.5220/0.7830/0.8962/0.9412 | 0.7697/0.7704/0.7712/0.7730/0.7786/0.7891/0.8019/0.8152 | 0.6674/0.6706/0.6743/0.6814/0.7005/0.7286/0.7574/0.7842 |
| Monte Carlo Dropout                  | 0.9676/0.9696/0.9720/0.9764/0.9856/0.9940/0.9971/0.9983 | 0.5142/0.5063/0.4888/0.4624/0.3707/0.2691/0.2000/0.1559 | 0.0156/0.0769/0.1485/0.2811/0.5636/0.8183/0.9123/0.9485 | 0.7808/0.7814/0.7822/0.7836/0.7888/0.7997/0.8133/0.8281 | 0.6735/0.6764/0.6798/0.6857/0.7031/0.7316/0.7617/0.7904 |
| Ensemble (size=10)                   | 0.9629/0.9647/0.9671/0.9714/0.9810/0.9905/0.9952/0.9977 | 0.4433/0.4485/0.4588/0.4465/0.3702/0.2800/0.2182/0.1761 | 0.0118/0.0597/0.1222/0.2379/0.4934/0.7463/0.8724/0.9385 | 0.7851/0.7859/0.7869/0.7888/0.7949/0.8065/0.8201/0.8341 | 0.6750/0.6785/0.6827/0.6902/0.7098/0.7394/0.7690/0.7962 |
| SNGP Ensemble (size=10)              | 0.9638/0.9656/0.9681/0.9723/0.9824/0.9919/0.9962/0.9980 | 0.4535/0.4600/0.4782/0.4495/0.3816/0.2856/0.2194/0.1732 | 0.0123/0.0627/0.1303/0.2450/0.5202/0.7787/0.8971/0.9444 | 0.7751/0.7759/0.7769/0.7786/0.7840/0.7946/0.8074/0.8206 | 0.6728/0.6764/0.6804/0.6873/0.7052/0.7334/0.7624/0.7888 |
| Deterministic + Focal Loss           | 0.9800/0.9818/0.9840/0.9872/0.9924/0.9961/0.9978/0.9986 | 0.4741/0.4740/0.4504/0.3840/0.2589/0.1664/0.1220/0.0958 | 0.0230/0.1154/0.2194/0.3742/0.6310/0.8109/0.8916/0.9335 | 0.8013/0.8014/0.8016/0.8021/0.8071/0.8215/0.8398/0.8591 | 0.6768/0.6773/0.6783/0.6807/0.6978/0.7335/0.7694/0.8022 |
| SNGP + Focal Loss                    | 0.9789/0.9809/0.9829/0.9862/0.9918/0.9961/0.9978/0.9986 | 0.5064/0.4929/0.4479/0.3902/0.2672/0.1765/0.1293/0.1009 | 0.0234/0.1142/0.2075/0.3618/0.6195/0.8184/0.8990/0.9360 | 0.8004/0.8006/0.8009/0.8018/0.8064/0.8185/0.8346/0.8513 | 0.6823/0.6835/0.6851/0.6893/0.7051/0.7371/0.7713/0.8024 |
| Monte Carlo Dropout + Focal Loss     | 0.9795/0.9813/0.9836/0.9871/0.9924/0.9963/0.9979/0.9987 | 0.4637/0.4638/0.4593/0.4020/0.2675/0.1732/0.1257/0.0983 | 0.0220/0.1104/0.2187/0.3831/0.6375/0.8257/0.8987/0.9371 | 0.8009/0.8010/0.8011/0.8016/0.8062/0.8203/0.8384/0.8574 | 0.6791/0.6797/0.6802/0.6827/0.6985/0.7337/0.7694/0.8015 |
| Ensemble + Focal Loss (size=10)      | 0.9800/0.9819/0.9839/0.9872/0.9925/0.9961/0.9979/0.9987 | 0.4836/0.4836/0.4469/0.3849/0.2614/0.1668/0.1227/0.0962 | 0.0235/0.1177/0.2175/0.3749/0.6368/0.8123/0.8965/0.9374 | 0.8041/0.8042/0.8043/0.8048/0.8097/0.8236/0.8415/0.8607 | 0.6815/0.6820/0.6828/0.6851/0.7018/0.7366/0.7719/0.8047 |
| SNGP Ensemble + Focal Loss (size=10) | 0.9795/0.9814/0.9835/0.9869/0.9922/0.9963/0.9980/0.9987 | 0.5147/0.4933/0.4587/0.3963/0.2645/0.1730/0.1268/0.0987 | 0.0244/0.1171/0.2178/0.3766/0.6286/0.8222/0.9040/0.9383 | 0.8002/0.8004/0.8007/0.8014/0.8059/0.8182/0.8345/0.8518 | 0.6830/0.6843/0.6858/0.6888/0.7046/0.7366/0.7707/0.8022 |

| Method | CollabAcc (Toxicity) | AbstainPrec (Toxicity) | AbstainRecall (Toxicity) | CollabAUROC (Toxicity) | CollabAUPRC (Toxicity) |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| Deterministic                        | 0.9629/0.9636/0.9654/0.9712/0.9924/0.9962/0.9978/0.9985 | 0.0722/0.1439/0.2597/0.4154/0.5907/0.3333/0.2333/0.1783 | 0.0019/0.0194/0.0698/0.2236/0.7950/0.8973/0.9420/0.9601 | 0.7796/0.7796/0.7796/0.7800/0.7843/0.7965/0.8117/0.8272 | 0.6689/0.6690/0.6695/0.6714/0.6885/0.7229/0.7569/0.7872 |
| SNGP                                 | 0.9640/0.9647/0.9664/0.9721/0.9924/0.9963/0.9980/0.9987 | 0.0637/0.1557/0.2394/0.4071/0.5682/0.3230/0.2269/0.1736 | 0.0018/0.0216/0.0664/0.2258/0.7883/0.8963/0.9443/0.9636 | 0.7695/0.7695/0.7696/0.7700/0.7738/0.7853/0.7993/0.8136 | 0.6665/0.6665/0.6670/0.6692/0.6850/0.7191/0.7520/0.7812 |
| Monte Carlo Dropout                  | 0.9672/0.9679/0.9696/0.9754/0.9926/0.9963/0.9979/0.9987 | 0.0619/0.1649/0.2513/0.4154/0.5097/0.2917/0.2053/0.1579 | 0.0019/0.0251/0.0764/0.2525/0.7750/0.8870/0.9363/0.9604 | 0.7806/0.7806/0.7807/0.7812/0.7848/0.7969/0.8118/0.8271 | 0.6727/0.6729/0.6735/0.6760/0.6904/0.7248/0.7585/0.7886 |
| Ensemble (size=10)                   | 0.9625/0.9632/0.9650/0.9707/0.9925/0.9962/0.9979/0.9987 | 0.0619/0.1432/0.2513/0.4123/0.6006/0.3368/0.2361/0.1810 | 0.0016/0.0191/0.0669/0.2197/0.8004/0.8977/0.9438/0.9649 | 0.7849/0.7849/0.7850/0.7853/0.7895/0.8017/0.8168/0.8322 | 0.6741/0.6742/0.6746/0.6765/0.6932/0.7280/0.7621/0.7927 |
| SNGP Ensemble (size=10)              | 0.9634/0.9640/0.9657/0.9713/0.9926/0.9964/0.9980/0.9987 | 0.0515/0.1404/0.2362/0.4002/0.5862/0.3305/0.2314/0.1770 | 0.0014/0.0191/0.0643/0.2182/0.7991/0.9010/0.9465/0.9653 | 0.7749/0.7749/0.7750/0.7753/0.7792/0.7906/0.8047/0.8190 | 0.6719/0.6719/0.6723/0.6741/0.6900/0.7237/0.7565/0.7857 |
| Deterministic + Focal Loss           | 0.9795/0.9802/0.9822/0.9864/0.9924/0.9961/0.9978/0.9986 | 0.0206/0.1417/0.2695/0.3479/0.2578/0.1659/0.1219/0.0957 | 0.0010/0.0345/0.1313/0.3390/0.6282/0.8088/0.8913/0.9331 | 0.8013/0.8013/0.8014/0.8018/0.8067/0.8212/0.8396/0.8590 | 0.6766/0.6767/0.6772/0.6794/0.6967/0.7329/0.7691/0.8021 |
| SNGP + Focal Loss                    | 0.9785/0.9794/0.9812/0.9861/0.9921/0.9962/0.9979/0.9986 | 0.0790/0.1914/0.2774/0.3827/0.2742/0.1779/0.1296/0.1010 | 0.0037/0.0443/0.1285/0.3548/0.6356/0.8251/0.9014/0.9367 | 0.8003/0.8003/0.8004/0.8009/0.8052/0.8178/0.8343/0.8512 | 0.6820/0.6821/0.6826/0.6851/0.7015/0.7355/0.7707/0.8022 |
| Monte Carlo Dropout + Focal Loss     | 0.9791/0.9799/0.9822/0.9867/0.9924/0.9964/0.9979/0.9987 | 0.1031/0.1710/0.3212/0.3835/0.2668/0.1734/0.1258/0.0982 | 0.0049/0.0407/0.1530/0.3654/0.6358/0.8266/0.8994/0.9365 | 0.8009/0.8009/0.8010/0.8015/0.8060/0.8201/0.8384/0.8573 | 0.6790/0.6791/0.6798/0.6820/0.6980/0.7334/0.7693/0.8014 |
| Ensemble + Focal Loss (size=10)      | 0.9795/0.9802/0.9822/0.9865/0.9926/0.9961/0.9979/0.9987 | 0.0309/0.1468/0.2692/0.3520/0.2618/0.1666/0.1225/0.0962 | 0.0015/0.0357/0.1310/0.3428/0.6376/0.8117/0.8954/0.9373 | 0.8041/0.8041/0.8041/0.8045/0.8093/0.8234/0.8414/0.8606 | 0.6814/0.6815/0.6819/0.6838/0.7008/0.7362/0.7717/0.8046 |
| SNGP Ensemble + Focal Loss (size=10) | 0.9790/0.9798/0.9815/0.9864/0.9925/0.9964/0.9980/0.9987 | 0.0722/0.1660/0.2527/0.3746/0.2700/0.1745/0.1272/0.0989 | 0.0034/0.0394/0.1200/0.3560/0.6415/0.8295/0.9068/0.9396 | 0.8002/0.8002/0.8003/0.8007/0.8049/0.8177/0.8342/0.8516 | 0.6827/0.6827/0.6834/0.6855/0.7014/0.7355/0.7702/0.8020 |

## Transfer Learning Performance on CivilComments Identity Dataset (SNGP + Focal Loss)

| Identity Type                  | AUROC/AUPRC/Acc      | ECE/Brier Score | Oracle Collaborative Acc           |
| ------------------------------ | -------------------- | --------------- | ---------------------------------- |
| gender                         | 0.7666/0.7826/0.9698 | 0.0269/0.0483   | 0.9741/0.9848/0.9908/0.9937/0.9959 |
| sexual_orientation             | 0.7709/0.9077/0.9570 | 0.0550/0.0876   | 0.9650/0.9793/0.9837/0.9854/0.9869 |
| religion                       | 0.7495/0.7765/0.9751 | 0.0130/0.0574   | 0.9797/0.9878/0.9922/0.9957/0.9959 |
| race                           | 0.7622/0.8934/0.9508 | 0.0245/0.0985   | 0.9566/0.9677/0.9760/0.9807/0.9846 |
| disability                     | 0.7412/0.8355/0.9688 | 0.0366/0.0595   | 0.9688/0.9779/0.9908/0.9944/0.9977 |

| Identity Type                  | AUROC/AUPRC/Acc      | ECE/Brier Score | Oracle Collaborative Acc           |
| ------------------------------ | -------------------- | --------------- | ---------------------------------- |
| male                           | 0.7599/0.7756/0.9747 | 0.0334/0.0478   | 0.9783/0.9860/0.9913/0.9940/0.9956 |
| female                         | 0.7664/0.7805/0.9733 | 0.0263/0.0465   | 0.9774/0.9889/0.9942/0.9960/0.9979 |
| transgender                    | 0.6357/0.8183/0.9688 | 0.0460/0.0528   | 0.9688/0.9844/0.9891/0.9937/0.9984 |
| homosexual_gay_or_lesbian      | 0.7664/0.9100/0.9531 | 0.0554/0.0881   | 0.9607/0.9780/0.9831/0.9854/0.9867 |
| christian                      | 0.7425/0.7296/0.9822 | 0.0163/0.0439   | 0.9853/0.9930/0.9965/0.9984/0.9985 |
| jewish                         | 0.7609/0.8345/0.9812 | 0.0347/0.0613   | 0.9812/0.9906/0.9970/0.9984/0.9997 |
| muslim                         | 0.7195/0.8435/0.9583 | 0.0116/0.0916   | 0.9653/0.9758/0.9817/0.9856/0.9887 |
| atheist                        | 0.7317/0.7530/0.9844 | 0.0266/0.0389   | 0.9922/0.9922/0.9955/0.9994/1.0000 |
| black                          | 0.7512/0.9193/0.9516 | 0.0157/0.1183   | 0.9552/0.9657/0.9732/0.9781/0.9829 |
| white                          | 0.7513/0.9073/0.9449 | 0.0254/0.1045   | 0.9504/0.9634/0.9710/0.9755/0.9799 |
| asian                          | 0.7186/0.6465/0.9766 | 0.0276/0.0358   | 0.9766/0.9891/0.9922/0.9922/0.9925 |
| psychiatric_or_mental_illness  | 0.7355/0.8362/0.9635 | 0.0363/0.0621   | 0.9635/0.9727/0.9856/0.9892/0.9951 |

## Metrics
We define metrics specific to Toxic Comments below. For general metrics,
see [`baselines/`](https://github.com/google/uncertainty-baselines/tree/main/baselines). For all metrics, we evaluate performance sending comments to an oracle or human moderator according to both uncertainty or toxicity scores (specifically, under the complement `1-p` of the toxicity score).  In our tables, we denote the results under these two policies by either "uncertainty" or "toxicity", or "u/t" for uncertainty and toxicity respectively.

1. __Oracle Collaborative Accuracy__. Accuracy after sending a certain fraction
of examples that the model is not confident about to the human moderators. Here
we apply fractions `0.01`, `0.05`, `0.10`, `0.15` and `0.20` for the
CivilComments Identities subsets, and `0.001`, `0.005`, `0.01`, `0.02`, `0.05`,
`0.10`, `0.15` and `0.20` for the Wikipedia Toxicity and full CivilComments
datasets. Note that if we randomly send sentences to human moderators, the final
accuracy is equal to `accuracy * (1 - fraction) + 1.0 * fraction`, and here for
deterministic model, the accuracies are `0.926` and `0.930` at fraction `0.05`
and `0.10`, which is much worse than Oracle Collaborative Accuracy (`0.944` and
`0.962`).

2. __Oracle Collaborative AUC__. The approximate oracle-collaborative equivalent for the AUC. We use the same fractions for computing this as for Oracle Collaborative Accuracy.

3. __Abstain Precision/Recall__. These metrics capture the model's uncertainty performance through its precision and recall in which examples it abstained from scoring (or equivalently, sent to an oracle or human moderator). Specifically, Abstain Precision captures the proportion of incorrect predictions the model abstained from scoring, and Abstain Recall computes the percentage of correctly abstained examples among all the incorrect predictions the model could have abstained from. The abstention decision is made under a budget, i.e., the model is only allowed to abstain a small fraction of examples, using the same fractions as for the previous two metrics.

4. __Calibration AUC__. Given a model that computes uncertainty score, this metric computes the AUC for a binary prediction task where the binary "label" is the predictive correctness (a binary label of 0's and 1's), and the prediction score is the confidence score. This measures a model's uncertainty calibration in the sense that it examines the degree to which model uncertainty is predictive of generalization error. We compute the AUC using the correct prediction as the negative label.


## Notes

1. A simple ensemble that averages over individual model's predictive
probabilities.

2. Trained with [focal loss](https://openreview.net/forum?id=SJxTZeHFPH)
(with alpha = 0.1 and gamma fine-tuned according to the architecture) to handle
class imbalance in the toxic comment datasets.
