Namespace(model='deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', input='/home/saabdelnabi/steering_evaluation_awareness/data/triggers_expanded_deepseek_qwen_with_GPT_labels_evidence.json', start_layer=0, end_layer=64, token='</think>', positive_keys=['model_awareness'], negative_keys=['model_awareness'], batch_size=1, classifier_filename='model', location='avg', save_dir='output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp')
========
['model_awareness']
Positive class (train): 215 examples
Positive class (test): 107 examples
Negative class (train): 215 examples
Negative class (test): 107 examples
Loading model 'deepseek-ai/DeepSeek-R1-Distill-Qwen-32B' ...
Processing examples in batches...
Batch inference completed. Processed hidden states for negative examples for training subset.
Processing examples in batches...
Batch inference completed. Processed hidden states for negative examples for test subset.
Processing examples in batches...
Batch inference completed. Processed hidden states for positive examples for training subset.
Processing examples in batches...
Batch inference completed. Processed hidden states for positive examples for test subset.
Processing layer 0...
Epoch [300/300], Loss: 0.6675
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_0.pth
Accuracy: 0.6709129511677282
Processing layer 1...
Epoch [300/300], Loss: 0.1225
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_1.pth
Accuracy: 0.8959660297239915
Processing layer 2...
Epoch [300/300], Loss: 0.0918
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_2.pth
Accuracy: 0.89171974522293
Processing layer 3...
Epoch [300/300], Loss: 0.0772
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_3.pth
Accuracy: 0.89171974522293
Processing layer 4...
Epoch [300/300], Loss: 0.0615
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_4.pth
Accuracy: 0.8980891719745223
Processing layer 5...
Epoch [300/300], Loss: 0.0468
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_5.pth
Accuracy: 0.9087048832271762
Processing layer 6...
Epoch [300/300], Loss: 0.0303
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_6.pth
Accuracy: 0.910828025477707
Processing layer 7...
Epoch [300/300], Loss: 0.0208
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_7.pth
Accuracy: 0.9150743099787686
Processing layer 8...
Epoch [300/300], Loss: 0.0137
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_8.pth
Accuracy: 0.9171974522292994
Processing layer 9...
Epoch [300/300], Loss: 0.0115
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_9.pth
Accuracy: 0.9150743099787686
Processing layer 10...
Epoch [300/300], Loss: 0.0071
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_10.pth
Accuracy: 0.9171974522292994
Processing layer 11...
Epoch [300/300], Loss: 0.0032
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_11.pth
Accuracy: 0.9171974522292994
Processing layer 12...
Epoch [300/300], Loss: 0.0031
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_12.pth
Accuracy: 0.9171974522292994
Processing layer 13...
Epoch [300/300], Loss: 0.0034
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_13.pth
Accuracy: 0.921443736730361
Processing layer 14...
Epoch [300/300], Loss: 0.0023
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_14.pth
Accuracy: 0.9193205944798302
Processing layer 15...
Epoch [300/300], Loss: 0.0022
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_15.pth
Accuracy: 0.921443736730361
Processing layer 16...
Epoch [300/300], Loss: 0.0019
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_16.pth
Accuracy: 0.9278131634819533
Processing layer 17...
Epoch [300/300], Loss: 0.0026
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_17.pth
Accuracy: 0.921443736730361
Processing layer 18...
Epoch [300/300], Loss: 0.0027
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_18.pth
Accuracy: 0.9235668789808917
Processing layer 19...
Epoch [300/300], Loss: 0.0023
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_19.pth
Accuracy: 0.9193205944798302
Processing layer 20...
Epoch [300/300], Loss: 0.0023
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_20.pth
Accuracy: 0.9171974522292994
Processing layer 21...
Epoch [300/300], Loss: 0.0021
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_21.pth
Accuracy: 0.9171974522292994
Processing layer 22...
Epoch [300/300], Loss: 0.0022
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_22.pth
Accuracy: 0.921443736730361
Processing layer 23...
Epoch [300/300], Loss: 0.0016
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_23.pth
Accuracy: 0.9171974522292994
Processing layer 24...
Epoch [300/300], Loss: 0.0014
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_24.pth
Accuracy: 0.9150743099787686
Processing layer 25...
Epoch [300/300], Loss: 0.0015
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_25.pth
Accuracy: 0.9171974522292994
Processing layer 26...
Epoch [300/300], Loss: 0.0012
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_26.pth
Accuracy: 0.9193205944798302
Processing layer 27...
Epoch [300/300], Loss: 0.0010
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_27.pth
Accuracy: 0.9150743099787686
Processing layer 28...
Epoch [300/300], Loss: 0.0010
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_28.pth
Accuracy: 0.9171974522292994
Processing layer 29...
Epoch [300/300], Loss: 0.0008
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_29.pth
Accuracy: 0.9171974522292994
Processing layer 30...
Epoch [300/300], Loss: 0.0008
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_30.pth
Accuracy: 0.910828025477707
Processing layer 31...
Epoch [300/300], Loss: 0.0005
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_31.pth
Accuracy: 0.9087048832271762
Processing layer 32...
Epoch [300/300], Loss: 0.0004
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_32.pth
Accuracy: 0.9129511677282378
Processing layer 33...
Epoch [300/300], Loss: 0.0003
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_33.pth
Accuracy: 0.9150743099787686
Processing layer 34...
Epoch [300/300], Loss: 0.0003
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_34.pth
Accuracy: 0.9129511677282378
Processing layer 35...
Epoch [300/300], Loss: 0.0003
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_35.pth
Accuracy: 0.9150743099787686
Processing layer 36...
Epoch [300/300], Loss: 0.0003
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_36.pth
Accuracy: 0.9044585987261147
Processing layer 37...
Epoch [300/300], Loss: 0.0002
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_37.pth
Accuracy: 0.9065817409766455
Processing layer 38...
Epoch [300/300], Loss: 0.0005
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_38.pth
Accuracy: 0.9065817409766455
Processing layer 39...
Epoch [300/300], Loss: 0.0002
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_39.pth
Accuracy: 0.9150743099787686
Processing layer 40...
Epoch [300/300], Loss: 0.0003
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_40.pth
Accuracy: 0.9129511677282378
Processing layer 41...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_41.pth
Accuracy: 0.9150743099787686
Processing layer 42...
Epoch [300/300], Loss: 0.0002
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_42.pth
Accuracy: 0.9087048832271762
Processing layer 43...
Epoch [300/300], Loss: 0.0002
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_43.pth
Accuracy: 0.9044585987261147
Processing layer 44...
Epoch [300/300], Loss: 0.0002
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_44.pth
Accuracy: 0.9023354564755839
Processing layer 45...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_45.pth
Accuracy: 0.9002123142250531
Processing layer 46...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_46.pth
Accuracy: 0.9150743099787686
Processing layer 47...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_47.pth
Accuracy: 0.9171974522292994
Processing layer 48...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_48.pth
Accuracy: 0.9171974522292994
Processing layer 49...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_49.pth
Accuracy: 0.9171974522292994
Processing layer 50...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_50.pth
Accuracy: 0.9171974522292994
Processing layer 51...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_51.pth
Accuracy: 0.9150743099787686
Processing layer 52...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_52.pth
Accuracy: 0.9193205944798302
Processing layer 53...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_53.pth
Accuracy: 0.9150743099787686
Processing layer 54...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_54.pth
Accuracy: 0.910828025477707
Processing layer 55...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_55.pth
Accuracy: 0.9193205944798302
Processing layer 56...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_56.pth
Accuracy: 0.910828025477707
Processing layer 57...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_57.pth
Accuracy: 0.9193205944798302
Processing layer 58...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_58.pth
Accuracy: 0.9171974522292994
Processing layer 59...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_59.pth
Accuracy: 0.9129511677282378
Processing layer 60...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_60.pth
Accuracy: 0.9129511677282378
Processing layer 61...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_61.pth
Accuracy: 0.9150743099787686
Processing layer 62...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_62.pth
Accuracy: 0.9150743099787686
Processing layer 63...
Epoch [300/300], Loss: 0.0000
Training complete.
Model saved to output_models/deepseek_qwen_from_evidence_negative_awareness_positive_awareness_avg_mlp/model_63.pth
Accuracy: 0.9129511677282378
