# Introduction to evaluation interface
The simultaneous translation models from sharedtask participents are evaluated under a server-client protocol. The participents are requisted to plug in their own model API in the protocol, and submit a docker file.

## Server-Client Protocol
An server-client protocol that will be used in evaluation. For example, when a *wait-k* model (k=3) translate the English sentence "Alice and Bob are good friends" to Genman sentence "Alice und Bob sind gute Freunde." , the evaluation process is shown as following figure. 

While every time client needs to read a new state (word or speech utterence), a "GET" request is supposed to sent over to server. Whenever a new token is generated, a "SEND" request with the word predicted (untokenized word) will be sent to server immediately. The server can hence calculate both latency and BLEU score of the sentence.

### Server
The server code is provided and can be set up directly locally for development purpose. For example, to evaluate a text simultaneous test set,

```shell

  python fairseq/examples/simultaneous_translation/eval/server.py \
    --hostname local_host  \
    --port 1234 \
    --src-file SRC_FILE \  
    --ref-file REF_FILE  \  
    --data-type text \
```
The state that server sent to client is has the following format
```json
{
  'sent_id': Int,
  'segment_id': Int,
  'segment': String
}
```

### Client
The client will handle the evaluation process mentioned above. It should be out-of-box as well. The client's protocol is as following table

|Action|Content|
|:---:|:---:|
|Request new word / utterence| ```{key: "Get", value: None}```|
|Predict word "W"| ```{key: "SEND", value: "W"}```|



The core of the client module is the agent, which needs to be modified to different models accordingly. The abstract class of agent is as follow, the evaluation process happens in the `decode()` function. 
```python
class Agent(object):
    "an agent needs to follow this pattern"
    def __init__(self, *args, **kwargs):
        ...

    def init_states(self):
        # Initializing states
        ...

    def update_states(self, states, new_state):
        # Update states with given new state from server
        # TODO (describe the states)
        ...

    def finish_eval(self, states, new_state):
        # Check if evaluation is finished
        ...
    
    def policy(self, state: list) -> dict:
        # Provide a action given current states
        # The action can only be either
        # {key: "GET", value: NONE} 
        # or
        # {key: "SEND", value: W}
        ...

    def reset(self):
        # Reset agent
        ...
        
    def decode(self, session):
        
        states = self.init_states()
        self.reset()      

        # Evaluataion protocol happens here
        while True:
            # Get action from the current states according to self.policy()
            action = self.policy(states)

            if action['key'] == GET:
                # Read a new state from server
                new_state = session.get_src()
                states = self.update_states(states, new_state)

                if self.finish_eval(states, new_state):
                    # End of document
                    break 
                
            elif action['key'] == SEND:
                # Send a new prediction to server
                session.send_hypo(action['value'])
                
                # Clean the history, wait for next sentence
                if action['value'] == DEFAULT_EOS:
                    states = self.init_states() 
                    self.reset()
            else:
                raise NotImplementedError

 
```
Here an implementation of agent of text [*wait-k* model](somelink). Notice that the tokenization is not considered.

## Quality
The quality is measured by detokenized BLEU. So make sure that the predicted words sent to server are detokenized. An implementation is can be find [here](some link)

## Latency
The latency metrics are 
* Average Proportion
* Average Lagging
* Differentiable Average Lagging
Again Thery will also be evaluated on detokenized text.

