Some general private notes


Contract-Based Design

Diplomacy Metrics

In the Diplomacy game, there can be multiple metrics for measuring objective success of a player, including no-press games and independent of any negotiation between agents. At the beginning of the game players control different countries. As the game progresses with multiple steps (hands), players that do not control regions containing supply centers are eliminated from the game. If a player controls 18 or more of the 34 supply centers, they are declared the “solo” winner. In the absence of a solo winner, remaining players may choose to agree a draw after any hand. In some situations, the game continues until there is either a solo winner or a draw. Within tournaments, a specified maximum number of game years can be defined.

In the proposed effort, we plan to support multiple standard metrics from the Diplomacy community. Some scoring approaches provide more flexibility, and would mirror different types of real-world scenarios. For example, the sum of squares scoring may allow other higher granularity game outcomes beyond just elimination, draws and solo wins.1 Furthermore, the countries and supply centers controlled at particular points in the game have relationships with the likely eventual outcomes. An important consideration is whether players are trying to achieve the same goals. For example, in the spirit of the game, it would be desirable to maximize whatever scoring metric is specified. However, a player may have an internal goal that is different. For example, some players may choose strategies that lead to draws, losses, or chaos, instead of trying to achieve solo wins or maximize a specified metric. In the real world, this mirrors situations where a nation-state may have different goals than a terrorist organization, even though they both control territories. The rest of this discussion assumes the goal of trying to “win” the game by achieving a solo win or maximizing the game score.

In addition to evaluating the performance of a player on different hands within a game and the ultimate game outcome, scores can be aggregated across multiple games within a tournament. Furthermore, as in chess rankings, rating systems have been developed to assess the skill of players. For example, WebDiplomacy developed Ghost Ratings based on Elo rating systems, where the skill levels of the opponents impact the adjustments to skill levels after wins, draws, losses, or other outcomes.2

The addition of standardized public and private communication allows metrics that characterize the type and frequency of inter-player/agent communication and their relationship to player strategies (which may be private), actions and outcomes. During the negotiation process, players can promise to take actions under specific conditions. Within a game, tournament, or entire system, a player can have its reliability assessed based on the alignment of its declared intent or negotiation and its subsequent actions. Some metrics could be made available to other players when they are based on public information and information to which they are privy, whereas the system may also track a more global set of metrics that include private negotiation data. Context is also critical for the performance metrics relate to reliability. A player may only lie under specific situations. Based on their history, one could profile the situations in which that player lied or told the truth in the past. This can be coupled with specific game states where it is known that certain actions or successes will lead to solo wins or (given other scoring measures) high scores. For example, Josh Burton analyzed specific game states, strategies and outcomes.3 We plan to support multiple categories of metrics, including the following:

  • measures that characterize the result of the hand and system state within a game for each player,
  • measures for the outcome of each game (allowing for multiple scoring systems),
  • measures for tournament outcomes,
  • skill ratings for each player across the system,
  • measures of reliability for each player within a game, tournament, context, or the entire system (based on information known to other specific players and also based on the full set of negotiations and other communication),
  • measures of players’ cooperation with other players (the types and quantity of negotiations they initiated, the types and quantity of negotiations they participated in that were initiated by others, the number and percentage of agreements that they have with other players) and tied to the context of which territories particular players control
  • measures for the frequency and type of communication between player agents.


  1. Mal Arky, Sum of Squares Scoring, The Diplomaticon, October 9, 2021, retrieved from on 11/03/2021.
  2. WebDiplomacy: Ghost Ratings Explained, retrieved from on 11/03/2021.
  3. Josh Burton, The Statistician: Solo Victories, retrieved from on 11/03/2021.