Cross-posted from my website on s-risks. Articles on the FRI blog reflect the opinions of individual researchers and not necessarily of FRI as a whole, nor have they necessarily been vetted by other team members. For more background, see our post on launching the FRI blog.
Surrogate goals might be one of the most promising approaches to reduce (the disvalue resulting from) threats. The idea is to add to one’s current goals a surrogate goal that one did not initially care about, hoping that any potential threats will target this surrogate goal rather than what one initially cared about.
In this post, I will outline two key obstacles to a successful implementation of surrogate goals.
In most settings, the threatener or the threatenee will not have perfect knowledge of the relative attractiveness of threat against the surrogate goal compared to threats against the original goal. For instance, the threatener may possess private information about how costly it is for her to carry out threats against either goal, while the threatenee may know more exactly how bad the execution of threats is compared to the loss of resources from giving in. This private information affects the feasibility of threats against either goal.
Now, it is possible that the surrogate goal may be a better threat target given the threatenee’s information, but the initial goal is better given the threatener’s (private) information. Surrogate goals don’t work in this case because the threatener will still threaten the initial goal.
The most straightforward way to deal with this problem is to make the surrogate goal more threatener-friendly so that the surrogate goal will still be the preferred target even with some private information pointing in the other direction. However, that introduces a genuine tradeoff between the probability of successfully deflecting threats to the surrogate goal and the expected loss of utility due to a worsened bargaining position. (Without private information, surrogate goals would only require an infinitesimally small concession in terms of vulnerability to threats.)
Tension between credibility and non-interference
Surrogate goals fail if it is not credible – in the eyes of potential threateners – that you actually care about the surrogate goal. But apart from human psychology, is there a strong reason why surrogate goals may be less credible than initial goals?
Unfortunately, one of the main ways how an observer can gain information about an agent’s values is to observe that agent’s behaviour and evaluate how consistent that is with a certain set of values. If an agent frequently takes actions to avoid death, that is (strong) evidence that the agent cares about survival (whether instrumentally or intrinsically). The problem is that surrogate goals should also not interfere with one’s initial goals, i.e. an agent will ideally not waste resources by pursuing surrogate goals. But in that case, threateners will find the agent’s initial goal credible but not their surrogate goal, and will thus choose to threaten the initial goal.
So the desiderata of credibility and non-interference are mutually exclusive if observing actions is a main source of evidence about values. An agent might be willing to spend some resources pursuing a surrogate goal to establish credibility, but that introduces another tradeoff between the benefits of a surrogate goal and the waste of resources. Ideally, we can avoid this tradeoff by finding other ways to make a surrogate goal credible. For instance, advanced AI systems could be built in a way that makes their goals (including surrogate goals) transparent to everyone.