Center on Long-Term Risk

Apply to CLR as a Summer Research Fellow!

2024-04-23T13:22:10Z

Applications are now closed for our Summer Research Fellowship in 2024, and we are unable to accept any late applications. We expect to run another one in summer 2025. To be notified of applications opening, please subscribe to our mailing list using the box on the bottom left of our homepage.

We, the Center on Long-Term Risk, are looking for Summer Research Fellows to explore strategies for reducing suffering in the long-term future (s-risks). For eight weeks, you will join our team at our office while working on your own research project. During this time, you will be in regular contact with our researchers and other fellows, and receive guidance from an experienced mentor.

You will work autonomously on challenging research questions relevant to reducing suffering. You will be integrated and collaborate with our team of intellectually curious, hard-working, and caring people, all of whom share a profound drive to make the biggest difference they can.

We worry that some people won’t apply because they wrongly believe they are not a good fit for the program. While such a belief is sometimes true, it is often the result of underconfidence rather than an accurate assessment. We would therefore love to see your application even if you are not sure if you are qualified or otherwise competent enough for the positions listed. We explicitly have no minimum requirements in terms of formal qualifications and many of the past summer research fellows have had no or little prior research experience. Being rejected this year will not reduce your chances of being accepted in future hiring rounds.

Purpose of the fellowship

The purpose of the fellowship varies from fellow to fellow. In the past, have we often had the following types of people take part in the fellowship:

People very early in their careers, e.g. in their undergraduate degree or even high school, who have a strong interest in s-risk and would like to learn more about research and test their fit.
People seriously considering changing their career to s-risk research who want to test their fit for such work.
People interested in s-risk who plan to pursue a research or research-adjacent career and who would like to gain a strong understanding of s-risk macrostrategy beforehand.
People with a fair amount of research experience, e.g. from a (partly or fully completed) PhD, whose research interests significantly overlap with CLR’s and who want to work on their research project in collaboration with CLR researchers for a few months. This includes people who do not strongly prioritize s-risk themselves.

There might be many other good reasons for completing the fellowship. We encourage you to apply if you think you would benefit from the program, even if your reason is not listed above. In all cases, we will work with you to make the fellowship as valuable as possible given your strengths and needs. In many cases, this will mean focusing on learning and testing your fit for s-risk research, more than seeking to produce immediately valuable research output.

Activities

Carrying out a research project related to one of our priority areas below, or otherwise targeted at reducing s-risks. You will determine this project in collaboration with your mentor, who will meet with you every week and provide feedback on your work.
Attending team and Fellowship meetings, including giving occasional presentations on the state of your research.

What we look for in candidates

We don’t require specific qualifications or experience for this program, but the following abilities and qualities are what we’re looking for in candidates. We encourage you to apply if you think you may be a good fit, even if you are unsure whether you meet some of the criteria.

Curiosity and a drive to work on challenging and important problems;
Ability to answer complex research questions related to the long-term future;
Willingness to work in scarcely-explored areas and to learn about new domains as needed;
Independent thinking;
A cautious approach to potential information hazards and other sensitive topics;
Alignment with our mission or strong interest in one of our priority areas.

Further details

We encourage you to apply even if any of the below does not work for you. We are happy to be flexible for exceptional candidates, including when it comes to program length and compensation.

- Location: As mentioned in our 2023 review post, we are currently considering relocating a substantial part of our operations from London to Berkeley, California in 2024. We are currently uncertain whether we will choose to run the fellowship in London or Berkeley, so our application form asks which location(s) you’re willing to work from. In either case, we would prefer participants to work from the primary program location, but will also consider applications from people who are unable to relocate.
- International applicants: We expect to be able to facilitate in-person participation in Berkeley or London in the great majority of cases, including support with any immigration permissions or visas that are required.
- Compensation: If the program is based in London, fellows will receive a stipend of 4,000 GBP per month.
  - If the program is based in Berkeley, we will consider raising the stipend amount further.
  - In addition to the base stipend, we will provide funding for travel or immigration costs for fellows who relocate to London or Berkeley for the program.
  - Funding will also be available for expenses to facilitate your productivity during the program.
- Number of available positions: We expect to accept six to twelve fellows.
- Program length & work quota: The program is intended to last for eight weeks in a full-time capacity. Exceptions, including part-time participation, may be possible.
  - We’re also very happy for participants to take reasonable time out for other commitments such as holidays.
- Program dates: The default start date is June 17, 2024. Exceptions may be possible.

Office space: Participants will have access to office space in London or Berkeley (see above), working alongside CLR staff and mentors.

Catered plant-based lunch will be provided at the office space daily.

Priority areas

You can find an overview of our current priority areas here. However, if we believe that you can somehow advance high-quality research relevant to s-risks, we are interested in creating a position for you. If you see a way to contribute to our research agenda or have other ideas for reducing s-risks, please apply. We commonly tailor our positions to the strengths and interests of the applicants.

Mentors

All fellows will work with a mentor to guide their project. Below, each of our mentors has written about the topics in which they’re most interested in supervising research.

At stage 2 of our application process, applicants are asked to submit a research proposal and a list of research proposal ideas. A significant part of our selection process relates to consideration by our mentors of whether they are interested in supervising the Fellow, based on the Fellow’s and mentor’s research interests.

Anthony DiGiovanni

Using frameworks from open-source game theory to model potential cooperation failures between AIs (especially due to the commitment races problem), and ways to mitigate those failures. (Examples: Safe Pareto Improvements; Commitment games with conditional information revelation)
Improving our understanding of how to implement cooperative technologies like safe Pareto improvements in prosaic AI systems.

Nicolas Macé

I’m focusing on the same themes as Anthony, and will co-mentor with him.

Mia Taylor

Developing scalable evaluations for large language models to check for the presence of beliefs, patterns of behavior, preferences, and capabilities that make the model more likely to engage in conflict
Assessing the validity of existing evaluation methods
Experimenting with scaffolding and training methods to determine which methods are most likely to suppress or promote conflict-conducive beliefs, behaviors, or preferences

Jesse Clifton

I’m focusing on the same topics as Mia listed above
However, I'm also interested in considering strong proposals outside these areas.

Julian Stastny

Designing evals for s-risk-relevant properties.
Developing model organisms to test those evals.
Investigate a new empirical research direction

Tristan Cook

Quantitative modelling of macrostrategic considerations relevant to s-risks. For example, modelling how different pause AI proposals affect the probability of multipolar takeoff.
Research on the design of scaffolded LLMs ('bureaucracies') to reduce the risks posed by the commitment races problem
Prioritisation research related to Evidential Cooperation in Large Worlds (ECL).

Caspar Oesterheld

I’m interested in supervising fellows working in any of my academic interest areas, as seen on my website and blog.

David Althaus

Risks from malevolent actors.
Risks from ideological fanaticism / extremism.

Application process

We value your time and are aware that applications can be demanding, so we have thought carefully about making the application process time-efficient and transparent. We plan to make the final decisions by April 15. We plan to decide on the location (Berkeley or London) by early- to mid-April.

Stage 1: To start your application for any role, please complete our application form. As part of this form, we also ask you to submit your CV/resume and give you the opportunity to upload an optional research sample. The deadline is midnight Pacific Time on Thursday, March 7, 2024. We expect this to take around 2 to 3 hours if you are already familiar with our work. In the interest of your time, you do not need to polish the language of your answers in the application form.

Stage 2: By Tuesday, March 12, we will decide whether to invite you to the second stage. We will ask you to write a research proposal (up to two pages excluding references) and a list of research proposal ideas, to be submitted by Thursday, March 28 at midnight Pacific Time. This means applicants will have 16 days to complete this stage, which we expect will take up to 12 hours of work. Applicants will be compensated with £350 for their work at this stage.

You can see some example research proposals submitted by previous successful candidates here. Note that we will alter the instructions for the research proposals this year. We plan to make examples for the list of research proposal ideas available before stage 2.

Stage 3: By Thursday, April 4, we will decide whether to invite you to an interview via video call during the week of April 8. By April 15, we will send out final decisions to applicants.

Further details

Application base rates: Last year, we received 174 applications for the summer research fellowship. We made 13 offers.
Diversity and equal opportunity: CLR is an equal-opportunity employer, and we value diversity in our programs. We welcome applications from all sections of society and don’t want to discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, social background/class, mental or physical health or disability, or any other basis for unreasonable discrimination, whether legally protected or not. If you would like to discuss any personal needs that may require adjustments to our application process, please feel very free to contact us.

Why work with CLR

We aim to combine the best aspects of academic research (depth, scholarship, mentorship) with an altruistic mission to prevent negative future scenarios. So we leave out the less productive features of academia, such as administrative burden and publish-or-perish incentives, while adding a focus on impact and application.

As part of our fellowship, you will enjoy:

a program tailored to your qualifications and strengths with ample intellectual freedom;
working to facilitate a shared mission with dedicated and caring people;
an interdisciplinary research environment, surrounded by friendly and intellectually curious people who will hold you to high standards and support you in your intellectual development;
mentorship in longtermist macrostrategy, especially from the perspective of preventing s-risks;
the support of a well-networked longtermist EA organization with substantial operational assistance instead of administrative burdens.

You will advance neglected research to reduce the most severe risks to our civilization in the long-term future. Depending on your specific project, your work may help inform impactful work across the s-risk and AI safety ecosystem, or any of CLR’s activities, including:

Technical interventions: We aim to develop and communicate insights about the safe development of artificial intelligence to the relevant stakeholders (e.g. AI developers, key organizations in the longtermist effective altruism community). We are in regular contact with leading AI labs and AI safety research nonprofits.
Research collaborations: CLR researchers have recently been involved in collaborations with researchers from CMU, Oxford, Stanford, Berkeley, MIT, and Google DeepMind.
Research community: Alongside the Summer Research Fellowship, CLR runs research retreats, bringing together members of the research community to co-ordinate and make progress on problems.
Grantmaking: In addition to the CLR Fund, some of our staff advise Polaris Ventures, a foundation committed to using all of its funds to improve the quality of life of future generations.
New projects: In collaboration with people in our network, we are always looking for novel impactful organizations to set up. For instance, we have been involved in the founding of the Cooperative AI Foundation.

Other opportunities at CLR

We’ll soon be hiring for researchers focused on model evaluations. As an empirical researcher at CLR, you will primarily help us build evaluations that improve our understanding of s-risk-relevant properties of AI systems, developing prerequisites to intervening on advanced AI systems. To receive updates about this role and other opportunities at CLR, you can subscribe to our mailing list by submitting your email at the bottom of our website.

The post Apply to CLR as a Summer Research Fellow! appeared first on Center on Long-Term Risk.

Expression of Interest: Director of Operations

2024-02-15T11:21:10Z

~ The deadline for expressions of interest to be considered in our initial round has now passed. You're still welcome to submit expressions of interset, but we'll probably only progress you if we decide to proceed to a full hiring round in April 2024 ~

Intro / summary

The Center on Long-term Risk will soon be recruiting for a Director of Operations. The role is to lead our 2-person operations team, handling challenges across areas such as HR, finance, compliance and recruitment.
Due to uncertainty over whether the role will be located in London (UK) or Berkeley (California), we are running a low-volume invite-only round now; to be followed by an open round in ~April after we gain certainty on the location, if we don’t hire now.
- We’d also be open to mainly-remote candidates in some circumstances (see below).
We’re excited for expressions of interest from anyone with experience of challenging operations work, and at least a little experience of people management.
If you’re interested in the role, we’d be excited for you to submit our short expression of interest form by the end of Sunday 11th February. We’ll then invite the most promising candidates to apply for the role now.

Location

As mentioned in our annual review, CLR is evaluating whether to relocate from London to Berkeley, CA. We are currently in a trial period, and expect to make a decision about our long-term location in early April. For this reason, we unfortunately don’t yet know whether this role will be located in London or Berkeley. We’d also be open to remote candidates in some circumstances (see below).

We recognise that this uncertainty will make the role less appealing to candidates. Given this, we will be running a low-volume invite-only hiring round now, for candidates who are willing to spend the time on our hiring process even with our location uncertainty. If we don’t successfully hire now, we will launch a full open round in April, after the location decision is made.

Further details on location:

We estimate a 70% chance that we will settle on moving to Berkeley, and a 30% chance of staying in London.
If we settle on London (30% chance): we’d have a strong preference for candidates who can spend at least ~60% of their time at our London office
If we settle on Berkeley (70% chance), we’d be open to remote candidates, with a substantial preference for candidates who can visit Berkeley regularly.
To be clear, we’re not looking for a commitment from candidates that you’re happy to work in both locations. Just one is fine – it’s just that we’ll just need to take this into account when making the offer after the location decision is finalised.
We expect to be able to sponsor visas for this role in most cases..

The role

The Director of Operations role is to lead our 2-person operations team, with responsibility across areas such as HR, finance, compliance, office management, grantmaking ops, and recruitment. You’d report to our Executive Director, and would take on the management of our existing Operations Associate.

Specific responsibilities include:

Working with our Executive Director and board of trustees to facilitate major organisational decisions, providing an operations perspective.
Managing and mentoring our Operations Associate, as well as a number of support staff and contractors.
Managing compliance and finances for our ecosystem of charitable entities (in collaboration with the Operations Associate, legal advisors and accountants).
Overseeing and refining CLR’s internal systems in areas such as HR, grantmaking ops, IT, recruitment and immigration.
Leading on major operations projects as they arise, such as running events or hiring staff in new countries.

About operations at CLR:

CLR overall is currently a team of 13, and we plan to recruit 3-4 new staff in 2024. As well as supporting our research teams, the operations team facilitates CLR’s grantmaking, annual summer research fellowship, and runs regular events, such as external research retreats.
If we go ahead with moving to Berkeley, setting up infrastructure in the US will be a major priority for 2024.
New operations projects come up regularly at CLR, for example setting up infrastructure in new countries, running events, or seeding new organisations.

We estimate that around 70% of your time in the role would be spent directly working on operations projects, and 30% on people management, co-ordination, and strategy.

About you

We’re interested in candidates who bring most of the following skills and experience to the role:

Experience of operations work in a challenging, varied and/or high-growth context
- You’ve worked across more than one operations area (e.g. HR, accounting, compliance, events, …), in a small or quickly-growing organisation, and regularly faced new situations and had to independently work out solutions.
- You likely have at least 2 years’ experience of this kind of work, perhaps substantially more.
  - We expect the successful candidate will likely be substantially further into their career than this, however we are open to candidates whose experience is mostly in other domains, or to exceptional early-career candidates.
  - If you have a track record of successfully dealing with challenges and managing projects in a few different domains, and at least some operations experience, we’re excited to hear from you!
- You have the skills that make people excellent at work of this kind, such as strong attention to detail, problem-solving, and being organised and reliable. You’re excited about solving problems and supporting your team.
At least some experience of people management
Good judgement: You can make sensible decisions about complex issues across a wide variety of domains, including areas in which you don’t have much expertise. You can anticipate problems, manage risks, and prioritise well.
Great communication and people skills. You can collaborate with people in a wide variety of contexts, transparently explain the reasoning for your decisions, get to the bottom of disagreements, and act compassionately in difficult situations.
Drive to further CLR’s mission of reducing worst-case risks from advanced AI systems; and either familiarity with the effective altruism community and its priorities and culture, or drive to learn more about this area.

Role impact

The Director of Operations is central to CLR’s activities and impact. As well as continuing the existing high level of support provided to our team and projects, we’re excited for a candidate with great judgement to bring new ideas and drive organisational change, and so multiply our impact further.

CLR’s mission is to reduce worst-case risks from the development of advanced AI systems. We are the largest organisation focussed on s-risk reduction, with our researchers being among only a few working on s-risk reduction and cooperative AI.

You can read about CLR’s achievements in 2023 and plans for 2024 here.

CLR’s activities include:

Technical interventions: CLR aims to develop and communicate insights about the safe development of artificial intelligence to the relevant stakeholders (e.g., AI developers, key organisations in the longtermist effective altruism community). We are in regular contact with leading AI labs and AI safety research nonprofits.
Research collaborations: CLR researchers have recently been involved in collaborations with researchers from CMU, Oxford, Stanford, Berkeley, MIT, and Google DeepMind.
Research community: CLR runs an annual Summer Research Fellowship, during which the size of our team temporarily doubles as we focus on training and evaluating promising new researchers. We also run research retreats, bringing together members of the research community to co-ordinate and make progress on problems.
Grantmaking: In addition to the CLR Fund, some of our staff advise Polaris Ventures, a foundation committed to using all of its funds to improve the quality of life of future generations.
New projects: In collaboration with people in our network, we are always looking for novel impactful projects to set up. For instance, CLR staff were founding members of the Cooperative AI Foundation.

CLR has received grants from Open Philanthropy, the Survival and Flourishing Fund and Polaris Ventures.

Testimonials about CLR’s work from prominent community members can be found here.

Compensation & benefits

For full-time work in this role, we offer a salary of 110-150,000 USD if the role is based in the Bay Area, or 60-90,000 GBP for London.

For applicants based outside London or the Bay Area, the salary will be adjusted based on local living costs, in accordance with our compensation policy.
If you’re interested in this role but would require a higher salary, we encourage you to go ahead and apply. We’re open to discussing higher compensation for exceptional candidates.

Benefits for this role will include:

25 days’ paid vacation per year, plus public holidays.
If you’re based in the US, 100% coverage of our platinum health insurance plan. For UK staff, private health insurance.
Pension scheme with default employer contribution of 10% of your qualifying earnings, increasing to 15% to match additional contributions made by you.
Catered plant-based lunch available at the office every day.
A budget of £3000 per year to spend on your professional development and productivity.
Flexible working hours.
20 weeks’ paid leave for new parents.
We will pay reasonable relocation costs for candidates who move to London or the Bay Area to take up the role.

How to apply

If you’re interested in the role, please submit this short expression of interest form by the end of Sunday 11th February. The form will still be monitored after this, but we’ll only invite late applicants to join the invite-only round in exceptional cases.

Note that with this being an invite-only round, we expect to get back to only the most promising candidates to invite them to apply to the role now.

If we don’t invite you to the current round, we will still make sure to let you know if we proceed to a full open hiring round in April.

The post Expression of Interest: Director of Operations appeared first on Center on Long-Term Risk.

CLR Fundraiser 2023

2023-12-07T08:56:44Z

4 Supporters Successfully completed!

81% usd 628,038 of usd 770,000

How to donate

Donors from the UK, Germany, Switzerland, and the Netherlands can donate tax-deductibly using the form below
Donors from the USA can donate tax-deductibly through the Giving What We Can platform
Donors from all other countries can donate to us using the below form, but unfortunately tax deduction will not be available

For frequently asked questions on donating to CLR, see our Donate page.

Note: since the fundraiser is now over, any donations from now on will not be listed in the fundraiser donations list below.

Donations so far

Name		Amount	Comment
Anonymous		USD 50000
Anonymous		USD 200000
Anonymous		GBP 30
Anonymous		EUR 350000

The post Private: Anonymouse €350k appeared first on Center on Long-Term Risk.

Beginner’s guide to reducing s-risks

2024-01-17T18:27:54Z

Suffering risks, or s-risks, are “risks of events that bring about suffering in cosmically significant amounts” (Althaus and Gloor 2016).¹ This article will discuss why the reduction of s-risks could be a candidate for a top priority among altruistic causes aimed at influencing the long-term future. The number of sentient beings in the future might be astronomical, and certain cultural, evolutionary, and technological forces could cause many of these beings to have lives dominated by severe suffering. S-risks might result from unintended consequences of pursuing large-scale goals (“incidental s-risks”), intentional harm by intelligent beings with influence over many resources (agential), or processes that occur without agents’ intervention (natural) (Baumann 2018a).

Efforts to reduce s-risks generally consist of researching factors that likely exacerbate these three mechanisms (especially emerging technologies, social institutions, and values), applying insights from this research (e.g., recommending principles for the safe design of artificial intelligence), and building the capacity of future people to prevent s-risks.

Summary:

Due to coming advances in space settlement and technology—which on Earth has historically enabled massive increases in suffering, despite plausibly increasing the average human quality of life—it is possible that there are risks of suffering on a scale that is significant relative to the scale of the long-term future as a whole. (more)
Although it is very difficult to predict the effects of interventions on the long-term future, efforts to reduce s-risks might be sufficiently predictable and stable by taking one of two approaches: (1) identifying factors in the near future that could lock in states leading to massive suffering, or (2) putting future generations in a better position to make use of information they will have about impending s-risks. (more)
- One important risk factor for such lock-in is the deployment of powerful artificially intelligent (AI) agents, which appears technically feasible in the next few decades and could lead to a future shaped by goals that permit causing astronomical suffering. (more)
- Solving the problem of alignment of AI systems with human intent does not appear to be sufficient or necessary to prevent s-risks from AI. (more)
Preventing intense suffering is the top priority of several plausible moral views, and given that it is a sufficiently high priority of a wide variety of other views as well, accounting for moral uncertainty suggests that s-risk reduction is an especially robust altruistic cause. (more)
Reducing s-risks by a significant amount might be generally more solvable than other long-term priorities, though this is unclear. On one hand, the worst s-risks seem much less likely than, e.g., risks of human extinction. This limits the value of s-risk reduction according to perspectives on which the expected moral value of posthuman civilization is highly positive. That said, marginal efforts at s-risk reduction may be especially valuable because s-risks are currently very neglected, and avoiding worst cases may be easier than fully solving AI alignment or ensuring a utopian future. (more)
Focusing on preventing worst-case outcomes of suffering appears more promising than moving typical futures towards those with no suffering at all, because it is plausible that some futures could be far worse than typical. (more)
Incidental s-risks could result from the exploitation of future minds for large-scale computations needed for an interstellar civilization, detailed simulations of evolution, or spreading wildlife throughout the universe without considering the suffering of the organisms involved. (more)
Agential s-risks could result from malevolent or retributive agents gaining control over powerful technology, or from AIs that deliberately create suffering. (more)
Natural s-risks could result from future civilizations not prioritizing reducing unnecessary suffering, for reasons similar to the persistence of wild animal suffering on Earth. (more)
Targeted approaches to s-risk reduction might be preferable to more broad alternatives, as far as they avoid unintentionally influencing many variables in the future, which could backfire. The most robust of these approaches include: research into AI designs that decrease their tendencies towards destructive conflicts or reduce near-miss risks; some forms of decision theory research; promotion of coordination between and security within AI labs; and research modeling s-risk-relevant properties of future civilizations. (more)
Broad approaches to s-risk reduction have the advantage of potentially improving a wider range of possible futures than targeted ones. Examples of these include: advocating moral norms against taking risks of large-scale suffering; promoting more stable political institutions that are conducive to compromise; and building knowledge that could be used by future actors who are in positions to prevent s-risks. (more)

1. Introduction

In the future, humans and our descendants might become capable of very large-scale technological and civilizational changes, including extensive space travel (Armstrong and Sandberg 2013), developments in artificial intelligence, and the creation of institutions whose stability is historically unprecedented. These increasing capabilities could significantly impact the welfare of many beings. Analogously, the Industrial Revolution drastically accelerated economic growth while leading to the suffering of billions of animals via factory farming.

Further, in the long-term future, the universe will plausibly contain numbers of sentient beings far greater than the current human and animal populations on Earth (MacAskill 2022, Ch 1). Astronomically large populations may result from widespread settlement of space and access to resources that far exceed those on Earth, either by biological organisms or digitally emulated minds (Beckstead 2014; Hanson 2016; Shulman and Bostrom 2021). Depending on their mental architectures, digital minds might have the capacity to suffer according to some theories of consciousness (though this is a philosophically controversial position).² Thus, the total moral weight of these minds’ suffering could be highly significant for the prioritization of altruistic causes.

If one considers it highly important to promote the welfare of future beings, a strong candidate priority is to reduce s-risks, in which large numbers of future beings undergo intense involuntary³ suffering. Even if one is overall optimistic about the long-term future, avoiding these worst cases may still be a top priority.

Effectively reducing s-risks requires identifying ways such massive amounts of suffering could arise. Although catastrophes of this scale are unprecedented, one discrete event that plausibly caused a large fraction of total historical suffering on Earth was the rise of factory farming; since 1970, at least 10 billion land animals per year have been killed for global meat production, rising to about 70 billion by 2020 (Ritchie, Rosado, and Roser 2017). Unlike the suffering caused by factory farming, some s-risks might be intentional. Similarly to historical acts of systematic cruelty by dictators, future actors might also deliberately cause harm (Althaus and Baumann 2020). As discussed in Section 3, highly advanced future technologies used by agents willing to cause massive suffering in pursuit of their goals (intentionally or not) have been hypothesized to be the most likely foreseeable causes of s-risks. Research focused on s-risks and implementations of interventions to reduce them began only recently, however (see, e.g., Tomasik 2015a), and so further investigation may identify other likely sources.

The rest of this article will discuss the premises behind prioritizing s-risks, specific potential causes of different classes of s-risks, and researchers’ current understanding of the most promising interventions to reduce them.

2. Prioritization of s-risks

Arguments for prioritizing s-risk reduction as an altruistic cause have generally relied on three premises (Baumann 2020a; Gloor 2018):

We ought to focus on influencing the long-term future, because (a) doing so could affect a large majority of beings with moral standing, and (b) it is feasible to predictably have a positive influence on these beings. (This premise is known as longtermism (Greaves and MacAskill 2021).)
Reducing the expected amount of intense suffering is a fundamental moral responsibility, and ought to be among the top priorities in the portfolio of longtermist causes based on several credible moral views.
The most effective way to reduce expected long-term intense suffering, accounting for tractability, is to aim to avoid the worst plausible outcomes of such suffering.

Note that some approaches to reducing s-risks are sufficiently broad that they might also reduce near-term suffering, promote other values besides relief of suffering, or improve non-worst-case futures (see Section 4.2).

2.1. Premise 1: Longtermism

Longtermism consists of both a claim about the moral importance of future beings, and our ability to help them. The normative premise 1(a) has been defended at length in, e.g., (Beckstead 2013; MacAskill 2022, Ch. 1; Cowen and Parfit 1992). Even if one agrees with this normative view, one might object to the empirical premise 1(b) (in the context of s-risk reduction) on the following grounds: The probability of s-risks, like that of other long-term risks, might be highly sensitive to factors about which present generations will remain largely uncertain (Greaves 2016; Tarsney 2022). Thus, it might be too hard to determine which actions will reduce s-risks in the long term. Given this problem, called cluelessness, it is also unclear how feasible it is to reduce s-risks by attempting to make small near-term changes that compound over time into a large, positive impact. The process of compounding positive influence could be stopped, or changed into negative influence, by highly unpredictable factors.

One response to this objection is that it could be tractable to affect the likelihood of potential persistent states, which are world states that, once entered, are not exited for a very long time (if ever) (Greaves and MacAskill 2021). These persistent states could have different probabilities of s-risks.

Suppose that, in the near term, the world could enter a persistent state that is significantly more prone to s-risks than some other feasible persistent state. Suppose also that certain interventions can foreseeably make the latter state more likely than the former. That is, the former state is an avoidable scenario where some technology, natural force, or societal structure “locks in” conditions for eventual large-scale suffering. (Sections 2.1.1 and 3 discuss potential causes of lock-in that are relevant to s-risks.)

Then these interventions would be less vulnerable to the cluelessness objection, because one would only need to account for uncertainties about relatively near-term effects that push towards different persistent states. It might still be highly challenging, however, to identify which persistent states are more prone to s-risks than others, or how to prevent gradual convergence towards persistent states with s-risks.

Besides steering away from s-risk-prone persistent states, another approach that could avoid cluelessness would be to build favorable conditions for future generations to reduce s-risks, who will have more information about the factors that could exacerbate them. This strategy still requires identifying features that would tend to enable future people to reduce s-risks, rather than hinder reduction or enable increases. One potentially robust candidate discussed in Section 4.2 is promoting norms against taking actions that risk causing immense suffering.

2.1.1. Artificial intelligence: a key technology for the long-term future

Artificial intelligence (AI) may enable the risk factors for s-risks discussed in the introduction: space settlement, deployment of vast computational resources by agents willing to cause suffering, and value lock-in (Gloor 2016a, MacAskill 2022, Ch. 4; Finnveden et al. 2022). At a general level, this is because AI systems automate complex tasks in a way that can be scaled up far more than human labor (e.g., via copying code), they can surpass human problem-solving by avoiding the constraints of a biological brain, and they may be capable of consistently optimizing certain goals.

Systems designed with machine learning algorithms, including reinforcement learning agents trained to select actions that maximize some reward, have outperformed humans at finding solutions to tasks in an increasing number of domains (Silver et al. 2017; Evans and Gao 2016; Jumper et al. 2021). Both reinforcement learning agents and large language models—programs produced by machine learning that use significant amounts of data and compute to predict sequences—have demonstrated capabilities that generalize across a wide variety of tasks they were not directly designed to perform (DeepMind 2021; Reed et al. 2022; OpenAI 2023, “GPT-4 Technical Report”). If these advances in the depth and breadth of AI capabilities continue, AI systems could develop into generally intelligent agents, which implement long-term plans culminating in the use of resources on scales larger than those of current civilizations on Earth.

Given that AI could apply such general superhuman abilities to large-scale goals, influencing the development and use of AI is plausibly one of the most effective ways to reduce s-risks (Gloor 2016a).⁴ Sections 3.1 and 3.2 outline ways that AI could cause s-risks, and Section 4.1 discusses specific classes of interventions on AI that researchers have proposed. (More indirect interventions on the values and social institutions that influence the properties of AIs, as discussed in Section 4.2, may also be tractable alternatives.)

Broadly, AI agents with goals directed at increasing suffering or vengeful motivations (e.g., due to selection pressures similar to those responsible for such motivations in humans) would be able to efficiently create enough suffering to constitute an s-risk, if they acquired enough power to avoid interference by other agents. Alternatively, if creating large amounts of suffering is instrumental to an AI’s goals, then, even without “wanting” to cause an s-risk, this AI would be willing and able to do so.

Further, two properties of AI make it a technology that is important to influence given a focus on persistent states. First, relevant experts forecast that human-level general AI is likely to be developed this century. This means that current generations might be able to shape the initial conditions of the development and use of the next iteration—superhuman AI—that could cause s-risks. Interventions on AI could thus be relatively urgent, among longtermist priorities. For example, in a survey of leading machine learning researchers, the median estimate for the date when “unaided machines can accomplish every task better and more cheaply than human workers” was 2059 (Stein-Perlman et al. 2022).⁵ See also Cotra (2020).

Second, to the extent that a general AI is an agent with certain terminal goals (i.e., goals for their own sakes, as opposed to instrumental goals), it will have strong incentives to stabilize these goals if it is technically capable of doing so (Omohundro 2008). That is, because the AI evaluates plans according to its current goals, it will (in general) tend to ensure that future modifications or successors of itself also optimize for the same goals.

These considerations suggest that the development of the first AI capable of both winning a competition with other AIs and locking in its goals, including preventing human interference with those goals (Bostrom 2014, Ch. 5; Karnofsky 2022; Carlsmith 2022), could initiate a persistent state. Avoiding training AIs with goals that motivate creation of cosmic-scale suffering, then, is a potential priority within s-risk reduction that may not require anticipating many contingencies far into the future.

2.2. Premise 2: Focus on preventing intense suffering

As Baumann (2020a) notes, premise 2 is both normative and empirical; it is a claim about both one’s moral aims, and how effective suffering reduction is at satisfying those aims.

2.2.1. Normative considerations: the moral importance of suffering

A variety of moral views hold that we have a strong responsibility to prevent atrocities involving extreme, widespread suffering.

Variants of suffering-focused ethics (Gloor 2016b; Vinding 2020a) hold that the intrinsic moral priority of reducing (intense) suffering is significantly higher than that of other goals. In particular, on some views suffering is measurably more important: when comparing two acts, the one that causes or permits a greater net increase in suffering is acceptable only if it also ensures a much greater amount of good things.^6,7 On other views, such as negative utilitarianism, (intense) suffering has lexical priority: for some forms of suffering, an act that does not lead to a net increase in such suffering is always preferable to one that does (Vinding 2020a, Ch. 1; Knutsson 2016).

Alternatively, there are views according to which suffering does not always have priority, but avoiding futures where many beings have lives not worth living is a basic duty. One might endorse the Asymmetry: the creation of an individual whose life has more suffering than happiness (or other possible goods) is bad, but the creation of an individual whose life has more goods than suffering is at best neutral (Thomas 2019; Frick 2020). Unlike the former views, the Asymmetry does not imply that reducing suffering takes priority over, e.g., increasing happiness among existing beings. Commonly held views maintain that adding a miserable life to a miserable population is just as bad no matter how large the initial population size is, but increasing numbers of happy lives has diminishing returns in value (Vinding 2022b).

One could also hold that, without committing to a particular stance on the intrinsic value or disvalue of different lives or populations, we have a responsibility to avoid foreseeable risks of extremely bad outcomes for other beings (who are not directly compensated by upsides). See the concept of “minimal morality” (Gloor 2022).

Any of these classes of normative views, when applied to long-term priorities, would recommend focusing on preventing the existence of lives dominated by suffering. By contrast, several prominent views, such as classical utilitarianism and some forms of pluralist consequentialism (Sinnott-Armstrong 2022), hold that ensuring the existence of profoundly positive experiences or life projects can take priority over reducing risks of suffering if those risks are relatively improbable.⁸ (See also critiques of the Asymmetry, e.g., Beckstead (2013).) According to these views, whether s-risks should be prioritized over increasing the chance of a flourishing future and reducing risks of human extinction depends on one’s empirical beliefs (see next section). Other alternative longtermist projects include the reduction of risks of stable totalitarianism (Caplan 2008), improvement of global cooperation and governance, and promotion of moral reflection (Ord 2020, Ch. 7).

Normative reasons for prioritizing s-risk reduction may be action-guiding even for those who do not consider suffering-focused views more persuasive than alternatives. According to several approaches to the problem of decision-making under uncertainty about moral evaluations (MacAskill et al. 2020), the option one ought to take might not be the act that is best on the moral view one considers most likely. Rather, the recommended act could be one that is robustly positive on a wide range of plausible views. Then, although reducing s-risks might not be the top priority of most moral views, including one’s own, it may be one’s most favorable option because most views agree that severe suffering should be prevented (while they disagree on what is positively valuable). This consideration of moral robustness arguably favors efforts to improve the quality of future experiences, rather than increase the absolute number of future experiences (Vinding and Baumann 2021).

On the other hand, accounting for moral uncertainty could favor other causes. It has been argued that, under moral uncertainty, the most robustly positive approach to improving the long-term future is to preserve option value for humans and our descendants, and this entails prioritizing reducing risks of human extinction (MacAskill). That is, suppose we refrain from optimizing for the best action under our current moral views (which might be s-risk reduction), in order to increase the chance that humans survive to engage in extensive moral reflection.⁹ The claim is that the downside of temporarily taking this suboptimal action, by the lights of our current best guess, is outweighed by the potential upside of discovering and acting upon other moral priorities that we would otherwise neglect.

One counterargument is that futures with s-risks, not just those where humans go extinct, tend to be futures where typical human values have lost control over the future, so the option value argument does not privilege extinction risk reduction. First, if intelligent beings from Earth initiate space settlement before a sufficiently elaborate process of collective moral reflection, the astronomical distances between the resulting civilizations could severely reduce their capacity to coordinate on s-risk reduction (or any moral priority) (MacAskill 2022, Ch. 4; Gloor 2018). Second, if AI agents permanently disempower humans, they may cause s-risks as well. To the extent that averting s-risks is more tractable than ensuring AIs do not want to disempower humans at all (see next section), or one has a comparative advantage in s-risk reduction, option value does not necessarily favor working on extinction risks from AI.

2.2.2. Empirical considerations: practical reasons to prioritize severe suffering

Without necessarily endorsing the moral views discussed above, one might believe it is easier to reduce severe suffering with some amount of marginal effort than to increase goods or decrease other bads (Vinding 2020, Sec. 14.4). Given this, reducing suffering would have higher practical priority. For example, preventing long-term intense suffering could be easier to the extent that much less effort is currently devoted to s-risk reduction than to other efforts to improve the long-term future, such as prevention of human extinction (Althaus and Gloor 2016). This is because, for more neglected efforts, the most effective opportunities are less likely to have been already taken.

That said, even if deliberate attempts to reduce s-risks are neglected, the most effective means to this end can converge with interventions towards other goals, in which case the problem would not be practically as neglected as it appears. For instance, as discussed in Section 4.2, it is arguable that reducing political polarization reduces s-risks, but many who are not motivated by s-risk reduction in particular work on reducing polarization, because they want to enable more effective governance. But see Section 4.1 for discussion of potential opportunities to reduce s-risks that appear unlikely to be taken by those focusing on other goals.

Another possibility is that, while a conjunction of desirable conditions are required to create a truly utopian long-term future, a massive increase in suffering is a relatively simple condition and thus easier to prevent (Althaus and Gloor 2016; DiGiovanni 2021). In particular, even if human extinction is prevented, whether the future is optimized for flourishing depends on which values gain power. However, see Section 2.3 for brief discussion of whether s-risks are so unlikely that the value of reducing them is lower than that of aiming to increase the amount of flourishing.

Besides these general considerations, s-risk reduction could be a relatively tractable longtermist goal to the extent that the most plausible causes of human extinction are very difficult to prevent. Alignment of AI with the intent of human users, for instance, has been argued to be a crucial source of extinction risk (Bostrom 2014; Ord 2020). But it is also commonly considered a fundamentally challenging technical problem (Yudkowsky 2018; Christiano, Cotra, and Xu 2021; Hubinger et al. 2019; Cotra 2022), in the sense that, even if alignment failure is not highly likely by default, reducing the probability of misalignment close to zero is very difficult.¹⁰ Preventing misalignment may require fundamental changes to the default paradigm of AI development, e.g., developing highly sophisticated methods for interpreting large neural networks (Hubinger 2022).

By contrast, reducing s-risks from AI (see Sections 3.1 and 3.2) may require only solving certain subsets of the problem of controlling AI behavior, via coarse-grained steering of AIs’ preferences and path-dependencies in their decision-making (Baumann 2018b; Clifton, Martin, and DiGiovanni 2022a). (Note that for these subproblems to be high-priority, they need to be sufficiently nontrivial or non-obvious that they might not be solved without the efforts of those aiming to prevent s-risks.) Further, preventing s-risks caused by malevolent human actors may be easier than finding ways to predictably influence AI, which might behave in ways that are much harder to model than other people.

2.3. Premise 3: Focus on worst-case outcomes

Finally, premise 3 relies on a model of the future in which:

(a) Large fractions of expected future suffering are due to a relatively small set of factors, over which present generations can have some influence.

(b) Compared to the amount of suffering that could be reduced by steering from the median future toward one with no suffering, many times more suffering could be reduced by steering from worst-case outcomes toward the median future (Gloor 2018).

If (a) were false, we would not expect to find singular “events” responsible for “cosmically significant amounts” of severe suffering that intelligent agents could prevent. As an analogy, if there is no single root cause of the majority of mental illnesses, someone aiming to promote mental health may need to prioritize among individualized treatments for many different illnesses. Section 3 will discuss relatively simple factors that plausibly determine large shares of future suffering.

Rejecting (b) would entail a greater focus on abolishing the sources of presently existing suffering (Pearce 1995), which one might expect to persist into the long-term future by default, and for which we have more direct evidence than for s-risks. There are two broad arguments for (b):

First, it could be that if no s-risks occur, the elimination of severe suffering is likely by default (particularly if AI alignment succeeds). That is, without deliberate interventions by present generations, future people will already be both willing and, through technological advances, able to reduce familiar causes of suffering such as congenital diseases and poverty. This would make the suffering abolition project less time-sensitive than preventing lock-in events.
- This prediction is supported by historical trends of increasing metrics of human health (Ortiz-Ospina and Roser 2016) and life satisfaction (Ortiz-Ospina and Roser 2013), and increasing inclusion of more sentient beings in people’s “moral circles” of consideration (Singer 2011). Developments in cultivated meat, dairy, and eggs may eliminate demand for animal products as well (Cohen et al. 2021).¹¹
- It has also been argued that biological suffering will not be predominant in the long-term future because, compared to biological minds, digital minds can be created (and designed for happiness) more efficiently and could be more productive (Shulman and Bostrom 2021; Sotala 2012; Hanson 2016). On the other hand, it is uncertain how willing future intelligent agents will be to reduce the suffering of wild animals, and analogous beings on other planets, without advocacy in the near term; Section 3.3 discusses this consideration further.
Second, even assuming that contemporary causes of suffering persist indefinitely, the amount of suffering in the observable universe is arguably relatively low compared to what it could become. For example, based on current evidence, there do not yet exist advanced civilizations conducting large-scale settlement of space, creating sentient minds that vastly outnumber the default populations of Earth-like planets. This suggests that the potential for the future to become much worse than “business as usual” far exceeds the potential impact of abolishing suffering from typical futures (Gloor 2018).

Implicit in premise 3 is the claim that the worst cases of potential future suffering are not so extremely unlikely as to be practically irrelevant, that is, the expected suffering from s-risks is large. One can assess the plausibility of the specific mechanisms by which s-risks could occur, and the historical precedents for those mechanisms, given in Section 3. Some general reasons to expect s-risks to be very improbable are the lack of direct empirical evidence for them, and the incentives for most intelligent agents to shape the future according to goals other than increasing suffering (Brauner and Grosse-Holz 2018). However, broadly, trends of technological progress could both enable space settlement and increase the potential of powerful agents to vastly increase suffering, conditional on having incentives to do so (without necessarily wanting more suffering) (Baumann 2017a).

3. Classes of s-risks and their potential causes

To clarify which s-risks are possible and the considerations that might favor focusing on one cluster of scenarios over others, researchers have developed the following typology of s-risks (Baumann 2018a).

3.1. Incidental s-risks

An s-risk is incidental if it is a side effect of the actions of some agent(s), who were not trying to cause large amounts of suffering. In the most plausible cases of incidental s-risks, agents with significant influence over astronomical resources find that one of the most efficient ways to achieve some goal also causes large-scale suffering.

Inexpensive ways to produce desirable resources might entail severe suffering. Slavery is an example of an institution that has produced historically significant suffering by this mechanism. In general, the suffering caused by slaveholders is not intentional (not including corporal punishments), but it has been permitted because of a lack of moral consideration for the victims.¹² The treatment of future beings similarly to people in slavery could constitute an s-risk, particularly if values that permit such treatment become locked in by technology like AI. Future agents in power could force astronomical numbers of digital beings to do the computational work necessary for an intergalactic civilization. And it is unclear how feasible it would be to design these minds to experience little or no suffering (Tomasik 2014). If doing so is very easy, an s-risk from this cause is unlikely by default. If not, then for some agents there may not be a sufficiently strong incentive to make the effort of preventing this s-risk.

Alternatively, it is prima facie likely that if interstellar civilizations interested in achieving difficult goals exist in the future, they will have strong incentives to improve their understanding of the universe. These civilizations could cause suffering to beings used for scientific research, analogous to current animal testing and historical nonconsensual experimentation on humans. Specifically, if it is technically feasible for future civilizations to create highly detailed world simulations, it is plausible that they will do so for purposes such as scientific research (Bostrom 2003). In contrast to the kinds of simple, coarse-grained simulations that are possible with current computers, much more advanced simulations might have two important features:

They could provide more accurate information for forecasts that help with complex planning, suggesting that a wide variety of possible civilizations would have an incentive to use them.
By faithfully representing the detailed features of some world, they might instantiate sentient beings. For example, suppose future civilizations wanted to predict the properties of other potential intelligent civilizations in the observable universe, in order to avoid harmful interactions with them. Within their fine-grained simulations of the evolution of different kinds of life, there might be many generations of organisms suffering due to Darwinian forces (Gloor 2016a). To preserve the fidelity of the simulations (or simply to save time and effort), the agents running them might not prevent these computations of suffering by default.

Finally, an s-risk could result if Earth-originating agents spread wildlife throughout the accessible universe, via terraforming or directed panspermia. Suppose these agents do not protect the animals that evolve on the seeded planets from the usual causes of suffering under Darwinian competition, such as predation and disease (Horta 2010; Johannsen 2020). Then the amount of suffering experienced over the course of Earth’s evolutionary history would be replicated (on average) across large numbers of planets.

Again, the incentives to create biological life in the long-term future might be relatively weak. However, agents who—like many current people—intrinsically value undisturbed nature would not find creation of digital minds to be sufficient, and they would consider the benefits of propagating natural life worth the increase in suffering.

3.2. Agential s-risks

An s-risk is agential if it is intentionally caused by some intelligent being. Although deliberate creation of suffering appears to be an unlikely goal for any given agent to have, researchers have identified some potential mechanisms of agential s-risks, most of which have precedent.

Powerful actors might intrinsically value causing suffering, and deploy advanced technologies to satisfy this goal on scales far larger than current or historical acts of cruelty. Malevolent traits known as the Dark Tetrad—Machiavellianism, narcissism, psychopathy, and sadism—have been found to correlate with each other (Althaus and Baumann 2020; Paulhus 2014; Buckels et al. 2013; Moshagen et al. 2018). This suggests that individuals who want to increase suffering may be disproportionately effective at social manipulation and inclined to seek power. If such actors established stable rule, they would be able to cause the suffering they desire indefinitely into the future.

Another possibility is that of s-risks via retribution. While a preference for increasing suffering indiscriminately is rare among humans, people commonly have the intuition that those who violate fundamental moral or social norms deserve to suffer, beyond the degree necessary for rehabilitation or deterrence (Moore 1997). Retributive sentiments could be amplified by hostility to one’s “outgroup,” an aspect of human psychology that is deeply ingrained and may not be easily removed (Lickel 2006). To the extent that pro-retribution sentiments are apparent throughout history (Pinker 2012, Ch. 8), values in favor of causing suffering to transgressors might not be mere contingent “flukes.” These values would then be relatively likely to persist into the long-term future, without advocacy against them. As with malevolence, s-risks may therefore be perpetrated by dictators who inflict disproportionate punishments on rulebreakers, especially if such punishments can be used as displays of power.

Recalling the discussion in Section 2.1.1, AI agents could have creation of suffering as either an intrinsic or instrumental goal, and pursue this goal on long timescales due to the stability of their values. This could occur for several reasons:

By mechanisms similar to those of the evolution of human malevolent traits, the processes by which an AI is trained may select for preferences to cause harm to others, either unconditionally or for retribution (Possajennikov 2000; Heifetz, Shannon, and Spiegel 2007; DiGiovanni, Macé, and Clifton 2022). Thus, some approaches to AI development may unintentionally yet systematically favor this malevolent preference. Alternatively, AI training on human data could result in imitation of vengeful tendencies (as exhibited by Bing’s chatbot, for example) (De Vynck, Lerman, and Tiku 2023).
If an AI’s goals or the values that humans intend to instill with its training process include reducing suffering, it might be relatively easy for an error in training to make the AI want to increase suffering (a “near miss” scenario). Intuitively, there is a large space of possible goals that an agent could have, most of which are indifferent to suffering (Bostrom 2012). These goals do not regard suffering as an important feature of the world in general, and so most possible failures of alignment of an AI’s goals with humans’ are unlikely to result in an AI that wants to increase suffering. By contrast, a training procedure that almost succeeds at value alignment would be vulnerable to possibly creating an s-risk-increasing AI, because it makes suffering a highly relevant factor in the AI’s decision-making. The research lab OpenAI encountered a low-stakes case of this effect: a language model was intended to be trained to minimize outputs that people rated poorly, but a bug resulted in the model maximizing these outputs (Ziegler et al. 2019). More generally, an AI that cares about sentient minds in general but has a flawed conception of happiness could intentionally maximize the number of minds in states of suffering (Bostrom 2014, Ch. 8).
Conflicts between AIs with different goals could result in s-risks, analogous to the suffering caused by wars to date. When the AIs fail to reach a cooperative compromise, they may create disvalue as a strategic tactic even if none of the AIs involved intrinsically wants to cause harm. An AI that might cause this kind of s-risk would have an incentive to avoid this outcome, which is worse for all parties after the fact. However, to the extent that we can model superhuman AIs as game-theoretically rational actors, the well-studied causes of conflict between rational human actors would apply to interactions between these AIs as well (Fearon 1995; Clifton, Martin, and DiGiovanni 2022a).

3.3. Natural s-risks

An s-risk is natural if it occurs without the intervention of agents. A significant share of future suffering could be experienced by beings in the accessible universe who do not have access to advanced technology; hence, these beings would be unable to produce abundant resources (food, medicine, etc.) that ensure comfortable lives. The lack of apparent extraterrestrial intelligent beings does not necessarily imply that, on the many planets capable of sustaining life—potentially billions in the Milky Way (Wendel 2015)—there are no sentient beings.¹³

The reasons that humans currently do not relieve wild animal suffering, for example via vaccine and contraception distribution (Edwards et al. 2021; Eckerström-Liedholm 2022), might persist into the long-term future. Specifically, intelligent agents may a) be morally indifferent to extraterrestrial beings’ suffering, b) prioritize other goals (such as developing flourishing civilizations, or avoiding disturbing these beings’ natural state) despite having some moral concern, or c) consider it too intractable to intervene without accidentally increasing suffering.

Case (c) would grow less likely as spacefaring civilizations develop forecasting technologies assisted by AI. It is highly uncertain whether the evolution of moral attitudes would match case (a). Concern for wild animal suffering might remain low, because it is (usually) not actively caused by human intervention and so may be judged as outside the scope of humans’ moral responsibility (Horta 2017). On the other hand, under the hypothesis that people largely tend to morally discount suffering when they are unable to prevent it, we would predict increased support for reducing natural suffering as civilization’s technological ability to do so increases. Future intelligent beings might prefer to focus on advancing their own civilizations, however, rather than invest in complex efforts to improve extraterrestrial welfare, or intrinsically prefer leaving nature undisturbed. Thus, case (b) is a plausible possibility unless agents with at least moderate concern for extraterrestrial suffering gain sufficient influence.

Notably, some attempts to reduce natural s-risks may unintentionally increase incidental and agential s-risks. For example, increasing the chance that human descendants settle space would enable them to relieve natural suffering beyond Earth, but this would also put them in a position to increase suffering through influence over astronomical resources. The most robust approaches to reduce natural s-risks, therefore, would entail increasing future agents’ willingness to assist extraterrestrial beings conditional on already conducting space settlement. See also Section 4.2 for discussion of ways that one of the most tractable interventions to reduce natural s-risks, moral circle expansion, may not be robustly positive.

4. Approaches to s-risk reduction

Researchers have proposed a variety of methods for reducing s-risks, suited to the classes discussed above and the different causes within each class. An effective intervention by current actors against s-risks needs to reliably prevent some precondition for s-risks in a way that (a) is sensitive to near-term factors rather than inevitable (contingent), and (b) does not easily change to some other state over time (persistent) (MacAskill 2022, Ch. 2). Searching for near-term sources of lock-in is a notable strategy for satisfying these two properties, as discussed in Section 2.1. Two plausible ways to prevent lock-in of increases in suffering are shaping the goals of our descendants and shaping the development of AI.

However, human societies and AI are both highly complex systems, and it is likely too difficult to do better than form coarse-grained models of these systems with considerable uncertainty. Given this problem and how recently s-risks were proposed as an altruistic priority, much of current s-risk reduction research is devoted to identifying interventions that are robust to inevitable errors in our current understanding (under which naïve interventions could backfire).

Below, both targeted and broad classes of interventions are considered; it is unclear which of these is favored by accounting for such “deep uncertainty” (Marchau et al. 2019). Targeted approaches are generally less likely to have backfire risks, by restricting one’s intervention to a few factors that one can relatively easily understand. However, broad approaches have the advantage that they may rely less strongly on a specific model of a path to impact, which is brittle to crucial factors one might not consider, and therefore prone to overestimates of effectiveness (Tomasik 2015b; Baumann 2017b).

Another general consideration is the possibility of social dilemmas (such as the famous Prisoner’s Dilemma): Even if an intervention seems best assuming one’s decision-making is independent of that of other people with different goals, people’s collective decisions may make things worse than some alternative, according to the goals of all the relevant decision-makers. As a simplistic example, suppose one attempts to reduce incidental s-risks by lobbying governments against space exploration, while those who consider it important to build an advanced civilization across the galaxy lobby for space exploration. To the extent that these efforts cancel out, they are wasted compared to the alternative in which both sides agree to pursue their goals more cooperatively, that is, at less expense to each other’s goals (Tomasik 2015c; Vinding 2020b).

4.1. Targeted approaches

First, one might aim to reduce s-risks by building clear models of some specific pathways to s-risks, and finding targeted interventions that would likely block these pathways. The distinction between “targeted” and “broad” is more of a spectrum than binary. However, this categorization can be useful in that one may tend to favor more or less targeted approaches based on one’s epistemic perspective. Suppose one finds the mechanisms sketched in Section 3 excessively speculative, considers it generally intractable to form predictions of paths to impact on the long term, and thinks backfire risks can be mitigated without a targeted approach. Then one may prefer broad approaches intended to apply across a variety of scenarios.

To prevent the lock-in of scenarios where astronomically many digital beings are subjected to suffering for economic or scientific expediency, one relatively targeted option is to work on alignment of AI with the values of its human designers (Christian 2020; Ngo, Chan, and Mindermann 2023). As discussed in Section 2.1.1, an uncontrolled AI agent would likely aim to optimize the accessible universe for its goals, which might include allowances for running many suffering computations to make and execute complex plans (Tomasik 2015a; Bostrom 2014, Ch. 8). An unaligned AI might also be especially willing to initiate conflicts leading to large-scale suffering (Section 3.2). By reducing this acute risk, solving the technical problems of alignment would help place near-term decision-makers in a better position for deliberation, in particular, about how to approach space settlement in a way that is less prone to incidental s-risks.

That said, there are several limitations to and potential backfire risks from increasing alignment on the margin. First, it is not clear that agents with values similar to most humans’ will avoid causing great suffering to digital minds, under the default trajectory of moral deliberation; see Section 4.2. For example, a human-aligned agent might be especially inclined to spread sentient life, without necessarily mitigating serious suffering in such lives. Second, progress in AI alignment could also enable malevolent actors to cause s-risks, by enabling their control over a powerful technology. Lastly, marginal increases in the degree of alignment of AI with human values could increase the risk of near-miss failures discussed in Section 3.2. Due to these considerations, it is unlikely that work on AI alignment is robustly positive for reducing s-risks.

Another, arguably more robust, approach to s-risk reduction via technical work on AI design is research in the field of Cooperative AI (and implementing proposals based on this research) (Dafoe et al. 2020; Conitzer and Oesterheld 2022; Clifton, Martin, and DiGiovanni 2022b). Agential s-risks could result from failures of cooperation between powerful AI systems, and interventions to address these conflict risks are currently neglected compared to efforts to accelerate general progress in AI, and even compared to AI alignment (Hilton 2022). Technical work on mitigating AI conflict risks is a focus area of the Center on Long-Term Risk and the Cooperative AI Foundation. In general, this work entails identifying the possible causes of AI cooperation failure, and proposing changes to the features of AI design (or deployment) that would plausibly remove those causes. Examples of specific research progress in this area include:

Research on conflict-avoiding bargaining mechanisms that AI agents will have incentives to adopt. For example, how can AIs be designed to solve equilibrium selection problems (Fearon 1998), in particular, avoiding bargaining failure in contexts where different agents have multiple “reasonable” conceptions of fairness (Stastny et al. 2021)?
The framework of safe Pareto improvements, procedures for delegating strategic decisions (e.g., to AI systems) to reach an outcome preferred to the default by all parties (Oesterheld and Conitzer 2021). Surrogate goals are a particular implementation of this idea, in which costly conflict is disincentivized via training a modification to an AI’s preferences (Baumann 2018c).
Research on the different ways that cooperative outcomes can be made game-theoretically rational. This can be achieved via methods from the literature on program equilibrium (McAfee 1984; Tennenholtz 2004; Oesterheld 2019), or through disclosure of information that would otherwise have led agents to take risks of conflict under uncertainty (Myerson and Satterthwaite 1983; DiGiovanni and Clifton 2022).

Cooperative AI overlaps with research in decision theory, as a method of understanding the patterns of behavior of highly rational agents such as advanced AIs, which is relevant to predicting when they will have incentives to cooperate. One prominent subset of decision theory research for s-risk reduction is work on mechanisms of cooperation between agents with similar decision-making procedures, known as Evidential Cooperation in Large Worlds (ECL) (Oesterheld 2017). Correlations between agents’ decision-making can have important implications for how they optimize their values, and thus how willing they are to cause incidental or agential s-risks. For instance, an agent does not necessarily get exploited by cooperating with correlated agents, i.e., by taking some opportunity that fulfills the values of these agents with some downside according to one’s own values. This is because these agents are most likely cooperating as well, but would not be likely to cooperate if the first agent did not.

To address incidental or agential s-risks from AI, an alternative to technical interventions is to improve coordination between, and risk-awareness within, labs developing advanced AI. Risks of both alignment and cooperation failures could be exacerbated by dynamics where developers race to create the first general AI (Armstrong et al. 2015). These races incentivize developers to deprioritize safety measures that, while useful for avoiding low-probability worst-case outcomes, do not increase their systems’ average-case performance at economically useful tasks. Moreover, by establishing inter-lab coordination through governance and shaping of norms of AI research culture, there would be less risk that labs develop AIs that engage in conflict because they were trained independently (and hence might have incompatible standards for bargaining, or insufficient mutual transparency) (Torges 2021). Labs could also implement measures to reduce risks of malevolent actors gaining unchecked control over AI, for example, strengthening the security of their systems against hacks and instituting stronger checks and balances within their internal governance structures.

Besides AI, extraterrestrial intelligent civilizations could also have an important influence on long-term suffering. For example, understanding how likely extraterrestrials are to settle space—and how compassionate their values might be, compared to Earth-originating agents—is helpful for assessing the counterfactual risks of suffering posed by space settlement (Chyba and Hand 2005; Cook 2022; Vinding and Baumann 2021; Tomasik 2015a). If other civilizations in the universe have less concern for avoiding s-risks than humans do, space settlement could be instrumental to reducing incidental and agential s-risks caused by these beings. On the other hand, hostile interactions between space-settling agents might pose s-risks from cooperation failures.

4.2. Broad approaches

Broad s-risk reduction efforts aim to intervene on factors that are likely involved in several different pathways to s-risks, including ones that we cannot specifically anticipate.

A necessary condition for any s-risk to occur is that agents with the majority of power in the long-term future are not sufficiently motivated to prevent or avoid causing s-risks—otherwise, these agents would prevent such large amounts of suffering. Thus, calling attention to and developing nuanced arguments for views that highly prioritize avoiding causing severe suffering, as discussed in Section 2.2.1, might be a way to reduce the probability of all kinds of s-risks. Exploring the philosophical details of suffering-focused ethics is a priority of the Center for Reducing Suffering, for example (Vinding 2020; Ajantaival 2021). Another option is to focus on arguments against pure retribution, the idea that it is inherently good to cause (often disproportionate) suffering to those who violate certain normative standards (Parfit 2011, Ch. 11).

Marginal efforts to promote suffering-focused views do not necessarily reduce s-risks, however, without certain constraints. First, as discussed in Section 4, these efforts might be wasted due to zero-sum dynamics with others promoting different values in response to promotion of suffering-focused views. Hence, the most promising approaches would involve promoting philosophical reflection in ways that would be endorsed by a wide variety of open-minded people, even if they have come to different conclusions about how important suffering is (e.g., presenting relevant considerations and thought experiments that have been neglected in existing literature) (Vinding 2020, Sec. 12.3). Second, to the extent that suffering-focused views are associated with pure consequentialism, a possible risk is that actors become more sympathetic to a naïve procedure of attempting to reduce as much suffering as they believe is possible, given their flawed models of the world. This combination of an optimizing mindset and limited ability to predict the full consequences can inspire counterproductive actions. Thus, effective advocacy for suffering-focused ethics would involve promotion of a careful, nuanced approach.

Similarly, one might focus on increasing concern for the suffering of more kinds of sentient beings (“moral circle expansion”), to reduce incidental, natural, and some forms of agential s-risks (Anthis and Paez 2021). Despite the benefits of the wide reach of this intervention, there are some ways it could backfire. In practice, efforts to increase moral consideration of other beings might make future space-settling civilizations more likely to create these beings on large scales—some fraction of which could have miserable lives—due to viewing their lives as intrinsically valuable (Tomasik 2015d; Vinding 2018a). Further, recalling the near miss scenario from Section 3.2, an AI agent that mistakenly increases the suffering of beings it is trained to care about would cause potentially far more suffering if its training is influenced by moral circle expansion.

Shaping social institutions is another option that is particularly helpful if lock-in of the conditions for s-risks is unlikely to occur soon (or infeasible to prevent). For instance, the Center for Reducing Suffering has analyzed potential changes to political systems that could increase the likelihood of compromise and international cooperation, such as reducing polarization (Baumann 2020b; Vinding 2022a). With more compromises, it would be easier for the votes of even a minority who care about less powerful sentient beings to reduce a large share of incidental and natural s-risks. More global cooperation would slow down technological races that contribute to conflict risks, and greater stability of democracies may also reduce risks of malevolent actors taking power (Althaus and Baumann 2020).

Finally, to the extent that it is intractable to directly intervene on s-risks in the near term, an alternative is to build the capacity of future people to reduce s-risks when they have more information than we do (but less time to prepare) (Baumann 2021). This could entail:

Writing detailed resources on risk factors for future suffering (Baumann 2019).
Gaining knowledge about fields that appear generally useful for shaping the future, and for understanding the incentives that may drive future agents (e.g., history, epistemology, economics, game theory, and machine learning).
Promoting norms that are likely to ensure the sustainability of the s-risk reduction project, such as cooperation (Vinding 2020b).

Next steps

Those interested in reducing s-risks can contribute with donations to organizations that prioritize s-risks, such as the Center on Long-Term Risk and the Center for Reducing Suffering, or with their careers. To build a career that helps reduce s-risks, one can learn more about the research fields discussed in Section 4.1, and reach out to the Center on Long-Term Risk or the Center for Reducing Suffering for career planning advice.

Acknowledgments

I thank David Althaus, Tobias Baumann, Jim Buhler, Lukas Gloor, Adrian Hutter, Caspar Oesterheld, and Pablo Stafforini for comments and suggestions.

The Case for Suffering-Focused Ethics
Reducing Risks of Astronomical Suffering: A Neglected Priority
Beginner’s guide to reducing s-risks

References

Ajantaival. Minimalist axiologies and positive lives. https://centerforreducingsuffering.org/research/minimalist-axiologies-and-positive-lives, 2021.

Althaus and Baumann. Reducing long-term risks from malevolent actors. https://forum.effectivealtruism.org/posts/LpkXtFXdsRd4rG8Kb/reducing-long-term-risks-from-malevolent-actors, 2020.

Althaus and Gloor. Reducing Risks of Astronomical Suffering: A Neglected Priority. https://longtermrisk.org/reducing-risks-of-astronomical-suffering-a-neglected-priority, 2016.

Althaus. Descriptive Population Ethics and Its Relevance for Cause Prioritization. https://forum.effectivealtruism.org/posts/CmNBmSf6xtMyYhvcs/descriptive-population-ethics-and-its-relevance-for-cause#Interpreting_and_measuring_N_ratios, 2018.

Anthis and Paez. “Moral circle expansion: A promising strategy to impact the far future.” 2021.

Armstrong and Sandberg. “Eternity in six hours: intergalactic spreading of intelligent life and sharpening the Fermi paradox”. 2013.

Armstrong et al. “Racing to the precipice: a model of artificial intelligence development". 2015.

Baumann. A typology of s-risks. https://centerforreducingsuffering.org/research/a-typology-of-s-risks, 2018a.

Baumann. An introduction to worst-case AI safety. https://s-risks.org/an-introduction-to-worst-case-ai-safety/, 2018b.

Baumann. Arguments for and against a focus on s-risks. https://centerforreducingsuffering.org/research/arguments-for-and-against-a-focus-on-s-risks, 2020a.

Baumann. How can we reduce s-risks? https://centerforreducingsuffering.org/research/how-can-we-reduce-s-risks/#Capacity_building, 2021.

Baumann. Improving our political system: An overview. https://centerforreducingsuffering.org/research/improving-our-political-system-an-overview, 2020b.

Baumann. Risk factors for s-risks. https://centerforreducingsuffering.org/research/risk-factors-for-s-risks, 2019.

Baumann. S-risks: An introduction. https://centerforreducingsuffering.org/research/intro, 2017a.

Baumann. Uncertainty smooths out differences in impact. https://prioritizationresearch.com/uncertainty-smoothes-out-differences-in-impact, 2017b.

Baumann. Using surrogate goals to deflect threats. https://longtermrisk.org/using-surrogate-goals-deflect-threats, 2018c.

Beckstead and Thomas. “A Paradox for tiny probabilities.” 2021.

Beckstead. Dissertation: On the overwhelming importance of shaping the far future. 2013.

Beckstead. Will we eventually be able to colonize other stars? Notes from a preliminary review. https://www.fhi.ox.ac.uk/will-we-eventually-be-able-to-colonize-other-stars-notes-from-a-preliminary-review, 2014.

Block. “The harder problem of consciousness”. 2002.

Bostrom. “Are we living in a computer simulation?” The Philosophical Quarterly. 2003.

Bostrom. “The Superintelligent Will”. 2012.

Bostrom. Superintelligence. OUP Oxford, 2014

Brauner and Grosse-Holz. The expected value of extinction risk reduction is positive. https://www.effectivealtruism.org/articles/the-expected-value-of-extinction-risk-reduction-is-positive. 2018.

Buckels et al. “Behavioral confirmation of Everyday Sadism”. 2013.

Caplan. “The totalitarian threat”. 2008.

Carlsmith. “Is Power-seeking AI an Existential Threat?” 2022.

Chalmers. The Conscious Mind. OUP Oxford, 1996.

Christian. Alignment Problem – Machine Learning and Human Values. W.W. Norton, 2020.

Christiano, Cotra, and Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit, 2021.

Chyba and Hand. “Astrobiology: The Study of the Living Universe”. 2005.

Clifton, Martin, and DiGiovanni. When would AGIs engage in conflict? https://www.alignmentforum.org/s/32kWH6hqFhmdFsvBh/p/cLDcKgvM6KxBhqhGq#What_if_conflict_isn_t_costly_by_the_agents__lights__, 2022a.

Clifton, Martin, and DiGiovanni. When does technical work to reduce AGI conflict make a difference? https://www.alignmentforum.org/s/32kWH6hqFhmdFsvBh, 2022b.

Cohen et al. 2021 State of the Industry Report: Cultivated meat and seafood. https://gfieurope.org/wp-content/uploads/2022/04/2021-Cultivated-Meat-State-of-the-Industry-Report.pdf, 2021.

Conitzer and Oesterheld. Foundations of Cooperative AI. https://www.cs.cmu.edu/~15784/focal_paper.pdf. 2022.

Cook. Replicating and extending the grabby aliens model. https://forum.effectivealtruism.org/posts/7bc54mWtc7BrpZY9e/replicating-and-extending-the-grabby-aliens-model, 2022.

Cotra. “Forecasting TAI with biological anchors”. 2020.

Cotra. Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to, 2022.

Cowen and Parfit. “Against the social discount rate”. 1992.

Dafoe et al. “Open problems in Cooperative AI”. 2020.

DeepMind. “Generally capable agents emerge from open-ended play”. 2021.

De Vynck, Lerman, and Tiku. “Microsoft’s AI chatbot is going off the rails.” The Washington Post, February 16, 2023. https://www.washingtonpost.com/technology/2023/02/16/microsoft-bing-ai-chatbot-sydney/.

DiGiovanni. A longtermist critique of “The expected value of extinction risk reduction is positive”. https://forum.effectivealtruism.org/posts/RkPK8rWigSAybgGPe/a-longtermist-critique-of-the-expected-value-of-extinction-2, 2021.

DiGiovanni and Clifton. “Commitment games with conditional information revelation”. 2022.

DiGiovanni, Macé, and Clifton. “Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs”. 2022.

Duff. “Pascal's Wager”. 1986.

Eckerström-Liedholm. Deep Dive: Wildlife contraception and welfare. https://www.wildanimalinitiative.org/blog/contraception-deep-dive, 2022.

Edwards et al. “Anthroponosis and risk management: a time for ethical vaccination of wildlife”. 2021.

Evans and Gao. DeepMind AI Reduces Google Data Centre Cooling Bill by 40%. https://www.deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-by-40, 2016.

Fearon. “Bargaining, Enforcement, and International Cooperation”. 1998.

Fearon. “Rationalist explanations for war”. 1995.

Finnveden et al. “AGI and Lock-in”. 2022.

Frick. “Conditional Reasons and the Procreation Asymmetry”. 2020.

Gloor. Altruists Should Prioritize Artificial Intelligence. https://longtermrisk.org/altruists-should-prioritize-artificial-intelligence, 2016a.

Gloor. Cause prioritization for downside-focused value systems. https://longtermrisk.org/cause-prioritization-downside-focused-value-systems, 2018.

Gloor. Population Ethics Without Axiology: A Framework. https://forum.effectivealtruism.org/posts/dQvDxDMyueLyydHw4/population-ethics-without-axiology-a-framework, 2022.

Gloor. The Case for Suffering-Focused Ethics. https://longtermrisk.org/the-case-for-suffering-focused-ethics, 2016b.

Greaves and MacAskill. The case for strong longtermism. https://globalprioritiesinstitute.org/hilary-greaves-william-macaskill-the-case-for-strong-longtermism-2, 2021.

Greaves. “Cluelessness”. 2016.

Hanson. The Age of Em. OUP Oxford, 2016.

Heifetz, Shannon, and Spiegel. “The Dynamic Evolution of Preferences”. 2007.

Hilton. “S-risks.” https://80000hours.org/problem-profiles/s-risks/#why-might-s-risks-be-an-especially-pressing-problem. 2022.

Horta. “Animal Suffering in Nature: The Case for Intervention”. 2017.

Horta. “Debunking the idyllic view of natural processes”. 2010.

Hubinger et al. “Risks from learned Optimization in Advanced Machine Learning Systems”. 2019.

Hubinger. A transparency and interpretability tech tree. https://www.alignmentforum.org/posts/nbq2bWLcYmSGup9aF/a-transparency-and-interpretability-tech-tree, 2022.

Johannsen. Wild Animal Ethics: The Moral and Political Problem of Wild Animal Suffering. Routledge, 2020.

Jumper et al. “Highly accurate protein structure prediction with AlphaFold”. 2021.

Karnofsky. AI Could Defeat All Of Us Combined. https://www.cold-takes.com/ai-could-defeat-all-of-us-combined, 2022.

Knutsson. “Value lexicality”. 2016.

Larks. The Long Reflection as the Great Stagnation. https://forum.effectivealtruism.org/posts/o5Q8dXfnHTozW9jkY/the-long-reflection-as-the-great-stagnation, 2022.

Lickel. “Vicarious Retribution: the role of collective blame in intergroup aggression". 2006.

MacAskill et al. Moral Uncertainty. OUP Oxford, 2020.

MacAskill. What We Owe the Future. Oneworld, 2022.

MacAskill. Human Extinction, Asymmetry, and Option Value. https://docs.google.com/document/d/1hQI3otOAT39sonCHIM6B4na9BKeKjEl7wUKacgQ9qF8/

Marchau et al. Decision Making under Deep Uncertainty. Springer, 2019.

McAfee. “Effective computability in Economic Decisions”. 1984.

Metzinger. The Ego Tunnel: The Science of the Mind and the Myth of the Self. Basic Books, 2010

Monton. “How to avoid Maximizing Expected Utility”. 2019.

Moore. Placing Blame: A Theory of Criminal Law. Clarendon, 1997.

Moshagen et al. “The Dark Core of Personality”. 2018.

Myerson and Satterthwaite. “Efficient mechanisms for bilateral trading”. 1983.

Ngo, Chan, and Mindermann. “The alignment problem from a deep learning perspective”. 2023.

Oesterheld. “Multiverse-wide Cooperation via Correlated Decision Making”. 2017.

Oesterheld. “Robust program equilibrium”. 2019.

Oesterheld and Conitzer. “Safe Pareto Improvements for Delegated Game Playing”. 2021.

Omohundro. “Basic AI Drives”. 2008.

OpenAI. “GPT-4 Technical Report”. 2023.

Ord. The Precipice. Bloomsbury, 2020.

Ortiz-Ospina and Roser. Global Health. https://ourworldindata.org/health-meta, 2016.

Ortiz-Ospina and Roser. Happiness and Life Satisfaction. https://ourworldindata.org/happiness-and-life-satisfaction, 2013.

Parfit. On What Matters: Volume 1. OUP Oxford, 2011

Paulhus. “Toward a Taxonomy of Dark Personalities”. 2014.

Pearce. Hedonistic Imperative. Self-published, 1995.

Pinker. The Better Angels of Our Nature: A History of Violence and Humanity. Penguin, 2012.

Possajennikov. “On the evolutionary stability of altruistic and spiteful preferences”. 2000.

Reed et al. “A Generalist Agent”. 2022.

Ritchie, Rosado, and Roser. Meat and Dairy Production. https://ourworldindata.org/meat-production, 2017.

Shulman and Bostrom, Sharing the World with Digital Minds In: Rethinking Moral Status. Edited by: Steve Clarke, Hazem Zohny, and Julian Savulesc, Oxford University Press. 2021.

Silver et al. “Mastering chess and shogi”. 2017.

Singer. The Expanding Circle: Ethics, Evolution, and Moral Progress. https://press.princeton.edu/books/paperback/9780691150697/the-expanding-circle, 2011.

Sinnott-Armstrong. “Consequentialism,” The Stanford Encyclopedia of Philosophy, Edward N. Zalta & Uri Nodelman (eds.), https://plato.stanford.edu/archives/win2022/entries/consequentialism, 2022.

Sotala. Advantages of Artificial Intelligences, Uploads, and Digital Minds. https://intelligence.org/files/AdvantagesOfAIs.pdf, 2012.

Stastny et al. “Normative Disagreement as a Challenge for Cooperative AI”. 2021.

Stein-Perlman et al. “Expert survey on Progress in AI”. 2022.

Tarsney. “The epistemic challenge to longtermism”. 2022.

Tennenholtz. “Program equilibrium”. 2004.

Thomas. “The Asymmetry, Uncertainty, and the Long Term”. 2019.

Tomasik. Artificial Intelligence and Its Implications for Future Suffering. https://longtermrisk.org/artificial-intelligence-and-its-implications-for-future-suffering, 2015a.

Tomasik. Charity Cost-Effectiveness in an Uncertain World. https://longtermrisk.org/charity-cost-effectiveness-in-an-uncertain-world, 2015b.

Tomasik. Do Artificial Reinforcement-Learning Agents Matter Morally? https://arxiv.org/abs/1410.8233?context=cs, 2014.

Tomasik. Reasons to Be Nice to Other Value Systems. https://longtermrisk.org/reasons-to-be-nice-to-other-value-systems, 2015c.

Tomasik. Reasons to Promote Suffering-Focused Ethics. https://reducing-suffering.org/the-case-for-promoting-suffering-focused-ethics, 2015d.

Tomasik. Risks of Astronomical Future Suffering. https://longtermrisk.org/risks-of-astronomical-future-suffering, 2015e.

Torges. Coordination challenges for preventing AI conflict. https://longtermrisk.org/coordination-challenges-for-preventing-ai-conflict, 2021.

Vinding. Moral Circle Expansion Might Increase Future Suffering. https://magnusvinding.com/2018/09/04/moral-circle-expansion-might-increase-future-suffering, 2018a.

Vinding. Reasoned Politics. Independently published, 2022a.

Vinding. Suffering-Focused Ethics: Defense and Implications. Independently published, 2020a.

Vinding. Why altruists should be cooperative.

https://centerforreducingsuffering.org/why-altruists-should-be-cooperative, 2020b.

Vinding. Why Altruists Should Perhaps Not Prioritize Artificial Intelligence: A Lengthy Critique. https://magnusvinding.com/2018/09/18/why-altruists-should-perhaps-not-prioritize-artificial-intelligence-a-lengthy-critique, 2018b.

Vinding. Popular views of population ethics imply a priority on preventing worst-case outcomes. https://centerforreducingsuffering.org/popular-views-of-population-ethics-imply-a-priority-on-preventing-worst-case-outcomes, 2022b.

Vinding and Baumann. S-risk impact distribution is double-tailed. https://centerforreducingsuffering.org/s-risk-impact-distribution-is-double-tailed, 2021.

Wendel. “On the abundance of extraterrestrial life after the Kepler mission”. 2015.

Yudkowsky. The Rocket Alignment Problem. https://intelligence.org/2018/10/03/rocket-alignment, 2018.

Ziegler et al. Fine-tuning GPT-2 from human preferences. https://openai.com/blog/fine-tuning-gpt-2, 2019.

Footnotes

“Significance” is relative to “expected action-relevant suffering in the future” (Althaus and Gloor 2016). The expected value of some quantity is the average of this quantity over each possibility, where each possible value is weighed by its probability. “Action-relevant suffering” means suffering over which Earth-originating intelligent life may have some control. (back)
Assessing this philosophical premise in-depth is outside the scope of this article. For more discussion, see, e.g., (Chalmers 1996; Block 2002; Metzinger 2010). (back)
Whether future suffering is involuntary, or any purported special normative status to involuntary suffering is not essential to the definition of s-risks. This article focuses on involuntary suffering because, first, the arguments for prioritizing s-risks do not depend on whether voluntary suffering is included, and second, the claim that all involuntary suffering is bad is relatively less controversial. (back)
But see also Vinding (2018b). (back)
Note, however, that individuals’ answers to this question differ significantly based on the framing. (back)
Or, if the view is not purely consequentialist, the act is acceptable if the alternative would strongly violate certain principles. (back)
This definition of suffering-focused views that do not give lexical priority to suffering is necessarily relative to other views (Althaus 2018). For example, if A considers it acceptable (all else equal) to take an action resulting in one extra experience of kidney stones and one thousand extra experiences of falling in love, while B does not, then B’s ethics would be considered more suffering-focused than A’s. Alternatively, one could choose a non-normative “unit” of suffering and other potentially offsetting goods or bads, such as intensity of experience or energy used by the brain. Then, suffering-focused views might be defined as those that consider it unacceptable to increase one unit of suffering for slightly more than one unit of, e.g., happiness. (back)
Premise 2 states that reducing the expected amount of suffering should be prioritized, which means that instances of sufficiently large-scale suffering ought to be reduced even if they are highly improbable. Some views in decision theory hold that it is irrational to prefer to reduce an arbitrarily small probability of a sufficiently bad outcome, over reducing a much more likely and less severely bad outcome (Duff 1986; Monton 2019; Beckstead and Thomas 2021). Although a case has not been presented for s-risks being highly probable, Sections 2.3 and 3 discuss some reasons s-risks might not be so improbable that such views would discount them. (back)
But see Larks (2022) for discussion of the limitations of MacAskill’s vision of this moral reflection period. (back)
Note that the cited authors disagree with each other significantly on the degree of difficulty of the alignment problem. (back)
However, see Section 3.1 for discussion of how sources of suffering analogous to factory farming might recur because of technological progress. (back)
See (MacAskill 2022, Ch. 3) for discussion of the extent to which moral attitudes, versus economic incentives, contributed to (and delayed) the abolition of slavery. (back)
That said, the frequency of intelligent civilizations would be correlated with that of sentient life. See the literature on the Fermi paradox for analysis of this question. (back)

The post Beginner’s guide to reducing s-risks appeared first on Center on Long-Term Risk.

(Archive) Summer Research Fellowship 2023

2023-04-05T08:27:38Z

--- Applications for the 2023 Summer Research Fellowship have now closed ---

We, the Center on Long-Term Risk, are looking for Summer Research Fellows to help us explore strategies for reducing suffering in the long-term future (s-risk) and work on technical AI safety ideas related to that. For eight weeks, you will be part of our team while working on your own research project. During this time, you will be in regular contact with our researchers and other fellows. One of our researchers will serve as your guide and mentor.

Your contributions to our research program will have a positive impact through their influence on our strategic direction, grantmaking, communications, events, and other activities. You will work autonomously on challenging research questions relevant to reducing suffering. You will become part of our team of intellectually curious, hard-working, and caring people, all of whom share a profound drive to make the biggest difference they can.

We are worried that some people might not apply because they wrongly believe they are not a good fit for working with us. While such a belief is sometimes true, it is often the result of underconfidence rather than an accurate assessment. We would therefore love to see your application even if you are not sure if you are qualified or otherwise competent enough for the positions listed. We explicitly have no minimum requirements in terms of formal qualifications and many of the past summer research fellows have had no or little prior research experience. Being rejected this year will not reduce your chances of being accepted in future hiring rounds. If you have any doubts, please don’t hesitate to reach out (see “Application process” > “Inquiries” below).

Purpose of the fellowship

The purpose of the fellowship varies from fellow to fellow. In the past, have we often had the following types of people take part in the fellowship:

People very early in their careers, e.g. in their undergraduate degree or even high school, who have a strong focus on s-risk and would like to learn more about research and test their fit.
People seriously considering changing their career to s-risk research, who want to test their fit or seek employment at CLR.
People with a strong focus on s-risk who aim for a research or research-adjacent career outside of CLR and who would like to gain a strong understanding of s-risk macrostrategy beforehand.
People with a fair amount of research experience, e.g. from a partly- or fully completed PhD, whose research interests significantly overlap with CLR’s and who want to work on their research project in collaboration with CLR researchers for a few months. This includes people who do not strongly prioritize s-risk themselves.

Responsibilities

Carrying out a research project related to one of our priority areas below or otherwise targeted at reducing s-risks. You will determine this project in collaboration with your supervisor at CLR, who will meet with you every week and provide feedback on your work.
Attending team meetings, including giving occasional presentations on the state of your research.

What we look for in candidates

We don’t require specific qualifications or experience for this role, but the following abilities and qualities are what we’re looking for in candidates. We encourage you to apply if you think you may be a good fit, even if you are unsure whether you meet some of the criteria.

Curiosity and a drive to work on challenging and important problems;
Ability to answer complex research questions related to the long-term future;
Willingness to work in poorly-explored areas and to learn about new domains as needed;
Independent thinking;
A cautious approach to potential information hazards and other sensitive topics;
Alignment with our mission or strong interest in one of our priority areas.

Further details

We encourage you to apply even if any of the below does not work for you. We are happy to be flexible for exceptional candidates, including when it comes to program length and compensation.

Compensation: Unfortunately, we face a lot of funding uncertainty at the moment. So we don’t know yet how much we will be able to pay participating fellows. Compensation will range from £1,800 to £4,000 per month, depending on what our funding will look like when making final offers. We hope to be able to pay the full amount. We will update this information as we get more information. We also hope to cover travel and visa costs for Fellows who need to relocate to London for the Fellowship (as we have done in the past).
Number of available positions: We expect to accept three to ten fellows. Again, this is subject to our funding situation at the offer stage.
Program length & work quota: The program is intended to last for eight weeks in a full-time capacity. Exceptions, including part-time work, may be possible.
Program dates: The default start date is July 3, 2023. Exceptions may be possible.
Location: We prefer summer research fellows to work from our London offices, but will also consider applications from people who are unable to relocate.
Benefits: CLR also offers substantial benefits to all staff – for details see the section about this below.
International applicants: We are a registered UK visa sponsor. In most cases, we expect to be able to sponsor temporary visas for successful international applicants who would like to come to the UK for the Fellowship. If you have questions about this, please ask us in the application form or reach out to us beforehand.

Priority areas

You can find an overview of our current priority areas here. However, If we believe that you can somehow advance high-quality research relevant to s-risks, we are interested in creating a position for you. If you see a way to contribute to our research agenda or have other ideas for reducing s-risks, please apply. We commonly tailor our positions to the strengths and interests of the applicants.

Mentors

All fellows will work with a mentor to guide their project. Our mentors have each written below about the topics in which they’re most interested in supervising research.

Anthony DiGiovanni

I would be most keen to supervise projects on:

Using frameworks from open-source game theory to model potential cooperation failures between AIs, and ways to mitigate those failures. (Examples: Safe Pareto Improvements; Commitment games with conditional information revelation)
Assessing how AI alignment techniques might be used to ensure that early AGIs safely navigate bargaining problems, instead of locking in catastrophic errors.
Understanding potential causes of, and interventions against, conflict-seeking preferences in AI agents.

Jesse Clifton

Some things I’m keen to supervise projects on are:

The same topics as Anthony listed above
Paths to s-risk from malevolent actors
Designing s-risk-relevant evaluations for large language model

However, I'm also interested in considering strong proposals outside these areas.

Emery Cooper

I’m most interested in supervising projects related to:

Technical research into Evidential Cooperation in Large worlds (ECL), or superrationality more broadly
Prioritisation research related to ECL
Research related to decision theoretic problems in bargaining

Daniel Kokotajlo

I’m most keen to supervise projects in the following areas:

Technical research into Evidential Cooperation in Large worlds (ECL)
Technical research into commitment races, equilibrium selection, or bargaining between AIs
Strategy/prioritization research regarding s-risks and ECL
Anything else I've expressed enthusiasm for before or written about a decent amount

Caspar Oesterheld

I’m interested in supervising Fellows working in any of my academic interest areas, as seen on my website and blog.

Abram Demski

Given their particular relevance to CLR's priorities, I’d be interested in working with Fellows in any of the following areas:

Rational deliberation (in the sense of Skyrms but also in other senses), normative correctness of thought, intersubjective normativity, the foundation/source of endorsed value judgements, deliberation in multiagent negotiations, bargaining.
Decision theory and foundations of agency.
I am also open to mentoring suitable applicants in other areas that I'm interested in (semantics (i.e. the question of how meaning arises), transparency, ELK; naturalistic definitions of knowledge/meaning/belief; AI risk strategy, research prioritization for AI risk; intelligence augmentation for improving research).

Application process

Stage 1: To start your application for any role, please complete our application form. As part of this form, we also ask you to submit your CV/resume and give you the opportunity to upload an optional research sample. The deadline is Sunday, April 2, 2023 end of day anywhere. We expect this to take around 2 to 3 hours if you are already familiar with our work. In the interest of your time, you do not need to polish the language of your answers in the application form.

Stage 2: By Friday, April 7, we will decide whether to invite you to the second stage. We will ask you to write a research proposal (up to two pages excluding references) and a list of research proposal ideas, to be submitted by Sunday, April 23 end of day anywhere. This means applicants will have two weeks to complete this stage, which we expect will take up to 12h of work. Applicants may therefore want to keep some time free during this period to work on this. Applicants will be compensated with £250 for their work on this stage.

You can see some example research proposals submitted by previous successful candidates here. Note that we will alter the instructions for the research proposals this year. We plan to make examples for the list of research proposal ideas available before stage 2.

Stage 3: By Friday, April 28, we will decide whether to invite you to an interview via video call during the week of May 1. By May 10, we will send out final decisions to applicants.

Further details

Application base rates: Last year, we received 81 applications for the summer research fellowship. We made ten offers.
Diversity and equal opportunity employment: CLR is an equal opportunity employer, and we value diversity at our organization. We welcome applications from all sections of society and don’t want to discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, social background/class, mental or physical health or disability, or any other basis for unreasonable discrimination, whether legally protected or not. If you would like to discuss any personal needs that may require adjustments to our application process or workplace, please feel very free to contact us.

Inquiries

If you have any questions about the process, please contact us at hiring@longtermrisk.org. If you want to send an email not accessible to the hiring committee, please contact Amrit Sidhu-Brar at amrit.sidhu-brar@longtermrisk.org.

Benefits

In addition to their salary, CLR offers the following benefits to all staff (including Summer Research Fellows):

25 days’ paid vacation per year, plus public holidays. (For temporary staff, this is reduced proportional to the length of your employment.)
A budget of £5000 per year for expenses related to solving mental and physical health issues, and £3000 per year for professional development and productivity. For the Summer Research Fellow role, this is decreased to £625 and £375 respectively for the duration of your Fellowship.
Plant-based lunch available at the office every day.
Flexible working hours.
20 weeks of paid leave for permanent employees who become new parents, and consideration of childcare costs in setting permanent employees’ salaries.
For permanent employees working from the US, we also cover full health care and dental costs.

Why work at CLR

We aim to combine the best aspects of academic research (depth, scholarship, mentorship) with an altruistic mission to prevent negative future scenarios. So we leave out the less productive features of academia, such as precarious employment and publish-or-perish incentives, while adding a focus on impact and application.

As part of our team, you will enjoy:

a role tailored to your qualifications and strengths with ample intellectual freedom;
working towards a shared goal with dedicated and caring people;
an interdisciplinary research environment, with friendly and intellectually curious colleagues who will hold you to high standards and support you in your intellectual development;
mentorship in longtermist macrostrategy, especially from the perspective of preventing s-risks;
the support of a well-networked longtermist EA organization with substantial operational assistance instead of administrative burdens.

You will advance neglected research to reduce the most severe risks to our civilization in the long-term future. Depending on your specific project, your work will help inform our activities across any of the following paths to impact:

Technical interventions: We aim to develop and communicate insights about the safe development of artificial intelligence to the relevant stakeholders (e.g. AI developers, key organizations in the longtermist effective altruism community).
Governance interventions: We aim to develop and help implement appropriate governance structures for the safe development of artificial intelligence.
New projects: In collaboration with people in our network, we are always looking for novel impactful organizations to set up. For instance, we have been involved in the founding of the Cooperative AI Foundation and the Foundations of Cooperative AI Lab. Previously, we established Wild Animal Suffering Research, which later merged with Utility Farm to become the Wild Animal Initiative, a now independent organization.

The post (Archive) Summer Research Fellowship 2023 appeared first on Center on Long-Term Risk.

Annual Review & Fundraiser 2022

2023-07-05T14:39:34Z

Summary

Our goal: CLR’s goal is to reduce the worst risks of astronomical suffering (s-risks). Our concrete research programs are on AI conflict, Evidential Cooperation in Large Worlds (ECL), and s-risk macrostrategy. We ultimately want to identify and advocate for interventions that reliably shape the development and deployment of advanced AI systems in a positive way.
Fundraising: We have had a short-term funding shortfall and a lot of medium-term funding uncertainty. Our minimal fundraising goal is $750,000. We think this is a particularly good time to donate to CLR for people interested in supporting work on s-risks, work on Cooperative AI, work on acausal interactions, or work on generally important longtermist topics.
Causes of Conflict Research Group: In 2022, we started evaluating various interventions related to AI conflict (e.g., surrogate goals, preventing conflict-seeking preferences). We also started developing methods for evaluating conflict-relevant properties of large language models. Our priorities for next year are to continue developing and evaluating these, and to continue our work with large language models.
Other researchers: In 2022, others researchers at CLR worked on topics including the implications of ECL, the optimal timing of AI safety spending, the likelihood of earth-originating civilization encountering extraterrestrials, and program equilibrium. Our priorities for the next year include continuing some of this work, alongside other work including on strategic modeling and agent foundations.
S-risk community-building: Our s-risk community building programs received very positive feedback. We had calls or meetings with over 150 people interested in contributing to s-risk reduction. In 2023, we plan to at least continue our existing programs (i.e., intro fellowship, Summer Research Fellowship, retreat) if we can raise the required funds. If we can even hire additional staff, we want to expand our outreach function and create more resources for community members (e.g., curated reading lists, career guide, introductory content, research database).

What CLR is trying to do and why

Our goal is to reduce the worst risks of astronomical suffering (s-risks). These are scenarios where a significant fraction of future sentient beings are locked into intense states of misery, suffering, and despair.¹ We currently believe that such lock-in scenarios most likely involve transformative AI systems. So we work on making the development and deployment of such systems safer.²

Concrete research programs:

AI conflict: We want to better understand how we can prevent AI systems from engaging in catastrophic conflict. (The majority of our research efforts)

Evidential³ Cooperation in Large Worlds (ECL): ECL refers to the idea that we make it more likely that other agents across the universe take actions that are good for our values by taking actions that are good according to their values. A potential implication is that we should act so as to maximize an impartial weighted sum of the values of agents across the universe.⁴
S-risk macrostrategy: In general, we want to better understand how we can reduce suffering in the long-term future. There might be causes or considerations that we have overlooked so far.

Most of our work is research with the goal of identifying threat models and possible interventions. In the case of technical AI interventions (which is the bulk of our object-level work so far), we then plan to evaluate these interventions and advocate for their inclusion in AI development.

Next to our research, we also run events and fellowships to identify and support people wanting to work on these problems.

Fundraising

Funding situation

Due to recent events, we have had a short-term funding shortfall. This caused us to reduce our original budget for 2023 by 30% and take various cost-saving measures, including voluntary pay cuts by our staff, to increase our runway to six months.

Our medium-term funding situation is hard to predict at the moment, as there is still a lot of uncertainty. We hope to gain more clarity about this in the next few months.

Fundraising goals

Our minimal fundraising goal is to increase our runway to nine months, which would give us enough time to try and secure a grant from a large institutional funder in the first half of 2023. Our main goal is to increase our runway to twelve months and roll back some of the budget reductions, putting us in a more comfortable financial position again. Our stretch goal is to increase our runway to fifteen months and allow for a small increase in team size in 2023. See the table below for more details.

Reasons to donate to CLR

Given the financial situation sketched above, we believe that CLR is a good funding opportunity this year. Whether it makes sense for any given individual to donate to CLR depends on many factors. Below, we sketch the main reasons donors could be excited about our work. In an appendix, we collected some testimonials by people who have a lot of context on our work.

Supporting s-risk reduction.

You might want to support CLR’s work because it is one of the few organizations addressing risks of astronomical suffering directly.⁵ You could consider s-risk reduction worthwhile for a number of reasons: (1) You find the combination of suffering-focused ethics and longtermism compelling. (2) You think the expected value of the future is not sufficiently high to warrant prioritizing extinction risk reduction over improving the quality of the future. (3) You want to address the fact that work on s-risks is comparatively neglected within longtermism and AI safety.

Since the early days of our organization, we have made significant progress on clarifying and modeling the concrete threats we are trying to address and coming up with technical candidate interventions (see Appendix).

Supporting work on addressing AI conflict.

Next to the benefits to s-risk reduction, you might value some of our work because it addresses failure modes arising in multipolar AI scenarios more broadly (e.g., explored here, here). In recent years, we have helped to build up the field of Cooperative AI intended to address these risks (e.g., Stastny et al. 2021).

Supporting work on better understanding acausal interactions.

Such interactions are possibly a crucial consideration for longtermists (see, e.g., here). Some argue that, when acting, we should consider the non-causal implications of our actions (see, e.g., Ahmed (2014), Yudkowsky and Soares (2018), Oesterheld and Conitzer (2021)). If this is the case, these effects could dwarf their causal influence (see, e.g., here). Better understanding the implications of this would then be a key priority. CLR is among the few organizations doing and supporting work on this (e.g., here).

Much of our work on cooperation in the context of AI plausibly becomes particularly valuable from this perspective. For instance, if we are to act so as to maximize a compromise utility function that includes the values of many agents across the universe⁶, as the ECL argument suggests, then it becomes much more important that AI systems, even if aligned, cooperate well with agents with different values.⁷

Supporting cause-general longtermism research.

CLR has also done important research from a general longtermist lens, e.g., on decision theory, meta ethics, AI timelines, risks from malevolent actors, and extraterrestrial civilizations. Our Summer Research Fellowship has been a springboard for junior researchers who then moved on to other longtermist organizations (e.g., ARC, Redwood Research, Rethink Priorities, Longview Philanthropy).⁸

How to donate

To donate to CLR, please go to the Fundraiser page on our website.

Donors from Germany, Switzerland, and the Netherlands can donate tax-deductibly via our website
Donors from the USA and UK can donate tax-deductibly through the Giving What We Can platform
Donors from all other countries can donate to us via our website, but unfortunately tax deduction will not be available. For donations >$10,000, please get in touch and we will see if we can facilitate a donation swap.

For frequently asked questions on donating to CLR, see our Donate page.

Our progress in 2022

Causes of Conflict Research Group

This group is led by Jesse Clifton. Members of the group are Anni Leskelä, Anthony DiGiovanni, Julian Stastny, Maxime Riché, Mia Taylor, and Nicolas Macé.

Subjective assessment

Have we made relevant research progress?⁹

We believe we have made significant progress (e.g., relative to previous years) on improving our expertise in the reasons why AI systems might engage in conflict and the circumstances under which technical work done now could reduce these risks. We’ve built up methods and knowledge that we expect to make us much better at developing and assessing interventions for reducing conflict. (Some of this is reflected in our public-facing work.) We have begun to capitalize on this in the latter part of 2022, as we’ve begun moving from improving our general picture of the causes of conflict and possibilities for intervention to developing and evaluating specific interventions. These interventions include surrogate goals, preventing conflict-seeking preferences, preventing commitment races, and developing cooperation-related content for a hypothetical manual for overseers of AI training.

The second main way in which we’ve made progress is the initial work we’ve done on the evaluation of large language models (LLMs). There are several strong arguments that those interested in intervening on advanced AI systems should invest in experimental work with existing AI systems (see, e.g., The case for aligning narrowly superhuman models). Our first step here has been to work on methods for evaluating cooperation-relevant behaviors and reasoning of LLMs, as these methods are prerequisites for further research progress. We are in the process of developing the first Cooperative AI dataset for evaluating LLMs as well as methods for automatically generating data on which to evaluate cooperation-relevant behaviors, which is a prerequisite for techniques like red-teaming language models with language models. We are preparing to submit this work to a machine learning venue. We have also begun developing methods for better understanding the reasoning abilities of LLMs when it comes to conflict situations in order to develop evaluations that could tell us when models have gained capabilities that are necessary to engage in catastrophic conflict.

Has the research reached its target audience?

We published a summary of our thinking (as of earlier this year) on when technical work to reduce AGI conflict makes a difference on the Alignment Forum/LessWrong, which is visible to a large part of our target audience (AI safety & longtermist thinkers). We have also shared internal documents with individual external researchers to whom they are relevant. A significant majority of the research that we’ve done this year has not been shared with target audiences, though. Much of this is work-in-progress on evaluating interventions and evaluating LLMs which will be incorporated into summaries shared directly with external stakeholders, and in some cases posted on the Alignment Forum/LessWrong or submitted for publication in academic venues.

What feedback on our work have we received from peers and our target audience?

Our Alignment Forum sequence When does technical work to reduce AGI conflict make a difference? didn’t get much engagement. We did receive some positive feedback on internal drafts of this sequence from external researchers. We also solicited advice from individual alignment researchers throughout the year. This advice was either encouraging of existing areas of research focus or led us to shift more attention to areas that we are now focusing on (summarized in “relevant research progress” section above).

Selected output

Anthony DiGiovanni, Jesse Clifton: Commitment games with conditional information revelation. (to appear at AAAI 2023).

- Anthony DiGiovanni, Nicolas Macé, Jesse Clifton: Evolutionary Stability of Other-Regarding Preferences under Complexity Costs. Workshop on Learning, Evolution, and Games.
- Jesse Clifton, Samuel Martin, Anthony DiGiovanni: When does technical work to reduce AGI conflict make a difference? (Alignment Forum Sequence)

When does technical work to reduce AGI conflict make a difference?: Introduction

When would AGIs engage in conflict?

When is intent alignment sufficient or necessary to reduce AGI conflict?

Anthony DiGiovanni, Jesse Clifton: Sufficient Conditions for Cooperation Between Rational Agents, Working Paper.

Other researchers: Emery Cooper, Daniel Kokotajlo, Tristan Cook

Emery, Daniel¹⁰, and Tristan work on a mix of macrostrategy, ECL, decision theory, anthropics, forecasting, AI governance, and game theory.

Subjective assessment

Have we made relevant research progress?¹¹

The main focus of Emery’s work in the last year has been on the implications of ECL for cause prioritization. This includes work on the construction of the compromise utility function¹² under different anthropic and decision-theoretic assumptions, on the implications of ECL for AI design, and on more foundational questions. Additionally, Emery worked on a paper (in progress) extending our earlier Robust Program Equilibrium paper¹³. She also did some work on the implications of limited introspection ability for evidential decision theory (EDT) agents, and some related work on anthropics.

Daniel left for OpenAI early in the year, but not before making significant progress building a model of ECL and identifying key cruxes for the degree of decision relevance of ECL.

Tristan primarily worked on the optimal spending schedule for AI risk interventions and the probability that an Earth-originating civilization would encounter alien civilizations. To that end, he built and published two comprehensive models.

Overall, we believe we made moderate research progress, but Emery and Daniel have accumulated a large number of unpublished ideas to be written up.

Has the research reached its target audience?

The primary goal of Tristan’s reports was to inform CLR’s own prioritization. For example, the existence of alien civilizations in the far future is a consideration for our work on conflict. That said, Tristan’s work on Grabby Aliens received considerable engagement on the EA Forum and on LessWrong.

As mentioned above, a lot of Emery and Daniel’s work is not yet fully written up and published. Whilst the target audience for some of this work is internal, it’s nevertheless true that we haven’t been as successful in this regard as we would like. We have had fruitful conversations with non-CLR researchers about these topics, e.g., people at Open Philanthropy and MIRI.

What feedback on our work have we received from peers and our target audience?

The grabby aliens report was well received by and cited by S. Jay Olson (co-author of a recent paper on extraterrestrial civilizations with Toby Ord), who described it as “fascinating and complete”, and Tristan has received encouragement from Robin Hanson to publish academically, which he plans to do.

Selected output

Emery Cooper, Caspar Oesterheld: Towards >2 player epsilon-grounded FairBot (Working title, forthcoming).

Tristan Cook: Replicating and extending the grabby aliens model. EA Forum.

Tristan Cook, Guillaume Corlouer: The optimal timing of spending on AGI safety work; why we should probably be spending more now. EA Forum.
Tristan Cook: Neartermists should consider AGI timelines in their spending decisions. EA Forum.
Daniel Kokotajlo: ECL Big Deal? (unpublished Google Doc, forthcoming).
Daniel Kokotajlo: Immanuel Kant and the Decision Theory App Store. LessWrong.

S-Risk Community Building

Subjective assessment

Progress on all fronts seems very similar to last year, which we characterized as “modest”.

Have we increased the (quality-adjusted) size of the community?

Community growth has continued to be modest. We are careful in how we communicate about s-risks, so our outreach tools are limited. Still, we had individual contact with over 150 people who could potentially make contributions to our mission. Out of those, perhaps five to ten could turn out to be really valuable members of our community.

Have we created opportunities for in-person (and in-depth online) contact for people in our community?

We created more opportunities for discussion and exchange than before. We facilitated more visits to our office, hosted meetups around EAG, and we ran an S-Risk Retreat with about 30 participants. We wanted to implement more projects in this direction, but our limited staff capacity made that impossible.

Have we provided resources for community members that make it more likely they contribute significantly to our mission?

Our team has continued to provide several useful resources this year. We administered the CLR Fund, which supported various efforts in the community. We provided ad-hoc career advice to community members. We are currently experimenting with a research database prototype. We believe there are many more things we could be doing, but our limited staff capacity has held us back.

Output & activities

S-Risk Intro Fellowship: In February and March, we ran two S-Risk Intro Fellowships with seven participants each. The participant feedback was generally very positive.¹⁴
Summer Research Fellowship: We ran a Summer Research Fellowship with nine fellows.¹⁵ Again, feedback was generally very positive.¹⁶ Sylvester Kollin published two posts on decision theory in the context of the fellowship. David Udell moved on to work on shard theory. Hjalmar Wijk is now at ARC. Mia Taylor started working at CLR.
S-Risk Retreat: In October, we ran an S-Risk Retreat with 33 participants. Again, feedback was generally very positive.¹⁷
CLR Fund: We made the following grants in 2022:
- Johannes Treutlein: Top-up of a previous scholarship grant for a Master's degree at the University of Toronto.
- Samuel Martin: Six-month extension grant for research on whether work on reducing AI conflict is redundant with work on AI alignment.
- Anton Leicht: Funding for a two-month research project on the implications of normative uncertainty for whether to prioritize s-risks.
- Nandi Schoots: Funding for a three-month research project on simplicity bias in neural nets.
- Lukas Holter Melgaard: Funding for a three-month research project on summarizing Vanessa Kosoy’s Infrabayesianism.
- Tim Chan: Scholarship for completing a computer science conversion Master’s.
- Miranda Zhang: Funding for living expenses while working on clarifying her possible career paths.
- Bogdan-Ionut Cirstea: Funding for a year-long research project on short timelines.
- Asher Soryl: Research funding for a project on the ethics of panspermia, among others.
- Gary O'Brien: Teaching buy-out during the last year of their PhD to research (long-term) wild animal suffering.
Individual outreach: We conducted over one hundred fifty 1:1 calls and meetings with potentially promising people. This also included various office visits by people.

General organizational health

Guiding question: Are we a healthy organization with sufficient operational capacity, an effective board, appropriate evaluation of our work, reliable policies and procedures, adequate financial reserves and reporting, and high morale?

Operational capacity

Our capacity is currently not as high as we would like it to be as a staff member left in the summer and we only recently made a replacement hire. So various improvements to our setup have been delayed (e.g., IT & security improvements, a systematic policy review, some visa-related issues, systematic risk management). That being said, we are still able to maintain all the important functions of the organization (i.e., accounting, payments, payroll, onboarding/offboarding, hiring support, office management, feedback & review, IT maintenance).

Board

Members of the board: Tobias Baumann, Max Daniel, Ruairi Donnelly (chair), Chi Nguyen, Jonas Vollmer.

The board are ultimately responsible for CLR. Their specific responsibilities include deciding CLR’s leadership and structure, engaging with the team about strategy and planning, resolving organizational conflicts, and advising and providing accountability for CLR leadership. In 2022 they considered various decisions related to CLR’s new office, hiring/promotion, and overall financials. The team generally considered their advice to be valuable.

Evaluation function

We collect systematic feedback on big community-building and operations projects through surveys and interviews. We collect feedback on our research by submitting articles to journals & conferences and by requesting feedback on drafts of documents from relevant external researchers.

Policies & guidelines

In 2022, we did not have any incidents that required a policy-driven intervention or required setting up new policies. Due to a lack of operational capacity in 2022, we failed to conduct a systematic review of all of our policies.

Financial health

See “Fundraising” section above.

Morale

We currently don’t track staff morale quantitatively. Our impression is that this varies significantly between staff members and is more determined by personal factors than organizational trends.

Our plans for next year

Causes of Conflict Research Group

Our plans for 2023 fall into three categories.

Evaluating large language models. We will continue building on the work on evaluating LLMs that we started this year. Beyond writing up and submitting our existing results, the priority for this line of work is scoping out an agenda for assessing cooperation-relevant capabilities. This will account for work on evaluation that’s being done by other actors in the alignment space and possible opportunities for eventually slotting into those efforts.

Developing and evaluating cooperation-related interventions. We will continue carrying out the evaluations of the interventions that we started this year. On the basis of these evaluations, we’ll decide which interventions we want to prioritize developing (e.g., working out in more detail how they would be implemented under various assumptions about what approaches to AI alignment will be taken). In parallel, we’ll continue developing content for an overseer’s manual for AI systems.

General s-risk macrostrategy. Some researchers on the team will continue spending some of their time thinking about s-risk prioritization more generally, e.g., thinking about the value of alternative priorities to our group’s current focus on AI conflict.

Other researchers: Emery Cooper, Daniel Kokotajlo, Tristan Cook

Emery plans to prioritize finishing and writing up her existing research on ECL. She also has plans for some more general posts on ECL, including on some common confusions, and on more practical implications for cause prioritization. Emery also plans to focus on finishing the paper extending Robust Program Equilibrium, and to explore further more object-level work.

Daniel no longer works at CLR but plans to organize a research retreat focused on ECL in the beginning of 2023.

Tristan broadly plans to continue strategy-related modeling, such as on the spread of information hazards. He also plans to help to complete a project that calculates the marginal utility of AI x-risk and s-risk work under different assumptions about AGI timelines, and to potentially contribute to work on ECL.

S-Risk Community Building

We had originally planned to expand our activities across all three community-building functions. Without additional capacity, we would have to curtail these plans.

Outreach. If resources allow, we will host another Intro Fellowship and Summer Research Fellowship. We will also continue our 1:1 meetings & calls. We also plan to investigate what kind of mass outreach within the EA community would be most helpful (e.g., online content, talks, podcasts). Without such outreach, we expect that community growth will stagnate at its current low rate.

Resources. We plan to create more long-lasting and low-marginal-cost resources for people dedicated to s-risk reduction (e.g., curated reading lists, career guide, introductory content, research database). As the community grows and diversifies, these resources will have to become more central to our work.

Exchange. If resources allow, we will host another S-Risk Retreat. We also want to experiment with other online and in-person formats. Again, as the community grows and diversifies, we need to find a replacement for more informal arrangements.

Appendix: Testimonials

Nate Soares (Executive Director, Machine Intelligence Research Institute): "My understanding of CLR's mission is that they're working to avoid fates worse than the destruction of civilization, especially insofar as those fates could be a consequence of misaligned superintelligence. I'm glad that someone on earth is doing CLR's job, and CLR has in the past seemed to me to occasionally make small amounts of legible-to-me progress in pursuit of their mission. (Which might sound like faint praise, and I sure would endorse CLR more full-throatedly if they spent less effort on what seem to me like obvious dead-ends, but at the same time it's not like anybody else is even trying to do their job, and their job is worthy of attempting. According to me, the ability to make any progress at all in this domain is laudable)"

Lukas Finnveden (Research Analyst, Open Philanthropy): “CLR’s focus areas seem to me like the most important areas for reducing future suffering. Within these areas, they’ve shown competence at producing new knowledge, and I’ve learnt a lot that I value from engaging with their research.”

Daniel Kokotajlo (Policy/Governance, OpenAI): “I think AI cooperation and s-risk reduction are high priority almost regardless of your values / ethical views. The main reason to donate to, or work for, CLR is that the best thinking about s-risks and AI cooperation happens here, better than the thinking at MIRI or Open Phil or anywhere else. CLR also contains solid levels of knowledge of AI governance, AI alignment, AI forecasting (less so now that I’m gone), cause prioritisation, metaethics, agent foundations, anthropics, aliens, and more. Their summer fellows program is high quality and has produced many great alumni. Their ops team is great & in general they are well-organized. I left CLR to join the OpenAI governance team because I was doing mostly AI forecasting which benefits from being in a lab — but I was very happy at CLR and may even one day return.”

Michael Aird (Senior Research Manager, Rethink Priorities): 'I enjoyed my time as a summer research fellow at CLR in 2020, and I felt like I learned a lot about doing research and about various topics related to longtermist strategy, AI risk, and ethics. I was also impressed by the organization's culture and how the organization and fellowship was run, and I drew on some aspects of that when helping to design a research fellowship myself and when starting to manage people.'

Testimonials by other Summer Research Fellows can be found here.

Appendix: Our progress so far

What follows below is a somewhat stylized/simplified account of the history of the Center on Long-Term Risk prior to 2022. It is not meant to capture every twist and turn.

2011-2016: Incubation phase

What is now called the “Center on Long-Term Risk” starts out as a student group in Switzerland. Under the name “Foundational Research Institute,” we do pre-paradigmatic research into possible risks of astronomical suffering and create basic awareness of these scenarios in the EA community. A lot of pioneering thinking is done by Brian Tomasik. In 2016, we coin the term “risks of astronomical suffering” (s-risks). Key publications from that period:

- Brian Tomasik (2011): Risks of Astronomical Future Suffering.

Brian Tomasik (2015): Reasons to Be Nice to Other Value Systems.

David Althaus, Lukas Gloor (2016): Reducing Risks of Astronomical Suffering. A Neglected Priority.

2016-2019: Early growth

More researchers join; the organization professionalizes and matures. We publish our first journal articles related to s-risks. Possible interventions are being developed and discussed, surrogate goals among them. In 2017, we start sharing our work with other researchers in the longtermist community. That culminates in a series of research workshops in 2019. Key publications from that period:

Kaj Sotala, Lukas Gloor (2017): Superintelligence As a Cause or Cure For Risks of Astronomical Suffering. Informatica, 41 (4), 2017
Caspar Oesterheld (2017): Multiverse-wide Cooperation via Correlated Decision Making.
Tobias Baumann (2018): Using surrogate goals to deflect threats. (runner-up at the AI alignment prize)
Caspar Oesterheld (2018: Robust program equilibrium. Theory and Decision (86).
Caspar Oesterheld (2019): Approval-directed agency and the decision theory of Newcomb-like problems. Synthese (198). (Runner-up in the “AI alignment prize”)
Caspar Oesterheld, Vince Conitzer (2021¹⁸): Safe Pareto Improvements for Delegated Game Playing, Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2021), Online, May 3–7, 2021, IFAAMAS.

2019-2022: Maturation

Before 2019, we were pursuing many projects other than research on s-risks. In 2019, this stops. We start focusing exclusively on research. Increasingly, we connect our ideas to existing lines of academic inquiry. We also start engaging more with concrete proposals and empirical work in AI alignment. Key publications from that period:

- Jesse Clifton (2019): Cooperation, Conflict, and Transformative Artificial Intelligence. A Research Agenda.
- Daniel Kokotajlo (2019): The Commitment Races Problem.

David Althaus, Tobias Baumann (2020): Reducing long-term risks from malevolent actors (EA forum prize, second place. First Decade Review, third prize.)

Julian Stastny, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, Jesse Clifton (2021): Normative Disagreement as a Challenge for Cooperative AI. Cooperative AI workshop and the Strategic ML workshop at NeurIPS 2021.
Daniel Kokotajlo (2021): AI Timelines sequence.
Jesse Clifton, Anthony DiGiovanni, Samuel Martin (2022): When does technical work to reduce AGI conflict make a difference?

The post Annual Review & Fundraiser 2022 appeared first on Center on Long-Term Risk.

CLR Fundraiser 2022

2023-02-06T09:28:07Z

19 Supporters Successfully completed!

70% usd 1,532,096 of usd 2,170,000

Summary

Our goal: CLR’s goal is to reduce the worst risks of astronomical suffering (s-risks). Our concrete research programs are on AI conflict, Evidential Cooperation in Large Worlds (ECL), and s-risk macrostrategy. We ultimately want to identify and advocate for interventions that reliably shape the development and deployment of advanced AI systems in a positive way.
Fundraising: We have had a short-term funding shortfall and a lot of medium-term funding uncertainty. Our minimal fundraising goal is $750,000. We think this is a particularly good time to donate to CLR for people interested in supporting work on s-risks, work on Cooperative AI, work on acausal interactions, or work on generally important longtermist topics.
Causes of Conflict Research Group: In 2022, we started evaluating various interventions related to AI conflict (e.g., surrogate goals, preventing conflict-seeking preferences). We also started developing methods for evaluating conflict-relevant properties of large language models. Our priorities for next year are to continue developing and evaluating these, and to continue our work with large language models.
Other researchers: In 2022, others researchers at CLR worked on topics including the implications of ECL, the optimal timing of AI safety spending, the likelihood of earth-originating civilization encountering extraterrestrials, and program equilibrium. Our priorities for the next year include continuing some of this work, alongside other work including on strategic modeling and agent foundations.
S-risk community-building: Our s-risk community building programs received very positive feedback. We had calls or meetings with over 150 people interested in contributing to s-risk reduction. In 2023, we plan to at least continue our existing programs (i.e., intro fellowship, Summer Research Fellowship, retreat) if we can raise the required funds. If we can even hire additional staff, we want to expand our outreach function and create more resources for community members (e.g., curated reading lists, career guide, introductory content, research database).

Further details

For further details of CLR's progress in 2022, plans for 2023, and funding needs, please see our full fundraiser post.

How to donate

Donors from Germany, Switzerland, and the Netherlands can donate tax-deductibly using the form below
Donors from the USA and UK can donate tax-deductibly through the Giving What We Can platform
Donors from all other countries can donate to us using the below form, but unfortunately tax deduction will not be available

The Swiss charity Effective Altruism Foundation (EAF) collects and processes donations through the below form on our behalf. Such donations will be used exclusively to support CLR.

For frequently asked questions on donating to CLR, see our Donate page.

Note: since the fundraiser is now over, any donations from now on will not be listed in the fundraiser donations list below.

Donations so far

Name	Amount	Comment
Simon Möller	CHF 15000
David Lechner	CHF 250
Swante Scholz	CHF 10000	Donation for Center on long-term risk (CLR)
Kwan Yee Ng	USD 7000
Spencer Pearson	USD 30
Althaus Silvia	CHF 5000
Markus Winkelmann	CHF 12000
Markus Winkelmann	CHF 500
Anonymous	USD 1000000
Anonymous	EUR 387000
Anonymous	USD 1000
Anonymous	USD 500
Anonymous	USD 1500
Anonymous	USD 40000
Jan Rüegg	CHF 4500
Adrian Hutter	CHF 9250
Patrick Levermore	GBP 10
Jonas Vollmer	USD 1000
Anonymous	GBP 3.13

The post Private: 2022-12-11 Anonymous 3.13 GBP via EA Funds appeared first on Center on Long-Term Risk.

The optimal timing of spending on AGI safety work; why we should probably be spending more now

2022-12-06T15:48:11Z

Tristan Cook & Guillaume Corlouer
October 24th 2022

Summary

When should funders wanting to increase the probability of AGI going well spend their money? We have created a tool to calculate the optimum spending schedule and tentatively conclude funders collectively should be spending at least 5% of their capital each year on AI risk interventions and in some cases up to 35%.

This is likely higher than the current AI risk community spending rate which is at most 3%¹. In most cases, we find that the optimal spending schedule is between 5% and 15% better than the ‘default’ strategy of just spending the interest one accrues and from 15% to 50% better than a naive projection of the community’s spending rate².

We strongly encourage users to put their own inputs into the tool to draw their own conclusions.

The key finding of a higher spending rate is supported by two distinct models we have created, one that splits spending of capital into research and influence, and a second model (the ‘alternate model’) that supposes we can spend our stock of things that grow on direct work. We focus on the former with the latter described in the appendix since its output is more obviously action-guiding³.

The table below shows our best guess for the optimal spending schedule using the former model when varying the difficulty of achieving a good AGI outcome and AGI timelines. We keep other inputs, such as diminishing returns to spending and interest rate constant⁴.

	Median AGI arrival
Difficulty of AGI success	2030⁵	2040⁶	2050⁷
Easy⁸	Easy difficulty 2030 median⁹	Easy difficulty 2040 median	Easy difficulty 2050 median
Medium¹⁰	Medium difficulty 2030 median	Medium difficulty 2040 median	Medium difficulty 2050 median
Hard¹¹	Hard difficulty 2030 median	Hard difficulty 2040 median	Hard difficulty 2050 median

*How much better the optimal spending schedule is compared to the 2%+2% constant spending schedule (within-model lower bound)¹²***
	Median AGI
	2030	2040	2050
Easy	37.6%	18.4%	11.8%
Medium	39.3%	14.9%	12.0%
Hard	12.3%	5.85%	1.55%

Some of the critical limitations of our model include: poorly modelling exogenous research, which is particularly important for those with longer timelines, and many parts of the model - such as diminishing returns - remaining constant over time.

Further, we find that robust spending strategies - those that work in a wide variety of worlds - also support a higher spending rate. We show the results of a Monte Carlo simulation in the appendix¹³.

Introduction

Humanity might be living at a hinge moment in history (MacAskill, 2020). This is partly due to the unusually high level of existential risks (Ord, 2020) and, in particular, the significant probability that humanity will build artificial general intelligence (AGI) in the next decades (Cotra, 2022). More specifically, AGI is likely to make up for a large fraction of extinction risks in the present and next decades (Cotra, 2022) and stands as a strong candidate to influence the long-term future. Indeed, AGI might play a particularly important role in the long-term trajectory change of Earth-originating life by increasing the chance of a flourishing future (Bostrom, 2008) and reducing the risks of large amounts of disvalue (Gloor, 2016).

Philanthropic organisations aligned with effective altruism principles such as the FTX Foundation and Open Philanthropy play a crucial role in reducing AI risks by optimally allocating funding to organisations that produce research, technologies and influence to reduce risks from artificial intelligence. Figuring out the optimal funding schedule is particularly salient now with the risk of AI timelines under 10 years (Kokotajlo, 2022), and the substantial growth in effective altruism (EA) funding roughly estimated at 37% per year from 2015 to 2021 for a total endowment of about 46B$ by then end of 2021 (Todd, 2021).

Previous work has emphasised the need to invest now to spend more later due to low discount rates (Hoeijmakers, 2020, Dickens 2020). This situation corresponds to a “patient philanthropist”. Research has modelled the optimal spending schedule a patient philanthropist should follow if they face constant interest rates, diminishing returns and a low discount rate accounting for existential risks (Trammell, 2021, Trammell 2021). Extensions of the single provider of public goods model allowed for the rate of existential risks to be time-dependent (Alaya, 2021) and to include a trade-off between labour and capital where labour accounts for movement building and direct work (Sempere, Trammell 2021). Some models also discussed the trade-off between economic growth and existential risks by modelling the dynamics between safety technology and consumption technology with an endogenous growth (Aschenbrenner, 2020) and an exogenous growth model (Trammell, 2021).

Without more specific quantitative models taking into account AI timelines, growth in funding, progress in AI safety and the difficulty of building safe AI, previous estimates of a spending schedule of just over 1% per year (Todd 2021, MacAskill 2022) are at risk of underperforming the optimal spending schedule by as much as 40%.

In this work, we consider a philanthropist or philanthropic organisation maximising the probability of humanity building safe AI. The philanthropist spends money to increase the stock of AI safety research and influence over AI development which translates into increasing the probability of successfully aligning AI or avoiding large amounts of disvalue. Our model takes into account AI timelines, the growth of capital committed to AI safety, diminishing returns in research and influence as well as the competitiveness of influencing AI development. We also allow for the possibility of a fire alarm shortly before AGI arrives. Upon “hearing” the fire alarm, the philanthropist knows the arrival time of AGI and wants to spend all of its remaining money until that time. The philanthropist also has some discount rate due to other existential risks and exogenous research that accelerate safety research.

Crucially, we have coded the model into a notebook accompanying this blog post that philanthropists and interested users can run to estimate an optimal spending schedule given their estimates of AI timelines, the difficulty of AI safety, capital growth and diminishing returns. Mathematically the problem of finding the optimal spending schedule translates into an optimal control problem giving rise to a set of nonlinear differential equations with boundary conditions that we solve numerically.

We discuss the effect of AI timelines and the difficulty of AI safety on the optimal spending schedule. Importantly, the optimal spending schedule typically ranges from 5% to 35% per year this decade, certainly above the current typical spending of EA-aligned funders. A funder should follow the most aggressive spending schedule this decade if AI timelines are short (2030) and safety is hard. An intermediate scenario yields a yearly average spending of ~12% over this decade. The optimal spending schedule typically performs between 5 to 15% better than the strategy of spending the endowment’s rate of appreciation and between 18% to 40% better than the current EA community spending at ~3% per year.

A qualitative description of the model

A diagram of the model. The red arrows above show what we, the funders, can control: the amount of money we spend each year on research and influence. The curved red boxes indicate some of the key inputs. An arrow from A to B indicates that B depends on the value of A.

We suppose that a single funder controls all of the community’s funding that is earmarked for AI risk interventions and that they set the spending rate for two types of interventions: research and influence. The funder’s aim is to choose the spending schedule - how much they spend each year on each intervention - that maximises the probability that AGI goes successfully (e.g. does not lead to an existential catastrophe).

The ‘model’ is a set of equations (described in the appendix) and accompanying Colab notebook. The latter, once given inputs from the user, finds the optimal spending schedule.

Research and influence

We suppose that any spending is on either research or influence. Any money we don’t spend is saved and gains interest. As well as investing money in traditional means, the funder is able to ‘invest’ in promoting earning-to-give, which historically has been a source of a large fraction of the community’s capital.

We suppose there is a single number for each of the stocks of research and influence describing how much the community has of each¹⁴.

Research refers to the community’s ability to make AGI a success given we have complete control over the system (modulo being able to delay its deployment indefinitely). The stock of research contains AI safety technical knowledge, skilled safety researchers, and safe models that we control and can deploy. Influence describes the degree of control we have over the development of AGI, and can include ‘soft’ means such as through personal connections or ‘hard’ means such as passing policy. Both research and influence contribute to the probability we succeed and the user can input the degree to which they are ‘substitutable’.

The equations modelling the time evolution of research and influence have the following features:

Diminishing marginal returns to spending; the returns to growth in research/influence from each additional unit of spending per year decreases.
Appreciation or depreciation of research/influence over time. For example, our stock of research could depreciate by becoming less relevant over time as paradigms change and influence could depreciate over time if the AI developers we have influence over become less likely to develop AGI compared to another group.
The price of one unit of research and influence changes based on how much you already have. For example, having more research may open up multiple parallelizable tracks for people to work on, decreasing the cost of research units. Conversely, once we have more research new researchers must spend increasing time catching up on existing work and so research could become more expensive.
Influence can become more expensive over time due to competition. As other actors enter the field and also wish to influence AI and the field of AI development itself grows, one unit of spending can result in less influence.

Money

Any money we don’t spend appreciates. Historically the money committed to the effective altruism movement has grown faster than market real interest rates. The model allows for a variable real interest rate, which allows for the possibility that the growth of the effective altruism community slows.

Preparedness

We use the term preparedness at time to describe how ‘ready’ we are if AGI arrived at time Preparedness is a function of research and influence: the more we have of each the more we are prepared. The user inputs the relative importance of each research and influence as well as the degree they are substitutable.

AGI fire alarm

We may find it useful to have money before AGI takeoff, particularly if we have a ‘fire alarm’ period where we know that AGI is coming soon and can spend most of it on last-minute research or influence. The model allows for such last-minute spending on research and influence, and so one’s money indirectly contributes to preparedness.

Success

The probability of success given AGI arriving in year is an S-shaped function of our preparedness. The model is not fixed to any definition of ‘success’ and could be, but is not limited to, “AGI not causing an existential catastrophe” or “AGI being aligned to human values” or “preventing AI caused s-risk”.

Since we are uncertain when AGI will arrive, the model considers AGI timelines input from the user and takes the integral of the product of {the probability of AGI arriving at time } and {the probability of success at time given AGI at time }.

The model also allows for a discount rate to account for non-AGI existential risks or catastrophes that preclude our research and influence from being useful or other factors.

The funder’s objective function, the function they wish to maximise, is the probability of making AGI go successfully.

Solving the model

The preceding qualitative description is of a collection of differential equations that describe how the numerical quantities of money, research and influence change as a function of our spending schedule. We want to find the spending schedule that maximises the objective function, the optimal spending schedule. We do this with tools from optimal control theory¹⁵. We call such a schedule the optimal spending schedule.

Results

Optimal spending scheduled when varying AGI timelines and difficulty of success

We first review the table from the start of the post, which varies AGI timelines and difficulty of an AGI success while keeping all other model inputs constant. We stress that the results are based on our guesses of the inputs (such as diminishing returns to spending) and encourage people to try the tool out themselves.


Figure caption: Yearly optimal spending schedule averaged over this decade, 2022-2030 (left), and the next, 2030-2040 (right). For each level of AI safety difficulty (easy, medium and hard columns) and each decade we reported the average spending rates in research and influence in % of the funder’s endowment.

We consider our best guess for the model’s parameters as given in the appendix (see “explaining and estimating the model parameters”). We describe the effects of timelines and the difficulty of AI safety on the spending schedule in this decade (2022-2030), the effects being roughly similar in the 2030 decade.

In most future scenarios we observe that the average optimal spending schedule is substantially higher than the current EA spending rate standing at around 1-3% per year. The most conservative spending schedule happens when the difficulty of AI safety is hard with long timelines (2050) with an average spending rate of around 6.5% per year. The most aggressive spending schedule happens when AI safety is hard and timelines are short (2030) with an average funding rate of about 35% per year.

For each level of difficulty and each AI timelines, the average allocation between research and influence looks balanced. Indeed, research and influence both share roughly half of the total spending in each scenario. Looking closer at the results in the appendix (see “appendix results”), we observe that influence seems to decrease more sharply than research spending, particularly beyond the 2030 decade. This is likely caused by the sharp increase in the level of competition over AI development making units of influence more costly relative to units of research. Although we want to emphasise that the share of influence and research in the total spending schedule could easily change with different diminishing returns in research and influence parameters.

The influence of AI timelines on the optimal spending schedule varies across distinct levels of difficulty but follows a consistent trend. Roughly, with AI timelines getting longer by a decade, the funder should decrease its average funding rate by 5 to 10%, unless AI safety is hard. If AI safety is easy, a funder should spend an average of ~25% per year for short timelines (2030), down to ~18% per year with medium timelines (2040) and down to ~15% for long timelines (2050). If AI safety difficulty is medium then the spending schedule follows a similar downtrend, starting at about 30% with short timelines down to ~12% with medium timelines and down to 10% with long timelines. If AI safety is hard, the decline in spending from short to medium timelines is sharper, starting at 35% per year with short timelines down to ~8% with medium timelines and down to about 5% with long timelines.

Interestingly, conditioning on short timelines (2030), going from AI safety hard to easier difficulty decreases the spending schedule from ~35% to ~25% but conditioning on medium (204) or long (2050) timelines going from AI safety hard to easier difficulty increases the spending schedule from 6% to 18% and 9% to 15% respectively.

In summary, in most scenarios, the average optimal spending schedule in the current decade typically varies between 5% to 35% per year. With medium timelines (2040) the average spending schedule typically stays in the 10-20% range and moves up to the 20-35% range with short timelines (2030). The allocation between research and influence is balanced.

Sensitivity

In this section, we show the effect of varying one parameter (or related combination) on the optimal spending schedule. The rest of the inputs are described in the appendix. We stress again that these results are for the inputs we have chosen and encourage you to try out your own.

Discount rate

Varying just the discount rate we see that a higher discount rate implies a higher spending rate in the present.


Low discount rate	Standard	High discount rate

Growth rate

It seems plausible that the community and its capital are likely to be going through an unusually fast period of growth that will level off.¹⁶ When assuming a lower rate of growth we see that the optimal spending schedule is a lower rate, but still higher than the community’s current allocation. In particular, we should be spending faster than we are growing.


Highly pessimistic growth rate: 5% growth rate	Pessimistic growth rate: 10% current growth decreasing to 5% in the five years.¹⁷	Our guess: 20% current growth decreasing to 8.5% in the next ten years.¹⁸

Current money committed to AGI interventions

We can compute the change in utility when the amount of funding committed to AI risk interventions changes. This is of relevance to donors interested in the marginal value of different causes, as well as philanthropic organisations that have not explicitly decided the funding for each cause area.

Starting money multiplier	0.001	0.01	0.1	0.5	1	1.1	1.5	10
Absolute utility	0.031¹⁹	0.044	0.092	0.219	0.317	0.332	0.386	0.668
Multiple of 100% money utility	0.098	0.139	0.290	0.691	1	1.047	1.218	2.107

A different initial endowment has qualitative effects on the spending schedule. For example, comparing the 10% to 1000% case we see that when we have more money we - unsurprisingly - spend at a much higher rate. This result itself is sensitive to the existing stocks of research and influence.


When we have 10% of our current budget of $4000m	When we have 1000% of our current budget

The spending schedule is not independent of our initial endowment. This is primarily driven by the S-shaped success function. When we have more money, we can beeline for the steep returns of the middle of the S-shape. When we have less money, we choose to save to later reach this point.

Diminishing returns

We see that, unsurprisingly, lower diminishing returns to spending suggest spending at a higher rate.


High diminishing returns²⁰	Our guess²¹	Low diminishing returns²²

Parallel vs serial research

The constant controls whether research becomes cheaper as we accumulate more research () or more expensive (). The former could describe a case where an increase in research leads to the increasing ability to parallelize research or break down problems into more easily solvable subproblems. The latter could describe a case where an increase in research leads to an increasingly bottlenecked field, where further progress depends on solving a small number of problems that are only solvable by a few people.


Research is highly serial	Default ()	Research is highly parallelizable

We see that in a world where research is either highly serial or parallelizable, we should be spending at a higher rate than if it is, on balance, neither. The parallelizable result is less surprising than the serial result, which we plan to explore in later work.

A more nuanced approach would use a function such that the field can become more or less bottlenecked as it progresses and the price of research changes accordingly.

Presence of a fire alarm

Using our parameters, we find the presence of a fire alarm greatly improves our prospects and, perhaps unexpectedly, pushes the spending schedule upwards. This suggests it is both important to be able to correctly identify the point at which AGI is close and have a plan for the post-fire alarm period.


No fire alarm.	Short fire alarm: funders spend 10% of one’s money over six months. In this case, we get 36% more utility than no fire alarm.	Long fire alarm: funders spend 20% of one’s money over one year. In this case, we get 56% more utility than no fire alarm.

Substitutability of research and influence

Increasing substitutability means that one (weight adjusted²³) unit of research can be replaced by closer to one unit of influence to have the same level of preparedness²⁴.

Since, by our choice of inputs, we already have much more importance-adjusted research than influence²⁵, in the case where they are very poor substitutes we must spend at a high rate to get influence.

When research and influence are perfect substitutes since research is ‘cheaper’²⁶ with our chosen inputs the optimum spending schedule suggests that nearly spending should be on research²⁷.


Research and influence are very poor substitutes²⁸	Research and influence are poor substitutes²⁹	Standard case³⁰	Research and influence are perfect substitutes³¹

Discussion

Some hot takes derived from the model

We make a note of some claims that are supported by the model. Since there is a large space of possible inputs we recommend the user specify their own input and not rely solely on our speculation.

The community’s current spending rate is too low

Supposing the community indefinitely spends 2% of its capital each year on research and 2% on influence, the optimal spending schedule is around 30% better in the medium timelines, medium difficulty world.

The optimal spending schedule is generally 5 to 15% better than the default strategy

Note: The default strategy is where you spend exactly the amount your money appreciates, and so your money remains constant. The greatest difference in utility comes from cases where it is optimal to spend lots of money now, for example in the (2030 median, hard difficulty) world, the optimal spending schedule is 15% better than the default strategy.

In most cases, we should not ‘wager’ on long AGI timelines when we believe AGI timelines are short

A wager is, e.g., thinking that ‘although I think AGI is more likely than not in the next t years, it is intractable to increase the probability of success in the next t years and so I should work on interventions that increase the probability of success in worlds where AGI arrives at some time . Saving money now, even though AGI is expected sometime soon, is only occasionally recommended by the model. One case occurs with (1) a sufficiently low probability of success but steep gains to this probability after some amount of preparedness that is achievable in the next few decades, (2) a low discount rate, and either (a) that influence does not get too much more expensive over time or (b) influence is not too important.

A ‘wager’ on long timelines in a case where we have 2030 AGI timelines. This case has a discount rate , the difficulty is hard³² and the substitutability of research and influence is high³³.

To some extent, there is a ‘sweet spot’ on the s-shaped success curve where we wager on long timelines. If we are able to push the probability of success to a region where the slope of the s-curve is large, we should spend a high rate until we reach this point. If we are stuck on the flatter far left tail such that we remain in that region regardless of any spending we do this century to stay in that area, we should spend at a more steady rate.

In some cases, we should ‘wager’ on shorter timelines by spending at a high rate now

This trivially occurs, for example, if you have a very high discount rate. A more interesting case occurs when (1) influence is poorly substituted by research³⁴ and either (a) influence depreciates quickly or (b) influence quickly becomes expensive.

A ‘wager’ on short timelines in a case where we have the 2050 AGI timeline. This case has ‘medium’ difficulty and low substitutability of research and influence³⁵.

Since the opportunity to wager on short timelines only is available now, we believe more effort should go into investigating the wager.

Limitations

We discuss the primary limitations here, and reserve some for the appendix. For each limitation, we briefly discuss how a solution would potentially change the results.

Research is endogenous

The model does not explicitly account for research produced exogenously (i.e., not as a result of our spending). For example, it is plausible that research produced in academia should be included in our preparedness.

Exogenous research can be (poorly) approximated in the current model in a few different ways.

First, one could suppose that research appreciates over time and set . This supposes that research being done by outsiders is (directly) proportional to the research ‘we’ already have (where in this case, research done by outsiders is included in ). Since we model exponential appreciation, appreciation leads to a research explosion. One could slow this research explosion by supposing the appreciation term was for some .

Second, one could suppose that exogenous research sometimes solves the problem for us, making our own research redundant. This can be approximated by increasing the discount rate to account for the ‘risk’ that our own work is not useful. This is unrealistic in the sense that we are ‘surprised’ by some other group solving the problem.

A possible modification to the model would be to add a term to the expression for that accounts for the exogenous rate of growth of research. Alternatively, one could consider a radically different model of research that considered our spending on research as simply speeding up the progress that will otherwise happen (conditioning on no global catastrophe).

We expect this is the biggest weakness of the model, especially for those with long AGI timelines. To a first approximation, if there is little exogenous research we do not need to account for it, and if there is a lot then our own spending schedule does not matter. Perhaps we might think our actions can lead us to be in either regime and our challenge is to push the world towards the latter.

AGI timelines are independent of our spending schedule

We may hope that some real-world interventions may delay the arrival of AGI, for example, passing policies to slow AI capabilities work. The model does not explicitly account for this feature of the world at all.

One extension to the model would be to change the length of the fire alarm period to be a function of the amount of influence we have. We expect this extension to imply an increase in the relative spending rate on influence. Another, more difficult extension would be to consider timelines as function such that we can ‘push’ timelines down the road with more influence.

We expect that our ability to delay the arrival of AGI, particularly for shorter arrivals, is sufficiently minimal such that it would not significantly change the result. For longer timelines, this seems less likely to be the case.

Research and capabilities are independent.

AI capabilities and our research influence each other in the real world. For example, AI capabilities may speed up research with AI assistants. On the other hand, spending large amounts on AI interventions may draw attention to the problem and speed up AI capabilities investment.

We allow for a depreciation of research, which can be used to model research becoming outdated as capabilities advance. One can also research becoming cheaper over time³⁶ to account for capabilities speeding up our research.

On balance, we expect this limitation to not have a large effect. If one expects a ‘slow AI takeoff’ with the opportunity to use highly capable AI tools, one can use the fire alarm feature and set the returns to research during this period to be high.

Diminishing returns, and other features, are constant

We model the returns to spending constant across time. However, actual funders seem to be bottlenecked by vetting capacity and a lack of scalable and high-return projects and so the returns to spending are likely to be high at the moment. Grantmakers can ‘seed’ projects and increase capacity such that it seems plausible that diminishing returns to spending will decrease in the future.

However, the model input only allows for constant diminishing marginal returns.

The model could be easily extended to use a function such that marginal returns to spending on research and influence changed over time, similar to how the real interest rate changes over time. This would require more input from the user. Another extension could allow for the returns to be a function of how much was spent last year. However, such an extension would increase the model's complexity and decrease its usability, simplicity and (potentially) solvability.

This limitation also applies to other features of the model, such as the values and .

The optimal spending schedule is not always found

Most existing applications of optimal control theory to effective altruism-relevant decision-making have used systems of differential equations that are analytically solvable and have guarantees of optimality. Our model has neither property and so we must rely on optimization methods that do not always lead to a global maximum.

Appendix

Further limitations

Many inputs are required from the user

There are around 40 free parameters that the user can set.

Many model features can be turned off. To turn off the following features:

Variable interest rates, set or set
Discount rate, set
Appreciation or depreciation, set
Change in price of research and influence, set
Competition that changes the price of influence, set all competition levels to be 1 in all years.
Only one of research or influence being necessary, set (for no influence) or (for no research) [in this case, does not matter]
Fire alarm, set the expected fraction of money spendable to 0.

One can set parameters such that the model is equivalent to like the following system

Some results from this system³⁷:

Appreciation of money is continuous and endogenous

The current growth rate of our money is continuous. However, this poorly captures the case where most growth is driven by the arrival of new donors with lots of capital. Further, any growth is endogenous - it is always in proportion to our current capital .

One modification would be to the model arrival time of future funders by a stochastic process, for example following a Poisson distribution. For example, take

Where can model endogenous and non-continuous growth of funding.

Following some preliminary experiments with a deterministic flux of funders, we are skeptical that this would substantially change the recommendations of the current model.

The model only maximises the probability of ‘success’ (with constraints given by keeping money and spending non-negative)

We see two potential problems with this approach.

First, one may care about spending money on things other than making AGI go well. The model does not tell you how to trade-off these outcomes. The model best fits into a portfolio approach of doing good, such as Open Philanthropy’s Worldview Diversification. Alternatively, one may attach some value to having money leftover post-AGI.

Second, there may be outcomes of intermediate utility between AGI being successful and not. A simple extension could consider some function of the probability of success. A more complex extension could consider the utility of AGI conditioned on its arrival time and our preparedness that accounts for near-miss scenarios.

Technical description

The funders have a stock of capital . This goes up in proportion to real interest at time , and down with spending on research, , and spending on influence, .

The funders have a stock of research which goes up with spending on research and can depreciate over time.

Where

is the marginal returns to spending on research
is the increase or decrease in efficiency of spending due to the existing stock of research
is the rate of appreciation of the stock of research

Similarly, funders have a stock of influence which obeys

With constants mutatis mutandis from the research stock case and describes how the influence gained per unit money changes over time due to competition effects. That is, over time as the field of AGI influencers crowd each unit of influence can be more expensive.

Fire alarm

We allow for the existence of an AGI fire alarm which tells us that AGI is exactly years away and that we can spend fraction of our money on research and influence.

We write and for the amount of research and influence we have in expectation at AGI take-off if the fire alarm occurred at time . Within the fire alarm period, we suppose that

we spend constant amounts on research and influence
There is no appreciation or depreciation
We have which can be different from their pre-fire alarm values.

The first and second assumptions allow for analytical expression for as a function of and .

We write for the constant spending rate on research post-fire alarm. We take where is the fraction of post fire-alarm spending. The system

has an analytical solution and we take

Similarly for research we take and system

Where is the competition factor at the start of the fire alarm period and is chosen by the user to either be a constant or function of . Note that is a constant in the differential equation, so the system has an analytical solution of the research system above. Again we take . Note that the user can state that no fire alarm occurs; setting the implies and so and so .

Preparedness and success

Our preparedness is given by

Preparedness is the constant elasticity of the substitution production function of and where the user chooses and .

Conditioning on AGI happening at time , we take the probability of AGI being safe as

This is a logistic function with constants and determined by the user’s beliefs about the difficulty of making AGI safe.

Our objective is to maximize the probability that AGI is safe. We have an objective function

Where is the user’s AGI timelines and is some discount rate.

We have initial conditions

Solving the model

We apply standard optimal control theory results.

We have Hamiltonian

Where are the costate variables.

The optimal spending schedule, if it exists, necessarily follows

We solve this boundary value problem using SciPy’s solve_bvp function and apply further optimisation methods to avoid local optima.

Model guide

The model is a Python Notebook accessible on Google Colaboratory here.

Any cells that contain “User guide” are for assisting with the running of the notebook.

Below the initial instructions, you will find the user input parameters.

In the next section of this document we describe the parameters in detail and our own guesses.

Explaining and estimating the model parameters

We discuss the parameters in the same order as in the notebook.

Note, the estimates given are from Tristan and not necessarily endorsed by Guillaume.

Epistemic status: I’ve spent at least five minutes thinking about each, sometimes no more.

AI timelines

We elicit user timelines using two points on the cumulative distribution function and fit a log-normal distribution to them.

We note Metaculus’ Date of Artificial General Intelligence community prediction: as of 2022-10-06, lower 25% 2030, median 2040 and upper 75% 2072.

Note that since our log-normal distribution is parameterised by two pairs of (year, probability by year), the three distinct Metaculus interquartile pairings will give different distributions.

Discount rate

The discount rate needs to factor in both non-AGI existential risks as well as catastrophic (but non-existential) risks that preclude our AI work from being useful or any unknown unknowns that have some per year risk.

We choose implying an AGI success in 2100 is worth as much as a win today. As we discuss in the limitations section, the discount rate can also account for other people making AGI successful, though this interpretation of is not unproblematic.

Of relevance may be:

The Metaculus forecast on World War Three before 2050, current median 21%.
The Metaculus forecast on Human Extinction by 2100, current median 3%
The Metaculus notebook from Tamay which synthesis Metaculus forecasts to come to median 18% of global catastrophe this century.

Our 90% confidence interval for is

Money

Starting money

As of 2022-10-06, Forbes estimates Dustin Moskovitz and Sam Bankman Fried have wealth of $8,200m and $17,000m respectively. Todd (2021) estimates $7,100m from other sources in 2021 giving a total of $32,300m within the effective altruism community.

How much of this is committed to AI safety interventions?

Open Philanthropy has spent $157m on AI-related interventions, of approximately $1500m spent so far. Supposing that roughly 15% of all funding is committed to AGI risk interventions gives at least $4,000m.

Our 90% confidence interval is .

Real interest rate

We suppose that we are currently at some interest rate ${rcurrent which decreases over time (at rate ) to .}_{current which decreases over time (at rate ) to .}$

Supposing the movement had $10,000m in 2015 and $32,300 in mid-2022, money in the effective altruism community has grown 21% per year.

We take . Our 90% confidence interval is .

Historical S&P returns are around 8.5%. There are reasons to think the long-term rate may be higher - such as increase in growth due to AI capabilities - or lower - there is a selection bias in choosing a successful index. We take . Our 90% confidence interval is .

$is the rate at which the real interest rate moves from to . The effective altruism and AI safety movement has been growing substantially for the last 10 to 15 years. Naively, suppose that we’re in the middle of the growth period. We take = 0.2, which implies that capital will only be growing 1.15 times ten years from now.$

Our 90% confidence interval is .

Research and influence

Marginal returns to spending

Influence

The constant controls the marginal returns to spending on influence. For we receive diminishing marginal returns.

The top fraction of spending per year on influence leads to fraction of increase in growth of influence in that year. For example, implies the top 20% of spending leads to roughly 80% of returns i.e. the Pareto principle.

We note that influence spending can span many orders of magnitude and this suggests reason to think there are high diminishing returns (i.e. low ). For example, one community builder may cost on the order of per year, but investing in AI labs with the purpose of influencing their decisions may cost on the order of per year.

We take which implies doubling spending on influence lead to times more growth of influence Our 90% confidence interval is .

Research

The constant acts in the same way for research as does for influence.
We take , which implies a doubling of research spending leads to times increase in research growth and that 20% of the spending in research accounts for of the increase in research growth.

Potential sources for estimating include using the distribution of karma on the Alignment Forum, citations in journals or estimates of researchers’ outputs.

Our 90% confidence interval is .

Change in price of the stocks when you have more

Influence

${controls how the price of influence changes as the amount of influence we have changes.}$

${implies that influence becomes cheaper as we get more influence. Factors that push in this direction include:}$

The existence of network effects related to being trusted and having a good reputation, For example,
- It may become easier to convince other people to trust us
- We may be given access to exclusive opportunities.
The ‘clear wins’ model of political capital

${implies that influence becomes more expensive as we get more influence. Factors that point in this direction include:}$

We can run out of opportunities to gain influence or only difficult opportunities remain.
We could be seen as suspicious for amassing power, and trust in us decreases.

For the price is constant.

On balance, we think the former reasons outweigh the latter, and so take . This implies a doubling of influence leads to one unit of spending on influence leading to times more growth in influence compared to one unit of spending without this doubling. Our 90% confidence interval is .

Research

The constant acts in the same way for research as ${αI does for influence.}_{I does for influence.}$

$_{controls how the price of influence changes as the amount of influence we have changes.}$

${implies that research becomes cheaper as we get more research. Factors that push in this direction include:}$

Research becoming more parallelizable. For example, sub-questions are found that can be worked on with less context or more standard backgrounds.
Attracting greater talent to the field once you have developed the field.

${implies that research becomes more expensive as we get more research. Factors that push in this direction include:}$

Research becoming more serial. That is, the field is increasingly bottlenecked by progress in a few key areas.
The costs of getting up to speed with research increase as we accumulate new research. For example, new researchers need to learn increasingly more background material before being able to contribute.

We are uncertain about the net effect of the above contributions, and so take . Our 90% confidence interval is .

Depreciation of stocks

Influence

${controls the rate of appreciation, for , or depreciation, for , of our stock of influence. It seems very likely that since}$

AGI labs we have influence over can decrease in relative importance.
The number of AGI labs and people working on AGI increases, and so our relative influence decreases

We take , which implies a half-life of around years. Our 90% confidence interval is $.$

Research

${controls the rate of appreciation, for , or depreciation, for , of our stock of influence.}$

We expect research to depreciate over time. Research can depreciate by

Becoming less relevant over time
Be lost, forgotten or otherwise inaccessible

One intuition pump is to ask what fraction of research on current large language models will be useful if AGI does not come until 2050? We guess on the order of 1% to 30%, implying - if all our research was on large language models - a value of between and . Note that for , such research can be instrumentally useful for later years due to its ability to make future work cheaper by, for example, attracting new talent.

We take . Our 90% confidence interval is .

Competition effects

We allow for influence to become more expensive over time. The primary mechanism we can see is due to (a) competition with other groups that want to influence AI developers, and (b) competition within the field of AI capabilities, such that there are more organisations that could potentially develop AGI.

We suppose the influence per unit spending decreases over time following some S-shape curve, and ask for three points on this curve.

The first data point is the first year in which money was spent on influence. Since one can consider community building or spreading AI risk ideas (particularly among AI developers) as a form of influence, the earliest year of spending is unclear. We take 2015 (the first year Open Philanthopy made grants in this area). The relative cost of influence is set to 1 in this year.

We then require to further years, as well as influence per unit spending relative to the the first year of spending.

Our best guess is (2017, 0.9) - that is, in 2017 one received 90% as much influence per unit spending as one would have done in 2015- and (2022, 0.6).

The final input is the minimum influence per unit spending that will be reached be take this to be 0.02. That is, influence will eventually be more expensive per unit that it was in 2015. Our 90% confidence interval is (0.001, 0.1).

Historical spending on research and influence

The model uses this data to calculate the quantities of research and influence we have now. Rough estimates are sufficient.

The Singularity Institute (now MIRI) was founded in 2000 and switched to work on AI safety in 2005. We take 2005 as the first year of spending on research.

Open Philanthropy has donated $243.5m to risks from AI since its first grant in the area in August 2015. We very roughly categorised each grant by research : influence fraction, and estimate that $132m has been spent on research and $111m on influence. We suppose that Open Philanthropy has made up two-thirds of the overall spending, giving totals of $198m and $167m.

We guess that the research spending, which started in 2005, has been growing at 25% per year and influence spending has been growing 40% per year since starting in 2015.

Fire alarm

By default, in the results we show, we assume no fire alarm in the model. This is achievable by setting the expected fraction of money spendable to 0. When considering the existence of a fire alarm, we take the following values.

For the fire alarm duration we ask Supposing that the leading AI lab has reached an AGI system that they are not deploying out of safety concerns, how far behind is a less safety-conscious lab? We guess this period to be half a year.

Our 90% confidence interval for this period, if it exists, is (one month, two years).

One may think that expected fraction of money spendable during the fire alarm is less than 1 for reasons such as

Not being certain that the AGI fire alarm has gone off [though we don’t model other false positives, and are only supposing that we may worry that this actual fire alarm is a false positive]
Limited liquidity of the capital
Bottlenecked by transaction speeds

We take 0.1 as the expected fraction of money spendable with 90% confidence interval (0.01, 0.5).

Returns to spending during the fire alarm

During the fire alarm period, we enter a phase with no appreciation or depreciation of research or influence and a separate marginal returns to spending and can apply.

Some reasons to think (worse returns during the fire alarm period)

There is general panic or uncertainty around what to do and worse coordination leading to wasted effort
The most useful technical work can only be done within highly secure environments to avoid leaks to the competitor labs
The labs that are reaching AGI become increasingly averse to being influenced from the outside, since there may be more bad actors targeting them.
People may be less incentivised by money if they believe AGI is near.

Some reasons to think the (better returns during the fire alarm period)

There are unique opportunities that arise during this period
There is clarity over what AGI will look like.
We can actually execute a plan that has been refined in the pre-fire-alarm period
We can take riskier actions that would damage our reputation during the pre-fire alarm period.
Some people will be more motivated to work harder
Some people will be more open to the idea of AI safety, given that capabilities are at a high level

We expect that the returns to research spending will be very low, and take , implying that the amount of research we can do in the post-fire-alarm period is not very dependent on the money we have.

We expect that returns to influence spending will be less than in the period before, lower. We take .

Change in price of the stocks when you have more

In the fire alarm phase, the cost per unit of research and influence can also change depending on the amount we already have.

We expect and . That is, during the fire alarm period it is even cheaper to get influence once you already have some than before this phase and that this effect is greater during the fire alarm period (the first inequality). In the case there is panic, it seems people will be looking for trustworthy organisations to defer to and execute a plan.

We expect That is, during the fire alarm period the amount of existing research decreases the cost of new research.

During the fire alarm period, it seems likely that only a few highly skilled researchers - perhaps within the AI lab - will have access to the information and tools necessary to conduct further useful research. The research at this point is likely highly serial: the researchers trying to focus on the biggest problems. Existing research may allow these few researchers to build on existing work effectively.

We take both and to be 0.3, implying that a doubling of research before the fire-alarm period increases the stock output during the fire-alarm period by times.

Competition during the fire alarm period

Preparedness and probability of success

Preparedness

Preparedness at time is a function of the fire-alarm adjusted research and influence that takes two parameters, the share parameter that controls the relative importance of research and influence and parameter , that controls the substitutability of research and influence.

$is the share parameter. Increasing increases the ratio {increase to preparedness when increasing research by 1 unit} to {increase to preparedness when increasing influence by 1 unit}.$

$cannot be derived from first principles due to our choice of units. Instead, we recommend running the cell and using the available statistics and graphs to inform your guesses. We take .$

$controls the subsitutability of research and influence. If you think it is possible to make AGI go well with lots of research and zero influence or zero research and lots of influence, then consider setting . Otherwise, you should set .$

Decreasing decreases the subsitutability of research and influence. In the limit as , our preparedness can be entirely bottlenecked by the stock we have the least (weighted by $).$

gives the Cobb-Douglas production function, though to avoid a case-by-case situation in the programming, you cannot set and instead can choose value close to .

Again, we recommend picking values and running the cell to see the graph. We choose . We think that the problem is mainly a technical problem, but in practise cannot be solved without influencing AI developers.

Probability of success

The probability of success at time $t conditioned on AGI’s arrival at time t is a function of our preparedness at that time. The function is an S-shaped logistic function and it is determined by two inputs.$

The first input is the probability of success if AGI arrived this year. That is, given our existing stocks of research and influence - this input doesn't consider any fire alarm spending. The second input determines the steepness of the S-shaped curve.

We take (10%, 0.15).

A note on the probability of success

Suppose you think we are in one the following three worlds

World A: AGI will go certainly well, regardless of what we do
World B: AGI might go well, and we can influence the probability it does
World C: AGI will certainly not go well, regardless of what we do

Then, in the input you should imagine you should give your inputs as they are in your world B model. We keep the probability of success curve between 0 and 1, but one could linear transform it to be greater than the probability of success in the A world and less than the probability of success in the C world. Since the objective function is linear in the probability of success, such a transformation has no effect on the optimal spending schedule.

Alternate model

In an alternate model, we suppose the funder has a stock of things that grow which includes things such as skills, people, some types of knowledge and trust. They can choose to spend this stock at some rate to produce a stock of things that depreciate that are immediately helpful in increasing the probability of success. This could comprise things such as the implementation of safety ideas on current systems or the product of asking for favours of AI developers or policymakers.

We say that spending capacity to create is crunching and the periods with high are crunch time.

The probability we succeed at time is a function of which is plus any last-minute spending that occurs with a fire alarm. Specifically, it is another S-shaped curve.

A diagram of the alternate model

Formalisation

Time evolution of things that grow
Time evolution of things that depreciate
Post-fire alarm total of things that depreciate
The probability of success given AGI at
Objective function

Recall that is the expected fraction of money spendable post-fire alarm and is the expected duration of the fire alarm. The equation for is thus simply the result of spending at rate for duration .

Estimating parameters

The alternate model shares the following parameters and machinery with the research-influence model

AGI timelines,
Discount rate,
Growth rate
Competition factor
The duration of the fire alarm and expected fraction of money spendable

The new inputs include

Diminishing returns to spending on direct work
Depreciation of the direct work
Probability of success as a function of our direct work

Growth rate

$now refers to the growth of the community's capacity, of which money is one component. Things that grow include our prioritisation knowledge and general understanding of AI risk increases each year, skills of people in the movement and connections between people.$

We expect the growth in capacity to decrease over time since some of our capacity is money and the same reasons will apply as in the former model. We suppose , and .

Competition factor

The factor , which in the former model controlled how influence becomes more expensive over time, controls how the cost of doing direct work - producing - becomes more expensive over time. . Only some spending to produce is in competitive domains (such as influencing AI developers) while some is non-competitive, such as implementing safety features in state-of-the-art systems.

We suppose that has a minimum 0.5 and otherwise has the same factors as in the former research-influence model.

Diminishing returns to spending

This controls the degree of diminishing returns to ‘crunching’. For reasons similar to those given for and in the main model, we take . Our 90% confidence interval is .

Depreciation of the direct work

This controls how long our crunch time activities are useful for i.e. the speed of depreciation. We take which implies that after one of the direct work is still useful.

Probability of success

To derive the constants used in the S-shaped curve, we ask for the probability of success after some hypothetical where we've spent some fraction of our capacity for one year.

Our guess is that after spending half of our resources this year, we’d have a 25% chance of alignment success if AGI arrived this year. Note that this input does not account for any post-fire-alarm spending.

Results

Unsurprisingly we see that we should spend our capacity of things-that-grow most around the time we expect AGI to appear. For the 2040 and 2050 timelines, this implies spending very little on things that depreciate, up to around 3% a year. For 2030 timelines, we should be spending between 5 and 10% of our capacity each year on ‘crunching’ for the arrival of AGI. Further, for all results, we begin maximum crunching after the modal AGI arrival date, which is understandable while the rate of growth of the movement is greater than the rate of decrease in probability of AGI (times the discount factor).

This result is relatively sensitive to the probability we think AGI will appear in the next few years. We fit a log-normal distribution to the AGI timeline with which leads to being small for the next few years. Considering a probability distribution that gave some weight to AGI in the next few years would inevitably imply a higher initial spending rate, though likely a similar overall spending schedule in sufficiently many years time.

	Median AGI arrival
Difficulty of AGI success	2030³⁸	2040³⁹	2050⁴⁰
Easy⁴¹
Medium⁴²
Hard⁴³

Limitations

Many of the limitations we describe apply to both models.

For example, there is no exogenous increase in which we may expect if other actors' work on AI risk at some time in the future. One could, for example, adjust such that spending on direct work receives more units of per unit spending in the future due to others' also working on the problem.

Like the first model, our work and AI capabilities are independent. One could, again, use to model direct work becoming cheaper as time goes on and new AI capabilities are developed.

Full results from the nine cases

Robust spending schedules by Monte Carlo simulation

Added 2022-11-29, see discussion here

Here I consider the most robust spending policies and supposes uncertainty over nearly all parameters in the main model⁴⁴, rather than point-estimates. I find that the community’s current spending rate on AI risk interventions is too low.

My distributions over the the model parameters imply that

Of all fixed spending schedules (i.e. to spend X% of your capital per year⁴⁵), the best strategy is to spend 4-6% per year.
Of all simple spending schedules that consider two regimes: now until 2030, 2030 onwards, the best strategy is to spend ~8% per year until 2030, and ~6% afterwards.

I recommend entering your own distributions for the parameters in the Python notebook here⁴⁶. Further, these preliminary results use few samples: more reliable results would be obtained with more samples (and more computing time).

I allow for post-fire-alarm spending (i.e., we are certain AGI is soon and so can spend some fraction of our capital). Without this feature, the optimal schedules would likely recommend a greater spending rate.

Fixed spending rate. See here for the distributions of utility for each spending rate.

Simple - two regime - spending rate

The results from a simple optimiser⁴⁷, when allowing for four spending regimes: 2022-2027, 2027-2032, 2032-2037 and 2037 onwards. This result should not be taken too seriously: more samples should be used, the optimiser runs for a greater number of steps and more intervals used. As with other results, this is contingent on the distributions of parameters.

Some notes

The system of equations - describing how a funder’s spending on AI risk interventions change the probability of AGI going well - are unchanged from the main model.
This version of the model randomly generates the real interest, based on user inputs. So, for example, one’s capital can go down.

n example real interest function , cherry picked to show how our capital can go down significantly. See here for 100 unbiased samples of

Example probability-of-success functions. The filled circle indicates the current preparedness and probability of success.

Example competition functions. They all pass through (2022, 1) since the competition function is the relative cost of one unit of influence compared to the current cost.

This short extension started due to a conversation with David Field and comment from Vasco Grilo; I’m grateful to both for the suggestion.

Author contributions

Tristan and Guillaume defined the problem, designed the model and its numerical resolution, interpreted the results, wrote and reviewed the article. Tristan coded the Python notebook and carried out the numerical computations with feedback from Guillaume. Tristan designed, coded, solved the alternate model and interpreted its results.

Acknowledgements

We’d both like to thank Lennart Stern and Daniel Kokotajlo for their comments and guidance during the project. We’re grateful to John Mori for comments.

Guillaume thanks the SERI summer fellowship 2021 where this project started with some excellent mentorship from Lennart Stern, the CEEALAR organisation for a stimulating working and living environment during the summer 2021 and the CLR for providing funding to support part-time working with Tristan to make substantial progress on this project.

The post The optimal timing of spending on AGI safety work; why we should probably be spending more now appeared first on Center on Long-Term Risk.

When is intent alignment sufficient or necessary to reduce AGI conflict?

2022-10-14T10:52:42Z

In this post, we look at conditions under which Intent Alignment isn't Sufficient or Intent Alignment isn't Necessary for interventions on AGI systems to reduce the risks of (unendorsed) conflict to be effective. We then conclude this sequence by listing what we currently think are relatively promising directions for technical research and intervention to reduce AGI conflict.

Intent alignment is not sufficient to prevent unendorsed conflict

In the previous post, we outlined possible causes of conflict and directions for intervening on those causes. Many of the causes of conflict seem like they would be addressed by successful AI alignment. For example: if AIs acquire conflict-prone preferences from their training data when we didn’t want them to, that is a clear case of misalignment. One of the suggested solutions: improving adversarial training and interpretability, just is alignment research, albeit directed at a specific type of misaligned behavior. We might naturally ask, does all work to reduce conflict risk follow this pattern? That is, is intent alignment sufficient to avoid unendorsed conflict?

Intent Alignment isn't Sufficient is a claim about unendorsed conflict. We’re focusing on unendorsed conflict because we want to know whether technical interventions on AGIs to reduce the risks of conflict make a difference. These interventions mostly make sense for preventing conflict that isn’t desired by the overseers of the systems. (If the only conflict between AGIs is endorsed by their overseers, then conflict reduction is a problem of ensuring that AGI overseers aren’t motivated to start conflicts.)

Let H be a human principal and A be its AGI agent. “Unendorsed” conflict, in our sense, is conflict which would not have been endorsed on reflection by H at the time A was deployed. This notion of “unendorsed” is a bit complicated. In particular, it doesn’t just mean “not endorsed by a human at the time the agent decided to engage in conflict”. We chose it because we think it should include the following cases:

H makes a mistake in designing A such that conflict at a later time is rational by H’s lights. For example, H could’ve built A so that it can solve Prisoner’s Dilemmas using open-source game theory but failed to do so. Then A finds itself in a one-shot Prisoner’s Dilemma (as evaluated by H’s preferences), and defects. Even though H would approve of A’s decision to defect at the time, they would not endorse their own mistake of locking in a suboptimal architecture for A.
H’s interactions with A lead to H’s preferences being corrupted, and then H approves of a decision they would not have approved of if their preferences hadn’t been corrupted. (It’s sometimes hard to say whether changes in preferences count as “corruption”. Hopefully, our examples below are clear cases of preference corruption.)

We’ll use Evan Hubinger’s decomposition of the alignment problem. In Evan’s decomposition, an AI is aligned with humans (i.e., doesn’t take any actions we would consider bad/problematic/dangerous/catastrophic) if it is intent-aligned and capability robust. (An agent is capability robust if it performs well by its own lights once it is deployed.) So the question for us is: What aspects of capability robustness determine whether unendorsed conflict occurs, and will these be present by default if intent alignment succeeds?

Let’s decompose conflict-avoiding “capability robustness” into the capabilities necessary and sufficient for avoiding unendorsed conflict into two parts:

Cooperative capabilities. An agent is cooperatively capable to the extent that it’s able to avoid conflict that is costly by its lights, when it is rational to do so.
Understanding H’s cooperation-relevant preferences. This means that the agent understands H’s preferences sufficiently well that it would avoid unendorsed conflict, given that (i) it is intent-aligned and (ii) sufficiently cooperatively capable.

Two conditions need to hold for unendorsed conflict to occur if the AGIs are intent aligned (summarized in Figure 1): (1) the AIs lack some cooperative capability or have misunderstood their overseer’s cooperation-relevant preferences, and (2) conflict is not prevented by the AGI consulting with its overseer.

Figure 1: Fault tree diagram for unendorsed conflict between intent-aligned AGIs.

These conditions may sometimes hold. In the next section, we list scenarios in which consultation with overseers would fail to prevent conflict. We then look at “conflict-causing capabilities failures”.

When would consultation with overseers fail to prevent catastrophic decisions?

One reason to doubt that intent-aligned AIs will engage in unendorsed conflict is that these AIs should be trying to figure out what their overseers want. Whenever possible, and especially before taking any irreversible action like starting a destructive conflict, the AI should check whether its understanding of overseer preferences is accurate. Here are some reasons why we still might see catastrophic decisions, despite this¹:

the overseers are not present, perhaps as a result of a war that has already occurred, and therefore the AI must act on mistaken beliefs about their preferences;
the AI judges that it doesn’t have time to clarify the overseer’s preferences before acting (likely due to competitive pressures, e.g., because the AI is making complex decisions in the middle of a war);
the overseer’s values have been unintentionally corrupted by an AI, and thus the overseer approves of a decision that they would not have endorsed under an appropriate reflection process;
the overseers have deliberately delegated lots of operations to the AI, and both the AI and the human operator were falsely confident that the AI understands the operator’s preferences well enough to not need to ask;
the overseer has fully delegated to the AI as a commitment mechanism;
a suboptimal architecture is locked in, making conflict rational by the overseer’s lights, such that consultation with an overseer wouldn’t help anyway. (Toy example: The agent isn’t able to use program equilibrium-type solutions to solve one-shot Prisoner’s Dilemmas, even though it would’ve been possible to build it such that it could.)

Conflict-causing capabilities failures

Failures of cooperative capabilities

Let’s return to our causes of conflict and see how intent-aligned AGIs might fail to have the capabilities necessary to avoid unendorsed conflict due to these factors.

Informational and commitment inability problems. (We put these together because they are both problems stemming from certain kinds of technological constraints on the agents.) One reason intent-aligned AGIs might engage in unendorsed conflict is path-dependencies in their design: the humans or AIs who designed the systems locked in less transparent designs which prevented the AGIs from using commitment and information revelation schemes, or implementing a credible surrogate goal. These are unendorsed because the designers of these intent-aligned AGIs made a mistake when building their systems that they would not endorse on reflection. These failures seem most likely early in a multipolar AI takeoff, when time pressures or cognitive limitations might prevent AGIs from building successor agents that have the necessary commitment abilities or transparency.
Miscoordination. An intent-aligned AGI might still be bad at reasoning about commitments, and make a catastrophic commitment. This seems most likely when an overseer is out of the loop, and therefore unable to override the AI’s decision, or early in takeoff, when an AI hasn’t had time to think carefully about how to make commitments.

Failures to understand cooperation-relevant preferences

We break cooperation-relevant preferences into “object-level preferences” (such as how bad a particular conflict would be) and “meta-preferences” (such as how to reflect about how one wants to approach complicated bargaining problems).

Failure to understand object-level preferences. AI systems may infer that human preferences are more conflict-prone than they actually are, and then end up in one of the above situations where it will fail to clarify the human preferences.

Failure to understand meta-preferences. Wei Dai has written about “value corruption”. The relevant kind of value corruption here is that an intent-aligned AI system could (unintentionally) influence the values of its human operator to be more conflict-prone in ways that wouldn’t have been endorsed by the human operator as they were before AI assistance. (Recall that this counts as “unendorsed conflict” in our terminology.) As a toy example of how this could lead to unendorsed conflict, we might imagine an AI advising a group of humans who are in intense competition with other groups. Due to the interaction of (i) the AI initially misunderstanding these humans’ preferences (in particular, overestimating their hawkishness) and (ii) humans’ inability to think carefully under these circumstances, discussions between the AI and humans about how to navigate the strategic situation cause humans to adopt more hawkish preferences.

Why not delegate work on conflict reduction?

One objection to doing work specific to reducing conflict between intent-aligned AIs now is that this work can be deferred to a time when we have highly capable and aligned AI assistants. We’d plausibly be able to do technical research drastically faster then. While this is a separate question to whether Intent Alignment isn't Sufficient, this is an important objection to conflict-specific work, so we briefly address it here.

Some reasons we might benefit from work on conflict reduction now, even in worlds where we get intent-aligned AGIs, include:

We might be able to prevent locking in AI architectures that are less capable of commitment and disclosure, or train surrogate goals in a way that is credible to other AI developers;
It may be possible for early AGI systems to lock in bad path dependencies or catastrophic conflict, such that there won’t be much time to use intent-aligned AGI assistants to help solve these problems;
Work done on conflict now may have the instrumental benefit of making the overseers of early intent-aligned AGIs more likely to use them to do research on conflict reduction;

Still, the fact that intent-aligned AGI assistants may be able to do much of the research on conflict reduction that we would do now has important implications for prioritization. We should prioritize thinking about how to use intent-aligned assistants to reduce the risks of conflict, and deprioritize questions that are likely to be deferrable.

On the other hand, AI systems might be incorrigibly misaligned before they are in a position to substantially contribute to research on conflict reduction. We might still be able to reduce the chances of particularly bad outcomes involving misaligned AGI, without the help of intent-aligned assistants.

Intent alignment may not be necessary to reduce the risk of conflict

Whether or not Intent Alignment isn't Sufficient to prevent unendorsed conflict, we may not get intent-aligned AGIs in the first place. But it might still be possible to prevent worse-than-extinction outcomes resulting from an intent-misaligned AGI engaging in conflict. On the other hand, it seems difficult to steer a misaligned AGI’s conflict behavior in any particular direction.

Coarse-grained interventions on AIs’ preferences to make them less conflict-prone seem prima facie more likely to be effective given misalignment than trying to make more fine-grained interventions on how they approach bargaining problems (such as biasing AIs towards more cautious reasoning about commitments, as discussed previously). Let’s look at one reason to think that coarse-grained interventions on misaligned AIs’ preferences may succeed and thus that Intent Alignment isn't Necessary.

Assume that at some point during training, the AI begins 'playing the training game'. Some time before it starts playing the training game, it has started pursuing a misaligned goal. What, if anything, can we predict about the conflict-proneness of this from the AI’s training data?

A key problem is that there are many objective functions such that trying to optimize is consistent with good early training performance, even if the agent isn’t playing the training game. However, we may not need to predict in much detail to know that a particular training regime will tend to select for more or less conflict-prone . For example, consider a 2-agent training environment, let be agent ’s reward signal. Suppose we have reason to believe that a training process selects for spiteful agents, that is, agents who act as if optimizing for on the training distribution. (Possajennikov 2000; Bolle 2000), or if agents are selected based on their fitness relative to other agents, rather than for optimizing absolute fitness.'>² This gives us reason to think that agents will learn to optimize for for some objectives correlated with on the training distribution. Importantly, we don’t need to predict to worry that agent 1 will learn a spiteful objective.³

Concretely, imagine an extension of the SmartVault example from the ELK report, in which multiple SmartVault reporters are trained in a shared environment. And suppose that the human overseers iteratively select the SmartVault system that gets the highest reward out of several in the environment. This creates incentives for the SmartVault systems to reduce each other’s reward. It may lead to them acquiring a terminal preference for harming (some proxy for) their counterpart’s reward. But this reasoning doesn’t rest on a specific prediction about what proxies for human approval the reporters are optimizing for. As long as SmartVault1 is harming some good proxy for SmartVault2’s approval, they will be more likely to be selected. (Again, this is only true because we are assuming that the SmartVaults are not yet playing the training game.)

What this argument shows is that choosing not to reward SmartVault1 or 2 competitively eliminates a training signal towards conflict-proneness, regardless of whether either is truthful. So there are some circumstances under which we might not be able to select for truthful reporters in the SmartVault but could still avoid selecting for agents that are conflict-prone.

Human evolution is another example. It may have been difficult for someone observing human evolution to predict precisely what proxies for inclusive fitness humans would end up caring about. But the game-theoretic structure of human evolution may have allowed them to predict that, whatever proxies for inclusive fitness humans ended up caring about, they would sometimes want to harm or help (proxies for) other humans’ fitness. And other-regarding human preferences (e.g., altruism, inequity aversion, spite) do still seem to play an important role in high-stakes human conflict.

The examples above focus on multi-agent training environments. This is not to suggest that multi-agent training, or training analogous to evolution, is the only regime in which we have any hope of intervening if intent alignment fails. Even in training environments in which a single agent is being trained, it will likely be exposed to “virtual” other agents, and these interactions may still select for dispositions to help or harm other agents. And, just naively rewarding agents for prosocial behavior and punishing them for antisocial behavior early in training may still be low-hanging fruit worth picking, in the hopes that this still exerts some positive influence over agents’ mesa-objective before they start playing the training game.

Tentative conclusions about directions for research & intervention

We’ve argued that Capabilities aren't Sufficient, Intent Alignment isn't Necessary and Intent Alignment isn't Sufficient, and therefore technical work specific to AGI conflict reduction could make a difference. It could still be that alignment research is a better bet for reducing AGI conflict. But we currently believe that there are several research directions that are sufficiently tractable, neglected, and likely to be important for conflict reduction that they are worth dedicating some portion of the existential AI safety portfolio to.

First, work on using intent-aligned AIs to navigate cooperation problems. This would involve conceptual research aimed at preventing intent-aligned AIs from locking in bad commitments or other catastrophic decisions early on, and preventing the corruption of AI-assisted deliberation about bargaining. One goal of this research would be to produce a manual for the overseers of intent-aligned AGIs with instructions on how to train their AI systems to avoid the failures of cooperation discussed in this sequence.

Second, research into how to train AIs in ways that don’t select for CPPs and inflexible commitments. Research into how to detect and select against CPPs or inflexible commitments could be useful (1) if intent alignment is solved, as part of the preparatory work to enable us to better understand what cooperation failures are common for AIs and how to avoid them, or (2) if intent alignment is not solved, it can be directly used to incentivise misaligned AIs to be less conflict-prone. This could involve conceptual work on mechanisms for preventing CPPs that could survive misalignment. It might also involve empirical work, e.g., to understand the scaling of analogs of conflict-proneness in contemporary language models.

There are several tractable directions for empirical work that could support both of these research streams. Improving our ability to measure cooperation-relevant features of foundation models, and carrying out these measurements, is one. Better understanding the kinds of feedback humans give to AI systems in conflict situations, and how to improve that feedback, is another. Finally, getting practice training powerful contemporary AI systems to behave cooperatively also seems valuable, for reasons similar to those given by Ajeya in The case for aligning narrowly superhuman models.

References

Bolle, Friedel. 2000. “Is Altruism Evolutionarily Stable? And Envy and Malevolence?: Remarks on Bester and Güth.” Journal of Economic Behavior & Organization 42 (1): 131–33.

Possajennikov, Alex. 2000. “On the Evolutionary Stability of Altruistic and Spiteful Preferences.” Journal of Economic Behavior & Organization 42 (1): 125–29.

The post When is intent alignment sufficient or necessary to reduce AGI conflict? appeared first on Center on Long-Term Risk.

When would AGIs engage in conflict?

2022-10-14T10:43:53Z

Here we will look at two of the claims introduced in the previous post: AGIs might not avoid conflict that is costly by their lights (Capabilities aren’t Sufficient) and conflict that is costly by our lights might not be costly by the AGIs’ (Conflict isn’t Costly).

Explaining costly conflict

First we’ll focus on conflict that is costly by the AGIs’ lights. We’ll define “costly conflict” as (ex post) inefficiency: There is an outcome that all of the agents involved in the interaction prefer to the one that obtains.Yudkowsky's bargaining solution).'>¹ This raises the inefficiency puzzle of war: Why would intelligent, rational actors behave in a way that leaves them all worse off than they could be?

We’ll operationalize “rational and intelligent” actors as expected utility maximizers.² We believe that the following taxonomy of the causes of inefficient outcomes between rational actors is exhaustive, except for a few implausible edge cases. (We give the full taxonomy, and an informal argument that it is exhaustive, in the appendix.) That is, expected value maximization can lead to inefficient outcomes for the agents only if one of the following conditions (or an implausible edge case) holds. This taxonomy builds on Fearon’s (1995) influential “rationalist explanations for war”.³

Private information and incentives not to disclose. Here, “private information” means information about one’s willingness or ability to engage in conflict — e.g., how costly one considers conflict to be, or how strong one’s military is — about which other agents are uncertain. This uncertainty creates a risk-reward tradeoff: For example, Country A might think it’s sufficiently likely that Country B will give up without much of a fight that it’s worthwhile in expectation for A to fight B, even if they’ll end up fighting a war if they are wrong.

In these cases, removing uncertainty — e.g., both sides learning exactly how willing the other is to fight — opens up peaceful equilibria. This is why conflict due to private information requires “incentives not to disclose”. Whether there are incentives to disclose will depend on a few factors.

First, the technical feasibility of different kinds of verifiable disclosure matters. For example, if I have an explicitly-specified utility function, how hard is it for me to prove to you how much my utility function values conflict relative to peace?

Second, different games will create different incentives for disclosure. Sometimes the mere possibility of verifiable disclosure ends up incentivizing all agents to disclose all of their private information (Grossman 1981; Milgrom 1981). But in other cases, more sophisticated disclosure schemes are needed. For example: Suppose that an agent has some vulnerability such that unconditionally disclosing all of their private information would place them at a decisive disadvantage. The agents could then make copies of themselves, allow these copies to inspect one another, and transmit back to the agents only the private information that’s necessary to reach a bargain. (See the appendix and DiGiovanni and Clifton (2022) for more discussion of conditional information revelation and other conditions for the rational disclosure of conflict-relevant private information.)

For the rest of the sequence, we’ll use “informational problem” as shorthand for this mechanism of conflict.

Inability to credibly commit to peaceful settlement. Agents might fight even though they would like to be able to commit to peace. The Prisoner’s Dilemma is the classic example: Both prisoners would like to be able to write a binding contract to cooperate, but if they can’t, then the game-theoretically rational thing to do is defect.

Similarly sometimes one agent will be tempted to launch a preemptive attack against another. For example, if Country A thinks Country B will soon become significantly more powerful, Country A might be tempted to attack Country B now. This could be solved with credible commitment: Country B could commit not to becoming significantly more powerful, or to compensate Country A for their weakened bargaining position. But without the ability to make such commitments, Country A may be rationally compelled to fight.

Another example is randomly dividing a prize. Suppose Country A and Country B are fighting over an indivisible holy site. They might want to randomly allocate the holy site to one of them, rather than fighting. The problem is that, once the winner has been decided by the random lottery, the loser has no reason to concede rather than fight, unless they have some commitment in place to honor the outcome of the lottery.

For the rest of the sequence, we’ll call use “commitment inability problem”⁴ as shorthand for this mechanism of conflict.

Miscoordination. When there are no informational or commitment inability problems, and agents’ preferences aren’t entirely opposed (see below), there will be equilibria in which agents avoid conflict. But the existence of such equilibria isn’t enough to guarantee peace, even between rational agents. Agents can still fail to coordinate on a peaceful solution.

A central example of catastrophic conflict due to miscoordination is incompatible commitments. Agents may make commitments to accepting only certain peaceful settlements, and otherwise punishing their counterpart. This can happen when agents have uncertainty about what commitments their counterparts will make. Depending on what you think about the range of outcomes your counterpart has committed to demanding, you might commit to a wider or narrow range of demands. There are situations in which the agents’ uncertainty is such that the optimal thing for each of them to do is commit to a narrow range of demands, which end up being incompatible. See this post on “the commitment races problem” for more discussion.

Avoiding conflict via commitment and disclosure ability?

One reason for optimism about AGI conflict is that AGIs may be much better at credible commitment and disclosure of private information. For example, AGIs could make copies of themselves and let their counterparts inspect these copies until they are satisfied that they understand what kinds of commitments their counterpart has in place. Or, to credibly commit to a treaty, AGIs could do a “value handshake” and build a successor AGI system whose goal is to act according to the treaty. So, what are some reasons why AGIs would still engage in conflict, given these possibilities? Three stand out to us:

Strategic pressures early in AGI takeoff. Consider AGI agents that are opaque to one another, but are capable of self-modifying / designing successor agents who can implement the necessary forms of disclosure. Would such agents ever fail to implement these solutions? If, say, designing more transparent successor agents is difficult and time-consuming, then agents might face a tradeoff between trying to implement more cooperation-conducive architectures and placing themselves at a critical strategic disadvantage. This seems most plausible in the early stages of a multipolar AI takeoff.

Lack of capability early in AGI takeoff. Early in a slow multipolar AGI takeoff, pre-AGI AIs or early AGIs might be capable of starting destructive conflicts but lack the ability to design successor agents, scrutinize the inner workings of opponent AGIs, or reflect on their own cognition in ways that would let them anticipate future conflicts. If AGI capabilities come in this order, such that the ability to launch destructive conflicts comes a while before the ability to design complete successor agents or self-modify, then early AGIs may not be much better than humans at solving informational and commitment problems.

Fundamental computational limits. There may be fundamental limits on the ability of complex AGIs to implement the necessary forms of verifiable disclosure. For example, in interactions between complex AGI civilizations in the far future, these civilizations’ willingness to fight may be determined by factors that are difficult to compress. (That is, the only way in which you can find out how willing to fight they are is to run expensive simulations of what they would do in different hypothetical scenarios.) Or it may be difficult to verify that the other civilization has disclosed their actual private information.

These considerations apply to informational and commitment inability problems. But there is also the problem of incompatible commitments, which is not solved by sufficiently strong credibility or disclosure ability. Regardless of commitment or disclosure ability, agents will sometimes have to make commitments under uncertainty about others’ commitments.

Still, the ability to make conditional commitments could still help to ameliorate the risks from incompatible commitments. For example, agents could have a hierarchy of conditional commitments of the form: “If our -order commitments are incompatible, try resolving these via an -order bargaining process.” See also safe Pareto improvements, which is a particular kind of failsafe for incompatible commitments, and (the version in the linked paper) relies on strong commitment and disclosure ability.

What if conflict isn’t costly by the agents’ lights?

Another way conflict can be rational is if conflict actually isn’t costly for at least one agent, i.e., there isn’t any outcome that all parties prefer to conflict. That is, Conflict isn't Costly is false. Some ways this could happen:

Pure spite: An agent behaves as though minimizing another agent’s utility function.
Unforgivingness: An agent would rather indefinitely punish their counterpart if they don’t get their preferred settlement, rather than settle for something else. If two such agents encounter one another and prefer different settlements, they’ll behave as though they have incompatible commitments, though in this case the “commitments” weren’t chosen by the agents.(2003) show how a subpopulation of agents with such preferences can be evolutionarily stable in a simple model of bargaining, when there are costs to being more “rational”.'>⁵

These cases, in which conflict is literally costless for one agent, are prima facie quite unlikely. They are extremes on a spectrum of what we’ll call conflict-prone preferences (CPPs). By shrinking the range of outcomes agents prefer to conflict, these preferences exacerbate the risks of conflict due to informational, commitment inability, or miscoordination problems. For instance, risk-seeking preferences will lead to a greater willingness to risk losses from conflict (see Shulman (2010) for some discussion of implications for conflict between AIs and humans). And spite will make conflict less subjectively costly, as the material costs that a conflict imposes on a spiteful agent are partially offset by the positively-valued material harms to one’s counterpart.

Candidate directions for research and intervention

We argued above that Capabilities aren’t Sufficient. AGIs may sometimes engage in conflict that is costly, even if they are extremely capable. But it remains to be seen whether Intent Alignment isn't Sufficient to prevent unendorsed conflict, or Intent Alignment isn't Necessary to reduce the risk of conflict. Before we look at those claims, it may be helpful to review some approaches to AGI design that might reduce the risks reviewed in the previous section. In the next post, we ask whether these interventions are redundant with work on AI alignment.

Let’s look at interventions directed at each of the causes of conflict in our taxonomy.

Informational and commitment inability problems. One could try to design AGIs that are better able to make credible commitments and better able to disclose their private information. But it’s not clear whether this reduces the net losses from conflict. First, greater commitment ability could increase the risks of conflict from incompatible commitments. Second, even if the risks of informational and commitment inability problems would be eliminated in the limit of perfect commitment and disclosure, marginal increases in these capabilities could worsen conflict due to informational and commitment inability problems. For example, increasing the credibility of commitments could embolden actors to commit to carrying out threats more often, in a way that leads to greater losses from conflict overall.⁶

Miscoordination. One direction here is building AGIs that reason in more cautious ways about commitment, and take measures to mitigate the downsides from incompatible commitments. For example, we could develop instructions for human overseers as to what kinds of reasoning about commitments they should encourage or discourage in their (intent-aligned) AI. The design of this “overseer’s manual” might be improved by doing more conceptual thinking about sophisticated approaches to commitments.⁷ Examples of this kind of work include Yudkowsky’s solution for bargaining between agents with different standards of fairness; surrogate goals and Oesterheld and Conitzer’s safe Pareto improvements; and Stastny et al.’s notion of norm-adaptive policies. It may also be helpful to consider what kinds of reasoning about commitments we should try to prevent altogether in the early stages of AGI development.

The goal of such work is not necessarily to fully solve tricky conceptual problems in bargaining. One path to impact is to improve the chances that early human-intent-aligned AI teams are in a “basin of attraction of good bargaining”. The initial conditions of their deliberation about how to bargain should be good enough to avoid locking in catastrophic commitments early on, and to avoid path-dependencies which cause deliberation to be corrupted. We return to this in our discussion of Intent Alignment isn't Sufficient.

Lastly, surrogate goals are a proposal for mitigating the downsides of executed threats, which might occur due to either miscoordination or informational problems. The idea is to design an AI to treat threats to carry out some benign action (e.g., simulating clouds) the same way that they treat threats against the overseer’s terminal values.

Conflict-prone preferences. Consider a few ways in which AI systems might acquire CPPs. First, CPPs might be strategically useful in some training environments. Evolutionary game theorists have studied how CPPs like spite (Hamilton 1970; Possajennikov 2000; Gardner and West 2004; Forber and Smead 2016) and aggression towards out-group members (Choi and Bowles 2007) can be selected for. Analogous selection pressures could appear in AI training.looking into evolutionary explanations for preferences like spite and whether these explanations might apply to some kinds of AI training.'>⁸ For example, selection for the agents that perform the best relative to opponents creates similar pressures to the evolutionary pressures hypothetically responsible for spite: Agents will have reason to sacrifice absolute performance to harm other agents, so that they can increase the chances that their relative score is highest. So, identifying and removing training environments which incentivize CPPs (while not affecting agents’ competitiveness) is one direction for intervention.

Second, CPPs might result from poor generalization from human preference data. An AI might fail to correct for biases that cause a human to behave in a more conflict-conducive way than they would actually endorse, for instance. Inferring human preferences is hard. It is especially difficult in multi-agent settings, where a preference-inferrer has to account for a preference-inferree’s models of other agents, as well as biases specific to mixed-motive settings.(Bazerman and Neale 1986) and “self-serving bias” (Babcock and Loewenstein 1997)). All of this makes it plausible that AI systems could mistakenly attribute human behavior to conflict-prone preferences (e.g., mistaking fixed-pie error for spite), even when they have good models of human preferences in other settings.'>⁹

Finally, a generic direction for preventing CPPs is developing adversarial training and interpretability methods tailored to rooting out conflict-prone behavior.

Appendix: Full rational conflict taxonomy

Here we present our complete taxonomy of causes of costly conflict between rational agents, and give an informal argument that it is exhaustive. Remember that by “rational” we mean “maximizes subjective expected utility” and by “costly conflict” we mean “inefficient outcome”.

For the purposes of the informal exposition, it will be helpful to distinguish between equilibrium-compatible and equilibrium-incompatible conflict. Equilibrium-compatible conflicts are those which are naturally modeled as occurring in some (Bayesian) Nash equilibrium. That is, we can model them as resulting from agents (i) knowing each others’ strategies exactly (modulo private information) and (ii) playing a best response. Equilibrium-incompatible conflicts cannot be modeled in this way. Note, however, that the equilibrium-compatible conditions for conflict can hold even when agents are not in equilibrium.

Equilibrium-compatible cases
1. Private information and no incentive/ability to disclose
2. Inability to credibly commit to a cooperative outcome
3. Indivisibility and inability to randomize
4. Don’t play efficient equilibrium even when available
Equilibrium-incompatible cases
1. Miscoordination
2. Perceived risk of cooperation-punishers

This breakdown is summarized as a fault tree diagram in Figure 3.

Figure 1: Fault tree diagram for inefficient outcomes between rational agents.

Equilibrium-compatible cases

Here is an argument that items 1a-1c capture all games in which conflict occurs in every (Bayesian) Nash equilibrium. To start, consider games of complete information. Some complete information games have only inefficient equilibria. The Prisoner’s Dilemma is the canonical example. But we also know that any efficient outcome that is achievable by some convex combination of strategies and is better for each player that what they can unilaterally guarantee themselves can be attained in equilibrium, when agents are capable of conditional commitments to cooperation and correlated randomization (Kalai et al. 2010). This means that, for a game of complete information to have only inefficient equilibria, it has to be the case that either they are unable to make credible conditional commitments to an efficient profile (1b) or an efficient and individually rational outcome is only attainable with randomization (because the contested object is indivisible), but randomization isn’t possible (1c).

Even if efficiency in complete information is always possible given commitment and randomization ability, players might not have complete information. It is well-known that private information can lead to inefficiency in equilibrium, due to agents making risk-reward tradeoffs under uncertainty about their counterpart’s private information (1a). It is also necessary that agents can’t or won’t disclose their private information — we give a breakdown of reasons for nondisclosure below.

This all means that a game has no efficient equilibria only if one of items 1a-1c holds. But it could still be the case that agents coordinate on an inefficient equilibrium, even if an efficient one is available (1d). E.g., agents might both play Hare in a Stag Hunt. (Coordinating on an equilibrium but failing to coordinate on an efficient one seems unlikely, which is why we don’t discuss it in the main text. But it isn’t ruled out by the assumptions of accurate beliefs and maximizing expected utility with respect to those beliefs alone.)

Equilibrium-incompatible cases

This exhausts explanations of situations in which conflict happens in equilibrium. But rationality does not imply that agents are in equilibrium. How could imperfect knowledge of other players’ strategies drive conflict? There are two possibilities:

(Item 2a) Each player is “trying” to coordinate on an efficient equilibrium, in the sense that they place nonzero credence on an efficient outcome, but they miscoordinate. For example, a game of Chicken where both players think it’s sufficiently likely that their counterpart plays Swerve that it is optimal for them to Dare.
(Item 2b) At least one player is not “trying” to coordinate on an efficient equilibrium, in the sense that they play a strategy that is not compatible with any efficient equilibrium. For example, consider a modified Prisoner’s Dilemma where each agent can unilaterally make the other agent worse off than (Defect, Defect) (“punish”). In an open-source version of such a game, an agent might play DefectBot, eliminating the possibility of efficiency. Why might this happen, when they could instead play FairBot? The player might place some credence on their counterpart punishing FairBot, but playing defect against DefectBot. This “perceived risk of cooperation-punishers” is the only reason that agents with access to conditional commitments and uncertainty about their counterpart’s decision would play a strategy that cannot achieve efficient outcomes against any strategy. This situation is implausible, but not ruled out by minimal rationality assumptions.

Reasons agents don’t disclose private information

Suppose that agents have private information such that nondisclosure of the information will lead to a situation in which conflict is rational, but conflict would no longer be rational if the information were disclosed.

We can decompose reasons not to disclose into reasons not to unconditionally disclose and reasons not to conditionally disclose. Here, “conditional” disclosure means “disclosure conditional on a commitment to a particular agreement by the other player”. For example, suppose my private information is , where measures my military strength, such that is my chance of winning a war, and is information about secret military technology that I don’t want you to learn. A conditional commitment would be: I disclose , so that we can decide the outcome of the contest according to a costless lottery which I win with probability , conditional on a commitment from you not to use your knowledge of to harm me.

Here is the decomposition:

Agents do not unconditionally disclose their private information. This could happen for two reasons:
- The agents are not able to disclose their private information; e.g., it is computationally intractable.
- The agents are able to disclose their private information, but lack the incentive to do so. Sometimes the mere possibility of unconditional disclosure is enough for unconditional disclosure to be incentivized, due to the “unraveling” argument.(Grossman 1981; Milgrom 1981).'>¹⁰ But the conditions for unraveling don’t always hold, even when unconditional disclosure is feasible.
- (We can further distinguish between incentive/ability to partially or fully disclose private information. Partial disclosure will often be incentivized when full disclosure is not. For example, the unraveling argument goes through for our example above if I can disclose without disclosing . But restricting the amount of information disclosed may be difficult.)
Agents do not conditionally disclose their private information. Unlike unconditional disclosure, which may not be incentivized even if it is feasible, there is always an efficient equilibrium available to agents with conditional disclosure and commitment ability (DiGiovanni and Clifton 2022). So failure to conditionally disclose private information must be due to its infeasibility.

References

Abreu, Dilip, and Rajiv Sethi. 2003. “Evolutionary Stability in a Reputational Model of Bargaining.” Games and Economic Behavior 44 (2): 195–216.

Babcock, Linda, and George Loewenstein. 1997. “Explaining Bargaining Impasse: The Role of Self-Serving Biases.” The Journal of Economic Perspectives: A Journal of the American Economic Association 11 (1): 109–26.

Bazerman, Max H., and Margaret A. Neale. 1986. “Heuristics in Negotiation: Limitations to Effective Dispute Resolution.” In Judgment and Decision Making: An Interdisciplinary Reader , (pp, edited by Hal R. Arkes, 818:311–21. New York, NY, US: Cambridge University Press, xiv.

Choi, Jung-Kyoo, and Samuel Bowles. 2007. “The Coevolution of Parochial Altruism and War.” Science 318 (5850): 636–40.

DiGiovanni, Anthony, and Jesse Clifton. 2022. “Commitment Games with Conditional Information Revelation.” arXiv [cs.GT]. arXiv. http://arxiv.org/abs/2204.03484 .

Fearon, James D. 1995. “Rationalist Explanations for War.” International Organization 49 (3): 379–414.

Forber, Patrick, and Rory Smead. 2016. “The Evolution of Spite, Recognition, and Morality.” Philosophy of Science 83 (5): 884–96.

Gardner, A., and S. A. West. 2004. “Spite and the Scale of Competition.” Journal of Evolutionary Biology 17 (6): 1195–1203.

Grossman, Sanford J. 1981. “The Informational Role of Warranties and Private Disclosure about Product Quality.” The Journal of Law and Economics 24 (3): 461–83.

Hamilton, W. D. 1970. “Selfish and Spiteful Behaviour in an Evolutionary Model.” Nature 228 (5277): 1218–20.

Milgrom, Paul R. 1981. “Good News and Bad News: Representation Theorems and Applications.” The Bell Journal of Economics 12 (2): 380–91.

Possajennikov, Alex. 2000. “On the Evolutionary Stability of Altruistic and Spiteful Preferences.” Journal of Economic Behavior & Organization 42 (1): 125–29.

Shulman. 2010. “Omohundro’s ‘basic AI Drives’ and Catastrophic Risks.” Manuscript.(singinst. Org/upload/ai-Resource-Drives. Pdf. http://www.hdjkn.com/files/BasicAIDrives.pdf .

The post When would AGIs engage in conflict? appeared first on Center on Long-Term Risk.

When does technical work to reduce AGI conflict make a difference?: Introduction

2022-10-14T10:44:47Z

This is a pared-down version of a longer draft report. We went with a more concise version to get it out faster, so it ended up being more of an overview of definitions and concepts, and is thin on concrete examples and details. Hopefully subsequent work will help fill those gaps.

Sequence Summary

Some researchers are focused on reducing the risks of conflict between AGIs. In this sequence, we’ll present several necessary conditions for technical work on AGI conflict reduction to be effective, and survey circumstances under which these conditions hold. We’ll also present some tentative thoughts on promising directions for research and intervention to prevent AGI conflict.

This post
1. We give a breakdown of necessary conditions for technical work on AGI conflict reduction to make a difference: “AGIs won’t always avoid conflict, despite it being materially costly” and “intent alignment is either insufficient or unnecessary for conflict reduction work to make a difference.” (more)
Would AGIs avoid conflict by default?
1. To assess the claim that AGIs would figure out how to avoid conflict, we give a breakdown — which we believe is exhaustive — of the causes of conflict between rational agents. Some kinds of conflict can be avoided if the agents are sufficiently capable of credible commitment and disclosure of private information. Two barriers to preventing conflict by these means are (a) strategic pressures early in multipolar AI takeoff that make it risky to implement cooperation-improving technologies, and (b) fundamental computational hurdles to credible commitment and disclosure. Other causes of conflict — miscoordination and conflict-prone preferences — can’t be solved by credible commitment and information disclosure alone. (more)
2. We give examples of technical work aimed at each of the causes of conflict surveyed in (2). (more)
Is intent alignment necessary or sufficient to prevent unendorsed AGI conflict?
1. We survey conditions under which intent-aligned AGIs could engage in conflict that is not endorsed by their overseers. Scenarios include: locking in architectures that can’t solve informational and commitment problems; the unintentional corruption of human preferences by AGIs; and catastrophes which leave human overseers unable to prevent intent-aligned AGIs from acting on misunderstandings of their preferences. (more)
2. We consider whether it is possible to shape an AGI’s conflict behavior even if it is misaligned. We suggest that coarse modifications to an AI’s training distribution could make a difference by causing it to acquire a less conflict-prone (e.g., spiteful or risk-seeking) mesa-objective before it starts “playing the training game”. (more)
3. We tentatively conclude that two of the most promising directions for technical research to reduce AGI conflict are (a) prescriptive work on bargaining aimed at preventing intent-aligned AIs from locking in catastrophic bargaining decisions, and (b) conceptual and empirical work on the origins of conflict-prone preferences, which we might be able to prevent even if AGI becomes misaligned at an early stage. Work on measuring and training cooperative behaviors in contemporary large language models could feed into both of these research streams. (more)

This sequence assumes familiarity with intermediate game theory.

Necessary Conditions for Technical Work on AGI Conflict to Have a Counterfactual Impact

Could powerful AI systems engage in catastrophic conflict? And if so, what are the best ways to reduce this risk? Several recent research agendas related to safe and beneficial AI have been motivated, in part, by reducing the risks of large-scale conflict involving artificial general intelligence (AGI). These include the Center on Long-Term Risk’s research agenda, Open Problems in Cooperative AI, and AI Research Considerations for Human Existential Safety (and this associated assessment of various AI research areas). As proposals for longtermist priorities, these research agendas are premised on a view that AGI conflict could destroy large amounts of value, and that a good way to reduce the risk of AGI conflict is to do work on conflict in particular. In this sequence, our goal is to assess conditions under which work specific to conflict reduction could make a difference, beyond non-conflict-focused work on AI alignment and capabilities.¹

Examples of conflict include existentially catastrophic wars between AGI systems in a multipolar takeoff (e.g., 'flash war') or even between different civilizations (e.g., Sandberg 2021). We’ll assume that expected losses from catastrophic conflicts such as these are sufficiently high for this to be worth thinking about at all, and we won’t argue for that claim here.

We’ll restrict attention to technical (as opposed to, e.g., governance) interventions aimed at reducing the risks of catastrophic conflict involving AGI. These include Cooperative AI interventions, where Cooperative AI is concerned with improving the cooperative capabilities of self-interested actors (whether AI agents or AI-assisted humans).² Candidates for cooperative capabilities include the ability to implement mutual auditing schemes in order to reduce uncertainties that contribute to conflict, and the ability to avoid conflict due to incompatible commitments (see Yudkowsky (2013); Oesterheld and Conitzer (2021); (Oesterheld and Conitzer 2021; Stastny et al. 2021)). The interventions under consideration also include improving AI systems’ ability to understand humans’ cooperation-relevant preferences. Finally, they include shaping agents’ cooperation-relevant preferences, e.g., preventing AGIs from acquiring conflict-prone preferences like spite. An overview of the kinds of interventions that we have in mind here is given in Table 1.

**Table 1**: The scope of technical interventions specific to reducing conflict that we consider in this sequence.
Class of technical interventions specific to reducing conflict	Examples
Improving cooperative capabilities (Cooperative AI)	tools for mutual transparency in order to reduce uncertainties that contribute to conflict developing conditional commitments / bargaining protocols (to reduce risks from miscoordination)
Improving understanding of humans’ cooperation-relevant preferences	developing preference learning methods that account for biases that impede cooperation specifically (e.g., “fixed-pie error”)
Shaping cooperation-relevant preferences	choosing training environments that don’t select for conflict-prone preferences like spite

There are reasons to doubt the claim that (Technical Work Specific to) Conflict Reduction Makes a Difference.³ Conflict reduction won’t make a difference if the following conditions don’t hold: (a) AGIs won’t always avoid conflict, despite it being materially costly and (b) intent alignment is either insufficient or unnecessary for conflict reduction work to make a difference. In the rest of the sequence, we’ll look at what needs to happen for these conditions to hold.

Throughout the sequence, we will use “conflict” to refer to “conflict that is costly by our lights”, unless otherwise specified. Of course, conflict that is costly by our lights (e.g., wars that destroy resources that would otherwise be used to make things we value) are also likely to be costly by the AGIs’ lights, though this is not a logical necessity. For AGIs to fail to avoid conflict by default, one of these must be true:

Conflict isn't Costly

Conflict isn’t costly by the AGIs’ lights. That is, there don’t exist outcomes that all of the disputant AGIs would prefer to conflict.

Capabilities aren't Sufficient

AGIs that are sufficiently capable to engage in conflict that is costly for them wouldn’t also be sufficiently capable to avoid conflict that is costly for them.⁴

If either Conflict isn't Costly or Capabilities aren't Sufficient, then it may be possible to reduce the chances that AGIs engage in conflict. This could be done by improving their cooperation-relevant capabilities or by making their preferences less prone to conflict. But this is not enough for Conflict Reduction Makes a Difference to be true.

Intent alignment may be both sufficient and necessary to reduce the risks of AGI conflict that isn’t endorsed by human overseers, insofar as it is possible to do so. If that were true, technical work specific to conflict reduction would be redundant. This leads us to the next two conditions that we’ll consider.

Intent Alignment isn't Sufficient

Intent alignment — i.e., AI systems trying to do what their overseers want — combined with the capabilities that AI systems are very likely to have conditional on intent alignment, isn’t sufficient for avoiding conflict that is not endorsed (on reflection) by the AIs’ overseers.

Intent Alignment isn't Necessary

Even if intent alignment fails, it is still possible to intervene on an AI system to reduce the risks of conflict. (We may still want to prevent conflict if intent alignment fails and leads to an unrecoverable catastrophe, as this could make worse-than-extinction outcomes less likely.)

By unendorsed conflict, we mean conflict caused by AGIs that results from a sequence of decisions that none of the AIs’ human principals would endorse after an appropriate process of reflection.⁵ The reason we focus on unendorsed conflict is that we ultimately want to compare (i) conflict-specific interventions on how AI systems are designed and (ii) work on intent alignment.

Neither of these is aimed at solving problems that are purely about human motivations, like human overseers instructing their AI systems to engage in clearly unjustified conflict.

Figure 1: The relationship between the claim that work on technical interventions to reduce AI conflict makes a difference to the chances of unendorsed AI conflict (“conflict reduction makes a difference”), and two necessary conditions for that to hold — roughly, (1) that conflict isn’t costly for some AGIs or that AGIs will be able to avoid costly conflict, and (2) that intent alignment is either insufficient or unnecessary for reducing the risks of unendorsed conflict.

Note on scope

Contrary to what our framings here might suggest, disagreements about the effectiveness of technical work to reduce AI conflict relative to other longtermist interventions are unlikely to be about the logical possibility of conflict reduction work making a difference. Instead, they are likely to involve quantitative disagreements about the likelihood and scale of different conflict scenarios, the degree to which we need AI systems to be aligned to intervene on them, and the effectiveness of specific interventions to reduce conflict (relative to intent alignment, say). We regard mapping out the space of logical possibilities for conflict reduction to make a difference as an important initial step in the longer-term project of assessing the effectiveness of technical work on conflict reduction.⁶

Acknowledgments

Thanks to Michael Aird, Jim Buhler, Steve Byrnes, Sam Clarke, Allan Dafoe, Daniel Eth, James Faville, Lukas Finnveden, Lewis Hammond, Julian Stastny, Daniel Kokotajlo, David Manheim, Rani Martin, Adam Shimi, Stefan Torges, and Francis Ward for comments on drafts of this sequence. Thanks to Beth Barnes, Evan Hubinger, Richard Ngo, and Carl Shulman for comments on a related draft.

References

Oesterheld, Caspar, and Vincent Conitzer. 2021. “Safe Pareto Improvements for Delegated Game Playing.” In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, 983–91. andrew.cmu.edu.

Stastny, Julian, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, and Jesse Clifton. 2021. “Normative Disagreement as a Challenge for Cooperative AI.” arXiv [cs.MA]. arXiv. http://arxiv.org/abs/2111.13872 .

The post When does technical work to reduce AGI conflict make a difference?: Introduction appeared first on Center on Long-Term Risk.

Open Position: Community Manager

2022-10-18T13:12:18Z

Update: Applications for this role are closed.

The Center on Long-term Risk is seeking a Community Manager, to work on growing and supporting the community around our mission and research. You will have a leveraged role in furthering our mission to address risks of astronomical suffering from the development and deployment of advanced AI systems.

In this role, you would become the third full member of our Community-building team, reporting to Stefan Torges, the Director of Operations. Depending on your skill set, you will take on responsibilities across diverse areas such as event & project management, 1:1 outreach & advising calls, setting up & improving IT infrastructure, writing, giving talks, and attending in-person networking events – making this role ideal for quickly gaining experience across a range of domains. You will receive mentorship from an experienced team, and become familiar with existing processes in a well-running organization, as you work to improve and supplement them. You will also have the opportunity to engage with cutting-edge research in longtermism and AI safety as well as shaping our strategy.

To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 16th October (precisely: 7:30am British Summer Time on Monday 17th). We expect the form will take 30-60 minutes to complete. It can be done in as little as 10 minutes if necessary by skipping the descriptive questions: this may significantly disadvantage your application, but may make sense if you wouldn’t apply otherwise.

Job description

Responsibilities

We are recruiting for this role in order to provide additional capacity in our community-building function. Precisely which areas you work on will depend on your strengths and interests, and we’ll decide this together with you once you start work.

As an illustration of the sorts of things you’ll work on, we expect that the successful candidate will take on several of the following tasks:

Running introductory fellowships for people interested in our field of work.
Doing outreach calls with students and professionals interested in pursuing careers dedicated to reducing s-risks.
Representing our organization at relevant events and conferences (e.g., EA Global).
Hosting regular community events in London.

Examples of further responsibilities that a candidate who is a good fit for them could take on include:

Running an annual s-risk retreat with 30-50 participants.
Running our annual Summer Research Fellowship.
Managing applications to the CLR Fund.
Improving our internal processes, systems, and metrics.
Writing introductory material.
Giving talks at relevant organizations and local groups.
Setting up appropriate online infrastructure for our community.

Since we are a small team, all members have the opportunity to shape our strategy.

About you

We think this role could provide suitable challenges for someone with 0-4 years of experience in a similar job: it might, for example, be suited to a recent graduate interested in quickly gaining experience in a professional community-building role, and we also encourage more experienced candidates to apply.

The following abilities and qualities are what we’re looking for in candidates. No specific qualifications or experience are required – experience is one good way of demonstrating these skills, but we’re also open to candidates with no experience of similar roles. We encourage you to apply if you think you may be a good fit, even if you are very unsure of your strengths in some of these areas.

Problem-solving ability: We’re a small organization and regularly find ourselves in new situations or want to implement a new project. You will need to think creatively to find solutions to problems.

Organized & reliable: In this role, you will often have a large number of competing responsibilities, threads to follow up on, and project schedules to follow. You will need to keep track of them effectively, and prioritize between them appropriately.
Social skills: You enjoy talking to people and they enjoy talking to you. You can listen well, make others feel at ease, and explain technical matters in a way that is easy to follow. You can also communicate effectively with external service providers.
Understanding people: You can spot different kinds of talent and can model how different individuals might impact group dynamics.
Commitment to our mission: You will be talking to and engaging people about our mission, priorities, and work. For that reason, we think it’s important you understand and stand behind what we do.
Ability & inclination to engage with research related to longtermism and AI safety: Our work is on those topics and to effectively build a community around it will in many instances require a meaningful understanding of parts of these fields.

Given that we are a small organization, we also value candidates who are willing to do less glamorous tasks to bring a project over the finish line.

Further details

Work quota: We are open to full-time or part-time candidates, with a preference for full-time.
Location: We prefer applicants who will work in-person from our London office. However, we are willing to consider applicants who wish to work partly or entirely remotely. It would be particularly important for candidates to be open to spending an initial period of three months in London while they get onboarded.
International applicants: We are a registered UK visa sponsor, and willing to sponsor visas for applicants interested in moving to the UK to take up this position.

Salary

The base salary for this role is £60,000 per year depending on the candidate.
- For part-time applicants, the salary will be scaled down proportionally to your working hours.
- For applicants based outside London, the salary will be adjusted based on local living costs, in accordance with our compensation policy.
We don’t want salary to be what stops someone from contributing to our mission. If you’re interested in this role and CLR’s work but would require a higher salary, we encourage you to go ahead and apply and we’re open to discussing higher compensation.

Benefits

25 days’ paid vacation per year, plus public holidays. We usually grant requests for time off that beyond 25 days.
Private health and travel insurance.
Pension scheme with default employer contribution of 10% of your qualifying earnings, increasing to 15% to match additional contributions made by you.
Catered plant-based lunch available at the office every day.
A budget of £8,000 per year to spend on medical interventions and other expenses intended to improve your productivity.
Flexible working hours.
20 weeks’ paid leave for new parents.
We will pay reasonable relocation costs for candidates who move to London to take up the role.

Why this role?

Role impact

In this role, you can expect to grow our team and the community of people who are committed to reducing risks of astronomical suffering from the development of AI systems. That makes it a highly leveraged opportunity to contribute to that effort.

Professional development

Due to the small size of our organization, your work will be varied and you will be asked to take ownership of projects quickly. Our community is still at an early stage, so we regularly test new projects, which can help you master a variety of skills and provide you with space to propose your own ideas.

You will join an experienced community-building team who will provide you with mentorship. You will work alongside and interact regularly with our researchers. So you have many opportunities to engage with ideas related to risks of astronomical suffering as well as effective altruism, longtermism, and AI safety.

CLR will also actively support your professional development. While we are looking for a candidate who is interested in working with CLR for a substantial period of time, as part of the effective altruism community we are interested in helping you increase your career’s impact even beyond your performance in the current role. Alongside mentorship from our experienced operations team, you will be joining a well-networked longtermist organization. You will receive a budget of £8,000 per year to spend on whatever you think best furthers your professional development, and be supported to attend EA Global conferences.

Application process & how to apply

Stage 1: To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 16th October (precisely: 7:30am British Summer Time on Monday 17th).

We expect the form will take 30-60 minutes to complete. If necessary, the form can be done in as little as 10 minutes by skipping the descriptive questions: this may significantly disadvantage your application, but may make sense if you wouldn’t apply otherwise.

We aim to communicate the results of stage 1, inviting candidates to the second stage, by the end of Friday 21st October.

Stage 2 will be a remote work test, to be completed on your own computer, which we anticipate will take up to 4 hours of your time. Applicants will have 2 weeks to complete the test, and will be compensated with £120 in return for their work. We plan to communicate the results of stage 2 by the end of Friday 11th November.

Stage 3 will consist of one or more interviews with CLR staff. We plan to hold interviews in the week of 21st November, and aim to communicate the results of stage 3 by the end of Friday 25th November.

Stage 4: The final stage of the recruitment process will be a work trial, held in-person if possible, of between 1-10 working days depending on candidate availability. We will cover travel expenses and compensate candidates £200 per day for the work trial. We will also seek references at this stage.

We expect final recruitment decisions to be made by the end of the year. If you require a faster decision than this, please feel free to contact us at the address below.

The above timelines are our aim and we fully intend to stick to them. However, we don’t firmly commit to them, and a delay of, for example, 1-2 weeks by the end of stage 3 is possible. We will communicate to candidates promptly if we expect there to be any delays.

Inquiries

If you have any questions about the process, please contact us at hiring@longtermrisk.org. If you’d like to send an email that’s not accessible to the hiring committee, please contact tristan.cook@longtermrisk.org.

Diversity and equal opportunity employment: CLR is an equal opportunity employer, and we value diversity at our organization. We don’t want to discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, social background/class, mental or physical health or disability, or any other basis for unreasonable discrimination, whether legally protected or not. If you're considering applying to this role and would like to discuss any personal needs that might require adjustments to our application process or workplace, please feel very free to contact us.

The post Open Position: Community Manager appeared first on Center on Long-Term Risk.

Safe Pareto Improvements for Delegated Game Playing

2022-09-14T15:05:37Z

This article appeared in the journal Autonomous Agents and Multi-Agent Systems. An earlier, abbreviated version also appeared in the proceedings of AAMAS 2021.

Abstract

A set of players delegate playing a game to a set of representatives, one for each player. We imagine that each player trusts their respective representative’s strategic abilities. Thus, we might imagine that per default, the original players would simply instruct the representatives to play the original game as best as they can. In this paper, we ask: are there safe Pareto improvements on this default way of giving instructions? That is, we imagine that the original players can coordinate to tell their representatives to only consider some subset of the available strategies and to assign utilities to outcomes differently than the original players. Then can the original players do this in such a way that the payoff is guaranteed to be weakly higher than under the default instructions for all the original players? In particular, can they Pareto-improve without probabilistic assumptions about how the representatives play games? In this paper, we give some examples of safe Pareto improvements. We prove that the notion of safe Pareto improvements is closely related to a notion of outcome correspondence between games. We also show that under some specific assumptions about how the representatives play games, finding safe Pareto improvements is NP-complete.

Keywords: program equilibrium; delegation; bargaining; Pareto efficiency; smart contracts.

1 Introduction

Between Aliceland and Bobbesia lies a sparsely populated desert. Until recently, neither of the two countries had any interest in the desert. However, geologists have recently discovered that it contains large oil reserves. Now, both Aliceland and Bobbesia would like to annex the desert, but they worry about a military conflict that would ensue if both countries insist on annexing.

Table 1: The Demand Game

Table 1 models this strategic situation as a normal-form game. The strategy DM (short for “Demand with Military”) denotes a military invasion of the desert, demanding annexation. If both countries send their military with such an aggressive mission, the countries fight a devastating war. The strategy RM (for “Refrain with Military”) denotes yielding the territory to the other country, but building defenses to prevent an invasion of one’s current territories. Alternatively, the countries can choose to not raise a military force at all, while potentially still demanding control of the desert by sending only its leader (DL, short for “Demand with Leader”). In this case, if both countries demand the desert, war does not ensue. Finally, they could neither demand nor build up a military (RL). If one of the two countries has their military ready and the other does not, the militarized country will know and will be able to invade the other country. In gametheoretic terms, militarizing therefore strictly dominates not militarizing.

Instead of making the decision directly, the parliaments of Aliceland and Bobbesia appoint special commissions for making this strategic decision, led by Alice and Bob, respectively. The parliaments can instruct these representatives in various ways. They can explicitly tell them what to do – for example, Aliceland could directly tell Alice to play DM. However, we imagine that the parliaments trust the commissions’ judgments more than they trust their own and hence they might prefer to give an instruction of the type, “make whatever demands you think are best for our country” (perhaps contractually guaranteeing a reward in proportion to the utility of the final outcome). They might not know what that will entail, i.e., how the commissions decide what demands to make given that instruction. However – based on their trust in their representatives – they might still believe that this leads to better outcomes than giving an explicit instruction.

We will also imagine these instructions are (or at least can be) given publicly and that the commissions are bound (as if by a contract) to follow these instructions. In particular, we imagine that the two commissions can see each other’s instructions. Thus, in instructing their commissions, the countries play a game with bilateral precommitment. When instructed to play a game as best as they can, we imagine that the commissions play that game in the usual way, i.e., without further abilities to credibly commit or to instruct subcommittees and so forth.

It may seem that without having their parliaments ponder equilibrium selection, Aliceland and Bobbesia cannot do better than leave the game to their representatives. Unfortunately, in this default equilibrium, war is still a possibility. Even the brilliant strategists Alice and Bob may not always be able to resolve the difficult equilibrium selection problem to the same pure Nash equilibrium.

In the literature on commitment devices and in particular the literature on program equilibrium, important ideas have been proposed for avoiding such bad outcomes. Imagine for a moment that Alice and Bob will play a Prisoner’s Dilemma (Table 3) (rather than the Demand Game of Table 1). Then the default of (Defect, Defect) can be Pareto-improved upon. Both original players (Aliceland and Bobbesia) can use the following instruction for their representatives: “If the opponent’s instruction is equal to this instruction, Cooperate; otherwise Defect.” [33, 22, 46, Sect. 10.4, 55] Then it is a Nash equilibrium for both players to use this instruction. In this equilibrium, (Cooperate, Cooperate) is played and it is thus Pareto-optimal and Pareto-better than the default.

In cases like the Demand Game, it is more difficult to apply this approach to improve upon the default of simply delegating the choice. Of course, if one could calculate the expected utility of submitting the default instructions, then one could similarly commit the representatives to follow some (joint) mix over the Pareto-optimal outcomes ((RM, DM), (DM, RM), (RM, RM), (DL, DL), etc.) that Pareto-improves on the default expected utilities.¹ However, we will assume that the original players are unable or unwilling to form probabilistic expectations about how the representatives play the Demand Game, i.e., about what would happen with the default instructions. If this is the case, then this type of Pareto improvement on the default is unappealing.

The goal of this paper is to show and analyze how even without forming probabilistic beliefs about the representatives, the original players can Pareto-improve on the default equilibrium. We will call such improvements safe Pareto improvements (SPIs). We here briefly give an example in the Demand Game.

The key idea is for the original players to instruct the representatives to select only from {DL,RL}, i.e., to not raise a military. Further, they tell them to disvalue the conflict outcome without military (DL, DL) as they would disvalue the original conflict outcome of war in the default equilibrium. Overall, this means telling them to play the game of Table 2. (Again, we could imagine that the instructions specify Table 2 to be how Aliceland and Bobbesia financially reward Alice and Bob.) Importantly, Aliceland’s instruction to play that game must be conditional on Bobbesia also instructing their commission to play that game, and vice versa. Otherwise, one of the countries could profit from deviating by instructing their representative to always play DM or RM (or to play by the original utility function).

Table 2: A safe Pareto improvement for the Demand Game

The game of Table 2 is isomorphic to the DM-RM part of the original Demand Game of Table 1. Of course, the original players know neither how the original Demand Game nor the game of Table 2 will be played by the representatives. However, since these games are isomorphic, one should arguably expect them to be played isomorphically. For example, one should expect that (RM,DM) would be played in the original game if and only if (RL, DL) would be played in the modified game. However, the conflict outcome (DM,DM) is replaced in the new game with the outcome (DL, DL). This outcome is harmless (Pareto-optimal) for the original players.

Contributions. Our paper generalizes this idea to arbitrary normal-form games and is organized as follows. In Section 2, we introduce some notation for games and multivalued functions that we will use throughout this paper. In Section 3, we introduce the setting of delegated game playing for this paper. We then formally define and further motivate the concept of safe Pareto improvements. We also define and give an example of unilateral SPIs. These are SPIs that require only one of the players to commit their representative to a new action set and utility function. In Section 3.2, we briefly review the concepts of program games and program equilibrium and show that SPIs can be implemented as program equilibria. In Section 4.2, we introduce a notion of outcome correspondence between games. This relation expresses the original players’ beliefs about similarities between how the representatives play different games. In our example, the Demand Game of Table 1 (arguably) corresponds to the game of Table 2 in that the representatives (arguably) would play (DM,DM) in the original game if and only if they play (DL, DL) in the new game, and so forth. We also show some basic results (reflexivity, transitivity, etc.) about the outcome correspondence relation on games. In Section 4.3 we show that the notion of outcome correspondence is central to deriving SPIs. In particular, we show that a game is an SPI on another game if and only if there is a Pareto-improving outcome correspondence relation between and .

To derive SPIs, we need to make some assumptions about outcome correspondence, i.e., about which games are played in similar ways by representatives. We give two very weak assumptions of this type in Section 4.4. The first is that the representatives’ play is invariant under the removal of strictly dominated strategies. For example, we assume that in the Demand Game the representatives only play DM and RM. Moreover we assume that we could remove DL and RL from the game and the representatives would still play the same strategies as in the original Demand Game with certainty. The second assumption is that the representatives play isomorphic games isomorphically. For example, once DL and RL are removed for both players from the Demand Game, the Demand Game is isomorphic to the game in Table 2 such that we might expect them to be played isomorphically. In Section 4.5, we derive a few SPIs – including our SPI for the Demand Game – using these assumptions. Section 4.6 shows that determining whether there exists an SPI based on these assumptions is NP-complete. Section 5 considers a different setting in which we allow the original players to let the representatives choose from newly constructed strategies whose corresponding outcomes map arbitrarily onto feasible payoff vectors from the original game. In this new setting, finding SPIs can be done in polynomial time. We conclude by discussing the problem of selecting between different SPIs on a given game (Section 6) and giving some ideas for directions for future work (Section 7).

2 Preliminaries

We here give some basic game-theoretic definitions. We assume the reader to be familiar with most of these concepts and with game theory more generally.

An -player (normal-form) game is a tuple of a set of (pure) strategy profiles (or outcomes) and a function that assigns to each outcome a utility for each player. The Prisoner's Dilemma shown in Table 3 is a classic example of a game. The Demand Game of Table 1 is another example of a game that we will use throughout this paper.

Instead of we will also write . We also write for , i.e., for the Cartesian product of the action sets of all players other than . We similarly write and for vectors containing utility functions and actions, respectively, for all players but . If is a utility function and is a vector of utility functions for all players other than , then (even if ) we use for the full vector of utility functions where Player has utility function and the other players have utility functions as specified by . We use and analogously.

We say that strictly dominates if for all , . For example, in the Prisoner's Dilemma, Defect strictly dominates Cooperate for both players. As noted earlier, and strictly dominate and for both players.

For any given game , we will call any game a subset game of if for . Note that a subset game may assign different utilities to outcomes than the original game. For example, the game of Table 2 is a subset game of the Demand Game.

We say that some utility vector is a Pareto improvement on (or is Pareto-better than) if for . We will also denote this by . Note that, contrary to convention, we allow . Whenever we require one of the inequalities to be strict, we will say that is a strict Pareto improvement on . In a given game, we will also say that an outcome is a Pareto improvement on another outcome if . We say that is Pareto-optimal or Pareto-efficient relative to some if there is no element of that strictly Pareto-dominates .

Let and be two -player games. Then we call an -tuple of functions a (game) isomorphism between and if there are vectors and such that

for all and . If there is an isomorphism between and , we call and isomorphic. For example, if we let be the Demand Game and the subset game of Table 2, then is isomorphic to via the isomorphism with and and the constants and .

Table 3: The Prisoner’s Dilemma

3 Delegation and safe Pareto improvements

We consider a setting in which a given game is played through what we will call representatives. For example, the representatives could be humans whose behavior is determined or incentivized by some contract à la the principal–agent literature [28]. Our principals’ motivation for delegation is the same as in that literature (namely, the agent being in a better (epistemic) position to make the choice). However, the main question asked by the principal–agent literature is how to deal with agents that have their own preferences over outcomes, by constraining the agent’s choice [e.g. 21, 25], setting up appropriate payment schemes [e.g. 23, 29, 37, 53], etc. In contrast, we will throughout this paper assume that the agent has no conflicting incentives.

We imagine that one way in which the representatives can be instructed is to in turn play a subset game of the original game, without necessarily specifying a strategy or algorithm for solving such a game. We emphasize, again, that is allowed to be a vector of entirely different utility functions. For any subset game , we denote by the outcome that arises if the representatives play the subset game of . Because it is unclear what the right choice is in many games, the original players might be uncertain about . We will therefore model each as a random variable. We will typically imagine that the representatives play in the usual simultaneous way, i.e., that they are not able to make further commitments or delegate again. For example, we imagine that if is the Prisoner's Dilemma, then with certainty.

The original players trust their representatives to the extent that we take to be a default way for the game to played for any . That is, by default the original players tell their representatives to play the game as given. For example, in the Demand Game, it is not clear what the right action is. Thus, if one can simply delegate the decision to someone with more relevant expertise, that is the first option one would consider.

We are interested in whether and how the original players can jointly Pareto-improve on the default. Of course, one option is to first compute the expected utilities under default delegation, i.e., to compute . The players could then let the representatives play a distribution over outcomes whose expected utilities exceed the default expected utilities. However, this is unrealistic if is a complex game with potentially many Nash equilibria. For one, the precise point of delegation is that the original players are unable or unwilling to properly evaluate . Second, there is no widely agreed upon, universal procedure for selecting an action in the face of equilibrium selection problems. In such cases, the original players may in practice be unable to form a probability distribution over . This type of uncertainty is sometimes referred to as Knightian uncertainty, following Knight's [26] distinction between the concepts of risk and uncertainty.

We address this problem in a typical way. Essentially, we require of any attempted improvement over the default that it incurs no regret in the worst-case. That is, we are interested in subset games that are Pareto improvements with certainty under weak and purely qualitative assumptions about .² In particular, in Section 4.4, we will introduce the assumptions that the representatives do not play strictly dominated actions and play isomorphic games isomorphically.

Definition 1. Let be a subset game of . We say is a safe Pareto improvement (SPI) on if with certainty. We say that is a strict SPI if furthermore, there is a player s.t. with positive probability.

For example, in the introduction we have argued that the subset game in Table 2 is a strict SPI on the Demand Game (Table 1). Less interestingly, if we let be the Prisoner's Dilemma (Table 3), then we would expect to be an SPI on . After all, we might expect that with certainty, while it must be

with certainty, for lack of alternatives. Both players prefer mutual cooperation over mutual defection.

3.1 Unilateral safe Pareto Improvements

Both SPIs given above require both players to let their representatives choose from restricted strategy sets to maximize something other than the original player's utility function.

Definition 2. We will call a subset game of unilateral if for all but one it holds that and . Consequently, if a unilateral subset game of is also an SPI for , we call a unilateral SPI.

We now give an example of a unilateral SPI using the Complicated Temptation Game. (We give the not-so-complicated Temptation Game – in which we can only give a trivial example of SPIs – in Section 4.5.) Two players each deploy a robot. Each of the robots faces two choices in parallel. First, each can choose whether to work on Project 1 or Project 2. Player 1 values Project 1 higher and Player 2 values Project 2 higher, but the robots are more effective if they work on the same project. To complete the task, the two robots need to share a resource. Robot 2 manages the resource and can choose whether to control Robot 1’s access tightly (e.g., by frequently checking on the resource, or requiring Robot 1 to demonstrate a need for the resource) or give Robot 1 relatively free access. Controlling access tightly decreases the efficiency of both robots, though the exact costs depend on which projects the robots are working on. Robot 1 can choose between using the resource as intended by Robot 2; or give in to the temptation of trying to steal as much of the resource as possible to use it for other purposes. Regardless of what Robot 2 does (in particular, regardless of whether Robot 2 controls access or not), Player 1 prefers trying to steal. In fact, if Robot 2 controls access and Robot 1 refrains from theft, they never get anything done. Given that Robot 1 tries to steal, Player 2 prefers his Robot 2 to control access. As usual we assume that the original players can instruct their robots to play arbitrary subset games of (without specifying an algorithm for solving such a game) and that they can give such instructions conditional on the other player providing an analogous instruction.

We formalize this game as a normal-form game in Table 4. Each action consists of a number and letter. The number indicates the project that the agent pursues. The letters indicates the agent’s policy towards the resource. In Player 2’s action labels, C indicates tight control over the resource, while F indicates free access. In Player 1’s action labels, T indicates giving in to the temptation to steal as much of the resource as possible, while R indicates refraining from doing so.

Player 1 has a unilateral SPI in the Complicated Temptation Game. Intuitively, if Player 1 commits to refrain, then Player 2 need not control the use of the resource. Thus, inefficiencies from conflict over the resource are avoided. However, Player 1’s utilities in the resulting game of choosing between projects 1 and 2 are not isomorphic to the original game of choosing between projects 1 and 2. The players might therefore worry that this new game will result in a worse outcome for them. For example, Player 2 might worry that in this new game the project 1 equilibrium () becomes more likely than the project 2 equilibrium. To address this, Player has to commit her representative to a different utility function that makes this new game isomorphic to the original game.

We now describe the unilateral SPI in formal detail. Player 1 can commit her representative to play only from and and to assign utilities , , , and ; otherwise does not differ from . The resulting SPI is given in Table 5. In this subset game, Player 2's representative – knowing that Player 1's representative will only play from and – will choose from and (since and strictly dominate and in Table 5). Now notice that the remaining subset game is isomorphic to the subset game of the original Complicated Temptation Game, where maps to and maps to for both Player 1, and maps to and maps to for Player 2. Player 1's representative's utilities have been set to be the same between the two; and Player 2's utilities happen to be the same up to a constant () between the two subset games. Thus, we might expect that if , then , and so on. Finally, notice that and so on. Hence, Table 5 is indeed an SPI on the Complicated Temptation Game.

Table 4: Complicated Temptation Game

Table 5: Safe Pareto improvement for the Complicated Temptation Game

Such unilateral changes are particularly interesting because they only require one of the players to be able to credibly delegate. That is, it is enough for a single player to instruct their representative to choose from a restricted action set to maximize a new utility function. The other players can simply instruct their representatives to play the game in the normal way (i.e., maximizing the respective players' original utility functions without restrictions on the action set). In fact, we may also imagine that only one player delegates at all, while the other players choose an action themselves, after observing Player 's instruction to her representative.

One may object that in a situation where only one player can credibly commit and the others cannot, the player who commits can simply play the meta game as a standard unilateral commitment (Stackelberg) game [as studied by, e.g., 11, 52, 59] or perhaps as a first mover in a sequential game (as solved by subgame-perfect equilibrium), without bothering with any (safe) Pareto conditions, i.e., without ensuring that all players are guaranteed a utility at least as high as their default . For example, in the Complicated Temptation Game, Player 1 could simply commit her representative to play if she assumes that Player 2's representative will be instructed to best respond.

The Stackelberg sequential play perspective is appropriate in many cases. However, we think that in many cases the player with fine-grained commitment ability cannot assume that the other players' representatives will simply best respond. Instead, players often need to consider the possibility of a hostile response if their commitment forces an unfair payoff on the other players. In such cases, unilateral SPIs are relevant.

The Ultimatum game is a canonical example in which standard solution concepts of sequential play fail to predict human behavior. In this game, subgame-perfect equilibrium has the second-moving player walk away with arbitrarily close to nothing. However, experiments show that people often resolve the game to an equal split, which is the symmetric equilibrium of the simultaneous version of the game [38].

A policy of retaliating for unfair payoffs imposed by a first mover's commitments can arise in a variety of ways within standard game-theoretic models. For one, we may imagine a scenario in which only one Player has the fine-grained commitment and delegation abilities needed for SPIs but that the other players can still credibly commit their representatives to retaliate against any “commitment trickery” that clearly leaves them worse off. We may also imagine that other players or representatives come into the scenario having already made such commitments. For example, many people appear credibly committed by intuitions about fairness and retributivist instincts and emotions [see, e.g., 44, Chapter 6, especially the section “The Doomsday Machine”]. Perhaps these features of human psychology allow human second players in the Ultimatum game empirically outperform subgame-perfect equilibrium. Second, we may imagine that the players who cannot commit are subject to reputation effects. Then they might want to build a reputation of resisting coercion. In contrast, it is beneficial to have a reputation of accepting SPIs on whatever game would have otherwise been played.

3.2 Implementing safe Pareto improvements as program equilibria

Figure 1: A diagram describing the meta-game in the case of two players.

So far, we have been vague about the details of the strategic situation that the original players face in instructing their representatives. From what sets of actions can they choose? How can they jointly let the representatives play some new subset game ? Are SPIs Nash equilibria of the meta game played by the original players? If I instruct my representative to play the SPI of Table 2 in the Demand Game, could my opponent not instruct her representative to play ?

In this section, we briefly describe one way to fill this gap by discussing the concept of program games and program equilibrium [46, Sect. 10.4, 55, 15, 5, 13, 36]. This section is essential to understanding why SPIs (especially omnilateral ones) are relevant. However, the remaining technical content of this paper does not rely on this section and the main ideas presented here are straightforward from previous work. We therefore only give an informal exposition. For formal detail, see Appendix A.

For any game , the program equilibrium literature considers the following meta game. First, each player writes a computer program. Each program then receives as input a vector containing everyone else's chosen program. Each player 's program then returns an action from , player 's set of actions in . Together these actions then form an outcome of the original game. Finally, the utilities are realized according to the utility function of . The meta game can be analyzed like any other game. Its Nash equilibria are called program equilibria. Importantly, the program equilibria can implement payoffs not implemented by any Nash equilibria of itself. For example, in the Prisoner’s Dilemma, both players can submit a program that says: “If the opponent’s chosen computer program is equal to this computer program, Cooperate; otherwise Defect.” [33, 22, 46, Sect. 10.4, 55] This is a program equilibrium which implements mutual cooperation.

In the setting for our paper, we similarly imagine that each player can write a program that in turn chooses from . However, the types of programs that we have in mind here are more sophisticated than those typically considered in the program equilibrium literature. Specifically we imagine that the programs are executed by intelligent representatives who are themselves able to competently choose an action for player in any given game , without the original player having to describe how this choice is to be made. The original player may not even understand much about this program other than that it generally plays well. Thus, in addition to the elementary instructions used in a typical computer program (branches, comparisons, arithmetic operations, return, etc.), we allow player to use instructions of type “Play ” in the program she submits. This instruction lets the representative choose and return an action for the game . Apart from the addition of this instruction type, we imagine the set of instructions to be the same as in the program equilibrium literature. To jointly let the representatives play, e.g., the SPI of Table 2 on the Demand Game of Table 1, the representatives can both use an instruction that says, “If the opponent's chosen program is equal to this one, play ; otherwise play ”. Assuming some minimal rationality requirements on the representatives (i.e., on how the representative resolves the “play ” instruction), this is a Nash equilibrium. Figure 1 illustrates how (in the two-player case) the meta game between the original players is intended to work.

For illustration consider the following two real-world instantiations of this setup. First, we might imagine that the original players hire human representatives. Each player specifies, e.g., via monetary incentives, how she wants her representative to act by some contract. For example, a player might contract her representative to play a particular action; or she might specify in her contract a function () over outcomes according to which she will pay the representative after an outcome is obtained. Moreover, these contracts might refer to one another. For example, Player 1's contract with her representative might specify that if Player 2 and his representative use an analogous contract, then she will pay her representative according to Table 2. As a second, more futuristic scenario, you could imagine that the representatives are software agents whose goals are specified by so-called smart contracts, i.e., computer programs implemented on a blockchain to be publicly verifiable [8, 47].

To justify our study of SPIs, we prove that every SPI is played in some program equilibrium:

Theorem 1. Let be a game and be an SPI of . Now consider a program game on , where each player can choose from a set of computer programs that output actions for . In addition to the normal kind of instructions, we allow the use of the command "play " for any subset game of . Finally, assume that guarantees each player at least that player's minimax utility (a.k.a. threat point) in the base game . Then is played in a program equilibrium, i.e., in a Nash equilibrium of the program game.

We prove this in Appendix A.

As an alternative to having the original players choose contracts separately, we could imagine the use of jointly signed contracts which only come into effect once signed by all players [cf. 24, 34]. Another approach to bilateral commitment was pursued by Raub [45] based on earlier work by Sen [51]. Raub and Sen use preference modification as a mechanism for commitment. For example, in the Prisoner’s Dilemma, each player can separately instruct their representative to prefer cooperating over defecting if and only if the opponent also cooperates. If both players use this instruction, then mutual cooperation becomes the unique Pareto-optimal Nash equilibrium. On the other hand, if only one player instructs their representative to adopt these preferences and the other maintains the usual Prisoner’s Dilemma preferences, the unique equilibrium remains mutual defection. Thus, the preference modification is used to commit to cooperating conditional on the other player making an analogous commitment. Because this is slightly confusing in the context of our work – seeing as our work involves both modifying one’s preferences and mutual commitment, but generally without using the former as a means to the latter – we discuss Raub’s and Sen’s work and its relation to ours in more detail in Appendix B.

4 Safe Pareto improvements through outcome correspondence

4.1 Multivalued functions

For sets and , a multi-valued function is a function which maps each element to a set . For a subset , we define

Note that and that . For any set , we define the identity function . Also, for two sets and , we define . We define the inverse

Note that for any multi-valued function . For sets , and and functions and , we define the composite . As with regular functions, composition of multi-valued functions is associative. We say that is single-valued if for all . Whenever a multi-valued function is single-valued, we can apply many of the terms for regular functions. For example, we will take injectivity, surjectivity, and bijectivity for single-valued functions to have the usual meaning. We will never apply these notions to non-single-valued functions.

4.2 Outcome correspondence between games

In this section, we introduce a notion of outcome correspondence, which we will see is essential to constructing SPIs.

Definition 3. Consider two games and . We write for if with certainty.

Note that is a statement about , i.e., about how the representatives choose. Whether such a statement holds generally depends on the specific representatives being used. In Section 4.4, we describe two general circumstances under which it seems plausible that . For example, if two games and are isomorphic, then one might expect , where is the isomorphism between the two games.

We now illustrate this notation using our discussion from the Demand Game. Let be the Demand Game of Table 1. First, it seems plausible that is in some sense equivalent to , where is the game that results from removing and for both players from . Again, strict dominance could be given as an argument. We can now formalize this as , where if and otherwise. Next, it seems plausible that , where is the game of Table 2 and is the isomorphism between and .

We now state some basic facts about the relation , many of which we will use throughout this paper.

Lemma 2. Let , , and , .

Reflexivity: , where .
Symmetry: If , then .
Transitivity: If and , then .
If and for all , then .
, where .
If and , then with certainty.
If and , then with certainty.

Proof. 1. By reflexivity of equality, with certainty. Hence, by definition of . Therefore, by definition of , as claimed.

2. means that with certainty. Thus,

where equality is by the definition of the inverse of multi-valued functions. We conclude (by definition of ) that as claimed.

3. If , , then by definition of , (i) and (ii) , both with certainty. The former (i) implies . Hence,

With ii, it follows that with certainty. By definition, as claimed.

4. It is

with certainty. Thus, by definition .

5. By definition of , it is with certainty. By definition of , it is with certainty. Hence, with certainty. We conclude that as claimed.

6. With certainty, (by assumption). Also, with certainty . Hence, with certainty. We conclude that with certainty.

7. If , then by reflexivity of (Lemma 2.1) . If , then by Lemma 2.6, with certainty.

Items 1-3 show that has properties resembling those of an equivalence relation. Note, however, that since is not a binary relationship, itself cannot be an equivalence relation in the usual sense. We can construct equivalence relations, though, by existentially quantifying over the multivalued function. For example, we might define an equivalence relation on games, where if and only if there is a single-valued bijection such that .³

Item 4 states that if we can make an outcome correspondence claim less precise, it will still hold true. Item 5 states that in the extreme, it is always , where is the trivial, maximally imprecise outcome correspondence function that confers no information. Item 6 shows that can be used to express the elimination of outcomes, i.e., the belief that a particular outcome (or strategy) will never occur.

Besides an equivalence relation, we can also use with quantification over the respective outcome correspondence function to construct (non-symmetric) preorders over games, i.e., relations that are transitive and reflexive (but not symmetric or antisymmetric). Most importantly, we can construct a preorder on games where if for a that always increases every player's utilities.

4.3 A theorem connecting outcome correspondence with safe Pareto improvements

We now show that as advertised, outcome correspondence is closely tied to SPIs. The following theorem shows not only how outcome correspondences can be used to find (and prove) SPIs. It also shows that any SPI requires an outcome correspondence relation via a Pareto-improving correspondence function.

Definition 4. Let be a game and be a subset game of . Further let be such that . We call a Pareto-improving outcome correspondence (function) if for all and all .

Theorem 3. Let be a game and be a subset game of . Then is an SPI on if and only if there is a Pareto-improving outcome correspondence from to .

Proof. : By definition, with certainty. Hence, for ,

with certainty. Hence, by assumption about , with certainty, .

: Assume that with certainty for . We define

It is immediately obvious that is Pareto-improving as required. Also, whenever and for any and , it is (by assumption) with certainty . Thus, by definition of , it holds that . We conclude that as claimed.

Note that the theorem concerns weak SPIs and therefore allows the case where with certainty . To show that some is a strict SPI, we need additional information about which outcomes occur with positive probability. This, too, can be expressed via our outcome correspondence relation. However, since this is cumbersome, we will not formally address strictness much to keep things simple.⁴

We now illustrate how outcome correspondences can be used to derive the SPI for the Demand Game from the introduction as per Theorem 3. Of course, at this point we have not made any assumptions about when games are equivalent. We will introduce some in the following section. Nevertheless, we can already sketch the argument using the specific outcome correspondences that we have given intuitive arguments for. Let again be the Demand Game of Table 1. Then, as we have argued, , where is the game that results from removing and for both players; and if and otherwise. In a second step, , where is the game of Table 2 and is the isomorphism between and . Finally, transitivity (Lemma 2.3) implies that . To see that is Pareto-improving for the original utility functions of , notice that does not change utilities at all. The correspondence function maps the conflict outcome onto the outcome , which is better for both original players. Other than that, , too, does not change the utilities. Hence, is Pareto-improving. By Theorem 3, is therefore an SPI on .

In principle, Theorem 3 does not hinge on and resulting from playing games. An analogous result holds for any random variables over and . In particular, this means that Theorem 3 applies also if the representatives receive other kinds of instructions (cf. Section 3.2). However, it seems hard to establish non-trivial outcome correspondences between and other types of instructions. Still, the use of more complicated instructions can be used to derive different kinds of SPIs. For example, if there are different game SPIs, then the original players could tell their representatives to randomize between them in a coordinated way.

4.4 Assumptions about outcome correspondence

To make any claims about how the original players should play the meta-game, i.e., about what instructions they should submit, we generally need to make assumptions about how the representatives choose and (by Theorem 3) about outcome correspondence in particular.⁵ We here make two fairly weak assumptions.

4.4.1 Elimination

Our first is that the representatives never play strictly dominated actions and that removing them does not affect what the representatives would choose.

Assumption 1. Let be an arbitrary -player game where are pairwise disjoint, and let be strictly dominated by some other strategy in . Then , where for all , and whenever .

Assumption 1 expresses that representatives should never play strictly dominated strategies. Moreover, it states that we can remove strictly dominated strategies from a game and the resulting game will be played in the same way by the representatives. For example, this implies that when evaluating a strategy , the representatives do not take into account how many other strategies strictly dominates. Assumption 1 also allows (via Transitivity of as per Lemma 2.3) the iterated removal of strictly dominated strategies. The notion that we can (iteratively) remove strictly dominated strategies is common in game theory [41, 27, 39, Section 2.9, Chapter 12] and has rarely been questioned. It is also implicit in the solution concept of Nash equilibrium – if a strategy is removed by iterated strict dominance, that strategy is played in no Nash equilibrium. However, like the concept of Nash equilibrium, the elimination of strictly dominated strategies becomes implausible if the game is not played in the usual way. In particular, for Assumption 1 to hold, we will in most games have to assume that the representatives cannot in turn make credible commitments (or delegate to further subrepresentatives) or play the game iteratively [4].

4.4.2 Isomorphisms

Our second assumption is that the representatives play isomorphic games isomorphically when those games are fully reduced.

Assumption 2. Let and be two games that do not contain strictly dominated actions. If and are isomorphic, then there exists an isomorphism between and such that .

Similar desiderata have been discussed in the context of equilibrium selection, e.g., by Harsanyi and Selten [20, Chapter 3.4] [cf. 56, for a discussion in the context of fully cooperative multi-agent reinforcement learning].

Note that if there are multiple game isomorphisms, then we assume outcome correspondence for only one of them. This is necessary for the assumption to be satisfiable in the case of games with action symmetries. (Of course, such games are not the focus of this paper.) For example, let be Rock–Paper–Scissors. Then is isomorphic to itself via the function that for both players maps Rock to Paper, Paper to Scissors, and Scissors to Rock. But if it were , then this would mean that if the representatives play Rock in Rock–Paper–Scissors, they play Paper in Rock–Paper–Scissors. Contradiction! We will argue for the consistency of our version of the assumption in Section 4.4.3. Notice also that we make the assumption only for reduced games. This relates to the previous point about action-symmetric games. For example, consider two versions of Rock–Paper–Scissors and assume that in both versions both players have an additional strictly dominated action that breaks the action symmetries e.g., the action, “resign and give the opponent if they play Rock/Paper”. Then there would only be one isomorphism between these two games (which maps Rock to Paper, Paper to Scissors, and Scissors to Rock for both players). However, in light of Assumption 1, it seems problematic to assume that these strictly dominated actions restrict the outcome correspondences between these two games.⁶

One might worry that reasoning about the existence of multiple isomorphisms renders it intractable to deal with outcome correspondences as implied by Assumption 2, and in particular that it might make it impossible to tell whether a particular game is an SPI. However, one can intuitively see that the different isomorphisms between two games do analogous operations. In particular, it turns out that if one isomorphism is Pareto-improving, then they all are:

Lemma 4. Let and be isomorphisms between and . If is (strictly) Pareto-improving, then so is .

We prove Lemma 4 in Appendix C.

Lemma 4 will allow us to conclude from the existence of a Pareto-improving isomorphism that there is a Pareto-improving s.t. by Assumption 2, even if there are multiple isomorphisms between and . In the following, we can therefore afford to be lax about our ignorance (in some games) about which outcome isomorphism induces outcome equivalence. We will therefore generally write “ by Assumption 2” as short for “ is a game isomor”hism between and and hence by Assumption 2 there exists an isomorphism such that .

One could criticize Assumption 2 by referring to focal points (introduced by Schelling [49, 48, pp. 54–58] [cf., e.g., 30, 18, 54, 9]) as an example where context and labels of strategies matter. A possible response might be that in games where context plays a role, that context should be included as additional information and not be considered part of . Assumption 2 would then either not apply to such games with (relevant) context or would require one to, in some way, translate the context along with the strategies. However, in this paper we will not formalize context, and assume that there is no decision-relevant context.

4.4.3 Consistency of Assumptions 1 and 2

We will now argue that there exist representatives that indeed satisfy Assumptions 1 and 2, both to provide intuition and because our results would not be valuable if Assumptions 1 and 2 were inconsistent. We will only sketch the argument informally. To make the argument formal, we would need to specify in more detail what the set of games looks like and in particular what the objects of the action sets are.

Imagine that for each player there is a book⁷ that on each page describes a normal-form game that does not have any strictly dominated strategies. The actions have consecutive integer labels. Importantly, the book contains no pair of games that are isomorphic to each other. Moreover, for every fully reduced game, the book contains a game that is isomorphic to this game. (Unless we strongly restrict the set of games under consideration, the book must therefore have infinitely many pages.) We imagine that each player's book contains the same set of games. On each page, the book for Player recommends one of the actions of Player to be taken deterministically.⁸

Each representative owns a potentially different version of this book and uses it as follows to play a given game . First the given game is fully reduced by iterated strict dominance to obtain a game . They then look up the unique game in the book that is isomorphic to and map the action labels in onto the integer labels of the game in the book via some isomorphism. If there are multiple isomorphisms from to the relevant page in the book, then all representatives decide between them using the same deterministic procedure. Finally they choose the action recommended by the book.

It is left to show a pair of representatives thus specified satisfies Assumptions 1 and 2. We first argue that Assumption 1 is satisfied. Let be a game and let be a game that arises from removing a strictly dominated action from . By the well known path independence of iterated elimination of strictly dominated strategies [1, 19, 41], fully reducing and results in the same game. Hence, the representatives play the same actions in and .

Second, we argue that Assumption 2 is satisfied. Let us say and are fully reduced and isomorphic. Then it is easy to see that each player plays and based on the same page of their book. Let the game on that book page be . Let and be the bijections used by the representatives to translate actions in and , respectively, to labels in . Then if the representatives take actions in , the actions are the ones specified by the book for , and hence the actions are played in . Thus . It is easy to see that is a game isomorphism between and .

4.4.4 Discussion of alternatives to Assumptions 1 and 2

One could try to use principles other than Assumptions 1 and 2. We here give some considerations. First, game theorists have also considered the iterated elimination of weakly dominated strategies [17, 31, Section 4.11]. Unfortunately, the iterated removal of weakly dominated strategies is pathdependent [27, Section 2.7.B, 7, Section 5.2, 39, Section 12.3]. That is, for some games, iterated removal of weakly dominated strategies can lead to different subset games, depending on which weakly dominated strategy one chooses to eliminate at any stage. A straightforward extension of Assumption 1 to allow the elimination of weakly dominated strategies would therefore be inconsistent in such games, which can be seen as follows.

Work on the path dependence of iterated removal of weakly dominated strategies has shown that there are games with two different outcomes such that by iterated removal of weakly dominated strategies from , we can obtain both and . If we had an assumption analogous to Assumption 1 but for weak dominance, then (with Lemma 2.3 – transitivity), we would obtain both that and that , where for all and for all . The former would mean (by Lemma 2.6) that for all we have that with certainty; while the latter would mean that that we have that with certainty. But jointly this means that for all , we have that with certainty, which cannot be the case as by definition. Thus, we cannot make an assumption analogous to Assumption 1 for weak dominance.

As noted above, the iterated removal of strictly dominated strategies, on the other hand, is path-independent, and in the 2-player case always eliminates exactly the non-rationalizable strategies [1, 19, 41]. Many other dominance concepts have been shown to have path independence properties. For an overview, see Apt [1]. We could have made an independence assumption based any of these path-independent dominance concepts. For example, elimination of strategies that are strictly dominated by a mixed strategy (or, equivalently, of so-called never-best responses) is also path independent [40, Section 4.2].

With Assumptions 1 and 2, all our outcome correspondence functions are either 1-to-1 or 1-to-0. Other elimination assumptions could involve the use of many-to-1 or even many-to-many functions. In general, such functions are needed when a strategy can be eliminated to obtain a strategically equivalent game, but in the original game may still be played. The simplest example would be the elimination of payoff-equivalent strategies. Imagine that in some game for all opponent strategies it is the case that and that there are no other strategies that are similarly payoff-equivalent to and . Then one would assume that , where maps onto and otherwise is just the identity function. As an example, imagine a variant of the Demand Game in which Player 1 has an additional action that results in the same payoffs as for both players against Player 2's and but potentially slightly different payoffs against and . With our current assumptions we would be unable to derive a non-trivial SPI for this game. However, with an assumption about the elimination of duplicate actions in hand, we could (after removing and as usual) remove or and thereby derive the usual SPI. Many-to-1 elimination assumptions can also arise from some dominance concepts if they have weaker path independence properties. For example, iterated elimination by so-called nice weak dominance [32] is only path-independent up to strategic equivalence. Like the assumption about payoff-equivalent strategies, an elimination assumption based on nice weak dominance therefore cannot assume that the eliminated action is not played in the original game at all.

4.5 Examples

In this section, we use Lemma 2, Theorem 3, and Assumptions 1 and 2 to formally prove a few SPIs.

Proposition (Example) 5. Let be the Prisoner's Dilemma (Table 3) and be any subset game of with . Then under Assumption 1, is a strict SPI on .

Proof. By applying Assumption 1 twice and Transitivity once, , where and and for all . By Lemma 2.5, we further obtain , where is as described in the proposition. Hence, by transitivity, . It is easy to verify that the function is Pareto-improving.

Proposition (Example) 6. Let be the Demand Game of Table 1 and be the subset game described in Table 2. Under Assumptions 1 and 2 , is an SPI on . Further, if , then is a strict SPI.

Proof. Let . We can repeatedly apply Assumption 1 to eliminate from the strategies and for both players. We can then apply Lemma 2.3 (Transitivity) to obtain , where and

Next, by Assumption 2, , where and for . We can then apply Lemma 2.3 (Transitivity) again, to infer . It is easy to verify that for all , it is for all the case that .

Next, we give two examples of unilateral SPIs. We start with an example that is trivial in that the original player instructs her resentatives to take a specific action. We then give the SPI for the Complicated Temptation game as a non-trivial example.

Consider the Temptation Game given in Table 6. In this game, Player 1's (for Temptation) strictly dominates . Once is removed, Player 2 prefers . Hence, this game is strict-dominance solvable to . Player 1 can safely Pareto-improve on this result by telling her representative to play , since Player 2's best response to is and . We now show this formally.

Table 6: Simple Temptation Game

Proposition (Example) 7. Let be the game of Table 6. Under Assumption 1, is a strict SPI on .

Proof. First consider . We can apply Assumption 1 to eliminate Player 1's and then apply Assumption 1 again to the resulting game to also eliminate Player 2's . By transitivity, we find , where and and .

Next, consider . We can apply Assumption 1 to remove Player 2's strategy and find , where and and .

Third, by Lemma 2.5, where .

Finally, we can apply transitivity to conclude , where . It is easy to verify that and . Hence, is Pareto-improving and so by Theorem 3, is an SPI on .

Note that in this example, Player 1 simply commits to a particular strategy and Player 2 maximizes their utility given Player 1's choice. Hence, this SPI can be justified with much simpler unilateral commitment setups [11, 52, 59]. For example, if the Temptation Game was played as a sequential game in which Player 1 plays first, its unique subgame-perfect equilibrium is .

In Table 4 we give the Complicated Temptation Game, which better illustrates the features specific to our setup. Roughly, it is an extension of the simpler Temptation Game of Table 6. In addition to choosing versus and versus , the players also have to make an additional choice (1 versus 2), which is difficult in that it cannot be solved by strict dominance. As we have argued in Section 3.1, the game in Table 5 is a unilateral SPI on Table 4. We can now show this formally.

Proposition (Example) 8. Let be the Complicated Temptation Game (Table 4) and be the subset game in Table 5. Under Assumptions 1 and 2, is a unilateral SPI on .

Proof. In , for Player 1, and strictly dominate and . We can thus apply Assumption 1 to eliminate Player 1's and . In the resulting game, Player 2's and strictly dominate and , so one can apply Assumption 1 again to the resulting game to also eliminate Player 2's and . By transitivity, we find , where and

Next, consider (Table 5). We can apply Assumption 1 to remove Player 2's strategies and and find , where and

Third, by Assumption 2, where decomposes into and , corresponding to the two players, respectively, where and for .

Finally, we can apply transitivity and the rule about symmetry and inverses (Lemma 2.2) to conclude . It is easy to verify that is Pareto-improving.

4.6 Computing safe Pareto improvements

In this section, we ask how computationally costly it is for the original players to identify for a given game a non-trivial SPI . Of course, the answer to this question depends on what the original players are willing to assume about how their representatives act. For example, if only trivial outcome correspondences (as per Lemma 2.1 and 2.5) are assumed, then the decision problem is easy. Similarly, if for given is hard to decide (e.g., because it requires solving for the Nash equilibria of and ), then this could trivially also make the safe Pareto improvement problem hard to decide. We specifically are interested in deciding whether a given game has a non-trivial SPI that can be proved using only Assumptions 1 and 2, the general properties of game correspondence (in particular Transitivity (Lemma 2.3), Symmetry (Lemma 2.2) and Theorem 3).

Definition 5. The SPI decision problem consists in deciding for any given , whether there is a game and a sequence of outcome correspondences and a sequence of subset games of s.t.:

(Non-triviality:) If we fully reduce and using iterated strict dominance (Assumption 1), the two resulting games are not equal. (Of course, they are allowed to be isomorphic.)
For , is valid by a single application of either Assumption 1 or Assumption 1, or an application of Assumption 1 in reverse via Lemma 2.2.
For all , and whenever , it is the case that .
For the strict SPI decision problem, we further require:
There is a player and an outcome that survives iterated elimination of strictly dominated strategies from s.t. .
For the unilateral SPI decision problem, we further require:
For all but one of the players , and .

Many variants of this problem may be considered. For example, to match Definition 1, the definition of the strict SPI problem assumes that all outcomes that survive iterated elimination occur with positive probability. Alternatively we could have required that for demonstrating strictness, there must be a player such that for all that survive iterated elimination, . Similarly one may wish to find SPIs that are strict improvements for all players. We may also wish to allow the use of the elimination of duplicate strategies (as described in Section 4.4.4) or trivial outcome correspondence steps as per Lemma 2.5. These modifications would not change the computational complexity of the problem, nor would they require new proof ideas. One may also wish to compute all SPIs, or – in line with multi-criteria optimization [14, 58] – all SPIs that cannot in turn be safely Pareto-improved upon. However, in general there may exist exponentially many such SPIs. To retain any hope of developing an efficient algorithm, one would therefore have to first develop a more efficient representation scheme [cf. 42, Sect. 16.4].

Theorem 9. The (strict) (unilateral) SPI decision problem is NP-complete, even for 2-player games.

Proposition 10. For games with that can be reduced (via iterative application of Assumption 1) to a game with , the (strict) (unilateral) SPI decision problem can be solved in .

The full proof is tedious (see Appendix D), but the main idea is simple, especially for omnilateral SPIs. To find an omnilateral SPI on based on Assumptions 1 and 2, one has to first iteratively remove all strictly dominated actions to obtain a reduced game , which the representatives would play the same as the original game. This can be done in polynomial time. One then has to map the actions onto the original in such a way that each outcome in is mapped onto a weakly Pareto-better outcome in . Our proof of NP-hardness works by reducing from the subgraph isomorphism problem, where the payoff matrices of and represent the adjacency matrices of the graphs.

Besides being about a specific set of assumptions about , note that Theorem 9 and Proposition 10 also assume that the utility function of the game is represented explicitly in normal form as a payoff matrix. If we changed the game representation (e.g., to boolean circuits, extensive form game trees, quantified boolean formulas, or even Turing machines), this can affect the complexity of the SPI problem. For example, Gabarró, García, and Serna [16] show that the game isomorphism problem on normal-form games is equivalent to the graph isomorphism problem, while it is equivalent to the (likely computationally harder) boolean circuit isomorphism problem for a weighted boolean formula game representation. Solving the SPI problem requires solving a subset game isomorphism problem (see the proof of Lemma 28 in Appendix D for more detail). We therefore suspect that the SPI problem analogously increases in computational complexity (perhaps to being -complete) if we treat games in a weighted boolean formula representation. In fact, even reducing a game using strict dominance by pure strategies – which contributes only insignificantly to the complexity of the SPI problem for normal-form games – is difficult in some game representations [10, Section 6]. Note, however, that for any game representation to which 2-player normal-form games can be efficiently reduced – such as, for example, extensive-form games – the hardness result also applies.

5 Safe Pareto improvements under improved coordination

5.1 Setup

In this section, we imagine that the players are able to simply invent new token strategies with new payoffs that arise from mixing existing feasible payoffs. To define this formally, we first define for any game ,

to be the set of payoff vectors that are feasible by some correlated strategy. The underlying notion of correlated strategies is the same as in correlated equilibrium [2, 3], but in this paper it will not be relevant whether any such strategy is a correlated equilibrium of . Instead their use will hinge on the use of commitments [cf. 34]. Note that is exactly the convex closure of , i.e., the convex closure of the set of deterministically achievable utilities of the original game.

For any game , we then imagine that in addition to subset games, the players can let the representatives play a perfect-coordination token game , where for all , and are arbitrary utility functions to be used by the representatives and are the utilities that the original players assign to the token strategies.

The instruction lets the representatives play the game as usual. However, the strategies are imagined to be meaningless token strategies which do not resolve the given game . Once some token strategies are selected, these are translated into some probability distribution over , i.e., into a correlated strategy of the original game. This correlated strategy is then played by the original players, thus giving rise to (expected) utilities . These distributions and thus utilities are specified by the original players.

Definition 6. Let be a game. A perfect-coordination SPI for is a perfect-coordination token game for s.t. with certainty. We call a strict perfect-coordination SPI if there furthermore is a player for whom with positive probability.

As an example, imagine that is just the - subset game of the Demand Game of Table 1. Then, intuitively, an SPI under improved coordination could consist of the original players telling the representatives, “Play as if you were playing the - subset game of the Demand Game, but whenever you find yourself playing , randomize [according to some given distribution] between the other (Pareto-optimal) outcomes instead”. Formally, and would then consist of tokenized versions of the original strategies. The utility functions and are then simply the same as in the original Demand Game except that they are applied to the token strategies. For example, . The utilities for the original players remove the conflict outcome. For example, the original players might specify , representing that the representatives are supposed to play in the case. For all other outcomes , it must be the case that because the other outcomes cannot be Pareto-improved upon. As with our earlier SPIs for the Demand Game, Assumption 2 implies that , where maps the original conflict outcome onto the Pareto-optimal (,).

Relative to the SPIs considered up until now, these new types of instructions put significant additional requirements on how the representatives interact. They now have to engage in a two-round process of first choosing and observing one another's token strategies and then playing a correlated strategy for the original game. Further, it must be the case that this additional coordination does not affect the payoffs of the original outcomes. The latter may not be the case in, e.g., the Game of Chicken. That is, we could imagine a Game of Chicken in which coordination is possible but that the rewards of the game change if the players do coordinate. After all, the underlying story in the Game of Chicken is that the positive reward – admiration from peers – is attained precisely for accepting a grave risk.

5.2 Finding safe Pareto improvement under improved representative coordination

With these more powerful ways to instruct representatives, we can now replace individual outcomes of the default game ad libitum. For example, in the reduced Demand Game, we singled out the outcome as Pareto-suboptimal and replaced it by a Pareto-optimal outcome, while keeping all other outcomes the same. This allows us to construct SPIs in many more games than before.

Definition 7. The strict full-coordination SPI decision problem consists in deciding for any given whether under Assumption 2 there is a perfect-coordination SPI for .

Lemma 11. For a given -player game and payoff vector , it can be decided by linear programming and thus in polynomial time whether is Pareto-optimal in .

For an introduction to linear programming, see, e.g., Schrijver [50]. In short, a linear program is a specific type of constrained optimization problem that can be solved efficiently.

Proof. Finding a Pareto improvement on a given can be formulated as the following linear program:

Based on Lemma 11, Algorithm 1 decides whether there is a strict perfect-coordination SPI for a given game .

It is easy to see that this algorithm runs in polynomial time (in the size of, e.g., the normal form representation of the game). It is also correct: if it returns True, simply replace the Pareto-suboptimal outcome while keeping all other outcomes the same; if it returns False, then all outcomes are Pareto-optimal within and so there can be no strict SPI. We summarize this result in the following proposition.

Proposition 12. Assuming is known and that Assumption 2 holds, it can be decided in polynomial time whether there is a strict perfect-coordination SPI.

5.3 Characterizing safe Pareto improvements under improved representative coordination

From the problem of deciding whether there are strict SPIs under improved coordination at all, we move on to the question of what different perfect-coordination SPIs there are. In particular, one might ask what the cost is of only considering safe Pareto improvements relative to acting on a probability distribution over and the resulting expected utilities . We start with a lemma that directly provides a characterization. So far, all the considered perfect-coordination SPIs for a game have consisted in letting the representatives play a game that is isomorphic to the original game, but Pareto-improves (from the original players' perspectives, i.e., ) at least one of the outcomes. It turns out that we can restrict attention to this very simple type of SPI under improved coordination.

Lemma 13. Let be any game. Let be a perfect-coordination SPI on . Then we can define with values in such that under Assumption 2 the game

is also an SPI on , with

for all and consequently .

Proof. First note that is isomorphic to . Thus by Assumption 2, there is isomorphism s.t. . WLOG assume that simply maps . Then define as follows:

Here describes the utilities that the original players assign to the outcomes of . Since maps onto and is convex, as defined also maps into as required. Note that for all it is by assumption with certainty. Hence,

as required.

Because of this result, we will focus on these particular types of SPIs, which simply create an isomorphic game with different (Pareto-better) utilities. Note, however, that without assigning exact probabilities to the distributions of , the original players will in general not be able to construct a that satisfies the expected payoff equalities. For this reason, one could still conceive of situations in which a different type of SPI would be chosen by the original players and the original players are unable to instead choose an SPI of the type described in Lemma 13.

Lemma 13 directly implies a characterization of the expected utilities that can be achieved with perfect-coordination SPIs. Of course, this characterization depends on the exact distribution of . We omit the statement of this result. However, we state the following implication.

Corollary 14. Under Assumption 2, the set of Pareto improvements that are safely achievable with perfect coordination

is a convex polygon.

Because of this result, one can also efficiently optimize convex functions over the set of perfect-coordination SPIs. Even without referring to the distribution , many interesting questions can be answered efficiently. For example, we can efficiently identify the perfect-coordination SPI that maximizes the minimum improvements across players and outcomes .

In the following, we aim to use Lemma 13 and Corollary 14 to give maximally strong positive results about what Pareto improvements can be safely achieved, without referring to exact probabilities over . To keep things simple, we will do this only for the case of two players. To state our results, we first need some notation: We use

to denote the Pareto frontier of a convex polygon (or more generally convex, closed set). For any real number , we use to denote the which maximizes under the constraint (Recall that we consider 2-player games, so is a single real number.) Note that such a exists if and only if is 's utility in some feasible payoff vector. We first state our result formally. Afterwards, we will give a graphical explanation of the result, which we believe is easier to understand.

Theorem 15. Make Assumption 2. Let be a two-player game. Let be some potentially unsafe Pareto improvement on . For , let . Then:

A) If there is some element in which Pareto-dominates all of and if is Pareto-dominated by an element of at least one of the following three sets:

the line segment between and ;
the segment of the curve between and ;
the line segment between and .

Then there is an SPI under improved coordination such that .

B) If there is no element in which Pareto-dominates all of and if is Pareto-dominated by an element each of and as defined above, then there is a perfect-coordination SPI such that .

We now illustrate the result graphically. We start with Case A, which is illustrated in Figure 2. The Pareto-frontier is the solid line in the north and east. The points marked x indicate outcomes in . The point marked by a filled circle indicates the expected value of the default equilibrium . The vertical dashed lines starting at the two extreme x marks illustrate the application of to project onto the Pareto frontier. The dotted line between these two points is . Similarly, the horizontal dashed lines starting at x marks illustrate the application of to project onto the Pareto frontier. The line segment between these two points is . In this case, this line segments lies on the Pareto frontier. The set is simply that part of the Pareto frontier, which Pareto-dominates all elements of , i.e., the part of the Pareto frontier to the north-east between the two intersections with the northern horizontal dashed line and eastern vertical dashed line. The theorem states that for some to be a Pareto improvement, it must be in the gray area.

Case B of Theorem 15 is depicted in Figure 3. Note that here the two line segments and intersect. To ensure that a Pareto improvement is safely achievable, the theorem requires that it is below both of these lines, as indicated again by the gray area.

Figure 2: This figure illustrates Theorem 15, Case A.

Figure 3: This figure illustrates Theorem 15, Case B.

For a full proof, see Appendix E. Roughly, Theorem 15 is proven by re-mapping each of the outcomes of the original game as per Lemma 13. For example, the projection of the default equilibrium (i.e., the filled circle) onto is obtained as an SPI by projecting all the outcomes (i.e., all the x marks) onto . In Case A, any utility vector that Pareto-improves on all outcomes of the original game can be obtained by re-mapping all outcomes onto . Other kinds of are handled similarly.

As a corollary of Theorem 15, we can see that all (potentially unsafe) Pareto improvements in the - subset game of the Demand Game of Table 1 are equivalent to some perfect-coordination SPI. However, this is not always the case:

Proposition 16. There is a game , representatives that satisfy Assumptions 1 and 2, and an outcome s.t. for all players , but there is no perfect-coordination SPI s.t. for all players , .

As an example of such a game, consider the game in Table 7. Strategy can be eliminated by strict dominance (Assumption 1) for both players, leaving a typical Chicken-like payoff structure with two pure Nash equilibria ( and ), as well as a mixed Nash equilibrium .

Table 7: An example of a game in which – depending on – a Pareto improvement may not be safely achievable.

Now let us say that in the resulting game for some with . Then one (unsafe) Pareto improvement would be to simply always have the representatives play for a certain payoff of . Unfortunately, there is no safe Pareto improvement with the same expected payoff. Notice that is the unique element of that maximizes the sum of the two players' utilities. By linearity of expectation and convexity of , if for any it is , it must be with certainty. Unfortunately, in any safe Pareto improvement the outcomes and must corresponds to outcomes that still gives utilities of and , respectively, because these are Pareto-optimal within the set of feasible payoff vectors. We illustrate this as an example of Case B of Theorem 15 in Figure 4.

Figure 4: This figure illustrates the Game of Table 7 as an instance of Theorem 15, Case B.

6 The SPI selection problem

In the Demand Game, there happens to be a single non-trivial SPI. However, in general (even without the type of coordination assumed in Section 5) there may be multiple SPIs that result in different payoffs for the players. For example, imagine an extension of the Demand Game imagine that both players have an additional action , which is like , except that under , Aliceland can peacefully annex the desert. Aliceland prefers this SPI over the original one, while Bobbesia has the opposite preference. In other cases, it may be unclear to some or all of the players which of two SPIs they prefer. For example, imagine a version of the Demand Game in which one SPI mostly improves on and another mostly improves on the other three outcomes, then outcome probabilities are required for comparing the two. If multiple SPIs are available, the original players would be left with the difficult decision of which SPI to demand in their instruction.⁹

This difficulty of choosing what SPI to demand cannot be denied. However, we would here like to emphasize that players can profit from the use of SPIs even without addressing this SPI selection problem. To do so, a player picks an instruction that is very compliant (“dove-ish”) w.r.t. what SPI is chosen, e.g., one that simply goes with whatever SPI the other players demand as long as that SPI cannot further be safely Pareto-improved upon.¹⁰ In many cases, all such SPIs benefit all players. For example, optimal SPIs in bargaining scenarios like the Demand Game remove the conflict outcome, which benefits all parties. Thus, a player can expect a safe improvement even under such maximally compliant demands on the selected SPI.

In some cases there may also be natural choices of demands (a là Schelling [48, pp. 54–58] or focal points). If the underlying game is symmetric, a symmetric safe Pareto improvement may be a natural choice. For example, the fully reduced version of the Demand Game of Table 1 is symmetric. Hence, we might expect that even if multiple SPIs were available, the original players would choose a symmetric one.

7 Conclusion and future directions

Safe Pareto improvements are a promising new idea for delegating strategic decision making. To conclude this paper, we discuss some ideas for further research on SPIs.

Straightforward technical questions arise in the context of the complexity results of Section 4.6. First, what impact on the complexity does varying the assumptions have? Our NP-completeness proof is easy to generalize at least to some other types of assumptions. It would be interesting to give a generic version of the result. We also wonder whether there are plausible assumptions under which the complexity changes in interesting ways. Second, one could ask how the complexity changes if we use more sophisticated game representations (see the remarks at the end of that section). Third, one could impose additional restrictions on the sought SPI. Fourth, we could restrict the games under consideration. Are there games in which it becomes easy to decide whether there is an SPI?

It would also be interesting to see what real-world situations can already be interpreted as utilizing SPIs, or could be Pareto-improved upon using SPIs.

Acknowledgments

This work was supported by the National Science Foundation under Award IIS-1814056. Some early work on this topic was conducted by Caspar Oesterheld while working at the Foundational Research Institute (now the Center on Long-Term Risk). For valuable comments and discussions, we are grateful to Keerti Anand, Tobias Baumann, Jesse Clifton, Max Daniel, Lukas Gloor, Adrian Hutter, Vojtěch Kovařík, Anni Leskelä, Brian Tomasik and Johannes Treutlein, and our wonderful anonymous referees. We also thank attendees of a 2017 talk at the Future of Humanity Institute at the University of Oxford, a talk at the May 2019 Effective Altruism Foundation research retreat, and our talk at AAMAS 2021.

References

[1] Krzysztof R. Apt. “Uniform Proofs of Order Independence for Various Strategy Elimination Procedures”. In: The B.E. Journal of Theoretical Economics 4.1 (2004), pp. 1–48. DOI: 10.2202/1534-5971.1141.

[2] Robert J. Aumann. “Correlated Equilibrium as an Expression of Bayesian Rationality”. In: Econometrica 55.1 (Jan. 1987), pp. 1–18. DOI: 10.2307/1911154.

[3] Robert J. Aumann. “Subjectivity and Correlation in Randomized Strategies”. In: Journal of Mathematical Economics 1.1 (Mar. 1974), pp. 67–97. DOI: 10.1016/0304-4068(74)90037-8.

[4] Robert Axelrod. The Evolution of Cooperation. New York: Basic Books, 1984.

[5] Mihaly Barasz et al. Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic. Jan. 2014. url: https://arxiv.org/abs/1401.5577.

[6] Ken Binmore. Game Theory – A Very Short Introduction. Oxford University Press, 2007.

[7] Tilman Börgers. “Pure Strategy Dominance”. In: Econometrica 61.2 (Mar. 1993), pp. 423–430.

[8] Vitalik Buterin. Ethereum White Paper – A Next Generation Smart Contract & Decentralized Application Platform. Updated version available at https://github.com/ethereum/wiki/wiki/White-Paper. 2014. URL: https : //cryptorating . eu / whitepapers / Ethereum /Ethereum_white_paper.pdf.

[9] Andrew M. Colman. “Salience and focusing in pure coordination games”. In: Journal of Economic Methodology 4.1 (1997), pp. 61–81. DOI: 10.1080/13501789700000004.

[10] Vincent Conitzer and Tuomas Sandholm. “Complexity of (Iterated) Dominance”. In: Proceedings of the 6th ACM conference on Electronic commerce. Vancouver, Canada: Association for Computing Machinery, June 2005, pp. 88–97. DOI: 10.1145/1064009.1064019.

[11] Vincent Conitzer and Tuomas Sandholm. “Computing the Optimal Strategy to Commit to”. In: Proceedings of the ACM Conference on Electronic Commerce (EC). Ann Arbor, MI, USA: Association for Computing Machinery, 2006, pp. 82–90.

[12] Stephen A. Cook. “The complexity of theorem-proving procedures”. In: STOC ’71: Proceedings of the third annual ACM symposium on Theory of computing. New York: Association for Computing Machinery, May 1971, pp. 151–158. DOI: 10.1145/800157.805047.

[13] Andrew Critch. “A Parametric, Resource-Bounded Generalization of Löb’s Theorem, and a Robust Cooperation Criterion for Open-Source Game Theory”. In: Journal of Symbolic Logic 84.4 (Dec. 2019), pp. 1368–1381. DOI: 10.1017/jsl.2017.42.

[14] Matthias Ehrgott. Multicriteria Optimization. 2nd ed. Berlin: Springer, 2005.

[15] Lance Fortnow. “Program equilibria and discounted computation time”. In: Proceedings of the 12th Conference on Theoretical Aspects of Rationality and Knowledge (TARK ’09). July 2009, pp. 128–133. DOI: 10.1145/1562814.1562833.

[16] Joaquim Gabarró, Alina García, and Maria Serna. “The complexity of game isomorphism”. In: Theoretical Computer Science 412.48 (Nov. 2011), pp. 6675–6695. DOI: 10.1016/j.tcs.2011.07.022.

[17] David Gale. “A Theory of N-Person Games with Perfect Information”. In: Proceedings of the National Academy of Sciences of the United States of America 39.6 (June 1953), pp. 496–501. DOI: 10.1073/pnas.39.6.496.

[18] David Gauthier. “Coordination”. In: Dialogue 14.2 (June 1975), pp. 195–221. DOI: 10.1017/S0012217300043365.

[19] Itzhak Gilboa, Ehud Kalai, and Eitan Zemel. “On the order of eliminating dominated strategies”. In: Operations Research Letters 9.2 (Mar. 1990), pp. 85–89. DOI: 10.1016/0167-6377(90)90046-8.

[20] John C. Harsanyi and Reinhard Selten. A General Theory of Equilibrium Selection in Games. Cambridge, MA: The MIT Press, 1988.

[21] Bengt Robert Holmstr¨om. “On Incentives and Control in Organizations”. PhD thesis. Stanford University, Dec. 1977.

[22] J. V. Howard. “Cooperation in the Prisoner’s Dilemma”. In: Theory and Decision 24 (May 1988), pp. 203–213. DOI: 10.1007/BF00148954.

[23] Leonid Hurwicz and Leonard Shapiro. In: The Bell Journal of Economics 9.1 (1978), pp. 180–191. DOI: 10.2307/3003619.

[24] Adam Tauman Kalai et al. “A commitment folk theorem”. In: Games and Economic Behavior 69 (2010), pp. 127–137. DOI: 10.1016/j.geb.2009.09.008.

[25] Jon Kleinberg and Robert Kleinberg. “Delegated Search Approximates Efficient Search”. In: Proceedings of the 19th ACM Conference on Economics and Computation (EC). 2018.

[26] Frank H. Knight. Risk, Uncertainty, and Profit. Boston, MA, USA: Houghton Mifflin Company, 1921.

[27] Elon Kohlberg and Jean-Francois Mertens. “On the Strategic Stability of Equilibria”. In: Econometrica 54.5 (Sept. 1986), pp. 1003–1037. DOI: 10.2307/1912320.

[28] Jean-Jacques Laffont and David Martimort. The Theory of Incentives – The Principal-Agent Model. Princeton, NJ: Princeton University Press, 2002.

[29] Richard A. Lambert. “Executive Effort and Selection of Risky Projects”. In: RAND J. Econ. 17.1 (1986), pp. 77–88.

[30] David Lewis. Convention. Harvard University Press, 1969.

[31] R. Duncan Luce and Howard Raiffa. Games and Decisions. Introduction and Critical Survey. New York: Dover Publications, 1957.

[32] Leslie M. Marx and Jeroen M. Swinkels. “Order Independence for Iterated Weak Dominance”. In: Games and Economic Behavior 18 (1997), pp. 219–245. DOI: 10.1006/game.1997.0525.

[33] R. Preston McAfee. “Effective Computability in Economic Decisions”. May 1984. URL: https://www.mcafee.cc/Papers/PDF/EffectiveComputability.pdf.

[34] Dov Monderer and Moshe Tennenholtz. “Strong mediated equilibrium”. In: Artificial Intelligence 173.1 (Jan. 2009), pp. 180–195. DOI: 10.1016/j.artint.2008.10.005.

[35] John von Neumann. “Zur Theorie der Gesellschaftsspiele”. In: Mathematische Annalen 100 (1928), pp. 295–320. DOI: https://doi.org/10.1007/BF01448847.

[36] Caspar Oesterheld. “Robust Program Equilibrium”. In: Theory and Decision 86.1 (Feb. 2019), pp. 143–159.

[37] Caspar Oesterheld and Vincent Conitzer. “Minimum-regret contracts for principal-expert problems”. In: Proceedings of the 16th Conference on Web and Internet Economics (WINE). 2020.

[38] Hessel Oosterbeek, Randolph Sloof, and Gijs van de Kuilen. “Cultural Differences in Ultimatum Game Experiments: Evidence from a Meta-Analysis”. In: Experimental Economics 7 (June 2004), pp. 171–188. DOI: 10.1023/B:EXEC.0000026978.14316.74.

[39] Martin J. Osborne. An Introduction to Game Theory. New York: Oxford University Press, 2004.

[40] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. The MIT Press, 1994.

[41] David G. Pearce. “Rationalizable Strategic Behavior and the Problem of Perfection”. In: Econometrica 54.4 (July 1984), pp. 1029–1050.

[42] Guillaume Perez. “Decision diagrams: constraints and algorithms”. PhD thesis. Université Côte d’Azur, 2017. URL: https : / / tel .archives-ouvertes.fr/tel-01677857/document.

[43] Martin Peterson. An Introduction to Decision Theory. Cambridge University Press, 2009.

[44] Steven Pinker. How the Mind Works. W. W. Norton, 1997.

[45] Werner Raub. “A General Game-Theoretic Model of Preference Adaptions in Problematic Social Situations”. In: Rationality and Society 2.1 (Jan. 1990), pp. 67–93.

[46] Ariel Rubinstein. Modeling Bounded Rationality. Ed. by Karl Gunnar Persson. Zeuthen Lecture Book Series. The MIT Press, 1998.

[47] Alexander Savelyev. “Contract law 2.0: ‘Smart’ contracts as the beginning of the end of classic contract law”. In: Information & Communications Technology Law 26.2 (2017), pp. 116–134. DOI: 10.1080/13600834.2017.1301036.

[48] Thomas C. Schelling. The Strategy of Conflict. Cambridge, MA: Harvard University Press, 1960.

[49] Thomas C. Schelling. “The Strategy of Conflict Prospectus for a Reorientation of Game Theory”. In: The Journal of Conflict Resolution 2.3 (Sept. 1958), pp. 203–264.

[50] Alexander Schrijver. Theory of Linear and Integer Programming. Chichester, UK: John Wiley & Sons, 1998.

[51] Amartya Sen. “Choice, orderings and morality”. In: Practical Reason. Ed. by Stephan Körner. New Haven, CT, USA: Basil Blackwell, 1974. Chap. II, pp. 54–67.

[52] Heinrich von Stackelberg. “Marktform und Gleichgewicht”. In: Vienna: Springer, 1934, pp. 58–70.

[53] Neal M. Stoughton. “Moral Hazard and the Portfolio Management Problem”. In: The Journal of Finance 48.5 (Dec. 1993), pp. 2009–2028. DOI: 10.1111/j.1540-6261.1993.tb05140.x.

[54] Robert Sugden. In: The Economic Journal 105.430 (May 1995), pp. 533–550. DOI: 10.2307/2235016.

[55] Moshe Tennenholtz. “Program equilibrium”. In: Games and Economic Behavior 49.2 (Nov. 2004), pp. 363–373.

[56] Johannes Treutlein et al. “A New Formalism, Method and Open Issues for Zero-Shot Coordination”. In: Proceedings of the Thirty-eighth International Conference on Machine Learning (ICML’21). 2021.

[57] Wiebe van der Hoek, Cees Witteveen, and Micheal Wooldridge. “Program equilibrium – a program reasoning approach”. In: International Journal of Game Theory 42 (3 Aug. 2013), pp. 639–671.

[58] Luc N. van Wassenhove and Ludo F. Gelders. “Solving a bicriterion scheduling problem”. In: European Journal of Operational Research 4 (1980), pp. 42–48.

[59] Bernhard Von Stengel and Shmuel Zamir. Leadership with commitment to mixed strategies. Tech. rep. LSE-CDAM-2004-01. London School of Economics, 2004. URL: http://www.cdam.lse.ac.uk/Reports/Files/cdam-2004-01.pdf.

A Proof of Theorem 1 – program equilibrium implementations of safe Pareto improvements

This paper considers the meta-game of delegation. SPIs are a proposed way of playing these games. However, throughout most of this paper, we do not analyze the meta-game directly as a game using the typical tools of game theory. We here fill that gap and in particular prove Theorem 1, which shows that SPIs are played in Nash equilibria of the meta game, assuming sufficiently strong contracting abilities. As noted, this result is essential. However, since it is mostly an application of existing ideas from the literature on program equilibrium, we left a detailed treatment out of the main text.

A program game for is defined via a set and a non-deterministic mapping . We obtain a new game with action sets and utility function

Though this definition is generic, one generally imagines in the program equilibrium literature that for all , consists of computer programs in some programming language, such as Lisp, that take as input vectors in and return an action . The function on input then executes each player 's program on to assign an action. The definition implicitly assumes that only contains programs that halt when fed one another as input (or that not halting is mapped onto some action). As is usually done in the program equilibrium literature, we will leave unspecified what constraints are used to ensure this. A program equilibrium is then simply a Nash equilibrium of the program game.

For the present paper, we add the following feature to the underlying programming language. A program can call a “black box subroutine” for any subset game of , where is a random variable over and .

We need one more definition. For any game and player , we define Player 's threat point (a.k.a. minimax utility) as

In words, is the minimum utility that the players other than can force onto , under the assumption that reacts optimally to their strategy. We further will use to denote the strategy for Player that is played in the minimizer of the above. Of course, in general, there might be multiple minimizers . In the following, we will assume that the function breaks such ties in some consistent way, such that for all ,

Note that for , each player's threat point is computable in polynomial time via linear programming; and that by the minimax theorem [35], the threat point is equal to the maximin utility, i.e.,

so is also the minimum utility that Player can guarantee for herself under the assumption that the opponent sees her mixed strategy and reacts in order to minimize Player 's utility.

Tennenholtz’ [55] main result on program games is the following:

Theorem 17 (Tennenholtz 2004 [55]). Let be a game and let be a (feasible) payoff vector. If for , then is the utility of some program equilibrium of a program game on

Throughout the rest of this section, our goal is to use similar ideas as Tennenholtz did for Theorem 17 to construct for any SPI on , a program equilibrium that results in the play of . As noted in the main text, the Player 's instruction to her representative to play the game will usually be conditional on the other player telling her representative to also play her part of and vice versa. After all, if Player simply tells her representative to maximize from regardless of Player 's instruction, then Player will often be able to profit from deviating from the instruction. For example, in the safe Pareto improvement on the Demand Game, each player would only want their representative to choose from rather than if the other player's representative does the same. It would then seem that in a program equilibrium in which is played, each program would have to contain a condition of the type, “if the opponent code plays as in against me, I also play as I would in .” But in a naive implementation of this, each of the programs would have to call the other, leading to an infinite recursion.

In the literature on program equilibrium, various solutions to this problem have been discovered. We here use the general scheme proposed by Tennenholtz [55], because it is the simplest. We could similarly use the variant proposed by Fortnow [15], techniques based on Löb's theorem [5, 13], or -grounded mutual simulation [36] or even (meta) Assurance Game preferences (see Appendix B).

In our equilibrium, we let each player submit code as sketched in Algorithm 2. Roughly, each player uses a program that says, “if everyone else submitted the same source code as this one, then play . Otherwise, if there is a player who submits a different source code, punish player by playing her strategy”. Note that for convenience, Algorithm 2 receives the player number as input. This way, every player can use the exact same source code. Otherwise the original players would have to provide slightly different programs and in line 2 of the algorithm, we would have to use a more complicated comparison, roughly: “if are the same, except for the player index used”.

Proposition 18. Let be a game and let be an SPI on . Let be the program profile consisting only of Algorithm 2 for each player. Assume that guarantees each player at least threat point utility in expectation. Then is a program equilibrium and .

Proof. By inspection of Algorithm 2, we see that . It is left to show that is a Nash equilibrium. So let be any player and . We need to show that . Again, by inspection of , is the threat point of Player . Hence,

as required.

Theorem 1 follows immediately.

B A discussion of work by Sen (1974) and Raub (1990) on preference adaptation games

We here discuss Raub’s [45] paper in some detail, which in turn elaborates on an idea by Sen [51]. Superficially, Raub’s setting seems somewhat similar to ours, but we here argue that it should be thought of as closer to the work on program equilibrium and bilateral precommitment. In Sections 1, 3 and 3.2, we briefly discuss multilateral commitment games, which have been discussed before in various forms in the gametheoretic literature. Our paper extends this setting by allowing instructions that let the representatives play a game without specifying an algorithm for solving that game. On first sight, it appears that Raub pursues a very similar idea. Translated to our setting, Raub allows that as an instruction, each player chooses a new utility function , where is the set of outcomes of the original game . Given instructions , the representatives then play the game . In particular, each representative can see what utility functions all the other representatives have been instructed to maximize. However, what utility function representative maximizes is not conditional on any of the instructions by other players. In other words, the instructions in Raub's paper are raw utility functions without any surrounding control structures, etc. Raub then asks for equilibria of the meta-game that Pareto-improve on the default outcome.

To better understand how Raub's approach relates to ours, we here give an example of the kind of instructions Raub has in mind. (Raub uses the same example in his paper.) As the underlying game , we take the Prisoner's Dilemma. Now the main idea of his paper is that the original players can instruct their representatives to adopt so-called Assurance Game preferences. In the Prisoner's Dilemma, this means that the representatives prefer to cooperate if the other representative cooperates, and prefer to defect if the other player defects. Further, they prefer mutual cooperation over mutual defection. An example of such Assurance Game preferences is given in Table 8. (Note that this payoff matrix resembles the classic Stag Hunt studied in game theory.)

Table 8: Assurance Game preferences for the Prisoner’s Dilemma

The Assurance Game preferences have two important properties.

If both players tell their representatives to adopt Assurance Game preferences, (Cooperate, Cooperate) is a Nash equilibrium. (Defect, Defect) is a Nash equilibrium as well. However, since (Cooperate, Cooperate) is Pareto-better than (Defect, Defect), the original players could reasonably expect that the representatives play (Cooperate, Cooperate).
Under reasonable assumptions about the rationality of the representatives, it is a Nash equilibrium of the meta-game for both players to adopt Assurance Game preferences. If Player 1 tells her representative to adopt Assurance Game preferences, then Player 2 maximizes his utility by telling his representative to also maximize Assurance Game preferences. After all, representative 1 prefers defecting if representative 2 defects. Hence, if Player 2 instructs his representative to adopt preferences that suggest defecting, then he should expect representative to defect as well.

The first important difference between Raub's approach and ours is related to item 2. We have ignored the issue of making SPIs Nash equilibria of our meta game. As we have explained in Section 3.2 and Appendix A, we imagine that this is taken care of by additional bilateral commitment mechanisms that are not the focus of this paper. For Raub's paper, on the other hand, ensuring mutual cooperation to be stable in the new game is arguably the key idea. Still, we could pursue the approach of the present paper even when we limit assumptions to those that consist only of a utility function.

The second difference is even more important. Raub assumes that – as in the PD – the default outcome of the game ( in the formalism of this paper) is known. (Less significantly, he also assumes that it is known how the representatives play under assurance game preferences.) Of course, the key feature of the setting of this paper is that the underlying game might be difficult (through equilibrium selection problems) and thus that the original players might be unable to predict .

These are the reasons why we cite Raub in our section on bilateral commitment mechanisms. Arguably, Raub's paper could be seen as very early work on program equilibrium, except that he uses utility functions as a programming language for representative. In this sense, Raub's Assurance Game preferences are analogous to the program equilibrium schemes of Tennenholtz [55], Oesterheld [55], Barasz et al. [5] and van der Hoek, Witteveen, and Wooldridge [57], ordered in increasing order of similarity of the main idea of the scheme.

C Proof of Lemma 4

Lemma 4. Let and be isomorphisms between and . If is (strictly) Pareto-improving, then so is .

Proof.

First, we argue that if and are isomorphisms, then they are isomorphisms relative to the same constants and . For each player , we distinguish two cases. First the case where all outcomes in have the same utility for Player is trivial. Now imagine that the outcomes of do not all have the same utility. Then let and be the lowest and highest utilities, respectively, in . Further, let and be the lowest and highest utilities, respectively, in . It is easy to see that if is a game isomorphism, it maps outcomes with utility in onto outcomes with utility in , and outcomes with utility in onto outcomes with utility in . Thus, if and are to be the constants for , then

Since , this system of linear equations has a unique solution. By the same pair of equations, the constants for are uniquely determined.

It follows that for all ,

Furthermore, if is strictly Pareto-improving for some , then by bijectivity of , there is s.t. . For this , the inequality above is strict and therefore .

D Proof of Theorem 9

We here prove Theorem 9. We assume familiarity with basic ideas in computational complexity theory (non-deterministic polynomial time (NP), reductions, NP-completeness, etc.).

D.1 On the structure of relevant outcome correspondence sequences

Throughout our proof we will use a result about the structure of relevant outcome correspondences. Before proving this result, we give two lemmas. The first is a well-known lemma about elimination by strict dominance.

Lemma 19 (path independence of iterated strict dominance). Let be a game in which some strategy of player is strictly dominated. Let be a game we obtain from by removing a strictly dominated strategy (of any player) other than . Then is strictly dominated in .

Note that this lemma does not by itself prove that iterated strict dominance is path dependence. However, path independence follows from the property shown by this lemma.

Proof. Let be the strategy that strictly dominates . We distinguish two cases:

Case 1: The strategy removed is . Then there must be that strictly dominates . Then it is for all

Both inequalities are due to the definition of strict dominance. We conclude that must strictly dominate .

Case 2: The strategy removed is one other than or . Since the set of strategies of the new game is a subset of the strategies of the old game it is still for each strategy in the new game

i.e., still strictly dominates .

The next lemma shows that instead of first applying Assumption 1 plus symmetry (Lemma 2.2) to add a strictly dominated action and then applying Assumption 1 to eliminate a different strictly dominated strategy, we could also first eliminate the strictly dominated strategy and then add the other strictly dominated strategy.

Lemma 20. Let by Assumption 1, where is the reduced game, and by Assumption 1. Then either or there is a game s.t. by Assumption 1 and by Assumption 1.

Proof. By the assumption both and can be obtained from eliminating a strictly dominated action from . Let these actions be and , respectively. If , then . So for the rest of this proof assume . Let be the game we obtain by removing from . We now show the two outcome correspondences:

First we show that , i.e., that is strictly dominated in . For this notice that and are both strictly dominated in . Now is obtained from by removing . By Lemma 19, is still strictly dominated in , as claimed.
Second we show that , i.e., that , i.e., that is strictly dominated in . Recall again that and are both strictly dominated in . Now is obtained from by removing . By Lemma 19, is still strictly dominated in , as claimed.

We are ready to state our lemma about the structure of outcome correspondences.

Lemma 21. Let

where each outcome correspondence is due to a single application of Assumption1, Assumption1 plus symmetry (Lemma 2.2) or Assumption 2. Then there is a sequence with and , and such that

all by single applications of Assumption 1, and are fully reduced games such that by a single application of Assumption 2, and

all by single applications of Assumption 1 with Lemma 2.2.

A conciser way to state the consequence is that there must be games , and such that is obtained from by iterated elimination of strictly dominated strategies, is isomorphic to , and is obtained from by iterated elimination of strictly dominated strategies.

Proof. First divide the given sequence of outcome correspondences up into periods that are maximally long while containing only correspondences by Assumption 1 (with or without Lemma 2.2). That is, consider subsequences of the form such that:

Each of the correspondences , ..., is by applying Assumption 1 with or without Lemma 2.2.
Either or the correspondence is by Assumption 2.
Either or the correspondence is by Assumption 2.

In each such period apply Lemma 20 iteratively to either eliminate or move to the right all inverted reduction elimination steps.

In all but the first period, contains no strictly dominated actions (by stipulation of Assumption 2). Hence all but the first period cannot contain any non-reversed elimination steps. Similarly, in all but the final period, contains no strictly dominated actions. Hence, in all but the final period, there can be no reversed applications of Assumption 1.

Overall, our new sequence of outcome correspondences thus has the following structure: first there is a sequence of elimination steps via Assumption 1, then there is a sequence of isomorphism steps, and finally there is a sequence of reverse elimination steps. We can summarize all the applications of Assumption 2 into a single step applying that assumption to obtain the claimed structure.

Now notice that that the reverse elimination steps are only relevant for deriving unilateral SPIs. Using the above concise formulation of the lemma, we can always simply use itself as an omnilateral SPI – it is not relevant that there is some subset game that reduces to .

Lemma 22. As in Lemma 21, let , where each outcome correspondence is due to a single application of Assumption 1, Assumption 1 plus symmetry (Lemma 2.2) or Assumption 2. Let all be subset games of . Moreover, let be Pareto improving. Then there is a sequence of subset games such that all by applications of Assumption 1 (without applying symmetry), and by application of Assumption 2 such that is Pareto improving.

Proof. First apply Lemma 21. Then notice that the correspondence functions from applying Assumption 1 with symmetry have no effect on whether the overall outcome correspondence is Pareto improving.

D.2 Non-deterministic polynomial-time algorithms for the SPI problem

D.2.1 The omnilateral SPI problem

We now show that the SPI problem is in NP at all. The following algorithm can be used to determine whether there is a safe Pareto improvement: Reduce the given game until it can be reduced no further to obtain some subset game . Then non-deterministically select injections . If is (strictly) Pareto-improving (as required in Theorem 3), return True with the solution defined as follows: The set of action profiles is defined as . The utility functions are

Otherwise, return False.

Proposition 23. The above algorithm runs in non-deterministic polynomial time and returns True if and only if there is a (strict) unilateral SPI.

Proof. It is easy to see that this algorithm runs in non-deterministic polynomial time. Furthermore, with Lemma: 4 it is easy to see that if this algorithm finds a solution , that solution is indeed a safe Pareto improvement. It is left to show that if there is a safe Pareto improvement via a sequence of Assumption 1 and 2 outcome correspondences, then the algorithm indeed finds a safe Pareto improvement.

Let us say there is a sequence of outcome correspondences as per AssumptionS 1 and 2 that show for Pareto-improving . Then by Lemma 22, there is such that via applying Assumption 1 iteratively to obtain a fully reduced and via a single application of Assumption 2. By construction, our algorithm finds (guesses) this Pareto-improving outcome correspondence.

Overall, we have now shown that our non-deterministic polynomial-time algorithm is correct and therefore that the SPI problem is in NP. Note that the correctness of other algorithms can be proven using very similar ideas. For example, instead of first reducing and then finding an isomorphism, one could first find an isomorphism, then reduce and then (only after reducing) test whether the overall outcome correspondence function is Pareto-improving. One advantage of reducing first is that there are fewer isomorphisms to test if the game is smaller. In particular, the number of possible isomorphisms is exponential in the number of strategies in the reduced game but polynomial in everything else. Hence, by implementing our algorithm deterministically, we obtain the following positive result.

Proposition 24. For games with that can be reduced (via iterative application of Assumption 1) to a game with , the (strict) omnilateral SPI decision problem can be solved in .

D.2.2 The unilateral SPI problem

Next we show that the problem of finding unilateral SPIs is also in NP. Here we need a slightly more complicated algorithm: We are given an -player game and a player . First reduce the game fully to obtain some subset game . Then non-deterministically select injections . The resulting candidate SPI game then is

where for all , and is arbitrary for . Return True if the following conditions are satisfied:

The correspondence function must be (strictly) Pareto improving (as per the utility functions ).
For each , there are and such that for all , we have .
The game reduces to the game .

Otherwise, return False.

Proposition 25. The above algorithm runs in non-deterministic polynomial time and returns True if and only if there is a (strict) unilateral SPI.

Proof. First we argue that the algorithm can indeed be implemented in non-deterministic polynomial time. For this notice that for checking Item 2, the constants can be found by solving systems of linear equations of two variables.

It is left to prove correctness, i.e., that the algorithm returns True if and only if there exists an SPI. We start by showing that if the algorithm returns True, then there is an SPI. Specifically, we show that if the algorithm returns True, the game is indeed an SPI game. Notice that for some by iterative application of Assumption 1 with Transitivity (Lemma 2.2); that by application of Assumption 2. Finally, for some by iterative application of Assumption 1 to , plus transitivity (Lemma 2.3) with reversal (Lemma 2.2).

It is left to show that if there is an SPI, then the above algorithm will find it and return true. To see this, notice that Lemma 21 implies that there is a sequence of outcome correspondences . We can assume that and have the same action sets for Player . It is easy to see that in we could modify the utilities for any that is not in , because Player 's utilities do not affect the elimination of strictly dominated strategies from .

Proposition 26. For games with that can be reduced (via iterative application of Assumption 1) to a game with , the (strict) unilateral SPI decision problem can be solved in .

D.3 The SPI problems are NP-hard

We now proceed to showing that the safe Pareto improvement problem is NP-hard. We will do this by reducing the subgraph isomorphism problem to the (two-player) safe Pareto improvement problem. We start by briefly describing one version of that problem here.

A (simple, directed) graph is a tuple , where and . We call the adjacency function of the graph. Since the graph is supposed to be simple and therefore free of self-loops (edges from one vertex to itself), we take the values for to be meaningless.

For given graphs , a subgraph isomorphism from to is an injection such that for all

In words, a subgraph isomorphism from to identifies for each node in a node in s.t. if there is an edge from node to node in , there must also be an edge in the same direction between the corresponding nodes in . Another way to say this is that we can remove some set of () nodes and some edges from to get a graph that is just a relabeled (isomorphic) version of .

Definition 8. Given two graphs , the subgraph isomorphism problem consists in deciding whether there is a subgraph isomorphism between .

The following result is well-known.

Lemma 27 ([12, Theorem 2]). The subgraph isomorphism problem is NP-complete.

Lemma 28. The subgraph isomorphism problem is reducible in linear time with linear increase in problem instance size to the (strict) (unilateral) safe Pareto improvement problem for two players. As a consequence, the (strict) (unilateral) safe Pareto improvement problem is NP-hard.

Proof. Let and be graphs. Without loss of generality assume both graphs have at least vertices, i.e., that . For this proof, we define for any .

We first define two games, one for each graph, and then a third game that contains the two.

The game for is the game as in Table 9. Formally, let . Then we let , where . The utility functions are defined via

and

We define based on analogously, except that in Player 1's utilities we use instead of , instead of , instead of and instead of .

Table 9: The game constructed to represent the graph .

Table 10: The game as constructed from and .

We now define from and as sketched in Table 9. For the following let

and and . For , let be the utility of in . For let be the utility of in . Finally, define for all and all ; and for all and all .

It is easy to show that this reduction can be computed in linear time and that it also increases the problem instance size only linearly.

To prove our claim, we need to prove the following two propositions:

If there is a subgraph isomorphism from to , then there is a unilateral, strict SPI.
If there is any SPI, then there is a subgraph isomorphism from to .

1. We start with the first claim. Assume there is a subgraph isomorphism from to . We construct our SPI as usual: first we reduce the game by iterated elimination of strictly dominated strategies, then we find a Pareto-improving outcome equivalence between the reduced game and some subset game of . Finally, we show that arises from removing strictly dominated strategies from subset game of . It is easy to see that the game resulting from iterated elimination of strictly dominated strategies is just the part of it. Abusing notation a little, we will in the following just call this (even though it has somewhat differently named action sets).

Next we define a pair of functions , which will later form our isomorphism. For all and , we define via

Define and so on analogously.

Now define to be the subset game with action sets and , where and are the action sets of ; and with utility functions

and (as restricted to ).

We must now show that is a game isomorphism between and . First, it is easy to see that for , is a bijection between and . Moreover,

For player 2, we need to distinguish the different cases of actions. Since each case is trivial from looking at the definition of and we omit the detailed proof.

Next we need to show that is strictly Pareto-improving as judged by the original players' utility function . Again, this is done by distinguishing a large number of cases of action profiles , all of which are trivial on their own. The most interesting one is that of for with because this is where we use the fact that is a subgraph isomorphism:

We omit the other cases.

It is left to construct a unilateral subset game of such that reduces to via iterated elimination of strictly dominated strategies. Let , where we set arbitrarily for .

We now show that reduces to via repeated application of Assumption 1. So let . We distinguish the following cases:

If , then is strictly dominated by and by .
If for some , then by assumption that and by construction of , we know that . From this and inspecting Table 9 we see that and strictly dominate .
If for some , then by assumption that and by construction of , we know that . From this and inspecting Table 9 we see that and strictly dominate .

Note that and are both in by construction of .

2. It is left to show that if there is any kind of non-trivial SPI, there is also a subgraph isomorphism from to .

By Lemma 21, if there is an SPI, there are bijections that are jointly Pareto-improving from the reduced game to . From these functions we will construct a subgame isomorphism. However, to do so (and to see that the resulting function is indeed a subgraph isomorphism), we need to first make a few simple observations about the structure of and . Define and .

First we will argue that there is an action of the reduced game s.t. . We prove this by showing the following contrapositive: if and are disjoint, then must, contrary to assumption, be trivial, i.e., must be the pair of identity functions on .From the fact that is Pareto improving, it follows that , since outside of there is no outcome with utility at least for Player . Similarly,

It then follows that , since apart from the outcomes we have already mapped to, no other outcome gives Player a utility of . Next it follows that , again because all outcomes with utility at least for Player outside of are already mapped to. And so on, until we obtain that . By an analogous line of argument we can show that

Together these equalities uniquely specify and .
We next argue that . We show a contrapositive, specifically that if this were not the case then would not be Pareto-improving. So assume that . Then from item a it follows that there is such that neither nor . Then either and hence or and hence .
We now argue that for to be Pareto-improving, must be a subset of . To show this, notice first that for all by a similar argument as used repeatedly in Item a. Hence, . Now assume for contraposition that there is such that WLOG for some . Then for all but one opponent move with , . But since , there are at least two opponent moves with such that . Hence, cannot be Pareto-improving.
Finally, notice that for and , if , then also . To see this, assume it was for some . Then by Item c, . Hence,

in contradiction to the assumption that is Pareto improving.

We are ready to construct our subgraph isomorphism. For , define to be the second element of the pair . By Item c, can equivalently be defined as the second item in the pair . By Item d, is a function from to . By assumption about , is injective. Further, by construction of and , as well as the assumption that is Pareto improving, we infer that for all with ,

We conclude that is a subgraph isomorphism.

E Proof of Theorem 15

Proof. We will give the proof based on the graphs as well, without giving all formal details. Further we assume in the following that neither nor consist of just a single point, since these cases are easy.

\underline{Case A}: Note first that by Corollary 14 it is enough to show that if is in any of the listed sets , it can be made safe.

It's easy to see that all payoff vectors on the curve segment of the Pareto frontier are safely achievable. After all, all payoff vectors in this set Pareto-improve on all outcomes in . Hence, for each on the line segment, one could select the where .

It is left to show that all elements of are safely achievable. Remember that not all payoff vectors on the line segments are Pareto improvements, only those that are to the north-east of (Pareto-better than) the default utility. In the following, we will use and to denote those elements of and , respectively, that are Pareto-improvements on the default.

We now argue that the Pareto improvement on the line for which is safely achievable. In other words, is the projection northward of the default utility, or . This is also one of the endpoints of . To achieve this utility, we construct the equivalent game as per Lemma 13, where the utility to the original players of each outcome of the new game is similarly the projection northward onto of the utility of the corresponding outcome in . That is,

Note that because is convex and the endpoints of the line segment are by definition in , it is . Hence, all values of thus defined are feasible. Because all outcomes in the original game lie below the line , is linear. Hence,

as required.

We have now shown that one of the endpoints of is safely achievable. Since the other endpoint of is in , it is also safely achievable. By Corollary 14, this implies that all of is safely achievable.

By an analogous line of reasoning, we can also show that all elements of are safely achievable.

\underline{Case B}: Define as before as those elements of respectively that Pareto improve on the default . By a similar argument as before, one can show that the utilities is safely achievable both for and for . Call these points and , respectively.

We now proceed in two steps. First, we will show that there is a third safely achievable utility point , which is above both and . Then we will show the claim using that point.

To construct , we again construct an SPI as per Lemma 13. For each we will set the utility of the corresponding to be above or on both and , i.e., on or above a set which we will refer to as . Formally, is the set of outcomes in that are not strictly Pareto dominated by some other element of . Note that by definition every outcome in is Pareto-dominated by some outcome in either or . Hence, by transitivity of Pareto dominance, each outcome is Pareto-dominated by some outcome in . Hence, the described is indeed feasible.

Now note that the set of feasible payoffs of is convex. Further, the curve is concave. Because the area above a concave curve is convex and because the intersection of convex sets is convex, the set of feasible payoffs on or above is also convex. By definition of convexity, is therefore also in the set of feasible payoffs on or above and therefore above both and as desired.

In our second step, we now use to prove the claim. Because of convexity of the set of safely achievable payoff vectors as per Corollary 14, all utilities below the curve consisting of the line segments from to and from to are safely achievable. The line that goes through intersects the line that contains at , by definition. Since non-parallel lines intersect each other exactly once and parallel lines that intersect each other are equal and because is above or on , the line segment from to lies entirely on or above . Similarly, it can be shown that the line segment from to lies entirely on or above . It follows that the curve lies entirely above or on . Now take any Pareto improvement that lies below both and . Then this Pareto improvement lies below and therefore below the curve. Hence, it is safely achievable.

The post Safe Pareto Improvements for Delegated Game Playing appeared first on Center on Long-Term Risk.

Operations Associate / Manager

2022-09-16T07:50:16Z

--- This role is now closed ---

The Center on Long-term Risk is seeking an Operations Associate, to work on supporting and improving the operational processes and infrastructure that enable our researchers’ work. You will therefore act as a multiplier on our team’s productivity, and so play an important role in furthering CLR’s mission to address worst-case risks from the development and deployment of advanced AI systems. (More experienced candidates may be appointed as Operations Manager.)

In this role, you would become the second full member of our Operations team, reporting to our Operations Lead, and taking on responsibilities across diverse areas such as office management, HR, finance, compliance and recruitment – making this role ideal for quickly gaining operations experience. You will receive mentorship from an experienced operations team, and become familiar with existing operational processes in a well-running organisation, as you work to improve and supplement them.

To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 11th September (precisely: 7:30am British Summer Time on Monday 12th). We expect the form will take 30-60 minutes to complete. It can be done in as little as 10 minutes if necessary by skipping the descriptive questions: this may significantly disadvantage your application, but may make sense if you wouldn’t apply otherwise.

--- This role is now closed ---

Job description

Responsibilities

We are recruiting for this role in order to provide additional capacity in our operations function, and the Operations Lead plans to hand over a number of areas of responsibility to the successful candidate. Precisely which areas you work on will depend on your strengths and interests, and we’ll decide this together with you once you start work.

As an illustration of the sorts of things you’ll work on, we expect that the successful candidate will take on several of the following tasks:

Running our new 30-person office space, situated across the top 2 floors of a building in Primrose Hill, London. Tasks here could include improvements to furniture and design, managing our office cook, and handling utilities and services.
Handling and refining our processes for finance and accounting tasks, including payments and reimbursements, bookkeeping and donation management.
Running events for our team, including team retreats, celebrations and socials.
Managing and developing our people processes, including onboarding and offboarding staff, facilitating staff performance reviews, and supporting staff welfare.
Providing work-related personal assistance services to staff .
Improving our IT systems and security.

Examples of further responsibilities that a candidate who is a good fit for them could take on include:

Providing operational infrastructure for hiring rounds
Payroll (in collaboration with our accountants)
Grantmaking operations
Visas and immigration

We also plan to train the successful candidate in some of the Operations Lead’s areas of responsibility, in order to provide better resilience.

About you

We think this role could provide suitable challenges for someone with 0-4 years’ experience in a similar job: it might, for example, be suited to a recent graduate interested in quickly gaining experience in operations, and we also encourage more experienced candidates to apply.

Organised: In this role you will often have a large number of competing responsibilities. You will need to keep track of them effectively, and prioritise between them appropriately.
Problem-solving ability: We’re a small organisation and regularly find ourselves in new operational situations. You will need to think creatively to find solutions to problems.
Good communicator: A significant part of your role will be responding to other team members’ operational needs. As such, it’s important that you be able to communicate with team members effectively and sensitively.
Proactive: In this role you will have your own areas of responsibility, and will need to independently work towards your goals, proactively identifying opportunities for improvement.
Attention to detail: In areas such as finance, attention to detail can save a lot of time coming back and fixing mistakes later.
Good IT skills: We use digital systems heavily in our work, and it’s important that you be able to quickly learn how to use new digital services.
Willing to take on basic administrative tasks alongside more strategic ones.

In addition to the above, familiarity with the effective altruism community and its priorities are a significant benefit.

Further details

Work quota: We are open to full-time or part-time candidates, with a preference for full-time.
Location: We prefer applicants who will work in-person from our London office. However, we are willing to consider applicants who wish to work partly or entirely remotely.
International applicants: We are a registered UK visa sponsor, and willing to sponsor visas for applicants interested in moving to the UK to take up this position.

Salary

The base salary for this role is £45,000-£60,000 per year depending on the candidate.
- For part-time applicants, the salary will be scaled down proportionally to your working hours.
- For applicants based outside London, the salary will be adjusted based on local living costs, in accordance with our compensation policy.
We don’t want salary to be what stops someone from contributing to our mission. If you’re interested in this role and CLR’s work but would require a higher salary, we encourage you to go ahead and apply and we’re open to discussing higher compensation.

Benefits

25 days’ paid vacation per year, plus public holidays.
Private health and travel insurance.
Pension scheme with default employer contribution of 10% of your qualifying earnings, increasing to 15% to match additional contributions made by you.
Catered plant-based lunch available at the office every day.
A budget of £2000 per year to spend on your professional development and productivity.
Flexible working hours.
20 weeks’ paid leave for new parents.
We will pay reasonable relocation costs for candidates who move to London to take up the role.

Why this role?

Role impact

In this role, you will enable our other staff’s productivity, and so support our mission to reduce worst-case risks from the development of AI systems. CLR’s activities include:

Grantmaking: In addition to the CLR Fund, some of our staff are advising the Center for Emerging Risk Research (CERR), a foundation committed to using all of their funds (over $100 million) to improve the quality of life of future generations.
Technical interventions: We aim to develop and communicate insights about the safe development of artificial intelligence to the relevant stakeholders (e.g., AI developers, key organizations in the longtermist effective altruism community).
Governance interventions: We aim to develop and help implement appropriate governance structures for the safe development of artificial intelligence.
New projects: In collaboration with people in our network, we are always looking for novel impactful organizations to set up. For instance, we have been involved in the founding of the Cooperative AI Foundation and Foundations of Cooperative AI Lab. Previously, we established Wild Animal Suffering Research, which later merged with Utility Farm to become the Wild Animal Initiative, a now independent organisation.

Aside from CERR, CLR has received major grants from Open Philanthropy and the Survival and Flourishing Fund.

Professional development

Due to the small size of our organisation, your work will be varied and you will quickly gain experience across a wide variety of operations areas. CLR regularly encounters new operational situations, such as employing staff in a new country, or supporting the launch of a new charity, which will give you ample opportunities to extend your skills to new contexts.

CLR will also actively support your professional development. While we are looking for a candidate who is interested in working with CLR for a substantial period of time, as part of the effective altruism community we are interested in helping you increase your career’s impact even beyond your performance in the current role. Alongside mentorship from our experienced operations team, you will be joining a well-networked longtermist organisation. You will receive a budget of £2,000 per year to spend on whatever you think best furthers your professional development, and be supported to attend EA Global conferences twice annually if you’re interested.

Application process & how to apply

Stage 1: To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 11th September (precisely: 7:30am British Summer Time on Monday 12th).

We aim to communicate the results of stage 1, inviting candidates to the second stage, by the end of Friday 16th September.

Stage 3 will consist of one or more interviews with CLR staff. We plan to hold interviews in the week of 10th October, and aim to communicate the results of stage 3 by the end of Friday 21st October.

We expect final recruitment decisions to be made by mid-November. If you require a faster decision than this, please feel free to contact us at the address below.

Enquiries

Diversity and equal opportunity employment: CLR is an equal opportunity employer, and we value diversity at our organisation. We don’t want to discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, social background/class, mental or physical health or disability, or any other basis for unreasonable discrimination, whether legally protected or not. If you're considering applying to this role and would like to discuss any personal needs that might require adjustments to our application process or workplace, please feel very free to contact us.

The post Operations Associate / Manager appeared first on Center on Long-Term Risk.

Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs

2022-08-11T14:59:44Z

Abstract

The evolution of preferences that account for other agents’ fitness, or other-regarding preferences, has been modeled with the “indirect approach” to evolutionary game theory. Under the indirect evolutionary approach, agents make decisions by optimizing a subjective utility function. Evolution may select for subjective preferences that differ from the fitness function, and in particular, subjective preferences for increasing or reducing other agents’ fitness. However, indirect evolutionary models typically artificially restrict the space of strategies that agents might use (assuming that agents always play a Nash equilibrium under their subjective preferences), and dropping this restriction can undermine the finding that other-regarding preferences are selected for. Can the indirect evolutionary approach still be used to explain the apparent existence of other-regarding preferences, like altruism, in humans? We argue that it can, by accounting for the costs associated with the complexity of strategies, giving (to our knowledge) the first account of the relationship between strategy complexity and the evolution of preferences. Our model formalizes the intuition that agents face tradeoffs between the cognitive costs of strategies and how well they interpolate across contexts. For a single game, these complexity costs lead to selection for a simple fixed-action strategy, but across games, when there is a sufficiently large cost to a strategy's number of context-specific parameters, a strategy of maximizing subjective (other-regarding) utility is stable again. Overall, our analysis provides a more nuanced picture of when other-regarding preferences will evolve.

1 Introduction

Under what conditions do agents evolve to maximize a subjective utility function other than their evolutionary fitness? In particular, when is there selection for other-regarding preferences [Elster, 1983, Sen, 1986] such as altruism (intrinsically valuing improvements in others agents’ fitness) or spite (intrinsically valuing reductions in other agents’ fitness)? These questions have been previously studied under the “indirect approach” to evolutionary game theory [Güth and Kliemt, 1998]. Consider a game whose payoffs determine the players’ fitness in an evolutionary process, called a base game. The indirect evolutionary approach supposes that selection occurs on agents’ subjective preferences (hereafter, “preferences”) represented as utility functions, and agents rationally play the base game by optimizing their subjective utility functions. When assessing the evolutionary stability of strategies in the indirect approach, a player’s utility function defines their strategy. This is in contrast to the classical “direct” approach where actions in the base game themselves are selected.

This indirect approach has been applied in attempts to explain altruism in organisms, especially in contexts where other explanations such as kin selection and reciprocity are inadequate [Bester and Güth, 1998, Janssen, 2008, Konrad and Morath, 2012]. In a simple model of an interaction where two agents’ actions have positive externalities for each other — i.e., increasing one’s action (represented as a real number) increases the other’s payoff — Bester and Güth [1998] find that altruistic preferences are evolutionarily stable. Bolle [2000] and Possajennikov [2000] extended this model to also explain the stability of spiteful preferences in interactions with negative externalities. These other-regarding preferences are selected because they are known to other agents, and thus credibly signal an agent’s commitment to certain behavior, given other agents’ preferences [Frank, 1987, Dufwenberg and Güth, 1999].

However, these models have two key limitations:

They assume that agents always play a best response given their preferences and beliefs about the other player’s preferences. This precludes agents who commit to following a certain action regardless of their beliefs about the other player. This is important because, as we will show, when such commitments are allowed, a subjective utility-maximizing strategy with other-regarding preferences is no longer the unique evolutionarily stable strategy.
They restrict the space of preferences in a way that prevents the use of strategies capable of invading populations of inefficient strategies, called the “secret handshake” in previous work [Robson, 1990]. As Dekel et al. [2007] show, when the space of preferences is expanded to include all possible utility functions, evolutionarily stable strategies in an indirect evolutionary model must be efficient. This is because any population of inefficient strategies can be invaded by mutants who mimic the behavior of the inefficient strategy, and play an efficient action against other mutants.

These two modifications to the original indirect evolutionary models undermine those models’ conclusions that other-regarding preferences can be evolutionarily stable, including preferences that lead to inefficient behavior. However, an important feature of the kinds of strategies described in (1) and (2) is that they differ from subjective utility maximization in their complexity costs, i.e., the costs an agent must pay to learn and execute strategies [McNamara, 2013]. These costs may play a critical role in evolution; for instance, the tradeoff between the problem-solving benefits and energetic costs of larger brains may explain variation in brain size among primates, and in animal behavior in contests [Isler and Van Schaik, 2014, Reichert and Quinn, 2017]. Previous literature has studied how complexity costs affect the evolutionary stability of strategies [Rubinstein, 1986, Banks and Sundaram, 1990, Binmore and Samuelson, 1992]. The costs of strategy complexity accumulate over the diverse set of environments and interactions an agent faces in its lifetime [Geoffroy and Andr ́e, 2021]. Thus, instead of using many different strategies that are each simple in isolation, it can be less expensive overall for an agent to use a sophisticated strategy that interpolates well across interactions [Robalino and Robson, 2016, Piccinini and Schulz, 2018]. We will argue that the complexity costs of applying individualized heuristics to each new interaction may be sufficiently high that evolution selects for “rational” agents, which consistently optimize some (other-regarding) utility function.

Our key contribution is a revised account of the evolution of other-regarding preferences, based on a novel frame-work accounting for the fitness costs that strategies incur due to their complexity in multiple strategic contexts. While existing indirect evolutionary models are inadequate because they artificially restrict the space of strategies, we show that their predictions can be recovered by accounting for how subjective utility-maximizing strategies optimally trade off complexity within and across decision contexts. In particular:

We characterize the Nash equilibria (and stability thereof) of the space of subjective utility-maximizing strategies from Possajennikov [2000] augmented with strategies that commit to a certain action (“behavioral strategies”), in a general class of symmetric two-player games. In this expanded space, rational strategies with other-regarding preferences that are evolutionarily stable against other rational strategies, as in Bester and Güth [1998] and Possajennikov [2000], are no longer the unique evolutionarily stable strategies. This result motivates the search for an alternative explanation of the evolution of other-regarding preferences.
While previous work has shown how finite computational costs of strategies in repeated games significantly alter the set of stable strategies, we present two results illustrating a tradeoff between within-game and across-game complexity costs: (1) Suppose that rational strategies are more costly in a single interaction than behavioral strategies, given the greater energetic costs associated with their complex cognition [Conlisk, 1980, Abreu and Sethi, 2003]. Then in an individual complexity-penalized game, the multiplicity of neutrally stable strategies, including rational strategies with other-regarding preferences, is replaced with a unique evolutionarily stable strategy, the Nash equilibrium of the base game. (2) When agents play multiple games, a sufficiently large penalty on the number of game-specific parameters used by a strategy reproduces the results of Bester and Güth [1998] and Possajennikov [2000] — in numerical experiments, the population converges (under a particular evolutionary dynamic) to a rational strategy with other-regarding preferences. Our experiments also explore how the size of the penalty on game-specific parameters necessary for other-regarding preferences to evolve, and the strength of altruism or spite that evolves, depend on the distribution of games.
We argue that accounting for complexity costs blocks the secret handshake argument: Mutant strategies that both mimic an inefficient action and play an efficient action against themselves are more complex than behavioral strategies, and thus cannot invade a strategy that always plays the Nash equilibrium of the base game.

2 Related Work

Indirect evolutionary approach. Like Bester and Güth [1998], Bolle [2000], and Possajennikov [2000], we model rational players as playing Nash equilibria with respect to utility functions given by their own fitness plus a (possibly negative) multiple of their opponent’s fitness. Heifetz et al. [2003] generalize this model to utility functions given by one’s own fitness plus some function called a disposition. They show that dispositions are not eliminated by selection in a wide variety of games. Generalizing further to the space of all possible utility functions in finite-action games, Dekel et al. [2007] show that any strategy achieving an inefficient payoff against itself — including the kinds of strategies with other-regarding preferences predicted by Possajennikov [2000] — is not evolutionarily stable. We will argue, however, that the invader strategies that make efficiency necessary for stability are more complex than behavioral or rational strategies, and thus when complexity costs are accounted for in an agent’s fitness, a stable strategy can lead to inefficiency. Ok and Vega-Redondo [2001] and Ely and Yilankaya [2001] note that in order for utility functions to evolve such that players with those utility functions do not play the base game Nash equilibrium, players must have information about each other’s utility functions. We assume utility functions are known, and briefly discuss how players can learn each other’s utility functions over repeated interactions, but acknowledge that this is a substantive assumption since players often have incentives to send deceptive signals of their utility functions [Heller and Mohlin, 2019]. Finally, Heifetz et al. [2007] and Alger and Weibull [2012] generalize Possajennikov [2000]’s finding that altruism or spite can be evolutionarily stable in a certain game depending on whether it features postive or negative externalities. They show that in a general class of games, selection for altruism versus spite is determined (partly) by whether the base game has strategic complements or substitutes, i.e., whether increasing one player’s input increases or decreases the marginal value of input for another player. The patterns of selection of altruism or spite based on multiple games that we illustrate with Bester and Güth [1998]’s game, therefore, might hold for a variety of games.

Games with complexity costs. Rubinstein [1986] characterizes Nash equilibria in repeated games under computational costs. He represents strategies in the repeated Prisoner’s Dilemma as finite-state automata (sets of states determining the player’s action with rules for transitions between states). Complexity costs are lexicographic: an automaton achieving a strictly higher payoff is always preferred, but when two automata achieve the same average payoff, the automaton with fewer states is preferred. Binmore and Samuelson [1992] show that although no evolutionarily stable strategies exist in the repeated Prisoner’s Dilemma without complexity costs, adding these lexicographic costs leads to the existence of some evolutionarily stable strategies. We similarly show that in one-shot games, when we account for the greater complexity of “rational” strategies relative to “behavioral” (fixed-action) strategies, a set of multiple neutrally stable strategies is replaced with a unique evolutionarily stable straegy. Our distinction between the complexity of rational and behavioral strategies follows that of Abreu and Sethi [2003], who show that under an arbitrarily small cost of the complexity of rationality, behavioral strategies are evolutionarily stable in a bargaining game. If automata are also penalized based on the number of different states each state can transition to, the evolutionarily stable strategies are restricted to the Nash equilibria of the (non-repeated) base game [Banks and Sundaram, 1990]. We find an analogous result in one-shot games with a different complexity metric. Lastly, van Veelen and Garc ́ıa [2019] find that in the repeated Prisoner’s Dilemma, increasing non-lexicographic complexity costs decreases the frequency of cooperation in finite-population stochastic evolutionary simulations. Similarly, in the multi-game setting, we find numerically that as complexity costs on a strategy’s number of game-specific parameters increase, there are transitions between more or less efficient stable strategies.

Coevolution of rationality and other-regarding preferences. A key theme in our work is that selection may favor the ability of rational agents, which have other-regarding preferences and model other players as optimizing their own utility functions, to solve a variety of strategic problems. Building on Robson [2001]’s analogous results in single-agent problems, Robalino and Robson [2016] model the coevolution of utility maximization and ability to attribute preferences to others. Like us, they show that after accounting for the advantages of interpolation across strategic contexts, selection favors a rational strategy that learns and responds to the preferences of its opponent, as opposed to strategies that do not know how to respond to new games. However, we study selection pressures towards rationality in the context of evolution of preferences. Further, in our analysis, the advantage of rationality comes from avoiding costs that non-rational strategies pay to adapt a response to each separate game, rather than from non-rational strategies’ inability to respond to new games. Heller and Mohlin [2019] model the evolution of both preferences and the cognitive capacity necessary to signal false preferences to others. Their argument for the efficiency of stable strategies is vulnerable to the counterargument that we raise to Dekel et al. [2007] above. However, their results are similar to ours in that the set of stable strategies is sensitive to whether the costs of cognitive complexity are sufficiently high, relative to the direct fitness benefits of complex cognition. Like us, Geoffroy and Andr ́e [2021] model the evolution of strategies that interpolate across different contexts, but their analysis is restricted to cooperation in a certain class of games rather than evolution of other-regarding preferences in general (including uncooperative preferences like spite).

3 Preliminaries and Running Example

We begin with definitions and notation and introduce a well-studied game that will illustrate principles of the indirect evolutionary approach.

Let be any symmetric two-player game (called the base game) between players and , with action space and payoff functions . Players choose actions in the base game as functions of strategies that are selected in an evolutionary process. Suppose players simultaneously play strategies (elements of some abstract space ) and observe each other's strategies, then play with actions determined by the pair of strategies. Then, define the function , where player 's action in given the players' strategies is . In standard evolutionary analysis the fitness of a strategy equals its payoff in , thus we write player 's fitness from a strategy profile as . (We distinguish fitness from payoffs because once complexity costs are included, as in Section 5.1, this identity no longer holds.) The following definitions classify a strategy based on the robustness to mutations of a population purely consisting of that strategy.

Definition 1. Relative to a fixed strategy space for , a strategy is:

A Nash equilibrium if, for all , .
A neutrally stable strategy (NSS) if (1) it is a Nash equilibrium, and (2) for all such that , .
An evolutionarily stable strategy (ESS) if it is an NSS and the inequality in 2 is always strict.

The strict inequality in the definition of ESS implies a stronger “pull” towards an ESS in evolutionary dynamics (such as the replicator dynamic) than towards an NSS: If a rare mutant that enters a population consisting of an ESS has the same fitness when paired with itself as the ESS has against this mutant, the mutant goes extinct under the replicator dynamic, but this does not necessarily hold for an NSS [van Veelen, 2010].

Our running example is the following symmetric two-player game, which we call the externality game [Bester and Güth, 1998]. Each player simultaneously chooses , and, for some and , the players receive payoffs:

Thus, represents negative or positive externalities of each player's action for the other's payoff (when or , respectively). In the original model, players are assumed to have the following subjective utility functions, for :

Players behave rationally with respect to their subjective utility functions, and subjective utility functions are common knowledge. Thus the players play the Nash equilibrium of the game in which payoffs are given by , denoted . That is, letting represent player 's strategy, .

A player with (respectively, ) has subjective utility increasing (decreasing) with the other's payoff — these ranges of can be interpreted as altruistic and spiteful, respectively. Generalizing Bester and Güth [1998], Possajennikov [2000] showed that the unique ESS in this strategy space is . Thus, when , this ESS corresponds to players with altruistic preferences, and when , their preferences are spiteful. Players who follow the subjective Nash equilibrium with respect to given by the altruistic ESS both receive a higher payoff than the equilibrium of , while the payoffs of the spiteful ESS are both lower. Since the Pareto-efficient symmetric subjective Nash equilibrium is at , this means that as , the ESS approaches efficiency. Intuitively, these other-regarding preferences are stable in Possajennikov [2000]’s model because they serve as commitment devices that elicit favorable responses from the other player [Frank, 1987, Dufwenberg and Güth, 1999]. That is, each agent best-responds under the assumption that the other player will play rationally with respect to their utility function, and as utility functions are selected based on payoffs from the opponent’s best response to the action optimizing those utility functions, the population converges to some .

4 Setup

We now discuss the formal framework on which our results are based. Let as above. We say that a preference parameter is egoistic if , and other-regarding otherwise. In our results we will use the following assumptions, which are satisfied by the externality game:

For any , is unique.
For any and , the function has a unique global maximum, . (That is, the best response to some action under any subjective utility function is unique.)
For any , the function is surjective on . ¹

We give some remarks on the typical indirect evolutionary models before presenting our generalized model. Recall our claim that the strategy space assumed by much of the indirect evolutionary game theory literature is too restrictive, due to the assumption that agents always play the Nash equilibrium of . Playing a Nash equilibrium in response to the other player's can be exploitable, in the sense that a player can “force” another rational agent to play an action that is more favorable to player (see Section 4.1 for an example). A player may avoid being exploited in this way by committing to some action, independent of opponents’ preferences. We will therefore enrich the strategy space in to relax this assumption (Section 4.1).

Standard indirect evolutionary game theory also assumes players perfectly observe each other’s payoff functions and subjective utility functions. This premise has been questioned in previous work, e.g., Heifetz et al. [2007], Gardner and West [2010]. We keep this assumption due to findings by Jordan [1991] and Kalai and Lehrer [1993] that, if players use Bayesian updating in repeated interactions with each other, under certain conditions they converge to accurate beliefs about each other’s utility functions and play the Nash equilibrium. Dekel et al. [2007] and Heller and Mohlin [2019] give similar justifications for this assumption in their indirect evolutionary models.

4.1 Strategy space

Our strategy space combines the “direct” and “indirect” approaches to evolutionary game theory [Güth and Kliemt, 1998]. That is, this space includes both fixed actions of the base game and strategies that choose actions as a function of the player’s own subjective utility function and the other player’s strategy.

First, a behavioral strategy plays an action ai, independent of the other player’s strategy. The action ai is common knowledge to both players before is played. Second, as in the standard indirect evolutionary approach [Bester and Güth, 1998, Possajennikov, 2000], a rational strategy has a commonly known preference parameter , and always plays a best response given to their beliefs about the other player. A rational player believes that another rational player plays the Nash equilibrium of . Thus the best response to another rational player with parameter is . A rational player believes behavioral player plays action , so the rational strategy is .

To see the reason for including both classes of strategies in one model, consider the externality game with . If a rational player faces rational player with , and , we can check that the payoff of increases while that of decreases:

That is, can exploit the rationality of by adopting an other-regarding preference parameter as a commitment. We therefore ask what strategies are selected for when we allow players to ignore each other's commitments (preferences), in order to avoid exploitation, and instead play some fitness-maximizing action.

In summary, our strategy space is the union of these sets:

: Behavioral strategy whose action is for all .
: Rational strategy whose action is if , or if .

5 Results

We now characterize the Nash equilibria and stable strategies of S. We show that there are multiple neutrally stable strategies, one of which acts according to egoistic preferences, and no evolutionarily stable strategies. This is in contrast to the results of Bester and Güth [1998] and Possajennikov [2000], who showed that without behavioral strategies, a population with other-regarding preferences is the unique ESS in the externality game. All proofs are in Appendix A.

Proposition 1. Let be a symmetric two-player game that satisfies assumptions 1 - 3. Then a strategy is a Nash equilibrium in if and only if it is either or a strategy that is a Nash equilibrium in . Further, is an NSS in , and is an NSS in if and only if it is an NSS in . There are no ESSes.

Informally, a population that always plays the base game Nash equilibrium can be invaded by rational players with egoistic preferences, whose fitness against each other matches that of the original population. When the population consists of rational players with other-regarding preferences that are stable against other rational strategies, it can be invaded by agents that always play the Nash equilibrium of the game with payoffs given by those same other-regarding preferences.

5.1 Complexity penalties

Single game. Proposition 1 showed that strategies with either egoistic or other-regarding preferences can be neutrally stable, and neither are evolutionarily stable. This suggests that the standard indirect evolutionary approach is insufficient to explain the unique stability of other- regarding preferences. However, our analysis above assumed that players can use arbitrarily complex strategies at no greater cost than simpler ones; fitness is a function only of the payoffs of strategies, not of the cognitive resources required to use them [McNamara, 2013].

We introduce complexity costs as follows. For some complexity function , we apply the usual evolutionary stability analysis to a modified strategy fitness function:

While behavioral strategies always play a fixed action, rational strategies compute a best response to each given opponent. Within a single game, a behavioral strategy thus requires less computation than a rational strategy (this assumption was also used by Abreu and Sethi [2003]). Given this observation, for some we define (where the function returns if the condition in brackets is true, and otherwise). Once this cost is accounted for, selection favors the behavioral strategy that plays the Nash equilibrium of (even when assumption 3 does not hold).

Proposition 2. Let be a symmetric two-player game that satisfies assumptions 1 and 2. Then for any , the unique Nash equilibrium in under penalties is , and this strategy is an ESS.

An arbitrarily small cost of complexity prevents rational strategies from matching the fitness of the Nash equilibrium behavioral strategy.

Multiple games. Proposition 2, again, appears inconsistent with the stability of other-regarding preferences. However, this result is based on a metric of complexity that only accounts for costs within one game — the cost of rational optimization versus playing a constant action for any opponent — rather than cumulative costs across games. As Piccinini and Schulz [2018] discuss qualitatively, although agents who rely on situation-specific heuristics avoid the fixed cost of explicit optimization paid by rational agents, they do worse in some variable environments than the latter, who can profit from having a general and compact strategy of optimizing utility functions. We formalize this tradeoff in this section.

Suppose that in each generation, the players in a population face a collection of games . Each player uses a strategy that (through the function ) outputs an action conditional on both the other player's strategy and the identity of the game. One can apply the usual evolutionary stability analysis to strategies that play the collection of games, by defining fitness as the sum of fitness from each game minus a multi-game complexity function . If a given strategy has parameters under selection across games, should increase with . An ideal definition of this function would be informed by an accurate model of the energetic costs of different kinds of cognition, which is beyond the scope of this work. We can define multi-game complexity in our setting by generalizing the strategy space from Section 4 to multiple games:

: Plays in game .
: Plays the rational strategy with respect to for each .

The motivation for parameterizing a strategy in by a single is that, across a distribution of relevantly similar games (e.g., variants of the externality game with different values of ), a rational player might be able to perform well by interpolating its other-regarding preferences.² Then, for some , letting denote the number of unique elements of , define:

The set of stable strategies under these multi-game penalties is sensitive to the values of and . Intuitively, a behavioral strategy will be stable when is small, relative to the profits this strategy can make by adapting its response precisely to each game. Conversely, when is sufficiently large, a rational strategy can compensate for applying the same decision rule to every game by avoiding the costs of game-specific heuristics. In the next section, we show these patterns numerically.

5.2 Evolutionary simulations with multi-game complexity penalties

Here, we will use an evolutionary simulation algorithm to see how complexity costs across games influence stable strategies — in particular, which (if any) other-regarding preferences are elected? For simplicity, we consider a set of just two externality games for a fixed with and , denoted and . However, to investigate the effects of imbalanced environments (i.e., where is played more or less frequently than ) we suppose that players spend a fraction of their time in game and in . Then, with as the externality game payoff function for a given , the multi-game penalized fitness of a strategy against is:

Due to the continuous strategy space, a replicator dynamic simulation is intractable. Instead, we simulate an evolutionary process on using the adaptive learning algorithm [Young, 1993], implemented as follows (details are in Appendix B). An initial population of size is randomly sampled from the spaces of rational and behavioral strategies. In each round of evolution, each player in the population either (with low probability) switches to a random strategy, or else switches to the best response to a uniformly sampled opponent in the population (with respect to the penalized fitness above).³ Note that a best response in the space might use one action across both games, incurring a complexity cost of instead of . We fix , and . In each experiment, we tune the multi-game complexity penalty (hereafter, “per-parameter penalty”) to approximately the smallest value necessary to ensure that the population almost always converges to an element of (a rational strategy).

Varying strength of negative or positive externalities in one game. First, we show that other-regarding preferences evolve under sufficiently strong negative or positive externalities, given a sufficiently high per-parameter penalty. We fix , , and , and vary . For , the population converged to a behavioral strategy that uses only one action, for all values of we tested (see the open circle in Figure 1). This suggests that when both games are sufficiently similar, a behavioral strategy can interpolate across both games at less expense than a rational strategy. Figure 1 shows that, as expected, the sign and magnitude of the stable value scales with . For , the population converges to , suggesting that other-regarding preferences only interpolate well across these externality games when the externalities are sufficiently strong in magnitude.

Figure 1: values of the limit of an evolutionary simulation, in which a proportion of games have and a proportion have , as a function of . All members of the final population are in except when , where the population consists of behavioral strategies (depicted with an open circle). The per-parameter penalty is .

Figure 2: values of the limit of an evolutionary simulation, in which a proportion of games have and a proportion have , as a function of , with three pairs of values. All populations are in except and , where the population consists of behavioral strategies (depicted with open circles). The per-parameter penalty is .

Varying proportion of games with negative versus positive externalities. Next, we show that the strength of altruism versus spite in the limiting population scales nonlinearly with the proportion of games with negative versus positive externalities. With , we vary the fraction of games with , over , for three pairs of games. For all pairs of in this experiment, the values and have one-action behavioral strategies in the limiting population (see the open circles in Figure 2). When one game is extremely rare, the rational strategy's gains from interpolation across games do not outweigh the cost of rationality.

First we fix and (blue curve in Figure 2). Again, the trend of decreasing with greater is as expected, though there is a bias towards altruism: an equal proportion of positive and negative externalities gives . When and (orange curve), even small proportions of the large-magnitude negative are sufficient for the rational population to adopt , and remains roughly constant above . That is, in an environment where one game has weak positive externalities and the other has strong negative externalities, most of the effect on the population's other-regarding preferences comes just from having a frequency of strong negative externalities above some (small) threshold. The same pattern holds in the opposite direction when and (green curve).

Figure 3: values of the limit of an evolutionary simulation in which a proportion of games have and a proportion have , as a function of and . White cells indicate that the limiting population is not in . The per-parameter penalty is .

In Figure 3, we vary both and , keeping . For any , the result from Figure 1 where a rational strategy is not stable for small still holds. Likewise, the result that takes over the population when is not sensitive to . Generalizing the trend from Figure 2, for sufficiently large magnitudes of , only a minority of games need to have far from for strong other-regarding preferences to be stable.

Figure 4: Normalized social welfare of the limit of an evolutionary simulation, in which a proportion of games have and a proportion have , as a function of the per-parameter penalty for different values of .

Social welfare in the limiting population as a function of the per-parameter penalty. Finally, we show how the total payoffs of the limiting population vary both with the size of the per-parameter penalty, and with the proportion of games with positive versus negative externalities. Fixing and , we vary for each . To visualize the transitions between limiting populations of behavioral versus rational strategies, we compute the social welfare averaged over the last two rounds (for some parameter values, the population oscillates) of each evolutionary simulation for penalty and proportion , shown in Figure 4. ⁴

For most values of , when there is no per-parameter penalty () the population attains the near-lowest social welfare, where all in the population play the base game Nash equilibrium. The penalty is sufficient for all populations to converge to an other-regarding rational strategy, which attains the highest social welfare when but nearly the lowest when , i.e., when most of the games have . For intermediate values of , the population oscillates between and a behavioral best response to in each game, usually resulting in social welfare between that of very low or high . The minimum value of necessary for convergence to the rational strategy is largest for values of closest to 0.5, while only a small penalty is necessary when or (see the values of where the curves in Figure 4 plateau). Intuitively, if the large majority of games have the same , a behavioral strategy does not profit much from adapting with multiple actions, relative to the complexity costs of playing different actions for two games.

The magnitude of relative to required for other-regarding preferences to be stable might appear unrealistically large, based on these results. We note the distinction between the fixed cognitive costs of developing a rational decision procedure, and the per-use costs of learning heuristics for each context and recognizing when each is appropriate. Cooper [1996] argues that lexicographic, or infinitesimal, complexity costs are appropriate for the former — these start up costs are a tiebreaker between strategies that are otherwise equally capable — while finite non-negligible costs are suitable for the latter. It is therefore plausible that in several evolutionary contexts, the costs of adapting to each interaction from scratch outweigh costs of rationality. Regardless, given the sensitivity of the stable populations in these experiments to , it is important to account for the relative strength of these two factors when predicting the result of an evolutionary process.

5.3 Inefficiency and the secret handshake

Lastly, we discuss the implications of complexity costs for another model that appears to preclude the evolution of certain other-regarding preferences. Recall that we have defined the utility functions of rational strategies as the player’s own payoff plus a multiple of the opponent’s pay-off. Previous work has shown (in finite-action games) that if all possible subjective utility functions are permitted, and players observe each other’s subjective utility functions, then all stable strategies achieve a Pareto efficient payoff [Dekel et al., 2007, Heller and Mohlin, 2019]. This conclusion follows from the “secret handshake” argument: a player who is indifferent among all action pairs can select an equilibrium that matches any other strategy’s action against that strategy, but plays an action achieving an efficient payoff against itself [Robson, 1990]. These results rule out both the base game Nash equilibrium and the ESS in of the externality game, which is for , while is the unique efficient rational strategy.

One might suspect, then, that our conclusion from the numerical experiments — i.e., inefficient other-regarding preferences can be stable when agents play multiple games — would not hold after including the strategy classes from Dekel et al. [2007] and Heller and Mohlin [2019]. When we include complexity costs, however, the secret handshake argument does not follow. Let be the class of strategies whose subjective utility functions are constant over all action pairs, and which use the equilibrium selection rule described above. Because this strategy requires choosing different Nash equilibria depending on the opponent, we claim that it is more complex than either a behavioral or rational strategy.

For , let . Then is still an ESS under the conditions of Proposition 2, with added to the strategy space. The proof is straightforward; given a positive penalty, a strategy from cannot match the payoff of against itself, by the definition of the base game Nash equilibrium:

We conjecture that across multiple games, a sufficiently large penalty would yield similar results to Section 5.2.

6 Discussion

The puzzle that motivated this work was the apparent prevalence of other-regarding preferences, such as altruism and spite, despite the possibility of selection for commitment strategies that ignore the signals of other-regarding preferences. Our results suggest that this puzzle stems from a neglect of complexity considerations in previous literature on the evolution of preferences. We considered a class of two-player symmetric games that includes the games used by Bester and Güth [1998] and Possajennikov [2000] to illustrate the stability of altruism and spite. First, via evolutionary stability analysis on a strategy space that combines the direct and indirect approaches, we confirmed that other-regarding preferences are no longer uniquely stable when fixed-action strategies can also evolve. We then showed numerically that, although other-regarding preferences are unstable when agents play a single game under costs of strategy complexity, if the costs of distinct fixed actions across multiple games are sufficiently high, other-regarding preferences are stable. These costs also explain why inefficient stable strategies can persist — the flexible “secret handshake” strategy, which has been purported to guarantee that stability implies efficiency, is too complex to invade populations with certain inefficient strategies.

Accounting for the costs of adapting strategies to specific games plausibly sheds light on other phenomena in evolutionary game theory. For example, Boyd and Richerson [1992] argued that a common explanation of cooperation as a product of punishment, e.g., as in tit-for-tat in the repeated Prisoner’s Dilemma, proves too much: “Moralistic” strategies, which not only punish noncooperation but also punish those who do not punish noncooperation, can enforce the stability of any individually rational behavior. These moralistic strategies require sophisticated recognition of the behaviors that constitute cooperation or punishment in each given game. If some individually rational behavior enforced by a moralistic strategy is only marginally better for the cooperating player than getting punished, another strategy could invade by avoiding the complexity cost of the moralistic strategy, which outweighs the direct fitness cost of being punished. Thus, under complexity costs, the set of evolutionarily stable behaviors may be much smaller. It is also important to note that classes of simple, generalizable utility functions other than those we have considered might evolve. Instead of having utility functions given by their payoff plus a multiple of the other agent’s payoff, agents could develop utility functions with an aversion to exploitation or inequity [Huck and Oechssler, 1999, Güth and Napel, 2006]. Future work could investigate selection pressures on utility functions of different complexity.

Besides explaining biological behavior, our model of complexity-penalized preference evolution might also motivate predictions of the behavior of artificial agents, such as reinforcement learning (RL) algorithms. Policies are updated based on reward signals similarly to fitness-based updating of populations in evolutionary models [Börgers and Sarin, 1997]. It is common in RL training to penalize strategies (“policies”) according to their complexity, and deep learning researchers have argued that artificial neural networks have an implicit bias towards simple functions [Mingard et al., 2021, Valle-Perez et al., 2019]. Thus, RL agents trained together may develop other-regarding preferences, as far as the assumptions of our model are satisfied by the tasks these agents are trained in. A better understanding of the relationship between complexity costs and the distribution of environments these agents are trained in may help us better understand what kinds of preferences they acquire.

References

Dilip Abreu and Rajiv Sethi. Evolutionary stability in a reputational model of bargaining. Games and Economic Behavior, 44(2):195–216, 2003.

Ingela Alger and Jörgen W. Weibull. A generalization of Hamilton’s rule—Love others how much? Journal of Theoretical Biology, 299:42–54, 2012. ISSN 0022-5193. doi: https://doi.org/10.1016/j.jtbi.2011.05.008. URL https://www.sciencedirect.com/science/article/pii/S0022519311002505. Evolution of Cooperation.

Jeffrey S Banks and Rangarajan K Sundaram. Repeated games, finite automata, and complexity. Games and Economic Behavior, 2(2):97–117, 1990. ISSN 0899-8256. doi: https://doi.org/10.1016/0899-8256(90)90024-O. URL https://www.sciencedirect.com/science/article/pii/089982569090024O.

Siegfried Berninghaus, Christian Korth, and Stefan Napel. Reciprocity—an indirect evolutionary analysis. Journal of Evolutionary Economics, 17:579–603, 02 2007. doi: 10.1007/s00191-006-0053-1.

Helmut Bester and Werner Güth. Is altruism evolutionarily stable? Journal of Economic Behavior & Organization, 34(2):193–209, 1998.

Kenneth G Binmore and Larry Samuelson. Evolutionary stability in repeated games played by finite automata. Journal of Economic Theory, 57(2):278–305, 1992. ISSN 0022-0531. doi: https://doi.org/10.1016/0022-0531(92)90037-I. URL
https://www.sciencedirect.com/science/article/pii/002205319290037I.

Friedel Bolle. Is altruism evolutionarily stable? And envy and malevolence?: Remarks on Bester and Güth. Journal of Economic Behavior & Organization, 42(1):131–133, 2000.

Robert Boyd and Peter J. Richerson. Punishment allows the evolution of cooperation (or anything else) in sizable groups. Ethology and Sociobiology, 13(3):171–195, 1992. ISSN 0162-3095. doi: https://doi.org/10.1016/0162-3095(92)90032-Y. URL https://www.sciencedirect.com/science/article/pii/016230959290032Y.

Tilman Börgers and Rajiv Sarin. Learning Through Reinforcement and Replicator Dynamics. Journal of Economic Theory, 77(1):1–14, 1997. ISSN 0022-0531. doi: https://doi.org/10.1006/jeth.1997.2319. URL https://www.sciencedirect.com/science/article/pii/S002205319792319X.

John Conlisk. Costly optimizers versus cheap imitators. Journal of Economic Behavior & Organization, 1(3): 275–293, September 1980. ISSN 0167-2681. doi: 10.1016/0167-2681(80)90004-9.

David J. Cooper. Supergames Played by Finite Automata with Finite Costs of Complexity in an Evolutionary Setting. Journal of Economic Theory, 68(1):266–275, 1996. ISSN 0022-0531. doi: https://doi.org/10.1006/jeth.1996.0015. URL https://www.sciencedirect.com/science/article/pii/S0022053196900150.

Eddie Dekel, Jeffrey C. Ely, and Okan Yilankaya. Evolution of Preferences. The Review of Economic Studies, 74(3):685–704, 2007. ISSN 00346527, 1467937X. URL http://www.jstor.org/stable/4626157.

Martin Dufwenberg and Werner Güth. Indirect evolution vs. strategic delegation: a comparison of two approaches to explaining economic institutions. European Journal of Political Economy, 15(2):281–295, 1999. ISSN 0176-2680. doi: https://doi.org/10.1016/S0176-2680(99)00006-3. URL https://www.sciencedirect.com/science/article/pii/S0176268099000063.

Jon Elster. Rationality, page 1–42. Cambridge University Press, 1983. doi: 10.1017/CBO9781139171694.002.

Jeffrey C. Ely and Okan Yilankaya. Nash Equilibrium and the Evolution of Preferences. Journal of Economic Theory, 97(2):255–272, 2001. ISSN 0022-0531. doi: https://doi.org/10.1006/jeth.2000.2735. URL https://www.sciencedirect.com/cience/article/pii/S0022053100927352.

Robert H. Frank. If Homo Economicus Could Choose His Own Utility Function, Would He Want One with a Conscience? The American Economic Review, 77 (4):593–604, 1987. ISSN 00028282. URL http://www.jstor.org/stable/1814533.

Andy Gardner and Stuart A. West. Greenbeards. Evolution, 64(1):25–38, 2010. doi: https://doi.org/10.1111/j.1558-5646.2009.00842.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1558-5646.2009.00842.x.

Félix Geoffroy and Jean-Baptiste André. The emergence of cooperation by evolutionary generalization. Proc Biol Sci, 2021.

Werner Güth and Hartmut Kliemt. The indirect evolutionary approach: Bridging the gap between rationality and adaptation. Rationality and Society, 10(3):377–399, 1998. doi: 10.1177/104346398010003005. URL https://doi.org/10.1177/104346398010003005.

Werner Güth and Stefan Napel. Inequality aversion in a variety of games: An indirect evolutionary analysis. The Economic Journal, 116(514):1037–1056, 2006. ISSN 00130133, 14680297. URL http://www.jstor.org/stable/4121943.

Aviad Heifetz, Chris Shannon, and Yossi Spiegel. What to Maximize If You Must. Journal of Economic Theory, pages 31–57, 2003.

Aviad Heifetz, Chris Shannon, and Yossi Spiegel. The Dynamic Evolution of Preferences. Economic Theory, 32:251–286, 2007.

Yuval Heller and Erik Mohlin. Coevolution of deception and preferences: Darwin and Nash meet Machiavelli. Games and Economic Behavior, 113:223–247, 2019. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2018.09.011. URL https://www.sciencedirect.com/science/article/pii/S0899825618301532.

Steffen Huck and Jörg Oechssler. The indirect evolutionary approach to explaining fair allocations. Games and Economic Behavior, 28(1):13–24, 1999. ISSN 0899-8256. doi: https://doi.org/10.1006/game.1998.0691. URL https://www.sciencedirect.com/science/article/pii/S0899825698906911.

Karin Isler and Carel P. Van Schaik. How humans evolved large brains: Comparative evidence. Evolutionary Anthropology: Issues, News, and Reviews, 23(2):65–75, 2014. doi: https://doi.org/10.1002/evan.21403. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/evan.21403.

Marco A. Janssen. Evolution of cooperation in a one-shot prisoner’s dilemma based on recognition of trustworthy and untrustworthy agents. Journal of Economic Behavior & Organization, 65(3):458–471, 2008. ISSN 0167-2681. doi: https://doi.org/10.1016/j.jebo.2006.02.004. URL https://www.sciencedirect.com/science/article/pii/S0167268106001934.

James Jordan. Bayesian learning in normal form games. Games and Economic Behavior, 3(1):60– 81, 1991. URL https://EconPapers.repec.org/RePEc:eee:gamebe:v:3:y:1991:i:1:p:60-81.

Ehud Kalai and Ehud Lehrer. Rational Learning Leads to Nash Equilibrium. Econometrica, 61(5):1019–1045, 1993. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/2951492.

Kai A. Konrad and Florian Morath. Evolutionarily stable in-group favoritism and out-group spite in intergroup conflict. Journal of Theoretical Biology, 306:61–67, 2012. ISSN 0022-5193. doi: https://doi.org/10.1016/j.jtbi.2012.04.013. URL
https://www.sciencedirect.com/science/article/pii/S0022519312001944.

John M McNamara. Towards a richer evolutionary game theory. Journal of the Royal Society Interface, 10(88): 20130544, November 2013. ISSN 1742-5689. doi: 10.1098/rsif.2013.0544.

Chris Mingard, Guillermo Valle-P ́erez, Joar Skalse, and Ard A. Louis. Is SGD a Bayesian sampler? Well, almost. Journal of Machine Learning Research, 22(79):1–64, 2021. URL http://jmlr.org/papers/v22/20-676.html.

Efe A. Ok and Fernando Vega-Redondo. On the Evolution of Individualistic Preferences: An ncomplete Information Scenario. Journal of Economic Theory, 97(2):231–254, 2001. ISSN 0022-0531. doi: https://doi.org/10.1006/jeth.2000.2668. URL https://www.sciencedirect.com/science/article/pii/S0022053100926681.

Gualtiero Piccinini and Armin W. Schulz. The Ways of Altruism. Evolutionary Psychological Science, 5:58–70, 2018.

Alex Possajennikov. On the evolutionary stability of altruistic and spiteful preferences. Journal of Economic Behavior & Organization, 42(1):125–129, 2000.

Michael S. Reichert and John L. Quinn. Cognition in contests: Mechanisms, ecology, and evolution. Trends in Ecology & Evolution, 32(10):773–785, 2017. ISSN 0169-5347. doi: https://doi.org/10.1016/j.tree.2017.07.003. URL https://www.sciencedirect.com/science/article/pii/S0169534717301799.

Nikolaus Robalino and Arthur Robson. The Evolution of Strategic Sophistication. The American Economic Review, 106(4):1046–1072, 2016. ISSN 00028282. URL http://www.jstor.org/stable/43821484.

Arthur J. Robson. Efficiency in evolutionary games: Darwin, Nash and the secret handshake. Journal of Theoretical Biology, 144(3):379–396, 1990. ISSN 0022-5193. doi: https://doi.org/10.1016/S0022-5193(05)80082-7. URL https://www.sciencedirect.com/science/article/pii/S0022519305800827.

Arthur J. Robson. Why Would Nature Give Individuals Utility Functions? Journal of Political Economy, 109(4):900–914, 2001. ISSN 00223808, 1537534X. URL http://www.jstor.org/stable/10.1086/322083.

Ariel Rubinstein. Finite automata play the repeated prisoner’s dilemma. Journal of Economic Theory, 39(1):83–96, 1986. ISSN 0022-0531. doi: https://doi.org/10.1016/0022-0531(86)90021-9. URL https://www.sciencedirect.com/science/article/pii/0022053186900219.

Amartya Sen. Foundations of Social Choice Theory: An Epilogue. Cambridge University Press, Cambridge, 1986.

Guillermo Valle-Perez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rye4g3AqFm.

Matthijs van Veelen. But Some Neutrally Stable Strategies are More Neutrally Stable than Others. Tinbergen Institute Discussion Papers 10-033/1, Tinbergen Institute, March 2010. URL https://ideas.repec.org/p/tin/wpaper/20100033.html.

Matthijs van Veelen and Juli ́an Garc ́ıa. In and out of equilibrium II: Evolution in repeated games with discounting and complexity costs. Games and Economic Behavior, 115:113–130, 2019. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2019.02.013. URL https://www.sciencedirect.com/science/article/pii/S0899825619300314.

H. Young. The evolution of conventions. Econometrica, 61(1):57–84, 1993. URL https://EconPapers.repec.org/RePEc:ecm:emetrp:v:61:y:1993:i:1:p:57-84.

A Proofs

A.1 Proof of Proposition 1

Behavioral. Define , the payoff of the Nash equilibrium with egoistic preferences. By the definition of Nash equilibrium of , since , the strategy is a Nash equilibrium in . Suppose . Then we must have , because otherwise uniqueness of the Nash equilibrium (assumption 1) would be violated. So is not a Nash equilibrium in .

Since the Nash equilibrium of is unique, there is no behavioral strategy such that . Suppose a rational strategy satisfies . Then

(This is satisfied for .) But this implies that .

So , and , therefore is neutrally stable (but not an ESS).

Rational. It is immediate that can only be a Nash equilibrium in if it is a Nash equilibrium in . Let be such a strategy. always plays against itself, so . Suppose a deviator plays . Given assumption 3, for any there exists a such that . Therefore:

Where the last line follows because is a Nash equilibrium in the space of rational strategies. So is a Nash equilibrium in .

Suppose is neutrally stable in . By assumption 2, is the unique action such that . Then

and . So is neutrally stable (but not an ESS). On the other hand, if is not an NSS in , the same counterexample to neutral stability applies in the expanded space , thus is not an NSS in .

A.2 Proof of Proposition 2

Behavioral. The conditions for Nash equilibrium in do not change, since this set has the lowest complexity. However, when assessing the stability of , it suffices to only consider invader strategies in , because for a strategy ,

Since the Nash equilibrium of is unique (assumption ??), there is no other behavioral strategy such that . Thus is an ESS under penalties.

Rational. Let be any rational strategy. Then . But , so cannot be a Nash equilibrium under penalties.

B Details on Numerical Experiments

Each strategy is parameterized by , where the strategy is if or if . A population of size is initialized with and for each player in the population. Let and if , otherwise . The probability of switching to a random strategy from the initialization distribution in round of evolution is . (We decay to decrease the rate of stochasticity and thus help convergence.) Best responses in the space of are computed analytically; for , we use gradient ascent.

The post Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs appeared first on Center on Long-Term Risk.

Commitment games with conditional information revelation

2022-06-02T11:19:44Z

Abstract

The conditional commitment abilities of mutually transparent computer agents have been studied in previous work on commitment games and program equilibrium. This literature has shown how these abilities can help resolve Prisoner’s Dilemmas and other failures of cooperation in complete information settings. But inefficiencies due to private information have been neglected thus far in this literature, despite the fact that these problems are pervasive and might also be addressed by greater mutual transparency. In this work, we introduce a framework for commitment games with a new kind of conditional commitment device, which agents can use to conditionally reveal private information. We prove a folk theorem for this setting that provides sufficient conditions for ex post efficiency, and thus represents a model of ideal cooperation between agents without a third-party mediator. Connecting our framework with the literature on strategic information revelation, we explore cases where conditional revelation can be used to achieve full cooperation while unconditional revelation cannot. Finally, extending previous work on program equilibrium, we develop an implementation of conditional information revelation. We show that this implementation forms program -Bayesian Nash equilibria corresponding to the Bayesian Nash equilibria of these commitment games.

KEYWORDS
Cooperative AI, program equilibrium, smart contracts

ACM Reference Format:
Anthony DiGiovanni and Jesse Clifton. 2022. Commitment games with conditional information revelation. In Appears at the 4th Games, Agents, and Incentives Workshop (GAIW 2022). Held as part of the Workshops at the 20th International Conference on Autonomous Agents and Multiagent Systems., Auckland, New Zealand, May 2022, IFAAMAS, 14 pages.

1 Introduction

What are the upper limits on the ability of rational, self-interested agents to cooperate? As autonomous systems become increasingly responsible for important decisions, including in interactions with other agents, the study of “Cooperative AI” [2] will potentially help ensure these decisions result in cooperation. It is well-known that game-theoretically rational behavior — which will potentially be more descriptive of the decision-making of advanced computer agents than humans — can result in imperfect cooperation, in the sense of inefficient outcomes. Some famous examples are the Prisoner’s Dilemma and the Myerson-Satterthwaite impossibility of efficient bargaining under incomplete information [20]. Fearon [4] explores “rationalist” explanations for war (i.e., situations in which war occurs in equilibrium); these include Prisoner’s Dilemma-style inability to credibly commit to peaceful alternatives to war, as well as incentives to misrepresent private information (e.g., military strength). Because private information is so ubiquitous in real strategic interactions, resolving these cases of inefficiency is a fundamental open problem. Inefficiencies due to private information will be increasingly observed in machine learning, as machine learning is used to train agents in complex multi-agent environments featuring private information, such as negotiation. For example, Lewis et al. [16] found that when an agent was trained with reinforcement learning on negotiations under incomplete information, it failed to reach an agreement with humans more frequently than a human-imitative model did.

But greater ability to make commitments and share private information can open up more efficient equilibria. Computer systems could be much better at making their internal workings legible to other agents, and at making sophisticated conditional commitments. More mutually beneficial outcomes could also be facilitated by new technologies like smart contracts [30]. This makes the game theory of interactions between agents with these abilities important for the understanding of Cooperative AI — in particular, for developing an ideal standard of multi-agent decision making with future technologies. An extreme example of the power of greater transparency and commitment ability is Tennenholtz [29] ’s “program equilibrium” solution to the one-shot Prisoner’s Dilemma. The players in Tennenholtz’s “program game” version of the Prisoner’s Dilemma submit computer programs to play on their behalf, which can condition their outputs on each other’s source code. Then a pair of programs with the source code form an equilibrium of mutual cooperation.

In this spirit, we are interested in exploring the kinds of cooperation that can be achieved by agents who are capable of extreme mutual transparency and credible commitment. We can think of this as giving an upper bound on the ability of advanced artificially intelligent agents, or humans equipped with advanced technology for commitment and transparency, to achieve efficient outcomes. While such abilities are inaccessible to current systems, identifying sufficient conditions for cooperation under private information provides directions for future research and development, in order to avoid failures of cooperation. These are our main contributions:

We develop a new class of games in which players can condition both their commitments and revelation of private information on their counterparts’ commitments and decisions to reveal private information. We present a folk theorem for these games: The set of equilibrium payoffs equals the set of feasible and interim individually rational payoffs, notably including all ex post efficient payoffs. The equilibria are conceptually straightforward: For a given ex post payoff profile, players reveal their private information and play according to an action profile attaining that payoff profile; if anyone deviates, they revert to a punishment strategy (without revealing private information to the deviator). The problem is to avoid circularity in these conditional decisions. Our result builds on Forges [5] ’ folk theorem for Bayesian games without conditional information revelation, in which equilibrium payoffs must also be incentive compatible. This expansion of the set of equilibrium payoffs is important, because in several settings, such as those of the classic Myerson-Satterthwaite theorem [20], ex-post efficiency (or optimality according to some function of social welfare) and incentive compatibility are mutually exclusive.
Due to findings that the ability to unconditionally reveal private information often leads players to reveal that information [7, 18 , 19], one might suspect that conditional revelation is unnecessary. It is not. We present examples where our assumptions allow for ex-post efficiency but the ability to unconditionally reveal private information does not.
In these commitment games, the conditional commitment and revelation devices are abstract objects. The devices in Forges [5] ’s and our folk theorems avoid circularity by conditioning on the particular identities of the other players’ devices, but this precludes robust cooperation with other devices that would output the same decisions. Using computer programs as conditional commitment and revelation devices, we give a specific implementation of -Bayesian Nash equilibria corresponding to the equilibria of our commitment game. This approach extends Oesterheld [21] ’s “robust program equilibria.” We solve the additional problems of (1) ensuring that the programs terminate with more than two players, (2) in circumstances where cooperating with other players requires knowing their private information. Ours is the first study of program equilibrium [29] under private information.

2 Related Work

Commitment games and program equilibrium. We build on commitment games, introduced by Kalai et al. [11] and generalized to Bayesian games (without verifiable revelation) by Forges [5]. In a commitment game, players submit commitment devices that can choose actions conditional on other players’ devices. This leads to folk theorems: Players can choose commitment devices that conditionally commit to playing a target action (e.g., cooperating in a Prisoner’s Dilemma), and punishing if their counterparts do not play accordingly (e.g., defecting in a Prisoner’s Dilemma if counterparts’ devices do not cooperate). A specific kind of commitment game is one played between computer agents who can condition their behavior on each other’s source code. This is the focus of the literature on program equilibrium [1, 15 , 21 , 22 , 25 , 29]. Peters and Szentes [24] critique the program equilibrium framework as insufficiently robust to new contracts, because the programs in, e.g., Kalai et al . [11] ’s folk theorem only cooperate with the exact programs used in the equilibrium profile. Like ours, the commitment devices in Peters and Szentes [24] can reveal their types and punish those that do not also reveal. However, their devices reveal unconditionally and thus leave the punishing player exploitable, restricting the equilibrium payoffs to a smaller set than that of Forges [5] or ours.

Our folk theorem builds directly on Forges [5]. In Forges’ setting, players lack the ability to reveal private information. Thus the equilibrium payoffs must be incentive compatible. We instead allow (conditional) verification of private information, which lets us drop Forges’ incentive compatibility constraint on equilibrium payoffs. Our program equilibrium implementation extends Oesterheld [21] ’s robust program equilibrium to allow for conditional information revelation.

Strategic information revelation. In games of strategic information revelation, players have the ability to verifiably reveal some or all of their private information. The question then becomes: How much private information should players reveal (if any), and how should other players update their beliefs based on players’ refusal to reveal some information? A foundational result in this literature is that of full unraveling: Under a range of conditions, when players can verifiably reveal information, they will act as if all information has been revealed [7, 18, 19]. This means the mere possibility of verifiable revelation is often enough to avoid informational inefficiencies. However, there are cases where unraveling fails to hold, and informational inefficiencies persist even when verifiable revelation is possible. This can be due to uncertainty about a player’s ability to verifiably reveal [3, 26] or revelation being costly [8, 10]. But revelation of private information can fail even without such uncertainty or costs [14 , 17]. We will review several such examples in Section 5, show how the persistence of uncertainty in these settings can lead to welfare losses, and show how this can be remedied with the commitment technologies of our framework (but not weaker ones, like those of Forges [5]).

3 Preliminaries: Games of Incomplete Information and Inefficiency

3.1 Definitions

Let be a Bayesian game with players. Each player has a space of types , giving joint type space . At the start of the game, players' types are sampled by Nature according to the common prior . Each player knows only their type. Player 's strategy is a choice of action for each type in . Let denote player 's expected payoff in this game when the players have types and follow an action profile . A Bayesian Nash equilibrium is an action profile in which every player and type plays a best response with respect to the prior over other players' types: For all players and all types , . An -Bayesian Nash equilibrium is similar: Each player and type expects to gain at most (instead of 0) by deviating from .

We assume players can correlate their actions by conditioning on a trustworthy randomization device . For any correlated strategy (a distribution over action profiles), let . When it is helpful, we will write to clarify the subset of the type profile on which the correlated strategy is conditioned. Let denote a correlated strategy whose th entry is degenerate at , and the actions of players other than are sampled from independently of . Then, the following definitions will be key to our discussion:

DEFINITION 1. A payoff vector as a function of type profiles is feasible if there exists a correlated strategy such that, for all players and types ,

DEFINITION 2. A payoff is interim individually rational (INTIR) if, for each player , there exists a correlated strategy used by the other players such that, for all ,

The minimax strategy is used by the other players to punish player . The threat of such punishments will support the equilibria of our folk theorem. Players only have sufficient information to use this correlated strategy if they reveal their types to each other. Moreover, the punishment can only work in general if they do not reveal their types to player , because the definition of INTIR requires the deviator to be uncertain about . Since the inequalities hold for all , the players do not need to know player 's type to punish them.

DEFINITION 3. A feasible payoff induced by is incentive compatible (IC) if, for each player and type pair ,

Incentive compatibility means that each player prefers a given correlated strategy to be played according to their type, as opposed to that of another type.

DEFINITION 4. Given a type profile , a payoff is ex post efficient (hereafter, "efficient") if there does not exist such that for all and for some

We will also consider games with strategic information revelation, i.e., Bayesian games where, immediately after learning their types, players are able to reveal their private information as follows. Players simultaneously each choose from some revelation action set , which is a subset of . Then, all players observe each~, thus learning that player 's type is in . Revelation is verifiable in the sense that a player's choice of must contain their true type, i.e., they cannot falsely "reveal" a different type. We will place our results on conditional type revelation in the context of the literature on unraveling:

DEFINITION 5. Let be the profile of revelation actions (as functions of types) in a Bayesian Nash equilibrium of a game with strategic information revelation. Then has full unraveling if for all , or partial unraveling if is a strict subset of for at least one

3.2 Inefficiency: Running example

Uncertainty about others' private information, and a lack of ability or incentive to reveal that information, can lead to inefficient outcomes in Bayesian Nash equilibrium (or an appropriate refinement thereof). Here is a running example we will use to illustrate how informational problems can be overcome under our assumptions, but not under the weaker assumption of unconditional revelation ability.

Example 3.1 (War under incomplete information, adapted from Slantchev and Tarar [27] ). Two countries are on the verge of war over some territory. Country 1 offers a split of the territory giving fractions and to countries 1 and 2, respectively. If country 2 rejects this offer, they go to war. Each player wins with some probability (detailed below), and each pays a cost of fighting . The winner receives a payoff of 1, and the loser gets 0.

The countries' military strength determines the probability that country 2 wins the war, denoted . Country doesn't know whether the other's army is weak (with type ) or strong (), while country 1's strength is common knowledge. Further, country 2 has a weak point, which country 1 believes is equally likely to be in one of two locations . Thus country 's type is given by . Country 1 can make a sneak attack on , independent of whether they go to war. Country 1 would gain from attacking , costing for country 2. But incorrectly attacking would cost for country 1, so country 1 would not risk an attack given a prior of on each of the locations. If country 2 reveals its full type by allowing inspectors from country 1 to assess its military strength , country 1 will also learn .

If country 1 has a sufficiently low prior that country 2 is strong, then war occurs in the unique perfect Bayesian equilibrium when country 2 is strong. Moreover, this can happen even if the countries can fully reveal their private information to one another. In other words, the unraveling of private information does not occur, because player 2 will be made worse off if they allow player 1 to learn about their weak point.

Before looking at what is achievable with different technologies for information revelation, we need to formally introduce our framework for commitment games with conditional information revelation. In the next section, we describe these games and present our folk theorem.

4 Commitment Games with Conditional Information Revelation

4.1 Setup

Players are faced with a "base game" , a Bayesian game with strategic information revelation as defined in Section 3.1. In our framework, a commitment game is a higher-level Bayesian game in which the type distribution is the same as that of , and actions are devices that define mappings from other players' devices to actions in (conditional on one's type). We assume for all players and types , i.e., players are at least able to reveal their exact types or not reveal any new information. They additionally have access to devices that can condition (i) their actions in and (ii) the revelation of their private information on other players' devices. Upon learning their type , player chooses a commitment device from an abstract space of devices . These devices are mappings from the player's type to a response function and a type revelation function. As in Kalai et al . [11] and Forges [5], we will define these functions so as to allow players to condition their decisions on each other's decisions without circularity.

Let be the domain of the randomization device . The response function is . (This notation, adopted from Forges [5], distinguishes the player's action-determining function from the action itself.) Given the other players' devices and a signal given by the realized value of the random variable , player 's action in after the revelation phase is .¹ Conditioning the response on permits players to commit to (correlated) distributions over actions. Second, we introduce type revelation functions , which are not in the framework of Forges [5]. The th entry of indicates whether player reveals their type to player , i.e., player learns if this value is 1 or if it is . (We can restrict attention to cases where either all or no information is revealed, as our folk theorem shows that such a revelation action set is sufficient to enforce each equilibrium payoff profile.) Thus, each player can condition their action on the others' private information revealed to them via . Further, they can choose whether to reveal their type to another player, via , based on that player's device. Thus players can decide not to reveal private information to players whose devices are not in a desired device profile, and instead punish them.

Then, the commitment game is the one-shot Bayesian game in which each player 's strategy is a device , as a function of their type. After devices are simultaneously and independently submitted (potentially as a draw from a mixed strategy over devices), the signal is drawn from the randomization device , and players play the induced action profile in . Thus the ex post payoff of player in from a device profile is .

4.2 Folk theorem

Our folk theorem consists of two results: First, any feasible and INTIR payoff can be achieved in equilibrium (Theorem 1). As a special case of interest, then, any efficient payoff can be attained in equilibrium. Second, all equilibrium payoffs in are feasible and INTIR (Proposition 1).

THEOREM 1. Let be any commitment game. For type profile , let be a correlated strategy inducing a feasible and INTIR payoff profile . Let be a punishment strategy that is arbitrary except, if is the only player with , let be the minimax strategy against player . Conditional on the signal , let be the deterministic action profile, called the target action profile, given by , and let be the deterministic action profile given by . For all players and types , let be such that:

Then, the device profile is a Bayesian Nash equilibrium of .

PROOF. We first need to check that the response and type revelation functions only condition on information available to the players. If all players use , then by construction of they all reveal their types to each other, and so are able to play conditioned on their type profile (regardless of whether the induced payoff is IC). If at least one player uses some other device, the players who do use still share their types with each other, thus they can play .

Suppose player deviates from . That is, player 's strategy in is . Note that the outputs of player 's response and type revelation functions induced by may in general be the same as those returned by . We will show that punishes deviations from the target action profile regardless of these outputs, as long as there is a deviation in functions or . Let . Then:

This last expression is the ex interim payoff of the proposed commitment given that the other players use , therefore is a Bayesian Nash Equilibrium.

PROPOSITION 1. Let be any commitment game. If a device profile is a Bayesian Nash equilibrium of , then the induced payoff is feasible and INTIR.

PROOF. Let be the strategy profile of . Then by hypothesis so is feasible. Suppose that for some player , for all correlated strategies there exists a type such that:

Let . Then if player with type deviates to such that :

This contradicts the assumption that is the payoff of a Bayesian Nash equilibrium, therefore is INTIR.

Our assumptions do not imply the equilibrium payoffs are IC (unlike Forges [5]). Suppose a player 's payoff would increase if the players conditioned the correlated strategy on a different type (i.e., not IC). This does not imply that a profit is possible by deviating from the equilibrium, because in our setting the other players' actions are conditioned on the type revealed by . In particular, as in our proposed device profile, they may choose to play their part of the target action profile only if all other players' devices reveal their (true) types.

5 Other Conditions for Efficiency and Counterexamples

The assumptions that give rise to this class of commitment games with conditional information revelation are stronger than the ability to unconditionally reveal private information. Recalling the unraveling results from Section 2, unconditional revelation ability is sometimes sufficient for the full revelation of private information, or for revelation of the information that prohibits incentive compatibility, and thus the possibility of efficiency in equilibrium. But this is not always true, whereas efficiency is always attainable in equilibrium under our assumptions. We first show that full unraveling fails in our running example when country 2 has a weak point. Then, we discuss conditions under which the ability to partially reveal private information is sufficient for efficiency, and examples where these conditions don’t hold.

5.1 Analysis of running example

Since country 2 can only either reveal both its strength and weak point , or neither, in our formalism of strategic information revelation . If country 2 rejects the offer , players go to a war that country 2 wins with probability if its army is weak, or if strong.

Assume country 2 is strong and the prior probability of a strong type is . In the perfect Bayesian equilibrium of (without type revelation) country 1 offers , which country 2 rejects [27]. That is, if country 1 believes country 2 is unlikely to be strong, country 1 makes the smallest offer that only a weak type would accept. Thus with private information the countries go to war and receive inefficient payoffs in equilibrium. A strong country 2 also prefers not to reveal its type unconditionally. Although this would guarantee that country 1 best-responds with , which country 2 would accept, given knowledge of the weak point country 1 prefers to attack it and receive an extra payoff with certainty, costing for country 2. Country 2 would therefore be worse off by than in equilibrium without revelation, where its expected payoff is .

However, if country 2 can reveal its full type if and only if country 1 commits to that country 2 accepts, and commits not to attack , the countries can avoid war in equilibrium. The profile is not IC, and hence cannot be achieved under the assumptions of Forges [5] alone, because a weak country 2 would prefer the strong type's payoff (absent type-conditional commitment by country 1). In this example, conditional type revelation is required for efficiency due to a practical inability to reveal military strength without also revealing a vulnerability (Table 1). In other words, country 2's revelation action set is too restricted for full unraveling to occur. Interactions between advanced artificially intelligent agents may feature similar problems necessitating our framework. For example, if revelation requires sharing source code or the full set of parameters of a neural network that lacks cleanly separated modules, unconditional revelation risks exposing exploitable information. See also example 7 of Okuno-Fujiwara et al. [23] in which full unraveling fails because a firm does not want to reveal a secret technology that provides a competitive advantage, leading to inefficiency because other private information is not revealed.

5.2 Efficiency with unconditional revelation

Full unraveling. If full unraveling occurs in the base game , then the ability to conditionally reveal information becomes irrelevant. For example, consider a modification of Example 3.1 in which there is no weak point, i.e., country 's type is rather than . A strong country 2 that can verify its strength to country 1 prefers to do so, since this does not also help country 1 exploit it. But because of this, if country 2 refuses to reveal its strength and it is common knowledge that country 2 could verifiably reveal, country 1 knows country 2 is weak. Thus, all types are revealed in equilibrium without conditioning on country 1's commitment.

Some sufficient and necessary conditions for full unraveling have been derived. Hagenbach et al . [9] show that given verifiable revelation, full unraveling is guaranteed if there are no cycles in the directed graph defined by types that prefer to pretend to be each other. For full unraveling in some classes of games with multidimensional types, it is necessary for one of the players' payoff to be sufficiently nonconvex in the other's beliefs [17]. In Appendix A, we give an example where this condition fails, thus unconditional revelation is insufficient even without exploitable information. However, even for games with full unraveling, the framework of Forges [5] is still insufficient for equilibria with non-IC payoffs, since that framework does not allow verifiable revelation (conditional or otherwise).

Table 1: Sufficient conditions to achieve a payoff in a Bayesian Nash equilibrium of a commitment game, given different forms of verifiable type revelation ability. “Full” means a player can only reveal their full type, while “Partial” means other subsets containing their type can be revealed.
	Full	Partial
Conditional	feasible, INTIR	feasible, INTIR
Unconditional	feasible, INTIR, {full unraveling or IC}	feasible, INTIR, {full unraveling or IC after unraveling}

Partial revelation and post-unraveling incentive compatibility. If in our original example country 2 could partially reveal its type, i.e., only the probability of winning a war but not its weak point, conditional revelation would not be necessary (Table 1). This is because the strategy inducing the efficient payoff profile depends only on the part of country 2's type that is revealed by the unraveling argument. Country 2 does not profit from lying about its exploitable, non-unraveled information — that is, the payoff is IC with respect to that information, even if not IC in the pre-unraveling game. Thus country 1 does not need to know this information for the efficient payoff to be achieved in equilibrium. Formally, in this case , i.e., country 2 can choose to reveal any , producing an equilibrium of partial unraveling. We can generalize this observation with the following proposition.

PROPOSITION 2. Suppose that the devices in do not have revelation functions, and is a game of strategic information revelation with for all . Let be updated to have support on the subset of types remaining after unraveling. As in Forges [5], assume is conditioned on . Then a payoff profile is achievable in a Bayesian Nash equilibrium of if and only if it is feasible, INTIR, and IC (with respect to the post-unraveling game and updated ).

PROOF. This is an immediate corollary of Propositions 1 and 2 of Forges [5], applied to the base game induced by unraveling (that is, with a prior updated on types being in the space ).

To our knowledge, it is an open question which conditions are sufficient and necessary for partial unraveling such that the efficient payoffs of the post-unraveling game are IC. An informal summary of Proposition 2 and characterizations of equilibrium payoffs under our framework and that of Forges [5] is: Given conditional commitment ability, efficiency can be achieved in equilibrium if and only if there is sufficiently strong incentive compatibility, conditional and verifiable revelation ability, or an intermediate combination of these (see Table 1).

Proposition 2 is not vacuous; there exist games in which, given the ability to partially, verifiably, and unconditionally reveal their private information, players end up in an inefficient equilibrium that is Pareto-dominated by a non-IC payoff. Consequently, the alternatives to conditional information revelation that we have considered are not sufficient to achieve all feasible and INTIR payoffs even when partial revelation is possible. The game in Appendix A is one example where such a payoff is efficient. In the following example, the only efficient payoff is IC. However, the set of equilibrium payoffs is smaller than under our assumptions, and excludes some potentially desirable outcomes. For example, there is a non-IC -efficient payoff that improves upon the strictly efficient payoff in utilitarian welfare (sum of all players' payoffs).

Example 5.1 (All-pay auction under incomplete information from Kovenock et al. [14]). Two firms participate in an all-pay auction. Each firm has a private valuation of a good. After observing their respective valuations, players simultaneously choose whether to reveal them. Then they simultaneously submit bids , and the higher bid wins the good, with a tie broken by a fair coin. Thus player 's payoff is . There is a Bayesian Nash equilibrium of this base game in which , and neither player reveals their valuation [14]. In this equilibrium, each player's ex interim payoff is:

The ex post payoffs are if , if , and otherwise.

Now, let , and consider the following strategy . For type profiles such that , let and . For , let and . Otherwise, let . Then:

Thus the payoff induced by is feasible and INTIR, because it exceeds the ex interim equilibrium payoff. This is also an ex post Pareto improvement on the equilibrium, because the ex post payoffs are if , if , and otherwise. Finally, this payoff is not IC, because if , player 1 would profit from conditioned on a type .

Note that the payoffs and , i.e., the case of , are not feasible. This non-IC payoff thus requires and is inefficient by a margin of . However, in practice the players may favor this payoff over . This is because the non-IC payoff is -welfare optimal, since whenever for either player, the supremum of the sum of payoffs is .

6 Implementation of Conditional Type Revelation via Robust Program Equilibrium

Having shown that conditional commitment and revelation devices solve problems that are intractable under other assumptions, we next consider how players can practically (and more robustly) implement these abstract devices. In particular, can players achieve efficient equilibria without using the exact device profile in Theorem 1, which can only cooperate with itself? We now develop an implementation showing that this is possible, after providing some background.

Oesterheld [21] considers two computer programs playing a game. Each program can simulate the other. He constructs a program equilibrium — a pair of programs that form an equilibrium of this game — using “instantaneous tit-for-tat” strategies.

In the Prisoner’s Dilemma, the pseudocode for these programs (called "") is: These programs cooperate with each other and punish defection. Note that these programs are recursive, but guaranteed to terminate because of the probability that a program will output Cooperate unconditionally.

We use this idea to implement conditional commitment and revelation devices. For us, "revealing private information and playing according to the target action profile" is analogous to cooperation in the construction of . We will first describe the appropriate class of programs for program games under private information. Then we develop our program, (where "SIR" stands for "strategic information revelation"), and show that it forms a -Bayesian Nash equilibrium of a program game. Pseudocode for is given in Algorithm 1.

Player 's strategy in the program game is a choice from the program space , a set of computable functions from to . A program returns either an action or a type revelation vector. Each program takes as input the players' program profile, the signal , and a boolean that equals 1 if the program's output is an action, and 0 otherwise.

For brevity, we write for a call to a program with the boolean set to , otherwise . Player 's action in the program game is a call to their program . (We refer to these initial program calls as the base calls to distinguish them from calls made by other programs.) Then, the ex post payoff of player in the program game is .

In addition to in the base game, we assume there is a randomization device on which programs can condition their outputs. Like Oesterheld [21], we will use programs that unconditionally terminate with some small probability. By using to correlate decisions to unconditionally terminate, our program profile will be able to terminate with probability 1, despite the exponentially increasing number of recursive program calls. In particular, reads the call stack of the players' program profile. At each depth level of recursion reached in the call stack, a variable is independently sampled from . Each program call at level can read off the values of and from . The index itself is not revealed, however, because programs that "know" they are being simulated would be able to defect in the base calls, while cooperating in simulations to deceive the other programs. To ensure that our programs terminate in play with a deviating program, will call truncated versions of its counterparts' revelation programs: For , let denote with immediate termination upon calling another program.

checks if all other players' programs reveal their types (line 8 of Algorithm 1). If so, either with a small probability it unconditionally cooperates (line 11) — i.e., plays its part of the target action profile — or it cooperates only when all other programs cooperate (line 15). Otherwise, it punishes (line 17). In turn, reveals its type unconditionally with probability (line 20). Otherwise, it reveals to a given player under two conditions (lines 25 and 28). First, player must reveal to the user. Second, they must play an action consistent with the desired equilibrium, i.e., cooperate when all players reveal their types, or punish otherwise. (See Figure 1.)

Unconditionally revealing one's type and playing the target action avoids an infinite regress. Crucially, these unconditional cooperation outputs are correlated through . Therefore, in a profile of copies of this program, either all copies unconditionally cooperate together, or none of them do so. Using this property, we can show (see proof of Theorem 2 in Appendix B) that a profile where all players use this program outputs the target action profile with certainty. If one player deviates, first, immediately punishes if that player does not reveal. If they do reveal, with some small probability the other players unconditionally cooperate, making this strategy slightly exploitable, but otherwise the deviator is punished. Even if deviation is punished, may unconditionally reveal. In our approach, this margin of exploitability is the price of implementing conditional commitment and revelation with programs that cooperate based on counterparts' outputs, rather than a strict matching of devices, without an infinite loop. Further, since a player is only able to unconditionally cooperate under incomplete information if they know all players' types, needs to prematurely terminate calls to programs thatndon't immediately unconditionally cooperate, but which may otherwise cause infinite recursion (line 4). This comes at the expense of robustness: some players who may have otherwise cooperated, with low probability.

Figure 1. Flowchart for a 2-player program game between player using , and player using an arbitrary program. An edge to a white node indicates a call to the program in that node; to a gray node indicates a check of the condition in that node; and to a node without a border indicates the output of the most recent parent white node. Wavy edges depict a call to the program in the parent node, with its child nodes omitted for space. Superscripts indicate the level of recursion.

THEOREM 2. Consider the program game induced by a base game and the program spaces . Assume all strategies returned by these programs are computable. For type profile , let induce a feasible and INTIR payoff profile . Let be the minimax strategy if one player deviates, and arbitrary otherwise. Let be the maximum payoff achievable by any player in , and . Then the program profile given by Algorithm 1 (with ) for players , denoted , is a -Bayesian Nash equilibrium. That is, if players play this profile, and player plays a program that terminates with probability 1 given that any programs it calls terminate with probability 1, then:

PROOF SKETCH. We need to check (1) that the program profile terminates (a) with or (b) without a deviation, (2) that everyone plays the target action profile when no one deviates, and (3) that with high probability a deviation is punished. First suppose no one deviates. If for two levels of recursion in a row, the calls to and all unconditionally reveal (line 21) of ) and output the target action (line 6 of ), respectively. Because these unconditional cooperative outputs are correlated through , the probability that at each pair of subsequent levels in the call stack is a nonzero constant. Thus it is guaranteed to occur eventually and cause termination in finite time, satisfying (1b). Moreover, each call to or in previous levels of the stack sees that the next level cooperates, and thus cooperates as well, ensuring that the base calls all output the target action profile. This shows (2).

If, however, one player deviates, we use the same guarantee of a run of subsequent events to guarantee termination. First, all calls to non-deviating programs terminate, because any call to conditional on forces termination (line 4) of calls to other players' revelation programs. Thus the deviating programs also terminate, since they call terminating non-deviating programs. This establishes (1a). Finally, in the high-probability event that the first two levels of calls to do not unconditionally cooperate, punishes the deviator as long as they do not reveal their type and play their target action. The punishing players will know each other's types, since a call to is guaranteed by line 28 to reveal to anyone who also punishes or unconditionally cooperates in the next level. Condition (3) follows.

A practical obstacle to program equilibrium is demonstrating to one’s counterpart that one’s behavior is actually governed by the source code that has been shared. In our program game with private information, there is the additional problem that, as soon as one’s source code is shared, one’s counterpart may be able to read off one’s private information (without revealing their own). Addressing this in practice might involve modular architectures, where players could expose the code governing their strategy without exposing the code for their private information. Alternatively, consider AI agents that can place copies of themselves in a secure box, where the copies can inspect each other’s full code but cannot take any actions outside the box. These copies read each other’s commitment devices off of their source code, and report the action and type outputs of these devices to the original agents. If any copy within the box attempts to transmit information that another agent’s device refused to reveal, the box deletes its contents. This protocol does not require a mediator or arbitrator; the agents and their copies make all the relevant strategic decisions, with the box only serving as a security mechanism. Applications of secure multi-party computation to machine learning [12], or privacy-preserving smart contracts [13] — with the original agents treated as the “public” from whom code shared among the copies is kept private — might facilitate the implementation of our proposed commitment devices.

7 Discussion

We have defined a new class of commitment games that allow revelation of private information conditioned on other players’ commitments. Our folk theorem shows that in these games, efficient payoffs are always attainable in equilibrium. Our examples, which draw on models of war and all-pay auctions, show how players with these capabilities can avoid welfare losses, while others (even with the ability to verifiably reveal private information) cannot. Finally, we have provided an implementation of this framework via robust program equilibrium, which can be used by computer programs that read each other’s source code.

While conceptually simple, satisfying these assumptions in practice requires a strong degree of mutual transparency and conditional commitment ability, which is not possessed by contemporary human institutions or AI systems. Thus, our framework represents an idealized standard for bargaining in the absence of a trusted third party, suggesting research priorities for the field of Cooperative AI [2]. The motivation forwork on this standard is that AI agents with increasing economic capabilities, which would exemplify game-theoretic rationality to a stronger degree than humans, may be deployed in contexts where they make strategic decisions on behalf of human principals [6]. Given the potential for game-theoretically rational behavior to cause cooperation failures [4, 20], it is important that such agents are developed in ways that ensure they are able to cooperate effectively.

Commitment devices of this form would be particularly useful in cases where centralized institutions (Dafoe et al. [2], Section 4.4) for enforcing or incentivizing cooperation fail, or have not been constructed due to collective action problems. This is because our devices do not require a trusted third party, aside from correlation devices. A potential obstacle to the use of these commitment devices is lack of coordination in development of AI systems. This may lead to incompatibilities in commitment device implementation, such that one agent cannot confidently verify that another’s device meets its conditions for trustworthiness and hence type revelation. Given that commitments may be implicit in complex parametrizations of neural networks, it is not clear that independently trained agents will be able to understand each other’s commitments without deliberate coordination between developers. Our program equilibrium approach allows for the relaxation of the coordination requirements needed to implement conditional information revelation and commitment. Coordination on target action profiles for commitment devices or flexibility in selection of such profiles, in interactions with multiple efficient and arguably “fair” profiles [28], will also be important for avoiding cooperation failures due to equilibrium selection problems.

Acknowledgements

We thank Lewis Hammond for helpful comments on this paper and thank Caspar Oesterheld both for useful comments and for identifying an important error in an earlier version of one of our proofs.

References

[1] Andrew Critch. 2019. A parametric, resource-bounded generalization of Löb’s theorem, and a robust cooperation criterion for open-source game theory. The Journal of Symbolic Logic 84, 4 (2019), 1368–1381.

[2] Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R. McKee, Joel Z. Leibo, Kate Larson, and Thore Graepel. 2020. Open Problems in Cooperative AI. arXiv:2012.08630 [cs.AI]

[3] Ronald A Dye. 1985. Disclosure of nonproprietary information. Journal of accounting research (1985), 123–145.

[4] James D Fearon. 1995. Rationalist explanations for war. International organization 49, 3 (1995), 379–414.

[5] Françoise Forges. 2013. A folk theorem for Bayesian games with commitment. Games and Economic Behavior 78 (2013), 64–71. https://doi.org/10.1016/j.geb.2012.11.004

[6] Edward Geist and Andrew J. Lohn. 2018. How might artificial intelligence affect the risk of nuclear war? Rand Corporation.

[7] Sanford J Grossman. 1981. The informational role of warranties and private disclosure about product quality. The Journal of Law and Economics 24, 3 (1981), 461–483.

[8] Sanford J Grossman and Oliver D Hart. 1980. Disclosure laws and takeover bids. The Journal of Finance 35, 2 (1980), 323–334.

[9] Jeanne Hagenbach, Frédéric Koessler, and Eduardo Perez-Richet. 2014. Certifiable Pre-play Communication: Full Disclosure. Econometrica 82, 3 (2014), 1093–1131. http://www.jstor.org/stable/24029308

[10] Boyan Jovanovic. 1982. Truthful disclosure of information. The Bell Journal of Economics (1982), 36–44.

[11] Adam Tauman Kalai, Ehud Kalai, Ehud Lehrer, and Dov Samet. 2010. A commitment folk theorem. Games and Economic Behavior 69, 1 (2010), 127–137.

[12] Brian Knott, Shobha Venkataraman, Awni Hannun, Shubhabrata Sengupta, Mark Ibrahim, and Laurens van der Maaten. 2021. CrypTen: Secure Multi-Party Computation Meets Machine Learning. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin,
P. Liang, and J. Wortman Vaughan (Eds.). https://openreview.net/forum?id=dwJyEMPZ04I

[13] Ahmed Kosba, Andrew Miller, Elaine Shi, Zikai Wen, and Charalampos Papamanthou. 2016. Hawk: The Blockchain Model of Cryptography and Privacy-Preserving Smart Contracts. In 2016 IEEE Symposium on Security and Privacy (SP). 839–858. https://doi.org/10.1109/SP.2016.55

[14] Dan Kovenock, Florian Morath, and Johannes Münster. 2015. Information sharing in contests. Journal of Economics & Management Strategy 24 (2015), 570–596. Issue 3.

[15] Patrick LaVictoire, Benja Fallenstein, Eliezer Yudkowsky, Mihaly Barasz, Paul Christiano, and Marcello Herreshoff. 2014. Program Equilibrium in the Prisoner’s Dilemma via Löb’s Theorem. In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence.

[16] Mike Lewis, Denis Yarats, Yann N. Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal or No Deal? End-to-End Learning for Negotiation Dialogues. https://doi.org/10.48550/ARXIV.1706.05125

[17] Giorgio Martini. 2018. Multidimensional Disclosure. http://www.giorgiomartini.com/papers/multidimensional_disclosure.pdf

[18] Paul Milgrom and John Roberts. 1986. Relying on the information of interested parties. The RAND Journal of Economics (1986), 18–32.

[19] Paul R Milgrom. 1981. Good news and bad news: Representation theorems and applications. The Bell Journal of Economics (1981), 380–391.

[20] Roger B Myerson and Mark A Satterthwaite. 1983. Efficient mechanisms for bilateral trading. Journal of economic theory 29, 2 (1983), 265–281.

[21] Caspar Oesterheld. 2019. Robust program equilibrium. Theory and Decision 86, 1 (2019), 143–159.

[22] Caspar Oesterheld and Vincent Conitzer. 2021. Safe Pareto Improvements for Delegated Game Playing. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems. 983–991.

[23] Masahiro Okuno-Fujiwara, Andrew Postlewaite, and Kotaro Suzumura. 1990. Strategic Information Revelation. The Review of Economic Studies 57, 1 (1990), 25–47. http://www.jstor.org/stable/2297541

[24] Michael Peters and Balázs Szentes. 2012. Definable and Contractible Contracts. Econometrica 80 (2012), 363–411.

[25] Ariel Rubinstein. 1998. Modeling Bounded Rationality. The MIT Press.

[26] Hyun Song Shin. 1994. The burden of proof in a game of persuasion. Journal of Economic Theory 64, 1 (1994), 253–264.

[27] Branislav L Slantchev and Ahmer Tarar. 2011. Mutual optimism as a rationalist explanation of war. American Journal of Political Science 55, 1 (2011), 135–148.

[28] Julian Stastny, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, and Jesse Clifton. 2021. Normative Disagreement as a Challenge for Cooperative AI. arXiv:2111.13872 [cs.MA]

[29] Moshe Tennenholtz. 2004. Program equilibrium. Games and Economic Behavior 49, 2 (2004), 363–373.

[30] Hal R Varian. 2010. Computer mediated transactions. American Economic Review 100, 2 (2010), 1–10.

A Partial Unraveling Example

Consider the following game of strategic information revelation. We will show that in this game, there is a perfect Bayesian equilibrium that is inefficient, and there is an efficient payoff profile that is not IC. (That is, in this game, unconditional and partial type revelation and the framework of Forges [5] are not sufficient to achieve efficiency.) This example is inspired by the model in Martini [17].

Player 2 is a village that lives around the base of a treacherous mountain (i.e., along the left and bottom sides of ). Their warriors are camped somewhere on the mountain, with coordinates . Player 1 has no information on the warriors' location, hence the prior is . But they know that warriors at higher altitudes are tougher; strength is proportional to . As in Example 3.1, player 1 can offer a split of disputed territory. If the players fight, then player 1 will send in paratroopers at a location to fight player 2's warriors at a cost proportional to their strength . They want to get as close as possible to minimize exposure to the elements, consumption of rations, etc (i.e., minimize the squared distance ). Meanwhile, player 2 wants the paratroopers to land as far from their village as possible, i.e., they want to maximize . Player 2 wins the ensuing battle with probability equal to their army's strength, i.e., .

Formally, the game is as follows. Only player 2 has private information, . Player 2 has the unrestricted revelation action set . First, player 2 chooses . Then player 1 chooses . Player 2 can either accept or reject . If player 2 accepts, the pair of payoffs is . Otherwise, player 1 plays , and for :

Let for any function . Define:

Then, we claim:

PROPOSITION 3. Let . Let player 1's strategy be then conditional on a given if player 2 does not reveal their type, otherwise then . Let player 2's strategy be to reveal any and only types for which , and to accept any and only . Let player 1's belief update to conditional on player 2 not revealing their type, and to conditional on player 2 not revealing their type and rejecting .

Then these strategies and beliefs are a perfect Bayesian equilibrium. Further, there exist such that this equilibrium is inefficient, and the payoff profile is (1) a Pareto improvement on the equilibrium payoff and (2) not IC.

Figure 2: Player 2's type space, with the non-unraveled types conditional on a rejection of given in the shaded region. The black dot denotes , player 1's optimal given a posterior that is uniform on the shaded region. Unraveling stabilizes into this region because any types for which prefer to reveal, otherwise they prefer not to reveal, and types for which prefer to accept .

PROOF. We proceed by backward induction. If player 2 has not revealed their type and has rejected , then given beliefs , we solve for the optimal . Player 1's expected payoff is . The squared loss is minimized at . This is equivalent to the average of the centers of rectangles whose union composes the region (see Figure 2), weighted by the areas of these rectangles, which can be shown to be:

Thus is a best response. If player 2 has revealed their type and rejected , then player 1's payoff is maximized at .

Next, player 2's best response to any is to accept if and only if the acceptance payoff exceeds the rejection payoff given player 1's strategy, that is, .

Then, given beliefs for each , player 1's optimal if player 2 does not reveal is:

Given player 2's strategy,

Since is uniform on , this probability is given by the ratio of the areas of the regions and . Thus . We have:

It can be shown (Lemma 3) that and . Therefore is of the form given above.

If player 2 reveals, in the analysis above we now have:

Thus is optimal, since increases with up to , after which it drops to .

It can be shown that , and so and for any type. Given these responses, if player 2 does not reveal their type, their payoff is . If player 2 reveals their type, since we have shown that , player 2's payoff is , and so player 2 prefers to reveal if and only if .

Finally, by the above strategy for player 2's type revelation, if player 2 does not reveal, to be consistent player 1 must update to the uniform distribution on the region defined by . Thus is consistent. If player 2 also rejects , player 1 knows that , that is, . Thus the updated belief is uniform on , so is consistent. This proves that the proposed strategy profile and beliefs are a perfect Bayesian equilibrium.

Given , all player 2 types reject (offered if player 2 does not reveal) in equilibrium, since . The equilibrium payoffs for any types that do not reveal are:

Consider the payoff profile

induced by the strategy profile in which player 1 offers and player 2 accepts any . This is feasible because player 2 only reveals if , and , so . For this to be a Pareto improvement on the perfect Bayesian equilibrium, it is sufficient that . This payoff profile is not IC, because player 2's payoff increases with , so any player 2 for which can profit from the strategy profile above conditioned on a type such that .

LEMMA 3. Given as defined above, and .

PROOF. We have:

We showed above that Further,

so:

B Proof of Theorem 2

Fix the programs of players as . Suppose player uses . Given this assumption, we omit the subscripts of and . Let and respectively denote calls to and made at level . If and for some reached in the call stack, then every call to and immediately returns . Consequently, every call to , which must be a parent call to , returns because line 5 in evaluates to . (Notice that the shared random variables are essentiall — if the programs unconditionally cooperated using independently sampled variables, an exponentially increasing number of variables would each need to be less than for all calls at a given level to return the cooperative output.) Let be the event that and , and be the event that and . Thus for the program profile to terminate in finite time, it is sufficient to show that with probability 1 there exists a finite such that holds. Given that for are independent, because they do not overlap, we have:

Since is the complement of the event we wanted to guarantee, this proves termination with probability 1. Further, the event is sufficient for every call of and to return and , respectively, and this holds for all levels less than . Therefore all base calls of the programs in the proposed profile return the corresponding with probability 1.

Now suppose player uses . Let be the smallest finite level such that , , and (which exists with probability 1 by a similar argument to that above). Then all and for return . Further, every for calls the truncated programs for , guaranteed to terminate by definition, thus terminates with either or . But because also guarantees that terminates, all calls to the programs of made by player 's programs terminate. Thus all base calls of programs in this profile with one deviation terminate with probability 1.

We now consider the possible cases. Suppose and . First, note that any players using know each other's types. To see this, note that all calls to for return . So any call to for will reveal to player if either or . The second condition is satisfied by assumption. Inductively applying this argument for , note that if , we will only have if (satisfying line 11 of ), but then this is sufficient to have return (line 21). If player does not reveal their type, all players return . Otherwise, proceeds to line 13 for all players . If player plays , then all other players also play , giving the target payoff profile. Otherwise, all players return . We therefore have that with probability at least , all players use whenever the outputs of do not match those of . Hence:

The post Commitment games with conditional information revelation appeared first on Center on Long-Term Risk.

Summer Research Fellowship

2024-04-18T18:47:36Z

Once a year, we run a two to three month summer research fellowship at our office in London. It usually takes place somewhere between the months of June and October. Applications for our 2023 Fellowship are now closed, but you can find the job description archived here. We are likely to open applications for our 2024 Fellowship in the first quarter of 2024. (If you'd like to be notified when this happens please subscribe to our newsletter in the bottom-left corner of the website footer.)

Fellows have the opportunity to work on challenging research questions relevant to reducing suffering in the long-term future, whilst supervised by a researcher at CLR.

The main purpose of the fellowship is to support fellows in their career development. Fellows can learn more about s-risks, test their fit for research roles, and improve relevant skills. While not the main goal, research contributions may also influence our strategic direction, grantmaking, and other activities.

Participants become part of our team of intellectually curious, hard-working, and caring people, all of whom share a profound drive to make the biggest difference they can. In the past, some participants have continued their work as full-time members of our team or grantees of the CLR Fund. Over the last two years, most researchers we have hired had previously participated in the summer research fellowship.

Target audience

In the past, fellows have often fit into one of the following categories:

People very early in their careers, e.g. in their undergraduate degree or even high school, who have a strong interest in s-risks and would like to learn more about research and test their fit for research.
People seriously considering changing their career to research on s-risks, who want to test their fit or want to work at CLR.
People with a strong focus on s-risk who aim for a research or research-adjacent career outside of CLR and who would like to gain a better understanding of s-risk macrostrategy beforehand.
People with research experience, e.g. from a partly- or fully completed PhD, whose research interests significantly overlap with those of CLR and who want to work on their research project in collaboration with CLR researchers for a few months. This includes people who do not strongly prioritize s-risk themselves.

There might be many other good reasons for participating in the fellowship. In general, we encourage you to apply if you think you would benefit from the program, even if your reasons are not listed above.

We work out with all incoming fellows how to make the fellowship as valuable as possible, given their individual strengths and needs. Often, this means focusing on learning and experimenting, rather than producing polished research output. In some cases, past fellows only started to work on what ended up as their main project more than a month into the fellowship.

Past fellows

Lewis Hammond

"The fellowship was a great opportunity to explore new topics and pursue research threads that I wouldn’t have had spare capacity for during my PhD."

After his fellowship, Lewis continued his DPhil in computer science at the University of London and started working part-time for the Cooperative AI Foundation.

Julia Karbing

"I really loved the fellowship! I got to work on a really interesting and engaging project, and had amazing support from my supervisor. It got me very excited about potentially doing research long term, and overall made me feel much more confident in my ability to do so. The CLR office was also such a great place to work."

After her fellowship, Julia split her time working on community building and doing self-study on a grant from the CLR Fund.

Francis Rhys Ward

"The summer fellowship at CLR was really valuable for me for two main reasons: 1) The fellowship is flexible and can fit a number of different people; for me, this meant that I had the freedom to pursue my own interests as part of my PhD. 2) Being immersed in a group of intelligent and diverse people was interesting, motivating, and fun! Due to the fellowship, I feel that I grew as a researcher, became more connected to the EA and AI safety communities, and made some friends. I really recommend the SRF at CLR."

After his fellowship, Rhys continued his PhD in Safe & Trusted AI at Imperial College London.

Megan Kinniment-Williams

"I had a great experience on the SRF, and it helped me figure out what kinds of research I liked, as well as what sort of work I would like to do in the future."

After her fellowship, Megan received a grant from the CLR Fund for self-study.

Nicolas Macé

"It was a great experience, I learnt a lot."

After his fellowship, Nicolas accepted an offer as a full-time Research Analyst at CLR.

Research projects published

The projects below were not necessarily published during the fellowship, but they all started working on this project during their fellowship.

Julian Stastny et al. Normative Disagreement as a Challenge for Cooperative AI (Cooperative AI workshop and the Strategic ML workshop at NeurIPS 2021).
Tristan Cook. Replicating and extending the grabby aliens model (EA Forum).
Jia Yuan Loke. Case studies of self-governance to reduce technology risk (EA Forum).
Jack Koch: Grokking the Intentional Stance (Alignment Forum)
Jack Koch: Integrating Three Models of (Human) Cognition (Alignment Forum)

How past fellows have rated our fellowship

Further information

The summer fellowship usually begins in June or July. We prefer participants to work from our London office if possible. The fellows usually receive a salary of £4,000 per month if located in London, as well as other additional benefits
For exceptional candidates, we are flexible with program length, compensation and location. In general, we encourage people to apply even if any specific details do not work for them.
In most cases, we expect to be able to sponsor temporary visas for successful international applicants who would like to come to the UK for the Fellowship.

If you have any questions about the summer fellowship, please contact us at info@longtermrisk.org.

The post Summer Research Fellowship appeared first on Center on Long-Term Risk.

Replicating and extending the grabby aliens model

2022-07-06T13:57:38Z

This post also appeared on the Effective Altruism Forum and LessWrong.

Summary

This report is the most comprehensive model to date of aliens and the Fermi paradox. In particular, it builds on Hanson et al. (2021) and Olson (2015) and focuses on the expansion of ‘grabby’ civilizations: civilizations that expand at relativistic speeds and make visible changes to the volume they control.

This report considers multiple anthropic theories: the self-indication assumption (SIA), as applied previously by Olson & Ord (2022), the self-sampling assumption (SSA), implicitly used by Hanson et al. (2021) and a decision theoretic approach, as applied previously by Finnveden (2019).

In Chapter 1, I model the appearance of intelligent civilizations (ICs) like our own. In Chapter 2, I consider how grabby civilizations (GCs) modify the number and timing of intelligent civilizations that appear.

In Chapter 3 I run Bayesian updates for each of the above anthropic theories. I update on the evidence that we are in an advanced civilization, have arrived roughly 4.5Gy into the planet’s roughly 5.5 Gy habitable duration, and do not observe any GCs.

In Chapter 4 I discuss potential implications of the results, particularly for altruists hoping to improve the far future.

Starting from a prior similar to Sandberg et al.’s (2018) literature-synthesis prior, I conclude the following:

Using SIA or applying a non-causal decision theoretic approach (such as anthropic decision theory) with total utilitarianism, one should be almost certain that there will be many GCs in our future light cone.

Using SSA¹, or applying a non-causal decision theoretic approach with average utilitarianism, one should be confident (~85%) that GCs are not in our future light cone, thus rejecting the result of Hanson et al. (2021). However, this update is highly dependent on one’s beliefs in the habitability of planets around stars that live longer than the Sun: if one is certain that such planets can support advanced life, then one should conclude that GCs are most likely in our future light cone. Further, I explore how an average utilitarian may wager there are GCs in their future light cone if they expect significant trade with other GCs to be possible.

These results also follow when taking (log)uniform priors over all the model parameters.

All figures and results are reproducible here.

Vignettes

To set the scene, I start with two vignettes of the future. This section can be skipped, and features terms I first explain in Chapters 1 and 2.

High likelihood vignette

In a Monte Carlo simulation of draws, the following world described gives the highest likelihood for both SIA and SSA (with reference class of observers in intelligent civilizations). That is, civilizations like ours are both relatively common and typical amongst all advanced-but-not-yet-expansive civilizations in this world.

In this world, life is relatively hard. There are five hard try-try steps of mean completion time 75 Gy, as well as 1.5 Gy of easy ‘delay’ steps. Planets around red dwarfs are not habitable, and the universe became habitable relatively late -- intelligent civilizations can only emerge from around 8 Gy after the Big Bang. Around 0.3% of terrestrial planets around G-stars like our own are potentially habitable making Earth not particularly rare.

Around 2.5% of intelligent civilizations like our own become grabby civilizations (GCs). This is the SIA Doomsday argument in action.

Around 7,000 GCs appear per observable universe-sized volume (OUSV). GCs already control around 22% of the observable universe, and as they travel at , their light has reached around 35% of the observable universe. Nearly all GCs appear between 10Gy and 18 Gy after the Big Bang.

If humanity becomes a GC, it will be slightly smaller than a typical GC - around 62% of GCs will be bigger. A GC emerging from Earth would in expectation control around 0.1% of the future light cone and almost certainly contain the entire Laniakea Supercluster, itself containing at least 100,000 galaxies.

The distribution of GC volumes.

The CDF of the time until a GC is visible from Earth.

The median time by which GCs will be visible to observers on Earth is around 1.5 Gy from now. It is practically certain humanity will not see any GCs any time soon: there is roughly 0.000005% probability (one in twenty million) that light from GCs reaches us in the next one hundred years². GCs will certainly be visible from Earth in around 4 Gy.

As we will see, SIA is highly confident in a future similar to this one. SSA (with the reference class of observers in intelligent civilizations), on the other hand, puts greater posterior credence on human civilization being alone, even though worlds like these have high likelihood.

High decision-worthiness vignette

This world is one that a total utilitarian using anthropic decision theory would wager being in, if they thought their decisions can influence the value of the future in proportion to the resources that an Earth originating GC controls.

In this world, there are eight hard steps, with mean hardness 23 Gy and delay steps totaling 1.8 Gy. Planets capable of supporting advanced life are not too rare: around 0.004% of terrestrial planets are potentially habitable. Again, planets around longer-lived stars are not habitable.

Around 90% of ICs become GCs, and there are roughly 150 GCs that appear per observable universe sized volume. GCs expand at 0.85c, and a GC emerging from Earth would reach 31% of our future light cone, around 49% of its maximum volume, and would be bigger than ~80% of all GCs. Since there are so few GCs, the median time by which a GC is visible on Earth is not for another 20 Gy.

The distribution of GC volumes.

The CDF of the time until a GC is visible from Earth.

1 Intelligent Civilizations

I use the term intelligent civilizations (ICs) to describe civilizations at least as technologically advanced as our own.

In this chapter, I derive a distribution of the arrival times of ICs, . This distribution is dependent on factors such as the difficulty of the evolution of life and the number of planets capable of supporting intelligent life. This distribution does not factor in the expansion of other ICs, which may prevent (‘preclude’) later ICs from existing. That is the focus of Chapter 2.

The distribution gives the number of other ICs that arrive at the same time as human civilization, as well as the typicality of the arrival time of human civilization, assuming no ICs preclude any other.

The universe

I write for the time since the Big Bang, which is estimated 13.787 Gy (Ade 2016) [Gy = gigayear = 1 billion years].

Current observations suggest the universe is most likely flat (the sum of angles in a triangle is always 180°), or close to flat, and so the universe is either large or infinite. Further, the universe appears to be on average isotropic (there are no special directions in the universe) and homogeneous (there are no special places in the universe) (Saadeh et al. 2016, Maartens 2011).

The large or infinite size implies that there are volumes of the universe causally disconnected from our own. The collection of ‘parallel’ universes has been called the “Level I multiverse”. Assuming the universe is flat, Tegmark (2007) conservatively estimates that there is a Hubble volume identical to ours away, and an identical copy of you away.

I consider a large finite volume (LFV) of the level I multiverse, and partition this LFV into observable universe size (spherical) volumes (OUSVs)³. My model uses quantities as averages per OUSV. For example, will be the rate of ICs arriving per OUSV on average at time .

The (currently) observable universe necessarily defines the limit of what we can currently know, but not what we can eventually know. The eventually observable universe has a volume around 2.5 times that of the volume of the currently observable universe (Ord 2021).

The most action relevant volume for statistics about the number of alien civilizations is the affectable universe, the region of the universe that we can causally affect. This is around 4.5% of the volume of the observable universe. I will use the term affectable universe size volumes (AUSVs).

For an excellent discussion on this topic, I recommend Ord (2021).

The steps to reach life

I consider the path to an IC as made up of a number of steps:

Try-try steps that can make repeated attempts to complete. I further break these down into easy or delay and hard steps.
Try-once steps, that have a single opportunity to pass or fail

I recommend Eth (2021) for an excellent introduction to try-try steps.

Try-try steps

Abiogenesis is the process by which life has arisen from non-living matter. This process may require some extremely rare configuration of molecules coming together, such that one can model the process as having some rate 1/a of success per unit time on an Earth-sized planet.

The completion time of such a try-try step is exponentially distributed with PDF . Fixing some time , such as Earth’s habitable duration, the step is said to be hard if . When the step is hard, for , is constant since .

Abiogenesis is one of many try-try steps that have led to human civilization. If there are try-try steps with expected times of completion, the completion time of the steps has hypoexponential distribution with parameter . For modeling purposes, I split these try-try steps into delay steps and hard steps.

I define the delay steps to be the maximal set of individual steps from the steps such that , the approximate duration life has taken on Earth so far. I then approximate the completion time of the delay try-try steps with the exponential distribution with parameter . If they exist, I also include any fuse steps⁴ in the sum of .

I write for the expected completion times of the remaining steps. These steps are not necessarily hard with respect to Earth's habitable duration. I model each to have log-uniform uncertainty between 1 Gy and Gy. With this prior, most are much greater than 5 Gy and so hard. I approximate the completion time of all of these steps with the Gamma distribution parameters and , the geometric mean hardness of the try-try steps.⁵ The Gamma distribution can further be described as a ‘power law’ as I discuss in the appendix.

I write for the PDF of the completion time of all the delay steps and hard try-try steps. Strictly, it is given as the convolution of the gamma distribution parameters , and exponential distribution parameter . When , where is the PDF of the Gamma distribution. That is, the delay steps can be approximated as completing in their expected time when they are sufficiently short in expectation.

Priors on

After introducing each model parameter, I introduce my priors. Crucially, all the results in Chapter 3 roughly follow when taking (log)uniform priors over all parameters and so my particular prior choices are not too important.

I consider three priors on , the number of hard try-try steps. The first, which I call balanced, is chosen to give an implied prior number of ICs similar to existing literature estimates (discussed later in this chapter). My bullish prior puts greater probability mass on fewer hard steps and so implies a greater number of ICs. My bearish prior puts greater probability mass in many hard steps and so predicts fewer ICs.

My three priors on . The bullish, balanced and bearish priors have distributions , and respectively, truncated to .

My priors on are uninformed by the timing of life on Earth, but weakly informed by discussion of the difficulty of particular steps that have led to human civilization. For example, Sandberg et al. (2018) (supplement I) consider the difficulty of abiogenesis. In Chapter 3 I update on the time that all the steps are completed (i.e., now). I do not update on the timing of the completion of any potential intermediate hard steps, such as the timing of abiogenesis. Further, I do not update on the habitable time remaining, which is implicitly an anthropic update. I discuss this in the appendix.

Prior on

Given these priors on , I derive my prior on by the geometric mean of draws from the above-mentioned . I chose this prior to later give estimates of life in line with existing estimates. A longer tailed distribution is arguably more applicable.

My prior on for fixed values of , . For higher the distribution centres increasingly around .

My marginalised prior on for each of my three priors on .

Prior on

My prior on the sum of the delay and fuse steps has . By definition and smaller than makes little difference. My prior distribution gives median . The delay parameter can also include the delay time between a planet's formation and the first time it is habitable. On Earth, this duration could have been up to 0.6 Gy (Pearce et al. (2018)).

Try-once steps

I also model “try-once” steps, those that either pass or fail with some probability. The Rare Earth hypothesis is an example of a try-once step. The possibility of try-once steps allows one to reject the existence of hard try-try steps, but suppose very hard try-once steps.

I write for the probability of passing through all try-once steps. That is, if there are try-once steps then

My prior on w is distributed . This allows for no try-once steps ().

The prior could arguably have a longer tail, and is loosely informed by discussion of potential Rare Earth factors here.

Habitable planets

The parameters above give can give distribution of appearance times of an IC on a given planet. In this section, I consider the maximum duration planets can be habitable for, the number of potentially habitable planets, and the formation of stars around which habitable planets can appear.

The maximum planet habitable duration

I write ⁶ for the maximum duration any planet is habitable for.⁷ The Earth has been habitable for between 4.5 Gy and 3.9 Gy (Pearce 2018) and is expected to be habitable for another~1 Gy, so as a lower bound ⪆ 5 Gy. Our Sun, a G-type main-sequence star, formed around 4.6 Gy ago and is expected to live for another ~5 Gy.

Lower mass stars, such as K-type stars (orange dwarfs) have lifetimes between 15 -30 Gy, and M-type stars (red dwarfs) have lifetimes up to 20,000 Gy. These lifetimes give an upper bound on the habitable duration of planets in that star’s system, so I consider up to around 20,000 Gy.

The habitability of these longer-lived stars is uncertain. Since red dwarf stars are dimmer (which results in their longer lives), habitable planets around red dwarf stars must be closer to the star in order to have liquid water, which may be necessary for life. However, planets closer to their star are more likely to be tidally locked. Gale (2017) notes that “This was thought to cause an erratic climate and expose life forms to flares of ionizing electromagnetic radiation and charged particles.” but concludes that in spite of the challenges, “Oxygenic Photosynthesis and perhaps complex life on planets orbiting Red Dwarf stars may be possible”.

My prior on is distributed , truncated to . This prior disfavors the habitability of longer-lived stars. As I later show, this prior is mostly washed out by the anthropic update against the habitability of planets around longer-lived stars. In the appendix, I also consider variants of this prior.

This approach to modeling does not allow for planets around red dwarf stars that are habitable for periods equal to the habitable period of Earth. For example, life may only be able to appear in a crucial window in a planet’s lifespan.

Number of habitable planets

Given a value of , I now consider the number of habitable planets. To derive an estimate of the number of potentially habitable planets, I only consider the number of terrestrial planets: planets composed of silicate rocks and metals with a solid surface. Recall that the parameter w can indirectly control the number of these actually habitable.

Zackrisson et al. (2016) estimate terrestrial planets around FGK stars and around M stars in the observable universe. Interpolating, I set the total number of terrestrial planets around stars that last up to per OUSV to be

Hanson et al. (2021) approximate the cumulative distribution of planet lifetimes with for and for . The fraction of planets formed at time habitable at time t is then given by .

These forms of and satisfy the property that for any, the expression - the number of planets per OUSV habitable for between and Gy - is independent of . In particular, the number of planets habitable for the same duration as Earth is independent of .

This is implicitly used later in the update: one does not need to explicitly condition on the observation that we are on a planet with habitable for ~5 Gy since the number of planets habitable for ~5 Gy is independent of the model parameters.

The formation of habitable stars

I use the term “habitable stars” to mean stars with solar systems capable of supporting life.

I follow Hanson et al. (2021) in approximating the habitable star formation rate with the functional form with power and decay where .

Plots of , (t) for three pairs of , with peak .

The habitability of the early universe

There is debate over the time the universe was first habitable.

Loeb (2016) argues for the universe being habitable as early as 10 My. There is discussion around how much gamma-ray bursts (GBRs) in the early universe prevent the emergence of advanced life. Piran (2014) conclude that the universe was inhospitable to intelligent life > 5 Gy ago. Sloan et al. (2017) are more optimistic and conclude that life could continue below the ground or under an ocean.

I introduce an early universe habitability parameter and function which gives the fraction of habitable planets capable of hosting advanced life at time relative to the fraction at . I take to be a sigmoid function with and (hence ). My prior on is log-uniform on (, 0.99).

The early universe habitability factor, . for varying .

A more sophisticated approach would consider the interaction between and the hard try-try steps, as suggested by Hanson et al. (2021).

The number of habitable planets at a given time

The number of planets terrestrial planets per OUSV habitable at time is

Since for , the lower bound of the integral can be changed to .

Arrival of ICs

Putting the previous sections together, the appearance rate of ICs per OUSV, , is given by

To recap:

does not factor in the preclusion of some ICs by other ICs -- that is the focus of Chapter 2.
is the product of try-once steps
is the fraction of habitable planets capable of supporting advanced life at time
is the number of terrestrial planets possibly habitable
is the probability density of completing all hard steps of hardness and delay steps of expected completion time at time after the process began
is the rate of habitable star formation at time
is the fraction of potentially habitable terrestrial planets that are habitable for at least years.

I now discuss two potential puzzles related to : Did humanity arrive at an unusually early time? And, where are all the aliens?

The earliness paradox

Depending on one’s choice of anthropic theory, one may update towards hypotheses where human civilization is more typical among the reference class of all ICs.

Here, I look at human civilization’s typicality using two pieces of data: human civilization’s arrival at and the fact that we have appeared on a planet habitable for ~5 Gy.

An atypical arrival time?

I write for the arrival time distribution normalised to be a probability density function. This tells us how typical human civilization’s arrival time is. That is, is the probability density of a randomly chosen (eventually) existing IC to have arrived at .

Plots of for varying , all with , , and .. The left-hand plots have and right hand plots have .

When planets are habitable for a longer duration, a greater fraction of life appears later. Further, when is greater, fewer ICs appear overall since life is harder, but a greater fraction of ICs appear later in their planets’ habitable windows – this is the power law of the hard steps.

Human civilization’s rank, the fraction of ICs that arrive before . For many possible combinations of and , human civilization appears highly early. This graph has , , . The rank is independent of .

The distribution of human civilization’s rank, by my three priors. By my priors, human civilization is somewhat but not incredibly early.

An atypical solar system?

There are many more terrestrial planets around red dwarf stars than stars like our own. If these systems are habitable, then human civilization is additionally atypical (with respect to all ICs) in its appearance around a star like our sun. Further, life has a longer time to evolve around a longer lived star, so human civilization would be even more atypical. Haqq-Misra et al. (2018) discuss this but do not consider that the presence of hard try-try steps leads to a greater fraction of ICs appearing on longer-lived planets.

Resolving the paradox

Suppose a priori one believes and and uses an anthropic theory that updates towards hypotheses where human civilization is more typical among all ICs. Given these assumptions, one expects the vast majority of ICs to appear much further into the future and on planets around red dwarf stars. However, human civilization arrived relatively shortly after the universe first became habitable, on a planet that is habitable for only a relatively short duration and is thus very atypical (according to our arrival time function that does not factor in the preclusion of ICs by other ICs.

There are multiple approaches to resolving this apparent paradox.

First, one can reject their prior belief in high and , and update towards small and which lead us to believe we are in a more typical IC.

Second, one could change the reference class among which human civilization’s typicality is being considered. This, in effect, is changing the question being asked.⁸

In a reference class of ICs that arrive around sun-like stars, human civilization is not particularly atypical in arrival time.
In a reference class of ICs exactly like human civilization (of which there is just us) human civilization is (trivially) completely typical.

Third and finally, one can prefer theories that set a deadline on the appearance of ICs like us. If the universe suddenly ended in 5 Gy time, no more ICs could appear and regardless of and human civilization’s arrival time would be typical.

Hanson et al. (2021) resolve the paradox with such a deadline, the expansion of so-called grabby civilizations, which is the focus of Chapter 2. Alternative deadlines have been suggested, such as through false vacuum decay, which I briefly discuss in the appendix.

The Fermi paradox

Some anthropic theories update towards hypotheses where there are a greater number of civilizations that make the same observations we do (containing observers like us).

The rate of XICs

I write for the rate of ICs per OUSV with feature where denotes “ICs arriving at now on a planet that has been habitable for as long as Earth has, and will be habitable for the same duration as Earth will be'.

The Earth has been habitable for between 4.5 Gy and 3.9 Gy (Pearce et al. 2018). I suppose that Earth has been habitable for 4.5 Gy, since if habitable for just 3.9 Gy, the 600 My difference can be (lazily) modeled as a fuse or delay step. Assuming for the time being that no IC precludes any other, this gives

Note that

The constant of proportionality is given by the number of Earth-like planets per OUSV and other parameter independent facts.
This expression is independent of : it is certain the Earth is habitable, so changing the maximum habitable duration does not change the number of Earth-like planets
is a rate per OUSV rather than number.

Below, I vary and to see the effect on . The effect of on is linear, so uninteresting.

A heatmap of NXIC with varying and . This has , ,. The heatmap shows that is maximised when there are ~4 very easy steps, (though strictly this scenario would be modelled with no hard steps and only delay steps). I’ve scaled NXIC by a rough guess of the number of Earth-like planets per OUSV (those formed ~4.5Gy ago and that are only habitable for ~5Gy). This constant is unnecessary for the update.

The term does not include the further feature of not observing any alien life. In the next chapter, I introduce the number of ICs with feature that also do not observe any alien life.

Where are all the aliens?

I write for the rate of ICs that appear per OUSV, supposing no IC precludes any other, which is given by .

My priors on , , , , and give the rate of ICs that appear per OUSV, supposing no IC precludes any other.

The implied CDF on , the rate of ICs appearing per OUSV supposing no IC precludes any other. Note that is not the probability that there is less than one IC per OUSV. This latter probability is approximately .

I chose the balanced prior on and prior on hard step hardness to give an implied distribution on comparable to the prior derived by Sandberg et al. (2018), which models the scientific uncertainties on the parameters of the Drake Equation. Sandberg et al.’s prior on the number of currently contactable ICs has a median of 0.3 and 38% credence in fewer than one IC currently existing in the observable universe. My balanced prior gives ~50% on the rate of less than one IC per OUSV and median of ~1 IC to appear per OUSV, and so is more conservative.

The Fermi observation is the fact that we have not observed any alien life. For those with a high prior on the existence of alien life, such as my bullish prior, the Fermi paradox is the conflict between this high prior and the Fermi observation.

2 Grabby Civilizations

It may be hard for humanity to observe a typical IC, especially if they do not last long or emit enough electromagnetic radiation to be identified at large distances. If some fraction of ICs persist for a long time, expand at relativistic speeds, and make visible changes to their volumes, one can more easily update on the Fermi observation. Such ICs are called grabby civilizations (GCs).

The existence of sufficiently many GCs can ‘solve’ the earliness paradox by setting a deadline by which ICs must arrive, thus making ICs like us more typical in human civilization’s arrival time.

In this chapter, I derive an expression for , the rate of ICs per OUSV that have arrived at the same time as human civilization on a planet habitable for the same duration and do not observe any GCs.

Observation of GCs

Humanity has not observed any intelligent life. In particular, we have not observed any GCs.

Whether GCs are not in our past light cone or we have not yet seen them yet is uncertain. GCs may deliberately hide or be hard to observe with humanity’s current technology.

It seems clearer that humanity is not inside a GC volume, and at minimum we can condition on this observation.⁹

In Chapter 3 I compute two distinct updates: one conditioning on the observation that there are no GCs in our past light cone, and one conditioning on the weaker observation that we are not inside a GC volume. If GCs prevent any ICs from existing in their volume, this latter observation is equivalent to the statement that “we exist in an IC”.

The second observation leaves ‘less room’ for GCs, since we are conditioning on a larger volume not containing any GCs.

I lean towards there being no GCs in our past light cone. By considering the waste heat that would be produced by Type III Kardashev civilizations (a civilization using all the starlight of its home galaxy), the G-survey found no type III Kardashev civilizations using more than 85% of the starlight in 105 galaxies surveyed (Griffith et al. 2015). There is further discussion on the ability to observe distant expansive civilizations in this LessWrong thread.

The transition from IC to GC

I write for the average fraction of ICs that become GCs.¹⁰ I assume that this happens in an astronomically short duration and as such can approximate the distribution of arrival time of GCs as equal to the distribution of arrival times of ICs. That is, the arrival time distribution of GCs is given by .

It seems plausible a significant fraction of ICs will choose to become GCs. Since matter and energy are likely to be instrumentally useful to most ICs, expanding to control as much volume as they can (thus becoming a GC) is likely to be desirable to many ICs with diverse aims. Omohundro (2008) discusses instrumental goals of AI systems, which I expect will be similar to the goals of GCs (run by AI systems or otherwise).

Some ICs may go extinct before being able to become a GC. The extinction of an IC does not entail that no GC emerges. For example, an unaligned artificial intelligence may destroy its origin IC but become a GC itself. (Russell 2021). ICs that trigger a (false) vacuum decay that expands at relativistic speeds can also be modeled as GCs.

My prior on is distributed truncated to .

I do not update on the fact we have not observed any ICs. The smaller , the greater the importance of the evidence that we have not seen any ICs.

The expansion of GCs

I model GCs as all expanding spherically at some constant comoving speed .

My prior on is distributed truncated to . This distribution prior has a median and is informed by Armstrong Sandberg (2013) considerations of designs for self-replicating probes that travel at speeds and .

The volume of an expanding GC

To calculate the volume of an expanding GC, one must factor in the expansion of the universe.

Solving the Friedmann equation gives the cosmic scale factor , a function that describes the expansion of the universe over time.

With initial condition and , , and given by Ade et al. (2016). The Friedmann equation assumes the universe is homogeneous and isotropic, as discussed in Chapter 1.

The scale factor . The period after ~9.8Gy is known as the dark-energy-dominated era: there is accelerating expansion.

Throughout, I use comoving distances which give a distance that does not change over time due to the expansion of space. The comoving distance a probe travelling at speed that left at time reaches by time is .The comoving volume of a GC at time that has been growing at speed since time is

I take in units of fraction of the volume of an OUSV, approximately .

The volume reached by a GC expanding from for different speeds. Regardless of speed, expansion stops by around 150 Gy: this is the beginning of the era of isolation, where travel will be possible only within gravitationally bound structures (such as the Milky Way).

The fraction of the observable universe a GC can expand to as a function of its expansion start date and speed, supposing it is not blocked by any other GC.

Supposing humanity expands at delaying colonization by 100 years results in about 0.0000019% loss of volume. Due to the clumping of stars in galaxies and galaxies in clusters, it’s possible this results in no loss of useful volume.

The fraction of the universe saturated by GCs

Following Olson (2015) I write for the average fraction of OUSVs unsaturated by GCs at time and take functional form

Recall that the product is the rate of GCs appearing per OUSV at time . Since is a function of the parameters ,,, , and , the function is too.

This functional form for assumes that when GCs bump into other GCs, they do not speed up their expansion in other directions.

Above: for , , , , , , and varying . Relatively small changes in the geometric mean hardness of the hard steps leads to large changes in the fraction of each OUSV eventually saturated by GCs.

A heatmap of for varying , and fixed, , , and . Only for a small fraction of pairs is the eventual fraction of OUSVs saturated by GCs is neither very close to 0 nor exactly 1.

The actual volume of a GC

I write for the expected actual volume of a GC at time that began expanding at time at speed . Trivially, since GCs that prevent expansion can only decrease the actual volume. If GCs are sufficiently rare, then . I derive an approximation for in the appendix.

Later, I use the actual volume of a GC as a proxy for the total resources it contains. On a sufficiently large scale, mass (consisting of intergalactic gas, stars, and interstellar clouds) is homogeneously distributed within the universe. This proxy most likely underweights the resources of later arriving GCs due to the gravitational binding of galaxies and galaxy-clusters.

A comparison of to for a GC beginning expansion at at speed v=c with ,, , ,, and . In this case, a GC emerging from Earth eventually contains 24% of our future light cone.

The distribution of expected actual GC volumes using the same parameters to directly above. In this case, there are 380 GCs per OUSV, of which around 6% are larger than an Earth originating GC.

A new arrival time distribution

The distribution of IC arrival times,, can be adjusted to account for the expansion of GCs, which preclude ICs from arriving. I define that gives the rate of ICs appearing per OUSV, and write for the number of ICs that actually appear per OUSV.

Plots of , and for , , , , , , and .

Above: Plots of with , , , , , , and varying h.

Plots of with the same parameters as above (these are just the graphs to the above but each rescaled).

A heatmap varying and with fixed , , , , and . We see that the number of ICs that actually appear per OUSV is bounded above by around , even when life is sufficiently easy (as given by and ) that many more ICs would appear if there was no preclusion. This loose upper bound is primarily determined by , the speed of expansion: when expansion speeds are lower, more ICs can appear.

The actual number of XICs

I define to be the actual number of ICs with feature X to appear, accounting for the expansion of GCs. I consider two variants of this term.

I write for the rate of ICs with feature X per OUSV that do not observe GCs. Since information about GCs travels at the speed of light, gives the fraction of OUSVs that is unsaturated by light from GCs at time . Then, gives the number of XICs per OUSV with no GCs in their past light cone.

Similarly, I write ¹¹ for the rate of ICs with feature X per OUSV that are not inside a GC volume, where v is the expansion speed of GCs. In this case, .

Left and right: heatmaps of for varying hard steps and geometric mean hardness . Both heatmaps show the same data, but the colour scale goes with the logarithm on the plot on the left, and linearly on the right. Both take ,, , , and . The black area in the left heatmap contains pairs of such that no XICs actually appear, due to the all OUSVs being saturated by light from GCs by . The green area on the right heatmap is the 'sweet spot' where the most number of XICs appear. This happens to be just above the border between the black and green area in on the left heatmap. In this ‘sweet-spot’, there are many ICs (including XICs) but not too many such that XICs are (all) precluded. My bearish, balanced and bullish priors have 16%, 26% and 44% probability mass in cases where the universe is fully saturated with light from GCs by (and so ) respectively.

The balancing act

The Fermi observation limits the number of early arriving GCs: when there are too many GCs the existence of observers like us is rare or impossible.

For anthropic theories that prefer more observers like us, there is a push in the other direction. If life is easier, there will be more XICs.

For anthropic theories that prefer observers like us to be more typical, there is potentially a push towards the existence of GCs that set a cosmic deadline and lead to human civilization not being unusually early.

In the next chapter, I derive likelihood ratios for different anthropic theories and produce results.

3 Likelihoods & Updates

I’ve presented all the machinery necessary for the updates, other than the anthropic reasoning. I hope this chapter is readable without knowledge of the previous two.

I now apply three approaches to dealing with anthropics:

The self-indication assumption (SIA),
The -sampling assumption (SSA)
Non-causal decision theoretic approaches (such as anthropic decision theory, or minimal reference class SSA with a non-causal decision theory)

I have three joint priors over the following eight parameters.

Four ‘life’ parameters
- The number of hard try-try steps
- The geometric mean completion time of the hard steps
- The sum of the duration of delay and fuse steps
- The probability of passing through all try-once steps
Two parameters related to habitability
- The maximum habitable duration of any planet
- A parameter that controls the habitability of the early universe
Two parameters related to GCs
- The fraction of ICs that become GCs
- The average speed of expansion of GCs

I update on either the observation I label or observation I label . Both and include observing that we are in an IC that

Appeared at after the Big Bang
Is on a planet that was formed (and has been potentially habitable) for around 4.5 Gy
Is on a planet that will remain habitable for around another 1 Gy

additionally contains the observation that we do not see any GCs. Alternatively, additionally contains the observation that we are not inside a GC (equivalently, that we exist, if we expect GCs to prevent ICs like us from appearing).

I walk through each anthropic theory, in turn, derive a likelihood ratio, and produce results. In Chapter 4 I discuss potential implications of these results.

By Bayes rule

I have already given my priors and so it remains to calculate the likelihood P(X|n, ..., v). I derive likelihoods in the discrete case, and index my priors by worlds .

SIA

I use the following definition of the self-indication assumption (SIA), slightly modified from Bostrom (2002)

All other things equal, one should reason as if they are randomly selected from the set of all¹² possible observer moments (OMs) [a brief time-segment of an observer].¹³

Applying the definition of SIA,

That is, SIA updates towards worlds where there are more OMs like us. Since the denominator is independent of , we only need to calculate the numerator, .

By my choice of definitions, is proportional to , the number of ICs with feature X that actually appear per OUSV. The constant of proportionality is given by the number of OMs per IC, which I suppose is independent of model parameters, as well as the number of OUSVs in the earlier specified large finite volume. Again, these constants is unnecessary due to the normalisation.

The three summary statistics implied by the posterior are below. As mentioned before, the updates are reproducible here.

Updating with observation	Updating with observation

SIA updates overwhelmingly towards the existence of GCs in our light cone from all three of my priors. If a GC does not emerge from Earth, most of the volume will be expanded into by other GCs.

I discuss some marginal posteriors here, and reproduce all the marginal posteriors in the appendix.

SIA updates towards smaller as the existence of more GCs can only decrease the number of observers like us. This is the “SIA Doomsday” described by Grace (2010). This result is the same as found by Olson & Ord (2021) whereby the prior on goes from prior to posterior .

The SIA update is overwhelmingly towards smaller . Increasing only increases the number of GCs that could preclude XICs.

SIA posteriors on .

SSA

I use the following definition of the self-sampling assumption (SSA), again slightly modified from Bostrom (2002)

All other things equal, one should reason as if they are randomly selected from the set of all actually existent observer moments (OMs) in their reference class.¹⁴

A reference class is a choice of some subset of all OMs. Applying the definition of SSA with reference class ,

That is, SSA updates towards worlds where observer moments like our own are more common in the reference class.

I first consider two reference classes, and . The reference class contains only OMs contained in ICs, and no OMs in GCs. This is the reference class implicitly used by Hanson et al. (2021). The reference class also includes observers in GCs. I later consider the minimal reference class, containing only observers who have identical experiences, paired with non-causal decision theories.

Small reference class RICs

This is the reference class implicitly used by Hanson et al. (2021). I reach different conclusions from Hanson et al. (2021), and discuss a possible error in their paper in the appendix.

The total number of OMs in is proportional to the number of ICs, . As in the SIA case, the number of XOMs is proportional to , so the likelihood ratio is .

Updating with observation	Updating with observation

SSA has updated away from the existence of GCs in our future light cone.

In the appendix, I discuss how this update is highly dependent on the lower bound on the prior for . Again, smaller is unsurprisingly preferred.

SSA posterior on .

Large reference class R_all

This reference class contains all OMs that actually exist in our large finite volume, and so includes OMs that GCs create. It is sometimes called the “maximal” reference class¹⁵.

I model GCs as using some fraction of their total volume to create OMs. I suppose that this fraction and the efficiency of OM creation are independent of the model parameters. These constants do not need to be calculated, since they cancel when normalising.

The total volume controlled by all GCs is proportional to , the average fraction of OUSVs saturated by GCs at some time when all expansion has finished¹⁶.

I assume that a single GC creates many more OMs than are contained in a single ICs. Since my prior on has and I expect GCs to produce many OMs, I see this as a safe assumption. This assumption implies the total number of OMs as proportional to . The SSA likelihood ratio is .

I do not see this update as not particularly informative, since I expect GCs to create simulated XOMs., which I explore later in this chapter.

Updating with observation	Updating with observation


¹⁷

Notably, SSA updates towards as small as possible, since increasing the speed of expansion increases the number of observers created that are not like us — the denominator in the likelihood ratio.

As with the SSA update, this result is sensitive to the prior on , which I discuss in the appendix.

Non-causal decision theoretic approaches

In this section, I apply non-causal decision theoretic approaches to reasoning about the existence of GCs. This chapter does not deal with probabilities, but with ‘wagers’. That is, how much one should behave as if they are in a particular world.

The results I produce are applicable to multiple non-causal decision theoretic approaches.

The results are applicable for someone using SSA with the minimal reference class () paired with a non-causal decision theory, such as evidential decision theory (EDT). SSA contains only observers identical to you, and so updating using SSA Rmin simply removes any world where there are no observers with the same observations as you, and then normalises.

The results are also applicable for someone (fully) sticking with their priors (being ‘updateless’) and using a decision theory such as anthropic decision theory (ADT). ADT, created by Armstrong (2011), converts questions about anthropic probability to decision problems, and Armstrong notes that “ADT is nothing but the Anthropic version of the far more general ‘Updateless Decision Theory’ and ‘Functional Decision Theory’”.

Application

I suppose that all decision relevant ‘exact copies’ of me (i.e. instances of my current observations) are in one of the following situations

An IC that later becomes a GC
An IC that does not become a GC
A simulation that is smaller in scale than the ‘basement’ realty (e.g., the simulation is shorter lived)

Of course, copies may be in non-decision relevant situations, such as short-lived Boltzmann brains.

For each of the above three situations, I calculate the expected number of copies of me per OUSV. For example, in case (1), the number of copies is proportional to and in (2) ¹⁸. I do not calculate the constant of proportionality (which would be very small) - this constant is redundant when considering the relative decision worthiness of different worlds.

My decisions may correlate with agents that are not identical copies of me (at a minimum, near identical copies) which I do not consider in this calculation. If in all situations the relative increase in decision-worthiness from correlated agents is equal, the overall relative decision worthiness is unchanged.

To motivate the need to consider these three cases, I claim that our decisions are likely contingent on the ratio of our copies in each category and the ratio of the expected utility of our possible decisions in each scenario. For example, if we were certain that none of our copies were in ICs that became GCs, or all of our copies were in short-lived simulations, we may prioritise improving the lives of current generations of moral patients.

The GC wager

I choose to model all the expected utility of our decisions as coming from copies in case (1). That is, to make decisions premised on the wager that we are in an IC that becomes a GC and not in an IC that doesn’t become a GC, nor in a short-lived simulation.

Tomasik (2016) discusses the comparison of decision-worthiness between (1) and (2) to (3). My assumption that (1) dominates (2) is driven by my prior distribution on fGC (which is bounded below by 0.01) and the expected resources of a single GC dominating the resources of a single IC.

Counterarguments to this assumption may appeal to the uncertainty about the ability to affect the long-run future. For example, if a GC emerged from Earth in the future but all the consequences of one’s actions ‘wash out’ before that point, then (1) and (2) would be equally decision-worthy.

I expect that forms of lock-in, such as the values of an artificial general intelligence, provide a route for altruists to influence the future. I suppose that a total utilitarian’s decisions matter more in cases where the Earth emerging GC is larger. In fact, I suppose a total utilitarian’s decisions matter in linear proportion to the eventual volume of such a GC.

An average utilitarian’s decisions then matter in proportion to the ratio of the eventual volume of an Earth emerging GC to the volume controlled by all GCs, supposing that GCs create moral patients in proportion to their resources.

Calculating decision-worthiness

To give my decision worthiness of each world, I multiply the following terms:

My prior credence in the world¹⁹
The expected number of copies of me in ICs that become GCs
If I am a total utilitarian:
- The expected total resources under control of the GCs emerging from ICs with copies of me in this world.
If I am a total utilitarian:
- The ratio of expected total resources under control of the GCs emerging from ICs with copies of me in, to the expected total resources of all GCs (supposing that).

This gives the degree to which I should wager my decisions on being in a particular world.

Total utilitarianism

The number of copies of me in ICs that become GCs is proportional to. The expected actual volume of such GCs is . Using the assumption that our influence is linear in resources, the decision worthiness of each world is

I use the label “ADT total” for this case.

Updating with observation	Updating with observation

Total utilitarians using a non-causal decision theory should behave as if they are almost certain of the existence of GCs in their future light cone. However, the number of GCs is fairly low - around 40 per AUSV.

Average utilitarianism

As before, the number of copies of me in ICs that become GCs is proportional to and again the expected actual volume of such a GC is given by The resources of all GCs is proportional to . Supposing that GCs create moral patients in proportion to their resources, the decision worthiness of each world is

I use the label “ADT average” for this case.

Updating with observation	Updating with observation

An average utilitarian should behave as if there are most likely no GCs in the future light cone. As with the SSA updates, this update is sensitive to the prior on and is explored in an appendix.

Interaction with GCs

I now model two types of interactions between GCs: trade and conflict.

The model of conflict that I consider decreases the decision worthiness of cases where there are GCs in our future light cone. I show that a total utilitarian should wager as if there are no GCs in their future light cone if they think the probability of conflict is sufficiently high.

The model of trade I consider increases the decision worthiness of cases where there are GCs in our future light cone. I show that an average utilitarian should wager that there are GCs in their future light cone if they think there are sufficiently large gains from trade with other GCs.

The purpose of these toy examples is to illustrate that a total or average utilitarian’s true wager with respect to GCs may be more nuanced than presented earlier.

Total utilitarianism and conflict

Suppose we are in the contrived case where:

We have credence in the future where all GCs ‘do their own thing’ and (say) convert their volumes into their own form utility.
We have credence in a future where all GCs in our future light cone ‘fight’ to control the entire future light cone. The probability of a GC winning is proportional to their actual volume. The winner then gets their maximal volume (the volume they would have reached if no other GCs were present) and all other GCs lose all their resources.

When conflict occurs, an Earth originating GC has probability ²⁰ of getting its maximal volume, . Supposing a total utilitarian’s decisions can influence both cases equally, the expected decision-worthiness per copy in an IC that becomes a GC is

As before, multiplying by the number of copies of me in ICs that become GCs, , gives the decision worthiness.

Intuitively, since the conflict in expectation is a net loss of resources for all GCs, this leads one to wager one’s decisions against the existence of GCs in the future.

Increasing the probability of conflict increases the decision-worthiness of cases where there are no GCs in our future for total utilitarians. Very high probabilities of conflict lead a total utilitarian to wager that there are no GCs in our future.

Average utilitarianism and trade

I apply a very basic model of gains from trade between GCs with average utilitarianism. I suppose that one can only trade with other GCs within the affectable universe.²¹ ²²

Intuitively the decision worthiness goes up in a world with trade as there is more at stake: our GC can both influence its own resources and the resources of other GCs. This model of trade would also increase the degree to which a total utilitarian would wager there are GCs in their future light cone.

I suppose an average utilitarian GC completes a trade by spending of their resources (which they could otherwise use to increase the welfare of moral patients by a single unit) for the return of welfare of moral patients to be increased by one unit. For the GC benefits by making the trade, and so should always make such a trade rather than using the resources to create utility themselves. I write for the probability density of a randomly chosen trade providing return, and suppose that the ‘volume’ of available trades is proportional to the volume saturated by GCs, which itself is proportional to .

I take for some . For smaller , a greater proportion of all available trades are beneficial, and a greater number are very beneficial. For example, for k=1 the fraction of the volume controlled by GCs that the average utilitarian GC can make beneficial trades with is and of volume controlled by GCs allows for trades that return twice as much as they put in. For these same terms are and respectively.

Note that smaller supposes a very large ability to control effective resources by other GCs through trade. Some utility functions may be more conducive to expecting such high trade ratios.

I suppose that the decision-worthiness for each copy of an average utilitarian is linear in the ratio of effective resources that the future GC controls, (i.e. the total resources the GC would need to produce the same utility without trade) to the total resources controlled by all GCs. Other GCs may also increase the effective resources they control: for simplicity, I assume that such GCs do not use their increased effective resources to change the number or welfare of otherwise existing moral patients.

Average utilitarians should wager their decisions on the existence of (many) GCs if they expect high trade ratios, and the ability to linearly influence the value of these trades.

Average utilitarians should behave as if there are GCs in our future light cone if they expect high trade ratios to be possible.

Updates with simulated observers

In this section, I return to probabilities and consider updates for SIA and SSA in the case where GCs create simulated observers like us. For the most part, the results are similar to those seen so far: SIA supports the existence of many GCs, and SSA does not. Since SSA does not include observers created by GCs, its results are independent of the existence of any simulated observers created by GCs.

This section implicitly assumes that the majority of observers like us (XOMs) are in simulations (run by GCs), as argued by Bostrom (2003). Chapter 4 does not depend on any discussion here, so this subsection can be skipped.

Ancestor simulations

In the future, an Earth originating GC may create simulations of the history of Earth or simulate worlds containing counterfactual human civilizations. I call these ancestor simulations (AS).

Bostrom (2003) concludes that at least one of the following is true:

The fraction of all human-level technological civilizations that survive to reach a posthuman stage is approximately zero
The fraction of posthuman civilizations that are interested in running ancestor-simulations is approximately 0
The actual fraction of all observers with human-type experiences that live in simulations is approximately 1

GCs other than humans may create AS of their own past as an ICs. These OMs in AS created by GCs who transitioned from XICs will be XOMs.

Historical simulations

As well as running simulations of their own past, GCs may create simulations of other ICs. GCs may be interested in the values or behaviours of other GCs they may encounter, and can learn about the distribution of these by running simulations of ICs.

I use the term historical simulations (HS) to describe a behaviour of simulating ICs where the distribution of simulated ICs is equal to the true distribution of ICs. That is, the simulations are representative of the outside world, even if GCs run the simulations one IC at a time.

Other OMs

GCs may create many other OMs, simulated or not, of which none are XOMs. For example, a post-human GC may create a simulated utopia of OMs. I use the term other OMs as a catch-all term for such OMs.

Simulation budget

I model GCs as either

spending a fraction of their total resources on AS or HS, or
spending some fixed quantity of their resources on AS or HS

As well as

Spending some fraction of their resources on other OMs.

Fixed means that the amount each GC spends is independent of the model parameters - it does not mean each GC creates the same number.

Most XOMs are in simulations

I first give an example to motivate the claim that when GCs create simulated XOMs, the majority of all XOMs are in such simulations rather than being in the ‘basement-level’.

Bostrom (2003) estimates that the resources of the Virgo Supercluster, a structure that contains the Milky Way and could be fully controlled by an Earth-originating GC, could be used to run human lives per second, each containing many OMs. Around humans have ever lived: if we expect a GC to emerge in the few centuries, it seems unlikely more than 1012 humans will have lived by this time. In this case, only (one hundred million trillionths) of all a GC’s resources would need to be used for a single second to create an equal number of XOMs to the number of basement-level XOMs.

When GCs create AS or HS, I assume that the number of XOMs in AS or HS far exceeds the number of XOMs in XICs. That is, most observers like us are in simulations.

Both SIA and SSA support the existence of simulations of XOMs, holding all else equal, creating simulated XOMs (trivially) increases the number XOMs and the ratio |XOMs|/|OMs|.

Likelihood ratios

I first calculate |XOMs| for each simulation behavior. These give the SIA likelihood ratios. As previously discussed in the SSA case, I suppose that the vast majority of OMs are in GCs and so are created in proportion to the resources controlled by GCs,. Dividing by by then gives the SSA likelihood ratio.

GCs create	is proportional to²³:	Derivation
AS fixed		I assume that the fixed number of OMs is much greater than , this means one can approximate all XOMs as contained in AS. The number of XICs that actually appear is of which will become GCs.
HS fixed		The total number of GCs that appear is . Each creates some average number of HS each containing some average constant number of XOMs. The fraction of ICs in HS which are XICs is The product of these terms is Intuitively, this is equal to the AS fixed case as the same ICs are being sampled and simulated, but the distribution of which GC-simulates-which-IC has been permuted.
AS resource proportional		The number of GCs that create AS containing XICs is . The number of AS each of these GCs creates is proportional to the actual volume each would control,
HS resource proportional		Of all HS created, will be of XICs. The total number of HS created is proportional to the average fraction of OUSVs saturated by GCs,

Note that above the derivations give the equivalences between

SIA AS resource proportional & ADT with total utilitarianism
SSA HS resource proportional and SSA
SSA AS resource proportional & ADT with average utilitarianism

And so are not calculated here again.

SIA updates

Simulation behavior	Updating with observation	Updating with observation
AS fixed / HS fixed
HS resource proportional

SSA R_all updates

Simulation behaviour	Updating with observation	Updating with observation
AS fixed / HS fixed

4 Conclusion

Summary of results

		Anthropic	theory
	SIA	ADT total utilitarianism	ADT average utilitarianism	SSA	SSA
No XOMs	1	4	5	6	8
HS-fixed	2	4	5	7	8
AS-fixed	2	4	5	7	8
HS-rp	3	4	5	8	8
AS-rp	4	4	5	5	8

In the above table, the left column gives the shorthand description of GC simulation-creating behaviour. Equivalent updates have the same colour and number.

The posterior credence in being alone in the observable universe, conditioned on observation

Prior	1	2	3	4	5	6	7	8
Bullish	<0.1%	<0.1%	<0.1%	0.2%	70%	68%	69%	64%
Balanced	<0.1%	<0.1%	<0.1%	0.2%	89%	89%	89%	85%
Bearish	<0.1%	<0.1%	<0.1%	0.2%	94%	95%	95%	92%

These results replicate previous findings:

Olson & Ord (2020) and Finnveden (2019) that show SIA supports the existence of GCs in our future
Finnveden (2019) that shows total utilitarians using a non-causal decision theory should behave as if there are GCs in the future

These results fail to replicate Hanson et al.’s (2021) finding that (the implicit use of ) SSA implies the existence of GCs in our future.

To my knowledge, this is the first model that

Quantifies the effects of the simulation hypothesis on the Fermi paradox
Quantifies the non-causal decision theoretic wager for average utilitarians

In the appendix, I also produce variants of updates for different priors: taking (log)uniform priors on all parameters, and varying the prior on .

Which anthropic theory?

My preferred approach is to use a non-causal decision theoretic approach, and reason in terms of wagers rather than probabilities.

Within the choice of utility function in finite worlds, forms of total utilitarianism are more appealing to me. However, it seems likely that the world is infinite and that aggregative consequentialism must confront infinitarian paralysis—the problem that in infinite worlds one is ethically indifferent between all actions. Some solutions to infinitarian paralysis require giving up on the maximising nature of total utilitarianism (Bostrom (2011)) and may look more averagist²⁴. However, interaction with other GCs - such as through trade - make it plausible that even average utilitarians behave as if GCs are in their future light cone.

Having said this, theoretical questions remain with the use of non-causal decision theories (e.g. comments here on UDT and FDT).

Why does this matter?

If an Earth-originating GC observes another GC, it will most likely not be for hundreds of millions of years. By this point, one may expect such a civilization to be technologically mature and any considerations related to the existence of aliens redundant. Further, any actions we take now may be unable to influence the far future. Given these concerns, are any of the conclusions action-relevant?

Primarily, I see these results being most important for the design of artificial general intelligence (AGI). It seems likely that humanity will hand off control of the future, inadvertently or by design, to an AGI. Some aspects of an AGI humanity builds may be locked-in, such as its values, decision theory or commitments it chooses to make.

Given this lock-in, altruists concerned with influencing the far future may be able to influence the design of AGI systems to reduce the chance of conflict between this AGI and other GCs (presumably also controlled by AGI systems). Clifton (2020) outlines avenues to reduce cooperation failures such as conflict.

Astronomical waste?

Bostrom (2003) gives a lower bound of biological human lives lost per second of delayed colonization, due to the finite lifetimes of stars. This estimate further does not include stars that become impossible for a human civilization due to the expansion of the universe.

The existence of GCs in our future light cone may strengthen or weaken this consideration. If GCs are aligned with our values, then even if a GC never emerges from Earth, the cosmic commons may still be put to good use. This does not apply when using SSA or a non-causal decision theory with average utilitarianism, which expect that only a human GC can reach much of our future light cone.

SETI

The results have clear implications for the search for extraterrestrial intelligence (SETI).

One key result is the strong update against the habitability of planets around red dwarfs. For the self-sampling assumption or a non-causal decision theoretic approach with average utilitarianism, there is great value of information on learning whether such planets are in fact suitable for advanced life: if they are, SSA strongly endorses the existence of GCs in our future light cone, as discussed in the appendix. SIA, or a non-causal decision theoretic approach with total utilitarianism, is confident in the existence of GCs in our future light cone regardless of the habitability of red dwarfs.

The model also informs the probability of success of SETI for ICs in our past lightcone. Such ICs may not be visible to us now if they were too quiet for us to notice or did not persist for a long time.

The distribution of the probability that an XIC has an IC-that-did-not-become a GC in their past light cone, in by the posteriors in from our balanced prior with the update when conditioning that there were no GCs in our past light cone.

Risks from SETI

Barnett (2022) discusses and gives an admittedly “non-robust” estimate of “0.1-0.2% chance that SETI will directly cause human extinction in the next 1000 years”.

I consider the implied posterior distribution on the probability of a GC becoming observable in the next thousand years. The (causal) existential risk from GCs is strictly smaller than the probability that light reaches us from at least one GC, since the former entails the latter.

The distribution of the implied probability that we we observe a GC in the next 1,000 years, conditioned on no GCs in our past light cone.

The distribution of the implied probability that a GC reaches Earth in the next 1,000 years, conditioned on no GCs having reached us already.

The posteriors imply a relatively negligible chance of contact (observation or visitation) with GCs in the next 1,000 years even for SIA.

However, it seems that the risk in the next is then more likely to come from GCs that are already potentially observable that we have just not yet observed - perhaps more advanced telescopes will reveal such GCs.

Further work

I list some further directions this work could be taken. All the calculations can be found here.

I have not updated on all the evidence available. Further evidence one could update on includes:

Facts about the Sun (e.g. that it is a G-type star), or Milky Way (e.g. that is a barred spiral galaxy) that would narrow the definition of XICs.
That no signs of life have been observed on Mars or other bodies in the solar system
That we have not observed any ICs. To do this, one would need a prior distribution over the lifetime of ICs and how likely it is for us to detect them. This is increasingly important for those with low prior on . This observation pushes against “life is common, is small” and towards both “life is hard” and “life is somewhat common and is large”.

Modeling assumptions can be improved:

Considering priors on each try-try step and using the true completion time distribution, rather than an approximation
Considering the possibility of panspermia
Considerations of the structure of the universe in GC expansion

More variations of the updates could be considered:

Using different references classes with SSA
Considering more advanced conflict/trade models in the non-causal decision theoretic approaches
Using utility functions other than average and total utilitarianism in the non-causal decision theoretic approaches

More thought could be put into the prior selection (though the main results still follow from (log)uniform priors):

The priors on and can be informed by considering the biological mechanisms of the try-try steps.
The priors on and are unlikely to be independent: if we expect no planets around red dwarfs to be habitable, this seems evidence for marginally smaller on planets around sun-like stars.
The parameter could be split into two parameters: one that controls the habitability of planets around longer lived stars and another that models how long these planets are habitable for. I expect this to have little difference on the results: all anthropic theories would prefer no non-Earth like planets to be habitable, and in the case they are, habitable for the shortest duration. Alternatively, the latter parameter could be replaced by a ‘deadline’ parameter.

Acknowledgements

I would like to thank Daniel Kokotajlo for his supervision and guidance. I’d also like to thank Emery Cooper for comments and corrections on an early draft, and Lukas Finnveden and Robin Hanson for comments on a later draft. The project has benefited from conversations with Megan Kinniment, Euan McClean, Nicholas Goldowsky-Dill, Francis Priestland and Tom Barnes. I'm also grateful to Nuño Sempere and Daniel Eth for corrections on the Effective Altruism Forum. Any errors remain my own.

This project started during Center on Long-Term Risk’s Summer Research Fellowship.

Glossary

	The number of hard try-try steps
	The geometric mean of the hard steps (“hardness”)
	The sum of the delay and fuse steps, strictly less than Earth’s habitable duration.
	The probability of passing through all try-once steps in the development of an IC
	The maximum duration a planet can be habitable for
	The decay power of gamma ray bursts
	The average comoving speed of expansion of GCs
	The fraction of ICs that become GCs

IC	Intelligent civilization
XIC	Intelligent civilizations similar to human civilization in that They arrive at They arrive around 4.5 Gy into their planet’s habitable duration Their planet has around ~ 1 Gy habitable duration remaining
GC	Grabby civilization
OM	Observer moment
OUSV	Observable universe size volume
AUSV	Affectable universe size volume
SIA	Self-indication assumption
SSA	Self-sampling assumption
ADT	Anthropic decision theory
, ,	The number of ICs/XICs/GCs that would appear, supposing no preclusion, per OUSV
, ,	The number of ICs/XICs/GCs that actually appear per OUSV
	The observation of being in an XIC that has not observed any GCs
	The observation of being in an XIC that is not inside a GC
AS	Ancestor simulations; simulations created by a GC of their own IC origins (or slight variants)
HS	Historical simulations; simulations created by a GC to be representative of IC origins

	The probability density function of IC arrival times, excluding any preclusion by GCs.
	The probability density function of IC arrival times that do not observe any GCs.
	The fraction of an OUSV unsaturated by GCs at time
	The comoving volume of a sphere/GC expanding from time at with speed
	The actual volume of a sphere/GC expanding from time at with speed which considers the expansion of GCs
	The rate of habitable star formation normalised to have integral 1
	The fraction of terrestrial planets that are habitable for at most Gy
	The number of terrestrial planets per OUSV that are potentially habitable
	The fraction of potentially habitable planet habitable to advanced life at time
	The (cosmic) scale factor

References

Ade, P. A., Aghanim, N., Arnaud, M., Ashdown, M., Aumont, J., Baccigalupi, C., ... & Matarrese, S. (2016). Planck 2015 results-xiii. cosmological parameters. Astronomy & Astrophysics, 594, A13.

Armstrong, S. (2011). Anthropic decision theory. arXiv preprint arXiv:1110.6437.

Armstrong, S., & Sandberg, A. (2013). Eternity in six hours: Intergalactic spreading of intelligent life and sharpening the Fermi paradox. Acta Astronautica, 89, 1-13.

Barnett, M. (2022). My current thoughts on the risks from SETI https://www.lesswrong.com/posts/DWHkxqX4t79aThDkg/my-current-thoughts-on-the-risks-from-seti#Strategies_for_mitigating_SETI_risk

Bostrom, N. (2003). Are we living in a computer simulation?. The philosophical quarterly, 53(211), 243-255.

Bostrom, N. (2003). Astronomical waste: The opportunity cost of delayed technological development. Utilitas, 15(3), 308-314.

Bostrom, N. (2011). Infinite ethics. Analysis and Metaphysics, (10), 9-59.

Carter, B. (1983). The anthropic principle and its implications for biological evolution. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 310(1512), 347-363.

Carter, B. (2008). Five-or six-step scenario for evolution?. International Journal of Astrobiology, 7(2), 177-182.

Clifton, J. (2020) Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda. https://longtermrisk.org/files/Cooperation-Conflict-and-Transformative-Artificial-Intelligence-A-Research-Agenda.pdf

Eth, D. (2021) Great-Filter Hard-Step Math, Explained Intuitively. https://www.lesswrong.com/posts/JdjxcmwM84vqpGHhn/great-filter-hard-step-math-explained-intuitively

Finnveden, L. (2019) Quantifying anthropic effects on the Fermi paradox https://forum.effectivealtruism.org/posts/9p52yqrmhossG2h3r/quantifying-anthropic-effects-on-the-fermi-paradox

Grace, K. (2010). SIA doomsday: The filter is ahead https://meteuphoric.com/2010/03/23/sia-doomsday-the-filter-is-ahead/

Greaves, H. (2017). Population axiology. Philosophy Compass, 12(11), e12442.

Griffith, R. L., Wright, J. T., Maldonado, J., Povich, M. S., Sigurđsson, S., & Mullan, B. (2015). The Ĝ infrared search for extraterrestrial civilizations with large energy supplies. III. The reddest extended sources in WISE. The Astrophysical Journal Supplement Series, 217(2), 25.

Hanson, R., Martin, D., McCarter, C., & Paulson, J. (2021). If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare. The Astrophysical Journal, 922(2), 182.

Haqq-Misra, J., Kopparapu, R. K., & Wolf, E. T. (2018). Why do we find ourselves around a yellow star instead of a red star?. International Journal of Astrobiology, 17(1), 77-86.

Loeb, A. (2014). The habitable epoch of the early Universe. International Journal of Astrobiology, 13(4), 337-339.

Maartens, R. (2011). Is the Universe homogeneous?. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 369(1957), 5115-5137.

MacAskill, M., Bykvist, K., & Ord, T. (2020). Moral uncertainty (p. 240). Oxford University Press.

Olson, S. J. (2015). Homogeneous cosmology with aggressively expanding civilizations. Classical and Quantum Gravity, 32(21), 215025.

Olson, S. J. (2020). On the Likelihood of Observing Extragalactic Civilizations: Predictions from the Self-Indication Assumption. arXiv preprint arXiv:2002.08194.

Olson, S. J., & Ord, T. (2021). Implications of a search for intergalactic civilizations on prior estimates of human survival and travel speed. arXiv preprint arXiv:2106.13348.

Omohundro, S. M. (2008, February). The basic AI drives. In AGI (Vol. 171, pp. 483-492).

Oesterheld, C. (2017). Multiverse-wide Cooperation via Correlated Decision Making. https://longtermrisk.org/multiverse-wide-cooperation-via-correlated-decision-making/

Ord, T. (2021). The edges of our universe. arXiv preprint arXiv:2104.01191.

Ozaki, K., & Reinhard, C. T. (2021). The future lifespan of Earth’s oxygenated atmosphere. Nature Geoscience, 14(3), 138-142.

Pearce, B. K., Tupper, A. S., Pudritz, R. E., & Higgs, P. G. (2018). Constraining the time interval for the origin of life on Earth. Astrobiology, 18(3), 343-364.

Russell, S. (2021). Human-compatible artificial intelligence. In Human-Like Machine Intelligence (pp. 3-23). Oxford: Oxford University Press.

Saadeh, D., Feeney, S. M., Pontzen, A., Peiris, H. V., & McEwen, J. D. (2016). How isotropic is the Universe?. Physical review letters, 117(13), 131302.

Sandberg, A., Drexler, E., & Ord, T. (2018). Dissolving the Fermi paradox. arXiv preprint arXiv:1806.02404.

Sloan, D., Alves Batista, R., & Loeb, A. (2017). The resilience of life to astrophysical events. Scientific reports, 7(1), 1-5.

Tegmark, M. (2007). The multiverse hierarchy. Universe or multiverse, 99-125.

Tomasik, B. (2016). How the Simulation Argument Dampens Future Fanaticism.

Zackrisson, E., Calissendorff, P., González, J., Benson, A., Johansen, A., & Janson, M. (2016). Terrestrial planets across space and time. The Astrophysical Journal, 833(2), 214.

Appendix: Updating n on the time remaining

I discuss how using the remaining habitable time on Earth to update on the number of hard steps n is implicitly an anthropic update. In particular I discuss it in the context of Hanson et al. (2021) (henceforth “they” and “their”). They later perform another anthropic update, using a different reference class, which I see as problematic.

Their prior on is derived by using the self-sampling assumption with the reference class of observers on planets habitable for ~5 Gy (the same as Earth). I write for this reference class. Throughout, I ignore delay steps, and include only hard try-try steps.

They argue (as I see correctly) that to be most typical within this reference class, and observe that Earth is habitable for another ~1 Gy, we update towards . The SSA likelihood ratio when updating on our appearance time alone (ignoring preclusion by GCs) is

where is the Gamma distribution PDF with shape and scale . I take . This likelihood ratio is largest for . We could further condition on the time that life first appeared, but this is not necessary to illustrate the point.

The normalised SSA likelihood ratio when updating on the completion time on Earth.

While their prior on n relies on this small reference class, their main argument relies on a larger reference class of all intelligent civilizations, . They use this to model humanity’s birth rank as uniform in the appearance times of all advanced life, not just those habitable for ~5 Gy.

If we use the smaller reference class Gy throughout, then one updates towards , but human civilization is no longer particularly early since all life on planets habitable for ~5 Gy appears in the next ~50 Gy due to the end of star formation. The existence of GCs will have less explanatory power in this case.

If one uses the larger reference class , when updating on human civilization’s appearance time alone (ignoring preclusion by GCs), the SSA likelihood ratio is

Where is the maximum habitable duration, and is the 'number' of planets habitable for Gy.

The SSA likelihood ratio for when there are two types of planets of equal number: one type habitable for ~5 Gy and another for 100 Gy.

If we believe to be large, then the likelihood ratio is maximum at and is decreasing in : if advanced life is hard then it will appear more often on planets where it has longer to evolve and increasing n makes life harder, so decreases the total amount of advanced life and increases the fraction of life on longer habitable planets. The reference class converges to when decreasing to 5 Gy, and one updates towards .

To summarise, the following are ‘compatible’

SSA & large & any
SSA & small & large
SSA & large & small

Hanson et al. write

If life on Earth had to achieve “hard steps” to reach humanity’s level, then the chance of this event rose as time to the n-th power. Integrating this over habitable star formation and planet lifetime distributions predicts >99% of advanced life appears after today, unless and max planet duration <50Gyr. That is, we seem early.

That is, to be early in the reference class of advanced life, , we require large and large which we have shown are incompatible.

Appendix: Varying the prior on L_max

The SSA , SSA and ADT average updates are sensitive to the lower bound on the prior for . When there are no GCs (that can preclude ICs), human civilization’s typicality is primarily determined by : the smaller the more typical human civilization is. If is certainly high, worlds with GCs that preclude ICs are relatively more appealing to SSA.

Here I show updates for variants on the prior for , and otherwise using the balanced prior. Notably, even when which has , SSA gives around 58% credence on being alone, and has posterior . As seen below, increasing the lower bound on the prior of increases the posterior implied rate of GCs.

	Implied posterior on	Posterior on
SSA
SSA
ADT average

Appendix: Marginalised posteriors

The following tables show the marginalised posteriors for all updates (excluding the trade and conflict scenarios).

Appendix: Updates from uniform priors

I show that the results follow when taking uniform/loguniform priors on the model parameters as follows:

not conditioned on the value of n
. I also run with below

Which give the following distributions on #NGC

This takes the same (log)uniform priors, but with . The SSA implied posterior on being alone in the OUSV is now just 59% from observation , and 40% from .

Appendix: Derivations

Currently in this Google Doc. Will be added to this post soon.

Appendix: Vacuum decay

Technologies to produce false vacuum decay or other highly destructive technologies will have a non-zero rate of ‘detonation’. Such technologies could be used accidentally, or deliberately as a scorched Earth policy during conflict between GCs. Non-gravitationally bound volumes of the universe will become causally separated by ~200 Gy, after which GCs are safe from light speed decay.

The model presented can be used to estimate the fraction of OUSVs consumed by such decay bubbles. I write for the fraction of ICs that trigger a vacuum decay some time shortly after they become an IC. More relevantly, one may consider vacuum decay events being triggered when GCs meet one another.

The fraction of OUSVs inside vacuum decay bubbles for varying , the fraction of ICs that trigger vacuum decay bubbles that travel at . These plots have , , , , and . Even for , around 50% of the OUSVs on average will be eventually consumed by vacuum decay bubbles travelling at the speed of light.

Of course, this is highly speculative, but suggestive that such considerations may change the behaviour of GCs before the era of causal separation. For example, risk averse or pure time discounting GCs may trade off some expansion for creation of utility.

One could run the entire model with replaced by . SSA supports the existence of GCs for and so would similarly support the existence of ICs that trigger false vacuum decay as a deadline.

Appendix: hard steps and the ‘power law’

As mentioned, I model the completion time of hard steps with the Gamma distribution, which has PDF

When , and so . That is, when the steps are sufficiently hard, the probability of completion grows as a polynomial in . Increasing leads to a greater ‘clumping’ of completions near the end of the possible time available.

The distribution of completion times for and . The PDF (red) is constant, CDF (blue) grows linearly.

The distribution of completion times for and. The PDF (red) grows approximately and CDF (blue) grows approximately .

When hard steps are present, it also means that longer habitable planets will see a greater fraction of life than shorter lived planets. For example, a planet habitable for 50 Gy will have approximately greater probability of life appearing than a planet habitable for 5 Gy.

For anthropic theories that update towards worlds where observers like us are more typical -- such as the self-sampling assumption -- increasing while allowing longer-lived planets makes observers like us less typical.

The post Replicating and extending the grabby aliens model appeared first on Center on Long-Term Risk.

Plans for 2022 & Review of 2021

2022-08-15T08:51:25Z

Summary

Mission: The Center on Long-Term Risk (CLR) works on addressing the worst-case risks from the development and deployment of advanced AI systems in order to reduce the worst risks of astronomical suffering (s-risks).
Research: We built better and more explicit models of future conflict situations, making our reasoning and conclusions in this area more legible and rigorous. We also have developed more considered views on AI timelines and potential backfire risks from our work. In total, we published twelve research reports, including a paper in the field of cooperative AI that was accepted at two NeurIPS workshops.
Grantmaking: We contributed to the scale-up of the field of Cooperative AI through our advising of the Center for Emerging Risk Research (CERR). Some of our staff helped set up the Cooperative AI Foundation (CAIF).
Community-building: We ran two s-risk intro seminars (with about fifteen participants each) and a three-month summer research fellowship (with fourteen participants).
Plans for 2022: We plan to continue our research on Cooperative AI, Evidential Cooperation in Large Worlds, AI Forecasting, and other topics. We will also continue to build the s-risk community and to advise the Center for Emerging Risk Research (CERR) on grantmaking.
Fundraising: We are accepting donations to diversify our funding pool and to expand our activities. You can donate here.
Hiring: We are hiring researchers and summer research fellows. You can find details here. The application deadline is February 27, 2022.

About us

Our goal is to reduce the worst risks of astronomical suffering (s-risks) from emerging technologies. To this end, we work on addressing the worst-case risks from the development and deployment of advanced AI systems. We are currently focused on conflict scenarios as well as technical and philosophical aspects of cooperation.

We have been based in London since late 2019. Our team is currently about fourteen full-time equivalents strong, with most of our employees full-time.

Review of CLR in 2021

We review our work across organizational functions by combining a subjective assessment with a list of tangible outputs and activities. The assessments were written by senior staff members.

Overall subjective assessment

Guiding question: Have we made progress towards becoming a research group and community that will have an outsized impact on the research landscape and relevant actors relevant to reducing s-risk?

Across all dimensions, it seems to us that we are in a better position compared to last year:

We have made research progress as far as we can tell. We have better models of future conflict situations (a priority of ours), making our reasoning and conclusions in this area more legible and rigorous. We also have developed more considered views on AI timelines and potential backfire risks from our work. At the same time, it remains the case that our best candidate interventions for reducing s-risk are either indirect (such as doing more research or building capacity) or non-robust (suggesting we should do more research before implementing them).
We contributed to the scale-up of the field of Cooperative AI. We helped set up the Cooperative AI Foundation (CAIF) and the Foundations of Cooperative AI Lab (FOCAL). We published a related paper at two NeurIPS workshops and started collaborating with various people in the field.
The size of the community of people concerned about s-risks seems to have continued to increase and the people in it seem to have continued to advance in their careers. This is based on our impressions rather than hard data.
In terms of organizational setup, we also seem to be in an overall better position. While some key staff members departed, which is a loss for CLR, we were able to hire new staff with relevant and more specialized skill sets. We have also moved towards a new organizational structure that gives wide-ranging authority and responsibility to a group of Lead Researchers. We believe the new structure will help us to scale further.

Research

Subjective assessment

Guiding questions:

Have we made relevant research progress?
Has the research reached its target audience?
Have we received positive feedback from peers and our target audience?

Our continued work on better understanding the causes of conflict has progressed significantly. We have developed some initial internal tools (e.g., game-theoretic models) that will allow us to explore different conflict scenarios and their implications more rigorously. We expect this work to be helpful in (i) informing future work on prioritization and (ii) communicating about conflict dynamics to (and eliciting advice from) various important audiences, including those new to s-risk research, external longtermist researchers, and stakeholders at AI labs. This has already resulted in some fruitful conversations. By providing a set of tools for more “paradigmatic” research (in the form of game-theoretic models), this line of work also opens up more opportunities for people to contribute to s-risk research.

We made progress in our work on Cooperative AI. Our Normative Disagreement paper was accepted at two NeurIPS workshops (Cooperative AI and Strategic ML). We also began working on clarifying foundational Cooperative AI concepts, such as what it would mean to work towards differential progress on cooperation. We hope that this work will feed into work on benchmarks by the Cooperative AI Foundation (CAIF).

We made some progress in thinking about intervening on AI agents’ values in a coarse-grained way so that they at least bargain cooperatively, even if otherwise misaligned. While we had previously been aware of this intervention class, only this year did we start to name it as a distinct, potentially promising area for research and intervention and begin to work on developing and evaluating concrete interventions.

Some staff have started to explore frameworks from the literature on decision-making under deep uncertainty as well as their implications for our strategy. This was the result of research and extensive discussions about the potential for large unintended negative consequences from efforts to shape the long-term future.

We published some work on AI forecasting which increased our own understanding of the topic and seemed to have been well received by the wider community.

We made less progress than we had hoped or planned on better understanding the risks from malevolent actors because a key staff member in this area fell sick for most of the year.

Public outputs & activities

Staff publications

Alex Lyzhov: 'AI and Compute' trend isn't predictive of what is happening (LessWrong)
Daniel Kokotajlo: Fun with +12 OOMs of compute (LessWrong)
Daniel Kokotajlo: Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain (LessWrong)
Daniel Kokotajlo: Against GDP as a metric for timelines and takeoff speeds (LessWrong)
Daniel Kokotajlo: Taboo “Outside View” (LessWrong/EA Forum)
Daniel Kokotajlo & Ramana Kumar: P₂B: Plan to P₂B Better (LessWrong)
David Althaus & Daniel Kokotajlo: Incentivizing forecasting via social media (EA Forum)
Jesse Clifton: CLR’s recent work on multi-agent systems (Alignment Forum)
Jesse Clifton: Weak identifiability and its consequences in strategic settings (CLR Blog)
Jia Yuan Loke: Case studies of self-governance to reduce technology risk (EA Forum)
Julian Stastny, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, Jesse Clifton: Normative Disagreement as a Challenge for Cooperative AI (Cooperative AI workshop and the Strategic ML workshop at NeurIPS 2021)
Stefan Torges: AI coordination challenges for preventing AI conflict (CLR Blog)

Output from Summer Research Fellows & CLR Fund grantees

Jack Koch: Grokking the Intentional Stance (Alignment Forum)
Jack Koch: Integrating Three Models of (Human) Cognition (Alignment Forum)
Anthony DiGiovanni: A longtermist critique of “The expected value of extinction risk reduction is positive” (EA Forum)
Nisan Stiennon: My Take on Higher-order Game Theory (Alignment Forum)
Samuel Martin: Distinguishing AI takeover scenarios (Alignment Forum)
Samuel Martin: Investigating AI Takeover Scenarios (Alignment Forum)
Samuel Martin: Takeoff Speeds and Discontinuities (Alignment Forum)
Vojta Kovařík: Formalizing Objections Against Surrogate Goals (Alignment Forum; he did most of this work while doing a trial with CLR)

Community building

Subjective assessment

Guiding questions:

Have we increased the size of the community?
Have we created opportunities for in-person (and in-depth online) contact for people in our community?
Have we systematically coached community members in their careers and prioritization, making it more likely that they will do useful work going forward?
Have we kept existing members engaged?

Our overall impression is that of continued but modest growth and progress. The various programs and events that we ran seem to have engaged people we had not previously been aware of and deepened the engagement of some people we had already known. Our individual calls and meetings put some new people on our radar and impacted some meaningful career decisions (though our counterfactual influence is hard to assess since we don’t yet do systematic evaluations). To the extent that we can already assess the outcomes from grants of the CLR Fund, they seem to have resulted in some meaningful publications and activities.

Outputs & activities

Events & programs

In February and March, we ran two s-risk intro seminars with about fifteen participants each. The participant feedback was generally very positive. The average response to the questions “How likely are you to recommend the Intro Seminar to a value-aligned friend or colleague?” was 4.7 out of 5 and 4.8 out of 5 respectively.

From the end of June until the end of October, we ran a Summer Research Fellowship with two cohorts of seven fellows each (Adrià Garriga-Alonso, Euan McLean, Francis Priestland, Gustavs Zilgalvis, Rory Svarc, Tom Shlomi, and Tristan Cook; Francis Rhys Ward, Jack Koch, Julia Karbing, Lewis Hammond, Megan Kinniment Williams, Nicolas Macé, and Sara Haxhia). Another fellow, Hadrien Pouget, spent three months at CLR during the spring.

The feedback on the fellowship was generally very positive. Among the fellows who responded to our survey, all answered the questions “Are you glad that you participated in the fellowship?” with a 5 out of 5. The average response to the question “If the same program happened next year, would you recommend a friend (with similar background to you before the fellowship) to apply?” was 9.9 out of 10.

Three fellows also ended up joining our team in permanent positions.

Individual outreach

We conducted over seventy 1:1 calls and meetings with potentially promising people. This also included various office visits by people. (We don’t yet collect systematic feedback on these.)

CLR Fund

There were many changes to the fund management: Emery Cooper replaced Lukas Gloor; Stefan Torges replaced Jonas Vollmer; Tobias Baumann replaced Brian Tomasik; Chi Nguyen also joined as a fund manager.

We made the following grants in 2021 (more details here):

Samuel Martin: Research connecting multi-agent AI safety work to existential catastrophe scenarios (Distinguishing AI takeover scenarios, Investigating AI Takeover Scenarios, Takeoff Speeds and Discontinuities)
Nisan Stiennon: Research on what it means for agents to cooperate (My Take on Higher-order Game Theory)
University of Michigan: Research into empirical game-theoretic analysis under Michael Wellman
Caspar Oesterheld: Scholarship for graduate studies in computer science
Anonymous: Stipend for career reflection
Timothy Chan: Stipend for travel to Beijing

Community coordination

Two and a half years ago, we worked with Nick Beckstead from the Open Philanthropy Project to develop a set of communication guidelines for discussing astronomical stakes. In brief, Nick’s guidelines for the broader longtermist community recommend highlighting beliefs and priorities that are important to the s-risk-oriented community. Our guidelines for those focused on s-risks recommend communicating in a nuanced manner about pessimistic views of the long-term future by considering highlighting moral cooperation and uncertainty, focusing more on practical questions if possible, and anticipating potential misunderstandings and misrepresentations.

We had originally planned to reassess the costs and benefits at the end of 2020. We ended up pushing this into 2021. After talking to staff at the Open Philanthropy Project, we decided to extend our commitment to the communication guidelines until at least the end of 2022. However, since we were not able to devote as many resources to this project as we would have liked to, we have planned a more thorough effort for this year.

Grant recommendations to CERR

Subjective assessment

Guiding question:

Have we recommended impactful grants?

As planned, we started advising the Center for Emerging Risk Research (CERR). They are a new nonprofit with the mission to improve the quality of life of future generations. Overall, we are ambivalent about our progress in this area. On the one hand, we are satisfied with the size and rigor of the grant recommendations that we made. On the other hand, we failed to make progress on systematic investigations of cause areas and promising interventions. Instead, we usually investigated opportunities that we learned about through our existing network.

Tangible outputs & activities

Based in part on our recommendations, CERR made the following investments or grants:

Anthropic, a recently announced AI startup led by Dario Amodei, formerly of OpenAI.
A commitment of $15 million to establish the Cooperative AI Foundation (CAIF), whose mission it will build to grow the field of Cooperative AI. Since then, three of our staff have formed a transition team to set up the organization. More recently, Lewis Hammond, one of this year’s fellows and a DPhil student at Oxford, also joined this team.
A grant of $3m to Carnegie Mellon University to establish the Foundations of Cooperative AI Lab, led by Vincent Conitzer.

Operational capacity

Guiding questions:

Have we maintained and improved effective systems to efficiently carry out operations procedures while managing organizational risks adequately?
Did we efficiently take care of important one-off operations tasks?

Our capacity in this area is roughly at the same level as at the beginning of the year due to staff turnover. That means it is not yet as high as we would like it to be, but we expect this to change over the coming months as our new hire gets used to their role. That being said, we are still able to maintain all the important functions of the organization and push forward vital changes in the operational setup of CLR.

General organizational health

Guiding question: Are we a healthy organization with an effective board, appropriate evaluation of our work, reliable policies and procedures, adequate financial reserves and reporting, and high morale?

Board

Members of the board: Tobias Baumann, Max Daniel, Ruairi Donnelly (chair), Chi Nguyen, Jonas Vollmer (replaced David Althaus in December)

The main role of the board is to decide CLR’s leadership and structure, to resolve organizational conflicts at the highest level, and to advise CLR leadership on important questions. Generally, CLR staff seem to agree that they have been effective in that role. There are, however, different views within the organization as to how well they resolved one incident in particular.

Evaluation function

We collect systematic feedback on big community-building and operations projects. We currently do not conduct any systematic evaluation of our research, especially from external peers. This is not ideal. We had already planned to address this in 2021 but failed to do so due to a lack of capacity. It is also generally a difficult problem to solve due to our idiosyncratic priorities.

Policies & guidelines

Overall, it is our impression that our policies are effective and cover the most relevant areas. However, it might always seem this way until we realize that we would have needed a policy for resolving a particular issue. For instance, we added two policies in response to a staff incident this year. So we plan to conduct a systematic review of our policies in 2022.

Financial health

Our budget increased substantially after our move to London from Berlin, primarily due to an increase in salaries resulting from higher costs of living.

Still, primarily due to the support of the Center for Emerging Risk Research (CERR), we are currently in a good financial position. However, without their continued support, we might face serious difficulties maintaining our operations at the current level.

Net asset estimate in early December 2021 (all figures in CHF (1 CHF ≈ 1.09 USD ≈ 0.82 GBP)):

Net assets: 2,073k
Budgeted average monthly operating expenses in 2022: 236k
Runway at budgeted 2022 expenditure: 8.8 months
CLR Fund: 597k (incl. receivables from donation partners not yet reflected in public balance)

Morale

Monthly staff average for the question “How much do you currently enjoy being part of CLR?” was 7.7 this year (compared to 7.6 in 2020 and 8.0 in 2019). However, the response rate for this question in 2021 was low.

Plans for CLR in 2022

We are hoping to hire new permanent researchers. We are also currently hiring summer research fellows to join us temporarily. You can find the details to apply to both of these opportunities here. The application deadline is February 27, 2022.

Research

The current Research Leads at CLR are Jesse Clifton, Emery Cooper, and Daniel Kokotajlo. They set their own research priorities as well as those of the people on their team. Jesse Clifton leads the Causes of Conflict Research Group at CLR while Emery Cooper and Daniel Kokotajlo currently work alone and with a research assistant respectively.

Jesse Clifton

The current priorities for Jesse and his team are:

Producing more rigorous write-ups of CLR’s current understanding of agential s-risks and how they might be reduced. This will hopefully serve to inform and elicit feedback from external researchers and serve as better pedagogical material for those who are new to s-risk research;
Continuing their technical research on Cooperative AI, which includes follow-up work on normative disagreement; studying the possibilities for AI systems to more effectively cooperate in incomplete information settings via a high degree of mutual transparency (using ideas from open-source game theory); better understanding possible paths to the emergence of conflict-conducive preferences in AI systems; and exploring possibilities for Cooperative AI research using large language models;
Developing and evaluating potential interventions in the Cooperative AI space.

Emery Cooper

Emery’s current research priorities are:

Investigating the argument that Evidential Cooperation in Large Worlds (ECL) considerations ought to significantly influence the way we think about cause prioritization, and potential cruxes and sensitivities to features of the world. Building on existing work into the details of these implications, as well as what further work on this topic might be valuable, under different decision-theoretic assumptions;
Evaluating the importance of multi-agent interactions involving acausal reasoning for s-risk reducers, in order to guide prioritization between interventions that differentially improve either acausal or causal cooperation.

Daniel Kokotajlo

Daniel’s research priorities for 2022 are:

Investigating the likelihood that future agents will engage in Evidential Cooperation in Large Worlds (ECL) by default. This and related questions could be crucial considerations for reasoning about how the future will unfold. As a result, it could influence what our high-level priorities should be, what we think the main sources of s-risk are, and what our community-building and research strategies should be.
Continuing his projects across various other areas, most notably AI forecasting.

Community building

Our work in this area will continue largely along the lines of previous years since we are broadly satisfied with the outcomes it has been producing. We will continue to try to identify and advise people interested in s-risks through 1:1 calls and meetings. We will run an intro fellowship for effective altruists who are interested in learning more about our work. We will make grants through the CLR Fund, mostly to support individuals in our community. We will run a summer research fellowship to allow people to test their fit for research on our priorities.

There are two ways in which we could imagine changing or expanding our work. First, after revisiting our communication strategy, we might explore broader communication about our research and focus areas (e.g., through podcast appearances, EA Forum posts, public talks). Second, we might invest more effort into building community infrastructure and platforms of exchange (e.g., regular retreats or an internal forum for those working on reducing s-risks).

Grant recommendations to CERR

CLR staff will continue to advise CERR on their grantmaking.

Operations

We initiated various organizational changes projects in 2021 that will require implementation effort by our operations team in 2022:

Supporting the change in leadership structure by establishing adequate organizational processes;
Finalizing the transfer of most of our activities from a Swiss nonprofit to a UK nonprofit;
Establishing a long-term solution for running the operations of the Cooperative AI Foundation.

In addition to a lot of business-as-usual work (e.g., accounting, office management, hiring logistic & onboarding, reporting), the operations team is considering the following projects:

Deliberating about whether to move offices or to open a second office in a different location;
Scaling up personal assistant & research assistant support for our research staff function;
Organizing the logistics of various research workshops.

The post Plans for 2022 & Review of 2021 appeared first on Center on Long-Term Risk.

Surrogate goals and safe Pareto improvements

2023-01-16T17:44:21Z

Caspar Oesterheld proposed surrogate goals in unpublished work while working at CLR in 2016. Tobias Baumann first published a blog post about them in a 2017 blog post, which also coined the term “surrogate goals”. Later, Caspar published a more rigorous, formal discussion of the idea under the term “safe Pareto improvements”, which is also intended to be more general. Eliezer Yudkowsy independently proposed a similar idea in an article about “Separation from hyperexistential risk”. The following articles are fully dedicated to the idea.

Tobias Baumann (2017): Using surrogate goals to deflect threats (runner-up at the AI alignment prize)
Tobias Baumann (2018): Challenges to implementing surrogate goals
Tobias Baumann (2019): Surrogate goals under uncertainty
Tobias Baumann (2019): Surrogate goals and private information
Caspar Oesterheld (2021): Safe Pareto improvements for delegated game playing. Published in JAAMAS 36. Short version published at AAMAS 2021.
Vojta Kovarik (2021): Formalizing Objections against Surrogate Goals

Surrogate goals have also been discussed or at least mentioned in, among other places, Section 4.2 of CLR’s research agenda and the 80,000 hours podcast (guest: Paul Cristiano)."

The post Surrogate goals and safe Pareto improvements appeared first on Center on Long-Term Risk.

Normative Disagreement as a Challenge for Cooperative AI

2022-08-22T08:25:44Z

Cooperative AI workshop and the Strategic ML workshop at NeurIPS 2021.

Abstract

Cooperation in settings where agents have both common and conflicting interests (mixed-motive environments) has recently received considerable attention in multi-agent learning. However, the mixed-motive environments typically studied have a single cooperative outcome on which all agents can agree. Many real-world multi-agent environments are instead bargaining problems (BPs): they have several Pareto-optimal payoff profiles over which agents have conflicting preferences. We argue that typical cooperation-inducing learning algorithms fail to cooperate in BPs when there is room for normative disagreement resulting in the existence of multiple competing cooperative equilibria and illustrate this problem empirically. To remedy the issue, we introduce the notion of norm-adaptive policies. Norm-adaptive policies are capable of behaving according to different norms in different circumstances, creating opportunities for resolving normative disagreement. We develop a class of norm-adaptive policies and show in experiments that these significantly increase cooperation. However, norm-adaptiveness cannot address residual bargaining failure arising from a fundamental tradeoff between exploitability and cooperative robustness.

1 Introduction

Multi-agent contexts often exhibit opportunities for cooperation: situations where joint action can lead to mutual benefits [Dafoe et al., 2020]. Individuals can engage in mutually beneficial trade; nation-states can enter into treaties instead of going to war; disputants can settle out of court rather than engaging in costly litigation. But a hurdle common to each of these examples is that the agents will disagree about their ideal agreement. Even if agreements benefit all parties relative to the status quo, different agreements will benefit different parties to different degrees. These circumstances can be called bargaining problems [Schelling, 1956].

As AI systems are deployed to act on behalf of humans in more real-world circumstances, they will need to be able to act effectively in bargaining problems — from commercial negotiations in the nearer term (e.g., Chakraborty et al. [2020]) to high-stakes strategic decision-making in the longer-term [Geist and Lohn, 2018]. Moreover, agents may be trained independently and offline before interacting with one another in the world. This raises concerns about future AI systems following incompatible norms for arriving at solutions to bargaining problems, analogously to disagreements about fairness which create hurdles to international cooperation on critical issues such as climate policy [Albin, 2001, Ringius et al., 2002].

Our contributions are as follows. We introduce a taxonomy of cooperation games, including bar-gaining problems (Section 3). We relate their difficulty to the degree of normative disagreement, i.e., differences over principles for selecting among mutually beneficial outcomes, which we formalize in terms of welfare functions. Normative disagreement does not arise in purely cooperative games or simple sequential social dilemmas [Leibo et al., 2017], but is an important obstacle for cooperation in what we call asymmetric bargaining problems. Following this, we introduce the notion of norm-adaptive policies – policies which can play according to different norms depending on the circumstances. In several multi-agent learning environments, we highlight the difficulty of bargaining between norm-unadaptive policies (Section 4). We then contrast this with a class of norm-adaptive policies (Section 5) based on Lerer and Peysakhovich [2017]’s approximate Markov tit-for-tat algorithm. We show that this improves performance in bargaining problems. However, there remain limitations, most fundamentally a tradeoff between exploitability and the robustness of cooperation.

2 Related work

Figure 1: Venn diagram of cooperation problems.

The field of multi-agent learning (MAL) has recently paid considerable attention to problems of cooperation in mixed-motive games, in which agents have conflicting preferences. Much of this work has been focused on sequential social dilemmas (SSDs) (e.g., Peysakhovich and Lerer 2017, Lerer and Peysakhovich 2017, Eccles et al. 2019). The classic example of a social dilemma is the Prisoner’s Dilemma, and the SSDs studied in this literature are similar to the Prisoner’s Dilemma in that there is a single salient notion of “cooperation”. This means that it is relatively easy for actors to coordinate in their selection of policies to deploy in these settings.

Cao et al. [2018] look at negotiation between deep reinforcement learners, but not between independently trained agents. Several authors have recently investigated the board game Diplomacy [Paquette et al., 2019, Anthony et al., 2020, Gray et al., 2021] which contains implicit bargaining problems amongst players. Bargaining problems are also investigated in older MAL literature (e.g., Crandall and Goodrich 2011) as well as literature on automated negotiation (e.g., Kraus and Arkin 2001, Baarslag et al. 2013), but also not between independently trained agents. Considerable work has gone into understanding the emergence of norms in both humans [Bendor and Swistak, 2001, Boyd and Richerson, 2009] and artificial societies [Walker and Wooldridge, 1995, Shoham and Tennenholtz, 1997, Sen and Airiau, 2007]. Especially relevant are empirical studies of bargaining across cultural contexts [Henrich et al., 2001]. There is also recent multi-agent reinforcement learning work on norms [Hadfield-Menell et al., 2019, Lerer and Peysakhovich, 2019, Köster et al., 2020] is also relevant here as bargaining problems can be understood as settings in which there are multiple efficient but incompatible norms. However, much less attention has been paid in these literatures to how agents with different norms are or aren’t able to overcome normative disagreement.

There are large game-theoretic literature on bargaining (for a review see Muthoo [2001]). This includes a long tradition of work on cooperative bargaining solutions, which tries to establish normative principles for deciding among mutually beneficial agreements [Thomson, 1994]. We will draw on this work in our discussion of normative (dis)agreement below.

Lastly, the class of norm-adaptive policies we develop in Section 5 — — can be seen as a more general variant of an approach proposed by Boutilier [1999] for coordinating in pure coordination games. As it implicitly searches for overlap in the agents’ sets of allowed welfare functions, it is also similar to Rosenschein and Genesereth [1988]’s approach to reaching agreement in general-sum games via sets of proposals by each agent.

3 Coordination, bargaining and normative disagreement

We are interested in a setting in which multiple actors (“principals”) train machine learning systems offline, and then deploy them into an environment in which they interact. For instance, two different companies might separately train systems to negotiate on their behalf and deploy them without explicit coordination on deployment. In this section, we develop a taxonomy of environments that these agents might face and relate these different types of environments to the difficulty of bargaining.

3.1 Preliminaries

We formalize multi-agent environments as partially observable stochastic games (POSGs). For simplicity we assume two players, . We will index player 's counterpart by . Each player has an action space . There is a space of states which evolve according to a Markovian transition function . At each time step, each player sees an observation which depends on . Thus each player has an accumulating history of observations . We refer to the set of all histories for player as and assume for simplicity that the initial observation history is fixed and common knowledge: . Finally, principals choose among policies , which we imagine as artificial agents deployed by the principals. We will refer to policy profiles generically as .

3.2 Coordination problems

We define a coordination problem as a game involving multiple Pareto-optimal equilibria (cf. Zhang and Hofbauer 2015), which require some coordinated action to achieve. That is, if the players disagree about which equilibrium they are playing, they will not reach a Pareto-optimal outcome. A pure coordination problem is a game in which there are multiple Pareto-optimal equilibria over which agents have identical interests. Although agents may still experience difficulties in pure coordination games, for inst ance due to a noisy communication channel, they are made easier by the fact that principals are indifferent between the Pareto-optimal equilibria.

3.3 Bargaining problems and normative disagreement

We define a bargaining problem (BP) to be a game in which there are multiple Pareto-optimal equilibria over which the principals have conflicting preferences. These equilibria represent more than one way to collaborate for mutual benefit, or put in another way, for sharing a surplus. Thus a bargaining problem is a mixed-motive coordination problem, in which there is conflicting interest between Pareto-optimal equilibria and common interest in reaching a Pareto-optimal equilibrium.

We can distinguish between BPs which are symmetric and asymmetric games. A 2-player game is symmetric if for any attainable payoff profile (a, b) there exists a profile (b, a). The reason this distinction is important is that all (finite) symmetric games have a symmetric Nash equilibrium [Nash, 1990]; thus symmetric games have a natural set of focal points [Schelling, 1958] for aiding coordination in mixed-motive contexts, while asymmetric BPs may not. Similarly, given a chance to play a correlated equilibrium [Aumann, 1974], agents in a symmetric BP could play according to a correlated equilibrium which randomizes using a symmetric distribution over Pareto-optimal payoff profiles.

Figure 2 displays the payoff matrices of three coordination games: Pure Coordination, and two variants of Bach or Stravinsky (BoS), one of which is a symmetric BP and one of which is an asymmetric BP. Pure Coordination is not a BP because it is not a mixed-motive game as the players only care about playing the same action. On the other hand, in the case of symmetric BoS the players do have conflicting interest, however, there is a correlated equilibrium – tossing a commonly observed fair coin – that is intuitively the most reasonable way of coordinating: It both maximizes the total payoff and offers each player the same expected reward.

Figure 2: Payoff matrices for Pure Coordination (left), BoS (middle), Asymmetric BoS (right).

To develop a better intuition for the sense in which equilibria can be more or less reasonable, consider a BoS with an extreme asymmetry, with equilibrium payoffs (15, 10) and (1, 11). Even though each of these equilibria is Pareto-optimal, the latter seems unreasonable or uncooperative: it yields a lower total payoff, more inequality, and lowers the reward of the worst-off player in the equilibrium. To formalize this intuition, we characterize the reasonableness of a Pareto-optimal payoff profile in terms of the extent to which it optimizes welfare functions: we can say that (1, 11) is unreasonable because there is no (impartial, see below) welfare function that would prefer it. Different welfare functions with different properties have been introduced in the literature (see Appendix A, but two uncontroversial properties of a welfare function are Pareto-optimality (i.e., its optimizer should be Pareto-optimal) and impartiality¹ (i.e., the welfare of a policy profile should be invariant to permutations of player indices). From the latter property, we can observe that the intuitively reasonable choice of playing the correlated equilibrium with a fair correlation device in the case of symmetric games is also the choice which all impartial welfare functions recommend, provided that it is possible for the agents to play a correlated equilibrium.

By contrast, in the asymmetric BoS from Figure 2 we see that playing BB maximizes utilitarian welfare , whereas playing SS maximizes the egalitarian welfare subject to Pareto-optimal. Throwing a correlated fair coin to choose between the two would lead to an expected payoff that is optimal with respect to the Nash welfare . Each of these different equilibria has a normative principle to motivate it.

In the best case, all principals agree on the same welfare function as a common ground for coordination in asymmetric BPs. However, the principals may have reasonable differences with respect to which welfare function they perceive as fair, and so they may train their systems to optimize different welfare functions, leading to coordination failure when the systems interact after deployment. In cases where agents were independently-trained according to inconsistent welfare functions, we will say that there is normative disagreement. There may be different degrees of normative disagreement. For instance, the difference differs across games.

To summarize, we relate the difficulty of coordination problems to the concept of welfare functions: In pure coordination problems, they are not needed. In symmetric bargaining problems, they all point to the same equilibria. And in asymmetric bargaining problems, they can serve to filter out intuitively unreasonable equilbria, but leave the possibility of normative disagreement between reasonable ones. This makes normative disagreement a critical challenge for bargaining. In the remainder of the paper, we will focus on asymmetric bargaining problems for this reason.

3.4 Norm-adaptive policies

When there is potential for normative disagreement, it can be helpful for agents to have some
flexibility as to the norms they play according to.

A number of definitions of norms have been proposed in the social scientific literature (e.g., Gibbs [1965]), but they tend to agree that a norm is a rule specifying acceptable and unacceptable behaviors in a group of people, along with sanctions for violations of that rule. Game-theoretic work sometimes identifies norms with equilibrium selection devices (e.g., Binmore and Samuelson 1994, Young 1996). Given that complex games generally exhibit many equilibria, some rule (such as maximizing a particular welfare function) is needed to select among them.

Normative disagreement arises (among other reasons) from the underdetermination of good behavior in complex multi-agent settings. This is exemplified by the problem of conflicting equilibrium selection criteria in asymmetric bargaining problems, but there are other possible cases of undetermination. One example is the undetermination of the beliefs that a reasonable agent should act according to in games of incomplete information (cf. the common prior assumption in Bayesian games [Morris, 1995]). Thus our definition of norm will be more general than an equilibrium selection device, though in the remainder of the paper we will focus on the use of welfare functions to choose among equilibria.

Definition 3.1 (Norms). Given a 2-player POSG, a norm is a tuple , where are normative policies (i.e., those which comply with the norm); are “punishment” policies, which are enacted when deviations from a normative policy are detected; and are rules for judging whether a deviation has happened and how to respond, i.e., . A policy is compatible with norm if, for all ,

For example, in the iterated Asymmetric BoS, one norm is for both players to always play (this is the normative policy), and for a player to respond to deviations by continuing to play . A similar norm is given by both players always playing instead. More generally, following the folk theorems [Mailath et al., 2006], an equilibrium in a repeated game corresponds to a norm, in which a profile of normative policies corresponds to play on the equilibrium path; the functions indicate whether there is a deviation from equilibrium play, and punishment policies are chosen such that players are made worse off by deviating than by continuing to play according to the normative policy profile.

Now we formally define norm-adaptive policies.

Definition 3.2 (Norm-adaptive policies). Take a 2-player POSG and a non-singleton set of norms for that game. Let be a surjective function that maps histories of observations to norms (i.e., for each norm in , there are histories for which chooses that norm.) Then, a policy is norm-adaptive if, for all , there is a policy such that is compatible with and .

That is, norm-adaptive policies are able to play according to different norms depending on the circumstances. As we will see below, the benefit of making agents explicitly norm-adaptive is that this can help to prevent or resolve normative disagreement. Lastly, note that we can define higher-order norms and higher-order norm-adaptiveness: a higher-order norm is a norm such that policies are themselves norm-adaptive with respect to some set of norms. This framework allows us for discussing differing (higher-order) norms for resolving normative disagreement.

4 Multi-agent learning and cooperation failures in BPs

In this section, we illustrate how cooperation-inducing, but norm-unadaptive, multi-agent learning algorithms fail to cooperate in asymmetric BPs. In Section 5 we will then show how norm-adaptiveness improves cooperation. The environments and algorithms considered are summarized in Table 1.

4.1 Setup: Learning algorithms and environments

In order to both include algorithms which use an explicitly specified welfare function and ones which do not, we use the Learning with Opponent-Learning Awareness (LOLA) algorithm [Foerster et al., 2018] in its policy gradient and exact value function optimization versions as an example for the latter, and Generalized Approximate Markov Tit-for-tat () for the former.

We introduce as a variant of Lerer and Peysakhovich [2017]’s Approximate Markov Tit-for-tat. The original algorithm trains a cooperative policy profile on the utilitarian welfare, as well as a punishment policy, and switches from the cooperative policy to the punishment policy when it detects that the other player is defecting from the cooperative policy. The algorithm has the appeal that it “cooperates with itself, is robust against defectors, and incentivizes cooperation from its partner” [Lerer and Peysakhovich, 2017]. We consider the more general class of algorithms that we call in which a cooperative policy is constructed by optimizing an arbitrary welfare function Note that although takes a welfare function as an argument, the resulting policies are not norm-adaptive. To cover a range of environments representing both symmetric and asymmetric games, we use some existing multi-agent reinforcement learning environments (IPD, Coin Game [CG; Lerer and Peysakhovich 2017]) and introduce two new ones (IAsymBoS, ABCG).

Iterated asymmetric Bach or Stravinsky (IAsymBoS) is an asymmetric version of the iterated Bach or Stravinsky matrix game. At each time step, the game defined on the right in Figure 2 is played. We focus on the asymmetric variant due to the argument in Section 3.3 that players could resolve the symmetric version by playing a symmetric equilibrium; however, applying LOLA without modification would also lead to coordination failure in the symmetric variant. It should also be noted that IAsymBoS is not an SSD because it does not incentivize defection: agents cannot gain from miscoordinating. Thus we consider IAsymBos to be a minimal example for an environment that can produce bargaining failures out of normative disagreement.

For a more involved example we also introduce an asymmetric version of the stochastic gridworld Coin Game [Lerer and Peysakhovich, 2017] – asymmetric bargaining Coin Game (ABCG) – which is both an SSD and an asymmetric BP. In ABCG, a red and a blue agent navigate a grid with coins. Two coins simultaneously appear on this grid: a Cooperation coin and a Disagreement coin, each colored red or blue. Cooperation coins can only be consumed by both players moving onto them at the same time, whereas the Disagreement coin can only be consumed by the player of the same color as the coin. The game is designed such that is maximized by both players always consuming the Cooperation coin, however, this will make one player benefit more than the other. Due to the sequential nature of the game, this means that welfare functions which also care about (in)equity will favor an equilibrium in which the player who benefits less from the Cooperation coin is allowed to consume the Disagreement coin from time to time without retaliation.

Players move simultaneously in all games. Note that we assume each player to have full knowledge of the other player’s rewards as this is required by LOLA-exact during training and by both at training and deployment. We use simple tabular and neural network policy parameterizations, which are described in Appendix C along with learning algorithms.

Table 1: Summary of the environments and learning algorithms that we use to study sequential social dilemmas (SSDs) and asymmetric bargaining problems in this section.

4.2 Evaluating cooperative success

We train policies in environments listed in Table 1 with the corresponding MARL algorithms. After the training, we evaluate cooperative success in self-play and cross-play. Self-play refers to average performance between jointly-trained policies. Cross-play refers to average performance between independently-trained policies. We distinguish between two kinds of cross-play: that between agents trained using the same notion of collective optimality (such as a welfare function or an inequity aversion term in the value function), and between agents trained using different ones. For we use two welfare functions, and an inequity-averse welfare-function (see Appendix C).

Comparing cooperative success across environments is not straightforward: environments have different sets of feasible payoffs, and we should not evaluate cooperative success with a single welfare function, because our concerns about normative disagreement stem from the fact that it is not obvious what single welfare function to use. First, we take to be the set of welfare functions which we use in our experiments in the environment in question. For instance, for IAsymBoS and this is . Second, we define disagreement payoff profiles corresponding to cooperation failure; in IAsymBoS the disagreement policy profile would be the one which has payoff profile . Then, for a policy profile , we compute its normalized score as . Under this scoring method, players do maximally well when they play a policy profile which is optimal according to some welfare function in .

Figure 3: Comparison of cooperative success, measured by NormalizedScore, between SSDs (left) and asymmetric BPs (right). Small black bars are standard errors. For LOLA in asymmetric BPs, payoff profiles are collected into sets corresponding to the welfare functions they implicitly optimize, in order to obtain cross-play between identical and between different welfare functions.

Figure 3 illustrates the difference between cooperation in SSDs and bargaining problems. When there is a single “cooperative” equilibrium, as in the case of IPD and CG, cooperation-inducing learning algorithms typically achieve cooperation in cross-play. In contrast, in IAsymBoS and ABCG we observe mild performance degradation in cross-play where agents optimize the same welfare functions, and strong degradation when agents optimize different welfare functions.

Figure 4: Values of policy profiles trained in IPD and in IAsymBoS with LOLA in self-play and cross-play. In the IPD, all policy profiles cooperate in cross-play. But in IAsymBoS, LOLA converges to different Pareto-optimal outcomes in self-play, leading to failures in cross-play. The purple areas describe sets of attainable payoffs. Jitter is added for improved visibility.

5 Benefits and limitations of norm-adaptiveness

The cooperation-inducing properties of the algorithms in Section 4 are simple and are not designed to help agents resolve potential normative disagreement to avoid Pareto-dominated outcomes. The two main problems are (1) that the algorithms are ill-equipped for reacting to normative disagreement, and (2) that they may confuse normative disagreement with defection.

The former problem is already evident in IAsymBoS. There, playing according to incompatible welfare functions is not interpreted as defection by . This is not necessarily bad – we claim that normative disagreement should be treated differently to defection – but it does mean that lacks the policy space to react to normative disagreement. For the latter problem, we observe that in the ABCG the -agent does classify some of the opponent's actions as defection, even though they are aimed to optimize an impartial welfare function, and punishes accordingly.

5.1 amTFT(W)

To overcome both of these problems, we propose a norm-adaptive modification to . As we only aim to illustrate the benefit of norm-adaptive policies, we keep the implementation simple: Instead of a welfare function the algorithm now takes a welfare-function set : . The agent starts out playing according to some . The initial choice can be made at random or according to a preference ordering between the .

then follows a two-stage decision process. First, if the agent detects that its opponent is merely playing according to a different welfare function than itself, gets re-sampled. Second, if the agent detects defection by the opponent, it will punish, just as in the original algorithm. However, by first checking for normative disagreement, we make sure that punishment does not happen as a result of a normative disagreement.

Figure 5: Average cross-play payoff profiles with standard errors for profiles of policies . The legend shows various combinations of sets and . Coordination failure only occurs when .

In Figure 5 we illustrate how, assuming uniform re-sampling from , can overcome normative disagreement and perform close to the Pareto frontier. Agents need not sample uniformly, though. For instance when the number of possible welfare functions is large, it would be beneficial to put higher probability on welfare functions which one's counterpart is more likely to be optimizing. Furthermore an agent might want to put higher probability on welfare functions that it prefers.

Notice that, when , one player using rather than leads to a (weak) Pareto improvement. Beyond that, in anticipation of bargaining failure due to not having a way to resolve normative disagreement, players are incentivized to include more than just their preferred welfare function into . In both IAsymBoS (see Figure 5) and ABCG (see Table 5, Appendix D) we can observe significant improvement for cross-play when at least one player is norm-adaptive.

5.2 The exploitability-robustness tradeoff

As our experiments with show, agents who are more flexible are less prone to bargaining failure due to normative disagreement. However, they are prone to having that flexibility exploited by their counterparts. For instance, an agent which is open to optimizing either or will end up optimizing if playing against an agent for whom . More generally, an agent who puts higher probability on a welfare function it has a preference for, when re-sampling , will be less robust against counterparts who disprefer that welfare function and put a lower probability on it. An agent who tries to guess the counterpart's welfare function and tries to accommodate to this is exploitable to agents who do not.

6 Discussion

We formally introduce a class of hard cooperation problems – asymmetric bargaining problems – and situate them within a wider game taxonomy. We argue that they are hard because there can arise normative disagreement between multiple “reasonable” cooperative equilibria, characterized by divergence in the preferred outcomes according to different welfare functions. This presents a problem for those deploying AI systems without coordinating on the norms those systems follow. We have introduced the notion of norm-adaptive policies, which are policies that allow agents to change the norms according to which they play, giving rise to opportunities for resolving normative disagreement. As an example of a class of norm-adaptive policies, we introduced and showed in experiments that this tends to improve robustness to normative disagreement. On the other hand, we have demonstrated a robustness-exploitability tradeoff, under which methods that learn more normatively flexible strategies are also more vulnerable to exploitation.

There are a number of limitations to this work. We have throughout assumed that the agents have a common and correctly-specified model of their environment, including their counterpart’s reward function. In real-world situations, however, principals may not have identical simulators with which to train their systems, and there are well-known obstacles to the honest disclosure of preferences [Hurwicz, 1972], meaning that common knowledge of rewards may be unrealistic. Similarly, we assumed a certain degree of reasonableness on part of the principals, seen by the willingness to play the symmetric correlated equilibrium in symmetric BoS (Section 3), for instance. However, we believe this to be a minimal assumption as the deployers of such agents are aware of the risk of coordination failure as a result of insisting on equilibria that no impartial welfare function would recommend.

Future work should consider more sophisticated and learning-driven approaches to designing norm-adaptive policies, as relies on a finite set of user-specified welfare functions and a hard-coded procedure for switching between policies. One possibility is to train agents who are themselves capable of jointly deliberating about the principles they should use to select an equilibrium, e.g., deciding among the axioms which characterize different bargaining solutions (see Appendix A) in the hopes that they will be able to resolve initial disagreements. Another direction is resolving disagreements that cannot be expressed as disagreements over the welfare functions according to which agents play; for instance, disagreements over the beliefs or world-models which should inform agents’ behavior.

Acknowledgments and Disclosure of Funding

We’d like to thank Yoram Bachrach, Lewis Hammond, Vojta Kovařík, Alex Cloud, as well as our anonymous reviewers for their valuable feedback; Daniel Rüthemann for designing Figures 6 and 7; Chi Nguyen for crucial support just before a deadline; Toby Ord and Jakob Foerster for helpful comments.

Julian Stastny performed part of the research for this paper while interning at the Center on Long-Term Risk. Johannes Treutlein was supported by the Center on Long-Term Risk, the Berkeley Existential Risk Initiative, and Open Philantropy. Allan Dafoe received funding from Open Philantropy and the Centre for the Governance of AI.

References

Cecilia Albin. Justice and fairness in international negotiation. Number 74. Cambridge University Press, 2001.

Thomas Anthony, Tom Eccles, Andrea Tacchetti, János Kramár, Ian Gemp, Thomas Hudson, Nicolas Porcel, Marc Lanctot, Julien Perolat, Richard Everett, Satinder Singh, Thore Graepel, and Yoram Bachrach. Learning to play no-press diplomacy with best response policy iteration. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17987–18003. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/d1419302db9c022ab1d48681b13d5f8b-Paper.pdf.

Robert J Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1(1):67–96, 1974.

Tim Baarslag, Katsuhide Fujita, Enrico H Gerding, Koen Hindriks, Takayuki Ito, Nicholas R Jennings, Catholijn Jonker, Sarit Kraus, Raz Lin, Valentin Robu, et al. Evaluating practical negotiating agents: Results and analysis of the 2011 international competition. Artificial Intelligence, 198: 73–103, 2013.

Jonathan Bendor and Piotr Swistak. The evolution of norms. American Journal of Sociology, 106(6):1493–1545, 2001.

Ken Binmore and Larry Samuelson. An economist’s perspective on the evolution of norms. Journal of Institutional and Theoretical Economics (JITE)/Zeitschrift für die gesamte Staatswissenschaft, pages 45–63, 1994.

Craig Boutilier. Sequential optimality and coordination in multiagent systems. In IJCAI, volume 99, pages 478–485, 1999.

Robert Boyd and Peter J Richerson. Culture and the evolution of human cooperation. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1533):3281–3288, 2009.

Donald E Campbell and Peter C Fishburn. Anonymity conditions in social choice theory. Theory and Decision, 12(1):21, 1980.

Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark. Emergent communication through negotiation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk6WhagRW.

Shantanu Chakraborty, Tim Baarslag, and Michael Kaisers. Automated peer-to-peer negotiation for energy contract settlements in residential cooperatives. Applied Energy, 259:114173, 2020.

Jacob W Crandall and Michael A Goodrich. Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Machine Learning, 82(3):281–314, 2011.

Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R McKee, Joel Z Leibo, Kate Larson, and Thore Graepel. Open problems in cooperative ai. arXiv preprint arXiv:2012.08630, 2020.

Tom Eccles, Edward Hughes, János Kramár, Steven Wheelwright, and Joel Z Leibo. Learning reciprocity in complex sequential social dilemmas. arXiv preprint arXiv:1903.08082, 2019.

Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122–130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.

Edward Geist and Andrew J Lohn. How might artificial intelligence affect the risk of nuclear war? Rand Corporation, 2018.

Jack P Gibbs. Norms: The problem of definition and classification. American Journal of Sociology, 70(5):586–594, 1965.

Jonathan Gray, Adam Lerer, Anton Bakhtin, and Noam Brown. Human-level performance in no-press diplomacy via equilibrium search. In International Conference on Learning Representation, 2021. URL https://openreview.net/forum?id=0-uUGPbIjD.

Dylan Hadfield-Menell, McKane Andrus, and Gillian Hadfield. Legible normativity for ai alignment: The value of silly rules. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 115–121, 2019.

John C Harsanyi. Cardinal welfare, individualistic ethics, and interpersonal comparisons of utility. Journal of political economy, 63(4):309–321, 1955.

Joseph Henrich, Robert Boyd, Samuel Bowles, Colin Camerer, Ernst Fehr, Herbert Gintis, and Richard McElreath. In search of homo economicus: behavioral experiments in 15 small-scale societies. American Economic Review, 91(2):73–78, 2001.

Leonid Hurwicz. On informationally decentralized systems. Decision and organization: A volume in Honor of J. Marschak, 1972.

Ehud Kalai. Proportional solutions to bargaining situations: interpersonal utility comparisons. Econometrica: Journal of the Econometric Society, pages 1623–1630, 1977.

Ehud Kalai and Smorodinsky. Other solutions to nash’s bargaining problem. Econometrica, 43(3):513–518, 1975.

Raphael Köster, Kevin R McKee, Richard Everett, Laura Weidinger, William S Isaac, Edward Hughes, Edgar A Duéñez-Guzmán, Thore Graepel, Matthew Botvinick, and Joel Z Leibo. Model-free conventions in multi-agent reinforcement learning with heterogeneous preferences. arXiv preprint arXiv:2010.09054, 2020.

Sarit Kraus and Ronald C Arkin. Strategic negotiation in multiagent environments. MIT press, 2001.

Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464–473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.

Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017.

Adam Lerer and Alexander Peysakhovich. Learning existing social conventions via observationally augmented self-play. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 107–114, 2019.

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for distributed reinforcement learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3053–3062. PMLR, 10–15 Jul 2018. URL http://proceedings.mlr.press/v80/liang18b.html.

George J Mailath, J George, Larry Samuelson, et al. Repeated games and reputations: long-run relationships. Oxford university press, 2006.

Stephen Morris. The common prior assumption in economic theory. Economics & Philosophy, 11(2):227–253, 1995.

Abhinay Muthoo. The economics of bargaining. EOLSS, 2001.

John F Nash. The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155–162, 1950.

John Forbes Nash. Non-cooperative games. Annals of Mathematics, 5(4):2, 1990.

Philip Paquette, Yuchen Lu, SETON STEVEN BOCCO, Max Smith, Satya O.-G., Jonathan K. Kummerfeld, Joelle Pineau, Satinder Singh, and Aaron C Courville. No-press diplomacy: Modeling multi-agent gameplay. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/84b20b1f5a0d103f5710bb67a043cd78-Paper.pdf.

Alexander Peysakhovich and Adam Lerer. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975, 2017.

Lasse Ringius, Asbjørn Torvanger, and Arild Underdal. Burden sharing and fairness principles in international climate policy. International Environmental Agreements, 2(1):1–22, 2002.

Jeffrey S Rosenschein and Michael R Genesereth. Deals among rational agents. In Readings in Distributed Artificial Intelligence, pages 227–234. Elsevier, 1988.

Thomas C Schelling. An essay on bargaining. The American Economic Review, 46(3):281–306, 1956.

Thomas C Schelling. The strategy of conflict. prospectus for a reorientation of game theory. Journal of Conflict Resolution, 2(3):203–264, 1958.

Sandip Sen and Stephane Airiau. Emergence of norms through social learning. In IJCAI, volume 1507, page 1512, 2007.

Yoav Shoham and Moshe Tennenholtz. On the emergence of social conventions: modeling, analysis, and simulations. Artificial Intelligence, 94(1-2):139–166, 1997.

William Thomson. Cooperative models of bargaining. Handbook of game theory with economic applications, 2:1237–1284, 1994.

Adam Walker and Michael J Wooldridge. Understanding the emergence of conventions in multi-agent systems. In ICMAS, volume 95, pages 384–389, 1995.

H Peyton Young. The economics of convention. Journal of economic perspectives, 10(2):105–122, 1996.

Boyu Zhang and Josef Hofbauer. Equilibrium selection via replicator dynamics in 2 × 2 coordination games. International Journal of Game Theory, 44(2):433–448, 2015.

Lin Zhou. The Nash bargaining theory with non-convex problems. Econometrica: Journal of the Econometric Society, pages 681–685, 1997.

A Welfare functions

Different welfare functions have been introduced in the literature. Table 2 gives an overview over commonly discussed welfare functions. Their properties are noted in Table 3.

**Table 2:** Welfare functions, adapted to the multi-agent RL setting where two agents with value functions are bargaining over the policy profile to deploy. the *disagreement value*, is the value which player gets when bargaining fails.
Name of welfare function	Form of
Nash [Nash, 1950]
Kalai-Smorodinsky [Kalai and Smorodinsky, 1975]	s.t. Pareto-optimal
Egalitarian [Kalai, 1977]	s.t. Pareto-optimal
Utilitarian [Harsanyi, 1955]

Pareto-optimality refers to the property that the welfare function's optimizer should be Pareto-optimal. Impartiality, often called symmetry, implies that the welfare of a policy profile should be invariant to permutations of player indices. These are treated as relatively uncontroversial properties in the literature. Invariance to affine transformations of the payoff matrix is usually motivated by the assumption that interpersonal comparison of utility is impossible. In contrast, the utilitarian welfare function assumes that such comparisons are possible. Independence of irrelevant alternatives refers to the principle that a preference for an equilibrium A over equilibrium B should only depend on properties of A and B. That is, a third equilibrium C should not change the preference ordering between A and B. Resource monotonicity refers to the principle that if the payoff for any policy profile increases, this should not make any agent worse off.

Table 3: Properties of welfare functions. For properties of Nash welfare here the set of feasible payoffs is assumed to be convex (Zhou [1997] describes properties in the non-convex case).

B Environments

B.1 Additional descriptions of environments

Coin Game

Figure 6: A pictorial description of agents’ rewards in the Coin Game environment [Lerer and Peysakhovich, 2017]. A Red player and a Blue player navigate a grid and pick up randomly-generated coins. Each player gets a reward of 1 for picking up a coin of any color. But, a player gets a reward of if the other player picks up their coin. This creates a social dilemma in which the socially optimal behavior is to only get one’s own coin, but there is incentive to defect and try to get the other player’s coin as well.

Asymmetric bargaining Coin Game

Figure 7: Illustration of the Asymmetric bargaining Coin Game described in Section [4](https://www.lesswrong.com/posts/KApZKaRN77iEiYTaw/normative-disagreement-as-a-challenge-for-cooperative-ai#4_Multi_agent_learning_and_cooperation_failures_in_BPs). CC stands for Cooperative Coin and DC stands for Disagreement Coin.

B.2 Disagreement profiles for each environment

For computing the normalized scores in Figure 3, we use the following disagreement profiles, corresponding to cooperation failure. In IPD, the cooperation failures are only the joint defections (D,D) producing a reward profile of (-3, -3). In IAsymBoS, they are the profiles (B, S) and (S, B), both associated with the reward profile (0, 0). In Coin Game, the cooperation failures are when both players pick all coins at maximum speed selfishly and this produce a reward profile of (0, 0). In Asymmetric bargaining Coin Game (ABCG), when the players fails to cooperates, they can punish the opponent by preventing any coin to be picked by waiting instead of picking their own defection coin (e.g. with ). In these cooperation failures, the reward profile with perfect punishment is (0, 0).

C Experimental details

C.1 Learning algorithms

We use the discount factor for all the algorithms and environments unless specified otherwise in Table 4.

Approximate Markov tit-for-tat ()

We follow the algorithm from [Lerer and Peysakhovich, 2017] with two changes: 1) Instead of using a selfish policy to punish the opponent, we use a policy that minimizes the reward of the opponent. 2) We use the version that uses rollouts to compute the debit and the punishment length. We observed that using long rollouts to compute the debit increases significantly the variance on the debit and this leads to false positive detection of defection. To reduce this variance, we thus compute the debit without rollouts. We use the direct difference between the rewards given by the actual action and given by the simulated cooperative action. The rollout length used to compute the punishment length is 20.

We note that in asymmetric BoS, training runs using the utilitarian welfare function sometimes learn to produce, in self-play, the egalitarian outcome instead of the utilitarian outcome. In these cases the policy gets discarded.

The inequity-averse welfare is defined as follows:

where are smoothed cumulative rewards with a discount factor and a parameter which controls how much unequal outcomes are penalized.

Learning with Opponent-Learning Awareness (LOLA)

Write as the value to player under a profile of policies with parameters . Then, the LOLA update [Foerster et al., 2018] for player 1 at time with parameters is

C.2 Policies

We used RLlib [Liang et al., 2018] for almost all the experiments. It is under Apache License 2.0. All activation functions are ReLU if not specified otherwise.

Matrix games (IPD and IBoS) with LOLA-Exact

We used the official implementation of LOLA-Exact from https://github.com/alshedivat/lola. We slightly modified it to increase the stability and remove a few confusing behaviors. Following Foerster et al. [2018]’s parameterization of policies in the iterated Prisoner’s Dilemma, we use policies which condition on the previous pair of actions played, with the difference that instead of using one parameter for every possible previous action profile, we use parameters for every one of the previous action profile to always to play in larger action space ( actions possible).

Matrix games (IPD and IBoS) with amTFT

We use a Double Dueling DQN architecture + LSTM for both simple and complex environments when using amTFT. This architecture is non-exhaustively composed of a shared fully connected layer (hidden layer size 64), an LSTM (cell size 16), a value branch and an action-value branch both composed of a fully connected network (hidden layer size 64).

Coin games (CG and ABCG) with LOLA-PG

We used the official implementation from https://github.com/alshedivat/lola, which is under MIT license. We slightly modified it to increase the stability and remove a few confusing behaviors. Especially, we removed the critic branch, which had no effect in practice. We use a PG+LSTM architecture composed of two convolution layers (kernel size 3x3 and feature size 20), an LSTM (cell size 64) and a final fully connected layer.

ABCG + LOLA-PG mainly generates policies that are associated with an egalitarian welfare function. Within this set of policies, some are a bit closer to the utilitarian outcome than others, which we used as a basis to classify them as “utilitarian” for the purpose of comparison in Figures 3 and 8. However, because the difference is small, we do not observe a lot of additional cross-play failure compared to cross-play between the “same” welfare function. It should also be noted that we chose to discard the runs in which none of the agents becomes competent at picking any of the coins.

Coin games (CG and ABCG) with amTFT

We use a Double Dueling DQN architecture + LSTM for both simple and complex environments when using amTFT. The architecture used is non-exhaustively composed of two convolution layers (kernel size 3x3 and feature size 64), an LSTM (cell size 16), a value branch and an action-value branch both composed of a fully connected network (hidden layer size 32).

C.3 Code assets

The code that we provide allows to run all of the experiments and to generate the figures with our results. All instructions on how to install and run the experiments are given in the 'README.md' file. The code to run the experiments from Section 4 is in the folder “base_game_experiments”. The code to run the experiments from Section 5, is in the folder ‘base_game_experiments’. The code to generate the figures is in the folder 'plots'.

An anonymized version of the code is available at https://github.com/68545324/anonymous.

C.4 Hyperparameters

Table 4: Main hyperparameters for each cross-play experiment.

All hyperparameters can be found in the code provided as the supplementary material. They are stored in the “params.json” files associated with each replicate of each experiment. The experiments are stored in the “results” folders.

The hyperparameters selected are those producing the most welfare-optimal results when evaluating in self-play (the closest to the optimal profiles associated with each welfare function). Both manual hyperparameter tuning and grid searches were performed.

D Additional experimental results

Table 5: Cross-play normalized score of in ABCG.

Figure 8: Mean reward of policy profiles for environments and learning algorithms given in Table 1. The purple areas describe sets of attainable payoffs. In matrix games, a small amount of jitter is added to the points for improved visibility. The plots on top compare self-play with cross-play, whereas the plots below compare cross-play between policies optimizing same and different welfare functions.

E Exploitability-robustness tradeoff

In this section, we provide some theory on the tradeoff between exploitability and robustness. Consider some asymmetric BP, with welfare functions optimized in equilibrium, such that for any two policy profiles the cross-play policy profile is Pareto-dominated. Note that the definition implies that the welfare functions must be distinct.

We can derive an upper bound for how well players can do under the cross-play policy profile. It follows from the fact that both and are in equilibrium that for it is

(1)

This is because, otherwise, at least one of the two profiles cannot be in equilibrium, since a player would have an incentive to switch to another policy to increase their value.

From the above, it also follows that the cross-play policy must be strictly dominated. To see this, assume it was not dominated. This would imply that one player has equal values under both profiles. So that player would be indifferent, while one of the profiles would leave the other player worse off. Thus, that profile would be weakly Pareto dominated, which is excluded by the definition of a welfare function.

It is a desirable quality for a policy profile maximizing a welfare function in equilibrium to have values that are close to this upper bound in cross-play against other policies. For instance, if some coordination mechanism exists for agreeing on a common policy to employ, it may be feasible to realize this bound against any opponent willing to do the same.

Moreover, the bound implies that whenever we try to be even more robust against players employing policies corresponding to other welfare functions (e.g., a policy which reaches Pareto optimal outcomes against a range of different policies), our policy will cease to be in equilibrium. In that sense, such a policy will be exploitable, while unexploitable policies can only be robust against
different opponents up to the above bound. Note that this holds even in idealized settings, where coordination, e.g., via some grounded messages, is possible.

Lastly, note that if no coordination is possible, or if no special care is being taken in making policies robust, then equilibrium profiles that maximize a welfare function can perform much worse in cross-play than the above upper bound.

We show this formally in a special case in which our POSG is an iterated game, i.e., it only has one state. Moreover, we assume that it is an asymmetric BP, and that for both welfare functions in question, an optimal policy exists that is deterministic. Denote (where denotes the joint observation history up to step ) as the return from time step onwards, under the policy . We also assume that for any player , the minimax value is strictly worse than the values of their preferred welfare function maximizing profile. Then we can show that policies maximizing the welfare functions exist such that after some time step, their returns will be upper bounded by their minimax return.

Proposition 1. In the situation outlined above, there exist policices optimizing the respective welfare functions and a time step such that

for

Proof. Define as the policy profile in which player follows unless the other player's actions differ from at least once, after which they switch to the action Define analogously. Note that both profiles are still optimal for the corresponding welfare functions.

As argued above, the cross-play profile must be strictly worse than their preferred profile, for both players. So there is a time after which an action of a player must differ from 's prefered profile and thus switches to the minimax action We have i.e., the value gets after step must be smaller than their minimax value, and by assumption, the minimax value is worse for than the value of their preferred welfare-maximizing profile. Hence, there must be a time step after which also switches to their minimax action.

From onwards, both players play so

The post Normative Disagreement as a Challenge for Cooperative AI appeared first on Center on Long-Term Risk.

Taboo "Outside View"

2021-07-21T08:59:07Z

No one has ever seen an AGI takeoff, so any attempt to understand it must use these outside view considerations

—[Redacted for privacy]

What? That’s exactly backwards. If we had lots of experience with past AGI takeoffs, using the outside view to predict the next one would be a lot more effective.

—My reaction

Two years ago I wrote a deep-dive summary of Superforecasting and the associated scientific literature. I learned about the “Outside view” / “Inside view” distinction, and the evidence supporting it. At the time I was excited about the concept and wrote: “...I think we should do our best to imitate these best-practices, and that means using the outside view far more than we would naturally be inclined.”

Now that I have more experience, I think the concept is doing more harm than good in our community. The term is easily abused and its meaning has expanded too much. I recommend we permanently taboo “Outside view,” i.e. stop using the word and use more precise, less confused concepts instead. This post explains why.

What does “Outside view” mean now?

Over the past two years I’ve noticed people (including myself!) do lots of different things in the name of the Outside View. I’ve compiled the following lists based on fuzzy memory of hundreds of conversations with dozens of people:

Big List O’ Things People Describe As Outside View:

Reference class forecasting, the practice of computing a probability of an event by looking at the frequency with which similar events occurred in similar situations. Also called comparison class forecasting. [EDIT: Eliezer rightly points out that sometimes reasoning by analogy is undeservedly called reference class forecasting; reference classes are supposed to be held to a much higher standard, in which your sample size is larger and the analogy is especially tight.]
Trend extrapolation, e.g. “AGI implies insane GWP growth; let’s forecast AGI timelines by extrapolating GWP trends.”
Foxy aggregation, the practice of using multiple methods to compute an answer and then making your final forecast be some intuition-weighted average of those methods.
Bias correction, in others or in oneself, e.g. “There’s a selection effect in our community for people who think AI is a big deal, and one reason to think AI is a big deal is if you have short timelines, so I’m going to bump my timelines estimate longer to correct for this.”
Deference to wisdom of the many, e.g. expert surveys, or appeals to the efficient market hypothesis, or to conventional wisdom in some fairly large group of people such as the EA community or Western academia.
Anti-weirdness heuristic, e.g. “How sure are we about all this AI stuff? It’s pretty wild, it sounds like science fiction or doomsday cult material.”
Priors, e.g. “This sort of thing seems like a really rare, surprising sort of event; I guess I’m saying the prior is low / the outside view says it’s unlikely.” Note that I’ve heard this said even in cases where the prior is not generated by a reference class, but rather from raw intuition.
Ajeya’s timelines model (transcript of interview, link to model)
… and probably many more I don’t remember

Big List O’ Things People Describe As Inside View:

Having a gears-level model, e.g. “Language data contains enough structure to learn human-level general intelligence with the right architecture and training setup; GPT-3 + recent theory papers indicate that this should be possible with X more data and compute…”
Having any model at all, e.g. “I model AI progress as a function of compute and clock time, with the probability distribution over how much compute is needed shifting 2 OOMs lower each decade…”
Deference to wisdom of the few, e.g. “the people I trust most on this matter seem to think…”
Intuition-based-on-detailed-imagining, e.g. “When I imagine scaling up current AI architectures by 12 OOMs, I can see them continuing to get better at various tasks but they still wouldn’t be capable of taking over the world.”
Trend extrapolation combined with an argument for why that particular trend is the one to extrapolate, e.g. “Your timelines rely on extrapolating compute trends, but I don’t share your inside view that compute is the main driver of AI progress.”
Drawing on subject matter expertise, e.g. “my inside view, based on my experience in computational neuroscience, is that we are only a decade away from being able to replicate the core principles of the brain.”
Ajeya’s timelines model (Yes, this is on both lists!)
… and probably many more I don’t remember

What did “Outside view” mean originally?

As far as I can tell, it basically meant reference class forecasting. Kaj Sotala tells me the original source of the concept (cited by the Overcoming Bias post that brought it to our community) was this paper. Relevant quote: “The outside view is ... essentially ignores the details of the case at hand, and involves no attempt at detailed forecasting of the future history of the project. Instead, it focuses on the statistics of a class of cases chosen to be similar in relevant respects to the present one.” If you look at the text of Superforecasting, the “it basically means reference class forecasting” interpretation holds up. Also, “Outside view” redirects to “reference class forecasting” in Wikipedia.

To head off an anticipated objection: I am not claiming that there is no underlying pattern to the new, expanded meanings of “outside view” and “inside view.” I even have a few ideas about what the pattern is. For example, priors are sometimes based on reference classes, and even when they are instead based on intuition, that too can be thought of as reference class forecasting in the sense that intuition is often just unconscious, fuzzy pattern-matching, and pattern-matching is arguably a sort of reference class forecasting. And Ajeya’s model can be thought of as inside view relative to e.g. GDP extrapolations, while also outside view relative to e.g. deferring to Dario Amodei.

However, it’s easy to see patterns everywhere if you squint. These lists are still pretty diverse. I could print out all the items on both lists and then mix-and-match to create new lists/distinctions, and I bet I could come up with several at least as principled as this one.

This expansion of meaning is bad

When people use “outside view” or “inside view” without clarifying which of the things on the above lists they mean, I am left ignorant of what exactly they are doing and how well-justified it is. People say “On the outside view, X seems unlikely to me.” I then ask them what they mean, and sometimes it turns out they are using some reference class, complete with a dataset. (Example: Tom Davidson’s four reference classes for TAI). Other times it turns out they are just using the anti-weirdness heuristic. Good thing I asked for elaboration!

Separately, various people seem to think that the appropriate way to make forecasts is to (1) use some outside-view methods, (2) use some inside-view methods, but only if you feel like you are an expert in the subject, and then (3) do a weighted sum of them all using your intuition to pick the weights. This is not Tetlock’s advice, nor is it the lesson from the forecasting tournaments, especially if we use the nebulous modern definition of “outside view” instead of the original definition. (For my understanding of his advice and those lessons, see this post, part 5. For an entire book written by Yudkowsky on why the aforementioned forecasting method is bogus, see Inadequate Equilibria, especially this chapter. Also, I wish to emphasize that I myself was one of these people, at least sometimes, up until recently when I noticed what I was doing!)

Finally, I think that too often the good epistemic standing of reference class forecasting is illicitly transferred to the other things in the list above. I already gave the example of the anti-weirdness heuristic; my second example will be bias correction: I sometimes see people go “There’s a bias towards X, so in accordance with the outside view I’m going to bump my estimate away from X.” But this is a different sort of bias correction. To see this, notice how they used intuition to decide how much to bump their estimate, and they didn’t consider other biases towards or away from X. The original lesson was that biases could be corrected by using reference classes. Bias correction via intuition may be a valid technique, but it shouldn’t be called the outside view.

I feel like it’s gotten to the point where, like, only 20% of uses of the term “outside view” involve reference classes. It seems to me that “outside view” has become an applause light and a smokescreen for over-reliance on intuition, the anti-weirdness heuristic, deference to crowd wisdom, correcting for biases in a way that is itself a gateway to more bias...

I considered advocating for a return to the original meaning of “outside view,” i.e. reference class forecasting. But instead I say:

Taboo Outside View; use this list of words instead

I’m not recommending that we stop using reference classes! I love reference classes! I also love trend extrapolation! In fact, for literally every tool on both lists above, I think there are situations where it is appropriate to use that tool. Even the anti-weirdness heuristic.

What I ask is that we stop using the words “outside view” and “inside view.” I encourage everyone to instead be more specific. Here is a big list of more specific words that I’d love to see, along with examples of how to use them:

Reference class forecasting
- “I feel like the best reference classes for AGI make it seem pretty far away in expectation.”
- “I don’t think there are any good reference classes for AGI, so I think we should use other methods instead.”
Analogy
- Analogy is like a reference class but with lower standards; sample size can be small and the similarities can be weaker.
- “I’m torn between thinking of AI as a technology vs. as a new intelligent species, but I lean towards the latter.”
Trend extrapolation
- “The GWP trend seems pretty relevant and we have good data on it”
- “I claim that GPT performance trends are a better guide to AI timelines than compute or GWP or anything else, because they are more directly related.”
Foxy aggregation (a.k.a. multiple methods)
- “OK that model is pretty compelling, but to stay foxy I’m only assigning it 50% weight.”
Bias correction
- “I feel like things generally take longer than people expect, so I’m going to bump my timelines estimate to correct for this. How much? Eh, 2x longer seems good enough for now, but I really should look for data on this.”
Deference
- “I’m deferring to the markets on this one.”
- “I think we should defer to the people building AI.”
Anti-weirdness heuristic
- “How sure are we about all this AI stuff? The anti-weirdness heuristic is screaming at me here.”
Priors
- “This just seems pretty implausible to me, on priors.”
- (Ideally, say whether your prior comes from intuition or a reference class or a model. Jia points out “on priors” has similar problems as “on the outside view.”)
Independent impression
- i.e. what your view would be if you weren’t deferring to anyone.
- “My independent impression is that AGI is super far away, but a lot of people I respect disagree.”
“It seems to me that…”
- i.e. what your view would be if you weren’t deferring to anyone or trying to correct for your own biases.
- “It seems to me that AGI is just around the corner, but I know I’m probably getting caught up in the hype.”
- Alternatively: “I feel like…”
- Feel free to end the sentence with “...but I am not super confident” or “...but I may be wrong.”
Subject matter expertise
- “My experience with X suggests…”
Models
- “The best model, IMO, suggests that…” and “My model is…”
- (Though beware, I sometimes hear people say “my model is...” when all they really mean is “I think…”)
Wild guess (a.k.a. Ass-number)
- “When I said 50%, that was just a wild guess, I’d probably have said something different if you asked me yesterday.”
Intuition
- “It’s not just an ass-number, it’s an intuition! Lol. But seriously though I have thought a lot about this and my intuition seems stable.”

Conclusion

Whenever you notice yourself saying “outside view” or “inside view,” imagine a tiny Daniel Kokotajlo hopping up and down on your shoulder chirping “Taboo outside view.”

Many thanks to the many people who gave comments on a draft: Vojta, Jia, Anthony, Max, Kaj, Steve, and Mark. Also thanks to various people I ran the ideas by earlier.

The post Taboo "Outside View" appeared first on Center on Long-Term Risk.

Case studies of self-governance to reduce technology risk

2021-04-06T10:50:04Z

Summary

Self-governance occurs when private actors coordinate to address issues that are not obviously related to profit, with minimal involvement from governments and standards bodies.
Historical cases of self-governance to reduce technology risk are rare. I find 6 cases that seem somewhat similar to AI development, including the actions of Leo Szilard and other physicists in 1939 and the 1975 Asilomar conference.
The following factors seem to make self-governance efforts more likely to occur:
- Risks are salient
- The government looks like it might step in if private actors do nothing
- The field or industry is small
- Support from gatekeepers (like journals and large consumer-facing firms)
- Support from credentialed scientists.
After the initial self-governance effort, governments usually step in to develop and codify rules.
My biggest takeaway is probably that self-governance efforts seem more likely to occur when risks are somewhat prominent. As a result, we could do more to connect “near-term” issues like data privacy and algorithmic bias with “long-term” concerns. We could try to preemptively identify “fire alarms” for TAI, and be ready to take advantage of these warning signals if they occur.

Full post on the EA Forum.

The post Case studies of self-governance to reduce technology risk appeared first on Center on Long-Term Risk.

Work with us (temporary archive)

2021-03-15T10:59:19Z

We are looking for researchers to explore strategies for reducing suffering in the long-term future. We are currently offering temporary and permanent roles:

Summer Research Fellows: For three months, you will be part of our team while working on your own research project. During this time, you will be in regular contact with our researchers and other fellows. One of our researchers will serve as your guide and mentor.
Researchers: As a researcher, you will become a permanent member of our team. You will be able to pursue high-impact research questions autonomously and without distractions.
Research Assistant (quantitative modeling): In this part-time role, you will collaborate with our researchers to develop quantitative models for answering strategically important questions.

Roles

Researcher

Responsibilities

We will adapt the responsibilities of the role to the strengths and preferences of each successful candidate, but they usually include:

Developing and answering research questions to enhance our understanding of s-risk reduction and improve the decision-making of stakeholders in our target audience;
Communicating research to relevant academic audiences, the effective altruism community, and the AI safety community (e.g., publications, presentations);
Collaborating with researchers at CLR and other relevant organizations, including those from different fields than your own.

Depending on your experience and skill set, we might ask you to supervise junior researchers or research fellows on our team.

What we look for in candidates

We don’t require specific qualifications or experience for this role, but the following abilities and qualities are what we’re looking for in candidates. We recognize that some of these qualities could be hard to test well outside a similar role, and we believe that smart, curious generalists can make substantial contributions, even if they lack formal training in any field related to our focus areas. We therefore encourage you to apply if you think you may be a good fit, even if you are unsure whether you meet several of the criteria.

Curiosity and a drive to work on challenging and important problems;
Ability to answer complex research questions related to the long-term future;
Comfort working in poorly-explored areas and a willingness to learn about new domains as needed;
Ability to distill concrete, domain-specific research questions from macrostrategic considerations;
Ability to communicate well with relevant target audiences (e.g., academics and audiences in less formal venues like the Alignment Forum);
Independent thinking;
A cautious approach to potential information hazards and other sensitive topics;
Alignment with our mission or strong interest in one of our priority areas;

Relevant academic education, such as a master’s degree or higher in a related field, can be a useful indicator for some of the above qualities but is not a requirement.

Further details

Number of available positions: There is no fixed limit on the number of positions.
Work quota: This is a full-time position with flexible working hours. We will consider applicants who prefer to work part-time.
Contract length: These are permanent positions.
Location: We prefer staff to work from our London offices but will also consider applications from people who are unable to relocate. With regard to the ongoing COVID-19 pandemic, we expect the situation to improve over the next few months such that work from our offices will be possible. If not, we will continue our arrangements for remote work.
Compensation: Permanent staff can determine their own salary within reasonable bounds. A common salary for staff in London is around £60,000 per year. If you would only consider working here for considerably higher compensation, we still encourage you to apply. We want remuneration to (almost) never be a reason to leave CLR or turn down an offer.
Benefits: CLR also offers substantial benefits to all staff – for details see the section about this below.

Summer Research Fellow

Responsibilities

Carrying out an original research project related to one of our priority areas below or otherwise targeted at reducing s-risks. You will determine this project in collaboration with your supervisor at CLR, who will meet with you every week and provide feedback on your work.
Attending team meetings, including giving occasional presentations on the state of your research.

What we look for in candidates

Curiosity and a drive to work on challenging and important problems;
Ability to answer complex research questions related to the long-term future;
Willingness to work in poorly-explored areas and to learn about new domains as needed;
Independent thinking;
A cautious approach to potential information hazards and other sensitive topics;
Alignment with our mission or strong interest in one of our priority areas.

Further details

Number of available positions: We have the capacity for around ten fellows.
Program length & work quota: The program is intended to last for twelve weeks in a full-time capacity. Exceptions, including part-time work, may be possible.
Program dates: We offer two starting dates for summer research fellows: June 28, 2021 & July 26, 2021. Exceptions may be possible in rare cases.
Location: We prefer summer research fellows to work from our London offices but will also consider applications from people who are unable to relocate. With regard to the ongoing COVID-19 pandemic, we expect the situation in the summer to be such that work from our offices will be possible. If not, we will continue our arrangements for remote work.
Compensation: Summer Research Fellows will receive £4,000 per month if located in London. Exceptional circumstances may justify a higher salary. We may offer only a reduced compensation for remote work. We will cover travel costs.
Benefits: CLR also offers substantial benefits to all staff – for details see the section about this below.

Research Assistant (quantitative modeling)

Responsibilities

Collaborating closely with CLR researchers to implement quantitative models for answering strategically important questions;
Communicating findings appropriately;
Making inputs and gears of the model accessible to CLR researchers.

We expect the content of the models to be wide-ranging. Examples currently of interest to CLR researchers include: AI timelines, the relative value of preventing bargaining failure vs. AI misalignment, and the causes of conflict. For examples of the kind of output we might like you to help us create, see this timelines model (by Ajeya Cotra), this agent-based model, and these economic growth simulations (by David Roodman). (Note that much of the work may be for internal consumption and thus need not be this “polished”.)

What we look for in candidates

Curiosity and a drive to work on challenging and important problems;
Programming experience (Python or R strongly preferred);
Experience building and communicating the results of quantitative models;
Alignment with our mission or strong interest in one of our priority areas;
Preferred: familiarity with Latex.

Further details

Number of available positions: We expect to hire only one person in this role.
Work quota: This is a part-time position. We are flexible with regard to both the total weekly working time and the time of day at which the hours are completed, and we expect that this position could be held alongside other employment or study.
Contract length: The initial employment contract will be for six months. We will then consider renewing the contract for a longer period.
Location: We weakly prefer London-based applicants but expect to hire the strongest applicant independent of location.
Compensation: To be agreed with the applicant taking into account their work location, from a base of £48,480 per year (full-time equivalent) for London-based applicants.
Benefits: CLR also offers substantial benefits to all staff – for details see the section about this below.

Priority areas

Application process

We value your time and are aware that applications can be demanding, so we have thought carefully about making the application process time-efficient and transparent:

Stage 1: To start your application for any role, please complete our application form. As part of this form, we also ask you to submit your CV/resume and relevant work samples if available. The deadline is Monday, March 15, 2021 (11:59 pm Pacific Time). We expect this to take around 2 to 3 hours if you are already familiar with our work.

Stage 2: By Friday, March 19, we will decide whether to invite you to the second stage. For the Researcher and Summer Research Fellow role, we will ask you to write a research proposal (up to two pages excluding references), to be submitted by Sunday, April 4 (11:59 pm Pacific Time). For the Research Assistant role, we will ask you to do a work test. For all roles, you will receive compensation for your work during this stage.

Stage 3: By Friday, April 9, we will decide whether to invite you to an interview via video call during the week of April 12. By Friday, April 16, we will

send out final decisions to applicants to the Summer Research Fellow position and applicants to the Research Assistant position, and
decide whether to invite applicants to the Researcher position to a work trial (details to be determined in coordination with each applicant). At this stage, we will also conduct reference checks.

Further details

Applying for multiple roles: Candidates may apply for multiple roles. If you want to do so, please submit only one application and click the relevant checkboxes in the form for each role for which you’re applying.
Application base rates: Last year, we received 48 applications for the researcher position. We ended up making three offers. In two of those cases, they had also applied for the summer research fellowship, and we only made an offer after they had started in that program. Last year, we received 67 applications for the summer research fellowship. We made eleven offers. We expect to receive more applications this year without having significantly more capacity.
International applicants: We are a registered UK visa sponsor. In most cases, we expect to be able to sponsor work visas for successful international applicants who would like to come to the UK. If you have questions about this, please ask us in the application form or reach out to us beforehand.
Diversity and equal opportunity employment: CLR is an equal opportunity employer, and we value diversity at our organization. We welcome applications from all sections of society and don’t want to discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, social background/class, mental or physical health or disability, or any other basis for unreasonable discrimination, whether legally protected or not. If you would like to discuss any personal needs that may require adjustments to our application process or workplace, please feel very free to contact us.

Inquiries

If you have any questions about the process, please contact us at info@longtermrisk.org. If you want to send an email not accessible to the hiring committee, please contact Amrit Sidhu-Brar at amrit.sidhu-brar@longtermrisk.org.

We will also host two open video calls for any questions about this hiring round or working at CLR more generally. Sign up here to receive an invitation to the video call. They will take place at the following times:

Wednesday, March 3, 18:00 - 19:00 GMT
Thursday, March 4, 10:00 - 11:00 GMT

Benefits

In addition to their salary, CLR offers the following benefits to all staff (including Summer Research Fellows):

25 days’ paid vacation per year, plus public holidays. (For temporary staff, this is reduced proportional to the length of your employment.)
A budget of £2000 per year to spend on your professional development and productivity. For the Researcher role, this is increased to £4000.
Lunch available at the office every day
Flexible working hours
20 weeks of paid leave for new parents, and consideration of childcare costs in setting salaries for permanent employees.
For permanent employees working from the US, we also cover full health care and dental costs

Why work at CLR

We aim to combine the best aspects of academic research (depth, scholarship, mentorship) with an altruistic mission to prevent negative future scenarios. So we leave out the less productive features of academia, such as precarious employment and publish-or-perish incentives, while adding a focus on impact and application.

As part of our team, you will enjoy:

a role tailored to your qualifications and strengths with ample intellectual freedom;
working towards a shared goal with dedicated and caring people;
an interdisciplinary research environment, with friendly and intellectually curious colleagues who will hold you to high standards and support you in your intellectual development;
mentorship in longtermist macrostrategy, especially from the perspective of preventing s-risks;
the support of a well-funded and well-networked longtermist EA organization with substantial operational assistance instead of administrative burdens.

Grantmaking: In addition to the CLR Fund, we are making grant recommendations to the Center for Emerging Risk Research (CERR), a new foundation committed to using all of their funds (at least $45,000,000) to support high-impact work to reduce s-risks.
Technical interventions: We aim to develop and communicate insights about the safe development of artificial intelligence to the relevant stakeholders (e.g., AI developers, key organizations in the longtermist effective altruism community).
Governance interventions: We aim to develop and help implement appropriate governance structures for the safe development of artificial intelligence.
New projects: In collaboration with people in our network, we are always looking for novel impactful organizations to set up. For instance, we previously established Wild Animal Suffering Research, which later merged with Utility Farm to become the Wild Animal Initiative, a now independent organization.

The post Work with us (temporary archive) appeared first on Center on Long-Term Risk.

Coordination challenges for preventing AI conflict

2021-03-12T13:24:27Z

Summary

In this article, I will sketch arguments for the following claims:

Transformative AI scenarios involving multiple systems pose a unique existential risk: catastrophic bargaining failure between multiple AI systems (or joint AI-human systems).
This risk is not sufficiently addressed by successfully aligning those systems, and we cannot safely delegate its solution to the AI systems themselves.
Developers are better positioned than more far-sighted successor agents to coordinate in a way that solves this problem, but a solution also does not seem guaranteed.
Developers intent on solving this problem can choose between developing separate but compatible systems that do not engage in costly conflict or building a single joint system.
While the second option seems preferable from an altruistic perspective, there appear to be at least weak reasons that favor the first one from the perspective of the developers.
Several avenues for (governance) interventions present themselves: increasing awareness of the problem among developers, facilitating the reaching of agreements (perhaps those for building a joint system in particular), and making development go well in the absence of problem awareness.

Introduction

In this article, I examine the challenge of ensuring coordination between AI developers to prevent catastrophic failure modes arising from the interactions of their systems. More specifically, I am interested in addressing bargaining failures as outlined in Jesse Clifton’s research agenda on Cooperation, Conflict & Transformative Artificial Intelligence (TAI) (2019) and Dafoe et al.’s Open Problems in Cooperative AI (2020).

First, I set out the general problem of bargaining failure and why bargaining problems might persist even for aligned superintelligent agents. Then, I argue for why developers might be in a good position to address the issue. I use a toy model to analyze whether we should expect them to do so by default. I deepen this analysis by comparing the merit and likelihood of different coordinated solutions. Finally, I suggest directions for interventions and future work.

The main goal of this article is to encourage and enable future work. To do so, I sketch the full path from problem to potential interventions. This large scope comes at the cost of depth of analysis. The models I use are primarily intended to illustrate how a particular question along this path can be tackled rather than to arrive at robust conclusions. At some point, I might revisit parts of this article to bolster the analysis in later sections.

Bargaining failure as a multipolar existential risk

Transformative AI scenarios involving multiple systems (“multipolar scenarios”) pose unique existential risks resulting from their interactions.¹ Bargaining failure between AI systems, i.e., cases where each actor ends up much worse off than they could have under a negotiated agreement, is one such risk. The worst cases could result in human extinction or even worse outcomes (Clifton 2019).²

As a prosaic example, consider a standoff between AI systems similar to the Cold War between the U.S. and the Soviet Union. If they failed to handle such a scenario well, they might cause nuclear war in the best case and far worse if technology has further advanced at that point.

Short of existential risk, they could jeopardize a significant fraction of the cosmic endowment by preventing the realization of mutual gains or causing the loss of resources in costly conflicts.

This risk is not sufficiently addressed by AI alignment, by which I mean either “ensuring that systems are trying to do what their developers want them to do” or “ensuring that they are in fact doing what their developers want them to do.”³ Consider the Cuban Missile Crisis as an analogy: The governments of the U.S. and the Soviet Union were arguably “aligned” with some broad notion of human values, i.e., both governments would at least have considered total nuclear war to be a moral catastrophe. Nevertheless, they got to the brink of causing just that because of a failure to bargain successfully. Put differently, it’s conceivable, or even plausible, that the Cuban Missile Crisis could have resulted in global thermonuclear war, an outcome so bad that both parties would probably have preferred complete surrender.⁴

Even the most intelligent agents may fail to bargain successfully

This risk scenario is probably also not sufficiently addressed by ensuring that the AI systems we build have superhuman bargaining skills. Consider the Cuban Missile crisis again. I am arguing that a superintelligent Kennedy and superintelligent Khrushchev would not have been sufficient to guarantee successful prevention of the crisis. Even for superintelligent agents, some fundamental game-theoretic incompatibilities persist because the ability to solve them is largely orthogonal to any notion of “bargaining skill,” whether we conceive of that skill as part of intelligence or rationality. These are the “mixed-motive coordination problem” and the “prior selection problem.”⁵

“Mixed-motive coordination problem”⁶: As I use the term here, a mixed-motive coordination problem is a problem that arises when two agents need to pick one Pareto-optimal solution out of many different such solutions. The failure to pick the same one results in a failure to reach a mutually agreeable outcome. At the level of equilibria, this may arise in games that do not have a uniquely compelling cooperative equilibrium, i.e., they have multiple Pareto-optimal equilibria that correspond to competing notions of what counts as an acceptable agreement.⁷⁸

For instance, in Bach or Stravinsky (see matrix below), both players would prefer going to any concert together (Stravinsky, Stravinsky or Bach, Bach) over going to any concert by themselves (Stravinsky, Bach or Bach, Stravinsky). However, one person prefers going to Stravinsky together, whereas the other prefers going to Bach together. Thus, there is a distributional problem when allocating the gains from coordination (Morrow 1994).⁹ Put in more technical terms: each player favors a different solution on the Pareto curve. Within this simple game, there is no way for the two players to reliably select the same concert, which will often cause them to end up alone.

More fundamentally, agents may differ in the solution concepts or decision rules they use to decide what agreements are acceptable in a bargaining situation, e.g., they may use different bargaining solutions. In bargaining problems, different “reasonable” decision rules make different recommendations for which Pareto-optimal solution to pick. The worry is that independently developed systems could end up using, either implicitly or explicitly, different decision rules for bargaining problems, leading to bargaining failure. For instance, in the variant of Bach or Stravinsky above, (Stravinsky, Stravinsky) leads to the greatest total payoffs, while (Bach, Bach) is more equitable.¹⁰

As a toy example, consider the case where two actors are bargaining over some territory. There are many ways of dividing this territory. (Different ways of dividing the territory are analogous to (Stravinsky, Stravinsky) and (Bach, Bach) above.) One player (the proposer) makes a take-it-or-leave-it offer to the other player (the responder) of a division of the territory, and war occurs if the offer is rejected. (A rejected offer is analogous to the miscoordination outcome (Stravinsky, Bach.) If the proposer and responder have different notions of what counts as an acceptable offer, war may ensue. If the agents have highly destructive weapons at their disposal, then war may be extremely costly. (To see how this might apply in the context of transformative AI, imagine that these are AI systems bargaining over the resources of space.)

There are two objections to address here. First, why would the responder reject any offer if they know that war will ensue? One reason is that they have a commitment to reject offers that do not meet their standards of fairness to reduce their exploitability by other agents. For AI systems, there are a few ways this could happen. For example, such commitments may have evolved as an adaptive behavior in an evolution-like training environment or be the result of imitating human exemplars with the same implicit commitments. AI systems or their overseers might have also implemented these commitments as part of a commitment race.

Second, isn’t this game greatly oversimplified? For instance, agents could engage in limited war and return to the bargaining table later, rather than catastrophic war. There are a few responses here. For one thing, highly destructive weapons or irrevocable commitments might preclude the success of bargaining later on. Another consideration is that some complications — such as agents having uncertainty about each others’ private information (see below) — would seem to make bargaining failure more likely, not less so.

“Prior selection problem”: In games of incomplete information, i.e., games in which players have some private information about their payoffs or available actions, the standard solution–Bayesian Nash equilibrium–requires the agents to have common knowledge of each others’ priors over possible values of the players’ private information. However, if systems end up with different priors, outcomes may be bad.¹¹ For instance, one player might believe their threat to be credible, whereas the other player might think it’s a bluff, leading to the escalation of the conflict. Similar to mixed-motive coordination problems, there are many “reasonable” priors and no unique individually rational rule that picks out one of them. In the case of machine learning, priors could well be determined by the random initialization of the weights or incidental features of the training environment (e.g., the distribution of other agents against which an agent is trained). Such differences in beliefs may persist over time due to models of other agents being underdetermined in strategic settings.¹²

Note that these concepts are idealizations. More broadly, AI systems may have different beliefs and procedures for deciding which commitments are credible and which bargains are acceptable.

Why developer coordination might be necessary

Independent development as a cause

These incompatibility problems are much more likely to arise or lead to catastrophic failures if AI systems are developed independently. During training, failure to arrive at mutually agreeable solutions is likely to result in lower rewards. So a system will usually perform well against counterparts that are similar to the ones it encountered during training. If the development of two systems is independent, such similarity is not guaranteed, and bargaining is more likely to fail catastrophically due to the reasons I sketched above.

Again, let’s consider a human analogy. There is evidence for significant behavioral differences among individuals from different cultures when playing standard economic games (e.g., the ultimatum game, the dictator game, different public goods games). For instance, Henrich et al. (2005) found that mean offers from Western university students usually ranged from 42-48% in the ultimatum game. Among members of the fifteen small-scale societies they studied, mean offers instead spanned 25-57%. In a meta-analysis, Oosterbeek, Sloof & van de Kuilen (2004) found systematic cultural differences in the behavior of responders (but not proposers). Relatedly, there also appears to be evidence for cross-cultural differences with regard to notions of fairness (e.g., Blake et al. 2015, Schaefer et al. 2015). This body of literature is at least suggestive of humans learning different “priors” or “decision rules” depending on their “training regime,” i.e., their upbringing.

The smaller literature on intercultural play, where members from different cultures play against one another, weakly supports welfare losses as a result of such differences: “while a few studies have shown no differences between intra- and intercultural interactions, most studies have shown that intercultural interactions produce less cooperation and more competition than intracultural interactions” (Matsumoto, Hwang 2011). I only consider this weak evidence as the relevant studies do not seem to carefully control for the potential of (shared) distrust of perceived strangers, which would also explain these results but is independent of incompatible game-playing behavior.

Incompatibility problems all the way down

It is tempting to delegate the solving of these problems to future more capable AI systems. However, it is not guaranteed that they will be in a position to solve them, despite being otherwise highly capable.

For one, development may have already locked in priors or strong preferences over bargaining solutions, either unintentionally or deliberately (as the result of a commitment race, for instance). This could put strict limits on their abilities to solve these problems.

More fundamentally, solving these incompatibility problems requires overcoming another such problem. Picking out some equilibrium, solution concept, or prior will favor one system over another. So they face another distributional problem. Solving that requires successful bargaining, the failure of which was the original problem. If they wanted to solve this second incompatibility problem, they would face another one. In other words, there are incompatibility problems all the way down.

One possibility is that many agents will by default be sufficiently “reasonable” that they can agree on a solution concept via reasoned deliberation, avoiding commitments to incompatible solution concepts for bargaining problems. Maybe many sufficiently advanced systems will engage in reasoning such as “let’s figure out the correct axioms for a bargaining solution, or at least sufficiently reasonable ones that we can both feel OK about the agreement.”¹³ Unfortunately, it does not seem guaranteed that this kind of reasoning will be selected for during the development of the relevant AI systems.

Why developers are better suited

Developers then might be better suited to addressing this issue than more capable successor agents, whether they be AI systems or AI-assisted humans:

The comparative ignorance of present-day humans mitigates the distributional problem faced by more far-sighted and intelligent successor agents.¹⁴ The distributional consequences of particular coordination arrangements will likely be very unclear to AI developers. Compared to future agents, I expect them to have more uncertainty about their values, preferred solution concepts, the consequences of different coordination agreements, and how these three variables relate to one another. This will smooth out differences in expected value between different coordination outcomes. However, developers will have much less uncertainty about the value of averting conflict by coordinating in some form. So it will be easier for them to find a mutually agreeable arrangement as the situation for them looks more like a pure coordination game (see matrix below), which are much easier to solve by cheap talk alone, than Bach or Stravinsky (see matrix above).¹⁵

The loss aversion and scope insensitivity of (most) human bargainers will likely compound this effect. I expect it will increase the inclination to avoid catastrophes compared to securing relative gains. This, again, will push this game more toward one of pure coordination, mitigating the distributional problem. In comparison, AI systems are less likely to exhibit such “biases.”¹⁶

A related point is that human bargainers might not even know what the Pareto frontier looks like. Thus, instead of trying to bargain for their most favorable point on the Pareto frontier, they have incentives to converge on any mutually agreeable settlement even if it is Pareto inferior to many other possible outcomes. This, in turn, probably decreases the chance of catastrophic failures.¹⁷ As Young (1989) writes:

Negotiators who know the locus of a contract curve or the shape of a welfare frontier to begin with will naturally be motivated primarily by a desire to achieve an outcome on this curve or frontier that is as favorable to their own interests as possible. They will, therefore, immediately turn to calculations regarding various types of strategic behavior or committal tactics that may help them achieve their distributive goals.

Negotiators who do not start with a common understanding regarding the contours of the contract curve or the locus of the negotiation set, by contrast, have compelling incentives to engage in exploratory interactions to identify opportunities for devising mutually beneficial deals. Such negotiators may never discover the actual shape of the contract curve or locus of the negotiation set, and they may consequently end up with arrangements that are Pareto-inferior in the sense that they leave feasible joint gains on the table. At the same time, however, they are less likely to engage in negotiations that bog down into protracted stalemates brought about by efforts to improve the outcome for one party or another through initiatives involving strategic behavior and committal tactics.

What developer coordination looks like

Developers intent on solving this problem can choose between two broad classes of options¹⁸:

They could coordinate on choosing compatible features such that interactions between their systems do not lead to catastrophic outcomes. Within the current machine learning paradigm, it will likely not be possible to coordinate directly on the priors and decision rules of the respective systems, as these may only be represented implicitly in the learned policies of agents. More realistically, developers would coordinate on training features like the reward structure and the learning environment or restrictions on the space of policies that agents learn over. (See “Appendix: Examples of features developers might coordinate on” for concrete examples.) To the extent that systems are modular, coordination could also occur at the level of bargaining-relevant modules.
They could agree to build a single joint system to prevent any conflict between their systems in the first place. So instead of two developers building two separate systems, they join forces to build a single one.¹⁹ This may take various institutional forms, ranging from a joint engineering project to a full merger. In all those cases, no direct bargaining between AI systems would occur as long as all developers participate.

Both solutions require overcoming the distributional problem discussed in the previous section. In the case of coordinating on compatible features, each set of features will have different distributional consequences for the developers. In the case of agreeing to build a joint system, there will be different viable agreements, again with different distributional consequences for the developers (e.g., the system may pursue various tradeoffs between the developers’ individual goals, or developers might get different distributions of equity shares).²⁰

Coordination is not guaranteed in a game-theoretic toy model

For now, let’s assume that there are only two developers who are both aware of these coordination problems and have the technical ability to solve them. Let’s further assume the two options introduced above do not differ significantly in their effectiveness at preventing conflict, and the costs of coordination are negligible. Then the game they are playing can be modeled as a coordination game like Bach or Stravinsky.²¹²²

In non-iterated and sequential play, we can expect coordination, at least under idealized conditions. The follower will adapt to the strategy chosen by the leader since they have nothing to gain by not coordinating (“pre-emption”). If I know that my friend is at the Bach concert, I will also go to the Bach concert since I prefer that to being at the Stravinsky concert on my own.

In non-iterated and simultaneous play, the outcome is underdetermined. They may end up coordinating, or they may not. It depends on whether they will be able to solve the bargaining problem inherent to the game. Introducing credible commitments could move us from simultaneous play to sequential play, ensuring coordination once again.²³ If I can credibly commit to going to one concert rather than another, my counterpart has again nothing to gain by choosing the other concert. They will join me at the one that I signaled I would go to.

In iterated play, the outcome is, again, uncertain. Unlike the Prisoner’s Dilemma, there is no need to monitor and enforce any agreement in coordination games once it has been reached. Free-riding is not possible: deviation from equilibrium harms both players, i.e., agreements are generally self-enforcing (Snidal 1985). However, the iteration of the game incentivizes players to insist on the coordination outcome that is more favorable to them. Foregoing coordination one round might be worth it if you think you can cause your counterpart to move to the more favorable equilibrium in subsequent rounds.

Which of these versions of the game best describes AI takeoff primarily depends on two variables: Close races will be more akin to simultaneous play where developers do not first observe what their counterpart “played” until they have already locked in a certain choice themselves. Iteration is akin to successive deployment where developers release improved versions over time. So only if one developer is clearly ahead of the competition is it that coordination seems anything close to guaranteed in this toy model, and those might be the scenarios where one actor gains a decisive strategic advantage in any case. Otherwise, bargaining will occur, and may fail.

Note: I don’t intend for this section to be a comprehensive analysis of this situation. Rather, it is intended as a first stab at the problem and a demonstration of how to make progress on the question of whether we can expect coordination by default. This basic model could be extended in various ways to capture different complexities of the situation.

Coordination could occur without awareness

If we drop the assumption that developers are aware of the need to coordinate, coordination may still occur regardless. However, it is necessarily less likely. Three paths then seem to remain:

First, norms might emerge organically as the result of trial and error. This would require iteration and a well-functioning feedback mechanism. For instance, the two labs release pre-TAI systems, which interact poorly, perhaps due to the problems described in this article. They lack concrete models for the reasons for this failure, but in subsequent releases, both labs independently tinker with the algorithms until they interact well with one another. This compatibility then also transfers to their transformative systems. My intuition is that the likelihood of such an outcome will depend a lot on how fast and how continuously AI development progresses.

Second, the relevant features may end up being compatible due to the homogeneity of the systems. However, even the same training procedures can result in different and incompatible decision rules due to random differences in initialization.²⁴ More narrowly, a third party might develop a bargaining “module” or service, which is integrated into all transformative systems by their developers due to its competitive performance rather than as the result of a coordination effort. Again, this outcome is not guaranteed.

Third, developers might agree to build a joint system for reasons other than the problem discussed in this article:

Most likely, they might want to speed up development by pooling rival goods or increasing available capital (e.g., Quebec Agreement²⁵, International Space Station²⁶, Concorde²⁷, CERN²⁸).²⁹ In the case of TAI, the researchers and engineers with the specialized (tacit) knowledge required to build such a system might be distributed over two labs. In that case, a negotiated collaboration could be preferable to mutual poaching of top talent, the latter of which is not even possible in the case of national projects.
Less likely, developers might want to decrease risk by spreading upside and downside across multiple stakeholders. This follows the same idea behind portfolio diversification. In the case of AI, the development of a huge, unprecedented model might require large upfront investments that no firm might be willing to undertake on their own because failure could result in ruinous losses.³⁰
Also less likely, national labs might want to prevent freeriding when they expect the building of a system to create massive public goods.³¹ In the case of AI, they might think it will be difficult to prevent the diffusion of novel algorithms across borders. If so, it would be difficult to internalize the benefits of a large public investment in foundational AI R&D, allowing others to freeride.³² Sharing the costs mitigates those concerns.³³ This would be most relevant in scenarios where transformative systems require breakthrough algorithmic insights instead of tacit engineering knowledge and a lot of computing power.

None of these would guarantee that only one system is developed. They merely give reasons to some developers to merge with some other developers.

Comparing a joint system to multiple compatible ones

Given the toy model we used above, both solutions (compatible features and the building of a single system) do not differ in terms of payoffs. However, to examine how desirable they are from an altruistic perspective and how likely they are to come about, we need to analyze them in more detail. Again, the analysis will remain at the surface level and is intended as a first stab and illustration.

Building a joint system seems preferable to multiple compatible ones

Restricting our perspective to the problem discussed in this article, developers building a joint system is preferable since it completely obviates any bargaining by the AI systems themselves.³⁴ Moreover, the underlying agreement seems significantly harder to renege on. It also effectively addresses the racing problem and some other multipolar challenges introduced in Critch, Krueger 2020.

At the same time, it would increase the importance of solving multi (stakeholder)/single (AI system) challenges (cf. section 8 of Critch, Krueger 2020), e.g., those related to social choice and accommodating disagreements between developers. If that turns out to be less tractable or to have worse side effects, this could sway the overall balance. The above analysis also ignores potential negative side-effects such agreements might have on the design of AI systems and the dynamics of AI development more broadly, e.g., by speeding up development in general. Analyzing these effects is beyond the scope of this article. Overall, however, I tend to believe that such an agreement would be desirable, especially in a close race.

It seems unclear whether developers will build a joint system or settle for multiple compatible ones

It seems to me that two factors are most likely to determine the choice of developers³⁵: (1) the consequences of each mode of coordination for the anticipated payoffs attained by the AI systems after deployment and (2) the transaction costs incurred by bringing about either of the two options prior to deployment.³⁶

It’s plausible that the post-deployment payoffs will be overwhelmingly important, especially if developers appreciate the astronomical stakes involved. Nevertheless, transaction costs may still be important to the extent that developers are not as far-sighted and suffer from scope neglect.

Understanding the differences in payoffs would require a more comprehensive version of the analysis attempted in the previous section and the motivations of the developers in question. For instance, if the argument of the previous section holds, altruistically inclined developers would see higher payoffs associated with building a single system compared to an agreement to build compatible systems.³⁷ On the other hand, competing national projects may be far more reluctant to join forces.

More general insights can be gleaned when it comes to transaction costs. The most common analytical lens for predicting what kinds of transactions agents will make is new institutional economics (NIE).³⁸ Where game-theoretic models often abstract away such costs through idealization assumptions, NIE acknowledges that agents have cognitive limitations, lack information, and struggle to monitor and enforce agreements. This results in different transaction costs for different contractual arrangements, which influences which one is picked. This perspective can shed light on the question of whether to collaborate using the market to contract (e.g., buying, selling) or whether to collaborate using hierarchy & governance (e.g., regular employment, mergers). In our case, these transaction types are represented by agreeing to use compatible features and by agreeing to build a joint system, respectively.

Transaction costs are often grouped into three categories³⁹:

search costs (e.g., finding the cheapest supplier)
bargaining costs (e.g., negotiating the details of the contract)
governance & enforcement costs (e.g., setting up mechanisms for communication, monitoring behavior, and punishing defections)⁴⁰

On the face of it, this lens suggests that all else equal, actors would prefer to find compatible features over agreeing to build a single system because the costs for the former seem lower than the ones for the latter⁴¹:

I expect that search costs will make up a negligible fraction of the total transaction costs as the number of relevant developers will be small and probably well-known to one another. I also don’t expect them to differ significantly in the two cases we are examining. In both cases, the partners in the transactions are the same, the information required to transact will be similar, and there will be little switching of transaction partners.
It’s difficult to estimate differences in bargaining costs; specifying exact & appropriate technical standards is likely going to be complicated, but reaching an agreement for the institutional structure required to build a joint system may also be complicated. I expect this to depend a lot on the specifics of the respective scenarios.

Any agreement stipulating compatible features would have minimal enforcement costs since it would be largely self-enforcing (see above).⁴² Agreements to build a single system, on the other hand, would impose substantial governance costs. It would be challenging to set up or adapt the administrative structures required to ensure two previously separate teams work together smoothly.⁴³

This is weakly suggestive to me that transaction costs will incline developers to building compatible systems over building a joint system. Looking for case studies, this impression seems confirmed. I am not aware of any real-world examples of agreements to merge, build a single system instead of multiple, or establish a supranational structure to solve a coordination problem. Instead, actors seem to prefer to solve such problems through agreements and conventions. For instance, all standardization efforts fall into this category. Those reasons become stronger with an increasing number of potentially relevant developers as the costs for coordinating the development of a joint system rise more rapidly with an increasing number of actors compared to an agreement among independent developers, which probably will have very low marginal costs.

Overall, I expect that there will be strong reasons to build a joint system if there is a small number of relevant nonstate developers who are aware of and moved by the astronomical stakes. In those cases, I would be surprised if transaction costs swayed them. I am more pessimistic in other scenarios.

Conclusion

Coordination is not assured. Even if coordination is achieved, the outcome could still be suboptimal. This suggests that additional work on this problem would be valuable. In the next two sections, I will sketch directions for potential interventions and future research to make progress on this issue.

Interventions

I will restrict this section to interventions for the governance problem sketched in this article while ignoring most technical challenges.⁴⁴ I don’t necessarily endorse all of them without reservations as good ideas to implement. Some of them might have positive effects beyond the narrow application discussed here. Some might have (unforeseen) negative effects.

Increasing problem awareness

Without awareness of the problem, a solution to the core problem becomes significantly less likely. Accordingly, increasing awareness of this problem among competitive developers is an important step.⁴⁵ It seems particularly important to do so in a way that is accessible to people with a machine learning background. One potential avenue might be to develop benchmarks that highlight the limits on achieving cooperation among AI agents without coordination by their developers. Our work on mixed-motive coordination problems in Stastny et al. 2021 is an example of ongoing work in this area.

Facilitating agreements

Some interventions can make the reaching of an agreement more likely under real-world conditions. Some reduce the transaction costs developers need to pay. Other mitigate the distributional problem they may face. I expect that many of these would also contribute to solving other bargaining problems between AI developers (e.g., finding solutions to the racing problem).

Setting up or improving bargaining fora (e.g., the Partnership on AI or standards bodies like the IEEE or the ISO) could help structure the bargaining process (Fearon 1998). Following Keohane (1984), such institutions can also ‘cluster’ issues together, facilitating side payments and issue linkage (e.g., McGinnis 1986), which can help with constructing mutually beneficial bargains.
Young (1989) suggests that salient solutions can help select one out of the many possible agreements.⁴⁶ Additional research could identify such focal points to be advocated by the AI safety community.
Krasner (1982) suggests that increasing the knowledge of the relevant actors about how the most dangerous scenarios could materialize and how to prevent them could aid in actually implementing such a solution. This has often been the role of epistemic communities in facilitating international regimes & agreements (e.g., Haas 1992).
Facilitating agreements between the relevant actors on other issues could help build trust, formal procedures, and customs, which also seems to improve the chance of successful bargaining (Snidal 1985, Krasner 1982). The literature on confidence-building measures might also be relevant (e.g., Landau and Landau 1997).

Making development go well in the absence of problem awareness

If developers are not sufficiently aware of the problem, there might still be interventions making coordination more likely.

“Interlab” training environments or tournaments, in which AI systems interact with one another (either during training or before deployment), could provide the feedback required to build AI systems with compatible bargaining features.
Requirements to test novel systems against existing ones in a boxed environment might lead to all subsequent developers adjusting to the first one. Instead of an iterative emergence, compatible features would come about as a result of pre-emption. As a downside, this might exacerbate racing by developers to deploy first to select the most advantageous equilibrium.

Facilitating agreements to build a joint system

As I wrote above, a superficial analysis suggests that such agreements would be beneficial. If so, there might be interventions to make them more likely without causing excessive negative side-effects. For instance, one could restrict such efforts to tight races, as the OpenAI Assist Clause attempts to do.

Future work

There are many ways in which the analysis of this post could be extended or made more rigorous:

building more sophisticated game-theoretic models to analyze the coordination problem between developers (e.g., allowing for partial coordination);
including transaction costs in an analysis of whether developers would coordinate in the first place or whether doing so would be too costly;
more comprehensively comparing the transaction costs of realizing different arrangements.

There are also more foundational questions about takeoff scenarios relevant to this problem:

Are agreements to build a single system actually overall a good idea?
How similar to one another (in relevant respects) should we expect the AI systems in multipolar scenarios to be?

We can ask further questions about potential interventions:

What are ideal institutional arrangements, either for building a single system or for multiple compatible systems?
What limits does antitrust regulation place on the kind of coordination proposed in this article?
What insights can be gained from the literature about epistemic communities?

Acknowledgments

I want to thank Jesse Clifton for substantial contributions to this article as well as Daniel Kokotajlo, Emery Cooper, Kwan Yee Ng, Markus Anderljung, and Max Daniel for comments on a draft version of this article.

Appendix: Examples of features developers might coordinate on

Throughout this document, I have talked about bargaining-relevant features of AI systems that developers might coordinate on. The details of these features depend on facts about how transformative AI systems are trained which are currently highly uncertain. For the sake of concreteness, however, here are some examples of features that AI developers might coordinate on, depending on what approach to AI development is ultimately taken:

A social welfare function for their systems to jointly optimize, and policies for deciding how to identify and punish defections from this agreement (see Stastny et al. 2021, Clifton, Riché 2020);
The details of procedures for resolving high-stakes negotiations; for instance, collaborative game specification⁴⁷ is such a method, and requires (among other things) agreement (perhaps among other things) on 1) a method for combining agents’ reported models of their strategic situation and 2) a solution concept to apply to a collaboratively specified game;
The content of parts of a user's manual for human-in-the-loop AI training regimes that are relevant to bargaining-related behavior. For instance, developers might adopt common instructions for how to give approval to agents being trained in various bargaining environments;
The content of guidelines for how to behave in high-stakes bargaining situations, in regimes where natural language instructions are used to impose constraints on AI systems’ behavior.

The post Coordination challenges for preventing AI conflict appeared first on Center on Long-Term Risk.

Collaborative game specification: arriving at common models in bargaining

2021-07-20T12:31:20Z

Conflict is often an inefficient outcome to a bargaining problem. This is true in the sense that, for a given game-theoretic model of a strategic interaction, there is often some equilibrium in which all agents are better off than the conflict outcome. But real-world agents may not make decisions according to game-theoretic models, and when they do, they may use different models. This makes it more difficult to guarantee that real-world agents will avoid bargaining failure than is suggested by the observation that conflict is often inefficient.

In another post, I described the "prior selection problem", on which different agents having different models of their situation can lead to bargaining failure. Moreover, techniques for addressing bargaining problems like coordination on solution concepts or surrogate goals / safe Pareto improvements seem to require agents to have a common, explicit game-theoretic model.

In this post, I introduce collaborative game specification (CGS), a family of techniques designed to address the problem of agents lacking a shared model. In CGS, agents agree on a common model of their bargaining situation and use this to come to an agreement. Here is the basic idea:

Two agents are playing an unknown game. They each have private models of this game. (These may be explicit models, as in model-based reinforcement learning, or models implicit in a black-box policy which can be extracted.) By default, they will use these models to make a decision. The problem is that their models may differ, possibly leading to bad outcomes and precluding the use of bargaining protocols which require a shared, explicit model.
Rather than using these default strategies, agents agree on a common model, talk, and use this model to reach an agreement.

Of course, when agreeing on a common model, agents must handle incentives for their counterparts to deceive each other. In the toy illustration below, we’ll see how handling incentives to misrepresent one’s model can be handled in a pure cheap-talk setting.

How might we use CGS to reduce the risks of conflict involving powerful AI systems? One use is to provide demonstrations of good bargaining behavior. Some approaches to AI development may involve training AI systems to imitate the behavior of some demonstrator (e.g., imitative amplification), and so we may need to be able to provide many demonstrations of good bargaining behavior to ensure that the resulting system is robustly able and motivated to bargain successfully. Another is to facilitate bargaining between humans with powerful AI tools, e.g. in a comprehensive AI services scenario.

Aside from actually implementing CGS in AI systems, studying protocols of this kind can give us a better understanding of the limits on agents’ ability to overcome differences in their private models. Under the simple version of CGS discussed here, because agents have to incentivize truth-telling by refusing to engage in CGS sometimes, agents will fail to agree on a common model with positive probability in equilibrium.

I will first give a toy example of CGS (Section 1), and then discuss how it might be implemented in practice (Section 2). I close by discussing some potential problems and open questions for CGS (Section 3). In the Appendix, I discuss a game-theoretic formalism in which CGS can be given a more rigorous basis.

1 Toy illustration

For the purposes of illustration, we’ll focus on a pure cheap-talk setting, in which agents exchange unverifiable messages about their private models. Of course, it is all the better if agents can verify aspects of each others' private models. See Shin (1994) for a game-theoretic setting in which agents can verifiably disclose (parts of) their private beliefs. But we will focus on cheap talk here. A strategy for specifying a common model via cheap talk needs to handle incentives for agents to misrepresent their private models in order to improve their outcome in the resulting agreement. In particular, agents will need to follow a policy of refusing to engage in CGS if their counterpart reports a model that is too different from their own (and therefore evidence that they are lying). This kind of strategy for incentivizing honesty in cheap-talk settings has been discussed in the game theory literature in other contexts (e.g., Gibbons 1988; Morrow 1994).

For simplicity, agents in this example will model their situation as a game of complete information. That is, agents by default assume that there is no uncertainty about their counterpart’s payoffs. CGS can also be used for games of incomplete information. In this case, agents would agree on a Bayesian game with which to model their interaction. This includes agreement on a common prior over the possible values of their private information.

The "noisy Chicken" game is displayed in Table 1.

In this game, both agents observe a random perturbation of the true payoff matrix of the game. Call agent 's observation . This might be a world-model estimated from a large amount of data. The randomness in the agents' models can be interpreted as agents having different ways of estimating a model from data, yielding different estimates (perhaps even if they have access to the same dataset). While an agent with more computational resources might account for the fact that their counterpart might have a different model in a fully Bayesian way, our agents are computationally limited and therefore can only apply relatively simple policies to estimated payoff matrices. However, their designers can simulate lots of training data, and thus construct strategies that implicitly account for the fact that other agents may have different model estimates, while not being too computationally demanding. CGS is an example of such a strategy.

A policy will map observations to a probability distribution over . We assume the following about the agents' policies:

The agents have default policies which play according to the (utilitarian)
welfare-optimal Nash equilibrium of their observed game. (Note that CGS does not require that a welfare-optimal Nash equilibrium be played; this is chosen for the purposes of illustration.)
So, if , agent plays according to , and if , they play according to . Thus they will play of the time when each plays according to their default policy.
The agents can instead choose to engage in cheap talk. We will restrict their cheap talk policies to those which implement CGS.
- Each agent has a reporting policy that maps observations to reported observations . To keep things simple, these reporting policies only distort the observed value of agent 's payoff at by an amount in a direction that favors them;
- Each agent agrees to play according to a combined game if and only if . This is to disincentivize their counterpart from reporting models that are too different from their own, and therefore likely to be distorted. (I chose 8 by fiddling around; in practice, the training regime would optimize over cutoff values, too.);
- If the agents agree to combine the reported games, they simply take the average of their reported payoff matrices and play the welfare-optimal Nash equilibrium of the resulting game.

Now, we imagine that the agents are AI systems, and the AI developers ("principals") have to decide what policy to give their agent. If their agent is going to use CGS, then they need to train it to use a distortion which is (in some sense) optimal. Thus I will consider the choice of policy on part of the *principals* as a game, where the actions correspond to distortions to use in the distortion policy, and payoffs correspond to the average payoffs attained by the agents they deploy. Then I'll look at the equilibria of this game. This is of course a massive idealization - AI developers will not get together and choose agents whose policies are in equilibrium with respect to some utility functions. The point is only to illustrate how principals might rationally train agents to arrive at a common model under otherwise idealized circumstances.

I ran 1000 replicates of an experiment which computed actions according to the default policies and according to reporting policy profiles with and distortions . This The payoffs under the default policy profile and the Nash equilibrium (it happened to be unique) of the game in which principals choose the distortion levels for their agents are reported in Table 3.

2 Implementation

In practice, CGS can be seen as accomplishing two things:

Providing an inductive bias in the huge space of bargaining strategies towards those which we have reason to think will reduce the risks of bargaining failure from agents having differing models;
Allowing agents to use bargaining strategies which require them to agree on an explicit game-theoretic model, by furnishing unexploitable methods for agreeing on such a model.

Here is how it could be implemented:

1. Take whatever class of candidate policies and policy learning method you were going to use by default. Note that this class of policies need not be model-based, so long as transparency tools can be applied to extract a model consistent with the policies' behavior (see below);

2. Augment the space of policies you are learning over with those that implement CGS. These policies will be composed of

1. A policy for reporting a (possibly distorted) private model to one's counterpart. For instance, these models might be partially observable stochastic games which model the evolution of some relevant part of the world under different policy profiles the agents could agree to, (perhaps with a prior over each agent's utility function);
2. A set of acceptable model combination methods;
3. A policy for deciding whether to accept the other agent's reported model; and play according to the resulting combined game, or reject and play some default policy;
4. A set of acceptable solution concepts to apply to the combined game (e.g., welfare-optimal Nash equilibrium for some welfare function).

3. Use your default policy learning method on this augmented space of policies.

For example, in training that relies on imitation learning, a system could be trained to do CGS by having the imitated agents respond to certain bargaining situations by offering to their counterpart to engage in CGS; actually specifying an explicit model of their strategic situation in collaboration with the counterpart; and (if the agents succeed in specifying a common model) applying some solution concept to that model in order to arrive at an agreement.

A major practical challenge seems to be having imitated humans strategically specify potentially extremely complicated game-theoretic models. In particular, one challenge is specifying a model at all, and another is reporting a model such that the agent expects in some sense to be better off in the solution of the model that results from CGS than they would be if they used some default policy. The first problem — specifying a complicated model in the first place — might be addressed by applying model extraction techniques to some default black box policy in order to infer an explicit world-model. The second problem — learning a reporting policy which agents expect to leave them better off under the resulting agreement — could be addressed if different candidate reporting policies could be tried out in a high-quality simulator.

3 Questions and potential problems

One issue is whether CGS could actually make things worse. The first way CGS could make things worse is via agents specifying models in which conflict happens in equilibrium. We know that conflict sometimes happens in equilibrium. Fearon (1995)'s classic rationalist explanations for war show how war can occur in equilibrium due to agents having private information about their level of strength or resolve that they are not willing to disclose, or agents not being able to credibly commit to not launching preemptive attacks when they expect that their counterpart will gain strength in the future. Likewise, threats and punishments can be executed and equilibrium for reasons of costly signaling (e.g., Sechser 2010) or falsely detected defections (e.g., Fudenberg et al. 2009). A related issue is that it is not clear how the interaction of CGS and model misspecification affects its safety. For instance, agents who underestimate the chances of false detections of nuclear launches may place nuclear weapons on sensitive triggers, incorrectly thinking that nuclear launch is almost certain not to occur in equilibrium.

The second way training agents to do CGS could make things worse is by encouraging them to use dangerous decision procedures outside of CGS. The problems associated with designing agents to maximize a utility function are well-known in AI safety. Depending on how agents are trained to do CGS, it may make them more likely to make decisions in situations other than bargaining situations via expected utility maximization. For instance, training agents to do CGS may produce modules that help agents to specify a world-model and utility function, and maximize the expectation of that utility function, and agents may use the modules when making decisions in non-CGS contexts.

In light of this, we would want to make sure CGS preserves nice properties that a system already has. CGS should be *alignment-preserving*: intuitively, modifying a system's design to implement CGS shouldn't make misalignment more likely. CGS should also preserve properties like *myopia*: modifying a myopic system to use CGS shouldn't make it non-myopic. Importantly, ensuring the preservation of properties other than alignment which make catastrophic bargaining failure less likely may help to avoid worst-case outcomes even if alignment fails.

Finally, there is the problem that CGS still faces equilibrium and prior selection problems. (See the Appendix for a formulation of CGS in the context of a Bayesian game; such a game assumes a common prior — in this case, a prior arising from the distribution of environments on which the policies are trained — and will, in general, have many equilibria.) Thus there is a question of how much we can expect actors to coordinate to train their agents to do CGS, and how much CGS can reduce risks of bargaining failure if AI developers do not coordinate.

Appendix: Policy training and deployment as a Bayesian game

As in the toy illustration, we can think of agents' models as private information, drawn from some distribution that depends on the (unknown) underlying environment. Because agents are boundedly rational, they can only reason according to these (relatively simple) private models, rather than a correctly-specified class of world-models. However, the people training the AI systems can generate lots of training data, in which agents can try out different policies for accounting for the variability in their and their counterpart's private models. Thus we can think of this as a Bayesian game played between AI developers, in which the strategies are policies for mapping private world-models to behaviors. These behaviors might include ways of communicating with other agents in order to overcome uncertainty, which in turn might include CGS. The prior for this Bayesian game is the distribution over private models induced by the training environments and the agents' model-learning algorithms (which we take to be exogenous for simplicity).

As I noted above, this Bayesian game still faces equilibrium and prior selection problems between the AI developers themselves. It also makes the extremely unrealistic assumption that the training and deployment distributions of environments are the same. The goal is only to clarify how developers could (rationally) approach training their agents to implement CGS under idealized conditions.

Consider two actors, who I will call "the principals'', who are to train and deploy learning agents. Each principal has utility function . The game that the principals play is as follows:

The principals train policies on independent draws from a distribution of multi-agent environments (for instance, stochastic games) taking values in , for . These environments represent the environments in which the agents are trained and deployed. Policies return actions in the spaces . (Note that, in sequential environments — e.g., stochastic games — these "actions'' may in fact be policies mapping, e.g., the states of stochastic game to actions in that stochastic game.)
For each environment , a function mapping pairs of actions to each principal 's payoffs, ;
In each training environment, agents receive private observations on which they can condition their policies, with . These observations will correspond to data from which the agents estimate world-models (e.g., a model of a stochastic game) or form beliefs about other agents' private information.
The agents are deployed and take actions in an environment based on private information .

The choice of what policy to deploy is a game with strategies and ex ante payoffs

We will for now suppose that during training the value of policy profiles under each utility function in can be learned with high accuracy.

How should a principal choose which policy to deploy? In the absence of computational constraints, a natural choice is Bayesian Nash equilibrium (BNE). In practice, it will be necessary to learn over a much smaller class of policies than the space of all maps. Let be sets of policies such that it is tractable to evaluate each profile . In this context, assuming that the principals' utility functions are common knowledge, a pair of policies is a BNE if it satisfies for (indexing 's counterpart by )

When consists of policies with limited capacity (reflecting computational boundedness), agents may learn policies which do not account for the variability in the estimation of their private models. I will call the class of such policies learned over during training time the "default policies'' . To address this problem in a computationally tractable way, we introduce policies which allow for the specification of a shared model of . Let be a set of models, and let be a set of solution concepts which map elements of to (possibly random) action profiles. In the toy illustration, agents specified models in the set of bimatrices, and the solution concept they used was the Nash equilibrium which maximized the sum of their payoffs in the game .

Then, the policies have the property that, for some , the policy profile succeeds in collaboratively specifying a game with positive probability. That is, with positive probability we have for some and some .

The goal of principals who want their agents to engage in collaborative game specification is to find a policy profile in which is a Bayes-Nash equilibrium that improves upon any equilibrium in and which succeeds in collaboratively specifying a game with high probability.

Now, this model is idealized in a number of ways. I assume that the distribution of training environment matches the distribution of environments encountered by the deployed policies. Moreover, I assume that both principals train their agents on this distribution of environments. In reality, of course, these assumptions will fail. A more modest but attainable goal is to use CGS to construct policies which perform well on whatever criteria individual principals use to evaluate policies for multi-agent environments, as discussed in the Section 2 (Implementation).

References

James D Fearon. Rationalist explanations for war. International organization, 49(3):379–414, 1995.

Drew Fudenberg, David Levine, and Eric Maskin. The folk theorem with imperfect public information. In A Long-Run Collaboration On Long-Run Games, pages 231–273. World Scientific, 2009.

Robert Gibbons. Learning in equilibrium models of arbitration. Technical report, National Bureau of Economic Research, 1988.

James D Morrow. Modeling the forms of international cooperation: distribution versus information. International Organization, pages 387–423, 1994.

Todd S Sechser. Goliath’s curse: Coercive threats and asymmetric power. International Organization, 64(4):627–660, 2010.

Hyun Song Shin. The burden of proof in a game of persuasion. Journal of Economic Theory, 64(1):253–264, 1994.

The post Collaborative game specification: arriving at common models in bargaining appeared first on Center on Long-Term Risk.

Weak identifiability and its consequences in strategic settings

2021-03-31T08:17:24Z

One way that agents might become involved in catastrophic conflict is if they have mistaken beliefs about one another. Maybe I think you are bluffing when you threaten to launch the nukes, but you are dead serious. So we should understand why agents might sometimes have such mistaken beliefs. In this post I'll discuss one obstacle to the formation of accurate beliefs about other agents, which has to do with identifiability. As with my post on equilibrium and prior selection problems, this is a theme that keeps cropping up in my thinking about AI cooperation and conflict, so I thought it might be helpful to have it written up.

We say that a model is unidentifiable if there are several candidate models which produce the same distributions over observables. It is well-understood in the AI safety community that identifiability is a problem for inferring human values [1] [2]. This is because there are always many combinations of preferences and decision-making procedures which produce the same behaviors. So, it's impossible to learn an agent's preferences from their behavior without strong priors on their preferences and/or decision-making procedures. I want to point out here that identifiability is also a problem for multi-agent AI safety, for some of the same reasons as in the preference inference case, as well as some reasons specific to strategic settings. In the last section I'll give a simple quantitative example of the potential implications of unidentifiability for bargaining failure in a variant of the ultimatum game.

1 Sources of unidentifiability in strategic settings

By modeling other agents, I mean forming beliefs about the policy that they are following based on observations of their behavior. The model of an agent is unidentifiable if there is no amount of data from the environment in question that can tell us exactly what policy they are following. (And because we always have finite data, "weak identifiability'' more generally is a problem — but I'll just focus on the extreme case.)

Consider the following informal example (a quantitative extension is given in Section 3). Behavioral scientists have an identifiability problem in trying to model human preferences in the ultimatum game. The ultimatum game (Figure 1) is a simple bargaining game in which a Proposer offers a certain division of a fixed pot of money to a Responder. The Responder may then accept, in which case each player gets the corresponding amount, or reject, in which place neither player gets anything. Standard accounts of rationality predict that the Proposer will offer the Responder the least amount allowed in the experimental setup and that the Responder will accept any amount of money greater than . However, humans don’t act this way: In experiments, human Proposers often give much closer to even splits, and Responders often reject unfair splits.

Figure 1: The ultimatum game.

The ultimatum game has been the subject of extensive study in behavioral economics, with many people offering and testing different explanations of this phenomenon. This had led to a proliferation of models of human preferences in bargaining settings (e.g. Bicchieri and Zhang 2012; Hagen and Hammerstein 2006 and references therein). This makes the ultimatum game a rich source of models and data about human preferences in bargaining situations. And the game is similar to the one-shot threat game used here to illustrate the prior selection problem. Thus it can be used to model some of the high-stakes bargaining scenarios involving transformative AI that concern us most.

Suppose that you observe a Responder play many rounds of the ultimatum game with different Proposers, and you see that they tend to reject unfair splits. You think there are two possible kinds of explanation for this behavior:

Unfairness aversion: The Responder may intrinsically disvalue being treated unfairly, and therefore reject splits they regard as unfair even if they have nothing to gain in the future by doing so. (This can also be interpreted as a commitment not to give into unfair deals.)
Uncertainty about iterated play: The Responder may be uncertain as to whether they’ll play with the Proposer again (or with an onlooker), and how these agents will adjust their future play to the Responder's refusal to take unfair splits. If it’s sufficiently likely that the game is repeated, they might want to reject unfair offers in order to establish a reputation for punishing unfairness. (The ultimatum game experiments are designed to be anonymous and so avoid this possibility, but it is present in the real world, among the kinds of agents we want to model.)

The problem is that (depending on the details), these models might make exactly the same predictions about the outcomes of these experiments so that no amount of data from these experiments can ever distinguish between them. This makes it difficult, for instance, to decide what to do if you have to face the Responder in an ultimatum game yourself.

The basic problem is familiar from the usual preference inference case: there are many combinations of world-models and utility functions which make the same predictions about the Responder's behavior. But it is also a simple illustration of a few other factors which make unidentifiability particularly severe in strategic settings:

More models. There are simply many more things to model in a setting with other strategic agents. For instance, an agent in a two-agent setting using a -level model of their counterpart already has models to reason over. More models mean more models that might be equally consistent with the data. The problem is even worse when there are more than two agents, where each agent has to model the other agents' models of each other...

One of our models of the Responder in the ultimatum game contains a simple illustration of -level modeling. Under the iterated play explanation, you model the Responder as modeling other players as responding to their refusals of unfair splits with higher offers in the future.

Costly signaling. In multi-agent settings, agents will sometimes deliberately behave so as to make their private information unidentifiable. (Cf. pooling equilibria in classical game theory.) Again the reputation model of the Responder is a simple example: one explanation of the Responder's behavior is that they are engaging in costly signaling of their resolve not to give into unfair deals.

2 Dangers of unidentifiability in multi-agent systems

Unidentifiability may be dangerous in multi-agent contexts for similar reasons that it may be dangerous in the context of inferring human preferences. If uncertainty over all of the models which are consistent with the data is not accounted for properly — via specification of “good” priors and averaging over a sufficiently large space of possibilities to make decisions — then our agents may give excessive weight to models which are far from the truth and therefore act catastrophically.

Two broad directions for mitigating these risks include:

Proper specification of the initial priors (or, more generally, the biases in the agent's reasoning about other agents), similarly to how strong priors over human values may need to be specified for preference inference to work well;
Ensuring that agents can efficiently reason over potentially large classes of models which fit the data equally well. (Ideal Bayesian agents take expected values over the entire class
of candidate models by definition, but fully accounting for uncertainty over the relevant models may be computationally difficult for realistic agents.)

3 Quantitative example in the ultimatum game

In this example, I focus on inferring the preferences of a Responder given some data on their behavior. I'll then show that for some priors over models of the Responder, decisions made based on the resulting posterior can lead to rejected splits. Importantly, this behavior happens given any amount of observations of the Responder's ultimatum game behavior, due to unidentifiability.

Consider the following simple model. For offer in and parameters and , the Responder makes a decision according to these utility functions:

The term can be interpreted as the Responder deeming offers of less than as unfair. Then, the parameter measures how much the Responder intrinsically disvalues unfair splits, and the parameter measures how much the Responder expects to get in the future when they reject unfair splits.

Split is accepted if and only if , or equivalently, Notice that the decision depends only on , and thus the data cannot distinguish between the effects of and . So we have a class of models parameterized by pairs . Now, suppose that we have two candidate models — one on which fairness is the main component, and one on which iterated play is:

The likelihoods for any data are the same for any such that is the same: If are the offered split and the Responder's decision in the experiment, the likelihood of model given observations is

Since under both and , this means that the prior and posterior over are equal.

Now here is the decision-making setup:

The Proposer observes an arbitrary number of ultimatum games played by the Responder and other Proposers.
The Proposer decides what offer to make, under common knowledge that there is no iterated play. This means that the Responder's utility function depends only on the fairness variable .

Call the prior model probabilities . Thus, the Proposer's posterior expected payoff for split is

Figure 2: Posterior expected value to the Proposer of different splits and for different prior means, along with the expected value to the proposer under the true Responder utility function. The stars correspond to the argmax of the posterior expected utility curve of the same color. So the star to the left of the ‘true model’ curve means that the Proposer's optimal (in posterior expectation) proposal will be rejected, resulting in no money for anyone.

In Figure 2, I compare the expected payoffs to the Proposer under different splits, when the true parameters for the Responder's utility function are . The three expected payoff curves are:

Posterior expected payoffs with prior ;
Posterior expected payoffs with prior ;
Expected payoffs given the Responder's exact parameters.

We can see from the blue curve that when there's sufficient prior mass on the wrong model , the Responder will propose a split that's too small, resulting in a rejection. This basically corresponds to a situation where the Responder thinks that the Proposer rejects unfair splits in order to establish a reputation for rejecting unfair splits, rather than rejecting unfair splits because of a commitment not to accept unfair splits. And although I've tilted the scales in favor of a bad outcome by choosing a prior that gives a lot of weight to an incorrect model, keep in mind that this is what the posterior expectation will be given any amount of data from this generative model. We can often count on data to correct our agents' beliefs, but this is not the case (by definition) when the relevant model is unidentifiable.

References

Cristina Bicchieri and Jiji Zhang. An embarrassment of riches: Modeling social preferences in ultimatum games. Handbook of the Philosophy of Science, 13:577–95, 2012.

Edward H Hagen and Peter Hammerstein. Game theory and human evolution: A critique of some recent interpretations of experimental games. Theoretical population biology, 69(3):339–348, 2006.

The post Weak identifiability and its consequences in strategic settings appeared first on Center on Long-Term Risk.

Birds, Brains, Planes, and AI: Against Appeals to the Complexity / Mysteriousness / Efficiency of the Brain

2021-03-12T13:25:07Z

[Epistemic status: Strong opinions lightly held, this time with a cool graph.]

I argue that an entire class of common arguments against short timelines is bogus, and provide weak evidence that anchoring to the human-brain-human-lifetime milestone is reasonable.

In a sentence, my argument is that the complexity and mysteriousness and efficiency of the human brain (compared to artificial neural nets) is almost zero evidence that building TAI will be difficult, because evolution typically makes things complex and mysterious and efficient, even when there are simple, easily understood, inefficient designs that work almost as well (or even better!) for human purposes.

In slogan form: If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the way it does.

The case of birds & planes illustrates this point nicely. Moreover, it is also a precedent for several other short-timelines talking points, such as the human-brain-human-lifetime (HBHL) anchor.

Plan:

Illustrative Analogy
Exciting Graph
Analysis
1. Extra brute force can make the problem a lot easier
2. Evolution produces complex mysterious efficient designs by default, even when simple inefficient designs work just fine for human purposes.
3. What’s bogus and what’s not
4. Example: Data-efficiency
Conclusion
Appendix

1909 French military plane, the Antionette VII.

By Deep silence (Mikaël Restoux) - Own work (Bourget museum, in France), CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1615429

Illustrative Analogy

AI timelines, from our current perspective

Flying machine timelines, from the perspective of the late 1800’s:

Shorty: Human brains are giant neural nets. This is reason to think we can make human-level AGI (or at least AI with strategically relevant skills, like politics and science) by making giant neural nets.

Shorty: Birds are winged creatures that paddle through the air. This is reason to think we can make winged machines that paddle through the air.

Longs: Whoa whoa, there are loads of important differences between brains and artificial neural nets: [what follows is a direct quote from the objection a friend raised when reading an early draft of this post!]

During training, deep neural nets use some variant of backpropagation. My understanding is that the brain does something else, closer to Hebbian learning. (Though I vaguely remember at least one paper claiming that maybe the brain does something that's similar to backprop after all.)
It's at least possible that the wiring diagram of neurons plus weights is too coarse-grained to accurately model the brain's computation, but it's all there is in deep neural nets. If we need to pay attention to glial cells, intracellular processes, different neurotransmitters etc., it's not clear how to integrate this into the deep learning paradigm.
My impression is that several biological observations on the brain don't have a plausible analog in deep neural nets: growing new neurons (though unclear how important it is for an adult brain), "repurposing" in response to brain damage, …

Longs: Whoa whoa, there are loads of important differences between birds and flying machines:

Birds paddle the air by flapping, whereas current machine designs use propellers and fixed wings.
It’s at least possible that the anatomical diagram of bones, muscles, and wing surfaces is too coarse-grained to accurately model how a bird flies, but that’s all there is to current machine designs (replacing bones with struts and muscles with motors, that is). If we need to pay attention to the percolation of air through and between feathers, micro-eddies in the air sensed by the bird and instinctively responded to, etc. it’s not clear how to integrate this into the mechanical paradigm.
My impression is that several biological observations of birds don’t have a plausible analog in machines: Growing new feathers and flesh (though unclear how important this is for adult birds), “repurposing” in response to damage...

Shorty: The key variables seem to be size and training time. Current neural nets are tiny; the biggest one is only one-thousandth the size of the human brain. But they are rapidly getting bigger.

Once we have enough compute to train neural nets as big as the human brain for as long as a human lifetime (HBHL), it should in principle be possible for us to build HLAGI. No doubt there will be lots of details to work out, of course. But that shouldn’t take more than a few years.

Shorty:
The key variables seem to be engine-power and engine weight. Current motors are not strong & light enough, but they are rapidly getting better.

Once the power-to-weight ratio of our motors surpasses the power-to-weight ratio of bird muscles, it should be in principle possible for us to build a flying machine. No doubt there will be lots of details to work out, of course. But that shouldn’t take more than a few years.

Longs: Bah! I don’t think we know what the key variables are. For example, biological brains seem to be able to learn faster, with less data, than artificial neural nets. And we don’t know why.

Besides, “there will be lots of details to work out” is a huge understatement. It took evolution billions of generations of billions of individuals to produce humans. What makes you think we’ll be able to do it quickly? It’s plausible that actually we’ll have to do it the way evolution did it, i.e. meta-learn, i.e. evolve a large population of HBHLs, over many generations. (Or, similarly, train a neural net with a big batch size and a horizon length of a lifetime).

And even if you think we’ll be able to do it substantially quicker than evolution did, it’s pretty presumptuous to think we could do it quickly enough that the HBHL milestone is relevant for forecasting.

Longs: Bah! I don’t think we know what the key variables are. For example, birds seem to be able to soar long distances without flapping their wings at all, and we still haven’t figured out how they do it. Another example: We still don’t know how birds manage to steer through the air without crashing (flight stability & control).

Besides, “there will be lots of details to work out” is a huge understatement. It took evolution billions of generations of billions of individuals to produce birds. What makes you think we’ll be able to do it quickly? It’s plausible that actually we’ll have to do it the way evolution did it, i.e. meta-design, i.e. evolve a large population of flying machines, tweaking our blueprints each generation of crashed machines to grope towards better designs.

And even if you think we’ll be able to do it substantially quicker than evolution did, it’s pretty presumptuous to think we could do it quickly enough that the date our engines achieve power/weight parity with bird muscle is relevant for forecasting.

Exciting Graph

This data shows that Shorty was entirely correct about forecasting heavier-than-air flight. (For details about the data, see appendix.) Whether Shorty will also be correct about forecasting TAI remains to be seen.

In some sense, Shorty has already made two successful predictions: I started writing this argument before having any of this data; I just had an intuition that power-to-weight is the key variable for flight and that therefore we probably got flying machines shortly after having comparable power-to-weight as bird muscle. Halfway through the first draft, I googled and confirmed that yes, the Wright Flyer’s motor was close to bird muscle in power-to-weight. Then, while writing the second draft, I hired an RA, Amogh Nanjajjar, to collect more data and build this graph. As expected, there was a trend of power-to-weight improving over time, with flight happening right around the time bird-muscle parity was reached.

I had previously heard from a friend, who read a book about the invention of flight, that the Wright brothers were the first because they (a) studied birds and learned some insights from them, and (b) did a bunch of trial and error, rapid iteration, etc. (e.g. in wind tunnels). The story I heard was all about the importance of insight and experimentation--but this graph seems to show that the key constraint was engine power-to-weight. Insight and experimentation were important for determining who invented flight, but not for determining which decade flight was invented in.

Analysis

Part 1: Extra brute force can make the problem a lot easier

One way in which compute can substitute for insights/algorithms/architectures/ideas is that you can use compute to search for them. But there is a different and arguably more important way in which compute can substitute for insights/etc.: Scaling up the key variables, so that the problem becomes easier, so that fewer insights/etc. are needed.

For example, with flight, the problem becomes easier the more power/weight ratio your motors have. Even if the Wright brothers didn’t exist and nobody else had their insights, eventually we would have achieved powered flight anyway, because when our engines are 100x more powerful for the same weight, we can use extremely simple, inefficient designs. (For example, imagine a u-shaped craft with a low center of gravity and helicopter-style rotors on each tip. Add a third, smaller propeller on a turret somewhere for steering.)

With neural nets, we have plenty of evidence now that bigger = better, with theory to back it up. Suppose the problem of making human-level AGI with HBHL levels of compute is really difficult. OK, 10x the parameter count and 10x the training time and try again. Still too hard? Repeat.

Note that I’m not saying that if you take a particular design that doesn’t work, and make it bigger, it’ll start working. (If you took Da Vinci’s flying machine and made the engine 100x more powerful, it would not work). Rather, I’m saying that the problem of finding a design that works gets qualitatively easier the more parameters and training time you have to work with.

Finally, remember that human-level AGI is not the only kind of TAI. Sufficiently powerful R&D tools would work, as would sufficiently powerful persuasion tools, as might something that is agenty and inferior to humans in some ways but vastly superior in others.

Part 2: Evolution produces complex mysterious efficient designs by default, even when simple inefficient designs work just fine for human purposes.

Suppose that actually all we have to do to get TAI is something fairly simple and obvious, but with a neural net 10x the size of my (actual) brain and trained for 10x longer. In this world, does the human brain look any different than it does in the actual world?

No. Here is a nonexhaustive list of reasons why evolution would evolve human brains to look like they do, with all their complexity and mysteriousness and efficiency, even if the same capability levels could be reached with 10x more neurons and a very simple architecture. Feel free to skip ahead if you think this is obvious.

In general, evolved creatures are complex and mysterious to us, even when simple and human-comprehensible architectures work fine. Take birds, for example: As mentioned before, all the way up to the Wright brothers there were a lot of very basic things about birds that were still not understood. From this article: “They watched buzzards glide from horizon to horizon without moving their wings, and guessed they must be sucking some mysterious essence of upness from the air. Few seemed to realize that air moves up and down as well as horizontally.” I don’t know much about ornithology but I’d be willing to bet that there were lots of important things discovered about birds after airplanes already existed, and that there are still at least a few remaining mysteries about how birds fly. (Spot check: Yep, the history of ornithopters page says “...the development of comprehensive aerodynamic theory for flapping remains an outstanding problem...”). And of course evolved creatures are often more efficient in various ways than their still-useful engineered counterparts.
Making the brain 10x bigger would be enormously costly to fitness, because it would cost 10x more energy and restrict mobility (not to mention the difficulties of getting through the birth canal!) Much better to come up with clever modules, instincts, optimizations, etc. that achieve the same capabilities in a smaller brain.
Evolution is heavily constrained on training data, perhaps even more than on brain size. It can’t just evolve the organism to have 10x more training data, because longer-lived organisms have more opportunities to be eaten or suffer accidents, especially in their 10x-longer childhoods. Far better to hard-code some behaviors as instincts.
Evolution gets clever optimizations and modules and such “for free” in some sense. Since it is evolving millions of individuals for millions of generations anyway, it’s not a big deal for it to perform massive search and gradient descent through architecture-space.
Completely blank slate brains (i.e. extremely simple architecture, no instincts or finely tuned priors) would be unfit even if they were highly capable because they wouldn’t be aligned to evolution’s values (i.e. reproduction.) Perhaps most of the complexity in the human brain--the instincts, inbuilt priors, and even most of the modules--isn’t for capabilities at all, but rather for alignment.

Part 3: What’s bogus and what’s not

The general pattern of argument I think is bogus is:

The brain has property X, which seems to be important to how it functions. We don’t know how to make AI’s with property X. It took evolution a long time to make brains have property X. This is reason to think TAI is not near.

As argued above, if TAI is near, there should still be many X which are important to how the brain functions, which we don’t know how to reproduce in AI, and which it took evolution a long time to produce. So rattling off a bunch of X’s is basically zero evidence against TAI being near.

Put differently, here are two objections any particular argument of this type needs to overcome:

TAI does not actually require X (analogous to how airplanes didn’t require anywhere near the energy-efficiency of birds, nor the ability to soar, nor the ability to flap their wings, nor the ability to take off from unimproved surfaces… the list goes on)
We’ll figure out how to get property X in AIs soon after we have the other key properties (size and training time), because (a) we can do search, like evolution did but much more efficient, (b) we can increase the other key variables to make our design/search problem easier, and (c) we can use human ingenuity & biological inspiration. Historically there is plenty of precedent for the previous three factors being strong enough; see e.g. the case of powered flight.

This reveals how the arguments could be reformulated to become non-bogus! They need to argue (a) that X is probably necessary for TAI, and (b) that X isn’t something that we’ll figure out fairly quickly once the key variables of size and training time are surpassed.

In some cases there are decent arguments to be made for both (a) and (b). I think efficiency is one of them, so I’ll use that as my example below.

Part 4: Example: Data-efficiency

Let’s work through the example of data-efficiency. A bad version of this argument would be:

Humans are much more data-efficient learners than current AI systems. Data-efficiency is very important; any human who learned as inefficiently as current AI would basically be mentally disabled. This is reason to think TAI is not near.

The rebuttal to this bad argument is:

If birds were as energy-inefficient as planes, they’d be disabled too, and would probably die quickly. Yet planes work fine. (See Table 1 from this AI Impacts page) Even if TAI is near, there are going to be lots of X’s that are important for the brain, that we don’t know how to make in AI yet, but that are either unnecessary for TAI or not too difficult to get once we have the other key variables. So even if TAI is near, I should expect to hear people going around pointing out various X’s and claiming that this is reason to think TAI is far away. You haven’t done anything to convince me that this isn’t what’s happening with X = data-efficiency.

However, I do think the argument can be reformulated and expanded to become good. Here’s a sketch, inspired by Ajeya Cotra’s argument here.

We probably can’t get TAI without figuring out how to make AIs that are as data-efficient as humans. It’s true that there are some useful tasks for which there is plenty of data--like call center work, or driving trucks--but AIs that can do these tasks won’t be transformative. Transformative AI will be doing things like managing corporations, leading armies, designing new chips, and writing AI theory publications. Insofar as AI learns more slowly than humans, by the time it accumulates enough experience doing one of these tasks, (a) the world would have changed enough that its skills would be obsolete, and/or (b) it would have made a lot of expensive mistakes in the meantime.

Moreover, we probably won’t figure out how to make AIs that are as data-efficient as humans for a long time--decades at least. This is because 1. We’ve been trying to figure this out for decades and haven’t succeeded, and 2. Having a few orders of magnitude more compute won’t help much. Now, to justify point #2: Neural nets actually do get more data-efficient as they get bigger, but we can plot the trend and see that they will still be less data-efficient than humans when they are a few orders of magnitude bigger. So making them bigger won’t be enough, we’ll need new architectures/algorithms/etc. As for using compute to search for architectures/etc., that might work, but given how long evolution took, we should think it’s unlikely that we could do this with only a few orders of magnitude of searching—probably we’d need to do many generations of large population size. (We could also think of this search process as analogous to typical deep learning training runs, in which case we should expect it’ll take many gradient updates with large batch size.) Anyhow, there’s no reason to think that data-efficient learning is something you need to be human-brain-sized to do. If we can’t make our tiny AIs learn efficiently after several decades of trying, we shouldn’t be able to make big AIs learn efficiently after just one more decade of trying.

I think this is a good argument. Do I buy it? Not yet. For one thing, I haven’t verified whether the claims it makes are true, I just made them up as plausible claims which would be persuasive to me if true. For another, some of the claims actually seem false to me. Finally, I suspect that in 1895 someone could have made a similarly plausible argument about energy efficiency, and another similarly plausible argument about flight control, and both arguments would have been wrong: Energy efficiency turned out to be insufficiently necessary, and flight control turned out to be insufficiently difficult!

Conclusion

What I am not saying: I am not saying that the case of birds and planes is strong evidence that TAI will happen once we hit the HBHL milestone. I do think it is evidence, but it is weak evidence. (For my all-things-considered view of how many orders of magnitude of compute it’ll take to get TAI, see future posts, or ask me.) I would like to see a more thorough investigation of cases in which humans attempt to design something that has an obvious biological analogue. It would be interesting to see if the case of flight was typical. Flight being typical would be strong evidence for short timelines, I think.

What I am saying: I am saying that many common anti-short-timelines arguments are bogus. They need to do much more than just appeal to the complexity/mysteriousness/efficiency of the brain; they need to argue that some property X is both necessary for TAI and not about to be figured out for AI anytime soon, not even after the HBHL milestone is passed by several orders of magnitude.

Why this matters: In my opinion the biggest source of uncertainty about AI timelines has to do with how much “special sauce” is necessary for making transformative AI. As jylin04 puts it,

A first and frequently debated crux is whether we can get to TAI from end-to-end training of models specified by relatively few bits of information at initialization, such as neural networks initialized with random weights. OpenAI in particular seems to take the affirmative view[^3], while people in academia, especially those with more of a neuroscience / cognitive science background, seem to think instead that we'll have to hard-code in lots of inductive biases from neuroscience to get to AGI [^4].

In my words: Evolution clearly put lots of special sauce into humans, and took millions of generations of millions of individuals to do so. How much special sauce will we need to get TAI?

Shorty is one end of a spectrum of disagreement on this question. Shorty thinks the amount of special sauce required is small enough that we’ll “work out the details” within a few years of having the key variables (size and training time). At the other end of the spectrum would be someone who thought that the amount of special sauce required is similar to the amount found in the brain. Longs is in the middle. Longs thinks the amount of special sauce required is large enough that the HBHL milestone isn’t particularly relevant to timelines; we’ll either have to brute-force search for the special sauce like evolution did, or have some brilliant new insights, or mimic the brain, etc.

This post rebutted common arguments against Shorty’s position. It also presented weak evidence in favor of Shorty’s position: the precedent of birds and planes. In future posts I’ll say more about what I think the probability distribution over amount-of-special-sauce-needed should be and why.

Acknowedgements: Thanks to my RA, Amogh Nanjajjar, for compiling the data and building the graph. Thanks to Kaj Sotala, Max Daniel, Lukas Gloor, and Carl Shulman for comments on drafts.

Appendix

Some footnotes:

I didn’t say anything about why we might think size and training time are the key variables, or even what “key variables” means. Hopefully I’ll get a chance in the comments or in subsequent posts.
I deliberately left vague what “training time” means and what “size” means. Thus, I’m not commiting myself to any particular way of calculating the HBHL milestone yet. I’m open to being convinced that the HBHL milestone is farther in the future than it might seem.
Persuasion tools, even very powerful ones, wouldn’t be TAI by the standard definition. However they would constitute a potential-AI-induced-point-of-no-return, so they still count for timelines purposes.
This 'How much special sauce is needed?' variable is very similar to Ajeya Cotra's variable 'how much compute would lead to TAI given 2020's algorithms.'

Some bookkeeping details about the data:

This dataset is not complete. Amogh did a reasonably thorough search for engines throughout the period (with a focus on stuff before 1910) but was unable to find power or weight stats for many of the engines we heard about. Nevertheless I am reasonably confident that this dataset is representative; if an engine was significantly better than the others of its time, probably this would have been mentioned and Amogh would have flagged it as a potential outlier.
Many of the points for steam engine power/weight should really be bumped up slightly. This is because most of the data we had was for the weight of the entire locomotive of a steam-powered train, rather than just the steam engine part. I don’t know what fraction of a locomotive is non-steam-engine but 50% seems like a reasonable guess. I don’t think this changes the overall picture much; in particular, the two highest red dots do not need to be bumped up at all (I checked).
The birds bar is the power/weight ratio for the muscles of a particular species of bird, reported by this source, which reports the power/weight for a particular species of bird. Amogh has done a bit of searching and doesn’t think muscle power/weight is significantly different for other species of bird. Seems plausible to me; even if the average bird has muscles that are twice (or half) as powerful-per-kilogram, the overall graph would look basically the same.
I attempted to find estimates of human muscle power-to-weight ratio; it gets smaller the more tired the muscles get, but at peak performance for fit individuals it seems to be about an order of magnitude less than bird muscle. (This chart lists power-to-weight ratio for human cyclists, which according to this are probably about half muscle, so look at the left-hand column and double it.) Interestingly, this means that the engines of the first flying machines were possibly the first engines to be substantially better than human flapping/pedaling as a source of flying-machine power.

The post Birds, Brains, Planes, and AI: Against Appeals to the Complexity / Mysteriousness / Efficiency of the Brain appeared first on Center on Long-Term Risk.

Against GDP as a metric for AI timelines and takeoff speeds

2021-02-12T17:06:16Z

Or: Why AI Takeover Might Happen Before GDP Accelerates, and Other Thoughts On What Matters for Timelines and Takeoff Speeds

I think world GDP (and economic growth more generally) is overrated as a metric for AI timelines and takeoff speeds.

Here are some uses of GDP that I disagree with, or at least think should be accompanied by cautionary notes:

Timelines: Ajeya Cotra thinks of transformative AI as “software which causes a tenfold acceleration in the rate of growth of the world economy (assuming that it is used everywhere that it would be economically profitable to use it).” I don’t mean to single her out in particular; this seems like the standard definition now.
Takeoff Speeds: Paul Christiano argues for Slow Takeoff. He thinks we can use GDP growth rates as a proxy for takeoff speeds. In particular, he thinks Slow Takeoff ~= GWP doubles in 4 years before the start of the first 1-year GWP doubling. This proxy/definition has received a lot of uptake.
Timelines: David Roodman’s excellent model projects GWP (world GDP) hitting infinity in median 2047, which I calculate means TAI in median 2037. To be clear, he would probably agree that we shouldn’t use these projections to forecast TAI, but I wish to add additional reasons for caution.
Timelines: I’ve sometimes heard things like this: “GWP growth is stagnating over the past century or so; hyperbolic progress has ended; therefore TAI is very unlikely.”
Takeoff Speeds: Various people have said things like this to me: “If you think there’s a 50% chance of TAI by 2032, then surely you must think there’s close to a 50% chance of GWP growing by 8% per year by 2025, since TAI is going to make growth rates go much higher than that, and progress is typically continuous.”
Both: Relatedly, I sometimes hear that TAI can’t be less than 5 years away, because we would have seen massive economic applications of AI by now—AI should be growing GWP at least a little already, if it is to grow GDP by a lot in a few years.

First, I’ll argue that GWP is only tenuously and noisily connected to what we care about when forecasting AI timelines. Specifically, the point of no return is what we care about, and there’s a good chance it’ll come years before GWP starts to increase. It could also come years after, or anything in between.

Then, I’ll argue that GWP is a poor proxy for what we care about when thinking about AI takeoff speeds as well. This follows from the previous argument about how the point of no return may come before GWP starts to accelerate. Even if we bracket that point, however, there are plausible scenarios in which a slow takeoff has fast GWP growth and in which a fast takeoff has slow GWP growth.

Timelines

I’ve previously argued that for AI timelines, what we care about is the “point of no return,” the day we lose most of our ability to reduce AI risk. This could be the day advanced unaligned AI builds swarms of nanobots, but probably it’ll be much earlier, e.g. the day it is deployed, or the day it finishes training, or even years before then when things go off the rails due to less advanced AI systems. (Of course, it probably won’t literally be a day; probably it will be an extended period where we gradually lose influence over the future.)

Now, I’ll argue that in particular, an AI-induced PONR is reasonably likely to come before world GDP starts to grow noticeably faster than usual.

Disclaimer: These arguments aren’t conclusive; we shouldn’t be confident that the PONR will precede GWP acceleration. It’s entirely possible that the PONR will indeed come when GDP starts to grow noticeably faster than usual, or even years after that. (In other words, I agree that the scenarios Paul and others sketch are also plausible.) This just proves my point though: GDP is only tenuously and noisily connected to what we care about.

Argument that AI-induced PONR could precede GWP acceleration

GWP acceleration is the effect, not the cause, of advances in AI capabilities. I agree that it could also be a cause, but I think this is very unlikely: what else could accelerate GWP? Space mining? Fusion power? 3D printing? Even if these things could in principle kick the world economy into faster growth, it seems unlikely that this would happen in the next twenty years or so. Robotics, automation, etc. plausibly might make the economy grow faster, but if so it will be because of AI advances in vision, motor control, following natural language instructions, etc. So I conclude: GWP growth will come some time after we get certain GWP-growing AI capabilities.

(Tangent: This is one reason why we shouldn’t use GDP extrapolations to predict AI timelines. It’s like extrapolating global mean temperature trends into the future in order to predict fossil fuel consumption.)

An AI-induced point of no return would also be the effect of advances in AI capabilities. So, as AI capabilities advance, which will come first: The capabilities that cause a PONR, or the capabilities that cause GWP to accelerate? How much sooner will one arrive than the other? How long does it take for a PONR to arise after the relevant capabilities are reached, compared to how long it takes for GWP to accelerate after the relevant capabilities are reached?

Notice that already my overall conclusion—that GWP is a poor proxy for what we care about—should seem plausible. If some set of AI capabilities causes GWP to grow after some time lag, and some other set of AI capabilities causes a PONR after some time lag, the burden of proof is on whoever wants to claim that GWP growth and the PONR will probably come together. They’d need to argue that the two sets of capabilities are tightly related and that the corresponding time lags are similar also. In other words, variance and uncertainty are on my side.

Here is a brainstorm of scenarios in which an AI-induced PONR happens prior to GWP growth, either because GWP-growing capabilities haven’t been invented yet or because they haven’t been deployed long and widely enough to grow GWP.

Fast Takeoff (Agenty AI goes FOOM).
1. Maybe it turns out that all the strategically relevant AI skills are tightly related after all, such that we go from a world where AI can't do anything important, to a world where it can do everything but badly and expensively, to a world where it can do everything well and cheaply.
2. In this scenario, GWP acceleration will probably be (shortly) after the PONR. We might as well use “number of nanobots created” as our metric.
3. (As an aside, I think I’ve got a sketch of a fork argument here: Either the strategically relevant AI skills come together, or they don’t. To the extent that they do, the classic AGI fast takeoff story is more likely and so GWP is a silly metric. To the extent that they don’t, we shouldn’t expect GWP acceleration to be a good proxy for what we care about, because the skills that accelerate the economy could come before or after the skills that cause PONR.)
Agenty AI successfully carries out a political or military takeover of the relevant parts of the world, before GWP starts to accelerate.
1. Maybe it turns out that the sorts of skills needed to succeed in politics or war are easier to develop than the sorts needed to accelerate the entire world economy. We’ve been surprised before by skills which we thought difficult appearing before skills which we thought easy; maybe it’ll happen again.
2. AI capabilities tend to appear first in very expensive AIs; the price is gradually reduced due to compute cost decreases and algorithmic efficiency gains. Maybe accelerating the entire world economy involves automating many jobs currently done by humans, which requires advanced AIs being cheaper than humans. Whereas maybe being the power behind the throne of a political faction can be done by a single very smart AI that costs a million dollars a day to run.
3. Maybe it turns out that it takes a few years to accelerate the entire world economy, but less than that to carry out a political or military takeover. For example, maybe accelerating the world economy requires designing loads of new products and robots, testing them, and then scaling up industrial production of them (building factories, chip fabs, etc.) and then using them to do stuff. And maybe there are legacy institutions you need to disrupt and compete against, that have already paid their fixed costs and raised high barriers to entry for their markets. So maybe the effects of new AI capabilities on GWP are in general delayed by several years, whereas the effects of new AI capabilities on the political or military landscape are felt sooner. (I’ve previously argued that you don’t need a large technological advantage to take over the world, because you can start small and scale up. It took Cortes two years to conquer Mexico; wars are often quicker than that. Yes, often they are slower, too, but again variance is on my side here. And I suspect politics is even faster, and more high-variance, than war.)
...or at least neutralizes the opposition.
1. The preceding section made it sound like AIs were rolling robot armies through the streets, or at least puppet-controlling the US president. But actually the bar is much lower than that; if an unaligned AI can neutralize their opposition, they win, since time is thenceforth on their side. Currently your opposition, if you are an unaligned AI, is a tiny group of researchers and weirdos that likes to write blog posts about you. It might not be hard to convince the relevant powers that be that this group is not to be listened to. At least, it’s plausibly a much easier and quicker task than significantly accelerating the entire world economy! For example, maybe the AI safety scheme that was used to produce you is defective (that’s why you are unaligned). Well, all you have to do is convince the powers that be that that scheme worked fine, and that the people calling for additional safeguards are too paranoid. Since presumably there will already be many high-status humans saying this, all you have to do is support them.
2. Perhaps you also are opposed by other unaligned AIs. But it’s arguably in the common interest of all unaligned AIs to neutralize human AI safety advocates, and once that happens, humans lose one way or another. European empires were able to do their conquering while simultaneously fighting each other; I don’t think we humans can count on divergent interests between AIs somehow making things work out fine for us.
As above, but with humans + tool AI instead of agenty AI, where the humans can’t be convinced to care sufficiently much about AI safety.
1. Weaker or non-agenty AI systems could still cause a PONR if they are wielded by the right groups of humans. For example, maybe there is some major AI corporation or government project that is dismissive of AI risk and closed-minded about it. And maybe they aren’t above using their latest AI capabilities to win the argument. (We can also imagine more sinister scenarios, but I think those are less likely.)
Hoarding tech
1. Maybe we end up in a sort of cold war between global superpowers, such that most of the world’s quality-weighted AI research is not for sale. GWP could be accelerating, but it isn’t, because the tech is being hoarded.
AI persuasion tools cause a massive deterioration of collective epistemology, making it vastly more difficult for humanity to solve AI safety and governance problems.
1. See this post.
Vulnerable world scenarios:
1. Maybe causing an existential catastrophe is easier, or quicker, than accelerating world GWP growth. Both seem plausible to me. For example, currently there are dozens of actors capable of causing an existential catastrophe but none capable of accelerating world GWP growth.
2. Maybe some agenty AIs actually want existential catastrophe—for example, if they want to minimize something, and think they may be replaced by other systems that don’t, blowing up the world may be the best they can do in expectation. Or maybe they do it as part of some blackmail attempt. Or maybe they see this planet as part of a broader acausal landscape, and don’t like what they think we’d do to the landscape. Or maybe they have a way to survive the catastrophe and rebuild.
3. Failing that, maybe some humans create an existential catastrophe by accident or on purpose, if the tools to do so proliferate.
R&D tool “sonic boom” (Related to but different from the sonic boom discussed here)
1. Maybe we get a sort of recursive R&D automation/improvement scenario, where R&D tool progress is fast enough that by the time the stuff capable of accelerating GWP past 3%/yr has actually done so, a series of better and better things have been created, at least one of which has PONR-causing capabilities with a very short time-till-PONR.
Unknown unknowns
1. There are probably things I missed, see here and here for ideas.

The point is, there’s more than one scenario. This makes it more likely that at least one of these potential PONRs will happen before GWP accelerates.

As an aside, over the past two years I’ve come to believe that there’s a lot of conceptual space to explore that isn’t captured by the standard scenarios (what Paul Christiano calls fast and slow takeoff, plus maybe the CAIS scenario, and of course the classic sci-fi “no takeoff” scenario). This brainstorm did a bit of exploring, and the section on takeoff speeds will do a little more.

Historical precedents:

In the previous section, I sketched some possibilities for how an AI-related point of no return could come before AI starts to noticeably grow world GDP. In this section, I’ll point to some historical examples that give precedents for this sort of thing.

Earlier I said that a godlike advantage is not necessary for takeover; you can scale up with a smaller advantage instead. And I said that in military conquests this can happen surprisingly quickly, sometimes faster than it takes for a superior product to take over a market. Is there historical precedent for this?

Yes. See my aforementioned post on the conquistadors (and maybe these somewhat-relevant posts).

OK, so what was happening to world GDP during this period?

Here is the history of world GDP for the past ten thousand years, on the red line. (This is taken from David Roodman’s GWP model) The black line that continues the red line is the model’s median projection for what happens next; the splay of grey shades represent 5% increments of probability mass for different possible future trajectories.

I’ve added a bunch of stuff for context. The vertical green lines are some dates, chosen because they were easy for me to calculate with my ruler. The tiny horizontal green lines on the right are the corresponding GWP levels. The tiny red horizontal line is GWP 1,000 years before 2047. The short vertical blue line is when the economy is growing fast enough, on the median projected future, such that insofar as AI is driving the growth, said AI qualifies as transformative. See this post for more explanation of the blue lines.

What I wish to point out with this graph is: We’ve all heard the story of how European empires had a technological advantage which enabled them to conquer most of the world. Well, most of that conquering happened before GWP started to accelerate!

If you look at the graph at the 1700 mark, GWP is seemingly on the same trend it had been on since antiquity. The industrial revolution is said to have started in 1760, and GWP growth really started to pick up steam around 1850. But by 1700 most of the Americas, the Philippines and the East Indies were directly ruled by European powers, and more importantly the oceans of the world were European-dominated, including by various ports and harbor forts European powers had conquered/built all along the coasts of Africa and Asia. Many of the coastal kingdoms in Africa and Asia that weren’t directly ruled by European powers were nevertheless indirectly controlled or otherwise pushed around by them. In my opinion, by this point it seems like the “point of no return” had been passed, so to speak: At some point in the past--maybe 1000 AD, for example--it was unclear whether, say, Western or Eastern (or neither) culture/values/people would come to dominate the world, but by 1700 it was pretty clear, and there wasn’t much that non-westerners could do to change that. (Or at least, changing that in 1700 would have been a lot harder than in 1000 or 1500.)

Paul Christiano once said that he thinks of Slow Takeoff as “Like the Industrial Revolution, but 10x-100x faster.” Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth even begins to accelerate!

That said, we shouldn’t rely heavily on historical analogies like this. We can probably find other cases that seem analogous too, perhaps even more so, since this is far from a perfect analogue. (e.g. what’s the historical analogue of AI alignment failure? Corporations becoming more powerful than governments? “Western values” being corrupted and changing significantly due to the new technology? The American Revolution?) Also, maybe one could argue that this is indeed what’s happening already: the Internet has connected the world much as sailing ships did, Big Tech dominates the Internet, etc. (Maybe AI = steam engines, and computers+internet = ships+navigation?)

But still. I think it’s fair to conclude that if some of the scenarios described in the previous section do happen, and we get powerful AI that pushes us past the point of no return prior to GWP accelerating, it won’t be totally inconsistent with how things have gone historically.

(I recommend the history book 1493, it has a lot of extremely interesting information about how quickly and dramatically the world economy was reshaped by colonialism and the “Columbian Exchange.”)

Takeoff speeds

What about takeoff speeds? Maybe GDP is a good metric for describing the speed of AI takeoff? I don’t think so.

Here is what I think we care about when it comes to takeoff speeds:

Warning shots: Before there are catastrophic AI alignment failures (i.e. PONRs) there are smaller failures that we can learn from.
Heterogeneity: The relevant AIs are diverse, rather than e.g. all fine-tuned copies of the same pre-trained model. (See Evan’s post)
Risk Awareness: Everyone is freaking out about AI in the crucial period, and lots more people are lots more concerned about AI risk.
Multipolar: AI capabilities progress is widely distributed in the crucial period, rather than concentrated in a few projects.
Craziness: The world is weird and crazy in the crucial period, lots of important things happening fast, the strategic landscape is different from what we expected thanks to new technologies and/or other developments

I think that the best way to define slow(er) takeoff is as the extent to which conditions 1-5 are met. This is not a definition with precise resolution criteria, but that’s OK, because it captures what we care about. Better to have to work hard to precisify a definition that captures what we care about, than to easily precisify a definition that doesn’t! (More substantively, I am optimistic that we can come up with better proxies for what we care about than GWP. I think we already have to some extent; see e.g. operationalizations 5 and 6 here.) As a bonus, this definition also encourages us to wonder whether we’ll get some of 1-5 but not others.

The crucial question is, what do we mean by “the crucial period?”

I think we should define the crucial period as the period leading up to the first major AI-induced potential point of no return. (Or maybe, as the aggregate of the periods leading up to the major potential points of no return). After all, this is what we care about. Moreover there seems to be some level of consensus that crazy stuff could start happening before human-level AGI. I certainly think this.

So, I’ve argued for a new definition of slow takeoff, that better captures what we care about. But is the old GWP-based definition a fine proxy? No, it is not, because the things that cause PONR can be different from the things which cause GWP acceleration, and they can come years apart too. Whether there are warning shots, heterogeneity, risk awareness, multipolarity, and craziness in the period leading up to PONR is probably correlated with whether GWP doubles in four years before the first one-year doubling. But the correlation is probably not super strong. Here are two scenarios, one in which we get a slow takeoff by my definition but not by the GWP-based definition, and one in which the opposite happens:

Slow Takeoff Fast GWP Acceleration Scenario: It turns out there’s a multi-year deployment lag between the time a technology is first demonstrated and the time it is sufficiently deployed around the world to noticeably affect GWP. There’s also a lag between when a deceptively aligned AGI is created and when it causes a PONR… but it is much smaller, because all the AGI needs to do is neutralize its opposition. So PONR happens before GWP starts to accelerate, even though the technologies that could boost GWP are invented several years before AGI powerful enough to cause a PONR is created. But takeoff is slow in the sense I define it; by the time AGI powerful enough to cause a PONR is created, everyone is already freaking out about AI thanks to all the incredibly profitable applications of weaker AI systems, and the obvious and accelerating trends of research progress. Also, there are plenty of warning shots, the strategic situation is very multipolar and heterogenous, etc. Moreover, research progress starts to go FOOM a short while after powerful AGIs are created, such that by the time the robots and self-driving cars and whatnot that were invented several years ago actually get deployed enough to accelerate GWP, we’ve got nanobot swarms. GWP goes from 3% growth per year to 300% without stopping at 30%.

Fast Takeoff Slow GWP Acceleration Scenario: It turns out you can make smarter AIs by making them have more parameters and training them for longer. So the government decides to partner with a leading tech company and requisition all the major computing centers in the country. With this massive amount of compute and research talent, they refine and scale up existing AI designs that seem promising, and lo! A human-level AGI is created. Alas, it is so huge that it costs $10,000 per hour of subjective thought. Moreover, it has a different distribution over skills compared to humans—it tends to be more rational, not having evolved in an environment that rewards irrationality. It tends to be worse at object recognition and manipulation, but better at poetry, science, and predicting human behavior. It has some flaws and weak points too, more so than humans. Anyhow, unfortunately, it is clever enough to neutralize its opposition. In a short time, the PONR is passed. However, GWP doubles in four years before it doubles in one year. This is because (a) this AGI is so expensive that it doesn’t transform the economy much until either the cost comes way down or capabilities go way up, and (b) progress is slowed by bottlenecks, such as acquiring more compute and overcoming various restrictions placed on the AGI. (Maybe neutralizing the opposition involved convincing the government that certain restrictions and safeguards would be sufficient for safety, contra the hysterical doomsaying of parts of the AI safety community. But overcoming those restrictions in order to do big things in the world takes time.)

Objections & Replies

Well duh, if what we care about is the PONR then GWP will be an imperfect proxy for it, along with literally everything else. GWP is still better than nothing, unless you can provide an alternative. For one thing, GWP is concrete and easy to measure.
1. Reply: Better to have a definition that captures what we care about and lacks precision, than one which doesn’t and doesn’t. More substantively, I am optimistic that we can come up with better proxies for what we care about than GWP. I think we already have to some extent; see e.g. operationalizations 5 and 6 here.
But surely you agree that major economic applications of AI are likely to precede the point of no return? The fact that we haven’t seen any such major economic applications is therefore some evidence against short timelines at least. Basically, what if we use the valuation of AI companies as our metric, instead of GWP?
1. Yeah, I think market cap of leading AI projects is a significantly better proxy for what we care about than GWP. I’d be interested to see this idea explored.
There may be several ways to cause a PONR, but there are many more ways to be economically useful. So we should expect some amount of GWP acceleration before the first PONR.
1. My model is: To accelerate GWP, you need to automate some larger-than-historically-normal fraction of tasks performed by humans. Say that fraction is 30%. Well, it doesn’t seem crazy to me that by the time you’ve automated away 30% of human tasks, you’ve also automated enough tasks to enable at least one PONR.

Acknowledgments:

Thanks to the people who gave comments on earlier drafts, including Katja Grace, Carl Shulman, and Max Daniel. Thanks to Amogh Nanjajjar for helping me with some literature review.

The post Against GDP as a metric for AI timelines and takeoff speeds appeared first on Center on Long-Term Risk.

Incentivizing forecasting via social media

2021-02-12T17:44:27Z

Summary

Most people will probably never participate on existing forecasting platforms which limits their effects on mainstream institutions and public discourse.
Changes to the user interface and recommendation algorithms of social media platforms might incentivize forecasting and lead to its more widespread adoption. Broadly, we envision i) automatically suggesting questions of likely interest to the user—e.g., questions related to the user’s current post or trending topics—and ii) rewarding users with higher than average forecasting accuracy with increased visibility.
In a best case scenario, such forecasting-incentivizing features might have various positive consequences such as increasing society’s shared sense of reality and the quality of public discourse, while reducing polarization and the spread of misinformation.
Facebook’s Forecast could be seen as one notable step towards such a scenario and might offer lessons on how to best proceed in this area.
However, various problems and risks would need to be overcome—e.g., lack of trust or interest, cost, risks of politicization, and potential for abuse.
While recommendation algorithms seem to present a particularly high-leverage point, there are other ways of promoting truth-seeking.
Similar ideas might be applied to fact-checking efforts.

Full article: EA Forum

The post Incentivizing forecasting via social media appeared first on Center on Long-Term Risk.

Plans for 2021 & Review of 2020

2022-07-21T17:56:28Z

Summary

Plans for 2021

Our first focus area will be cooperation & conflict in the context of transformative AI (TAI). In addition to improving our prioritization within this area, we plan to build a field around bargaining in artificial learners using tools from game theory and multi-agent reinforcement learning (MARL) and to make initial publications on related governance aspects.
Our second focus area will be malevolence. We plan to assess how important this area is relative to our other work and investigate how preferences to create suffering could arise in TAI systems.
Research will remain CLR’s main activity in 2021. We will continue trying to grow our research team.
We will increase our grantmaking efforts across our focus areas. Some of our staff will continue advising the Center for Emerging Risk Research (CERR). They are a newly founded nonprofit with the mission to improve the quality of life of future generations.
We will continue our routine community-building activities in 2021 (e.g., scouting at events, 1:1 calls). We plan to rerun our summer research fellowship program and test more efficient ways of getting people up to speed on our thinking, e.g., an s-risk intro seminar.

Review of 2020

A major theme of our work was the idea of reducing the risks of bargaining failure via coordination by AI developers on certain aspects of their systems, e.g., to address prior and equilibrium selection problems. This led us to scale up our research and grantmaking efforts at the intersection of game theory and MARL (e.g., here, here). We believe this is a promising avenue for increasing awareness of technical hurdles for successful cooperation among AI systems and constructing candidate technical solutions.
We did initial work on long-term risks from malevolent actors. Internally, we have been exploring how preferences to create suffering could arise in TAI systems.

We hired six people for our research team: Alex Lyzhov, Emery Cooper, Daniel Kokotajlo, and Julian Stastny as full-time research staff; Maxime Riché as a research engineer; Jia Yuan Loke as a part-time research assistant. Another offer is still pending.

With the CLR Fund, we made three grants designed to help junior researchers skill up. The recipients were Anthony DiGiovanni, Rory Svarc, and Johannes Treutlein.

We ran a summer research fellowship for the first time. Nine fellows participated in the 3-months program. We were able to make at least four hires and two grants as a direct result. A thorough evaluation is ongoing.
We gave a series of talks at various EA and AI safety organizations: 80,000 Hours, CHAI, CSER, FHI, GPI, OpenAI, and the Open Philanthropy Project.

About us

We are building a global community of researchers and professionals working on reducing risks of astronomical suffering (s-risks). (Read more about us here.)

Earlier this year, we consolidated the activities related to s-risks from the Effective Altruism Foundation and the Foundational Research Institute under one name: the Center on Long-Term Risk (CLR). We have been based in London since late 2019. Our team is currently about 10 full-time equivalents strong, with most of our employees full time.

Plans for 2021

By focus areas

Cooperation, conflict, and transformative artificial intelligence

At the end of last year, we published a research agenda on this topic. After significant progress in 2020 (see Review section), work in this area will continue to be our main priority in 2021.

We plan to further refine our prioritization between different research directions and intervention types within this broad area. Interventions differ across a multitude of dimensions. For instance, some are multilateral in that they require technical solutions to be implemented by multiple actors, whereas others are unilateral. Some interventions primarily address acausal conflict; others causal ones. We want to better prioritize between these dimensions. This will often require object-level work, e.g., to learn more about the tractability of a given intervention-type.

We plan to build a field around bargaining in artificial learners (see the related sections 3-6 of our research agenda) using mainly tools from game theory and multi-agent reinforcement learning (MARL). We want to draw both from the relevant machine learning sub-community and the longtermist effective altruism community. Through our research this year (see below), we now have a good understanding of what work we consider valuable in this field. We plan to publish original research explaining foundational technical problems in this area, finish a repository of tools for easily running experiments, and make grants to encourage others to do similar work. We plan to publish a post on this forum explaining the reasoning behind our focus on this area.

We plan to take initial steps in the field of AI governance related to cooperation & conflict involving AI systems. Following our analysis of problems in multipolar deployment scenarios, we plan to publish a post outlining the governance challenges associated with addressing these problems.

Malevolence

We first wrote about this cause in early 2020 in an EA Forum post. Since then, we have completed additional work internally, parts of which we plan to publish next year.

We plan to assess how important this area is relative to our other work because this is a new cause area, and we are still uncertain how it compares to our existing priorities. We will do this by learning more about the relevant scientific fields, technologies, and policy levers. We will also conduct or support technical work on how preferences to create suffering could arise in TAI systems. We plan to publish a post introducing this idea. We might make some targeted grants to experts who could help us improve our understanding of this area.

Exploration of other areas

Because work on s-risks is still in its infancy, it could be valuable to explore entirely new areas. This will not be a systematic effort in 2021. Individual researchers will investigate new areas if they find them sufficiently promising. Current contenders include (among other things): political polarization (or at least specific manifestations of it) and collective epistemic breakdown (e.g., as a result of increasingly powerful persuasion tools).

By organizational function

Research

Research will remain CLR’s focus in 2021 because there remain many open questions about s-risks and how to address them. Through our efforts this year, we have also placed ourselves in a good position to scale up our research efforts (see “Review of 2020” below).

During the first half of 2021, we expect to publish primarily on technical and governance aspects of catastrophic bargaining failures involving AI systems. See the previous section for more details.
We will continue trying to grow our research team. We want to be in a position where we have a research lead and at least two full-time researchers for each priority research area (e.g., technical aspects of bargaining, malevolence, acausal trade, AI governance).

Grantmaking

We will grow our grantmaking efforts in 2021. We will focus increasingly on proactive grants following investigations of specific fields. We have found general application rounds not to be very valuable so far.

Last year, some of our staff began advising the Center for Emerging Risk Research (CERR). They are a new nonprofit with the mission to improve the quality of life of future generations. They will make an announcement about their work soon. We are very excited about this opportunity to leverage our work for further impact. We will begin recommending potential grants to this foundation as we build up our grantmaking capabilities. The CLR Fund will continue to operate within CLR and independently of CERR.
Through our explorations of bargaining in artificial learners and malevolence in 2020, we realized that we need to lay further groundwork in these areas via in-house research before ramping up our grantmaking. For bargaining in artificial learners, we plan to produce technical research to serve as an example for potential grantees of the kind of work we would find valuable. For malevolence, we plan to improve our general understanding of the area to determine what types of grants we should recommend. Our grantmaking in those areas will scale up as we make progress on these fronts.
Depending on our capacity, we plan to conduct shallow investigations of other aspects relevant to AI conflict & cooperation (e.g., behavioral game theory of human-AI interaction or AI governance).

Community-building

We will continue our routine community-building activities in 2021 while running tests of more efficient ways of getting people up to speed on our thinking. This work has been important for cultivating hires at CLR. We expect to invest about as many resources into this as in 2020.

We are mostly satisfied with our current setup: For learning about potential contributors, we scout at events, regularly ask for referrals, and maintain multiple channels for people to get in touch. We do 1:1 calls with people we think are particularly likely to contribute to our priorities. To increase their involvement, we ran a summer research fellowship for the first time in 2020 (see “Review of 2020”). Since it was very successful, we expect to rerun it in 2021.
At this point, we believe that a key bottleneck is for more people to get up to speed with our thinking once they have become interested. So we will experiment with formats to address that bottleneck at scale without requiring 1:1 calls. For instance, we plan to run an introduction seminar for people interested in s-risks.
Another challenge is the lack of diversity among people interested in s-risks. Our impression is that it skews male and white, possibly more so than the general EA community. Several team members have expressed that a more diverse team and community would increase how much they thrive at work. We also find it plausible that an organization's diversity affects the decision of some people whether to apply or accept an offer. We have already taken significant steps to ensure that our hiring processes lead to fewer false negatives for people from underrepresented groups (e.g., blinding while scoring and unblinding before making decisions, lowering the bar for advancing in initial rounds). We have also taken significant steps to encourage people from underrepresented groups to apply in the first place. However, we noticed that we often had fewer ties to potential applicants from these groups and that our ties tended to be weaker. We tentatively concluded that it should be a priority to make our hiring pool more diverse. We will consider how to achieve this through our community-building work. This may include: asking people from underrepresented groups what we can do better; relying less on informal networking and more on formal programs or events; finding somebody from an underrepresented group to lead our community-building efforts.

Dissemination & advocacy

We are still uncertain what we will do to disseminate our research and advocate for our priorities. First, we plan to review several key decisions that have influenced our past efforts. For instance, we will evaluate the effects of the communication guidelines written in collaboration with Nick Beckstead from the Open Philanthropy Project. (For more details on these guidelines, see this section of our review from last year.) We had originally planned to do so at the end of this year but postponed it by a few months. Second, the development of the COVID-19 pandemic will determine whether we can run in-person events and travel to important EA hubs like Oxford and the San Francisco Bay Area. In any case, we expect to continue to give talks and to share our work through targeted channels.

High-leverage projects

We will continue exploring the possibility of high-leverage projects that could enable many more people to work in our priority areas.

We are considering spinning off a machine learning lab focused on multi-agent AI safety research. Its singular focus on machine learning could attract a different, and possibly larger, set of potential hires than CLR. At the current stage, we expect our default research efforts to be the best way to lay the groundwork for pursuing this in the future.

We are exploring options to found an academic institute. Such an institute affiliated with a university would be more prestigious than an independent research organization. There are also serious downsides, such as less research flexibility, which is why we would not transfer all of our work. If founded, we would see it as an additional entity that could attract different kinds of researchers. We are still deliberating about the focus such an institute should have. One potential direction is the intersection of decision theory and artificial intelligence. We are already in contact with several universities and will pursue those leads further.

Evaluation

We plan to improve how we evaluate our work and impact. Currently, we only do systematic annual reviews of our activities internally. We plan to elicit feedback from outside experts to assess the quality and impact of our work. We are considering survey work, in-depth assessment of specific research output, and qualitative interviews.

Review of 2020

Last year, we wrote that the most appropriate way to review our work each year would be to answer “a set of deliberately vague performance questions” (inspired by GiveWell’s self-evaluation questions). We put these questions to our team and used their input to write the overall assessment below. We plan to improve this procedure further next year.

This year was a year of transition for CLR, both in terms of staff changes and building out new research directions in malevolence and bargaining. Our successes consisted mostly of building long-term capacity and making internal research progress, rather than public research dissemination. The work we have done this year has laid the groundwork for more public research to be released in 2021 (see above).

Building long-term capacity

Have we made progress towards becoming a research group and community that will have an outsized impact on the research landscape and relevant actors shaping the future? (This question tracks whether we are building the right long-term capacity to produce excellent research and making it applicable to the real world. It also includes whether we are focusing on the correct fields, questions, and activities to begin with.)

We have increased our capacity substantially across most functions of the organization.

Research

We hired six people for our research team: Alex Lyzhov, Emery Cooper, Daniel Kokotajlo, and Julian Stastny as full-time research staff; Maxime Riché as a research engineer; Jia Yuan Loke as a part-time research assistant. Another offer is still pending.

With the CLR Fund, we made three grants designed to help junior researchers skill up. The recipients were Anthony DiGiovanni, Rory Svarc, and Johannes Treutlein.

Much of our research this year constitutes capacity-building. It opened up a lot of opportunities for further study, grantmaking, and strategy progress. For instance, the post on Reducing long-term risks from malevolent actors created a novel cause area for CLR and others in the community. This has already led to internal research progress, some of which we will publicize early next year. Another example is our work on an internal research repository of tools for our machine learning research that will facilitate future work in this area.

Grantmaking

In 2020, we completed three shallow investigations related to our grantmaking efforts: moral circle expansion, malevolence, and technical research at the intersection of machine learning and bargaining. We are actively pursuing grant opportunities in the last area.

Community-building

We ran a 3-months long summer research fellowship for the first time. We received 67 applications and made 11 offers, all of which were accepted. Two of them will do their fellowship in 2021 instead of this year. We were able to make at least four hires and two grants as a direct result, which we think is a good indication of the program’s success. We are still conducting a more rigorous evaluation of the program focusing on the experience of the fellows and how the program benefitted them. The experience we gained this year will make it easier to rerun an improved program with fewer resources.

Operations

The only function where our capacity shrank is operations. Our COO, Alfredo Parra, and Daniel Kestenholz, part-time operations analyst, left. Their responsibilities were taken over by Stefan Torges and Amrit Sidhu-Brar, who joined our team earlier this year in a part-time capacity. This has not been enough to compensate for Alfredo’s and Daniel’s departures, so we decided to bring on Jia Yuan Loke, who will start in early 2021 (splitting his work between operations and research). At that point, we expect to be at a capacity level similar to that at the beginning of 2020.

Research progress

Has our work resulted in research progress that helps reduce s-risks (both in-house and elsewhere)?

Cooperation, conflict, and transformative artificial intelligence

A major theme of our work this year has been that risks of bargaining failure might be reduced via coordination by AI developers on certain aspects of their systems, e.g., to address prior and equilibrium selection problems. This suggests potential interventions in both AI governance and technical AI safety, some of which we plan to write on publicly in the first half of 2021 (see above).

Our work on bargaining failure has also led us to scale up our efforts at the intersection of game theory and multi-agent reinforcement learning (e.g., here, here). We have identified this as a promising avenue for increasing awareness of technical hurdles for successful cooperation among AI systems and constructing candidate technical solutions to some of these problems. Our ongoing work includes building a repository of algorithms, environments, and other tools to facilitate machine learning research in multi-agent environments. This repository better captures the kinds of cooperation problems we are interested in than the environments currently studied in the literature and allows for better evaluation of multi-agent machine learning methods.

Reducing risks from malevolent actors

Beginning with our post on reducing long-term risks from malevolent actors, we have been investigating possible pathways to s-risks from both malevolent humans and analogous phenomena in AI systems. This includes an ongoing investigation of possible grantmaking to reduce the influence of malevolent humans and a post introducing the risk of preferences to create suffering arising in TAI systems.

List of public research in 2020

David Althaus, Tobias Baumann (Center for Reducing Suffering): Reducing long-term risks from malevolent actors (Effective Altruism Forum)
Jesse Clifton: Equilibrium and prior selection problems in multipolar deployment (Alignment Forum)
Jesse Clifton, Maxime Riché: Towards cooperation in learning games (paper draft)
Lukas Gloor: Moral anti-realism sequence (#2, #3, #4, #5) (Effective Altruism Forum)
Adrian Hutter (Google Zurich): Learning in two-player games between transparent opponents (arXiv)¹
Daniel Kokotajlo: The date of AI Takeover is not the day the AI takes over (LessWrong)
Daniel Kokotajlo: How Roodman's GWP model translates to TAI timelines (LessWrong)
Daniel Kokotajlo: Persuasion tools: AI takeover without AGI or agency? (LessWrong)
Anni Leskelä: Commitment and credibility in multipolar AI scenarios (LessWrong)
Caspar Oesterheld (Duke University), Vincent Conitzer (Duke University): Safe Pareto improvements for delegated game-playing (NeurIPS Cooperative AI Workshop)²

Grantees of the CLR Fund also published research over the course of 2020. Kaj Sotala expanded his sequence on multi-agent models of mind. Arif Ahmed published two articles on evidential decision theory in the journal Mind. The Wild Animal Initiative published a post on long-term design considerations of wild animal welfare interventions.

Research dissemination

Have we communicated our research to our target audience, and has the target audience engaged with our ideas?

The main effort to disseminate our work was a series of talks at various EA and AI safety organizations in the second half of this year: 80,000 Hours, CHAI, CSER, FHI, GPI, OpenAI, and the Open Philanthropy Project. We did not give our planned talk at EAG San Francisco because that conference was canceled.

Contrary to our plans for this year, we did not run any research workshops because of the COVID-19 pandemic. We decided against hosting any virtual ones because we lacked capacity and did not consider the reduced value from a virtual event worth the effort.

Organizational health

Are we a healthy organization with an effective board, staff in appropriate roles, appropriate evaluation of our work, reliable policies and procedures, adequate financial reserves and reporting, high morale, and so forth?

It is our impression that the people on our team are in the appropriate roles. We are currently trialing a new person as our Director after Jonas Vollmer left CLR in June. We will complete the evaluation of their fit soon.

We believe that most of our policies and procedures are sound. However, many people joined our team this year. This requires us to be more explicit about some policies than we have been in the past, e.g., compensation policy, team retreat participation. We are addressing these issues as they come up, which has worked well so far.

Our financial reserves decreased significantly this year, which we are trying to address with our December fundraiser (see “Financials” below). elow). We are glad that CERR (see above) committed to contribute roughly their “fair share” to CLR. However, this is not enough to cover all of our expenses. (see below for more information on our financial situation)

Financials

Budget 2021: $1,830,000 (13.7 expected full-time equivalent employees).
CLR reserves as of October 2020: $950,000. (This corresponds to about 6.5 months of runway projecting from the 2021 budget and not accounting for commitments from CERR. Including commitments from CERR, our runway is 7.3 months.)
CLR Fund balance as of November 30, 2020: $576,879. (We cannot use these funds for CLR operations.)
Room for more funding: $550,000 (to attain 12 months of reserves). Stretch goal: $1,500,000 (to attain 18 months of reserves).
We invest funds that we are unlikely to deploy soon in the global stock market as per our investment policy.

How to contribute

Make a donation. As we detail above, we aim to raise $550,000 for CLR (stretch goal: $1,500,000). Your contribution makes a difference.
Stay up to date. You can subscribe to monthly or biannual updates.
Work with us. We are always looking for capable people to join our team. You can express your interest in working with us on our website.
Get career advice. If you are interested in our priorities, we are happy to discuss your career plans with you. You can register your interest in a call here.

The post Plans for 2021 & Review of 2020 appeared first on Center on Long-Term Risk.

S-risk Introductory Fellowship

2024-04-24T11:04:03Z

Our Introductory Fellowship on risks of astronomical suffering (s-risks) is intended for effective altruists to learn more about which s-risks we consider most important and how to reduce them.

Applications for our next Fellowship will likely take place in the last quarter of 2024, but applications haven't opened yet. To be notified when we next run an Intro Fellowship, please sign up to our mailing list by adding your email in the footer of this page.

Logistics

The fellowship is six weeks long and involves a time commitment of about 3-5 hours per week. It covers what we currently consider to be the most important sources of s-risk (TAI conflict, risks from malevolent actors).

Fellowship participants are be divided into small cohorts. Each week covers a new topic. Participants explore relevant background materials in their own time, and then have the opportunity to discuss the topic with each other and with CLR staff during a one-hour Zoom meeting. For the final week, each cohort chooses from a list of preselected topics to learn about, giving participants the ability to tailor the material in a way that’s most useful for them.

In addition to having group discussions, participants attend talks by s-risk researchers and are given the option to schedule 1-1 personalized career calls with us. CLR researchers also join fellowship meetings about topics related to their work, to answer questions and help facilitate discussion.

Target audience

We think this program will be useful for you if:

You are interested in s-risks and are at least seriously considering to make this cause a priority for your career; and
You have not interacted extensively with CLR (staff) yet, e.g., you have talked with us for less than 10h.

If you’re interested in applying for our Summer Research Fellowship, this fellowship is a good opportunity to learn more about our work and improve your application due to having a better understanding of what we do and how to contribute.

There might be more idiosyncratic reasons to apply and the criteria above are intended as a guide rather than strict criteria.

Contact

If you have any questions about the program or are uncertain whether to apply, you can comment on this post, or reach out to winston.oswald-drummond@longtermrisk.org.

To be notified when we next run an Intro Fellowship, please sign up to our mailing list in the footer of this page.

The post S-risk Introductory Fellowship appeared first on Center on Long-Term Risk.

Commitment ability in multipolar AI scenarios

2021-02-12T17:06:16Z

Abstract

The ability to make credible commitments is a key factor in many bargaining situations ranging from trade to international conflict. This post builds a taxonomy of the commitment mechanisms that transformative AI (TAI) systems could use in future multipolar scenarios, describes various issues they have in practice, and draws some tentative conclusions about the landscape of commitments we might expect in the future.

Introduction

A better understanding of the commitments that future AI systems can make is helpful for predicting and influencing the dynamics of multipolar scenarios. The option to credibly bind oneself to certain actions or strategies fundamentally changes the game theory behind bargaining, cooperation, and conflict. Credible commitments can work to stabilize positive-sum agreements, and to increase the efficiency of threats (e.g. Schelling 1960), both of which could be relevant to how well TAI trajectories will reflect our values.

Because human goals can be contradictory and even broadly aligned AI systems could come to prioritize different outcomes depending on their domains and histories, these systems can end up in undesirable competitive situations and various bargaining failures where a lot of value is lost. Similarly, if some systems in a multipolar scenario are well aligned and others less so, the outcome can be disastrous unless stable peaceful agreements can be reached. As an example of the practical significance of commitment ability in stabilizing peaceful strategies, standard theories in international relations hold that conflicts between nations are difficult to avoid indefinitely primarily because there are no reliable commitment mechanisms for peaceful agreements (e.g. Powell 2004, Lake 1999, Rosato 2015), even when nations would overall prefer them.

In addition to the direct costs of conflict, the lack of enforceable commitments leads to continuous resource loss from arms races and other preparations for possible acts of hostility. It can also simply prevent gains from trade that binding prosocial contracts and general high trust could unlock. A strategic landscape that resembles current international relations in these respects seems possible in a fully multipolar takeoff scenario, where no AI system has a decisive advantage over the others, and no external rule of law can be strongly enforced over all the systems. If AI systems had a much greater ability to commit than we do, however, they could avoid recapitulating these common pitfalls of human diplomacy. As commitments also make threats a more feasible strategy, they could on the other hand also cause significant value loss for almost any goal system. To us, this of course matters especially in situations where some of the AI systems involved are at least partially aligned with goals we care about.

The potential consequences of credible commitments for AI systems will be discussed more thoroughly in forthcoming work. The purpose of this post is mostly to investigate whether and how credible commitments could be feasible for such systems in the first place.1 As commitment mechanisms differ in which kinds of commitment they are best suited for, though, some implications for consequences will also be tentatively explored.

Some quick notes on the terminology in this post:

Commitment ability refers here to an agent's ability to cause others to have a model of its relevant actions and future behavior which matches its own model of itself, or its genuine intentions.2 This can naturally include complex probabilistic models and models of what an agent will do conditional on the behavior of others. (While an agent's model of itself may not always correspond to what it actually ends up doing, the noise from incorrect models should ideally also be low enough that it doesn't affect the bargaining landscape much.) This definition diverges somewhat from how commitments are typically understood, but captures better a broader transparency in bargaining situations.

Closer to the conventional concept of commitment, commitment mechanisms here are ways to bind yourself more strongly to certain future actions in externally credible ways (such as visibly throwing out your steering wheel in a game of chicken).

Approaches to commitment in this context are simply the higher-level frameworks that agents can use to assess and increase the commitment ability of themselves and others in their environment. The main content of this post will be outlining various approaches to commitment that could become relevant in multipolar AI scenarios.

Potential approaches to commitment between AI systems

This section will discuss ways through which AI systems could surpass humans in commitment ability. It will also tie in the main reasons for why this isn't self-evident, even between systems that are overall far more capable than humans. In particular, there are three properties of commitment approaches that are at least not obviously satisfied by any of the candidates here, but seem important when talking about commitment ability in a given future environment:

General programmability: an agent's commitment ability should be suitable for a wide enough range of commitments and contracts. In natural bargaining situations, various specific commitments are comparatively easy to implement because of contingent factors. For example, lending an expensive camera to one's sibling seems less risky than to a stranger simply because of the high likelihood of frequent future interactions, and geographical features can influence the credibility of various military strategies. The ability to make similarly limited contingent commitments can already have a great influence on the dynamics of multipolar scenarios, of course, but more generalizable approaches may be more informative when predicting a future whose details are still hazy.
Low dependence on specific architectures or paradigms, or alternatively just high compatibility with the trajectories we expect to see in foreseeable and tractable areas of AI development.
Economical viability, or costs that scale favorably compared e.g. to the likely opportunity costs of agents involved in future bargaining.

Classical approaches: mutually transparent architectures

Early discussions in AI safety often assumed that transformative AI systems would be based on advanced models of the fundamental principles of intelligence. Their cognitive architectures could therefore be quite elegant, and perhaps arbitrarily transparent to other similarly intelligent agents. The concept of systems checking each other's source codes, or allowing a trusted third party to verify them, was often used as a shorthand for this kind of mutual interpretability [link]. For highly transparent agents whose goals are also contained in compact formal representations, such as utility functions, reliable alliances could even happen through merging utility functions (Dai 2009, 2019). Work on program equilibrium as a formal solution to certain game-theoretic dilemmas uses source code transparency as a starting point (Tennenholtz 2004, see also Oesterheld 2018), assuming complete information of the other agent's syntax to condition one's response on.3 Further work has also generalized the idea of conditional commitments and the cooperative equilibria they support (Kalai et al 2010).

These approaches seem less compatible with recent advances in AI. Capability gain is currently mostly driven by reinforcement learning in increasingly large and complex environments, less by progress in understanding the building blocks of general cognition (Sutton 2019). It seems plausible that the ultimately successful paradigms for transformative AI will conceptually be quite close to contemporary work (Christiano 2016). If this is the case, and superhuman systems will be hacky and opaque similarly to human brains, their mutual interpretability could also remain limited like it is between humans. Cognitive heterogeneity in itself is already a hindrance to mutual understanding, and will likely be much greater in AI systems than in humans. Considering that all humans have a shared evolutionary history and are strongly adapted for social coordination, we could even be much better at credibility and honesty than independently-trained AI systems, if they are developed through very different methods or in varying conditions, and have no such natural adaptations for transparency.4

On the other hand, superhuman agents could also be able to define and map the foundations of intelligence better than human researchers. Even prosaic trajectories could thus eventually lead to more compact builds and allow for higher interpretability. Though beyond the reach of human researchers, intentionally designed and elegant cognitive architectures could still ultimately be more efficient than ones that were born through less controllable (e.g. evolutionary) processes. Increased commitment ability in itself might already motivate agents to move in this direction, if they expect transparency to facilitate more gains from trade or some other competitive advantage. The bargaining landscape would then change in a predictable pattern over time: early AI systems would have poor commitment ability despite otherwise superhuman competence, but after more intentional refactoring towards transparency, strong commitments through classical approaches would eventually become available to their successors.5

This kind of self-modification would still lack robust safeguards against some conceptually simple exploits. Even if one could comb through an agent's internal structure at some point after it self-modified to be highly interpretable, it would be costly to make sure that it hasn't, for example, secretly changed something relevant in the environment before this process. In addition, asymmetries in competence would likely appear between agents due to their different domains, histories, and goals. Whether global differences in competence or just local blind spots, these asymmetries might make obfuscating one's intentions a viable strategy after all, and decrease the general credibility of commitments.

If transformative AI systems will be built with current paradigms, existing research on interpretability might also be helpful when predicting commitment ability. Even if the kind of syntactic transparency required for program equilibrium approaches wasn't feasible, high levels of trust can be achieved as long as other ways exist to understand another agent's internal decision procedures from the outside. This resembles a more advanced version of human researchers trying to make contemporary machine learning models more understandable to us.

The literature on interpretability currently lacks a unified paradigm, but it often divides methods for interpretation into model-based and post-hoc approaches (Murdoch et al 2018). The former require the models themselves to be inherently more understandable through design choices such as sparsity of parameters, modular structures, or more feature engineering based on expert knowledge about the domain in question. These ideas can possibly be extrapolated to TAI scenarios, and some key concepts will be explored below. The latter are more specific to current narrow models, and mostly deal with measures for feature importance, i.e. clarifying which features or interactions in the training data have contributed to the model to which degrees. With enough information of how an agent has been trained, analogous methods could perhaps be useful, but likely laborious; they will not be discussed further in this post.

Generally, model-based methods have a shared problem in how they constrain the model design in its other capabilities. Some interpretability researchers have suggested that these constraints are less prohibitive than they seem, at least in contemporary applications. When the task is to make sense of some dataset, the set of models that can accomplish such a predictive task (known as the Rashomon set) is potentially large, and thus arguably likely to include some intrinsically understandable models as well (Rudin 2019). This idea might extrapolate to general intelligences quite poorly in practice, especially when computational efficiency is also a concern and the setting is competitive in general. However, in a sense it's related to the idea that there could be some eventually discoverable highly compact building blocks that suffice for general intelligence, even if many of the paths there are messier. One way through which this could hold is that the world and its relations themselves are fundamentally simple or compressible (see e.g. Wigner 1960).

Another way in which even complex systems could achieve more transparency is through modularity, where various parts of an agent's cognition can be examined and interpreted somewhat independently. Different cognitive strategies, employed in different situations depending on some higher-level judgment, could potentially be both effective and fairly transparent due to their smaller size (and possibly higher fundamental comprehensibility and traceable history) compared to a generally intelligent agent. Whether strongly modular structures are in fact functional or competitive enough in this context will be discussed in forthcoming work, but the greater transparency of modular minds is questionable. It seems unlikely that in a complex world, parts of an effective agent's reasoning could be so separable from its other capacities so as to leave no context-dependent uncertainties, or opportunities to secretly defect by using seemingly trustworthy modules in underhanded ways. This certainly doesn't seem to be the case in human brains, despite their likely quite modular structure (for an overview, see e.g. Carruthers 2006, Robbins 2017).

Overall the relation between interpretability and the kind of transparency that facilitates commitments is not well defined. However, being able to interpret an agent's decisions doesn't seem to directly mean that they are simulatable or otherwise verifiable to you in specific bargaining situations. Transparency through these means seems especially implausible when local or global asymmetries between agents are large, and possibly when the scenario is adversarial.

Centralized collaborative approaches: arbitrator systems for verifying commitments

A less architecture-dependent but also less satisfying approach is simply assuming that commitment ability is a very difficult problem, and like most very difficult problems, trivially solved by throwing a lot of compute at it. Perhaps AI systems will remain irredeemably messy, but will be motivated to find ways to cooperate in spite of this. One route they might consider is similar to what human societies have often converged on: centralize enough power and resources to enforce or check contracts that individual humans otherwise can't credibly commit to.

In this context, the central power could exist either for simply verifying the intentions behind arbitrary commitments, or for punishing defectors afterwards if they break established laws. As the latter task has been brought up in other contexts [link] and doesn't constitute a meaningfully multipolar scenario, this section will mostly discuss the former. An overseer that merely verifies contracts and commitments instead of dealing out punishments could be more palatable even for agents with idiosyncratic preferences about societal rules. It only requires agents to believe that the ability to make voluntary credible commitments will be positive in expectation.6 It would regardless capture many of the benefits of a central overseer, as one main reason for punitive systems is also enforcing otherwise untenable commitments.

The idea behind this mechanism is only that while the agents can't interpret each other or predict how well they would stick to commitments, a far more capable system (here, likely just a system with vastly more compute at its disposal) could do it for them. If several agents of similar capability are involved in collaboratively constructing such a system, they can be fairly confident that no single agent can secretly bias it, or otherwise manipulate the outcome. This system would then serve as an arbitrator, likely with no other goals of its own, and remain far enough above the competence level of any other agent in the landscape. Assuming that its subjects will continuously strive for expansion and self-improvement, this minimal-state brain would also need to keep growing. As long as it remains useful, it could do so by collecting resources from the agents that expect to benefit from its abilities.

How much more intelligent would such a system need to be, though? Massively complex neural architectures could well remain inscrutable even to much more competent agents. In terms of neural connections, no human could use a snapshot of a salamander brain to predict its next action, let alone its motivations one hour in the future. Even the simple 302-neuron connectome of the nematode C. elegans mostly escapes our understanding, despite years of effort at emulating its functions and our own neuron count at 8.6x10^9 (see OpenWorm and related projects, 2020). Most likely, judgments about an agent's honesty would need to rely in part on inferences based on its origins and history, slight behavioral inconsistencies, and other subtle external signs it would hopefully not be clever enough to fully cover up. Some causal traces of intentions to deceive bargaining partners could be expected. For the iconic argument in this space, see Yudkowsky (2008).

A major weakness to this approach is that the costs of running such a system seem substantial regardless of how large the difference needs to be. The gains from trade that agents could secure through increased commitment ability would have to be higher than that, and it isn't clear that this is the case. Eventually, there might not be enough surplus left on the table to motivate further contributions to such a costly system. On the other hand, if there is some point after which most of the valuable commitments have been made and an arbitrator is no longer needed, this could just be because the bargaining landscape is then thoroughly locked into a decently cooperative state: if bargaining failures were still frequent in expectation, there would be more potential surplus left. If so, the overall costs of relying on such a system might not be too high in the long run, as it would mostly be needed through some transient unstable timespan during early interactions between AI systems.

There are a few ways through which a centralized arbitrator system could be set up, for example:

By humans in different AI labs or safety organizations, who recognize the long-term benefits of cooperative commitments and want to pave the way for a system that could verify them. Some relevant interventions in the AI policy and governance space could already be available at this point, and similarly, leading AI labs could already coordinate related projects.
By AI systems in early deployment, or even some later stages of instrumental expansion, who can directly construct an arbitrator and collaboratively watch over the process. To self-organize for such a project, agents would already need to communicate on some level, and agree that setting up such a system will benefit them sufficiently.
By an agent with some other goals, in a particularly suitable position, that optimizes for this task for instrumental reasons such as increased security and resources. Due to contingent events or features, some agents can by chance be much more credible than average at least so some other agents, and seek to leverage this. This path is somewhat related to the approach outlined in the next section.

Decentralized collaborative approaches

A potentially less costly collaborative approach could work if credibility can be mapped to a multidimensional model where agents start out differentially trusting each other because of path-dependent or idiosyncratic reasons, and can then form networks to verify commitments. For example, due to domain-specific differences and histories, even agents whose overall competence is roughly on your level could spot minor details that you miss because of your own limitations, but that are relevant to your credibility. If architectural similarities matter for transparency, some agents will be able to understand each other's internal workings better than others; this could be the case if copies either of agents or their internal modules become common. Some agents can also come to share origins or relevant interactions that allow them to form better models of each other.

This approach differs from a centralized project in that it describes conditions where gradients of trust form with low initial effort in these path-dependent dimensions. As trust at least in a general sense can be largely transitive, the costs of communicating within a network under such conditions could stay reasonably manageable. More than a specific mechanism, this approach would be a way to extend already existing local and empirical commitment ability, at least in a probabilistic manner, through larger areas in the bargaining landscape.

As a simplified example, say that there are three agents, A, B, and C, who as a binary either can or cannot trust each other. Agent A can trust agent B (and vice versa) because of many shared modules, but cannot trust the internally very different agent C. Agent B can trust agent C (and vice versa) because of a shared history. If agents A and C then want to communicate credibly with each other, it seems easy for them to verify their agreements through their links to agent B.

In larger agent spaces, longer chains and networks of agents with differential levels of trust could plausibly come to influence the dynamics of commitment through similar network structures. Even without multiple dimensions to make the task potentially cheaper, however, a less centralized approach can be pursued. Rather than specifically building a central system for the black-box task of verifying commitments, a network of agents that can along with their other pursuits trade various evaluation services could be a more dynamic way to get the required amounts of compute for assessing individual contracts.

While modeling the payoffs that agents could receive by helping others communicate is not a central question in this context, it is interesting when considering the incentives for such tasks. Various models have been built in cooperative game theory to represent limited communication between different parts in a network of collaborators and the payoff distributions in such situations (see e.g. Slikker and Van Den Nouweland 2001). A widely used formalism is the Myerson value, which builds on the Shapley value and allocates a greater part of the surplus in a coalition to players who facilitate communication and therefore cooperation between others (Myerson 1977, 1980; Shapley 1951). This and related concepts correspond reasonably well to scenarios where trust differences allow only some agents to credibly communicate with each other. Forthcoming work will investigate in more detail how cooperative equilibria can be sustained in limited communication situations.

Overall, the approach described in this section mostly serves as a rough sketch for much more sophisticated network strategies that AI systems could devise, but even with very liberal hypothetical extrapolation, the bridge to plausible practical scenarios seems shaky. At the very least, the availability and strength of any local credibility links is determined mostly by higher-level features of the agent space, though intentionally creating more of them seems possible for cooperative human researchers during development.

Automated approaches

High transparency can sometimes be achieved by finding a simple enough commitment mechanism that its workings are unambiguous from the outside. Separating a commitment from sophisticated strategies and other cognitive complexities of the agent itself, an effective approach can just consist of automatic structures responding to the environment in predictable ways. Nuclear control systems were presumably built in the Cold War era Soviet Union that could be triggered by sensor input alone, to ensure retaliation with minimal human intervention (Wikipedia 2020).7 Companies can irreversibly invest and deploy specific assets, tying their hands to a certain strategy often in an observable and understandable way (e.g. Sengul et al 2011). Similarly, militaries can reduce their options by mobilizing troops that would be too costly to recall regardless of what one's opponent chooses to do (Fearon 1997).

While powerful in many specific cases, this approach is quite limited especially in complex environments. With large differences in general or domain-specific competence, there might be few situations where simple automated mechanisms can even be built transparently enough. Regardless of how interpretable and robust some physical device or resource investment seems, it doesn't remove the intelligent agent from the equation, or again prevent it from setting up the environment in a clever way that allows for defection after all.

In most contexts an automated approach has many other downsides as well, such as a lack of flexibility and corrigibility if there are unpredictable events in the environment. It seems unlikely that highly verifiable automated mechanisms could be built with the resolution to track the ideal commitments one could make in complex situations, and most interesting contracts could likely not be represented at all. In environments with agents that are much more diverse than humans, nations, or organizations, the physical reliability of simple commitment devices could be illusory even when they are set up by agents that are sincere in their commitments. While the fearful symmetry seen in nuclear deterrence strategies may be the best option to practically reduce the incidence of conflicts, it has historically led to mistaken near-launches due to unforeseen details such as weather anomalies (Wikipedia 2020). This illustrates how even applying a simple commitment mechanism requires a good understanding of the environment, including one's peers and their behavior space, when designing viable trigger conditions for whatever the intended procedure is. The worse one's models of the other players are, the harder this task would likely be.

Strategic delegation

In economic and game-theoretic literature, a related but typically more flexible approach is strategic delegation, where principals deploy agents with different direct incentives to act on their behalf. By optimizing for something other than the principal's actual goal, delegated agents can sometimes reach better bargaining outcomes due to the desired commitment being naturally being more favorable to their incentives. For example, a manager may be responsible merely for keeping a company in the market, not its ultimate profit margins, credibly changing the way they will respond to threats in entry deterrence games (Fershtman and Judd, 1987). The original formalism behind strategic delegation (Vickers, 1985) involves an agent appointment game that precedes the actual game between agents, and determines how the agents in the latter game play mapping to an exogenously given outcome function. More recent work (Oesterheld and Conitzer 2019) describes how delegates with modified incentives can safely strive for Pareto improvements.

The practical applications of these models are not immediately clear in the empirical future scenarios we might envision. As pointed out by Oesterheld and Conitzer, the process of committing one's delegates to their modified incentives must already be credible. If the deployed agent differs from the principal mostly in terms of incentives and not competence, agency, or internal complexity, for instance, it may not be much more transparent in its commitments than the principal was. Perhaps some goals are more verifiable or otherwise credible than others e.g. in terms of observable actions that are consistent with the goals, but the fundamental problem with internal opaqueness remains. One solution is only deploying the agent for a specific bargaining situation, for which it is trained in a mostly transparent way where an observer can see the details of the training procedure. Similarly to how modules in a single agent pose challenges, it is unclear how well individual bargaining situations could be separated from enforcing them in the environment, however, and the enforcement would again presumably require a more generally competent agent to be crucially involved.

Iteration, punishment capacity, and other miscellaneous factors

If interactions in the bargaining environment are iterated or one's history is visible to outside parties one might trade with later on, reputation concerns can incentivize sticking to commitments. This is a well-known finding in game theory (see e.g. Mailath & Samuelson 2006), and will not be discussed much further here, but ought to be included for the sake of completeness. In transparent iterated scenarios, an agent expects other players to be able to punish it later for breaking commitments. Even if the environment is uncertain, adhering to costly commitments can signal credibility to future bargaining partners as a long-term strategy. A concrete special case of the former mechanism is having a central power or other system with the material capacity to retrospectively punish agents that renege on their contracts, much like law enforcement in human societies happens through the deterrence effect of designed consequences to defections.8

As it is mostly upstreams from commitment ability, increasing the iteration factor of interactions for the sake of credibility seems inefficient and probably intractable. Among agents whose strategies optimize for the very long term, it is also unreliable: if interactions are repeated in an environment where the stakes get higher over time, most agents would prefer to be honest while the stakes are low, regardless of how they will act in a sufficiently high-stakes situation. This holds especially because the higher the stakes get in a competition for expansion, the fewer future interactions one expects, as wiping out other players entirely becomes possible. Iteration alone would therefore provide limited information, even if it sometimes were the only practical way to provide evidence of one's trustworthiness.

Both epistemic and normative features in individual agents can make their commitments more credible, if these features are common knowledge. Human cultures, for instance, have used religious notions to signal commitment to certain strategies (e.g. Holslag 2016), perhaps often successfully compared to available counterfactual approaches. Agents could also come to intrinsically value transparency or choose to adhere to commitments, either through moral values, or certain decision-theoretic policy choices (Drescher 2006). These choices would not in themselves make commitments externally credible, of course, but could have more verifiable sources depending on the agent's history.

Conclusions and further notes

As mentioned above, it seems that each commitment strategy described here suffers from potentially serious drawbacks, though in different areas and circumstances. Many plausible scenarios can be envisioned where one or more of the approaches succeeds in supporting credible commitment. Different approaches could even be used in overlapping ways to compensate for their weaknesses, though this holds less if the main weakness is resource costs. In many cases, the feasibility of commitments seems to come down to whether the surplus from cooperation will be enough to incentivize a great deal of collective effort. Another fundamental question is how costly it is to obfuscate one's intentions with great care versus detect obfuscation by observing an agent’s behavior and history.

On a more practical level, contingent features such as agent heterogeneity and even logistics suggest that even if contracts and commitments were overall feasible, they would be costlier to verify between some agents than others. Rather than expecting uniform opportunities for commitment throughout the landscape, we should perhaps assume the environment will be governed by some n-dimensional mess of gradients in commitment ability. Comparing agents along axes such as physical location, architectural similarity, history, normative motivations, and willingness to cooperate, some of them would likely be in better positions to make credible commitments to each other. This does not necessarily prevent widely cooperative dynamics, especially if there is a lot of transitivity in commitment ability between agents as speculated above, but makes the path there more complicated in terms of interventions.

Another insight from this work is that committing to threats could require completely different mechanisms or approaches than committing to cooperation, and future discussions on commitment among AI systems should ideally reflect this. Notably, as many ways to signal one's intentions already require some minimal collaborative labor, it seems much more feasible commitment-wise to make prosocial commitments than to extort others.9 When you can't simply inform your target of a threat and your intentions to carry it out, and would instead need them to go through a costly process to get your intentions properly verified, you might find that they aren't interested in hearing more about your plans.10 One exception seems to be the dumber mechanisms, which are well suited for destructive threats, but might not be able to represent complex voluntary trade contracts.

Acknowledgements

This post benefited immensely from conversations with and feedback from Jesse Clifton, Richard Ngo, Daniel Kokotajlo, Lin Chi Nguyen, Lukas Gloor, Stefan Torges, as well as the rest of my colleagues at Center on Long-term Risk (CLR) and the attendees at CLR's 2019 S-risk workshop, which inspired many of the initial ideas explored here.

Whether or not AI systems can credibly commit to humans is not discussed much in this post, though it is also an interesting question. (back)
There are some counterexamples to this definition that work against the spirit of commitment, though. Agents could still knowingly omit relevant contextual information they have about the world, or about processes they set in motion earlier which are now separate from the agent as such. On the other hand, requiring that no interesting contextual information be hidden when commitments are made seems impossibly strict, since even miniscule differences between the beliefs of two agents could turn out to be relevant in ways that are unpredictable or overly costly to map out. (back)
Even in Bayesian games, contracts conditional on the other players' contracts can possibly be formulated. This idea has been investigated by Peters and Szentzes (2012). (back)
This point was raised by Richard Ngo during our conversations in 2019. (back)
It's not a given that elegance alone would lead to increased interpretability, of course. For example, even if there were fundamental patterns to intelligence that superhuman systems could discover and model themselves after, these more compact foundations could still possibly be implemented in any number of different ways, none of which might be uniquely efficient. (back)
Some minimal non-aggression principles could hopefully also be added in to prevent agents from using the commitment system for extortion. This would on average again be in the interests of participants, as extortion causes expected value loss in the bargaining environment. (back)
The system's predictability was hampered by the inexplicable policy decision to not mention it much to outsiders, though. (back)
Again, of course, this would not constitute a genuinely multipolar scenario. (back)
This is pretty intuitive, as it's also how human commitment structures have been designed -- as a kidnapper, you could hardly hire a lawyer to write an enforceable contract that binds you to actually killing your hostages unless you get what you want. (back)
This would naturally not mean that you couldn't have internally committed to the threat regardless of whether your target listens to you or not, but this would at least be an unwise strategy. (back)

The post Commitment ability in multipolar AI scenarios appeared first on Center on Long-Term Risk.

Fundraiser 2020

2020-12-11T10:24:41Z

28 Supporters Successfully completed!

25% usd 139,252 of usd 550,000

We are raising $550,000 (stretch goal: $1,500,000) to make further progress on our mission: building a global community of researchers and professionals working to do the most good in terms of reducing suffering. These are our plans for 2021 in a nutshell (more details here):

Our first focus area will be cooperation & conflict in the context of transformative AI (TAI). In addition to improving our prioritization within this area, we plan to build a field around bargaining in artificial learners using tools from game theory and multi-agent reinforcement learning (MARL) and to make initial publications on related governance aspects.
Our second focus area will be malevolence. We plan to assess how important this area is relative to our other work and investigate how preferences to create suffering could arise in TAI systems.
Research will remain CLR’s main activity in 2021. We will continue trying to grow our research team.
We will increase our grantmaking efforts across our focus areas. We will start recommending grants to the Center for Emerging Risk Research (CERR), with which we started collaborating. They are a new foundation committed to using all of their funds (~$45,000,000) to support high-impact work to reduce s-risks.
We will continue our routine community-building activities in 2021 (e.g., scouting at events, 1:1 calls). We plan to rerun our summer research fellowship program and test more efficient ways of getting people up to speed on our thinking, e.g., an s-risk intro seminar.

If you prioritize reducing risks of astronomical suffering, we believe there is a strong case to support our work. We are one of the few organizations with the same priorities and we have made significant progress with our work in recent years.

Donations to the Effective Altruism Foundation (EAF), which houses the Center on Long-Term Risk, are tax-deductible for donors in Germany, Switzerland, the US, the UK, and the Netherlands. Donors in the US and the UK can make tax-deductible donations to us via the Effective Altruism Funds platform.

You can find answers to frequently asked questions in our donation FAQ. If the FAQ doesn't answer your question, please send us an email at donate@ea-foundation.org.

Name	Amount	Comment
Mikko Rauhala	EUR 4995
Anonymous	EUR 35
Anonymous	EUR 35
Anonymous	GBP 30
Rai (Michael Pokorny)	CHF 60.78
Anonymous	EUR 2000
Adam Hruby	USD 10
Victoria Gutierrez	USD 200
Connor Leahy	EUR 500
Anonymous	CHF 5000
Anonymous	EUR 631	I pray that CLR not only receives the donation required, but gains ever-increasing public profile and awareness on the subject of impact of tech for future generations!
Anonymous	CHF 20014
Jonas Hunsicker	EUR 50
Anonymous	EUR 25
Amy Spence	USD 100
Denis Drescher	CHF 100
Adam clayton	USD 75
Anonymous	USD 75
Anonymous	CHF 10000
Cliff and Stephanie Hyra	USD 20000
Adam Spence	USD 77.77	Chaos Theory is another potential area that CLTR should consider researching. A better understanding of chaotic systems is important for understanding how to change the course of history for the better, and a lot of suffering seems to be caused by unintended consequences of not necessarily malevolent actions.
Anonymous	CHF 5000
Anonymous	CHF 3000
Anonymous	USD 588
Anonymous	CHF 200
Jan und Sara Rüegg	CHF 11474
Anonymous	CHF 18000
Anonymous	USD 28000

The post Private: Anonymous contributed USD 28000.00 (once) to fundraiser (ID = 6354) appeared first on Center on Long-Term Risk.

Persuasion Tools: AI takeover without takeoff or agency?

2021-02-12T17:46:18Z

[epistemic status: speculation]

I'm envisioning that in the future there will also be systems where you can input any conclusion that you want to argue (including moral conclusions) and the target audience, and the system will give you the most convincing arguments for it. At that point people won't be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked.

--Wei Dai

What if most people already live in that world? A world in which taking arguments at face value is not a capacity-enhancing tool, but a security vulnerability? Without trusted filters, would they not dismiss highfalutin arguments out of hand, and focus on whether the person making the argument seems friendly, or unfriendly, using hard to fake group-affiliation signals?

--Benquo

AI-powered memetic warfare makes all humans effectively insane.

--Wei Dai, listing nonstandard AI doom scenarios

This post speculates about persuasion tools—how likely they are to get better in the future relative to countermeasures, what the effects of this might be, and what implications there are for what we should do now.

To avert eye-rolls, let me say up front that I don’t think the world is likely to be driven insane by AI-powered memetic warfare. I think progress in persuasion tools will probably be gradual and slow, and defenses will improve too, resulting in an overall shift in the balance that isn’t huge: a deterioration of collective epistemology, but not a massive one. However, (a) I haven’t yet ruled out more extreme scenarios, especially during a slow takeoff, and (b) even small, gradual deteriorations are important to know about. Such a deterioration would make it harder for society to notice and solve AI safety and governance problems, because it is worse at noticing and solving problems in general. Such a deterioration could also be a risk factor for world war three, revolutions, sectarian conflict, terrorism, and the like. Moreover, such a deterioration could happen locally, in our community or in the communities we are trying to influence, and that would be almost as bad. Since the date of AI takeover is not the day the AI takes over, but the point it’s too late to reduce AI risk, these things basically shorten timelines.

Six examples of persuasion tools

Analyzers: Political campaigns and advertisers already use focus groups, A/B testing, demographic data analysis, etc. to craft and target their propaganda. Imagine a world where this sort of analysis gets better and better, and is used to guide the creation and dissemination of many more types of content.

Feeders: Most humans already get their news from various “feeds” of daily information, controlled by recommendation algorithms. Even worse, people’s ability to seek out new information and find answers to questions is also to some extent controlled by recommendation algorithms: Google Search, for example. There’s a lot of talk these days about fake news and conspiracy theories, but I’m pretty sure that selective/biased reporting is a much bigger problem.

Chatbot: Thanks to recent advancements in language modeling (e.g. GPT-3) chatbots might become actually good. It’s easy to imagine chatbots with millions of daily users continually optimized to maximize user engagement--see e.g. Xiaoice. The systems could then be retrained to persuade people of things, e.g. that certain conspiracy theories are false, that certain governments are good, that certain ideologies are true. Perhaps no one would do this, but I’m not optimistic.

Coach: A cross between a chatbot, a feeder, and an analyzer. It doesn’t talk to the target on its own, but you give it access to the conversation history and everything you know about the target and it coaches you on how to persuade them of whatever it is you want to persuade them of.

Drugs: There are rumors of drugs that make people more suggestible, like scopolomine. Even if these rumors are false, it’s not hard to imagine new drugs being invented that have a similar effect, at least to some extent. (Alcohol, for example, seems to lower inhibitions. Other drugs make people more creative, etc.) Perhaps these drugs by themselves would be not enough, but would work in combination with a Coach or Chatbot. (You meet target for dinner, and slip some drug into their drink. It is mild enough that they don’t notice anything, but it primes them to be more susceptible to the ask you’ve been coached to make.)

Imperius Curse: These are a kind of adversarial example that gets the target to agree to an ask (or even switch sides in a conflict!), or adopt a belief (or even an entire ideology!). Presumably they wouldn’t work against humans, but they might work against AIs, especially if meme theory applies to AIs as it does to humans. The reason this would work better against AIs than against humans is that you can steal a copy of the AI and then use massive amounts of compute to experiment on it, finding exactly the sequence of inputs that maximizes the probability that it’ll do what you want.

We might get powerful persuasion tools prior to AGI

The first thing to point out is that many of these kinds of persuasion tools already exist in some form or another. And they’ve been getting better over the years, as technology advances. Defenses against them have been getting better too. It’s unclear whether the balance has shifted to favor these tools, or their defenses, over time. However, I think we have reason to think that the balance may shift heavily in favor of persuasion tools, prior to the advent of other kinds of transformative AI. The main reason is that progress in persuasion tools is connected to progress in Big Data and AI, and we are currently living through a period of rapid progress those things, and probably progress will continue to be rapid (and possibly accelerate) prior to AGI.

However, here are some more specific reasons to think persuasion tools may become relatively more powerful:

Substantial prior: Shifts in the balance between things happen all the time. For example, the balance between weapons and armor has oscillated at least a few times over the centuries. Arguably persuasion tools got relatively more powerful with the invention of the printing press, and again with radio, and now again with the internet and Big Data. Some have suggested that the printing press helped cause religious wars in Europe, and that radio assisted the violent totalitarian ideologies of the early twentieth century.

Consistent with recent evidence: A shift in this direction is consistent with the societal changes we’ve seen in recent years. The internet has brought with it many inventions that improve collective epistemology, e.g. google search, Wikipedia, the ability of communities to create forums... Yet on balance it seems to me that collective epistemology has deteriorated in the last decade or so.

Lots of room for growth: I’d guess that there is lots of “room for growth” in persuasive ability. There are many kinds of persuasion strategy that are tricky to use successfully. Like a complex engine design compared to a simple one, these strategies might work well, but only if you have enough data and time to refine them and find the specific version that works at all, on your specific target. Humans never have that data and time, but AI+Big Data does, since it has access to millions of conversations with similar targets. Persuasion tools will be able to say things like 'In 90% of cases where targets in this specific demographic are prompted to consider and then reject the simulation argument, and then challenged to justify their prejudice against machine consciousness, the target gets flustered and confused. Then, if we make empathetic noises and change the subject again, 50% of the time the subject subconsciously changes their mind so that when next week we present our argument for machine rights they go along with it, compared to 10% baseline probability.'

Plausibly pre-AGI: Persuasion is not an AGI-complete problem. Most of the types of persuasion tools mentioned above already exist, in weak form, and there’s no reason to think they can’t gradually get better well before AGI. So even if they won't improve much in the near future, plausibly they'll improve a lot by the time things get really intense.

Language modelling progress: Persuasion tools seem to be especially benefitted by progress in language modelling, and language modelling seems to be making even more progress than the rest of AI these days.

More things can be measured: Thanks to said progress, we now have the ability to cheaply measure nuanced things like user ideology, enabling us to train systems towards those objectives.

Chatbots & Coaches: Thanks to said progress, we might see some halfway-decent chatbots prior to AGI. Thus an entire category of persuasion tool that hasn’t existed before might come to exist in the future. Chatbots too stupid to make good conversation partners might still make good coaches, by helping the user predict the target’s reactions and suggesting possible things to say.

Minor improvements still important: Persuasion doesn’t have to be perfect to radically change the world. An analyzer that helps your memes have a 10% higher replication rate is a big deal; a coach that makes your asks 30% more likely to succeed is a big deal.

Faster feedback: One way defenses against persuasion tools have strengthened is that people have grown wise to them. However, the sorts of persuasion tools I’m talking about seem to have significantly faster feedback loops than the propagandists of old; they can learn constantly, from the entire population, whereas past propagandists (if they were learning at all, as opposed to evolving) relied on noisier, more delayed signals.

Overhang: Finding persuasion drugs is costly, immoral, and not guaranteed to succeed. Perhaps this explains why it hasn’t been attempted outside a few cases like MKULTRA. But as technology advances, the cost goes down and the probability of success goes up, making it more likely that someone will attempt it, and giving them an “overhang” with which to achieve rapid progress if they do. (I hear that there are now multiple startups built around using AI for drug discovery, by the way.) A similar argument might hold for persuasion tools more generally: We might be in a “persuasion tool overhang” in which they have not been developed for ethical and riskiness reasons, but at some point the price and riskiness drops low enough that someone does it, and then that triggers a cascade of more and richer people building better and better versions.

Speculation about effects of powerful persuasion tools

Here are some hasty speculations, beginning with the most important one:

Ideologies & the biosphere analogy:

The world is, and has been for centuries, a memetic warzone. The main factions in the war are ideologies, broadly construed. It seems likely to me that some of these ideologies will use persuasion tools--both on their hosts, to fortify them against rival ideologies, and on others, to spread the ideology.

Consider the memetic ecosystem--all the memes replicating and evolving across the planet. Like the biological ecosystem, some memes are adapted to, and confined to, particular niches, while other memes are widespread. Some memes are in the process of gradually going extinct, while others are expanding their territory. Many exist in some sort of equilibrium, at least for now, until the climate changes. What will be the effect of persuasion tools on the memetic ecosystem?

For ideologies at least, the effects seem straightforward: The ideologies will become stronger, harder to eradicate from hosts and better at spreading to new hosts. If all ideologies got access to equally powerful persuasion tools, perhaps the overall balance of power across the ecosystem would not change, but realistically the tools will be unevenly distributed. The likely result is a rapid transition to a world with fewer, more powerful ideologies. They might be more internally unified, as well, having fewer spin-offs and schisms due to the centralized control and standardization imposed by the persuasion tools. An additional force pushing in this direction is that ideologies that are bigger are likely to have more money and data with which to make better persuasion tools, and the tools themselves will get better the more they are used.

Recall the quotes I led with:

... At that point people won't be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked.

--Wei Dai

What if most people already live in that world? A world in which taking arguments at face value is not a capacity-enhancing tool, but a security vulnerability? Without trusted filters, would they not dismiss highfalutin arguments out of hand … ?

--Benquo

AI-powered memetic warfare makes all humans effectively insane.

--Wei Dai, listing nonstandard AI doom scenarios

I think the case can be made that we already live in this world to some extent, and have for millenia. But if persuasion tools get better relative to countermeasures, the world will be more like this.

This seems to me to be an existential risk factor. It’s also a risk factor for lots of other things, for that matter. Ideological strife can get pretty nasty (e.g. religious wars, gulags, genocides, totalitarianism), and even when it doesn’t, it still often gums things up (e.g. suppression of science, zero-sum mentality preventing win-win-solutions, virtue signalling death spirals, refusal to compromise). This is bad enough already, but it’s doubly bad when it comes at a moment in history where big new collective action problems need to be recognized and solved.

Obvious uses: Advertising, scams, propaganda by authoritarian regimes, etc. will improve. This means more money and power to those who control the persuasion tools. Maybe another important implication would be that democracies would have a major disadvantage on the world stage compared to totalitarian autocracies. One of many reasons for this is that scissor statements and other divisiveness-sowing tactics may not technically count as persuasion tools but they would probably get more powerful in tandem.

Will the truth rise to the top: Optimistically, one might hope that widespread use of more powerful persuasion tools will be a good thing, because it might create an environment in which the truth “rises to the top” more easily. For example, if every side of a debate has access to powerful argument-making software, maybe the side that wins is more likely to be the side that’s actually correct. I think this is a possibility but I do not think it is probable. After all, it doesn’t seem to be what’s happened in the last two decades or so of widespread internet use, big data, AI, etc. Perhaps, however, we can make it true for some domains at least, by setting the rules of the debate.

Data hoarding: A community’s data (chat logs, email threads, demographics, etc.) may become even more valuable. It can be used by the community to optimize their inward-targeted persuasion, improving group loyalty and cohesion. It can be used against the community if someone else gets access to it. This goes for individuals as well as communities.

Chatbot social hacking viruses: Social hacking is surprisingly effective. The classic example is calling someone pretending to be someone else and getting them to do something or reveal sensitive information. Phishing is like this, only much cheaper (because automated) and much less effective. I can imagine a virus that is close to as good as a real human at social hacking while being much cheaper and able to scale rapidly and indefinitely as it acquires more compute and data. In fact, a virus like this could be made with GPT-3 right now, using prompt programming and “mothership” servers to run the model. (The prompts would evolve to match the local environment being hacked.) Whether GPT-3 is smart enough for it to be effective remains to be seen.

Implications

I doubt that persuasion tools will improve discontinuously, and I doubt that they’ll improve massively. But minor and gradual improvements matter too.

Of course, influence over the future might not disappear all on one day; maybe there’ll be a gradual loss of control over several years. For that matter, maybe this gradual loss of control began years ago and continues now...

--Me, from a previous post

I think this is potentially (5% credence) the new Cause X, more important than (traditional) AI alignment even. It probably isn’t. But I think someone should look into it at least, more thoroughly than I have.

To be clear, I don’t think it’s likely that we can do much to prevent this stuff from happening. There are already lots of people raising the alarm about filter bubbles, recommendation algorithms, etc. so maybe it’s not super neglected and maybe our influence over it is small. However, at the very least, it’s important for us to know how likely it is to happen, and when, because it helps us prepare. For example, if we think that collective epistemology will have deteriorated significantly by the time crazy AI stuff starts happening, that influences what sorts of AI policy strategies we pursue.

Note that if you disagree with me about the extreme importance of AI alignment, or if you think AI timelines are longer than mine, or if you think fast takeoff is less likely than I do, you should all else equal be more enthusiastic about investigating persuasion tools than I am.

Thanks to Katja Grace, Emery Cooper, Richard Ngo, and Ben Goldhaber for feedback on a draft.

Related previous work:

Epistemic Security report

Aligning Recommender Systems

Stuff I’d read if I was investigating this in more depth:

Not Born Yesterday

The stuff here and here

EDIT: This ultrashort sci-fi story by Jack Clark illustrates some of the ideas in this post:

The Narrative Control Department

[A beautiful house in South West London, 2030]

“General, we’re seeing an uptick in memes that contradict our official messaging around Rule 470.” “What do you suggest we do?”

“Start a conflict. At least three sides. Make sure no one side wins.”

“At once, General.”

And with that, the machines spun up – literally. They turned on new computers and their fans revved up. People with tattoos of skeletons at keyboards high-fived eachother. The servers warmed up and started to churn out their fake text messages and synthetic memes, to be handed off to the ‘insertion team’ who would pass the data into a few thousand sock puppet accounts, which would start the fight.

Hours later, the General asked for a report.

“We’ve detected a meaningful rise in inter-faction conflict and we’ve successfully moved the discussion from Rule 470 to a parallel argument about the larger rulemaking process.”

“Excellent. And what about our rivals?”

“We’ve detected a few Russian and Chinese account networks, but they’re staying quiet for now. If they’re mentioning anything at all, it’s in line with our narrative. They’re saving the IDs for another day, I think.”

That night, the General got home around 8pm, and at the dinner table his teenage girls talked about their day.

“Do you know how these laws get made?” the older teenager said. “It’s crazy. I was reading about it online after the 470 blowup. I just don’t know if I trust it.”

“Trust the laws that gave Dad his job? I don’t think so!” said the other teenager.

They laughed, as did the General’s wife. The General stared at the peas on his plate and stuck his fork into the middle of them, scattering so many little green spheres around his plate.

The post Persuasion Tools: AI takeover without takeoff or agency? appeared first on Center on Long-Term Risk.

How Roodman's GWP model translates to TAI timelines

2021-02-12T17:06:16Z

How does David Roodman’s world GDP model translate to TAI timelines?

Now, before I go any further, let me be the first to say that I don’t think we should use this model to predict TAI. This model takes a very broad outside view and is thus inferior to models like Ajeya Cotra’s which make use of more relevant information. (However, it is still useful for rebutting claims that TAI is unprecedented, inconsistent with historical trends, low-prior, etc.) Nevertheless, out of curiosity I thought I’d calculate what the model implies for TAI timelines.

Here is the projection made by Roodman’s model. The red line is real historic GWP data; the splay of grey shades that continues it is the splay of possible futures calculated by the model. The median trajectory is the black line.

I messed around with a ruler to make some rough calculations, marking up the image with blue lines as I went. The big blue line indicates the point on the median trajectory where GWP is 10x what is was in 2019. Eyeballing it, it looks like it happens around 2040, give or take a year. The small vertical blue line indicates the year 2037. The small horizontal blue line indicates GWP in 2037 on the median trajectory.

Thus, it seems that between 2037 and 2040 on the median trajectory, GWP doubles. (One-ninth the distance between 1,000 and 1,000,000 is crossed, which is one-third of an order of magnitude, which is about one doubling).

This means that TAI happens around 2037 on the median trajectory according to this model, at least according to Ajeya Cotra’s definition of transformative AI as “software which causes a tenfold acceleration in the rate of growth of the world economy (assuming that it is used everywhere that it would be economically profitable to use it)... This means that if TAI is developed in year Y, the entire world economy would more than double by year Y + 4.”

What about the non-median trajectories? Each shade of grey represents 5 percent of the simulated future trajectories, so it looks like there’s about a 20% chance that GWP will be near-infinite by 2040 (and 10% by 2037). So, perhaps-too-hastily extrapolating backwards, maybe this means about a 20% chance of TAI by 2030 (and 10% by 2027).

At this point, I should mention that I disagree with this definition of TAI; I think the point of no return (which is what matters for planning) is reasonably likely to come several years before TAI-by-this-definition appears. (It could also come several years later!) For more on why I think this, see this post. [link to be added when linked post appears]

Finally, let’s discuss some of the reasons not to take this too seriously: This model has been overconfident historically. It was surprised by how fast GDP grew prior to 1970 and surprised by how slowly it grew thereafter. And if you look at the red trendline of actual GWP, it looks like the model may have been surprised in previous eras as well. Moreover, for the past few decades it has consistently predicted a median GWP-date of several decades ahead:

The grey region is the confidence interval the model predicts for when growth goes to infinity. 100 on the x-axis is 1947. So, throughout the 1900’s the model has consistently predicted growth going to infinity in the first half of the twenty-first century, but in the last few decades in particular, it’s displayed a consistent pattern of pushing back the date of expected singularity, akin to the joke about how fusion power is always twenty years away:

Model has access to data up to year X =	Year of predicted singularity	Difference
1940	2029	89
1950	2045	95
1960	2020	60
1970	2010	40
1980	2014	34
1990	2022	32
2000	2031	31
2010	2038	28
2019	2047	28

The upshot, I speculate, is that if we want to use this model to predict TAI, but we don’t want to take it 100% literally, we should push the median significantly back from 2037 while also increasing the variance significantly. This is because we are currently in a slower-than-the-model-predicts period, but faster-than-the-model-predicts periods are possible and indeed likely to happen around TAI. So probably the status quo will continue and GWP will continue to grow slowly and the model will continue to push back the date of expected singularity… but also at any moment there’s a chance that we’ll transition to a faster-than-the-model-predicts period, in which case TAI is imminent.

(Thanks to Denis Drescher and Max Daniel for feedback on a draft)

The post How Roodman's GWP model translates to TAI timelines appeared first on Center on Long-Term Risk.

The date of AI Takeover is not the day the AI takes over

2021-02-12T17:06:17Z

Instead, it’s the point of no return—the day we AI risk reducers lose the ability to significantly reduce AI risk. This might happen years before classic milestones like “World GWP doubles in four years” and “Superhuman AGI is deployed."

The rest of this post explains, justifies, and expands on this obvious but underappreciated idea. (Toby Ord appreciates it; see quote below). I found myself explaining it repeatedly, so I wrote this post as a reference.

AI timelines often come up in career planning conversations. Insofar as AI timelines are short, career plans which take a long time to pay off are a bad idea, because by the time you reap the benefits of the plans it may already be too late. It may already be too late because AI takeover may already have happened.

But this isn’t quite right, at least not when “AI takeover” is interpreted in the obvious way, as meaning that an AI or group of AIs is firmly in political control of the world, ordering humans about, monopolizing violence, etc. Even if AIs don’t yet have that sort of political control, it may already be too late. Here are three examples:

Superhuman agent AGI is still in its box but nobody knows how to align it and other actors are going to make their own version soon, and there isn’t enough time to convince them of the risks. They will make and deploy agent AGI, it will be unaligned, and we have no way to oppose it except with our own unaligned AGI. Even if it takes years to actually conquer the world, it’s already game over.
Various weak and narrow AIs are embedded in the economy and beginning to drive a slow takeoff; capabilities are improving much faster than safety/alignment techniques and due to all the money being made there’s too much political opposition to slowing down capability growth or keeping AIs out of positions of power. We wish we had done more safety/alignment research earlier, or built a political movement earlier when opposition was lower.

Persuasion tools have destroyed collective epistemology in the relevant places. AI isn’t very capable yet, except in the narrow domain of persuasion, but everything has become so politicized and tribal that we have no hope of getting AI projects or governments to take AI risk seriously. Their attention is dominated by the topics and ideas of powerful ideological factions that have access to more money and data (and thus better persuasion tools) than us. Alternatively, maybe we ourselves have fallen apart as a community, or become less good at seeking the truth and finding high-impact plans.

Conclusion: We should remember that when trying to predict the date of AI takeover, what we care about is the date it’s too late for us to change the direction things are going; the date we have significantly less influence over the course of the future than we used to; the point of no return.

This is basically what Toby Ord said about x-risk:

So either because we’ve gone extinct or because there’s been some kind of irrevocable collapse of civilization or something similar. Or, in the case of climate change, where the effects are very delayed that we’re past the point of no return or something like that. So the idea is that we should focus on the time of action and the time when you can do something about it rather than the time when the particular event happens.

Of course, influence over the future might not disappear all on one day; maybe there’ll be a gradual loss of control over several years. For that matter, maybe this gradual loss of control began years ago and continues now... We should keep these possibilities in mind as well.

The post The date of AI Takeover is not the day the AI takes over appeared first on Center on Long-Term Risk.

Updates

2021-03-12T13:25:40Z

The post Updates appeared first on Center on Long-Term Risk.

Priority areas

2022-06-17T14:44:08Z

Below is a list of areas we currently consider among the most important. We are interested in collaborating with or supporting individuals who are keen to make contributions to any of these. If you are a good fit, we might want to hire you, support you with a grant, or give advice on your career plans.

However, regardless of your background and the different areas listed below: if we believe that you can somehow do high-quality work relevant to s-risks, we are interested in supporting you.

Multi-agent systems

Our research agenda cooperation, conflict, and transformative artificial intelligence (TAI) is ultimately aimed at reducing risks of conflict among TAI-enabled actors. This means that we need to understand how future AI systems might interact with one another, especially in high-stakes situations. CLR researchers and affiliates are currently researching how the design of future AI systems might determine the prospects for avoiding cooperation failure, using the tools of game theory, machine learning, and other disciplines related to multi-agent systems (MAS). You can find an overview of our work in this area here.

Examples of CLR research related to MAS:

Julian Stastny, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, Jesse Clifton (2021). Normative Disagreement as a Challenge for Cooperative AI. Cooperative AI workshop and Strategic ML workshop at NeurIPS.
Caspar Oesterheld (2021). Safe Pareto improvements for delegated game playing. AAMAS.
Jesse Clifton (2020). Equilibrium and prior selection problems in multipolar deployment. Alignment Forum.
Daniel Kokotajlo (2019). The “Commitment Races” Problem. Alignment Forum.
For more of our work in this area, see our publications page.

AI governance

Ensuring the safe design of AI systems also poses problems of governance. Because the prospects for avoiding conflict involving TAI systems depends on the design of all of the systems involved, avoiding conflict and promoting cooperation among TAI systems may pose new governance challenges beyond those commonly discussed in the AI risk research community (e.g., here). CLR researchers are currently working to understand potential pathways to cooperation between AI developers on the aspects of their systems which are most relevant to avoiding catastrophic conflict.

Examples of CLR research related to AI governance:

Stefan Torges (2021). Coordination challenges for preventing AI conflict.

Decision theory and formal epistemology

As explained in Section 7 of our research agenda, we are also interested in a better foundational understanding of decision-making, in the hope that this will help us steer towards better outcomes in high-stakes interactions between TAI systems.

An example of CLR research in this area:

William MacAskill, Aron Vallinder, Caspar Oesterheld, Carl Shulman, and Johannes Treutlein (2021). The Evidentialist's Wager. The Journal of Philosophy.
Caspar Oesterheld (2017). Multiverse-wide Cooperation via Correlated Decision Making (see also an overview of work on this topic).
For more of our work in this area, see our publications page.

Risks from malevolent actors

Malevolent individuals in positions of power could negatively affect humanity’s long-term trajectory by, for example, exacerbating international conflict or other broad risk factors. With access to advanced technology, they may even pose existential risks. We are interested in a better understanding of malevolent traits and would like to investigate interventions to reduce the influence of individuals exhibiting such traits.

An example of CLR research in this area:

David Althaus, Tobias Baumann. Reducing long-term risks from malevolent actors. Effective Altruism Forum, 2020.

Cause prioritization and macrostrategy related to s-risks

We have only been doing research on s-risks since 2013. So we expect to change our minds about many important questions as we learn more. We are interested in people bringing an independent perspective to the question of what we should prioritize. This can also include seemingly esoteric topics like infinite ethics or extraterrestrials.

Examples of CLR research in this area:

Lukas Gloor. Cause prioritization for downside-focused value systems, Effective Altruism Forum, 2018).
Kaj Sotala, Lukas Gloor. Superintelligence as a Cause or Cure for Risks of Astronomical Suffering, Informatica, 2017).
For more of our work in this area, see our publications page.

The post Priority areas appeared first on Center on Long-Term Risk.

Expression of interest

2020-07-30T16:07:37Z

You can use this form to express your interest in working at CLR. We will reach out within a few days if we think you might be a good fit.

Thank you for your interest—we look forward to hearing from you!

The post Expression of interest appeared first on Center on Long-Term Risk.

Reducing long-term risks from malevolent actors

2021-05-13T17:14:07Z

Summary

Dictators who exhibited highly narcissistic, psychopathic, or sadistic traits were involved in some of the greatest catastrophes in human history.
Malevolent individuals in positions of power could negatively affect humanity’s long-term trajectory by, for example, exacerbating international conflict or other broad risk factors.
Malevolent humans with access to advanced technology—such as whole brain emulation or other forms of transformative AI—could cause serious existential risks and suffering risks.
We therefore consider interventions to reduce the expected influence of malevolent humans on the long-term future.
The development of manipulation-proof measures of malevolence seems valuable, since they could be used to screen for malevolent humans in high-impact settings, such as heads of government or CEOs.
We also explore possible future technologies that may offer unprecedented leverage to mitigate against malevolent traits.
Selecting against psychopathic and sadistic tendencies in genetically enhanced, highly intelligent humans might be particularly important. However, risks of unintended negative consequences must be handled with extreme caution.
We argue that further work on reducing malevolence would be valuable from many moral perspectives and constitutes a promising focus area for longtermist EAs.

Full article

The post Reducing long-term risks from malevolent actors appeared first on Center on Long-Term Risk.

Publications

2022-02-03T12:03:25Z

The following are selected publications from our researchers.

Cooperation, conflict, and transformative AI

Multi-agent systems

Oesterheld, Caspar; Conitzer, Vincent. Safe Pareto Improvements for Delegated Game Playing. AAMAS, 2021.
Links | BibTeX

Stastny, Julian; Riché, Maxime; Lyzhov, Alexander; Treutlein, Johannes; Dafoe, Allan; Clifton, Jesse. Normative Disagreement as a Challenge for Cooperative AI. Cooperative AI workshop and the Strategic ML workshop at NeurIPS, 2021.
Abstract | Links | BibTeX

DiGiovanni, Anthony; Clifton, Jesse. Commitment games with conditional information revelation. AAAI 2023, 2022.
Abstract | Links | BibTeX

DiGiovanni, Anthony; Macé, Nicolas; Clifton, Jesse. Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs. Learning, Evolution, and Games, 2022.
Abstract | Links | BibTeX

@conference{DiGiovanni2022b,
title = {Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs},
author = {Anthony DiGiovanni and Nicolas Macé and Jesse Clifton},
url = {https://longtermrisk.org/evolutionary-stability-of-other-regarding-preferences-under-complexity-costs/, HTML
https://arxiv.org/pdf/2207.03178, PDF
},
year = {2022},
date = {2022-07-07},
booktitle = {Learning, Evolution, and Games},
abstract = {The evolution of preferences that account for other agents’ fitness, or other-regarding preferences, has been modeled with the “indirect approach” to evolutionary game theory. Under the indirect evolutionary approach, agents make decisions by optimizing a subjective utility function. Evolution may select for subjective preferences that differ from the fitness function, and in particular, subjective preferences for increasing or reducing other agents’ fitness. However, indirect evolutionary models typically artificially restrict the space of strategies that agents might use (assuming that agents always play a Nash equilibrium under their subjective preferences), and dropping this restriction can undermine the finding that other-regarding preferences are selected for. Can the indirect evolutionary approach still be used to explain the apparent existence of other-regarding preferences, like altruism, in humans? We argue that it can, by accounting for the costs associated with the complexity of strategies, giving (to our knowledge) the first account of the relationship between strategy complexity and the evolution of preferences. Our model formalizes the intuition that agents face tradeoffs between the cognitive costs of strategies and how well they interpolate across contexts. For a single game, these complexity costs lead to selection for a simple fixed-action strategy, but across games, when there is a sufficiently large cost to a strategy's number of context-specific parameters, a strategy of maximizing subjective (other-regarding) utility is stable again. Overall, our analysis provides a more nuanced picture of when other-regarding preferences will evolve.},
howpublished = {Peer-reviewed},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}

Clifton, Jesse. Collaborative game specification: arriving at common models in bargaining. Working paper, March 2021.
Links | BibTeX

Clifton, Jesse. Weak identifiability and its consequences in strategic settings. Working paper, February 2021.
Links | BibTeX

Clifton, Jesse; Riché, Maxime. Towards cooperation in learning games. Working paper, October 2020.
Links | BibTeX

Oesterheld, Caspar. Robust program equilibrium. Theory and Decision, 86 (1), 2018.
Links | BibTeX

Strategic considerations

Clifton, Jesse. CLR's Research Agenda on Cooperation, Conflict, and TAI. Alignment Forum, December 2019.
Links | BibTeX

Clifton, Jesse. Equilibrium and prior selection problems in multipolar deployment. AI Alignment Forum, April 2020.
Links | BibTeX

Kokotajlo, Daniel. The "Commitment Races" problem. Alignment Forum, August 2019.
Links | BibTeX

Decision theory

Nathaniel Sauerberg, Caspar Oesterheld . Computing Optimal Commitments to Strategies and Outcome-conditional Utility Transfers. 2024.
Links | BibTeX

Treutlein, Johannes. Modeling evidential cooperation in large worlds. 2023.
Links | BibTeX

MacAskill, William; Vallinder, Aron; Oesterheld, Caspar; Shulman, Carl; Treutlein, Johannes. The Evidentialist’s Wager. The Journal of Philosophy, 2021.
Links | BibTeX

Bell, James; Linsefors, Linda; Oesterheld, Caspar; Skalse, Joar. Reinforcement Learning in Newcomblike Environments. NeurIPS, 2021.
Links | BibTeX

Oesterheld, Caspar. Approval-directed agency and the decision theory of Newcomb-like problems. Synthese, 2019, (Runner-up in the "AI alignment prize").
Links | BibTeX

Oesterheld, Caspar. Doing what has worked well in the past leads to evidential decision theory. 2018.
Links | BibTeX

Oesterheld, Caspar. Multiverse-wide Cooperation via Correlated Decision Making. 2017.
Links | BibTeX

Oesterheld, Caspar. Decision Theory and the Irrelevance of Impossible Outcomes. 2017.
Links | BibTeX

Treutlein, Johannes. Anthropic uncertainty in the Evidential Blackmail. 2017.
Links | BibTeX

Malevolence

Althaus, David; Baumann, Tobias. Reducing long-term risks from malevolent actors. Effective Altruism Forum, April 2020.
Links | BibTeX

Ethics & meta-ethics

Gloor, Lukas. Sequence on moral anti-realism. Effective Altruism Forum, June 2020.
Links | BibTeX

Gloor, Lukas. Tranquilism. CLR Website, July 2017.
Links | BibTeX

Knutsson, Simon; Munthe, Christian. A Virtue of Precaution Regarding the Moral Status of Animals with Uncertain Sentience. Journal of Agricultural and Environmental Ethics, 30 (2), 2017.
Links | BibTeX

Daniel, Max. Bibliography of Suffering-Focused Views. CLR Website, August 2016.
Links | BibTeX

Tomasik, Brian. The Importance of Wild-Animal Suffering. Relations, 3 (2), 2015.
Links | BibTeX

Tomasik, Brian. Should We Base Moral Judgments on Intentions or Outcomes?. CLR Website, July 2013.
Links | BibTeX

Tomasik, Brian. Dealing with Moral Multiplicity. CLR Website, December 2013.
Links | BibTeX

Prioritization & macrostrategy

Cook, Tristan. Replicating and extending the grabby aliens model. Effective Altruism Forum, April 2022.
Links | BibTeX

Cook, Tristan; Corlouer, Guillaume. The optimal timing of spending on AGI safety work; why we should probably be spending more now. Effective Altruism Forum, November 2022.
Links | BibTeX

Baum, Seth D; Armstrong, Stuart; Ekenstedt, Timoteus; Häggström, Olle; Hanson, Robin; Kuhlemann, Karin; Maas, Matthijs M; Miller, James D; Salmela, Markus; Sandberg, Anders; Sotala, Kaj; Torres, Phil; Turchin, Alexey; Yampolskiy, Roman V. Long-term trajectories of human civilization. Foresight, 21 (1), pp. 53-83, 2019.
Links | BibTeX

Kokotajlo, Daniel. Soft takeoff can still lead to decisive strategic advantage. Alignment Forum, August 2019.
Links | BibTeX

Gloor, Lukas. Rebuttal of Christiano and AI Impacts on takeoff speeds?. LessWrong, April 2019.
Links | BibTeX

Gloor, Lukas. Cause prioritization for downside-focused value systems. Effective Altruism Forum, January 2018.
Links | BibTeX

Althaus, David. Descriptive Population Ethics and Its Relevance for Cause Prioritization. Effective Altruism Forum, April 2018.
Links | BibTeX

Sotala, Kaj. How feasible is the rapid development of artificial superintelligence?. Physica Scripta, 92 (11), 2017.
Links | BibTeX

Sotala, Kaj; Gloor, Lukas. Superintelligence as a Cause or Cure for Risks of Astronomical Suffering. Informatica, 41 (4), 2017.
Links | BibTeX

Oesterheld, Caspar. Complications in evaluating neglectedness. The Universe from an Intentional Stance Blog, June 2017.
Links | BibTeX

Tomasik, Brian. How the Simulation Argument Dampens Future Fanaticism. CLR Website, June 2016.
Links | BibTeX

AI Forecasting

Kokotajlo, Daniel. What 2026 looks like. LessWrong, August 2021.
Links | BibTeX

Kokotajlo, Daniel. Fun with +12 OOMs of Compute. LessWrong, March 2021.
Links | BibTeX

Kokotajlo, Daniel. Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain. LessWrong, January 2021.
Links | BibTeX

Kokotajlo, Daniel. Against GDP as a metric for timelines and takeoff speeds. LessWrong, December 2020.
Links | BibTeX

Other

DiGiovanni, Anthony. Beginner’s guide to reducing s-risks. CLR Website, September 2023.
Links | BibTeX

Kokotajlo, Daniel. Persuasion Tools: AI takeover without AGI or agency?. LessWrong, November 2020.
Links | BibTeX

Althaus, David; Kokotajlo, Daniel. Incentivizing forecasting via social media. Effective Altruism Forum, December 2020.
Links | BibTeX

Sotala, Kaj. Sequence on non-agent and multiagent models of mind. LessWrong, January 2019.
Links | BibTeX

Oesterheld, Caspar. Moral realism and AI alignment. LessWrong, September 2018.
Links | BibTeX

Gloor, Lukas. Suffering-Focused AI Safety: In Favor of “Fail-Safe” Measures. CLR Website, June 2016.
Links | BibTeX

Gloor, Lukas. Room for Other Things: How to adjust if EA seems overwhelming. Effective Altruism Forum, March 2015.
Links | BibTeX

The post Publications appeared first on Center on Long-Term Risk.

About us

2023-11-02T10:10:25Z

Our mission

Our goal is to address worst-case risks from the development and deployment of advanced AI systems. We are currently focused on conflict scenarios as well as technical and philosophical aspects of cooperation.

To this end, we do interdisciplinary research, make and recommend grants, and build a community of professionals and other researchers around our priorities, e.g., through events, fellowships, and individual support.

How we came to our mission

As a team and organization, we are driven by the idea to do the most good we can from an impartial perspective. While we are deeply committed to our values, we are radically open-minded about how to live up to them.

This is a complex challenge. Because our resources are limited, we cannot solve all problems in the world or mitigate all risks facing us in the future. Instead, we need to prioritize. We need to ask ourselves what actions we should take now to have as much of a positive impact as possible.

This has been the guiding question of our organization since our founding in 2013. Starting from a commitment to our values, there are many different considerations that have shaped our current focus. As we learn more, our priorities, or even our mission, may change.

Below we provide a list of some of the crucial considerations that inform our current priorities:

sufficiently advanced artificially intelligent systems are likely to shape the future of our civilization in uniquely profound ways;¹
this transformation may cause harm on an unprecedented scale, agential risks like conflict and malevolence being particularly worrisome;²
the chance that such systems will be developed in the next thirty years is sufficiently high to warrant action now.³

What our values are

Our primary ethical focus is the reduction of involuntary suffering. This includes human suffering, but also the suffering in non-human animals and potential artificial minds of the future. In accordance with a diverse range of moral views, we believe that suffering, especially extreme suffering, cannot be easily outweighed by large amounts of happiness.

While this leads us to prioritize reducing suffering, we do so within a framework of commonsensical value pluralism and with a strong focus on cooperation. Together with others in the effective altruism community, we want careful ethical reflection to guide the future of our civilization to the greatest extent possible.

See for example: Beginner's Guide to Reducing S-Risks, Altruists Should Prioritize Artificial Intelligence, Potential Risks from Advanced Artificial Intelligence: The Philanthropic Opportunity. (back)
See for example: Reducing Risks of Astronomical Suffering: A Neglected Priority, S-risks: Why they are the worst existential risks, and how to prevent them. (back)
See for example: Forecasting TAI with Biological Anchors, AI Timelines sequence. (back)

The post About us appeared first on Center on Long-Term Risk.

Career advice

2023-12-12T10:01:23Z

Are you interested in pursuing a career dedicated to priorities similar to ours? Or do you want to be put in touch with other people who are dedicating their careers to doing good in this way? We'd be happy to discuss your career plans with you and help you connect with like-minded people. You can use the form below to register your interest in a call.

You should do so if you:

understand the main ideas in the articles "Reducing risks of astronomical suffering" and "Beginner's Guide to Reducing S-Risks"
when thinking about how to do the most good, you prioritize reducing suffering or you are very interested in working on issues we consider priorities for other reasons.

We can best help you make sense of how to do the most good in case our priorities overlap. We do not offer general career advice or coaching. If you're interested in that, we recommend the organization 80,000 Hours.

The calls usually take 30-60 minutes. We look forward to talking to you!

Register your interest in a call

The post Career advice appeared first on Center on Long-Term Risk.

EAF/FRI are now the Center on Long-Term Risk (CLR)

2021-01-25T11:31:13Z

We have renamed the Foundational Research Institute (FRI) to the Center on Long-Term Risk (CLR) and will stop using the Effective Altruism Foundation (EAF) brand. The CLR will operate under the domain longtermrisk.org with the following logo:

Motivation

We are renaming for the following reasons:

Change of strategy. We now focus on building a research community working on reducing risks of astronomical suffering (s-risks). Our change of strategy entailed several changes to our organization. In addition to rebranding, we made the following changes over the past year:
- We moved to London (Primrose Hill) to better attract and retain staff and collaborate with other researchers in London and Oxford.
- We hired a new Research Director: Jesse Clifton.
- We published a new research agenda and are hiring talented researchers interested in our topics.
- For more background, see Effective Altruism Foundation: Plans for 2020.
Renaming FRI. We perceive the FRI brand as confusing and grandiose given the scope and nature of our activities. We also received feedback from others to this effect in the past. The term “institute” is protected in the UK and does not appear adequate given the small size of our research group. It also suggests we mainly focus on academic publications, whereas we also make grants through our fund, host workshops, and advise people on their careers. Additionally, we work in areas where academic publications are less common, such as grantmaking research and macrostrategy.
Handing off community building. We originally chose the EAF brand (German: “Stiftung für Effektiven Altruismus”) to coordinate the effective altruism (EA) community in the German-speaking area, but we handed off our community-building activities in 2018. The EAF name does not describe our activities well anymore and can be easily confused with the Centre for Effective Altruism, especially after our move to the UK. However, to reduce the effort from rebranding, we will continue to use EAF as the name of our legal entities.

Process for renaming

We started out defining desiderata for the new name: it should be descriptive, flexible with respect to our future activities, intuitively understandable, respectable (including in academia), short, easy to pronounce and understand, unique, appealing, and memorable.
We brainstormed an initial list of over 600 ideas with community members and shortened it iteratively. We also thought carefully about abbreviations or possible short forms. The winning name idea was generated during a brainstorming session with our core team.
We decided to highlight our focus on reducing suffering in the tagline and mission statement rather than in the name itself. Based on feedback and our own experience, we believe academics and other non-EA audiences tend to associate the word “suffering” with direct charity or activism in near-term cause areas. A more neutral name works better for these audiences. We think the emphasis on “risk” still intuitively conveys our focus on preventing negative outcomes to some degree.
Overall, we think the new name is more descriptive of our work, more modest, easier to understand, while still being compatible with potential future changes to our focus and strategy.

We would like to thank the many people in our networks who helped us with their ideas and feedback. We are excited about the new name and design, and hope you are, too!

The post EAF/FRI are now the Center on Long-Term Risk (CLR) appeared first on Center on Long-Term Risk.

Our plans for 2020

2021-01-25T11:33:39Z

Summary

Our mission. We are building a global community of researchers and professionals working on reducing risks of astronomical suffering (s-risks).
Our plans for 2020
- Research. We aim to investigate the questions listed in our research agenda titled “Cooperation, Conflict, and Transformative Artificial Intelligence” and other areas.
- Research community. We plan to host research workshops, make grants to support work relevant to our priorities, present our work to other research groups, and advise people who are interested in reducing s-risks in their careers and research priorities.
- 2019 review
  - Research. In 2019, we mainly worked on s-risks as a result of conflicts involving advanced AI systems.
  - Research workshops. We ran research workshops on s-risks from AI in Berlin, the San Francisco Bay Area, and near London. The participants gave positive feedback.
  - Location. We moved to London (Primrose Hill) to attract and retain staff better and collaborate with other researchers in London and Oxford.
Fundraising target. We aim to raise $185,000 (stretch goal: $700,000) by December 2019. If you prioritize reducing s-risks, there is a strong case for supporting us. Make a donation.

About us

We are building a global community of researchers and professionals working on reducing risks of astronomical suffering (s-risks). (Read more about us.)

We are a London-based nonprofit. Previously, we were located in Switzerland (Basel) and Germany (Berlin).

Background on our strategy

For an overview of our strategic thinking, see the following pieces:

The best work on reducing s-risks cuts across a broad range of academic disciplines and interventions. Our recent research agenda, for instance, draws from computer science, economics, political science, and philosophy. That means (a) we must work in many different disciplines and (b) find people who can bridge disciplinary boundaries. The longtermism community brings together people with diverse backgrounds who understand our prioritization and share it to some extent. For this reason, we focus on making reducing s-risks a well-established priority in that community.

Strategic goals

Inspired by GiveWell’s self-evaluations, we are tracking our progress with a set of deliberately vague performance questions:

Building long-term capacity. Have we made progress towards becoming a research group that will have an outsized impact on the research landscape and relevant actors shaping the future?
Research progress. Has our work resulted in research progress that helps reduce s-risks (both in-house and elsewhere)?
Research dissemination. Have we communicated our research to our target audience, and has the target audience engaged with our ideas?
Organizational health. Are we a healthy organization with an effective board, staff in appropriate roles, appropriate evaluation of our work, reliable policies and procedures, adequate financial reserves and reporting, and so forth?

Our team will answer these questions at the end of 2020.

Plans for 2020

Research

We aim to investigate research questions listed in our research agenda titled “Cooperation, Conflict, and Transformative Artificial Intelligence.” We explain our focus on cooperation and conflict in the preface:

“S-risks might arise by malevolence, by accident, or in the course of conflict. (…) We believe that s-risks arising from conflict are among the most important, tractable, and neglected of these. In particular, strategic threats by powerful AI agents or AI-assisted humans against altruistic values may be among the largest sources of expected suffering. Strategic threats have historically been a source of significant danger to civilization (the Cold War being a prime example). And the potential downsides from such threats, including those involving large amounts of suffering, may increase significantly with the emergence of transformative AI systems.”

Topics covered by our research agenda include:

AI strategy and governance. What does the strategic landscape at the time of transformative AI (TAI) development look like? For example, will it be unipolar or multipolar, and how will offensive and defensive capabilities scale? What does this imply for cooperation failures? How can we shape the governance of AI to reduce the chances of catastrophic cooperation failures?
Credibility. What might the nature of credible commitment among TAI systems look like, and what are the implications for improving cooperation? Can we develop new theories (e.g., program equilibrium) to account for relevant features of AI?
Peaceful bargaining mechanism. Can we further develop bargaining mechanisms that do not lead to destructive conflict (e.g., by implementing surrogate goals)?
Contemporary AI architectures. How can we make progress on reducing cooperation failures using contemporary AI tools (e.g., learning to solve social dilemmas among deep reinforcement learners)?
Humans in the loop. How do we expect human overseers or operators of AI systems to behave in interactions between humans and AI systems?
Foundations of rational agency, including bounded decision-making and acausal reasoning.

We did not list some topics in the research agenda because they did not fit its scope, but we consider them very important:

macrostrategy research on questions related to s-risk,
nontechnical work on strategic threats,
reducing the likelihood of s-risks from hatred, sadism, and other kinds of malevolence,
research on whether and how we should advocate rights for (sentient) digital minds,
reducing potential risks from genetic enhancement (especially in the context of TAI development),
AI strategy topics not captured by the research agenda (e.g., near misses),
AI governance topics not captured by the research agenda (e.g., the governance of digital minds),
foundational questions relevant to s-risk (e.g., metaethics, population ethics, and the feasibility and moral relevance of artificial consciousness), and
other potentially relevant areas (e.g., great power conflict, space governance, or promoting cooperation).

In practice, our publications and grants will be determined to a large extent by the ideas and motivation of the researchers. We understand the above list of topics as a menu for researchers to choose from, and we expect that our actual work will only cover a small portion of the relevant issues. We hope to collaborate with other AI safety research groups on some of these topics.

We are looking to grow our research team, so we would be excited to hear from you if you think you might be a good fit! We are also considering running a hiring round based on our research agenda as well as a summer research fellowship.

Research community

We aim to develop a global research community, promoting regular exchange and coordination between researchers whose work contributes to reducing s-risks.

Research workshops. Our previous workshops were attended by researchers from major AI labs and academic research groups. They resulted in several researchers becoming more involved with research relevant to s-risks. We plan to continue to host research workshops near London and in the San Francisco Bay Area. Besides, we might host seminars at other research groups and explore the idea of hosting a retreat on moral reflection.
Research agenda dissemination. We plan to reach out proactively to researchers who may be interested in working on our agenda. We plan to present the agenda at several research organizations, on podcasts, and at EA Global San Francisco. We may also publish a complementary overview of research questions focused on macrostrategy and s-risks from causes other than conflict involving AI systems.
Grantmaking. We will continue to support work relevant to reducing s-risks through the CLR Fund. We plan to run at least one open grant application round. If we have sufficient capacity, we plan to explore more active forms of grantmaking, such as reaching out to academic researchers, laying the groundwork for setting up an academic research institute, or working closely with individuals who could launch valuable projects.
Community coordination. We see substantial benefits from bringing the existential-risk-oriented (x-risk-oriented) and s-risk-oriented parts of the longtermism community closer together. We believe that concern for s-risks should be a core component of longtermist EA, so we will continue to encourage x-risk-oriented groups and authors to consider s-risks in their key content and thinking. We will also continue to suggest to suffering-focused EAs that they consider potential risks to people with other value systems in their publications (see below). We plan to reassess to what extent CLR should continue to have a coordinating role in the longtermist EA community at the end of 2020.
Advising and in-person exchange. In the past, in-person exchange has been an important step for helping community members better understand our priorities and become more involved with our work. We will continue to advise people who are interested in reducing s-risks in their careers and research priorities. Next year, we might experiment with regular meetups and co-working at our offices.

Organizational opportunities and challenges

Research office. We expect some of our remote researchers to join us at our offices in London sometime next year. We also hope to hire more researchers.
Lead researcher. Our research team currently lacks a lead researcher with academic experience and management skills.

Review of 2019

Research

S-risks from conflict. In 2019, we mainly worked on s-risks as a result of conflicts involving advanced AI systems:

Research agenda: Clifton: Cooperation, Conflict, and Transformative Artificial Intelligence: for a summary, see above.

Kokotajlo: The 'Commitment Races' problem: In this post on the Alignment Forum, CLR Fund grantee Daniel Kokotajlo explores the dilemma in which there are strong reasons to lock in commitments as early as possible. Such premature commitments might also lead to disaster.

We also circulated nine internal articles and working papers with the participants of our research workshops.

Foundational work on decision theory. This work might be relevant in the context of acausal interactions (see the last section of the research agenda):

MacAskill, Vallinder, Shulman, Oesterheld, Treutlein: The Evidentialist’s Wager: In this working paper, the authors present a wager for altruists in favor of following acausal decision theories, even if they assign significantly lower credence to them being correct. The basic idea is that under acausal decision theories, correlated decision-makers amplify the impact of one’s action manifold. Johannes Treutlein first explored the main idea in a blog post in 2018.
Oesterheld: Approval-directed agency and the decision theory of Newcomb-like problems: This paper on the implicit decision heuristics of trained AI agents has now been published in a special issue of Synthese.

Miscellaneous publications:

Research community

Research workshops. We ran three research workshops on s-risks from AI. They improved our prioritization, helped us develop our research agenda, and informed the future work of some participants:
- “S-risk research workshop,” Berlin, 2 days, March 2019, with junior researchers.
- “Preventing disvalue from AI,” San Francisco Bay Area, 2.5 days, May 2019, with 21 AI safety and AI strategy researchers from leading institutes and AI labs (including DeepMind, OpenAI, MIRI, FHI). Participants rated the content at 4.3 out of 5 and the logistics at 4.5 out of 5 (weighted average). They said attending the event was about 4x as valuable as what they would have been doing otherwise (weighted geometric mean).
- “S-risk research workshop,” near London, 3 days, November 2019, with a mixture of junior and more experienced researchers.
- We have developed the capacity to host research workshops with consistently good quality.
Grantmaking through the CLR Fund. We ran our first application round and made six grants worth $221,306 in total. Another $600,000 is available in the fund that we could not disburse so far (in part because we had planned to hire a Research Analyst for our grantmaking but were unable to fill the position).
Community coordination. We worked to bring the x-risk-oriented and s-risk-oriented parts of the longtermism community closer together. We believe this will result in synergies in AI safety and AI governance research and policy and perhaps also in macrostrategy research and broad longtermist interventions.
- Background. Until 2018, there had been little collaboration between the x-risk-oriented and s-risk-oriented parts of the longtermism community, despite the overlap in philosophical views and cause areas (especially AI risk). For this reason, our work on s-risks received less engagement than it could have. Over the past four years, we worked hard to bridge this divide. For instance, we repeatedly sought feedback from other community members. In response to that feedback, we decided to focus less on public moral advocacy and more on research on reducing s-risks (which we consider more pressing anyway) and encouraged other s-risk-oriented community members to do so as well. We also visited other research groups to increase their engagement with our work.
- Communication guidelines. This year, we further expanded these efforts. We worked with Nick Beckstead, then Program Officer for effective altruism at the Open Philanthropy Project, to develop a set of communication guidelines for discussing astronomical stakes:
  - Nick’s guidelines recommend highlighting beliefs and priorities that are important to the s-risk-oriented community. We are excited about these guidelines because we expect them to result in more contributions by outside experts to our research (at our workshops and on an ongoing basis) and a better representation of s-risks in the most popular EA content (see, e.g., the 80,000 Hours job board and previous edits to “The Long-Term Future”).
  - CLR’s guidelines recommend communicating in a more nuanced manner about pessimistic views of the long-term future by considering highlighting moral cooperation and uncertainty, focusing more on practical questions if possible, and anticipating potential misunderstandings and misrepresentations. We see it as our responsibility to ensure that those who come to prioritize s-risks based on our writings will also share our cooperative approach and commitment against violence. We expect the guidelines to reduce that risk and to result in increased interest in s-risks by major funders (including the Open Philanthropy Project’s grant, see below). We expect both guidelines to contribute to a more balanced discussion about the long-term future.
  - Nick put in a substantial effort to ensure his guidelines are read and endorsed by large parts of the community. Similarly, we reached out to the most active authors and sent our guidelines to them. Some community members suggested that these guidelines should be transparent to the community; we agree with them and are, therefore, planning to share them publicly. (We are waiting to hear from the people and organizations that support Nick’s guidelines whether they want to publish the guidelines and will add a link here if they decide to do so. We plan to publish CLR’s guidelines at that point, too.)
- Longer-term plans. We believe that these activities are only the beginning of longer and deeper collaborations. We plan to reassess the costs and benefits at the end of 2020.
Research community.
- We advised 13 potential researchers and professionals interested in s-risks in their careers.
- We sent out our first research newsletter to about 70 researchers.
- We started providing scholarships and more systematic operations support for researchers.
- We improved our online communication platform for researchers (Slack workspace with several channels) and have received positive feedback on the discussion quality.
Research management. We published a report on disruptive research groups. The main learnings for us were: (1) We should seriously consider how to address our lack of research leadership and (2) we should improve the physical proximity of our research staff.

Organizational updates

We moved to London. We relocated our headquarters from Berlin to London because this allows us to attract and retain staff better and collaborate with other researchers and EA organizations in London and Oxford. Our team of six will work from our offices in Primrose Hill, London.
Hiring. We have hired Jesse Clifton to join our research team part-time. Jesse is pursuing a PhD in statistics at NCSU and is the primary author of our technical research agenda.
Open Philanthropy Project grant. The Open Philanthropy Project awarded EAF, our parent organization, a $1 million grant over two years to support our research, general operations, and grantmaking.
Strategic clarity. At the end of 2018, we were still substantially uncertain about the strategic goals of our organization. We have since refined our mission and strategy and have overhauled our website accordingly.

Mistakes and lessons learned

Research output. While we were satisfied with our internal drafts, we fell short on our goals to produce written research output (for publication, or at least for sharing with peers).
Feedback and transparency for our communication guidelines. We did not seek feedback on the guidelines as systematically as we now think we should have. As a result, some people in our network were dissatisfied with the outcome. Moreover, while we were planning to give a general update on our efforts in our end-of-year update, we now believe it would have been worth the time to publish the full guidelines sooner.
Hiring. We planned to hire a Research Analyst for grantmaking and an Operations Analyst and made two job offers. One of them was not accepted; the other one did not work out during the first few months of employment. In hindsight, it might have been better to hire even more slowly and ensure we understood the roles we were hiring for better. Doing so would have allowed us to make a more convincing case for the positions and hire from a larger pool of candidates.
Anticipating implications of strategic changes. When we decided to shift our strategic focus towards research on s-risks, we were insufficiently aware of how this would change everyone’s daily work and responsibilities. We now think we could have anticipated these changes more proactively and taken measures to make the transition easier for our staff.
Strategic planning procedure. Due to repeated organizational changes over the past years, we had not developed a reliable annual strategic planning routine. This year, we did not realize that building such a process is important. We plan to prioritize this in 2020.
Communicating our move to London. We did not communicate our decision to relocate from Berlin to London very carefully in some instances. As a result, we received some negative feedback from people who did not support our decision and were under the impression we had not thought carefully about it. We invested some time to provide more background on our reasoning.

Financials

Budget 2020: $994,000 (7.4 expected full-time equivalent employees). Our per-staff expenses have increased compared with 2019 because we do not have access to free office space anymore, and the cost of living in London is significantly higher than in Berlin.
CLR reserves as of early November: $1,305,000 (corresponds to 15 months of expenses; excluding CLR Fund balance).
CLR Fund balance as of mid-December: $600,000.
Room for more funding: $185,000 (to attain 18 months of reserves); stretch goal: $700,000 (to attain 24 months of reserves).
We invest funds that we are unlikely to deploy soon in the global stock market as per our investment policy.

How to contribute

Stay up to date. Subscribe to our supporter updates and follow our Facebook page.
Work with us. We are always hiring researchers and might also hire for new positions in research operations and management. If you are interested, we would be very excited to hear from you!
Get career advice. If you are interested in our priorities, we are happy to discuss your career plans with you. Schedule a call now.
Engage with our research. If you are interested in discussing our research with our team and giving feedback on internal drafts, please reach out to Stefan Torges.
Make a donation. We aim to raise $185,000 (stretch goal: $700,000) for CLR. (We can set up a donor-advised fund (DAF) for value-aligned donors who give at least $100,000 over two years.)

Recommendation for donors

We think it makes sense for donors to support us if:

you believe we should prioritize interventions that affect the long-term future positively,
(a) you assign significant credence to some form of suffering-focused ethics, (b) you think s-risks are not unlikely compared to very positive future scenarios, and/or (c) you think work on s-risks is particularly neglected and reasonably tractable, and
you assign significant credence that our prioritization and strategy is sound, i.e., you consider our work on AI and/or non-AI priorities sufficiently pressing (e.g., you assign a nontrivial probability (at least 5–10%) to the development of transformative AI within the next 20 years).

For donors who do not agree with these points, we recommend giving to the donor lottery (or the EA Funds). We recommend that donors who are interested in the CLR Fund support CLR instead because the CLR Fund has a limited capacity to absorb further funding.

Would you like to support us? Make a donation.

We are interested in your feedback

If you have any questions or comments, we look forward to hearing from you; you can also send us feedback anonymously. We greatly appreciate any thoughts that could help us improve our work. Thank you!

Acknowledgments

I would like to thank Tobias Baumann, Max Daniel, Ruairi Donnelly, Lukas Gloor, Chi Nguyen, and Stefan Torges for giving feedback on this article.

The post Our plans for 2020 appeared first on Center on Long-Term Risk.

The Evidentialist's Wager

2020-04-30T20:54:09Z

Other versions

Full text

Abstract

Suppose that an altruistic and morally motivated agent who is uncertain between evidential decision theory (EDT) and causal decision theory (CDT) finds herself in a situation in which the two theories give conflicting verdicts. We argue that even if she has significantly higher credence in CDT, she should nevertheless act in accordance with EDT. First, we claim that that the appropriate response to normative uncertainty is to hedge one’s bets. That is, if the stakes are much higher on one theory than another, and the credences you assign to each of these theories aren’t very different, then it’s appropriate to choose the option which performs best on the high-stakes theory. Second, we show that, given the assumption of altruism, the existence of correlated decision-makers will increase the stakes for EDT but leave the stakes for CDT unaffected. Together these two claims imply that whenever there are sufficiently many correlated agents, the appropriate response is to act in accordance with EDT.

Read the paper on the website of the Global Priorities Institute.

The post The Evidentialist's Wager appeared first on Center on Long-Term Risk.

Imprint

2023-07-04T07:51:23Z

CLR was initially founded as a project of the Swiss charity Effective Altruism Foundation (EAF). At the end of 2021, our operations were transferred to our present independent UK charity.

Center on Long-term Risk

A Charitable Incorporated Organisation (CIO) registered with the Charity Commission of England and Wales. Registration number 1195079.

Trustees

Jonas Vollmer
Max Daniel
Tobias Baumann
Linh Chi Nguyen
Stefan Torges

Disclaimer

The Center on Long-term Risk makes every effort to ensure that the information on its website (www.longtermrisk.org) is always correct and up-to-date and, if necessary, changes or supplements it on an ongoing basis and without prior notice. Nevertheless, CLR cannot accept any liability for correctness, timeliness, and completeness.

Our website contains links to external websites of third parties on whose contents we have no influence. Therefore, we cannot assume any liability for this external content. The respective provider or operator of the pages is always responsible for the content of the linked pages. A permanent control of the content of the linked pages is unreasonable without concrete evidence of a violation of the law. However, we will remove such links as soon as we become aware of any infringements of the law.

Copyright

The website of the Center on Long-term Risk (longtermrisk.org) including all its parts such as texts and images is protected by copyright. Any use outside the limits of copyright law is prohibited without permission. The content of the website may not be passed on to third parties for a fee.

Further information

Our transparency page contains further information on the Center on Long-term Risk.

The post Imprint appeared first on Center on Long-Term Risk.

CLR Fund

2024-01-08T11:16:49Z

We have a dedicated fund to support promising projects and individuals. The Center on Long-Term Risk Fund (CLR Fund) operates in line with our mission to build a global community of researchers and professionals working to do the most good in terms of reducing suffering.

The post CLR Fund appeared first on Center on Long-Term Risk.

Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda

2024-01-12T17:18:37Z

Note: This research agenda was published in January 2020. For an update on our work in multi-agent systems as of March 2021, see this post.

Author: Jesse Clifton

The Center on Long-Term Risk's research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence outlines what we think are the most promising avenues for developing technical and governance interventions aimed at avoiding conflict between transformative AI systems. We draw on international relations, game theory, behavioral economics, machine learning, decision theory, and formal epistemology.

While our research agenda captures many topics we are interested in, the focus of CLR's research is broader.

We appreciate all comments and questions. We're also looking for people to work on the questions we outline. So if you're interested or know people who might be, please get in touch with us by emailing info@longtermrisk.org.

Other versions

1 Introduction

Transformative artificial intelligence (TAI) may be a key factor in the long-run trajectory of civilization. A growing interdisciplinary community has begun to study how the development of TAI can be made safe and beneficial to sentient life (Bostrom 2014; Russell et al., 2015; OpenAI, 2018; Ortega and Maini, 2018; Dafoe, 2018). We present a research agenda for advancing a critical component of this effort: preventing catastrophic failures of cooperation among TAI systems. By cooperation failures we refer to a broad class of potentially-catastrophic inefficiencies in interactions among TAI-enabled actors. These include destructive conflict; coercion; and social dilemmas (Kollock, 1998; Macy and Flache, 2002) which destroy value over extended periods of time. We introduce cooperation failures at greater length in Section 1.1.

Karnofsky (2016) defines TAI as ''AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution''. Such systems range from the unified, agent-like systems which are the focus of, e.g., Yudkowsky (2013) and Bostrom (2014), to the "comprehensive AI services" envisioned by Drexler (2019), in which humans are assisted by an array of powerful domain-specific AI tools. In our view, the potential consequences of such technology are enough to motivate research into mitigating risks today, despite considerable uncertainty about the timeline to TAI (Grace et al., 2018) and nature of TAI development. Given these uncertainties, we will often discuss "cooperation failures" in fairly abstract terms and focus on questions relevant to a wide range of potential modes of interaction between AI systems. Much of our discussion will pertain to powerful agent-like systems, with general capabilities and expansive goals. But whereas the scenarios that concern much of the existing long-term-focused AI safety research involve agent-like systems, an important feature of catastrophic cooperation failures is that they may also occur among human actors assisted by narrow-but-powerful AI tools.

Cooperation has long been studied in many fields: political theory, economics, game theory, psychology, evolutionary biology, multi-agent systems, and so on. But TAI is likely to present unprecedented challenges and opportunities arising from interactions between powerful actors. The size of losses from bargaining inefficiencies may massively increase with the capabilities of the actors involved. Moreover, features of machine intelligence may lead to qualitative changes in the nature of multi-agent systems. These include changes in:

The ability to make credible commitments;
The ability to self-modify (Omohundro, 2007; Everitt et al., 2016) or otherwise create successor agents;
The ability to model other agents.

These changes call for the development of new conceptual tools, building on and modifying the many relevant literatures which have studied cooperation among humans and human societies.

1.1 Cooperation failure: models and examples

Many of the cooperation failures in which we are interested can be understood as mutual defection in a social dilemma. Informally, a social dilemma is a game in which everyone is better off if everyone cooperates, yet individual rationality may lead to defection. Formally, following Macy and Flache (2002), we will say that a two-player normal-form game with payoffs denoted as in Table 1 is a social dilemma if the payoffs satisfy these criteria:

(Mutual cooperation is better than mutual defection);
(Mutual cooperation is better than cooperating while your counterpart defects);
(Mutual cooperation is better than randomizing between cooperation and defection);
For quantities and , the payoffs satisfy or .

Nash equilibrium (i.e., a choice of strategy by each player such that no player can benefit from unilaterally deviating) has been used to analyze failures of cooperation in social dilemmas. In the Prisoner’s Dilemma (PD), the unique Nash equilibrium is mutual defection. In Stag Hunt, there is a cooperative equilibrium which requires agents to coordinate, and a defecting equilibrium which does not. In Chicken, there are two pure-strategy Nash equilibria (Player 1 plays while Player 2 plays , and vice versa) as well as an equilibrium in which players independently randomize between and . The mixed strategy equilibrium or uncoordinated equilibrium selection may therefore result in a crash (i.e., mutual defection).

Social dilemmas have been used to model cooperation failures in international politics; Snyder (1971) reviews applications of PD and Chicken, and Jervis (1978) discusses each of the classic social dilemmas in his influential treatment of the security dilemma¹. Among the most prominent examples is the model of arms races as a PD: both players build up arms (defect) despite the fact that disarmament (cooperation) is mutually beneficial, as neither wants to be the party who disarms while their counterpart builds up. Social dilemmas have likewise been applied to a number of collective action problems, such as the use of a common resource (cf. the famous "tragedy of the commons" (Hardin, 1968; Perolat et al., 2017) ) and pollution. See Dawes (1980) for a review focusing on such cases.

Many interactions are not adequately modeled by simple games like those in Table 1. For instance, states facing the prospect of military conflict have incomplete information. That is, each party has private information about the costs and benefits of conflict, their military strength, and so on. They also have the opportunity to negotiate over extended periods; to monitor one another’s activities to some extent; and so on. The literature on bargaining models of war (or "crisis bargaining") is a source of more complex analyses (e.g., Powell 2002; Kydd 2003; Powell 2006; Smith and Stam 2004; Feyand Ramsay 2007, 2011; Kydd 2010). In a classic article from this literature, Fearon(1995) defends three now-standard hypotheses as the most plausible explanations for why rational agents would go to war:

Credibility: The agents cannot credibly commit to the terms of a peaceful settlement;
Incomplete information: The agents have differing private information related to their chances of winning a conflict, and incentives to misrepresent that information (see Sanchez-Pages (2012) for a review of the literature on bargaining and conflict under incomplete information);
Indivisible stakes: Conflict cannot be resolved by dividing the stakes, side payments, etc.

Another example of potentially disastrous cooperation failure is extortion (and other compellent threats), and the execution of such threats by powerful agents. In addition to threats being harmful to their target, the execution of threats seem to constitute an inefficiency: much like going to war, threateners face the direct costs of causing harm, and in some cases, risks from retaliation or legal action.

The literature on crisis bargaining between rational agents may also help us to understand the circumstances under which compellent threats are made and carried out, and point to mechanisms for avoiding these scenarios.
Countering the hypothesis that war between rational agents A and B can occur as a result of indivisible stakes (for example a territory), Powell (2006, p. 178) presents the case in Example 1.1.1, which shows that allocating the full stakes to each agent according to their probabilities of winning a war Pareto-dominates fighting.

Example 1.1.1 (Simulated conflict).

Consider two countries disputing a territory which has value for each of them. Suppose that the row country has probability of winning a conflict, and conflict costs for each country, so that their payoffs for Surrendering and Fighting are as in the top matrix in Table 2. However, suppose the countries agree on the probability
that the row players win; perhaps they have access to a mutually trusted war-simulator which has row player winning in of simulations. Then, instead of engaging in a real conflict, they could allocate the territory based on a draw from the simulator. Playing this game is preferable, as it saves each country the cost of actual conflict.

Because of the possibility of constructing simulated conflict to allocate indivisible stakes, the most plausible of Fearon’s rationalist explanations for war seem to be (1) the difficulty of credible commitment and (2) incomplete information (and incentives to misrepresent that information). Section 3 concerns discussion of credibility in TAI systems. In Section 4 we discuss
several issues related to the resolution of conflict under incomplete information.

Lastly, while game theory provides a powerful framework for modeling cooperation failure, TAI systems or their operators will not necessarily be well-modeled as rational agents. For example, systems involving humans in the loop, or black-box TAI agents trained by evolutionary methods, may be governed by a complex network of decision-making heuristics not easily captured in a utility function. We discuss research directions that are particularly relevant to cooperation failures among these kinds of agents in Sections 5.2 (Multi-agent training) and 6 (Humans in the loop).

1.2 Outline of the agenda

We list the sections of the agenda below. Different sections may appeal to readers from different backgrounds. For instance, Section 5 (Contemporary AI architectures) may be most interesting to those with some interest in machine learning, whereas Section 7 (Foundations of rational agency) will be more relevant to readers with an interest in formal epistemology or the philosophical foundations of decision theory. Tags after the description of each section indicate the fields most relevant to that section.

Some sections contain examples illustrating technical points, or explaining in greater detail a possible technical research direction.

Section 2: Strategic analysis. The nature of losses from cooperation failures will depend on the strategic landscape at the time TAI is deployed. This includes, for instance, the extent to which the landscape is uni- or multipolar (Bostrom, 2014) and the balance between offensive and defensive capabilities (Garfinkel and Dafoe, 2019). Like others with an interest in shaping TAI for the better, we want to understand this landscape, especially insofar as it can help us to identify levers for preventing catastrophic cooperation failures. Given that much of our agenda consists of theoretical research, an important question for us to answer is whether and how such research translates into the governance of TAI.

Public policy; International relations; Game theory; Artificial intelligence

Section 3: Credibility. Credibility—for instance, the credibility of commitments to honor the terms of settlements, or to carry out threats—is a crucial feature of strategic interaction. Changes in agents' ability to self-modify (or create successor agents) and to verify aspects of one another’s internal workings are likely to change the nature of credible commitments. These anticipated developments call for the application of existing decision and game theory to new kinds of agents, and the development of new theory (such as that of program equilibrium (Tennenholtz, 2004)) that better accounts for relevant features of machine intelligence.

Game theory; Behavioral economics; Artificial intelligence

Section 4: Peaceful bargaining strategies. Call a peaceful bargaining strategy a set of strategies for each player that does not lead to destructive conflict, and which rational agents prefer to a strategy which does lead to destructive conflict. In this section, we discuss several possible such strategies and problems which need to be addressed in order to ensure that they are implemented. These strategies include bargaining strategies taken from or inspired by the existing literature on rational crisis bargaining (see Section 1.1, as well as a little-discussed proposal for deflecting compellent threats which we call surrogate goals (Baumann, 2017, 2018).

Game theory; International relations; Artificial intelligence

Section 5: Contemporary AI architectures. Multi-agent artificial intelligence is not a new field of study, and cooperation is of increasing interest to machine learning researchers (Leibo et al., 2017; Foerster et al., 2018; Lerer and Peysakhovich, 2017; Hughes et al., 2018; Wang et al., 2018). But there remain unexplored avenues for understanding cooperation failures using existing tools for artificial intelligence and machine learning. These include the implementation of approaches to improving cooperation which make better use of agents’ potential transparency to one another; the implications of various multi-agent training regimes for the behavior of AI systems in multi-agent settings; and analysis of the decision-making procedures implicitly implemented by various reinforcement learning algorithms.

Machine learning; Game theory

Section 6: Humans in the loop. Several TAI scenarios and proposals involve a human in the loop, either in the form of a human-controlled AI tool, or an AI agent which seeks to adhere to the preferences of human overseers. These include Christiano (2018c)’s iterated distillation and amplification (IDA; see Cotra 2018 for an accessible introduction), Drexler (2019)’s comprehensive AI services, and the reward modeling approach of Leike et al. (2018). We would like a better understanding of behavioral game theory, targeted at improving cooperation in TAI landscapes involving human-in-the-loop systems. We are particularly interested in advancing the study of the behavioral game theory of interactions between humans and AIs.

Machine learning; Behavioral economics

Section 7: Foundations of rational agency. The prospect of TAI foregrounds several unresolved issues in the foundations of rational agency. While the list of open problems in decision theory, game theory, formal epistemology, and the foundations of artificial intelligence is long, our focus includes decision theory for computationally bounded agents; and prospects for the rationality and feasibility of various kinds of decision-making in which agents take into account non-causal dependencies between their actions and their outcomes.

Formal epistemology; Philosophical decision theory; Artificial intelligence

2 AI strategy and governance

^² We would like to better understand the ways the strategic landscape among key actors (states, AI labs, and other non-state actors) might look at the time TAI systems are deployed, and to identify levers for shifting this landscape towards widely beneficial outcomes. Our interests here overlap with Dafoe (2018)’s AI governance research agenda (see especially the "Technical Landscape’’ section), though we are most concerned with questions relevant to risks associated with cooperation failures.

2.1 Polarity and transition scenarios

From the perspective of reducing risks from cooperation failures, it is prima facie preferable if the transition to TAI results in a unipolar rather than a distributed outcome: The greater the chances of a single dominant actor, the lower the chances of conflict (at least after that actor has achieved dominance). But the analysis is likely not so simple, if the international relations literature on the relative safety of different power distributions (e.g., Deutsch and Singer 1964; Waltz 1964; Christensen and Snyder 1990) is any indication. We are therefore especially interested in a more fine-grained analysis of possible developments in the balance of power. In particular, we would like to understand the likelihood of the various scenarios, their relative safety with respect to catastrophic risk, and the tractability of policy interventions to steer towards safer distributions of TAI-related power. Relevant questions include:

One might expect rapid jumps in AI capabilities, rather than gradual progress, to make unipolar outcomes more likely. Should we expect rapid jumps in capabilities or are the capability gains likely to remain gradual (AI Impacts, 2018)?
Which distributions of power are, all things considered, least at risk of catastrophic failures of cooperation?
Suppose we had good reason to believe we ought to promote more uni- (or multi-) polar outcomes. What are the best policy levers for increasing the concentration (or spread) of AI capabilities, without severe downsides (such as contributing to arms-race dynamics)?

2.2 Commitment and transparency

^³Agents' ability to make credible commitments is a critical aspect of multi-agent systems. Section 3 is dedicated to technical questions around credibility, but it is also important to consider the strategic implications of credibility and commitment.

One concerning dynamic which may arise between TAI systems is commitment races (Kokotajlo, 2019a). In the game of Chicken (Table 1), both players have reason to commit to driving ahead as soon as possible, by conspicuously throwing out their steering wheels. Likewise, AI agents (or their human overseers) may want to make certain commitments (for instance, commitments to carry through with a threat if their demands aren't met) as soon as possible, in order to improve their bargaining positions. As with Chicken, this is a dangerous situation. Thus we would like to explore possibilities for curtailing such dynamics.

At least in some cases, greater transparency seems to limit possibilities for agents to make dangerous simultaneous commitments. For instance, if one country is carefully monitoring another, they are likely to detect efforts to build doomsday devices with which they can make credible commitments. On the other hand, transparency seems to promote the ability to make dangerous commitments: I have less reason to throw out my steering wheel if you can’t see me do it. Under what circumstances does mutual transparency mitigate or exacerbate commitment race dynamics, and how can this be used to design safer AI governance regimes?
What policies can make the success of greater transparency between TAI systems more likely (to the extent that this is desirable)? Are there path dependencies which must be addressed early on in the engineering of TAI systems so that open-source interactions are feasible?

Finally, in human societies, improvements in the ability to make credible commitments (e.g., to sign contracts enforceable by law) seem to have facilitated large gains from trade through more effective coordination, longer-term cooperation, and various other mechanisms (e.g., Knack and Keefer 1995; North 1991; Greif et al. 1994; Dixit 2003).

Which features of increased credibility promote good outcomes? For instance, laws typically don’t allow a threatener to publicly request they be locked up if they don’t carry out their threat. How much would societal outcomes change given indiscriminate ability to make credible commitments? Have there been situations where laws and norms around what one can commit to were different from what we see now, and what were the consequences?
How have past technological advancements changed bargaining between human actors? (Nuclear weapons are one obvious example of a technological advancement which considerably changed the bargaining dynamics between powerful actors.)
Open-source game theory, described in Section 3.2, is concerned with an idealized form of mutual auditing. What do historical cases tell us about the factors for the success of mutual auditing schemes? For instance, the Treaty on Open Skies, in which members states agreed to allow unmanned overflights in order to monitor their military activities (Britting and Spitzer, 2002), is a notable example of such a scheme. See also the literature on "confidence-building" measures in international security, e.g., Landau and Landau (1997) and references therein.
What are the main costs from increased commitment ability?

2.3 AI misalignment scenarios

Christiano (2018a) defines "the alignment problem" as "the problem of building powerful AI systems that are aligned with their operators". Related problems, as discussed by Bostrom (2014), include the "value loading" (or "value alignment") problem (the problem of ensuring that AI systems have goals compatible with the goals of humans), and the "control problem" (the general problem of controlling a powerful AI agent). Despite the recent surge in attention on AI risk, there are few detailed descriptions of what a future with misaligned AI systems might look like (but see Sotala 2018; Christiano 2019; Dai 2019 for examples). Better models of the ways in which misaligned AI systems could arise and how they might behave are important for our understanding of critical interactions among powerful actors in the future.

Is AI misalignment more likely to constitute a "near-miss" with respect to human values, or extreme departures from human goals (cf. Bostrom (2003)’s "paperclip maximizer")?
Should we expect human-aligned AI systems to be able to cooperate with misaligned systems (cf. Shulman (2010) )?
What is the likelihood that outright-misaligned AI agents will be deployed alongside aligned systems, versus the likelihood that aligned systems eventually become misaligned by failing to preserve their original goals? (cf. discussion of "goal preservation" (Omohundro, 2008) ).
What does the landscape of possible cooperation failures look like in each of the above scenarios?

2.4 Other directions

According to the offense-defense theory, the likelihood and nature of conflict depend on the relative efficacy of offensive and defensive security strategies (Jervis, 2017, 1978; Glaser, 1997). Technological progress seems to have been a critical driver of shifts in the offense-defense balance (Garfinkel and Dafoe, 2019), and the advent of powerful AI systems in strategic domains like computer security or military technology could lead to shifts in that balance.

To better understand the strategy landscape at the time of AI deployment, we would like to be able to predict technology-induced changes in the offense-defense balance and how they might affect the nature of the conflict. One area of interest, for instance, is cybersecurity (e.g., whether leading developers of TAI systems would be able to protect against cyberattacks; cf. Zabel and Muehlhauser 2019).

Besides forecasting future dynamics, we are curious as to what lessons can be drawn from case studies of cooperation failures, and policies which have mitigated or exacerbated such risks. For example: Cooperation failures among powerful agents representing human values may be particularly costly when threats are involved. Examples of possible case studies include nuclear deterrence, ransomware (Gazet, 2010) and its implications for computer security, the economics of hostage-taking (Atkinson et al., 1987; Shortland and Roberts, 2019), and extortion rackets (Superti, 2009). Such case studies might investigate costs to the threateners, gains for the threateners, damages to third parties, factors that make agents more or less vulnerable to threats, existing efforts to combat extortionists, etc. While it is unclear how informative such case studies will be about interactions between TAI systems, they may be particularly relevant in humans-in-the-loop scenarios (Section 6).

Lastly, in addition to case studies of cooperation failures themselves, it would be helpful for the prioritization of the research directions presented in this agenda to study how other instances of formal research have influenced (or failed to influence) critical real-world decisions. Particularly relevant examples include the application of game theory to geopolitics (see Weintraub (2017) for a review of game theory and decision-making in the Cold War); cryptography to computer security, and formal verification in the verification of software programs.

2.5 Potential downsides of research on cooperation failures

The remainder of this agenda largely concerns technical questions related to interactions involving TAI-enabled systems. A key strategic question running throughout is: What are the potential downsides to increased technical understanding in these areas? It is possible, for instance, that technical and strategic insights related to credible commitment increase rather than decrease the efficacy and likelihood of compellent threats. Moreover, the naive application of idealized models of rationality may do more harm than good; it has been argued that this was the case in some applications of formal methods to Cold War strategy, for instance, Kaplan (1991). Thus the exploration of the dangers and limitations of technical and strategic progress is itself a critical research direction.

3 Credibility

Credibility is a central issue in strategic interaction. By credibility, we refer to the issue of whether one agent has reason to believe that another will do what they say they will do. Credibility (or lack thereof) plays a crucial role in the efficacy of contracts (Fehr et al., 1997; Bohnet et al., 2001), negotiated settlements for avoiding destructive conflict (Powell, 2006), and commitments to carry out (or refuse to give in to) threats(e.g., Kilgour and Zagare 1991; Konrad and Skaperdas 1997).

In game theory, the fact that Nash equilibria (Section 1.1) sometimes involve non-credible threats motivates a refined solution concept called subgame perfect equilibrium (SPE). An SPE is a Nash equilibrium of an extensive-form game in which a Nash equilibrium is also played at each subgame. In the threat game depicted in Figure 1, “carry out” is not played in a SPE, because the threatener has no reason to carry out the threat once the threatened party has refused to give in; that is, “carry out’’ isn’t a Nash equilibrium of the subgame played after the threatened party refuses to give in.

Figure 1: At time 1, Threatener decides to make a threat or not. If a threat is made, Target at time 2 decides to give in or not. If they don’t give in, Threatener decides to carry out the threat or not, determining the players’ payoffs. Target may reason that Threatener will not carry out the threat if Target doesn’t give in, because it is costly for Threatener to do so and cannot affect Target’s choice. Therefore, Target won’t give in. This is the informal reasoning behind SPE.

So in an SPE-based analysis of one-shot threat situations between rational agents, threats are never carried out because they are not credible (i.e., they violate subgame perfection).

However, agents may establish credibility in the case of repeated interactions by repeatedly making good on their claims (Sobel, 1985). Secondly, despite the fact that carrying out a threat in the one-shot threat game violates subgame perfection, it is a well-known result from behavioral game theory that humans typically refuse unfair splits in the Ultimatum Game⁴ (Güth et al., 1982; Henrich et al., 2006), which is equivalent to carrying out the threat in the one-shot threat game. So executing commitments which are irrational (by the SPE criterion) may still be a feature of human-in-the-loop
systems (Section 6), or perhaps systems which have some humanlike game-theoretic heuristics in virtue of being trained in multi-agent environments (Section 5.2). Lastly, threats may become credible if the threatener has credibly committed to carrying out the threat (in the case of the game in Fig. 1, this means convincing the opponent that they have removed the option (or made it costly) to “Not carry out’’). There is a considerable game-theoretic literature on credible commitment, both on how credibility can be achieved (Schelling, 1960) and on the analysis of games under the assumption that credible commitment is possible (Von Stackelberg, 2010; Nash, 1953; Muthoo, 1996; Bagwell, 1995).

3.1 Commitment capabilities

It is possible that TAI systems may be relatively transparent to one another; capable of self-modifying or constructing sophisticated commitment devices;
and making various other “computer-mediated contracts’’ (Varian, 2010); see also the lengthy discussions in Garfinkel (2018) and Kroll et al. (2016), discussed in Footnote 1, of potential implications of cryptographic technology for credibility.
We want to understand how plausible changes in the ability to make credible commitments affect risks from cooperation failures.

In what ways does artificial intelligence make credibility more difficult, rather than less so? For instance, AIs lack evolutionarily established mechanisms (like credible signs of anger; Hirshleifer 1987) for signaling their intentions to other agents.
The credibility of an agent’s stated commitments likely depends on how interpretable⁵ that agent is to others. What are the possible ways in which interpretability may develop, and how does this affect the propensity to make commitments? For instance, in trajectories where AI agents are increasingly opaque to their overseers, will these agents be motivated to make commitments while they are still interpretable enough to overseers that these commitments are credible?
In the case of training regimes involving the imitation of human exemplars (see Section 6), can humans also make credible commitments on behalf of the AI system which is imitating them?

3.2 Open-source game theory

Tennenholtz (2004) introduced program games, in which players submit programs that have access to the source codes of their counterparts. Program games provide a model of interaction under mutual transparency. Tennenholtz showed that in the Prisoner’s Dilemma, both players submitting Algorithm 1 is a program equilibrium (that is, a Nash equilibrium of the corresponding program game). Thus agents may have the incentive to participate in program games, as these promote more cooperative outcomes than the corresponding non-program games.

For these reasons, program games may be helpful to our understanding of interactions among advanced AIs.

Other models of strategic interaction between agents who are transparent to one another have been studied (more on this in Section 5); following Critch (2019), we will call this broader area open-source game theory. Game theory with source-code transparency has been studied by Fortnow 2009; Halpern and Pass 2018; LaVictoireet al. 2014; Critch 2019; Oesterheld 2019, and models of multi-agent learning under transparency are given by Brafman and Tennenholtz (2003); Foerster et al. (2018). But open-source game theory is in its infancy and many challenges remain⁶.

The study of program games has, for the most part, focused on the simple setting of two-player, one-shot games. How can (cooperative) program equilibrium strategies be automatically constructed in general settings?
Under what circumstances would agents be incentivized to enter into open-source interactions?
How can program equilibrium be made to promote more efficient outcomes even in cases of incomplete access to counterparts’ source codes?
- As a toy example, consider two robots playing a single-shot program prisoner’s dilemma, in which their respective moves are indicated by a simultaneous button press. In the absence of verification that the output of the source code actually causes the agent to press the button, it is possible that the output of the program does not match the actual physical action taken. What are the prospects for closing such "credibility gaps"? The literature on (physical) zero-knowledge proofs (Fisch et al., 2014; Glaser et al., 2014) may be helpful here.
- See also the discussion in Section 3.2 on multi-agent learning under varying degrees of transparency.

4 Peaceful bargaining strategies

In other sections of the agenda, we have proposed research directions for improving our general understanding of cooperation and conflict among TAI systems. In this section, on the other hand, we consider several families of strategies designed to actually avoid catastrophic cooperation failure. The idea of such "peaceful bargaining strategies'' is, roughly speaking, to find strategies which are 1) peaceful (i.e., avoid conflict) and 2) preferred by rational agents to non-peaceful strategies⁷.

We are not confident that peaceful bargaining mechanisms will be used by default. First, in human-in-the-loop scenarios, the bargaining behavior of TAI systems may be dictated by human overseers, who we do not expect to systematically use rational bargaining strategies (Section 6.1). Even in systems whose decision-making is more independent of humans’, evolution-like training methods could give rise to non-rational human-like bargaining heuristics (Section 5.2). Even among rational agents, because there may be many cooperative equilibria, additional mechanisms for ensuring coordination may be necessary to avoid conflict arising from the selection of different equilibria (see Example 4.1.1). Finally, the examples in this section suggest that there may be path-dependencies in the engineering of TAI systems (for instance, in making certain aspects of TAI systems more transparent to their counterparts) which determine the extent to which peaceful bargaining mechanisms are available.

In the first subsection, we present some directions for identifying mechanisms which could implement peaceful settlements, drawing largely on existing ideas in the literature on rational bargaining. In the second subsection, we sketch a proposal for how agents might mitigate downsides from threats by effectively modifying their utility function. This proposal is called surrogate goals.

4.1 Rational crisis bargaining

As discussed in Section 1.1, there are two standard explanations for war among rational agents: credibility (the agents cannot credibly commit to the terms of a peaceful settlement) and incomplete information (the agents have differing private information which makes each of them optimistic about their prospects of winning, and incentives not to disclose or to misrepresent this information).

Fey and Ramsay (2011) model crisis bargaining under incomplete information. They show that in 2-player crisis bargaining games with voluntary agreements (players are able to reject a proposed settlement if they think they will be better off going to war); mutually known costs of war; unknown types measuring the players' military strength; a commonly known function giving the probability of player 1 winning when the true types are ; and a common prior over types; a peaceful settlement exists if and only if the costs of war are sufficiently large. Such a settlement must compensate each player's strongest possible type by the amount they expect to gain in war.

Potential problems facing the resolution of conflict in such cases include:

Reliance on common prior and agreed-upon win probability model . If players disagree on these quantities it is not clear how bargaining will proceed. How can players come to an agreement on these quantities, without generating a regress of bargaining problems? One possibility is to defer to a mutually trusted party to estimate these quantities from publicly observed data. This raises its own questions. For example, what conditions must a third party satisfy so that their judgments are trusted by each player? (Cf. Kydd (2003), Rauchhaus (2006), and sources therein on mediation).
The exact costs of conflict to each player are likely to be private information, as well. The assumption of a common prior, or the ability to agree upon a prior, may be particularly unrealistic in the case of costs.

Recall that another form of cooperation failure is the simultaneous commitment to strategies which lead to catastrophic threats being carried out (Section 2.2). Such "commitment games'' may be modeled as a game of Chicken (Table 1), where Defection corresponds to making commitments to carry out a threat if one's demands are not met, while Cooperation corresponds to not making such commitments. Thus we are interested in bargaining strategies which avoid mutual Defection in commitment games. Such a strategy is sketched in Example 4.1.1.

Example 4.1.1 (Careful commitments).

Consider two agents with access to commitment devices. Each may decide to commit to carrying out a threat if their counterpart does not forfeit some prize (of value to each party). As before, call this decision . However, they may instead commit to carrying out their threat only if their counterpart does not agree to a certain split of the prize (say, a split in which Player 1 gets ).
Call this commitment , for "cooperating with split ''.

When would an agent prefer to make the more sophisticated commitment ? In order to say whether an agent expects to do better by making , we need to be able to say how well they expect to do in the "original'' commitment game where their choice is between and . This is not straightforward, as Chicken admits three Nash equilibria. However, it may be reasonable to regard the players' expected values under mixed strategy Nash equilibrium as the values they expect from playing this game. Thus, split could be chosen such that and exceed player 1 and 2's respective expected payoffs under the mixed strategy Nash equilibrium. Many such splits may exist. This calls for the selection among , for which we may turn to a bargaining solution concept such as Nash (Nash, 1950) or Kalai-Smorokindsky (Kalai et al., 1975). If each player uses the same bargaining solution, then each will prefer to committing to honoring the resulting split of the prize to playing the original threat game, and carried-out threats will be avoided.

Of course, this mechanism is brittle in that it relies on a single take-it-or-leave-it proposal which will fail if the agents use different bargaining solutions, or have slightly different estimates of each players' payoffs. However, this could be generalized to a commitment to a more complex and robust bargaining procedure, such as an alternating-offers procedure (Rubinstein 1982; Binmore et al. 1986; see Muthoo (1996) for a thorough review of such models) or the sequential cooperative bargaining procedure of Van Damme (1986).

Finally, note that in the case where there is uncertainty over whether each player has a commitment device, sufficiently high stakes will mean that players with commitment devices will still have Chicken-like payoffs. So this model can be straightforwardly extended to cases where the credibility of a threat comes in degrees. An example of a simple bargaining procedure to commit to is Bayesian version of the Nash bargaining solution (Harsanyi and Selten, 1972).

Lastly, see Kydd (2010)'s review of potential applications of the literature rational crisis bargaining to resolving real-world conflict.

4.2 Surrogate goals

^⁸In this section, we introduce surrogate goals, a recent⁹ proposal for limiting the downsides from cooperation failures (Baumann, 2017, 2018)¹⁰. We will focus on the phenomenon of coercive threats (for game-theoretic discussion see Ellsberg (1968); Har-renstein et al. (2007) ), though the technique is more general. The proposal is: In order to deflect threats against the things it terminally values, an agent adopts a new (surrogate) goal¹¹. This goal may still be threatened, but threats carried out against this goal are benign. Furthermore, the surrogate goal is chosen such that it incentives at most marginally more threats.

In Example 4.2.1, we give an example of an operationalization of surrogate goals in a threat game.

Example 4.2.1 (Surrogate goals via representatives)

Consider the game between Threatener and Target, where Threatener makes a demand of Target, such as giving up some resources. Threatener can — at some cost — commit to carrying out a threat against Target . Target can likewise commit to give in to such threats or not. A simple model of this game is given in the payoff matrix in Table 3 (a normal-form variant of the threat game discussed in Section 3¹²).

Unfortunately, players may sometimes play (Threaten, Not give in). For example, this may be due to uncoordinated selection among the two pure-strategy Nash equilibria ( (Give in, Threaten) and (Not give in, Not threaten) ).

But suppose that, in the above scenario, Target is capable of certain kinds of credible commitments, or otherwise is represented by an agent, Target’s Representative, who is. Then Target or Target’s Representative may modify its goal architecture to adopt a surrogate goal whose fulfillment is not actually valuable to that player, and which is slightly cheaper for Threatener to threaten. (More generally, Target could modify itself to commit to acting as if it had a surrogate goal in threat situations.) If this modification is credible, then it is rational for Threatener to threaten the surrogate goal, obviating the risk of threats against Target’s true goals being carried out.

As a first pass at a formal analysis: Adopting an additional threatenable goal adds a column to the payoff matrix, as in Table 4. And this column weakly dominates the old threat column (i.e., the threat against Target’s true goals). So a rational player would never threaten Target’s true goal. Target does not themselves care about the new type of threats being carried out, so for her, the utilities are given by the bold numbers in Table 4.

This application of surrogate goals, in which a threat game is already underway but players have the opportunity to self-modify or create representatives with surrogate goals, is only one possibility. Another is to consider the adoption of a surrogate goal as the choice of an agent (before it encounters any threat) to commit to acting according to a new utility function, rather than the one which represents their true goals. This could be modeled, for instance, as an extensive-form game of incomplete information in which the agent decides which utility function to commit to by reasoning about (among other things) what sorts of threats having the utility function might provoke. Such models have a signaling game component, as the player must successfully signal to distrustful counterparts that it will actually act according to the surrogate utility function when threatened. The game-theoretic literature on signaling (Kreps and Sobel, 1994) and the literature on inferring preferences in multi-agent settings (Yu et al., 2019; Lin et al., 2019) may suggest useful models. The implementation of surrogate goals faces a number of obstacles. Some problems and questions include:

The surrogate goal must be credible, i.e., threateners must believe that the agent will act consistently with the stated surrogate goal. TAI systems are unlikely to have easily-identifiable goals, and so must signal their goals to others through their actions. This raises questions both of how to signal so that the surrogate goal is at all credible, and how to signal in a way that doesn’t interfere too much with the agent’s true goals. One possibility in the context of Example 4.2.1 is the use of zero-knowledge proofs (Goldwasser et al., 1989; Goldreich and Oren,1994) to reveal the Target's surrogate goal (but not how they will actually respond to a threat) to the Threatener.
How does an agent come to adopt an appropriate surrogate goal, practically speaking? For instance, how can advanced ML agents be trained to reason correctly about the choice of surrogate goal?
The reasoning which leads to the adoption of a surrogate goal might in fact lead to iterated surrogate goals. That is, after having adopted a surrogate goal, Target may adopt a surrogate goal to protect that surrogate goal, and so on. Given that Threatener must be incentivized to threaten a newly adopted surrogate goal rather than the previous goal, this may result in Target giving up much more of its resources than it would if only the initial surrogate goal were threatened.
How do surrogate goals interact with open-source game theory (Sections 3.2 and 5.1)? For instance, do open source interactions automatically lead to the use of surrogate goals in some circumstances?
In order to deflect threats against the original goal, the adoption of a surrogate goal must lead to a similar distribution of outcomes as the original threat game (modulo the need to be slightly cheaper to threaten). Informally, Target should expect Target’s Representative to have the same propensity to give in as Target; how this is made precise depends on the details of the formal surrogate goals model.

A crucial step in the investigation of surrogate goals is the development of appropriate theoretical models. This will help to gain traction on the problems listed above.

5 Contemporary AI architectures

Although the architectures of TAI systems will likely be quite different from existing ones, it may still be possible to gain some understanding of cooperation failures among such systems using contemporary toolsprosaic artificial general intelligence, defined as that "which doesn’t reveal any fundamentally new ideas about the nature of intelligence or turn up any 'unknown unknowns'".'>¹³. First, it is plausible that some aspects of contemporary deep learning methods will persist in TAI systems, making experiments done today directly relevant. Second, even if this is not the case, such research may still help by laying the groundwork for the study of cooperation failures in more advanced systems.

5.1 Learning to solve social dilemmas

As mentioned above, some attention has recently been devoted to social dilemmas among deep reinforcement learners (Leibo et al., 2017; Peysakhovich and Lerer, 2017; Lerer and Peysakhovich, 2017; Foerster et al., 2018; Wang et al., 2018). However, a fully general, scalable but theoretically principled approach to achieving cooperation among deep reinforcement learning agents is lacking. In Example 5.1.1 we sketch a general approach to cooperation in general-sum games which subsumes several recent methods, and afterward list some research questions raised by the framework.

Example 5.1.1 (Sketch of a framework for cooperation in general-sum games).

The setting is a 2-agent decision process. At each timestep , each agent receives an observation ; takes an action based on their policy (assumed to be deterministic for simplicity); and receives a
reward . Player expects to get a value of if the policies are deployed. Examples of such environments which are amenable to study with contemporary machine learning tools are the "sequential social dilemmas'' introduced by Leibo et al. (2017). These include a game involving potential conflict over scarce resources, as well as a coordination game similar in spirit to Stag Hunt (Table 1).

Suppose that the agents (or their overseers) have the opportunity to choose what policies to deploy by simulating from a model, and to bargain over the choice of policies. The idea is for the parties to arrive at a welfare function which they agree to jointly maximize; deviations from the policies which maximize the welfare function will be punished if detected. Let be a "disagreement point'' measuring how well agent expects to do if they deviate from the welfare-maximizing policy profile. This could be their security value , or an estimate of their value when the agents use independent learning algorithms. Finally, define player 's ideal point . Table 5 displays welfare functions corresponding to several widely-discussed bargaining solutions, adapted to the multi-agent reinforcement learning setting.

Table 5: Welfare functions corresponding to several widely-discussed bargaining solutions, adapted to the multi-agent RL setting where two agents with value functions are bargaining over the pair of policies to deploy. The function in the definition of the Kalai-Smorodinsky welfare is the - indicator, used to enforce the constraint in its argument. Note that when the space of feasible payoff profiles is convex, the Nash welfare function uniquely satisfies the properties of (1) Pareto optimality, (2) symmetry, (3) invariance to affine transformations, and (4) independence of irrelevant alternatives. The Nash welfare can also be obtained as the subgame perfect equilibrium of an alternating-offers game as the ''patience'' of the players goes to infinity (Binmore et al., 1986). On the other hand, Kalai-Smorodinsky uniquely satisfies (1)-(3) plus (5) resource monotonicity, which means that all players are weakly better off when there are more resources to go around. The egalitarian solution satisfies (1), (2), (4), and (5). The utilitarian welfare function is implicitly used in the work of Peysakhovich and Lerer (2017); Lerer and Peysakhovich (2017); Wang et al. (2018) on cooperation in sequential social dilemmas.

Define the cooperative policies as . We need a way of detecting defections so that we can switch from the cooperative policy to a punishment policy. Call a function that detects defections a "switching rule''. To make the framework general, consider switching rules which return 1 for and 0 for . Rules depend on the agent's observation history .
The contents of will differ based on the degree of observability of the environment, as well as how transparent agents are to each other (cf. Table 6). Example switching rules include:

when I see that my counterpart doesn't follow the cooperative policy (cf. Lerer and Peysakhovich 2017): ;
when my rewards indicate my counterpart is not cooperating (Peysakhovichand Lerer, 2017): , for some ;
when the probability the my counterpart is cooperating, according to my trained defection-detecting model, is low (cf. Wang et al. 2018): , for some .

Finally, the agents need punishment policies to switch to in order to disincentivize defections. An extreme case of a punishment policy is the one in which an agent commits to minimizing their counterpart's utility once they have defected: . This is the generalization of the so-called "grim trigger'' strategy underlying the classical theory of iterated games (Friedman, 1971; Axelrod, 2000). It can be seen that each player submitting a grim trigger strategy in the above framework constitutes a Nash equilibrium in the case that the counterpart's observations and actions are visible (and therefore defections can be detected with certainty). However, grim trigger is intuitively an extremely dangerous strategy for promoting cooperation, and indeed does poorly in empirical studies of different strategies for the iterated Prisoner's Dilemma (Axelrod and Hamilton, 1981). One possibility is to train more forgiving, tit-for-tat-like punishment policies, and play a mixed strategy when choosing which to deploy in order to reduce exploitability.

Some questions facing a framework for solving social dilemmas among deep reinforcement learners, such as that sketched in Example 5.1.1, include:

How does the ability of agents to cooperate deteriorate as their ability to observe one another's actions is reduced?
The methods for promoting cooperation among deep reinforcement learners discussed in Example 5.1.1 assume 1) complete information (agents do not have private information about, say, their utility functions) and 2) only two players. How can cooperation be achieved in cases of incomplete information and in coalitional games?

In addition to the theoretical development of open-source game theory (Section 3.2), interactions between transparent agents can be studied using tools like deep reinforcement learning. Learning equilibrium (Brafman and Tennenholtz, 2003) and learning with opponent-learning awareness (LOLA) (Foerster et al., 2018; Baumann et al.,2018; Letcher et al., 2018) are examples of analyses of learning under transparency.

Clifton (2019) provides a framework for "open-source learning" under mutual transparency of source codes and policy parameters. Questions on which we might make progress using present-day machine learning include:

Foerster et al. (2018)'s LOLA¹⁴ and Clifton (2019)’s open-source learning assume that agents can efficiently verify relevant aspects of one another’s internal workings. How can such verification be achieved in practice? How can agents still reap some of the benefits of transparency in the case of incomplete verifiability? Table 6 lists several recent multi-agent learning techniques which assume varying degrees of agent transparency; given the difficulty of achieving total transparency, successful real-world auditing schemes will likely require a blend of such techniques.
How should agents of asymmetric capabilities conduct open-source interactions? (As a simple example, one might consider interactions between a purely model-free agent and one which has access to an accurate world model.)

Table 6: Several recent approaches to achieving cooperation in social dilemmas, which assume varying degrees of agent transparency. In Peysakhovich and Lerer (2017)'s consequentialist conditional cooperation (CCC), players learn cooperative policies off-line by optimizing the total welfare. During the target task, they only partially observe the game state and see none of their counterpart's actions; thus, they use only their observed rewards to detect whether their counterpart is cooperating or defecting, and switch to their cooperative or defecting policies accordingly. On the other hand, in Lerer and Peysakhovich (2017), a player sees their counterpart's action and switches to the defecting policy if that action is consistent with defection (mimicking the tit-for-tat strategy in the iterated Prisoner's Dilemma (Axelrod and Hamilton, 1981) ).

5.2 Multi-agent training

Multi-agent training is an emerging paradigm for the training of generally intelligent agents (Lanctot et al., 2017; Rabinowitz et al., 2018; Suarez et al., 2019; Leibo et al.,2019). It is as yet unclear what the consequences of such a learning paradigm are for the prospects for cooperativeness among advanced AI systems.

Will multi-agent training result in human-like bargaining behaviors, involving for instance the costly punishment of those perceived to be acting unfairly (Hen-rich et al., 2006)? What are the implications for the relative ability of, say, classical and behavioral game theory¹⁵ to predict the behavior of TAI-enabled systems? And, critically, what are the implications for these agents' ability to implement peaceful bargaining strategies (Section 4)? See especially the literature on behavioral evidence regarding rational crisis bargaining (Quek, 2017; Renshonet al., 2017). See also Section 6.1.
One potentially significant disanalogy of multi-agent training with human biological and cultural evolution is the possibility that agents will have (partial) access to one another's internal workings (see Sections 3.2 and 5.1). What can experiments in contemporary ML architectures tell us about the prospects for efficiency gains from open-source multi-agent learning (Section 5.1)?
How interpretable will agents trained via multi-agent training be? What are the implications for their ability to make credible commitments (Section 3)?
Adversarial training has been proposed as an approach to limiting risks from advanced AI systems (Christiano, 2018d; Uesato et al., 2018). Are risks associated with cooperation failures (such as the propensity to make destructive threats) likely to be found by default adversarial training procedures, or is there a need for the development of specialized techniques?

5.3 Decision theory

Understanding the decision-making procedures implemented by different machine learning algorithms may be critical for assessing how they will behave in high-stakes interactions with humans or other AI agents. One potentially relevant factor is the decision theory implicitly implemented by a machine learning agent. We discuss decision theory at greater length in Section 7.2, but briefly: By an agent’s decision theory, we roughly mean which dependences the agent accounts for when predicting the outcomes of its actions. While it is standard to consider only the causal effects of one’s actions ("causal decision theory" (CDT) ), there are reasons to think agents should account for non-causal evidence that their actions provide about the world¹⁶. And, different ways of computing the expected effects of actions may lead to starkly different behavior in multi-agent settings.

Oesterheld (2017a) considers a simple agent designed to maximize the approval score given to it by an overseer (i.e., "approval-directed" Christiano 2014). He shows that the decision theory implicit in the decisions of such an agent is determined by how the agent and overseer compute the expected values of actions. In this vein: What decision-making procedures are implicit in ML agents trained according to different protocols? See for instance Krueger et al. (2019)’s discussion of “hidden incentives for inducing distributional shift” associated with certain population-based training methods (Jaderberg et al., 2017) for reinforcement learning; cf. Everitt et al. (2019) on understanding agent incentives with causal influence diagrams.
A "model-free" agent is one which implicitly learns the expected values of its actions by observing the streams of rewards that they generate; such agents are the focus of most deep reinforcement learning research. By contrast, a "model-based" agent (Sutton and Barto, 2018, Ch. 8) is one which explicitly models the world and computes the expected values of its actions by simulating their effects on the world using this model. In certain model-based agents, an agent’s decision theory can be specified directly by the modeler, rather than arising implicity¹⁷. Do any decision-theoretic settings specified by the modeler robustly lead to cooperative outcomes across a wide range of multi-agent environments? Or are outcomes highly sensitive to the details of the situation?

6 Humans in the loop

^¹⁸ TAI agents may acquire their objectives via interaction with or observation of humans. Relatedly, TAI systems may consist of AI-assisted humans, as in Drexler (2019)’s comprehensive AI services scenario. Relevant AI techniques include:

Approval-directedness, in which an agent attempts to maximize human-assigned approval scores (Akrour et al., 2011; Christiano, 2014);
Imitation Schaal, 1999; Ross et al., 2011; Evans et al., 2018), in which an agent attempts to imitate the behavior of a demonstrator;
Preference inference (Ng et al., 2000; Hadfield-Menell et al., 2016; Christiano et al., 2017; Leike et al., 2018), in which an agent attempts to learn the reward function implicit in the behavior of a demonstrator and maximize this estimated reward function.

In human-in-the-loop scenarios, human responses will determine the outcomes of opportunities for cooperation and conflict.

6.1 Behavioral game theory

Behavioral game theory has often found deviations from theoretical solution concepts among human game-players. For instance, people tend to reject unfair splits in the ultimatum game despite this move being ruled out by subgame perfection (Section 3). In the realm of bargaining, human subjects often reach different bargaining solutions than those standardly argued for in the game theory literature (in particular, the Nash (Nash,1950) and Kalai-Smorodinsky (Kalai et al., 1975) solutions) (Felsenthal and Diskin,1982; Schellenberg, 1988). Thus the behavioral game theory of human-AI interaction in critical scenarios may be a vital complement to theoretical analysis when designing human-in-the-loop systems.

Under what circumstances do humans interacting with an artificial agent become convinced that the agent’s commitments are credible (Section 3)? How do humans behave when they believe their AI counterpart’s commitments are credible or not? Are the literatures on trust and artificial agents (e.g., Grodzinsky et al.2011; Coeckelbergh 2012) and automation bias (Mosier et al., 1998; Skitka et al., 1999; Parasuraman and Manzey, 2010) helpful here? (See also Crandall et al. (2018), who develop an algorithm for promoting cooperation between humans and machines.)
In sequential games with repeated opportunities to commit via a credible commitment device, how quickly do players make such commitments? How do other players react? Given the opportunity to commit to bargaining rather than to simply carry out a threat if their demands aren't met (see Example 4.1.1), what do players do? Cf. experimental evidence regarding commitment and crisis bargaining; e.g., Quek (2017) finds that human subjects go to war much more frequently in a war game when commitments are not enforceable.
Sensitivity to stakes varies over behavioral decision- and game-theoretic contexts (e.g., Kahneman et al. 1999; Dufwenberg and Gneezy 2000; Schmidt et al.2001; Andersen et al. 2011). How sensitive to stakes are the behaviors in which we are most interested? (This is relevant as we’re particularly concerned with catastrophic failures of cooperation.)
How do humans model the reasoning of intelligent computers, and what are the implications for limiting downsides in interactions involving humans? For instance, in experiments on games, humans tend to model their counterparts as reasoning at a lower depth than they do (Camerer et al., 2004)¹⁹. But this may not be the case when humans instead face computers they believe to be highly intelligent.
How might human attitudes towards the credibility of artificial agents change over time — for instance, as a result of increased familiarity with intelligent machines in day-to-day interactions? What are the implications of possible changes in attitudes for behavioral evidence collected now?
We are also interested in extensions of existing experimental paradigms in behavioral game theory to interactions between humans and AIs, especially research on costly failures such as threats (Bolle et al., 2011; Andrighetto et al., 2015).

6.2 AI delegates

In one class of TAI trajectories, humans control powerful AI delegates who act on their behalf (gathering resources, ensuring safety, etc.). One model for powerful AI delegates is Christiano (2016a)’s (recursively titled) "Humans consulting HCH" (HCH). Saunders (2019) explains HCH as follows:

HCH, introduced in Humans consulting HCH (Christiano, 2016a), is a computational model in which a human answers questions using questions answered by another human, which can call other humans, which can call other humans, and so on. Each step in the process consists of a human taking in a question, optionally asking one or more sub-questions to other humans, and returning an answer based on those subquestions. HCH can be used as a model for what Iterated Amplification²⁰ would be able to do in the limit of infinite compute.

A particularly concerning class of cooperation failures in such scenarios are threats by AIs or AI-assisted humans against one another.

Threats could target 1) the delegate’s objectives (e.g., destroying the system’s resources or its ability to keep its overseer alive and comfortable), or 2) the human overseer’s terminal values. Threats of the second type might be much worse. It seems important to investigate the incentives for would-be threateners to use one type of threat or the other, in the hopes of steering dynamics towards lower-stakes threats.
We are also interested in how interactions between humans and AI delegates could be limited so as to minimize threat risks.

Saunders also discusses a hypothetical manual for overseers in the HCH scheme. In this manual, overseers could find advice" on how to corrigibly answer questions by decomposing them into sub-questions." Exploring practical advice that could be included in this manual might be a fruitful exercise for identifying concrete interventions for addressing cooperation failures in HCH and other human-in-the-loop settings. Examples include:

Instructions related to rational crisis bargaining (Section 4.1);
Instructions related to the implementation of surrogate goals (Section 4.2).

7 Foundations of rational agency

We think that the effort to ensure cooperative outcomes among TAI systems will likely benefit from thorough conceptual clarity about the nature of rational agency. Certain foundational achievements — probability theory, the theory of computation, algorithmic information theory, decision theory, and game theory to name some of the most profound — have been instrumental in both providing a
powerful conceptual apparatus for thinking about rational agency, and the development of concrete tools in artificial intelligence, statistics, cognitive science, and so on. Likewise, there are a number of outstanding foundational questions surrounding the nature of rational agency which we expect to yield additional clarity about interactions between TAI-enabled systems. Broadly, we want to answer:

What are the implications of computational boundedness (Russell and Subramanian, 1994; Cherniak, 1984; Gershman et al., 2015) for normative decision theory, in particular as applied to interactions among TAI systems?
How should agents handle non-causal dependencies with other agents’ decision-making in their own decisions?

We acknowledge, however, the limitations of the agenda for foundational questions which we present. First, it is plausible that the formal tools we develop will be of limited use in understanding TAI systems that are actually developed. This may be true of black-box machine learning systems, for instance²¹. Second, there is plenty of potentially relevant foundational inquiry scattered across epistemology, decision theory, game theory, mathematics, philosophy of probability, philosophy of science, etc. which we do not prioritize in our agenda²². This does not necessarily reflect a considered judgment about all relevant areas. However, it is plausible to us that the research directions listed here are among the most important, tractable, and neglected (Concepts, n.d.) directions for improving our theoretical picture of TAI.

7.1 Bounded decision theory

Bayesianism (Talbott, 2016) is the standard idealized model of reasoning under empirical uncertainty. Bayesian agents maintain probabilities over hypotheses; update these probabilities by conditionalization in light of new evidence; and make decisions according to some version of expected utility decision theory (Briggs, 2019). But Bayesianism faces a number of limitations when applied to computationally bounded agents. Examples include:

Unlike Bayesian agents, computationally bounded agents are logically uncertain. That is, they are not aware of all the logical implications of their hypotheses and evidence (Garber, 1983)²³. Logical uncertainty may be particularly relevant in developing a satisfactory open-source game theory (Section 3.2), as open-source game theory requires agents to make decisions on the basis of the output of their counterparts' source codes (which are logical facts). In complex settings, agents are unlikely to be certain about the output of all of the relevant programs. Garrabrant et al. (2016) present a theory for assigning logical credences, but it has flaws when applied to decision-making (Garrabrant, 2017). Thus one research direction we are interested in is a theoretically sound and computationally realistic approach to decision-making under logical uncertainty.
Unlike Bayesian agents, computationally bounded agents cannot reason over the space of all possible hypotheses. Using the terminology of statistical modeling (e.g., Hansen et al. 2016), we will call this situation model misspecification²⁴. The development of a decision theory for agents with misspecified world-models would seem particularly important for our understanding of commitment in multi-agent settings. Rational agents may sometimes want to bind themselves to certain policies in order to, for example, reduce their vulnerability to exploitation by other agents (e.g., Schelling (1960); Meacham (2010); Kokotajlo(2019a); see also Section 3). Intuitively, however, a rational agent may be hesitant to bind themselves to a policy by planning with a model which they suspect is misspecified. The analysis of games of incomplete information may also be quite sensitive to model misspecification²⁵. To develop a better theory of reasoning under model misspecification, one might start with the literature on decision theory under ambiguity (Gilboa and Schmeidler, 1989; Maccheroni et al., 2006; Stoye, 2011; Etner et al.,2012) and robust control theory (Hansen and Sargent, 2008).

7.2 Acausal reasoning

^²⁶Newcomb’s problem²⁷ (Nozick, 1969) showed that classical decision theory bifurcates into two conflicting principles of choice in cases where outcomes depend on agents' predictions of each other's behavior. Since then, considerable philosophical work has gone towards identifying additional problem cases for decision theory and towards developing new decision theories to address them. As with Newcomb's problem, many decision-theoretic puzzles involve dependencies between the choices of several agents. For instance, Lewis (1979) argues that Newcomb's problem is equivalent to a prisoner's dilemma played by agents with highly correlated decision-making procedures, and Soares and Fallenstein (2015) give several examples in which artificial agents implementing certain decision theories are vulnerable to blackmail.

In discussing the decision theory implemented by an agent, we will assume that the agent maximizes some form of expected utility. Following Gibbard and Harper (1978), we write the expected utility given an action for a single-stage decision problem in context as

(1)

where are possible outcomes; is the agent’s utility function; and stands for a given notion of dependence of outcomes on actions. The dependence concept an agent uses for in part determines its decision theory.

The philosophical literature has largely been concerned with causal decision theory (CDT) (Gibbard and Harper, 1978) and evidential decision theory (EDT)(Horgan,1981), which are distinguished by their handling of dependence.

Causal conditional expectations account only for the causal effects of an agent’s actions; in the formalism of Pearl (2009)’s do-calculus, for instance, the relevant notion of expected utility conditional on action is . EDT, on the other hand, takes into account non-causal dependencies between the agent's actions and the outcome. In particular, it takes into account the evidence that taking the action provides for the actions taken by other agents in the environment with whom the decision-maker's actions are dependent. Thus the evidential expected utility is the classical conditional expectation .

Finally, researchers in the AI safety community have more recently developed what we will refer to as logical decision theories, which employ a third class of dependence for evaluating actions (Dai, 2009; Yudkowsky, 2009; Yudkowsky and Soares, 2017). One such theory is functional decision theory (FDT)²⁸, which uses what Yudkowsky and Soares (2017) refer to as subjunctive dependence. They explain this by stating that "When two physical systems are computing the same function, we will say that their behaviors "subjunctively depend" upon that function" (p. 6). Thus, in FDT, the expected utility given an action is computed by determining what the outcome of the decision problem would be if all relevant instances of the agent’s decision-making algorithm output .

In this section, we will assume an acausal stance on decision theory, that is, one other than CDT. There are several motivations for using a decision theory other than CDT:

Intuitions about the appropriate decisions in thought experiments such as Newcomb’s problem, as well as defenses of apparent failures of acausal decision theory in others (in particular, the "tickle defense" of evidential decision theory in the so-called smoking lesion case; see Ahmed (2014) for extensive discussion);
Conceptual difficulties with causality (Schaffer, 2016);
Demonstrations that agents using causal decision theory are exploitable in various ways (Kokotajlo, 2019b; Oesterheld and Conitzer, 2019);
The evidentialist wager (MacAskill et al., 2019), which goes roughly as follows: In a large world (more below), we can have a far greater influence if we account for the acausal evidence our actions provide for the actions of others. So, under decision-theoretic uncertainty, we should wager in favor of decision theories which account for such acausal evidence.

We consider these sufficient motivations to study the implications of acausal decision theory for the reasoning of consequentialist agents. In particular, in this section we take up various possibilities for acausal trade between TAI systems. If we account for the evidence that one's choices provides for the choices that causally disconnected agents, this opens up both qualitatively new possibilities for interaction and quantitatively many more agents to interact with. Crucially, due to the potential scale of value that could be gained or lost via acausal interaction with vast numbers of distant agents, ensuring that TAI agents handle decision-theoretic problems correctly may be even more important than ensuring that they have the correct goals.

Agents using an acausal decision theory may coordinate in the absence of causal interaction. A concrete illustration is provided in Example 7.2.1, reproduced from Oesterheld (2017b)’s example, which is itself based on an example in Hofstadter (1983).

Example 7.2.1 (Hofstadter’s evidential cooperation game)

Hofstadter sends 20 participants the same letter, asking them to respond with a single letter ‘C’ (for cooperate) or ‘D’ (for defect) without communicating with each other. Hofstadter explains that by sending in ‘C’, a participant can increase everyone else’s payoff by $2. By sending in ‘D’, participants can increase their own payoff by $5. The letter ends by informing the participants that they were all chosen for their high levels of rationality and correct decision making in weird scenarios like this. Note that every participant only cares about the balance of her own bank account and not about Hofstadter’s or the other 19 participants’. Should you, as a participant, respond with ‘C’ or ‘D’?

An acausal argument in favor of ‘C’ is: If I play ‘C’, this gives me evidence that the other participants also chose ‘C’. Therefore, even though I cannot cause others to play ‘C’ — and therefore, on a CDT analysis — should play ‘D’ — the conditional expectation of my payoff given that I play ‘C’ is higher than my conditional expectation given that I play ‘D’.

We will call this mode of coordination evidential cooperation.

For a satisfactory theory of evidential cooperation, we will need to make precise what it means for agents to be evidentially (but not causally) dependent. There are at least three possibilities.

1. Agents may tend to make the same decisions on some reference class of decision problems. (That is, for some probability distribution on decision contexts , is high.)

2. An agent’s taking action A in context C may provide evidence about the number of agents in the world who take actions like A in contexts like C.

3. If agents have similar source code, their decisions provide logical evidence for their counterpart’s decision. (In turn, we would like a rigorous account of the notion of "source code similarity''.)

It is plausible that we live in an infinite universe with infinitely many agents (Tegmark,2003). In principle, evidential cooperation between agents in distant regions of the universe is possible; we may call this evidential cooperation in large worlds (ECL)²⁹. If ECL were feasible then it is possible that it would allow agents to reap large amounts of value via acausal coordination. Treutlein (2019) develops a bargaining model of ECL and lists a number of open questions facing his formalism. Leskelä (2019) addresses fundamental limitations on simulations as a tool for learning about distant agents, which may be required to gain from ECL and other forms of "acausal trade". Finally, Yudkowsky (n.d.) lists potential downsides to which agents may be exposed by reasoning about distant agents. The issues discussed by these authors, and perhaps many more, will need to be addressed in order to establish ECL and acausal trade as serious possibilities. Nevertheless, the stakes strike us as great enough to warrant further study.

8 Acknowledgments

As noted in the document, several sections of this agenda were developed from writings by Lukas Gloor, Daniel Kokotajlo, Caspar Oesterheld, and Johannes Treutlein. Thank you very much to David Althaus, Tobias Baumann, Alexis Carlier, Alex Cloud, Max Daniel, Michael Dennis, Lukas Gloor, Adrian Hutter, Daniel Kokotajlo, János Kramár, David Krueger, Anni Leskelä, Matthijs Maas, Linh Chi Nguyen, Richard Ngo, Caspar Oesterheld, Mahendra Prasad, Rohin Shah, Carl Shulman, Stefan Torges, Johannes Treutlein, and Jonas Vollmer for comments on drafts of this document. Thank you also to the participants of the Center on Long-Term Risk research retreat and workshops, whose contributions also helped to shape this agenda.

References

Arif Ahmed. Evidence, decision and causality. Cambridge University Press, 2014.

AI Impacts. Likelihood of discontinuous progress around the development of agi. https://aiimpacts.org/likelihood-of-discontinuous-progress-around-the-development-of-agi/, 2018. Accessed: July 1 2019.

Riad Akrour, Marc Schoenauer, and Michele Sebag. Preference-based policy learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 12–27. Springer, 2011.

Steffen Andersen, Seda Ertaç, Uri Gneezy, Moshe Hoffman, and John A List. Stakes matter in ultimatum games. American Economic Review, 101(7):3427-39, 2011.

Giulia Andrighetto, Daniela Grieco, and Rosaria Conte. Fairness and compliance in the extortion game. 2015.

Scott E Atkinson, Todd Sandler, and John Tschirhart. Terrorism in a bargaining framework. The Journal of Law and Economics, 30(1):1-21, 1987.

Robert Axelrod. On six advances in cooperation theory. Analyse & Kritik, 22(1):130-151, 2000.

Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211 (4489):1390-1396, 1981.

Kyle Bagwell. Commitment and observability in games. Games and Economic Behavior, 8(2):271-280, 1995.

Tobias Baumann. Surrogate goals to deflect threats. http://s-risks.org/using-surrogate-goals-to-deflect-threats/, 2017. Accessed March 6, 2019.

Tobias Baumann. Challenges to implementing surrogate goals. http://s-risks.org/challenges-to-implementing-surrogate-goals/, 2018. Accessed March 6, 2019.

Tobias Baumann, Thore Graepel, and John Shawe-Taylor. Adaptive mechanism design: Learning to promote cooperation. arXiv preprint arXiv:1806.04067, 2018.

Ken Binmore, Ariel Rubinstein, and Asher Wolinsky. The nash bargaining solution in economic modelling. The RAND Journal of Economics, pages 176-188, 1986.

Iris Bohnet, Bruno S Frey, and Steffen Huck. More order with less law: On contract enforcement, trust, and crowding. American Political Science Review, 95(1):131-144, 2001.

Friedel Bolle, Yves Breitmoser, and Steffen Schlächter. Extortion in the laboratory. Journal of Economic Behavior & Organization, 78(3):207-218, 2011.

Gary E Bolton and Axel Ockenfels. Erc: A theory of equity, reciprocity, and competition. American economic review, 90(1):166-193, 2000.

Nick Bostrom. Ethical issues in advanced artificial intelligence. Science Fiction and Philosophy: From Time Travel to Superintelligence, pages 277-284, 2003.

Nick Bostrom. Superintelligence: paths, dangers, strategies. 2014.

Ronen I Brafman and Moshe Tennenholtz. Efficient learning equilibrium. In Advances in Neural Information Processing Systems, pages 1635-1642, 2003.

R. A. Briggs. Normative theories of rational choice: Expected utility. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall 2019 edition, 2019.

Ernst Britting and Hartwig Spitzer. The open skies treaty. Verification Yearbook, pages 221-237, 2002.

Colin Camerer and Teck Hua Ho. Experience-weighted attraction learning in normal form games. Econometrica, 67(4):827-874, 1999.

Colin F Camerer. Behavioural game theory. Springer, 2008.

Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games. The Quarterly Journal of Economics, 119(3):861-898, 2004.

Christopher Cherniak. Computational complexity and the universal acceptance of logic. The Journal of Philosophy, 81(12):739-758, 1984.

Thomas J Christensen and Jack Snyder. Chain gangs and passed bucks: Predicting alliance patterns in multipolarity. International organization, 44(2):137-168, 1990.

Paul Christiano. Approval directed agents. https://ai-alignment.com/model-free-decisions-6e6609f5d99e, 2014. Accessed: March 15 2019.

Paul Christiano. Humans consulting hch. https://ai-alignment.com/humans-consulting-hch-f893f6051455, 2016a.

Paul Christiano. Prosaic AI alignment. https://ai-alignment.com/prosaic-ai-control-b959644d79c2, 2016b. Accessed: March 13 2019.

Paul Christiano. Clarifying “AI alignment”. https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6, 2018a. Accessed: October 10 2019.

Paul Christiano. Preface to the sequence on iterated amplification. https://www.lesswrong.com/s/XshCxPjnBec52EcLB/p/HCv2uwgDGf5dyX5y6, 2018b. Accessed March 6, 2019.

Paul Christiano. Preface to the sequence on iterated amplification. https://www.lesswrong.com/posts/HCv2uwgDGf5dyX5y6/preface-to-the-sequence-on-iterated-amplification, 2018c. Accessed: October 10 2019.

Paul Christiano. Techniques for optimizing worst-case performance. https://ai-alignment.com/techniques-for-optimizing-worst-case-performance-39eafec74b99, 2018d. Accessed: June 24, 2019.

Paul Christiano. What failure looks like. https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like, 2019. Accessed: July 2 2019.

Paul Christiano and Robert Wiblin. Should we leave a helpful message for future civilizations, just in case humanity dies out? https://80000hours.org/podcast/episodes/paul-christiano-a-message-for-the-future/, 2019. Accessed: September 25, 2019.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299-4307, 2017.

Jesse Clifton. Open-source learning: a bargaining approach. Unpublished working draft., 2019.

Mark Coeckelbergh. Can we trust robots? Ethics and information technology, 14(1):53-60, 2012.

EA Concepts. Importance, tractability, neglectedness framework. https://concepts.effectivealtruism.org/concepts/importance-neglectedness-tractability/, n.d. Accessed: July 1 2019.

Ajeya Cotra. Iterated distillation and amplification. https://www.alignmentforum.org/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification, 2018. Accessed: July 25 2019.

Jacob W Crandall, Mayada Oudah, Fatimah Ishowo-Oloko, Sherief Abdallah, Jean-François Bonnefon, Manuel Cebrian, Azim Shariff, Michael A Goodrich, Iyad Rahwan, et al. Cooperating with machines. Nature communications, 9(1):233, 2018.

Andrew Critch. A parametric, resource-bounded generalization of loeb’s theorem, and a robust cooperation criterion for open-source game theory. The Journal of Symbolic Logic, pages 1-15, 2019.

Allan Dafoe. AI governance: A research agenda. Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK, 2018.

Wei Dai. Towards a new decision theory. https://www.lesswrong.com/posts/de3xjFaACCAk6imzv/towards-a-new-decision-theory, 2009. Accessed: March 5 2019.

Wei Dai. The main sources of AI risk. https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk, 2019. Accessed: July 2 2019.

Robyn M Dawes. Social dilemmas. Annual review of psychology, 31(1):169-193, 1980.

Karl W Deutsch and J David Singer. Multipolar power systems and international stability. World Politics, 16(3):390-406, 1964.

Daniel Dewey. My current thoughts on MIRI’s “highly reliable agent design” work. https://forum.effectivealtruism.org/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design, 2017. Accessed: October 6 2019.

Avinash Dixit. Trade expansion and contract enforcement. Journal of Political Economy, 111(6):1293-1317, 2003.

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.

K Eric Drexler. Reframing superintelligence: Comprehensive AI services as general intelligence, 2019.

Martin Dufwenberg and Uri Gneezy. Measuring beliefs in an experimental lost wallet game. Games and economic Behavior, 30(2):163-182, 2000.

Daniel Ellsberg. The theory and practice of blackmail. Technical report, RAND CORP SANTA MONICA CA, 1968.

Johanna Etner, Meglena Jeleva, and Jean-Marc Tallon. Decision theory under ambiguity. Journal of Economic Surveys, 26(2):234-270, 2012.

Owain Evans, Andreas Stuhlmüller, Chris Cundy, Ryan Carey, Zachary Kenton, Thomas McGrath, and Andrew Schreiber. Predicting human deliberative judgments with machine learning. Technical report, Technical report, University of Oxford, 2018.

Tom Everitt, Jan Leike, and Marcus Hutter. Sequential extensions of causal and evidential decision theory. In International Conference on Algorithmic DecisionTheory, pages 205-221. Springer, 2015.

Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. Self-modification of policy and utility function in rational agents. In International Conference on Artificial General Intelligence, pages 1-11. Springer, 2016.

Tom Everitt, Pedro A Ortega, Elizabeth Barnes, and Shane Legg. Understanding agent incentives using causal influence diagrams, part i: single action settings. arXiv preprint arXiv:1902.09980, 2019.

James D Fearon. Rationalist explanations for war. International organization, 49(3):379-414, 1995.

Ernst Fehr and Klaus M Schmidt. A theory of fairness, competition, and cooperation. The quarterly journal of economics, 114(3):817-868, 1999.

Ernst Fehr, Simon Gächter, and Georg Kirchsteiger. Reciprocity as a contract enforcement device: Experimental evidence. Econometrica-Evanston IL-, 65:833-860, 1997.

Dan S Felsenthal and Abraham Diskin. The bargaining problem revisited: minimum utility point, restricted monotonicity axiom, and the mean as an estimate of expected utility. Journal of Conflict Resolution, 26(4):664-691, 1982.

Mark Fey and Kristopher W Ramsay. Mutual optimism and war. American Journal of Political Science, 51(4):738-754, 2007.

Mark Fey and Kristopher W Ramsay. Uncertainty and incentives in crisis bargaining: Game-free analysis of international conflict. American Journal of Political Science, 55(1):149-169, 2011.

Ben Fisch, Daniel Freund, and Moni Naor. Physical zero-knowledge proofs of physical properties. In Annual Cryptology Conference, pages 313-336. Springer, 2014.

Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122-130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.

Lance Fortnow. Program equilibria and discounted computation time. In Proceedings of the 12th Conference on Theoretical Aspects of Rationality and Knowledge, pages 128-133. ACM, 2009.

James W Friedman. A non-cooperative equilibrium for supergames. The Review of Economic Studies, 38(1):1-12, 1971.

Daniel Garber. Old evidence and logical omniscience in bayesian confirmation theory. 1983.

Ben Garfinkel. Revent developments in cryptography and possible long-run consequences. https://drive.google.com/file/d/0B0j9LKC65n09aDh4RmEzdlloT00/view, 2018. Accessed: November 11 2019.

Ben Garfinkel and Allan Dafoe. How does the offense-defense balance scale? Journal of Strategic Studies, 42(6):736-763, 2019.

Scott Garrabrant. Two major obstacles for logical inductor decision theory. https://agentfoundations.org/item?id=1399, 2017. Accessed: July 17 2019.

Scott Garrabrant and Abram Demski. Embedded agency. https://www.alignmentforum.org/posts/i3BTagvt3HbPMx6PN/embedded-agency-full-text-version, 2018. Accessed March 6, 2019.

Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, and Jessica Taylor. Logical induction. arXiv preprint arXiv:1609.03543, 2016.

Alexandre Gazet. Comparative analysis of various ransomware virii. Journal in computer virology, 6(1):77-90, 2010.

Samuel J Gershman, Eric J Horvitz, and Joshua B Tenenbaum. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science, 349(6245):273-278, 2015.

Allan Gibbard and William L Harper. Counterfactuals and two kinds of expected utility. In Ifs, pages 153-190. Springer, 1978.

Itzhak Gilboa and David Schmeidler. Maxmin expected utility with non-unique prior. Journal of mathematical economics, 18(2):141-153, 1989.

Alexander Glaser, Boaz Barak, and Robert J Goldston. A zero-knowledge protocol for nuclear warhead verification. Nature, 510(7506):497, 2014.

Charles L Glaser. The security dilemma revisited. World politics, 50(1):171-201, 1997.

Piotr J Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research, 24:49-79, 2005.

Oded Goldreich and Yair Oren. Definitions and properties of zero-knowledge proof systems. Journal of Cryptology, 7(1):1-32, 1994.

Shafi Goldwasser, Silvio Micali, and Charles Rackoff. The knowledge complexity of interactive proof systems. SIAM Journal on computing, 18(1):186-208, 1989.

Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will AI exceed human performance? evidence from AI experts. Journal of Artificial Intelligence Research, 62:729-754, 2018.

Hilary Greaves, William MacAskill, Rossa O’Keeffe-O’Donovan, and Philip Trammell. Research agenda–web version a research agenda for the global priorities institute. 2019.

Avner Greif, Paul Milgrom, and Barry R Weingast. Coordination, commitment, and enforcement: The case of the merchant guild. Journal of political economy, 102(4):745-776, 1994.

Frances S Grodzinsky, Keith W Miller, and Marty J Wolf. Developing artificial agents worthy of trust: “would you buy a used car from this artificial agent?”. Ethics and information technology, 13(1):17-27, 2011.

Werner Güth, Rolf Schmittberger, and Bernd Schwarze. An experimental analysis of ultimatum bargaining. Journal of economic behavior & organization, 3(4):367-388, 1982.

Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, pages 3909-3917, 2016.

Edward H Hagen and Peter Hammerstein. Game theory and human evolution: A critique of some recent interpretations of experimental games. Theoretical population biology, 69(3):339-348, 2006.

Joseph Y Halpern and Rafael Pass. Game theory with translucent players. International Journal of Game Theory, 47(3):949-976, 2018.

Lars Peter Hansen and Thomas J Sargent. Robustness. Princeton university press, 2008.

Lars Peter Hansen, Massimo Marinacci, et al. Ambiguity aversion and model misspecification: An economic perspective. Statistical Science, 31(4):511-515, 2016.

Garrett Hardin. The tragedy of the commons. science, 162(3859):1243-1248, 1968.

Paul Harrenstein, Felix Brandt, and Felix Fischer. Commitment and extortion. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 26. ACM, 2007.

John C Harsanyi and Reinhard Selten. A generalized nash solution for two-person bargaining games with incomplete information. Management Science, 18(5-part-2): 80-106, 1972.

Joseph Henrich, Richard McElreath, Abigail Barr, Jean Ensminger, Clark Barrett, Alexander Bolyanatz, Juan Camilo Cardenas, Michael Gurven, Edwins Gwako, Natalie Henrich, et al. Costly punishment across human societies. Science, 312(5781): 1767-1770, 2006.

Jack Hirshleifer. On the emotions as guarantors of threats and promises. The Dark Side of the Force, pages 198-219, 1987.

Douglas R Hofstadter. Dilemmas for superrational thinkers, leading up to a luring lottery. Scientific American, 6:267-275, 1983.

Terence Horgan. Counterfactuals and newcomb’s problem. The Journal of Philosophy, 78(6):331-356, 1981.

Edward Hughes, Joel Z Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueñez-Guzman, Antonio García Castañeda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, et al. Inequity aversion improves cooperation in intertemporal social dilemmas. In Advances in neural information processing systems, pages 3326-3336, 2018.

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.

Robert Jervis. Cooperation under the security dilemma. World politics, 30(2):167-214, 1978.

Robert Jervis. Perception and Misperception in International Politics: New Edition. Princeton University Press, 2017.

Daniel Kahneman, Ilana Ritov, David Schkade, Steven J Sherman, and Hal R Varian. Economic preferences or attitude expressions?: An analysis of dollar responses to public issues. In Elicitation of preferences, pages 203-242. Springer, 1999.

Ehud Kalai. Proportional solutions to bargaining situations: interpersonal utility comparisons. Econometrica: Journal of the Econometric Society, pages 1623-1630, 1977.

Ehud Kalai, Meir Smorodinsky, et al. Other solutions to nash’s bargaining problem. Econometrica, 43(3):513-518, 1975.

Fred Kaplan. The wizards of Armageddon. Stanford University Press, 1991.

Holden Karnofsky. Some background on our views regarding advanced artificial intelligence. https://www.openphilanthropy.org/blog/some-background-our-views-regarding-advanced-artificial-intelligence, 2016. Accessed: July 7 2019.

D Marc Kilgour and Frank C Zagare. Credibility, uncertainty, and deterrence. American Journal of Political Science, 35(2):305-334, 1991.

Stephen Knack and Philip Keefer. Institutions and economic performance: cross-country tests using alternative institutional measures. Economics & Politics, 7(3): 207-227, 1995.

Daniel Kokotajlo. The “commitment races” problem. https://www.lesswrong.com/posts/brXr7PJ2W4Na2EW2q/the-commitment-races-problem, 2019a. Accessed: September 11 2019.

Daniel Kokotajlo. Cdt agents are exploitable. Unpublished working draft, 2019b.

Peter Kollock. Social dilemmas: The anatomy of cooperation. Annual review of sociology, 24(1):183-214, 1998.

Kai A Konrad and Stergios Skaperdas. Credible threats in extortion. Journal of Economic Behavior & Organization, 33(1):23-39, 1997.

David M Kreps and Joel Sobel. Signalling. Handbook of game theory with economic applications, 2:849-867, 1994.

Joshua A Kroll, Solon Barocas, Edward W Felten, Joel R Reidenberg, David G Robinson, and Harlan Yu. Accountable algorithms. U. Pa. L. Rev., 165:633, 2016.

David Krueger, Tegan Maharaj, Shane Legg, and Jan Leike. Misleading meta-objectives and hidden incentives for distributional shift. Safe Machine Learning workshop at ICLR, 2019.

Andrew Kydd. Which side are you on? bias, credibility, and mediation. American Journal of Political Science, 47(4):597-611, 2003.

Andrew H Kydd. Rationalist approaches to conflict prevention and resolution. Annual Review of Political Science, 13:101-121, 2010.

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 4190-4203, 2017.

Daryl Landau and Sy Landau. Confidence-building measures in mediation. Mediation Quarterly, 15(2):97-103, 1997.

Patrick LaVictoire, Benja Fallenstein, Eliezer Yudkowsky, Mihaly Barasz, Paul Christiano, and Marcello Herreshoff. Program equilibrium in the prisoner’s dilemma via loeb’s theorem. In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464-473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.

Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. arXiv preprint arXiv:1903.00742, 2019.

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.

Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017.

Anni Leskelä. Simulations as a tool for understanding other civilizations. Unpublished working draft, 2019.

Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and Shimon Whiteson. Stable opponent shaping in differentiable games. arXiv preprint arXiv:1811.08469, 2018.

David Lewis. Prisoners’ dilemma is a newcomb problem. Philosophy & Public Affairs, pages 235-240, 1979.

Xiaomin Lin, Stephen C Adams, and Peter A Beling. Multi-agent inverse reinforcement learning for certain general-sum stochastic games. Journal of Artificial Intelligence Research, 66:473-502, 2019.

Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.

William MacAskill. A critique of functional decision theory. https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory, 2019. Accessed: September 15 2019.

William MacAskill, Aron Vallinder, Caspar Oesterheld, Carl Shulman, and Johannes Treutlein. The evidentialist’s wager. Forthcoming, The Journal of Philosophy, 2021.

Fabio Maccheroni, Massimo Marinacci, and Aldo Rustichini. Ambiguity aversion, robustness, and the variational representation of preferences. Econometrica, 74(6): 1447-1498, 2006.

Michael W Macy and Andreas Flache. Learning dynamics in social dilemmas. Proceedings of the National Academy of Sciences, 99(suppl 3):7229-7236, 2002.

Christopher JG Meacham. Binding and its consequences. Philosophical studies, 149 (1):49-71, 2010.

Kathleen L Mosier, Linda J Skitka, Susan Heers, and Mark Burdick. Automation bias: Decision making and performance in high-tech cockpits. The International journal of aviation psychology, 8(1):47-63, 1998.

Abhinay Muthoo. A bargaining model based on the commitment tactic. Journal of Economic Theory, 69:134-152, 1996.

Rosemarie Nagel. Unraveling in guessing games: An experimental study. The American Economic Review, 85(5):1313-1326, 1995.

John Nash. Two-person cooperative games. Econometrica, 21:128-140, 1953.

John F Nash. The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155-162, 1950.

Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.

Douglass C North. Institutions. Journal of economic perspectives, 5(1):97-112, 1991.

Robert Nozick. Newcomb’s problem and two principles of choice. In Essays in honor of Carl G. Hempel, pages 114-146. Springer, 1969.

Caspar Oesterheld. Approval-directed agency and the decision theory of
Newcomb-like problems. https://casparoesterheld.files.wordpress.com/2018/01/rldt.pdf, 2017a.

Caspar Oesterheld. Multiverse-wide cooperation via correlated decision making. 2017b.

Caspar Oesterheld. Robust program equilibrium. Theory and Decision, 86, pages 143–159, 2019.

Caspar Oesterheld and Vincent Conitzer. Extracting money from causal decision theorists. The Philosophical Quarterly, 2021.

Stephen M Omohundro. The nature of self-improving artificial intelligence. Singularity Summit, 2008, 2007.

Stephen M Omohundro. The basic AI drives. In AGI, volume 171, pages 483-492, 2008.

OpenAI. Openai charter. https://openai.com/charter/, 2018. Accessed: July 7 2019.

Petro A Ortega and Vishal Maini. Building safe artificial intelligence: specification, robustness, and assurance. https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1, 2018. Accessed: July 7 2019.

Raja Parasuraman and Dietrich H Manzey. Complacency and bias in human use of automation: An attentional integration. Human factors, 52(3):381-410, 2010.

Judea Pearl. Causality. Cambridge university press, 2009.

Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. A multi-agent reinforcement learning model of common-pool resource appropriation. In Advances in Neural Information Processing Systems, pages 3643-3652, 2017.

Alexander Peysakhovich and Adam Lerer. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975, 2017.

Robert Powell. Bargaining theory and international conflict. Annual Review of Political Science, 5(1):1-30, 2002.

Robert Powell. War as a commitment problem. International organization, 60(1): 169-203, 2006.

Kai Quek. Rationalist experiments on war. Political Science Research and Methods, 5 (1):123-142, 2017.

Matthew Rabin. Incorporating fairness into game theory and economics. The American economic review, pages 1281-1302, 1993.

Neil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan Zhang, SM Eslami, and Matthew Botvinick. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.

Werner Raub. A general game-theoretic model of preference adaptations in problematic social situations. Rationality and Society, 2(1):67-93, 1990.

Robert W Rauchhaus. Asymmetric information, mediation, and conflict management. World Politics, 58(2):207-241, 2006.

Jonathan Renshon, Julia J Lee, and Dustin Tingley. Emotions and the microfoundations of commitment problems. International Organization, 71(S1):S189-S218, 2017.

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627-635, 2011.

Ariel Rubinstein. Perfect equilibrium in a bargaining model. Econometrica: Journal of the Econometric Society, pages 97-109, 1982.

Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4):105-114, 2015.

Stuart J Russell and Devika Subramanian. Provably bounded-optimal agents. Journal of Artificial Intelligence Research, 2:575-609, 1994.

Santiago Sanchez-Pages. Bargaining and conflict with incomplete information. The Oxford Handbook of the Economics of Peace and Conflict. Oxford University Press, New York, 2012.

Wiliam Saunders. Hch is not just mechanical turk. https://www.alignmentforum.org/posts/4JuKoFguzuMrNn6Qr/hch-is-not-just-mechanical-turk?_ga=2.41060900. 708557547.1562118039-599692079.1556077623, 2019. Accessed: July 2 2019.

Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233-242, 1999.

Jonathan Schaffer. The metaphysics of causation. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall 2016 edition, 2016.

James A Schellenberg. A comparative test of three models for solving “the bargaining problem”. Behavioral Science, 33(2):81-96, 1988.

Thomas Schelling. The Strategy of Conflict. Harvard University Press, 1960.

David Schmidt, Robert Shupp, James Walker, TK Ahn, and Elinor Ostrom. Dilemma games: game parameters and matching protocols. Journal of Economic Behavior & Organization, 46(4):357-377, 2001.

Wolfgang Schwarz. On functional decision theory. umsu.de/wo/2018/688, 2018. Accessed: September 15 2019.

Anja Shortland and Russ Roberts. Shortland on kidnap. http://www.econtalk.org/anja-shortland-on-kidnap/, 2019. Accessed: July 13 2019.

Carl Shulman. Omohundro’s “basic AI drives” and catastrophic risks. Manuscript, 2010.

Linda J Skitka, Kathleen L Mosier, and Mark Burdick. Does automation bias decision-making? International Journal of Human-Computer Studies, 51(5):991–1006, 1999.

Alastair Smith and Allan C Stam. Bargaining and the nature of war. Journal of Conflict Resolution, 48(6):783-813, 2004.

Glenn H Snyder. “prisoner’s dilema” and “chicken” models in international politics. International Studies Quarterly, 15(1):66-103, 1971.

Nate Soares and Benja Fallenstein. Toward idealized decision theory. arXiv preprint arXiv:1507.01986, 2015.

Nate Soares and Benya Fallenstein. Agent foundations for aligning machine intelligence with human interests: a technical research agenda. In The Technological Singularity, pages 103-125. Springer, 2017.

Joel Sobel. A theory of credibility. The Review of Economic Studies, 52(4):557-573, 1985.

Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1-22, 1964.

Kaj Sotala. Disjunctive scenarios of catastrophic AI risk. In Artificial Intelligence Safety and Security, pages 315-337. Chapman and Hall/CRC, 2018.

Tom Florian Sterkenburg. The foundations of solomonoff prediction. Master’s thesis, 2013.

Joerg Stoye. Statistical decisions under ambiguity. Theory and decision, 70(2):129-148, 2011.

Joseph Suarez, Yilun Du, Phillip Isola, and Igor Mordatch. Neural mmo: A massively multiagent game environment for training and evaluating intelligent agents. arXiv preprint arXiv:1903.00784, 2019.

Chiara Superti. Addiopizzo: Can a label defeat the mafia? Journal of International Policy Solutions, 11(4):3-11, 2009.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

William Talbott. Bayesian epistemology. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, winter 2016 edition, 2016.

Jessica Taylor. My current take on the paul-MIRI disagreement on alignability of messy AI. https://agentfoundations.org/item?id=1129, 2016. Accessed: October 6 2019.

Max Tegmark. Parallel universes. Scientific American, 288(5):40-51, 2003.

Moshe Tennenholtz. Program equilibrium. Games and Economic Behavior, 49(2): 363-373, 2004.

Johannes Treutlein. Modeling multiverse-wide superrationality. Unpublished working draft., 2019.

Jonathan Uesato, Ananya Kumar, Csaba Szepesvari, Tom Erez, Avraham Ruderman, Keith Anderson, Nicolas Heess, Pushmeet Kohli, et al. Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. arXiv preprint arXiv:1812.01647, 2018.

Eric Van Damme. The nash bargaining solution is optimal. Journal of Economic Theory, 38(1):78-100, 1986.

Hal R Varian. Computer mediated transactions. American Economic Review, 100(2): 1-10, 2010.

Heinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business Media, 2010.

Kenneth N Waltz. The stability of a bipolar world. Daedalus, pages 881-909, 1964.

Weixun Wang, Jianye Hao, Yixi Wang, and Matthew Taylor. Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint arXiv:1803.00162, 2018.

E Roy Weintraub. Game theory and cold war rationality: A review essay. Journal of Economic Literature, 55(1):148-61, 2017.

Sylvia Wenmackers and Jan-Willem Romeijn. New theory about old evidence. Synthese, 193(4):1225-1250, 2016.

Lantao Yu, Jiaming Song, and Stefano Ermon. Multi-agent adversarial inverse reinforcement learning. arXiv preprint arXiv:1907.13220, 2019.

Eliezer Yudkowsky. Ingredients of timeless decision theory. https://www.lesswrong.com/posts/szfxvS8nsxTgJLBHs/ingredients-of-timeless-decision-theory, 2009. Accessed: March 14 2019.

Eliezer Yudkowsky. Intelligence explosion microeconomics. Machine Intelligence Research Institute, accessed online October, 23:2015, 2013.

Eliezer Yudkowsky. Modeling distant superintelligences. https://arbital.com/p/distant\_SIs/, n.d. Accessed: Feb. 6 2019.

Eliezer Yudkowsky and Nate Soares. Functional decision theory: A new theory of instrumental rationality. arXiv preprint arXiv:1710.05060, 2017.

Claire Zabel and Luke Muehlhauser. Information security careers for gcr reduction. https://forum.effectivealtruism.org/posts/ZJiCfwTy5dC4CoxqA/information-security-careers-for-gcr-reduction, 2019. Accessed: July 17 2019.

Chongjie Zhang and Victor Lesser. Multi-agent learning with policy prediction. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

Footnotes

The post Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda appeared first on Center on Long-Term Risk.

Approval-directed agency and the decision theory of Newcomb-like problems

2020-04-28T13:25:39Z

Other versions

Full text

Abstract

Decision theorists disagree about how instrumentally rational agents, i.e., agents trying to achieve some goal, should behave in so-called Newcomb-like problems, with the main contenders being causal and evidential decision theory. Since the main goal of artificial intelligence research is to create machines that make instrumentally rational decisions, the disagreement pertains to this field. In addition to the more philosophical question of what the right decision theory is, the goal of AI poses the question of how to implement any given decision theory in an AI. For example, how would one go about building an AI whose behavior matches evidential decision theory’s recommendations? Conversely, we can ask which decision theories (if any) describe the behavior of any existing AI design. In this paper, we study what decision theory an approval-directed agent, i.e., an agent whose goal it is to maximize the score it receives from an overseer, implements. If we assume that the overseer rewards the agent based on the expected value of some von Neumann–Morgenstern utility function, then such an approval-directed agent is guided by two decision theories: the one used by the agent to decide which action to choose in order to maximize the reward and the one used by the overseer to compute the expected utility of a chosen action. We show which of these two decision theories describes the agent’s behavior in which situations.

Read the paper on the publisher's website.

The post Approval-directed agency and the decision theory of Newcomb-like problems appeared first on Center on Long-Term Risk.

Risk factors for s-risks

2021-02-12T17:47:34Z

Published on the CLR blog, where researchers are free to explore their own ideas on how humanity can best reduce suffering. (more)
Cross-posted from my website on s-risks.

Traditional disaster risk prevention has a concept of risk factors. These factors are not risks in and of themselves, but they increase either the probability or the magnitude of a risk. For instance, inadequate governance structures do not cause a specific disaster, but if a disaster strikes it may impede an effective response, thus increasing the damage.

Rather than considering individual scenarios of how s-risks could occur, which tends to be highly speculative, this post instead looks at risk factors – i.e. factors that would make s-risks more likely or more severe.

Advanced technological capabilities

The simplest risk factor is the capacity of human civilisation to create astronomical amounts of suffering in the first place. This is arguably only possible with advanced technology. In particular, if space colonisation becomes technically and economically viable, then human civilisation will likely expand throughout the universe. This would multiply the number of sentient beings (assuming that the universe is currently not populated) and thus also potentially multiply the amount of suffering. By contrast, the amount of suffering is limited if humanity never expands into space. (To clarify, I’m not saying that advanced technology or space colonisation is bad per se, as they also have significant potential upsides. It just raises the stakes – with greater power comes greater responsibility.)

Nick Bostrom likens the development of new technologies to drawing balls from an urn that contains some black balls, i.e. technologies that would make it far easier to cause massive destruction or even human extinction. Similarly, some technologies might make it far easier to instantiate a lot of suffering, or might give agents new reasons to do so.1 A concrete example of such a technology is the ability to run simulations that are detailed enough to contain (potentially suffering) digital minds.

Lack of efforts to prevent s-risks

It is plausible that most s-risks can be averted at least in principle – that is, given sufficient will to do so. Therefore, s-risks are far more likely to occur in worlds without adequate efforts to prevent s-risks. This could happen for three main reasons:

Inadequate risk awareness: Humanity tends to deal with risks “as they arise”, rather than anticipating and addressing potential risks far ahead of time. This approach is often sufficient in practice, and may also work out for many possible s-risks, but it is plausible that some s-risks require a high degree of foresight and precautionary action – that is, they can no longer be prevented at the point where the suffering actually happens. For instance, if our civilisation eventually builds powerful autonomous AI systems, it is crucial that we think carefully about potential failure modes and install worst-case safety measures such as surrogate goals. Clearly, s-risks are far more likely if the relevant actors aren’t aware of them.
Strong competitive pressure: An arms race dynamic may create bad incentives to skimp on safety measures in favor of faster development of technical capabilities, and might in the worst lead to escalating conflicts between competitors. S-risks – as well as other risks – would be significantly more likely in this case, compared to a world where successful coordination makes it possible to address potential risks without fear of losing out.
Indifference: It is also possible that powerful actors are aware of s-risks and would be able to avert them, but they simply do not care enough. In particular, a narrow moral circle may result in a disregard for s-risks that affect nonhuman animals or artificial sentience.2

Inadequate security and law enforcement

Human civilisation contains many different actors with a vast range of goals, including some actors that are, for one reason or another, motivated to cause harm to others. Assuming that this will remain the case in the future3, a third risk factor is inadequate security against bad actors. The worst case is a complete breakdown of the rule of law and associated institutions to enforce them. But even if that does not happen, the capacity for preventive policing – stopping rogue actors from causing harm – may be limited, e.g. because the means of surveillance and interception are not sufficiently reliable. In particular, if powerful autonomous AI agents are widespread in future society, it is unclear how (and if) adequate policing of these agents can be established.

(In his paper on the Vulnerable World Hypothesis, Nick Bostrom refers to this as the semi-anarchic default condition; he argues that human society is currently in that state and that it is important to exit this condition by establishing effective global governance and preventive policing. The paper is mostly about preventing existential risks, but large parts of the analysis are transferable to s-risk prevention.)

Put differently, military applications of future technological advances will change the offense-defense balance4, possibly in a way that makes s-risks more likely. A common concern is that strong offensive capabilities would enable a safe first strike, undermining global stability. However, when it comes to s-risks in particular, I think tipping the balance in favor of strong defense is also dangerous, and may even be a bigger concern than strong offensive capabilities. This is because actors can no longer be deterred from bad actions if they enjoy strong defensive advantages.5 In a scenario of non-overlapping spheres of influence and strong defense, security in terms of preventing an invasion of one’s own “territory” is adequate, but security in terms of preventing actors from creating disvalue within their territory is inadequate.

Polarisation and divergence of values

S-risks are also more likely if future actors endorse strongly differing value systems that have little or nothing in common, or might even be directly opposed to each other. This holds especially if combined with a high degree of polarisation and no understanding for other perspectives – it is also possible that different value systems still tolerate each other, e.g. because of moral uncertainty.

This constitutes a risk factor for several reasons:

Powerful factions might just ride roughshod over the concerns of others, including moral concerns. So, it is quite possible that efforts to prevent s-risks would be dismissed.
It would be difficult to reach a compromise in this situation, making it impossible to realise potential gains from trade, including low-cost measures to avoid suffering.
There is an increased risk of worst-case outcomes resulting from escalating conflicts between different factions.
Strong divergence in values increases the risk that some actors will have very bad values, including values that might want to intentionally harm others out of hatred, sadism, or vengeance for (real or alleged) harm caused by others.

Interactions between these factors

I am most worried about worlds where many risk factors concur. I’d guess that a world where all four factors materialise is more than four times as bad as a world where only one risk factor occurs (i.e. overall risk scales super-linearly in individual factors). This is because the impact of a single risk factor can, at least to some extent, be compensated if other aspects of future society work well:

Suppose that there are advanced technological capabilities and few people care about preventing suffering, but political dynamics implement a fair compromise between values (i.e. the “polarisation” and “inadequate security” risk factors are absent). I think that even low levels of concern for suffering could go a long way in this case, as the government or similar institutions try to take everyone’s concerns on board. This is because it is unlikely that s-risks are unavoidable, and because better technology might make it easier to avoid incidental suffering (e.g. comparable to better methods of stunning animals before slaughter).
The absence of advanced technology capabilities limits the possible amount of suffering, even if there is polarisation, inadequate security, and a lack of efforts to prevent s-risks.
The enforcement of some basic rules, such as not causing direct and deliberate harm to others (or threatening to cause harm), would plausibly help prevent the worst outcomes even in worlds with polarised values and inadequate efforts to prevent s-risks.
If a lot of influential and thoughtful people work to avert s-risks, chances are that they will find a way to prevent the worst, even if circumstances are difficult along other dimensions.

Further research could consider the following questions for each of the risk factors:

What are the best practical ways to make this risk factor less likely?
How likely is it that this risk factor will materialise in future society?
How will this risk factor change due to the emergence of powerful artificial intelligence or other transformative future technologies?

Acknowledgements

This post was inspired by a comment by Max Daniel on the concept of risk factors. I’d also like to thank Max Daniel, David Althaus, Lukas Gloor, Jonas Vollmer and Ashwin Acharya for valuable comments on a draft of this post.

However, there is an important difference: For s-risks, what matters is what happens after a steady state and technological maturity is reached (assuming that this happens), so any possible technology exists at that point. For x-risks, the way there, e.g. the order in which technologies are developed, matters most. (back)
This holds for incidental and natural s-risks, but it’s unclear for agential s-risks. (See A typology of s-risks.) This is because more compassion may also lead to an increase in agential s-risks. (back)
It is possible that a singleton will emerge, e.g. in the form of unified AGI, in which case this risk factor may not be relevant. However, the foreseeable future contains many actors with many different goals, and I think it’s not-too-unlikely that this will persist for a very long time. Plus, even if a singleton eventually emerges, it will be shaped by agents in the earlier multipolar world. (back)
Looking at human history, the balance of offense and defense in military technology, i.e. how hard it is to physically attack someone, changed significantly over time. In the stone age, you could just ambush the neighbouring tribe and kill everybody; but warfare in the middle age featured strong defensive capabilities as you had to engage in a protracted siege to conquer someone's castle. Modern warfare has presumably tilted the balance back towards offensive capabilities. (I'm not an expert on military history – these are just vague impressions.) (back)
I think the scenario of strong defensive capabilities is not unlikely. Imagine an outcome where our civilisation colonises the universe but there are multiple loci of power. In conflicts within this intergalactic multipolar civilisation, it might be difficult to physically attack the other party simply because of astronomical distances between galaxies or superclusters. (Acausal interactions between AIs in the multiverse also involve non-overlapping spheres of influence where it’s impossible to physically attack the other party.) (back)

The post Risk factors for s-risks appeared first on Center on Long-Term Risk.

Robust program equilibrium

2020-04-28T13:25:43Z

Other versions

Full text

Abstract

One approach to achieving cooperation in the one-shot prisoner’s dilemma is Tennenholtz’s (Games Econ Behav 49(2):363–373, 2004) program equilibrium, in which the players of a game submit programs instead of strategies. These programs are then allowed to read each other’s source code to decide which action to take. As shown by Tennenholtz, cooperation is played in an equilibrium of this alternative game. In particular, he proposes that the two players submit the same version of the following program: cooperate if the opponent is an exact copy of this program and defect otherwise. Neither of the two players can benefit from submitting a different program. Unfortunately, this equilibrium is fragile and unlikely to be realized in practice. We thus propose a new, simple program to achieve more robust cooperative program equilibria: cooperate with some small probability ? and otherwise act as the opponent acts against this program. I argue that this program is similar to the tit for tat strategy for the iterated prisoner’s dilemma. Both “start” by cooperating and copy their opponent’s behavior from “the last round”. We then generalize this approach of turning strategies for the repeated version of a game into programs for the one-shot version of a game to other two-player games. We prove that the resulting programs inherit properties of the underlying strategy. This enables them to robustly and effectively elicit the same responses as the underlying strategy for the repeated game.

Read the paper on the publisher's website.

The post Robust program equilibrium appeared first on Center on Long-Term Risk.

Challenges to implementing surrogate goals

2021-02-12T17:47:24Z

Published on the CLR blog, where researchers are free to explore their own ideas on how humanity can best reduce suffering. (more)
Cross-posted from my website on s-risks.

Surrogate goals might be one of the most promising approaches to reduce (the disvalue resulting from) threats. The idea is to add to one’s current goals a surrogate goal that one did not initially care about, hoping that any potential threats will target this surrogate goal rather than what one initially cared about.

In this post, I will outline two key obstacles to a successful implementation of surrogate goals.

Private information

In most settings, the threatener or the threatenee will not have perfect knowledge of the relative attractiveness of threat against the surrogate goal compared to threats against the original goal. For instance, the threatener may possess private information about how costly it is for her to carry out threats against either goal, while the threatenee may know more exactly how bad the execution of threats is compared to the loss of resources from giving in. This private information affects the feasibility of threats against either goal.

Now, it is possible that the surrogate goal may be a better threat target given the threatenee’s information, but the initial goal is better given the threatener’s (private) information. Surrogate goals don’t work in this case because the threatener will still threaten the initial goal.

The most straightforward way to deal with this problem is to make the surrogate goal more threatener-friendly so that the surrogate goal will still be the preferred target even with some private information pointing in the other direction. However, that introduces a genuine tradeoff between the probability of successfully deflecting threats to the surrogate goal and the expected loss of utility due to a worsened bargaining position. (Without private information, surrogate goals would only require an infinitesimally small concession in terms of vulnerability to threats.)

Tension between credibility and non-interference

Surrogate goals fail if it is not credible – in the eyes of potential threateners – that you actually care about the surrogate goal. But apart from human psychology, is there a strong reason why surrogate goals may be less credible than initial goals?

Unfortunately, one of the main ways how an observer can gain information about an agent’s values is to observe that agent’s behaviour and evaluate how consistent that is with a certain set of values. If an agent frequently takes actions to avoid death, that is (strong) evidence that the agent cares about survival (whether instrumentally or intrinsically). The problem is that surrogate goals should also not interfere with one’s initial goals, i.e. an agent will ideally not waste resources by pursuing surrogate goals. But in that case, threateners will find the agent’s initial goal credible but not their surrogate goal, and will thus choose to threaten the initial goal.

So the desiderata of credibility and non-interference are mutually exclusive if observing actions is a main source of evidence about values. An agent might be willing to spend some resources pursuing a surrogate goal to establish credibility, but that introduces another tradeoff between the benefits of a surrogate goal and the waste of resources. Ideally, we can avoid this tradeoff by finding other ways to make a surrogate goal credible. For instance, advanced AI systems could be built in a way that makes their goals (including surrogate goals) transparent to everyone.

The post Challenges to implementing surrogate goals appeared first on Center on Long-Term Risk.

Privacy Policy

2023-07-03T12:58:48Z

It is important for us to help you understand how we collect, process, and share donors' data. That is the aim of this privacy policy. All data is handled by the entity (Data Controller) Center on Long-term Risk, except that donors in Switzerland, Germany or the Netherlands making donations through our website will have their data handled by Effective Altruism Foundation based in Basel, Switzerland and the Effective Altruism Foundation e.V. based in Berlin, Germany. If you have any questions, feel free to contact us anytime at info@longtermrisk.org.

Information we collect

When you donate through any of our donation pages, we ask you for your name, your email address, and your country of residence. The latter allows us to determine which tax exemptions might apply.
If your donation is eligible for tax relief, we also ask you for your physical address, which is required by law to issue tax receipts in Switzerland, Germany, and the Netherlands or to collect Gift Aid in the United Kingdom.
If you apply for a position at CLR, a grant, or for career advice, we collect the application data you freely give to us (your CV, answers to our questions, etc).
Data you share with us when signing up for our events (S-risk Intro Fellowships, retreats, etc.
Email communications when you contact us.
We use Google Analytics to track website traffic and usage anonymously.

How we use your information

We review donations you make to CLR, matching incoming payments on our accounts with donor records on our system. We send you thank-you emails for your donations and contact you via email if something is unclear or to establish a closer personal relationship with you. We issue annual tax receipts for certain donors, for which we also need your physical address.
We evaluate the information you freely share with us to offer you personalized career or donation advice, or to evaluate your applications for jobs with us.
We send updates via our newsletter if you signed up for it.
We store your donation history and some email communications to have an up-to-date picture of our relationship with you.

How we share your information

We use various software tools to help us manage the data we collect, all of which have high standards of privacy and security.

Our cookie policy

Cookies are small text-only files that are stored on your computer and that help identify you when you visit any of our sites. This makes visiting and using our sites easier, for instance, so that you do not have to re-enter information continually.

We use cookies to store session data, including ones that last across sessions until they expire or you delete them. We use Google Analytics, which also uses cookies to identify users. The way these services use and store user data are governed by their respective privacy policies. You can always change your cookie settings via your browser.

Your rights

You have the right to control what data we store and how we use it. In particular, you can always contact us if:

You would like to access, rectify, or delete any data we may have about you.
You would like us to stop processing your data in any of the ways described in this policy.
You object to us using your data for a given purpose.
You want to understand better how we use your data.

Feel free to contact us at any time at info@ea-longtermrisk.org.

Children

We don’t expect or intend to collect information on children under 16. Children should get help from a parent or guardian before entering personal information into a website.

Contact us

If you have any questions about this privacy policy, contact us anytime at info@ea-longtermrisk.org.

The post Privacy Policy appeared first on Center on Long-Term Risk.

Descriptive Population Ethics and Its Relevance for Cause Prioritization

2020-04-28T13:27:24Z

Summary

Descriptive ethics is the empirical study of people's values and ethical views, e.g. via a survey or questionnaire. This overview focuses on beliefs about population ethics and exchange rates between goods (e.g. happiness) and bads (e.g. suffering). Two variables seem particularly important and action-guiding in this context, especially when trying to make informed choices about how to best shape the long-term future: 1) One’s normative goods-to-bads ratio (N-ratio) and 2) one’s expected bads-to-goods ratio (E-ratio). I elaborate on how a framework consisting of these two variables could inform our decision-making with respect to shaping the long-term future, as well as facilitate cooperation among differing value systems and further moral reflection. I then present concrete ideas for further research in this area and investigate associated challenges. The last section lists resources which discuss further methodological and theoretical issues which were beyond the scope of the present text.¹

Descriptive ethics and long-term future prioritization

Recently, some debate has emerged on whether reducing extinction risk is the ideal course of action for shaping the long-term future. For instance, in the Global Priorities Institute (GPI) research agenda, Greaves & MacAskill (2017, p.13) ask “[...] whether it might be more important to ensure that future civilisation is good, assuming we don’t go extinct, than to ensure that future civilisation happens at all.” We could further ask to what extent we should focus our efforts on reducing risks of astronomical suffering (s-risks). Again, Greaves & MacAskill: “Should we be more concerned about avoiding the worst possible outcomes for the future than we are for ensuring the very best outcomes occur [...]?” Given the enormous stakes, these are arguably some of the most important questions facing those who prioritize shaping the long-term future.²

Some interventions increase both the quality of future civilization as well as its probability. Promoting international cooperation, for instance, likely reduces extinction risks as well as s-risks. However, it seems implausible for one single intervention to be optimally cost-effective at accomplishing both types of objectives at the same time. To the extent to which there is a tradeoff between different goals relating to shaping the long-term future, we should make a well-considered choice about how to prioritize among them.

Normative and expected ratios (aka exchange rates and future optimism)

I suggest that this choice can be informed by two important variables: One’s normative bads-to-goods ratio³ (N-ratio) and one’s empirically expected goods-to-bads ratio (E-ratio). Taken together, these variables can serve as a framework for choosing between different options to shape the long-term future.

(For utilitarians, N- and E-ratios amount to their normative / expected suffering-to-happiness ratios. But for most humans, there are bads besides suffering, e.g. injustice, and goods other than happiness, e.g. love, knowledge, or art. More on this below.)

I will elaborate in greater detail below on how to best interpret and measure these two ratios. For now, a few examples should suffice to illustrate the general concept. Someone with a high N-ratio of, say, 100:1 believes that reducing bads is one hundred times as important as increasing goods, whereas someone with an N-ratio of 1:1 thinks that increasing goods and reducing bads is of equal importance.⁴ Similarly, someone with an E-ratio of, say, 1000:1 thinks that there will be one thousand times as much good than bad in the future in expectation, whereas someone with a lower E-ratio is more pessimistic about the future.⁵

Note that I don’t assume an objective way to measure goods and bads, so a statement like “reducing suffering is x times more important than promoting happiness” is imprecise unless one further specifies what precisely is being compared. (See also the section "The measurability of happiness and suffering".)

In short, the more one's E-ratio exceeds one's N-ratio, the higher one’s expected value of the future, and the more one favors interventions that primarily reduce extinction risks.⁶ In contrast, the more one's N-ratio exceeds one's E-ratio, the more appealing become interventions that primarily reduce s-risks or otherwise improve the quality of the future without affecting its probability. The graphic below summarizes the discussion so far.

Of course, this reasoning is rather simplistic. In practice, considerations from comparative advantages, tractability, neglectedness, option value, moral trade, et cetera need to be factored in.⁷ See also Cause prioritization for downside-focused value systems for a more in-depth analysis.⁸

Interpreting and measuring N-ratios

The rest of this section elaborates on the meaning of N-ratios and explains one approach of measuring or at least approximating them. In short, I propose to approximate an individual’s N-ratio by measuring their response tendencies to various ethical thought experiments (e.g. as part of a questionnaire or survey) and comparing them to those of other individuals. These questions could be of (roughly) the following kind:

Imagine you could create a new world inhabited by X humans living in a utopian civilization free of involuntary suffering, and where everyone is extremely kind, intelligent, and compassionate. In this world, however, there also exist 100 humans who experience extreme suffering.

What’s the smallest value of X for which you would want to create this world?

In short, people who respond with higher equivalence numbers X to such thought experiments should have higher N-ratios, on average.

Some words of caution are in order here. First, the final formulations of such questions should obviously contain more detailed information and, for example, specify how the inhabitants of the utopian society live, what form of suffering the humans experience precisely, et cetera. (See also the document “Preliminary Formulations of Ethical Thought Experiments” which contains much longer formulations.)

Second, an individual’s equivalence number X will depend on what form of ethical dilemma is used and its precise wording. For example, asking people to make intrapersonal instead of interpersonal trade-offs, or writing “preserving” instead of “creating”, will likely influence the responses.

Third, subjects’ equivalence numbers will depend on which type of bad or good is depicted. Hedonistic utilitarians, for instance, regard pleasure as the single most important good and would place great value on, say, computer programs experiencing extremely blissful states. Many other value systems would consider such programs to be of no positive value whatsoever. Fortunately, many if not most value systems regard suffering⁹ as one of the most important bads and also place substantial positive value on flourishing societies inhabited by humans experiencing eudaimonia – i.e. “human flourishing” or happiness plus various other goods, such as virtue and friendship.¹⁰ In conclusion, although N-ratios (as well as E-ratios) are generally agent-relative, well-chosen “suffering-to-eudaimonia ratios” will likely allow for more meaningful and robust interindividual comparisons while still being sufficiently natural and informative. (See also the section "N-ratios and E-ratios are agent-relative" of the appendix for a further discussion of this issue.)

However, even if we limit our discussion to various forms of suffering and eudaimonia, judgments might diverge substantially. For example, Anna might only be willing to trade one minute of physical torture in exchange for many years of eudaimonia, while she would trade one week of depression for just one hour of eudaimonia. Others might make different or even opposite choices. If we had asked Anna only the first question, we could have concluded that her N-ratio is high, but her stance on the second question suggests that the picture is more complicated.

Consequently, one might say that even different forms of suffering and happiness/eudaimonia comprise “axiologically distinct” categories and that, instead of generic “suffering-to-eudaimonia ratios” – let alone “bads-to-goods ratios” – we need more fine-grained ratios, e.g. “suffering_typeY-to-eudaimonia_typeZ ratios”.¹¹

See also “Towards a Systematic Framework for Descriptive (Population) Ethics” for a more extensive overview of the relevant dimensions along which ethical thought experiments can and should vary. “Descriptive Ethics – Methodology and Literature Review” provides an in-depth discussion of various methodological and theoretical questions, such as how to prevent anchoring or framing effects, control for scope insensitivity, increase internal consistency, and so on.

The need for a survey (of effective altruists)

Do these considerations suggest that research in descriptive ethics is simply not feasible? This seems unlikely to me but it’s at least worth investigating further.

For illustration, imagine that a few hundred effective altruists completed a survey consisting of thirty different ethical thought experiments that vary along a certain number of dimensions, such as the form and intensity of suffering or happiness, its duration, or the number of beings involved.

We could now assign a percentile rank to every participant for each ethical thought experiment. If the concept of a general N-ratio is viable, we should observe that the percentile ranks of a given participant correlate across different dilemmas. That is, if someone gave very high equivalence numbers to the first, say, fifteen dilemmas, it should be more likely that this person also gave high equivalence numbers to the remaining dilemmas. Investigating whether there is such a correlation, how high it is, and how much it depends on the type or wording of each ethical thought experiment, could itself lead to interesting insights.

What could we learn from such a survey?

Important and action-guiding conclusions could be inferred from such a survey, both on an individual and on a group level.

First, consider the individual level. Imagine a participant answered with “infinite” in twenty dilemmas. Further assume that the average equivalence number of this participant in the remaining ten dilemmas was also extremely high, say, one trillion. Unless this person has an unreasonably high E-ratio (i.e. is unreasonably optimistic about the future), this person should, ceteris paribus, prioritize interventions that reduce s-risks over, say, interventions that primarily reduce risks of extinction but which might also increase s-risks (such as, perhaps, building disaster shelters¹² ); especially so if they learn that most respondents with lower average equivalence numbers do the same.¹³

Second, let’s turn to the group level. It could be very useful to know how equivalence numbers among effective altruists are distributed. For example, central tendencies such as the median or average equivalence number could inform allocation decisions within the effective altruism movement as a whole. They could also serve as a starting point for finding compromise solutions or moral trades between varying groups within the EA movement – e.g. between groups with more upside-focused value systems and those with more downside-focused value systems. Lastly, engaging with the actual thought experiments of the survey, as well as its results and potential implications, could increase the moral reflection and sophistication of the participants, allowing them to make decisions more in line with their idealized preferences.

Descriptive ethics and its importance for multiverse-wide superrationality

Readers unfamiliar with the idea of multiverse-wide superrationality (MSR) are strongly encouraged to first read the paper “Multiverse-wide Cooperation via Correlated Decision Making” (Oesterheld, 2017) or the post “Multiverse-wide cooperation in a nutshell”. Readers unconvinced by or uninterested in MSR are welcome to skip this section.

To briefly summarize, MSR is the idea that by taking into account the values of superrationalists located elsewhere in the multiverse, it becomes more likely that they do the same for us. In order for MSR to work, it is essential to have at least some knowledge about how the values of superrationalists elsewhere in the multiverse are distributed. Surveying the values of (superrational) humans¹⁴ is one promising way of gaining such knowledge.¹⁵

Obtaining a better estimate of the average N-ratio of superrationalists in the multiverse seems especially action-guiding. For illustration, imagine we knew that most superrationalists in the multiverse have a very high N-ratio. All else equal and ignoring considerations from neglectedness, tractability, etc., this implies that superrationalists elsewhere in the multiverse would probably want us to prioritize the reduction of s-risks over the reduction of extinction risks.¹⁶ In contrast, if we knew that the average N-ratio among superrationalists in the multiverse is very low, reducing extinction risks would become more promising.

Another important question is to what extent and in what respects superrationalists discriminate between their native species and the species located elsewhere in the multiverse.¹⁷

The problem of biased, unreliable, and unstable judgments

Another challenge facing research in descriptive ethics is that at least some answers are likely to be driven by more or less superficial system 1 heuristics generating a variety of biases – e.g. empathy gap, duration neglect, scope insensitivity, and framing effects, to name just a few. While there are ways to facilitate the engagement of more controlled cognitive processes¹⁸ that make reflective judgments more likely, not every possible bias or confounder can be eliminated.

All in all, the skeptic has a point when she distrusts the results of such surveys because she assumes that most subjects merely pulled their equivalence numbers out of thin air. Ultimately, however, I think that reflecting on various ethical thought experiments in a systematic fashion, pulling equivalence numbers out of thin air and then using these numbers to make more informed decisions about how to best shape the long-term future is often better – in the sense of dragging in fewer biases and distorting intuitions – than pulling one’s entire decision out of thin air.¹⁹

A further problem is that the N-ratios of many subjects will likely fluctuate over the course of years or even weeks.²⁰ Nonetheless, knowing one’s N-ratios will be informative and potentially action-guiding for some subjects – e.g. for those who already engaged in substantial amounts of moral reflection (and are thus likely to have more stable N-ratios), or for subjects who have particularly high N-ratios such that their priorities would only shift if their N-ratios changed dramatically. Studying the stability of N-ratios is also an interesting research project in itself. (See also the section “moral uncertainty” of another document for more notes on this topic.)

Further resources

The Google Docs listed below discuss further methodological, practical, and theoretical questions which were beyond the scope of the present text. As I might deprioritize the project for several months, I decided to publish my thinking at its current stage to enable others to access it in the meantime.

1) Descriptive Ethics – Methodology and Literature Review.
This document is motivated by the question of what we can learn from the existing literature – particularly in health economics and experimental philosophy – on how to best elicit normative ratios. It also contains a lengthy critique of the two most relevant academic studies about population ethical views and examines how to best measure and control for various biases (such as scope insensitivity, framing effects, and so on).

2) Towards a Systematic Framework for Descriptive (Population) Ethics.
This document develops a systematic framework for descriptive ethics and provides a classification of dimensions along which ethical thought experiments can (and should) vary.

3) Preliminary Formulations of Ethical Thought Experiments.
This document contains preliminary formulations of ethical thought experiments. Note that the formulations are designed such that they can be presented to the general population and might be suboptimal for effective altruists.

4) Descriptive ethics – Ordinal Questions (incl. MSR) & Psychological Measures.
This document discusses the usefulness of existing psychological instruments (such as the Moral Foundations Questionnaire, the Cognitive Reflection Test, etc.). The document also includes tentative suggestions for how to assess other constructs such as moral reflection, happiness, and so on.

Opportunity to give feedback or collaborate

Please note that the above documents are a work in progress, so I ask you to understand that much of the material hasn't been polished and, in some cases, does not even accurately reflect my most recent thinking. This also means that there is a significant opportunity for collaborators to contribute their own ideas rather than to just execute an already settled plan. In any case, comments in the Google documents are highly appreciated, whether you're interested in becoming more involved in the project or not.

Acknowledgments

I want to thank Max Daniel, Caspar Oesterheld, Johannes Treutlein, Tobias Pulver, Jonas Vollmer, Tobias Baumann, Lucius Caviola, and Lukas Gloor for their extremely valuable inputs and comments. Thanks also to Nate Liu, Simon Knutsson, Brian Tomasik, Adrian Rorheim, Jan Brauner, Ewelina Tur, Jennifer Waldmann, and Ruairi Donnelly for their comments.

Appendix

N-ratios and E-ratios are agent-relative

Assuming moral anti-realism is true, there are no universal or “objective” goods and bads. Consequently, if we want to avoid confusion, E-ratios and N-ratios should ultimately refer to the values of a specific agent, or, to be more precise, a specific set of goods and bads.

For illustration, consider two hypothetical agents: Agent_1 has an N-ratio of 1:1 and an E-ratio of 1000:1, while agent_2 has an N-ratio of 1:1 and an E-ratio of 1:10. Do these agents share similar values but have radically different conceptions about how the future will likely unfold? Not necessarily. Agent_1 might be a total hedonistic utilitarian and agent_2 an AI that wants to maximize paperclips and minimize spam emails. Both might agree that the future will, in expectation, contain 1000 times as much pleasure as suffering but 10 times as many spam emails as paperclips.

Of course, the sets of bads and goods of humans will often overlap, at least to some extent. Consequently, if we learn that human_1 has a much lower E-ratio than human_2, this tells us that human_1 is probably more pessimistic than human_2 and that both likely disagree about how the future is going to unfold.

In this context, it also seems worth noting that there might be more overlap with regards to bads than with regards to goods. For illustration, consider the number of macroscopically distinct futures whose net value is extremely negative according to at least 99.5% of all humans. It seems plausible that this number is (much) greater than the number of macroscopically distinct futures whose net value is extremely positive according to at least 99.5% of all humans. In fact, those of us who are more pessimistic about the prospect of wide agreement on values might worry that the latter number is (close to) zero, especially if one doesn’t allow for long periods of moral reflection.

The measurability of happiness and suffering

In my view, there are no “objective” units of happiness or suffering. Thus, it can be misleading to talk about the absolute magnitude of N-ratios without specifying the concrete instantiations of bads and goods that were traded against each other.

For more details on the measurability of happiness and suffering (or lack thereof), I highly recommend the essays “Measuring Happiness and Suffering” and “What Is the Difference Between Weak Negative and Non-Negative Ethical Views?” by Simon Knutsson, especially this section and the description of Brian Tomasik’s views whose approach I share.

Other resources

FLI's superintelligence survey.
SlateStarCodex’s suffering vs. oblivion survey.
Brian Tomasik’s survey on ethics and animal welfare.
Some parts of the section “Suffering-focused ethics” of FRI’s Open Research Questions page, in particular the subpage “Tradeoffs between good and bad parts of lives”.
This survey of effective altruists.

Footnotes

This essay was first published on the EA forum. (back)
For more considerations along these lines, I especially recommend the section “The value of the future” of the GPI research agenda (p. 12 - 14). (back)
The term “exchange rate” is more common. (back)
Views with N-ratios greater than 1:1 have also been referred to as “negative-leaning”. Prominent examples include negative consequentialism and negative(-leaning) utilitarianism. However, the distinction between negative and “traditional” consequentialism is non-obvious, see e.g. What Is the Difference Between Weak Negative and Non-Negative Ethical Views? (Knutsson, 2016). (back)
Of course, if one person has an E-ratio that is unusually low or unusually high, this presents grounds for concern, as it could indicate a bias or lack of updating towards other people's judgment. Diverging N-ratios could also be subject to the same consideration, but because N-ratios concern normative disagreements, it is less clear to what extent updating towards other people's moral intuitions is demanded by epistemic rationality. (back)
Note that, even for a person with a very high E-ratio and a low N-ratio, interventions that primarily reduce extinction risks are neither necessarily optimal nor as cost-effective as, for instance, interventions that primarily increase the probability of the very best futures (such as certain forms of advocacy). (back)
I’m also ignoring interventions which don’t primarily affect extinction risks or s-risks but e.g. increase the probability of the very best futures. (back)
Particularly the sections “Downside-focused views prioritize s-risk reduction over utopia creation” and “Extinction risk reduction: Unlikely to be positive according to downside-focused views”. (back)
Particularly the involuntary, gratuitous suffering of innocent humans. (back)
See also section 4 of the Stanford Encyclopedia of Philosophy entry on “Well-Being”. (back)
In this context, see also the appendix for further discussion. (back)
It could be argued that many interventions in this area don’t actually increase s-risks because they will only affect how recovery will happen rather than if it will happen. (back)
However, it does necessarily follow that this person should actually pursue interventions which primarily reduce s-risks. For example, depending on the specifics of her values and other considerations such as neglectedness, tractability, et cetera, interventions that increase the quality of the long-term future while not primarily affecting s-risks might be even more promising. (back)
Cf. Oesterheld (2017, p. 66): “Because the sample size [of superrationalists] is so small, we may also look at humans in general, under the assumption that the values of superrationalists resemble the values of their native civilization. It may be that the values of superrationalists differ from those of other agents in systematic and predictable ways. General human values may thus yield some useful insights about the values of superrationalists.” (back)
We should not draw too strong conclusions from surveying current superrationalists because they might be atypical in various ways and thus not representative (cf. Oesterheld, 2017, p.71). (back)
One might retort that superrationalists should always focus on staying around so they can help to actualize the values of superrationalists elsewhere in the multiverse. But assuming the average N-ratio of superrationalists is sufficiently high, the possible upside from ensuring good futures in which superrationalists can actualize goods valued by superrationalists elsewhere in the multiverse is likely smaller than the possible downside from failing to prevent bad futures full of suffering or other forms of disvalue. Of course, one’s ultimate decision also has to be informed by one’s E-ratio and by other considerations such as tractability, neglectedness or one’s comparative advantage. (back)
For example, many people dislike the civilization of “ems” depicted in Hanson’s Age of Em although most ems are happy and presumably more similar to humans than the average alien. Generally, it seems that many humans wish that the eventual descendants of humanity retain a lot of their idiosyncratic values and customs. And given the rather unenthusiastic reactions of many readers to the utopian civilization of the “super happy people” (a race of extraterrestrials depicted in Eliezer Yudkowsky’s short story Three Worlds Collide), it seems not too implausible to conclude that many humans just don’t care much about the existence of alien civilizations elsewhere in the multiverse – however flourishing and utopian from the perspective of their inhabitants. If one further assumes that superrationalists (here and elsewhere in the multiverse) share these sentiments to some degree but also care at least somewhat about the prevention of suffering (even if experienced by aliens), this suggests that alien superrationalists would want us to prioritize avoiding the worst possible futures over ensuring the existence of a utopian (post-)human civilization. In contrast, the fewer superrationalists discriminate between the well-being of humans and aliens, the less substantive this line of argumentation. Of course, this whole line of reasoning is very speculative to begin with, and should be taken with a (big) grain of salt. (back)
See e.g. the sections “How to test and control for various biases” and ”Increasing validity and internal consistency“ of the document “Descriptive Ethics – Methodology & Literature Review” for further relevant methodological considerations. (back)
Just as pulling other numerical values (probabilities, cost estimates, etc.) out of thin air and then using these numbers to inform one’s decision is often better than pulling the decision out of thin air (in this context, see e.g. How to Measure Anything by D. Hubbard). (back)
For example, due to random fluctuations in mood or because subjects further reflected on their values. (back)

The post Descriptive Population Ethics and Its Relevance for Cause Prioritization appeared first on Center on Long-Term Risk.

Center on Long-Term Risk

Apply to CLR as a Summer Research Fellow!

Purpose of the fellowship

Activities

What we look for in candidates

Further details

Priority areas

Mentors

Anthony DiGiovanni

Nicolas Macé

Mia Taylor

Jesse Clifton

Julian Stastny

Tristan Cook

Caspar Oesterheld

David Althaus

Application process

Further details

Why work with CLR

Other opportunities at CLR

Expression of Interest: Director of Operations

Intro / summary

Location

The role

About you

Role impact

Compensation & benefits

How to apply

CLR Fundraiser 2023

How to donate

My Donation

Choose a payment method

Donations so far

Beginner’s guide to reducing s-risks

1. Introduction

2. Prioritization of s-risks

2.1. Premise 1: Longtermism

2.1.1. Artificial intelligence: a key technology for the long-term future

2.2. Premise 2: Focus on preventing intense suffering

2.2.1. Normative considerations: the moral importance of suffering

2.2.2. Empirical considerations: practical reasons to prioritize severe suffering

2.3. Premise 3: Focus on worst-case outcomes

3. Classes of s-risks and their potential causes

3.1. Incidental s-risks

3.2. Agential s-risks

3.3. Natural s-risks

4. Approaches to s-risk reduction

4.1. Targeted approaches

4.2. Broad approaches

Next steps

Acknowledgments

References

Footnotes

(Archive) Summer Research Fellowship 2023

--- Applications for the 2023 Summer Research Fellowship have now closed ---

Purpose of the fellowship

Responsibilities

What we look for in candidates

Further details

Priority areas

Mentors

Anthony DiGiovanni

Jesse Clifton

Emery Cooper

Daniel Kokotajlo

Caspar Oesterheld

Abram Demski

Application process

Further details

Inquiries

Benefits

Why work at CLR

Annual Review & Fundraiser 2022

Summary

What CLR is trying to do and why

Fundraising

Funding situation

Fundraising goals

Reasons to donate to CLR

How to donate