The post Apply to CLR as a Summer Research Fellow! appeared first on Center on Long-Term Risk.
]]>-
We, the Center on Long-Term Risk, are looking for Summer Research Fellows to explore strategies for reducing suffering in the long-term future (s-risks). For eight weeks, you will join our team at our office while working on your own research project. During this time, you will be in regular contact with our researchers and other fellows, and receive guidance from an experienced mentor.
You will work autonomously on challenging research questions relevant to reducing suffering. You will be integrated and collaborate with our team of intellectually curious, hard-working, and caring people, all of whom share a profound drive to make the biggest difference they can.
We worry that some people won’t apply because they wrongly believe they are not a good fit for the program. While such a belief is sometimes true, it is often the result of underconfidence rather than an accurate assessment. We would therefore love to see your application even if you are not sure if you are qualified or otherwise competent enough for the positions listed. We explicitly have no minimum requirements in terms of formal qualifications and many of the past summer research fellows have had no or little prior research experience. Being rejected this year will not reduce your chances of being accepted in future hiring rounds. If you have any doubts, please don’t hesitate to reach out (see “Application process” > “Inquiries” below).
The purpose of the fellowship varies from fellow to fellow. In the past, have we often had the following types of people take part in the fellowship:
There might be many other good reasons for completing the fellowship. We encourage you to apply if you think you would benefit from the program, even if your reason is not listed above. In all cases, we will work with you to make the fellowship as valuable as possible given your strengths and needs. In many cases, this will mean focusing on learning and testing your fit for s-risk research, more than seeking to produce immediately valuable research output.
We don’t require specific qualifications or experience for this program, but the following abilities and qualities are what we’re looking for in candidates. We encourage you to apply if you think you may be a good fit, even if you are unsure whether you meet some of the criteria.
We encourage you to apply even if any of the below does not work for you. We are happy to be flexible for exceptional candidates, including when it comes to program length and compensation.
You can find an overview of our current priority areas here. However, if we believe that you can somehow advance high-quality research relevant to s-risks, we are interested in creating a position for you. If you see a way to contribute to our research agenda or have other ideas for reducing s-risks, please apply. We commonly tailor our positions to the strengths and interests of the applicants.
All fellows will work with a mentor to guide their project. Below, each of our mentors has written about the topics in which they’re most interested in supervising research.
At stage 2 of our application process, applicants are asked to submit a research proposal and a list of research proposal ideas. A significant part of our selection process relates to consideration by our mentors of whether they are interested in supervising the Fellow, based on the Fellow’s and mentor’s research interests.
We value your time and are aware that applications can be demanding, so we have thought carefully about making the application process time-efficient and transparent. We plan to make the final decisions by April 15. We plan to decide on the location (Berkeley or London) by early- to mid-April.
Stage 1: To start your application for any role, please complete our application form. As part of this form, we also ask you to submit your CV/resume and give you the opportunity to upload an optional research sample. The deadline is midnight Pacific Time on Thursday, March 7, 2024. We expect this to take around 2 to 3 hours if you are already familiar with our work. In the interest of your time, you do not need to polish the language of your answers in the application form.
Stage 2: By Tuesday, March 12, we will decide whether to invite you to the second stage. We will ask you to write a research proposal (up to two pages excluding references) and a list of research proposal ideas, to be submitted by Thursday, March 28 at midnight Pacific Time. This means applicants will have 16 days to complete this stage, which we expect will take up to 12 hours of work. Applicants will be compensated with £350 for their work at this stage.
Stage 3: By Thursday, April 4, we will decide whether to invite you to an interview via video call during the week of April 8. By April 15, we will send out final decisions to applicants.
If you have any questions about the process, please contact us at hiring@longtermrisk.org. If you want to send an email not accessible to the hiring committee, please contact Harriet Patterson at harriet.patterson@longtermrisk.org.
We aim to combine the best aspects of academic research (depth, scholarship, mentorship) with an altruistic mission to prevent negative future scenarios. So we leave out the less productive features of academia, such as administrative burden and publish-or-perish incentives, while adding a focus on impact and application.
As part of our fellowship, you will enjoy:
You will advance neglected research to reduce the most severe risks to our civilization in the long-term future. Depending on your specific project, your work may help inform impactful work across the s-risk and AI safety ecosystem, or any of CLR’s activities, including:
We’ll soon be hiring for researchers focused on model evaluations. As an empirical researcher at CLR, you will primarily help us build evaluations that improve our understanding of s-risk-relevant properties of AI systems, developing prerequisites to intervening on advanced AI systems. To receive updates about this role and other opportunities at CLR, you can subscribe to our mailing list by submitting your email at the bottom of our website.
The post Apply to CLR as a Summer Research Fellow! appeared first on Center on Long-Term Risk.
]]>The post Expression of Interest: Director of Operations appeared first on Center on Long-Term Risk.
]]>As mentioned in our annual review, CLR is evaluating whether to relocate from London to Berkeley, CA. We are currently in a trial period, and expect to make a decision about our long-term location in early April. For this reason, we unfortunately don’t yet know whether this role will be located in London or Berkeley. We’d also be open to remote candidates in some circumstances (see below).
We recognise that this uncertainty will make the role less appealing to candidates. Given this, we will be running a low-volume invite-only hiring round now, for candidates who are willing to spend the time on our hiring process even with our location uncertainty. If we don’t successfully hire now, we will launch a full open round in April, after the location decision is made.
Further details on location:
The Director of Operations role is to lead our 2-person operations team, with responsibility across areas such as HR, finance, compliance, office management, grantmaking ops, and recruitment. You’d report to our Executive Director, and would take on the management of our existing Operations Associate.
Specific responsibilities include:
About operations at CLR:
We estimate that around 70% of your time in the role would be spent directly working on operations projects, and 30% on people management, co-ordination, and strategy.
We’re interested in candidates who bring most of the following skills and experience to the role:
The Director of Operations is central to CLR’s activities and impact. As well as continuing the existing high level of support provided to our team and projects, we’re excited for a candidate with great judgement to bring new ideas and drive organisational change, and so multiply our impact further.
CLR’s mission is to reduce worst-case risks from the development of advanced AI systems. We are the largest organisation focussed on s-risk reduction, with our researchers being among only a few working on s-risk reduction and cooperative AI.
You can read about CLR’s achievements in 2023 and plans for 2024 here.
CLR’s activities include:
CLR has received grants from Open Philanthropy, the Survival and Flourishing Fund and Polaris Ventures.
Testimonials about CLR’s work from prominent community members can be found here.
For full-time work in this role, we offer a salary of 110-150,000 USD if the role is based in the Bay Area, or 60-90,000 GBP for London.
Benefits for this role will include:
If you’re interested in the role, please submit this short expression of interest form by the end of Sunday 11th February. The form will still be monitored after this, but we’ll only invite late applicants to join the invite-only round in exceptional cases.
Note that with this being an invite-only round, we expect to get back to only the most promising candidates to invite them to apply to the role now.
If we don’t invite you to the current round, we will still make sure to let you know if we proceed to a full open hiring round in April.
The post Expression of Interest: Director of Operations appeared first on Center on Long-Term Risk.
]]>The post CLR Fundraiser 2023 appeared first on Center on Long-Term Risk.
]]>For frequently asked questions on donating to CLR, see our Donate page.
Note: since the fundraiser is now over, any donations from now on will not be listed in the fundraiser donations list below.
Name | Amount | Comment | |
---|---|---|---|
Anonymous | USD 50000 | ||
Anonymous | USD 200000 | ||
Anonymous | GBP 30 | ||
Anonymous | EUR 350000 |
The post Private: Anonymouse €350k appeared first on Center on Long-Term Risk.
]]>The post Beginner’s guide to reducing s-risks appeared first on Center on Long-Term Risk.
]]>Efforts to reduce s-risks generally consist of researching factors that likely exacerbate these three mechanisms (especially emerging technologies, social institutions, and values), applying insights from this research (e.g., recommending principles for the safe design of artificial intelligence), and building the capacity of future people to prevent s-risks.
Summary:
In the future, humans and our descendants might become capable of very large-scale technological and civilizational changes, including extensive space travel (Armstrong and Sandberg 2013), developments in artificial intelligence, and the creation of institutions whose stability is historically unprecedented. These increasing capabilities could significantly impact the welfare of many beings. Analogously, the Industrial Revolution drastically accelerated economic growth while leading to the suffering of billions of animals via factory farming.
Further, in the long-term future, the universe will plausibly contain numbers of sentient beings far greater than the current human and animal populations on Earth (MacAskill 2022, Ch 1). Astronomically large populations may result from widespread settlement of space and access to resources that far exceed those on Earth, either by biological organisms or digitally emulated minds (Beckstead 2014; Hanson 2016; Shulman and Bostrom 2021). Depending on their mental architectures, digital minds might have the capacity to suffer according to some theories of consciousness (though this is a philosophically controversial position).2 Thus, the total moral weight of these minds’ suffering could be highly significant for the prioritization of altruistic causes.
If one considers it highly important to promote the welfare of future beings, a strong candidate priority is to reduce s-risks, in which large numbers of future beings undergo intense involuntary3 suffering. Even if one is overall optimistic about the long-term future, avoiding these worst cases may still be a top priority.
Effectively reducing s-risks requires identifying ways such massive amounts of suffering could arise. Although catastrophes of this scale are unprecedented, one discrete event that plausibly caused a large fraction of total historical suffering on Earth was the rise of factory farming; since 1970, at least 10 billion land animals per year have been killed for global meat production, rising to about 70 billion by 2020 (Ritchie, Rosado, and Roser 2017). Unlike the suffering caused by factory farming, some s-risks might be intentional. Similarly to historical acts of systematic cruelty by dictators, future actors might also deliberately cause harm (Althaus and Baumann 2020). As discussed in Section 3, highly advanced future technologies used by agents willing to cause massive suffering in pursuit of their goals (intentionally or not) have been hypothesized to be the most likely foreseeable causes of s-risks. Research focused on s-risks and implementations of interventions to reduce them began only recently, however (see, e.g., Tomasik 2015a), and so further investigation may identify other likely sources.
The rest of this article will discuss the premises behind prioritizing s-risks, specific potential causes of different classes of s-risks, and researchers’ current understanding of the most promising interventions to reduce them.
Arguments for prioritizing s-risk reduction as an altruistic cause have generally relied on three premises (Baumann 2020a; Gloor 2018):
Note that some approaches to reducing s-risks are sufficiently broad that they might also reduce near-term suffering, promote other values besides relief of suffering, or improve non-worst-case futures (see Section 4.2).
Longtermism consists of both a claim about the moral importance of future beings, and our ability to help them. The normative premise 1(a) has been defended at length in, e.g., (Beckstead 2013; MacAskill 2022, Ch. 1; Cowen and Parfit 1992). Even if one agrees with this normative view, one might object to the empirical premise 1(b) (in the context of s-risk reduction) on the following grounds: The probability of s-risks, like that of other long-term risks, might be highly sensitive to factors about which present generations will remain largely uncertain (Greaves 2016; Tarsney 2022). Thus, it might be too hard to determine which actions will reduce s-risks in the long term. Given this problem, called cluelessness, it is also unclear how feasible it is to reduce s-risks by attempting to make small near-term changes that compound over time into a large, positive impact. The process of compounding positive influence could be stopped, or changed into negative influence, by highly unpredictable factors.
One response to this objection is that it could be tractable to affect the likelihood of potential persistent states, which are world states that, once entered, are not exited for a very long time (if ever) (Greaves and MacAskill 2021). These persistent states could have different probabilities of s-risks.
Suppose that, in the near term, the world could enter a persistent state that is significantly more prone to s-risks than some other feasible persistent state. Suppose also that certain interventions can foreseeably make the latter state more likely than the former. That is, the former state is an avoidable scenario where some technology, natural force, or societal structure “locks in” conditions for eventual large-scale suffering. (Sections 2.1.1 and 3 discuss potential causes of lock-in that are relevant to s-risks.)
Then these interventions would be less vulnerable to the cluelessness objection, because one would only need to account for uncertainties about relatively near-term effects that push towards different persistent states. It might still be highly challenging, however, to identify which persistent states are more prone to s-risks than others, or how to prevent gradual convergence towards persistent states with s-risks.
Besides steering away from s-risk-prone persistent states, another approach that could avoid cluelessness would be to build favorable conditions for future generations to reduce s-risks, who will have more information about the factors that could exacerbate them. This strategy still requires identifying features that would tend to enable future people to reduce s-risks, rather than hinder reduction or enable increases. One potentially robust candidate discussed in Section 4.2 is promoting norms against taking actions that risk causing immense suffering.
Artificial intelligence (AI) may enable the risk factors for s-risks discussed in the introduction: space settlement, deployment of vast computational resources by agents willing to cause suffering, and value lock-in (Gloor 2016a, MacAskill 2022, Ch. 4; Finnveden et al. 2022). At a general level, this is because AI systems automate complex tasks in a way that can be scaled up far more than human labor (e.g., via copying code), they can surpass human problem-solving by avoiding the constraints of a biological brain, and they may be capable of consistently optimizing certain goals.
Systems designed with machine learning algorithms, including reinforcement learning agents trained to select actions that maximize some reward, have outperformed humans at finding solutions to tasks in an increasing number of domains (Silver et al. 2017; Evans and Gao 2016; Jumper et al. 2021). Both reinforcement learning agents and large language models—programs produced by machine learning that use significant amounts of data and compute to predict sequences—have demonstrated capabilities that generalize across a wide variety of tasks they were not directly designed to perform (DeepMind 2021; Reed et al. 2022; OpenAI 2023, “GPT-4 Technical Report”). If these advances in the depth and breadth of AI capabilities continue, AI systems could develop into generally intelligent agents, which implement long-term plans culminating in the use of resources on scales larger than those of current civilizations on Earth.
Given that AI could apply such general superhuman abilities to large-scale goals, influencing the development and use of AI is plausibly one of the most effective ways to reduce s-risks (Gloor 2016a).4 Sections 3.1 and 3.2 outline ways that AI could cause s-risks, and Section 4.1 discusses specific classes of interventions on AI that researchers have proposed. (More indirect interventions on the values and social institutions that influence the properties of AIs, as discussed in Section 4.2, may also be tractable alternatives.)
Broadly, AI agents with goals directed at increasing suffering or vengeful motivations (e.g., due to selection pressures similar to those responsible for such motivations in humans) would be able to efficiently create enough suffering to constitute an s-risk, if they acquired enough power to avoid interference by other agents. Alternatively, if creating large amounts of suffering is instrumental to an AI’s goals, then, even without “wanting” to cause an s-risk, this AI would be willing and able to do so.
Further, two properties of AI make it a technology that is important to influence given a focus on persistent states. First, relevant experts forecast that human-level general AI is likely to be developed this century. This means that current generations might be able to shape the initial conditions of the development and use of the next iteration—superhuman AI—that could cause s-risks. Interventions on AI could thus be relatively urgent, among longtermist priorities. For example, in a survey of leading machine learning researchers, the median estimate for the date when “unaided machines can accomplish every task better and more cheaply than human workers” was 2059 (Stein-Perlman et al. 2022).5 See also Cotra (2020).
Second, to the extent that a general AI is an agent with certain terminal goals (i.e., goals for their own sakes, as opposed to instrumental goals), it will have strong incentives to stabilize these goals if it is technically capable of doing so (Omohundro 2008). That is, because the AI evaluates plans according to its current goals, it will (in general) tend to ensure that future modifications or successors of itself also optimize for the same goals.
These considerations suggest that the development of the first AI capable of both winning a competition with other AIs and locking in its goals, including preventing human interference with those goals (Bostrom 2014, Ch. 5; Karnofsky 2022; Carlsmith 2022), could initiate a persistent state. Avoiding training AIs with goals that motivate creation of cosmic-scale suffering, then, is a potential priority within s-risk reduction that may not require anticipating many contingencies far into the future.
As Baumann (2020a) notes, premise 2 is both normative and empirical; it is a claim about both one’s moral aims, and how effective suffering reduction is at satisfying those aims.
A variety of moral views hold that we have a strong responsibility to prevent atrocities involving extreme, widespread suffering.
Variants of suffering-focused ethics (Gloor 2016b; Vinding 2020a) hold that the intrinsic moral priority of reducing (intense) suffering is significantly higher than that of other goals. In particular, on some views suffering is measurably more important: when comparing two acts, the one that causes or permits a greater net increase in suffering is acceptable only if it also ensures a much greater amount of good things.6,7 On other views, such as negative utilitarianism, (intense) suffering has lexical priority: for some forms of suffering, an act that does not lead to a net increase in such suffering is always preferable to one that does (Vinding 2020a, Ch. 1; Knutsson 2016).
Alternatively, there are views according to which suffering does not always have priority, but avoiding futures where many beings have lives not worth living is a basic duty. One might endorse the Asymmetry: the creation of an individual whose life has more suffering than happiness (or other possible goods) is bad, but the creation of an individual whose life has more goods than suffering is at best neutral (Thomas 2019; Frick 2020). Unlike the former views, the Asymmetry does not imply that reducing suffering takes priority over, e.g., increasing happiness among existing beings. Commonly held views maintain that adding a miserable life to a miserable population is just as bad no matter how large the initial population size is, but increasing numbers of happy lives has diminishing returns in value (Vinding 2022b).
One could also hold that, without committing to a particular stance on the intrinsic value or disvalue of different lives or populations, we have a responsibility to avoid foreseeable risks of extremely bad outcomes for other beings (who are not directly compensated by upsides). See the concept of “minimal morality” (Gloor 2022).
Any of these classes of normative views, when applied to long-term priorities, would recommend focusing on preventing the existence of lives dominated by suffering. By contrast, several prominent views, such as classical utilitarianism and some forms of pluralist consequentialism (Sinnott-Armstrong 2022), hold that ensuring the existence of profoundly positive experiences or life projects can take priority over reducing risks of suffering if those risks are relatively improbable.8 (See also critiques of the Asymmetry, e.g., Beckstead (2013).) According to these views, whether s-risks should be prioritized over increasing the chance of a flourishing future and reducing risks of human extinction depends on one’s empirical beliefs (see next section). Other alternative longtermist projects include the reduction of risks of stable totalitarianism (Caplan 2008), improvement of global cooperation and governance, and promotion of moral reflection (Ord 2020, Ch. 7).
Normative reasons for prioritizing s-risk reduction may be action-guiding even for those who do not consider suffering-focused views more persuasive than alternatives. According to several approaches to the problem of decision-making under uncertainty about moral evaluations (MacAskill et al. 2020), the option one ought to take might not be the act that is best on the moral view one considers most likely. Rather, the recommended act could be one that is robustly positive on a wide range of plausible views. Then, although reducing s-risks might not be the top priority of most moral views, including one’s own, it may be one’s most favorable option because most views agree that severe suffering should be prevented (while they disagree on what is positively valuable). This consideration of moral robustness arguably favors efforts to improve the quality of future experiences, rather than increase the absolute number of future experiences (Vinding and Baumann 2021).
On the other hand, accounting for moral uncertainty could favor other causes. It has been argued that, under moral uncertainty, the most robustly positive approach to improving the long-term future is to preserve option value for humans and our descendants, and this entails prioritizing reducing risks of human extinction (MacAskill). That is, suppose we refrain from optimizing for the best action under our current moral views (which might be s-risk reduction), in order to increase the chance that humans survive to engage in extensive moral reflection.9 The claim is that the downside of temporarily taking this suboptimal action, by the lights of our current best guess, is outweighed by the potential upside of discovering and acting upon other moral priorities that we would otherwise neglect.
One counterargument is that futures with s-risks, not just those where humans go extinct, tend to be futures where typical human values have lost control over the future, so the option value argument does not privilege extinction risk reduction. First, if intelligent beings from Earth initiate space settlement before a sufficiently elaborate process of collective moral reflection, the astronomical distances between the resulting civilizations could severely reduce their capacity to coordinate on s-risk reduction (or any moral priority) (MacAskill 2022, Ch. 4; Gloor 2018). Second, if AI agents permanently disempower humans, they may cause s-risks as well. To the extent that averting s-risks is more tractable than ensuring AIs do not want to disempower humans at all (see next section), or one has a comparative advantage in s-risk reduction, option value does not necessarily favor working on extinction risks from AI.
Without necessarily endorsing the moral views discussed above, one might believe it is easier to reduce severe suffering with some amount of marginal effort than to increase goods or decrease other bads (Vinding 2020, Sec. 14.4). Given this, reducing suffering would have higher practical priority. For example, preventing long-term intense suffering could be easier to the extent that much less effort is currently devoted to s-risk reduction than to other efforts to improve the long-term future, such as prevention of human extinction (Althaus and Gloor 2016). This is because, for more neglected efforts, the most effective opportunities are less likely to have been already taken.
That said, even if deliberate attempts to reduce s-risks are neglected, the most effective means to this end can converge with interventions towards other goals, in which case the problem would not be practically as neglected as it appears. For instance, as discussed in Section 4.2, it is arguable that reducing political polarization reduces s-risks, but many who are not motivated by s-risk reduction in particular work on reducing polarization, because they want to enable more effective governance. But see Section 4.1 for discussion of potential opportunities to reduce s-risks that appear unlikely to be taken by those focusing on other goals.
Another possibility is that, while a conjunction of desirable conditions are required to create a truly utopian long-term future, a massive increase in suffering is a relatively simple condition and thus easier to prevent (Althaus and Gloor 2016; DiGiovanni 2021). In particular, even if human extinction is prevented, whether the future is optimized for flourishing depends on which values gain power. However, see Section 2.3 for brief discussion of whether s-risks are so unlikely that the value of reducing them is lower than that of aiming to increase the amount of flourishing.
Besides these general considerations, s-risk reduction could be a relatively tractable longtermist goal to the extent that the most plausible causes of human extinction are very difficult to prevent. Alignment of AI with the intent of human users, for instance, has been argued to be a crucial source of extinction risk (Bostrom 2014; Ord 2020). But it is also commonly considered a fundamentally challenging technical problem (Yudkowsky 2018; Christiano, Cotra, and Xu 2021; Hubinger et al. 2019; Cotra 2022), in the sense that, even if alignment failure is not highly likely by default, reducing the probability of misalignment close to zero is very difficult.10 Preventing misalignment may require fundamental changes to the default paradigm of AI development, e.g., developing highly sophisticated methods for interpreting large neural networks (Hubinger 2022).
By contrast, reducing s-risks from AI (see Sections 3.1 and 3.2) may require only solving certain subsets of the problem of controlling AI behavior, via coarse-grained steering of AIs’ preferences and path-dependencies in their decision-making (Baumann 2018b; Clifton, Martin, and DiGiovanni 2022a). (Note that for these subproblems to be high-priority, they need to be sufficiently nontrivial or non-obvious that they might not be solved without the efforts of those aiming to prevent s-risks.) Further, preventing s-risks caused by malevolent human actors may be easier than finding ways to predictably influence AI, which might behave in ways that are much harder to model than other people.
Finally, premise 3 relies on a model of the future in which:
(a) Large fractions of expected future suffering are due to a relatively small set of factors, over which present generations can have some influence.
(b) Compared to the amount of suffering that could be reduced by steering from the median future toward one with no suffering, many times more suffering could be reduced by steering from worst-case outcomes toward the median future (Gloor 2018).
If (a) were false, we would not expect to find singular “events” responsible for “cosmically significant amounts” of severe suffering that intelligent agents could prevent. As an analogy, if there is no single root cause of the majority of mental illnesses, someone aiming to promote mental health may need to prioritize among individualized treatments for many different illnesses. Section 3 will discuss relatively simple factors that plausibly determine large shares of future suffering.
Rejecting (b) would entail a greater focus on abolishing the sources of presently existing suffering (Pearce 1995), which one might expect to persist into the long-term future by default, and for which we have more direct evidence than for s-risks. There are two broad arguments for (b):
Implicit in premise 3 is the claim that the worst cases of potential future suffering are not so extremely unlikely as to be practically irrelevant, that is, the expected suffering from s-risks is large. One can assess the plausibility of the specific mechanisms by which s-risks could occur, and the historical precedents for those mechanisms, given in Section 3. Some general reasons to expect s-risks to be very improbable are the lack of direct empirical evidence for them, and the incentives for most intelligent agents to shape the future according to goals other than increasing suffering (Brauner and Grosse-Holz 2018). However, broadly, trends of technological progress could both enable space settlement and increase the potential of powerful agents to vastly increase suffering, conditional on having incentives to do so (without necessarily wanting more suffering) (Baumann 2017a).
To clarify which s-risks are possible and the considerations that might favor focusing on one cluster of scenarios over others, researchers have developed the following typology of s-risks (Baumann 2018a).
An s-risk is incidental if it is a side effect of the actions of some agent(s), who were not trying to cause large amounts of suffering. In the most plausible cases of incidental s-risks, agents with significant influence over astronomical resources find that one of the most efficient ways to achieve some goal also causes large-scale suffering.
Inexpensive ways to produce desirable resources might entail severe suffering. Slavery is an example of an institution that has produced historically significant suffering by this mechanism. In general, the suffering caused by slaveholders is not intentional (not including corporal punishments), but it has been permitted because of a lack of moral consideration for the victims.12 The treatment of future beings similarly to people in slavery could constitute an s-risk, particularly if values that permit such treatment become locked in by technology like AI. Future agents in power could force astronomical numbers of digital beings to do the computational work necessary for an intergalactic civilization. And it is unclear how feasible it would be to design these minds to experience little or no suffering (Tomasik 2014). If doing so is very easy, an s-risk from this cause is unlikely by default. If not, then for some agents there may not be a sufficiently strong incentive to make the effort of preventing this s-risk.
Alternatively, it is prima facie likely that if interstellar civilizations interested in achieving difficult goals exist in the future, they will have strong incentives to improve their understanding of the universe. These civilizations could cause suffering to beings used for scientific research, analogous to current animal testing and historical nonconsensual experimentation on humans. Specifically, if it is technically feasible for future civilizations to create highly detailed world simulations, it is plausible that they will do so for purposes such as scientific research (Bostrom 2003). In contrast to the kinds of simple, coarse-grained simulations that are possible with current computers, much more advanced simulations might have two important features:
Finally, an s-risk could result if Earth-originating agents spread wildlife throughout the accessible universe, via terraforming or directed panspermia. Suppose these agents do not protect the animals that evolve on the seeded planets from the usual causes of suffering under Darwinian competition, such as predation and disease (Horta 2010; Johannsen 2020). Then the amount of suffering experienced over the course of Earth’s evolutionary history would be replicated (on average) across large numbers of planets.
Again, the incentives to create biological life in the long-term future might be relatively weak. However, agents who—like many current people—intrinsically value undisturbed nature would not find creation of digital minds to be sufficient, and they would consider the benefits of propagating natural life worth the increase in suffering.
An s-risk is agential if it is intentionally caused by some intelligent being. Although deliberate creation of suffering appears to be an unlikely goal for any given agent to have, researchers have identified some potential mechanisms of agential s-risks, most of which have precedent.
Powerful actors might intrinsically value causing suffering, and deploy advanced technologies to satisfy this goal on scales far larger than current or historical acts of cruelty. Malevolent traits known as the Dark Tetrad—Machiavellianism, narcissism, psychopathy, and sadism—have been found to correlate with each other (Althaus and Baumann 2020; Paulhus 2014; Buckels et al. 2013; Moshagen et al. 2018). This suggests that individuals who want to increase suffering may be disproportionately effective at social manipulation and inclined to seek power. If such actors established stable rule, they would be able to cause the suffering they desire indefinitely into the future.
Another possibility is that of s-risks via retribution. While a preference for increasing suffering indiscriminately is rare among humans, people commonly have the intuition that those who violate fundamental moral or social norms deserve to suffer, beyond the degree necessary for rehabilitation or deterrence (Moore 1997). Retributive sentiments could be amplified by hostility to one’s “outgroup,” an aspect of human psychology that is deeply ingrained and may not be easily removed (Lickel 2006). To the extent that pro-retribution sentiments are apparent throughout history (Pinker 2012, Ch. 8), values in favor of causing suffering to transgressors might not be mere contingent “flukes.” These values would then be relatively likely to persist into the long-term future, without advocacy against them. As with malevolence, s-risks may therefore be perpetrated by dictators who inflict disproportionate punishments on rulebreakers, especially if such punishments can be used as displays of power.
Recalling the discussion in Section 2.1.1, AI agents could have creation of suffering as either an intrinsic or instrumental goal, and pursue this goal on long timescales due to the stability of their values. This could occur for several reasons:
An s-risk is natural if it occurs without the intervention of agents. A significant share of future suffering could be experienced by beings in the accessible universe who do not have access to advanced technology; hence, these beings would be unable to produce abundant resources (food, medicine, etc.) that ensure comfortable lives. The lack of apparent extraterrestrial intelligent beings does not necessarily imply that, on the many planets capable of sustaining life—potentially billions in the Milky Way (Wendel 2015)—there are no sentient beings.13
The reasons that humans currently do not relieve wild animal suffering, for example via vaccine and contraception distribution (Edwards et al. 2021; Eckerström-Liedholm 2022), might persist into the long-term future. Specifically, intelligent agents may a) be morally indifferent to extraterrestrial beings’ suffering, b) prioritize other goals (such as developing flourishing civilizations, or avoiding disturbing these beings’ natural state) despite having some moral concern, or c) consider it too intractable to intervene without accidentally increasing suffering.
Case (c) would grow less likely as spacefaring civilizations develop forecasting technologies assisted by AI. It is highly uncertain whether the evolution of moral attitudes would match case (a). Concern for wild animal suffering might remain low, because it is (usually) not actively caused by human intervention and so may be judged as outside the scope of humans’ moral responsibility (Horta 2017). On the other hand, under the hypothesis that people largely tend to morally discount suffering when they are unable to prevent it, we would predict increased support for reducing natural suffering as civilization’s technological ability to do so increases. Future intelligent beings might prefer to focus on advancing their own civilizations, however, rather than invest in complex efforts to improve extraterrestrial welfare, or intrinsically prefer leaving nature undisturbed. Thus, case (b) is a plausible possibility unless agents with at least moderate concern for extraterrestrial suffering gain sufficient influence.
Notably, some attempts to reduce natural s-risks may unintentionally increase incidental and agential s-risks. For example, increasing the chance that human descendants settle space would enable them to relieve natural suffering beyond Earth, but this would also put them in a position to increase suffering through influence over astronomical resources. The most robust approaches to reduce natural s-risks, therefore, would entail increasing future agents’ willingness to assist extraterrestrial beings conditional on already conducting space settlement. See also Section 4.2 for discussion of ways that one of the most tractable interventions to reduce natural s-risks, moral circle expansion, may not be robustly positive.
Researchers have proposed a variety of methods for reducing s-risks, suited to the classes discussed above and the different causes within each class. An effective intervention by current actors against s-risks needs to reliably prevent some precondition for s-risks in a way that (a) is sensitive to near-term factors rather than inevitable (contingent), and (b) does not easily change to some other state over time (persistent) (MacAskill 2022, Ch. 2). Searching for near-term sources of lock-in is a notable strategy for satisfying these two properties, as discussed in Section 2.1. Two plausible ways to prevent lock-in of increases in suffering are shaping the goals of our descendants and shaping the development of AI.
However, human societies and AI are both highly complex systems, and it is likely too difficult to do better than form coarse-grained models of these systems with considerable uncertainty. Given this problem and how recently s-risks were proposed as an altruistic priority, much of current s-risk reduction research is devoted to identifying interventions that are robust to inevitable errors in our current understanding (under which naïve interventions could backfire).
Below, both targeted and broad classes of interventions are considered; it is unclear which of these is favored by accounting for such “deep uncertainty” (Marchau et al. 2019). Targeted approaches are generally less likely to have backfire risks, by restricting one’s intervention to a few factors that one can relatively easily understand. However, broad approaches have the advantage that they may rely less strongly on a specific model of a path to impact, which is brittle to crucial factors one might not consider, and therefore prone to overestimates of effectiveness (Tomasik 2015b; Baumann 2017b).
Another general consideration is the possibility of social dilemmas (such as the famous Prisoner’s Dilemma): Even if an intervention seems best assuming one’s decision-making is independent of that of other people with different goals, people’s collective decisions may make things worse than some alternative, according to the goals of all the relevant decision-makers. As a simplistic example, suppose one attempts to reduce incidental s-risks by lobbying governments against space exploration, while those who consider it important to build an advanced civilization across the galaxy lobby for space exploration. To the extent that these efforts cancel out, they are wasted compared to the alternative in which both sides agree to pursue their goals more cooperatively, that is, at less expense to each other’s goals (Tomasik 2015c; Vinding 2020b).
First, one might aim to reduce s-risks by building clear models of some specific pathways to s-risks, and finding targeted interventions that would likely block these pathways. The distinction between “targeted” and “broad” is more of a spectrum than binary. However, this categorization can be useful in that one may tend to favor more or less targeted approaches based on one’s epistemic perspective. Suppose one finds the mechanisms sketched in Section 3 excessively speculative, considers it generally intractable to form predictions of paths to impact on the long term, and thinks backfire risks can be mitigated without a targeted approach. Then one may prefer broad approaches intended to apply across a variety of scenarios.
To prevent the lock-in of scenarios where astronomically many digital beings are subjected to suffering for economic or scientific expediency, one relatively targeted option is to work on alignment of AI with the values of its human designers (Christian 2020; Ngo, Chan, and Mindermann 2023). As discussed in Section 2.1.1, an uncontrolled AI agent would likely aim to optimize the accessible universe for its goals, which might include allowances for running many suffering computations to make and execute complex plans (Tomasik 2015a; Bostrom 2014, Ch. 8). An unaligned AI might also be especially willing to initiate conflicts leading to large-scale suffering (Section 3.2). By reducing this acute risk, solving the technical problems of alignment would help place near-term decision-makers in a better position for deliberation, in particular, about how to approach space settlement in a way that is less prone to incidental s-risks.
That said, there are several limitations to and potential backfire risks from increasing alignment on the margin. First, it is not clear that agents with values similar to most humans’ will avoid causing great suffering to digital minds, under the default trajectory of moral deliberation; see Section 4.2. For example, a human-aligned agent might be especially inclined to spread sentient life, without necessarily mitigating serious suffering in such lives. Second, progress in AI alignment could also enable malevolent actors to cause s-risks, by enabling their control over a powerful technology. Lastly, marginal increases in the degree of alignment of AI with human values could increase the risk of near-miss failures discussed in Section 3.2. Due to these considerations, it is unlikely that work on AI alignment is robustly positive for reducing s-risks.
Another, arguably more robust, approach to s-risk reduction via technical work on AI design is research in the field of Cooperative AI (and implementing proposals based on this research) (Dafoe et al. 2020; Conitzer and Oesterheld 2022; Clifton, Martin, and DiGiovanni 2022b). Agential s-risks could result from failures of cooperation between powerful AI systems, and interventions to address these conflict risks are currently neglected compared to efforts to accelerate general progress in AI, and even compared to AI alignment (Hilton 2022). Technical work on mitigating AI conflict risks is a focus area of the Center on Long-Term Risk and the Cooperative AI Foundation. In general, this work entails identifying the possible causes of AI cooperation failure, and proposing changes to the features of AI design (or deployment) that would plausibly remove those causes. Examples of specific research progress in this area include:
Cooperative AI overlaps with research in decision theory, as a method of understanding the patterns of behavior of highly rational agents such as advanced AIs, which is relevant to predicting when they will have incentives to cooperate. One prominent subset of decision theory research for s-risk reduction is work on mechanisms of cooperation between agents with similar decision-making procedures, known as Evidential Cooperation in Large Worlds (ECL) (Oesterheld 2017). Correlations between agents’ decision-making can have important implications for how they optimize their values, and thus how willing they are to cause incidental or agential s-risks. For instance, an agent does not necessarily get exploited by cooperating with correlated agents, i.e., by taking some opportunity that fulfills the values of these agents with some downside according to one’s own values. This is because these agents are most likely cooperating as well, but would not be likely to cooperate if the first agent did not.
To address incidental or agential s-risks from AI, an alternative to technical interventions is to improve coordination between, and risk-awareness within, labs developing advanced AI. Risks of both alignment and cooperation failures could be exacerbated by dynamics where developers race to create the first general AI (Armstrong et al. 2015). These races incentivize developers to deprioritize safety measures that, while useful for avoiding low-probability worst-case outcomes, do not increase their systems’ average-case performance at economically useful tasks. Moreover, by establishing inter-lab coordination through governance and shaping of norms of AI research culture, there would be less risk that labs develop AIs that engage in conflict because they were trained independently (and hence might have incompatible standards for bargaining, or insufficient mutual transparency) (Torges 2021). Labs could also implement measures to reduce risks of malevolent actors gaining unchecked control over AI, for example, strengthening the security of their systems against hacks and instituting stronger checks and balances within their internal governance structures.
Besides AI, extraterrestrial intelligent civilizations could also have an important influence on long-term suffering. For example, understanding how likely extraterrestrials are to settle space—and how compassionate their values might be, compared to Earth-originating agents—is helpful for assessing the counterfactual risks of suffering posed by space settlement (Chyba and Hand 2005; Cook 2022; Vinding and Baumann 2021; Tomasik 2015a). If other civilizations in the universe have less concern for avoiding s-risks than humans do, space settlement could be instrumental to reducing incidental and agential s-risks caused by these beings. On the other hand, hostile interactions between space-settling agents might pose s-risks from cooperation failures.
Broad s-risk reduction efforts aim to intervene on factors that are likely involved in several different pathways to s-risks, including ones that we cannot specifically anticipate.
A necessary condition for any s-risk to occur is that agents with the majority of power in the long-term future are not sufficiently motivated to prevent or avoid causing s-risks—otherwise, these agents would prevent such large amounts of suffering. Thus, calling attention to and developing nuanced arguments for views that highly prioritize avoiding causing severe suffering, as discussed in Section 2.2.1, might be a way to reduce the probability of all kinds of s-risks. Exploring the philosophical details of suffering-focused ethics is a priority of the Center for Reducing Suffering, for example (Vinding 2020; Ajantaival 2021). Another option is to focus on arguments against pure retribution, the idea that it is inherently good to cause (often disproportionate) suffering to those who violate certain normative standards (Parfit 2011, Ch. 11).
Marginal efforts to promote suffering-focused views do not necessarily reduce s-risks, however, without certain constraints. First, as discussed in Section 4, these efforts might be wasted due to zero-sum dynamics with others promoting different values in response to promotion of suffering-focused views. Hence, the most promising approaches would involve promoting philosophical reflection in ways that would be endorsed by a wide variety of open-minded people, even if they have come to different conclusions about how important suffering is (e.g., presenting relevant considerations and thought experiments that have been neglected in existing literature) (Vinding 2020, Sec. 12.3). Second, to the extent that suffering-focused views are associated with pure consequentialism, a possible risk is that actors become more sympathetic to a naïve procedure of attempting to reduce as much suffering as they believe is possible, given their flawed models of the world. This combination of an optimizing mindset and limited ability to predict the full consequences can inspire counterproductive actions. Thus, effective advocacy for suffering-focused ethics would involve promotion of a careful, nuanced approach.
Similarly, one might focus on increasing concern for the suffering of more kinds of sentient beings (“moral circle expansion”), to reduce incidental, natural, and some forms of agential s-risks (Anthis and Paez 2021). Despite the benefits of the wide reach of this intervention, there are some ways it could backfire. In practice, efforts to increase moral consideration of other beings might make future space-settling civilizations more likely to create these beings on large scales—some fraction of which could have miserable lives—due to viewing their lives as intrinsically valuable (Tomasik 2015d; Vinding 2018a). Further, recalling the near miss scenario from Section 3.2, an AI agent that mistakenly increases the suffering of beings it is trained to care about would cause potentially far more suffering if its training is influenced by moral circle expansion.
Shaping social institutions is another option that is particularly helpful if lock-in of the conditions for s-risks is unlikely to occur soon (or infeasible to prevent). For instance, the Center for Reducing Suffering has analyzed potential changes to political systems that could increase the likelihood of compromise and international cooperation, such as reducing polarization (Baumann 2020b; Vinding 2022a). With more compromises, it would be easier for the votes of even a minority who care about less powerful sentient beings to reduce a large share of incidental and natural s-risks. More global cooperation would slow down technological races that contribute to conflict risks, and greater stability of democracies may also reduce risks of malevolent actors taking power (Althaus and Baumann 2020).
Finally, to the extent that it is intractable to directly intervene on s-risks in the near term, an alternative is to build the capacity of future people to reduce s-risks when they have more information than we do (but less time to prepare) (Baumann 2021). This could entail:
Those interested in reducing s-risks can contribute with donations to organizations that prioritize s-risks, such as the Center on Long-Term Risk and the Center for Reducing Suffering, or with their careers. To build a career that helps reduce s-risks, one can learn more about the research fields discussed in Section 4.1, and reach out to the Center on Long-Term Risk or the Center for Reducing Suffering for career planning advice.
I thank David Althaus, Tobias Baumann, Jim Buhler, Lukas Gloor, Adrian Hutter, Caspar Oesterheld, and Pablo Stafforini for comments and suggestions.
Ajantaival. Minimalist axiologies and positive lives. https://centerforreducingsuffering.org/research/minimalist-axiologies-and-positive-lives, 2021.
Althaus and Baumann. Reducing long-term risks from malevolent actors. https://forum.effectivealtruism.org/posts/LpkXtFXdsRd4rG8Kb/reducing-long-term-risks-from-malevolent-actors, 2020.
Althaus and Gloor. Reducing Risks of Astronomical Suffering: A Neglected Priority. https://longtermrisk.org/reducing-risks-of-astronomical-suffering-a-neglected-priority, 2016.
Althaus. Descriptive Population Ethics and Its Relevance for Cause Prioritization. https://forum.effectivealtruism.org/posts/CmNBmSf6xtMyYhvcs/descriptive-population-ethics-and-its-relevance-for-cause#Interpreting_and_measuring_N_ratios, 2018.
Anthis and Paez. “Moral circle expansion: A promising strategy to impact the far future.” 2021.
Armstrong and Sandberg. “Eternity in six hours: intergalactic spreading of intelligent life and sharpening the Fermi paradox”. 2013.
Armstrong et al. “Racing to the precipice: a model of artificial intelligence development". 2015.
Baumann. A typology of s-risks. https://centerforreducingsuffering.org/research/a-typology-of-s-risks, 2018a.
Baumann. An introduction to worst-case AI safety. https://s-risks.org/an-introduction-to-worst-case-ai-safety/, 2018b.
Baumann. Arguments for and against a focus on s-risks. https://centerforreducingsuffering.org/research/arguments-for-and-against-a-focus-on-s-risks, 2020a.
Baumann. How can we reduce s-risks? https://centerforreducingsuffering.org/research/how-can-we-reduce-s-risks/#Capacity_building, 2021.
Baumann. Improving our political system: An overview. https://centerforreducingsuffering.org/research/improving-our-political-system-an-overview, 2020b.
Baumann. Risk factors for s-risks. https://centerforreducingsuffering.org/research/risk-factors-for-s-risks, 2019.
Baumann. S-risks: An introduction. https://centerforreducingsuffering.org/research/intro, 2017a.
Baumann. Uncertainty smooths out differences in impact. https://prioritizationresearch.com/uncertainty-smoothes-out-differences-in-impact, 2017b.
Baumann. Using surrogate goals to deflect threats. https://longtermrisk.org/using-surrogate-goals-deflect-threats, 2018c.
Beckstead and Thomas. “A Paradox for tiny probabilities.” 2021.
Beckstead. Dissertation: On the overwhelming importance of shaping the far future. 2013.
Beckstead. Will we eventually be able to colonize other stars? Notes from a preliminary review. https://www.fhi.ox.ac.uk/will-we-eventually-be-able-to-colonize-other-stars-notes-from-a-preliminary-review, 2014.
Block. “The harder problem of consciousness”. 2002.
Bostrom. “Are we living in a computer simulation?” The Philosophical Quarterly. 2003.
Bostrom. “The Superintelligent Will”. 2012.
Bostrom. Superintelligence. OUP Oxford, 2014
Brauner and Grosse-Holz. The expected value of extinction risk reduction is positive. https://www.effectivealtruism.org/articles/the-expected-value-of-extinction-risk-reduction-is-positive. 2018.
Buckels et al. “Behavioral confirmation of Everyday Sadism”. 2013.
Caplan. “The totalitarian threat”. 2008.
Carlsmith. “Is Power-seeking AI an Existential Threat?” 2022.
Chalmers. The Conscious Mind. OUP Oxford, 1996.
Christian. Alignment Problem – Machine Learning and Human Values. W.W. Norton, 2020.
Christiano, Cotra, and Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit, 2021.
Chyba and Hand. “Astrobiology: The Study of the Living Universe”. 2005.
Clifton, Martin, and DiGiovanni. When would AGIs engage in conflict? https://www.alignmentforum.org/s/32kWH6hqFhmdFsvBh/p/cLDcKgvM6KxBhqhGq#What_if_conflict_isn_t_costly_by_the_agents__lights__, 2022a.
Clifton, Martin, and DiGiovanni. When does technical work to reduce AGI conflict make a difference? https://www.alignmentforum.org/s/32kWH6hqFhmdFsvBh, 2022b.
Cohen et al. 2021 State of the Industry Report: Cultivated meat and seafood. https://gfieurope.org/wp-content/uploads/2022/04/2021-Cultivated-Meat-State-of-the-Industry-Report.pdf, 2021.
Conitzer and Oesterheld. Foundations of Cooperative AI. https://www.cs.cmu.edu/~15784/focal_paper.pdf. 2022.
Cook. Replicating and extending the grabby aliens model. https://forum.effectivealtruism.org/posts/7bc54mWtc7BrpZY9e/replicating-and-extending-the-grabby-aliens-model, 2022.
Cotra. “Forecasting TAI with biological anchors”. 2020.
Cotra. Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to, 2022.
Cowen and Parfit. “Against the social discount rate”. 1992.
Dafoe et al. “Open problems in Cooperative AI”. 2020.
DeepMind. “Generally capable agents emerge from open-ended play”. 2021.
De Vynck, Lerman, and Tiku. “Microsoft’s AI chatbot is going off the rails.” The Washington Post, February 16, 2023. https://www.washingtonpost.com/technology/2023/02/16/microsoft-bing-ai-chatbot-sydney/.
DiGiovanni. A longtermist critique of “The expected value of extinction risk reduction is positive”. https://forum.effectivealtruism.org/posts/RkPK8rWigSAybgGPe/a-longtermist-critique-of-the-expected-value-of-extinction-2, 2021.
DiGiovanni and Clifton. “Commitment games with conditional information revelation”. 2022.
DiGiovanni, Macé, and Clifton. “Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs”. 2022.
Duff. “Pascal's Wager”. 1986.
Eckerström-Liedholm. Deep Dive: Wildlife contraception and welfare. https://www.wildanimalinitiative.org/blog/contraception-deep-dive, 2022.
Edwards et al. “Anthroponosis and risk management: a time for ethical vaccination of wildlife”. 2021.
Evans and Gao. DeepMind AI Reduces Google Data Centre Cooling Bill by 40%. https://www.deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-by-40, 2016.
Fearon. “Bargaining, Enforcement, and International Cooperation”. 1998.
Fearon. “Rationalist explanations for war”. 1995.
Finnveden et al. “AGI and Lock-in”. 2022.
Frick. “Conditional Reasons and the Procreation Asymmetry”. 2020.
Gloor. Altruists Should Prioritize Artificial Intelligence. https://longtermrisk.org/altruists-should-prioritize-artificial-intelligence, 2016a.
Gloor. Cause prioritization for downside-focused value systems. https://longtermrisk.org/cause-prioritization-downside-focused-value-systems, 2018.
Gloor. Population Ethics Without Axiology: A Framework. https://forum.effectivealtruism.org/posts/dQvDxDMyueLyydHw4/population-ethics-without-axiology-a-framework, 2022.
Gloor. The Case for Suffering-Focused Ethics. https://longtermrisk.org/the-case-for-suffering-focused-ethics, 2016b.
Greaves and MacAskill. The case for strong longtermism. https://globalprioritiesinstitute.org/hilary-greaves-william-macaskill-the-case-for-strong-longtermism-2, 2021.
Greaves. “Cluelessness”. 2016.
Hanson. The Age of Em. OUP Oxford, 2016.
Heifetz, Shannon, and Spiegel. “The Dynamic Evolution of Preferences”. 2007.
Hilton. “S-risks.” https://80000hours.org/problem-profiles/s-risks/#why-might-s-risks-be-an-especially-pressing-problem. 2022.
Horta. “Animal Suffering in Nature: The Case for Intervention”. 2017.
Horta. “Debunking the idyllic view of natural processes”. 2010.
Hubinger et al. “Risks from learned Optimization in Advanced Machine Learning Systems”. 2019.
Hubinger. A transparency and interpretability tech tree. https://www.alignmentforum.org/posts/nbq2bWLcYmSGup9aF/a-transparency-and-interpretability-tech-tree, 2022.
Johannsen. Wild Animal Ethics: The Moral and Political Problem of Wild Animal Suffering. Routledge, 2020.
Jumper et al. “Highly accurate protein structure prediction with AlphaFold”. 2021.
Karnofsky. AI Could Defeat All Of Us Combined. https://www.cold-takes.com/ai-could-defeat-all-of-us-combined, 2022.
Knutsson. “Value lexicality”. 2016.
Larks. The Long Reflection as the Great Stagnation. https://forum.effectivealtruism.org/posts/o5Q8dXfnHTozW9jkY/the-long-reflection-as-the-great-stagnation, 2022.
Lickel. “Vicarious Retribution: the role of collective blame in intergroup aggression". 2006.
MacAskill et al. Moral Uncertainty. OUP Oxford, 2020.
MacAskill. What We Owe the Future. Oneworld, 2022.
MacAskill. Human Extinction, Asymmetry, and Option Value. https://docs.google.com/document/d/1hQI3otOAT39sonCHIM6B4na9BKeKjEl7wUKacgQ9qF8/
Marchau et al. Decision Making under Deep Uncertainty. Springer, 2019.
McAfee. “Effective computability in Economic Decisions”. 1984.
Metzinger. The Ego Tunnel: The Science of the Mind and the Myth of the Self. Basic Books, 2010
Monton. “How to avoid Maximizing Expected Utility”. 2019.
Moore. Placing Blame: A Theory of Criminal Law. Clarendon, 1997.
Moshagen et al. “The Dark Core of Personality”. 2018.
Myerson and Satterthwaite. “Efficient mechanisms for bilateral trading”. 1983.
Ngo, Chan, and Mindermann. “The alignment problem from a deep learning perspective”. 2023.
Oesterheld. “Multiverse-wide Cooperation via Correlated Decision Making”. 2017.
Oesterheld. “Robust program equilibrium”. 2019.
Oesterheld and Conitzer. “Safe Pareto Improvements for Delegated Game Playing”. 2021.
Omohundro. “Basic AI Drives”. 2008.
OpenAI. “GPT-4 Technical Report”. 2023.
Ord. The Precipice. Bloomsbury, 2020.
Ortiz-Ospina and Roser. Global Health. https://ourworldindata.org/health-meta, 2016.
Ortiz-Ospina and Roser. Happiness and Life Satisfaction. https://ourworldindata.org/happiness-and-life-satisfaction, 2013.
Parfit. On What Matters: Volume 1. OUP Oxford, 2011
Paulhus. “Toward a Taxonomy of Dark Personalities”. 2014.
Pearce. Hedonistic Imperative. Self-published, 1995.
Pinker. The Better Angels of Our Nature: A History of Violence and Humanity. Penguin, 2012.
Possajennikov. “On the evolutionary stability of altruistic and spiteful preferences”. 2000.
Reed et al. “A Generalist Agent”. 2022.
Ritchie, Rosado, and Roser. Meat and Dairy Production. https://ourworldindata.org/meat-production, 2017.
Shulman and Bostrom, Sharing the World with Digital Minds In: Rethinking Moral Status. Edited by: Steve Clarke, Hazem Zohny, and Julian Savulesc, Oxford University Press. 2021.
Silver et al. “Mastering chess and shogi”. 2017.
Singer. The Expanding Circle: Ethics, Evolution, and Moral Progress. https://press.princeton.edu/books/paperback/9780691150697/the-expanding-circle, 2011.
Sinnott-Armstrong. “Consequentialism,” The Stanford Encyclopedia of Philosophy, Edward N. Zalta & Uri Nodelman (eds.), https://plato.stanford.edu/archives/win2022/entries/consequentialism, 2022.
Sotala. Advantages of Artificial Intelligences, Uploads, and Digital Minds. https://intelligence.org/files/AdvantagesOfAIs.pdf, 2012.
Stastny et al. “Normative Disagreement as a Challenge for Cooperative AI”. 2021.
Stein-Perlman et al. “Expert survey on Progress in AI”. 2022.
Tarsney. “The epistemic challenge to longtermism”. 2022.
Tennenholtz. “Program equilibrium”. 2004.
Thomas. “The Asymmetry, Uncertainty, and the Long Term”. 2019.
Tomasik. Artificial Intelligence and Its Implications for Future Suffering. https://longtermrisk.org/artificial-intelligence-and-its-implications-for-future-suffering, 2015a.
Tomasik. Charity Cost-Effectiveness in an Uncertain World. https://longtermrisk.org/charity-cost-effectiveness-in-an-uncertain-world, 2015b.
Tomasik. Do Artificial Reinforcement-Learning Agents Matter Morally? https://arxiv.org/abs/1410.8233?context=cs, 2014.
Tomasik. Reasons to Be Nice to Other Value Systems. https://longtermrisk.org/reasons-to-be-nice-to-other-value-systems, 2015c.
Tomasik. Reasons to Promote Suffering-Focused Ethics. https://reducing-suffering.org/the-case-for-promoting-suffering-focused-ethics, 2015d.
Tomasik. Risks of Astronomical Future Suffering. https://longtermrisk.org/risks-of-astronomical-future-suffering, 2015e.
Torges. Coordination challenges for preventing AI conflict. https://longtermrisk.org/coordination-challenges-for-preventing-ai-conflict, 2021.
Vinding. Moral Circle Expansion Might Increase Future Suffering. https://magnusvinding.com/2018/09/04/moral-circle-expansion-might-increase-future-suffering, 2018a.
Vinding. Reasoned Politics. Independently published, 2022a.
Vinding. Suffering-Focused Ethics: Defense and Implications. Independently published, 2020a.
Vinding. Why altruists should be cooperative.
https://centerforreducingsuffering.org/why-altruists-should-be-cooperative, 2020b.
Vinding. Why Altruists Should Perhaps Not Prioritize Artificial Intelligence: A Lengthy Critique. https://magnusvinding.com/2018/09/18/why-altruists-should-perhaps-not-prioritize-artificial-intelligence-a-lengthy-critique, 2018b.
Vinding. Popular views of population ethics imply a priority on preventing worst-case outcomes. https://centerforreducingsuffering.org/popular-views-of-population-ethics-imply-a-priority-on-preventing-worst-case-outcomes, 2022b.
Vinding and Baumann. S-risk impact distribution is double-tailed. https://centerforreducingsuffering.org/s-risk-impact-distribution-is-double-tailed, 2021.
Wendel. “On the abundance of extraterrestrial life after the Kepler mission”. 2015.
Yudkowsky. The Rocket Alignment Problem. https://intelligence.org/2018/10/03/rocket-alignment, 2018.
Ziegler et al. Fine-tuning GPT-2 from human preferences. https://openai.com/blog/fine-tuning-gpt-2, 2019.
The post Beginner’s guide to reducing s-risks appeared first on Center on Long-Term Risk.
]]>The post (Archive) Summer Research Fellowship 2023 appeared first on Center on Long-Term Risk.
]]>
We, the Center on Long-Term Risk, are looking for Summer Research Fellows to help us explore strategies for reducing suffering in the long-term future (s-risk) and work on technical AI safety ideas related to that. For eight weeks, you will be part of our team while working on your own research project. During this time, you will be in regular contact with our researchers and other fellows. One of our researchers will serve as your guide and mentor.
Your contributions to our research program will have a positive impact through their influence on our strategic direction, grantmaking, communications, events, and other activities. You will work autonomously on challenging research questions relevant to reducing suffering. You will become part of our team of intellectually curious, hard-working, and caring people, all of whom share a profound drive to make the biggest difference they can.
We are worried that some people might not apply because they wrongly believe they are not a good fit for working with us. While such a belief is sometimes true, it is often the result of underconfidence rather than an accurate assessment. We would therefore love to see your application even if you are not sure if you are qualified or otherwise competent enough for the positions listed. We explicitly have no minimum requirements in terms of formal qualifications and many of the past summer research fellows have had no or little prior research experience. Being rejected this year will not reduce your chances of being accepted in future hiring rounds. If you have any doubts, please don’t hesitate to reach out (see “Application process” > “Inquiries” below).
The purpose of the fellowship varies from fellow to fellow. In the past, have we often had the following types of people take part in the fellowship:
There might be many other good reasons for completing the fellowship. We encourage you to apply if you think you would benefit from the program, even if your reason is not listed above. In all cases, we will work with you to make the fellowship as valuable as possible given your strengths and needs. In many cases, this will mean focusing on learning and testing your fit for s-risk research, more than seeking to produce immediately valuable research output.
We don’t require specific qualifications or experience for this role, but the following abilities and qualities are what we’re looking for in candidates. We encourage you to apply if you think you may be a good fit, even if you are unsure whether you meet some of the criteria.
We encourage you to apply even if any of the below does not work for you. We are happy to be flexible for exceptional candidates, including when it comes to program length and compensation.
You can find an overview of our current priority areas here. However, If we believe that you can somehow advance high-quality research relevant to s-risks, we are interested in creating a position for you. If you see a way to contribute to our research agenda or have other ideas for reducing s-risks, please apply. We commonly tailor our positions to the strengths and interests of the applicants.
All fellows will work with a mentor to guide their project. Our mentors have each written below about the topics in which they’re most interested in supervising research.
At stage 2 of our application process, applicants are asked to submit a research proposal and a list of research proposal ideas. A significant part of our selection process relates to consideration by our mentors of whether they are interested in supervising the Fellow, based on the Fellow’s and mentor’s research interests.
I would be most keen to supervise projects on:
Some things I’m keen to supervise projects on are:
However, I'm also interested in considering strong proposals outside these areas.
I’m most interested in supervising projects related to:
I’m most keen to supervise projects in the following areas:
I’m interested in supervising Fellows working in any of my academic interest areas, as seen on my website and blog.
Given their particular relevance to CLR's priorities, I’d be interested in working with Fellows in any of the following areas:
We value your time and are aware that applications can be demanding, so we have thought carefully about making the application process time-efficient and transparent. We plan to make the final decisions between May 5 and May 10.
Stage 1: To start your application for any role, please complete our application form. As part of this form, we also ask you to submit your CV/resume and give you the opportunity to upload an optional research sample. The deadline is Sunday, April 2, 2023 end of day anywhere. We expect this to take around 2 to 3 hours if you are already familiar with our work. In the interest of your time, you do not need to polish the language of your answers in the application form.
Stage 2: By Friday, April 7, we will decide whether to invite you to the second stage. We will ask you to write a research proposal (up to two pages excluding references) and a list of research proposal ideas, to be submitted by Sunday, April 23 end of day anywhere. This means applicants will have two weeks to complete this stage, which we expect will take up to 12h of work. Applicants may therefore want to keep some time free during this period to work on this. Applicants will be compensated with £250 for their work on this stage.
Stage 3: By Friday, April 28, we will decide whether to invite you to an interview via video call during the week of May 1. By May 10, we will send out final decisions to applicants.
If you have any questions about the process, please contact us at hiring@longtermrisk.org. If you want to send an email not accessible to the hiring committee, please contact Amrit Sidhu-Brar at amrit.sidhu-brar@longtermrisk.org.
In addition to their salary, CLR offers the following benefits to all staff (including Summer Research Fellows):
We aim to combine the best aspects of academic research (depth, scholarship, mentorship) with an altruistic mission to prevent negative future scenarios. So we leave out the less productive features of academia, such as precarious employment and publish-or-perish incentives, while adding a focus on impact and application.
As part of our team, you will enjoy:
You will advance neglected research to reduce the most severe risks to our civilization in the long-term future. Depending on your specific project, your work will help inform our activities across any of the following paths to impact:
The post (Archive) Summer Research Fellowship 2023 appeared first on Center on Long-Term Risk.
]]>The post Annual Review & Fundraiser 2022 appeared first on Center on Long-Term Risk.
]]>Our goal is to reduce the worst risks of astronomical suffering (s-risks). These are scenarios where a significant fraction of future sentient beings are locked into intense states of misery, suffering, and despair.1 We currently believe that such lock-in scenarios most likely involve transformative AI systems. So we work on making the development and deployment of such systems safer.2
Concrete research programs:
Most of our work is research with the goal of identifying threat models and possible interventions. In the case of technical AI interventions (which is the bulk of our object-level work so far), we then plan to evaluate these interventions and advocate for their inclusion in AI development.
Next to our research, we also run events and fellowships to identify and support people wanting to work on these problems.
Due to recent events, we have had a short-term funding shortfall. This caused us to reduce our original budget for 2023 by 30% and take various cost-saving measures, including voluntary pay cuts by our staff, to increase our runway to six months.
Our medium-term funding situation is hard to predict at the moment, as there is still a lot of uncertainty. We hope to gain more clarity about this in the next few months.
Our minimal fundraising goal is to increase our runway to nine months, which would give us enough time to try and secure a grant from a large institutional funder in the first half of 2023. Our main goal is to increase our runway to twelve months and roll back some of the budget reductions, putting us in a more comfortable financial position again. Our stretch goal is to increase our runway to fifteen months and allow for a small increase in team size in 2023. See the table below for more details.
Given the financial situation sketched above, we believe that CLR is a good funding opportunity this year. Whether it makes sense for any given individual to donate to CLR depends on many factors. Below, we sketch the main reasons donors could be excited about our work. In an appendix, we collected some testimonials by people who have a lot of context on our work.
Supporting s-risk reduction.
You might want to support CLR’s work because it is one of the few organizations addressing risks of astronomical suffering directly.5 You could consider s-risk reduction worthwhile for a number of reasons: (1) You find the combination of suffering-focused ethics and longtermism compelling. (2) You think the expected value of the future is not sufficiently high to warrant prioritizing extinction risk reduction over improving the quality of the future. (3) You want to address the fact that work on s-risks is comparatively neglected within longtermism and AI safety.
Since the early days of our organization, we have made significant progress on clarifying and modeling the concrete threats we are trying to address and coming up with technical candidate interventions (see Appendix).
Supporting work on addressing AI conflict.
Next to the benefits to s-risk reduction, you might value some of our work because it addresses failure modes arising in multipolar AI scenarios more broadly (e.g., explored here, here). In recent years, we have helped to build up the field of Cooperative AI intended to address these risks (e.g., Stastny et al. 2021).
Supporting work on better understanding acausal interactions.
Such interactions are possibly a crucial consideration for longtermists (see, e.g., here). Some argue that, when acting, we should consider the non-causal implications of our actions (see, e.g., Ahmed (2014), Yudkowsky and Soares (2018), Oesterheld and Conitzer (2021)). If this is the case, these effects could dwarf their causal influence (see, e.g., here). Better understanding the implications of this would then be a key priority. CLR is among the few organizations doing and supporting work on this (e.g., here).
Much of our work on cooperation in the context of AI plausibly becomes particularly valuable from this perspective. For instance, if we are to act so as to maximize a compromise utility function that includes the values of many agents across the universe6, as the ECL argument suggests, then it becomes much more important that AI systems, even if aligned, cooperate well with agents with different values.7
Supporting cause-general longtermism research.
CLR has also done important research from a general longtermist lens, e.g., on decision theory, meta ethics, AI timelines, risks from malevolent actors, and extraterrestrial civilizations. Our Summer Research Fellowship has been a springboard for junior researchers who then moved on to other longtermist organizations (e.g., ARC, Redwood Research, Rethink Priorities, Longview Philanthropy).8
To donate to CLR, please go to the Fundraiser page on our website.
For frequently asked questions on donating to CLR, see our Donate page.
This group is led by Jesse Clifton. Members of the group are Anni Leskelä, Anthony DiGiovanni, Julian Stastny, Maxime Riché, Mia Taylor, and Nicolas Macé.
Have we made relevant research progress?9
We believe we have made significant progress (e.g., relative to previous years) on improving our expertise in the reasons why AI systems might engage in conflict and the circumstances under which technical work done now could reduce these risks. We’ve built up methods and knowledge that we expect to make us much better at developing and assessing interventions for reducing conflict. (Some of this is reflected in our public-facing work.) We have begun to capitalize on this in the latter part of 2022, as we’ve begun moving from improving our general picture of the causes of conflict and possibilities for intervention to developing and evaluating specific interventions. These interventions include surrogate goals, preventing conflict-seeking preferences, preventing commitment races, and developing cooperation-related content for a hypothetical manual for overseers of AI training.
The second main way in which we’ve made progress is the initial work we’ve done on the evaluation of large language models (LLMs). There are several strong arguments that those interested in intervening on advanced AI systems should invest in experimental work with existing AI systems (see, e.g., The case for aligning narrowly superhuman models). Our first step here has been to work on methods for evaluating cooperation-relevant behaviors and reasoning of LLMs, as these methods are prerequisites for further research progress. We are in the process of developing the first Cooperative AI dataset for evaluating LLMs as well as methods for automatically generating data on which to evaluate cooperation-relevant behaviors, which is a prerequisite for techniques like red-teaming language models with language models. We are preparing to submit this work to a machine learning venue. We have also begun developing methods for better understanding the reasoning abilities of LLMs when it comes to conflict situations in order to develop evaluations that could tell us when models have gained capabilities that are necessary to engage in catastrophic conflict.
Has the research reached its target audience?
We published a summary of our thinking (as of earlier this year) on when technical work to reduce AGI conflict makes a difference on the Alignment Forum/LessWrong, which is visible to a large part of our target audience (AI safety & longtermist thinkers). We have also shared internal documents with individual external researchers to whom they are relevant. A significant majority of the research that we’ve done this year has not been shared with target audiences, though. Much of this is work-in-progress on evaluating interventions and evaluating LLMs which will be incorporated into summaries shared directly with external stakeholders, and in some cases posted on the Alignment Forum/LessWrong or submitted for publication in academic venues.
What feedback on our work have we received from peers and our target audience?
Our Alignment Forum sequence When does technical work to reduce AGI conflict make a difference? didn’t get much engagement. We did receive some positive feedback on internal drafts of this sequence from external researchers. We also solicited advice from individual alignment researchers throughout the year. This advice was either encouraging of existing areas of research focus or led us to shift more attention to areas that we are now focusing on (summarized in “relevant research progress” section above).
Emery, Daniel10, and Tristan work on a mix of macrostrategy, ECL, decision theory, anthropics, forecasting, AI governance, and game theory.
Have we made relevant research progress?11
The main focus of Emery’s work in the last year has been on the implications of ECL for cause prioritization. This includes work on the construction of the compromise utility function12 under different anthropic and decision-theoretic assumptions, on the implications of ECL for AI design, and on more foundational questions. Additionally, Emery worked on a paper (in progress) extending our earlier Robust Program Equilibrium paper13. She also did some work on the implications of limited introspection ability for evidential decision theory (EDT) agents, and some related work on anthropics.
Daniel left for OpenAI early in the year, but not before making significant progress building a model of ECL and identifying key cruxes for the degree of decision relevance of ECL.
Tristan primarily worked on the optimal spending schedule for AI risk interventions and the probability that an Earth-originating civilization would encounter alien civilizations. To that end, he built and published two comprehensive models.
Overall, we believe we made moderate research progress, but Emery and Daniel have accumulated a large number of unpublished ideas to be written up.
Has the research reached its target audience?
The primary goal of Tristan’s reports was to inform CLR’s own prioritization. For example, the existence of alien civilizations in the far future is a consideration for our work on conflict. That said, Tristan’s work on Grabby Aliens received considerable engagement on the EA Forum and on LessWrong.
As mentioned above, a lot of Emery and Daniel’s work is not yet fully written up and published. Whilst the target audience for some of this work is internal, it’s nevertheless true that we haven’t been as successful in this regard as we would like. We have had fruitful conversations with non-CLR researchers about these topics, e.g., people at Open Philanthropy and MIRI.
What feedback on our work have we received from peers and our target audience?
The grabby aliens report was well received by and cited by S. Jay Olson (co-author of a recent paper on extraterrestrial civilizations with Toby Ord), who described it as “fascinating and complete”, and Tristan has received encouragement from Robin Hanson to publish academically, which he plans to do.
Progress on all fronts seems very similar to last year, which we characterized as “modest”.
Have we increased the (quality-adjusted) size of the community?
Community growth has continued to be modest. We are careful in how we communicate about s-risks, so our outreach tools are limited. Still, we had individual contact with over 150 people who could potentially make contributions to our mission. Out of those, perhaps five to ten could turn out to be really valuable members of our community.
Have we created opportunities for in-person (and in-depth online) contact for people in our community?
We created more opportunities for discussion and exchange than before. We facilitated more visits to our office, hosted meetups around EAG, and we ran an S-Risk Retreat with about 30 participants. We wanted to implement more projects in this direction, but our limited staff capacity made that impossible.
Have we provided resources for community members that make it more likely they contribute significantly to our mission?
Our team has continued to provide several useful resources this year. We administered the CLR Fund, which supported various efforts in the community. We provided ad-hoc career advice to community members. We are currently experimenting with a research database prototype. We believe there are many more things we could be doing, but our limited staff capacity has held us back.
Guiding question: Are we a healthy organization with sufficient operational capacity, an effective board, appropriate evaluation of our work, reliable policies and procedures, adequate financial reserves and reporting, and high morale?
Our capacity is currently not as high as we would like it to be as a staff member left in the summer and we only recently made a replacement hire. So various improvements to our setup have been delayed (e.g., IT & security improvements, a systematic policy review, some visa-related issues, systematic risk management). That being said, we are still able to maintain all the important functions of the organization (i.e., accounting, payments, payroll, onboarding/offboarding, hiring support, office management, feedback & review, IT maintenance).
Members of the board: Tobias Baumann, Max Daniel, Ruairi Donnelly (chair), Chi Nguyen, Jonas Vollmer.
The board are ultimately responsible for CLR. Their specific responsibilities include deciding CLR’s leadership and structure, engaging with the team about strategy and planning, resolving organizational conflicts, and advising and providing accountability for CLR leadership. In 2022 they considered various decisions related to CLR’s new office, hiring/promotion, and overall financials. The team generally considered their advice to be valuable.
We collect systematic feedback on big community-building and operations projects through surveys and interviews. We collect feedback on our research by submitting articles to journals & conferences and by requesting feedback on drafts of documents from relevant external researchers.
In 2022, we did not have any incidents that required a policy-driven intervention or required setting up new policies. Due to a lack of operational capacity in 2022, we failed to conduct a systematic review of all of our policies.
See “Fundraising” section above.
We currently don’t track staff morale quantitatively. Our impression is that this varies significantly between staff members and is more determined by personal factors than organizational trends.
Our plans for 2023 fall into three categories.
Evaluating large language models. We will continue building on the work on evaluating LLMs that we started this year. Beyond writing up and submitting our existing results, the priority for this line of work is scoping out an agenda for assessing cooperation-relevant capabilities. This will account for work on evaluation that’s being done by other actors in the alignment space and possible opportunities for eventually slotting into those efforts.
Developing and evaluating cooperation-related interventions. We will continue carrying out the evaluations of the interventions that we started this year. On the basis of these evaluations, we’ll decide which interventions we want to prioritize developing (e.g., working out in more detail how they would be implemented under various assumptions about what approaches to AI alignment will be taken). In parallel, we’ll continue developing content for an overseer’s manual for AI systems.
General s-risk macrostrategy. Some researchers on the team will continue spending some of their time thinking about s-risk prioritization more generally, e.g., thinking about the value of alternative priorities to our group’s current focus on AI conflict.
Emery plans to prioritize finishing and writing up her existing research on ECL. She also has plans for some more general posts on ECL, including on some common confusions, and on more practical implications for cause prioritization. Emery also plans to focus on finishing the paper extending Robust Program Equilibrium, and to explore further more object-level work.
Daniel no longer works at CLR but plans to organize a research retreat focused on ECL in the beginning of 2023.
Tristan broadly plans to continue strategy-related modeling, such as on the spread of information hazards. He also plans to help to complete a project that calculates the marginal utility of AI x-risk and s-risk work under different assumptions about AGI timelines, and to potentially contribute to work on ECL.
We had originally planned to expand our activities across all three community-building functions. Without additional capacity, we would have to curtail these plans.
Outreach. If resources allow, we will host another Intro Fellowship and Summer Research Fellowship. We will also continue our 1:1 meetings & calls. We also plan to investigate what kind of mass outreach within the EA community would be most helpful (e.g., online content, talks, podcasts). Without such outreach, we expect that community growth will stagnate at its current low rate.
Resources. We plan to create more long-lasting and low-marginal-cost resources for people dedicated to s-risk reduction (e.g., curated reading lists, career guide, introductory content, research database). As the community grows and diversifies, these resources will have to become more central to our work.
Exchange. If resources allow, we will host another S-Risk Retreat. We also want to experiment with other online and in-person formats. Again, as the community grows and diversifies, we need to find a replacement for more informal arrangements.
Nate Soares (Executive Director, Machine Intelligence Research Institute): "My understanding of CLR's mission is that they're working to avoid fates worse than the destruction of civilization, especially insofar as those fates could be a consequence of misaligned superintelligence. I'm glad that someone on earth is doing CLR's job, and CLR has in the past seemed to me to occasionally make small amounts of legible-to-me progress in pursuit of their mission. (Which might sound like faint praise, and I sure would endorse CLR more full-throatedly if they spent less effort on what seem to me like obvious dead-ends, but at the same time it's not like anybody else is even trying to do their job, and their job is worthy of attempting. According to me, the ability to make any progress at all in this domain is laudable)"
Lukas Finnveden (Research Analyst, Open Philanthropy): “CLR’s focus areas seem to me like the most important areas for reducing future suffering. Within these areas, they’ve shown competence at producing new knowledge, and I’ve learnt a lot that I value from engaging with their research.”
Daniel Kokotajlo (Policy/Governance, OpenAI): “I think AI cooperation and s-risk reduction are high priority almost regardless of your values / ethical views. The main reason to donate to, or work for, CLR is that the best thinking about s-risks and AI cooperation happens here, better than the thinking at MIRI or Open Phil or anywhere else. CLR also contains solid levels of knowledge of AI governance, AI alignment, AI forecasting (less so now that I’m gone), cause prioritisation, metaethics, agent foundations, anthropics, aliens, and more. Their summer fellows program is high quality and has produced many great alumni. Their ops team is great & in general they are well-organized. I left CLR to join the OpenAI governance team because I was doing mostly AI forecasting which benefits from being in a lab — but I was very happy at CLR and may even one day return.”
Michael Aird (Senior Research Manager, Rethink Priorities): 'I enjoyed my time as a summer research fellow at CLR in 2020, and I felt like I learned a lot about doing research and about various topics related to longtermist strategy, AI risk, and ethics. I was also impressed by the organization's culture and how the organization and fellowship was run, and I drew on some aspects of that when helping to design a research fellowship myself and when starting to manage people.'
Testimonials by other Summer Research Fellows can be found here.
What follows below is a somewhat stylized/simplified account of the history of the Center on Long-Term Risk prior to 2022. It is not meant to capture every twist and turn.
2011-2016: Incubation phase
What is now called the “Center on Long-Term Risk” starts out as a student group in Switzerland. Under the name “Foundational Research Institute,” we do pre-paradigmatic research into possible risks of astronomical suffering and create basic awareness of these scenarios in the EA community. A lot of pioneering thinking is done by Brian Tomasik. In 2016, we coin the term “risks of astronomical suffering” (s-risks). Key publications from that period:
2016-2019: Early growth
More researchers join; the organization professionalizes and matures. We publish our first journal articles related to s-risks. Possible interventions are being developed and discussed, surrogate goals among them. In 2017, we start sharing our work with other researchers in the longtermist community. That culminates in a series of research workshops in 2019. Key publications from that period:
2019-2022: Maturation
Before 2019, we were pursuing many projects other than research on s-risks. In 2019, this stops. We start focusing exclusively on research. Increasingly, we connect our ideas to existing lines of academic inquiry. We also start engaging more with concrete proposals and empirical work in AI alignment. Key publications from that period:
The post Annual Review & Fundraiser 2022 appeared first on Center on Long-Term Risk.
]]>The post CLR Fundraiser 2022 appeared first on Center on Long-Term Risk.
]]>For further details of CLR's progress in 2022, plans for 2023, and funding needs, please see our full fundraiser post.
The Swiss charity Effective Altruism Foundation (EAF) collects and processes donations through the below form on our behalf. Such donations will be used exclusively to support CLR.
For frequently asked questions on donating to CLR, see our Donate page.
Note: since the fundraiser is now over, any donations from now on will not be listed in the fundraiser donations list below.
Name | Amount | Comment | |
---|---|---|---|
Simon Möller | CHF 15000 | ||
David Lechner | CHF 250 | ||
Swante Scholz | CHF 10000 | Donation for Center on long-term risk (CLR) | |
Kwan Yee Ng | USD 7000 | ||
Spencer Pearson | USD 30 | ||
Althaus Silvia | CHF 5000 | ||
Markus Winkelmann | CHF 12000 | ||
Markus Winkelmann | CHF 500 | ||
Anonymous | USD 1000000 | ||
Anonymous | EUR 387000 | ||
Anonymous | USD 1000 | ||
Anonymous | USD 500 | ||
Anonymous | USD 1500 | ||
Anonymous | USD 40000 | ||
Jan Rüegg | CHF 4500 | ||
Adrian Hutter | CHF 9250 | ||
Patrick Levermore | GBP 10 | ||
Jonas Vollmer | USD 1000 | ||
Anonymous | GBP 3.13 |
The post Private: 2022-12-11 Anonymous 3.13 GBP via EA Funds appeared first on Center on Long-Term Risk.
]]>The post The optimal timing of spending on AGI safety work; why we should probably be spending more now appeared first on Center on Long-Term Risk.
]]>When should funders wanting to increase the probability of AGI going well spend their money? We have created a tool to calculate the optimum spending schedule and tentatively conclude funders collectively should be spending at least 5% of their capital each year on AI risk interventions and in some cases up to 35%.
This is likely higher than the current AI risk community spending rate which is at most 3%1. In most cases, we find that the optimal spending schedule is between 5% and 15% better than the ‘default’ strategy of just spending the interest one accrues and from 15% to 50% better than a naive projection of the community’s spending rate2.
We strongly encourage users to put their own inputs into the tool to draw their own conclusions.
The key finding of a higher spending rate is supported by two distinct models we have created, one that splits spending of capital into research and influence, and a second model (the ‘alternate model’) that supposes we can spend our stock of things that grow on direct work. We focus on the former with the latter described in the appendix since its output is more obviously action-guiding3.
The table below shows our best guess for the optimal spending schedule using the former model when varying the difficulty of achieving a good AGI outcome and AGI timelines. We keep other inputs, such as diminishing returns to spending and interest rate constant4.
Median AGI arrival | |||
Difficulty of AGI success | 20305 | 20406 | 20507 |
Easy8 | Easy difficulty 2030 median9 | Easy difficulty 2040 median | Easy difficulty 2050 median |
Medium10 | Medium difficulty 2030 median | Medium difficulty 2040 median | Medium difficulty 2050 median |
Hard11 | Hard difficulty 2030 median | Hard difficulty 2040 median | Hard difficulty 2050 median |
How much better the optimal spending schedule is compared to the 2%+2% constant spending schedule (within-model lower bound)12 |
|||
Median AGI |
|||
2030 | 2040 | 2050 | |
Easy | 37.6% | 18.4% | 11.8% |
Medium | 39.3% | 14.9% | 12.0% |
Hard | 12.3% | 5.85% | 1.55% |
Some of the critical limitations of our model include: poorly modelling exogenous research, which is particularly important for those with longer timelines, and many parts of the model - such as diminishing returns - remaining constant over time.
Further, we find that robust spending strategies - those that work in a wide variety of worlds - also support a higher spending rate. We show the results of a Monte Carlo simulation in the appendix13.
Humanity might be living at a hinge moment in history (MacAskill, 2020). This is partly due to the unusually high level of existential risks (Ord, 2020) and, in particular, the significant probability that humanity will build artificial general intelligence (AGI) in the next decades (Cotra, 2022). More specifically, AGI is likely to make up for a large fraction of extinction risks in the present and next decades (Cotra, 2022) and stands as a strong candidate to influence the long-term future. Indeed, AGI might play a particularly important role in the long-term trajectory change of Earth-originating life by increasing the chance of a flourishing future (Bostrom, 2008) and reducing the risks of large amounts of disvalue (Gloor, 2016).
Philanthropic organisations aligned with effective altruism principles such as the FTX Foundation and Open Philanthropy play a crucial role in reducing AI risks by optimally allocating funding to organisations that produce research, technologies and influence to reduce risks from artificial intelligence. Figuring out the optimal funding schedule is particularly salient now with the risk of AI timelines under 10 years (Kokotajlo, 2022), and the substantial growth in effective altruism (EA) funding roughly estimated at 37% per year from 2015 to 2021 for a total endowment of about 46B$ by then end of 2021 (Todd, 2021).
Previous work has emphasised the need to invest now to spend more later due to low discount rates (Hoeijmakers, 2020, Dickens 2020). This situation corresponds to a “patient philanthropist”. Research has modelled the optimal spending schedule a patient philanthropist should follow if they face constant interest rates, diminishing returns and a low discount rate accounting for existential risks (Trammell, 2021, Trammell 2021). Extensions of the single provider of public goods model allowed for the rate of existential risks to be time-dependent (Alaya, 2021) and to include a trade-off between labour and capital where labour accounts for movement building and direct work (Sempere, Trammell 2021). Some models also discussed the trade-off between economic growth and existential risks by modelling the dynamics between safety technology and consumption technology with an endogenous growth (Aschenbrenner, 2020) and an exogenous growth model (Trammell, 2021).
Without more specific quantitative models taking into account AI timelines, growth in funding, progress in AI safety and the difficulty of building safe AI, previous estimates of a spending schedule of just over 1% per year (Todd 2021, MacAskill 2022) are at risk of underperforming the optimal spending schedule by as much as 40%.
In this work, we consider a philanthropist or philanthropic organisation maximising the probability of humanity building safe AI. The philanthropist spends money to increase the stock of AI safety research and influence over AI development which translates into increasing the probability of successfully aligning AI or avoiding large amounts of disvalue. Our model takes into account AI timelines, the growth of capital committed to AI safety, diminishing returns in research and influence as well as the competitiveness of influencing AI development. We also allow for the possibility of a fire alarm shortly before AGI arrives. Upon “hearing” the fire alarm, the philanthropist knows the arrival time of AGI and wants to spend all of its remaining money until that time. The philanthropist also has some discount rate due to other existential risks and exogenous research that accelerate safety research.
Crucially, we have coded the model into a notebook accompanying this blog post that philanthropists and interested users can run to estimate an optimal spending schedule given their estimates of AI timelines, the difficulty of AI safety, capital growth and diminishing returns. Mathematically the problem of finding the optimal spending schedule translates into an optimal control problem giving rise to a set of nonlinear differential equations with boundary conditions that we solve numerically.
We discuss the effect of AI timelines and the difficulty of AI safety on the optimal spending schedule. Importantly, the optimal spending schedule typically ranges from 5% to 35% per year this decade, certainly above the current typical spending of EA-aligned funders. A funder should follow the most aggressive spending schedule this decade if AI timelines are short (2030) and safety is hard. An intermediate scenario yields a yearly average spending of ~12% over this decade. The optimal spending schedule typically performs between 5 to 15% better than the strategy of spending the endowment’s rate of appreciation and between 18% to 40% better than the current EA community spending at ~3% per year.
We suppose that a single funder controls all of the community’s funding that is earmarked for AI risk interventions and that they set the spending rate for two types of interventions: research and influence. The funder’s aim is to choose the spending schedule - how much they spend each year on each intervention - that maximises the probability that AGI goes successfully (e.g. does not lead to an existential catastrophe).
The ‘model’ is a set of equations (described in the appendix) and accompanying Colab notebook. The latter, once given inputs from the user, finds the optimal spending schedule.
We suppose that any spending is on either research or influence. Any money we don’t spend is saved and gains interest. As well as investing money in traditional means, the funder is able to ‘invest’ in promoting earning-to-give, which historically has been a source of a large fraction of the community’s capital.
We suppose there is a single number for each of the stocks of research and influence describing how much the community has of each14.
Research refers to the community’s ability to make AGI a success given we have complete control over the system (modulo being able to delay its deployment indefinitely). The stock of research contains AI safety technical knowledge, skilled safety researchers, and safe models that we control and can deploy. Influence describes the degree of control we have over the development of AGI, and can include ‘soft’ means such as through personal connections or ‘hard’ means such as passing policy. Both research and influence contribute to the probability we succeed and the user can input the degree to which they are ‘substitutable’.
The equations modelling the time evolution of research and influence have the following features:
Any money we don’t spend appreciates. Historically the money committed to the effective altruism movement has grown faster than market real interest rates. The model allows for a variable real interest rate, which allows for the possibility that the growth of the effective altruism community slows.
We use the term preparedness at time to describe how ‘ready’ we are if AGI arrived at time Preparedness is a function of research and influence: the more we have of each the more we are prepared. The user inputs the relative importance of each research and influence as well as the degree they are substitutable.
We may find it useful to have money before AGI takeoff, particularly if we have a ‘fire alarm’ period where we know that AGI is coming soon and can spend most of it on last-minute research or influence. The model allows for such last-minute spending on research and influence, and so one’s money indirectly contributes to preparedness.
The probability of success given AGI arriving in year is an S-shaped function of our preparedness. The model is not fixed to any definition of ‘success’ and could be, but is not limited to, “AGI not causing an existential catastrophe” or “AGI being aligned to human values” or “preventing AI caused s-risk”.
Since we are uncertain when AGI will arrive, the model considers AGI timelines input from the user and takes the integral of the product of {the probability of AGI arriving at time } and {the probability of success at time given AGI at time }.
The model also allows for a discount rate to account for non-AGI existential risks or catastrophes that preclude our research and influence from being useful or other factors.
The funder’s objective function, the function they wish to maximise, is the probability of making AGI go successfully.
The preceding qualitative description is of a collection of differential equations that describe how the numerical quantities of money, research and influence change as a function of our spending schedule. We want to find the spending schedule that maximises the objective function, the optimal spending schedule. We do this with tools from optimal control theory15. We call such a schedule the optimal spending schedule.
We first review the table from the start of the post, which varies AGI timelines and difficulty of an AGI success while keeping all other model inputs constant. We stress that the results are based on our guesses of the inputs (such as diminishing returns to spending) and encourage people to try the tool out themselves.
Figure caption: Yearly optimal spending schedule averaged over this decade, 2022-2030 (left), and the next, 2030-2040 (right). For each level of AI safety difficulty (easy, medium and hard columns) and each decade we reported the average spending rates in research and influence in % of the funder’s endowment. |
We consider our best guess for the model’s parameters as given in the appendix (see “explaining and estimating the model parameters”). We describe the effects of timelines and the difficulty of AI safety on the spending schedule in this decade (2022-2030), the effects being roughly similar in the 2030 decade.
In most future scenarios we observe that the average optimal spending schedule is substantially higher than the current EA spending rate standing at around 1-3% per year. The most conservative spending schedule happens when the difficulty of AI safety is hard with long timelines (2050) with an average spending rate of around 6.5% per year. The most aggressive spending schedule happens when AI safety is hard and timelines are short (2030) with an average funding rate of about 35% per year.
For each level of difficulty and each AI timelines, the average allocation between research and influence looks balanced. Indeed, research and influence both share roughly half of the total spending in each scenario. Looking closer at the results in the appendix (see “appendix results”), we observe that influence seems to decrease more sharply than research spending, particularly beyond the 2030 decade. This is likely caused by the sharp increase in the level of competition over AI development making units of influence more costly relative to units of research. Although we want to emphasise that the share of influence and research in the total spending schedule could easily change with different diminishing returns in research and influence parameters.
The influence of AI timelines on the optimal spending schedule varies across distinct levels of difficulty but follows a consistent trend. Roughly, with AI timelines getting longer by a decade, the funder should decrease its average funding rate by 5 to 10%, unless AI safety is hard. If AI safety is easy, a funder should spend an average of ~25% per year for short timelines (2030), down to ~18% per year with medium timelines (2040) and down to ~15% for long timelines (2050). If AI safety difficulty is medium then the spending schedule follows a similar downtrend, starting at about 30% with short timelines down to ~12% with medium timelines and down to 10% with long timelines. If AI safety is hard, the decline in spending from short to medium timelines is sharper, starting at 35% per year with short timelines down to ~8% with medium timelines and down to about 5% with long timelines.
Interestingly, conditioning on short timelines (2030), going from AI safety hard to easier difficulty decreases the spending schedule from ~35% to ~25% but conditioning on medium (204) or long (2050) timelines going from AI safety hard to easier difficulty increases the spending schedule from 6% to 18% and 9% to 15% respectively.
In summary, in most scenarios, the average optimal spending schedule in the current decade typically varies between 5% to 35% per year. With medium timelines (2040) the average spending schedule typically stays in the 10-20% range and moves up to the 20-35% range with short timelines (2030). The allocation between research and influence is balanced.
In this section, we show the effect of varying one parameter (or related combination) on the optimal spending schedule. The rest of the inputs are described in the appendix. We stress again that these results are for the inputs we have chosen and encourage you to try out your own.
Varying just the discount rate we see that a higher discount rate implies a higher spending rate in the present.
Low discount rate | Standard | High discount rate |
It seems plausible that the community and its capital are likely to be going through an unusually fast period of growth that will level off.16 When assuming a lower rate of growth we see that the optimal spending schedule is a lower rate, but still higher than the community’s current allocation. In particular, we should be spending faster than we are growing.
Highly pessimistic growth rate: 5% growth rate | Pessimistic growth rate: 10% current growth decreasing to 5% in the five years.17 | Our guess: 20% current growth decreasing to 8.5% in the next ten years.18 |
We can compute the change in utility when the amount of funding committed to AI risk interventions changes. This is of relevance to donors interested in the marginal value of different causes, as well as philanthropic organisations that have not explicitly decided the funding for each cause area.
Starting money multiplier | 0.001 | 0.01 | 0.1 | 0.5 | 1 | 1.1 | 1.5 | 10 |
Absolute utility | 0.03119 | 0.044 | 0.092 | 0.219 | 0.317 | 0.332 | 0.386 | 0.668 |
Multiple of 100% money utility | 0.098 | 0.139 | 0.290 | 0.691 | 1 | 1.047 | 1.218 | 2.107 |
A different initial endowment has qualitative effects on the spending schedule. For example, comparing the 10% to 1000% case we see that when we have more money we - unsurprisingly - spend at a much higher rate. This result itself is sensitive to the existing stocks of research and influence.
When we have 10% of our current budget of $4000m | When we have 1000% of our current budget |
The spending schedule is not independent of our initial endowment. This is primarily driven by the S-shaped success function. When we have more money, we can beeline for the steep returns of the middle of the S-shape. When we have less money, we choose to save to later reach this point.
We see that, unsurprisingly, lower diminishing returns to spending suggest spending at a higher rate.
High diminishing returns20 | Our guess21 | Low diminishing returns22 |
The constant controls whether research becomes cheaper as we accumulate more research () or more expensive (). The former could describe a case where an increase in research leads to the increasing ability to parallelize research or break down problems into more easily solvable subproblems. The latter could describe a case where an increase in research leads to an increasingly bottlenecked field, where further progress depends on solving a small number of problems that are only solvable by a few people.
Research is highly serial | Default () | Research is highly parallelizable |
We see that in a world where research is either highly serial or parallelizable, we should be spending at a higher rate than if it is, on balance, neither. The parallelizable result is less surprising than the serial result, which we plan to explore in later work.
A more nuanced approach would use a function such that the field can become more or less bottlenecked as it progresses and the price of research changes accordingly.
Using our parameters, we find the presence of a fire alarm greatly improves our prospects and, perhaps unexpectedly, pushes the spending schedule upwards. This suggests it is both important to be able to correctly identify the point at which AGI is close and have a plan for the post-fire alarm period.
No fire alarm. | Short fire alarm: funders spend 10% of one’s money over six months. In this case, we get 36% more utility than no fire alarm. | Long fire alarm: funders spend 20% of one’s money over one year. In this case, we get 56% more utility than no fire alarm. |
Increasing substitutability means that one (weight adjusted23) unit of research can be replaced by closer to one unit of influence to have the same level of preparedness24.
Since, by our choice of inputs, we already have much more importance-adjusted research than influence25, in the case where they are very poor substitutes we must spend at a high rate to get influence.
When research and influence are perfect substitutes since research is ‘cheaper’26 with our chosen inputs the optimum spending schedule suggests that nearly spending should be on research27.
Research and influence are very poor substitutes28 | Research and influence are poor substitutes29 | Standard case30 | Research and influence are perfect substitutes31 |
We make a note of some claims that are supported by the model. Since there is a large space of possible inputs we recommend the user specify their own input and not rely solely on our speculation.
Supposing the community indefinitely spends 2% of its capital each year on research and 2% on influence, the optimal spending schedule is around 30% better in the medium timelines, medium difficulty world.
Note: The default strategy is where you spend exactly the amount your money appreciates, and so your money remains constant. The greatest difference in utility comes from cases where it is optimal to spend lots of money now, for example in the (2030 median, hard difficulty) world, the optimal spending schedule is 15% better than the default strategy.
A wager is, e.g., thinking that ‘although I think AGI is more likely than not in the next t years, it is intractable to increase the probability of success in the next t years and so I should work on interventions that increase the probability of success in worlds where AGI arrives at some time . Saving money now, even though AGI is expected sometime soon, is only occasionally recommended by the model. One case occurs with (1) a sufficiently low probability of success but steep gains to this probability after some amount of preparedness that is achievable in the next few decades, (2) a low discount rate, and either (a) that influence does not get too much more expensive over time or (b) influence is not too important.
A ‘wager’ on long timelines in a case where we have 2030 AGI timelines. This case has a discount rate , the difficulty is hard32 and the substitutability of research and influence is high33. |
To some extent, there is a ‘sweet spot’ on the s-shaped success curve where we wager on long timelines. If we are able to push the probability of success to a region where the slope of the s-curve is large, we should spend a high rate until we reach this point. If we are stuck on the flatter far left tail such that we remain in that region regardless of any spending we do this century to stay in that area, we should spend at a more steady rate.
In some cases, we should ‘wager’ on shorter timelines by spending at a high rate now
This trivially occurs, for example, if you have a very high discount rate. A more interesting case occurs when (1) influence is poorly substituted by research34 and either (a) influence depreciates quickly or (b) influence quickly becomes expensive.
A ‘wager’ on short timelines in a case where we have the 2050 AGI timeline. This case has ‘medium’ difficulty and low substitutability of research and influence35. |
Since the opportunity to wager on short timelines only is available now, we believe more effort should go into investigating the wager.
We discuss the primary limitations here, and reserve some for the appendix. For each limitation, we briefly discuss how a solution would potentially change the results.
The model does not explicitly account for research produced exogenously (i.e., not as a result of our spending). For example, it is plausible that research produced in academia should be included in our preparedness.
Exogenous research can be (poorly) approximated in the current model in a few different ways.
First, one could suppose that research appreciates over time and set . This supposes that research being done by outsiders is (directly) proportional to the research ‘we’ already have (where in this case, research done by outsiders is included in ). Since we model exponential appreciation, appreciation leads to a research explosion. One could slow this research explosion by supposing the appreciation term was for some .
Second, one could suppose that exogenous research sometimes solves the problem for us, making our own research redundant. This can be approximated by increasing the discount rate to account for the ‘risk’ that our own work is not useful. This is unrealistic in the sense that we are ‘surprised’ by some other group solving the problem.
A possible modification to the model would be to add a term to the expression for that accounts for the exogenous rate of growth of research. Alternatively, one could consider a radically different model of research that considered our spending on research as simply speeding up the progress that will otherwise happen (conditioning on no global catastrophe).
We expect this is the biggest weakness of the model, especially for those with long AGI timelines. To a first approximation, if there is little exogenous research we do not need to account for it, and if there is a lot then our own spending schedule does not matter. Perhaps we might think our actions can lead us to be in either regime and our challenge is to push the world towards the latter.
We may hope that some real-world interventions may delay the arrival of AGI, for example, passing policies to slow AI capabilities work. The model does not explicitly account for this feature of the world at all.
One extension to the model would be to change the length of the fire alarm period to be a function of the amount of influence we have. We expect this extension to imply an increase in the relative spending rate on influence. Another, more difficult extension would be to consider timelines as function such that we can ‘push’ timelines down the road with more influence.
We expect that our ability to delay the arrival of AGI, particularly for shorter arrivals, is sufficiently minimal such that it would not significantly change the result. For longer timelines, this seems less likely to be the case.
AI capabilities and our research influence each other in the real world. For example, AI capabilities may speed up research with AI assistants. On the other hand, spending large amounts on AI interventions may draw attention to the problem and speed up AI capabilities investment.
We allow for a depreciation of research, which can be used to model research becoming outdated as capabilities advance. One can also research becoming cheaper over time36 to account for capabilities speeding up our research.
On balance, we expect this limitation to not have a large effect. If one expects a ‘slow AI takeoff’ with the opportunity to use highly capable AI tools, one can use the fire alarm feature and set the returns to research during this period to be high.
We model the returns to spending constant across time. However, actual funders seem to be bottlenecked by vetting capacity and a lack of scalable and high-return projects and so the returns to spending are likely to be high at the moment. Grantmakers can ‘seed’ projects and increase capacity such that it seems plausible that diminishing returns to spending will decrease in the future.
However, the model input only allows for constant diminishing marginal returns.
The model could be easily extended to use a function such that marginal returns to spending on research and influence changed over time, similar to how the real interest rate changes over time. This would require more input from the user. Another extension could allow for the returns to be a function of how much was spent last year. However, such an extension would increase the model's complexity and decrease its usability, simplicity and (potentially) solvability.
This limitation also applies to other features of the model, such as the values and .
Most existing applications of optimal control theory to effective altruism-relevant decision-making have used systems of differential equations that are analytically solvable and have guarantees of optimality. Our model has neither property and so we must rely on optimization methods that do not always lead to a global maximum.
There are around 40 free parameters that the user can set.
Many model features can be turned off. To turn off the following features:
One can set parameters such that the model is equivalent to like the following system
Some results from this system37:
The current growth rate of our money is continuous. However, this poorly captures the case where most growth is driven by the arrival of new donors with lots of capital. Further, any growth is endogenous - it is always in proportion to our current capital .
One modification would be to the model arrival time of future funders by a stochastic process, for example following a Poisson distribution. For example, take
Where can model endogenous and non-continuous growth of funding.
Following some preliminary experiments with a deterministic flux of funders, we are skeptical that this would substantially change the recommendations of the current model.
We see two potential problems with this approach.
First, one may care about spending money on things other than making AGI go well. The model does not tell you how to trade-off these outcomes. The model best fits into a portfolio approach of doing good, such as Open Philanthropy’s Worldview Diversification. Alternatively, one may attach some value to having money leftover post-AGI.
Second, there may be outcomes of intermediate utility between AGI being successful and not. A simple extension could consider some function of the probability of success. A more complex extension could consider the utility of AGI conditioned on its arrival time and our preparedness that accounts for near-miss scenarios.
The funders have a stock of capital . This goes up in proportion to real interest at time , and down with spending on research, , and spending on influence, .
The funders have a stock of research which goes up with spending on research and can depreciate over time.
Where
Similarly, funders have a stock of influence which obeys
With constants mutatis mutandis from the research stock case and describes how the influence gained per unit money changes over time due to competition effects. That is, over time as the field of AGI influencers crowd each unit of influence can be more expensive.
We allow for the existence of an AGI fire alarm which tells us that AGI is exactly years away and that we can spend fraction of our money on research and influence.
We write and for the amount of research and influence we have in expectation at AGI take-off if the fire alarm occurred at time . Within the fire alarm period, we suppose that
The first and second assumptions allow for analytical expression for as a function of and .
We write for the constant spending rate on research post-fire alarm. We take where is the fraction of post fire-alarm spending. The system
has an analytical solution and we take
Similarly for research we take and system
Where is the competition factor at the start of the fire alarm period and is chosen by the user to either be a constant or function of . Note that is a constant in the differential equation, so the system has an analytical solution of the research system above. Again we take . Note that the user can state that no fire alarm occurs; setting the implies and so and so .
Our preparedness is given by
Preparedness is the constant elasticity of the substitution production function of and where the user chooses and .
Conditioning on AGI happening at time , we take the probability of AGI being safe as
This is a logistic function with constants and determined by the user’s beliefs about the difficulty of making AGI safe.
Our objective is to maximize the probability that AGI is safe. We have an objective function
Where is the user’s AGI timelines and is some discount rate.
We have initial conditions
We apply standard optimal control theory results.
We have Hamiltonian
Where are the costate variables.
The optimal spending schedule, if it exists, necessarily follows
We solve this boundary value problem using SciPy’s solve_bvp function and apply further optimisation methods to avoid local optima.
The model is a Python Notebook accessible on Google Colaboratory here.
Any cells that contain “User guide” are for assisting with the running of the notebook.
Below the initial instructions, you will find the user input parameters.
In the next section of this document we describe the parameters in detail and our own guesses.
We discuss the parameters in the same order as in the notebook.
Note, the estimates given are from Tristan and not necessarily endorsed by Guillaume.
Epistemic status: I’ve spent at least five minutes thinking about each, sometimes no more.
We elicit user timelines using two points on the cumulative distribution function and fit a log-normal distribution to them.
We note Metaculus’ Date of Artificial General Intelligence community prediction: as of 2022-10-06, lower 25% 2030, median 2040 and upper 75% 2072.
Note that since our log-normal distribution is parameterised by two pairs of (year, probability by year), the three distinct Metaculus interquartile pairings will give different distributions.
The discount rate needs to factor in both non-AGI existential risks as well as catastrophic (but non-existential) risks that preclude our AI work from being useful or any unknown unknowns that have some per year risk.
We choose implying an AGI success in 2100 is worth as much as a win today. As we discuss in the limitations section, the discount rate can also account for other people making AGI successful, though this interpretation of is not unproblematic.
Of relevance may be:
Our 90% confidence interval for is
As of 2022-10-06, Forbes estimates Dustin Moskovitz and Sam Bankman Fried have wealth of $8,200m and $17,000m respectively. Todd (2021) estimates $7,100m from other sources in 2021 giving a total of $32,300m within the effective altruism community.
How much of this is committed to AI safety interventions?
Open Philanthropy has spent $157m on AI-related interventions, of approximately $1500m spent so far. Supposing that roughly 15% of all funding is committed to AGI risk interventions gives at least $4,000m.
Our 90% confidence interval is .
We suppose that we are currently at some interest rate
Supposing the movement had $10,000m in 2015 and $32,300 in mid-2022, money in the effective altruism community has grown 21% per year.
We take . Our 90% confidence interval is .
Historical S&P returns are around 8.5%. There are reasons to think the long-term rate may be higher - such as increase in growth due to AI capabilities - or lower - there is a selection bias in choosing a successful index. We take . Our 90% confidence interval is .
Our 90% confidence interval is .
Influence
The constant controls the marginal returns to spending on influence. For we receive diminishing marginal returns.
The top fraction of spending per year on influence leads to fraction of increase in growth of influence in that year. For example, implies the top 20% of spending leads to roughly 80% of returns i.e. the Pareto principle.
We note that influence spending can span many orders of magnitude and this suggests reason to think there are high diminishing returns (i.e. low ). For example, one community builder may cost on the order of per year, but investing in AI labs with the purpose of influencing their decisions may cost on the order of per year.
We take which implies doubling spending on influence lead to times more growth of influence Our 90% confidence interval is .
Research
The constant acts in the same way for research as does for influence.
We take , which implies a doubling of research spending leads to times increase in research growth and that 20% of the spending in research accounts for of the increase in research growth.
Potential sources for estimating include using the distribution of karma on the Alignment Forum, citations in journals or estimates of researchers’ outputs.
Our 90% confidence interval is .
Influence
For the price is constant.
On balance, we think the former reasons outweigh the latter, and so take . This implies a doubling of influence leads to one unit of spending on influence leading to times more growth in influence compared to one unit of spending without this doubling. Our 90% confidence interval is .
Research
The constant acts in the same way for research as
We are uncertain about the net effect of the above contributions, and so take . Our 90% confidence interval is .
Influence
We take , which implies a half-life of around years. Our 90% confidence interval is
Research
We expect research to depreciate over time. Research can depreciate by
One intuition pump is to ask what fraction of research on current large language models will be useful if AGI does not come until 2050? We guess on the order of 1% to 30%, implying - if all our research was on large language models - a value of between and . Note that for , such research can be instrumentally useful for later years due to its ability to make future work cheaper by, for example, attracting new talent.
We take . Our 90% confidence interval is .
We allow for influence to become more expensive over time. The primary mechanism we can see is due to (a) competition with other groups that want to influence AI developers, and (b) competition within the field of AI capabilities, such that there are more organisations that could potentially develop AGI.
We suppose the influence per unit spending decreases over time following some S-shape curve, and ask for three points on this curve.
The first data point is the first year in which money was spent on influence. Since one can consider community building or spreading AI risk ideas (particularly among AI developers) as a form of influence, the earliest year of spending is unclear. We take 2015 (the first year Open Philanthopy made grants in this area). The relative cost of influence is set to 1 in this year.
We then require to further years, as well as influence per unit spending relative to the the first year of spending.
Our best guess is (2017, 0.9) - that is, in 2017 one received 90% as much influence per unit spending as one would have done in 2015- and (2022, 0.6).
The final input is the minimum influence per unit spending that will be reached be take this to be 0.02. That is, influence will eventually be more expensive per unit that it was in 2015. Our 90% confidence interval is (0.001, 0.1).
The model uses this data to calculate the quantities of research and influence we have now. Rough estimates are sufficient.
The Singularity Institute (now MIRI) was founded in 2000 and switched to work on AI safety in 2005. We take 2005 as the first year of spending on research.
Open Philanthropy has donated $243.5m to risks from AI since its first grant in the area in August 2015. We very roughly categorised each grant by research : influence fraction, and estimate that $132m has been spent on research and $111m on influence. We suppose that Open Philanthropy has made up two-thirds of the overall spending, giving totals of $198m and $167m.
We guess that the research spending, which started in 2005, has been growing at 25% per year and influence spending has been growing 40% per year since starting in 2015.
By default, in the results we show, we assume no fire alarm in the model. This is achievable by setting the expected fraction of money spendable to 0. When considering the existence of a fire alarm, we take the following values.
For the fire alarm duration we ask Supposing that the leading AI lab has reached an AGI system that they are not deploying out of safety concerns, how far behind is a less safety-conscious lab? We guess this period to be half a year.
Our 90% confidence interval for this period, if it exists, is (one month, two years).
One may think that expected fraction of money spendable during the fire alarm is less than 1 for reasons such as
We take 0.1 as the expected fraction of money spendable with 90% confidence interval (0.01, 0.5).
During the fire alarm period, we enter a phase with no appreciation or depreciation of research or influence and a separate marginal returns to spending and can apply.
Some reasons to think (worse returns during the fire alarm period)
Some reasons to think the (better returns during the fire alarm period)
We expect that the returns to research spending will be very low, and take , implying that the amount of research we can do in the post-fire-alarm period is not very dependent on the money we have.
We expect that returns to influence spending will be less than in the period before, lower. We take .
In the fire alarm phase, the cost per unit of research and influence can also change depending on the amount we already have.
We expect and . That is, during the fire alarm period it is even cheaper to get influence once you already have some than before this phase and that this effect is greater during the fire alarm period (the first inequality). In the case there is panic, it seems people will be looking for trustworthy organisations to defer to and execute a plan.
We expect That is, during the fire alarm period the amount of existing research decreases the cost of new research.
During the fire alarm period, it seems likely that only a few highly skilled researchers - perhaps within the AI lab - will have access to the information and tools necessary to conduct further useful research. The research at this point is likely highly serial: the researchers trying to focus on the biggest problems. Existing research may allow these few researchers to build on existing work effectively.
We take both and to be 0.3, implying that a doubling of research before the fire-alarm period increases the stock output during the fire-alarm period by times.
Preparedness at time is a function of the fire-alarm adjusted research and influence that takes two parameters, the share parameter that controls the relative importance of research and influence and parameter , that controls the substitutability of research and influence.
Decreasing decreases the subsitutability of research and influence. In the limit as , our preparedness can be entirely bottlenecked by the stock we have the least (weighted by
gives the Cobb-Douglas production function, though to avoid a case-by-case situation in the programming, you cannot set and instead can choose value close to .
Again, we recommend picking values and running the cell to see the graph. We choose . We think that the problem is mainly a technical problem, but in practise cannot be solved without influencing AI developers.
The probability of success at time
The first input is the probability of success if AGI arrived this year. That is, given our existing stocks of research and influence - this input doesn't consider any fire alarm spending. The second input determines the steepness of the S-shaped curve.
We take (10%, 0.15).
A note on the probability of success
Suppose you think we are in one the following three worlds
Then, in the input you should imagine you should give your inputs as they are in your world B model. We keep the probability of success curve between 0 and 1, but one could linear transform it to be greater than the probability of success in the A world and less than the probability of success in the C world. Since the objective function is linear in the probability of success, such a transformation has no effect on the optimal spending schedule.
In an alternate model, we suppose the funder has a stock of things that grow which includes things such as skills, people, some types of knowledge and trust. They can choose to spend this stock at some rate to produce a stock of things that depreciate that are immediately helpful in increasing the probability of success. This could comprise things such as the implementation of safety ideas on current systems or the product of asking for favours of AI developers or policymakers.
We say that spending capacity to create is crunching and the periods with high are crunch time.
The probability we succeed at time is a function of which is plus any last-minute spending that occurs with a fire alarm. Specifically, it is another S-shaped curve.
Time evolution of things that grow | |
Time evolution of things that depreciate | |
Post-fire alarm total of things that depreciate | |
The probability of success given AGI at | |
Objective function |
Recall that is the expected fraction of money spendable post-fire alarm and is the expected duration of the fire alarm. The equation for is thus simply the result of spending at rate for duration .
The alternate model shares the following parameters and machinery with the research-influence model
The new inputs include
We expect the growth in capacity to decrease over time since some of our capacity is money and the same reasons will apply as in the former model. We suppose , and .
The factor , which in the former model controlled how influence becomes more expensive over time, controls how the cost of doing direct work - producing - becomes more expensive over time. . Only some spending to produce is in competitive domains (such as influencing AI developers) while some is non-competitive, such as implementing safety features in state-of-the-art systems.
We suppose that has a minimum 0.5 and otherwise has the same factors as in the former research-influence model.
This controls the degree of diminishing returns to ‘crunching’. For reasons similar to those given for and in the main model, we take . Our 90% confidence interval is .
This controls how long our crunch time activities are useful for i.e. the speed of depreciation. We take which implies that after one of the direct work is still useful.
To derive the constants used in the S-shaped curve, we ask for the probability of success after some hypothetical where we've spent some fraction of our capacity for one year.
Our guess is that after spending half of our resources this year, we’d have a 25% chance of alignment success if AGI arrived this year. Note that this input does not account for any post-fire-alarm spending.
Unsurprisingly we see that we should spend our capacity of things-that-grow most around the time we expect AGI to appear. For the 2040 and 2050 timelines, this implies spending very little on things that depreciate, up to around 3% a year. For 2030 timelines, we should be spending between 5 and 10% of our capacity each year on ‘crunching’ for the arrival of AGI. Further, for all results, we begin maximum crunching after the modal AGI arrival date, which is understandable while the rate of growth of the movement is greater than the rate of decrease in probability of AGI (times the discount factor).
This result is relatively sensitive to the probability we think AGI will appear in the next few years. We fit a log-normal distribution to the AGI timeline with which leads to being small for the next few years. Considering a probability distribution that gave some weight to AGI in the next few years would inevitably imply a higher initial spending rate, though likely a similar overall spending schedule in sufficiently many years time.
Median AGI arrival | |||
Difficulty of AGI success | 203038 | 204039 | 205040 |
Easy41 | |||
Medium42 | |||
Hard43 |
Many of the limitations we describe apply to both models.
For example, there is no exogenous increase in which we may expect if other actors' work on AI risk at some time in the future. One could, for example, adjust such that spending on direct work receives more units of per unit spending in the future due to others' also working on the problem.
Like the first model, our work and AI capabilities are independent. One could, again, use to model direct work becoming cheaper as time goes on and new AI capabilities are developed.
Added 2022-11-29, see discussion here
Here I consider the most robust spending policies and supposes uncertainty over nearly all parameters in the main model44, rather than point-estimates. I find that the community’s current spending rate on AI risk interventions is too low.
My distributions over the the model parameters imply that
I recommend entering your own distributions for the parameters in the Python notebook here46. Further, these preliminary results use few samples: more reliable results would be obtained with more samples (and more computing time).
I allow for post-fire-alarm spending (i.e., we are certain AGI is soon and so can spend some fraction of our capital). Without this feature, the optimal schedules would likely recommend a greater spending rate.
The results from a simple optimiser47, when allowing for four spending regimes: 2022-2027, 2027-2032, 2032-2037 and 2037 onwards. This result should not be taken too seriously: more samples should be used, the optimiser runs for a greater number of steps and more intervals used. As with other results, this is contingent on the distributions of parameters. |
This short extension started due to a conversation with David Field and comment from Vasco Grilo; I’m grateful to both for the suggestion.
Tristan and Guillaume defined the problem, designed the model and its numerical resolution, interpreted the results, wrote and reviewed the article. Tristan coded the Python notebook and carried out the numerical computations with feedback from Guillaume. Tristan designed, coded, solved the alternate model and interpreted its results.
We’d both like to thank Lennart Stern and Daniel Kokotajlo for their comments and guidance during the project. We’re grateful to John Mori for comments.
Guillaume thanks the SERI summer fellowship 2021 where this project started with some excellent mentorship from Lennart Stern, the CEEALAR organisation for a stimulating working and living environment during the summer 2021 and the CLR for providing funding to support part-time working with Tristan to make substantial progress on this project.
The post The optimal timing of spending on AGI safety work; why we should probably be spending more now appeared first on Center on Long-Term Risk.
]]>The post When is intent alignment sufficient or necessary to reduce AGI conflict? appeared first on Center on Long-Term Risk.
]]>In the previous post, we outlined possible causes of conflict and directions for intervening on those causes. Many of the causes of conflict seem like they would be addressed by successful AI alignment. For example: if AIs acquire conflict-prone preferences from their training data when we didn’t want them to, that is a clear case of misalignment. One of the suggested solutions: improving adversarial training and interpretability, just is alignment research, albeit directed at a specific type of misaligned behavior. We might naturally ask, does all work to reduce conflict risk follow this pattern? That is, is intent alignment sufficient to avoid unendorsed conflict?
Intent Alignment isn't Sufficient is a claim about unendorsed conflict. We’re focusing on unendorsed conflict because we want to know whether technical interventions on AGIs to reduce the risks of conflict make a difference. These interventions mostly make sense for preventing conflict that isn’t desired by the overseers of the systems. (If the only conflict between AGIs is endorsed by their overseers, then conflict reduction is a problem of ensuring that AGI overseers aren’t motivated to start conflicts.)
Let H be a human principal and A be its AGI agent. “Unendorsed” conflict, in our sense, is conflict which would not have been endorsed on reflection by H at the time A was deployed. This notion of “unendorsed” is a bit complicated. In particular, it doesn’t just mean “not endorsed by a human at the time the agent decided to engage in conflict”. We chose it because we think it should include the following cases:
We’ll use Evan Hubinger’s decomposition of the alignment problem. In Evan’s decomposition, an AI is aligned with humans (i.e., doesn’t take any actions we would consider bad/problematic/dangerous/catastrophic) if it is intent-aligned and capability robust. (An agent is capability robust if it performs well by its own lights once it is deployed.) So the question for us is: What aspects of capability robustness determine whether unendorsed conflict occurs, and will these be present by default if intent alignment succeeds?
Let’s decompose conflict-avoiding “capability robustness” into the capabilities necessary and sufficient for avoiding unendorsed conflict into two parts:
Two conditions need to hold for unendorsed conflict to occur if the AGIs are intent aligned (summarized in Figure 1): (1) the AIs lack some cooperative capability or have misunderstood their overseer’s cooperation-relevant preferences, and (2) conflict is not prevented by the AGI consulting with its overseer.
These conditions may sometimes hold. In the next section, we list scenarios in which consultation with overseers would fail to prevent conflict. We then look at “conflict-causing capabilities failures”.
One reason to doubt that intent-aligned AIs will engage in unendorsed conflict is that these AIs should be trying to figure out what their overseers want. Whenever possible, and especially before taking any irreversible action like starting a destructive conflict, the AI should check whether its understanding of overseer preferences is accurate. Here are some reasons why we still might see catastrophic decisions, despite this1:
Let’s return to our causes of conflict and see how intent-aligned AGIs might fail to have the capabilities necessary to avoid unendorsed conflict due to these factors.
We break cooperation-relevant preferences into “object-level preferences” (such as how bad a particular conflict would be) and “meta-preferences” (such as how to reflect about how one wants to approach complicated bargaining problems).
One objection to doing work specific to reducing conflict between intent-aligned AIs now is that this work can be deferred to a time when we have highly capable and aligned AI assistants. We’d plausibly be able to do technical research drastically faster then. While this is a separate question to whether Intent Alignment isn't Sufficient, this is an important objection to conflict-specific work, so we briefly address it here.
Some reasons we might benefit from work on conflict reduction now, even in worlds where we get intent-aligned AGIs, include:
Still, the fact that intent-aligned AGI assistants may be able to do much of the research on conflict reduction that we would do now has important implications for prioritization. We should prioritize thinking about how to use intent-aligned assistants to reduce the risks of conflict, and deprioritize questions that are likely to be deferrable.
On the other hand, AI systems might be incorrigibly misaligned before they are in a position to substantially contribute to research on conflict reduction. We might still be able to reduce the chances of particularly bad outcomes involving misaligned AGI, without the help of intent-aligned assistants.
Whether or not Intent Alignment isn't Sufficient to prevent unendorsed conflict, we may not get intent-aligned AGIs in the first place. But it might still be possible to prevent worse-than-extinction outcomes resulting from an intent-misaligned AGI engaging in conflict. On the other hand, it seems difficult to steer a misaligned AGI’s conflict behavior in any particular direction.
Coarse-grained interventions on AIs’ preferences to make them less conflict-prone seem prima facie more likely to be effective given misalignment than trying to make more fine-grained interventions on how they approach bargaining problems (such as biasing AIs towards more cautious reasoning about commitments, as discussed previously). Let’s look at one reason to think that coarse-grained interventions on misaligned AIs’ preferences may succeed and thus that Intent Alignment isn't Necessary.
Assume that at some point during training, the AI begins 'playing the training game'. Some time before it starts playing the training game, it has started pursuing a misaligned goal. What, if anything, can we predict about the conflict-proneness of this from the AI’s training data?
A key problem is that there are many objective functions such that trying to optimize is consistent with good early training performance, even if the agent isn’t playing the training game. However, we may not need to predict in much detail to know that a particular training regime will tend to select for more or less conflict-prone . For example, consider a 2-agent training environment, let be agent ’s reward signal. Suppose we have reason to believe that a training process selects for spiteful agents, that is, agents who act as if optimizing for on the training distribution.2 This gives us reason to think that agents will learn to optimize for for some objectives correlated with on the training distribution. Importantly, we don’t need to predict to worry that agent 1 will learn a spiteful objective.3
Concretely, imagine an extension of the SmartVault example from the ELK report, in which multiple SmartVault reporters are trained in a shared environment. And suppose that the human overseers iteratively select the SmartVault system that gets the highest reward out of several in the environment. This creates incentives for the SmartVault systems to reduce each other’s reward. It may lead to them acquiring a terminal preference for harming (some proxy for) their counterpart’s reward. But this reasoning doesn’t rest on a specific prediction about what proxies for human approval the reporters are optimizing for. As long as SmartVault1 is harming some good proxy for SmartVault2’s approval, they will be more likely to be selected. (Again, this is only true because we are assuming that the SmartVaults are not yet playing the training game.)
What this argument shows is that choosing not to reward SmartVault1 or 2 competitively eliminates a training signal towards conflict-proneness, regardless of whether either is truthful. So there are some circumstances under which we might not be able to select for truthful reporters in the SmartVault but could still avoid selecting for agents that are conflict-prone.
Human evolution is another example. It may have been difficult for someone observing human evolution to predict precisely what proxies for inclusive fitness humans would end up caring about. But the game-theoretic structure of human evolution may have allowed them to predict that, whatever proxies for inclusive fitness humans ended up caring about, they would sometimes want to harm or help (proxies for) other humans’ fitness. And other-regarding human preferences (e.g., altruism, inequity aversion, spite) do still seem to play an important role in high-stakes human conflict.
The examples above focus on multi-agent training environments. This is not to suggest that multi-agent training, or training analogous to evolution, is the only regime in which we have any hope of intervening if intent alignment fails. Even in training environments in which a single agent is being trained, it will likely be exposed to “virtual” other agents, and these interactions may still select for dispositions to help or harm other agents. And, just naively rewarding agents for prosocial behavior and punishing them for antisocial behavior early in training may still be low-hanging fruit worth picking, in the hopes that this still exerts some positive influence over agents’ mesa-objective before they start playing the training game.
We’ve argued that Capabilities aren't Sufficient, Intent Alignment isn't Necessary and Intent Alignment isn't Sufficient, and therefore technical work specific to AGI conflict reduction could make a difference. It could still be that alignment research is a better bet for reducing AGI conflict. But we currently believe that there are several research directions that are sufficiently tractable, neglected, and likely to be important for conflict reduction that they are worth dedicating some portion of the existential AI safety portfolio to.
First, work on using intent-aligned AIs to navigate cooperation problems. This would involve conceptual research aimed at preventing intent-aligned AIs from locking in bad commitments or other catastrophic decisions early on, and preventing the corruption of AI-assisted deliberation about bargaining. One goal of this research would be to produce a manual for the overseers of intent-aligned AGIs with instructions on how to train their AI systems to avoid the failures of cooperation discussed in this sequence.
Second, research into how to train AIs in ways that don’t select for CPPs and inflexible commitments. Research into how to detect and select against CPPs or inflexible commitments could be useful (1) if intent alignment is solved, as part of the preparatory work to enable us to better understand what cooperation failures are common for AIs and how to avoid them, or (2) if intent alignment is not solved, it can be directly used to incentivise misaligned AIs to be less conflict-prone. This could involve conceptual work on mechanisms for preventing CPPs that could survive misalignment. It might also involve empirical work, e.g., to understand the scaling of analogs of conflict-proneness in contemporary language models.
There are several tractable directions for empirical work that could support both of these research streams. Improving our ability to measure cooperation-relevant features of foundation models, and carrying out these measurements, is one. Better understanding the kinds of feedback humans give to AI systems in conflict situations, and how to improve that feedback, is another. Finally, getting practice training powerful contemporary AI systems to behave cooperatively also seems valuable, for reasons similar to those given by Ajeya in The case for aligning narrowly superhuman models.
The post When is intent alignment sufficient or necessary to reduce AGI conflict? appeared first on Center on Long-Term Risk.
]]>The post When would AGIs engage in conflict? appeared first on Center on Long-Term Risk.
]]>First we’ll focus on conflict that is costly by the AGIs’ lights. We’ll define “costly conflict” as (ex post) inefficiency: There is an outcome that all of the agents involved in the interaction prefer to the one that obtains.1 This raises the inefficiency puzzle of war: Why would intelligent, rational actors behave in a way that leaves them all worse off than they could be?
We’ll operationalize “rational and intelligent” actors as expected utility maximizers.2 We believe that the following taxonomy of the causes of inefficient outcomes between rational actors is exhaustive, except for a few implausible edge cases. (We give the full taxonomy, and an informal argument that it is exhaustive, in the appendix.) That is, expected value maximization can lead to inefficient outcomes for the agents only if one of the following conditions (or an implausible edge case) holds. This taxonomy builds on Fearon’s (1995) influential “rationalist explanations for war”.3
Private information and incentives not to disclose. Here, “private information” means information about one’s willingness or ability to engage in conflict — e.g., how costly one considers conflict to be, or how strong one’s military is — about which other agents are uncertain. This uncertainty creates a risk-reward tradeoff: For example, Country A might think it’s sufficiently likely that Country B will give up without much of a fight that it’s worthwhile in expectation for A to fight B, even if they’ll end up fighting a war if they are wrong.
In these cases, removing uncertainty — e.g., both sides learning exactly how willing the other is to fight — opens up peaceful equilibria. This is why conflict due to private information requires “incentives not to disclose”. Whether there are incentives to disclose will depend on a few factors.
First, the technical feasibility of different kinds of verifiable disclosure matters. For example, if I have an explicitly-specified utility function, how hard is it for me to prove to you how much my utility function values conflict relative to peace?
Second, different games will create different incentives for disclosure. Sometimes the mere possibility of verifiable disclosure ends up incentivizing all agents to disclose all of their private information (Grossman 1981; Milgrom 1981). But in other cases, more sophisticated disclosure schemes are needed. For example: Suppose that an agent has some vulnerability such that unconditionally disclosing all of their private information would place them at a decisive disadvantage. The agents could then make copies of themselves, allow these copies to inspect one another, and transmit back to the agents only the private information that’s necessary to reach a bargain. (See the appendix and DiGiovanni and Clifton (2022) for more discussion of conditional information revelation and other conditions for the rational disclosure of conflict-relevant private information.)
For the rest of the sequence, we’ll use “informational problem” as shorthand for this mechanism of conflict.
Inability to credibly commit to peaceful settlement. Agents might fight even though they would like to be able to commit to peace. The Prisoner’s Dilemma is the classic example: Both prisoners would like to be able to write a binding contract to cooperate, but if they can’t, then the game-theoretically rational thing to do is defect.
Similarly sometimes one agent will be tempted to launch a preemptive attack against another. For example, if Country A thinks Country B will soon become significantly more powerful, Country A might be tempted to attack Country B now. This could be solved with credible commitment: Country B could commit not to becoming significantly more powerful, or to compensate Country A for their weakened bargaining position. But without the ability to make such commitments, Country A may be rationally compelled to fight.
Another example is randomly dividing a prize. Suppose Country A and Country B are fighting over an indivisible holy site. They might want to randomly allocate the holy site to one of them, rather than fighting. The problem is that, once the winner has been decided by the random lottery, the loser has no reason to concede rather than fight, unless they have some commitment in place to honor the outcome of the lottery.
For the rest of the sequence, we’ll call use “commitment inability problem”4 as shorthand for this mechanism of conflict.
Miscoordination. When there are no informational or commitment inability problems, and agents’ preferences aren’t entirely opposed (see below), there will be equilibria in which agents avoid conflict. But the existence of such equilibria isn’t enough to guarantee peace, even between rational agents. Agents can still fail to coordinate on a peaceful solution.
A central example of catastrophic conflict due to miscoordination is incompatible commitments. Agents may make commitments to accepting only certain peaceful settlements, and otherwise punishing their counterpart. This can happen when agents have uncertainty about what commitments their counterparts will make. Depending on what you think about the range of outcomes your counterpart has committed to demanding, you might commit to a wider or narrow range of demands. There are situations in which the agents’ uncertainty is such that the optimal thing for each of them to do is commit to a narrow range of demands, which end up being incompatible. See this post on “the commitment races problem” for more discussion.
One reason for optimism about AGI conflict is that AGIs may be much better at credible commitment and disclosure of private information. For example, AGIs could make copies of themselves and let their counterparts inspect these copies until they are satisfied that they understand what kinds of commitments their counterpart has in place. Or, to credibly commit to a treaty, AGIs could do a “value handshake” and build a successor AGI system whose goal is to act according to the treaty. So, what are some reasons why AGIs would still engage in conflict, given these possibilities? Three stand out to us:
Strategic pressures early in AGI takeoff. Consider AGI agents that are opaque to one another, but are capable of self-modifying / designing successor agents who can implement the necessary forms of disclosure. Would such agents ever fail to implement these solutions? If, say, designing more transparent successor agents is difficult and time-consuming, then agents might face a tradeoff between trying to implement more cooperation-conducive architectures and placing themselves at a critical strategic disadvantage. This seems most plausible in the early stages of a multipolar AI takeoff.
Lack of capability early in AGI takeoff. Early in a slow multipolar AGI takeoff, pre-AGI AIs or early AGIs might be capable of starting destructive conflicts but lack the ability to design successor agents, scrutinize the inner workings of opponent AGIs, or reflect on their own cognition in ways that would let them anticipate future conflicts. If AGI capabilities come in this order, such that the ability to launch destructive conflicts comes a while before the ability to design complete successor agents or self-modify, then early AGIs may not be much better than humans at solving informational and commitment problems.
Fundamental computational limits. There may be fundamental limits on the ability of complex AGIs to implement the necessary forms of verifiable disclosure. For example, in interactions between complex AGI civilizations in the far future, these civilizations’ willingness to fight may be determined by factors that are difficult to compress. (That is, the only way in which you can find out how willing to fight they are is to run expensive simulations of what they would do in different hypothetical scenarios.) Or it may be difficult to verify that the other civilization has disclosed their actual private information.
These considerations apply to informational and commitment inability problems. But there is also the problem of incompatible commitments, which is not solved by sufficiently strong credibility or disclosure ability. Regardless of commitment or disclosure ability, agents will sometimes have to make commitments under uncertainty about others’ commitments.
Still, the ability to make conditional commitments could still help to ameliorate the risks from incompatible commitments. For example, agents could have a hierarchy of conditional commitments of the form: “If our -order commitments are incompatible, try resolving these via an -order bargaining process.” See also safe Pareto improvements, which is a particular kind of failsafe for incompatible commitments, and (the version in the linked paper) relies on strong commitment and disclosure ability.
Another way conflict can be rational is if conflict actually isn’t costly for at least one agent, i.e., there isn’t any outcome that all parties prefer to conflict. That is, Conflict isn't Costly is false. Some ways this could happen:
These cases, in which conflict is literally costless for one agent, are prima facie quite unlikely. They are extremes on a spectrum of what we’ll call conflict-prone preferences (CPPs). By shrinking the range of outcomes agents prefer to conflict, these preferences exacerbate the risks of conflict due to informational, commitment inability, or miscoordination problems. For instance, risk-seeking preferences will lead to a greater willingness to risk losses from conflict (see Shulman (2010) for some discussion of implications for conflict between AIs and humans). And spite will make conflict less subjectively costly, as the material costs that a conflict imposes on a spiteful agent are partially offset by the positively-valued material harms to one’s counterpart.
We argued above that Capabilities aren’t Sufficient. AGIs may sometimes engage in conflict that is costly, even if they are extremely capable. But it remains to be seen whether Intent Alignment isn't Sufficient to prevent unendorsed conflict, or Intent Alignment isn't Necessary to reduce the risk of conflict. Before we look at those claims, it may be helpful to review some approaches to AGI design that might reduce the risks reviewed in the previous section. In the next post, we ask whether these interventions are redundant with work on AI alignment.
Let’s look at interventions directed at each of the causes of conflict in our taxonomy.
Informational and commitment inability problems. One could try to design AGIs that are better able to make credible commitments and better able to disclose their private information. But it’s not clear whether this reduces the net losses from conflict. First, greater commitment ability could increase the risks of conflict from incompatible commitments. Second, even if the risks of informational and commitment inability problems would be eliminated in the limit of perfect commitment and disclosure, marginal increases in these capabilities could worsen conflict due to informational and commitment inability problems. For example, increasing the credibility of commitments could embolden actors to commit to carrying out threats more often, in a way that leads to greater losses from conflict overall.6
Miscoordination. One direction here is building AGIs that reason in more cautious ways about commitment, and take measures to mitigate the downsides from incompatible commitments. For example, we could develop instructions for human overseers as to what kinds of reasoning about commitments they should encourage or discourage in their (intent-aligned) AI. The design of this “overseer’s manual” might be improved by doing more conceptual thinking about sophisticated approaches to commitments.7 Examples of this kind of work include Yudkowsky’s solution for bargaining between agents with different standards of fairness; surrogate goals and Oesterheld and Conitzer’s safe Pareto improvements; and Stastny et al.’s notion of norm-adaptive policies. It may also be helpful to consider what kinds of reasoning about commitments we should try to prevent altogether in the early stages of AGI development.
The goal of such work is not necessarily to fully solve tricky conceptual problems in bargaining. One path to impact is to improve the chances that early human-intent-aligned AI teams are in a “basin of attraction of good bargaining”. The initial conditions of their deliberation about how to bargain should be good enough to avoid locking in catastrophic commitments early on, and to avoid path-dependencies which cause deliberation to be corrupted. We return to this in our discussion of Intent Alignment isn't Sufficient.
Lastly, surrogate goals are a proposal for mitigating the downsides of executed threats, which might occur due to either miscoordination or informational problems. The idea is to design an AI to treat threats to carry out some benign action (e.g., simulating clouds) the same way that they treat threats against the overseer’s terminal values.
Conflict-prone preferences. Consider a few ways in which AI systems might acquire CPPs. First, CPPs might be strategically useful in some training environments. Evolutionary game theorists have studied how CPPs like spite (Hamilton 1970; Possajennikov 2000; Gardner and West 2004; Forber and Smead 2016) and aggression towards out-group members (Choi and Bowles 2007) can be selected for. Analogous selection pressures could appear in AI training.8 For example, selection for the agents that perform the best relative to opponents creates similar pressures to the evolutionary pressures hypothetically responsible for spite: Agents will have reason to sacrifice absolute performance to harm other agents, so that they can increase the chances that their relative score is highest. So, identifying and removing training environments which incentivize CPPs (while not affecting agents’ competitiveness) is one direction for intervention.
Second, CPPs might result from poor generalization from human preference data. An AI might fail to correct for biases that cause a human to behave in a more conflict-conducive way than they would actually endorse, for instance. Inferring human preferences is hard. It is especially difficult in multi-agent settings, where a preference-inferrer has to account for a preference-inferree’s models of other agents, as well as biases specific to mixed-motive settings.9
Finally, a generic direction for preventing CPPs is developing adversarial training and interpretability methods tailored to rooting out conflict-prone behavior.
Here we present our complete taxonomy of causes of costly conflict between rational agents, and give an informal argument that it is exhaustive. Remember that by “rational” we mean “maximizes subjective expected utility” and by “costly conflict” we mean “inefficient outcome”.
For the purposes of the informal exposition, it will be helpful to distinguish between equilibrium-compatible and equilibrium-incompatible conflict. Equilibrium-compatible conflicts are those which are naturally modeled as occurring in some (Bayesian) Nash equilibrium. That is, we can model them as resulting from agents (i) knowing each others’ strategies exactly (modulo private information) and (ii) playing a best response. Equilibrium-incompatible conflicts cannot be modeled in this way. Note, however, that the equilibrium-compatible conditions for conflict can hold even when agents are not in equilibrium.
This breakdown is summarized as a fault tree diagram in Figure 3.
Here is an argument that items 1a-1c capture all games in which conflict occurs in every (Bayesian) Nash equilibrium. To start, consider games of complete information. Some complete information games have only inefficient equilibria. The Prisoner’s Dilemma is the canonical example. But we also know that any efficient outcome that is achievable by some convex combination of strategies and is better for each player that what they can unilaterally guarantee themselves can be attained in equilibrium, when agents are capable of conditional commitments to cooperation and correlated randomization (Kalai et al. 2010). This means that, for a game of complete information to have only inefficient equilibria, it has to be the case that either they are unable to make credible conditional commitments to an efficient profile (1b) or an efficient and individually rational outcome is only attainable with randomization (because the contested object is indivisible), but randomization isn’t possible (1c).
Even if efficiency in complete information is always possible given commitment and randomization ability, players might not have complete information. It is well-known that private information can lead to inefficiency in equilibrium, due to agents making risk-reward tradeoffs under uncertainty about their counterpart’s private information (1a). It is also necessary that agents can’t or won’t disclose their private information — we give a breakdown of reasons for nondisclosure below.
This all means that a game has no efficient equilibria only if one of items 1a-1c holds. But it could still be the case that agents coordinate on an inefficient equilibrium, even if an efficient one is available (1d). E.g., agents might both play Hare in a Stag Hunt. (Coordinating on an equilibrium but failing to coordinate on an efficient one seems unlikely, which is why we don’t discuss it in the main text. But it isn’t ruled out by the assumptions of accurate beliefs and maximizing expected utility with respect to those beliefs alone.)
This exhausts explanations of situations in which conflict happens in equilibrium. But rationality does not imply that agents are in equilibrium. How could imperfect knowledge of other players’ strategies drive conflict? There are two possibilities:
Suppose that agents have private information such that nondisclosure of the information will lead to a situation in which conflict is rational, but conflict would no longer be rational if the information were disclosed.
We can decompose reasons not to disclose into reasons not to unconditionally disclose and reasons not to conditionally disclose. Here, “conditional” disclosure means “disclosure conditional on a commitment to a particular agreement by the other player”. For example, suppose my private information is , where measures my military strength, such that is my chance of winning a war, and is information about secret military technology that I don’t want you to learn. A conditional commitment would be: I disclose , so that we can decide the outcome of the contest according to a costless lottery which I win with probability , conditional on a commitment from you not to use your knowledge of to harm me.
Here is the decomposition:
DiGiovanni, Anthony, and Jesse Clifton. 2022. “Commitment Games with Conditional Information Revelation.” arXiv [cs.GT]. arXiv. http://arxiv.org/abs/2204.03484.
Shulman. 2010. “Omohundro’s ‘basic AI Drives’ and Catastrophic Risks.” Manuscript.(singinst. Org/upload/ai-Resource-Drives. Pdf. http://www.hdjkn.com/files/BasicAIDrives.pdf.
The post When would AGIs engage in conflict? appeared first on Center on Long-Term Risk.
]]>The post When does technical work to reduce AGI conflict make a difference?: Introduction appeared first on Center on Long-Term Risk.
]]>
This is a pared-down version of a longer draft report. We went with a more concise version to get it out faster, so it ended up being more of an overview of definitions and concepts, and is thin on concrete examples and details. Hopefully subsequent work will help fill those gaps.
Some researchers are focused on reducing the risks of conflict between AGIs. In this sequence, we’ll present several necessary conditions for technical work on AGI conflict reduction to be effective, and survey circumstances under which these conditions hold. We’ll also present some tentative thoughts on promising directions for research and intervention to prevent AGI conflict.
This sequence assumes familiarity with intermediate game theory.
Could powerful AI systems engage in catastrophic conflict? And if so, what are the best ways to reduce this risk? Several recent research agendas related to safe and beneficial AI have been motivated, in part, by reducing the risks of large-scale conflict involving artificial general intelligence (AGI). These include the Center on Long-Term Risk’s research agenda, Open Problems in Cooperative AI, and AI Research Considerations for Human Existential Safety (and this associated assessment of various AI research areas). As proposals for longtermist priorities, these research agendas are premised on a view that AGI conflict could destroy large amounts of value, and that a good way to reduce the risk of AGI conflict is to do work on conflict in particular. In this sequence, our goal is to assess conditions under which work specific to conflict reduction could make a difference, beyond non-conflict-focused work on AI alignment and capabilities.1
Examples of conflict include existentially catastrophic wars between AGI systems in a multipolar takeoff (e.g., 'flash war') or even between different civilizations (e.g., Sandberg 2021). We’ll assume that expected losses from catastrophic conflicts such as these are sufficiently high for this to be worth thinking about at all, and we won’t argue for that claim here.
We’ll restrict attention to technical (as opposed to, e.g., governance) interventions aimed at reducing the risks of catastrophic conflict involving AGI. These include Cooperative AI interventions, where Cooperative AI is concerned with improving the cooperative capabilities of self-interested actors (whether AI agents or AI-assisted humans).2 Candidates for cooperative capabilities include the ability to implement mutual auditing schemes in order to reduce uncertainties that contribute to conflict, and the ability to avoid conflict due to incompatible commitments (see Yudkowsky (2013); Oesterheld and Conitzer (2021); (Oesterheld and Conitzer 2021; Stastny et al. 2021)). The interventions under consideration also include improving AI systems’ ability to understand humans’ cooperation-relevant preferences. Finally, they include shaping agents’ cooperation-relevant preferences, e.g., preventing AGIs from acquiring conflict-prone preferences like spite. An overview of the kinds of interventions that we have in mind here is given in Table 1.
Class of technical interventions specific to reducing conflict | Examples |
Improving cooperative capabilities (Cooperative AI) |
|
Improving understanding of humans’ cooperation-relevant preferences |
|
Shaping cooperation-relevant preferences |
|
There are reasons to doubt the claim that (Technical Work Specific to) Conflict Reduction Makes a Difference.3 Conflict reduction won’t make a difference if the following conditions don’t hold: (a) AGIs won’t always avoid conflict, despite it being materially costly and (b) intent alignment is either insufficient or unnecessary for conflict reduction work to make a difference. In the rest of the sequence, we’ll look at what needs to happen for these conditions to hold.
Throughout the sequence, we will use “conflict” to refer to “conflict that is costly by our lights”, unless otherwise specified. Of course, conflict that is costly by our lights (e.g., wars that destroy resources that would otherwise be used to make things we value) are also likely to be costly by the AGIs’ lights, though this is not a logical necessity. For AGIs to fail to avoid conflict by default, one of these must be true:
Conflict isn’t costly by the AGIs’ lights. That is, there don’t exist outcomes that all of the disputant AGIs would prefer to conflict.
AGIs that are sufficiently capable to engage in conflict that is costly for them wouldn’t also be sufficiently capable to avoid conflict that is costly for them.4
If either Conflict isn't Costly or Capabilities aren't Sufficient, then it may be possible to reduce the chances that AGIs engage in conflict. This could be done by improving their cooperation-relevant capabilities or by making their preferences less prone to conflict. But this is not enough for Conflict Reduction Makes a Difference to be true.
Intent alignment may be both sufficient and necessary to reduce the risks of AGI conflict that isn’t endorsed by human overseers, insofar as it is possible to do so. If that were true, technical work specific to conflict reduction would be redundant. This leads us to the next two conditions that we’ll consider.
Intent alignment — i.e., AI systems trying to do what their overseers want — combined with the capabilities that AI systems are very likely to have conditional on intent alignment, isn’t sufficient for avoiding conflict that is not endorsed (on reflection) by the AIs’ overseers.
Even if intent alignment fails, it is still possible to intervene on an AI system to reduce the risks of conflict. (We may still want to prevent conflict if intent alignment fails and leads to an unrecoverable catastrophe, as this could make worse-than-extinction outcomes less likely.)
By unendorsed conflict, we mean conflict caused by AGIs that results from a sequence of decisions that none of the AIs’ human principals would endorse after an appropriate process of reflection.5 The reason we focus on unendorsed conflict is that we ultimately want to compare (i) conflict-specific interventions on how AI systems are designed and (ii) work on intent alignment.
Neither of these is aimed at solving problems that are purely about human motivations, like human overseers instructing their AI systems to engage in clearly unjustified conflict.
Contrary to what our framings here might suggest, disagreements about the effectiveness of technical work to reduce AI conflict relative to other longtermist interventions are unlikely to be about the logical possibility of conflict reduction work making a difference. Instead, they are likely to involve quantitative disagreements about the likelihood and scale of different conflict scenarios, the degree to which we need AI systems to be aligned to intervene on them, and the effectiveness of specific interventions to reduce conflict (relative to intent alignment, say). We regard mapping out the space of logical possibilities for conflict reduction to make a difference as an important initial step in the longer-term project of assessing the effectiveness of technical work on conflict reduction.6
Thanks to Michael Aird, Jim Buhler, Steve Byrnes, Sam Clarke, Allan Dafoe, Daniel Eth, James Faville, Lukas Finnveden, Lewis Hammond, Julian Stastny, Daniel Kokotajlo, David Manheim, Rani Martin, Adam Shimi, Stefan Torges, and Francis Ward for comments on drafts of this sequence. Thanks to Beth Barnes, Evan Hubinger, Richard Ngo, and Carl Shulman for comments on a related draft.
Stastny, Julian, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, and Jesse Clifton. 2021. “Normative Disagreement as a Challenge for Cooperative AI.” arXiv [cs.MA]. arXiv. http://arxiv.org/abs/2111.13872.
The post When does technical work to reduce AGI conflict make a difference?: Introduction appeared first on Center on Long-Term Risk.
]]>The post Open Position: Community Manager appeared first on Center on Long-Term Risk.
]]>The Center on Long-term Risk is seeking a Community Manager, to work on growing and supporting the community around our mission and research. You will have a leveraged role in furthering our mission to address risks of astronomical suffering from the development and deployment of advanced AI systems.
In this role, you would become the third full member of our Community-building team, reporting to Stefan Torges, the Director of Operations. Depending on your skill set, you will take on responsibilities across diverse areas such as event & project management, 1:1 outreach & advising calls, setting up & improving IT infrastructure, writing, giving talks, and attending in-person networking events – making this role ideal for quickly gaining experience across a range of domains. You will receive mentorship from an experienced team, and become familiar with existing processes in a well-running organization, as you work to improve and supplement them. You will also have the opportunity to engage with cutting-edge research in longtermism and AI safety as well as shaping our strategy.
To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 16th October (precisely: 7:30am British Summer Time on Monday 17th). We expect the form will take 30-60 minutes to complete. It can be done in as little as 10 minutes if necessary by skipping the descriptive questions: this may significantly disadvantage your application, but may make sense if you wouldn’t apply otherwise.
We are recruiting for this role in order to provide additional capacity in our community-building function. Precisely which areas you work on will depend on your strengths and interests, and we’ll decide this together with you once you start work.
As an illustration of the sorts of things you’ll work on, we expect that the successful candidate will take on several of the following tasks:
Examples of further responsibilities that a candidate who is a good fit for them could take on include:
Since we are a small team, all members have the opportunity to shape our strategy.
We think this role could provide suitable challenges for someone with 0-4 years of experience in a similar job: it might, for example, be suited to a recent graduate interested in quickly gaining experience in a professional community-building role, and we also encourage more experienced candidates to apply.
The following abilities and qualities are what we’re looking for in candidates. No specific qualifications or experience are required – experience is one good way of demonstrating these skills, but we’re also open to candidates with no experience of similar roles. We encourage you to apply if you think you may be a good fit, even if you are very unsure of your strengths in some of these areas.
Given that we are a small organization, we also value candidates who are willing to do less glamorous tasks to bring a project over the finish line.
In this role, you can expect to grow our team and the community of people who are committed to reducing risks of astronomical suffering from the development of AI systems. That makes it a highly leveraged opportunity to contribute to that effort.
Due to the small size of our organization, your work will be varied and you will be asked to take ownership of projects quickly. Our community is still at an early stage, so we regularly test new projects, which can help you master a variety of skills and provide you with space to propose your own ideas.
You will join an experienced community-building team who will provide you with mentorship. You will work alongside and interact regularly with our researchers. So you have many opportunities to engage with ideas related to risks of astronomical suffering as well as effective altruism, longtermism, and AI safety.
CLR will also actively support your professional development. While we are looking for a candidate who is interested in working with CLR for a substantial period of time, as part of the effective altruism community we are interested in helping you increase your career’s impact even beyond your performance in the current role. Alongside mentorship from our experienced operations team, you will be joining a well-networked longtermist organization. You will receive a budget of £8,000 per year to spend on whatever you think best furthers your professional development, and be supported to attend EA Global conferences.
Stage 1: To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 16th October (precisely: 7:30am British Summer Time on Monday 17th).
We expect the form will take 30-60 minutes to complete. If necessary, the form can be done in as little as 10 minutes by skipping the descriptive questions: this may significantly disadvantage your application, but may make sense if you wouldn’t apply otherwise.
We aim to communicate the results of stage 1, inviting candidates to the second stage, by the end of Friday 21st October.
Stage 2 will be a remote work test, to be completed on your own computer, which we anticipate will take up to 4 hours of your time. Applicants will have 2 weeks to complete the test, and will be compensated with £120 in return for their work. We plan to communicate the results of stage 2 by the end of Friday 11th November.
Stage 3 will consist of one or more interviews with CLR staff. We plan to hold interviews in the week of 21st November, and aim to communicate the results of stage 3 by the end of Friday 25th November.
Stage 4: The final stage of the recruitment process will be a work trial, held in-person if possible, of between 1-10 working days depending on candidate availability. We will cover travel expenses and compensate candidates £200 per day for the work trial. We will also seek references at this stage.
We expect final recruitment decisions to be made by the end of the year. If you require a faster decision than this, please feel free to contact us at the address below.
The above timelines are our aim and we fully intend to stick to them. However, we don’t firmly commit to them, and a delay of, for example, 1-2 weeks by the end of stage 3 is possible. We will communicate to candidates promptly if we expect there to be any delays.
If you have any questions about the process, please contact us at hiring@longtermrisk.org. If you’d like to send an email that’s not accessible to the hiring committee, please contact tristan.cook@longtermrisk.org.
Diversity and equal opportunity employment: CLR is an equal opportunity employer, and we value diversity at our organization. We don’t want to discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, social background/class, mental or physical health or disability, or any other basis for unreasonable discrimination, whether legally protected or not. If you're considering applying to this role and would like to discuss any personal needs that might require adjustments to our application process or workplace, please feel very free to contact us.
The post Open Position: Community Manager appeared first on Center on Long-Term Risk.
]]>The post Safe Pareto Improvements for Delegated Game Playing appeared first on Center on Long-Term Risk.
]]>
A set of players delegate playing a game to a set of representatives, one for each player. We imagine that each player trusts their respective representative’s strategic abilities. Thus, we might imagine that per default, the original players would simply instruct the representatives to play the original game as best as they can. In this paper, we ask: are there safe Pareto improvements on this default way of giving instructions? That is, we imagine that the original players can coordinate to tell their representatives to only consider some subset of the available strategies and to assign utilities to outcomes differently than the original players. Then can the original players do this in such a way that the payoff is guaranteed to be weakly higher than under the default instructions for all the original players? In particular, can they Pareto-improve without probabilistic assumptions about how the representatives play games? In this paper, we give some examples of safe Pareto improvements. We prove that the notion of safe Pareto improvements is closely related to a notion of outcome correspondence between games. We also show that under some specific assumptions about how the representatives play games, finding safe Pareto improvements is NP-complete.
Keywords: program equilibrium; delegation; bargaining; Pareto efficiency; smart contracts.
Between Aliceland and Bobbesia lies a sparsely populated desert. Until recently, neither of the two countries had any interest in the desert. However, geologists have recently discovered that it contains large oil reserves. Now, both Aliceland and Bobbesia would like to annex the desert, but they worry about a military conflict that would ensue if both countries insist on annexing.
Table 1 models this strategic situation as a normal-form game. The strategy DM (short for “Demand with Military”) denotes a military invasion of the desert, demanding annexation. If both countries send their military with such an aggressive mission, the countries fight a devastating war. The strategy RM (for “Refrain with Military”) denotes yielding the territory to the other country, but building defenses to prevent an invasion of one’s current territories. Alternatively, the countries can choose to not raise a military force at all, while potentially still demanding control of the desert by sending only its leader (DL, short for “Demand with Leader”). In this case, if both countries demand the desert, war does not ensue. Finally, they could neither demand nor build up a military (RL). If one of the two countries has their military ready and the other does not, the militarized country will know and will be able to invade the other country. In gametheoretic terms, militarizing therefore strictly dominates not militarizing.
Instead of making the decision directly, the parliaments of Aliceland and Bobbesia appoint special commissions for making this strategic decision, led by Alice and Bob, respectively. The parliaments can instruct these representatives in various ways. They can explicitly tell them what to do – for example, Aliceland could directly tell Alice to play DM. However, we imagine that the parliaments trust the commissions’ judgments more than they trust their own and hence they might prefer to give an instruction of the type, “make whatever demands you think are best for our country” (perhaps contractually guaranteeing a reward in proportion to the utility of the final outcome). They might not know what that will entail, i.e., how the commissions decide what demands to make given that instruction. However – based on their trust in their representatives – they might still believe that this leads to better outcomes than giving an explicit instruction.
We will also imagine these instructions are (or at least can be) given publicly and that the commissions are bound (as if by a contract) to follow these instructions. In particular, we imagine that the two commissions can see each other’s instructions. Thus, in instructing their commissions, the countries play a game with bilateral precommitment. When instructed to play a game as best as they can, we imagine that the commissions play that game in the usual way, i.e., without further abilities to credibly commit or to instruct subcommittees and so forth.
It may seem that without having their parliaments ponder equilibrium selection, Aliceland and Bobbesia cannot do better than leave the game to their representatives. Unfortunately, in this default equilibrium, war is still a possibility. Even the brilliant strategists Alice and Bob may not always be able to resolve the difficult equilibrium selection problem to the same pure Nash equilibrium.
In the literature on commitment devices and in particular the literature on program equilibrium, important ideas have been proposed for avoiding such bad outcomes. Imagine for a moment that Alice and Bob will play a Prisoner’s Dilemma (Table 3) (rather than the Demand Game of Table 1). Then the default of (Defect, Defect) can be Pareto-improved upon. Both original players (Aliceland and Bobbesia) can use the following instruction for their representatives: “If the opponent’s instruction is equal to this instruction, Cooperate; otherwise Defect.” [33, 22, 46, Sect. 10.4, 55] Then it is a Nash equilibrium for both players to use this instruction. In this equilibrium, (Cooperate, Cooperate) is played and it is thus Pareto-optimal and Pareto-better than the default.
In cases like the Demand Game, it is more difficult to apply this approach to improve upon the default of simply delegating the choice. Of course, if one could calculate the expected utility of submitting the default instructions, then one could similarly commit the representatives to follow some (joint) mix over the Pareto-optimal outcomes ((RM, DM), (DM, RM), (RM, RM), (DL, DL), etc.) that Pareto-improves on the default expected utilities.1 However, we will assume that the original players are unable or unwilling to form probabilistic expectations about how the representatives play the Demand Game, i.e., about what would happen with the default instructions. If this is the case, then this type of Pareto improvement on the default is unappealing.
The goal of this paper is to show and analyze how even without forming probabilistic beliefs about the representatives, the original players can Pareto-improve on the default equilibrium. We will call such improvements safe Pareto improvements (SPIs). We here briefly give an example in the Demand Game.
The key idea is for the original players to instruct the representatives to select only from {DL,RL}, i.e., to not raise a military. Further, they tell them to disvalue the conflict outcome without military (DL, DL) as they would disvalue the original conflict outcome of war in the default equilibrium. Overall, this means telling them to play the game of Table 2. (Again, we could imagine that the instructions specify Table 2 to be how Aliceland and Bobbesia financially reward Alice and Bob.) Importantly, Aliceland’s instruction to play that game must be conditional on Bobbesia also instructing their commission to play that game, and vice versa. Otherwise, one of the countries could profit from deviating by instructing their representative to always play DM or RM (or to play by the original utility function).
The game of Table 2 is isomorphic to the DM-RM part of the original Demand Game of Table 1. Of course, the original players know neither how the original Demand Game nor the game of Table 2 will be played by the representatives. However, since these games are isomorphic, one should arguably expect them to be played isomorphically. For example, one should expect that (RM,DM) would be played in the original game if and only if (RL, DL) would be played in the modified game. However, the conflict outcome (DM,DM) is replaced in the new game with the outcome (DL, DL). This outcome is harmless (Pareto-optimal) for the original players.
Contributions. Our paper generalizes this idea to arbitrary normal-form games and is organized as follows. In Section 2, we introduce some notation for games and multivalued functions that we will use throughout this paper. In Section 3, we introduce the setting of delegated game playing for this paper. We then formally define and further motivate the concept of safe Pareto improvements. We also define and give an example of unilateral SPIs. These are SPIs that require only one of the players to commit their representative to a new action set and utility function. In Section 3.2, we briefly review the concepts of program games and program equilibrium and show that SPIs can be implemented as program equilibria. In Section 4.2, we introduce a notion of outcome correspondence between games. This relation expresses the original players’ beliefs about similarities between how the representatives play different games. In our example, the Demand Game of Table 1 (arguably) corresponds to the game of Table 2 in that the representatives (arguably) would play (DM,DM) in the original game if and only if they play (DL, DL) in the new game, and so forth. We also show some basic results (reflexivity, transitivity, etc.) about the outcome correspondence relation on games. In Section 4.3 we show that the notion of outcome correspondence is central to deriving SPIs. In particular, we show that a game is an SPI on another game if and only if there is a Pareto-improving outcome correspondence relation between and .
To derive SPIs, we need to make some assumptions about outcome correspondence, i.e., about which games are played in similar ways by representatives. We give two very weak assumptions of this type in Section 4.4. The first is that the representatives’ play is invariant under the removal of strictly dominated strategies. For example, we assume that in the Demand Game the representatives only play DM and RM. Moreover we assume that we could remove DL and RL from the game and the representatives would still play the same strategies as in the original Demand Game with certainty. The second assumption is that the representatives play isomorphic games isomorphically. For example, once DL and RL are removed for both players from the Demand Game, the Demand Game is isomorphic to the game in Table 2 such that we might expect them to be played isomorphically. In Section 4.5, we derive a few SPIs – including our SPI for the Demand Game – using these assumptions. Section 4.6 shows that determining whether there exists an SPI based on these assumptions is NP-complete. Section 5 considers a different setting in which we allow the original players to let the representatives choose from newly constructed strategies whose corresponding outcomes map arbitrarily onto feasible payoff vectors from the original game. In this new setting, finding SPIs can be done in polynomial time. We conclude by discussing the problem of selecting between different SPIs on a given game (Section 6) and giving some ideas for directions for future work (Section 7).
We here give some basic game-theoretic definitions. We assume the reader to be familiar with most of these concepts and with game theory more generally.
An -player (normal-form) game is a tuple of a set of (pure) strategy profiles (or outcomes) and a function that assigns to each outcome a utility for each player. The Prisoner's Dilemma shown in Table 3 is a classic example of a game. The Demand Game of Table 1 is another example of a game that we will use throughout this paper.
Instead of we will also write . We also write for , i.e., for the Cartesian product of the action sets of all players other than . We similarly write and for vectors containing utility functions and actions, respectively, for all players but . If is a utility function and is a vector of utility functions for all players other than , then (even if ) we use for the full vector of utility functions where Player has utility function and the other players have utility functions as specified by . We use and analogously.
We say that strictly dominates if for all , . For example, in the Prisoner's Dilemma, Defect strictly dominates Cooperate for both players. As noted earlier, and strictly dominate and for both players.
For any given game , we will call any game a subset game of if for . Note that a subset game may assign different utilities to outcomes than the original game. For example, the game of Table 2 is a subset game of the Demand Game.
We say that some utility vector is a Pareto improvement on (or is Pareto-better than) if for . We will also denote this by . Note that, contrary to convention, we allow . Whenever we require one of the inequalities to be strict, we will say that is a strict Pareto improvement on . In a given game, we will also say that an outcome is a Pareto improvement on another outcome if . We say that is Pareto-optimal or Pareto-efficient relative to some if there is no element of that strictly Pareto-dominates .
Let and be two -player games. Then we call an -tuple of functions a (game) isomorphism between and if there are vectors and such that
for all and . If there is an isomorphism between and , we call and isomorphic. For example, if we let be the Demand Game and the subset game of Table 2, then is isomorphic to via the isomorphism with and and the constants and .
We consider a setting in which a given game is played through what we will call representatives. For example, the representatives could be humans whose behavior is determined or incentivized by some contract à la the principal–agent literature [28]. Our principals’ motivation for delegation is the same as in that literature (namely, the agent being in a better (epistemic) position to make the choice). However, the main question asked by the principal–agent literature is how to deal with agents that have their own preferences over outcomes, by constraining the agent’s choice [e.g. 21, 25], setting up appropriate payment schemes [e.g. 23, 29, 37, 53], etc. In contrast, we will throughout this paper assume that the agent has no conflicting incentives.
We imagine that one way in which the representatives can be instructed is to in turn play a subset game of the original game, without necessarily specifying a strategy or algorithm for solving such a game. We emphasize, again, that is allowed to be a vector of entirely different utility functions. For any subset game , we denote by the outcome that arises if the representatives play the subset game of . Because it is unclear what the right choice is in many games, the original players might be uncertain about . We will therefore model each as a random variable. We will typically imagine that the representatives play in the usual simultaneous way, i.e., that they are not able to make further commitments or delegate again. For example, we imagine that if is the Prisoner's Dilemma, then with certainty.
The original players trust their representatives to the extent that we take to be a default way for the game to played for any . That is, by default the original players tell their representatives to play the game as given. For example, in the Demand Game, it is not clear what the right action is. Thus, if one can simply delegate the decision to someone with more relevant expertise, that is the first option one would consider.
We are interested in whether and how the original players can jointly Pareto-improve on the default. Of course, one option is to first compute the expected utilities under default delegation, i.e., to compute . The players could then let the representatives play a distribution over outcomes whose expected utilities exceed the default expected utilities. However, this is unrealistic if is a complex game with potentially many Nash equilibria. For one, the precise point of delegation is that the original players are unable or unwilling to properly evaluate . Second, there is no widely agreed upon, universal procedure for selecting an action in the face of equilibrium selection problems. In such cases, the original players may in practice be unable to form a probability distribution over . This type of uncertainty is sometimes referred to as Knightian uncertainty, following Knight's [26] distinction between the concepts of risk and uncertainty.
We address this problem in a typical way. Essentially, we require of any attempted improvement over the default that it incurs no regret in the worst-case. That is, we are interested in subset games that are Pareto improvements with certainty under weak and purely qualitative assumptions about .2 In particular, in Section 4.4, we will introduce the assumptions that the representatives do not play strictly dominated actions and play isomorphic games isomorphically.
Definition 1. Let be a subset game of . We say is a safe Pareto improvement (SPI) on if with certainty. We say that is a strict SPI if furthermore, there is a player s.t. with positive probability.
For example, in the introduction we have argued that the subset game in Table 2 is a strict SPI on the Demand Game (Table 1). Less interestingly, if we let be the Prisoner's Dilemma (Table 3), then we would expect to be an SPI on . After all, we might expect that with certainty, while it must be
with certainty, for lack of alternatives. Both players prefer mutual cooperation over mutual defection.
Both SPIs given above require both players to let their representatives choose from restricted strategy sets to maximize something other than the original player's utility function.
Definition 2. We will call a subset game of unilateral if for all but one it holds that and . Consequently, if a unilateral subset game of is also an SPI for , we call a unilateral SPI.
We now give an example of a unilateral SPI using the Complicated Temptation Game. (We give the not-so-complicated Temptation Game – in which we can only give a trivial example of SPIs – in Section 4.5.) Two players each deploy a robot. Each of the robots faces two choices in parallel. First, each can choose whether to work on Project 1 or Project 2. Player 1 values Project 1 higher and Player 2 values Project 2 higher, but the robots are more effective if they work on the same project. To complete the task, the two robots need to share a resource. Robot 2 manages the resource and can choose whether to control Robot 1’s access tightly (e.g., by frequently checking on the resource, or requiring Robot 1 to demonstrate a need for the resource) or give Robot 1 relatively free access. Controlling access tightly decreases the efficiency of both robots, though the exact costs depend on which projects the robots are working on. Robot 1 can choose between using the resource as intended by Robot 2; or give in to the temptation of trying to steal as much of the resource as possible to use it for other purposes. Regardless of what Robot 2 does (in particular, regardless of whether Robot 2 controls access or not), Player 1 prefers trying to steal. In fact, if Robot 2 controls access and Robot 1 refrains from theft, they never get anything done. Given that Robot 1 tries to steal, Player 2 prefers his Robot 2 to control access. As usual we assume that the original players can instruct their robots to play arbitrary subset games of (without specifying an algorithm for solving such a game) and that they can give such instructions conditional on the other player providing an analogous instruction.
We formalize this game as a normal-form game in Table 4. Each action consists of a number and letter. The number indicates the project that the agent pursues. The letters indicates the agent’s policy towards the resource. In Player 2’s action labels, C indicates tight control over the resource, while F indicates free access. In Player 1’s action labels, T indicates giving in to the temptation to steal as much of the resource as possible, while R indicates refraining from doing so.
Player 1 has a unilateral SPI in the Complicated Temptation Game. Intuitively, if Player 1 commits to refrain, then Player 2 need not control the use of the resource. Thus, inefficiencies from conflict over the resource are avoided. However, Player 1’s utilities in the resulting game of choosing between projects 1 and 2 are not isomorphic to the original game of choosing between projects 1 and 2. The players might therefore worry that this new game will result in a worse outcome for them. For example, Player 2 might worry that in this new game the project 1 equilibrium () becomes more likely than the project 2 equilibrium. To address this, Player has to commit her representative to a different utility function that makes this new game isomorphic to the original game.
We now describe the unilateral SPI in formal detail. Player 1 can commit her representative to play only from and and to assign utilities , , , and ; otherwise does not differ from . The resulting SPI is given in Table 5. In this subset game, Player 2's representative – knowing that Player 1's representative will only play from and – will choose from and (since and strictly dominate and in Table 5). Now notice that the remaining subset game is isomorphic to the subset game of the original Complicated Temptation Game, where maps to and maps to for both Player 1, and maps to and maps to for Player 2. Player 1's representative's utilities have been set to be the same between the two; and Player 2's utilities happen to be the same up to a constant () between the two subset games. Thus, we might expect that if , then , and so on. Finally, notice that and so on. Hence, Table 5 is indeed an SPI on the Complicated Temptation Game.
Such unilateral changes are particularly interesting because they only require one of the players to be able to credibly delegate. That is, it is enough for a single player to instruct their representative to choose from a restricted action set to maximize a new utility function. The other players can simply instruct their representatives to play the game in the normal way (i.e., maximizing the respective players' original utility functions without restrictions on the action set). In fact, we may also imagine that only one player delegates at all, while the other players choose an action themselves, after observing Player 's instruction to her representative.
One may object that in a situation where only one player can credibly commit and the others cannot, the player who commits can simply play the meta game as a standard unilateral commitment (Stackelberg) game [as studied by, e.g., 11, 52, 59] or perhaps as a first mover in a sequential game (as solved by subgame-perfect equilibrium), without bothering with any (safe) Pareto conditions, i.e., without ensuring that all players are guaranteed a utility at least as high as their default . For example, in the Complicated Temptation Game, Player 1 could simply commit her representative to play if she assumes that Player 2's representative will be instructed to best respond.
The Stackelberg sequential play perspective is appropriate in many cases. However, we think that in many cases the player with fine-grained commitment ability cannot assume that the other players' representatives will simply best respond. Instead, players often need to consider the possibility of a hostile response if their commitment forces an unfair payoff on the other players. In such cases, unilateral SPIs are relevant.
The Ultimatum game is a canonical example in which standard solution concepts of sequential play fail to predict human behavior. In this game, subgame-perfect equilibrium has the second-moving player walk away with arbitrarily close to nothing. However, experiments show that people often resolve the game to an equal split, which is the symmetric equilibrium of the simultaneous version of the game [38].
A policy of retaliating for unfair payoffs imposed by a first mover's commitments can arise in a variety of ways within standard game-theoretic models. For one, we may imagine a scenario in which only one Player has the fine-grained commitment and delegation abilities needed for SPIs but that the other players can still credibly commit their representatives to retaliate against any “commitment trickery” that clearly leaves them worse off. We may also imagine that other players or representatives come into the scenario having already made such commitments. For example, many people appear credibly committed by intuitions about fairness and retributivist instincts and emotions [see, e.g., 44, Chapter 6, especially the section “The Doomsday Machine”]. Perhaps these features of human psychology allow human second players in the Ultimatum game empirically outperform subgame-perfect equilibrium. Second, we may imagine that the players who cannot commit are subject to reputation effects. Then they might want to build a reputation of resisting coercion. In contrast, it is beneficial to have a reputation of accepting SPIs on whatever game would have otherwise been played.
So far, we have been vague about the details of the strategic situation that the original players face in instructing their representatives. From what sets of actions can they choose? How can they jointly let the representatives play some new subset game ? Are SPIs Nash equilibria of the meta game played by the original players? If I instruct my representative to play the SPI of Table 2 in the Demand Game, could my opponent not instruct her representative to play ?
In this section, we briefly describe one way to fill this gap by discussing the concept of program games and program equilibrium [46, Sect. 10.4, 55, 15, 5, 13, 36]. This section is essential to understanding why SPIs (especially omnilateral ones) are relevant. However, the remaining technical content of this paper does not rely on this section and the main ideas presented here are straightforward from previous work. We therefore only give an informal exposition. For formal detail, see Appendix A.
For any game , the program equilibrium literature considers the following meta game. First, each player writes a computer program. Each program then receives as input a vector containing everyone else's chosen program. Each player 's program then returns an action from , player 's set of actions in . Together these actions then form an outcome of the original game. Finally, the utilities are realized according to the utility function of . The meta game can be analyzed like any other game. Its Nash equilibria are called program equilibria. Importantly, the program equilibria can implement payoffs not implemented by any Nash equilibria of itself. For example, in the Prisoner’s Dilemma, both players can submit a program that says: “If the opponent’s chosen computer program is equal to this computer program, Cooperate; otherwise Defect.” [33, 22, 46, Sect. 10.4, 55] This is a program equilibrium which implements mutual cooperation.
In the setting for our paper, we similarly imagine that each player can write a program that in turn chooses from . However, the types of programs that we have in mind here are more sophisticated than those typically considered in the program equilibrium literature. Specifically we imagine that the programs are executed by intelligent representatives who are themselves able to competently choose an action for player in any given game , without the original player having to describe how this choice is to be made. The original player may not even understand much about this program other than that it generally plays well. Thus, in addition to the elementary instructions used in a typical computer program (branches, comparisons, arithmetic operations, return, etc.), we allow player to use instructions of type “Play ” in the program she submits. This instruction lets the representative choose and return an action for the game . Apart from the addition of this instruction type, we imagine the set of instructions to be the same as in the program equilibrium literature. To jointly let the representatives play, e.g., the SPI of Table 2 on the Demand Game of Table 1, the representatives can both use an instruction that says, “If the opponent's chosen program is equal to this one, play ; otherwise play ”. Assuming some minimal rationality requirements on the representatives (i.e., on how the representative resolves the “play ” instruction), this is a Nash equilibrium. Figure 1 illustrates how (in the two-player case) the meta game between the original players is intended to work.
For illustration consider the following two real-world instantiations of this setup. First, we might imagine that the original players hire human representatives. Each player specifies, e.g., via monetary incentives, how she wants her representative to act by some contract. For example, a player might contract her representative to play a particular action; or she might specify in her contract a function () over outcomes according to which she will pay the representative after an outcome is obtained. Moreover, these contracts might refer to one another. For example, Player 1's contract with her representative might specify that if Player 2 and his representative use an analogous contract, then she will pay her representative according to Table 2. As a second, more futuristic scenario, you could imagine that the representatives are software agents whose goals are specified by so-called smart contracts, i.e., computer programs implemented on a blockchain to be publicly verifiable [8, 47].
To justify our study of SPIs, we prove that every SPI is played in some program equilibrium:
Theorem 1. Let be a game and be an SPI of . Now consider a program game on , where each player can choose from a set of computer programs that output actions for . In addition to the normal kind of instructions, we allow the use of the command "play " for any subset game of . Finally, assume that guarantees each player at least that player's minimax utility (a.k.a. threat point) in the base game . Then is played in a program equilibrium, i.e., in a Nash equilibrium of the program game.
We prove this in Appendix A.
As an alternative to having the original players choose contracts separately, we could imagine the use of jointly signed contracts which only come into effect once signed by all players [cf. 24, 34]. Another approach to bilateral commitment was pursued by Raub [45] based on earlier work by Sen [51]. Raub and Sen use preference modification as a mechanism for commitment. For example, in the Prisoner’s Dilemma, each player can separately instruct their representative to prefer cooperating over defecting if and only if the opponent also cooperates. If both players use this instruction, then mutual cooperation becomes the unique Pareto-optimal Nash equilibrium. On the other hand, if only one player instructs their representative to adopt these preferences and the other maintains the usual Prisoner’s Dilemma preferences, the unique equilibrium remains mutual defection. Thus, the preference modification is used to commit to cooperating conditional on the other player making an analogous commitment. Because this is slightly confusing in the context of our work – seeing as our work involves both modifying one’s preferences and mutual commitment, but generally without using the former as a means to the latter – we discuss Raub’s and Sen’s work and its relation to ours in more detail in Appendix B.
For sets and , a multi-valued function is a function which maps each element to a set . For a subset , we define
Note that and that . For any set , we define the identity function . Also, for two sets and , we define . We define the inverse
Note that for any multi-valued function . For sets , and and functions and , we define the composite . As with regular functions, composition of multi-valued functions is associative. We say that is single-valued if for all . Whenever a multi-valued function is single-valued, we can apply many of the terms for regular functions. For example, we will take injectivity, surjectivity, and bijectivity for single-valued functions to have the usual meaning. We will never apply these notions to non-single-valued functions.
In this section, we introduce a notion of outcome correspondence, which we will see is essential to constructing SPIs.
Definition 3. Consider two games and . We write for if with certainty.
Note that is a statement about , i.e., about how the representatives choose. Whether such a statement holds generally depends on the specific representatives being used. In Section 4.4, we describe two general circumstances under which it seems plausible that . For example, if two games and are isomorphic, then one might expect , where is the isomorphism between the two games.
We now illustrate this notation using our discussion from the Demand Game. Let be the Demand Game of Table 1. First, it seems plausible that is in some sense equivalent to , where is the game that results from removing and for both players from . Again, strict dominance could be given as an argument. We can now formalize this as , where if and otherwise. Next, it seems plausible that , where is the game of Table 2 and is the isomorphism between and .
We now state some basic facts about the relation , many of which we will use throughout this paper.
Lemma 2. Let , , and , .
Proof. 1. By reflexivity of equality, with certainty. Hence, by definition of . Therefore, by definition of , as claimed.
2. means that with certainty. Thus,
where equality is by the definition of the inverse of multi-valued functions. We conclude (by definition of ) that as claimed.
3. If , , then by definition of , (i) and (ii) , both with certainty. The former (i) implies . Hence,
With ii, it follows that with certainty. By definition, as claimed.
4. It is
with certainty. Thus, by definition .
5. By definition of , it is with certainty. By definition of , it is with certainty. Hence, with certainty. We conclude that as claimed.
6. With certainty, (by assumption). Also, with certainty . Hence, with certainty. We conclude that with certainty.
7. If , then by reflexivity of (Lemma 2.1) . If , then by Lemma 2.6, with certainty.
Items 1-3 show that has properties resembling those of an equivalence relation. Note, however, that since is not a binary relationship, itself cannot be an equivalence relation in the usual sense. We can construct equivalence relations, though, by existentially quantifying over the multivalued function. For example, we might define an equivalence relation on games, where if and only if there is a single-valued bijection such that .3
Item 4 states that if we can make an outcome correspondence claim less precise, it will still hold true. Item 5 states that in the extreme, it is always , where is the trivial, maximally imprecise outcome correspondence function that confers no information. Item 6 shows that can be used to express the elimination of outcomes, i.e., the belief that a particular outcome (or strategy) will never occur.
Besides an equivalence relation, we can also use with quantification over the respective outcome correspondence function to construct (non-symmetric) preorders over games, i.e., relations that are transitive and reflexive (but not symmetric or antisymmetric). Most importantly, we can construct a preorder on games where if for a that always increases every player's utilities.
We now show that as advertised, outcome correspondence is closely tied to SPIs. The following theorem shows not only how outcome correspondences can be used to find (and prove) SPIs. It also shows that any SPI requires an outcome correspondence relation via a Pareto-improving correspondence function.
Definition 4. Let be a game and be a subset game of . Further let be such that . We call a Pareto-improving outcome correspondence (function) if for all and all .
Theorem 3. Let be a game and be a subset game of . Then is an SPI on if and only if there is a Pareto-improving outcome correspondence from to .
Proof. : By definition, with certainty. Hence, for ,
with certainty. Hence, by assumption about , with certainty, .
: Assume that with certainty for . We define
It is immediately obvious that is Pareto-improving as required. Also, whenever and for any and , it is (by assumption) with certainty . Thus, by definition of , it holds that . We conclude that as claimed.
Note that the theorem concerns weak SPIs and therefore allows the case where with certainty . To show that some is a strict SPI, we need additional information about which outcomes occur with positive probability. This, too, can be expressed via our outcome correspondence relation. However, since this is cumbersome, we will not formally address strictness much to keep things simple.4
We now illustrate how outcome correspondences can be used to derive the SPI for the Demand Game from the introduction as per Theorem 3. Of course, at this point we have not made any assumptions about when games are equivalent. We will introduce some in the following section. Nevertheless, we can already sketch the argument using the specific outcome correspondences that we have given intuitive arguments for. Let again be the Demand Game of Table 1. Then, as we have argued, , where is the game that results from removing and for both players; and if and otherwise. In a second step, , where is the game of Table 2 and is the isomorphism between and . Finally, transitivity (Lemma 2.3) implies that . To see that is Pareto-improving for the original utility functions of , notice that does not change utilities at all. The correspondence function maps the conflict outcome onto the outcome , which is better for both original players. Other than that, , too, does not change the utilities. Hence, is Pareto-improving. By Theorem 3, is therefore an SPI on .
In principle, Theorem 3 does not hinge on and resulting from playing games. An analogous result holds for any random variables over and . In particular, this means that Theorem 3 applies also if the representatives receive other kinds of instructions (cf. Section 3.2). However, it seems hard to establish non-trivial outcome correspondences between and other types of instructions. Still, the use of more complicated instructions can be used to derive different kinds of SPIs. For example, if there are different game SPIs, then the original players could tell their representatives to randomize between them in a coordinated way.
To make any claims about how the original players should play the meta-game, i.e., about what instructions they should submit, we generally need to make assumptions about how the representatives choose and (by Theorem 3) about outcome correspondence in particular.5 We here make two fairly weak assumptions.
Our first is that the representatives never play strictly dominated actions and that removing them does not affect what the representatives would choose.
Assumption 1. Let be an arbitrary -player game where are pairwise disjoint, and let be strictly dominated by some other strategy in . Then , where for all , and whenever .
Assumption 1 expresses that representatives should never play strictly dominated strategies. Moreover, it states that we can remove strictly dominated strategies from a game and the resulting game will be played in the same way by the representatives. For example, this implies that when evaluating a strategy , the representatives do not take into account how many other strategies strictly dominates. Assumption 1 also allows (via Transitivity of as per Lemma 2.3) the iterated removal of strictly dominated strategies. The notion that we can (iteratively) remove strictly dominated strategies is common in game theory [41, 27, 39, Section 2.9, Chapter 12] and has rarely been questioned. It is also implicit in the solution concept of Nash equilibrium – if a strategy is removed by iterated strict dominance, that strategy is played in no Nash equilibrium. However, like the concept of Nash equilibrium, the elimination of strictly dominated strategies becomes implausible if the game is not played in the usual way. In particular, for Assumption 1 to hold, we will in most games have to assume that the representatives cannot in turn make credible commitments (or delegate to further subrepresentatives) or play the game iteratively [4].
Our second assumption is that the representatives play isomorphic games isomorphically when those games are fully reduced.
Assumption 2. Let and be two games that do not contain strictly dominated actions. If and are isomorphic, then there exists an isomorphism between and such that .
Similar desiderata have been discussed in the context of equilibrium selection, e.g., by Harsanyi and Selten [20, Chapter 3.4] [cf. 56, for a discussion in the context of fully cooperative multi-agent reinforcement learning].
Note that if there are multiple game isomorphisms, then we assume outcome correspondence for only one of them. This is necessary for the assumption to be satisfiable in the case of games with action symmetries. (Of course, such games are not the focus of this paper.) For example, let be Rock–Paper–Scissors. Then is isomorphic to itself via the function that for both players maps Rock to Paper, Paper to Scissors, and Scissors to Rock. But if it were , then this would mean that if the representatives play Rock in Rock–Paper–Scissors, they play Paper in Rock–Paper–Scissors. Contradiction! We will argue for the consistency of our version of the assumption in Section 4.4.3. Notice also that we make the assumption only for reduced games. This relates to the previous point about action-symmetric games. For example, consider two versions of Rock–Paper–Scissors and assume that in both versions both players have an additional strictly dominated action that breaks the action symmetries e.g., the action, “resign and give the opponent if they play Rock/Paper”. Then there would only be one isomorphism between these two games (which maps Rock to Paper, Paper to Scissors, and Scissors to Rock for both players). However, in light of Assumption 1, it seems problematic to assume that these strictly dominated actions restrict the outcome correspondences between these two games.6
One might worry that reasoning about the existence of multiple isomorphisms renders it intractable to deal with outcome correspondences as implied by Assumption 2, and in particular that it might make it impossible to tell whether a particular game is an SPI. However, one can intuitively see that the different isomorphisms between two games do analogous operations. In particular, it turns out that if one isomorphism is Pareto-improving, then they all are:
Lemma 4. Let and be isomorphisms between and . If is (strictly) Pareto-improving, then so is .
We prove Lemma 4 in Appendix C.
Lemma 4 will allow us to conclude from the existence of a Pareto-improving isomorphism that there is a Pareto-improving s.t. by Assumption 2, even if there are multiple isomorphisms between and . In the following, we can therefore afford to be lax about our ignorance (in some games) about which outcome isomorphism induces outcome equivalence. We will therefore generally write “ by Assumption 2” as short for “ is a game isomor”hism between and and hence by Assumption 2 there exists an isomorphism such that .
One could criticize Assumption 2 by referring to focal points (introduced by Schelling [49, 48, pp. 54–58] [cf., e.g., 30, 18, 54, 9]) as an example where context and labels of strategies matter. A possible response might be that in games where context plays a role, that context should be included as additional information and not be considered part of . Assumption 2 would then either not apply to such games with (relevant) context or would require one to, in some way, translate the context along with the strategies. However, in this paper we will not formalize context, and assume that there is no decision-relevant context.
We will now argue that there exist representatives that indeed satisfy Assumptions 1 and 2, both to provide intuition and because our results would not be valuable if Assumptions 1 and 2 were inconsistent. We will only sketch the argument informally. To make the argument formal, we would need to specify in more detail what the set of games looks like and in particular what the objects of the action sets are.
Imagine that for each player there is a book7 that on each page describes a normal-form game that does not have any strictly dominated strategies. The actions have consecutive integer labels. Importantly, the book contains no pair of games that are isomorphic to each other. Moreover, for every fully reduced game, the book contains a game that is isomorphic to this game. (Unless we strongly restrict the set of games under consideration, the book must therefore have infinitely many pages.) We imagine that each player's book contains the same set of games. On each page, the book for Player recommends one of the actions of Player to be taken deterministically.8
Each representative owns a potentially different version of this book and uses it as follows to play a given game . First the given game is fully reduced by iterated strict dominance to obtain a game . They then look up the unique game in the book that is isomorphic to and map the action labels in onto the integer labels of the game in the book via some isomorphism. If there are multiple isomorphisms from to the relevant page in the book, then all representatives decide between them using the same deterministic procedure. Finally they choose the action recommended by the book.
It is left to show a pair of representatives thus specified satisfies Assumptions 1 and 2. We first argue that Assumption 1 is satisfied. Let be a game and let be a game that arises from removing a strictly dominated action from . By the well known path independence of iterated elimination of strictly dominated strategies [1, 19, 41], fully reducing and results in the same game. Hence, the representatives play the same actions in and .
Second, we argue that Assumption 2 is satisfied. Let us say and are fully reduced and isomorphic. Then it is easy to see that each player plays and based on the same page of their book. Let the game on that book page be . Let and be the bijections used by the representatives to translate actions in and , respectively, to labels in . Then if the representatives take actions in , the actions are the ones specified by the book for , and hence the actions are played in . Thus . It is easy to see that is a game isomorphism between and .
One could try to use principles other than Assumptions 1 and 2. We here give some considerations. First, game theorists have also considered the iterated elimination of weakly dominated strategies [17, 31, Section 4.11]. Unfortunately, the iterated removal of weakly dominated strategies is pathdependent [27, Section 2.7.B, 7, Section 5.2, 39, Section 12.3]. That is, for some games, iterated removal of weakly dominated strategies can lead to different subset games, depending on which weakly dominated strategy one chooses to eliminate at any stage. A straightforward extension of Assumption 1 to allow the elimination of weakly dominated strategies would therefore be inconsistent in such games, which can be seen as follows.
Work on the path dependence of iterated removal of weakly dominated strategies has shown that there are games with two different outcomes such that by iterated removal of weakly dominated strategies from , we can obtain both and . If we had an assumption analogous to Assumption 1 but for weak dominance, then (with Lemma 2.3 – transitivity), we would obtain both that and that , where for all and for all . The former would mean (by Lemma 2.6) that for all we have that with certainty; while the latter would mean that that we have that with certainty. But jointly this means that for all , we have that with certainty, which cannot be the case as by definition. Thus, we cannot make an assumption analogous to Assumption 1 for weak dominance.
As noted above, the iterated removal of strictly dominated strategies, on the other hand, is path-independent, and in the 2-player case always eliminates exactly the non-rationalizable strategies [1, 19, 41]. Many other dominance concepts have been shown to have path independence properties. For an overview, see Apt [1]. We could have made an independence assumption based any of these path-independent dominance concepts. For example, elimination of strategies that are strictly dominated by a mixed strategy (or, equivalently, of so-called never-best responses) is also path independent [40, Section 4.2].
With Assumptions 1 and 2, all our outcome correspondence functions are either 1-to-1 or 1-to-0. Other elimination assumptions could involve the use of many-to-1 or even many-to-many functions. In general, such functions are needed when a strategy can be eliminated to obtain a strategically equivalent game, but in the original game may still be played. The simplest example would be the elimination of payoff-equivalent strategies. Imagine that in some game for all opponent strategies it is the case that and that there are no other strategies that are similarly payoff-equivalent to and . Then one would assume that , where maps onto and otherwise is just the identity function. As an example, imagine a variant of the Demand Game in which Player 1 has an additional action that results in the same payoffs as for both players against Player 2's and but potentially slightly different payoffs against and . With our current assumptions we would be unable to derive a non-trivial SPI for this game. However, with an assumption about the elimination of duplicate actions in hand, we could (after removing and as usual) remove or and thereby derive the usual SPI. Many-to-1 elimination assumptions can also arise from some dominance concepts if they have weaker path independence properties. For example, iterated elimination by so-called nice weak dominance [32] is only path-independent up to strategic equivalence. Like the assumption about payoff-equivalent strategies, an elimination assumption based on nice weak dominance therefore cannot assume that the eliminated action is not played in the original game at all.
In this section, we use Lemma 2, Theorem 3, and Assumptions 1 and 2 to formally prove a few SPIs.
Proposition (Example) 5. Let be the Prisoner's Dilemma (Table 3) and be any subset game of with . Then under Assumption 1, is a strict SPI on .
Proof. By applying Assumption 1 twice and Transitivity once, , where and and for all . By Lemma 2.5, we further obtain , where is as described in the proposition. Hence, by transitivity, . It is easy to verify that the function is Pareto-improving.
Proposition (Example) 6. Let be the Demand Game of Table 1 and be the subset game described in Table 2. Under Assumptions 1 and 2 , is an SPI on . Further, if , then is a strict SPI.
Proof. Let . We can repeatedly apply Assumption 1 to eliminate from the strategies and for both players. We can then apply Lemma 2.3 (Transitivity) to obtain , where and
Next, by Assumption 2, , where and for . We can then apply Lemma 2.3 (Transitivity) again, to infer . It is easy to verify that for all , it is for all the case that .
Next, we give two examples of unilateral SPIs. We start with an example that is trivial in that the original player instructs her resentatives to take a specific action. We then give the SPI for the Complicated Temptation game as a non-trivial example.
Consider the Temptation Game given in Table 6. In this game, Player 1's (for Temptation) strictly dominates . Once is removed, Player 2 prefers . Hence, this game is strict-dominance solvable to . Player 1 can safely Pareto-improve on this result by telling her representative to play , since Player 2's best response to is and . We now show this formally.
Proposition (Example) 7. Let be the game of Table 6. Under Assumption 1, is a strict SPI on .
Proof. First consider . We can apply Assumption 1 to eliminate Player 1's and then apply Assumption 1 again to the resulting game to also eliminate Player 2's . By transitivity, we find , where and and .
Next, consider . We can apply Assumption 1 to remove Player 2's strategy and find , where and and .
Third, by Lemma 2.5, where .
Finally, we can apply transitivity to conclude , where . It is easy to verify that and . Hence, is Pareto-improving and so by Theorem 3, is an SPI on .
Note that in this example, Player 1 simply commits to a particular strategy and Player 2 maximizes their utility given Player 1's choice. Hence, this SPI can be justified with much simpler unilateral commitment setups [11, 52, 59]. For example, if the Temptation Game was played as a sequential game in which Player 1 plays first, its unique subgame-perfect equilibrium is .
In Table 4 we give the Complicated Temptation Game, which better illustrates the features specific to our setup. Roughly, it is an extension of the simpler Temptation Game of Table 6. In addition to choosing versus and versus , the players also have to make an additional choice (1 versus 2), which is difficult in that it cannot be solved by strict dominance. As we have argued in Section 3.1, the game in Table 5 is a unilateral SPI on Table 4. We can now show this formally.
Proposition (Example) 8. Let be the Complicated Temptation Game (Table 4) and be the subset game in Table 5. Under Assumptions 1 and 2, is a unilateral SPI on .
Proof. In , for Player 1, and strictly dominate and . We can thus apply Assumption 1 to eliminate Player 1's and . In the resulting game, Player 2's and strictly dominate and , so one can apply Assumption 1 again to the resulting game to also eliminate Player 2's and . By transitivity, we find , where and
Next, consider (Table 5). We can apply Assumption 1 to remove Player 2's strategies and and find , where and
Third, by Assumption 2, where decomposes into and , corresponding to the two players, respectively, where and for .
Finally, we can apply transitivity and the rule about symmetry and inverses (Lemma 2.2) to conclude . It is easy to verify that is Pareto-improving.
In this section, we ask how computationally costly it is for the original players to identify for a given game a non-trivial SPI . Of course, the answer to this question depends on what the original players are willing to assume about how their representatives act. For example, if only trivial outcome correspondences (as per Lemma 2.1 and 2.5) are assumed, then the decision problem is easy. Similarly, if for given is hard to decide (e.g., because it requires solving for the Nash equilibria of and ), then this could trivially also make the safe Pareto improvement problem hard to decide. We specifically are interested in deciding whether a given game has a non-trivial SPI that can be proved using only Assumptions 1 and 2, the general properties of game correspondence (in particular Transitivity (Lemma 2.3), Symmetry (Lemma 2.2) and Theorem 3).
Definition 5. The SPI decision problem consists in deciding for any given , whether there is a game and a sequence of outcome correspondences and a sequence of subset games of s.t.:
Many variants of this problem may be considered. For example, to match Definition 1, the definition of the strict SPI problem assumes that all outcomes that survive iterated elimination occur with positive probability. Alternatively we could have required that for demonstrating strictness, there must be a player such that for all that survive iterated elimination, . Similarly one may wish to find SPIs that are strict improvements for all players. We may also wish to allow the use of the elimination of duplicate strategies (as described in Section 4.4.4) or trivial outcome correspondence steps as per Lemma 2.5. These modifications would not change the computational complexity of the problem, nor would they require new proof ideas. One may also wish to compute all SPIs, or – in line with multi-criteria optimization [14, 58] – all SPIs that cannot in turn be safely Pareto-improved upon. However, in general there may exist exponentially many such SPIs. To retain any hope of developing an efficient algorithm, one would therefore have to first develop a more efficient representation scheme [cf. 42, Sect. 16.4].
Theorem 9. The (strict) (unilateral) SPI decision problem is NP-complete, even for 2-player games.
Proposition 10. For games with that can be reduced (via iterative application of Assumption 1) to a game with , the (strict) (unilateral) SPI decision problem can be solved in .
The full proof is tedious (see Appendix D), but the main idea is simple, especially for omnilateral SPIs. To find an omnilateral SPI on based on Assumptions 1 and 2, one has to first iteratively remove all strictly dominated actions to obtain a reduced game , which the representatives would play the same as the original game. This can be done in polynomial time. One then has to map the actions onto the original in such a way that each outcome in is mapped onto a weakly Pareto-better outcome in . Our proof of NP-hardness works by reducing from the subgraph isomorphism problem, where the payoff matrices of and represent the adjacency matrices of the graphs.
Besides being about a specific set of assumptions about , note that Theorem 9 and Proposition 10 also assume that the utility function of the game is represented explicitly in normal form as a payoff matrix. If we changed the game representation (e.g., to boolean circuits, extensive form game trees, quantified boolean formulas, or even Turing machines), this can affect the complexity of the SPI problem. For example, Gabarró, García, and Serna [16] show that the game isomorphism problem on normal-form games is equivalent to the graph isomorphism problem, while it is equivalent to the (likely computationally harder) boolean circuit isomorphism problem for a weighted boolean formula game representation. Solving the SPI problem requires solving a subset game isomorphism problem (see the proof of Lemma 28 in Appendix D for more detail). We therefore suspect that the SPI problem analogously increases in computational complexity (perhaps to being -complete) if we treat games in a weighted boolean formula representation. In fact, even reducing a game using strict dominance by pure strategies – which contributes only insignificantly to the complexity of the SPI problem for normal-form games – is difficult in some game representations [10, Section 6]. Note, however, that for any game representation to which 2-player normal-form games can be efficiently reduced – such as, for example, extensive-form games – the hardness result also applies.
In this section, we imagine that the players are able to simply invent new token strategies with new payoffs that arise from mixing existing feasible payoffs. To define this formally, we first define for any game ,
to be the set of payoff vectors that are feasible by some correlated strategy. The underlying notion of correlated strategies is the same as in correlated equilibrium [2, 3], but in this paper it will not be relevant whether any such strategy is a correlated equilibrium of . Instead their use will hinge on the use of commitments [cf. 34]. Note that is exactly the convex closure of , i.e., the convex closure of the set of deterministically achievable utilities of the original game.
For any game , we then imagine that in addition to subset games, the players can let the representatives play a perfect-coordination token game , where for all , and are arbitrary utility functions to be used by the representatives and are the utilities that the original players assign to the token strategies.
The instruction lets the representatives play the game as usual. However, the strategies are imagined to be meaningless token strategies which do not resolve the given game . Once some token strategies are selected, these are translated into some probability distribution over , i.e., into a correlated strategy of the original game. This correlated strategy is then played by the original players, thus giving rise to (expected) utilities . These distributions and thus utilities are specified by the original players.
Definition 6. Let be a game. A perfect-coordination SPI for is a perfect-coordination token game for s.t. with certainty. We call a strict perfect-coordination SPI if there furthermore is a player for whom with positive probability.
As an example, imagine that is just the - subset game of the Demand Game of Table 1. Then, intuitively, an SPI under improved coordination could consist of the original players telling the representatives, “Play as if you were playing the - subset game of the Demand Game, but whenever you find yourself playing , randomize [according to some given distribution] between the other (Pareto-optimal) outcomes instead”. Formally, and would then consist of tokenized versions of the original strategies. The utility functions and are then simply the same as in the original Demand Game except that they are applied to the token strategies. For example, . The utilities for the original players remove the conflict outcome. For example, the original players might specify , representing that the representatives are supposed to play in the case. For all other outcomes , it must be the case that because the other outcomes cannot be Pareto-improved upon. As with our earlier SPIs for the Demand Game, Assumption 2 implies that , where maps the original conflict outcome onto the Pareto-optimal (,).
Relative to the SPIs considered up until now, these new types of instructions put significant additional requirements on how the representatives interact. They now have to engage in a two-round process of first choosing and observing one another's token strategies and then playing a correlated strategy for the original game. Further, it must be the case that this additional coordination does not affect the payoffs of the original outcomes. The latter may not be the case in, e.g., the Game of Chicken. That is, we could imagine a Game of Chicken in which coordination is possible but that the rewards of the game change if the players do coordinate. After all, the underlying story in the Game of Chicken is that the positive reward – admiration from peers – is attained precisely for accepting a grave risk.
With these more powerful ways to instruct representatives, we can now replace individual outcomes of the default game ad libitum. For example, in the reduced Demand Game, we singled out the outcome as Pareto-suboptimal and replaced it by a Pareto-optimal outcome, while keeping all other outcomes the same. This allows us to construct SPIs in many more games than before.
Definition 7. The strict full-coordination SPI decision problem consists in deciding for any given whether under Assumption 2 there is a perfect-coordination SPI for .
Lemma 11. For a given -player game and payoff vector , it can be decided by linear programming and thus in polynomial time whether is Pareto-optimal in .
For an introduction to linear programming, see, e.g., Schrijver [50]. In short, a linear program is a specific type of constrained optimization problem that can be solved efficiently.
Proof. Finding a Pareto improvement on a given can be formulated as the following linear program:
Based on Lemma 11, Algorithm 1 decides whether there is a strict perfect-coordination SPI for a given game .
It is easy to see that this algorithm runs in polynomial time (in the size of, e.g., the normal form representation of the game). It is also correct: if it returns True, simply replace the Pareto-suboptimal outcome while keeping all other outcomes the same; if it returns False, then all outcomes are Pareto-optimal within and so there can be no strict SPI. We summarize this result in the following proposition.
Proposition 12. Assuming is known and that Assumption 2 holds, it can be decided in polynomial time whether there is a strict perfect-coordination SPI.
From the problem of deciding whether there are strict SPIs under improved coordination at all, we move on to the question of what different perfect-coordination SPIs there are. In particular, one might ask what the cost is of only considering safe Pareto improvements relative to acting on a probability distribution over and the resulting expected utilities . We start with a lemma that directly provides a characterization. So far, all the considered perfect-coordination SPIs for a game have consisted in letting the representatives play a game that is isomorphic to the original game, but Pareto-improves (from the original players' perspectives, i.e., ) at least one of the outcomes. It turns out that we can restrict attention to this very simple type of SPI under improved coordination.
Lemma 13. Let be any game. Let be a perfect-coordination SPI on . Then we can define with values in such that under Assumption 2 the game
is also an SPI on , with
for all and consequently .
Proof. First note that is isomorphic to . Thus by Assumption 2, there is isomorphism s.t. . WLOG assume that simply maps . Then define as follows:
Here describes the utilities that the original players assign to the outcomes of . Since maps onto and is convex, as defined also maps into as required. Note that for all it is by assumption with certainty. Hence,
as required.
Because of this result, we will focus on these particular types of SPIs, which simply create an isomorphic game with different (Pareto-better) utilities. Note, however, that without assigning exact probabilities to the distributions of , the original players will in general not be able to construct a that satisfies the expected payoff equalities. For this reason, one could still conceive of situations in which a different type of SPI would be chosen by the original players and the original players are unable to instead choose an SPI of the type described in Lemma 13.
Lemma 13 directly implies a characterization of the expected utilities that can be achieved with perfect-coordination SPIs. Of course, this characterization depends on the exact distribution of . We omit the statement of this result. However, we state the following implication.
Corollary 14. Under Assumption 2, the set of Pareto improvements that are safely achievable with perfect coordination
is a convex polygon.
Because of this result, one can also efficiently optimize convex functions over the set of perfect-coordination SPIs. Even without referring to the distribution , many interesting questions can be answered efficiently. For example, we can efficiently identify the perfect-coordination SPI that maximizes the minimum improvements across players and outcomes .
In the following, we aim to use Lemma 13 and Corollary 14 to give maximally strong positive results about what Pareto improvements can be safely achieved, without referring to exact probabilities over . To keep things simple, we will do this only for the case of two players. To state our results, we first need some notation: We use
to denote the Pareto frontier of a convex polygon (or more generally convex, closed set). For any real number , we use to denote the which maximizes under the constraint (Recall that we consider 2-player games, so is a single real number.) Note that such a exists if and only if is 's utility in some feasible payoff vector. We first state our result formally. Afterwards, we will give a graphical explanation of the result, which we believe is easier to understand.
Theorem 15. Make Assumption 2. Let be a two-player game. Let be some potentially unsafe Pareto improvement on . For , let . Then:
A) If there is some element in which Pareto-dominates all of and if is Pareto-dominated by an element of at least one of the following three sets:
Then there is an SPI under improved coordination such that .
B) If there is no element in which Pareto-dominates all of and if is Pareto-dominated by an element each of and as defined above, then there is a perfect-coordination SPI such that .
We now illustrate the result graphically. We start with Case A, which is illustrated in Figure 2. The Pareto-frontier is the solid line in the north and east. The points marked x indicate outcomes in . The point marked by a filled circle indicates the expected value of the default equilibrium . The vertical dashed lines starting at the two extreme x marks illustrate the application of to project onto the Pareto frontier. The dotted line between these two points is . Similarly, the horizontal dashed lines starting at x marks illustrate the application of to project onto the Pareto frontier. The line segment between these two points is . In this case, this line segments lies on the Pareto frontier. The set is simply that part of the Pareto frontier, which Pareto-dominates all elements of , i.e., the part of the Pareto frontier to the north-east between the two intersections with the northern horizontal dashed line and eastern vertical dashed line. The theorem states that for some to be a Pareto improvement, it must be in the gray area.
Case B of Theorem 15 is depicted in Figure 3. Note that here the two line segments and intersect. To ensure that a Pareto improvement is safely achievable, the theorem requires that it is below both of these lines, as indicated again by the gray area.
For a full proof, see Appendix E. Roughly, Theorem 15 is proven by re-mapping each of the outcomes of the original game as per Lemma 13. For example, the projection of the default equilibrium (i.e., the filled circle) onto is obtained as an SPI by projecting all the outcomes (i.e., all the x marks) onto . In Case A, any utility vector that Pareto-improves on all outcomes of the original game can be obtained by re-mapping all outcomes onto . Other kinds of are handled similarly.
As a corollary of Theorem 15, we can see that all (potentially unsafe) Pareto improvements in the - subset game of the Demand Game of Table 1 are equivalent to some perfect-coordination SPI. However, this is not always the case:
Proposition 16. There is a game , representatives that satisfy Assumptions 1 and 2, and an outcome s.t. for all players , but there is no perfect-coordination SPI s.t. for all players , .
As an example of such a game, consider the game in Table 7. Strategy can be eliminated by strict dominance (Assumption 1) for both players, leaving a typical Chicken-like payoff structure with two pure Nash equilibria ( and ), as well as a mixed Nash equilibrium .
Now let us say that in the resulting game for some with . Then one (unsafe) Pareto improvement would be to simply always have the representatives play for a certain payoff of . Unfortunately, there is no safe Pareto improvement with the same expected payoff. Notice that is the unique element of that maximizes the sum of the two players' utilities. By linearity of expectation and convexity of , if for any it is , it must be with certainty. Unfortunately, in any safe Pareto improvement the outcomes and must corresponds to outcomes that still gives utilities of and , respectively, because these are Pareto-optimal within the set of feasible payoff vectors. We illustrate this as an example of Case B of Theorem 15 in Figure 4.
In the Demand Game, there happens to be a single non-trivial SPI. However, in general (even without the type of coordination assumed in Section 5) there may be multiple SPIs that result in different payoffs for the players. For example, imagine an extension of the Demand Game imagine that both players have an additional action , which is like , except that under , Aliceland can peacefully annex the desert. Aliceland prefers this SPI over the original one, while Bobbesia has the opposite preference. In other cases, it may be unclear to some or all of the players which of two SPIs they prefer. For example, imagine a version of the Demand Game in which one SPI mostly improves on and another mostly improves on the other three outcomes, then outcome probabilities are required for comparing the two. If multiple SPIs are available, the original players would be left with the difficult decision of which SPI to demand in their instruction.9
This difficulty of choosing what SPI to demand cannot be denied. However, we would here like to emphasize that players can profit from the use of SPIs even without addressing this SPI selection problem. To do so, a player picks an instruction that is very compliant (“dove-ish”) w.r.t. what SPI is chosen, e.g., one that simply goes with whatever SPI the other players demand as long as that SPI cannot further be safely Pareto-improved upon.10 In many cases, all such SPIs benefit all players. For example, optimal SPIs in bargaining scenarios like the Demand Game remove the conflict outcome, which benefits all parties. Thus, a player can expect a safe improvement even under such maximally compliant demands on the selected SPI.
In some cases there may also be natural choices of demands (a là Schelling [48, pp. 54–58] or focal points). If the underlying game is symmetric, a symmetric safe Pareto improvement may be a natural choice. For example, the fully reduced version of the Demand Game of Table 1 is symmetric. Hence, we might expect that even if multiple SPIs were available, the original players would choose a symmetric one.
Safe Pareto improvements are a promising new idea for delegating strategic decision making. To conclude this paper, we discuss some ideas for further research on SPIs.
Straightforward technical questions arise in the context of the complexity results of Section 4.6. First, what impact on the complexity does varying the assumptions have? Our NP-completeness proof is easy to generalize at least to some other types of assumptions. It would be interesting to give a generic version of the result. We also wonder whether there are plausible assumptions under which the complexity changes in interesting ways. Second, one could ask how the complexity changes if we use more sophisticated game representations (see the remarks at the end of that section). Third, one could impose additional restrictions on the sought SPI. Fourth, we could restrict the games under consideration. Are there games in which it becomes easy to decide whether there is an SPI?
It would also be interesting to see what real-world situations can already be interpreted as utilizing SPIs, or could be Pareto-improved upon using SPIs.
This work was supported by the National Science Foundation under Award IIS-1814056. Some early work on this topic was conducted by Caspar Oesterheld while working at the Foundational Research Institute (now the Center on Long-Term Risk). For valuable comments and discussions, we are grateful to Keerti Anand, Tobias Baumann, Jesse Clifton, Max Daniel, Lukas Gloor, Adrian Hutter, Vojtěch Kovařík, Anni Leskelä, Brian Tomasik and Johannes Treutlein, and our wonderful anonymous referees. We also thank attendees of a 2017 talk at the Future of Humanity Institute at the University of Oxford, a talk at the May 2019 Effective Altruism Foundation research retreat, and our talk at AAMAS 2021.
[1] Krzysztof R. Apt. “Uniform Proofs of Order Independence for Various Strategy Elimination Procedures”. In: The B.E. Journal of Theoretical Economics 4.1 (2004), pp. 1–48. DOI: 10.2202/1534-5971.1141.
[2] Robert J. Aumann. “Correlated Equilibrium as an Expression of Bayesian Rationality”. In: Econometrica 55.1 (Jan. 1987), pp. 1–18. DOI: 10.2307/1911154.
[3] Robert J. Aumann. “Subjectivity and Correlation in Randomized Strategies”. In: Journal of Mathematical Economics 1.1 (Mar. 1974), pp. 67–97. DOI: 10.1016/0304-4068(74)90037-8.
[4] Robert Axelrod. The Evolution of Cooperation. New York: Basic Books, 1984.
[5] Mihaly Barasz et al. Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic. Jan. 2014. url: https://arxiv.org/abs/1401.5577.
[6] Ken Binmore. Game Theory – A Very Short Introduction. Oxford University Press, 2007.
[7] Tilman Börgers. “Pure Strategy Dominance”. In: Econometrica 61.2 (Mar. 1993), pp. 423–430.
[8] Vitalik Buterin. Ethereum White Paper – A Next Generation Smart Contract & Decentralized Application Platform. Updated version available at https://github.com/ethereum/wiki/wiki/White-Paper. 2014. URL: https : //cryptorating . eu / whitepapers / Ethereum /Ethereum_white_paper.pdf.
[9] Andrew M. Colman. “Salience and focusing in pure coordination games”. In: Journal of Economic Methodology 4.1 (1997), pp. 61–81. DOI: 10.1080/13501789700000004.
[10] Vincent Conitzer and Tuomas Sandholm. “Complexity of (Iterated) Dominance”. In: Proceedings of the 6th ACM conference on Electronic commerce. Vancouver, Canada: Association for Computing Machinery, June 2005, pp. 88–97. DOI: 10.1145/1064009.1064019.
[11] Vincent Conitzer and Tuomas Sandholm. “Computing the Optimal Strategy to Commit to”. In: Proceedings of the ACM Conference on Electronic Commerce (EC). Ann Arbor, MI, USA: Association for Computing Machinery, 2006, pp. 82–90.
[12] Stephen A. Cook. “The complexity of theorem-proving procedures”. In: STOC ’71: Proceedings of the third annual ACM symposium on Theory of computing. New York: Association for Computing Machinery, May 1971, pp. 151–158. DOI: 10.1145/800157.805047.
[13] Andrew Critch. “A Parametric, Resource-Bounded Generalization of Löb’s Theorem, and a Robust Cooperation Criterion for Open-Source Game Theory”. In: Journal of Symbolic Logic 84.4 (Dec. 2019), pp. 1368–1381. DOI: 10.1017/jsl.2017.42.
[14] Matthias Ehrgott. Multicriteria Optimization. 2nd ed. Berlin: Springer, 2005.
[15] Lance Fortnow. “Program equilibria and discounted computation time”. In: Proceedings of the 12th Conference on Theoretical Aspects of Rationality and Knowledge (TARK ’09). July 2009, pp. 128–133. DOI: 10.1145/1562814.1562833.
[16] Joaquim Gabarró, Alina García, and Maria Serna. “The complexity of game isomorphism”. In: Theoretical Computer Science 412.48 (Nov. 2011), pp. 6675–6695. DOI: 10.1016/j.tcs.2011.07.022.
[17] David Gale. “A Theory of N-Person Games with Perfect Information”. In: Proceedings of the National Academy of Sciences of the United States of America 39.6 (June 1953), pp. 496–501. DOI: 10.1073/pnas.39.6.496.
[18] David Gauthier. “Coordination”. In: Dialogue 14.2 (June 1975), pp. 195–221. DOI: 10.1017/S0012217300043365.
[19] Itzhak Gilboa, Ehud Kalai, and Eitan Zemel. “On the order of eliminating dominated strategies”. In: Operations Research Letters 9.2 (Mar. 1990), pp. 85–89. DOI: 10.1016/0167-6377(90)90046-8.
[20] John C. Harsanyi and Reinhard Selten. A General Theory of Equilibrium Selection in Games. Cambridge, MA: The MIT Press, 1988.
[21] Bengt Robert Holmstr¨om. “On Incentives and Control in Organizations”. PhD thesis. Stanford University, Dec. 1977.
[22] J. V. Howard. “Cooperation in the Prisoner’s Dilemma”. In: Theory and Decision 24 (May 1988), pp. 203–213. DOI: 10.1007/BF00148954.
[23] Leonid Hurwicz and Leonard Shapiro. In: The Bell Journal of Economics 9.1 (1978), pp. 180–191. DOI: 10.2307/3003619.
[24] Adam Tauman Kalai et al. “A commitment folk theorem”. In: Games and Economic Behavior 69 (2010), pp. 127–137. DOI: 10.1016/j.geb.2009.09.008.
[25] Jon Kleinberg and Robert Kleinberg. “Delegated Search Approximates Efficient Search”. In: Proceedings of the 19th ACM Conference on Economics and Computation (EC). 2018.
[26] Frank H. Knight. Risk, Uncertainty, and Profit. Boston, MA, USA: Houghton Mifflin Company, 1921.
[27] Elon Kohlberg and Jean-Francois Mertens. “On the Strategic Stability of Equilibria”. In: Econometrica 54.5 (Sept. 1986), pp. 1003–1037. DOI: 10.2307/1912320.
[28] Jean-Jacques Laffont and David Martimort. The Theory of Incentives – The Principal-Agent Model. Princeton, NJ: Princeton University Press, 2002.
[29] Richard A. Lambert. “Executive Effort and Selection of Risky Projects”. In: RAND J. Econ. 17.1 (1986), pp. 77–88.
[30] David Lewis. Convention. Harvard University Press, 1969.
[31] R. Duncan Luce and Howard Raiffa. Games and Decisions. Introduction and Critical Survey. New York: Dover Publications, 1957.
[32] Leslie M. Marx and Jeroen M. Swinkels. “Order Independence for Iterated Weak Dominance”. In: Games and Economic Behavior 18 (1997), pp. 219–245. DOI: 10.1006/game.1997.0525.
[33] R. Preston McAfee. “Effective Computability in Economic Decisions”. May 1984. URL: https://www.mcafee.cc/Papers/PDF/EffectiveComputability.pdf.
[34] Dov Monderer and Moshe Tennenholtz. “Strong mediated equilibrium”. In: Artificial Intelligence 173.1 (Jan. 2009), pp. 180–195. DOI: 10.1016/j.artint.2008.10.005.
[35] John von Neumann. “Zur Theorie der Gesellschaftsspiele”. In: Mathematische Annalen 100 (1928), pp. 295–320. DOI: https://doi.org/10.1007/BF01448847.
[36] Caspar Oesterheld. “Robust Program Equilibrium”. In: Theory and Decision 86.1 (Feb. 2019), pp. 143–159.
[37] Caspar Oesterheld and Vincent Conitzer. “Minimum-regret contracts for principal-expert problems”. In: Proceedings of the 16th Conference on Web and Internet Economics (WINE). 2020.
[38] Hessel Oosterbeek, Randolph Sloof, and Gijs van de Kuilen. “Cultural Differences in Ultimatum Game Experiments: Evidence from a Meta-Analysis”. In: Experimental Economics 7 (June 2004), pp. 171–188. DOI: 10.1023/B:EXEC.0000026978.14316.74.
[39] Martin J. Osborne. An Introduction to Game Theory. New York: Oxford University Press, 2004.
[40] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. The MIT Press, 1994.
[41] David G. Pearce. “Rationalizable Strategic Behavior and the Problem of Perfection”. In: Econometrica 54.4 (July 1984), pp. 1029–1050.
[42] Guillaume Perez. “Decision diagrams: constraints and algorithms”. PhD thesis. Université Côte d’Azur, 2017. URL: https : / / tel .archives-ouvertes.fr/tel-01677857/document.
[43] Martin Peterson. An Introduction to Decision Theory. Cambridge University Press, 2009.
[44] Steven Pinker. How the Mind Works. W. W. Norton, 1997.
[45] Werner Raub. “A General Game-Theoretic Model of Preference Adaptions in Problematic Social Situations”. In: Rationality and Society 2.1 (Jan. 1990), pp. 67–93.
[46] Ariel Rubinstein. Modeling Bounded Rationality. Ed. by Karl Gunnar Persson. Zeuthen Lecture Book Series. The MIT Press, 1998.
[47] Alexander Savelyev. “Contract law 2.0: ‘Smart’ contracts as the beginning of the end of classic contract law”. In: Information & Communications Technology Law 26.2 (2017), pp. 116–134. DOI: 10.1080/13600834.2017.1301036.
[48] Thomas C. Schelling. The Strategy of Conflict. Cambridge, MA: Harvard University Press, 1960.
[49] Thomas C. Schelling. “The Strategy of Conflict Prospectus for a Reorientation of Game Theory”. In: The Journal of Conflict Resolution 2.3 (Sept. 1958), pp. 203–264.
[50] Alexander Schrijver. Theory of Linear and Integer Programming. Chichester, UK: John Wiley & Sons, 1998.
[51] Amartya Sen. “Choice, orderings and morality”. In: Practical Reason. Ed. by Stephan Körner. New Haven, CT, USA: Basil Blackwell, 1974. Chap. II, pp. 54–67.
[52] Heinrich von Stackelberg. “Marktform und Gleichgewicht”. In: Vienna: Springer, 1934, pp. 58–70.
[53] Neal M. Stoughton. “Moral Hazard and the Portfolio Management Problem”. In: The Journal of Finance 48.5 (Dec. 1993), pp. 2009–2028. DOI: 10.1111/j.1540-6261.1993.tb05140.x.
[54] Robert Sugden. In: The Economic Journal 105.430 (May 1995), pp. 533–550. DOI: 10.2307/2235016.
[55] Moshe Tennenholtz. “Program equilibrium”. In: Games and Economic Behavior 49.2 (Nov. 2004), pp. 363–373.
[56] Johannes Treutlein et al. “A New Formalism, Method and Open Issues for Zero-Shot Coordination”. In: Proceedings of the Thirty-eighth International Conference on Machine Learning (ICML’21). 2021.
[57] Wiebe van der Hoek, Cees Witteveen, and Micheal Wooldridge. “Program equilibrium – a program reasoning approach”. In: International Journal of Game Theory 42 (3 Aug. 2013), pp. 639–671.
[58] Luc N. van Wassenhove and Ludo F. Gelders. “Solving a bicriterion scheduling problem”. In: European Journal of Operational Research 4 (1980), pp. 42–48.
[59] Bernhard Von Stengel and Shmuel Zamir. Leadership with commitment to mixed strategies. Tech. rep. LSE-CDAM-2004-01. London School of Economics, 2004. URL: http://www.cdam.lse.ac.uk/Reports/Files/cdam-2004-01.pdf.
This paper considers the meta-game of delegation. SPIs are a proposed way of playing these games. However, throughout most of this paper, we do not analyze the meta-game directly as a game using the typical tools of game theory. We here fill that gap and in particular prove Theorem 1, which shows that SPIs are played in Nash equilibria of the meta game, assuming sufficiently strong contracting abilities. As noted, this result is essential. However, since it is mostly an application of existing ideas from the literature on program equilibrium, we left a detailed treatment out of the main text.
A program game for is defined via a set and a non-deterministic mapping . We obtain a new game with action sets and utility function
Though this definition is generic, one generally imagines in the program equilibrium literature that for all , consists of computer programs in some programming language, such as Lisp, that take as input vectors in and return an action . The function on input then executes each player 's program on to assign an action. The definition implicitly assumes that only contains programs that halt when fed one another as input (or that not halting is mapped onto some action). As is usually done in the program equilibrium literature, we will leave unspecified what constraints are used to ensure this. A program equilibrium is then simply a Nash equilibrium of the program game.
For the present paper, we add the following feature to the underlying programming language. A program can call a “black box subroutine” for any subset game of , where is a random variable over and .
We need one more definition. For any game and player , we define Player 's threat point (a.k.a. minimax utility) as
In words, is the minimum utility that the players other than can force onto , under the assumption that reacts optimally to their strategy. We further will use to denote the strategy for Player that is played in the minimizer of the above. Of course, in general, there might be multiple minimizers . In the following, we will assume that the function breaks such ties in some consistent way, such that for all ,
Note that for , each player's threat point is computable in polynomial time via linear programming; and that by the minimax theorem [35], the threat point is equal to the maximin utility, i.e.,
so is also the minimum utility that Player can guarantee for herself under the assumption that the opponent sees her mixed strategy and reacts in order to minimize Player 's utility.
Tennenholtz’ [55] main result on program games is the following:
Theorem 17 (Tennenholtz 2004 [55]). Let be a game and let be a (feasible) payoff vector. If for , then is the utility of some program equilibrium of a program game on
Throughout the rest of this section, our goal is to use similar ideas as Tennenholtz did for Theorem 17 to construct for any SPI on , a program equilibrium that results in the play of . As noted in the main text, the Player 's instruction to her representative to play the game will usually be conditional on the other player telling her representative to also play her part of and vice versa. After all, if Player simply tells her representative to maximize from regardless of Player 's instruction, then Player will often be able to profit from deviating from the instruction. For example, in the safe Pareto improvement on the Demand Game, each player would only want their representative to choose from rather than if the other player's representative does the same. It would then seem that in a program equilibrium in which is played, each program would have to contain a condition of the type, “if the opponent code plays as in against me, I also play as I would in .” But in a naive implementation of this, each of the programs would have to call the other, leading to an infinite recursion.
In the literature on program equilibrium, various solutions to this problem have been discovered. We here use the general scheme proposed by Tennenholtz [55], because it is the simplest. We could similarly use the variant proposed by Fortnow [15], techniques based on Löb's theorem [5, 13], or -grounded mutual simulation [36] or even (meta) Assurance Game preferences (see Appendix B).
In our equilibrium, we let each player submit code as sketched in Algorithm 2. Roughly, each player uses a program that says, “if everyone else submitted the same source code as this one, then play . Otherwise, if there is a player who submits a different source code, punish player by playing her strategy”. Note that for convenience, Algorithm 2 receives the player number as input. This way, every player can use the exact same source code. Otherwise the original players would have to provide slightly different programs and in line 2 of the algorithm, we would have to use a more complicated comparison, roughly: “if are the same, except for the player index used”.
Proposition 18. Let be a game and let be an SPI on . Let be the program profile consisting only of Algorithm 2 for each player. Assume that guarantees each player at least threat point utility in expectation. Then is a program equilibrium and .
Proof. By inspection of Algorithm 2, we see that . It is left to show that is a Nash equilibrium. So let be any player and . We need to show that . Again, by inspection of , is the threat point of Player . Hence,
as required.
Theorem 1 follows immediately.
We here discuss Raub’s [45] paper in some detail, which in turn elaborates on an idea by Sen [51]. Superficially, Raub’s setting seems somewhat similar to ours, but we here argue that it should be thought of as closer to the work on program equilibrium and bilateral precommitment. In Sections 1, 3 and 3.2, we briefly discuss multilateral commitment games, which have been discussed before in various forms in the gametheoretic literature. Our paper extends this setting by allowing instructions that let the representatives play a game without specifying an algorithm for solving that game. On first sight, it appears that Raub pursues a very similar idea. Translated to our setting, Raub allows that as an instruction, each player chooses a new utility function , where is the set of outcomes of the original game . Given instructions , the representatives then play the game . In particular, each representative can see what utility functions all the other representatives have been instructed to maximize. However, what utility function representative maximizes is not conditional on any of the instructions by other players. In other words, the instructions in Raub's paper are raw utility functions without any surrounding control structures, etc. Raub then asks for equilibria of the meta-game that Pareto-improve on the default outcome.
To better understand how Raub's approach relates to ours, we here give an example of the kind of instructions Raub has in mind. (Raub uses the same example in his paper.) As the underlying game , we take the Prisoner's Dilemma. Now the main idea of his paper is that the original players can instruct their representatives to adopt so-called Assurance Game preferences. In the Prisoner's Dilemma, this means that the representatives prefer to cooperate if the other representative cooperates, and prefer to defect if the other player defects. Further, they prefer mutual cooperation over mutual defection. An example of such Assurance Game preferences is given in Table 8. (Note that this payoff matrix resembles the classic Stag Hunt studied in game theory.)
The Assurance Game preferences have two important properties.
The first important difference between Raub's approach and ours is related to item 2. We have ignored the issue of making SPIs Nash equilibria of our meta game. As we have explained in Section 3.2 and Appendix A, we imagine that this is taken care of by additional bilateral commitment mechanisms that are not the focus of this paper. For Raub's paper, on the other hand, ensuring mutual cooperation to be stable in the new game is arguably the key idea. Still, we could pursue the approach of the present paper even when we limit assumptions to those that consist only of a utility function.
The second difference is even more important. Raub assumes that – as in the PD – the default outcome of the game ( in the formalism of this paper) is known. (Less significantly, he also assumes that it is known how the representatives play under assurance game preferences.) Of course, the key feature of the setting of this paper is that the underlying game might be difficult (through equilibrium selection problems) and thus that the original players might be unable to predict .
These are the reasons why we cite Raub in our section on bilateral commitment mechanisms. Arguably, Raub's paper could be seen as very early work on program equilibrium, except that he uses utility functions as a programming language for representative. In this sense, Raub's Assurance Game preferences are analogous to the program equilibrium schemes of Tennenholtz [55], Oesterheld [55], Barasz et al. [5] and van der Hoek, Witteveen, and Wooldridge [57], ordered in increasing order of similarity of the main idea of the scheme.
Lemma 4. Let and be isomorphisms between and . If is (strictly) Pareto-improving, then so is .
Proof.
First, we argue that if and are isomorphisms, then they are isomorphisms relative to the same constants and . For each player , we distinguish two cases. First the case where all outcomes in have the same utility for Player is trivial. Now imagine that the outcomes of do not all have the same utility. Then let and be the lowest and highest utilities, respectively, in . Further, let and be the lowest and highest utilities, respectively, in . It is easy to see that if is a game isomorphism, it maps outcomes with utility in onto outcomes with utility in , and outcomes with utility in onto outcomes with utility in . Thus, if and are to be the constants for , then
Since , this system of linear equations has a unique solution. By the same pair of equations, the constants for are uniquely determined.
It follows that for all ,
Furthermore, if is strictly Pareto-improving for some , then by bijectivity of , there is s.t. . For this , the inequality above is strict and therefore .
We here prove Theorem 9. We assume familiarity with basic ideas in computational complexity theory (non-deterministic polynomial time (NP), reductions, NP-completeness, etc.).
Throughout our proof we will use a result about the structure of relevant outcome correspondences. Before proving this result, we give two lemmas. The first is a well-known lemma about elimination by strict dominance.
Lemma 19 (path independence of iterated strict dominance). Let be a game in which some strategy of player is strictly dominated. Let be a game we obtain from by removing a strictly dominated strategy (of any player) other than . Then is strictly dominated in .
Note that this lemma does not by itself prove that iterated strict dominance is path dependence. However, path independence follows from the property shown by this lemma.
Proof. Let be the strategy that strictly dominates . We distinguish two cases:
Case 1: The strategy removed is . Then there must be that strictly dominates . Then it is for all
Both inequalities are due to the definition of strict dominance. We conclude that must strictly dominate .
Case 2: The strategy removed is one other than or . Since the set of strategies of the new game is a subset of the strategies of the old game it is still for each strategy in the new game
i.e., still strictly dominates .
The next lemma shows that instead of first applying Assumption 1 plus symmetry (Lemma 2.2) to add a strictly dominated action and then applying Assumption 1 to eliminate a different strictly dominated strategy, we could also first eliminate the strictly dominated strategy and then add the other strictly dominated strategy.
Lemma 20. Let by Assumption 1, where is the reduced game, and by Assumption 1. Then either or there is a game s.t. by Assumption 1 and by Assumption 1.
Proof. By the assumption both and can be obtained from eliminating a strictly dominated action from . Let these actions be and , respectively. If , then . So for the rest of this proof assume . Let be the game we obtain by removing from . We now show the two outcome correspondences:
We are ready to state our lemma about the structure of outcome correspondences.
Lemma 21. Let
where each outcome correspondence is due to a single application of Assumption1, Assumption1 plus symmetry (Lemma 2.2) or Assumption 2. Then there is a sequence with and , and such that
all by single applications of Assumption 1, and are fully reduced games such that by a single application of Assumption 2, and
all by single applications of Assumption 1 with Lemma 2.2.
A conciser way to state the consequence is that there must be games , and such that is obtained from by iterated elimination of strictly dominated strategies, is isomorphic to , and is obtained from by iterated elimination of strictly dominated strategies.
Proof. First divide the given sequence of outcome correspondences up into periods that are maximally long while containing only correspondences by Assumption 1 (with or without Lemma 2.2). That is, consider subsequences of the form such that:
In each such period apply Lemma 20 iteratively to either eliminate or move to the right all inverted reduction elimination steps.
In all but the first period, contains no strictly dominated actions (by stipulation of Assumption 2). Hence all but the first period cannot contain any non-reversed elimination steps. Similarly, in all but the final period, contains no strictly dominated actions. Hence, in all but the final period, there can be no reversed applications of Assumption 1.
Overall, our new sequence of outcome correspondences thus has the following structure: first there is a sequence of elimination steps via Assumption 1, then there is a sequence of isomorphism steps, and finally there is a sequence of reverse elimination steps. We can summarize all the applications of Assumption 2 into a single step applying that assumption to obtain the claimed structure.
Now notice that that the reverse elimination steps are only relevant for deriving unilateral SPIs. Using the above concise formulation of the lemma, we can always simply use itself as an omnilateral SPI – it is not relevant that there is some subset game that reduces to .
Lemma 22. As in Lemma 21, let , where each outcome correspondence is due to a single application of Assumption 1, Assumption 1 plus symmetry (Lemma 2.2) or Assumption 2. Let all be subset games of . Moreover, let be Pareto improving. Then there is a sequence of subset games such that all by applications of Assumption 1 (without applying symmetry), and by application of Assumption 2 such that is Pareto improving.
Proof. First apply Lemma 21. Then notice that the correspondence functions from applying Assumption 1 with symmetry have no effect on whether the overall outcome correspondence is Pareto improving.
We now show that the SPI problem is in NP at all. The following algorithm can be used to determine whether there is a safe Pareto improvement: Reduce the given game until it can be reduced no further to obtain some subset game . Then non-deterministically select injections . If is (strictly) Pareto-improving (as required in Theorem 3), return True with the solution defined as follows: The set of action profiles is defined as . The utility functions are
Otherwise, return False.
Proposition 23. The above algorithm runs in non-deterministic polynomial time and returns True if and only if there is a (strict) unilateral SPI.
Proof. It is easy to see that this algorithm runs in non-deterministic polynomial time. Furthermore, with Lemma: 4 it is easy to see that if this algorithm finds a solution , that solution is indeed a safe Pareto improvement. It is left to show that if there is a safe Pareto improvement via a sequence of Assumption 1 and 2 outcome correspondences, then the algorithm indeed finds a safe Pareto improvement.
Let us say there is a sequence of outcome correspondences as per AssumptionS 1 and 2 that show for Pareto-improving . Then by Lemma 22, there is such that via applying Assumption 1 iteratively to obtain a fully reduced and via a single application of Assumption 2. By construction, our algorithm finds (guesses) this Pareto-improving outcome correspondence.
Overall, we have now shown that our non-deterministic polynomial-time algorithm is correct and therefore that the SPI problem is in NP. Note that the correctness of other algorithms can be proven using very similar ideas. For example, instead of first reducing and then finding an isomorphism, one could first find an isomorphism, then reduce and then (only after reducing) test whether the overall outcome correspondence function is Pareto-improving. One advantage of reducing first is that there are fewer isomorphisms to test if the game is smaller. In particular, the number of possible isomorphisms is exponential in the number of strategies in the reduced game but polynomial in everything else. Hence, by implementing our algorithm deterministically, we obtain the following positive result.
Proposition 24. For games with that can be reduced (via iterative application of Assumption 1) to a game with , the (strict) omnilateral SPI decision problem can be solved in .
Next we show that the problem of finding unilateral SPIs is also in NP. Here we need a slightly more complicated algorithm: We are given an -player game and a player . First reduce the game fully to obtain some subset game . Then non-deterministically select injections . The resulting candidate SPI game then is
where for all , and is arbitrary for . Return True if the following conditions are satisfied:
Otherwise, return False.
Proposition 25. The above algorithm runs in non-deterministic polynomial time and returns True if and only if there is a (strict) unilateral SPI.
Proof. First we argue that the algorithm can indeed be implemented in non-deterministic polynomial time. For this notice that for checking Item 2, the constants can be found by solving systems of linear equations of two variables.
It is left to prove correctness, i.e., that the algorithm returns True if and only if there exists an SPI. We start by showing that if the algorithm returns True, then there is an SPI. Specifically, we show that if the algorithm returns True, the game is indeed an SPI game. Notice that for some by iterative application of Assumption 1 with Transitivity (Lemma 2.2); that by application of Assumption 2. Finally, for some by iterative application of Assumption 1 to , plus transitivity (Lemma 2.3) with reversal (Lemma 2.2).
It is left to show that if there is an SPI, then the above algorithm will find it and return true. To see this, notice that Lemma 21 implies that there is a sequence of outcome correspondences . We can assume that and have the same action sets for Player . It is easy to see that in we could modify the utilities for any that is not in , because Player 's utilities do not affect the elimination of strictly dominated strategies from .
Proposition 26. For games with that can be reduced (via iterative application of Assumption 1) to a game with , the (strict) unilateral SPI decision problem can be solved in .
We now proceed to showing that the safe Pareto improvement problem is NP-hard. We will do this by reducing the subgraph isomorphism problem to the (two-player) safe Pareto improvement problem. We start by briefly describing one version of that problem here.
A (simple, directed) graph is a tuple , where and . We call the adjacency function of the graph. Since the graph is supposed to be simple and therefore free of self-loops (edges from one vertex to itself), we take the values for to be meaningless.
For given graphs , a subgraph isomorphism from to is an injection such that for all
In words, a subgraph isomorphism from to identifies for each node in a node in s.t. if there is an edge from node to node in , there must also be an edge in the same direction between the corresponding nodes in . Another way to say this is that we can remove some set of () nodes and some edges from to get a graph that is just a relabeled (isomorphic) version of .
Definition 8. Given two graphs , the subgraph isomorphism problem consists in deciding whether there is a subgraph isomorphism between .
The following result is well-known.
Lemma 27 ([12, Theorem 2]). The subgraph isomorphism problem is NP-complete.
Lemma 28. The subgraph isomorphism problem is reducible in linear time with linear increase in problem instance size to the (strict) (unilateral) safe Pareto improvement problem for two players. As a consequence, the (strict) (unilateral) safe Pareto improvement problem is NP-hard.
Proof. Let and be graphs. Without loss of generality assume both graphs have at least vertices, i.e., that . For this proof, we define for any .
We first define two games, one for each graph, and then a third game that contains the two.
The game for is the game as in Table 9. Formally, let . Then we let , where . The utility functions are defined via
and
We define based on analogously, except that in Player 1's utilities we use instead of , instead of , instead of and instead of .
We now define from and as sketched in Table 9. For the following let
and and . For , let be the utility of in . For let be the utility of in . Finally, define for all and all ; and for all and all .
It is easy to show that this reduction can be computed in linear time and that it also increases the problem instance size only linearly.
To prove our claim, we need to prove the following two propositions:
1. We start with the first claim. Assume there is a subgraph isomorphism from to . We construct our SPI as usual: first we reduce the game by iterated elimination of strictly dominated strategies, then we find a Pareto-improving outcome equivalence between the reduced game and some subset game of . Finally, we show that arises from removing strictly dominated strategies from subset game of . It is easy to see that the game resulting from iterated elimination of strictly dominated strategies is just the part of it. Abusing notation a little, we will in the following just call this (even though it has somewhat differently named action sets).
Next we define a pair of functions , which will later form our isomorphism. For all and , we define via
Define and so on analogously.
Now define to be the subset game with action sets and , where and are the action sets of ; and with utility functions
and (as restricted to ).
We must now show that is a game isomorphism between and . First, it is easy to see that for , is a bijection between and . Moreover,
For player 2, we need to distinguish the different cases of actions. Since each case is trivial from looking at the definition of and we omit the detailed proof.
Next we need to show that is strictly Pareto-improving as judged by the original players' utility function . Again, this is done by distinguishing a large number of cases of action profiles , all of which are trivial on their own. The most interesting one is that of for with because this is where we use the fact that is a subgraph isomorphism:
We omit the other cases.
It is left to construct a unilateral subset game of such that reduces to via iterated elimination of strictly dominated strategies. Let , where we set arbitrarily for .
We now show that reduces to via repeated application of Assumption 1. So let . We distinguish the following cases:
Note that and are both in by construction of .
2. It is left to show that if there is any kind of non-trivial SPI, there is also a subgraph isomorphism from to .
By Lemma 21, if there is an SPI, there are bijections that are jointly Pareto-improving from the reduced game to . From these functions we will construct a subgame isomorphism. However, to do so (and to see that the resulting function is indeed a subgraph isomorphism), we need to first make a few simple observations about the structure of and . Define and .
It then follows that , since apart from the outcomes we have already mapped to, no other outcome gives Player a utility of . Next it follows that , again because all outcomes with utility at least for Player outside of are already mapped to. And so on, until we obtain that . By an analogous line of argument we can show that
Together these equalities uniquely specify and .
in contradiction to the assumption that is Pareto improving.
We are ready to construct our subgraph isomorphism. For , define to be the second element of the pair . By Item c, can equivalently be defined as the second item in the pair . By Item d, is a function from to . By assumption about , is injective. Further, by construction of and , as well as the assumption that is Pareto improving, we infer that for all with ,
We conclude that is a subgraph isomorphism.
Proof. We will give the proof based on the graphs as well, without giving all formal details. Further we assume in the following that neither nor consist of just a single point, since these cases are easy.
\underline{Case A}: Note first that by Corollary 14 it is enough to show that if is in any of the listed sets , it can be made safe.
It's easy to see that all payoff vectors on the curve segment of the Pareto frontier are safely achievable. After all, all payoff vectors in this set Pareto-improve on all outcomes in . Hence, for each on the line segment, one could select the where .
It is left to show that all elements of are safely achievable. Remember that not all payoff vectors on the line segments are Pareto improvements, only those that are to the north-east of (Pareto-better than) the default utility. In the following, we will use and to denote those elements of and , respectively, that are Pareto-improvements on the default.
We now argue that the Pareto improvement on the line for which is safely achievable. In other words, is the projection northward of the default utility, or . This is also one of the endpoints of . To achieve this utility, we construct the equivalent game as per Lemma 13, where the utility to the original players of each outcome of the new game is similarly the projection northward onto of the utility of the corresponding outcome in . That is,
Note that because is convex and the endpoints of the line segment are by definition in , it is . Hence, all values of thus defined are feasible. Because all outcomes in the original game lie below the line , is linear. Hence,
as required.
We have now shown that one of the endpoints of is safely achievable. Since the other endpoint of is in , it is also safely achievable. By Corollary 14, this implies that all of is safely achievable.
By an analogous line of reasoning, we can also show that all elements of are safely achievable.
\underline{Case B}: Define as before as those elements of respectively that Pareto improve on the default . By a similar argument as before, one can show that the utilities is safely achievable both for and for . Call these points and , respectively.
We now proceed in two steps. First, we will show that there is a third safely achievable utility point , which is above both and . Then we will show the claim using that point.
To construct , we again construct an SPI as per Lemma 13. For each we will set the utility of the corresponding to be above or on both and , i.e., on or above a set which we will refer to as . Formally, is the set of outcomes in that are not strictly Pareto dominated by some other element of . Note that by definition every outcome in is Pareto-dominated by some outcome in either or . Hence, by transitivity of Pareto dominance, each outcome is Pareto-dominated by some outcome in . Hence, the described is indeed feasible.
Now note that the set of feasible payoffs of is convex. Further, the curve is concave. Because the area above a concave curve is convex and because the intersection of convex sets is convex, the set of feasible payoffs on or above is also convex. By definition of convexity, is therefore also in the set of feasible payoffs on or above and therefore above both and as desired.
In our second step, we now use to prove the claim. Because of convexity of the set of safely achievable payoff vectors as per Corollary 14, all utilities below the curve consisting of the line segments from to and from to are safely achievable. The line that goes through intersects the line that contains at , by definition. Since non-parallel lines intersect each other exactly once and parallel lines that intersect each other are equal and because is above or on , the line segment from to lies entirely on or above . Similarly, it can be shown that the line segment from to lies entirely on or above . It follows that the curve lies entirely above or on . Now take any Pareto improvement that lies below both and . Then this Pareto improvement lies below and therefore below the curve. Hence, it is safely achievable.
The post Safe Pareto Improvements for Delegated Game Playing appeared first on Center on Long-Term Risk.
]]>The post Operations Associate / Manager appeared first on Center on Long-Term Risk.
]]>The Center on Long-term Risk is seeking an Operations Associate, to work on supporting and improving the operational processes and infrastructure that enable our researchers’ work. You will therefore act as a multiplier on our team’s productivity, and so play an important role in furthering CLR’s mission to address worst-case risks from the development and deployment of advanced AI systems. (More experienced candidates may be appointed as Operations Manager.)
In this role, you would become the second full member of our Operations team, reporting to our Operations Lead, and taking on responsibilities across diverse areas such as office management, HR, finance, compliance and recruitment – making this role ideal for quickly gaining operations experience. You will receive mentorship from an experienced operations team, and become familiar with existing operational processes in a well-running organisation, as you work to improve and supplement them.
To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 11th September (precisely: 7:30am British Summer Time on Monday 12th). We expect the form will take 30-60 minutes to complete. It can be done in as little as 10 minutes if necessary by skipping the descriptive questions: this may significantly disadvantage your application, but may make sense if you wouldn’t apply otherwise.
--- This role is now closed ---
We are recruiting for this role in order to provide additional capacity in our operations function, and the Operations Lead plans to hand over a number of areas of responsibility to the successful candidate. Precisely which areas you work on will depend on your strengths and interests, and we’ll decide this together with you once you start work.
As an illustration of the sorts of things you’ll work on, we expect that the successful candidate will take on several of the following tasks:
Examples of further responsibilities that a candidate who is a good fit for them could take on include:
We also plan to train the successful candidate in some of the Operations Lead’s areas of responsibility, in order to provide better resilience.
We think this role could provide suitable challenges for someone with 0-4 years’ experience in a similar job: it might, for example, be suited to a recent graduate interested in quickly gaining experience in operations, and we also encourage more experienced candidates to apply.
The following abilities and qualities are what we’re looking for in candidates. No specific qualifications or experience are required – experience is one good way of demonstrating these skills, but we’re also open to candidates with no experience of similar roles. We encourage you to apply if you think you may be a good fit, even if you are very unsure of your strengths in some of these areas.
In addition to the above, familiarity with the effective altruism community and its priorities are a significant benefit.
In this role, you will enable our other staff’s productivity, and so support our mission to reduce worst-case risks from the development of AI systems. CLR’s activities include:
Aside from CERR, CLR has received major grants from Open Philanthropy and the Survival and Flourishing Fund.
Due to the small size of our organisation, your work will be varied and you will quickly gain experience across a wide variety of operations areas. CLR regularly encounters new operational situations, such as employing staff in a new country, or supporting the launch of a new charity, which will give you ample opportunities to extend your skills to new contexts.
CLR will also actively support your professional development. While we are looking for a candidate who is interested in working with CLR for a substantial period of time, as part of the effective altruism community we are interested in helping you increase your career’s impact even beyond your performance in the current role. Alongside mentorship from our experienced operations team, you will be joining a well-networked longtermist organisation. You will receive a budget of £2,000 per year to spend on whatever you think best furthers your professional development, and be supported to attend EA Global conferences twice annually if you’re interested.
Stage 1: To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 11th September (precisely: 7:30am British Summer Time on Monday 12th).
We expect the form will take 30-60 minutes to complete. If necessary, the form can be done in as little as 10 minutes by skipping the descriptive questions: this may significantly disadvantage your application, but may make sense if you wouldn’t apply otherwise.
We aim to communicate the results of stage 1, inviting candidates to the second stage, by the end of Friday 16th September.
Stage 2 will be a remote work test, to be completed on your own computer, which we anticipate will take up to 4 hours of your time. Applicants will have 2 weeks to complete the test, and will be compensated with £120 in return for their work. We plan to communicate the results of stage 2 by the end of Friday 7th October.
Stage 3 will consist of one or more interviews with CLR staff. We plan to hold interviews in the week of 10th October, and aim to communicate the results of stage 3 by the end of Friday 21st October.
Stage 4: The final stage of the recruitment process will be a work trial, held in-person if possible, of between 1-10 working days depending on candidate availability. We will cover travel expenses and compensate candidates £200 per day for the work trial. We will also seek references at this stage.
We expect final recruitment decisions to be made by mid-November. If you require a faster decision than this, please feel free to contact us at the address below.
The above timelines are our aim and we fully intend to stick to them. However, we don’t firmly commit to them, and a delay of, for example, 1-2 weeks by the end of stage 3 is possible. We will communicate to candidates promptly if we expect there to be any delays.
If you have any questions about the process, please contact us at hiring@longtermrisk.org. If you’d like to send an email that’s not accessible to the hiring committee, please contact tristan.cook@longtermrisk.org.
Diversity and equal opportunity employment: CLR is an equal opportunity employer, and we value diversity at our organisation. We don’t want to discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, social background/class, mental or physical health or disability, or any other basis for unreasonable discrimination, whether legally protected or not. If you're considering applying to this role and would like to discuss any personal needs that might require adjustments to our application process or workplace, please feel very free to contact us.
The post Operations Associate / Manager appeared first on Center on Long-Term Risk.
]]>The post Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs appeared first on Center on Long-Term Risk.
]]>The evolution of preferences that account for other agents’ fitness, or other-regarding preferences, has been modeled with the “indirect approach” to evolutionary game theory. Under the indirect evolutionary approach, agents make decisions by optimizing a subjective utility function. Evolution may select for subjective preferences that differ from the fitness function, and in particular, subjective preferences for increasing or reducing other agents’ fitness. However, indirect evolutionary models typically artificially restrict the space of strategies that agents might use (assuming that agents always play a Nash equilibrium under their subjective preferences), and dropping this restriction can undermine the finding that other-regarding preferences are selected for. Can the indirect evolutionary approach still be used to explain the apparent existence of other-regarding preferences, like altruism, in humans? We argue that it can, by accounting for the costs associated with the complexity of strategies, giving (to our knowledge) the first account of the relationship between strategy complexity and the evolution of preferences. Our model formalizes the intuition that agents face tradeoffs between the cognitive costs of strategies and how well they interpolate across contexts. For a single game, these complexity costs lead to selection for a simple fixed-action strategy, but across games, when there is a sufficiently large cost to a strategy's number of context-specific parameters, a strategy of maximizing subjective (other-regarding) utility is stable again. Overall, our analysis provides a more nuanced picture of when other-regarding preferences will evolve.
Under what conditions do agents evolve to maximize a subjective utility function other than their evolutionary fitness? In particular, when is there selection for other-regarding preferences [Elster, 1983, Sen, 1986] such as altruism (intrinsically valuing improvements in others agents’ fitness) or spite (intrinsically valuing reductions in other agents’ fitness)? These questions have been previously studied under the “indirect approach” to evolutionary game theory [Güth and Kliemt, 1998]. Consider a game whose payoffs determine the players’ fitness in an evolutionary process, called a base game. The indirect evolutionary approach supposes that selection occurs on agents’ subjective preferences (hereafter, “preferences”) represented as utility functions, and agents rationally play the base game by optimizing their subjective utility functions. When assessing the evolutionary stability of strategies in the indirect approach, a player’s utility function defines their strategy. This is in contrast to the classical “direct” approach where actions in the base game themselves are selected.
This indirect approach has been applied in attempts to explain altruism in organisms, especially in contexts where other explanations such as kin selection and reciprocity are inadequate [Bester and Güth, 1998, Janssen, 2008, Konrad and Morath, 2012]. In a simple model of an interaction where two agents’ actions have positive externalities for each other — i.e., increasing one’s action (represented as a real number) increases the other’s payoff — Bester and Güth [1998] find that altruistic preferences are evolutionarily stable. Bolle [2000] and Possajennikov [2000] extended this model to also explain the stability of spiteful preferences in interactions with negative externalities. These other-regarding preferences are selected because they are known to other agents, and thus credibly signal an agent’s commitment to certain behavior, given other agents’ preferences [Frank, 1987, Dufwenberg and Güth, 1999].
However, these models have two key limitations:
These two modifications to the original indirect evolutionary models undermine those models’ conclusions that other-regarding preferences can be evolutionarily stable, including preferences that lead to inefficient behavior. However, an important feature of the kinds of strategies described in (1) and (2) is that they differ from subjective utility maximization in their complexity costs, i.e., the costs an agent must pay to learn and execute strategies [McNamara, 2013]. These costs may play a critical role in evolution; for instance, the tradeoff between the problem-solving benefits and energetic costs of larger brains may explain variation in brain size among primates, and in animal behavior in contests [Isler and Van Schaik, 2014, Reichert and Quinn, 2017]. Previous literature has studied how complexity costs affect the evolutionary stability of strategies [Rubinstein, 1986, Banks and Sundaram, 1990, Binmore and Samuelson, 1992]. The costs of strategy complexity accumulate over the diverse set of environments and interactions an agent faces in its lifetime [Geoffroy and Andr ́e, 2021]. Thus, instead of using many different strategies that are each simple in isolation, it can be less expensive overall for an agent to use a sophisticated strategy that interpolates well across interactions [Robalino and Robson, 2016, Piccinini and Schulz, 2018]. We will argue that the complexity costs of applying individualized heuristics to each new interaction may be sufficiently high that evolution selects for “rational” agents, which consistently optimize some (other-regarding) utility function.
Our key contribution is a revised account of the evolution of other-regarding preferences, based on a novel frame-work accounting for the fitness costs that strategies incur due to their complexity in multiple strategic contexts. While existing indirect evolutionary models are inadequate because they artificially restrict the space of strategies, we show that their predictions can be recovered by accounting for how subjective utility-maximizing strategies optimally trade off complexity within and across decision contexts. In particular:
Indirect evolutionary approach. Like Bester and Güth [1998], Bolle [2000], and Possajennikov [2000], we model rational players as playing Nash equilibria with respect to utility functions given by their own fitness plus a (possibly negative) multiple of their opponent’s fitness. Heifetz et al. [2003] generalize this model to utility functions given by one’s own fitness plus some function called a disposition. They show that dispositions are not eliminated by selection in a wide variety of games. Generalizing further to the space of all possible utility functions in finite-action games, Dekel et al. [2007] show that any strategy achieving an inefficient payoff against itself — including the kinds of strategies with other-regarding preferences predicted by Possajennikov [2000] — is not evolutionarily stable. We will argue, however, that the invader strategies that make efficiency necessary for stability are more complex than behavioral or rational strategies, and thus when complexity costs are accounted for in an agent’s fitness, a stable strategy can lead to inefficiency. Ok and Vega-Redondo [2001] and Ely and Yilankaya [2001] note that in order for utility functions to evolve such that players with those utility functions do not play the base game Nash equilibrium, players must have information about each other’s utility functions. We assume utility functions are known, and briefly discuss how players can learn each other’s utility functions over repeated interactions, but acknowledge that this is a substantive assumption since players often have incentives to send deceptive signals of their utility functions [Heller and Mohlin, 2019]. Finally, Heifetz et al. [2007] and Alger and Weibull [2012] generalize Possajennikov [2000]’s finding that altruism or spite can be evolutionarily stable in a certain game depending on whether it features postive or negative externalities. They show that in a general class of games, selection for altruism versus spite is determined (partly) by whether the base game has strategic complements or substitutes, i.e., whether increasing one player’s input increases or decreases the marginal value of input for another player. The patterns of selection of altruism or spite based on multiple games that we illustrate with Bester and Güth [1998]’s game, therefore, might hold for a variety of games.
Games with complexity costs. Rubinstein [1986] characterizes Nash equilibria in repeated games under computational costs. He represents strategies in the repeated Prisoner’s Dilemma as finite-state automata (sets of states determining the player’s action with rules for transitions between states). Complexity costs are lexicographic: an automaton achieving a strictly higher payoff is always preferred, but when two automata achieve the same average payoff, the automaton with fewer states is preferred. Binmore and Samuelson [1992] show that although no evolutionarily stable strategies exist in the repeated Prisoner’s Dilemma without complexity costs, adding these lexicographic costs leads to the existence of some evolutionarily stable strategies. We similarly show that in one-shot games, when we account for the greater complexity of “rational” strategies relative to “behavioral” (fixed-action) strategies, a set of multiple neutrally stable strategies is replaced with a unique evolutionarily stable straegy. Our distinction between the complexity of rational and behavioral strategies follows that of Abreu and Sethi [2003], who show that under an arbitrarily small cost of the complexity of rationality, behavioral strategies are evolutionarily stable in a bargaining game. If automata are also penalized based on the number of different states each state can transition to, the evolutionarily stable strategies are restricted to the Nash equilibria of the (non-repeated) base game [Banks and Sundaram, 1990]. We find an analogous result in one-shot games with a different complexity metric. Lastly, van Veelen and Garc ́ıa [2019] find that in the repeated Prisoner’s Dilemma, increasing non-lexicographic complexity costs decreases the frequency of cooperation in finite-population stochastic evolutionary simulations. Similarly, in the multi-game setting, we find numerically that as complexity costs on a strategy’s number of game-specific parameters increase, there are transitions between more or less efficient stable strategies.
Coevolution of rationality and other-regarding preferences. A key theme in our work is that selection may favor the ability of rational agents, which have other-regarding preferences and model other players as optimizing their own utility functions, to solve a variety of strategic problems. Building on Robson [2001]’s analogous results in single-agent problems, Robalino and Robson [2016] model the coevolution of utility maximization and ability to attribute preferences to others. Like us, they show that after accounting for the advantages of interpolation across strategic contexts, selection favors a rational strategy that learns and responds to the preferences of its opponent, as opposed to strategies that do not know how to respond to new games. However, we study selection pressures towards rationality in the context of evolution of preferences. Further, in our analysis, the advantage of rationality comes from avoiding costs that non-rational strategies pay to adapt a response to each separate game, rather than from non-rational strategies’ inability to respond to new games. Heller and Mohlin [2019] model the evolution of both preferences and the cognitive capacity necessary to signal false preferences to others. Their argument for the efficiency of stable strategies is vulnerable to the counterargument that we raise to Dekel et al. [2007] above. However, their results are similar to ours in that the set of stable strategies is sensitive to whether the costs of cognitive complexity are sufficiently high, relative to the direct fitness benefits of complex cognition. Like us, Geoffroy and Andr ́e [2021] model the evolution of strategies that interpolate across different contexts, but their analysis is restricted to cooperation in a certain class of games rather than evolution of other-regarding preferences in general (including uncooperative preferences like spite).
We begin with definitions and notation and introduce a well-studied game that will illustrate principles of the indirect evolutionary approach.
Let be any symmetric two-player game (called the base game) between players and , with action space and payoff functions . Players choose actions in the base game as functions of strategies that are selected in an evolutionary process. Suppose players simultaneously play strategies (elements of some abstract space ) and observe each other's strategies, then play with actions determined by the pair of strategies. Then, define the function , where player 's action in given the players' strategies is . In standard evolutionary analysis the fitness of a strategy equals its payoff in , thus we write player 's fitness from a strategy profile as . (We distinguish fitness from payoffs because once complexity costs are included, as in Section 5.1, this identity no longer holds.) The following definitions classify a strategy based on the robustness to mutations of a population purely consisting of that strategy.
Definition 1. Relative to a fixed strategy space for , a strategy is:
The strict inequality in the definition of ESS implies a stronger “pull” towards an ESS in evolutionary dynamics (such as the replicator dynamic) than towards an NSS: If a rare mutant that enters a population consisting of an ESS has the same fitness when paired with itself as the ESS has against this mutant, the mutant goes extinct under the replicator dynamic, but this does not necessarily hold for an NSS [van Veelen, 2010].
Our running example is the following symmetric two-player game, which we call the externality game [Bester and Güth, 1998]. Each player simultaneously chooses , and, for some and , the players receive payoffs:
Thus, represents negative or positive externalities of each player's action for the other's payoff (when or , respectively). In the original model, players are assumed to have the following subjective utility functions, for :
Players behave rationally with respect to their subjective utility functions, and subjective utility functions are common knowledge. Thus the players play the Nash equilibrium of the game in which payoffs are given by , denoted . That is, letting represent player 's strategy, .
A player with (respectively, ) has subjective utility increasing (decreasing) with the other's payoff — these ranges of can be interpreted as altruistic and spiteful, respectively. Generalizing Bester and Güth [1998], Possajennikov [2000] showed that the unique ESS in this strategy space is . Thus, when , this ESS corresponds to players with altruistic preferences, and when , their preferences are spiteful. Players who follow the subjective Nash equilibrium with respect to given by the altruistic ESS both receive a higher payoff than the equilibrium of , while the payoffs of the spiteful ESS are both lower. Since the Pareto-efficient symmetric subjective Nash equilibrium is at , this means that as , the ESS approaches efficiency. Intuitively, these other-regarding preferences are stable in Possajennikov [2000]’s model because they serve as commitment devices that elicit favorable responses from the other player [Frank, 1987, Dufwenberg and Güth, 1999]. That is, each agent best-responds under the assumption that the other player will play rationally with respect to their utility function, and as utility functions are selected based on payoffs from the opponent’s best response to the action optimizing those utility functions, the population converges to some .
We now discuss the formal framework on which our results are based. Let as above. We say that a preference parameter is egoistic if , and other-regarding otherwise. In our results we will use the following assumptions, which are satisfied by the externality game:
We give some remarks on the typical indirect evolutionary models before presenting our generalized model. Recall our claim that the strategy space assumed by much of the indirect evolutionary game theory literature is too restrictive, due to the assumption that agents always play the Nash equilibrium of . Playing a Nash equilibrium in response to the other player's can be exploitable, in the sense that a player can “force” another rational agent to play an action that is more favorable to player (see Section 4.1 for an example). A player may avoid being exploited in this way by committing to some action, independent of opponents’ preferences. We will therefore enrich the strategy space in to relax this assumption (Section 4.1).
Standard indirect evolutionary game theory also assumes players perfectly observe each other’s payoff functions and subjective utility functions. This premise has been questioned in previous work, e.g., Heifetz et al. [2007], Gardner and West [2010]. We keep this assumption due to findings by Jordan [1991] and Kalai and Lehrer [1993] that, if players use Bayesian updating in repeated interactions with each other, under certain conditions they converge to accurate beliefs about each other’s utility functions and play the Nash equilibrium. Dekel et al. [2007] and Heller and Mohlin [2019] give similar justifications for this assumption in their indirect evolutionary models.
Our strategy space combines the “direct” and “indirect” approaches to evolutionary game theory [Güth and Kliemt, 1998]. That is, this space includes both fixed actions of the base game and strategies that choose actions as a function of the player’s own subjective utility function and the other player’s strategy.
First, a behavioral strategy plays an action ai, independent of the other player’s strategy. The action ai is common knowledge to both players before is played. Second, as in the standard indirect evolutionary approach [Bester and Güth, 1998, Possajennikov, 2000], a rational strategy has a commonly known preference parameter , and always plays a best response given to their beliefs about the other player. A rational player believes that another rational player plays the Nash equilibrium of . Thus the best response to another rational player with parameter is . A rational player believes behavioral player plays action , so the rational strategy is .
To see the reason for including both classes of strategies in one model, consider the externality game with . If a rational player faces rational player with , and , we can check that the payoff of increases while that of decreases:
That is, can exploit the rationality of by adopting an other-regarding preference parameter as a commitment. We therefore ask what strategies are selected for when we allow players to ignore each other's commitments (preferences), in order to avoid exploitation, and instead play some fitness-maximizing action.
In summary, our strategy space is the union of these sets:
We now characterize the Nash equilibria and stable strategies of S. We show that there are multiple neutrally stable strategies, one of which acts according to egoistic preferences, and no evolutionarily stable strategies. This is in contrast to the results of Bester and Güth [1998] and Possajennikov [2000], who showed that without behavioral strategies, a population with other-regarding preferences is the unique ESS in the externality game. All proofs are in Appendix A.
Proposition 1. Let be a symmetric two-player game that satisfies assumptions 1 - 3. Then a strategy is a Nash equilibrium in if and only if it is either or a strategy that is a Nash equilibrium in . Further, is an NSS in , and is an NSS in if and only if it is an NSS in . There are no ESSes.
Informally, a population that always plays the base game Nash equilibrium can be invaded by rational players with egoistic preferences, whose fitness against each other matches that of the original population. When the population consists of rational players with other-regarding preferences that are stable against other rational strategies, it can be invaded by agents that always play the Nash equilibrium of the game with payoffs given by those same other-regarding preferences.
Single game. Proposition 1 showed that strategies with either egoistic or other-regarding preferences can be neutrally stable, and neither are evolutionarily stable. This suggests that the standard indirect evolutionary approach is insufficient to explain the unique stability of other- regarding preferences. However, our analysis above assumed that players can use arbitrarily complex strategies at no greater cost than simpler ones; fitness is a function only of the payoffs of strategies, not of the cognitive resources required to use them [McNamara, 2013].
We introduce complexity costs as follows. For some complexity function , we apply the usual evolutionary stability analysis to a modified strategy fitness function:
While behavioral strategies always play a fixed action, rational strategies compute a best response to each given opponent. Within a single game, a behavioral strategy thus requires less computation than a rational strategy (this assumption was also used by Abreu and Sethi [2003]). Given this observation, for some we define (where the function returns if the condition in brackets is true, and otherwise). Once this cost is accounted for, selection favors the behavioral strategy that plays the Nash equilibrium of (even when assumption 3 does not hold).
Proposition 2. Let be a symmetric two-player game that satisfies assumptions 1 and 2. Then for any , the unique Nash equilibrium in under penalties is , and this strategy is an ESS.
An arbitrarily small cost of complexity prevents rational strategies from matching the fitness of the Nash equilibrium behavioral strategy.
Multiple games. Proposition 2, again, appears inconsistent with the stability of other-regarding preferences. However, this result is based on a metric of complexity that only accounts for costs within one game — the cost of rational optimization versus playing a constant action for any opponent — rather than cumulative costs across games. As Piccinini and Schulz [2018] discuss qualitatively, although agents who rely on situation-specific heuristics avoid the fixed cost of explicit optimization paid by rational agents, they do worse in some variable environments than the latter, who can profit from having a general and compact strategy of optimizing utility functions. We formalize this tradeoff in this section.
Suppose that in each generation, the players in a population face a collection of games . Each player uses a strategy that (through the function ) outputs an action conditional on both the other player's strategy and the identity of the game. One can apply the usual evolutionary stability analysis to strategies that play the collection of games, by defining fitness as the sum of fitness from each game minus a multi-game complexity function . If a given strategy has parameters under selection across games, should increase with . An ideal definition of this function would be informed by an accurate model of the energetic costs of different kinds of cognition, which is beyond the scope of this work. We can define multi-game complexity in our setting by generalizing the strategy space from Section 4 to multiple games:
The motivation for parameterizing a strategy in by a single is that, across a distribution of relevantly similar games (e.g., variants of the externality game with different values of ), a rational player might be able to perform well by interpolating its other-regarding preferences.2 Then, for some , letting denote the number of unique elements of , define:
The set of stable strategies under these multi-game penalties is sensitive to the values of and . Intuitively, a behavioral strategy will be stable when is small, relative to the profits this strategy can make by adapting its response precisely to each game. Conversely, when is sufficiently large, a rational strategy can compensate for applying the same decision rule to every game by avoiding the costs of game-specific heuristics. In the next section, we show these patterns numerically.
Here, we will use an evolutionary simulation algorithm to see how complexity costs across games influence stable strategies — in particular, which (if any) other-regarding preferences are elected? For simplicity, we consider a set of just two externality games for a fixed with and , denoted and . However, to investigate the effects of imbalanced environments (i.e., where is played more or less frequently than ) we suppose that players spend a fraction of their time in game and in . Then, with as the externality game payoff function for a given , the multi-game penalized fitness of a strategy against is:
Due to the continuous strategy space, a replicator dynamic simulation is intractable. Instead, we simulate an evolutionary process on using the adaptive learning algorithm [Young, 1993], implemented as follows (details are in Appendix B). An initial population of size is randomly sampled from the spaces of rational and behavioral strategies. In each round of evolution, each player in the population either (with low probability) switches to a random strategy, or else switches to the best response to a uniformly sampled opponent in the population (with respect to the penalized fitness above).3 Note that a best response in the space might use one action across both games, incurring a complexity cost of instead of . We fix , and . In each experiment, we tune the multi-game complexity penalty (hereafter, “per-parameter penalty”) to approximately the smallest value necessary to ensure that the population almost always converges to an element of (a rational strategy).
Varying strength of negative or positive externalities in one game. First, we show that other-regarding preferences evolve under sufficiently strong negative or positive externalities, given a sufficiently high per-parameter penalty. We fix , , and , and vary . For , the population converged to a behavioral strategy that uses only one action, for all values of we tested (see the open circle in Figure 1). This suggests that when both games are sufficiently similar, a behavioral strategy can interpolate across both games at less expense than a rational strategy. Figure 1 shows that, as expected, the sign and magnitude of the stable value scales with . For , the population converges to , suggesting that other-regarding preferences only interpolate well across these externality games when the externalities are sufficiently strong in magnitude.
Varying proportion of games with negative versus positive externalities. Next, we show that the strength of altruism versus spite in the limiting population scales nonlinearly with the proportion of games with negative versus positive externalities. With , we vary the fraction of games with , over , for three pairs of games. For all pairs of in this experiment, the values and have one-action behavioral strategies in the limiting population (see the open circles in Figure 2). When one game is extremely rare, the rational strategy's gains from interpolation across games do not outweigh the cost of rationality.
First we fix and (blue curve in Figure 2). Again, the trend of decreasing with greater is as expected, though there is a bias towards altruism: an equal proportion of positive and negative externalities gives . When and (orange curve), even small proportions of the large-magnitude negative are sufficient for the rational population to adopt , and remains roughly constant above . That is, in an environment where one game has weak positive externalities and the other has strong negative externalities, most of the effect on the population's other-regarding preferences comes just from having a frequency of strong negative externalities above some (small) threshold. The same pattern holds in the opposite direction when and (green curve).
In Figure 3, we vary both and , keeping . For any , the result from Figure 1 where a rational strategy is not stable for small still holds. Likewise, the result that takes over the population when is not sensitive to . Generalizing the trend from Figure 2, for sufficiently large magnitudes of , only a minority of games need to have far from for strong other-regarding preferences to be stable.
Social welfare in the limiting population as a function of the per-parameter penalty. Finally, we show how the total payoffs of the limiting population vary both with the size of the per-parameter penalty, and with the proportion of games with positive versus negative externalities. Fixing and , we vary for each . To visualize the transitions between limiting populations of behavioral versus rational strategies, we compute the social welfare averaged over the last two rounds (for some parameter values, the population oscillates) of each evolutionary simulation for penalty and proportion , shown in Figure 4. 4
For most values of , when there is no per-parameter penalty () the population attains the near-lowest social welfare, where all in the population play the base game Nash equilibrium. The penalty is sufficient for all populations to converge to an other-regarding rational strategy, which attains the highest social welfare when but nearly the lowest when , i.e., when most of the games have . For intermediate values of , the population oscillates between and a behavioral best response to in each game, usually resulting in social welfare between that of very low or high . The minimum value of necessary for convergence to the rational strategy is largest for values of closest to 0.5, while only a small penalty is necessary when or (see the values of where the curves in Figure 4 plateau). Intuitively, if the large majority of games have the same , a behavioral strategy does not profit much from adapting with multiple actions, relative to the complexity costs of playing different actions for two games.
The magnitude of relative to required for other-regarding preferences to be stable might appear unrealistically large, based on these results. We note the distinction between the fixed cognitive costs of developing a rational decision procedure, and the per-use costs of learning heuristics for each context and recognizing when each is appropriate. Cooper [1996] argues that lexicographic, or infinitesimal, complexity costs are appropriate for the former — these start up costs are a tiebreaker between strategies that are otherwise equally capable — while finite non-negligible costs are suitable for the latter. It is therefore plausible that in several evolutionary contexts, the costs of adapting to each interaction from scratch outweigh costs of rationality. Regardless, given the sensitivity of the stable populations in these experiments to , it is important to account for the relative strength of these two factors when predicting the result of an evolutionary process.
Lastly, we discuss the implications of complexity costs for another model that appears to preclude the evolution of certain other-regarding preferences. Recall that we have defined the utility functions of rational strategies as the player’s own payoff plus a multiple of the opponent’s pay-off. Previous work has shown (in finite-action games) that if all possible subjective utility functions are permitted, and players observe each other’s subjective utility functions, then all stable strategies achieve a Pareto efficient payoff [Dekel et al., 2007, Heller and Mohlin, 2019]. This conclusion follows from the “secret handshake” argument: a player who is indifferent among all action pairs can select an equilibrium that matches any other strategy’s action against that strategy, but plays an action achieving an efficient payoff against itself [Robson, 1990]. These results rule out both the base game Nash equilibrium and the ESS in of the externality game, which is for , while is the unique efficient rational strategy.
One might suspect, then, that our conclusion from the numerical experiments — i.e., inefficient other-regarding preferences can be stable when agents play multiple games — would not hold after including the strategy classes from Dekel et al. [2007] and Heller and Mohlin [2019]. When we include complexity costs, however, the secret handshake argument does not follow. Let be the class of strategies whose subjective utility functions are constant over all action pairs, and which use the equilibrium selection rule described above. Because this strategy requires choosing different Nash equilibria depending on the opponent, we claim that it is more complex than either a behavioral or rational strategy.
For , let . Then is still an ESS under the conditions of Proposition 2, with added to the strategy space. The proof is straightforward; given a positive penalty, a strategy from cannot match the payoff of against itself, by the definition of the base game Nash equilibrium:
We conjecture that across multiple games, a sufficiently large penalty would yield similar results to Section 5.2.
The puzzle that motivated this work was the apparent prevalence of other-regarding preferences, such as altruism and spite, despite the possibility of selection for commitment strategies that ignore the signals of other-regarding preferences. Our results suggest that this puzzle stems from a neglect of complexity considerations in previous literature on the evolution of preferences. We considered a class of two-player symmetric games that includes the games used by Bester and Güth [1998] and Possajennikov [2000] to illustrate the stability of altruism and spite. First, via evolutionary stability analysis on a strategy space that combines the direct and indirect approaches, we confirmed that other-regarding preferences are no longer uniquely stable when fixed-action strategies can also evolve. We then showed numerically that, although other-regarding preferences are unstable when agents play a single game under costs of strategy complexity, if the costs of distinct fixed actions across multiple games are sufficiently high, other-regarding preferences are stable. These costs also explain why inefficient stable strategies can persist — the flexible “secret handshake” strategy, which has been purported to guarantee that stability implies efficiency, is too complex to invade populations with certain inefficient strategies.
Accounting for the costs of adapting strategies to specific games plausibly sheds light on other phenomena in evolutionary game theory. For example, Boyd and Richerson [1992] argued that a common explanation of cooperation as a product of punishment, e.g., as in tit-for-tat in the repeated Prisoner’s Dilemma, proves too much: “Moralistic” strategies, which not only punish noncooperation but also punish those who do not punish noncooperation, can enforce the stability of any individually rational behavior. These moralistic strategies require sophisticated recognition of the behaviors that constitute cooperation or punishment in each given game. If some individually rational behavior enforced by a moralistic strategy is only marginally better for the cooperating player than getting punished, another strategy could invade by avoiding the complexity cost of the moralistic strategy, which outweighs the direct fitness cost of being punished. Thus, under complexity costs, the set of evolutionarily stable behaviors may be much smaller. It is also important to note that classes of simple, generalizable utility functions other than those we have considered might evolve. Instead of having utility functions given by their payoff plus a multiple of the other agent’s payoff, agents could develop utility functions with an aversion to exploitation or inequity [Huck and Oechssler, 1999, Güth and Napel, 2006]. Future work could investigate selection pressures on utility functions of different complexity.
Besides explaining biological behavior, our model of complexity-penalized preference evolution might also motivate predictions of the behavior of artificial agents, such as reinforcement learning (RL) algorithms. Policies are updated based on reward signals similarly to fitness-based updating of populations in evolutionary models [Börgers and Sarin, 1997]. It is common in RL training to penalize strategies (“policies”) according to their complexity, and deep learning researchers have argued that artificial neural networks have an implicit bias towards simple functions [Mingard et al., 2021, Valle-Perez et al., 2019]. Thus, RL agents trained together may develop other-regarding preferences, as far as the assumptions of our model are satisfied by the tasks these agents are trained in. A better understanding of the relationship between complexity costs and the distribution of environments these agents are trained in may help us better understand what kinds of preferences they acquire.
Dilip Abreu and Rajiv Sethi. Evolutionary stability in a reputational model of bargaining. Games and Economic Behavior, 44(2):195–216, 2003.
Ingela Alger and Jörgen W. Weibull. A generalization of Hamilton’s rule—Love others how much? Journal of Theoretical Biology, 299:42–54, 2012. ISSN 0022-5193. doi: https://doi.org/10.1016/j.jtbi.2011.05.008. URL https://www.sciencedirect.com/science/article/pii/S0022519311002505. Evolution of Cooperation.
Jeffrey S Banks and Rangarajan K Sundaram. Repeated games, finite automata, and complexity. Games and Economic Behavior, 2(2):97–117, 1990. ISSN 0899-8256. doi: https://doi.org/10.1016/0899-8256(90)90024-O. URL https://www.sciencedirect.com/science/article/pii/089982569090024O.
Siegfried Berninghaus, Christian Korth, and Stefan Napel. Reciprocity—an indirect evolutionary analysis. Journal of Evolutionary Economics, 17:579–603, 02 2007. doi: 10.1007/s00191-006-0053-1.
Helmut Bester and Werner Güth. Is altruism evolutionarily stable? Journal of Economic Behavior & Organization, 34(2):193–209, 1998.
Kenneth G Binmore and Larry Samuelson. Evolutionary stability in repeated games played by finite automata. Journal of Economic Theory, 57(2):278–305, 1992. ISSN 0022-0531. doi: https://doi.org/10.1016/0022-0531(92)90037-I. URL
https://www.sciencedirect.com/science/article/pii/002205319290037I.
Friedel Bolle. Is altruism evolutionarily stable? And envy and malevolence?: Remarks on Bester and Güth. Journal of Economic Behavior & Organization, 42(1):131–133, 2000.
Robert Boyd and Peter J. Richerson. Punishment allows the evolution of cooperation (or anything else) in sizable groups. Ethology and Sociobiology, 13(3):171–195, 1992. ISSN 0162-3095. doi: https://doi.org/10.1016/0162-3095(92)90032-Y. URL https://www.sciencedirect.com/science/article/pii/016230959290032Y.
Tilman Börgers and Rajiv Sarin. Learning Through Reinforcement and Replicator Dynamics. Journal of Economic Theory, 77(1):1–14, 1997. ISSN 0022-0531. doi: https://doi.org/10.1006/jeth.1997.2319. URL https://www.sciencedirect.com/science/article/pii/S002205319792319X.
John Conlisk. Costly optimizers versus cheap imitators. Journal of Economic Behavior & Organization, 1(3): 275–293, September 1980. ISSN 0167-2681. doi: 10.1016/0167-2681(80)90004-9.
David J. Cooper. Supergames Played by Finite Automata with Finite Costs of Complexity in an Evolutionary Setting. Journal of Economic Theory, 68(1):266–275, 1996. ISSN 0022-0531. doi: https://doi.org/10.1006/jeth.1996.0015. URL https://www.sciencedirect.com/science/article/pii/S0022053196900150.
Eddie Dekel, Jeffrey C. Ely, and Okan Yilankaya. Evolution of Preferences. The Review of Economic Studies, 74(3):685–704, 2007. ISSN 00346527, 1467937X. URL http://www.jstor.org/stable/4626157.
Martin Dufwenberg and Werner Güth. Indirect evolution vs. strategic delegation: a comparison of two approaches to explaining economic institutions. European Journal of Political Economy, 15(2):281–295, 1999. ISSN 0176-2680. doi: https://doi.org/10.1016/S0176-2680(99)00006-3. URL https://www.sciencedirect.com/science/article/pii/S0176268099000063.
Jon Elster. Rationality, page 1–42. Cambridge University Press, 1983. doi: 10.1017/CBO9781139171694.002.
Jeffrey C. Ely and Okan Yilankaya. Nash Equilibrium and the Evolution of Preferences. Journal of Economic Theory, 97(2):255–272, 2001. ISSN 0022-0531. doi: https://doi.org/10.1006/jeth.2000.2735. URL https://www.sciencedirect.com/cience/article/pii/S0022053100927352.
Robert H. Frank. If Homo Economicus Could Choose His Own Utility Function, Would He Want One with a Conscience? The American Economic Review, 77 (4):593–604, 1987. ISSN 00028282. URL http://www.jstor.org/stable/1814533.
Andy Gardner and Stuart A. West. Greenbeards. Evolution, 64(1):25–38, 2010. doi: https://doi.org/10.1111/j.1558-5646.2009.00842.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1558-5646.2009.00842.x.
Félix Geoffroy and Jean-Baptiste André. The emergence of cooperation by evolutionary generalization. Proc Biol Sci, 2021.
Werner Güth and Hartmut Kliemt. The indirect evolutionary approach: Bridging the gap between rationality and adaptation. Rationality and Society, 10(3):377–399, 1998. doi: 10.1177/104346398010003005. URL https://doi.org/10.1177/104346398010003005.
Werner Güth and Stefan Napel. Inequality aversion in a variety of games: An indirect evolutionary analysis. The Economic Journal, 116(514):1037–1056, 2006. ISSN 00130133, 14680297. URL http://www.jstor.org/stable/4121943.
Aviad Heifetz, Chris Shannon, and Yossi Spiegel. What to Maximize If You Must. Journal of Economic Theory, pages 31–57, 2003.
Aviad Heifetz, Chris Shannon, and Yossi Spiegel. The Dynamic Evolution of Preferences. Economic Theory, 32:251–286, 2007.
Yuval Heller and Erik Mohlin. Coevolution of deception and preferences: Darwin and Nash meet Machiavelli. Games and Economic Behavior, 113:223–247, 2019. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2018.09.011. URL https://www.sciencedirect.com/science/article/pii/S0899825618301532.
Steffen Huck and Jörg Oechssler. The indirect evolutionary approach to explaining fair allocations. Games and Economic Behavior, 28(1):13–24, 1999. ISSN 0899-8256. doi: https://doi.org/10.1006/game.1998.0691. URL https://www.sciencedirect.com/science/article/pii/S0899825698906911.
Karin Isler and Carel P. Van Schaik. How humans evolved large brains: Comparative evidence. Evolutionary Anthropology: Issues, News, and Reviews, 23(2):65–75, 2014. doi: https://doi.org/10.1002/evan.21403. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/evan.21403.
Marco A. Janssen. Evolution of cooperation in a one-shot prisoner’s dilemma based on recognition of trustworthy and untrustworthy agents. Journal of Economic Behavior & Organization, 65(3):458–471, 2008. ISSN 0167-2681. doi: https://doi.org/10.1016/j.jebo.2006.02.004. URL https://www.sciencedirect.com/science/article/pii/S0167268106001934.
James Jordan. Bayesian learning in normal form games. Games and Economic Behavior, 3(1):60– 81, 1991. URL https://EconPapers.repec.org/RePEc:eee:gamebe:v:3:y:1991:i:1:p:60-81.
Ehud Kalai and Ehud Lehrer. Rational Learning Leads to Nash Equilibrium. Econometrica, 61(5):1019–1045, 1993. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/2951492.
Kai A. Konrad and Florian Morath. Evolutionarily stable in-group favoritism and out-group spite in intergroup conflict. Journal of Theoretical Biology, 306:61–67, 2012. ISSN 0022-5193. doi: https://doi.org/10.1016/j.jtbi.2012.04.013. URL
https://www.sciencedirect.com/science/article/pii/S0022519312001944.
John M McNamara. Towards a richer evolutionary game theory. Journal of the Royal Society Interface, 10(88): 20130544, November 2013. ISSN 1742-5689. doi: 10.1098/rsif.2013.0544.
Chris Mingard, Guillermo Valle-P ́erez, Joar Skalse, and Ard A. Louis. Is SGD a Bayesian sampler? Well, almost. Journal of Machine Learning Research, 22(79):1–64, 2021. URL http://jmlr.org/papers/v22/20-676.html.
Efe A. Ok and Fernando Vega-Redondo. On the Evolution of Individualistic Preferences: An ncomplete Information Scenario. Journal of Economic Theory, 97(2):231–254, 2001. ISSN 0022-0531. doi: https://doi.org/10.1006/jeth.2000.2668. URL https://www.sciencedirect.com/science/article/pii/S0022053100926681.
Gualtiero Piccinini and Armin W. Schulz. The Ways of Altruism. Evolutionary Psychological Science, 5:58–70, 2018.
Alex Possajennikov. On the evolutionary stability of altruistic and spiteful preferences. Journal of Economic Behavior & Organization, 42(1):125–129, 2000.
Michael S. Reichert and John L. Quinn. Cognition in contests: Mechanisms, ecology, and evolution. Trends in Ecology & Evolution, 32(10):773–785, 2017. ISSN 0169-5347. doi: https://doi.org/10.1016/j.tree.2017.07.003. URL https://www.sciencedirect.com/science/article/pii/S0169534717301799.
Nikolaus Robalino and Arthur Robson. The Evolution of Strategic Sophistication. The American Economic Review, 106(4):1046–1072, 2016. ISSN 00028282. URL http://www.jstor.org/stable/43821484.
Arthur J. Robson. Efficiency in evolutionary games: Darwin, Nash and the secret handshake. Journal of Theoretical Biology, 144(3):379–396, 1990. ISSN 0022-5193. doi: https://doi.org/10.1016/S0022-5193(05)80082-7. URL https://www.sciencedirect.com/science/article/pii/S0022519305800827.
Arthur J. Robson. Why Would Nature Give Individuals Utility Functions? Journal of Political Economy, 109(4):900–914, 2001. ISSN 00223808, 1537534X. URL http://www.jstor.org/stable/10.1086/322083.
Ariel Rubinstein. Finite automata play the repeated prisoner’s dilemma. Journal of Economic Theory, 39(1):83–96, 1986. ISSN 0022-0531. doi: https://doi.org/10.1016/0022-0531(86)90021-9. URL https://www.sciencedirect.com/science/article/pii/0022053186900219.
Amartya Sen. Foundations of Social Choice Theory: An Epilogue. Cambridge University Press, Cambridge, 1986.
Guillermo Valle-Perez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rye4g3AqFm.
Matthijs van Veelen. But Some Neutrally Stable Strategies are More Neutrally Stable than Others. Tinbergen Institute Discussion Papers 10-033/1, Tinbergen Institute, March 2010. URL https://ideas.repec.org/p/tin/wpaper/20100033.html.
Matthijs van Veelen and Juli ́an Garc ́ıa. In and out of equilibrium II: Evolution in repeated games with discounting and complexity costs. Games and Economic Behavior, 115:113–130, 2019. ISSN 0899-8256. doi: https://doi.org/10.1016/j.geb.2019.02.013. URL https://www.sciencedirect.com/science/article/pii/S0899825619300314.
H. Young. The evolution of conventions. Econometrica, 61(1):57–84, 1993. URL https://EconPapers.repec.org/RePEc:ecm:emetrp:v:61:y:1993:i:1:p:57-84.
Behavioral. Define , the payoff of the Nash equilibrium with egoistic preferences. By the definition of Nash equilibrium of , since , the strategy is a Nash equilibrium in . Suppose . Then we must have , because otherwise uniqueness of the Nash equilibrium (assumption 1) would be violated. So is not a Nash equilibrium in .
Since the Nash equilibrium of is unique, there is no behavioral strategy such that . Suppose a rational strategy satisfies . Then
(This is satisfied for .) But this implies that .
So , and , therefore is neutrally stable (but not an ESS).
Rational. It is immediate that can only be a Nash equilibrium in if it is a Nash equilibrium in . Let be such a strategy. always plays against itself, so . Suppose a deviator plays . Given assumption 3, for any there exists a such that . Therefore:
Where the last line follows because is a Nash equilibrium in the space of rational strategies. So is a Nash equilibrium in .
Suppose is neutrally stable in . By assumption 2, is the unique action such that . Then
and . So is neutrally stable (but not an ESS). On the other hand, if is not an NSS in , the same counterexample to neutral stability applies in the expanded space , thus is not an NSS in .
Behavioral. The conditions for Nash equilibrium in do not change, since this set has the lowest complexity. However, when assessing the stability of , it suffices to only consider invader strategies in , because for a strategy ,
Since the Nash equilibrium of is unique (assumption ??), there is no other behavioral strategy such that . Thus is an ESS under penalties.
Rational. Let be any rational strategy. Then . But , so cannot be a Nash equilibrium under penalties.
Each strategy is parameterized by , where the strategy is if or if . A population of size is initialized with and for each player in the population. Let and if , otherwise . The probability of switching to a random strategy from the initialization distribution in round of evolution is . (We decay to decrease the rate of stochasticity and thus help convergence.) Best responses in the space of are computed analytically; for , we use gradient ascent.
The post Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs appeared first on Center on Long-Term Risk.
]]>The post Commitment games with conditional information revelation appeared first on Center on Long-Term Risk.
]]>
The conditional commitment abilities of mutually transparent computer agents have been studied in previous work on commitment games and program equilibrium. This literature has shown how these abilities can help resolve Prisoner’s Dilemmas and other failures of cooperation in complete information settings. But inefficiencies due to private information have been neglected thus far in this literature, despite the fact that these problems are pervasive and might also be addressed by greater mutual transparency. In this work, we introduce a framework for commitment games with a new kind of conditional commitment device, which agents can use to conditionally reveal private information. We prove a folk theorem for this setting that provides sufficient conditions for ex post efficiency, and thus represents a model of ideal cooperation between agents without a third-party mediator. Connecting our framework with the literature on strategic information revelation, we explore cases where conditional revelation can be used to achieve full cooperation while unconditional revelation cannot. Finally, extending previous work on program equilibrium, we develop an implementation of conditional information revelation. We show that this implementation forms program -Bayesian Nash equilibria corresponding to the Bayesian Nash equilibria of these commitment games.
KEYWORDS
Cooperative AI, program equilibrium, smart contractsACM Reference Format:
Anthony DiGiovanni and Jesse Clifton. 2022. Commitment games with conditional information revelation. In Appears at the 4th Games, Agents, and Incentives Workshop (GAIW 2022). Held as part of the Workshops at the 20th International Conference on Autonomous Agents and Multiagent Systems., Auckland, New Zealand, May 2022, IFAAMAS, 14 pages.
What are the upper limits on the ability of rational, self-interested agents to cooperate? As autonomous systems become increasingly responsible for important decisions, including in interactions with other agents, the study of “Cooperative AI” [2] will potentially help ensure these decisions result in cooperation. It is well-known that game-theoretically rational behavior — which will potentially be more descriptive of the decision-making of advanced computer agents than humans — can result in imperfect cooperation, in the sense of inefficient outcomes. Some famous examples are the Prisoner’s Dilemma and the Myerson-Satterthwaite impossibility of efficient bargaining under incomplete information [20]. Fearon [4] explores “rationalist” explanations for war (i.e., situations in which war occurs in equilibrium); these include Prisoner’s Dilemma-style inability to credibly commit to peaceful alternatives to war, as well as incentives to misrepresent private information (e.g., military strength). Because private information is so ubiquitous in real strategic interactions, resolving these cases of inefficiency is a fundamental open problem. Inefficiencies due to private information will be increasingly observed in machine learning, as machine learning is used to train agents in complex multi-agent environments featuring private information, such as negotiation. For example, Lewis et al. [16] found that when an agent was trained with reinforcement learning on negotiations under incomplete information, it failed to reach an agreement with humans more frequently than a human-imitative model did.
But greater ability to make commitments and share private information can open up more efficient equilibria. Computer systems could be much better at making their internal workings legible to other agents, and at making sophisticated conditional commitments. More mutually beneficial outcomes could also be facilitated by new technologies like smart contracts [30]. This makes the game theory of interactions between agents with these abilities important for the understanding of Cooperative AI — in particular, for developing an ideal standard of multi-agent decision making with future technologies. An extreme example of the power of greater transparency and commitment ability is Tennenholtz [29] ’s “program equilibrium” solution to the one-shot Prisoner’s Dilemma. The players in Tennenholtz’s “program game” version of the Prisoner’s Dilemma submit computer programs to play on their behalf, which can condition their outputs on each other’s source code. Then a pair of programs with the source code form an equilibrium of mutual cooperation.
In this spirit, we are interested in exploring the kinds of cooperation that can be achieved by agents who are capable of extreme mutual transparency and credible commitment. We can think of this as giving an upper bound on the ability of advanced artificially intelligent agents, or humans equipped with advanced technology for commitment and transparency, to achieve efficient outcomes. While such abilities are inaccessible to current systems, identifying sufficient conditions for cooperation under private information provides directions for future research and development, in order to avoid failures of cooperation. These are our main contributions:
Commitment games and program equilibrium. We build on commitment games, introduced by Kalai et al. [11] and generalized to Bayesian games (without verifiable revelation) by Forges [5]. In a commitment game, players submit commitment devices that can choose actions conditional on other players’ devices. This leads to folk theorems: Players can choose commitment devices that conditionally commit to playing a target action (e.g., cooperating in a Prisoner’s Dilemma), and punishing if their counterparts do not play accordingly (e.g., defecting in a Prisoner’s Dilemma if counterparts’ devices do not cooperate). A specific kind of commitment game is one played between computer agents who can condition their behavior on each other’s source code. This is the focus of the literature on program equilibrium [1, 15 , 21 , 22 , 25 , 29]. Peters and Szentes [24] critique the program equilibrium framework as insufficiently robust to new contracts, because the programs in, e.g., Kalai et al . [11] ’s folk theorem only cooperate with the exact programs used in the equilibrium profile. Like ours, the commitment devices in Peters and Szentes [24] can reveal their types and punish those that do not also reveal. However, their devices reveal unconditionally and thus leave the punishing player exploitable, restricting the equilibrium payoffs to a smaller set than that of Forges [5] or ours.
Our folk theorem builds directly on Forges [5]. In Forges’ setting, players lack the ability to reveal private information. Thus the equilibrium payoffs must be incentive compatible. We instead allow (conditional) verification of private information, which lets us drop Forges’ incentive compatibility constraint on equilibrium payoffs. Our program equilibrium implementation extends Oesterheld [21] ’s robust program equilibrium to allow for conditional information revelation.
Strategic information revelation. In games of strategic information revelation, players have the ability to verifiably reveal some or all of their private information. The question then becomes: How much private information should players reveal (if any), and how should other players update their beliefs based on players’ refusal to reveal some information? A foundational result in this literature is that of full unraveling: Under a range of conditions, when players can verifiably reveal information, they will act as if all information has been revealed [7, 18, 19]. This means the mere possibility of verifiable revelation is often enough to avoid informational inefficiencies. However, there are cases where unraveling fails to hold, and informational inefficiencies persist even when verifiable revelation is possible. This can be due to uncertainty about a player’s ability to verifiably reveal [3, 26] or revelation being costly [8, 10]. But revelation of private information can fail even without such uncertainty or costs [14 , 17]. We will review several such examples in Section 5, show how the persistence of uncertainty in these settings can lead to welfare losses, and show how this can be remedied with the commitment technologies of our framework (but not weaker ones, like those of Forges [5]).
Let be a Bayesian game with players. Each player has a space of types , giving joint type space . At the start of the game, players' types are sampled by Nature according to the common prior . Each player knows only their type. Player 's strategy is a choice of action for each type in . Let denote player 's expected payoff in this game when the players have types and follow an action profile . A Bayesian Nash equilibrium is an action profile in which every player and type plays a best response with respect to the prior over other players' types: For all players and all types , . An -Bayesian Nash equilibrium is similar: Each player and type expects to gain at most (instead of 0) by deviating from .
We assume players can correlate their actions by conditioning on a trustworthy randomization device . For any correlated strategy (a distribution over action profiles), let . When it is helpful, we will write to clarify the subset of the type profile on which the correlated strategy is conditioned. Let denote a correlated strategy whose th entry is degenerate at , and the actions of players other than are sampled from independently of . Then, the following definitions will be key to our discussion:
DEFINITION 1. A payoff vector as a function of type profiles is feasible if there exists a correlated strategy such that, for all players and types ,
DEFINITION 2. A payoff is interim individually rational (INTIR) if, for each player , there exists a correlated strategy used by the other players such that, for all ,
The minimax strategy is used by the other players to punish player . The threat of such punishments will support the equilibria of our folk theorem. Players only have sufficient information to use this correlated strategy if they reveal their types to each other. Moreover, the punishment can only work in general if they do not reveal their types to player , because the definition of INTIR requires the deviator to be uncertain about . Since the inequalities hold for all , the players do not need to know player 's type to punish them.
DEFINITION 3. A feasible payoff induced by is incentive compatible (IC) if, for each player and type pair ,
Incentive compatibility means that each player prefers a given correlated strategy to be played according to their type, as opposed to that of another type.
DEFINITION 4. Given a type profile , a payoff is ex post efficient (hereafter, "efficient") if there does not exist such that for all and for some
We will also consider games with strategic information revelation, i.e., Bayesian games where, immediately after learning their types, players are able to reveal their private information as follows. Players simultaneously each choose from some revelation action set , which is a subset of . Then, all players observe each~, thus learning that player 's type is in . Revelation is verifiable in the sense that a player's choice of must contain their true type, i.e., they cannot falsely "reveal" a different type. We will place our results on conditional type revelation in the context of the literature on unraveling:
DEFINITION 5. Let be the profile of revelation actions (as functions of types) in a Bayesian Nash equilibrium of a game with strategic information revelation. Then has full unraveling if for all , or partial unraveling if is a strict subset of for at least one
Uncertainty about others' private information, and a lack of ability or incentive to reveal that information, can lead to inefficient outcomes in Bayesian Nash equilibrium (or an appropriate refinement thereof). Here is a running example we will use to illustrate how informational problems can be overcome under our assumptions, but not under the weaker assumption of unconditional revelation ability.
Example 3.1 (War under incomplete information, adapted from Slantchev and Tarar [27] ). Two countries are on the verge of war over some territory. Country 1 offers a split of the territory giving fractions and to countries 1 and 2, respectively. If country 2 rejects this offer, they go to war. Each player wins with some probability (detailed below), and each pays a cost of fighting . The winner receives a payoff of 1, and the loser gets 0.
The countries' military strength determines the probability that country 2 wins the war, denoted . Country doesn't know whether the other's army is weak (with type ) or strong (), while country 1's strength is common knowledge. Further, country 2 has a weak point, which country 1 believes is equally likely to be in one of two locations . Thus country 's type is given by . Country 1 can make a sneak attack on , independent of whether they go to war. Country 1 would gain from attacking , costing for country 2. But incorrectly attacking would cost for country 1, so country 1 would not risk an attack given a prior of on each of the locations. If country 2 reveals its full type by allowing inspectors from country 1 to assess its military strength , country 1 will also learn .
If country 1 has a sufficiently low prior that country 2 is strong, then war occurs in the unique perfect Bayesian equilibrium when country 2 is strong. Moreover, this can happen even if the countries can fully reveal their private information to one another. In other words, the unraveling of private information does not occur, because player 2 will be made worse off if they allow player 1 to learn about their weak point.
Before looking at what is achievable with different technologies for information revelation, we need to formally introduce our framework for commitment games with conditional information revelation. In the next section, we describe these games and present our folk theorem.
Players are faced with a "base game" , a Bayesian game with strategic information revelation as defined in Section 3.1. In our framework, a commitment game is a higher-level Bayesian game in which the type distribution is the same as that of , and actions are devices that define mappings from other players' devices to actions in (conditional on one's type). We assume for all players and types , i.e., players are at least able to reveal their exact types or not reveal any new information. They additionally have access to devices that can condition (i) their actions in and (ii) the revelation of their private information on other players' devices. Upon learning their type , player chooses a commitment device from an abstract space of devices . These devices are mappings from the player's type to a response function and a type revelation function. As in Kalai et al . [11] and Forges [5], we will define these functions so as to allow players to condition their decisions on each other's decisions without circularity.
Let be the domain of the randomization device . The response function is . (This notation, adopted from Forges [5], distinguishes the player's action-determining function from the action itself.) Given the other players' devices and a signal given by the realized value of the random variable , player 's action in after the revelation phase is .1 Conditioning the response on permits players to commit to (correlated) distributions over actions. Second, we introduce type revelation functions , which are not in the framework of Forges [5]. The th entry of indicates whether player reveals their type to player , i.e., player learns if this value is 1 or if it is . (We can restrict attention to cases where either all or no information is revealed, as our folk theorem shows that such a revelation action set is sufficient to enforce each equilibrium payoff profile.) Thus, each player can condition their action on the others' private information revealed to them via . Further, they can choose whether to reveal their type to another player, via , based on that player's device. Thus players can decide not to reveal private information to players whose devices are not in a desired device profile, and instead punish them.
Then, the commitment game is the one-shot Bayesian game in which each player 's strategy is a device , as a function of their type. After devices are simultaneously and independently submitted (potentially as a draw from a mixed strategy over devices), the signal is drawn from the randomization device , and players play the induced action profile in . Thus the ex post payoff of player in from a device profile is .
Our folk theorem consists of two results: First, any feasible and INTIR payoff can be achieved in equilibrium (Theorem 1). As a special case of interest, then, any efficient payoff can be attained in equilibrium. Second, all equilibrium payoffs in are feasible and INTIR (Proposition 1).
THEOREM 1. Let be any commitment game. For type profile , let be a correlated strategy inducing a feasible and INTIR payoff profile . Let be a punishment strategy that is arbitrary except, if is the only player with , let be the minimax strategy against player . Conditional on the signal , let be the deterministic action profile, called the target action profile, given by , and let be the deterministic action profile given by . For all players and types , let be such that:
Then, the device profile is a Bayesian Nash equilibrium of .
PROOF. We first need to check that the response and type revelation functions only condition on information available to the players. If all players use , then by construction of they all reveal their types to each other, and so are able to play conditioned on their type profile (regardless of whether the induced payoff is IC). If at least one player uses some other device, the players who do use still share their types with each other, thus they can play .
Suppose player deviates from . That is, player 's strategy in is . Note that the outputs of player 's response and type revelation functions induced by may in general be the same as those returned by . We will show that punishes deviations from the target action profile regardless of these outputs, as long as there is a deviation in functions or . Let . Then:
This last expression is the ex interim payoff of the proposed commitment given that the other players use , therefore is a Bayesian Nash Equilibrium.
PROPOSITION 1. Let be any commitment game. If a device profile is a Bayesian Nash equilibrium of , then the induced payoff is feasible and INTIR.
PROOF. Let be the strategy profile of . Then by hypothesis so is feasible. Suppose that for some player , for all correlated strategies there exists a type such that:
Let . Then if player with type deviates to such that :
This contradicts the assumption that is the payoff of a Bayesian Nash equilibrium, therefore is INTIR.
Our assumptions do not imply the equilibrium payoffs are IC (unlike Forges [5]). Suppose a player 's payoff would increase if the players conditioned the correlated strategy on a different type (i.e., not IC). This does not imply that a profit is possible by deviating from the equilibrium, because in our setting the other players' actions are conditioned on the type revealed by . In particular, as in our proposed device profile, they may choose to play their part of the target action profile only if all other players' devices reveal their (true) types.
The assumptions that give rise to this class of commitment games with conditional information revelation are stronger than the ability to unconditionally reveal private information. Recalling the unraveling results from Section 2, unconditional revelation ability is sometimes sufficient for the full revelation of private information, or for revelation of the information that prohibits incentive compatibility, and thus the possibility of efficiency in equilibrium. But this is not always true, whereas efficiency is always attainable in equilibrium under our assumptions. We first show that full unraveling fails in our running example when country 2 has a weak point. Then, we discuss conditions under which the ability to partially reveal private information is sufficient for efficiency, and examples where these conditions don’t hold.
Since country 2 can only either reveal both its strength and weak point , or neither, in our formalism of strategic information revelation . If country 2 rejects the offer , players go to a war that country 2 wins with probability if its army is weak, or if strong.
Assume country 2 is strong and the prior probability of a strong type is . In the perfect Bayesian equilibrium of (without type revelation) country 1 offers , which country 2 rejects [27]. That is, if country 1 believes country 2 is unlikely to be strong, country 1 makes the smallest offer that only a weak type would accept. Thus with private information the countries go to war and receive inefficient payoffs in equilibrium. A strong country 2 also prefers not to reveal its type unconditionally. Although this would guarantee that country 1 best-responds with , which country 2 would accept, given knowledge of the weak point country 1 prefers to attack it and receive an extra payoff with certainty, costing for country 2. Country 2 would therefore be worse off by than in equilibrium without revelation, where its expected payoff is .
However, if country 2 can reveal its full type if and only if country 1 commits to that country 2 accepts, and commits not to attack , the countries can avoid war in equilibrium. The profile is not IC, and hence cannot be achieved under the assumptions of Forges [5] alone, because a weak country 2 would prefer the strong type's payoff (absent type-conditional commitment by country 1). In this example, conditional type revelation is required for efficiency due to a practical inability to reveal military strength without also revealing a vulnerability (Table 1). In other words, country 2's revelation action set is too restricted for full unraveling to occur. Interactions between advanced artificially intelligent agents may feature similar problems necessitating our framework. For example, if revelation requires sharing source code or the full set of parameters of a neural network that lacks cleanly separated modules, unconditional revelation risks exposing exploitable information. See also example 7 of Okuno-Fujiwara et al. [23] in which full unraveling fails because a firm does not want to reveal a secret technology that provides a competitive advantage, leading to inefficiency because other private information is not revealed.
Full unraveling. If full unraveling occurs in the base game , then the ability to conditionally reveal information becomes irrelevant. For example, consider a modification of Example 3.1 in which there is no weak point, i.e., country 's type is rather than . A strong country 2 that can verify its strength to country 1 prefers to do so, since this does not also help country 1 exploit it. But because of this, if country 2 refuses to reveal its strength and it is common knowledge that country 2 could verifiably reveal, country 1 knows country 2 is weak. Thus, all types are revealed in equilibrium without conditioning on country 1's commitment.
Some sufficient and necessary conditions for full unraveling have been derived. Hagenbach et al . [9] show that given verifiable revelation, full unraveling is guaranteed if there are no cycles in the directed graph defined by types that prefer to pretend to be each other. For full unraveling in some classes of games with multidimensional types, it is necessary for one of the players' payoff to be sufficiently nonconvex in the other's beliefs [17]. In Appendix A, we give an example where this condition fails, thus unconditional revelation is insufficient even without exploitable information. However, even for games with full unraveling, the framework of Forges [5] is still insufficient for equilibria with non-IC payoffs, since that framework does not allow verifiable revelation (conditional or otherwise).
Full | Partial | |
Conditional | feasible, INTIR | feasible, INTIR |
Unconditional | feasible, INTIR, {full unraveling or IC} | feasible, INTIR, {full unraveling or IC after unraveling} |
Partial revelation and post-unraveling incentive compatibility. If in our original example country 2 could partially reveal its type, i.e., only the probability of winning a war but not its weak point, conditional revelation would not be necessary (Table 1). This is because the strategy inducing the efficient payoff profile depends only on the part of country 2's type that is revealed by the unraveling argument. Country 2 does not profit from lying about its exploitable, non-unraveled information — that is, the payoff is IC with respect to that information, even if not IC in the pre-unraveling game. Thus country 1 does not need to know this information for the efficient payoff to be achieved in equilibrium. Formally, in this case , i.e., country 2 can choose to reveal any , producing an equilibrium of partial unraveling. We can generalize this observation with the following proposition.
PROPOSITION 2. Suppose that the devices in do not have revelation functions, and is a game of strategic information revelation with for all . Let be updated to have support on the subset of types remaining after unraveling. As in Forges [5], assume is conditioned on . Then a payoff profile is achievable in a Bayesian Nash equilibrium of if and only if it is feasible, INTIR, and IC (with respect to the post-unraveling game and updated ).
PROOF. This is an immediate corollary of Propositions 1 and 2 of Forges [5], applied to the base game induced by unraveling (that is, with a prior updated on types being in the space ).
To our knowledge, it is an open question which conditions are sufficient and necessary for partial unraveling such that the efficient payoffs of the post-unraveling game are IC. An informal summary of Proposition 2 and characterizations of equilibrium payoffs under our framework and that of Forges [5] is: Given conditional commitment ability, efficiency can be achieved in equilibrium if and only if there is sufficiently strong incentive compatibility, conditional and verifiable revelation ability, or an intermediate combination of these (see Table 1).
Proposition 2 is not vacuous; there exist games in which, given the ability to partially, verifiably, and unconditionally reveal their private information, players end up in an inefficient equilibrium that is Pareto-dominated by a non-IC payoff. Consequently, the alternatives to conditional information revelation that we have considered are not sufficient to achieve all feasible and INTIR payoffs even when partial revelation is possible. The game in Appendix A is one example where such a payoff is efficient. In the following example, the only efficient payoff is IC. However, the set of equilibrium payoffs is smaller than under our assumptions, and excludes some potentially desirable outcomes. For example, there is a non-IC -efficient payoff that improves upon the strictly efficient payoff in utilitarian welfare (sum of all players' payoffs).
Example 5.1 (All-pay auction under incomplete information from Kovenock et al. [14]). Two firms participate in an all-pay auction. Each firm has a private valuation of a good. After observing their respective valuations, players simultaneously choose whether to reveal them. Then they simultaneously submit bids , and the higher bid wins the good, with a tie broken by a fair coin. Thus player 's payoff is . There is a Bayesian Nash equilibrium of this base game in which , and neither player reveals their valuation [14]. In this equilibrium, each player's ex interim payoff is:
The ex post payoffs are if , if , and otherwise.
Now, let , and consider the following strategy . For type profiles such that , let and . For , let and . Otherwise, let . Then:
Thus the payoff induced by is feasible and INTIR, because it exceeds the ex interim equilibrium payoff. This is also an ex post Pareto improvement on the equilibrium, because the ex post payoffs are if , if , and otherwise. Finally, this payoff is not IC, because if , player 1 would profit from conditioned on a type .
Note that the payoffs and , i.e., the case of , are not feasible. This non-IC payoff thus requires and is inefficient by a margin of . However, in practice the players may favor this payoff over . This is because the non-IC payoff is -welfare optimal, since whenever for either player, the supremum of the sum of payoffs is .
Having shown that conditional commitment and revelation devices solve problems that are intractable under other assumptions, we next consider how players can practically (and more robustly) implement these abstract devices. In particular, can players achieve efficient equilibria without using the exact device profile in Theorem 1, which can only cooperate with itself? We now develop an implementation showing that this is possible, after providing some background.
Oesterheld [21] considers two computer programs playing a game. Each program can simulate the other. He constructs a program equilibrium — a pair of programs that form an equilibrium of this game — using “instantaneous tit-for-tat” strategies.
In the Prisoner’s Dilemma, the pseudocode for these programs (called "") is: These programs cooperate with each other and punish defection. Note that these programs are recursive, but guaranteed to terminate because of the probability that a program will output Cooperate unconditionally.
We use this idea to implement conditional commitment and revelation devices. For us, "revealing private information and playing according to the target action profile" is analogous to cooperation in the construction of . We will first describe the appropriate class of programs for program games under private information. Then we develop our program, (where "SIR" stands for "strategic information revelation"), and show that it forms a -Bayesian Nash equilibrium of a program game. Pseudocode for is given in Algorithm 1.
Player 's strategy in the program game is a choice from the program space , a set of computable functions from to . A program returns either an action or a type revelation vector. Each program takes as input the players' program profile, the signal , and a boolean that equals 1 if the program's output is an action, and 0 otherwise.
For brevity, we write for a call to a program with the boolean set to , otherwise . Player 's action in the program game is a call to their program . (We refer to these initial program calls as the base calls to distinguish them from calls made by other programs.) Then, the ex post payoff of player in the program game is .
In addition to in the base game, we assume there is a randomization device on which programs can condition their outputs. Like Oesterheld [21], we will use programs that unconditionally terminate with some small probability. By using to correlate decisions to unconditionally terminate, our program profile will be able to terminate with probability 1, despite the exponentially increasing number of recursive program calls. In particular, reads the call stack of the players' program profile. At each depth level of recursion reached in the call stack, a variable is independently sampled from . Each program call at level can read off the values of and from . The index itself is not revealed, however, because programs that "know" they are being simulated would be able to defect in the base calls, while cooperating in simulations to deceive the other programs. To ensure that our programs terminate in play with a deviating program, will call truncated versions of its counterparts' revelation programs: For , let denote with immediate termination upon calling another program.
checks if all other players' programs reveal their types (line 8 of Algorithm 1). If so, either with a small probability it unconditionally cooperates (line 11) — i.e., plays its part of the target action profile — or it cooperates only when all other programs cooperate (line 15). Otherwise, it punishes (line 17). In turn, reveals its type unconditionally with probability (line 20). Otherwise, it reveals to a given player under two conditions (lines 25 and 28). First, player must reveal to the user. Second, they must play an action consistent with the desired equilibrium, i.e., cooperate when all players reveal their types, or punish otherwise. (See Figure 1.)
Unconditionally revealing one's type and playing the target action avoids an infinite regress. Crucially, these unconditional cooperation outputs are correlated through . Therefore, in a profile of copies of this program, either all copies unconditionally cooperate together, or none of them do so. Using this property, we can show (see proof of Theorem 2 in Appendix B) that a profile where all players use this program outputs the target action profile with certainty. If one player deviates, first, immediately punishes if that player does not reveal. If they do reveal, with some small probability the other players unconditionally cooperate, making this strategy slightly exploitable, but otherwise the deviator is punished. Even if deviation is punished, may unconditionally reveal. In our approach, this margin of exploitability is the price of implementing conditional commitment and revelation with programs that cooperate based on counterparts' outputs, rather than a strict matching of devices, without an infinite loop. Further, since a player is only able to unconditionally cooperate under incomplete information if they know all players' types, needs to prematurely terminate calls to programs thatndon't immediately unconditionally cooperate, but which may otherwise cause infinite recursion (line 4). This comes at the expense of robustness: some players who may have otherwise cooperated, with low probability.
THEOREM 2. Consider the program game induced by a base game and the program spaces . Assume all strategies returned by these programs are computable. For type profile , let induce a feasible and INTIR payoff profile . Let be the minimax strategy if one player deviates, and arbitrary otherwise. Let be the maximum payoff achievable by any player in , and . Then the program profile given by Algorithm 1 (with ) for players , denoted , is a -Bayesian Nash equilibrium. That is, if players play this profile, and player plays a program that terminates with probability 1 given that any programs it calls terminate with probability 1, then:
PROOF SKETCH. We need to check (1) that the program profile terminates (a) with or (b) without a deviation, (2) that everyone plays the target action profile when no one deviates, and (3) that with high probability a deviation is punished. First suppose no one deviates. If for two levels of recursion in a row, the calls to and all unconditionally reveal (line 21) of ) and output the target action (line 6 of ), respectively. Because these unconditional cooperative outputs are correlated through , the probability that at each pair of subsequent levels in the call stack is a nonzero constant. Thus it is guaranteed to occur eventually and cause termination in finite time, satisfying (1b). Moreover, each call to or in previous levels of the stack sees that the next level cooperates, and thus cooperates as well, ensuring that the base calls all output the target action profile. This shows (2).
If, however, one player deviates, we use the same guarantee of a run of subsequent events to guarantee termination. First, all calls to non-deviating programs terminate, because any call to conditional on forces termination (line 4) of calls to other players' revelation programs. Thus the deviating programs also terminate, since they call terminating non-deviating programs. This establishes (1a). Finally, in the high-probability event that the first two levels of calls to do not unconditionally cooperate, punishes the deviator as long as they do not reveal their type and play their target action. The punishing players will know each other's types, since a call to is guaranteed by line 28 to reveal to anyone who also punishes or unconditionally cooperates in the next level. Condition (3) follows.
A practical obstacle to program equilibrium is demonstrating to one’s counterpart that one’s behavior is actually governed by the source code that has been shared. In our program game with private information, there is the additional problem that, as soon as one’s source code is shared, one’s counterpart may be able to read off one’s private information (without revealing their own). Addressing this in practice might involve modular architectures, where players could expose the code governing their strategy without exposing the code for their private information. Alternatively, consider AI agents that can place copies of themselves in a secure box, where the copies can inspect each other’s full code but cannot take any actions outside the box. These copies read each other’s commitment devices off of their source code, and report the action and type outputs of these devices to the original agents. If any copy within the box attempts to transmit information that another agent’s device refused to reveal, the box deletes its contents. This protocol does not require a mediator or arbitrator; the agents and their copies make all the relevant strategic decisions, with the box only serving as a security mechanism. Applications of secure multi-party computation to machine learning [12], or privacy-preserving smart contracts [13] — with the original agents treated as the “public” from whom code shared among the copies is kept private — might facilitate the implementation of our proposed commitment devices.
We have defined a new class of commitment games that allow revelation of private information conditioned on other players’ commitments. Our folk theorem shows that in these games, efficient payoffs are always attainable in equilibrium. Our examples, which draw on models of war and all-pay auctions, show how players with these capabilities can avoid welfare losses, while others (even with the ability to verifiably reveal private information) cannot. Finally, we have provided an implementation of this framework via robust program equilibrium, which can be used by computer programs that read each other’s source code.
While conceptually simple, satisfying these assumptions in practice requires a strong degree of mutual transparency and conditional commitment ability, which is not possessed by contemporary human institutions or AI systems. Thus, our framework represents an idealized standard for bargaining in the absence of a trusted third party, suggesting research priorities for the field of Cooperative AI [2]. The motivation forwork on this standard is that AI agents with increasing economic capabilities, which would exemplify game-theoretic rationality to a stronger degree than humans, may be deployed in contexts where they make strategic decisions on behalf of human principals [6]. Given the potential for game-theoretically rational behavior to cause cooperation failures [4, 20], it is important that such agents are developed in ways that ensure they are able to cooperate effectively.
Commitment devices of this form would be particularly useful in cases where centralized institutions (Dafoe et al. [2], Section 4.4) for enforcing or incentivizing cooperation fail, or have not been constructed due to collective action problems. This is because our devices do not require a trusted third party, aside from correlation devices. A potential obstacle to the use of these commitment devices is lack of coordination in development of AI systems. This may lead to incompatibilities in commitment device implementation, such that one agent cannot confidently verify that another’s device meets its conditions for trustworthiness and hence type revelation. Given that commitments may be implicit in complex parametrizations of neural networks, it is not clear that independently trained agents will be able to understand each other’s commitments without deliberate coordination between developers. Our program equilibrium approach allows for the relaxation of the coordination requirements needed to implement conditional information revelation and commitment. Coordination on target action profiles for commitment devices or flexibility in selection of such profiles, in interactions with multiple efficient and arguably “fair” profiles [28], will also be important for avoiding cooperation failures due to equilibrium selection problems.
We thank Lewis Hammond for helpful comments on this paper and thank Caspar Oesterheld both for useful comments and for identifying an important error in an earlier version of one of our proofs.
[1] Andrew Critch. 2019. A parametric, resource-bounded generalization of Löb’s theorem, and a robust cooperation criterion for open-source game theory. The Journal of Symbolic Logic 84, 4 (2019), 1368–1381.
[2] Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R. McKee, Joel Z. Leibo, Kate Larson, and Thore Graepel. 2020. Open Problems in Cooperative AI. arXiv:2012.08630 [cs.AI]
[3] Ronald A Dye. 1985. Disclosure of nonproprietary information. Journal of accounting research (1985), 123–145.
[4] James D Fearon. 1995. Rationalist explanations for war. International organization 49, 3 (1995), 379–414.
[5] Françoise Forges. 2013. A folk theorem for Bayesian games with commitment. Games and Economic Behavior 78 (2013), 64–71. https://doi.org/10.1016/j.geb.2012.11.004
[6] Edward Geist and Andrew J. Lohn. 2018. How might artificial intelligence affect the risk of nuclear war? Rand Corporation.
[7] Sanford J Grossman. 1981. The informational role of warranties and private disclosure about product quality. The Journal of Law and Economics 24, 3 (1981), 461–483.
[8] Sanford J Grossman and Oliver D Hart. 1980. Disclosure laws and takeover bids. The Journal of Finance 35, 2 (1980), 323–334.
[9] Jeanne Hagenbach, Frédéric Koessler, and Eduardo Perez-Richet. 2014. Certifiable Pre-play Communication: Full Disclosure. Econometrica 82, 3 (2014), 1093–1131. http://www.jstor.org/stable/24029308
[10] Boyan Jovanovic. 1982. Truthful disclosure of information. The Bell Journal of Economics (1982), 36–44.
[11] Adam Tauman Kalai, Ehud Kalai, Ehud Lehrer, and Dov Samet. 2010. A commitment folk theorem. Games and Economic Behavior 69, 1 (2010), 127–137.
[12] Brian Knott, Shobha Venkataraman, Awni Hannun, Shubhabrata Sengupta, Mark Ibrahim, and Laurens van der Maaten. 2021. CrypTen: Secure Multi-Party Computation Meets Machine Learning. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin,
P. Liang, and J. Wortman Vaughan (Eds.). https://openreview.net/forum?id=dwJyEMPZ04I
[13] Ahmed Kosba, Andrew Miller, Elaine Shi, Zikai Wen, and Charalampos Papamanthou. 2016. Hawk: The Blockchain Model of Cryptography and Privacy-Preserving Smart Contracts. In 2016 IEEE Symposium on Security and Privacy (SP). 839–858. https://doi.org/10.1109/SP.2016.55
[14] Dan Kovenock, Florian Morath, and Johannes Münster. 2015. Information sharing in contests. Journal of Economics & Management Strategy 24 (2015), 570–596. Issue 3.
[15] Patrick LaVictoire, Benja Fallenstein, Eliezer Yudkowsky, Mihaly Barasz, Paul Christiano, and Marcello Herreshoff. 2014. Program Equilibrium in the Prisoner’s Dilemma via Löb’s Theorem. In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence.
[16] Mike Lewis, Denis Yarats, Yann N. Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal or No Deal? End-to-End Learning for Negotiation Dialogues. https://doi.org/10.48550/ARXIV.1706.05125
[17] Giorgio Martini. 2018. Multidimensional Disclosure. http://www.giorgiomartini.com/papers/multidimensional_disclosure.pdf
[18] Paul Milgrom and John Roberts. 1986. Relying on the information of interested parties. The RAND Journal of Economics (1986), 18–32.
[19] Paul R Milgrom. 1981. Good news and bad news: Representation theorems and applications. The Bell Journal of Economics (1981), 380–391.
[20] Roger B Myerson and Mark A Satterthwaite. 1983. Efficient mechanisms for bilateral trading. Journal of economic theory 29, 2 (1983), 265–281.
[21] Caspar Oesterheld. 2019. Robust program equilibrium. Theory and Decision 86, 1 (2019), 143–159.
[22] Caspar Oesterheld and Vincent Conitzer. 2021. Safe Pareto Improvements for Delegated Game Playing. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems. 983–991.
[23] Masahiro Okuno-Fujiwara, Andrew Postlewaite, and Kotaro Suzumura. 1990. Strategic Information Revelation. The Review of Economic Studies 57, 1 (1990), 25–47. http://www.jstor.org/stable/2297541
[24] Michael Peters and Balázs Szentes. 2012. Definable and Contractible Contracts. Econometrica 80 (2012), 363–411.
[25] Ariel Rubinstein. 1998. Modeling Bounded Rationality. The MIT Press.
[26] Hyun Song Shin. 1994. The burden of proof in a game of persuasion. Journal of Economic Theory 64, 1 (1994), 253–264.
[27] Branislav L Slantchev and Ahmer Tarar. 2011. Mutual optimism as a rationalist explanation of war. American Journal of Political Science 55, 1 (2011), 135–148.
[28] Julian Stastny, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, and Jesse Clifton. 2021. Normative Disagreement as a Challenge for Cooperative AI. arXiv:2111.13872 [cs.MA]
[29] Moshe Tennenholtz. 2004. Program equilibrium. Games and Economic Behavior 49, 2 (2004), 363–373.
[30] Hal R Varian. 2010. Computer mediated transactions. American Economic Review 100, 2 (2010), 1–10.
Consider the following game of strategic information revelation. We will show that in this game, there is a perfect Bayesian equilibrium that is inefficient, and there is an efficient payoff profile that is not IC. (That is, in this game, unconditional and partial type revelation and the framework of Forges [5] are not sufficient to achieve efficiency.) This example is inspired by the model in Martini [17].
Player 2 is a village that lives around the base of a treacherous mountain (i.e., along the left and bottom sides of ). Their warriors are camped somewhere on the mountain, with coordinates . Player 1 has no information on the warriors' location, hence the prior is . But they know that warriors at higher altitudes are tougher; strength is proportional to . As in Example 3.1, player 1 can offer a split of disputed territory. If the players fight, then player 1 will send in paratroopers at a location to fight player 2's warriors at a cost proportional to their strength . They want to get as close as possible to minimize exposure to the elements, consumption of rations, etc (i.e., minimize the squared distance ). Meanwhile, player 2 wants the paratroopers to land as far from their village as possible, i.e., they want to maximize . Player 2 wins the ensuing battle with probability equal to their army's strength, i.e., .
Formally, the game is as follows. Only player 2 has private information, . Player 2 has the unrestricted revelation action set . First, player 2 chooses . Then player 1 chooses . Player 2 can either accept or reject . If player 2 accepts, the pair of payoffs is . Otherwise, player 1 plays , and for :
Let for any function . Define:
Then, we claim:
PROPOSITION 3. Let . Let player 1's strategy be then conditional on a given if player 2 does not reveal their type, otherwise then . Let player 2's strategy be to reveal any and only types for which , and to accept any and only . Let player 1's belief update to conditional on player 2 not revealing their type, and to conditional on player 2 not revealing their type and rejecting .
Then these strategies and beliefs are a perfect Bayesian equilibrium. Further, there exist such that this equilibrium is inefficient, and the payoff profile is (1) a Pareto improvement on the equilibrium payoff and (2) not IC.
PROOF. We proceed by backward induction. If player 2 has not revealed their type and has rejected , then given beliefs , we solve for the optimal . Player 1's expected payoff is . The squared loss is minimized at . This is equivalent to the average of the centers of rectangles whose union composes the region (see Figure 2), weighted by the areas of these rectangles, which can be shown to be:
Thus is a best response. If player 2 has revealed their type and rejected , then player 1's payoff is maximized at .
Next, player 2's best response to any is to accept if and only if the acceptance payoff exceeds the rejection payoff given player 1's strategy, that is, .
Then, given beliefs for each , player 1's optimal if player 2 does not reveal is:
Given player 2's strategy,
Since is uniform on , this probability is given by the ratio of the areas of the regions and . Thus . We have:
It can be shown (Lemma 3) that and . Therefore is of the form given above.
If player 2 reveals, in the analysis above we now have:
Thus is optimal, since increases with up to , after which it drops to .
It can be shown that , and so and for any type. Given these responses, if player 2 does not reveal their type, their payoff is . If player 2 reveals their type, since we have shown that , player 2's payoff is , and so player 2 prefers to reveal if and only if .
Finally, by the above strategy for player 2's type revelation, if player 2 does not reveal, to be consistent player 1 must update to the uniform distribution on the region defined by . Thus is consistent. If player 2 also rejects , player 1 knows that , that is, . Thus the updated belief is uniform on , so is consistent. This proves that the proposed strategy profile and beliefs are a perfect Bayesian equilibrium.
Given , all player 2 types reject (offered if player 2 does not reveal) in equilibrium, since . The equilibrium payoffs for any types that do not reveal are:
Consider the payoff profile
induced by the strategy profile in which player 1 offers and player 2 accepts any . This is feasible because player 2 only reveals if , and , so . For this to be a Pareto improvement on the perfect Bayesian equilibrium, it is sufficient that . This payoff profile is not IC, because player 2's payoff increases with , so any player 2 for which can profit from the strategy profile above conditioned on a type such that .
LEMMA 3. Given as defined above, and .
PROOF. We have:
We showed above that Further,
so:
Fix the programs of players as . Suppose player uses . Given this assumption, we omit the subscripts of and . Let and respectively denote calls to and made at level . If and for some reached in the call stack, then every call to and immediately returns . Consequently, every call to , which must be a parent call to , returns because line 5 in evaluates to . (Notice that the shared random variables are essentiall — if the programs unconditionally cooperated using independently sampled variables, an exponentially increasing number of variables would each need to be less than for all calls at a given level to return the cooperative output.) Let be the event that and , and be the event that and . Thus for the program profile to terminate in finite time, it is sufficient to show that with probability 1 there exists a finite such that holds. Given that for are independent, because they do not overlap, we have:
Since is the complement of the event we wanted to guarantee, this proves termination with probability 1. Further, the event is sufficient for every call of and to return and , respectively, and this holds for all levels less than . Therefore all base calls of the programs in the proposed profile return the corresponding with probability 1.
Now suppose player uses . Let be the smallest finite level such that , , and (which exists with probability 1 by a similar argument to that above). Then all and for return . Further, every for calls the truncated programs for , guaranteed to terminate by definition, thus terminates with either or . But because also guarantees that terminates, all calls to the programs of made by player 's programs terminate. Thus all base calls of programs in this profile with one deviation terminate with probability 1.
We now consider the possible cases. Suppose and . First, note that any players using know each other's types. To see this, note that all calls to for return . So any call to for will reveal to player if either or . The second condition is satisfied by assumption. Inductively applying this argument for , note that if , we will only have if (satisfying line 11 of ), but then this is sufficient to have return (line 21). If player does not reveal their type, all players return . Otherwise, proceeds to line 13 for all players . If player plays , then all other players also play , giving the target payoff profile. Otherwise, all players return . We therefore have that with probability at least , all players use whenever the outputs of do not match those of . Hence:
The post Commitment games with conditional information revelation appeared first on Center on Long-Term Risk.
]]>The post Summer Research Fellowship appeared first on Center on Long-Term Risk.
]]>For information about the fellowship and how to apply, see here.
Once a year, we run a two to three month summer research fellowship at our office in London. It usually takes place somewhere between the months of June and October. Applications for our 2023 Fellowship are now closed, but you can find the job description archived here. We are likely to open applications for our 2024 Fellowship in the first quarter of 2024. (If you'd like to be notified when this happens please subscribe to our newsletter in the bottom-left corner of the website footer.)
Fellows have the opportunity to work on challenging research questions relevant to reducing suffering in the long-term future, whilst supervised by a researcher at CLR.
The main purpose of the fellowship is to support fellows in their career development. Fellows can learn more about s-risks, test their fit for research roles, and improve relevant skills. While not the main goal, research contributions may also influence our strategic direction, grantmaking, and other activities.
Participants become part of our team of intellectually curious, hard-working, and caring people, all of whom share a profound drive to make the biggest difference they can. In the past, some participants have continued their work as full-time members of our team or grantees of the CLR Fund. Over the last two years, most researchers we have hired had previously participated in the summer research fellowship.
In the past, fellows have often fit into one of the following categories:
There might be many other good reasons for participating in the fellowship. In general, we encourage you to apply if you think you would benefit from the program, even if your reasons are not listed above.
We work out with all incoming fellows how to make the fellowship as valuable as possible, given their individual strengths and needs. Often, this means focusing on learning and experimenting, rather than producing polished research output. In some cases, past fellows only started to work on what ended up as their main project more than a month into the fellowship.
"The fellowship was a great opportunity to explore new topics and pursue research threads that I wouldn’t have had spare capacity for during my PhD."
After his fellowship, Lewis continued his DPhil in computer science at the University of London and started working part-time for the Cooperative AI Foundation.
"I really loved the fellowship! I got to work on a really interesting and engaging project, and had amazing support from my supervisor. It got me very excited about potentially doing research long term, and overall made me feel much more confident in my ability to do so. The CLR office was also such a great place to work."
After her fellowship, Julia split her time working on community building and doing self-study on a grant from the CLR Fund.
"The summer fellowship at CLR was really valuable for me for two main reasons: 1) The fellowship is flexible and can fit a number of different people; for me, this meant that I had the freedom to pursue my own interests as part of my PhD. 2) Being immersed in a group of intelligent and diverse people was interesting, motivating, and fun! Due to the fellowship, I feel that I grew as a researcher, became more connected to the EA and AI safety communities, and made some friends. I really recommend the SRF at CLR."
After his fellowship, Rhys continued his PhD in Safe & Trusted AI at Imperial College London.
"I had a great experience on the SRF, and it helped me figure out what kinds of research I liked, as well as what sort of work I would like to do in the future."
After her fellowship, Megan received a grant from the CLR Fund for self-study.
"It was a great experience, I learnt a lot."
After his fellowship, Nicolas accepted an offer as a full-time Research Analyst at CLR.
The projects below were not necessarily published during the fellowship, but they all started working on this project during their fellowship.
If you have any questions about the summer fellowship, please contact us at info@longtermrisk.org.
The post Summer Research Fellowship appeared first on Center on Long-Term Risk.
]]>The post Replicating and extending the grabby aliens model appeared first on Center on Long-Term Risk.
]]>This report is the most comprehensive model to date of aliens and the Fermi paradox. In particular, it builds on Hanson et al. (2021) and Olson (2015) and focuses on the expansion of ‘grabby’ civilizations: civilizations that expand at relativistic speeds and make visible changes to the volume they control.
This report considers multiple anthropic theories: the self-indication assumption (SIA), as applied previously by Olson & Ord (2022), the self-sampling assumption (SSA), implicitly used by Hanson et al. (2021) and a decision theoretic approach, as applied previously by Finnveden (2019).
In Chapter 1, I model the appearance of intelligent civilizations (ICs) like our own. In Chapter 2, I consider how grabby civilizations (GCs) modify the number and timing of intelligent civilizations that appear.
In Chapter 3 I run Bayesian updates for each of the above anthropic theories. I update on the evidence that we are in an advanced civilization, have arrived roughly 4.5Gy into the planet’s roughly 5.5 Gy habitable duration, and do not observe any GCs.
In Chapter 4 I discuss potential implications of the results, particularly for altruists hoping to improve the far future.
Starting from a prior similar to Sandberg et al.’s (2018) literature-synthesis prior, I conclude the following:
Using SIA or applying a non-causal decision theoretic approach (such as anthropic decision theory) with total utilitarianism, one should be almost certain that there will be many GCs in our future light cone.
Using SSA1, or applying a non-causal decision theoretic approach with average utilitarianism, one should be confident (~85%) that GCs are not in our future light cone, thus rejecting the result of Hanson et al. (2021). However, this update is highly dependent on one’s beliefs in the habitability of planets around stars that live longer than the Sun: if one is certain that such planets can support advanced life, then one should conclude that GCs are most likely in our future light cone. Further, I explore how an average utilitarian may wager there are GCs in their future light cone if they expect significant trade with other GCs to be possible.
These results also follow when taking (log)uniform priors over all the model parameters.
All figures and results are reproducible here.
To set the scene, I start with two vignettes of the future. This section can be skipped, and features terms I first explain in Chapters 1 and 2.
In a Monte Carlo simulation of draws, the following world described gives the highest likelihood for both SIA and SSA (with reference class of observers in intelligent civilizations). That is, civilizations like ours are both relatively common and typical amongst all advanced-but-not-yet-expansive civilizations in this world.
In this world, life is relatively hard. There are five hard try-try steps of mean completion time 75 Gy, as well as 1.5 Gy of easy ‘delay’ steps. Planets around red dwarfs are not habitable, and the universe became habitable relatively late -- intelligent civilizations can only emerge from around 8 Gy after the Big Bang. Around 0.3% of terrestrial planets around G-stars like our own are potentially habitable making Earth not particularly rare.
Around 2.5% of intelligent civilizations like our own become grabby civilizations (GCs). This is the SIA Doomsday argument in action.
Around 7,000 GCs appear per observable universe-sized volume (OUSV). GCs already control around 22% of the observable universe, and as they travel at , their light has reached around 35% of the observable universe. Nearly all GCs appear between 10Gy and 18 Gy after the Big Bang.
If humanity becomes a GC, it will be slightly smaller than a typical GC - around 62% of GCs will be bigger. A GC emerging from Earth would in expectation control around 0.1% of the future light cone and almost certainly contain the entire Laniakea Supercluster, itself containing at least 100,000 galaxies.
The median time by which GCs will be visible to observers on Earth is around 1.5 Gy from now. It is practically certain humanity will not see any GCs any time soon: there is roughly 0.000005% probability (one in twenty million) that light from GCs reaches us in the next one hundred years2. GCs will certainly be visible from Earth in around 4 Gy.
As we will see, SIA is highly confident in a future similar to this one. SSA (with the reference class of observers in intelligent civilizations), on the other hand, puts greater posterior credence on human civilization being alone, even though worlds like these have high likelihood.
This world is one that a total utilitarian using anthropic decision theory would wager being in, if they thought their decisions can influence the value of the future in proportion to the resources that an Earth originating GC controls.
In this world, there are eight hard steps, with mean hardness 23 Gy and delay steps totaling 1.8 Gy. Planets capable of supporting advanced life are not too rare: around 0.004% of terrestrial planets are potentially habitable. Again, planets around longer-lived stars are not habitable.
Around 90% of ICs become GCs, and there are roughly 150 GCs that appear per observable universe sized volume. GCs expand at 0.85c, and a GC emerging from Earth would reach 31% of our future light cone, around 49% of its maximum volume, and would be bigger than ~80% of all GCs. Since there are so few GCs, the median time by which a GC is visible on Earth is not for another 20 Gy.
I use the term intelligent civilizations (ICs) to describe civilizations at least as technologically advanced as our own.
In this chapter, I derive a distribution of the arrival times of ICs, . This distribution is dependent on factors such as the difficulty of the evolution of life and the number of planets capable of supporting intelligent life. This distribution does not factor in the expansion of other ICs, which may prevent (‘preclude’) later ICs from existing. That is the focus of Chapter 2.
The distribution gives the number of other ICs that arrive at the same time as human civilization, as well as the typicality of the arrival time of human civilization, assuming no ICs preclude any other.
I write for the time since the Big Bang, which is estimated 13.787 Gy (Ade 2016) [Gy = gigayear = 1 billion years].
Current observations suggest the universe is most likely flat (the sum of angles in a triangle is always 180°), or close to flat, and so the universe is either large or infinite. Further, the universe appears to be on average isotropic (there are no special directions in the universe) and homogeneous (there are no special places in the universe) (Saadeh et al. 2016, Maartens 2011).
The large or infinite size implies that there are volumes of the universe causally disconnected from our own. The collection of ‘parallel’ universes has been called the “Level I multiverse”. Assuming the universe is flat, Tegmark (2007) conservatively estimates that there is a Hubble volume identical to ours away, and an identical copy of you away.
I consider a large finite volume (LFV) of the level I multiverse, and partition this LFV into observable universe size (spherical) volumes (OUSVs)3. My model uses quantities as averages per OUSV. For example, will be the rate of ICs arriving per OUSV on average at time .
The (currently) observable universe necessarily defines the limit of what we can currently know, but not what we can eventually know. The eventually observable universe has a volume around 2.5 times that of the volume of the currently observable universe (Ord 2021).
The most action relevant volume for statistics about the number of alien civilizations is the affectable universe, the region of the universe that we can causally affect. This is around 4.5% of the volume of the observable universe. I will use the term affectable universe size volumes (AUSVs).
For an excellent discussion on this topic, I recommend Ord (2021).
I consider the path to an IC as made up of a number of steps:
I recommend Eth (2021) for an excellent introduction to try-try steps.
Abiogenesis is the process by which life has arisen from non-living matter. This process may require some extremely rare configuration of molecules coming together, such that one can model the process as having some rate 1/a of success per unit time on an Earth-sized planet.
The completion time of such a try-try step is exponentially distributed with PDF . Fixing some time , such as Earth’s habitable duration, the step is said to be hard if . When the step is hard, for , is constant since .
Abiogenesis is one of many try-try steps that have led to human civilization. If there are try-try steps with expected times of completion, the completion time of the steps has hypoexponential distribution with parameter . For modeling purposes, I split these try-try steps into delay steps and hard steps.
I define the delay steps to be the maximal set of individual steps from the steps such that , the approximate duration life has taken on Earth so far. I then approximate the completion time of the delay try-try steps with the exponential distribution with parameter . If they exist, I also include any fuse steps4 in the sum of .
I write for the expected completion times of the remaining steps. These steps are not necessarily hard with respect to Earth's habitable duration. I model each to have log-uniform uncertainty between 1 Gy and Gy. With this prior, most are much greater than 5 Gy and so hard. I approximate the completion time of all of these steps with the Gamma distribution parameters and , the geometric mean hardness of the try-try steps.5 The Gamma distribution can further be described as a ‘power law’ as I discuss in the appendix.
I write for the PDF of the completion time of all the delay steps and hard try-try steps. Strictly, it is given as the convolution of the gamma distribution parameters , and exponential distribution parameter . When , where is the PDF of the Gamma distribution. That is, the delay steps can be approximated as completing in their expected time when they are sufficiently short in expectation.
Priors on
After introducing each model parameter, I introduce my priors. Crucially, all the results in Chapter 3 roughly follow when taking (log)uniform priors over all parameters and so my particular prior choices are not too important.
I consider three priors on , the number of hard try-try steps. The first, which I call balanced, is chosen to give an implied prior number of ICs similar to existing literature estimates (discussed later in this chapter). My bullish prior puts greater probability mass on fewer hard steps and so implies a greater number of ICs. My bearish prior puts greater probability mass in many hard steps and so predicts fewer ICs.
My priors on are uninformed by the timing of life on Earth, but weakly informed by discussion of the difficulty of particular steps that have led to human civilization. For example, Sandberg et al. (2018) (supplement I) consider the difficulty of abiogenesis. In Chapter 3 I update on the time that all the steps are completed (i.e., now). I do not update on the timing of the completion of any potential intermediate hard steps, such as the timing of abiogenesis. Further, I do not update on the habitable time remaining, which is implicitly an anthropic update. I discuss this in the appendix.
Prior on
Given these priors on , I derive my prior on by the geometric mean of draws from the above-mentioned . I chose this prior to later give estimates of life in line with existing estimates. A longer tailed distribution is arguably more applicable.
Prior on
My prior on the sum of the delay and fuse steps has . By definition and smaller than makes little difference. My prior distribution gives median . The delay parameter can also include the delay time between a planet's formation and the first time it is habitable. On Earth, this duration could have been up to 0.6 Gy (Pearce et al. (2018)).
I also model “try-once” steps, those that either pass or fail with some probability. The Rare Earth hypothesis is an example of a try-once step. The possibility of try-once steps allows one to reject the existence of hard try-try steps, but suppose very hard try-once steps.
I write for the probability of passing through all try-once steps. That is, if there are try-once steps then
The prior could arguably have a longer tail, and is loosely informed by discussion of potential Rare Earth factors here.
The parameters above give can give distribution of appearance times of an IC on a given planet. In this section, I consider the maximum duration planets can be habitable for, the number of potentially habitable planets, and the formation of stars around which habitable planets can appear.
I write 6 for the maximum duration any planet is habitable for.7 The Earth has been habitable for between 4.5 Gy and 3.9 Gy (Pearce 2018) and is expected to be habitable for another~1 Gy, so as a lower bound ⪆ 5 Gy. Our Sun, a G-type main-sequence star, formed around 4.6 Gy ago and is expected to live for another ~5 Gy.
Lower mass stars, such as K-type stars (orange dwarfs) have lifetimes between 15 -30 Gy, and M-type stars (red dwarfs) have lifetimes up to 20,000 Gy. These lifetimes give an upper bound on the habitable duration of planets in that star’s system, so I consider up to around 20,000 Gy.
The habitability of these longer-lived stars is uncertain. Since red dwarf stars are dimmer (which results in their longer lives), habitable planets around red dwarf stars must be closer to the star in order to have liquid water, which may be necessary for life. However, planets closer to their star are more likely to be tidally locked. Gale (2017) notes that “This was thought to cause an erratic climate and expose life forms to flares of ionizing electromagnetic radiation and charged particles.” but concludes that in spite of the challenges, “Oxygenic Photosynthesis and perhaps complex life on planets orbiting Red Dwarf stars may be possible”.
This approach to modeling does not allow for planets around red dwarf stars that are habitable for periods equal to the habitable period of Earth. For example, life may only be able to appear in a crucial window in a planet’s lifespan.
Given a value of , I now consider the number of habitable planets. To derive an estimate of the number of potentially habitable planets, I only consider the number of terrestrial planets: planets composed of silicate rocks and metals with a solid surface. Recall that the parameter w can indirectly control the number of these actually habitable.
Zackrisson et al. (2016) estimate terrestrial planets around FGK stars and around M stars in the observable universe. Interpolating, I set the total number of terrestrial planets around stars that last up to per OUSV to be
Hanson et al. (2021) approximate the cumulative distribution of planet lifetimes with for and for . The fraction of planets formed at time habitable at time t is then given by .
These forms of and satisfy the property that for any, the expression - the number of planets per OUSV habitable for between and Gy - is independent of . In particular, the number of planets habitable for the same duration as Earth is independent of .
This is implicitly used later in the update: one does not need to explicitly condition on the observation that we are on a planet with habitable for ~5 Gy since the number of planets habitable for ~5 Gy is independent of the model parameters.
I use the term “habitable stars” to mean stars with solar systems capable of supporting life.
I follow Hanson et al. (2021) in approximating the habitable star formation rate with the functional form with power and decay where .
There is debate over the time the universe was first habitable.
Loeb (2016) argues for the universe being habitable as early as 10 My. There is discussion around how much gamma-ray bursts (GBRs) in the early universe prevent the emergence of advanced life. Piran (2014) conclude that the universe was inhospitable to intelligent life > 5 Gy ago. Sloan et al. (2017) are more optimistic and conclude that life could continue below the ground or under an ocean.
I introduce an early universe habitability parameter and function which gives the fraction of habitable planets capable of hosting advanced life at time relative to the fraction at . I take to be a sigmoid function with and (hence ). My prior on is log-uniform on (, 0.99).
A more sophisticated approach would consider the interaction between and the hard try-try steps, as suggested by Hanson et al. (2021).
The number of planets terrestrial planets per OUSV habitable at time is
Since for , the lower bound of the integral can be changed to .
Putting the previous sections together, the appearance rate of ICs per OUSV, , is given by
To recap:
I now discuss two potential puzzles related to : Did humanity arrive at an unusually early time? And, where are all the aliens?
Depending on one’s choice of anthropic theory, one may update towards hypotheses where human civilization is more typical among the reference class of all ICs.
Here, I look at human civilization’s typicality using two pieces of data: human civilization’s arrival at and the fact that we have appeared on a planet habitable for ~5 Gy.
An atypical arrival time?
I write for the arrival time distribution normalised to be a probability density function. This tells us how typical human civilization’s arrival time is. That is, is the probability density of a randomly chosen (eventually) existing IC to have arrived at .
When planets are habitable for a longer duration, a greater fraction of life appears later. Further, when is greater, fewer ICs appear overall since life is harder, but a greater fraction of ICs appear later in their planets’ habitable windows – this is the power law of the hard steps.
An atypical solar system?
There are many more terrestrial planets around red dwarf stars than stars like our own. If these systems are habitable, then human civilization is additionally atypical (with respect to all ICs) in its appearance around a star like our sun. Further, life has a longer time to evolve around a longer lived star, so human civilization would be even more atypical. Haqq-Misra et al. (2018) discuss this but do not consider that the presence of hard try-try steps leads to a greater fraction of ICs appearing on longer-lived planets.
Resolving the paradox
Suppose a priori one believes and and uses an anthropic theory that updates towards hypotheses where human civilization is more typical among all ICs. Given these assumptions, one expects the vast majority of ICs to appear much further into the future and on planets around red dwarf stars. However, human civilization arrived relatively shortly after the universe first became habitable, on a planet that is habitable for only a relatively short duration and is thus very atypical (according to our arrival time function that does not factor in the preclusion of ICs by other ICs.
There are multiple approaches to resolving this apparent paradox.
First, one can reject their prior belief in high and , and update towards small and which lead us to believe we are in a more typical IC.
Second, one could change the reference class among which human civilization’s typicality is being considered. This, in effect, is changing the question being asked.8
Third and finally, one can prefer theories that set a deadline on the appearance of ICs like us. If the universe suddenly ended in 5 Gy time, no more ICs could appear and regardless of and human civilization’s arrival time would be typical.
Hanson et al. (2021) resolve the paradox with such a deadline, the expansion of so-called grabby civilizations, which is the focus of Chapter 2. Alternative deadlines have been suggested, such as through false vacuum decay, which I briefly discuss in the appendix.
Some anthropic theories update towards hypotheses where there are a greater number of civilizations that make the same observations we do (containing observers like us).
The rate of XICs
I write for the rate of ICs per OUSV with feature where denotes “ICs arriving at now on a planet that has been habitable for as long as Earth has, and will be habitable for the same duration as Earth will be'.
The Earth has been habitable for between 4.5 Gy and 3.9 Gy (Pearce et al. 2018). I suppose that Earth has been habitable for 4.5 Gy, since if habitable for just 3.9 Gy, the 600 My difference can be (lazily) modeled as a fuse or delay step. Assuming for the time being that no IC precludes any other, this gives
Note that
Below, I vary and to see the effect on . The effect of on is linear, so uninteresting.
The term does not include the further feature of not observing any alien life. In the next chapter, I introduce the number of ICs with feature that also do not observe any alien life.
Where are all the aliens?
I write for the rate of ICs that appear per OUSV, supposing no IC precludes any other, which is given by .
My priors on , , , , and give the rate of ICs that appear per OUSV, supposing no IC precludes any other.
I chose the balanced prior on and prior on hard step hardness to give an implied distribution on comparable to the prior derived by Sandberg et al. (2018), which models the scientific uncertainties on the parameters of the Drake Equation. Sandberg et al.’s prior on the number of currently contactable ICs has a median of 0.3 and 38% credence in fewer than one IC currently existing in the observable universe. My balanced prior gives ~50% on the rate of less than one IC per OUSV and median of ~1 IC to appear per OUSV, and so is more conservative.
The Fermi observation is the fact that we have not observed any alien life. For those with a high prior on the existence of alien life, such as my bullish prior, the Fermi paradox is the conflict between this high prior and the Fermi observation.
It may be hard for humanity to observe a typical IC, especially if they do not last long or emit enough electromagnetic radiation to be identified at large distances. If some fraction of ICs persist for a long time, expand at relativistic speeds, and make visible changes to their volumes, one can more easily update on the Fermi observation. Such ICs are called grabby civilizations (GCs).
The existence of sufficiently many GCs can ‘solve’ the earliness paradox by setting a deadline by which ICs must arrive, thus making ICs like us more typical in human civilization’s arrival time.
In this chapter, I derive an expression for , the rate of ICs per OUSV that have arrived at the same time as human civilization on a planet habitable for the same duration and do not observe any GCs.
Humanity has not observed any intelligent life. In particular, we have not observed any GCs.
Whether GCs are not in our past light cone or we have not yet seen them yet is uncertain. GCs may deliberately hide or be hard to observe with humanity’s current technology.
It seems clearer that humanity is not inside a GC volume, and at minimum we can condition on this observation.9
In Chapter 3 I compute two distinct updates: one conditioning on the observation that there are no GCs in our past light cone, and one conditioning on the weaker observation that we are not inside a GC volume. If GCs prevent any ICs from existing in their volume, this latter observation is equivalent to the statement that “we exist in an IC”.
The second observation leaves ‘less room’ for GCs, since we are conditioning on a larger volume not containing any GCs.
I lean towards there being no GCs in our past light cone. By considering the waste heat that would be produced by Type III Kardashev civilizations (a civilization using all the starlight of its home galaxy), the G-survey found no type III Kardashev civilizations using more than 85% of the starlight in 105 galaxies surveyed (Griffith et al. 2015). There is further discussion on the ability to observe distant expansive civilizations in this LessWrong thread.
I write for the average fraction of ICs that become GCs.10 I assume that this happens in an astronomically short duration and as such can approximate the distribution of arrival time of GCs as equal to the distribution of arrival times of ICs. That is, the arrival time distribution of GCs is given by .
It seems plausible a significant fraction of ICs will choose to become GCs. Since matter and energy are likely to be instrumentally useful to most ICs, expanding to control as much volume as they can (thus becoming a GC) is likely to be desirable to many ICs with diverse aims. Omohundro (2008) discusses instrumental goals of AI systems, which I expect will be similar to the goals of GCs (run by AI systems or otherwise).
Some ICs may go extinct before being able to become a GC. The extinction of an IC does not entail that no GC emerges. For example, an unaligned artificial intelligence may destroy its origin IC but become a GC itself. (Russell 2021). ICs that trigger a (false) vacuum decay that expands at relativistic speeds can also be modeled as GCs.
I do not update on the fact we have not observed any ICs. The smaller , the greater the importance of the evidence that we have not seen any ICs.
I model GCs as all expanding spherically at some constant comoving speed .
To calculate the volume of an expanding GC, one must factor in the expansion of the universe.
Solving the Friedmann equation gives the cosmic scale factor , a function that describes the expansion of the universe over time.
With initial condition and , , and given by Ade et al. (2016). The Friedmann equation assumes the universe is homogeneous and isotropic, as discussed in Chapter 1.
Throughout, I use comoving distances which give a distance that does not change over time due to the expansion of space. The comoving distance a probe travelling at speed that left at time reaches by time is .The comoving volume of a GC at time that has been growing at speed since time is
I take in units of fraction of the volume of an OUSV, approximately .
Supposing humanity expands at delaying colonization by 100 years results in about 0.0000019% loss of volume. Due to the clumping of stars in galaxies and galaxies in clusters, it’s possible this results in no loss of useful volume.
Following Olson (2015) I write for the average fraction of OUSVs unsaturated by GCs at time and take functional form
Recall that the product is the rate of GCs appearing per OUSV at time . Since is a function of the parameters ,,, , and , the function is too.
This functional form for assumes that when GCs bump into other GCs, they do not speed up their expansion in other directions.
The actual volume of a GC
I write for the expected actual volume of a GC at time that began expanding at time at speed . Trivially, since GCs that prevent expansion can only decrease the actual volume. If GCs are sufficiently rare, then . I derive an approximation for in the appendix.
Later, I use the actual volume of a GC as a proxy for the total resources it contains. On a sufficiently large scale, mass (consisting of intergalactic gas, stars, and interstellar clouds) is homogeneously distributed within the universe. This proxy most likely underweights the resources of later arriving GCs due to the gravitational binding of galaxies and galaxy-clusters.
A new arrival time distribution
The distribution of IC arrival times,, can be adjusted to account for the expansion of GCs, which preclude ICs from arriving. I define that gives the rate of ICs appearing per OUSV, and write for the number of ICs that actually appear per OUSV.
The actual number of XICs
I define to be the actual number of ICs with feature X to appear, accounting for the expansion of GCs. I consider two variants of this term.
I write for the rate of ICs with feature X per OUSV that do not observe GCs. Since information about GCs travels at the speed of light, gives the fraction of OUSVs that is unsaturated by light from GCs at time . Then, gives the number of XICs per OUSV with no GCs in their past light cone.
Similarly, I write 11 for the rate of ICs with feature X per OUSV that are not inside a GC volume, where v is the expansion speed of GCs. In this case, .
The Fermi observation limits the number of early arriving GCs: when there are too many GCs the existence of observers like us is rare or impossible.
For anthropic theories that prefer more observers like us, there is a push in the other direction. If life is easier, there will be more XICs.
For anthropic theories that prefer observers like us to be more typical, there is potentially a push towards the existence of GCs that set a cosmic deadline and lead to human civilization not being unusually early.
In the next chapter, I derive likelihood ratios for different anthropic theories and produce results.
I’ve presented all the machinery necessary for the updates, other than the anthropic reasoning. I hope this chapter is readable without knowledge of the previous two.
I now apply three approaches to dealing with anthropics:
I have three joint priors over the following eight parameters.
I update on either the observation I label or observation I label . Both and include observing that we are in an IC that
additionally contains the observation that we do not see any GCs. Alternatively, additionally contains the observation that we are not inside a GC (equivalently, that we exist, if we expect GCs to prevent ICs like us from appearing).
I walk through each anthropic theory, in turn, derive a likelihood ratio, and produce results. In Chapter 4 I discuss potential implications of these results.
By Bayes rule
I have already given my priors and so it remains to calculate the likelihood P(X|n, ..., v). I derive likelihoods in the discrete case, and index my priors by worlds .
I use the following definition of the self-indication assumption (SIA), slightly modified from Bostrom (2002)
All other things equal, one should reason as if they are randomly selected from the set of all12 possible observer moments (OMs) [a brief time-segment of an observer].13
Applying the definition of SIA,
That is, SIA updates towards worlds where there are more OMs like us. Since the denominator is independent of , we only need to calculate the numerator, .
By my choice of definitions, is proportional to , the number of ICs with feature X that actually appear per OUSV. The constant of proportionality is given by the number of OMs per IC, which I suppose is independent of model parameters, as well as the number of OUSVs in the earlier specified large finite volume. Again, these constants is unnecessary due to the normalisation.
The three summary statistics implied by the posterior are below. As mentioned before, the updates are reproducible here.
Updating with observation | Updating with observation |
SIA updates overwhelmingly towards the existence of GCs in our light cone from all three of my priors. If a GC does not emerge from Earth, most of the volume will be expanded into by other GCs.
I discuss some marginal posteriors here, and reproduce all the marginal posteriors in the appendix.
SIA updates towards smaller as the existence of more GCs can only decrease the number of observers like us. This is the “SIA Doomsday” described by Grace (2010). This result is the same as found by Olson & Ord (2021) whereby the prior on goes from prior to posterior .
The SIA update is overwhelmingly towards smaller . Increasing only increases the number of GCs that could preclude XICs.
I use the following definition of the self-sampling assumption (SSA), again slightly modified from Bostrom (2002)
All other things equal, one should reason as if they are randomly selected from the set of all actually existent observer moments (OMs) in their reference class.14
A reference class is a choice of some subset of all OMs. Applying the definition of SSA with reference class ,
That is, SSA updates towards worlds where observer moments like our own are more common in the reference class.
I first consider two reference classes, and . The reference class contains only OMs contained in ICs, and no OMs in GCs. This is the reference class implicitly used by Hanson et al. (2021). The reference class also includes observers in GCs. I later consider the minimal reference class, containing only observers who have identical experiences, paired with non-causal decision theories.
This is the reference class implicitly used by Hanson et al. (2021). I reach different conclusions from Hanson et al. (2021), and discuss a possible error in their paper in the appendix.
The total number of OMs in is proportional to the number of ICs, . As in the SIA case, the number of XOMs is proportional to , so the likelihood ratio is .
Updating with observation | Updating with observation |
SSA has updated away from the existence of GCs in our future light cone.
In the appendix, I discuss how this update is highly dependent on the lower bound on the prior for . Again, smaller is unsurprisingly preferred.
This reference class contains all OMs that actually exist in our large finite volume, and so includes OMs that GCs create. It is sometimes called the “maximal” reference class15.
I model GCs as using some fraction of their total volume to create OMs. I suppose that this fraction and the efficiency of OM creation are independent of the model parameters. These constants do not need to be calculated, since they cancel when normalising.
The total volume controlled by all GCs is proportional to , the average fraction of OUSVs saturated by GCs at some time when all expansion has finished16.
I assume that a single GC creates many more OMs than are contained in a single ICs. Since my prior on has and I expect GCs to produce many OMs, I see this as a safe assumption. This assumption implies the total number of OMs as proportional to . The SSA likelihood ratio is .
I do not see this update as not particularly informative, since I expect GCs to create simulated XOMs., which I explore later in this chapter.
Updating with observation | Updating with observation |
Notably, SSA updates towards as small as possible, since increasing the speed of expansion increases the number of observers created that are not like us — the denominator in the likelihood ratio.
As with the SSA update, this result is sensitive to the prior on , which I discuss in the appendix.
In this section, I apply non-causal decision theoretic approaches to reasoning about the existence of GCs. This chapter does not deal with probabilities, but with ‘wagers’. That is, how much one should behave as if they are in a particular world.
The results I produce are applicable to multiple non-causal decision theoretic approaches.
The results are applicable for someone using SSA with the minimal reference class () paired with a non-causal decision theory, such as evidential decision theory (EDT). SSA contains only observers identical to you, and so updating using SSA Rmin simply removes any world where there are no observers with the same observations as you, and then normalises.
The results are also applicable for someone (fully) sticking with their priors (being ‘updateless’) and using a decision theory such as anthropic decision theory (ADT). ADT, created by Armstrong (2011), converts questions about anthropic probability to decision problems, and Armstrong notes that “ADT is nothing but the Anthropic version of the far more general ‘Updateless Decision Theory’ and ‘Functional Decision Theory’”.
I suppose that all decision relevant ‘exact copies’ of me (i.e. instances of my current observations) are in one of the following situations
Of course, copies may be in non-decision relevant situations, such as short-lived Boltzmann brains.
For each of the above three situations, I calculate the expected number of copies of me per OUSV. For example, in case (1), the number of copies is proportional to and in (2) 18. I do not calculate the constant of proportionality (which would be very small) - this constant is redundant when considering the relative decision worthiness of different worlds.
My decisions may correlate with agents that are not identical copies of me (at a minimum, near identical copies) which I do not consider in this calculation. If in all situations the relative increase in decision-worthiness from correlated agents is equal, the overall relative decision worthiness is unchanged.
To motivate the need to consider these three cases, I claim that our decisions are likely contingent on the ratio of our copies in each category and the ratio of the expected utility of our possible decisions in each scenario. For example, if we were certain that none of our copies were in ICs that became GCs, or all of our copies were in short-lived simulations, we may prioritise improving the lives of current generations of moral patients.
The GC wager
I choose to model all the expected utility of our decisions as coming from copies in case (1). That is, to make decisions premised on the wager that we are in an IC that becomes a GC and not in an IC that doesn’t become a GC, nor in a short-lived simulation.
Tomasik (2016) discusses the comparison of decision-worthiness between (1) and (2) to (3). My assumption that (1) dominates (2) is driven by my prior distribution on fGC (which is bounded below by 0.01) and the expected resources of a single GC dominating the resources of a single IC.
Counterarguments to this assumption may appeal to the uncertainty about the ability to affect the long-run future. For example, if a GC emerged from Earth in the future but all the consequences of one’s actions ‘wash out’ before that point, then (1) and (2) would be equally decision-worthy.
I expect that forms of lock-in, such as the values of an artificial general intelligence, provide a route for altruists to influence the future. I suppose that a total utilitarian’s decisions matter more in cases where the Earth emerging GC is larger. In fact, I suppose a total utilitarian’s decisions matter in linear proportion to the eventual volume of such a GC.
An average utilitarian’s decisions then matter in proportion to the ratio of the eventual volume of an Earth emerging GC to the volume controlled by all GCs, supposing that GCs create moral patients in proportion to their resources.
Calculating decision-worthiness
To give my decision worthiness of each world, I multiply the following terms:
This gives the degree to which I should wager my decisions on being in a particular world.
Total utilitarianism
The number of copies of me in ICs that become GCs is proportional to. The expected actual volume of such GCs is . Using the assumption that our influence is linear in resources, the decision worthiness of each world is
I use the label “ADT total” for this case.
Updating with observation | Updating with observation |
Total utilitarians using a non-causal decision theory should behave as if they are almost certain of the existence of GCs in their future light cone. However, the number of GCs is fairly low - around 40 per AUSV.
Average utilitarianism
As before, the number of copies of me in ICs that become GCs is proportional to and again the expected actual volume of such a GC is given by The resources of all GCs is proportional to . Supposing that GCs create moral patients in proportion to their resources, the decision worthiness of each world is
I use the label “ADT average” for this case.
Updating with observation | Updating with observation |
An average utilitarian should behave as if there are most likely no GCs in the future light cone. As with the SSA updates, this update is sensitive to the prior on and is explored in an appendix.
I now model two types of interactions between GCs: trade and conflict.
The model of conflict that I consider decreases the decision worthiness of cases where there are GCs in our future light cone. I show that a total utilitarian should wager as if there are no GCs in their future light cone if they think the probability of conflict is sufficiently high.
The model of trade I consider increases the decision worthiness of cases where there are GCs in our future light cone. I show that an average utilitarian should wager that there are GCs in their future light cone if they think there are sufficiently large gains from trade with other GCs.
The purpose of these toy examples is to illustrate that a total or average utilitarian’s true wager with respect to GCs may be more nuanced than presented earlier.
Total utilitarianism and conflict
Suppose we are in the contrived case where:
When conflict occurs, an Earth originating GC has probability 20 of getting its maximal volume, . Supposing a total utilitarian’s decisions can influence both cases equally, the expected decision-worthiness per copy in an IC that becomes a GC is
As before, multiplying by the number of copies of me in ICs that become GCs, , gives the decision worthiness.
Intuitively, since the conflict in expectation is a net loss of resources for all GCs, this leads one to wager one’s decisions against the existence of GCs in the future.
Average utilitarianism and trade
I apply a very basic model of gains from trade between GCs with average utilitarianism. I suppose that one can only trade with other GCs within the affectable universe.21 22
Intuitively the decision worthiness goes up in a world with trade as there is more at stake: our GC can both influence its own resources and the resources of other GCs. This model of trade would also increase the degree to which a total utilitarian would wager there are GCs in their future light cone.
I suppose an average utilitarian GC completes a trade by spending of their resources (which they could otherwise use to increase the welfare of moral patients by a single unit) for the return of welfare of moral patients to be increased by one unit. For the GC benefits by making the trade, and so should always make such a trade rather than using the resources to create utility themselves. I write for the probability density of a randomly chosen trade providing return, and suppose that the ‘volume’ of available trades is proportional to the volume saturated by GCs, which itself is proportional to .
I take for some . For smaller , a greater proportion of all available trades are beneficial, and a greater number are very beneficial. For example, for k=1 the fraction of the volume controlled by GCs that the average utilitarian GC can make beneficial trades with is and of volume controlled by GCs allows for trades that return twice as much as they put in. For these same terms are and respectively.
Note that smaller supposes a very large ability to control effective resources by other GCs through trade. Some utility functions may be more conducive to expecting such high trade ratios.
I suppose that the decision-worthiness for each copy of an average utilitarian is linear in the ratio of effective resources that the future GC controls, (i.e. the total resources the GC would need to produce the same utility without trade) to the total resources controlled by all GCs. Other GCs may also increase the effective resources they control: for simplicity, I assume that such GCs do not use their increased effective resources to change the number or welfare of otherwise existing moral patients.
Average utilitarians should wager their decisions on the existence of (many) GCs if they expect high trade ratios, and the ability to linearly influence the value of these trades.
In this section, I return to probabilities and consider updates for SIA and SSA in the case where GCs create simulated observers like us. For the most part, the results are similar to those seen so far: SIA supports the existence of many GCs, and SSA does not. Since SSA does not include observers created by GCs, its results are independent of the existence of any simulated observers created by GCs.
This section implicitly assumes that the majority of observers like us (XOMs) are in simulations (run by GCs), as argued by Bostrom (2003). Chapter 4 does not depend on any discussion here, so this subsection can be skipped.
In the future, an Earth originating GC may create simulations of the history of Earth or simulate worlds containing counterfactual human civilizations. I call these ancestor simulations (AS).
Bostrom (2003) concludes that at least one of the following is true:
GCs other than humans may create AS of their own past as an ICs. These OMs in AS created by GCs who transitioned from XICs will be XOMs.
As well as running simulations of their own past, GCs may create simulations of other ICs. GCs may be interested in the values or behaviours of other GCs they may encounter, and can learn about the distribution of these by running simulations of ICs.
I use the term historical simulations (HS) to describe a behaviour of simulating ICs where the distribution of simulated ICs is equal to the true distribution of ICs. That is, the simulations are representative of the outside world, even if GCs run the simulations one IC at a time.
GCs may create many other OMs, simulated or not, of which none are XOMs. For example, a post-human GC may create a simulated utopia of OMs. I use the term other OMs as a catch-all term for such OMs.
I model GCs as either
As well as
Fixed means that the amount each GC spends is independent of the model parameters - it does not mean each GC creates the same number.
I first give an example to motivate the claim that when GCs create simulated XOMs, the majority of all XOMs are in such simulations rather than being in the ‘basement-level’.
Bostrom (2003) estimates that the resources of the Virgo Supercluster, a structure that contains the Milky Way and could be fully controlled by an Earth-originating GC, could be used to run human lives per second, each containing many OMs. Around humans have ever lived: if we expect a GC to emerge in the few centuries, it seems unlikely more than 1012 humans will have lived by this time. In this case, only (one hundred million trillionths) of all a GC’s resources would need to be used for a single second to create an equal number of XOMs to the number of basement-level XOMs.
When GCs create AS or HS, I assume that the number of XOMs in AS or HS far exceeds the number of XOMs in XICs. That is, most observers like us are in simulations.
Both SIA and SSA support the existence of simulations of XOMs, holding all else equal, creating simulated XOMs (trivially) increases the number XOMs and the ratio |XOMs|/|OMs|.
I first calculate |XOMs| for each simulation behavior. These give the SIA likelihood ratios. As previously discussed in the SSA case, I suppose that the vast majority of OMs are in GCs and so are created in proportion to the resources controlled by GCs,. Dividing by by then gives the SSA likelihood ratio.
GCs create |
is proportional to23: |
Derivation |
AS fixed |
I assume that the fixed number of OMs is much greater than , this means one can approximate all XOMs as contained in AS. The number of XICs that actually appear is of which will become GCs. |
|
HS fixed |
The total number of GCs that appear is . Each creates some average number of HS each containing some average constant number of XOMs. The fraction of ICs in HS which are XICs is The product of these terms is Intuitively, this is equal to the AS fixed case as the same ICs are being sampled and simulated, but the distribution of which GC-simulates-which-IC has been permuted. |
|
AS resource proportional |
The number of GCs that create AS containing XICs is . The number of AS each of these GCs creates is proportional to the actual volume each would control, |
|
HS resource proportional |
Of all HS created, will be of XICs. The total number of HS created is proportional to the average fraction of OUSVs saturated by GCs, |
Note that above the derivations give the equivalences between
And so are not calculated here again.
Simulation behavior | Updating with observation | Updating with observation |
AS fixed / HS fixed | ||
HS resource proportional |
Simulation behaviour | Updating with observation | Updating with observation |
AS fixed / HS fixed |
Anthropic | theory | ||||
SIA |
ADT total utilitarianism |
ADT average utilitarianism | SSA | SSA | |
No XOMs | 1 | 4 | 5 | 6 | 8 |
HS-fixed | 2 | 4 | 5 | 7 | 8 |
AS-fixed | 2 | 4 | 5 | 7 | 8 |
HS-rp | 3 | 4 | 5 | 8 | 8 |
AS-rp | 4 | 4 | 5 | 5 | 8 |
In the above table, the left column gives the shorthand description of GC simulation-creating behaviour. Equivalent updates have the same colour and number.
The posterior credence in being alone in the observable universe, conditioned on observation
Prior | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Bullish | <0.1% | <0.1% | <0.1% | 0.2% | 70% | 68% | 69% | 64% |
Balanced | <0.1% | <0.1% | <0.1% | 0.2% | 89% | 89% | 89% | 85% |
Bearish | <0.1% | <0.1% | <0.1% | 0.2% | 94% | 95% | 95% | 92% |
These results replicate previous findings:
These results fail to replicate Hanson et al.’s (2021) finding that (the implicit use of ) SSA implies the existence of GCs in our future.
To my knowledge, this is the first model that
In the appendix, I also produce variants of updates for different priors: taking (log)uniform priors on all parameters, and varying the prior on .
My preferred approach is to use a non-causal decision theoretic approach, and reason in terms of wagers rather than probabilities.
Within the choice of utility function in finite worlds, forms of total utilitarianism are more appealing to me. However, it seems likely that the world is infinite and that aggregative consequentialism must confront infinitarian paralysis—the problem that in infinite worlds one is ethically indifferent between all actions. Some solutions to infinitarian paralysis require giving up on the maximising nature of total utilitarianism (Bostrom (2011)) and may look more averagist24. However, interaction with other GCs - such as through trade - make it plausible that even average utilitarians behave as if GCs are in their future light cone.
Having said this, theoretical questions remain with the use of non-causal decision theories (e.g. comments here on UDT and FDT).
If an Earth-originating GC observes another GC, it will most likely not be for hundreds of millions of years. By this point, one may expect such a civilization to be technologically mature and any considerations related to the existence of aliens redundant. Further, any actions we take now may be unable to influence the far future. Given these concerns, are any of the conclusions action-relevant?
Primarily, I see these results being most important for the design of artificial general intelligence (AGI). It seems likely that humanity will hand off control of the future, inadvertently or by design, to an AGI. Some aspects of an AGI humanity builds may be locked-in, such as its values, decision theory or commitments it chooses to make.
Given this lock-in, altruists concerned with influencing the far future may be able to influence the design of AGI systems to reduce the chance of conflict between this AGI and other GCs (presumably also controlled by AGI systems). Clifton (2020) outlines avenues to reduce cooperation failures such as conflict.
Bostrom (2003) gives a lower bound of biological human lives lost per second of delayed colonization, due to the finite lifetimes of stars. This estimate further does not include stars that become impossible for a human civilization due to the expansion of the universe.
The existence of GCs in our future light cone may strengthen or weaken this consideration. If GCs are aligned with our values, then even if a GC never emerges from Earth, the cosmic commons may still be put to good use. This does not apply when using SSA or a non-causal decision theory with average utilitarianism, which expect that only a human GC can reach much of our future light cone.
The results have clear implications for the search for extraterrestrial intelligence (SETI).
One key result is the strong update against the habitability of planets around red dwarfs. For the self-sampling assumption or a non-causal decision theoretic approach with average utilitarianism, there is great value of information on learning whether such planets are in fact suitable for advanced life: if they are, SSA strongly endorses the existence of GCs in our future light cone, as discussed in the appendix. SIA, or a non-causal decision theoretic approach with total utilitarianism, is confident in the existence of GCs in our future light cone regardless of the habitability of red dwarfs.
The model also informs the probability of success of SETI for ICs in our past lightcone. Such ICs may not be visible to us now if they were too quiet for us to notice or did not persist for a long time.
Barnett (2022) discusses and gives an admittedly “non-robust” estimate of “0.1-0.2% chance that SETI will directly cause human extinction in the next 1000 years”.
I consider the implied posterior distribution on the probability of a GC becoming observable in the next thousand years. The (causal) existential risk from GCs is strictly smaller than the probability that light reaches us from at least one GC, since the former entails the latter.
The posteriors imply a relatively negligible chance of contact (observation or visitation) with GCs in the next 1,000 years even for SIA.
However, it seems that the risk in the next is then more likely to come from GCs that are already potentially observable that we have just not yet observed - perhaps more advanced telescopes will reveal such GCs.
Further work
I list some further directions this work could be taken. All the calculations can be found here.
I have not updated on all the evidence available. Further evidence one could update on includes:
Modeling assumptions can be improved:
More variations of the updates could be considered:
More thought could be put into the prior selection (though the main results still follow from (log)uniform priors):
I would like to thank Daniel Kokotajlo for his supervision and guidance. I’d also like to thank Emery Cooper for comments and corrections on an early draft, and Lukas Finnveden and Robin Hanson for comments on a later draft. The project has benefited from conversations with Megan Kinniment, Euan McClean, Nicholas Goldowsky-Dill, Francis Priestland and Tom Barnes. I'm also grateful to Nuño Sempere and Daniel Eth for corrections on the Effective Altruism Forum. Any errors remain my own.
This project started during Center on Long-Term Risk’s Summer Research Fellowship.
The number of hard try-try steps | |
The geometric mean of the hard steps (“hardness”) | |
The sum of the delay and fuse steps, strictly less than Earth’s habitable duration. | |
The probability of passing through all try-once steps in the development of an IC | |
The maximum duration a planet can be habitable for | |
The decay power of gamma ray bursts | |
The average comoving speed of expansion of GCs | |
The fraction of ICs that become GCs |
IC |
Intelligent civilization |
XIC | Intelligent civilizations similar to human civilization in that
|
GC | Grabby civilization |
OM | Observer moment |
OUSV | Observable universe size volume |
AUSV | Affectable universe size volume |
SIA | Self-indication assumption |
SSA | Self-sampling assumption |
ADT | Anthropic decision theory |
, , | The number of ICs/XICs/GCs that would appear, supposing no preclusion, per OUSV |
, , | The number of ICs/XICs/GCs that actually appear per OUSV |
The observation of being in an XIC that has not observed any GCs | |
The observation of being in an XIC that is not inside a GC | |
AS | Ancestor simulations; simulations created by a GC of their own IC origins (or slight variants) |
HS | Historical simulations; simulations created by a GC to be representative of IC origins |
The probability density function of IC arrival times, excluding any preclusion by GCs. | |
The probability density function of IC arrival times that do not observe any GCs. | |
The fraction of an OUSV unsaturated by GCs at time | |
The comoving volume of a sphere/GC expanding from time at with speed | |
The actual volume of a sphere/GC expanding from time at with speed which considers the expansion of GCs | |
The rate of habitable star formation normalised to have integral 1 | |
The fraction of terrestrial planets that are habitable for at most Gy | |
The number of terrestrial planets per OUSV that are potentially habitable | |
The fraction of potentially habitable planet habitable to advanced life at time | |
The (cosmic) scale factor |
Ade, P. A., Aghanim, N., Arnaud, M., Ashdown, M., Aumont, J., Baccigalupi, C., ... & Matarrese, S. (2016). Planck 2015 results-xiii. cosmological parameters. Astronomy & Astrophysics, 594, A13.
Armstrong, S. (2011). Anthropic decision theory. arXiv preprint arXiv:1110.6437.
Armstrong, S., & Sandberg, A. (2013). Eternity in six hours: Intergalactic spreading of intelligent life and sharpening the Fermi paradox. Acta Astronautica, 89, 1-13.
Barnett, M. (2022). My current thoughts on the risks from SETI https://www.lesswrong.com/posts/DWHkxqX4t79aThDkg/my-current-thoughts-on-the-risks-from-seti#Strategies_for_mitigating_SETI_risk
Bostrom, N. (2003). Are we living in a computer simulation?. The philosophical quarterly, 53(211), 243-255.
Bostrom, N. (2003). Astronomical waste: The opportunity cost of delayed technological development. Utilitas, 15(3), 308-314.
Bostrom, N. (2011). Infinite ethics. Analysis and Metaphysics, (10), 9-59.
Carter, B. (1983). The anthropic principle and its implications for biological evolution. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 310(1512), 347-363.
Carter, B. (2008). Five-or six-step scenario for evolution?. International Journal of Astrobiology, 7(2), 177-182.
Clifton, J. (2020) Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda. https://longtermrisk.org/files/Cooperation-Conflict-and-Transformative-Artificial-Intelligence-A-Research-Agenda.pdf
Eth, D. (2021) Great-Filter Hard-Step Math, Explained Intuitively. https://www.lesswrong.com/posts/JdjxcmwM84vqpGHhn/great-filter-hard-step-math-explained-intuitively
Finnveden, L. (2019) Quantifying anthropic effects on the Fermi paradox https://forum.effectivealtruism.org/posts/9p52yqrmhossG2h3r/quantifying-anthropic-effects-on-the-fermi-paradox
Grace, K. (2010). SIA doomsday: The filter is ahead https://meteuphoric.com/2010/03/23/sia-doomsday-the-filter-is-ahead/
Greaves, H. (2017). Population axiology. Philosophy Compass, 12(11), e12442.
Griffith, R. L., Wright, J. T., Maldonado, J., Povich, M. S., Sigurđsson, S., & Mullan, B. (2015). The Ĝ infrared search for extraterrestrial civilizations with large energy supplies. III. The reddest extended sources in WISE. The Astrophysical Journal Supplement Series, 217(2), 25.
Hanson, R., Martin, D., McCarter, C., & Paulson, J. (2021). If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare. The Astrophysical Journal, 922(2), 182.
Haqq-Misra, J., Kopparapu, R. K., & Wolf, E. T. (2018). Why do we find ourselves around a yellow star instead of a red star?. International Journal of Astrobiology, 17(1), 77-86.
Loeb, A. (2014). The habitable epoch of the early Universe. International Journal of Astrobiology, 13(4), 337-339.
Maartens, R. (2011). Is the Universe homogeneous?. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 369(1957), 5115-5137.
MacAskill, M., Bykvist, K., & Ord, T. (2020). Moral uncertainty (p. 240). Oxford University Press.
Olson, S. J. (2015). Homogeneous cosmology with aggressively expanding civilizations. Classical and Quantum Gravity, 32(21), 215025.
Olson, S. J. (2020). On the Likelihood of Observing Extragalactic Civilizations: Predictions from the Self-Indication Assumption. arXiv preprint arXiv:2002.08194.
Olson, S. J., & Ord, T. (2021). Implications of a search for intergalactic civilizations on prior estimates of human survival and travel speed. arXiv preprint arXiv:2106.13348.
Omohundro, S. M. (2008, February). The basic AI drives. In AGI (Vol. 171, pp. 483-492).
Oesterheld, C. (2017). Multiverse-wide Cooperation via Correlated Decision Making. https://longtermrisk.org/multiverse-wide-cooperation-via-correlated-decision-making/
Ord, T. (2021). The edges of our universe. arXiv preprint arXiv:2104.01191.
Ozaki, K., & Reinhard, C. T. (2021). The future lifespan of Earth’s oxygenated atmosphere. Nature Geoscience, 14(3), 138-142.
Pearce, B. K., Tupper, A. S., Pudritz, R. E., & Higgs, P. G. (2018). Constraining the time interval for the origin of life on Earth. Astrobiology, 18(3), 343-364.
Russell, S. (2021). Human-compatible artificial intelligence. In Human-Like Machine Intelligence (pp. 3-23). Oxford: Oxford University Press.
Saadeh, D., Feeney, S. M., Pontzen, A., Peiris, H. V., & McEwen, J. D. (2016). How isotropic is the Universe?. Physical review letters, 117(13), 131302.
Sandberg, A., Drexler, E., & Ord, T. (2018). Dissolving the Fermi paradox. arXiv preprint arXiv:1806.02404.
Sloan, D., Alves Batista, R., & Loeb, A. (2017). The resilience of life to astrophysical events. Scientific reports, 7(1), 1-5.
Tegmark, M. (2007). The multiverse hierarchy. Universe or multiverse, 99-125.
Tomasik, B. (2016). How the Simulation Argument Dampens Future Fanaticism.
Zackrisson, E., Calissendorff, P., González, J., Benson, A., Johansen, A., & Janson, M. (2016). Terrestrial planets across space and time. The Astrophysical Journal, 833(2), 214.
I discuss how using the remaining habitable time on Earth to update on the number of hard steps n is implicitly an anthropic update. In particular I discuss it in the context of Hanson et al. (2021) (henceforth “they” and “their”). They later perform another anthropic update, using a different reference class, which I see as problematic.
Their prior on is derived by using the self-sampling assumption with the reference class of observers on planets habitable for ~5 Gy (the same as Earth). I write for this reference class. Throughout, I ignore delay steps, and include only hard try-try steps.
They argue (as I see correctly) that to be most typical within this reference class, and observe that Earth is habitable for another ~1 Gy, we update towards . The SSA likelihood ratio when updating on our appearance time alone (ignoring preclusion by GCs) is
where is the Gamma distribution PDF with shape and scale . I take . This likelihood ratio is largest for . We could further condition on the time that life first appeared, but this is not necessary to illustrate the point.
While their prior on n relies on this small reference class, their main argument relies on a larger reference class of all intelligent civilizations, . They use this to model humanity’s birth rank as uniform in the appearance times of all advanced life, not just those habitable for ~5 Gy.
If we use the smaller reference class Gy throughout, then one updates towards , but human civilization is no longer particularly early since all life on planets habitable for ~5 Gy appears in the next ~50 Gy due to the end of star formation. The existence of GCs will have less explanatory power in this case.
If one uses the larger reference class , when updating on human civilization’s appearance time alone (ignoring preclusion by GCs), the SSA likelihood ratio is
Where is the maximum habitable duration, and is the 'number' of planets habitable for Gy.
If we believe to be large, then the likelihood ratio is maximum at and is decreasing in : if advanced life is hard then it will appear more often on planets where it has longer to evolve and increasing n makes life harder, so decreases the total amount of advanced life and increases the fraction of life on longer habitable planets. The reference class converges to when decreasing to 5 Gy, and one updates towards .
To summarise, the following are ‘compatible’
Hanson et al. write
If life on Earth had to achieve “hard steps” to reach humanity’s level, then the chance of this event rose as time to the n-th power. Integrating this over habitable star formation and planet lifetime distributions predicts >99% of advanced life appears after today, unless and max planet duration <50Gyr. That is, we seem early.
That is, to be early in the reference class of advanced life, , we require large and large which we have shown are incompatible.
The SSA , SSA and ADT average updates are sensitive to the lower bound on the prior for . When there are no GCs (that can preclude ICs), human civilization’s typicality is primarily determined by : the smaller the more typical human civilization is. If is certainly high, worlds with GCs that preclude ICs are relatively more appealing to SSA.
Here I show updates for variants on the prior for , and otherwise using the balanced prior. Notably, even when which has , SSA gives around 58% credence on being alone, and has posterior . As seen below, increasing the lower bound on the prior of increases the posterior implied rate of GCs.
Implied posterior on | Posterior on | |
SSA | ||
SSA | ||
ADT average |
The following tables show the marginalised posteriors for all updates (excluding the trade and conflict scenarios).
I show that the results follow when taking uniform/loguniform priors on the model parameters as follows:
Which give the following distributions on #NGC
This takes the same (log)uniform priors, but with . The SSA implied posterior on being alone in the OUSV is now just 59% from observation , and 40% from .
Currently in this Google Doc. Will be added to this post soon.
Technologies to produce false vacuum decay or other highly destructive technologies will have a non-zero rate of ‘detonation’. Such technologies could be used accidentally, or deliberately as a scorched Earth policy during conflict between GCs. Non-gravitationally bound volumes of the universe will become causally separated by ~200 Gy, after which GCs are safe from light speed decay.
The model presented can be used to estimate the fraction of OUSVs consumed by such decay bubbles. I write for the fraction of ICs that trigger a vacuum decay some time shortly after they become an IC. More relevantly, one may consider vacuum decay events being triggered when GCs meet one another.
Of course, this is highly speculative, but suggestive that such considerations may change the behaviour of GCs before the era of causal separation. For example, risk averse or pure time discounting GCs may trade off some expansion for creation of utility.
One could run the entire model with replaced by . SSA supports the existence of GCs for and so would similarly support the existence of ICs that trigger false vacuum decay as a deadline.
As mentioned, I model the completion time of hard steps with the Gamma distribution, which has PDF
When , and so . That is, when the steps are sufficiently hard, the probability of completion grows as a polynomial in . Increasing leads to a greater ‘clumping’ of completions near the end of the possible time available.
When hard steps are present, it also means that longer habitable planets will see a greater fraction of life than shorter lived planets. For example, a planet habitable for 50 Gy will have approximately greater probability of life appearing than a planet habitable for 5 Gy.
For anthropic theories that update towards worlds where observers like us are more typical -- such as the self-sampling assumption -- increasing while allowing longer-lived planets makes observers like us less typical.
The post Replicating and extending the grabby aliens model appeared first on Center on Long-Term Risk.
]]>The post Plans for 2022 & Review of 2021 appeared first on Center on Long-Term Risk.
]]>Our goal is to reduce the worst risks of astronomical suffering (s-risks) from emerging technologies. To this end, we work on addressing the worst-case risks from the development and deployment of advanced AI systems. We are currently focused on conflict scenarios as well as technical and philosophical aspects of cooperation.
We have been based in London since late 2019. Our team is currently about fourteen full-time equivalents strong, with most of our employees full-time.
We review our work across organizational functions by combining a subjective assessment with a list of tangible outputs and activities. The assessments were written by senior staff members.
Guiding question: Have we made progress towards becoming a research group and community that will have an outsized impact on the research landscape and relevant actors relevant to reducing s-risk?
Across all dimensions, it seems to us that we are in a better position compared to last year:
Guiding questions:
Our continued work on better understanding the causes of conflict has progressed significantly. We have developed some initial internal tools (e.g., game-theoretic models) that will allow us to explore different conflict scenarios and their implications more rigorously. We expect this work to be helpful in (i) informing future work on prioritization and (ii) communicating about conflict dynamics to (and eliciting advice from) various important audiences, including those new to s-risk research, external longtermist researchers, and stakeholders at AI labs. This has already resulted in some fruitful conversations. By providing a set of tools for more “paradigmatic” research (in the form of game-theoretic models), this line of work also opens up more opportunities for people to contribute to s-risk research.
We made progress in our work on Cooperative AI. Our Normative Disagreement paper was accepted at two NeurIPS workshops (Cooperative AI and Strategic ML). We also began working on clarifying foundational Cooperative AI concepts, such as what it would mean to work towards differential progress on cooperation. We hope that this work will feed into work on benchmarks by the Cooperative AI Foundation (CAIF).
We made some progress in thinking about intervening on AI agents’ values in a coarse-grained way so that they at least bargain cooperatively, even if otherwise misaligned. While we had previously been aware of this intervention class, only this year did we start to name it as a distinct, potentially promising area for research and intervention and begin to work on developing and evaluating concrete interventions.
Some staff have started to explore frameworks from the literature on decision-making under deep uncertainty as well as their implications for our strategy. This was the result of research and extensive discussions about the potential for large unintended negative consequences from efforts to shape the long-term future.
We published some work on AI forecasting which increased our own understanding of the topic and seemed to have been well received by the wider community.
We made less progress than we had hoped or planned on better understanding the risks from malevolent actors because a key staff member in this area fell sick for most of the year.
Guiding questions:
Our overall impression is that of continued but modest growth and progress. The various programs and events that we ran seem to have engaged people we had not previously been aware of and deepened the engagement of some people we had already known. Our individual calls and meetings put some new people on our radar and impacted some meaningful career decisions (though our counterfactual influence is hard to assess since we don’t yet do systematic evaluations). To the extent that we can already assess the outcomes from grants of the CLR Fund, they seem to have resulted in some meaningful publications and activities.
In February and March, we ran two s-risk intro seminars with about fifteen participants each. The participant feedback was generally very positive. The average response to the questions “How likely are you to recommend the Intro Seminar to a value-aligned friend or colleague?” was 4.7 out of 5 and 4.8 out of 5 respectively.
From the end of June until the end of October, we ran a Summer Research Fellowship with two cohorts of seven fellows each (Adrià Garriga-Alonso, Euan McLean, Francis Priestland, Gustavs Zilgalvis, Rory Svarc, Tom Shlomi, and Tristan Cook; Francis Rhys Ward, Jack Koch, Julia Karbing, Lewis Hammond, Megan Kinniment Williams, Nicolas Macé, and Sara Haxhia). Another fellow, Hadrien Pouget, spent three months at CLR during the spring.
The feedback on the fellowship was generally very positive. Among the fellows who responded to our survey, all answered the questions “Are you glad that you participated in the fellowship?” with a 5 out of 5. The average response to the question “If the same program happened next year, would you recommend a friend (with similar background to you before the fellowship) to apply?” was 9.9 out of 10.
Three fellows also ended up joining our team in permanent positions.
We conducted over seventy 1:1 calls and meetings with potentially promising people. This also included various office visits by people. (We don’t yet collect systematic feedback on these.)
There were many changes to the fund management: Emery Cooper replaced Lukas Gloor; Stefan Torges replaced Jonas Vollmer; Tobias Baumann replaced Brian Tomasik; Chi Nguyen also joined as a fund manager.
We made the following grants in 2021 (more details here):
Two and a half years ago, we worked with Nick Beckstead from the Open Philanthropy Project to develop a set of communication guidelines for discussing astronomical stakes. In brief, Nick’s guidelines for the broader longtermist community recommend highlighting beliefs and priorities that are important to the s-risk-oriented community. Our guidelines for those focused on s-risks recommend communicating in a nuanced manner about pessimistic views of the long-term future by considering highlighting moral cooperation and uncertainty, focusing more on practical questions if possible, and anticipating potential misunderstandings and misrepresentations.
We had originally planned to reassess the costs and benefits at the end of 2020. We ended up pushing this into 2021. After talking to staff at the Open Philanthropy Project, we decided to extend our commitment to the communication guidelines until at least the end of 2022. However, since we were not able to devote as many resources to this project as we would have liked to, we have planned a more thorough effort for this year.
Guiding question:
As planned, we started advising the Center for Emerging Risk Research (CERR). They are a new nonprofit with the mission to improve the quality of life of future generations. Overall, we are ambivalent about our progress in this area. On the one hand, we are satisfied with the size and rigor of the grant recommendations that we made. On the other hand, we failed to make progress on systematic investigations of cause areas and promising interventions. Instead, we usually investigated opportunities that we learned about through our existing network.
Based in part on our recommendations, CERR made the following investments or grants:
Guiding questions:
Our capacity in this area is roughly at the same level as at the beginning of the year due to staff turnover. That means it is not yet as high as we would like it to be, but we expect this to change over the coming months as our new hire gets used to their role. That being said, we are still able to maintain all the important functions of the organization and push forward vital changes in the operational setup of CLR.
Guiding question: Are we a healthy organization with an effective board, appropriate evaluation of our work, reliable policies and procedures, adequate financial reserves and reporting, and high morale?
Members of the board: Tobias Baumann, Max Daniel, Ruairi Donnelly (chair), Chi Nguyen, Jonas Vollmer (replaced David Althaus in December)
The main role of the board is to decide CLR’s leadership and structure, to resolve organizational conflicts at the highest level, and to advise CLR leadership on important questions. Generally, CLR staff seem to agree that they have been effective in that role. There are, however, different views within the organization as to how well they resolved one incident in particular.
We collect systematic feedback on big community-building and operations projects. We currently do not conduct any systematic evaluation of our research, especially from external peers. This is not ideal. We had already planned to address this in 2021 but failed to do so due to a lack of capacity. It is also generally a difficult problem to solve due to our idiosyncratic priorities.
Overall, it is our impression that our policies are effective and cover the most relevant areas. However, it might always seem this way until we realize that we would have needed a policy for resolving a particular issue. For instance, we added two policies in response to a staff incident this year. So we plan to conduct a systematic review of our policies in 2022.
Our budget increased substantially after our move to London from Berlin, primarily due to an increase in salaries resulting from higher costs of living.
Still, primarily due to the support of the Center for Emerging Risk Research (CERR), we are currently in a good financial position. However, without their continued support, we might face serious difficulties maintaining our operations at the current level.
Net asset estimate in early December 2021 (all figures in CHF (1 CHF ≈ 1.09 USD ≈ 0.82 GBP)):
Monthly staff average for the question “How much do you currently enjoy being part of CLR?” was 7.7 this year (compared to 7.6 in 2020 and 8.0 in 2019). However, the response rate for this question in 2021 was low.
We are hoping to hire new permanent researchers. We are also currently hiring summer research fellows to join us temporarily. You can find the details to apply to both of these opportunities here. The application deadline is February 27, 2022.
The current Research Leads at CLR are Jesse Clifton, Emery Cooper, and Daniel Kokotajlo. They set their own research priorities as well as those of the people on their team. Jesse Clifton leads the Causes of Conflict Research Group at CLR while Emery Cooper and Daniel Kokotajlo currently work alone and with a research assistant respectively.
The current priorities for Jesse and his team are:
Emery’s current research priorities are:
Daniel’s research priorities for 2022 are:
Our work in this area will continue largely along the lines of previous years since we are broadly satisfied with the outcomes it has been producing. We will continue to try to identify and advise people interested in s-risks through 1:1 calls and meetings. We will run an intro fellowship for effective altruists who are interested in learning more about our work. We will make grants through the CLR Fund, mostly to support individuals in our community. We will run a summer research fellowship to allow people to test their fit for research on our priorities.
There are two ways in which we could imagine changing or expanding our work. First, after revisiting our communication strategy, we might explore broader communication about our research and focus areas (e.g., through podcast appearances, EA Forum posts, public talks). Second, we might invest more effort into building community infrastructure and platforms of exchange (e.g., regular retreats or an internal forum for those working on reducing s-risks).
CLR staff will continue to advise CERR on their grantmaking.
We initiated various organizational changes projects in 2021 that will require implementation effort by our operations team in 2022:
In addition to a lot of business-as-usual work (e.g., accounting, office management, hiring logistic & onboarding, reporting), the operations team is considering the following projects:
The post Plans for 2022 & Review of 2021 appeared first on Center on Long-Term Risk.
]]>The post Surrogate goals and safe Pareto improvements appeared first on Center on Long-Term Risk.
]]>Surrogate goals have also been discussed or at least mentioned in, among other places, Section 4.2 of CLR’s research agenda and the 80,000 hours podcast (guest: Paul Cristiano)."
The post Surrogate goals and safe Pareto improvements appeared first on Center on Long-Term Risk.
]]>The post Normative Disagreement as a Challenge for Cooperative AI appeared first on Center on Long-Term Risk.
]]>
Cooperative AI workshop and the Strategic ML workshop at NeurIPS 2021.
Cooperation in settings where agents have both common and conflicting interests (mixed-motive environments) has recently received considerable attention in multi-agent learning. However, the mixed-motive environments typically studied have a single cooperative outcome on which all agents can agree. Many real-world multi-agent environments are instead bargaining problems (BPs): they have several Pareto-optimal payoff profiles over which agents have conflicting preferences. We argue that typical cooperation-inducing learning algorithms fail to cooperate in BPs when there is room for normative disagreement resulting in the existence of multiple competing cooperative equilibria and illustrate this problem empirically. To remedy the issue, we introduce the notion of norm-adaptive policies. Norm-adaptive policies are capable of behaving according to different norms in different circumstances, creating opportunities for resolving normative disagreement. We develop a class of norm-adaptive policies and show in experiments that these significantly increase cooperation. However, norm-adaptiveness cannot address residual bargaining failure arising from a fundamental tradeoff between exploitability and cooperative robustness.
Multi-agent contexts often exhibit opportunities for cooperation: situations where joint action can lead to mutual benefits [Dafoe et al., 2020]. Individuals can engage in mutually beneficial trade; nation-states can enter into treaties instead of going to war; disputants can settle out of court rather than engaging in costly litigation. But a hurdle common to each of these examples is that the agents will disagree about their ideal agreement. Even if agreements benefit all parties relative to the status quo, different agreements will benefit different parties to different degrees. These circumstances can be called bargaining problems [Schelling, 1956].
As AI systems are deployed to act on behalf of humans in more real-world circumstances, they will need to be able to act effectively in bargaining problems — from commercial negotiations in the nearer term (e.g., Chakraborty et al. [2020]) to high-stakes strategic decision-making in the longer-term [Geist and Lohn, 2018]. Moreover, agents may be trained independently and offline before interacting with one another in the world. This raises concerns about future AI systems following incompatible norms for arriving at solutions to bargaining problems, analogously to disagreements about fairness which create hurdles to international cooperation on critical issues such as climate policy [Albin, 2001, Ringius et al., 2002].
Our contributions are as follows. We introduce a taxonomy of cooperation games, including bar-gaining problems (Section 3). We relate their difficulty to the degree of normative disagreement, i.e., differences over principles for selecting among mutually beneficial outcomes, which we formalize in terms of welfare functions. Normative disagreement does not arise in purely cooperative games or simple sequential social dilemmas [Leibo et al., 2017], but is an important obstacle for cooperation in what we call asymmetric bargaining problems. Following this, we introduce the notion of norm-adaptive policies – policies which can play according to different norms depending on the circumstances. In several multi-agent learning environments, we highlight the difficulty of bargaining between norm-unadaptive policies (Section 4). We then contrast this with a class of norm-adaptive policies (Section 5) based on Lerer and Peysakhovich [2017]’s approximate Markov tit-for-tat algorithm. We show that this improves performance in bargaining problems. However, there remain limitations, most fundamentally a tradeoff between exploitability and the robustness of cooperation.
The field of multi-agent learning (MAL) has recently paid considerable attention to problems of cooperation in mixed-motive games, in which agents have conflicting preferences. Much of this work has been focused on sequential social dilemmas (SSDs) (e.g., Peysakhovich and Lerer 2017, Lerer and Peysakhovich 2017, Eccles et al. 2019). The classic example of a social dilemma is the Prisoner’s Dilemma, and the SSDs studied in this literature are similar to the Prisoner’s Dilemma in that there is a single salient notion of “cooperation”. This means that it is relatively easy for actors to coordinate in their selection of policies to deploy in these settings.
Cao et al. [2018] look at negotiation between deep reinforcement learners, but not between independently trained agents. Several authors have recently investigated the board game Diplomacy [Paquette et al., 2019, Anthony et al., 2020, Gray et al., 2021] which contains implicit bargaining problems amongst players. Bargaining problems are also investigated in older MAL literature (e.g., Crandall and Goodrich 2011) as well as literature on automated negotiation (e.g., Kraus and Arkin 2001, Baarslag et al. 2013), but also not between independently trained agents. Considerable work has gone into understanding the emergence of norms in both humans [Bendor and Swistak, 2001, Boyd and Richerson, 2009] and artificial societies [Walker and Wooldridge, 1995, Shoham and Tennenholtz, 1997, Sen and Airiau, 2007]. Especially relevant are empirical studies of bargaining across cultural contexts [Henrich et al., 2001]. There is also recent multi-agent reinforcement learning work on norms [Hadfield-Menell et al., 2019, Lerer and Peysakhovich, 2019, Köster et al., 2020] is also relevant here as bargaining problems can be understood as settings in which there are multiple efficient but incompatible norms. However, much less attention has been paid in these literatures to how agents with different norms are or aren’t able to overcome normative disagreement.
There are large game-theoretic literature on bargaining (for a review see Muthoo [2001]). This includes a long tradition of work on cooperative bargaining solutions, which tries to establish normative principles for deciding among mutually beneficial agreements [Thomson, 1994]. We will draw on this work in our discussion of normative (dis)agreement below.
Lastly, the class of norm-adaptive policies we develop in Section 5 — — can be seen as a more general variant of an approach proposed by Boutilier [1999] for coordinating in pure coordination games. As it implicitly searches for overlap in the agents’ sets of allowed welfare functions, it is also similar to Rosenschein and Genesereth [1988]’s approach to reaching agreement in general-sum games via sets of proposals by each agent.
We are interested in a setting in which multiple actors (“principals”) train machine learning systems offline, and then deploy them into an environment in which they interact. For instance, two different companies might separately train systems to negotiate on their behalf and deploy them without explicit coordination on deployment. In this section, we develop a taxonomy of environments that these agents might face and relate these different types of environments to the difficulty of bargaining.
We formalize multi-agent environments as partially observable stochastic games (POSGs). For simplicity we assume two players, . We will index player 's counterpart by . Each player has an action space . There is a space of states which evolve according to a Markovian transition function . At each time step, each player sees an observation which depends on . Thus each player has an accumulating history of observations . We refer to the set of all histories for player as and assume for simplicity that the initial observation history is fixed and common knowledge: . Finally, principals choose among policies , which we imagine as artificial agents deployed by the principals. We will refer to policy profiles generically as .
We define a coordination problem as a game involving multiple Pareto-optimal equilibria (cf. Zhang and Hofbauer 2015), which require some coordinated action to achieve. That is, if the players disagree about which equilibrium they are playing, they will not reach a Pareto-optimal outcome. A pure coordination problem is a game in which there are multiple Pareto-optimal equilibria over which agents have identical interests. Although agents may still experience difficulties in pure coordination games, for inst ance due to a noisy communication channel, they are made easier by the fact that principals are indifferent between the Pareto-optimal equilibria.
We define a bargaining problem (BP) to be a game in which there are multiple Pareto-optimal equilibria over which the principals have conflicting preferences. These equilibria represent more than one way to collaborate for mutual benefit, or put in another way, for sharing a surplus. Thus a bargaining problem is a mixed-motive coordination problem, in which there is conflicting interest between Pareto-optimal equilibria and common interest in reaching a Pareto-optimal equilibrium.
We can distinguish between BPs which are symmetric and asymmetric games. A 2-player game is symmetric if for any attainable payoff profile (a, b) there exists a profile (b, a). The reason this distinction is important is that all (finite) symmetric games have a symmetric Nash equilibrium [Nash, 1990]; thus symmetric games have a natural set of focal points [Schelling, 1958] for aiding coordination in mixed-motive contexts, while asymmetric BPs may not. Similarly, given a chance to play a correlated equilibrium [Aumann, 1974], agents in a symmetric BP could play according to a correlated equilibrium which randomizes using a symmetric distribution over Pareto-optimal payoff profiles.
Figure 2 displays the payoff matrices of three coordination games: Pure Coordination, and two variants of Bach or Stravinsky (BoS), one of which is a symmetric BP and one of which is an asymmetric BP. Pure Coordination is not a BP because it is not a mixed-motive game as the players only care about playing the same action. On the other hand, in the case of symmetric BoS the players do have conflicting interest, however, there is a correlated equilibrium – tossing a commonly observed fair coin – that is intuitively the most reasonable way of coordinating: It both maximizes the total payoff and offers each player the same expected reward.
To develop a better intuition for the sense in which equilibria can be more or less reasonable, consider a BoS with an extreme asymmetry, with equilibrium payoffs (15, 10) and (1, 11). Even though each of these equilibria is Pareto-optimal, the latter seems unreasonable or uncooperative: it yields a lower total payoff, more inequality, and lowers the reward of the worst-off player in the equilibrium. To formalize this intuition, we characterize the reasonableness of a Pareto-optimal payoff profile in terms of the extent to which it optimizes welfare functions: we can say that (1, 11) is unreasonable because there is no (impartial, see below) welfare function that would prefer it. Different welfare functions with different properties have been introduced in the literature (see Appendix A, but two uncontroversial properties of a welfare function are Pareto-optimality (i.e., its optimizer should be Pareto-optimal) and impartiality1 (i.e., the welfare of a policy profile should be invariant to permutations of player indices). From the latter property, we can observe that the intuitively reasonable choice of playing the correlated equilibrium with a fair correlation device in the case of symmetric games is also the choice which all impartial welfare functions recommend, provided that it is possible for the agents to play a correlated equilibrium.
By contrast, in the asymmetric BoS from Figure 2 we see that playing BB maximizes utilitarian welfare , whereas playing SS maximizes the egalitarian welfare subject to Pareto-optimal. Throwing a correlated fair coin to choose between the two would lead to an expected payoff that is optimal with respect to the Nash welfare . Each of these different equilibria has a normative principle to motivate it.
In the best case, all principals agree on the same welfare function as a common ground for coordination in asymmetric BPs. However, the principals may have reasonable differences with respect to which welfare function they perceive as fair, and so they may train their systems to optimize different welfare functions, leading to coordination failure when the systems interact after deployment. In cases where agents were independently-trained according to inconsistent welfare functions, we will say that there is normative disagreement. There may be different degrees of normative disagreement. For instance, the difference differs across games.
To summarize, we relate the difficulty of coordination problems to the concept of welfare functions: In pure coordination problems, they are not needed. In symmetric bargaining problems, they all point to the same equilibria. And in asymmetric bargaining problems, they can serve to filter out intuitively unreasonable equilbria, but leave the possibility of normative disagreement between reasonable ones. This makes normative disagreement a critical challenge for bargaining. In the remainder of the paper, we will focus on asymmetric bargaining problems for this reason.
When there is potential for normative disagreement, it can be helpful for agents to have some
flexibility as to the norms they play according to.
A number of definitions of norms have been proposed in the social scientific literature (e.g., Gibbs [1965]), but they tend to agree that a norm is a rule specifying acceptable and unacceptable behaviors in a group of people, along with sanctions for violations of that rule. Game-theoretic work sometimes identifies norms with equilibrium selection devices (e.g., Binmore and Samuelson 1994, Young 1996). Given that complex games generally exhibit many equilibria, some rule (such as maximizing a particular welfare function) is needed to select among them.
Normative disagreement arises (among other reasons) from the underdetermination of good behavior in complex multi-agent settings. This is exemplified by the problem of conflicting equilibrium selection criteria in asymmetric bargaining problems, but there are other possible cases of undetermination. One example is the undetermination of the beliefs that a reasonable agent should act according to in games of incomplete information (cf. the common prior assumption in Bayesian games [Morris, 1995]). Thus our definition of norm will be more general than an equilibrium selection device, though in the remainder of the paper we will focus on the use of welfare functions to choose among equilibria.
Definition 3.1 (Norms). Given a 2-player POSG, a norm is a tuple , where are normative policies (i.e., those which comply with the norm); are “punishment” policies, which are enacted when deviations from a normative policy are detected; and are rules for judging whether a deviation has happened and how to respond, i.e., . A policy is compatible with norm if, for all ,
For example, in the iterated Asymmetric BoS, one norm is for both players to always play (this is the normative policy), and for a player to respond to deviations by continuing to play . A similar norm is given by both players always playing instead. More generally, following the folk theorems [Mailath et al., 2006], an equilibrium in a repeated game corresponds to a norm, in which a profile of normative policies corresponds to play on the equilibrium path; the functions indicate whether there is a deviation from equilibrium play, and punishment policies are chosen such that players are made worse off by deviating than by continuing to play according to the normative policy profile.
Now we formally define norm-adaptive policies.
Definition 3.2 (Norm-adaptive policies). Take a 2-player POSG and a non-singleton set of norms for that game. Let be a surjective function that maps histories of observations to norms (i.e., for each norm in , there are histories for which chooses that norm.) Then, a policy is norm-adaptive if, for all , there is a policy such that is compatible with and .
That is, norm-adaptive policies are able to play according to different norms depending on the circumstances. As we will see below, the benefit of making agents explicitly norm-adaptive is that this can help to prevent or resolve normative disagreement. Lastly, note that we can define higher-order norms and higher-order norm-adaptiveness: a higher-order norm is a norm such that policies are themselves norm-adaptive with respect to some set of norms. This framework allows us for discussing differing (higher-order) norms for resolving normative disagreement.
In this section, we illustrate how cooperation-inducing, but norm-unadaptive, multi-agent learning algorithms fail to cooperate in asymmetric BPs. In Section 5 we will then show how norm-adaptiveness improves cooperation. The environments and algorithms considered are summarized in Table 1.
In order to both include algorithms which use an explicitly specified welfare function and ones which do not, we use the Learning with Opponent-Learning Awareness (LOLA) algorithm [Foerster et al., 2018] in its policy gradient and exact value function optimization versions as an example for the latter, and Generalized Approximate Markov Tit-for-tat () for the former.
We introduce as a variant of Lerer and Peysakhovich [2017]’s Approximate Markov Tit-for-tat. The original algorithm trains a cooperative policy profile on the utilitarian welfare, as well as a punishment policy, and switches from the cooperative policy to the punishment policy when it detects that the other player is defecting from the cooperative policy. The algorithm has the appeal that it “cooperates with itself, is robust against defectors, and incentivizes cooperation from its partner” [Lerer and Peysakhovich, 2017]. We consider the more general class of algorithms that we call in which a cooperative policy is constructed by optimizing an arbitrary welfare function Note that although takes a welfare function as an argument, the resulting policies are not norm-adaptive. To cover a range of environments representing both symmetric and asymmetric games, we use some existing multi-agent reinforcement learning environments (IPD, Coin Game [CG; Lerer and Peysakhovich 2017]) and introduce two new ones (IAsymBoS, ABCG).
Iterated asymmetric Bach or Stravinsky (IAsymBoS) is an asymmetric version of the iterated Bach or Stravinsky matrix game. At each time step, the game defined on the right in Figure 2 is played. We focus on the asymmetric variant due to the argument in Section 3.3 that players could resolve the symmetric version by playing a symmetric equilibrium; however, applying LOLA without modification would also lead to coordination failure in the symmetric variant. It should also be noted that IAsymBoS is not an SSD because it does not incentivize defection: agents cannot gain from miscoordinating. Thus we consider IAsymBos to be a minimal example for an environment that can produce bargaining failures out of normative disagreement.
For a more involved example we also introduce an asymmetric version of the stochastic gridworld Coin Game [Lerer and Peysakhovich, 2017] – asymmetric bargaining Coin Game (ABCG) – which is both an SSD and an asymmetric BP. In ABCG, a red and a blue agent navigate a grid with coins. Two coins simultaneously appear on this grid: a Cooperation coin and a Disagreement coin, each colored red or blue. Cooperation coins can only be consumed by both players moving onto them at the same time, whereas the Disagreement coin can only be consumed by the player of the same color as the coin. The game is designed such that is maximized by both players always consuming the Cooperation coin, however, this will make one player benefit more than the other. Due to the sequential nature of the game, this means that welfare functions which also care about (in)equity will favor an equilibrium in which the player who benefits less from the Cooperation coin is allowed to consume the Disagreement coin from time to time without retaliation.
Players move simultaneously in all games. Note that we assume each player to have full knowledge of the other player’s rewards as this is required by LOLA-exact during training and by both at training and deployment. We use simple tabular and neural network policy parameterizations, which are described in Appendix C along with learning algorithms.
Table 1: Summary of the environments and learning algorithms that we use to study sequential social dilemmas (SSDs) and asymmetric bargaining problems in this section. |
We train policies in environments listed in Table 1 with the corresponding MARL algorithms. After the training, we evaluate cooperative success in self-play and cross-play. Self-play refers to average performance between jointly-trained policies. Cross-play refers to average performance between independently-trained policies. We distinguish between two kinds of cross-play: that between agents trained using the same notion of collective optimality (such as a welfare function or an inequity aversion term in the value function), and between agents trained using different ones. For we use two welfare functions, and an inequity-averse welfare-function (see Appendix C).
Comparing cooperative success across environments is not straightforward: environments have different sets of feasible payoffs, and we should not evaluate cooperative success with a single welfare function, because our concerns about normative disagreement stem from the fact that it is not obvious what single welfare function to use. First, we take to be the set of welfare functions which we use in our experiments in the environment in question. For instance, for IAsymBoS and this is . Second, we define disagreement payoff profiles corresponding to cooperation failure; in IAsymBoS the disagreement policy profile would be the one which has payoff profile . Then, for a policy profile , we compute its normalized score as . Under this scoring method, players do maximally well when they play a policy profile which is optimal according to some welfare function in .
Figure 3 illustrates the difference between cooperation in SSDs and bargaining problems. When there is a single “cooperative” equilibrium, as in the case of IPD and CG, cooperation-inducing learning algorithms typically achieve cooperation in cross-play. In contrast, in IAsymBoS and ABCG we observe mild performance degradation in cross-play where agents optimize the same welfare functions, and strong degradation when agents optimize different welfare functions.
The cooperation-inducing properties of the algorithms in Section 4 are simple and are not designed to help agents resolve potential normative disagreement to avoid Pareto-dominated outcomes. The two main problems are (1) that the algorithms are ill-equipped for reacting to normative disagreement, and (2) that they may confuse normative disagreement with defection.
The former problem is already evident in IAsymBoS. There, playing according to incompatible welfare functions is not interpreted as defection by . This is not necessarily bad – we claim that normative disagreement should be treated differently to defection – but it does mean that lacks the policy space to react to normative disagreement. For the latter problem, we observe that in the ABCG the -agent does classify some of the opponent's actions as defection, even though they are aimed to optimize an impartial welfare function, and punishes accordingly.
To overcome both of these problems, we propose a norm-adaptive modification to . As we only aim to illustrate the benefit of norm-adaptive policies, we keep the implementation simple: Instead of a welfare function the algorithm now takes a welfare-function set : . The agent starts out playing according to some . The initial choice can be made at random or according to a preference ordering between the .
then follows a two-stage decision process. First, if the agent detects that its opponent is merely playing according to a different welfare function than itself, gets re-sampled. Second, if the agent detects defection by the opponent, it will punish, just as in the original algorithm. However, by first checking for normative disagreement, we make sure that punishment does not happen as a result of a normative disagreement.
In Figure 5 we illustrate how, assuming uniform re-sampling from , can overcome normative disagreement and perform close to the Pareto frontier. Agents need not sample uniformly, though. For instance when the number of possible welfare functions is large, it would be beneficial to put higher probability on welfare functions which one's counterpart is more likely to be optimizing. Furthermore an agent might want to put higher probability on welfare functions that it prefers.
Notice that, when , one player using rather than leads to a (weak) Pareto improvement. Beyond that, in anticipation of bargaining failure due to not having a way to resolve normative disagreement, players are incentivized to include more than just their preferred welfare function into . In both IAsymBoS (see Figure 5) and ABCG (see Table 5, Appendix D) we can observe significant improvement for cross-play when at least one player is norm-adaptive.
As our experiments with show, agents who are more flexible are less prone to bargaining failure due to normative disagreement. However, they are prone to having that flexibility exploited by their counterparts. For instance, an agent which is open to optimizing either or will end up optimizing if playing against an agent for whom . More generally, an agent who puts higher probability on a welfare function it has a preference for, when re-sampling , will be less robust against counterparts who disprefer that welfare function and put a lower probability on it. An agent who tries to guess the counterpart's welfare function and tries to accommodate to this is exploitable to agents who do not.
We formally introduce a class of hard cooperation problems – asymmetric bargaining problems – and situate them within a wider game taxonomy. We argue that they are hard because there can arise normative disagreement between multiple “reasonable” cooperative equilibria, characterized by divergence in the preferred outcomes according to different welfare functions. This presents a problem for those deploying AI systems without coordinating on the norms those systems follow. We have introduced the notion of norm-adaptive policies, which are policies that allow agents to change the norms according to which they play, giving rise to opportunities for resolving normative disagreement. As an example of a class of norm-adaptive policies, we introduced and showed in experiments that this tends to improve robustness to normative disagreement. On the other hand, we have demonstrated a robustness-exploitability tradeoff, under which methods that learn more normatively flexible strategies are also more vulnerable to exploitation.
There are a number of limitations to this work. We have throughout assumed that the agents have a common and correctly-specified model of their environment, including their counterpart’s reward function. In real-world situations, however, principals may not have identical simulators with which to train their systems, and there are well-known obstacles to the honest disclosure of preferences [Hurwicz, 1972], meaning that common knowledge of rewards may be unrealistic. Similarly, we assumed a certain degree of reasonableness on part of the principals, seen by the willingness to play the symmetric correlated equilibrium in symmetric BoS (Section 3), for instance. However, we believe this to be a minimal assumption as the deployers of such agents are aware of the risk of coordination failure as a result of insisting on equilibria that no impartial welfare function would recommend.
Future work should consider more sophisticated and learning-driven approaches to designing norm-adaptive policies, as relies on a finite set of user-specified welfare functions and a hard-coded procedure for switching between policies. One possibility is to train agents who are themselves capable of jointly deliberating about the principles they should use to select an equilibrium, e.g., deciding among the axioms which characterize different bargaining solutions (see Appendix A) in the hopes that they will be able to resolve initial disagreements. Another direction is resolving disagreements that cannot be expressed as disagreements over the welfare functions according to which agents play; for instance, disagreements over the beliefs or world-models which should inform agents’ behavior.
We’d like to thank Yoram Bachrach, Lewis Hammond, Vojta Kovařík, Alex Cloud, as well as our anonymous reviewers for their valuable feedback; Daniel Rüthemann for designing Figures 6 and 7; Chi Nguyen for crucial support just before a deadline; Toby Ord and Jakob Foerster for helpful comments.
Julian Stastny performed part of the research for this paper while interning at the Center on Long-Term Risk. Johannes Treutlein was supported by the Center on Long-Term Risk, the Berkeley Existential Risk Initiative, and Open Philantropy. Allan Dafoe received funding from Open Philantropy and the Centre for the Governance of AI.
Cecilia Albin. Justice and fairness in international negotiation. Number 74. Cambridge University Press, 2001.
Thomas Anthony, Tom Eccles, Andrea Tacchetti, János Kramár, Ian Gemp, Thomas Hudson, Nicolas Porcel, Marc Lanctot, Julien Perolat, Richard Everett, Satinder Singh, Thore Graepel, and Yoram Bachrach. Learning to play no-press diplomacy with best response policy iteration. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17987–18003. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/d1419302db9c022ab1d48681b13d5f8b-Paper.pdf.
Robert J Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1(1):67–96, 1974.
Tim Baarslag, Katsuhide Fujita, Enrico H Gerding, Koen Hindriks, Takayuki Ito, Nicholas R Jennings, Catholijn Jonker, Sarit Kraus, Raz Lin, Valentin Robu, et al. Evaluating practical negotiating agents: Results and analysis of the 2011 international competition. Artificial Intelligence, 198: 73–103, 2013.
Jonathan Bendor and Piotr Swistak. The evolution of norms. American Journal of Sociology, 106(6):1493–1545, 2001.
Ken Binmore and Larry Samuelson. An economist’s perspective on the evolution of norms. Journal of Institutional and Theoretical Economics (JITE)/Zeitschrift für die gesamte Staatswissenschaft, pages 45–63, 1994.
Craig Boutilier. Sequential optimality and coordination in multiagent systems. In IJCAI, volume 99, pages 478–485, 1999.
Robert Boyd and Peter J Richerson. Culture and the evolution of human cooperation. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1533):3281–3288, 2009.
Donald E Campbell and Peter C Fishburn. Anonymity conditions in social choice theory. Theory and Decision, 12(1):21, 1980.
Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark. Emergent communication through negotiation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk6WhagRW.
Shantanu Chakraborty, Tim Baarslag, and Michael Kaisers. Automated peer-to-peer negotiation for energy contract settlements in residential cooperatives. Applied Energy, 259:114173, 2020.
Jacob W Crandall and Michael A Goodrich. Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Machine Learning, 82(3):281–314, 2011.
Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R McKee, Joel Z Leibo, Kate Larson, and Thore Graepel. Open problems in cooperative ai. arXiv preprint arXiv:2012.08630, 2020.
Tom Eccles, Edward Hughes, János Kramár, Steven Wheelwright, and Joel Z Leibo. Learning reciprocity in complex sequential social dilemmas. arXiv preprint arXiv:1903.08082, 2019.
Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122–130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
Edward Geist and Andrew J Lohn. How might artificial intelligence affect the risk of nuclear war? Rand Corporation, 2018.
Jack P Gibbs. Norms: The problem of definition and classification. American Journal of Sociology, 70(5):586–594, 1965.
Jonathan Gray, Adam Lerer, Anton Bakhtin, and Noam Brown. Human-level performance in no-press diplomacy via equilibrium search. In International Conference on Learning Representation, 2021. URL https://openreview.net/forum?id=0-uUGPbIjD.
Dylan Hadfield-Menell, McKane Andrus, and Gillian Hadfield. Legible normativity for ai alignment: The value of silly rules. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 115–121, 2019.
John C Harsanyi. Cardinal welfare, individualistic ethics, and interpersonal comparisons of utility. Journal of political economy, 63(4):309–321, 1955.
Joseph Henrich, Robert Boyd, Samuel Bowles, Colin Camerer, Ernst Fehr, Herbert Gintis, and Richard McElreath. In search of homo economicus: behavioral experiments in 15 small-scale societies. American Economic Review, 91(2):73–78, 2001.
Leonid Hurwicz. On informationally decentralized systems. Decision and organization: A volume in Honor of J. Marschak, 1972.
Ehud Kalai. Proportional solutions to bargaining situations: interpersonal utility comparisons. Econometrica: Journal of the Econometric Society, pages 1623–1630, 1977.
Ehud Kalai and Smorodinsky. Other solutions to nash’s bargaining problem. Econometrica, 43(3):513–518, 1975.
Raphael Köster, Kevin R McKee, Richard Everett, Laura Weidinger, William S Isaac, Edward Hughes, Edgar A Duéñez-Guzmán, Thore Graepel, Matthew Botvinick, and Joel Z Leibo. Model-free conventions in multi-agent reinforcement learning with heterogeneous preferences. arXiv preprint arXiv:2010.09054, 2020.
Sarit Kraus and Ronald C Arkin. Strategic negotiation in multiagent environments. MIT press, 2001.
Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464–473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017.
Adam Lerer and Alexander Peysakhovich. Learning existing social conventions via observationally augmented self-play. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 107–114, 2019.
Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for distributed reinforcement learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3053–3062. PMLR, 10–15 Jul 2018. URL http://proceedings.mlr.press/v80/liang18b.html.
George J Mailath, J George, Larry Samuelson, et al. Repeated games and reputations: long-run relationships. Oxford university press, 2006.
Stephen Morris. The common prior assumption in economic theory. Economics & Philosophy, 11(2):227–253, 1995.
Abhinay Muthoo. The economics of bargaining. EOLSS, 2001.
John F Nash. The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155–162, 1950.
John Forbes Nash. Non-cooperative games. Annals of Mathematics, 5(4):2, 1990.
Philip Paquette, Yuchen Lu, SETON STEVEN BOCCO, Max Smith, Satya O.-G., Jonathan K. Kummerfeld, Joelle Pineau, Satinder Singh, and Aaron C Courville. No-press diplomacy: Modeling multi-agent gameplay. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/84b20b1f5a0d103f5710bb67a043cd78-Paper.pdf.
Alexander Peysakhovich and Adam Lerer. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975, 2017.
Lasse Ringius, Asbjørn Torvanger, and Arild Underdal. Burden sharing and fairness principles in international climate policy. International Environmental Agreements, 2(1):1–22, 2002.
Jeffrey S Rosenschein and Michael R Genesereth. Deals among rational agents. In Readings in Distributed Artificial Intelligence, pages 227–234. Elsevier, 1988.
Thomas C Schelling. An essay on bargaining. The American Economic Review, 46(3):281–306, 1956.
Thomas C Schelling. The strategy of conflict. prospectus for a reorientation of game theory. Journal of Conflict Resolution, 2(3):203–264, 1958.
Sandip Sen and Stephane Airiau. Emergence of norms through social learning. In IJCAI, volume 1507, page 1512, 2007.
Yoav Shoham and Moshe Tennenholtz. On the emergence of social conventions: modeling, analysis, and simulations. Artificial Intelligence, 94(1-2):139–166, 1997.
William Thomson. Cooperative models of bargaining. Handbook of game theory with economic applications, 2:1237–1284, 1994.
Adam Walker and Michael J Wooldridge. Understanding the emergence of conventions in multi-agent systems. In ICMAS, volume 95, pages 384–389, 1995.
H Peyton Young. The economics of convention. Journal of economic perspectives, 10(2):105–122, 1996.
Boyu Zhang and Josef Hofbauer. Equilibrium selection via replicator dynamics in 2 × 2 coordination games. International Journal of Game Theory, 44(2):433–448, 2015.
Lin Zhou. The Nash bargaining theory with non-convex problems. Econometrica: Journal of the Econometric Society, pages 681–685, 1997.
Different welfare functions have been introduced in the literature. Table 2 gives an overview over commonly discussed welfare functions. Their properties are noted in Table 3.
Name of welfare function | Form of |
Nash [Nash, 1950] | |
Kalai-Smorodinsky [Kalai and Smorodinsky, 1975] | s.t. Pareto-optimal |
Egalitarian [Kalai, 1977] | s.t. Pareto-optimal |
Utilitarian [Harsanyi, 1955] |
Pareto-optimality refers to the property that the welfare function's optimizer should be Pareto-optimal. Impartiality, often called symmetry, implies that the welfare of a policy profile should be invariant to permutations of player indices. These are treated as relatively uncontroversial properties in the literature. Invariance to affine transformations of the payoff matrix is usually motivated by the assumption that interpersonal comparison of utility is impossible. In contrast, the utilitarian welfare function assumes that such comparisons are possible. Independence of irrelevant alternatives refers to the principle that a preference for an equilibrium A over equilibrium B should only depend on properties of A and B. That is, a third equilibrium C should not change the preference ordering between A and B. Resource monotonicity refers to the principle that if the payoff for any policy profile increases, this should not make any agent worse off.
Table 3: Properties of welfare functions. For properties of Nash welfare here the set of feasible payoffs is assumed to be convex (Zhou [1997] describes properties in the non-convex case). |
Coin Game
Asymmetric bargaining Coin Game
For computing the normalized scores in Figure 3, we use the following disagreement profiles, corresponding to cooperation failure. In IPD, the cooperation failures are only the joint defections (D,D) producing a reward profile of (-3, -3). In IAsymBoS, they are the profiles (B, S) and (S, B), both associated with the reward profile (0, 0). In Coin Game, the cooperation failures are when both players pick all coins at maximum speed selfishly and this produce a reward profile of (0, 0). In Asymmetric bargaining Coin Game (ABCG), when the players fails to cooperates, they can punish the opponent by preventing any coin to be picked by waiting instead of picking their own defection coin (e.g. with ). In these cooperation failures, the reward profile with perfect punishment is (0, 0).
We use the discount factor for all the algorithms and environments unless specified otherwise in Table 4.
Approximate Markov tit-for-tat ()
We follow the algorithm from [Lerer and Peysakhovich, 2017] with two changes: 1) Instead of using a selfish policy to punish the opponent, we use a policy that minimizes the reward of the opponent. 2) We use the version that uses rollouts to compute the debit and the punishment length. We observed that using long rollouts to compute the debit increases significantly the variance on the debit and this leads to false positive detection of defection. To reduce this variance, we thus compute the debit without rollouts. We use the direct difference between the rewards given by the actual action and given by the simulated cooperative action. The rollout length used to compute the punishment length is 20.
We note that in asymmetric BoS, training runs using the utilitarian welfare function sometimes learn to produce, in self-play, the egalitarian outcome instead of the utilitarian outcome. In these cases the policy gets discarded.
The inequity-averse welfare is defined as follows:
where are smoothed cumulative rewards with a discount factor and a parameter which controls how much unequal outcomes are penalized.
Learning with Opponent-Learning Awareness (LOLA)
Write as the value to player under a profile of policies with parameters . Then, the LOLA update [Foerster et al., 2018] for player 1 at time with parameters is
We used RLlib [Liang et al., 2018] for almost all the experiments. It is under Apache License 2.0. All activation functions are ReLU if not specified otherwise.
Matrix games (IPD and IBoS) with LOLA-Exact
We used the official implementation of LOLA-Exact from https://github.com/alshedivat/lola. We slightly modified it to increase the stability and remove a few confusing behaviors. Following Foerster et al. [2018]’s parameterization of policies in the iterated Prisoner’s Dilemma, we use policies which condition on the previous pair of actions played, with the difference that instead of using one parameter for every possible previous action profile, we use parameters for every one of the previous action profile to always to play in larger action space ( actions possible).
Matrix games (IPD and IBoS) with amTFT
We use a Double Dueling DQN architecture + LSTM for both simple and complex environments when using amTFT. This architecture is non-exhaustively composed of a shared fully connected layer (hidden layer size 64), an LSTM (cell size 16), a value branch and an action-value branch both composed of a fully connected network (hidden layer size 64).
Coin games (CG and ABCG) with LOLA-PG
We used the official implementation from https://github.com/alshedivat/lola, which is under MIT license. We slightly modified it to increase the stability and remove a few confusing behaviors. Especially, we removed the critic branch, which had no effect in practice. We use a PG+LSTM architecture composed of two convolution layers (kernel size 3x3 and feature size 20), an LSTM (cell size 64) and a final fully connected layer.
ABCG + LOLA-PG mainly generates policies that are associated with an egalitarian welfare function. Within this set of policies, some are a bit closer to the utilitarian outcome than others, which we used as a basis to classify them as “utilitarian” for the purpose of comparison in Figures 3 and 8. However, because the difference is small, we do not observe a lot of additional cross-play failure compared to cross-play between the “same” welfare function. It should also be noted that we chose to discard the runs in which none of the agents becomes competent at picking any of the coins.
Coin games (CG and ABCG) with amTFT
We use a Double Dueling DQN architecture + LSTM for both simple and complex environments when using amTFT. The architecture used is non-exhaustively composed of two convolution layers (kernel size 3x3 and feature size 64), an LSTM (cell size 16), a value branch and an action-value branch both composed of a fully connected network (hidden layer size 32).
The code that we provide allows to run all of the experiments and to generate the figures with our results. All instructions on how to install and run the experiments are given in the 'README.md' file. The code to run the experiments from Section 4 is in the folder “base_game_experiments”. The code to run the experiments from Section 5, is in the folder ‘base_game_experiments’. The code to generate the figures is in the folder 'plots'.
An anonymized version of the code is available at https://github.com/68545324/anonymous.
Table 4: Main hyperparameters for each cross-play experiment. |
All hyperparameters can be found in the code provided as the supplementary material. They are stored in the “params.json” files associated with each replicate of each experiment. The experiments are stored in the “results” folders.
The hyperparameters selected are those producing the most welfare-optimal results when evaluating in self-play (the closest to the optimal profiles associated with each welfare function). Both manual hyperparameter tuning and grid searches were performed.
Table 5: Cross-play normalized score of in ABCG. |
Figure 8: Mean reward of policy profiles for environments and learning algorithms given in Table 1. The purple areas describe sets of attainable payoffs. In matrix games, a small amount of jitter is added to the points for improved visibility. The plots on top compare self-play with cross-play, whereas the plots below compare cross-play between policies optimizing same and different welfare functions. |
In this section, we provide some theory on the tradeoff between exploitability and robustness. Consider some asymmetric BP, with welfare functions optimized in equilibrium, such that for any two policy profiles the cross-play policy profile is Pareto-dominated. Note that the definition implies that the welfare functions must be distinct.
We can derive an upper bound for how well players can do under the cross-play policy profile. It follows from the fact that both and are in equilibrium that for it is
(1)
This is because, otherwise, at least one of the two profiles cannot be in equilibrium, since a player would have an incentive to switch to another policy to increase their value.
From the above, it also follows that the cross-play policy must be strictly dominated. To see this, assume it was not dominated. This would imply that one player has equal values under both profiles. So that player would be indifferent, while one of the profiles would leave the other player worse off. Thus, that profile would be weakly Pareto dominated, which is excluded by the definition of a welfare function.
It is a desirable quality for a policy profile maximizing a welfare function in equilibrium to have values that are close to this upper bound in cross-play against other policies. For instance, if some coordination mechanism exists for agreeing on a common policy to employ, it may be feasible to realize this bound against any opponent willing to do the same.
Moreover, the bound implies that whenever we try to be even more robust against players employing policies corresponding to other welfare functions (e.g., a policy which reaches Pareto optimal outcomes against a range of different policies), our policy will cease to be in equilibrium. In that sense, such a policy will be exploitable, while unexploitable policies can only be robust against
different opponents up to the above bound. Note that this holds even in idealized settings, where coordination, e.g., via some grounded messages, is possible.
Lastly, note that if no coordination is possible, or if no special care is being taken in making policies robust, then equilibrium profiles that maximize a welfare function can perform much worse in cross-play than the above upper bound.
We show this formally in a special case in which our POSG is an iterated game, i.e., it only has one state. Moreover, we assume that it is an asymmetric BP, and that for both welfare functions in question, an optimal policy exists that is deterministic. Denote (where denotes the joint observation history up to step ) as the return from time step onwards, under the policy . We also assume that for any player , the minimax value is strictly worse than the values of their preferred welfare function maximizing profile. Then we can show that policies maximizing the welfare functions exist such that after some time step, their returns will be upper bounded by their minimax return.
Proposition 1. In the situation outlined above, there exist policices optimizing the respective welfare functions and a time step such that
for
Proof. Define as the policy profile in which player follows unless the other player's actions differ from at least once, after which they switch to the action Define analogously. Note that both profiles are still optimal for the corresponding welfare functions.
As argued above, the cross-play profile must be strictly worse than their preferred profile, for both players. So there is a time after which an action of a player must differ from 's prefered profile and thus switches to the minimax action We have i.e., the value gets after step must be smaller than their minimax value, and by assumption, the minimax value is worse for than the value of their preferred welfare-maximizing profile. Hence, there must be a time step after which also switches to their minimax action.
From onwards, both players play so
The post Normative Disagreement as a Challenge for Cooperative AI appeared first on Center on Long-Term Risk.
]]>The post Taboo "Outside View" appeared first on Center on Long-Term Risk.
]]>No one has ever seen an AGI takeoff, so any attempt to understand it must use these outside view considerations
—[Redacted for privacy]
What? That’s exactly backwards. If we had lots of experience with past AGI takeoffs, using the outside view to predict the next one would be a lot more effective.
—My reaction
Two years ago I wrote a deep-dive summary of Superforecasting and the associated scientific literature. I learned about the “Outside view” / “Inside view” distinction, and the evidence supporting it. At the time I was excited about the concept and wrote: “...I think we should do our best to imitate these best-practices, and that means using the outside view far more than we would naturally be inclined.”
Now that I have more experience, I think the concept is doing more harm than good in our community. The term is easily abused and its meaning has expanded too much. I recommend we permanently taboo “Outside view,” i.e. stop using the word and use more precise, less confused concepts instead. This post explains why.
Over the past two years I’ve noticed people (including myself!) do lots of different things in the name of the Outside View. I’ve compiled the following lists based on fuzzy memory of hundreds of conversations with dozens of people:
As far as I can tell, it basically meant reference class forecasting. Kaj Sotala tells me the original source of the concept (cited by the Overcoming Bias post that brought it to our community) was this paper. Relevant quote: “The outside view is ... essentially ignores the details of the case at hand, and involves no attempt at detailed forecasting of the future history of the project. Instead, it focuses on the statistics of a class of cases chosen to be similar in relevant respects to the present one.” If you look at the text of Superforecasting, the “it basically means reference class forecasting” interpretation holds up. Also, “Outside view” redirects to “reference class forecasting” in Wikipedia.
To head off an anticipated objection: I am not claiming that there is no underlying pattern to the new, expanded meanings of “outside view” and “inside view.” I even have a few ideas about what the pattern is. For example, priors are sometimes based on reference classes, and even when they are instead based on intuition, that too can be thought of as reference class forecasting in the sense that intuition is often just unconscious, fuzzy pattern-matching, and pattern-matching is arguably a sort of reference class forecasting. And Ajeya’s model can be thought of as inside view relative to e.g. GDP extrapolations, while also outside view relative to e.g. deferring to Dario Amodei.
However, it’s easy to see patterns everywhere if you squint. These lists are still pretty diverse. I could print out all the items on both lists and then mix-and-match to create new lists/distinctions, and I bet I could come up with several at least as principled as this one.
When people use “outside view” or “inside view” without clarifying which of the things on the above lists they mean, I am left ignorant of what exactly they are doing and how well-justified it is. People say “On the outside view, X seems unlikely to me.” I then ask them what they mean, and sometimes it turns out they are using some reference class, complete with a dataset. (Example: Tom Davidson’s four reference classes for TAI). Other times it turns out they are just using the anti-weirdness heuristic. Good thing I asked for elaboration!
Separately, various people seem to think that the appropriate way to make forecasts is to (1) use some outside-view methods, (2) use some inside-view methods, but only if you feel like you are an expert in the subject, and then (3) do a weighted sum of them all using your intuition to pick the weights. This is not Tetlock’s advice, nor is it the lesson from the forecasting tournaments, especially if we use the nebulous modern definition of “outside view” instead of the original definition. (For my understanding of his advice and those lessons, see this post, part 5. For an entire book written by Yudkowsky on why the aforementioned forecasting method is bogus, see Inadequate Equilibria, especially this chapter. Also, I wish to emphasize that I myself was one of these people, at least sometimes, up until recently when I noticed what I was doing!)
Finally, I think that too often the good epistemic standing of reference class forecasting is illicitly transferred to the other things in the list above. I already gave the example of the anti-weirdness heuristic; my second example will be bias correction: I sometimes see people go “There’s a bias towards X, so in accordance with the outside view I’m going to bump my estimate away from X.” But this is a different sort of bias correction. To see this, notice how they used intuition to decide how much to bump their estimate, and they didn’t consider other biases towards or away from X. The original lesson was that biases could be corrected by using reference classes. Bias correction via intuition may be a valid technique, but it shouldn’t be called the outside view.
I feel like it’s gotten to the point where, like, only 20% of uses of the term “outside view” involve reference classes. It seems to me that “outside view” has become an applause light and a smokescreen for over-reliance on intuition, the anti-weirdness heuristic, deference to crowd wisdom, correcting for biases in a way that is itself a gateway to more bias...
I considered advocating for a return to the original meaning of “outside view,” i.e. reference class forecasting. But instead I say:
I’m not recommending that we stop using reference classes! I love reference classes! I also love trend extrapolation! In fact, for literally every tool on both lists above, I think there are situations where it is appropriate to use that tool. Even the anti-weirdness heuristic.
What I ask is that we stop using the words “outside view” and “inside view.” I encourage everyone to instead be more specific. Here is a big list of more specific words that I’d love to see, along with examples of how to use them:
Whenever you notice yourself saying “outside view” or “inside view,” imagine a tiny Daniel Kokotajlo hopping up and down on your shoulder chirping “Taboo outside view.”
Many thanks to the many people who gave comments on a draft: Vojta, Jia, Anthony, Max, Kaj, Steve, and Mark. Also thanks to various people I ran the ideas by earlier.
The post Taboo "Outside View" appeared first on Center on Long-Term Risk.
]]>The post Case studies of self-governance to reduce technology risk appeared first on Center on Long-Term Risk.
]]>Full post on the EA Forum.
The post Case studies of self-governance to reduce technology risk appeared first on Center on Long-Term Risk.
]]>The post Work with us (temporary archive) appeared first on Center on Long-Term Risk.
]]>Your contributions to our research program will have a positive impact through their influence on our strategic direction, grantmaking, communications, events, and other activities. You will work autonomously on challenging research questions relevant to reducing suffering. You will become part of our team of intellectually curious, hard-working, and caring people, all of whom share a profound drive to make the biggest difference they can.
We will adapt the responsibilities of the role to the strengths and preferences of each successful candidate, but they usually include:
Depending on your experience and skill set, we might ask you to supervise junior researchers or research fellows on our team.
We don’t require specific qualifications or experience for this role, but the following abilities and qualities are what we’re looking for in candidates. We recognize that some of these qualities could be hard to test well outside a similar role, and we believe that smart, curious generalists can make substantial contributions, even if they lack formal training in any field related to our focus areas. We therefore encourage you to apply if you think you may be a good fit, even if you are unsure whether you meet several of the criteria.
Relevant academic education, such as a master’s degree or higher in a related field, can be a useful indicator for some of the above qualities but is not a requirement.
We don’t require specific qualifications or experience for this role, but the following abilities and qualities are what we’re looking for in candidates. We encourage you to apply if you think you may be a good fit, even if you are unsure whether you meet some of the criteria.
We expect the content of the models to be wide-ranging. Examples currently of interest to CLR researchers include: AI timelines, the relative value of preventing bargaining failure vs. AI misalignment, and the causes of conflict. For examples of the kind of output we might like you to help us create, see this timelines model (by Ajeya Cotra), this agent-based model, and these economic growth simulations (by David Roodman). (Note that much of the work may be for internal consumption and thus need not be this “polished”.)
You can find an overview of our current priority areas here. However, if we believe that you can somehow advance high-quality research relevant to s-risks, we are interested in creating a position for you. If you see a way to contribute to our research agenda or have other ideas for reducing s-risks, please apply. We commonly tailor our positions to the strengths and interests of the applicants.
We value your time and are aware that applications can be demanding, so we have thought carefully about making the application process time-efficient and transparent:
Stage 1: To start your application for any role, please complete our application form. As part of this form, we also ask you to submit your CV/resume and relevant work samples if available. The deadline is Monday, March 15, 2021 (11:59 pm Pacific Time). We expect this to take around 2 to 3 hours if you are already familiar with our work.
Stage 2: By Friday, March 19, we will decide whether to invite you to the second stage. For the Researcher and Summer Research Fellow role, we will ask you to write a research proposal (up to two pages excluding references), to be submitted by Sunday, April 4 (11:59 pm Pacific Time). For the Research Assistant role, we will ask you to do a work test. For all roles, you will receive compensation for your work during this stage.
Stage 3: By Friday, April 9, we will decide whether to invite you to an interview via video call during the week of April 12. By Friday, April 16, we will
If you have any questions about the process, please contact us at info@longtermrisk.org. If you want to send an email not accessible to the hiring committee, please contact Amrit Sidhu-Brar at amrit.sidhu-brar@longtermrisk.org.
We will also host two open video calls for any questions about this hiring round or working at CLR more generally. Sign up here to receive an invitation to the video call. They will take place at the following times:
In addition to their salary, CLR offers the following benefits to all staff (including Summer Research Fellows):
We aim to combine the best aspects of academic research (depth, scholarship, mentorship) with an altruistic mission to prevent negative future scenarios. So we leave out the less productive features of academia, such as precarious employment and publish-or-perish incentives, while adding a focus on impact and application.
As part of our team, you will enjoy:
You will advance neglected research to reduce the most severe risks to our civilization in the long-term future. Depending on your specific project, your work will help inform our activities across any of the following paths to impact:
The post Work with us (temporary archive) appeared first on Center on Long-Term Risk.
]]>The post Coordination challenges for preventing AI conflict appeared first on Center on Long-Term Risk.
]]>In this article, I will sketch arguments for the following claims:
In this article, I examine the challenge of ensuring coordination between AI developers to prevent catastrophic failure modes arising from the interactions of their systems. More specifically, I am interested in addressing bargaining failures as outlined in Jesse Clifton’s research agenda on Cooperation, Conflict & Transformative Artificial Intelligence (TAI) (2019) and Dafoe et al.’s Open Problems in Cooperative AI (2020).
First, I set out the general problem of bargaining failure and why bargaining problems might persist even for aligned superintelligent agents. Then, I argue for why developers might be in a good position to address the issue. I use a toy model to analyze whether we should expect them to do so by default. I deepen this analysis by comparing the merit and likelihood of different coordinated solutions. Finally, I suggest directions for interventions and future work.
The main goal of this article is to encourage and enable future work. To do so, I sketch the full path from problem to potential interventions. This large scope comes at the cost of depth of analysis. The models I use are primarily intended to illustrate how a particular question along this path can be tackled rather than to arrive at robust conclusions. At some point, I might revisit parts of this article to bolster the analysis in later sections.
Transformative AI scenarios involving multiple systems (“multipolar scenarios”) pose unique existential risks resulting from their interactions.1 Bargaining failure between AI systems, i.e., cases where each actor ends up much worse off than they could have under a negotiated agreement, is one such risk. The worst cases could result in human extinction or even worse outcomes (Clifton 2019).2
As a prosaic example, consider a standoff between AI systems similar to the Cold War between the U.S. and the Soviet Union. If they failed to handle such a scenario well, they might cause nuclear war in the best case and far worse if technology has further advanced at that point.
Short of existential risk, they could jeopardize a significant fraction of the cosmic endowment by preventing the realization of mutual gains or causing the loss of resources in costly conflicts.
This risk is not sufficiently addressed by AI alignment, by which I mean either “ensuring that systems are trying to do what their developers want them to do” or “ensuring that they are in fact doing what their developers want them to do.”3 Consider the Cuban Missile Crisis as an analogy: The governments of the U.S. and the Soviet Union were arguably “aligned” with some broad notion of human values, i.e., both governments would at least have considered total nuclear war to be a moral catastrophe. Nevertheless, they got to the brink of causing just that because of a failure to bargain successfully. Put differently, it’s conceivable, or even plausible, that the Cuban Missile Crisis could have resulted in global thermonuclear war, an outcome so bad that both parties would probably have preferred complete surrender.4
This risk scenario is probably also not sufficiently addressed by ensuring that the AI systems we build have superhuman bargaining skills. Consider the Cuban Missile crisis again. I am arguing that a superintelligent Kennedy and superintelligent Khrushchev would not have been sufficient to guarantee successful prevention of the crisis. Even for superintelligent agents, some fundamental game-theoretic incompatibilities persist because the ability to solve them is largely orthogonal to any notion of “bargaining skill,” whether we conceive of that skill as part of intelligence or rationality. These are the “mixed-motive coordination problem” and the “prior selection problem.”5
“Mixed-motive coordination problem”6: As I use the term here, a mixed-motive coordination problem is a problem that arises when two agents need to pick one Pareto-optimal solution out of many different such solutions. The failure to pick the same one results in a failure to reach a mutually agreeable outcome. At the level of equilibria, this may arise in games that do not have a uniquely compelling cooperative equilibrium, i.e., they have multiple Pareto-optimal equilibria that correspond to competing notions of what counts as an acceptable agreement.78
For instance, in Bach or Stravinsky (see matrix below), both players would prefer going to any concert together (Stravinsky, Stravinsky or Bach, Bach) over going to any concert by themselves (Stravinsky, Bach or Bach, Stravinsky). However, one person prefers going to Stravinsky together, whereas the other prefers going to Bach together. Thus, there is a distributional problem when allocating the gains from coordination (Morrow 1994).9 Put in more technical terms: each player favors a different solution on the Pareto curve. Within this simple game, there is no way for the two players to reliably select the same concert, which will often cause them to end up alone.
More fundamentally, agents may differ in the solution concepts or decision rules they use to decide what agreements are acceptable in a bargaining situation, e.g., they may use different bargaining solutions. In bargaining problems, different “reasonable” decision rules make different recommendations for which Pareto-optimal solution to pick. The worry is that independently developed systems could end up using, either implicitly or explicitly, different decision rules for bargaining problems, leading to bargaining failure. For instance, in the variant of Bach or Stravinsky above, (Stravinsky, Stravinsky) leads to the greatest total payoffs, while (Bach, Bach) is more equitable.10
As a toy example, consider the case where two actors are bargaining over some territory. There are many ways of dividing this territory. (Different ways of dividing the territory are analogous to (Stravinsky, Stravinsky) and (Bach, Bach) above.) One player (the proposer) makes a take-it-or-leave-it offer to the other player (the responder) of a division of the territory, and war occurs if the offer is rejected. (A rejected offer is analogous to the miscoordination outcome (Stravinsky, Bach.) If the proposer and responder have different notions of what counts as an acceptable offer, war may ensue. If the agents have highly destructive weapons at their disposal, then war may be extremely costly. (To see how this might apply in the context of transformative AI, imagine that these are AI systems bargaining over the resources of space.)
There are two objections to address here. First, why would the responder reject any offer if they know that war will ensue? One reason is that they have a commitment to reject offers that do not meet their standards of fairness to reduce their exploitability by other agents. For AI systems, there are a few ways this could happen. For example, such commitments may have evolved as an adaptive behavior in an evolution-like training environment or be the result of imitating human exemplars with the same implicit commitments. AI systems or their overseers might have also implemented these commitments as part of a commitment race.
Second, isn’t this game greatly oversimplified? For instance, agents could engage in limited war and return to the bargaining table later, rather than catastrophic war. There are a few responses here. For one thing, highly destructive weapons or irrevocable commitments might preclude the success of bargaining later on. Another consideration is that some complications — such as agents having uncertainty about each others’ private information (see below) — would seem to make bargaining failure more likely, not less so.
“Prior selection problem”: In games of incomplete information, i.e., games in which players have some private information about their payoffs or available actions, the standard solution–Bayesian Nash equilibrium–requires the agents to have common knowledge of each others’ priors over possible values of the players’ private information. However, if systems end up with different priors, outcomes may be bad.11 For instance, one player might believe their threat to be credible, whereas the other player might think it’s a bluff, leading to the escalation of the conflict. Similar to mixed-motive coordination problems, there are many “reasonable” priors and no unique individually rational rule that picks out one of them. In the case of machine learning, priors could well be determined by the random initialization of the weights or incidental features of the training environment (e.g., the distribution of other agents against which an agent is trained). Such differences in beliefs may persist over time due to models of other agents being underdetermined in strategic settings.12
Note that these concepts are idealizations. More broadly, AI systems may have different beliefs and procedures for deciding which commitments are credible and which bargains are acceptable.
These incompatibility problems are much more likely to arise or lead to catastrophic failures if AI systems are developed independently. During training, failure to arrive at mutually agreeable solutions is likely to result in lower rewards. So a system will usually perform well against counterparts that are similar to the ones it encountered during training. If the development of two systems is independent, such similarity is not guaranteed, and bargaining is more likely to fail catastrophically due to the reasons I sketched above.
Again, let’s consider a human analogy. There is evidence for significant behavioral differences among individuals from different cultures when playing standard economic games (e.g., the ultimatum game, the dictator game, different public goods games). For instance, Henrich et al. (2005) found that mean offers from Western university students usually ranged from 42-48% in the ultimatum game. Among members of the fifteen small-scale societies they studied, mean offers instead spanned 25-57%. In a meta-analysis, Oosterbeek, Sloof & van de Kuilen (2004) found systematic cultural differences in the behavior of responders (but not proposers). Relatedly, there also appears to be evidence for cross-cultural differences with regard to notions of fairness (e.g., Blake et al. 2015, Schaefer et al. 2015). This body of literature is at least suggestive of humans learning different “priors” or “decision rules” depending on their “training regime,” i.e., their upbringing.
The smaller literature on intercultural play, where members from different cultures play against one another, weakly supports welfare losses as a result of such differences: “while a few studies have shown no differences between intra- and intercultural interactions, most studies have shown that intercultural interactions produce less cooperation and more competition than intracultural interactions” (Matsumoto, Hwang 2011). I only consider this weak evidence as the relevant studies do not seem to carefully control for the potential of (shared) distrust of perceived strangers, which would also explain these results but is independent of incompatible game-playing behavior.
It is tempting to delegate the solving of these problems to future more capable AI systems. However, it is not guaranteed that they will be in a position to solve them, despite being otherwise highly capable.
For one, development may have already locked in priors or strong preferences over bargaining solutions, either unintentionally or deliberately (as the result of a commitment race, for instance). This could put strict limits on their abilities to solve these problems.
More fundamentally, solving these incompatibility problems requires overcoming another such problem. Picking out some equilibrium, solution concept, or prior will favor one system over another. So they face another distributional problem. Solving that requires successful bargaining, the failure of which was the original problem. If they wanted to solve this second incompatibility problem, they would face another one. In other words, there are incompatibility problems all the way down.
One possibility is that many agents will by default be sufficiently “reasonable” that they can agree on a solution concept via reasoned deliberation, avoiding commitments to incompatible solution concepts for bargaining problems. Maybe many sufficiently advanced systems will engage in reasoning such as “let’s figure out the correct axioms for a bargaining solution, or at least sufficiently reasonable ones that we can both feel OK about the agreement.”13 Unfortunately, it does not seem guaranteed that this kind of reasoning will be selected for during the development of the relevant AI systems.
Developers then might be better suited to addressing this issue than more capable successor agents, whether they be AI systems or AI-assisted humans:
The comparative ignorance of present-day humans mitigates the distributional problem faced by more far-sighted and intelligent successor agents.14 The distributional consequences of particular coordination arrangements will likely be very unclear to AI developers. Compared to future agents, I expect them to have more uncertainty about their values, preferred solution concepts, the consequences of different coordination agreements, and how these three variables relate to one another. This will smooth out differences in expected value between different coordination outcomes. However, developers will have much less uncertainty about the value of averting conflict by coordinating in some form. So it will be easier for them to find a mutually agreeable arrangement as the situation for them looks more like a pure coordination game (see matrix below), which are much easier to solve by cheap talk alone, than Bach or Stravinsky (see matrix above).15
The loss aversion and scope insensitivity of (most) human bargainers will likely compound this effect. I expect it will increase the inclination to avoid catastrophes compared to securing relative gains. This, again, will push this game more toward one of pure coordination, mitigating the distributional problem. In comparison, AI systems are less likely to exhibit such “biases.”16
A related point is that human bargainers might not even know what the Pareto frontier looks like. Thus, instead of trying to bargain for their most favorable point on the Pareto frontier, they have incentives to converge on any mutually agreeable settlement even if it is Pareto inferior to many other possible outcomes. This, in turn, probably decreases the chance of catastrophic failures.17 As Young (1989) writes:
Negotiators who know the locus of a contract curve or the shape of a welfare frontier to begin with will naturally be motivated primarily by a desire to achieve an outcome on this curve or frontier that is as favorable to their own interests as possible. They will, therefore, immediately turn to calculations regarding various types of strategic behavior or committal tactics that may help them achieve their distributive goals.
Negotiators who do not start with a common understanding regarding the contours of the contract curve or the locus of the negotiation set, by contrast, have compelling incentives to engage in exploratory interactions to identify opportunities for devising mutually beneficial deals. Such negotiators may never discover the actual shape of the contract curve or locus of the negotiation set, and they may consequently end up with arrangements that are Pareto-inferior in the sense that they leave feasible joint gains on the table. At the same time, however, they are less likely to engage in negotiations that bog down into protracted stalemates brought about by efforts to improve the outcome for one party or another through initiatives involving strategic behavior and committal tactics.
Developers intent on solving this problem can choose between two broad classes of options18:
Both solutions require overcoming the distributional problem discussed in the previous section. In the case of coordinating on compatible features, each set of features will have different distributional consequences for the developers. In the case of agreeing to build a joint system, there will be different viable agreements, again with different distributional consequences for the developers (e.g., the system may pursue various tradeoffs between the developers’ individual goals, or developers might get different distributions of equity shares).20
For now, let’s assume that there are only two developers who are both aware of these coordination problems and have the technical ability to solve them. Let’s further assume the two options introduced above do not differ significantly in their effectiveness at preventing conflict, and the costs of coordination are negligible. Then the game they are playing can be modeled as a coordination game like Bach or Stravinsky.2122
In non-iterated and sequential play, we can expect coordination, at least under idealized conditions. The follower will adapt to the strategy chosen by the leader since they have nothing to gain by not coordinating (“pre-emption”). If I know that my friend is at the Bach concert, I will also go to the Bach concert since I prefer that to being at the Stravinsky concert on my own.
In non-iterated and simultaneous play, the outcome is underdetermined. They may end up coordinating, or they may not. It depends on whether they will be able to solve the bargaining problem inherent to the game. Introducing credible commitments could move us from simultaneous play to sequential play, ensuring coordination once again.23 If I can credibly commit to going to one concert rather than another, my counterpart has again nothing to gain by choosing the other concert. They will join me at the one that I signaled I would go to.
In iterated play, the outcome is, again, uncertain. Unlike the Prisoner’s Dilemma, there is no need to monitor and enforce any agreement in coordination games once it has been reached. Free-riding is not possible: deviation from equilibrium harms both players, i.e., agreements are generally self-enforcing (Snidal 1985). However, the iteration of the game incentivizes players to insist on the coordination outcome that is more favorable to them. Foregoing coordination one round might be worth it if you think you can cause your counterpart to move to the more favorable equilibrium in subsequent rounds.
Which of these versions of the game best describes AI takeoff primarily depends on two variables: Close races will be more akin to simultaneous play where developers do not first observe what their counterpart “played” until they have already locked in a certain choice themselves. Iteration is akin to successive deployment where developers release improved versions over time. So only if one developer is clearly ahead of the competition is it that coordination seems anything close to guaranteed in this toy model, and those might be the scenarios where one actor gains a decisive strategic advantage in any case. Otherwise, bargaining will occur, and may fail.
Note: I don’t intend for this section to be a comprehensive analysis of this situation. Rather, it is intended as a first stab at the problem and a demonstration of how to make progress on the question of whether we can expect coordination by default. This basic model could be extended in various ways to capture different complexities of the situation.
If we drop the assumption that developers are aware of the need to coordinate, coordination may still occur regardless. However, it is necessarily less likely. Three paths then seem to remain:
First, norms might emerge organically as the result of trial and error. This would require iteration and a well-functioning feedback mechanism. For instance, the two labs release pre-TAI systems, which interact poorly, perhaps due to the problems described in this article. They lack concrete models for the reasons for this failure, but in subsequent releases, both labs independently tinker with the algorithms until they interact well with one another. This compatibility then also transfers to their transformative systems. My intuition is that the likelihood of such an outcome will depend a lot on how fast and how continuously AI development progresses.
Second, the relevant features may end up being compatible due to the homogeneity of the systems. However, even the same training procedures can result in different and incompatible decision rules due to random differences in initialization.24 More narrowly, a third party might develop a bargaining “module” or service, which is integrated into all transformative systems by their developers due to its competitive performance rather than as the result of a coordination effort. Again, this outcome is not guaranteed.
Third, developers might agree to build a joint system for reasons other than the problem discussed in this article:
None of these would guarantee that only one system is developed. They merely give reasons to some developers to merge with some other developers.
Given the toy model we used above, both solutions (compatible features and the building of a single system) do not differ in terms of payoffs. However, to examine how desirable they are from an altruistic perspective and how likely they are to come about, we need to analyze them in more detail. Again, the analysis will remain at the surface level and is intended as a first stab and illustration.
Restricting our perspective to the problem discussed in this article, developers building a joint system is preferable since it completely obviates any bargaining by the AI systems themselves.34 Moreover, the underlying agreement seems significantly harder to renege on. It also effectively addresses the racing problem and some other multipolar challenges introduced in Critch, Krueger 2020.
At the same time, it would increase the importance of solving multi (stakeholder)/single (AI system) challenges (cf. section 8 of Critch, Krueger 2020), e.g., those related to social choice and accommodating disagreements between developers. If that turns out to be less tractable or to have worse side effects, this could sway the overall balance. The above analysis also ignores potential negative side-effects such agreements might have on the design of AI systems and the dynamics of AI development more broadly, e.g., by speeding up development in general. Analyzing these effects is beyond the scope of this article. Overall, however, I tend to believe that such an agreement would be desirable, especially in a close race.
It seems to me that two factors are most likely to determine the choice of developers35: (1) the consequences of each mode of coordination for the anticipated payoffs attained by the AI systems after deployment and (2) the transaction costs incurred by bringing about either of the two options prior to deployment.36
It’s plausible that the post-deployment payoffs will be overwhelmingly important, especially if developers appreciate the astronomical stakes involved. Nevertheless, transaction costs may still be important to the extent that developers are not as far-sighted and suffer from scope neglect.
Understanding the differences in payoffs would require a more comprehensive version of the analysis attempted in the previous section and the motivations of the developers in question. For instance, if the argument of the previous section holds, altruistically inclined developers would see higher payoffs associated with building a single system compared to an agreement to build compatible systems.37 On the other hand, competing national projects may be far more reluctant to join forces.
More general insights can be gleaned when it comes to transaction costs. The most common analytical lens for predicting what kinds of transactions agents will make is new institutional economics (NIE).38 Where game-theoretic models often abstract away such costs through idealization assumptions, NIE acknowledges that agents have cognitive limitations, lack information, and struggle to monitor and enforce agreements. This results in different transaction costs for different contractual arrangements, which influences which one is picked. This perspective can shed light on the question of whether to collaborate using the market to contract (e.g., buying, selling) or whether to collaborate using hierarchy & governance (e.g., regular employment, mergers). In our case, these transaction types are represented by agreeing to use compatible features and by agreeing to build a joint system, respectively.
Transaction costs are often grouped into three categories39:
On the face of it, this lens suggests that all else equal, actors would prefer to find compatible features over agreeing to build a single system because the costs for the former seem lower than the ones for the latter41:
This is weakly suggestive to me that transaction costs will incline developers to building compatible systems over building a joint system. Looking for case studies, this impression seems confirmed. I am not aware of any real-world examples of agreements to merge, build a single system instead of multiple, or establish a supranational structure to solve a coordination problem. Instead, actors seem to prefer to solve such problems through agreements and conventions. For instance, all standardization efforts fall into this category. Those reasons become stronger with an increasing number of potentially relevant developers as the costs for coordinating the development of a joint system rise more rapidly with an increasing number of actors compared to an agreement among independent developers, which probably will have very low marginal costs.
Overall, I expect that there will be strong reasons to build a joint system if there is a small number of relevant nonstate developers who are aware of and moved by the astronomical stakes. In those cases, I would be surprised if transaction costs swayed them. I am more pessimistic in other scenarios.
Coordination is not assured. Even if coordination is achieved, the outcome could still be suboptimal. This suggests that additional work on this problem would be valuable. In the next two sections, I will sketch directions for potential interventions and future research to make progress on this issue.
I will restrict this section to interventions for the governance problem sketched in this article while ignoring most technical challenges.44 I don’t necessarily endorse all of them without reservations as good ideas to implement. Some of them might have positive effects beyond the narrow application discussed here. Some might have (unforeseen) negative effects.
Increasing problem awareness
Without awareness of the problem, a solution to the core problem becomes significantly less likely. Accordingly, increasing awareness of this problem among competitive developers is an important step.45 It seems particularly important to do so in a way that is accessible to people with a machine learning background. One potential avenue might be to develop benchmarks that highlight the limits on achieving cooperation among AI agents without coordination by their developers. Our work on mixed-motive coordination problems in Stastny et al. 2021 is an example of ongoing work in this area.
Facilitating agreements
Some interventions can make the reaching of an agreement more likely under real-world conditions. Some reduce the transaction costs developers need to pay. Other mitigate the distributional problem they may face. I expect that many of these would also contribute to solving other bargaining problems between AI developers (e.g., finding solutions to the racing problem).
Making development go well in the absence of problem awareness
If developers are not sufficiently aware of the problem, there might still be interventions making coordination more likely.
Facilitating agreements to build a joint system
As I wrote above, a superficial analysis suggests that such agreements would be beneficial. If so, there might be interventions to make them more likely without causing excessive negative side-effects. For instance, one could restrict such efforts to tight races, as the OpenAI Assist Clause attempts to do.
There are many ways in which the analysis of this post could be extended or made more rigorous:
There are also more foundational questions about takeoff scenarios relevant to this problem:
We can ask further questions about potential interventions:
I want to thank Jesse Clifton for substantial contributions to this article as well as Daniel Kokotajlo, Emery Cooper, Kwan Yee Ng, Markus Anderljung, and Max Daniel for comments on a draft version of this article.
Throughout this document, I have talked about bargaining-relevant features of AI systems that developers might coordinate on. The details of these features depend on facts about how transformative AI systems are trained which are currently highly uncertain. For the sake of concreteness, however, here are some examples of features that AI developers might coordinate on, depending on what approach to AI development is ultimately taken:
The post Coordination challenges for preventing AI conflict appeared first on Center on Long-Term Risk.
]]>The post Collaborative game specification: arriving at common models in bargaining appeared first on Center on Long-Term Risk.
]]>In another post, I described the "prior selection problem", on which different agents having different models of their situation can lead to bargaining failure. Moreover, techniques for addressing bargaining problems like coordination on solution concepts or surrogate goals / safe Pareto improvements seem to require agents to have a common, explicit game-theoretic model.
In this post, I introduce collaborative game specification (CGS), a family of techniques designed to address the problem of agents lacking a shared model. In CGS, agents agree on a common model of their bargaining situation and use this to come to an agreement. Here is the basic idea:
Of course, when agreeing on a common model, agents must handle incentives for their counterparts to deceive each other. In the toy illustration below, we’ll see how handling incentives to misrepresent one’s model can be handled in a pure cheap-talk setting.
How might we use CGS to reduce the risks of conflict involving powerful AI systems? One use is to provide demonstrations of good bargaining behavior. Some approaches to AI development may involve training AI systems to imitate the behavior of some demonstrator (e.g., imitative amplification), and so we may need to be able to provide many demonstrations of good bargaining behavior to ensure that the resulting system is robustly able and motivated to bargain successfully. Another is to facilitate bargaining between humans with powerful AI tools, e.g. in a comprehensive AI services scenario.
Aside from actually implementing CGS in AI systems, studying protocols of this kind can give us a better understanding of the limits on agents’ ability to overcome differences in their private models. Under the simple version of CGS discussed here, because agents have to incentivize truth-telling by refusing to engage in CGS sometimes, agents will fail to agree on a common model with positive probability in equilibrium.
I will first give a toy example of CGS (Section 1), and then discuss how it might be implemented in practice (Section 2). I close by discussing some potential problems and open questions for CGS (Section 3). In the Appendix, I discuss a game-theoretic formalism in which CGS can be given a more rigorous basis.
For the purposes of illustration, we’ll focus on a pure cheap-talk setting, in which agents exchange unverifiable messages about their private models. Of course, it is all the better if agents can verify aspects of each others' private models. See Shin (1994) for a game-theoretic setting in which agents can verifiably disclose (parts of) their private beliefs. But we will focus on cheap talk here. A strategy for specifying a common model via cheap talk needs to handle incentives for agents to misrepresent their private models in order to improve their outcome in the resulting agreement. In particular, agents will need to follow a policy of refusing to engage in CGS if their counterpart reports a model that is too different from their own (and therefore evidence that they are lying). This kind of strategy for incentivizing honesty in cheap-talk settings has been discussed in the game theory literature in other contexts (e.g., Gibbons 1988; Morrow 1994).
For simplicity, agents in this example will model their situation as a game of complete information. That is, agents by default assume that there is no uncertainty about their counterpart’s payoffs. CGS can also be used for games of incomplete information. In this case, agents would agree on a Bayesian game with which to model their interaction. This includes agreement on a common prior over the possible values of their private information.
The "noisy Chicken" game is displayed in Table 1.
In this game, both agents observe a random perturbation of the true payoff matrix of the game. Call agent 's observation . This might be a world-model estimated from a large amount of data. The randomness in the agents' models can be interpreted as agents having different ways of estimating a model from data, yielding different estimates (perhaps even if they have access to the same dataset). While an agent with more computational resources might account for the fact that their counterpart might have a different model in a fully Bayesian way, our agents are computationally limited and therefore can only apply relatively simple policies to estimated payoff matrices. However, their designers can simulate lots of training data, and thus construct strategies that implicitly account for the fact that other agents may have different model estimates, while not being too computationally demanding. CGS is an example of such a strategy.
A policy will map observations to a probability distribution over . We assume the following about the agents' policies:
Now, we imagine that the agents are AI systems, and the AI developers ("principals") have to decide what policy to give their agent. If their agent is going to use CGS, then they need to train it to use a distortion which is (in some sense) optimal. Thus I will consider the choice of policy on part of the *principals* as a game, where the actions correspond to distortions to use in the distortion policy, and payoffs correspond to the average payoffs attained by the agents they deploy. Then I'll look at the equilibria of this game. This is of course a massive idealization - AI developers will not get together and choose agents whose policies are in equilibrium with respect to some utility functions. The point is only to illustrate how principals might rationally train agents to arrive at a common model under otherwise idealized circumstances.
I ran 1000 replicates of an experiment which computed actions according to the default policies and according to reporting policy profiles with and distortions . This The payoffs under the default policy profile and the Nash equilibrium (it happened to be unique) of the game in which principals choose the distortion levels for their agents are reported in Table 3.
In practice, CGS can be seen as accomplishing two things:
Here is how it could be implemented:
1. Take whatever class of candidate policies and policy learning method you were going to use by default. Note that this class of policies need not be model-based, so long as transparency tools can be applied to extract a model consistent with the policies' behavior (see below);
2. Augment the space of policies you are learning over with those that implement CGS. These policies will be composed of
3. Use your default policy learning method on this augmented space of policies.
For example, in training that relies on imitation learning, a system could be trained to do CGS by having the imitated agents respond to certain bargaining situations by offering to their counterpart to engage in CGS; actually specifying an explicit model of their strategic situation in collaboration with the counterpart; and (if the agents succeed in specifying a common model) applying some solution concept to that model in order to arrive at an agreement.
A major practical challenge seems to be having imitated humans strategically specify potentially extremely complicated game-theoretic models. In particular, one challenge is specifying a model at all, and another is reporting a model such that the agent expects in some sense to be better off in the solution of the model that results from CGS than they would be if they used some default policy. The first problem — specifying a complicated model in the first place — might be addressed by applying model extraction techniques to some default black box policy in order to infer an explicit world-model. The second problem — learning a reporting policy which agents expect to leave them better off under the resulting agreement — could be addressed if different candidate reporting policies could be tried out in a high-quality simulator.
One issue is whether CGS could actually make things worse. The first way CGS could make things worse is via agents specifying models in which conflict happens in equilibrium. We know that conflict sometimes happens in equilibrium. Fearon (1995)'s classic rationalist explanations for war show how war can occur in equilibrium due to agents having private information about their level of strength or resolve that they are not willing to disclose, or agents not being able to credibly commit to not launching preemptive attacks when they expect that their counterpart will gain strength in the future. Likewise, threats and punishments can be executed and equilibrium for reasons of costly signaling (e.g., Sechser 2010) or falsely detected defections (e.g., Fudenberg et al. 2009). A related issue is that it is not clear how the interaction of CGS and model misspecification affects its safety. For instance, agents who underestimate the chances of false detections of nuclear launches may place nuclear weapons on sensitive triggers, incorrectly thinking that nuclear launch is almost certain not to occur in equilibrium.
The second way training agents to do CGS could make things worse is by encouraging them to use dangerous decision procedures outside of CGS. The problems associated with designing agents to maximize a utility function are well-known in AI safety. Depending on how agents are trained to do CGS, it may make them more likely to make decisions in situations other than bargaining situations via expected utility maximization. For instance, training agents to do CGS may produce modules that help agents to specify a world-model and utility function, and maximize the expectation of that utility function, and agents may use the modules when making decisions in non-CGS contexts.
In light of this, we would want to make sure CGS preserves nice properties that a system already has. CGS should be *alignment-preserving*: intuitively, modifying a system's design to implement CGS shouldn't make misalignment more likely. CGS should also preserve properties like *myopia*: modifying a myopic system to use CGS shouldn't make it non-myopic. Importantly, ensuring the preservation of properties other than alignment which make catastrophic bargaining failure less likely may help to avoid worst-case outcomes even if alignment fails.
Finally, there is the problem that CGS still faces equilibrium and prior selection problems. (See the Appendix for a formulation of CGS in the context of a Bayesian game; such a game assumes a common prior — in this case, a prior arising from the distribution of environments on which the policies are trained — and will, in general, have many equilibria.) Thus there is a question of how much we can expect actors to coordinate to train their agents to do CGS, and how much CGS can reduce risks of bargaining failure if AI developers do not coordinate.
As in the toy illustration, we can think of agents' models as private information, drawn from some distribution that depends on the (unknown) underlying environment. Because agents are boundedly rational, they can only reason according to these (relatively simple) private models, rather than a correctly-specified class of world-models. However, the people training the AI systems can generate lots of training data, in which agents can try out different policies for accounting for the variability in their and their counterpart's private models. Thus we can think of this as a Bayesian game played between AI developers, in which the strategies are policies for mapping private world-models to behaviors. These behaviors might include ways of communicating with other agents in order to overcome uncertainty, which in turn might include CGS. The prior for this Bayesian game is the distribution over private models induced by the training environments and the agents' model-learning algorithms (which we take to be exogenous for simplicity).
As I noted above, this Bayesian game still faces equilibrium and prior selection problems between the AI developers themselves. It also makes the extremely unrealistic assumption that the training and deployment distributions of environments are the same. The goal is only to clarify how developers could (rationally) approach training their agents to implement CGS under idealized conditions.
Consider two actors, who I will call "the principals'', who are to train and deploy learning agents. Each principal has utility function . The game that the principals play is as follows:
The choice of what policy to deploy is a game with strategies and ex ante payoffs
We will for now suppose that during training the value of policy profiles under each utility function in can be learned with high accuracy.
How should a principal choose which policy to deploy? In the absence of computational constraints, a natural choice is Bayesian Nash equilibrium (BNE). In practice, it will be necessary to learn over a much smaller class of policies than the space of all maps. Let be sets of policies such that it is tractable to evaluate each profile . In this context, assuming that the principals' utility functions are common knowledge, a pair of policies is a BNE if it satisfies for (indexing 's counterpart by )
When consists of policies with limited capacity (reflecting computational boundedness), agents may learn policies which do not account for the variability in the estimation of their private models. I will call the class of such policies learned over during training time the "default policies'' . To address this problem in a computationally tractable way, we introduce policies which allow for the specification of a shared model of . Let be a set of models, and let be a set of solution concepts which map elements of to (possibly random) action profiles. In the toy illustration, agents specified models in the set of bimatrices, and the solution concept they used was the Nash equilibrium which maximized the sum of their payoffs in the game .
Then, the policies have the property that, for some , the policy profile succeeds in collaboratively specifying a game with positive probability. That is, with positive probability we have for some and some .
The goal of principals who want their agents to engage in collaborative game specification is to find a policy profile in which is a Bayes-Nash equilibrium that improves upon any equilibrium in and which succeeds in collaboratively specifying a game with high probability.
Now, this model is idealized in a number of ways. I assume that the distribution of training environment matches the distribution of environments encountered by the deployed policies. Moreover, I assume that both principals train their agents on this distribution of environments. In reality, of course, these assumptions will fail. A more modest but attainable goal is to use CGS to construct policies which perform well on whatever criteria individual principals use to evaluate policies for multi-agent environments, as discussed in the Section 2 (Implementation).
James D Fearon. Rationalist explanations for war. International organization, 49(3):379–414, 1995.
Drew Fudenberg, David Levine, and Eric Maskin. The folk theorem with imperfect public information. In A Long-Run Collaboration On Long-Run Games, pages 231–273. World Scientific, 2009.
Robert Gibbons. Learning in equilibrium models of arbitration. Technical report, National Bureau of Economic Research, 1988.
James D Morrow. Modeling the forms of international cooperation: distribution versus information. International Organization, pages 387–423, 1994.
Todd S Sechser. Goliath’s curse: Coercive threats and asymmetric power. International Organization, 64(4):627–660, 2010.
Hyun Song Shin. The burden of proof in a game of persuasion. Journal of Economic Theory, 64(1):253–264, 1994.
The post Collaborative game specification: arriving at common models in bargaining appeared first on Center on Long-Term Risk.
]]>The post Weak identifiability and its consequences in strategic settings appeared first on Center on Long-Term Risk.
]]>We say that a model is unidentifiable if there are several candidate models which produce the same distributions over observables. It is well-understood in the AI safety community that identifiability is a problem for inferring human values [1] [2]. This is because there are always many combinations of preferences and decision-making procedures which produce the same behaviors. So, it's impossible to learn an agent's preferences from their behavior without strong priors on their preferences and/or decision-making procedures. I want to point out here that identifiability is also a problem for multi-agent AI safety, for some of the same reasons as in the preference inference case, as well as some reasons specific to strategic settings. In the last section I'll give a simple quantitative example of the potential implications of unidentifiability for bargaining failure in a variant of the ultimatum game.
By modeling other agents, I mean forming beliefs about the policy that they are following based on observations of their behavior. The model of an agent is unidentifiable if there is no amount of data from the environment in question that can tell us exactly what policy they are following. (And because we always have finite data, "weak identifiability'' more generally is a problem — but I'll just focus on the extreme case.)
Consider the following informal example (a quantitative extension is given in Section 3). Behavioral scientists have an identifiability problem in trying to model human preferences in the ultimatum game. The ultimatum game (Figure 1) is a simple bargaining game in which a Proposer offers a certain division of a fixed pot of money to a Responder. The Responder may then accept, in which case each player gets the corresponding amount, or reject, in which place neither player gets anything. Standard accounts of rationality predict that the Proposer will offer the Responder the least amount allowed in the experimental setup and that the Responder will accept any amount of money greater than . However, humans don’t act this way: In experiments, human Proposers often give much closer to even splits, and Responders often reject unfair splits.
The ultimatum game has been the subject of extensive study in behavioral economics, with many people offering and testing different explanations of this phenomenon. This had led to a proliferation of models of human preferences in bargaining settings (e.g. Bicchieri and Zhang 2012; Hagen and Hammerstein 2006 and references therein). This makes the ultimatum game a rich source of models and data about human preferences in bargaining situations. And the game is similar to the one-shot threat game used here to illustrate the prior selection problem. Thus it can be used to model some of the high-stakes bargaining scenarios involving transformative AI that concern us most.
Suppose that you observe a Responder play many rounds of the ultimatum game with different Proposers, and you see that they tend to reject unfair splits. You think there are two possible kinds of explanation for this behavior:
The problem is that (depending on the details), these models might make exactly the same predictions about the outcomes of these experiments so that no amount of data from these experiments can ever distinguish between them. This makes it difficult, for instance, to decide what to do if you have to face the Responder in an ultimatum game yourself.
The basic problem is familiar from the usual preference inference case: there are many combinations of world-models and utility functions which make the same predictions about the Responder's behavior. But it is also a simple illustration of a few other factors which make unidentifiability particularly severe in strategic settings:
One of our models of the Responder in the ultimatum game contains a simple illustration of -level modeling. Under the iterated play explanation, you model the Responder as modeling other players as responding to their refusals of unfair splits with higher offers in the future.
Unidentifiability may be dangerous in multi-agent contexts for similar reasons that it may be dangerous in the context of inferring human preferences. If uncertainty over all of the models which are consistent with the data is not accounted for properly — via specification of “good” priors and averaging over a sufficiently large space of possibilities to make decisions — then our agents may give excessive weight to models which are far from the truth and therefore act catastrophically.
Two broad directions for mitigating these risks include:
In this example, I focus on inferring the preferences of a Responder given some data on their behavior. I'll then show that for some priors over models of the Responder, decisions made based on the resulting posterior can lead to rejected splits. Importantly, this behavior happens given any amount of observations of the Responder's ultimatum game behavior, due to unidentifiability.
Consider the following simple model. For offer in and parameters and , the Responder makes a decision according to these utility functions:
The term can be interpreted as the Responder deeming offers of less than as unfair. Then, the parameter measures how much the Responder intrinsically disvalues unfair splits, and the parameter measures how much the Responder expects to get in the future when they reject unfair splits.
Split is accepted if and only if , or equivalently, Notice that the decision depends only on , and thus the data cannot distinguish between the effects of and . So we have a class of models parameterized by pairs . Now, suppose that we have two candidate models — one on which fairness is the main component, and one on which iterated play is:
The likelihoods for any data are the same for any such that is the same: If are the offered split and the Responder's decision in the experiment, the likelihood of model given observations is
Since under both and , this means that the prior and posterior over are equal.
Now here is the decision-making setup:
Call the prior model probabilities . Thus, the Proposer's posterior expected payoff for split is
In Figure 2, I compare the expected payoffs to the Proposer under different splits, when the true parameters for the Responder's utility function are . The three expected payoff curves are:
We can see from the blue curve that when there's sufficient prior mass on the wrong model , the Responder will propose a split that's too small, resulting in a rejection. This basically corresponds to a situation where the Responder thinks that the Proposer rejects unfair splits in order to establish a reputation for rejecting unfair splits, rather than rejecting unfair splits because of a commitment not to accept unfair splits. And although I've tilted the scales in favor of a bad outcome by choosing a prior that gives a lot of weight to an incorrect model, keep in mind that this is what the posterior expectation will be given any amount of data from this generative model. We can often count on data to correct our agents' beliefs, but this is not the case (by definition) when the relevant model is unidentifiable.
Cristina Bicchieri and Jiji Zhang. An embarrassment of riches: Modeling social preferences in ultimatum games. Handbook of the Philosophy of Science, 13:577–95, 2012.
Edward H Hagen and Peter Hammerstein. Game theory and human evolution: A critique of some recent interpretations of experimental games. Theoretical population biology, 69(3):339–348, 2006.
The post Weak identifiability and its consequences in strategic settings appeared first on Center on Long-Term Risk.
]]>The post Birds, Brains, Planes, and AI: Against Appeals to the Complexity / Mysteriousness / Efficiency of the Brain appeared first on Center on Long-Term Risk.
]]>I argue that an entire class of common arguments against short timelines is bogus, and provide weak evidence that anchoring to the human-brain-human-lifetime milestone is reasonable.
In a sentence, my argument is that the complexity and mysteriousness and efficiency of the human brain (compared to artificial neural nets) is almost zero evidence that building TAI will be difficult, because evolution typically makes things complex and mysterious and efficient, even when there are simple, easily understood, inefficient designs that work almost as well (or even better!) for human purposes.
In slogan form: If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the way it does.
The case of birds & planes illustrates this point nicely. Moreover, it is also a precedent for several other short-timelines talking points, such as the human-brain-human-lifetime (HBHL) anchor.
1909 French military plane, the Antionette VII.
By Deep silence (Mikaël Restoux) - Own work (Bourget museum, in France), CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1615429
AI timelines, from our current perspective |
Flying machine timelines, from the perspective of the late 1800’s: |
Shorty: Human brains are giant neural nets. This is reason to think we can make human-level AGI (or at least AI with strategically relevant skills, like politics and science) by making giant neural nets. |
Shorty: Birds are winged creatures that paddle through the air. This is reason to think we can make winged machines that paddle through the air. |
Longs: Whoa whoa, there are loads of important differences between brains and artificial neural nets: [what follows is a direct quote from the objection a friend raised when reading an early draft of this post!]
|
Longs: Whoa whoa, there are loads of important differences between birds and flying machines:
|
Shorty: The key variables seem to be size and training time. Current neural nets are tiny; the biggest one is only one-thousandth the size of the human brain. But they are rapidly getting bigger. Once we have enough compute to train neural nets as big as the human brain for as long as a human lifetime (HBHL), it should in principle be possible for us to build HLAGI. No doubt there will be lots of details to work out, of course. But that shouldn’t take more than a few years. |
Shorty: Once the power-to-weight ratio of our motors surpasses the power-to-weight ratio of bird muscles, it should be in principle possible for us to build a flying machine. No doubt there will be lots of details to work out, of course. But that shouldn’t take more than a few years. |
Longs: Bah! I don’t think we know what the key variables are. For example, biological brains seem to be able to learn faster, with less data, than artificial neural nets. And we don’t know why. Besides, “there will be lots of details to work out” is a huge understatement. It took evolution billions of generations of billions of individuals to produce humans. What makes you think we’ll be able to do it quickly? It’s plausible that actually we’ll have to do it the way evolution did it, i.e. meta-learn, i.e. evolve a large population of HBHLs, over many generations. (Or, similarly, train a neural net with a big batch size and a horizon length of a lifetime). And even if you think we’ll be able to do it substantially quicker than evolution did, it’s pretty presumptuous to think we could do it quickly enough that the HBHL milestone is relevant for forecasting. |
Longs: Bah! I don’t think we know what the key variables are. For example, birds seem to be able to soar long distances without flapping their wings at all, and we still haven’t figured out how they do it. Another example: We still don’t know how birds manage to steer through the air without crashing (flight stability & control). Besides, “there will be lots of details to work out” is a huge understatement. It took evolution billions of generations of billions of individuals to produce birds. What makes you think we’ll be able to do it quickly? It’s plausible that actually we’ll have to do it the way evolution did it, i.e. meta-design, i.e. evolve a large population of flying machines, tweaking our blueprints each generation of crashed machines to grope towards better designs. And even if you think we’ll be able to do it substantially quicker than evolution did, it’s pretty presumptuous to think we could do it quickly enough that the date our engines achieve power/weight parity with bird muscle is relevant for forecasting. |
This data shows that Shorty was entirely correct about forecasting heavier-than-air flight. (For details about the data, see appendix.) Whether Shorty will also be correct about forecasting TAI remains to be seen.
In some sense, Shorty has already made two successful predictions: I started writing this argument before having any of this data; I just had an intuition that power-to-weight is the key variable for flight and that therefore we probably got flying machines shortly after having comparable power-to-weight as bird muscle. Halfway through the first draft, I googled and confirmed that yes, the Wright Flyer’s motor was close to bird muscle in power-to-weight. Then, while writing the second draft, I hired an RA, Amogh Nanjajjar, to collect more data and build this graph. As expected, there was a trend of power-to-weight improving over time, with flight happening right around the time bird-muscle parity was reached.
I had previously heard from a friend, who read a book about the invention of flight, that the Wright brothers were the first because they (a) studied birds and learned some insights from them, and (b) did a bunch of trial and error, rapid iteration, etc. (e.g. in wind tunnels). The story I heard was all about the importance of insight and experimentation--but this graph seems to show that the key constraint was engine power-to-weight. Insight and experimentation were important for determining who invented flight, but not for determining which decade flight was invented in.
One way in which compute can substitute for insights/algorithms/architectures/ideas is that you can use compute to search for them. But there is a different and arguably more important way in which compute can substitute for insights/etc.: Scaling up the key variables, so that the problem becomes easier, so that fewer insights/etc. are needed.
For example, with flight, the problem becomes easier the more power/weight ratio your motors have. Even if the Wright brothers didn’t exist and nobody else had their insights, eventually we would have achieved powered flight anyway, because when our engines are 100x more powerful for the same weight, we can use extremely simple, inefficient designs. (For example, imagine a u-shaped craft with a low center of gravity and helicopter-style rotors on each tip. Add a third, smaller propeller on a turret somewhere for steering.)
With neural nets, we have plenty of evidence now that bigger = better, with theory to back it up. Suppose the problem of making human-level AGI with HBHL levels of compute is really difficult. OK, 10x the parameter count and 10x the training time and try again. Still too hard? Repeat.
Note that I’m not saying that if you take a particular design that doesn’t work, and make it bigger, it’ll start working. (If you took Da Vinci’s flying machine and made the engine 100x more powerful, it would not work). Rather, I’m saying that the problem of finding a design that works gets qualitatively easier the more parameters and training time you have to work with.
Finally, remember that human-level AGI is not the only kind of TAI. Sufficiently powerful R&D tools would work, as would sufficiently powerful persuasion tools, as might something that is agenty and inferior to humans in some ways but vastly superior in others.
Suppose that actually all we have to do to get TAI is something fairly simple and obvious, but with a neural net 10x the size of my (actual) brain and trained for 10x longer. In this world, does the human brain look any different than it does in the actual world?
No. Here is a nonexhaustive list of reasons why evolution would evolve human brains to look like they do, with all their complexity and mysteriousness and efficiency, even if the same capability levels could be reached with 10x more neurons and a very simple architecture. Feel free to skip ahead if you think this is obvious.
The general pattern of argument I think is bogus is:
The brain has property X, which seems to be important to how it functions. We don’t know how to make AI’s with property X. It took evolution a long time to make brains have property X. This is reason to think TAI is not near.
As argued above, if TAI is near, there should still be many X which are important to how the brain functions, which we don’t know how to reproduce in AI, and which it took evolution a long time to produce. So rattling off a bunch of X’s is basically zero evidence against TAI being near.
Put differently, here are two objections any particular argument of this type needs to overcome:
This reveals how the arguments could be reformulated to become non-bogus! They need to argue (a) that X is probably necessary for TAI, and (b) that X isn’t something that we’ll figure out fairly quickly once the key variables of size and training time are surpassed.
In some cases there are decent arguments to be made for both (a) and (b). I think efficiency is one of them, so I’ll use that as my example below.
Let’s work through the example of data-efficiency. A bad version of this argument would be:
Humans are much more data-efficient learners than current AI systems. Data-efficiency is very important; any human who learned as inefficiently as current AI would basically be mentally disabled. This is reason to think TAI is not near.
The rebuttal to this bad argument is:
If birds were as energy-inefficient as planes, they’d be disabled too, and would probably die quickly. Yet planes work fine. (See Table 1 from this AI Impacts page) Even if TAI is near, there are going to be lots of X’s that are important for the brain, that we don’t know how to make in AI yet, but that are either unnecessary for TAI or not too difficult to get once we have the other key variables. So even if TAI is near, I should expect to hear people going around pointing out various X’s and claiming that this is reason to think TAI is far away. You haven’t done anything to convince me that this isn’t what’s happening with X = data-efficiency.
However, I do think the argument can be reformulated and expanded to become good. Here’s a sketch, inspired by Ajeya Cotra’s argument here.
We probably can’t get TAI without figuring out how to make AIs that are as data-efficient as humans. It’s true that there are some useful tasks for which there is plenty of data--like call center work, or driving trucks--but AIs that can do these tasks won’t be transformative. Transformative AI will be doing things like managing corporations, leading armies, designing new chips, and writing AI theory publications. Insofar as AI learns more slowly than humans, by the time it accumulates enough experience doing one of these tasks, (a) the world would have changed enough that its skills would be obsolete, and/or (b) it would have made a lot of expensive mistakes in the meantime.
Moreover, we probably won’t figure out how to make AIs that are as data-efficient as humans for a long time--decades at least. This is because 1. We’ve been trying to figure this out for decades and haven’t succeeded, and 2. Having a few orders of magnitude more compute won’t help much. Now, to justify point #2: Neural nets actually do get more data-efficient as they get bigger, but we can plot the trend and see that they will still be less data-efficient than humans when they are a few orders of magnitude bigger. So making them bigger won’t be enough, we’ll need new architectures/algorithms/etc. As for using compute to search for architectures/etc., that might work, but given how long evolution took, we should think it’s unlikely that we could do this with only a few orders of magnitude of searching—probably we’d need to do many generations of large population size. (We could also think of this search process as analogous to typical deep learning training runs, in which case we should expect it’ll take many gradient updates with large batch size.) Anyhow, there’s no reason to think that data-efficient learning is something you need to be human-brain-sized to do. If we can’t make our tiny AIs learn efficiently after several decades of trying, we shouldn’t be able to make big AIs learn efficiently after just one more decade of trying.
I think this is a good argument. Do I buy it? Not yet. For one thing, I haven’t verified whether the claims it makes are true, I just made them up as plausible claims which would be persuasive to me if true. For another, some of the claims actually seem false to me. Finally, I suspect that in 1895 someone could have made a similarly plausible argument about energy efficiency, and another similarly plausible argument about flight control, and both arguments would have been wrong: Energy efficiency turned out to be insufficiently necessary, and flight control turned out to be insufficiently difficult!
What I am not saying: I am not saying that the case of birds and planes is strong evidence that TAI will happen once we hit the HBHL milestone. I do think it is evidence, but it is weak evidence. (For my all-things-considered view of how many orders of magnitude of compute it’ll take to get TAI, see future posts, or ask me.) I would like to see a more thorough investigation of cases in which humans attempt to design something that has an obvious biological analogue. It would be interesting to see if the case of flight was typical. Flight being typical would be strong evidence for short timelines, I think.
What I am saying: I am saying that many common anti-short-timelines arguments are bogus. They need to do much more than just appeal to the complexity/mysteriousness/efficiency of the brain; they need to argue that some property X is both necessary for TAI and not about to be figured out for AI anytime soon, not even after the HBHL milestone is passed by several orders of magnitude.
Why this matters: In my opinion the biggest source of uncertainty about AI timelines has to do with how much “special sauce” is necessary for making transformative AI. As jylin04 puts it,
A first and frequently debated crux is whether we can get to TAI from end-to-end training of models specified by relatively few bits of information at initialization, such as neural networks initialized with random weights. OpenAI in particular seems to take the affirmative view[^3], while people in academia, especially those with more of a neuroscience / cognitive science background, seem to think instead that we'll have to hard-code in lots of inductive biases from neuroscience to get to AGI [^4].
In my words: Evolution clearly put lots of special sauce into humans, and took millions of generations of millions of individuals to do so. How much special sauce will we need to get TAI?
Shorty is one end of a spectrum of disagreement on this question. Shorty thinks the amount of special sauce required is small enough that we’ll “work out the details” within a few years of having the key variables (size and training time). At the other end of the spectrum would be someone who thought that the amount of special sauce required is similar to the amount found in the brain. Longs is in the middle. Longs thinks the amount of special sauce required is large enough that the HBHL milestone isn’t particularly relevant to timelines; we’ll either have to brute-force search for the special sauce like evolution did, or have some brilliant new insights, or mimic the brain, etc.
This post rebutted common arguments against Shorty’s position. It also presented weak evidence in favor of Shorty’s position: the precedent of birds and planes. In future posts I’ll say more about what I think the probability distribution over amount-of-special-sauce-needed should be and why.
Acknowedgements: Thanks to my RA, Amogh Nanjajjar, for compiling the data and building the graph. Thanks to Kaj Sotala, Max Daniel, Lukas Gloor, and Carl Shulman for comments on drafts.
Some footnotes:
Some bookkeeping details about the data:
The post Birds, Brains, Planes, and AI: Against Appeals to the Complexity / Mysteriousness / Efficiency of the Brain appeared first on Center on Long-Term Risk.
]]>The post Against GDP as a metric for AI timelines and takeoff speeds appeared first on Center on Long-Term Risk.
]]>I think world GDP (and economic growth more generally) is overrated as a metric for AI timelines and takeoff speeds.
Here are some uses of GDP that I disagree with, or at least think should be accompanied by cautionary notes:
First, I’ll argue that GWP is only tenuously and noisily connected to what we care about when forecasting AI timelines. Specifically, the point of no return is what we care about, and there’s a good chance it’ll come years before GWP starts to increase. It could also come years after, or anything in between.
Then, I’ll argue that GWP is a poor proxy for what we care about when thinking about AI takeoff speeds as well. This follows from the previous argument about how the point of no return may come before GWP starts to accelerate. Even if we bracket that point, however, there are plausible scenarios in which a slow takeoff has fast GWP growth and in which a fast takeoff has slow GWP growth.
I’ve previously argued that for AI timelines, what we care about is the “point of no return,” the day we lose most of our ability to reduce AI risk. This could be the day advanced unaligned AI builds swarms of nanobots, but probably it’ll be much earlier, e.g. the day it is deployed, or the day it finishes training, or even years before then when things go off the rails due to less advanced AI systems. (Of course, it probably won’t literally be a day; probably it will be an extended period where we gradually lose influence over the future.)
Now, I’ll argue that in particular, an AI-induced PONR is reasonably likely to come before world GDP starts to grow noticeably faster than usual.
Disclaimer: These arguments aren’t conclusive; we shouldn’t be confident that the PONR will precede GWP acceleration. It’s entirely possible that the PONR will indeed come when GDP starts to grow noticeably faster than usual, or even years after that. (In other words, I agree that the scenarios Paul and others sketch are also plausible.) This just proves my point though: GDP is only tenuously and noisily connected to what we care about.
GWP acceleration is the effect, not the cause, of advances in AI capabilities. I agree that it could also be a cause, but I think this is very unlikely: what else could accelerate GWP? Space mining? Fusion power? 3D printing? Even if these things could in principle kick the world economy into faster growth, it seems unlikely that this would happen in the next twenty years or so. Robotics, automation, etc. plausibly might make the economy grow faster, but if so it will be because of AI advances in vision, motor control, following natural language instructions, etc. So I conclude: GWP growth will come some time after we get certain GWP-growing AI capabilities.
(Tangent: This is one reason why we shouldn’t use GDP extrapolations to predict AI timelines. It’s like extrapolating global mean temperature trends into the future in order to predict fossil fuel consumption.)
An AI-induced point of no return would also be the effect of advances in AI capabilities. So, as AI capabilities advance, which will come first: The capabilities that cause a PONR, or the capabilities that cause GWP to accelerate? How much sooner will one arrive than the other? How long does it take for a PONR to arise after the relevant capabilities are reached, compared to how long it takes for GWP to accelerate after the relevant capabilities are reached?
Notice that already my overall conclusion—that GWP is a poor proxy for what we care about—should seem plausible. If some set of AI capabilities causes GWP to grow after some time lag, and some other set of AI capabilities causes a PONR after some time lag, the burden of proof is on whoever wants to claim that GWP growth and the PONR will probably come together. They’d need to argue that the two sets of capabilities are tightly related and that the corresponding time lags are similar also. In other words, variance and uncertainty are on my side.
Here is a brainstorm of scenarios in which an AI-induced PONR happens prior to GWP growth, either because GWP-growing capabilities haven’t been invented yet or because they haven’t been deployed long and widely enough to grow GWP.
The point is, there’s more than one scenario. This makes it more likely that at least one of these potential PONRs will happen before GWP accelerates.
As an aside, over the past two years I’ve come to believe that there’s a lot of conceptual space to explore that isn’t captured by the standard scenarios (what Paul Christiano calls fast and slow takeoff, plus maybe the CAIS scenario, and of course the classic sci-fi “no takeoff” scenario). This brainstorm did a bit of exploring, and the section on takeoff speeds will do a little more.
In the previous section, I sketched some possibilities for how an AI-related point of no return could come before AI starts to noticeably grow world GDP. In this section, I’ll point to some historical examples that give precedents for this sort of thing.
Earlier I said that a godlike advantage is not necessary for takeover; you can scale up with a smaller advantage instead. And I said that in military conquests this can happen surprisingly quickly, sometimes faster than it takes for a superior product to take over a market. Is there historical precedent for this?
Yes. See my aforementioned post on the conquistadors (and maybe these somewhat-relevant posts).
OK, so what was happening to world GDP during this period?
Here is the history of world GDP for the past ten thousand years, on the red line. (This is taken from David Roodman’s GWP model) The black line that continues the red line is the model’s median projection for what happens next; the splay of grey shades represent 5% increments of probability mass for different possible future trajectories.
I’ve added a bunch of stuff for context. The vertical green lines are some dates, chosen because they were easy for me to calculate with my ruler. The tiny horizontal green lines on the right are the corresponding GWP levels. The tiny red horizontal line is GWP 1,000 years before 2047. The short vertical blue line is when the economy is growing fast enough, on the median projected future, such that insofar as AI is driving the growth, said AI qualifies as transformative. See this post for more explanation of the blue lines.
What I wish to point out with this graph is: We’ve all heard the story of how European empires had a technological advantage which enabled them to conquer most of the world. Well, most of that conquering happened before GWP started to accelerate!
If you look at the graph at the 1700 mark, GWP is seemingly on the same trend it had been on since antiquity. The industrial revolution is said to have started in 1760, and GWP growth really started to pick up steam around 1850. But by 1700 most of the Americas, the Philippines and the East Indies were directly ruled by European powers, and more importantly the oceans of the world were European-dominated, including by various ports and harbor forts European powers had conquered/built all along the coasts of Africa and Asia. Many of the coastal kingdoms in Africa and Asia that weren’t directly ruled by European powers were nevertheless indirectly controlled or otherwise pushed around by them. In my opinion, by this point it seems like the “point of no return” had been passed, so to speak: At some point in the past--maybe 1000 AD, for example--it was unclear whether, say, Western or Eastern (or neither) culture/values/people would come to dominate the world, but by 1700 it was pretty clear, and there wasn’t much that non-westerners could do to change that. (Or at least, changing that in 1700 would have been a lot harder than in 1000 or 1500.)
Paul Christiano once said that he thinks of Slow Takeoff as “Like the Industrial Revolution, but 10x-100x faster.” Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth even begins to accelerate!
That said, we shouldn’t rely heavily on historical analogies like this. We can probably find other cases that seem analogous too, perhaps even more so, since this is far from a perfect analogue. (e.g. what’s the historical analogue of AI alignment failure? Corporations becoming more powerful than governments? “Western values” being corrupted and changing significantly due to the new technology? The American Revolution?) Also, maybe one could argue that this is indeed what’s happening already: the Internet has connected the world much as sailing ships did, Big Tech dominates the Internet, etc. (Maybe AI = steam engines, and computers+internet = ships+navigation?)
But still. I think it’s fair to conclude that if some of the scenarios described in the previous section do happen, and we get powerful AI that pushes us past the point of no return prior to GWP accelerating, it won’t be totally inconsistent with how things have gone historically.
(I recommend the history book 1493, it has a lot of extremely interesting information about how quickly and dramatically the world economy was reshaped by colonialism and the “Columbian Exchange.”)
What about takeoff speeds? Maybe GDP is a good metric for describing the speed of AI takeoff? I don’t think so.
Here is what I think we care about when it comes to takeoff speeds:
I think that the best way to define slow(er) takeoff is as the extent to which conditions 1-5 are met. This is not a definition with precise resolution criteria, but that’s OK, because it captures what we care about. Better to have to work hard to precisify a definition that captures what we care about, than to easily precisify a definition that doesn’t! (More substantively, I am optimistic that we can come up with better proxies for what we care about than GWP. I think we already have to some extent; see e.g. operationalizations 5 and 6 here.) As a bonus, this definition also encourages us to wonder whether we’ll get some of 1-5 but not others.
The crucial question is, what do we mean by “the crucial period?”
I think we should define the crucial period as the period leading up to the first major AI-induced potential point of no return. (Or maybe, as the aggregate of the periods leading up to the major potential points of no return). After all, this is what we care about. Moreover there seems to be some level of consensus that crazy stuff could start happening before human-level AGI. I certainly think this.
So, I’ve argued for a new definition of slow takeoff, that better captures what we care about. But is the old GWP-based definition a fine proxy? No, it is not, because the things that cause PONR can be different from the things which cause GWP acceleration, and they can come years apart too. Whether there are warning shots, heterogeneity, risk awareness, multipolarity, and craziness in the period leading up to PONR is probably correlated with whether GWP doubles in four years before the first one-year doubling. But the correlation is probably not super strong. Here are two scenarios, one in which we get a slow takeoff by my definition but not by the GWP-based definition, and one in which the opposite happens:
Slow Takeoff Fast GWP Acceleration Scenario: It turns out there’s a multi-year deployment lag between the time a technology is first demonstrated and the time it is sufficiently deployed around the world to noticeably affect GWP. There’s also a lag between when a deceptively aligned AGI is created and when it causes a PONR… but it is much smaller, because all the AGI needs to do is neutralize its opposition. So PONR happens before GWP starts to accelerate, even though the technologies that could boost GWP are invented several years before AGI powerful enough to cause a PONR is created. But takeoff is slow in the sense I define it; by the time AGI powerful enough to cause a PONR is created, everyone is already freaking out about AI thanks to all the incredibly profitable applications of weaker AI systems, and the obvious and accelerating trends of research progress. Also, there are plenty of warning shots, the strategic situation is very multipolar and heterogenous, etc. Moreover, research progress starts to go FOOM a short while after powerful AGIs are created, such that by the time the robots and self-driving cars and whatnot that were invented several years ago actually get deployed enough to accelerate GWP, we’ve got nanobot swarms. GWP goes from 3% growth per year to 300% without stopping at 30%.
Fast Takeoff Slow GWP Acceleration Scenario: It turns out you can make smarter AIs by making them have more parameters and training them for longer. So the government decides to partner with a leading tech company and requisition all the major computing centers in the country. With this massive amount of compute and research talent, they refine and scale up existing AI designs that seem promising, and lo! A human-level AGI is created. Alas, it is so huge that it costs $10,000 per hour of subjective thought. Moreover, it has a different distribution over skills compared to humans—it tends to be more rational, not having evolved in an environment that rewards irrationality. It tends to be worse at object recognition and manipulation, but better at poetry, science, and predicting human behavior. It has some flaws and weak points too, more so than humans. Anyhow, unfortunately, it is clever enough to neutralize its opposition. In a short time, the PONR is passed. However, GWP doubles in four years before it doubles in one year. This is because (a) this AGI is so expensive that it doesn’t transform the economy much until either the cost comes way down or capabilities go way up, and (b) progress is slowed by bottlenecks, such as acquiring more compute and overcoming various restrictions placed on the AGI. (Maybe neutralizing the opposition involved convincing the government that certain restrictions and safeguards would be sufficient for safety, contra the hysterical doomsaying of parts of the AI safety community. But overcoming those restrictions in order to do big things in the world takes time.)
Acknowledgments:
Thanks to the people who gave comments on earlier drafts, including Katja Grace, Carl Shulman, and Max Daniel. Thanks to Amogh Nanjajjar for helping me with some literature review.
The post Against GDP as a metric for AI timelines and takeoff speeds appeared first on Center on Long-Term Risk.
]]>The post Incentivizing forecasting via social media appeared first on Center on Long-Term Risk.
]]>Full article: EA Forum
The post Incentivizing forecasting via social media appeared first on Center on Long-Term Risk.
]]>The post Plans for 2021 & Review of 2020 appeared first on Center on Long-Term Risk.
]]>Plans for 2021
Review of 2020
We are building a global community of researchers and professionals working on reducing risks of astronomical suffering (s-risks). (Read more about us here.)
Earlier this year, we consolidated the activities related to s-risks from the Effective Altruism Foundation and the Foundational Research Institute under one name: the Center on Long-Term Risk (CLR). We have been based in London since late 2019. Our team is currently about 10 full-time equivalents strong, with most of our employees full time.
At the end of last year, we published a research agenda on this topic. After significant progress in 2020 (see Review section), work in this area will continue to be our main priority in 2021.
We plan to further refine our prioritization between different research directions and intervention types within this broad area. Interventions differ across a multitude of dimensions. For instance, some are multilateral in that they require technical solutions to be implemented by multiple actors, whereas others are unilateral. Some interventions primarily address acausal conflict; others causal ones. We want to better prioritize between these dimensions. This will often require object-level work, e.g., to learn more about the tractability of a given intervention-type.
We plan to build a field around bargaining in artificial learners (see the related sections 3-6 of our research agenda) using mainly tools from game theory and multi-agent reinforcement learning (MARL). We want to draw both from the relevant machine learning sub-community and the longtermist effective altruism community. Through our research this year (see below), we now have a good understanding of what work we consider valuable in this field. We plan to publish original research explaining foundational technical problems in this area, finish a repository of tools for easily running experiments, and make grants to encourage others to do similar work. We plan to publish a post on this forum explaining the reasoning behind our focus on this area.
We plan to take initial steps in the field of AI governance related to cooperation & conflict involving AI systems. Following our analysis of problems in multipolar deployment scenarios, we plan to publish a post outlining the governance challenges associated with addressing these problems.
We first wrote about this cause in early 2020 in an EA Forum post. Since then, we have completed additional work internally, parts of which we plan to publish next year.
We plan to assess how important this area is relative to our other work because this is a new cause area, and we are still uncertain how it compares to our existing priorities. We will do this by learning more about the relevant scientific fields, technologies, and policy levers. We will also conduct or support technical work on how preferences to create suffering could arise in TAI systems. We plan to publish a post introducing this idea. We might make some targeted grants to experts who could help us improve our understanding of this area.
Because work on s-risks is still in its infancy, it could be valuable to explore entirely new areas. This will not be a systematic effort in 2021. Individual researchers will investigate new areas if they find them sufficiently promising. Current contenders include (among other things): political polarization (or at least specific manifestations of it) and collective epistemic breakdown (e.g., as a result of increasingly powerful persuasion tools).
Research will remain CLR’s focus in 2021 because there remain many open questions about s-risks and how to address them. Through our efforts this year, we have also placed ourselves in a good position to scale up our research efforts (see “Review of 2020” below).
We will grow our grantmaking efforts in 2021. We will focus increasingly on proactive grants following investigations of specific fields. We have found general application rounds not to be very valuable so far.
We will continue our routine community-building activities in 2021 while running tests of more efficient ways of getting people up to speed on our thinking. This work has been important for cultivating hires at CLR. We expect to invest about as many resources into this as in 2020.
We are still uncertain what we will do to disseminate our research and advocate for our priorities. First, we plan to review several key decisions that have influenced our past efforts. For instance, we will evaluate the effects of the communication guidelines written in collaboration with Nick Beckstead from the Open Philanthropy Project. (For more details on these guidelines, see this section of our review from last year.) We had originally planned to do so at the end of this year but postponed it by a few months. Second, the development of the COVID-19 pandemic will determine whether we can run in-person events and travel to important EA hubs like Oxford and the San Francisco Bay Area. In any case, we expect to continue to give talks and to share our work through targeted channels.
We will continue exploring the possibility of high-leverage projects that could enable many more people to work in our priority areas.
We plan to improve how we evaluate our work and impact. Currently, we only do systematic annual reviews of our activities internally. We plan to elicit feedback from outside experts to assess the quality and impact of our work. We are considering survey work, in-depth assessment of specific research output, and qualitative interviews.
Last year, we wrote that the most appropriate way to review our work each year would be to answer “a set of deliberately vague performance questions” (inspired by GiveWell’s self-evaluation questions). We put these questions to our team and used their input to write the overall assessment below. We plan to improve this procedure further next year.
This year was a year of transition for CLR, both in terms of staff changes and building out new research directions in malevolence and bargaining. Our successes consisted mostly of building long-term capacity and making internal research progress, rather than public research dissemination. The work we have done this year has laid the groundwork for more public research to be released in 2021 (see above).
Have we made progress towards becoming a research group and community that will have an outsized impact on the research landscape and relevant actors shaping the future? (This question tracks whether we are building the right long-term capacity to produce excellent research and making it applicable to the real world. It also includes whether we are focusing on the correct fields, questions, and activities to begin with.)
We have increased our capacity substantially across most functions of the organization.
We hired six people for our research team: Alex Lyzhov, Emery Cooper, Daniel Kokotajlo, and Julian Stastny as full-time research staff; Maxime Riché as a research engineer; Jia Yuan Loke as a part-time research assistant. Another offer is still pending.
With the CLR Fund, we made three grants designed to help junior researchers skill up. The recipients were Anthony DiGiovanni, Rory Svarc, and Johannes Treutlein.
Much of our research this year constitutes capacity-building. It opened up a lot of opportunities for further study, grantmaking, and strategy progress. For instance, the post on Reducing long-term risks from malevolent actors created a novel cause area for CLR and others in the community. This has already led to internal research progress, some of which we will publicize early next year. Another example is our work on an internal research repository of tools for our machine learning research that will facilitate future work in this area.
In 2020, we completed three shallow investigations related to our grantmaking efforts: moral circle expansion, malevolence, and technical research at the intersection of machine learning and bargaining. We are actively pursuing grant opportunities in the last area.
We ran a 3-months long summer research fellowship for the first time. We received 67 applications and made 11 offers, all of which were accepted. Two of them will do their fellowship in 2021 instead of this year. We were able to make at least four hires and two grants as a direct result, which we think is a good indication of the program’s success. We are still conducting a more rigorous evaluation of the program focusing on the experience of the fellows and how the program benefitted them. The experience we gained this year will make it easier to rerun an improved program with fewer resources.
The only function where our capacity shrank is operations. Our COO, Alfredo Parra, and Daniel Kestenholz, part-time operations analyst, left. Their responsibilities were taken over by Stefan Torges and Amrit Sidhu-Brar, who joined our team earlier this year in a part-time capacity. This has not been enough to compensate for Alfredo’s and Daniel’s departures, so we decided to bring on Jia Yuan Loke, who will start in early 2021 (splitting his work between operations and research). At that point, we expect to be at a capacity level similar to that at the beginning of 2020.
Has our work resulted in research progress that helps reduce s-risks (both in-house and elsewhere)?
A major theme of our work this year has been that risks of bargaining failure might be reduced via coordination by AI developers on certain aspects of their systems, e.g., to address prior and equilibrium selection problems. This suggests potential interventions in both AI governance and technical AI safety, some of which we plan to write on publicly in the first half of 2021 (see above).
Our work on bargaining failure has also led us to scale up our efforts at the intersection of game theory and multi-agent reinforcement learning (e.g., here, here). We have identified this as a promising avenue for increasing awareness of technical hurdles for successful cooperation among AI systems and constructing candidate technical solutions to some of these problems. Our ongoing work includes building a repository of algorithms, environments, and other tools to facilitate machine learning research in multi-agent environments. This repository better captures the kinds of cooperation problems we are interested in than the environments currently studied in the literature and allows for better evaluation of multi-agent machine learning methods.
Beginning with our post on reducing long-term risks from malevolent actors, we have been investigating possible pathways to s-risks from both malevolent humans and analogous phenomena in AI systems. This includes an ongoing investigation of possible grantmaking to reduce the influence of malevolent humans and a post introducing the risk of preferences to create suffering arising in TAI systems.
Grantees of the CLR Fund also published research over the course of 2020. Kaj Sotala expanded his sequence on multi-agent models of mind. Arif Ahmed published two articles on evidential decision theory in the journal Mind. The Wild Animal Initiative published a post on long-term design considerations of wild animal welfare interventions.
Have we communicated our research to our target audience, and has the target audience engaged with our ideas?
The main effort to disseminate our work was a series of talks at various EA and AI safety organizations in the second half of this year: 80,000 Hours, CHAI, CSER, FHI, GPI, OpenAI, and the Open Philanthropy Project. We did not give our planned talk at EAG San Francisco because that conference was canceled.
Contrary to our plans for this year, we did not run any research workshops because of the COVID-19 pandemic. We decided against hosting any virtual ones because we lacked capacity and did not consider the reduced value from a virtual event worth the effort.
Are we a healthy organization with an effective board, staff in appropriate roles, appropriate evaluation of our work, reliable policies and procedures, adequate financial reserves and reporting, high morale, and so forth?
It is our impression that the people on our team are in the appropriate roles. We are currently trialing a new person as our Director after Jonas Vollmer left CLR in June. We will complete the evaluation of their fit soon.
We believe that most of our policies and procedures are sound. However, many people joined our team this year. This requires us to be more explicit about some policies than we have been in the past, e.g., compensation policy, team retreat participation. We are addressing these issues as they come up, which has worked well so far.
Our financial reserves decreased significantly this year, which we are trying to address with our December fundraiser (see “Financials” below). elow). We are glad that CERR (see above) committed to contribute roughly their “fair share” to CLR. However, this is not enough to cover all of our expenses. (see below for more information on our financial situation)
The post Plans for 2021 & Review of 2020 appeared first on Center on Long-Term Risk.
]]>The post S-Risk Intro Fellowship appeared first on Center on Long-Term Risk.
]]>Applications have now closed for our next Fellowship, taking place from January-February 2023. To be notified when we next run an Intro Fellowship, please sign up to our mailing list in the footer of this page.
The fellowship is six weeks long and involves a time commitment of about 3-5 hours per week. It covers what we currently consider to be the most important sources of s-risk (TAI conflict, risks from malevolent actors).
Fellowship participants are be divided into small cohorts. Each week covers a new topic. Participants explore relevant background materials in their own time, and then have the opportunity to discuss the topic with each other and with CLR staff during a one-hour Zoom meeting. For the final week, each cohort chooses from a list of preselected topics to learn about, giving participants the ability to tailor the material in a way that’s most useful for them.
In addition to having group discussions, participants attend talks by s-risk researchers and are given the option to schedule 1-1 personalized career calls with us. CLR researchers also join fellowship meetings about topics related to their work, to answer questions and help facilitate discussion.
We think this program will be useful for you if:
If you’re interested in applying for our Summer Research Fellowship, this fellowship is a good opportunity to learn more about our work and improve your application due to having a better understanding of what we do and how to contribute.
There might be more idiosyncratic reasons to apply and the criteria above are intended as a guide rather than strict criteria.
If you have any questions about the program or are uncertain whether to apply, you can comment on this post, or reach out to tristan.cook@longtermrisk.org.
To be notified when we next run an Intro Fellowship, please sign up to our mailing list in the footer of this page.
The post S-Risk Intro Fellowship appeared first on Center on Long-Term Risk.
]]>The post Commitment ability in multipolar AI scenarios appeared first on Center on Long-Term Risk.
]]>The ability to make credible commitments is a key factor in many bargaining situations ranging from trade to international conflict. This post builds a taxonomy of the commitment mechanisms that transformative AI (TAI) systems could use in future multipolar scenarios, describes various issues they have in practice, and draws some tentative conclusions about the landscape of commitments we might expect in the future.
A better understanding of the commitments that future AI systems can make is helpful for predicting and influencing the dynamics of multipolar scenarios. The option to credibly bind oneself to certain actions or strategies fundamentally changes the game theory behind bargaining, cooperation, and conflict. Credible commitments can work to stabilize positive-sum agreements, and to increase the efficiency of threats (e.g. Schelling 1960), both of which could be relevant to how well TAI trajectories will reflect our values.
Because human goals can be contradictory and even broadly aligned AI systems could come to prioritize different outcomes depending on their domains and histories, these systems can end up in undesirable competitive situations and various bargaining failures where a lot of value is lost. Similarly, if some systems in a multipolar scenario are well aligned and others less so, the outcome can be disastrous unless stable peaceful agreements can be reached. As an example of the practical significance of commitment ability in stabilizing peaceful strategies, standard theories in international relations hold that conflicts between nations are difficult to avoid indefinitely primarily because there are no reliable commitment mechanisms for peaceful agreements (e.g. Powell 2004, Lake 1999, Rosato 2015), even when nations would overall prefer them.
In addition to the direct costs of conflict, the lack of enforceable commitments leads to continuous resource loss from arms races and other preparations for possible acts of hostility. It can also simply prevent gains from trade that binding prosocial contracts and general high trust could unlock. A strategic landscape that resembles current international relations in these respects seems possible in a fully multipolar takeoff scenario, where no AI system has a decisive advantage over the others, and no external rule of law can be strongly enforced over all the systems. If AI systems had a much greater ability to commit than we do, however, they could avoid recapitulating these common pitfalls of human diplomacy. As commitments also make threats a more feasible strategy, they could on the other hand also cause significant value loss for almost any goal system. To us, this of course matters especially in situations where some of the AI systems involved are at least partially aligned with goals we care about.
The potential consequences of credible commitments for AI systems will be discussed more thoroughly in forthcoming work. The purpose of this post is mostly to investigate whether and how credible commitments could be feasible for such systems in the first place.1 As commitment mechanisms differ in which kinds of commitment they are best suited for, though, some implications for consequences will also be tentatively explored.
Some quick notes on the terminology in this post:
Commitment ability refers here to an agent's ability to cause others to have a model of its relevant actions and future behavior which matches its own model of itself, or its genuine intentions.2 This can naturally include complex probabilistic models and models of what an agent will do conditional on the behavior of others. (While an agent's model of itself may not always correspond to what it actually ends up doing, the noise from incorrect models should ideally also be low enough that it doesn't affect the bargaining landscape much.) This definition diverges somewhat from how commitments are typically understood, but captures better a broader transparency in bargaining situations.
Closer to the conventional concept of commitment, commitment mechanisms here are ways to bind yourself more strongly to certain future actions in externally credible ways (such as visibly throwing out your steering wheel in a game of chicken).
Approaches to commitment in this context are simply the higher-level frameworks that agents can use to assess and increase the commitment ability of themselves and others in their environment. The main content of this post will be outlining various approaches to commitment that could become relevant in multipolar AI scenarios.
This section will discuss ways through which AI systems could surpass humans in commitment ability. It will also tie in the main reasons for why this isn't self-evident, even between systems that are overall far more capable than humans. In particular, there are three properties of commitment approaches that are at least not obviously satisfied by any of the candidates here, but seem important when talking about commitment ability in a given future environment:
Classical approaches: mutually transparent architectures
Early discussions in AI safety often assumed that transformative AI systems would be based on advanced models of the fundamental principles of intelligence. Their cognitive architectures could therefore be quite elegant, and perhaps arbitrarily transparent to other similarly intelligent agents. The concept of systems checking each other's source codes, or allowing a trusted third party to verify them, was often used as a shorthand for this kind of mutual interpretability [link]. For highly transparent agents whose goals are also contained in compact formal representations, such as utility functions, reliable alliances could even happen through merging utility functions (Dai 2009, 2019). Work on program equilibrium as a formal solution to certain game-theoretic dilemmas uses source code transparency as a starting point (Tennenholtz 2004, see also Oesterheld 2018), assuming complete information of the other agent's syntax to condition one's response on.3 Further work has also generalized the idea of conditional commitments and the cooperative equilibria they support (Kalai et al 2010).
These approaches seem less compatible with recent advances in AI. Capability gain is currently mostly driven by reinforcement learning in increasingly large and complex environments, less by progress in understanding the building blocks of general cognition (Sutton 2019). It seems plausible that the ultimately successful paradigms for transformative AI will conceptually be quite close to contemporary work (Christiano 2016). If this is the case, and superhuman systems will be hacky and opaque similarly to human brains, their mutual interpretability could also remain limited like it is between humans. Cognitive heterogeneity in itself is already a hindrance to mutual understanding, and will likely be much greater in AI systems than in humans. Considering that all humans have a shared evolutionary history and are strongly adapted for social coordination, we could even be much better at credibility and honesty than independently-trained AI systems, if they are developed through very different methods or in varying conditions, and have no such natural adaptations for transparency.4
On the other hand, superhuman agents could also be able to define and map the foundations of intelligence better than human researchers. Even prosaic trajectories could thus eventually lead to more compact builds and allow for higher interpretability. Though beyond the reach of human researchers, intentionally designed and elegant cognitive architectures could still ultimately be more efficient than ones that were born through less controllable (e.g. evolutionary) processes. Increased commitment ability in itself might already motivate agents to move in this direction, if they expect transparency to facilitate more gains from trade or some other competitive advantage. The bargaining landscape would then change in a predictable pattern over time: early AI systems would have poor commitment ability despite otherwise superhuman competence, but after more intentional refactoring towards transparency, strong commitments through classical approaches would eventually become available to their successors.5
This kind of self-modification would still lack robust safeguards against some conceptually simple exploits. Even if one could comb through an agent's internal structure at some point after it self-modified to be highly interpretable, it would be costly to make sure that it hasn't, for example, secretly changed something relevant in the environment before this process. In addition, asymmetries in competence would likely appear between agents due to their different domains, histories, and goals. Whether global differences in competence or just local blind spots, these asymmetries might make obfuscating one's intentions a viable strategy after all, and decrease the general credibility of commitments.
If transformative AI systems will be built with current paradigms, existing research on interpretability might also be helpful when predicting commitment ability. Even if the kind of syntactic transparency required for program equilibrium approaches wasn't feasible, high levels of trust can be achieved as long as other ways exist to understand another agent's internal decision procedures from the outside. This resembles a more advanced version of human researchers trying to make contemporary machine learning models more understandable to us.
The literature on interpretability currently lacks a unified paradigm, but it often divides methods for interpretation into model-based and post-hoc approaches (Murdoch et al 2018). The former require the models themselves to be inherently more understandable through design choices such as sparsity of parameters, modular structures, or more feature engineering based on expert knowledge about the domain in question. These ideas can possibly be extrapolated to TAI scenarios, and some key concepts will be explored below. The latter are more specific to current narrow models, and mostly deal with measures for feature importance, i.e. clarifying which features or interactions in the training data have contributed to the model to which degrees. With enough information of how an agent has been trained, analogous methods could perhaps be useful, but likely laborious; they will not be discussed further in this post.
Generally, model-based methods have a shared problem in how they constrain the model design in its other capabilities. Some interpretability researchers have suggested that these constraints are less prohibitive than they seem, at least in contemporary applications. When the task is to make sense of some dataset, the set of models that can accomplish such a predictive task (known as the Rashomon set) is potentially large, and thus arguably likely to include some intrinsically understandable models as well (Rudin 2019). This idea might extrapolate to general intelligences quite poorly in practice, especially when computational efficiency is also a concern and the setting is competitive in general. However, in a sense it's related to the idea that there could be some eventually discoverable highly compact building blocks that suffice for general intelligence, even if many of the paths there are messier. One way through which this could hold is that the world and its relations themselves are fundamentally simple or compressible (see e.g. Wigner 1960).
Another way in which even complex systems could achieve more transparency is through modularity, where various parts of an agent's cognition can be examined and interpreted somewhat independently. Different cognitive strategies, employed in different situations depending on some higher-level judgment, could potentially be both effective and fairly transparent due to their smaller size (and possibly higher fundamental comprehensibility and traceable history) compared to a generally intelligent agent. Whether strongly modular structures are in fact functional or competitive enough in this context will be discussed in forthcoming work, but the greater transparency of modular minds is questionable. It seems unlikely that in a complex world, parts of an effective agent's reasoning could be so separable from its other capacities so as to leave no context-dependent uncertainties, or opportunities to secretly defect by using seemingly trustworthy modules in underhanded ways. This certainly doesn't seem to be the case in human brains, despite their likely quite modular structure (for an overview, see e.g. Carruthers 2006, Robbins 2017).
Overall the relation between interpretability and the kind of transparency that facilitates commitments is not well defined. However, being able to interpret an agent's decisions doesn't seem to directly mean that they are simulatable or otherwise verifiable to you in specific bargaining situations. Transparency through these means seems especially implausible when local or global asymmetries between agents are large, and possibly when the scenario is adversarial.
Centralized collaborative approaches: arbitrator systems for verifying commitments
A less architecture-dependent but also less satisfying approach is simply assuming that commitment ability is a very difficult problem, and like most very difficult problems, trivially solved by throwing a lot of compute at it. Perhaps AI systems will remain irredeemably messy, but will be motivated to find ways to cooperate in spite of this. One route they might consider is similar to what human societies have often converged on: centralize enough power and resources to enforce or check contracts that individual humans otherwise can't credibly commit to.
In this context, the central power could exist either for simply verifying the intentions behind arbitrary commitments, or for punishing defectors afterwards if they break established laws. As the latter task has been brought up in other contexts [link] and doesn't constitute a meaningfully multipolar scenario, this section will mostly discuss the former. An overseer that merely verifies contracts and commitments instead of dealing out punishments could be more palatable even for agents with idiosyncratic preferences about societal rules. It only requires agents to believe that the ability to make voluntary credible commitments will be positive in expectation.6 It would regardless capture many of the benefits of a central overseer, as one main reason for punitive systems is also enforcing otherwise untenable commitments.
The idea behind this mechanism is only that while the agents can't interpret each other or predict how well they would stick to commitments, a far more capable system (here, likely just a system with vastly more compute at its disposal) could do it for them. If several agents of similar capability are involved in collaboratively constructing such a system, they can be fairly confident that no single agent can secretly bias it, or otherwise manipulate the outcome. This system would then serve as an arbitrator, likely with no other goals of its own, and remain far enough above the competence level of any other agent in the landscape. Assuming that its subjects will continuously strive for expansion and self-improvement, this minimal-state brain would also need to keep growing. As long as it remains useful, it could do so by collecting resources from the agents that expect to benefit from its abilities.
How much more intelligent would such a system need to be, though? Massively complex neural architectures could well remain inscrutable even to much more competent agents. In terms of neural connections, no human could use a snapshot of a salamander brain to predict its next action, let alone its motivations one hour in the future. Even the simple 302-neuron connectome of the nematode C. elegans mostly escapes our understanding, despite years of effort at emulating its functions and our own neuron count at 8.6x10^9 (see OpenWorm and related projects, 2020). Most likely, judgments about an agent's honesty would need to rely in part on inferences based on its origins and history, slight behavioral inconsistencies, and other subtle external signs it would hopefully not be clever enough to fully cover up. Some causal traces of intentions to deceive bargaining partners could be expected. For the iconic argument in this space, see Yudkowsky (2008).
A major weakness to this approach is that the costs of running such a system seem substantial regardless of how large the difference needs to be. The gains from trade that agents could secure through increased commitment ability would have to be higher than that, and it isn't clear that this is the case. Eventually, there might not be enough surplus left on the table to motivate further contributions to such a costly system. On the other hand, if there is some point after which most of the valuable commitments have been made and an arbitrator is no longer needed, this could just be because the bargaining landscape is then thoroughly locked into a decently cooperative state: if bargaining failures were still frequent in expectation, there would be more potential surplus left. If so, the overall costs of relying on such a system might not be too high in the long run, as it would mostly be needed through some transient unstable timespan during early interactions between AI systems.
There are a few ways through which a centralized arbitrator system could be set up, for example:
Decentralized collaborative approaches
A potentially less costly collaborative approach could work if credibility can be mapped to a multidimensional model where agents start out differentially trusting each other because of path-dependent or idiosyncratic reasons, and can then form networks to verify commitments. For example, due to domain-specific differences and histories, even agents whose overall competence is roughly on your level could spot minor details that you miss because of your own limitations, but that are relevant to your credibility. If architectural similarities matter for transparency, some agents will be able to understand each other's internal workings better than others; this could be the case if copies either of agents or their internal modules become common. Some agents can also come to share origins or relevant interactions that allow them to form better models of each other.
This approach differs from a centralized project in that it describes conditions where gradients of trust form with low initial effort in these path-dependent dimensions. As trust at least in a general sense can be largely transitive, the costs of communicating within a network under such conditions could stay reasonably manageable. More than a specific mechanism, this approach would be a way to extend already existing local and empirical commitment ability, at least in a probabilistic manner, through larger areas in the bargaining landscape.
As a simplified example, say that there are three agents, A, B, and C, who as a binary either can or cannot trust each other. Agent A can trust agent B (and vice versa) because of many shared modules, but cannot trust the internally very different agent C. Agent B can trust agent C (and vice versa) because of a shared history. If agents A and C then want to communicate credibly with each other, it seems easy for them to verify their agreements through their links to agent B.
In larger agent spaces, longer chains and networks of agents with differential levels of trust could plausibly come to influence the dynamics of commitment through similar network structures. Even without multiple dimensions to make the task potentially cheaper, however, a less centralized approach can be pursued. Rather than specifically building a central system for the black-box task of verifying commitments, a network of agents that can along with their other pursuits trade various evaluation services could be a more dynamic way to get the required amounts of compute for assessing individual contracts.
While modeling the payoffs that agents could receive by helping others communicate is not a central question in this context, it is interesting when considering the incentives for such tasks. Various models have been built in cooperative game theory to represent limited communication between different parts in a network of collaborators and the payoff distributions in such situations (see e.g. Slikker and Van Den Nouweland 2001). A widely used formalism is the Myerson value, which builds on the Shapley value and allocates a greater part of the surplus in a coalition to players who facilitate communication and therefore cooperation between others (Myerson 1977, 1980; Shapley 1951). This and related concepts correspond reasonably well to scenarios where trust differences allow only some agents to credibly communicate with each other. Forthcoming work will investigate in more detail how cooperative equilibria can be sustained in limited communication situations.
Overall, the approach described in this section mostly serves as a rough sketch for much more sophisticated network strategies that AI systems could devise, but even with very liberal hypothetical extrapolation, the bridge to plausible practical scenarios seems shaky. At the very least, the availability and strength of any local credibility links is determined mostly by higher-level features of the agent space, though intentionally creating more of them seems possible for cooperative human researchers during development.
Automated approaches
High transparency can sometimes be achieved by finding a simple enough commitment mechanism that its workings are unambiguous from the outside. Separating a commitment from sophisticated strategies and other cognitive complexities of the agent itself, an effective approach can just consist of automatic structures responding to the environment in predictable ways. Nuclear control systems were presumably built in the Cold War era Soviet Union that could be triggered by sensor input alone, to ensure retaliation with minimal human intervention (Wikipedia 2020).7 Companies can irreversibly invest and deploy specific assets, tying their hands to a certain strategy often in an observable and understandable way (e.g. Sengul et al 2011). Similarly, militaries can reduce their options by mobilizing troops that would be too costly to recall regardless of what one's opponent chooses to do (Fearon 1997).
While powerful in many specific cases, this approach is quite limited especially in complex environments. With large differences in general or domain-specific competence, there might be few situations where simple automated mechanisms can even be built transparently enough. Regardless of how interpretable and robust some physical device or resource investment seems, it doesn't remove the intelligent agent from the equation, or again prevent it from setting up the environment in a clever way that allows for defection after all.
In most contexts an automated approach has many other downsides as well, such as a lack of flexibility and corrigibility if there are unpredictable events in the environment. It seems unlikely that highly verifiable automated mechanisms could be built with the resolution to track the ideal commitments one could make in complex situations, and most interesting contracts could likely not be represented at all. In environments with agents that are much more diverse than humans, nations, or organizations, the physical reliability of simple commitment devices could be illusory even when they are set up by agents that are sincere in their commitments. While the fearful symmetry seen in nuclear deterrence strategies may be the best option to practically reduce the incidence of conflicts, it has historically led to mistaken near-launches due to unforeseen details such as weather anomalies (Wikipedia 2020). This illustrates how even applying a simple commitment mechanism requires a good understanding of the environment, including one's peers and their behavior space, when designing viable trigger conditions for whatever the intended procedure is. The worse one's models of the other players are, the harder this task would likely be.
Strategic delegation
In economic and game-theoretic literature, a related but typically more flexible approach is strategic delegation, where principals deploy agents with different direct incentives to act on their behalf. By optimizing for something other than the principal's actual goal, delegated agents can sometimes reach better bargaining outcomes due to the desired commitment being naturally being more favorable to their incentives. For example, a manager may be responsible merely for keeping a company in the market, not its ultimate profit margins, credibly changing the way they will respond to threats in entry deterrence games (Fershtman and Judd, 1987). The original formalism behind strategic delegation (Vickers, 1985) involves an agent appointment game that precedes the actual game between agents, and determines how the agents in the latter game play mapping to an exogenously given outcome function. More recent work (Oesterheld and Conitzer 2019) describes how delegates with modified incentives can safely strive for Pareto improvements.
The practical applications of these models are not immediately clear in the empirical future scenarios we might envision. As pointed out by Oesterheld and Conitzer, the process of committing one's delegates to their modified incentives must already be credible. If the deployed agent differs from the principal mostly in terms of incentives and not competence, agency, or internal complexity, for instance, it may not be much more transparent in its commitments than the principal was. Perhaps some goals are more verifiable or otherwise credible than others e.g. in terms of observable actions that are consistent with the goals, but the fundamental problem with internal opaqueness remains. One solution is only deploying the agent for a specific bargaining situation, for which it is trained in a mostly transparent way where an observer can see the details of the training procedure. Similarly to how modules in a single agent pose challenges, it is unclear how well individual bargaining situations could be separated from enforcing them in the environment, however, and the enforcement would again presumably require a more generally competent agent to be crucially involved.
Iteration, punishment capacity, and other miscellaneous factors
If interactions in the bargaining environment are iterated or one's history is visible to outside parties one might trade with later on, reputation concerns can incentivize sticking to commitments. This is a well-known finding in game theory (see e.g. Mailath & Samuelson 2006), and will not be discussed much further here, but ought to be included for the sake of completeness. In transparent iterated scenarios, an agent expects other players to be able to punish it later for breaking commitments. Even if the environment is uncertain, adhering to costly commitments can signal credibility to future bargaining partners as a long-term strategy. A concrete special case of the former mechanism is having a central power or other system with the material capacity to retrospectively punish agents that renege on their contracts, much like law enforcement in human societies happens through the deterrence effect of designed consequences to defections.8
As it is mostly upstreams from commitment ability, increasing the iteration factor of interactions for the sake of credibility seems inefficient and probably intractable. Among agents whose strategies optimize for the very long term, it is also unreliable: if interactions are repeated in an environment where the stakes get higher over time, most agents would prefer to be honest while the stakes are low, regardless of how they will act in a sufficiently high-stakes situation. This holds especially because the higher the stakes get in a competition for expansion, the fewer future interactions one expects, as wiping out other players entirely becomes possible. Iteration alone would therefore provide limited information, even if it sometimes were the only practical way to provide evidence of one's trustworthiness.
Both epistemic and normative features in individual agents can make their commitments more credible, if these features are common knowledge. Human cultures, for instance, have used religious notions to signal commitment to certain strategies (e.g. Holslag 2016), perhaps often successfully compared to available counterfactual approaches. Agents could also come to intrinsically value transparency or choose to adhere to commitments, either through moral values, or certain decision-theoretic policy choices (Drescher 2006). These choices would not in themselves make commitments externally credible, of course, but could have more verifiable sources depending on the agent's history.
As mentioned above, it seems that each commitment strategy described here suffers from potentially serious drawbacks, though in different areas and circumstances. Many plausible scenarios can be envisioned where one or more of the approaches succeeds in supporting credible commitment. Different approaches could even be used in overlapping ways to compensate for their weaknesses, though this holds less if the main weakness is resource costs. In many cases, the feasibility of commitments seems to come down to whether the surplus from cooperation will be enough to incentivize a great deal of collective effort. Another fundamental question is how costly it is to obfuscate one's intentions with great care versus detect obfuscation by observing an agent’s behavior and history.
On a more practical level, contingent features such as agent heterogeneity and even logistics suggest that even if contracts and commitments were overall feasible, they would be costlier to verify between some agents than others. Rather than expecting uniform opportunities for commitment throughout the landscape, we should perhaps assume the environment will be governed by some n-dimensional mess of gradients in commitment ability. Comparing agents along axes such as physical location, architectural similarity, history, normative motivations, and willingness to cooperate, some of them would likely be in better positions to make credible commitments to each other. This does not necessarily prevent widely cooperative dynamics, especially if there is a lot of transitivity in commitment ability between agents as speculated above, but makes the path there more complicated in terms of interventions.
Another insight from this work is that committing to threats could require completely different mechanisms or approaches than committing to cooperation, and future discussions on commitment among AI systems should ideally reflect this. Notably, as many ways to signal one's intentions already require some minimal collaborative labor, it seems much more feasible commitment-wise to make prosocial commitments than to extort others.9 When you can't simply inform your target of a threat and your intentions to carry it out, and would instead need them to go through a costly process to get your intentions properly verified, you might find that they aren't interested in hearing more about your plans.10 One exception seems to be the dumber mechanisms, which are well suited for destructive threats, but might not be able to represent complex voluntary trade contracts.
This post benefited immensely from conversations with and feedback from Jesse Clifton, Richard Ngo, Daniel Kokotajlo, Lin Chi Nguyen, Lukas Gloor, Stefan Torges, as well as the rest of my colleagues at Center on Long-term Risk (CLR) and the attendees at CLR's 2019 S-risk workshop, which inspired many of the initial ideas explored here.
The post Commitment ability in multipolar AI scenarios appeared first on Center on Long-Term Risk.
]]>The post Fundraiser 2020 appeared first on Center on Long-Term Risk.
]]>
We are raising $550,000 (stretch goal: $1,500,000) to make further progress on our mission: building a global community of researchers and professionals working to do the most good in terms of reducing suffering. These are our plans for 2021 in a nutshell (more details here):
If you prioritize reducing risks of astronomical suffering, we believe there is a strong case to support our work. We are one of the few organizations with the same priorities and we have made significant progress with our work in recent years.
Donations to the Effective Altruism Foundation (EAF), which houses the Center on Long-Term Risk, are tax-deductible for donors in Germany, Switzerland, the US, the UK, and the Netherlands. Donors in the US and the UK can make tax-deductible donations to us via the Effective Altruism Funds platform.
You can find answers to frequently asked questions in our donation FAQ. If the FAQ doesn't answer your question, please send us an email at donate@ea-foundation.org.
Name | Amount | Comment | |
---|---|---|---|
Mikko Rauhala | EUR 4995 | ||
Anonymous | EUR 35 | ||
Anonymous | EUR 35 | ||
Anonymous | GBP 30 | ||
Rai (Michael Pokorny) | CHF 60.78 | ||
Anonymous | EUR 2000 | ||
Adam Hruby | USD 10 | ||
Victoria Gutierrez | USD 200 | ||
Connor Leahy | EUR 500 | ||
Anonymous | CHF 5000 | ||
Anonymous | EUR 631 | I pray that CLR not only receives the donation required, but gains ever-increasing public profile and awareness on the subject of impact of tech for future generations! | |
Anonymous | CHF 20014 | ||
Jonas Hunsicker | EUR 50 | ||
Anonymous | EUR 25 | ||
Amy Spence | USD 100 | ||
Denis Drescher | CHF 100 | ||
Adam clayton | USD 75 | ||
Anonymous | USD 75 | ||
Anonymous | CHF 10000 | ||
Cliff and Stephanie Hyra | USD 20000 | ||
Adam Spence | USD 77.77 | Chaos Theory is another potential area that CLTR should consider researching. A better understanding of chaotic systems is important for understanding how to change the course of history for the better, and a lot of suffering seems to be caused by unintended consequences of not necessarily malevolent actions. | |
Anonymous | CHF 5000 | ||
Anonymous | CHF 3000 | ||
Anonymous | USD 588 | ||
Anonymous | CHF 200 | ||
Jan und Sara Rüegg | CHF 11474 | ||
Anonymous | CHF 18000 | ||
Anonymous | USD 28000 |
The post Private: Anonymous contributed USD 28000.00 (once) to fundraiser (ID = 6354) appeared first on Center on Long-Term Risk.
]]>The post Persuasion Tools: AI takeover without takeoff or agency? appeared first on Center on Long-Term Risk.
]]>I'm envisioning that in the future there will also be systems where you can input any conclusion that you want to argue (including moral conclusions) and the target audience, and the system will give you the most convincing arguments for it. At that point people won't be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked.
--Wei Dai
What if most people already live in that world? A world in which taking arguments at face value is not a capacity-enhancing tool, but a security vulnerability? Without trusted filters, would they not dismiss highfalutin arguments out of hand, and focus on whether the person making the argument seems friendly, or unfriendly, using hard to fake group-affiliation signals?
--Benquo
AI-powered memetic warfare makes all humans effectively insane.
--Wei Dai, listing nonstandard AI doom scenarios
This post speculates about persuasion tools—how likely they are to get better in the future relative to countermeasures, what the effects of this might be, and what implications there are for what we should do now.
To avert eye-rolls, let me say up front that I don’t think the world is likely to be driven insane by AI-powered memetic warfare. I think progress in persuasion tools will probably be gradual and slow, and defenses will improve too, resulting in an overall shift in the balance that isn’t huge: a deterioration of collective epistemology, but not a massive one. However, (a) I haven’t yet ruled out more extreme scenarios, especially during a slow takeoff, and (b) even small, gradual deteriorations are important to know about. Such a deterioration would make it harder for society to notice and solve AI safety and governance problems, because it is worse at noticing and solving problems in general. Such a deterioration could also be a risk factor for world war three, revolutions, sectarian conflict, terrorism, and the like. Moreover, such a deterioration could happen locally, in our community or in the communities we are trying to influence, and that would be almost as bad. Since the date of AI takeover is not the day the AI takes over, but the point it’s too late to reduce AI risk, these things basically shorten timelines.
Analyzers: Political campaigns and advertisers already use focus groups, A/B testing, demographic data analysis, etc. to craft and target their propaganda. Imagine a world where this sort of analysis gets better and better, and is used to guide the creation and dissemination of many more types of content.
Feeders: Most humans already get their news from various “feeds” of daily information, controlled by recommendation algorithms. Even worse, people’s ability to seek out new information and find answers to questions is also to some extent controlled by recommendation algorithms: Google Search, for example. There’s a lot of talk these days about fake news and conspiracy theories, but I’m pretty sure that selective/biased reporting is a much bigger problem.
Chatbot: Thanks to recent advancements in language modeling (e.g. GPT-3) chatbots might become actually good. It’s easy to imagine chatbots with millions of daily users continually optimized to maximize user engagement--see e.g. Xiaoice. The systems could then be retrained to persuade people of things, e.g. that certain conspiracy theories are false, that certain governments are good, that certain ideologies are true. Perhaps no one would do this, but I’m not optimistic.
Coach: A cross between a chatbot, a feeder, and an analyzer. It doesn’t talk to the target on its own, but you give it access to the conversation history and everything you know about the target and it coaches you on how to persuade them of whatever it is you want to persuade them of.
Drugs: There are rumors of drugs that make people more suggestible, like scopolomine. Even if these rumors are false, it’s not hard to imagine new drugs being invented that have a similar effect, at least to some extent. (Alcohol, for example, seems to lower inhibitions. Other drugs make people more creative, etc.) Perhaps these drugs by themselves would be not enough, but would work in combination with a Coach or Chatbot. (You meet target for dinner, and slip some drug into their drink. It is mild enough that they don’t notice anything, but it primes them to be more susceptible to the ask you’ve been coached to make.)
Imperius Curse: These are a kind of adversarial example that gets the target to agree to an ask (or even switch sides in a conflict!), or adopt a belief (or even an entire ideology!). Presumably they wouldn’t work against humans, but they might work against AIs, especially if meme theory applies to AIs as it does to humans. The reason this would work better against AIs than against humans is that you can steal a copy of the AI and then use massive amounts of compute to experiment on it, finding exactly the sequence of inputs that maximizes the probability that it’ll do what you want.
The first thing to point out is that many of these kinds of persuasion tools already exist in some form or another. And they’ve been getting better over the years, as technology advances. Defenses against them have been getting better too. It’s unclear whether the balance has shifted to favor these tools, or their defenses, over time. However, I think we have reason to think that the balance may shift heavily in favor of persuasion tools, prior to the advent of other kinds of transformative AI. The main reason is that progress in persuasion tools is connected to progress in Big Data and AI, and we are currently living through a period of rapid progress those things, and probably progress will continue to be rapid (and possibly accelerate) prior to AGI.
However, here are some more specific reasons to think persuasion tools may become relatively more powerful:
Substantial prior: Shifts in the balance between things happen all the time. For example, the balance between weapons and armor has oscillated at least a few times over the centuries. Arguably persuasion tools got relatively more powerful with the invention of the printing press, and again with radio, and now again with the internet and Big Data. Some have suggested that the printing press helped cause religious wars in Europe, and that radio assisted the violent totalitarian ideologies of the early twentieth century.
Consistent with recent evidence: A shift in this direction is consistent with the societal changes we’ve seen in recent years. The internet has brought with it many inventions that improve collective epistemology, e.g. google search, Wikipedia, the ability of communities to create forums... Yet on balance it seems to me that collective epistemology has deteriorated in the last decade or so.
Lots of room for growth: I’d guess that there is lots of “room for growth” in persuasive ability. There are many kinds of persuasion strategy that are tricky to use successfully. Like a complex engine design compared to a simple one, these strategies might work well, but only if you have enough data and time to refine them and find the specific version that works at all, on your specific target. Humans never have that data and time, but AI+Big Data does, since it has access to millions of conversations with similar targets. Persuasion tools will be able to say things like 'In 90% of cases where targets in this specific demographic are prompted to consider and then reject the simulation argument, and then challenged to justify their prejudice against machine consciousness, the target gets flustered and confused. Then, if we make empathetic noises and change the subject again, 50% of the time the subject subconsciously changes their mind so that when next week we present our argument for machine rights they go along with it, compared to 10% baseline probability.'
Plausibly pre-AGI: Persuasion is not an AGI-complete problem. Most of the types of persuasion tools mentioned above already exist, in weak form, and there’s no reason to think they can’t gradually get better well before AGI. So even if they won't improve much in the near future, plausibly they'll improve a lot by the time things get really intense.
Language modelling progress: Persuasion tools seem to be especially benefitted by progress in language modelling, and language modelling seems to be making even more progress than the rest of AI these days.
More things can be measured: Thanks to said progress, we now have the ability to cheaply measure nuanced things like user ideology, enabling us to train systems towards those objectives.
Chatbots & Coaches: Thanks to said progress, we might see some halfway-decent chatbots prior to AGI. Thus an entire category of persuasion tool that hasn’t existed before might come to exist in the future. Chatbots too stupid to make good conversation partners might still make good coaches, by helping the user predict the target’s reactions and suggesting possible things to say.
Minor improvements still important: Persuasion doesn’t have to be perfect to radically change the world. An analyzer that helps your memes have a 10% higher replication rate is a big deal; a coach that makes your asks 30% more likely to succeed is a big deal.
Faster feedback: One way defenses against persuasion tools have strengthened is that people have grown wise to them. However, the sorts of persuasion tools I’m talking about seem to have significantly faster feedback loops than the propagandists of old; they can learn constantly, from the entire population, whereas past propagandists (if they were learning at all, as opposed to evolving) relied on noisier, more delayed signals.
Overhang: Finding persuasion drugs is costly, immoral, and not guaranteed to succeed. Perhaps this explains why it hasn’t been attempted outside a few cases like MKULTRA. But as technology advances, the cost goes down and the probability of success goes up, making it more likely that someone will attempt it, and giving them an “overhang” with which to achieve rapid progress if they do. (I hear that there are now multiple startups built around using AI for drug discovery, by the way.) A similar argument might hold for persuasion tools more generally: We might be in a “persuasion tool overhang” in which they have not been developed for ethical and riskiness reasons, but at some point the price and riskiness drops low enough that someone does it, and then that triggers a cascade of more and richer people building better and better versions.
Here are some hasty speculations, beginning with the most important one:
Ideologies & the biosphere analogy:
The world is, and has been for centuries, a memetic warzone. The main factions in the war are ideologies, broadly construed. It seems likely to me that some of these ideologies will use persuasion tools--both on their hosts, to fortify them against rival ideologies, and on others, to spread the ideology.
Consider the memetic ecosystem--all the memes replicating and evolving across the planet. Like the biological ecosystem, some memes are adapted to, and confined to, particular niches, while other memes are widespread. Some memes are in the process of gradually going extinct, while others are expanding their territory. Many exist in some sort of equilibrium, at least for now, until the climate changes. What will be the effect of persuasion tools on the memetic ecosystem?
For ideologies at least, the effects seem straightforward: The ideologies will become stronger, harder to eradicate from hosts and better at spreading to new hosts. If all ideologies got access to equally powerful persuasion tools, perhaps the overall balance of power across the ecosystem would not change, but realistically the tools will be unevenly distributed. The likely result is a rapid transition to a world with fewer, more powerful ideologies. They might be more internally unified, as well, having fewer spin-offs and schisms due to the centralized control and standardization imposed by the persuasion tools. An additional force pushing in this direction is that ideologies that are bigger are likely to have more money and data with which to make better persuasion tools, and the tools themselves will get better the more they are used.
Recall the quotes I led with:
... At that point people won't be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked.
--Wei Dai
What if most people already live in that world? A world in which taking arguments at face value is not a capacity-enhancing tool, but a security vulnerability? Without trusted filters, would they not dismiss highfalutin arguments out of hand … ?
--Benquo
AI-powered memetic warfare makes all humans effectively insane.
--Wei Dai, listing nonstandard AI doom scenarios
I think the case can be made that we already live in this world to some extent, and have for millenia. But if persuasion tools get better relative to countermeasures, the world will be more like this.
This seems to me to be an existential risk factor. It’s also a risk factor for lots of other things, for that matter. Ideological strife can get pretty nasty (e.g. religious wars, gulags, genocides, totalitarianism), and even when it doesn’t, it still often gums things up (e.g. suppression of science, zero-sum mentality preventing win-win-solutions, virtue signalling death spirals, refusal to compromise). This is bad enough already, but it’s doubly bad when it comes at a moment in history where big new collective action problems need to be recognized and solved.
Obvious uses: Advertising, scams, propaganda by authoritarian regimes, etc. will improve. This means more money and power to those who control the persuasion tools. Maybe another important implication would be that democracies would have a major disadvantage on the world stage compared to totalitarian autocracies. One of many reasons for this is that scissor statements and other divisiveness-sowing tactics may not technically count as persuasion tools but they would probably get more powerful in tandem.
Will the truth rise to the top: Optimistically, one might hope that widespread use of more powerful persuasion tools will be a good thing, because it might create an environment in which the truth “rises to the top” more easily. For example, if every side of a debate has access to powerful argument-making software, maybe the side that wins is more likely to be the side that’s actually correct. I think this is a possibility but I do not think it is probable. After all, it doesn’t seem to be what’s happened in the last two decades or so of widespread internet use, big data, AI, etc. Perhaps, however, we can make it true for some domains at least, by setting the rules of the debate.
Data hoarding: A community’s data (chat logs, email threads, demographics, etc.) may become even more valuable. It can be used by the community to optimize their inward-targeted persuasion, improving group loyalty and cohesion. It can be used against the community if someone else gets access to it. This goes for individuals as well as communities.
Chatbot social hacking viruses: Social hacking is surprisingly effective. The classic example is calling someone pretending to be someone else and getting them to do something or reveal sensitive information. Phishing is like this, only much cheaper (because automated) and much less effective. I can imagine a virus that is close to as good as a real human at social hacking while being much cheaper and able to scale rapidly and indefinitely as it acquires more compute and data. In fact, a virus like this could be made with GPT-3 right now, using prompt programming and “mothership” servers to run the model. (The prompts would evolve to match the local environment being hacked.) Whether GPT-3 is smart enough for it to be effective remains to be seen.
I doubt that persuasion tools will improve discontinuously, and I doubt that they’ll improve massively. But minor and gradual improvements matter too.
Of course, influence over the future might not disappear all on one day; maybe there’ll be a gradual loss of control over several years. For that matter, maybe this gradual loss of control began years ago and continues now...
I think this is potentially (5% credence) the new Cause X, more important than (traditional) AI alignment even. It probably isn’t. But I think someone should look into it at least, more thoroughly than I have.
To be clear, I don’t think it’s likely that we can do much to prevent this stuff from happening. There are already lots of people raising the alarm about filter bubbles, recommendation algorithms, etc. so maybe it’s not super neglected and maybe our influence over it is small. However, at the very least, it’s important for us to know how likely it is to happen, and when, because it helps us prepare. For example, if we think that collective epistemology will have deteriorated significantly by the time crazy AI stuff starts happening, that influences what sorts of AI policy strategies we pursue.
Note that if you disagree with me about the extreme importance of AI alignment, or if you think AI timelines are longer than mine, or if you think fast takeoff is less likely than I do, you should all else equal be more enthusiastic about investigating persuasion tools than I am.
Thanks to Katja Grace, Emery Cooper, Richard Ngo, and Ben Goldhaber for feedback on a draft.
Related previous work:
Stuff I’d read if I was investigating this in more depth:
The stuff here and here
EDIT: This ultrashort sci-fi story by Jack Clark illustrates some of the ideas in this post:
The Narrative Control Department
[A beautiful house in South West London, 2030]“General, we’re seeing an uptick in memes that contradict our official messaging around Rule 470.” “What do you suggest we do?”
“Start a conflict. At least three sides. Make sure no one side wins.”
“At once, General.”
And with that, the machines spun up – literally. They turned on new computers and their fans revved up. People with tattoos of skeletons at keyboards high-fived eachother. The servers warmed up and started to churn out their fake text messages and synthetic memes, to be handed off to the ‘insertion team’ who would pass the data into a few thousand sock puppet accounts, which would start the fight.
Hours later, the General asked for a report.
“We’ve detected a meaningful rise in inter-faction conflict and we’ve successfully moved the discussion from Rule 470 to a parallel argument about the larger rulemaking process.”
“Excellent. And what about our rivals?”
“We’ve detected a few Russian and Chinese account networks, but they’re staying quiet for now. If they’re mentioning anything at all, it’s in line with our narrative. They’re saving the IDs for another day, I think.”
That night, the General got home around 8pm, and at the dinner table his teenage girls talked about their day.
“Do you know how these laws get made?” the older teenager said. “It’s crazy. I was reading about it online after the 470 blowup. I just don’t know if I trust it.”
“Trust the laws that gave Dad his job? I don’t think so!” said the other teenager.
They laughed, as did the General’s wife. The General stared at the peas on his plate and stuck his fork into the middle of them, scattering so many little green spheres around his plate.
The post Persuasion Tools: AI takeover without takeoff or agency? appeared first on Center on Long-Term Risk.
]]>The post How Roodman's GWP model translates to TAI timelines appeared first on Center on Long-Term Risk.
]]>Now, before I go any further, let me be the first to say that I don’t think we should use this model to predict TAI. This model takes a very broad outside view and is thus inferior to models like Ajeya Cotra’s which make use of more relevant information. (However, it is still useful for rebutting claims that TAI is unprecedented, inconsistent with historical trends, low-prior, etc.) Nevertheless, out of curiosity I thought I’d calculate what the model implies for TAI timelines.
Here is the projection made by Roodman’s model. The red line is real historic GWP data; the splay of grey shades that continues it is the splay of possible futures calculated by the model. The median trajectory is the black line.
I messed around with a ruler to make some rough calculations, marking up the image with blue lines as I went. The big blue line indicates the point on the median trajectory where GWP is 10x what is was in 2019. Eyeballing it, it looks like it happens around 2040, give or take a year. The small vertical blue line indicates the year 2037. The small horizontal blue line indicates GWP in 2037 on the median trajectory.
Thus, it seems that between 2037 and 2040 on the median trajectory, GWP doubles. (One-ninth the distance between 1,000 and 1,000,000 is crossed, which is one-third of an order of magnitude, which is about one doubling).
This means that TAI happens around 2037 on the median trajectory according to this model, at least according to Ajeya Cotra’s definition of transformative AI as “software which causes a tenfold acceleration in the rate of growth of the world economy (assuming that it is used everywhere that it would be economically profitable to use it)... This means that if TAI is developed in year Y, the entire world economy would more than double by year Y + 4.”
What about the non-median trajectories? Each shade of grey represents 5 percent of the simulated future trajectories, so it looks like there’s about a 20% chance that GWP will be near-infinite by 2040 (and 10% by 2037). So, perhaps-too-hastily extrapolating backwards, maybe this means about a 20% chance of TAI by 2030 (and 10% by 2027).
At this point, I should mention that I disagree with this definition of TAI; I think the point of no return (which is what matters for planning) is reasonably likely to come several years before TAI-by-this-definition appears. (It could also come several years later!) For more on why I think this, see this post. [link to be added when linked post appears]
Finally, let’s discuss some of the reasons not to take this too seriously: This model has been overconfident historically. It was surprised by how fast GDP grew prior to 1970 and surprised by how slowly it grew thereafter. And if you look at the red trendline of actual GWP, it looks like the model may have been surprised in previous eras as well. Moreover, for the past few decades it has consistently predicted a median GWP-date of several decades ahead:
The grey region is the confidence interval the model predicts for when growth goes to infinity. 100 on the x-axis is 1947. So, throughout the 1900’s the model has consistently predicted growth going to infinity in the first half of the twenty-first century, but in the last few decades in particular, it’s displayed a consistent pattern of pushing back the date of expected singularity, akin to the joke about how fusion power is always twenty years away:
Model has access to data up to year X = | Year of predicted singularity | Difference |
1940 | 2029 | 89 |
1950 | 2045 | 95 |
1960 | 2020 | 60 |
1970 | 2010 | 40 |
1980 | 2014 | 34 |
1990 | 2022 | 32 |
2000 | 2031 | 31 |
2010 | 2038 | 28 |
2019 | 2047 | 28 |
The upshot, I speculate, is that if we want to use this model to predict TAI, but we don’t want to take it 100% literally, we should push the median significantly back from 2037 while also increasing the variance significantly. This is because we are currently in a slower-than-the-model-predicts period, but faster-than-the-model-predicts periods are possible and indeed likely to happen around TAI. So probably the status quo will continue and GWP will continue to grow slowly and the model will continue to push back the date of expected singularity… but also at any moment there’s a chance that we’ll transition to a faster-than-the-model-predicts period, in which case TAI is imminent.
(Thanks to Denis Drescher and Max Daniel for feedback on a draft)
The post How Roodman's GWP model translates to TAI timelines appeared first on Center on Long-Term Risk.
]]>The post The date of AI Takeover is not the day the AI takes over appeared first on Center on Long-Term Risk.
]]>The rest of this post explains, justifies, and expands on this obvious but underappreciated idea. (Toby Ord appreciates it; see quote below). I found myself explaining it repeatedly, so I wrote this post as a reference.
AI timelines often come up in career planning conversations. Insofar as AI timelines are short, career plans which take a long time to pay off are a bad idea, because by the time you reap the benefits of the plans it may already be too late. It may already be too late because AI takeover may already have happened.
But this isn’t quite right, at least not when “AI takeover” is interpreted in the obvious way, as meaning that an AI or group of AIs is firmly in political control of the world, ordering humans about, monopolizing violence, etc. Even if AIs don’t yet have that sort of political control, it may already be too late. Here are three examples:
Conclusion: We should remember that when trying to predict the date of AI takeover, what we care about is the date it’s too late for us to change the direction things are going; the date we have significantly less influence over the course of the future than we used to; the point of no return.
This is basically what Toby Ord said about x-risk:
So either because we’ve gone extinct or because there’s been some kind of irrevocable collapse of civilization or something similar. Or, in the case of climate change, where the effects are very delayed that we’re past the point of no return or something like that. So the idea is that we should focus on the time of action and the time when you can do something about it rather than the time when the particular event happens.
Of course, influence over the future might not disappear all on one day; maybe there’ll be a gradual loss of control over several years. For that matter, maybe this gradual loss of control began years ago and continues now... We should keep these possibilities in mind as well.
The post The date of AI Takeover is not the day the AI takes over appeared first on Center on Long-Term Risk.
]]>The post Updates appeared first on Center on Long-Term Risk.
]]>The post Priority areas appeared first on Center on Long-Term Risk.
]]>However, regardless of your background and the different areas listed below: if we believe that you can somehow do high-quality work relevant to s-risks, we are interested in supporting you.
Our research agenda cooperation, conflict, and transformative artificial intelligence (TAI) is ultimately aimed at reducing risks of conflict among TAI-enabled actors. This means that we need to understand how future AI systems might interact with one another, especially in high-stakes situations. CLR researchers and affiliates are currently researching how the design of future AI systems might determine the prospects for avoiding cooperation failure, using the tools of game theory, machine learning, and other disciplines related to multi-agent systems (MAS). You can find an overview of our work in this area here.
Examples of CLR research related to MAS:
Ensuring the safe design of AI systems also poses problems of governance. Because the prospects for avoiding conflict involving TAI systems depends on the design of all of the systems involved, avoiding conflict and promoting cooperation among TAI systems may pose new governance challenges beyond those commonly discussed in the AI risk research community (e.g., here). CLR researchers are currently working to understand potential pathways to cooperation between AI developers on the aspects of their systems which are most relevant to avoiding catastrophic conflict.
Examples of CLR research related to AI governance:
As explained in Section 7 of our research agenda, we are also interested in a better foundational understanding of decision-making, in the hope that this will help us steer towards better outcomes in high-stakes interactions between TAI systems.
An example of CLR research in this area:
Malevolent individuals in positions of power could negatively affect humanity’s long-term trajectory by, for example, exacerbating international conflict or other broad risk factors. With access to advanced technology, they may even pose existential risks. We are interested in a better understanding of malevolent traits and would like to investigate interventions to reduce the influence of individuals exhibiting such traits.
An example of CLR research in this area:
We have only been doing research on s-risks since 2013. So we expect to change our minds about many important questions as we learn more. We are interested in people bringing an independent perspective to the question of what we should prioritize. This can also include seemingly esoteric topics like infinite ethics or extraterrestrials.
Examples of CLR research in this area:
The post Priority areas appeared first on Center on Long-Term Risk.
]]>The post Expression of interest appeared first on Center on Long-Term Risk.
]]>Thank you for your interest—we look forward to hearing from you!
The post Expression of interest appeared first on Center on Long-Term Risk.
]]>The post Reducing long-term risks from malevolent actors appeared first on Center on Long-Term Risk.
]]>Full article
The post Reducing long-term risks from malevolent actors appeared first on Center on Long-Term Risk.
]]>The post Publications appeared first on Center on Long-Term Risk.
]]>Safe Pareto Improvements for Delegated Game Playing. AAMAS, 2021. |
Normative Disagreement as a Challenge for Cooperative AI. Cooperative AI workshop and the Strategic ML workshop at NeurIPS, 2021. |
Commitment games with conditional information revelation. AAAI 2023, 2022. |
Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs. Learning, Evolution, and Games, 2022. |
Collaborative game specification: arriving at common models in bargaining. Working paper, March 2021. |
Weak identifiability and its consequences in strategic settings. Working paper, February 2021. |
Towards cooperation in learning games. Working paper, October 2020. |
Robust program equilibrium. Theory and Decision, 86 (1), 2018. |
CLR's Research Agenda on Cooperation, Conflict, and TAI. Alignment Forum, December 2019. |
Equilibrium and prior selection problems in multipolar deployment. AI Alignment Forum, April 2020. |
The "Commitment Races" problem. Alignment Forum, August 2019. |
Reducing long-term risks from malevolent actors. Effective Altruism Forum, April 2020. |
Sequence on moral anti-realism. Effective Altruism Forum, June 2020. |
Tranquilism. CLR Website, July 2017. |
A Virtue of Precaution Regarding the Moral Status of Animals with Uncertain Sentience. Journal of Agricultural and Environmental Ethics, 30 (2), 2017. |
Bibliography of Suffering-Focused Views. CLR Website, August 2016. |
The Importance of Wild-Animal Suffering. Relations, 3 (2), 2015. |
Should We Base Moral Judgments on Intentions or Outcomes?. CLR Website, July 2013. |
Dealing with Moral Multiplicity. CLR Website, December 2013. |
What 2026 looks like. LessWrong, August 2021. |
Fun with +12 OOMs of Compute. LessWrong, March 2021. |
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain. LessWrong, January 2021. |
Against GDP as a metric for timelines and takeoff speeds. LessWrong, December 2020. |
Beginner’s guide to reducing s-risks. CLR Website, September 2023. |
Persuasion Tools: AI takeover without AGI or agency?. LessWrong, November 2020. |
Incentivizing forecasting via social media. Effective Altruism Forum, December 2020. |
Sequence on non-agent and multiagent models of mind. LessWrong, January 2019. |
Moral realism and AI alignment. LessWrong, September 2018. |
Suffering-Focused AI Safety: In Favor of “Fail-Safe” Measures. CLR Website, June 2016. |
Room for Other Things: How to adjust if EA seems overwhelming. Effective Altruism Forum, March 2015. |
The post Publications appeared first on Center on Long-Term Risk.
]]>The post About us appeared first on Center on Long-Term Risk.
]]>Our goal is to address worst-case risks from the development and deployment of advanced AI systems. We are currently focused on conflict scenarios as well as technical and philosophical aspects of cooperation.
To this end, we do interdisciplinary research, make and recommend grants, and build a community of professionals and other researchers around our priorities, e.g., through events, fellowships, and individual support.
As a team and organization, we are driven by the idea to do the most good we can from an impartial perspective. While we are deeply committed to our values, we are radically open-minded about how to live up to them.
This is a complex challenge. Because our resources are limited, we cannot solve all problems in the world or mitigate all risks facing us in the future. Instead, we need to prioritize. We need to ask ourselves what actions we should take now to have as much of a positive impact as possible.
This has been the guiding question of our organization since our founding in 2013. Starting from a commitment to our values, there are many different considerations that have shaped our current focus. As we learn more, our priorities, or even our mission, may change.
Below we provide a list of some of the crucial considerations that inform our current priorities:
Our primary ethical focus is the reduction of involuntary suffering. This includes human suffering, but also the suffering in non-human animals and potential artificial minds of the future. In accordance with a diverse range of moral views, we believe that suffering, especially extreme suffering, cannot be easily outweighed by large amounts of happiness.
While this leads us to prioritize reducing suffering, we do so within a framework of commonsensical value pluralism and with a strong focus on cooperation. Together with others in the effective altruism community, we want careful ethical reflection to guide the future of our civilization to the greatest extent possible.
The post About us appeared first on Center on Long-Term Risk.
]]>The post Career advice appeared first on Center on Long-Term Risk.
]]>You should do so if you:
We can best help you make sense of how to do the most good in case our priorities overlap. We do not offer general career advice or coaching. If you're interested in that, we recommend the organization 80,000 Hours.
The calls usually take 30-60 minutes. We look forward to talking to you!
The post Career advice appeared first on Center on Long-Term Risk.
The post EAF/FRI are now the Center on Long-Term Risk (CLR) appeared first on Center on Long-Term Risk.
]]>We are renaming for the following reasons:
We would like to thank the many people in our networks who helped us with their ideas and feedback. We are excited about the new name and design, and hope you are, too!
The post EAF/FRI are now the Center on Long-Term Risk (CLR) appeared first on Center on Long-Term Risk.
]]>The post Our plans for 2020 appeared first on Center on Long-Term Risk.
]]>We are building a global community of researchers and professionals working on reducing risks of astronomical suffering (s-risks). (Read more about us.)
We are a London-based nonprofit. Previously, we were located in Switzerland (Basel) and Germany (Berlin).
For an overview of our strategic thinking, see the following pieces:
The best work on reducing s-risks cuts across a broad range of academic disciplines and interventions. Our recent research agenda, for instance, draws from computer science, economics, political science, and philosophy. That means (a) we must work in many different disciplines and (b) find people who can bridge disciplinary boundaries. The longtermism community brings together people with diverse backgrounds who understand our prioritization and share it to some extent. For this reason, we focus on making reducing s-risks a well-established priority in that community.
Inspired by GiveWell’s self-evaluations, we are tracking our progress with a set of deliberately vague performance questions:
Our team will answer these questions at the end of 2020.
We aim to investigate research questions listed in our research agenda titled “Cooperation, Conflict, and Transformative Artificial Intelligence.” We explain our focus on cooperation and conflict in the preface:
“S-risks might arise by malevolence, by accident, or in the course of conflict. (…) We believe that s-risks arising from conflict are among the most important, tractable, and neglected of these. In particular, strategic threats by powerful AI agents or AI-assisted humans against altruistic values may be among the largest sources of expected suffering. Strategic threats have historically been a source of significant danger to civilization (the Cold War being a prime example). And the potential downsides from such threats, including those involving large amounts of suffering, may increase significantly with the emergence of transformative AI systems.”
Topics covered by our research agenda include:
We did not list some topics in the research agenda because they did not fit its scope, but we consider them very important:
In practice, our publications and grants will be determined to a large extent by the ideas and motivation of the researchers. We understand the above list of topics as a menu for researchers to choose from, and we expect that our actual work will only cover a small portion of the relevant issues. We hope to collaborate with other AI safety research groups on some of these topics.
We are looking to grow our research team, so we would be excited to hear from you if you think you might be a good fit! We are also considering running a hiring round based on our research agenda as well as a summer research fellowship.
We aim to develop a global research community, promoting regular exchange and coordination between researchers whose work contributes to reducing s-risks.
S-risks from conflict. In 2019, we mainly worked on s-risks as a result of conflicts involving advanced AI systems:
We also circulated nine internal articles and working papers with the participants of our research workshops.
Foundational work on decision theory. This work might be relevant in the context of acausal interactions (see the last section of the research agenda):
Miscellaneous publications:
We think it makes sense for donors to support us if:
For donors who do not agree with these points, we recommend giving to the donor lottery (or the EA Funds). We recommend that donors who are interested in the CLR Fund support CLR instead because the CLR Fund has a limited capacity to absorb further funding.
Would you like to support us? Make a donation.
If you have any questions or comments, we look forward to hearing from you; you can also send us feedback anonymously. We greatly appreciate any thoughts that could help us improve our work. Thank you!
I would like to thank Tobias Baumann, Max Daniel, Ruairi Donnelly, Lukas Gloor, Chi Nguyen, and Stefan Torges for giving feedback on this article.
The post Our plans for 2020 appeared first on Center on Long-Term Risk.
]]>The post The Evidentialist's Wager appeared first on Center on Long-Term Risk.
]]>Suppose that an altruistic and morally motivated agent who is uncertain between evidential decision theory (EDT) and causal decision theory (CDT) finds herself in a situation in which the two theories give conflicting verdicts. We argue that even if she has significantly higher credence in CDT, she should nevertheless act in accordance with EDT. First, we claim that that the appropriate response to normative uncertainty is to hedge one’s bets. That is, if the stakes are much higher on one theory than another, and the credences you assign to each of these theories aren’t very different, then it’s appropriate to choose the option which performs best on the high-stakes theory. Second, we show that, given the assumption of altruism, the existence of correlated decision-makers will increase the stakes for EDT but leave the stakes for CDT unaffected. Together these two claims imply that whenever there are sufficiently many correlated agents, the appropriate response is to act in accordance with EDT.
Read the paper on the website of the Global Priorities Institute.
The post The Evidentialist's Wager appeared first on Center on Long-Term Risk.
]]>The post Imprint appeared first on Center on Long-Term Risk.
]]>Center on Long-term Risk
A Charitable Incorporated Organisation (CIO) registered with the Charity Commission of England and Wales. Registration number 1195079.
Trustees
Disclaimer
The Center on Long-term Risk makes every effort to ensure that the information on its website (www.longtermrisk.org) is always correct and up-to-date and, if necessary, changes or supplements it on an ongoing basis and without prior notice. Nevertheless, CLR cannot accept any liability for correctness, timeliness, and completeness.
Our website contains links to external websites of third parties on whose contents we have no influence. Therefore, we cannot assume any liability for this external content. The respective provider or operator of the pages is always responsible for the content of the linked pages. A permanent control of the content of the linked pages is unreasonable without concrete evidence of a violation of the law. However, we will remove such links as soon as we become aware of any infringements of the law.
Copyright
The website of the Center on Long-term Risk (longtermrisk.org) including all its parts such as texts and images is protected by copyright. Any use outside the limits of copyright law is prohibited without permission. The content of the website may not be passed on to third parties for a fee.
Further information
Our transparency page contains further information on the Center on Long-term Risk.
The post Imprint appeared first on Center on Long-Term Risk.
]]>The post CLR Fund appeared first on Center on Long-Term Risk.
]]>The post CLR Fund appeared first on Center on Long-Term Risk.
]]>The post Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda appeared first on Center on Long-Term Risk.
]]>Author: Jesse Clifton
The Center on Long-Term Risk's research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence outlines what we think are the most promising avenues for developing technical and governance interventions aimed at avoiding conflict between transformative AI systems. We draw on international relations, game theory, behavioral economics, machine learning, decision theory, and formal epistemology.
While our research agenda captures many topics we are interested in, the focus of CLR's research is broader.
We appreciate all comments and questions. We're also looking for people to work on the questions we outline. So if you're interested or know people who might be, please get in touch with us by emailing info@longtermrisk.org.
Transformative artificial intelligence (TAI) may be a key factor in the long-run trajectory of civilization. A growing interdisciplinary community has begun to study how the development of TAI can be made safe and beneficial to sentient life (Bostrom 2014; Russell et al., 2015; OpenAI, 2018; Ortega and Maini, 2018; Dafoe, 2018). We present a research agenda for advancing a critical component of this effort: preventing catastrophic failures of cooperation among TAI systems. By cooperation failures we refer to a broad class of potentially-catastrophic inefficiencies in interactions among TAI-enabled actors. These include destructive conflict; coercion; and social dilemmas (Kollock, 1998; Macy and Flache, 2002) which destroy value over extended periods of time. We introduce cooperation failures at greater length in Section 1.1.
Karnofsky (2016) defines TAI as ''AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution''. Such systems range from the unified, agent-like systems which are the focus of, e.g., Yudkowsky (2013) and Bostrom (2014), to the "comprehensive AI services" envisioned by Drexler (2019), in which humans are assisted by an array of powerful domain-specific AI tools. In our view, the potential consequences of such technology are enough to motivate research into mitigating risks today, despite considerable uncertainty about the timeline to TAI (Grace et al., 2018) and nature of TAI development. Given these uncertainties, we will often discuss "cooperation failures" in fairly abstract terms and focus on questions relevant to a wide range of potential modes of interaction between AI systems. Much of our discussion will pertain to powerful agent-like systems, with general capabilities and expansive goals. But whereas the scenarios that concern much of the existing long-term-focused AI safety research involve agent-like systems, an important feature of catastrophic cooperation failures is that they may also occur among human actors assisted by narrow-but-powerful AI tools.
Cooperation has long been studied in many fields: political theory, economics, game theory, psychology, evolutionary biology, multi-agent systems, and so on. But TAI is likely to present unprecedented challenges and opportunities arising from interactions between powerful actors. The size of losses from bargaining inefficiencies may massively increase with the capabilities of the actors involved. Moreover, features of machine intelligence may lead to qualitative changes in the nature of multi-agent systems. These include changes in:
These changes call for the development of new conceptual tools, building on and modifying the many relevant literatures which have studied cooperation among humans and human societies.
Many of the cooperation failures in which we are interested can be understood as mutual defection in a social dilemma. Informally, a social dilemma is a game in which everyone is better off if everyone cooperates, yet individual rationality may lead to defection. Formally, following Macy and Flache (2002), we will say that a two-player normal-form game with payoffs denoted as in Table 1 is a social dilemma if the payoffs satisfy these criteria:
Nash equilibrium (i.e., a choice of strategy by each player such that no player can benefit from unilaterally deviating) has been used to analyze failures of cooperation in social dilemmas. In the Prisoner’s Dilemma (PD), the unique Nash equilibrium is mutual defection. In Stag Hunt, there is a cooperative equilibrium which requires agents to coordinate, and a defecting equilibrium which does not. In Chicken, there are two pure-strategy Nash equilibria (Player 1 plays while Player 2 plays , and vice versa) as well as an equilibrium in which players independently randomize between and . The mixed strategy equilibrium or uncoordinated equilibrium selection may therefore result in a crash (i.e., mutual defection).
Social dilemmas have been used to model cooperation failures in international politics; Snyder (1971) reviews applications of PD and Chicken, and Jervis (1978) discusses each of the classic social dilemmas in his influential treatment of the security dilemma1. Among the most prominent examples is the model of arms races as a PD: both players build up arms (defect) despite the fact that disarmament (cooperation) is mutually beneficial, as neither wants to be the party who disarms while their counterpart builds up. Social dilemmas have likewise been applied to a number of collective action problems, such as the use of a common resource (cf. the famous "tragedy of the commons" (Hardin, 1968; Perolat et al., 2017) ) and pollution. See Dawes (1980) for a review focusing on such cases.
Many interactions are not adequately modeled by simple games like those in Table 1. For instance, states facing the prospect of military conflict have incomplete information. That is, each party has private information about the costs and benefits of conflict, their military strength, and so on. They also have the opportunity to negotiate over extended periods; to monitor one another’s activities to some extent; and so on. The literature on bargaining models of war (or "crisis bargaining") is a source of more complex analyses (e.g., Powell 2002; Kydd 2003; Powell 2006; Smith and Stam 2004; Feyand Ramsay 2007, 2011; Kydd 2010). In a classic article from this literature, Fearon(1995) defends three now-standard hypotheses as the most plausible explanations for why rational agents would go to war:
Another example of potentially disastrous cooperation failure is extortion (and other compellent threats), and the execution of such threats by powerful agents. In addition to threats being harmful to their target, the execution of threats seem to constitute an inefficiency: much like going to war, threateners face the direct costs of causing harm, and in some cases, risks from retaliation or legal action.
The literature on crisis bargaining between rational agents may also help us to understand the circumstances under which compellent threats are made and carried out, and point to mechanisms for avoiding these scenarios.
Countering the hypothesis that war between rational agents A and B can occur as a result of indivisible stakes (for example a territory), Powell (2006, p. 178) presents the case in Example 1.1.1, which shows that allocating the full stakes to each agent according to their probabilities of winning a war Pareto-dominates fighting.
Example 1.1.1 (Simulated conflict).
Consider two countries disputing a territory which has value for each of them. Suppose that the row country has probability of winning a conflict, and conflict costs for each country, so that their payoffs for Surrendering and Fighting are as in the top matrix in Table 2. However, suppose the countries agree on the probability
that the row players win; perhaps they have access to a mutually trusted war-simulator which has row player winning in of simulations. Then, instead of engaging in a real conflict, they could allocate the territory based on a draw from the simulator. Playing this game is preferable, as it saves each country the cost of actual conflict.
Because of the possibility of constructing simulated conflict to allocate indivisible stakes, the most plausible of Fearon’s rationalist explanations for war seem to be (1) the difficulty of credible commitment and (2) incomplete information (and incentives to misrepresent that information). Section 3 concerns discussion of credibility in TAI systems. In Section 4 we discuss
several issues related to the resolution of conflict under incomplete information.
Lastly, while game theory provides a powerful framework for modeling cooperation failure, TAI systems or their operators will not necessarily be well-modeled as rational agents. For example, systems involving humans in the loop, or black-box TAI agents trained by evolutionary methods, may be governed by a complex network of decision-making heuristics not easily captured in a utility function. We discuss research directions that are particularly relevant to cooperation failures among these kinds of agents in Sections 5.2 (Multi-agent training) and 6 (Humans in the loop).
We list the sections of the agenda below. Different sections may appeal to readers from different backgrounds. For instance, Section 5 (Contemporary AI architectures) may be most interesting to those with some interest in machine learning, whereas Section 7 (Foundations of rational agency) will be more relevant to readers with an interest in formal epistemology or the philosophical foundations of decision theory. Tags after the description of each section indicate the fields most relevant to that section.
Some sections contain examples illustrating technical points, or explaining in greater detail a possible technical research direction.
Public policy; International relations; Game theory; Artificial intelligence
Game theory; Behavioral economics; Artificial intelligence
Game theory; International relations; Artificial intelligence
Machine learning; Game theory
Machine learning; Behavioral economics
Formal epistemology; Philosophical decision theory; Artificial intelligence
2 We would like to better understand the ways the strategic landscape among key actors (states, AI labs, and other non-state actors) might look at the time TAI systems are deployed, and to identify levers for shifting this landscape towards widely beneficial outcomes. Our interests here overlap with Dafoe (2018)’s AI governance research agenda (see especially the "Technical Landscape’’ section), though we are most concerned with questions relevant to risks associated with cooperation failures.
From the perspective of reducing risks from cooperation failures, it is prima facie preferable if the transition to TAI results in a unipolar rather than a distributed outcome: The greater the chances of a single dominant actor, the lower the chances of conflict (at least after that actor has achieved dominance). But the analysis is likely not so simple, if the international relations literature on the relative safety of different power distributions (e.g., Deutsch and Singer 1964; Waltz 1964; Christensen and Snyder 1990) is any indication. We are therefore especially interested in a more fine-grained analysis of possible developments in the balance of power. In particular, we would like to understand the likelihood of the various scenarios, their relative safety with respect to catastrophic risk, and the tractability of policy interventions to steer towards safer distributions of TAI-related power. Relevant questions include:
3Agents' ability to make credible commitments is a critical aspect of multi-agent systems. Section 3 is dedicated to technical questions around credibility, but it is also important to consider the strategic implications of credibility and commitment.
One concerning dynamic which may arise between TAI systems is commitment races (Kokotajlo, 2019a). In the game of Chicken (Table 1), both players have reason to commit to driving ahead as soon as possible, by conspicuously throwing out their steering wheels. Likewise, AI agents (or their human overseers) may want to make certain commitments (for instance, commitments to carry through with a threat if their demands aren't met) as soon as possible, in order to improve their bargaining positions. As with Chicken, this is a dangerous situation. Thus we would like to explore possibilities for curtailing such dynamics.
Finally, in human societies, improvements in the ability to make credible commitments (e.g., to sign contracts enforceable by law) seem to have facilitated large gains from trade through more effective coordination, longer-term cooperation, and various other mechanisms (e.g., Knack and Keefer 1995; North 1991; Greif et al. 1994; Dixit 2003).
Christiano (2018a) defines "the alignment problem" as "the problem of building powerful AI systems that are aligned with their operators". Related problems, as discussed by Bostrom (2014), include the "value loading" (or "value alignment") problem (the problem of ensuring that AI systems have goals compatible with the goals of humans), and the "control problem" (the general problem of controlling a powerful AI agent). Despite the recent surge in attention on AI risk, there are few detailed descriptions of what a future with misaligned AI systems might look like (but see Sotala 2018; Christiano 2019; Dai 2019 for examples). Better models of the ways in which misaligned AI systems could arise and how they might behave are important for our understanding of critical interactions among powerful actors in the future.
According to the offense-defense theory, the likelihood and nature of conflict depend on the relative efficacy of offensive and defensive security strategies (Jervis, 2017, 1978; Glaser, 1997). Technological progress seems to have been a critical driver of shifts in the offense-defense balance (Garfinkel and Dafoe, 2019), and the advent of powerful AI systems in strategic domains like computer security or military technology could lead to shifts in that balance.
Besides forecasting future dynamics, we are curious as to what lessons can be drawn from case studies of cooperation failures, and policies which have mitigated or exacerbated such risks. For example: Cooperation failures among powerful agents representing human values may be particularly costly when threats are involved. Examples of possible case studies include nuclear deterrence, ransomware (Gazet, 2010) and its implications for computer security, the economics of hostage-taking (Atkinson et al., 1987; Shortland and Roberts, 2019), and extortion rackets (Superti, 2009). Such case studies might investigate costs to the threateners, gains for the threateners, damages to third parties, factors that make agents more or less vulnerable to threats, existing efforts to combat extortionists, etc. While it is unclear how informative such case studies will be about interactions between TAI systems, they may be particularly relevant in humans-in-the-loop scenarios (Section 6).
Lastly, in addition to case studies of cooperation failures themselves, it would be helpful for the prioritization of the research directions presented in this agenda to study how other instances of formal research have influenced (or failed to influence) critical real-world decisions. Particularly relevant examples include the application of game theory to geopolitics (see Weintraub (2017) for a review of game theory and decision-making in the Cold War); cryptography to computer security, and formal verification in the verification of software programs.
The remainder of this agenda largely concerns technical questions related to interactions involving TAI-enabled systems. A key strategic question running throughout is: What are the potential downsides to increased technical understanding in these areas? It is possible, for instance, that technical and strategic insights related to credible commitment increase rather than decrease the efficacy and likelihood of compellent threats. Moreover, the naive application of idealized models of rationality may do more harm than good; it has been argued that this was the case in some applications of formal methods to Cold War strategy, for instance, Kaplan (1991). Thus the exploration of the dangers and limitations of technical and strategic progress is itself a critical research direction.
Credibility is a central issue in strategic interaction. By credibility, we refer to the issue of whether one agent has reason to believe that another will do what they say they will do. Credibility (or lack thereof) plays a crucial role in the efficacy of contracts (Fehr et al., 1997; Bohnet et al., 2001), negotiated settlements for avoiding destructive conflict (Powell, 2006), and commitments to carry out (or refuse to give in to) threats(e.g., Kilgour and Zagare 1991; Konrad and Skaperdas 1997).
In game theory, the fact that Nash equilibria (Section 1.1) sometimes involve non-credible threats motivates a refined solution concept called subgame perfect equilibrium (SPE). An SPE is a Nash equilibrium of an extensive-form game in which a Nash equilibrium is also played at each subgame. In the threat game depicted in Figure 1, “carry out” is not played in a SPE, because the threatener has no reason to carry out the threat once the threatened party has refused to give in; that is, “carry out’’ isn’t a Nash equilibrium of the subgame played after the threatened party refuses to give in.
So in an SPE-based analysis of one-shot threat situations between rational agents, threats are never carried out because they are not credible (i.e., they violate subgame perfection).
However, agents may establish credibility in the case of repeated interactions by repeatedly making good on their claims (Sobel, 1985). Secondly, despite the fact that carrying out a threat in the one-shot threat game violates subgame perfection, it is a well-known result from behavioral game theory that humans typically refuse unfair splits in the Ultimatum Game4 (Güth et al., 1982; Henrich et al., 2006), which is equivalent to carrying out the threat in the one-shot threat game. So executing commitments which are irrational (by the SPE criterion) may still be a feature of human-in-the-loop
systems (Section 6), or perhaps systems which have some humanlike game-theoretic heuristics in virtue of being trained in multi-agent environments (Section 5.2). Lastly, threats may become credible if the threatener has credibly committed to carrying out the threat (in the case of the game in Fig. 1, this means convincing the opponent that they have removed the option (or made it costly) to “Not carry out’’). There is a considerable game-theoretic literature on credible commitment, both on how credibility can be achieved (Schelling, 1960) and on the analysis of games under the assumption that credible commitment is possible (Von Stackelberg, 2010; Nash, 1953; Muthoo, 1996; Bagwell, 1995).
It is possible that TAI systems may be relatively transparent to one another; capable of self-modifying or constructing sophisticated commitment devices;
and making various other “computer-mediated contracts’’ (Varian, 2010); see also the lengthy discussions in Garfinkel (2018) and Kroll et al. (2016), discussed in Footnote 1, of potential implications of cryptographic technology for credibility.
We want to understand how plausible changes in the ability to make credible commitments affect risks from cooperation failures.
Tennenholtz (2004) introduced program games, in which players submit programs that have access to the source codes of their counterparts. Program games provide a model of interaction under mutual transparency. Tennenholtz showed that in the Prisoner’s Dilemma, both players submitting Algorithm 1 is a program equilibrium (that is, a Nash equilibrium of the corresponding program game). Thus agents may have the incentive to participate in program games, as these promote more cooperative outcomes than the corresponding non-program games.
For these reasons, program games may be helpful to our understanding of interactions among advanced AIs.
Other models of strategic interaction between agents who are transparent to one another have been studied (more on this in Section 5); following Critch (2019), we will call this broader area open-source game theory. Game theory with source-code transparency has been studied by Fortnow 2009; Halpern and Pass 2018; LaVictoireet al. 2014; Critch 2019; Oesterheld 2019, and models of multi-agent learning under transparency are given by Brafman and Tennenholtz (2003); Foerster et al. (2018). But open-source game theory is in its infancy and many challenges remain6.
In other sections of the agenda, we have proposed research directions for improving our general understanding of cooperation and conflict among TAI systems. In this section, on the other hand, we consider several families of strategies designed to actually avoid catastrophic cooperation failure. The idea of such "peaceful bargaining strategies'' is, roughly speaking, to find strategies which are 1) peaceful (i.e., avoid conflict) and 2) preferred by rational agents to non-peaceful strategies7.
We are not confident that peaceful bargaining mechanisms will be used by default. First, in human-in-the-loop scenarios, the bargaining behavior of TAI systems may be dictated by human overseers, who we do not expect to systematically use rational bargaining strategies (Section 6.1). Even in systems whose decision-making is more independent of humans’, evolution-like training methods could give rise to non-rational human-like bargaining heuristics (Section 5.2). Even among rational agents, because there may be many cooperative equilibria, additional mechanisms for ensuring coordination may be necessary to avoid conflict arising from the selection of different equilibria (see Example 4.1.1). Finally, the examples in this section suggest that there may be path-dependencies in the engineering of TAI systems (for instance, in making certain aspects of TAI systems more transparent to their counterparts) which determine the extent to which peaceful bargaining mechanisms are available.
In the first subsection, we present some directions for identifying mechanisms which could implement peaceful settlements, drawing largely on existing ideas in the literature on rational bargaining. In the second subsection, we sketch a proposal for how agents might mitigate downsides from threats by effectively modifying their utility function. This proposal is called surrogate goals.
As discussed in Section 1.1, there are two standard explanations for war among rational agents: credibility (the agents cannot credibly commit to the terms of a peaceful settlement) and incomplete information (the agents have differing private information which makes each of them optimistic about their prospects of winning, and incentives not to disclose or to misrepresent this information).
Fey and Ramsay (2011) model crisis bargaining under incomplete information. They show that in 2-player crisis bargaining games with voluntary agreements (players are able to reject a proposed settlement if they think they will be better off going to war); mutually known costs of war; unknown types measuring the players' military strength; a commonly known function giving the probability of player 1 winning when the true types are ; and a common prior over types; a peaceful settlement exists if and only if the costs of war are sufficiently large. Such a settlement must compensate each player's strongest possible type by the amount they expect to gain in war.
Potential problems facing the resolution of conflict in such cases include:
Recall that another form of cooperation failure is the simultaneous commitment to strategies which lead to catastrophic threats being carried out (Section 2.2). Such "commitment games'' may be modeled as a game of Chicken (Table 1), where Defection corresponds to making commitments to carry out a threat if one's demands are not met, while Cooperation corresponds to not making such commitments. Thus we are interested in bargaining strategies which avoid mutual Defection in commitment games. Such a strategy is sketched in Example 4.1.1.
Example 4.1.1 (Careful commitments).
Consider two agents with access to commitment devices. Each may decide to commit to carrying out a threat if their counterpart does not forfeit some prize (of value to each party). As before, call this decision . However, they may instead commit to carrying out their threat only if their counterpart does not agree to a certain split of the prize (say, a split in which Player 1 gets ).
Call this commitment , for "cooperating with split ''.
When would an agent prefer to make the more sophisticated commitment ? In order to say whether an agent expects to do better by making , we need to be able to say how well they expect to do in the "original'' commitment game where their choice is between and . This is not straightforward, as Chicken admits three Nash equilibria. However, it may be reasonable to regard the players' expected values under mixed strategy Nash equilibrium as the values they expect from playing this game. Thus, split could be chosen such that and exceed player 1 and 2's respective expected payoffs under the mixed strategy Nash equilibrium. Many such splits may exist. This calls for the selection among , for which we may turn to a bargaining solution concept such as Nash (Nash, 1950) or Kalai-Smorokindsky (Kalai et al., 1975). If each player uses the same bargaining solution, then each will prefer to committing to honoring the resulting split of the prize to playing the original threat game, and carried-out threats will be avoided.
Of course, this mechanism is brittle in that it relies on a single take-it-or-leave-it proposal which will fail if the agents use different bargaining solutions, or have slightly different estimates of each players' payoffs. However, this could be generalized to a commitment to a more complex and robust bargaining procedure, such as an alternating-offers procedure (Rubinstein 1982; Binmore et al. 1986; see Muthoo (1996) for a thorough review of such models) or the sequential cooperative bargaining procedure of Van Damme (1986).
Finally, note that in the case where there is uncertainty over whether each player has a commitment device, sufficiently high stakes will mean that players with commitment devices will still have Chicken-like payoffs. So this model can be straightforwardly extended to cases where the credibility of a threat comes in degrees. An example of a simple bargaining procedure to commit to is Bayesian version of the Nash bargaining solution (Harsanyi and Selten, 1972).
Lastly, see Kydd (2010)'s review of potential applications of the literature rational crisis bargaining to resolving real-world conflict.
8In this section, we introduce surrogate goals, a recent9 proposal for limiting the downsides from cooperation failures (Baumann, 2017, 2018)10. We will focus on the phenomenon of coercive threats (for game-theoretic discussion see Ellsberg (1968); Har-renstein et al. (2007) ), though the technique is more general. The proposal is: In order to deflect threats against the things it terminally values, an agent adopts a new (surrogate) goal11. This goal may still be threatened, but threats carried out against this goal are benign. Furthermore, the surrogate goal is chosen such that it incentives at most marginally more threats.
In Example 4.2.1, we give an example of an operationalization of surrogate goals in a threat game.
Example 4.2.1 (Surrogate goals via representatives)
Consider the game between Threatener and Target, where Threatener makes a demand of Target, such as giving up some resources. Threatener can — at some cost — commit to carrying out a threat against Target . Target can likewise commit to give in to such threats or not. A simple model of this game is given in the payoff matrix in Table 3 (a normal-form variant of the threat game discussed in Section 312).
Unfortunately, players may sometimes play (Threaten, Not give in). For example, this may be due to uncoordinated selection among the two pure-strategy Nash equilibria ( (Give in, Threaten) and (Not give in, Not threaten) ).
But suppose that, in the above scenario, Target is capable of certain kinds of credible commitments, or otherwise is represented by an agent, Target’s Representative, who is. Then Target or Target’s Representative may modify its goal architecture to adopt a surrogate goal whose fulfillment is not actually valuable to that player, and which is slightly cheaper for Threatener to threaten. (More generally, Target could modify itself to commit to acting as if it had a surrogate goal in threat situations.) If this modification is credible, then it is rational for Threatener to threaten the surrogate goal, obviating the risk of threats against Target’s true goals being carried out.
As a first pass at a formal analysis: Adopting an additional threatenable goal adds a column to the payoff matrix, as in Table 4. And this column weakly dominates the old threat column (i.e., the threat against Target’s true goals). So a rational player would never threaten Target’s true goal. Target does not themselves care about the new type of threats being carried out, so for her, the utilities are given by the bold numbers in Table 4.
This application of surrogate goals, in which a threat game is already underway but players have the opportunity to self-modify or create representatives with surrogate goals, is only one possibility. Another is to consider the adoption of a surrogate goal as the choice of an agent (before it encounters any threat) to commit to acting according to a new utility function, rather than the one which represents their true goals. This could be modeled, for instance, as an extensive-form game of incomplete information in which the agent decides which utility function to commit to by reasoning about (among other things) what sorts of threats having the utility function might provoke. Such models have a signaling game component, as the player must successfully signal to distrustful counterparts that it will actually act according to the surrogate utility function when threatened. The game-theoretic literature on signaling (Kreps and Sobel, 1994) and the literature on inferring preferences in multi-agent settings (Yu et al., 2019; Lin et al., 2019) may suggest useful models. The implementation of surrogate goals faces a number of obstacles. Some problems and questions include:
A crucial step in the investigation of surrogate goals is the development of appropriate theoretical models. This will help to gain traction on the problems listed above.
Although the architectures of TAI systems will likely be quite different from existing ones, it may still be possible to gain some understanding of cooperation failures among such systems using contemporary tools13. First, it is plausible that some aspects of contemporary deep learning methods will persist in TAI systems, making experiments done today directly relevant. Second, even if this is not the case, such research may still help by laying the groundwork for the study of cooperation failures in more advanced systems.
As mentioned above, some attention has recently been devoted to social dilemmas among deep reinforcement learners (Leibo et al., 2017; Peysakhovich and Lerer, 2017; Lerer and Peysakhovich, 2017; Foerster et al., 2018; Wang et al., 2018). However, a fully general, scalable but theoretically principled approach to achieving cooperation among deep reinforcement learning agents is lacking. In Example 5.1.1 we sketch a general approach to cooperation in general-sum games which subsumes several recent methods, and afterward list some research questions raised by the framework.
Example 5.1.1 (Sketch of a framework for cooperation in general-sum games).
The setting is a 2-agent decision process. At each timestep , each agent receives an observation ; takes an action based on their policy (assumed to be deterministic for simplicity); and receives a
reward . Player expects to get a value of if the policies are deployed. Examples of such environments which are amenable to study with contemporary machine learning tools are the "sequential social dilemmas'' introduced by Leibo et al. (2017). These include a game involving potential conflict over scarce resources, as well as a coordination game similar in spirit to Stag Hunt (Table 1).
Suppose that the agents (or their overseers) have the opportunity to choose what policies to deploy by simulating from a model, and to bargain over the choice of policies. The idea is for the parties to arrive at a welfare function which they agree to jointly maximize; deviations from the policies which maximize the welfare function will be punished if detected. Let be a "disagreement point'' measuring how well agent expects to do if they deviate from the welfare-maximizing policy profile. This could be their security value , or an estimate of their value when the agents use independent learning algorithms. Finally, define player 's ideal point . Table 5 displays welfare functions corresponding to several widely-discussed bargaining solutions, adapted to the multi-agent reinforcement learning setting.
Table 5: Welfare functions corresponding to several widely-discussed bargaining solutions, adapted to the multi-agent RL setting where two agents with value functions are bargaining over the pair of policies to deploy. The function in the definition of the Kalai-Smorodinsky welfare is the - indicator, used to enforce the constraint in its argument. Note that when the space of feasible payoff profiles is convex, the Nash welfare function uniquely satisfies the properties of (1) Pareto optimality, (2) symmetry, (3) invariance to affine transformations, and (4) independence of irrelevant alternatives. The Nash welfare can also be obtained as the subgame perfect equilibrium of an alternating-offers game as the ''patience'' of the players goes to infinity (Binmore et al., 1986). On the other hand, Kalai-Smorodinsky uniquely satisfies (1)-(3) plus (5) resource monotonicity, which means that all players are weakly better off when there are more resources to go around. The egalitarian solution satisfies (1), (2), (4), and (5). The utilitarian welfare function is implicitly used in the work of Peysakhovich and Lerer (2017); Lerer and Peysakhovich (2017); Wang et al. (2018) on cooperation in sequential social dilemmas.
Define the cooperative policies as . We need a way of detecting defections so that we can switch from the cooperative policy to a punishment policy. Call a function that detects defections a "switching rule''. To make the framework general, consider switching rules which return 1 for and 0 for . Rules depend on the agent's observation history .
The contents of will differ based on the degree of observability of the environment, as well as how transparent agents are to each other (cf. Table 6). Example switching rules include:
Finally, the agents need punishment policies to switch to in order to disincentivize defections. An extreme case of a punishment policy is the one in which an agent commits to minimizing their counterpart's utility once they have defected: . This is the generalization of the so-called "grim trigger'' strategy underlying the classical theory of iterated games (Friedman, 1971; Axelrod, 2000). It can be seen that each player submitting a grim trigger strategy in the above framework constitutes a Nash equilibrium in the case that the counterpart's observations and actions are visible (and therefore defections can be detected with certainty). However, grim trigger is intuitively an extremely dangerous strategy for promoting cooperation, and indeed does poorly in empirical studies of different strategies for the iterated Prisoner's Dilemma (Axelrod and Hamilton, 1981). One possibility is to train more forgiving, tit-for-tat-like punishment policies, and play a mixed strategy when choosing which to deploy in order to reduce exploitability.
Some questions facing a framework for solving social dilemmas among deep reinforcement learners, such as that sketched in Example 5.1.1, include:
In addition to the theoretical development of open-source game theory (Section 3.2), interactions between transparent agents can be studied using tools like deep reinforcement learning. Learning equilibrium (Brafman and Tennenholtz, 2003) and learning with opponent-learning awareness (LOLA) (Foerster et al., 2018; Baumann et al.,2018; Letcher et al., 2018) are examples of analyses of learning under transparency.
Clifton (2019) provides a framework for "open-source learning" under mutual transparency of source codes and policy parameters. Questions on which we might make progress using present-day machine learning include:
Table 6: Several recent approaches to achieving cooperation in social dilemmas, which assume varying degrees of agent transparency. In Peysakhovich and Lerer (2017)'s consequentialist conditional cooperation (CCC), players learn cooperative policies off-line by optimizing the total welfare. During the target task, they only partially observe the game state and see none of their counterpart's actions; thus, they use only their observed rewards to detect whether their counterpart is cooperating or defecting, and switch to their cooperative or defecting policies accordingly. On the other hand, in Lerer and Peysakhovich (2017), a player sees their counterpart's action and switches to the defecting policy if that action is consistent with defection (mimicking the tit-for-tat strategy in the iterated Prisoner's Dilemma (Axelrod and Hamilton, 1981) ).
Multi-agent training is an emerging paradigm for the training of generally intelligent agents (Lanctot et al., 2017; Rabinowitz et al., 2018; Suarez et al., 2019; Leibo et al.,2019). It is as yet unclear what the consequences of such a learning paradigm are for the prospects for cooperativeness among advanced AI systems.
Understanding the decision-making procedures implemented by different machine learning algorithms may be critical for assessing how they will behave in high-stakes interactions with humans or other AI agents. One potentially relevant factor is the decision theory implicitly implemented by a machine learning agent. We discuss decision theory at greater length in Section 7.2, but briefly: By an agent’s decision theory, we roughly mean which dependences the agent accounts for when predicting the outcomes of its actions. While it is standard to consider only the causal effects of one’s actions ("causal decision theory" (CDT) ), there are reasons to think agents should account for non-causal evidence that their actions provide about the world16. And, different ways of computing the expected effects of actions may lead to starkly different behavior in multi-agent settings.
18 TAI agents may acquire their objectives via interaction with or observation of humans. Relatedly, TAI systems may consist of AI-assisted humans, as in Drexler (2019)’s comprehensive AI services scenario. Relevant AI techniques include:
In human-in-the-loop scenarios, human responses will determine the outcomes of opportunities for cooperation and conflict.
Behavioral game theory has often found deviations from theoretical solution concepts among human game-players. For instance, people tend to reject unfair splits in the ultimatum game despite this move being ruled out by subgame perfection (Section 3). In the realm of bargaining, human subjects often reach different bargaining solutions than those standardly argued for in the game theory literature (in particular, the Nash (Nash,1950) and Kalai-Smorodinsky (Kalai et al., 1975) solutions) (Felsenthal and Diskin,1982; Schellenberg, 1988). Thus the behavioral game theory of human-AI interaction in critical scenarios may be a vital complement to theoretical analysis when designing human-in-the-loop systems.
In one class of TAI trajectories, humans control powerful AI delegates who act on their behalf (gathering resources, ensuring safety, etc.). One model for powerful AI delegates is Christiano (2016a)’s (recursively titled) "Humans consulting HCH" (HCH). Saunders (2019) explains HCH as follows:
HCH, introduced in Humans consulting HCH (Christiano, 2016a), is a computational model in which a human answers questions using questions answered by another human, which can call other humans, which can call other humans, and so on. Each step in the process consists of a human taking in a question, optionally asking one or more sub-questions to other humans, and returning an answer based on those subquestions. HCH can be used as a model for what Iterated Amplification20 would be able to do in the limit of infinite compute.
A particularly concerning class of cooperation failures in such scenarios are threats by AIs or AI-assisted humans against one another.
Saunders also discusses a hypothetical manual for overseers in the HCH scheme. In this manual, overseers could find advice" on how to corrigibly answer questions by decomposing them into sub-questions." Exploring practical advice that could be included in this manual might be a fruitful exercise for identifying concrete interventions for addressing cooperation failures in HCH and other human-in-the-loop settings. Examples include:
We think that the effort to ensure cooperative outcomes among TAI systems will likely benefit from thorough conceptual clarity about the nature of rational agency. Certain foundational achievements — probability theory, the theory of computation, algorithmic information theory, decision theory, and game theory to name some of the most profound — have been instrumental in both providing a
powerful conceptual apparatus for thinking about rational agency, and the development of concrete tools in artificial intelligence, statistics, cognitive science, and so on. Likewise, there are a number of outstanding foundational questions surrounding the nature of rational agency which we expect to yield additional clarity about interactions between TAI-enabled systems. Broadly, we want to answer:
We acknowledge, however, the limitations of the agenda for foundational questions which we present. First, it is plausible that the formal tools we develop will be of limited use in understanding TAI systems that are actually developed. This may be true of black-box machine learning systems, for instance21. Second, there is plenty of potentially relevant foundational inquiry scattered across epistemology, decision theory, game theory, mathematics, philosophy of probability, philosophy of science, etc. which we do not prioritize in our agenda22. This does not necessarily reflect a considered judgment about all relevant areas. However, it is plausible to us that the research directions listed here are among the most important, tractable, and neglected (Concepts, n.d.) directions for improving our theoretical picture of TAI.
Bayesianism (Talbott, 2016) is the standard idealized model of reasoning under empirical uncertainty. Bayesian agents maintain probabilities over hypotheses; update these probabilities by conditionalization in light of new evidence; and make decisions according to some version of expected utility decision theory (Briggs, 2019). But Bayesianism faces a number of limitations when applied to computationally bounded agents. Examples include:
26 Newcomb’s problem27 (Nozick, 1969) showed that classical decision theory bifurcates into two conflicting principles of choice in cases where outcomes depend on agents' predictions of each other's behavior. Since then, considerable philosophical work has gone towards identifying additional problem cases for decision theory and towards developing new decision theories to address them. As with Newcomb's problem, many decision-theoretic puzzles involve dependencies between the choices of several agents. For instance, Lewis (1979) argues that Newcomb's problem is equivalent to a prisoner's dilemma played by agents with highly correlated decision-making procedures, and Soares and Fallenstein (2015) give several examples in which artificial agents implementing certain decision theories are vulnerable to blackmail.
In discussing the decision theory implemented by an agent, we will assume that the agent maximizes some form of expected utility. Following Gibbard and Harper (1978), we write the expected utility given an action for a single-stage decision problem in context as
(1)
where are possible outcomes; is the agent’s utility function; and stands for a given notion of dependence of outcomes on actions. The dependence concept an agent uses for in part determines its decision theory.
The philosophical literature has largely been concerned with causal decision theory (CDT) (Gibbard and Harper, 1978) and evidential decision theory (EDT)(Horgan,1981), which are distinguished by their handling of dependence.
Causal conditional expectations account only for the causal effects of an agent’s actions; in the formalism of Pearl (2009)’s do-calculus, for instance, the relevant notion of expected utility conditional on action is . EDT, on the other hand, takes into account non-causal dependencies between the agent's actions and the outcome. In particular, it takes into account the evidence that taking the action provides for the actions taken by other agents in the environment with whom the decision-maker's actions are dependent. Thus the evidential expected utility is the classical conditional expectation .
Finally, researchers in the AI safety community have more recently developed what we will refer to as logical decision theories, which employ a third class of dependence for evaluating actions (Dai, 2009; Yudkowsky, 2009; Yudkowsky and Soares, 2017). One such theory is functional decision theory (FDT)28, which uses what Yudkowsky and Soares (2017) refer to as subjunctive dependence. They explain this by stating that "When two physical systems are computing the same function, we will say that their behaviors "subjunctively depend" upon that function" (p. 6). Thus, in FDT, the expected utility given an action is computed by determining what the outcome of the decision problem would be if all relevant instances of the agent’s decision-making algorithm output .
In this section, we will assume an acausal stance on decision theory, that is, one other than CDT. There are several motivations for using a decision theory other than CDT:
We consider these sufficient motivations to study the implications of acausal decision theory for the reasoning of consequentialist agents. In particular, in this section we take up various possibilities for acausal trade between TAI systems. If we account for the evidence that one's choices provides for the choices that causally disconnected agents, this opens up both qualitatively new possibilities for interaction and quantitatively many more agents to interact with. Crucially, due to the potential scale of value that could be gained or lost via acausal interaction with vast numbers of distant agents, ensuring that TAI agents handle decision-theoretic problems correctly may be even more important than ensuring that they have the correct goals.
Agents using an acausal decision theory may coordinate in the absence of causal interaction. A concrete illustration is provided in Example 7.2.1, reproduced from Oesterheld (2017b)’s example, which is itself based on an example in Hofstadter (1983).
Example 7.2.1 (Hofstadter’s evidential cooperation game)
Hofstadter sends 20 participants the same letter, asking them to respond with a single letter ‘C’ (for cooperate) or ‘D’ (for defect) without communicating with each other. Hofstadter explains that by sending in ‘C’, a participant can increase everyone else’s payoff by $2. By sending in ‘D’, participants can increase their own payoff by $5. The letter ends by informing the participants that they were all chosen for their high levels of rationality and correct decision making in weird scenarios like this. Note that every participant only cares about the balance of her own bank account and not about Hofstadter’s or the other 19 participants’. Should you, as a participant, respond with ‘C’ or ‘D’?
An acausal argument in favor of ‘C’ is: If I play ‘C’, this gives me evidence that the other participants also chose ‘C’. Therefore, even though I cannot cause others to play ‘C’ — and therefore, on a CDT analysis — should play ‘D’ — the conditional expectation of my payoff given that I play ‘C’ is higher than my conditional expectation given that I play ‘D’.
We will call this mode of coordination evidential cooperation.
For a satisfactory theory of evidential cooperation, we will need to make precise what it means for agents to be evidentially (but not causally) dependent. There are at least three possibilities.
1. Agents may tend to make the same decisions on some reference class of decision problems. (That is, for some probability distribution on decision contexts , is high.)
2. An agent’s taking action A in context C may provide evidence about the number of agents in the world who take actions like A in contexts like C.
3. If agents have similar source code, their decisions provide logical evidence for their counterpart’s decision. (In turn, we would like a rigorous account of the notion of "source code similarity''.)
It is plausible that we live in an infinite universe with infinitely many agents (Tegmark,2003). In principle, evidential cooperation between agents in distant regions of the universe is possible; we may call this evidential cooperation in large worlds (ECL)29. If ECL were feasible then it is possible that it would allow agents to reap large amounts of value via acausal coordination. Treutlein (2019) develops a bargaining model of ECL and lists a number of open questions facing his formalism. Leskelä (2019) addresses fundamental limitations on simulations as a tool for learning about distant agents, which may be required to gain from ECL and other forms of "acausal trade". Finally, Yudkowsky (n.d.) lists potential downsides to which agents may be exposed by reasoning about distant agents. The issues discussed by these authors, and perhaps many more, will need to be addressed in order to establish ECL and acausal trade as serious possibilities. Nevertheless, the stakes strike us as great enough to warrant further study.
As noted in the document, several sections of this agenda were developed from writings by Lukas Gloor, Daniel Kokotajlo, Caspar Oesterheld, and Johannes Treutlein. Thank you very much to David Althaus, Tobias Baumann, Alexis Carlier, Alex Cloud, Max Daniel, Michael Dennis, Lukas Gloor, Adrian Hutter, Daniel Kokotajlo, János Kramár, David Krueger, Anni Leskelä, Matthijs Maas, Linh Chi Nguyen, Richard Ngo, Caspar Oesterheld, Mahendra Prasad, Rohin Shah, Carl Shulman, Stefan Torges, Johannes Treutlein, and Jonas Vollmer for comments on drafts of this document. Thank you also to the participants of the Center on Long-Term Risk research retreat and workshops, whose contributions also helped to shape this agenda.
Arif Ahmed. Evidence, decision and causality. Cambridge University Press, 2014.
AI Impacts. Likelihood of discontinuous progress around the development of agi. https://aiimpacts.org/likelihood-of-discontinuous-progress-around-the-development-of-agi/, 2018. Accessed: July 1 2019.
Riad Akrour, Marc Schoenauer, and Michele Sebag. Preference-based policy learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 12–27. Springer, 2011.
Steffen Andersen, Seda Ertaç, Uri Gneezy, Moshe Hoffman, and John A List. Stakes matter in ultimatum games. American Economic Review, 101(7):3427-39, 2011.
Giulia Andrighetto, Daniela Grieco, and Rosaria Conte. Fairness and compliance in the extortion game. 2015.
Scott E Atkinson, Todd Sandler, and John Tschirhart. Terrorism in a bargaining framework. The Journal of Law and Economics, 30(1):1-21, 1987.
Robert Axelrod. On six advances in cooperation theory. Analyse & Kritik, 22(1):130-151, 2000.
Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211 (4489):1390-1396, 1981.
Kyle Bagwell. Commitment and observability in games. Games and Economic Behavior, 8(2):271-280, 1995.
Tobias Baumann. Surrogate goals to deflect threats. http://s-risks.org/using-surrogate-goals-to-deflect-threats/, 2017. Accessed March 6, 2019.
Tobias Baumann. Challenges to implementing surrogate goals. http://s-risks.org/challenges-to-implementing-surrogate-goals/, 2018. Accessed March 6, 2019.
Tobias Baumann, Thore Graepel, and John Shawe-Taylor. Adaptive mechanism design: Learning to promote cooperation. arXiv preprint arXiv:1806.04067, 2018.
Ken Binmore, Ariel Rubinstein, and Asher Wolinsky. The nash bargaining solution in economic modelling. The RAND Journal of Economics, pages 176-188, 1986.
Iris Bohnet, Bruno S Frey, and Steffen Huck. More order with less law: On contract enforcement, trust, and crowding. American Political Science Review, 95(1):131-144, 2001.
Friedel Bolle, Yves Breitmoser, and Steffen Schlächter. Extortion in the laboratory. Journal of Economic Behavior & Organization, 78(3):207-218, 2011.
Gary E Bolton and Axel Ockenfels. Erc: A theory of equity, reciprocity, and competition. American economic review, 90(1):166-193, 2000.
Nick Bostrom. Ethical issues in advanced artificial intelligence. Science Fiction and Philosophy: From Time Travel to Superintelligence, pages 277-284, 2003.
Nick Bostrom. Superintelligence: paths, dangers, strategies. 2014.
Ronen I Brafman and Moshe Tennenholtz. Efficient learning equilibrium. In Advances in Neural Information Processing Systems, pages 1635-1642, 2003.
R. A. Briggs. Normative theories of rational choice: Expected utility. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall 2019 edition, 2019.
Ernst Britting and Hartwig Spitzer. The open skies treaty. Verification Yearbook, pages 221-237, 2002.
Colin Camerer and Teck Hua Ho. Experience-weighted attraction learning in normal form games. Econometrica, 67(4):827-874, 1999.
Colin F Camerer. Behavioural game theory. Springer, 2008.
Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games. The Quarterly Journal of Economics, 119(3):861-898, 2004.
Christopher Cherniak. Computational complexity and the universal acceptance of logic. The Journal of Philosophy, 81(12):739-758, 1984.
Thomas J Christensen and Jack Snyder. Chain gangs and passed bucks: Predicting alliance patterns in multipolarity. International organization, 44(2):137-168, 1990.
Paul Christiano. Approval directed agents. https://ai-alignment.com/model-free-decisions-6e6609f5d99e, 2014. Accessed: March 15 2019.
Paul Christiano. Humans consulting hch. https://ai-alignment.com/humans-consulting-hch-f893f6051455, 2016a.
Paul Christiano. Prosaic AI alignment. https://ai-alignment.com/prosaic-ai-control-b959644d79c2, 2016b. Accessed: March 13 2019.
Paul Christiano. Clarifying “AI alignment”. https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6, 2018a. Accessed: October 10 2019.
Paul Christiano. Preface to the sequence on iterated amplification. https://www.lesswrong.com/s/XshCxPjnBec52EcLB/p/HCv2uwgDGf5dyX5y6, 2018b. Accessed March 6, 2019.
Paul Christiano. Preface to the sequence on iterated amplification. https://www.lesswrong.com/posts/HCv2uwgDGf5dyX5y6/preface-to-the-sequence-on-iterated-amplification, 2018c. Accessed: October 10 2019.
Paul Christiano. Techniques for optimizing worst-case performance. https://ai-alignment.com/techniques-for-optimizing-worst-case-performance-39eafec74b99, 2018d. Accessed: June 24, 2019.
Paul Christiano. What failure looks like. https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like, 2019. Accessed: July 2 2019.
Paul Christiano and Robert Wiblin. Should we leave a helpful message for future civilizations, just in case humanity dies out? https://80000hours.org/podcast/episodes/paul-christiano-a-message-for-the-future/, 2019. Accessed: September 25, 2019.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299-4307, 2017.
Jesse Clifton. Open-source learning: a bargaining approach. Unpublished working draft., 2019.
Mark Coeckelbergh. Can we trust robots? Ethics and information technology, 14(1):53-60, 2012.
EA Concepts. Importance, tractability, neglectedness framework. https://concepts.effectivealtruism.org/concepts/importance-neglectedness-tractability/, n.d. Accessed: July 1 2019.
Ajeya Cotra. Iterated distillation and amplification. https://www.alignmentforum.org/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification, 2018. Accessed: July 25 2019.
Jacob W Crandall, Mayada Oudah, Fatimah Ishowo-Oloko, Sherief Abdallah, Jean-François Bonnefon, Manuel Cebrian, Azim Shariff, Michael A Goodrich, Iyad Rahwan, et al. Cooperating with machines. Nature communications, 9(1):233, 2018.
Andrew Critch. A parametric, resource-bounded generalization of loeb’s theorem, and a robust cooperation criterion for open-source game theory. The Journal of Symbolic Logic, pages 1-15, 2019.
Allan Dafoe. AI governance: A research agenda. Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK, 2018.
Wei Dai. Towards a new decision theory. https://www.lesswrong.com/posts/de3xjFaACCAk6imzv/towards-a-new-decision-theory, 2009. Accessed: March 5 2019.
Wei Dai. The main sources of AI risk. https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk, 2019. Accessed: July 2 2019.
Robyn M Dawes. Social dilemmas. Annual review of psychology, 31(1):169-193, 1980.
Karl W Deutsch and J David Singer. Multipolar power systems and international stability. World Politics, 16(3):390-406, 1964.
Daniel Dewey. My current thoughts on MIRI’s “highly reliable agent design” work. https://forum.effectivealtruism.org/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design, 2017. Accessed: October 6 2019.
Avinash Dixit. Trade expansion and contract enforcement. Journal of Political Economy, 111(6):1293-1317, 2003.
Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
K Eric Drexler. Reframing superintelligence: Comprehensive AI services as general intelligence, 2019.
Martin Dufwenberg and Uri Gneezy. Measuring beliefs in an experimental lost wallet game. Games and economic Behavior, 30(2):163-182, 2000.
Daniel Ellsberg. The theory and practice of blackmail. Technical report, RAND CORP SANTA MONICA CA, 1968.
Johanna Etner, Meglena Jeleva, and Jean-Marc Tallon. Decision theory under ambiguity. Journal of Economic Surveys, 26(2):234-270, 2012.
Owain Evans, Andreas Stuhlmüller, Chris Cundy, Ryan Carey, Zachary Kenton, Thomas McGrath, and Andrew Schreiber. Predicting human deliberative judgments with machine learning. Technical report, Technical report, University of Oxford, 2018.
Tom Everitt, Jan Leike, and Marcus Hutter. Sequential extensions of causal and evidential decision theory. In International Conference on Algorithmic DecisionTheory, pages 205-221. Springer, 2015.
Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. Self-modification of policy and utility function in rational agents. In International Conference on Artificial General Intelligence, pages 1-11. Springer, 2016.
Tom Everitt, Pedro A Ortega, Elizabeth Barnes, and Shane Legg. Understanding agent incentives using causal influence diagrams, part i: single action settings. arXiv preprint arXiv:1902.09980, 2019.
James D Fearon. Rationalist explanations for war. International organization, 49(3):379-414, 1995.
Ernst Fehr and Klaus M Schmidt. A theory of fairness, competition, and cooperation. The quarterly journal of economics, 114(3):817-868, 1999.
Ernst Fehr, Simon Gächter, and Georg Kirchsteiger. Reciprocity as a contract enforcement device: Experimental evidence. Econometrica-Evanston IL-, 65:833-860, 1997.
Dan S Felsenthal and Abraham Diskin. The bargaining problem revisited: minimum utility point, restricted monotonicity axiom, and the mean as an estimate of expected utility. Journal of Conflict Resolution, 26(4):664-691, 1982.
Mark Fey and Kristopher W Ramsay. Mutual optimism and war. American Journal of Political Science, 51(4):738-754, 2007.
Mark Fey and Kristopher W Ramsay. Uncertainty and incentives in crisis bargaining: Game-free analysis of international conflict. American Journal of Political Science, 55(1):149-169, 2011.
Ben Fisch, Daniel Freund, and Moni Naor. Physical zero-knowledge proofs of physical properties. In Annual Cryptology Conference, pages 313-336. Springer, 2014.
Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122-130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
Lance Fortnow. Program equilibria and discounted computation time. In Proceedings of the 12th Conference on Theoretical Aspects of Rationality and Knowledge, pages 128-133. ACM, 2009.
James W Friedman. A non-cooperative equilibrium for supergames. The Review of Economic Studies, 38(1):1-12, 1971.
Daniel Garber. Old evidence and logical omniscience in bayesian confirmation theory. 1983.
Ben Garfinkel. Revent developments in cryptography and possible long-run consequences. https://drive.google.com/file/d/0B0j9LKC65n09aDh4RmEzdlloT00/view, 2018. Accessed: November 11 2019.
Ben Garfinkel and Allan Dafoe. How does the offense-defense balance scale? Journal of Strategic Studies, 42(6):736-763, 2019.
Scott Garrabrant. Two major obstacles for logical inductor decision theory. https://agentfoundations.org/item?id=1399, 2017. Accessed: July 17 2019.
Scott Garrabrant and Abram Demski. Embedded agency. https://www.alignmentforum.org/posts/i3BTagvt3HbPMx6PN/embedded-agency-full-text-version, 2018. Accessed March 6, 2019.
Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, and Jessica Taylor. Logical induction. arXiv preprint arXiv:1609.03543, 2016.
Alexandre Gazet. Comparative analysis of various ransomware virii. Journal in computer virology, 6(1):77-90, 2010.
Samuel J Gershman, Eric J Horvitz, and Joshua B Tenenbaum. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science, 349(6245):273-278, 2015.
Allan Gibbard and William L Harper. Counterfactuals and two kinds of expected utility. In Ifs, pages 153-190. Springer, 1978.
Itzhak Gilboa and David Schmeidler. Maxmin expected utility with non-unique prior. Journal of mathematical economics, 18(2):141-153, 1989.
Alexander Glaser, Boaz Barak, and Robert J Goldston. A zero-knowledge protocol for nuclear warhead verification. Nature, 510(7506):497, 2014.
Charles L Glaser. The security dilemma revisited. World politics, 50(1):171-201, 1997.
Piotr J Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research, 24:49-79, 2005.
Oded Goldreich and Yair Oren. Definitions and properties of zero-knowledge proof systems. Journal of Cryptology, 7(1):1-32, 1994.
Shafi Goldwasser, Silvio Micali, and Charles Rackoff. The knowledge complexity of interactive proof systems. SIAM Journal on computing, 18(1):186-208, 1989.
Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will AI exceed human performance? evidence from AI experts. Journal of Artificial Intelligence Research, 62:729-754, 2018.
Hilary Greaves, William MacAskill, Rossa O’Keeffe-O’Donovan, and Philip Trammell. Research agenda–web version a research agenda for the global priorities institute. 2019.
Avner Greif, Paul Milgrom, and Barry R Weingast. Coordination, commitment, and enforcement: The case of the merchant guild. Journal of political economy, 102(4):745-776, 1994.
Frances S Grodzinsky, Keith W Miller, and Marty J Wolf. Developing artificial agents worthy of trust: “would you buy a used car from this artificial agent?”. Ethics and information technology, 13(1):17-27, 2011.
Werner Güth, Rolf Schmittberger, and Bernd Schwarze. An experimental analysis of ultimatum bargaining. Journal of economic behavior & organization, 3(4):367-388, 1982.
Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, pages 3909-3917, 2016.
Edward H Hagen and Peter Hammerstein. Game theory and human evolution: A critique of some recent interpretations of experimental games. Theoretical population biology, 69(3):339-348, 2006.
Joseph Y Halpern and Rafael Pass. Game theory with translucent players. International Journal of Game Theory, 47(3):949-976, 2018.
Lars Peter Hansen and Thomas J Sargent. Robustness. Princeton university press, 2008.
Lars Peter Hansen, Massimo Marinacci, et al. Ambiguity aversion and model misspecification: An economic perspective. Statistical Science, 31(4):511-515, 2016.
Garrett Hardin. The tragedy of the commons. science, 162(3859):1243-1248, 1968.
Paul Harrenstein, Felix Brandt, and Felix Fischer. Commitment and extortion. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 26. ACM, 2007.
John C Harsanyi and Reinhard Selten. A generalized nash solution for two-person bargaining games with incomplete information. Management Science, 18(5-part-2): 80-106, 1972.
Joseph Henrich, Richard McElreath, Abigail Barr, Jean Ensminger, Clark Barrett, Alexander Bolyanatz, Juan Camilo Cardenas, Michael Gurven, Edwins Gwako, Natalie Henrich, et al. Costly punishment across human societies. Science, 312(5781): 1767-1770, 2006.
Jack Hirshleifer. On the emotions as guarantors of threats and promises. The Dark Side of the Force, pages 198-219, 1987.
Douglas R Hofstadter. Dilemmas for superrational thinkers, leading up to a luring lottery. Scientific American, 6:267-275, 1983.
Terence Horgan. Counterfactuals and newcomb’s problem. The Journal of Philosophy, 78(6):331-356, 1981.
Edward Hughes, Joel Z Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueñez-Guzman, Antonio García Castañeda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, et al. Inequity aversion improves cooperation in intertemporal social dilemmas. In Advances in neural information processing systems, pages 3326-3336, 2018.
Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
Robert Jervis. Cooperation under the security dilemma. World politics, 30(2):167-214, 1978.
Robert Jervis. Perception and Misperception in International Politics: New Edition. Princeton University Press, 2017.
Daniel Kahneman, Ilana Ritov, David Schkade, Steven J Sherman, and Hal R Varian. Economic preferences or attitude expressions?: An analysis of dollar responses to public issues. In Elicitation of preferences, pages 203-242. Springer, 1999.
Ehud Kalai. Proportional solutions to bargaining situations: interpersonal utility comparisons. Econometrica: Journal of the Econometric Society, pages 1623-1630, 1977.
Ehud Kalai, Meir Smorodinsky, et al. Other solutions to nash’s bargaining problem. Econometrica, 43(3):513-518, 1975.
Fred Kaplan. The wizards of Armageddon. Stanford University Press, 1991.
Holden Karnofsky. Some background on our views regarding advanced artificial intelligence. https://www.openphilanthropy.org/blog/some-background-our-views-regarding-advanced-artificial-intelligence, 2016. Accessed: July 7 2019.
D Marc Kilgour and Frank C Zagare. Credibility, uncertainty, and deterrence. American Journal of Political Science, 35(2):305-334, 1991.
Stephen Knack and Philip Keefer. Institutions and economic performance: cross-country tests using alternative institutional measures. Economics & Politics, 7(3): 207-227, 1995.
Daniel Kokotajlo. The “commitment races” problem. https://www.lesswrong.com/posts/brXr7PJ2W4Na2EW2q/the-commitment-races-problem, 2019a. Accessed: September 11 2019.
Daniel Kokotajlo. Cdt agents are exploitable. Unpublished working draft, 2019b.
Peter Kollock. Social dilemmas: The anatomy of cooperation. Annual review of sociology, 24(1):183-214, 1998.
Kai A Konrad and Stergios Skaperdas. Credible threats in extortion. Journal of Economic Behavior & Organization, 33(1):23-39, 1997.
David M Kreps and Joel Sobel. Signalling. Handbook of game theory with economic applications, 2:849-867, 1994.
Joshua A Kroll, Solon Barocas, Edward W Felten, Joel R Reidenberg, David G Robinson, and Harlan Yu. Accountable algorithms. U. Pa. L. Rev., 165:633, 2016.
David Krueger, Tegan Maharaj, Shane Legg, and Jan Leike. Misleading meta-objectives and hidden incentives for distributional shift. Safe Machine Learning workshop at ICLR, 2019.
Andrew Kydd. Which side are you on? bias, credibility, and mediation. American Journal of Political Science, 47(4):597-611, 2003.
Andrew H Kydd. Rationalist approaches to conflict prevention and resolution. Annual Review of Political Science, 13:101-121, 2010.
Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 4190-4203, 2017.
Daryl Landau and Sy Landau. Confidence-building measures in mediation. Mediation Quarterly, 15(2):97-103, 1997.
Patrick LaVictoire, Benja Fallenstein, Eliezer Yudkowsky, Mihaly Barasz, Paul Christiano, and Marcello Herreshoff. Program equilibrium in the prisoner’s dilemma via loeb’s theorem. In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464-473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. arXiv preprint arXiv:1903.00742, 2019.
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017.
Anni Leskelä. Simulations as a tool for understanding other civilizations. Unpublished working draft, 2019.
Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and Shimon Whiteson. Stable opponent shaping in differentiable games. arXiv preprint arXiv:1811.08469, 2018.
David Lewis. Prisoners’ dilemma is a newcomb problem. Philosophy & Public Affairs, pages 235-240, 1979.
Xiaomin Lin, Stephen C Adams, and Peter A Beling. Multi-agent inverse reinforcement learning for certain general-sum stochastic games. Journal of Artificial Intelligence Research, 66:473-502, 2019.
Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
William MacAskill. A critique of functional decision theory. https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory, 2019. Accessed: September 15 2019.
William MacAskill, Aron Vallinder, Caspar Oesterheld, Carl Shulman, and Johannes Treutlein. The evidentialist’s wager. Forthcoming, The Journal of Philosophy, 2021.
Fabio Maccheroni, Massimo Marinacci, and Aldo Rustichini. Ambiguity aversion, robustness, and the variational representation of preferences. Econometrica, 74(6): 1447-1498, 2006.
Michael W Macy and Andreas Flache. Learning dynamics in social dilemmas. Proceedings of the National Academy of Sciences, 99(suppl 3):7229-7236, 2002.
Christopher JG Meacham. Binding and its consequences. Philosophical studies, 149 (1):49-71, 2010.
Kathleen L Mosier, Linda J Skitka, Susan Heers, and Mark Burdick. Automation bias: Decision making and performance in high-tech cockpits. The International journal of aviation psychology, 8(1):47-63, 1998.
Abhinay Muthoo. A bargaining model based on the commitment tactic. Journal of Economic Theory, 69:134-152, 1996.
Rosemarie Nagel. Unraveling in guessing games: An experimental study. The American Economic Review, 85(5):1313-1326, 1995.
John Nash. Two-person cooperative games. Econometrica, 21:128-140, 1953.
John F Nash. The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155-162, 1950.
Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.
Douglass C North. Institutions. Journal of economic perspectives, 5(1):97-112, 1991.
Robert Nozick. Newcomb’s problem and two principles of choice. In Essays in honor of Carl G. Hempel, pages 114-146. Springer, 1969.
Caspar Oesterheld. Approval-directed agency and the decision theory of
Newcomb-like problems. https://casparoesterheld.files.wordpress.com/2018/01/rldt.pdf, 2017a.
Caspar Oesterheld. Multiverse-wide cooperation via correlated decision making. 2017b.
Caspar Oesterheld. Robust program equilibrium. Theory and Decision, 86, pages 143–159, 2019.
Caspar Oesterheld and Vincent Conitzer. Extracting money from causal decision theorists. The Philosophical Quarterly, 2021.
Stephen M Omohundro. The nature of self-improving artificial intelligence. Singularity Summit, 2008, 2007.
Stephen M Omohundro. The basic AI drives. In AGI, volume 171, pages 483-492, 2008.
OpenAI. Openai charter. https://openai.com/charter/, 2018. Accessed: July 7 2019.
Petro A Ortega and Vishal Maini. Building safe artificial intelligence: specification, robustness, and assurance. https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1, 2018. Accessed: July 7 2019.
Raja Parasuraman and Dietrich H Manzey. Complacency and bias in human use of automation: An attentional integration. Human factors, 52(3):381-410, 2010.
Judea Pearl. Causality. Cambridge university press, 2009.
Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. A multi-agent reinforcement learning model of common-pool resource appropriation. In Advances in Neural Information Processing Systems, pages 3643-3652, 2017.
Alexander Peysakhovich and Adam Lerer. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975, 2017.
Robert Powell. Bargaining theory and international conflict. Annual Review of Political Science, 5(1):1-30, 2002.
Robert Powell. War as a commitment problem. International organization, 60(1): 169-203, 2006.
Kai Quek. Rationalist experiments on war. Political Science Research and Methods, 5 (1):123-142, 2017.
Matthew Rabin. Incorporating fairness into game theory and economics. The American economic review, pages 1281-1302, 1993.
Neil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan Zhang, SM Eslami, and Matthew Botvinick. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.
Werner Raub. A general game-theoretic model of preference adaptations in problematic social situations. Rationality and Society, 2(1):67-93, 1990.
Robert W Rauchhaus. Asymmetric information, mediation, and conflict management. World Politics, 58(2):207-241, 2006.
Jonathan Renshon, Julia J Lee, and Dustin Tingley. Emotions and the microfoundations of commitment problems. International Organization, 71(S1):S189-S218, 2017.
Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627-635, 2011.
Ariel Rubinstein. Perfect equilibrium in a bargaining model. Econometrica: Journal of the Econometric Society, pages 97-109, 1982.
Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4):105-114, 2015.
Stuart J Russell and Devika Subramanian. Provably bounded-optimal agents. Journal of Artificial Intelligence Research, 2:575-609, 1994.
Santiago Sanchez-Pages. Bargaining and conflict with incomplete information. The Oxford Handbook of the Economics of Peace and Conflict. Oxford University Press, New York, 2012.
Wiliam Saunders. Hch is not just mechanical turk. https://www.alignmentforum.org/posts/4JuKoFguzuMrNn6Qr/hch-is-not-just-mechanical-turk?_ga=2.41060900. 708557547.1562118039-599692079.1556077623, 2019. Accessed: July 2 2019.
Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233-242, 1999.
Jonathan Schaffer. The metaphysics of causation. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall 2016 edition, 2016.
James A Schellenberg. A comparative test of three models for solving “the bargaining problem”. Behavioral Science, 33(2):81-96, 1988.
Thomas Schelling. The Strategy of Conflict. Harvard University Press, 1960.
David Schmidt, Robert Shupp, James Walker, TK Ahn, and Elinor Ostrom. Dilemma games: game parameters and matching protocols. Journal of Economic Behavior & Organization, 46(4):357-377, 2001.
Wolfgang Schwarz. On functional decision theory. umsu.de/wo/2018/688, 2018. Accessed: September 15 2019.
Anja Shortland and Russ Roberts. Shortland on kidnap. http://www.econtalk.org/anja-shortland-on-kidnap/, 2019. Accessed: July 13 2019.
Carl Shulman. Omohundro’s “basic AI drives” and catastrophic risks. Manuscript, 2010.
Linda J Skitka, Kathleen L Mosier, and Mark Burdick. Does automation bias decision-making? International Journal of Human-Computer Studies, 51(5):991–1006, 1999.
Alastair Smith and Allan C Stam. Bargaining and the nature of war. Journal of Conflict Resolution, 48(6):783-813, 2004.
Glenn H Snyder. “prisoner’s dilema” and “chicken” models in international politics. International Studies Quarterly, 15(1):66-103, 1971.
Nate Soares and Benja Fallenstein. Toward idealized decision theory. arXiv preprint arXiv:1507.01986, 2015.
Nate Soares and Benya Fallenstein. Agent foundations for aligning machine intelligence with human interests: a technical research agenda. In The Technological Singularity, pages 103-125. Springer, 2017.
Joel Sobel. A theory of credibility. The Review of Economic Studies, 52(4):557-573, 1985.
Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1-22, 1964.
Kaj Sotala. Disjunctive scenarios of catastrophic AI risk. In Artificial Intelligence Safety and Security, pages 315-337. Chapman and Hall/CRC, 2018.
Tom Florian Sterkenburg. The foundations of solomonoff prediction. Master’s thesis, 2013.
Joerg Stoye. Statistical decisions under ambiguity. Theory and decision, 70(2):129-148, 2011.
Joseph Suarez, Yilun Du, Phillip Isola, and Igor Mordatch. Neural mmo: A massively multiagent game environment for training and evaluating intelligent agents. arXiv preprint arXiv:1903.00784, 2019.
Chiara Superti. Addiopizzo: Can a label defeat the mafia? Journal of International Policy Solutions, 11(4):3-11, 2009.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
William Talbott. Bayesian epistemology. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, winter 2016 edition, 2016.
Jessica Taylor. My current take on the paul-MIRI disagreement on alignability of messy AI. https://agentfoundations.org/item?id=1129, 2016. Accessed: October 6 2019.
Max Tegmark. Parallel universes. Scientific American, 288(5):40-51, 2003.
Moshe Tennenholtz. Program equilibrium. Games and Economic Behavior, 49(2): 363-373, 2004.
Johannes Treutlein. Modeling multiverse-wide superrationality. Unpublished working draft., 2019.
Jonathan Uesato, Ananya Kumar, Csaba Szepesvari, Tom Erez, Avraham Ruderman, Keith Anderson, Nicolas Heess, Pushmeet Kohli, et al. Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. arXiv preprint arXiv:1812.01647, 2018.
Eric Van Damme. The nash bargaining solution is optimal. Journal of Economic Theory, 38(1):78-100, 1986.
Hal R Varian. Computer mediated transactions. American Economic Review, 100(2): 1-10, 2010.
Heinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business Media, 2010.
Kenneth N Waltz. The stability of a bipolar world. Daedalus, pages 881-909, 1964.
Weixun Wang, Jianye Hao, Yixi Wang, and Matthew Taylor. Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint arXiv:1803.00162, 2018.
E Roy Weintraub. Game theory and cold war rationality: A review essay. Journal of Economic Literature, 55(1):148-61, 2017.
Sylvia Wenmackers and Jan-Willem Romeijn. New theory about old evidence. Synthese, 193(4):1225-1250, 2016.
Lantao Yu, Jiaming Song, and Stefano Ermon. Multi-agent adversarial inverse reinforcement learning. arXiv preprint arXiv:1907.13220, 2019.
Eliezer Yudkowsky. Ingredients of timeless decision theory. https://www.lesswrong.com/posts/szfxvS8nsxTgJLBHs/ingredients-of-timeless-decision-theory, 2009. Accessed: March 14 2019.
Eliezer Yudkowsky. Intelligence explosion microeconomics. Machine Intelligence Research Institute, accessed online October, 23:2015, 2013.
Eliezer Yudkowsky. Modeling distant superintelligences. https://arbital.com/p/distant\_SIs/, n.d. Accessed: Feb. 6 2019.
Eliezer Yudkowsky and Nate Soares. Functional decision theory: A new theory of instrumental rationality. arXiv preprint arXiv:1710.05060, 2017.
Claire Zabel and Luke Muehlhauser. Information security careers for gcr reduction. https://forum.effectivealtruism.org/posts/ZJiCfwTy5dC4CoxqA/information-security-careers-for-gcr-reduction, 2019. Accessed: July 17 2019.
Chongjie Zhang and Victor Lesser. Multi-agent learning with policy prediction. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
The post Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda appeared first on Center on Long-Term Risk.
]]>The post Approval-directed agency and the decision theory of Newcomb-like problems appeared first on Center on Long-Term Risk.
]]>Decision theorists disagree about how instrumentally rational agents, i.e., agents trying to achieve some goal, should behave in so-called Newcomb-like problems, with the main contenders being causal and evidential decision theory. Since the main goal of artificial intelligence research is to create machines that make instrumentally rational decisions, the disagreement pertains to this field. In addition to the more philosophical question of what the right decision theory is, the goal of AI poses the question of how to implement any given decision theory in an AI. For example, how would one go about building an AI whose behavior matches evidential decision theory’s recommendations? Conversely, we can ask which decision theories (if any) describe the behavior of any existing AI design. In this paper, we study what decision theory an approval-directed agent, i.e., an agent whose goal it is to maximize the score it receives from an overseer, implements. If we assume that the overseer rewards the agent based on the expected value of some von Neumann–Morgenstern utility function, then such an approval-directed agent is guided by two decision theories: the one used by the agent to decide which action to choose in order to maximize the reward and the one used by the overseer to compute the expected utility of a chosen action. We show which of these two decision theories describes the agent’s behavior in which situations.
Read the paper on the publisher's website.
The post Approval-directed agency and the decision theory of Newcomb-like problems appeared first on Center on Long-Term Risk.
]]>Rather than considering individual scenarios of how s-risks could occur, which tends to be highly speculative, this post instead looks at risk factors – i.e. factors that would make s-risks more likely or more severe.
The post Risk factors for s-risks appeared first on Center on Long-Term Risk.
]]>Traditional disaster risk prevention has a concept of risk factors. These factors are not risks in and of themselves, but they increase either the probability or the magnitude of a risk. For instance, inadequate governance structures do not cause a specific disaster, but if a disaster strikes it may impede an effective response, thus increasing the damage.
Rather than considering individual scenarios of how s-risks could occur, which tends to be highly speculative, this post instead looks at risk factors – i.e. factors that would make s-risks more likely or more severe.
The simplest risk factor is the capacity of human civilisation to create astronomical amounts of suffering in the first place. This is arguably only possible with advanced technology. In particular, if space colonisation becomes technically and economically viable, then human civilisation will likely expand throughout the universe. This would multiply the number of sentient beings (assuming that the universe is currently not populated) and thus also potentially multiply the amount of suffering. By contrast, the amount of suffering is limited if humanity never expands into space. (To clarify, I’m not saying that advanced technology or space colonisation is bad per se, as they also have significant potential upsides. It just raises the stakes – with greater power comes greater responsibility.)
Nick Bostrom likens the development of new technologies to drawing balls from an urn that contains some black balls, i.e. technologies that would make it far easier to cause massive destruction or even human extinction. Similarly, some technologies might make it far easier to instantiate a lot of suffering, or might give agents new reasons to do so.1 A concrete example of such a technology is the ability to run simulations that are detailed enough to contain (potentially suffering) digital minds.
It is plausible that most s-risks can be averted at least in principle – that is, given sufficient will to do so. Therefore, s-risks are far more likely to occur in worlds without adequate efforts to prevent s-risks. This could happen for three main reasons:
Human civilisation contains many different actors with a vast range of goals, including some actors that are, for one reason or another, motivated to cause harm to others. Assuming that this will remain the case in the future3, a third risk factor is inadequate security against bad actors. The worst case is a complete breakdown of the rule of law and associated institutions to enforce them. But even if that does not happen, the capacity for preventive policing – stopping rogue actors from causing harm – may be limited, e.g. because the means of surveillance and interception are not sufficiently reliable. In particular, if powerful autonomous AI agents are widespread in future society, it is unclear how (and if) adequate policing of these agents can be established.
(In his paper on the Vulnerable World Hypothesis, Nick Bostrom refers to this as the semi-anarchic default condition; he argues that human society is currently in that state and that it is important to exit this condition by establishing effective global governance and preventive policing. The paper is mostly about preventing existential risks, but large parts of the analysis are transferable to s-risk prevention.)
Put differently, military applications of future technological advances will change the offense-defense balance4, possibly in a way that makes s-risks more likely. A common concern is that strong offensive capabilities would enable a safe first strike, undermining global stability. However, when it comes to s-risks in particular, I think tipping the balance in favor of strong defense is also dangerous, and may even be a bigger concern than strong offensive capabilities. This is because actors can no longer be deterred from bad actions if they enjoy strong defensive advantages.5 In a scenario of non-overlapping spheres of influence and strong defense, security in terms of preventing an invasion of one’s own “territory” is adequate, but security in terms of preventing actors from creating disvalue within their territory is inadequate.
S-risks are also more likely if future actors endorse strongly differing value systems that have little or nothing in common, or might even be directly opposed to each other. This holds especially if combined with a high degree of polarisation and no understanding for other perspectives – it is also possible that different value systems still tolerate each other, e.g. because of moral uncertainty.
This constitutes a risk factor for several reasons:
I am most worried about worlds where many risk factors concur. I’d guess that a world where all four factors materialise is more than four times as bad as a world where only one risk factor occurs (i.e. overall risk scales super-linearly in individual factors). This is because the impact of a single risk factor can, at least to some extent, be compensated if other aspects of future society work well:
--
Further research could consider the following questions for each of the risk factors:
This post was inspired by a comment by Max Daniel on the concept of risk factors. I’d also like to thank Max Daniel, David Althaus, Lukas Gloor, Jonas Vollmer and Ashwin Acharya for valuable comments on a draft of this post.
The post Risk factors for s-risks appeared first on Center on Long-Term Risk.
]]>The post Robust program equilibrium appeared first on Center on Long-Term Risk.
]]>One approach to achieving cooperation in the one-shot prisoner’s dilemma is Tennenholtz’s (Games Econ Behav 49(2):363–373, 2004) program equilibrium, in which the players of a game submit programs instead of strategies. These programs are then allowed to read each other’s source code to decide which action to take. As shown by Tennenholtz, cooperation is played in an equilibrium of this alternative game. In particular, he proposes that the two players submit the same version of the following program: cooperate if the opponent is an exact copy of this program and defect otherwise. Neither of the two players can benefit from submitting a different program. Unfortunately, this equilibrium is fragile and unlikely to be realized in practice. We thus propose a new, simple program to achieve more robust cooperative program equilibria: cooperate with some small probability ? and otherwise act as the opponent acts against this program. I argue that this program is similar to the tit for tat strategy for the iterated prisoner’s dilemma. Both “start” by cooperating and copy their opponent’s behavior from “the last round”. We then generalize this approach of turning strategies for the repeated version of a game into programs for the one-shot version of a game to other two-player games. We prove that the resulting programs inherit properties of the underlying strategy. This enables them to robustly and effectively elicit the same responses as the underlying strategy for the repeated game.
Read the paper on the publisher's website.
The post Robust program equilibrium appeared first on Center on Long-Term Risk.
]]>In this post, I will outline two key obstacles to a successful implementation of surrogate goals.
The post Challenges to implementing surrogate goals appeared first on Center on Long-Term Risk.
]]>Surrogate goals might be one of the most promising approaches to reduce (the disvalue resulting from) threats. The idea is to add to one’s current goals a surrogate goal that one did not initially care about, hoping that any potential threats will target this surrogate goal rather than what one initially cared about.
In this post, I will outline two key obstacles to a successful implementation of surrogate goals.
In most settings, the threatener or the threatenee will not have perfect knowledge of the relative attractiveness of threat against the surrogate goal compared to threats against the original goal. For instance, the threatener may possess private information about how costly it is for her to carry out threats against either goal, while the threatenee may know more exactly how bad the execution of threats is compared to the loss of resources from giving in. This private information affects the feasibility of threats against either goal.
Now, it is possible that the surrogate goal may be a better threat target given the threatenee’s information, but the initial goal is better given the threatener’s (private) information. Surrogate goals don’t work in this case because the threatener will still threaten the initial goal.
The most straightforward way to deal with this problem is to make the surrogate goal more threatener-friendly so that the surrogate goal will still be the preferred target even with some private information pointing in the other direction. However, that introduces a genuine tradeoff between the probability of successfully deflecting threats to the surrogate goal and the expected loss of utility due to a worsened bargaining position. (Without private information, surrogate goals would only require an infinitesimally small concession in terms of vulnerability to threats.)
Surrogate goals fail if it is not credible – in the eyes of potential threateners – that you actually care about the surrogate goal. But apart from human psychology, is there a strong reason why surrogate goals may be less credible than initial goals?
Unfortunately, one of the main ways how an observer can gain information about an agent’s values is to observe that agent’s behaviour and evaluate how consistent that is with a certain set of values. If an agent frequently takes actions to avoid death, that is (strong) evidence that the agent cares about survival (whether instrumentally or intrinsically). The problem is that surrogate goals should also not interfere with one’s initial goals, i.e. an agent will ideally not waste resources by pursuing surrogate goals. But in that case, threateners will find the agent’s initial goal credible but not their surrogate goal, and will thus choose to threaten the initial goal.
So the desiderata of credibility and non-interference are mutually exclusive if observing actions is a main source of evidence about values. An agent might be willing to spend some resources pursuing a surrogate goal to establish credibility, but that introduces another tradeoff between the benefits of a surrogate goal and the waste of resources. Ideally, we can avoid this tradeoff by finding other ways to make a surrogate goal credible. For instance, advanced AI systems could be built in a way that makes their goals (including surrogate goals) transparent to everyone.
The post Challenges to implementing surrogate goals appeared first on Center on Long-Term Risk.
]]>The post Privacy Policy appeared first on Center on Long-Term Risk.
]]>We use various software tools to help us manage the data we collect, all of which have high standards of privacy and security.
Cookies are small text-only files that are stored on your computer and that help identify you when you visit any of our sites. This makes visiting and using our sites easier, for instance, so that you do not have to re-enter information continually.
We use cookies to store session data, including ones that last across sessions until they expire or you delete them. We use Google Analytics, which also uses cookies to identify users. The way these services use and store user data are governed by their respective privacy policies. You can always change your cookie settings via your browser.
You have the right to control what data we store and how we use it. In particular, you can always contact us if:
Feel free to contact us at any time at info@ea-longtermrisk.org.
We don’t expect or intend to collect information on children under 16. Children should get help from a parent or guardian before entering personal information into a website.
If you have any questions about this privacy policy, contact us anytime at info@ea-longtermrisk.org.
The post Privacy Policy appeared first on Center on Long-Term Risk.
]]>The post Descriptive Population Ethics and Its Relevance for Cause Prioritization appeared first on Center on Long-Term Risk.
]]>Descriptive ethics is the empirical study of people's values and ethical views, e.g. via a survey or questionnaire. This overview focuses on beliefs about population ethics and exchange rates between goods (e.g. happiness) and bads (e.g. suffering). Two variables seem particularly important and action-guiding in this context, especially when trying to make informed choices about how to best shape the long-term future: 1) One’s normative goods-to-bads ratio (N-ratio) and 2) one’s expected bads-to-goods ratio (E-ratio). I elaborate on how a framework consisting of these two variables could inform our decision-making with respect to shaping the long-term future, as well as facilitate cooperation among differing value systems and further moral reflection. I then present concrete ideas for further research in this area and investigate associated challenges. The last section lists resources which discuss further methodological and theoretical issues which were beyond the scope of the present text.1
Recently, some debate has emerged on whether reducing extinction risk is the ideal course of action for shaping the long-term future. For instance, in the Global Priorities Institute (GPI) research agenda, Greaves & MacAskill (2017, p.13) ask “[...] whether it might be more important to ensure that future civilisation is good, assuming we don’t go extinct, than to ensure that future civilisation happens at all.” We could further ask to what extent we should focus our efforts on reducing risks of astronomical suffering (s-risks). Again, Greaves & MacAskill: “Should we be more concerned about avoiding the worst possible outcomes for the future than we are for ensuring the very best outcomes occur [...]?” Given the enormous stakes, these are arguably some of the most important questions facing those who prioritize shaping the long-term future.2
Some interventions increase both the quality of future civilization as well as its probability. Promoting international cooperation, for instance, likely reduces extinction risks as well as s-risks. However, it seems implausible for one single intervention to be optimally cost-effective at accomplishing both types of objectives at the same time. To the extent to which there is a tradeoff between different goals relating to shaping the long-term future, we should make a well-considered choice about how to prioritize among them.
I suggest that this choice can be informed by two important variables: One’s normative bads-to-goods ratio3 (N-ratio) and one’s empirically expected goods-to-bads ratio (E-ratio). Taken together, these variables can serve as a framework for choosing between different options to shape the long-term future.
(For utilitarians, N- and E-ratios amount to their normative / expected suffering-to-happiness ratios. But for most humans, there are bads besides suffering, e.g. injustice, and goods other than happiness, e.g. love, knowledge, or art. More on this below.)
I will elaborate in greater detail below on how to best interpret and measure these two ratios. For now, a few examples should suffice to illustrate the general concept. Someone with a high N-ratio of, say, 100:1 believes that reducing bads is one hundred times as important as increasing goods, whereas someone with an N-ratio of 1:1 thinks that increasing goods and reducing bads is of equal importance.4 Similarly, someone with an E-ratio of, say, 1000:1 thinks that there will be one thousand times as much good than bad in the future in expectation, whereas someone with a lower E-ratio is more pessimistic about the future.5
Note that I don’t assume an objective way to measure goods and bads, so a statement like “reducing suffering is x times more important than promoting happiness” is imprecise unless one further specifies what precisely is being compared. (See also the section "The measurability of happiness and suffering".)
In short, the more one's E-ratio exceeds one's N-ratio, the higher one’s expected value of the future, and the more one favors interventions that primarily reduce extinction risks.6 In contrast, the more one's N-ratio exceeds one's E-ratio, the more appealing become interventions that primarily reduce s-risks or otherwise improve the quality of the future without affecting its probability. The graphic below summarizes the discussion so far.
Of course, this reasoning is rather simplistic. In practice, considerations from comparative advantages, tractability, neglectedness, option value, moral trade, et cetera need to be factored in.7 See also Cause prioritization for downside-focused value systems for a more in-depth analysis.8
The rest of this section elaborates on the meaning of N-ratios and explains one approach of measuring or at least approximating them. In short, I propose to approximate an individual’s N-ratio by measuring their response tendencies to various ethical thought experiments (e.g. as part of a questionnaire or survey) and comparing them to those of other individuals. These questions could be of (roughly) the following kind:
Imagine you could create a new world inhabited by X humans living in a utopian civilization free of involuntary suffering, and where everyone is extremely kind, intelligent, and compassionate. In this world, however, there also exist 100 humans who experience extreme suffering.
What’s the smallest value of X for which you would want to create this world?
In short, people who respond with higher equivalence numbers X to such thought experiments should have higher N-ratios, on average.
Some words of caution are in order here. First, the final formulations of such questions should obviously contain more detailed information and, for example, specify how the inhabitants of the utopian society live, what form of suffering the humans experience precisely, et cetera. (See also the document “Preliminary Formulations of Ethical Thought Experiments” which contains much longer formulations.)
Second, an individual’s equivalence number X will depend on what form of ethical dilemma is used and its precise wording. For example, asking people to make intrapersonal instead of interpersonal trade-offs, or writing “preserving” instead of “creating”, will likely influence the responses.
Third, subjects’ equivalence numbers will depend on which type of bad or good is depicted. Hedonistic utilitarians, for instance, regard pleasure as the single most important good and would place great value on, say, computer programs experiencing extremely blissful states. Many other value systems would consider such programs to be of no positive value whatsoever. Fortunately, many if not most value systems regard suffering9 as one of the most important bads and also place substantial positive value on flourishing societies inhabited by humans experiencing eudaimonia – i.e. “human flourishing” or happiness plus various other goods, such as virtue and friendship.10 In conclusion, although N-ratios (as well as E-ratios) are generally agent-relative, well-chosen “suffering-to-eudaimonia ratios” will likely allow for more meaningful and robust interindividual comparisons while still being sufficiently natural and informative. (See also the section "N-ratios and E-ratios are agent-relative" of the appendix for a further discussion of this issue.)
However, even if we limit our discussion to various forms of suffering and eudaimonia, judgments might diverge substantially. For example, Anna might only be willing to trade one minute of physical torture in exchange for many years of eudaimonia, while she would trade one week of depression for just one hour of eudaimonia. Others might make different or even opposite choices. If we had asked Anna only the first question, we could have concluded that her N-ratio is high, but her stance on the second question suggests that the picture is more complicated.
Consequently, one might say that even different forms of suffering and happiness/eudaimonia comprise “axiologically distinct” categories and that, instead of generic “suffering-to-eudaimonia ratios” – let alone “bads-to-goods ratios” – we need more fine-grained ratios, e.g. “suffering_typeY-to-eudaimonia_typeZ ratios”.11
See also “Towards a Systematic Framework for Descriptive (Population) Ethics” for a more extensive overview of the relevant dimensions along which ethical thought experiments can and should vary. “Descriptive Ethics – Methodology and Literature Review” provides an in-depth discussion of various methodological and theoretical questions, such as how to prevent anchoring or framing effects, control for scope insensitivity, increase internal consistency, and so on.
Do these considerations suggest that research in descriptive ethics is simply not feasible? This seems unlikely to me but it’s at least worth investigating further.
For illustration, imagine that a few hundred effective altruists completed a survey consisting of thirty different ethical thought experiments that vary along a certain number of dimensions, such as the form and intensity of suffering or happiness, its duration, or the number of beings involved.
We could now assign a percentile rank to every participant for each ethical thought experiment. If the concept of a general N-ratio is viable, we should observe that the percentile ranks of a given participant correlate across different dilemmas. That is, if someone gave very high equivalence numbers to the first, say, fifteen dilemmas, it should be more likely that this person also gave high equivalence numbers to the remaining dilemmas. Investigating whether there is such a correlation, how high it is, and how much it depends on the type or wording of each ethical thought experiment, could itself lead to interesting insights.
Important and action-guiding conclusions could be inferred from such a survey, both on an individual and on a group level.
First, consider the individual level. Imagine a participant answered with “infinite” in twenty dilemmas. Further assume that the average equivalence number of this participant in the remaining ten dilemmas was also extremely high, say, one trillion. Unless this person has an unreasonably high E-ratio (i.e. is unreasonably optimistic about the future), this person should, ceteris paribus, prioritize interventions that reduce s-risks over, say, interventions that primarily reduce risks of extinction but which might also increase s-risks (such as, perhaps, building disaster shelters12 ); especially so if they learn that most respondents with lower average equivalence numbers do the same.13
Second, let’s turn to the group level. It could be very useful to know how equivalence numbers among effective altruists are distributed. For example, central tendencies such as the median or average equivalence number could inform allocation decisions within the effective altruism movement as a whole. They could also serve as a starting point for finding compromise solutions or moral trades between varying groups within the EA movement – e.g. between groups with more upside-focused value systems and those with more downside-focused value systems. Lastly, engaging with the actual thought experiments of the survey, as well as its results and potential implications, could increase the moral reflection and sophistication of the participants, allowing them to make decisions more in line with their idealized preferences.
Readers unfamiliar with the idea of multiverse-wide superrationality (MSR) are strongly encouraged to first read the paper “Multiverse-wide Cooperation via Correlated Decision Making” (Oesterheld, 2017) or the post “Multiverse-wide cooperation in a nutshell”. Readers unconvinced by or uninterested in MSR are welcome to skip this section.
To briefly summarize, MSR is the idea that by taking into account the values of superrationalists located elsewhere in the multiverse, it becomes more likely that they do the same for us. In order for MSR to work, it is essential to have at least some knowledge about how the values of superrationalists elsewhere in the multiverse are distributed. Surveying the values of (superrational) humans14 is one promising way of gaining such knowledge.15
Obtaining a better estimate of the average N-ratio of superrationalists in the multiverse seems especially action-guiding. For illustration, imagine we knew that most superrationalists in the multiverse have a very high N-ratio. All else equal and ignoring considerations from neglectedness, tractability, etc., this implies that superrationalists elsewhere in the multiverse would probably want us to prioritize the reduction of s-risks over the reduction of extinction risks.16 In contrast, if we knew that the average N-ratio among superrationalists in the multiverse is very low, reducing extinction risks would become more promising.
Another important question is to what extent and in what respects superrationalists discriminate between their native species and the species located elsewhere in the multiverse.17
Another challenge facing research in descriptive ethics is that at least some answers are likely to be driven by more or less superficial system 1 heuristics generating a variety of biases – e.g. empathy gap, duration neglect, scope insensitivity, and framing effects, to name just a few. While there are ways to facilitate the engagement of more controlled cognitive processes18 that make reflective judgments more likely, not every possible bias or confounder can be eliminated.
All in all, the skeptic has a point when she distrusts the results of such surveys because she assumes that most subjects merely pulled their equivalence numbers out of thin air. Ultimately, however, I think that reflecting on various ethical thought experiments in a systematic fashion, pulling equivalence numbers out of thin air and then using these numbers to make more informed decisions about how to best shape the long-term future is often better – in the sense of dragging in fewer biases and distorting intuitions – than pulling one’s entire decision out of thin air.19
A further problem is that the N-ratios of many subjects will likely fluctuate over the course of years or even weeks.20 Nonetheless, knowing one’s N-ratios will be informative and potentially action-guiding for some subjects – e.g. for those who already engaged in substantial amounts of moral reflection (and are thus likely to have more stable N-ratios), or for subjects who have particularly high N-ratios such that their priorities would only shift if their N-ratios changed dramatically. Studying the stability of N-ratios is also an interesting research project in itself. (See also the section “moral uncertainty” of another document for more notes on this topic.)
The Google Docs listed below discuss further methodological, practical, and theoretical questions which were beyond the scope of the present text. As I might deprioritize the project for several months, I decided to publish my thinking at its current stage to enable others to access it in the meantime.
1) Descriptive Ethics – Methodology and Literature Review.
This document is motivated by the question of what we can learn from the existing literature – particularly in health economics and experimental philosophy – on how to best elicit normative ratios. It also contains a lengthy critique of the two most relevant academic studies about population ethical views and examines how to best measure and control for various biases (such as scope insensitivity, framing effects, and so on).
2) Towards a Systematic Framework for Descriptive (Population) Ethics.
This document develops a systematic framework for descriptive ethics and provides a classification of dimensions along which ethical thought experiments can (and should) vary.
3) Preliminary Formulations of Ethical Thought Experiments.
This document contains preliminary formulations of ethical thought experiments. Note that the formulations are designed such that they can be presented to the general population and might be suboptimal for effective altruists.
4) Descriptive ethics – Ordinal Questions (incl. MSR) & Psychological Measures.
This document discusses the usefulness of existing psychological instruments (such as the Moral Foundations Questionnaire, the Cognitive Reflection Test, etc.). The document also includes tentative suggestions for how to assess other constructs such as moral reflection, happiness, and so on.
Please note that the above documents are a work in progress, so I ask you to understand that much of the material hasn't been polished and, in some cases, does not even accurately reflect my most recent thinking. This also means that there is a significant opportunity for collaborators to contribute their own ideas rather than to just execute an already settled plan. In any case, comments in the Google documents are highly appreciated, whether you're interested in becoming more involved in the project or not.
I want to thank Max Daniel, Caspar Oesterheld, Johannes Treutlein, Tobias Pulver, Jonas Vollmer, Tobias Baumann, Lucius Caviola, and Lukas Gloor for their extremely valuable inputs and comments. Thanks also to Nate Liu, Simon Knutsson, Brian Tomasik, Adrian Rorheim, Jan Brauner, Ewelina Tur, Jennifer Waldmann, and Ruairi Donnelly for their comments.
Assuming moral anti-realism is true, there are no universal or “objective” goods and bads. Consequently, if we want to avoid confusion, E-ratios and N-ratios should ultimately refer to the values of a specific agent, or, to be more precise, a specific set of goods and bads.
For illustration, consider two hypothetical agents: Agent_1 has an N-ratio of 1:1 and an E-ratio of 1000:1, while agent_2 has an N-ratio of 1:1 and an E-ratio of 1:10. Do these agents share similar values but have radically different conceptions about how the future will likely unfold? Not necessarily. Agent_1 might be a total hedonistic utilitarian and agent_2 an AI that wants to maximize paperclips and minimize spam emails. Both might agree that the future will, in expectation, contain 1000 times as much pleasure as suffering but 10 times as many spam emails as paperclips.
Of course, the sets of bads and goods of humans will often overlap, at least to some extent. Consequently, if we learn that human_1 has a much lower E-ratio than human_2, this tells us that human_1 is probably more pessimistic than human_2 and that both likely disagree about how the future is going to unfold.
In this context, it also seems worth noting that there might be more overlap with regards to bads than with regards to goods. For illustration, consider the number of macroscopically distinct futures whose net value is extremely negative according to at least 99.5% of all humans. It seems plausible that this number is (much) greater than the number of macroscopically distinct futures whose net value is extremely positive according to at least 99.5% of all humans. In fact, those of us who are more pessimistic about the prospect of wide agreement on values might worry that the latter number is (close to) zero, especially if one doesn’t allow for long periods of moral reflection.
In my view, there are no “objective” units of happiness or suffering. Thus, it can be misleading to talk about the absolute magnitude of N-ratios without specifying the concrete instantiations of bads and goods that were traded against each other.
For more details on the measurability of happiness and suffering (or lack thereof), I highly recommend the essays “Measuring Happiness and Suffering” and “What Is the Difference Between Weak Negative and Non-Negative Ethical Views?” by Simon Knutsson, especially this section and the description of Brian Tomasik’s views whose approach I share.
The post Descriptive Population Ethics and Its Relevance for Cause Prioritization appeared first on Center on Long-Term Risk.
]]>The post A framework for thinking about AI timescales appeared first on Center on Long-Term Risk.
]]>To steer the development of powerful AI in beneficial directions, we need an accurate understanding of how the transition to a world with powerful AI systems will unfold. A key question is how long such a transition (or “takeoff”) will take. This has been discussed at length, for instance in the AI foom debate.
In this post, I will attempt to clarify what we mean more precisely when we talk of “soft” or “hard” takeoffs.
(Disclaimer: Probably most of the following ideas have already been mentioned somewhere in some form, so my claimed contribution is just to collect them in one place.)
The obvious question is: what reference points do we use to define the beginning and the end of the transition to powerful AI? Ideally, the reference points should be applicable to a wide range of plausible AI scenarios rather than making tacit assumptions about what powerful AI will look like.
A commonly used reference point is the attainment of “human-level” general intelligence (also called AGI, artificial general intelligence), which is defined as the ability to successfully perform any intellectual task that a human is capable of. The reference point for the end of the transition is the attainment of superintelligence – being vastly superior to humans at any intellectual task – and the “decisive strategic advantage” (DSA) that ensues.1 The question, then, is how long it takes to get from human-level intelligence to superintelligence.
I find this definition problematic. The framing suggests that there will be a point in time when machine intelligence can meaningfully be called “human-level”. But I expect artificial intelligence to differ radically from human intelligence in many ways. In particular, the distribution of strengths and weaknesses over different domains or different types of reasoning is and will likely be different2 – just as machines are currently superhuman at chess and Go, but tend to lack “common sense”. AI systems may also diverge from biological minds in terms of speed, communication bandwidth, reliability, the possibility to create arbitrary numbers of copies, and entanglement with existing systems.
Unless we have reason to expect a much higher degree of convergence between human and artificial intelligence in the future, this implies that at the point where AI systems are at least on par with humans at any intellectual task, they actually vastly surpass humans in most domains (and have just fixed their worst weakness). So, in this view, “human-level AI” marks the end of the transition to powerful AI rather than its beginning.
As an alternative, I suggest that we consider the fraction of global economic activity that can be attributed to (autonomous) AI systems.3 Now, we can use reference points of the form “AI systems contribute X% of the global economy”. (We could also look at the fraction of resources that’s controlled by AI, but I think this is sufficiently similar to collapse both into a single dimension. There’s always a tradeoff between precision and simplicity in how we think about AI scenarios.)
A useful definition for the duration of the transition to powerful AI could be how long it takes to go from, say, 10% of the economy to 90%. If we wish to measure the acceleration more directly, we could also ask how much less time it will take to get from 50% to 90% compared to 10%-50%. (The point in time where AI systems contribute 50% to the economy could even be used as a definition for “human-level” AI or AGI, though this usage is nonstandard.)
I think this definition broadly captures what we intuitively mean when we talk about “powerful”, “advanced”, or “transformative” AI: we mean AI that is capable enough to displace4 humans at a large range of economically vital tasks.5
A key advantage of this definition is that it makes sense in many possible AI scenarios – regardless of whether we imagine powerful AI as a localized, discrete entity or as a distributed process with gradually increasing impacts.
By default, we tend to think about “how long it takes” in terms of physical time, that is, what a wall clock measures. But perhaps physical time is not the most relevant or useful metric in this context. It is conceivable that the world generally moves, say, 20 times faster when the transition to powerful AI happens – e.g. because of whole brain emulation or strong forms of biological enhancement. In this case, the transition takes much less physical time, but the quantity of interest is some notion of “how much stuff is happening during the transition”, not the number of revolutions on grandma’s wall clock.
An natural alternative is economic time6, which adjusts for the overall rate of economic progress and innovation. Currently, the global economy doubles every ~20 years, so a year of economic time corresponds to a growth of ~3.5%. The question, then, is how many economic doublings will happen during the transition to powerful AI. Saying that it will take 40 years of economic time would mean that global economy quadruples during the transition (which might happen much less than 40 years of physical time).7
Another alternative – which I will call political time – is to adjust for the rate of social change. So, saying that the transition takes 10 years of political time would mean that there will be ten times as much change in the relative power of individuals, institutions, and societal values compared to what happens in an average year these days.8
(It is possible that the economic growth rate is a good approximation of the rate of political change, rendering these notions of time equivalent. But this isn’t obvious to me – in particular, the rate of political change might lag if there’s a disruptive acceleration in economic growth. Conversely, economic collapse may cause a lot of political change.)
Given this, we can now talk about slow vs. fast takeoffs in physical, economic, or political time. The takeoff may be slow along some axes, but fast among others. Specifically, I find it plausible that a takeoff would be quite fast in physical time, but much slower in terms of economic or political time.
We can use a similar terminology for AI timelines by asking how much physical, economic, or political time will pass...
These two questions are equivalent to people who assume that the transition to powerful AI will likely result in a steady state (a singleton with a decisive strategic advantage). But this is an implicit assumption, and in my opinion it’s also quite possible that there would be centuries (or more) of time – especially in terms of economic or political time – between the advent of powerful AI and the formation of a steady state.
I am indebted to Brian Tomasik, Lukas Gloor, Max Daniel and Magnus Vinding for valuable comments on the first draft of this text.
The post A framework for thinking about AI timescales appeared first on Center on Long-Term Risk.
]]>The post Commenting on MSR, Part 2: Cooperation heuristics appeared first on Center on Long-Term Risk.
]]>This post was originally written for internal discussions only; it is half-baked and unpolished. The post assumes familiarity with the ideas discussed in Caspar Oesterheld’s paper Multiverse-wide cooperation via coordinated decision-making. I wrote a short introduction to multiverse-wide cooperation in an earlier post (but I still recommend reading parts of Caspar’s original paper, or this more advanced introduction, because several of the points that follow below build on topics not covered in my introduction). With that out of the way: In this post, I will comment on what I think might be interesting aspects of multiverse-wide cooperation via superrationality (abbreviation: MSR) and what I think might be its practical implications – if the idea works at all. I will focus particularly on aspects where I place more emphasis on certain considerations than Caspar does in his paper, though most of the issues I discuss are already noted by Caspar. A major theme in my comments will be exploring how the multiverse-wide compromise changes shape once we go from a formal, idealized conception of how to think about it to real-world policy suggestions for humans. For the perhaps most interesting part of the post, skip to the section "How to trade, practically."
(Epistemic status: Highly speculative. I am outlining practical implications not because I am convinced that they are what we should do, but as an exercise in what I think would follow given certain assumptions.)
Under idealized conditions, each MSR participant would attempt to follow the same multiverse-wide compromise utility function (MCUF) reflecting the distribution of values amongst all superrational cooperators. In practice, trying to formalize a detailed, probabilistic model of what the MCUF would look like, and consulting it for every decision, is much too noisy and effortful. A more practical strategy for implementing MSR is to come up with heuristics that approximate the MCUF reasonably well for the purposes needed. Let’s call these cooperation heuristics (CHs). A simple example for a heuristic might be “Perform actions that benefit the value systems of other superrational cooperators considerably if you can do so at low cost, and refrain from hurting these value systems if you would only expect comparatively little gains from doing so.” This example heuristic is easy to follow and unlikely to go wrong. In fact, aside from the part about superrational cooperators, it sounds like a great decision rule even for people who do not buy into MSR but are interested in low-effort ways of cooperating with other people for all the "normal" reasons (the many reasons in favor of cooperation that do not involve aliens). The primary caveat about this particular CH is that it is very vague, and that the gains from trade it produces if everyone were to follow it are far from maximal. MSR is attractive because it may make it possible for us to give other value systems even more weight through further-reaching CHs, and thereby in expectation getting more gains from trade back in return.
In the standard prisoner’s dilemma, the participants have symmetrical information and payoff structures. MSR can be viewed as a prisoner's-dilemma-like decision problem, but one that is a lot more messy than any traditional formulation of a prisoner’s dilemma. In MSR, different instantiations of superrational cooperators find themselves with information asymmetries, different goals, different biases and different resources at their disposal. Consider this non-exhaustive list of examples for potentially asymmetric features between MSR participants:
Asymmetries amongst potential MSR participants call into question whether we can indeed assume that our potential cooperators are finding themselves in sufficiently similar decision situations as we find ourselves in. To recap: We are (for the sake of entertaining the argument) assuming that MSR works when two agents operate on highly similar decision algorithms and find themselves in highly similar decision situations. Under these conditions, certain approaches to decision theory, which I’m for the sake of simplicity referring to with the umbrella term superrationality, recommend reasoning as though the decision outputs of the agents in question are logically entangled and are going to output the same decisions. Asymmetric features amongst potential MSR participants now make it non-obvious whether we can still talk of the decision situations different agents find themselves in as being “relevantly similar,” or whether they break the similarity because the conclusions that participants will come to, for instance with regard to whether to incorporate MSR into their behavior or not, might be affected by these asymmetries.
So does MSR break down because decision situations are never relevantly similar? I think the answer depends on the level of abstraction at which the agents are looking at a decision, i.e., how they come to construe their decision problem. We can assume that agents interested in MSR have an incentive to pick whichever general and collectively followed process for selecting cooperation heuristics produces the largest gains from trade.
Because the correct priorities for MSR participants may in many cases depend on the exact nature and distribution of asymmetric features, we can expect that on the level of concrete execution (but not at the level of the more general decision problem) “implementing MSR” could look fairly different for different agents. Even though eeryone would try to use cooperation heuristics that produce optimal benefits, individual cooperator's cooperation heuristics would recommend different types of actions depending on whether an agent finds themselves with one type of comparative advantage or another.
To illustrate this point, consider agents who expect the highest returns from MSR from a focus on convergent priorities where they would work on interventions that are positive for their own value systems, but (absent MSR considerations) not maximally positive. Selecting interventions this way produces a compromise cluster where a few different value systems mutually benefit each other through a focus on some shared priority. Value systems A, B, C and D may for instance may have a shared priority x and value systems E, F, G and H may share priority y. (By “priority,” I mean an intervention such as “reducing existential risk” or “promoting consequentialist thinking about morality.”) By focusing a priority that one’s own value system shares with other value systems, one only benefits a subset of all value systems directly (only the ones one shares such convergent priorities with). However, through the general mechanism of doing something because it is informed by non-causal effects of one’s decision algorithm, we in theory should now also expect there to be increased coordination between value systems that form a cooperation cluster around their convergent priorities.
Similarly, following a cooperation heuristic of mostly cooperating with those value systems we know best (e.g. only with value systems we are directly familiar with) makes it more likely that civilizations full of completely alien value systems will also only cooperate with the value systems they already know (which sounds reasonable and efficient).
Focusing on convergent priorities is of course not the only strategy for generating gains from trade. Whether MSR participants should spend all their resources on convergent priorities, or whether they should rather work on (other) comparative advantages they may have at greatly benefitting particular value systems, depends on the specifics of the empirical circumstances that the agents find themselves in. The tricky part about focusing on comparative advantages rather than (just) convergent priorities is that it might be one’s comparative advantage to do something that is neutral or negative according to one’s value system, or at the very least has high opportunity costs. In such a case, one needs to be particularly confident that MSR works well enough, and that one’s CH is chosen/executed diligently enough, to generate sufficient benefits.
In practice, the whole picture of who benefits whom quickly becomes extremely complicated. A map of how different agents in MSR benefit each others’ value systems would likely contain all of the following features:
It is apparent that things would start to look confusing pretty quickly, and it seems legitimate to question whether humans can get the details of this picture right enough to reap any gains from trade at all (as opposed to shooting oneself in the foot). On the other hand, the CHs behind how each agent of a given value system should select their priorities could be kept simple. This can work if one picks cooperation heuristics in such a way that, assuming they are universally applied, maximizes the gains from trade for all value systems. (If done properly and if the assumptions behind MSR are correct, this then corresponds to maximizing the gains from trade for one’s own value system.)
For figuring out what MSR implies for humans, I think it is important to think in terms of agents of limited intelligence and rationality executing practical CHs, as opposed to ideal reasoners computing a maximally detailed MCUF for all decisions. Using heuristics means accepting a tradeoff between accuracy loss and practicality concerns. Accuracy in following the MCUF is important, because whether rare or hard-to-benefit value systems actually benefit from a cooperation heuristic depends on whether the heuristic is sensitive enough to notice situations where one’s comparative advantage is indeed to benefit these rare value systems. This makes it challenging to find simple heuristics that nevertheless react well to the ways in which all the features in decision situations can vary.
For these reasons, I recommend being careful with talk such as the following:
“Intervention X [insert: global warming reduction, existential risk reduction, AI safety, etc] is good from a MSR perspective.”
To be clear, there is a sense in which this way of talking can be perfectly reasonable. From the perspective of the MCUF, majority-favored interventions indeed receive a boost in how valuable they should be regarded as being, as compared to their evaluation from any single value system. However, this picture risk missing out on important nuances.
For instance, interventions that benefit value systems that are unusually hard to benefit must also receive a boost if the MCUF incorporates variance normalization (see chapter 3 here for an introduction). Using variance normalization means, roughly, that one looks at the variance of how much value or disvalue is commonly at stake for each value system and compensates for certain value systems being hard to benefit. If a value system is (for whatever structural reasons) particularly difficult to benefit, then for any of the rare instances where one is able to actually benefit said value system a great deal, doing so becomes especially important and one wants the MCUF and any CHs to be such that they recommend actually pursuing these rare instances.
These considerations paint a complicated picture. The worry with phrases like “intervention x is positive for MSR” is that it may tempt us to overlook that sometimes pursuing these interventions is heavily suboptimal if the MSR-participating agent actually has a strong comparative advantage for benefitting a value systems that is normally unusually hard to benefit. When someone hears “Intervention x is positive for MSR,” they may do more of intervention x without ever checking what other interventions are positive too, and potentially more positive for their given situation. As soon as people start to take shortcuts, there is a danger that these shortcuts will disproportionately and predictably benefit some value systems and neglect others. (We can think of shortcuts as cooperation heuristics produced by a dangerously low amount of careful thinking.)
The crucial theme here is that even if everyone always does things that are "positive according to the MCUF," if people often fail to do what is maximally positive, then it is possible for some value systems to predictably lose a lot of value in expectation or even suffer expected harm overall. Therefore, this cannot be how we should in practice implement MSR. The variance-voting or “equal gains from trade” MCUF – which I describe in more detail in the section “How to trade, ideally” – is set up such that if everyone tries to maximize it to the best of their ability, then it distributes gains equally. There is no guarantee that if everyone just picks random stuff that is positive according to this MCUF, this will be good for everyone. Cooperation heuristics have to be selected with the same principle in mind: We want to pick CHs which ensure equal gains from trade provided that everyone follows them diligently to the best of their ability.
All of this suggests that whether hearing an intervention being performed somewhere in the multiverse is “positive news” in expectation or not for a given MSR participant is actually not only a feature of that intervention itself, but also of whether the cooperation heuristic behind the decision was chosen and executed wisely. That is, it can in practice depend on things such as whether the agent who performed the intervention in question had a sufficiently large comparative advantage for it, or whether the intervention was chosen correctly for reasons of convergent priorities. With this in mind, it might in many contexts be epistemically safer to talk about CHs rather than concrete interventions being what is (without caveats) “positive from an MSR perspective.”
Asymmetries amongst MSR participants and the issue with choosing CHs in a way that distributes the gains from trade equally make it tricky to pick cooperation heuristics wisely. One failure mode I am particularly concerned about is the following:
Superrationalizing: When the CH you think you follow is different from the CH that actually guides your behavior.
For instance, you might think your CH produces the largest expected gains given practical concerns, but, unbeknownst to you, you only chose it the way you did because of asymmetric features that, if followed universally, would give you a disproportionate benefit. Others, who you thought will arrive at the same CH, will then adopt a different CH than the one you think you are following (perhaps biased in favor of their own benefit). You therefore lose out on the gains from trade you thought your CH would produce.
Relatedly, another manifestation of superrationalizing is that one might think one is following a CH that produces large gains from trade for one’s value system, but if the de facto execution of the CH one thinks one is following is sloppy, one has no reason to assume that the predicted gains from trade would actually materialize.
For better illustration, I am going to list some examples for different kinds of superrationalizing in a more concrete context. For this purpose, let me first introduce two hypothetical value systems held by MSR participants: Straightforwardism and Complicatedism.
Straightforwardists have practical priorities that are largely shared by the majority of value systems interested in MSR. Proponents of Complicatedism on the other hand are not excited about the canon of majority-favored interventions.
For an example of superrationalizing, let us assume that the Straightforwardists pick their CH according to the following, implicit reasoning: “When MSR participants reason very crudely about the MCUF and only draw the most salient implications with a very simple CH, such as looking for things that benefit many other value systems, this will be greatly beneficial for us. Therefore, we do not have to think too much about the specifics of the MCUF and can just focus on what is beneficial for many value systems including ours.”
By contrast, proponents of Complicatedism may worry about getting skipped in the compromise if people only perform the most salient, majority-favored interventions. So they might adopt a policy of paying extra careful attention to value systems never getting harmed by MSR in expectation, and therefore focus their own efforts disproportionately on benefitting the value system Supercomplicatedism, which only has few proponents and whose prioritization is very difficult to take into account.
Of course, MSR does not work that way, and the proponents of the two value systems above are making mistakes by, perhaps unconsciously/unthinkingly, assuming that other MSR participants will be affected symmetrically by features that are specific to only their own situation. The mistake is that if one pays extra careful attention to value systems never getting harmed by MSR because one’s own value system is in a minority that seems more at risk than the average value system, then the reasoning process at work is not “No matter the circumstances, be extra careful about value systems getting harmed.” Instead, the proper description of what is going on then would be that one unfairly privileges features that are only important for one’s own value system. To put it differently, if proponents of Straightforwardism think “I allow myself to reason crudely about MSR partners, therefore other agents are likely to think crudely about it, too – which is good for me!” they are failing to see that the reason they were tempted to think crudely is not a reason that is shared by all other compromise participants.
In order to maximize the gains from trade, proponents of both value systems, Straightforwardism and Complicatedism, have to make sure that they use a decision procedure that, in expectation, benefits both value systems equally much (weighted in proportion to how prevalent and powerful the proponents are). Straightforwardists have reason to pick a CH that also helps Complicatedists sufficiently much, and Complicatedists are incentivized to not be overly cautious and risk averse. If implemented properly, asymmetries between potential MSR participants cannot be used to gain an unfair advantage. (But maybe it is simply extremely complicated to implement MSR cooperation heuristics properly.)
For a slightly different example of superrationalizing, consider a case where Complicatedists naively place too much faith into the diligence of the Straightforwardists. They may reason as follows:
“The majority of compromise participants benefit from intervention Z. Even though intervention Z is slightly negative or at best neutral for my own values, I should perform intervention Z. This is because if I am diligent enough to support Z for the common good, as it seems best for a majority of compromise participants and therefore an obvious low-hanging fruit for doing my part in maximizing the MCUF, other agents will also be diligent in the way they implement MSR. Others being diligent then implies that whichever agents are in the best position to reward my own value system will indeed do so.”
This reasoning is sound in theory. But it is also risky. Whether the Complicatedists reap gains from trade, or whether the decision procedure they actually follow (as opposed to the decision procedure they think they follow) implies that they are shooting themselves in the foot, depends on their own level of diligence in picking their MSR implications. The Complicatedists have to, through the level of diligence in the CH they de facto follow, ensure that the agents who are in fact in the best position to help Complicatedism will be diligent enough to notice this and therefore act accordingly.
It seems to me that, if the Complicatedists put all their resources into intervention Z and never spend attention researching whether they themselves might be in a particularly good position to help rare value systems or value systems whose prioritization is particularly complicated, then the reasoning process they are de facto following is itself not as diligent as they require their superrational cooperators to be. If even the Complicatedists (who themselves do not benefit from the majority-favored interventions) end up working on the majority-favored interventions because they seem like the easiest and most salient thing to pick out, why would one expect agents who actually benefit from this “low-hanging fruit” to ever work on anything else? The Complicatedists (and everyone else for that matter, at least in order to maximize the gains from trade that MSR can provide) have to make sure that they work on majority-favored interventions if and only if it is actually their multiverse-wide comparative advantage. This may be difficult to ensure, because one has to expect that people often rationalize, especially when majority-favored interventions tend to be associated with high status, or tend to draw in Complicatedists high in agreeableness who are bothered by lack of convergence in people’s prioritization.
In order to allocate personal comparative advantages in a way that reliably produces the greatest gains from trade, one has to find the right mix between exploration and exploitation. It is plausible that MSR participants should often focus on majority-favored interventions, because after all, the fact that they are majority-favored means that they make up a large portion of the MCUF. But next to that, everyone should be on the lookout for special opportunities to benefit value systems with idiosyncratic priorities. This should happen especially often for value systems that are well-represented in the MCUF, but perhaps one should also make use of randomization procedures to sometimes spend time exploring the prioritization of comparatively rare value systems (see also the proposal in “How to trade, practically”).
Randomization procedures of course also come with a danger of superrationalizing. It can be difficult to properly commit to doing something that may cost social capital or is difficult to follow through with for other reasons. Illusory low-probability commitments that one would not actually follow through if the dice comes “6” five times in a row weaken or even destroy the gains from trade one in expectation receives from this aspect of MSR. Proper introspection and very high levels of commitment to one's chosen CH become important for not shooting oneself into the foot when attempting to get MSR implications right.
An intuition I got from writing this section is that it tentatively seems to me that cooperation heuristics that exploit convergent priorities, in particular when the resulting intervention benefits one’s own value system, are less risky (in the sense of it being harder to mess things up through superrationalizing) than trades based on comparative advantages. The overall gains from trade one can achieve with such (arguably) risk-averse cooperation heuristics are certainly not maximal, but if one is sufficiently pessimistic about getting things right otherwise, then they may be the overall better deal. This is bad news for value systems that don't benefit as much from trades focused on convergent priorities.
Having said that, it seems to me also that exploiting comparative advantages can produce particularly large gains from trade, and that getting things right enough might be within what we can expect careful reasoners to manage. While it would seem incredibly intractable to attempt estimate one’s comparative advantage at benefitting a particular value system when compared to agents in unknown parts of the multiverse, what looks fairly tractable by contrast (and is similarly impactful overall) is evaluating one’s comparative advantage compared to other people on earth. Following a CH that models comparative advantages among people on earth would be a pretty good start and likely better than a status quo of not considering comparative advantages at all.
Which value systems in particular MSR participants should benefit depends on their situations and especially their comparative advantages. In the last section of my introduction to MSR, I advocated for the principle that we should largely limit our cooperation heuristics to considering value systems we know well.
One might be tempted to assume that this would give suboptimal results, as limiting how inclusive one is with benefitting value systems different from one’s own determines how many value systems will be incentivized to join our compromise in total. So perhaps low inclusivity (in the sense of not speculating about the values of aliens with different value systems from us) in this way means that one’s decisions now only influence a smaller number (or lower measure given that we might be dealing with infinities) of agents in the multiverse. However, it is important to note that MSR never manages to bring other agents to follow one’s own priorities exclusively; it only grants you a proportionate share of the attention and resources of some other agents. The more types of compromise participants are added, the smaller said share of extra attention one receives per participant. (Consider: If I have to think about what my comparative advantage is amongst three value systems, that takes less time than figuring out one’s comparative advantage amongst three hundred value systems.) This means that there is no overriding incentive to choose maximally inclusive cooperation heuristics, i.e. ones that in expectation benefit maximally many value systems of superrationalists in the multiverse.
Note that this also implies that one cannot make a strong wager in favor of MSR of the sort that, if MSR works, our decisions have a vastly wider scope than if it does not work.1 While it is true, strictly speaking, that our decisions have a “wider scope” if MSR works, this is counterbalanced by us having to devote attention to more value systems in order to make it work. MSR’s gains from trade do not come from the large total numbers of participants, but from exploiting convergent priorities and comparative advantages. So while it is not important to consider maximally many plausible value systems in one’s compromise, it is important that we do include whichever value systems we expect large gains from trade from (as this superrationally ensures that others follow similarly high-impact cooperation heuristics).
If one had infinite computing power and could at any point download and execute the precise implications of an ideal MCUF containing all agents interested in MSR, then a maximally inclusive compromise would give the highest gains from trade, because for every ever-so-specific situation, the ideal MCUF would find exactly the best way of ensuring equal gains from trade for all participants. However, given that thinking about the prioritization of other value systems (especially obscure ones that only make up a tiny portion of the MCUF) comes with a cost, it may not be worthwhile to invest resources into ever-more-sophisticated CHs solely with the goal of making sure that we do not forget value systems we could in theory benefit. This reasoning supports the intuition that the best way to draw implications from MSR is by cooperating with proponents of value systems that one already causally interacts with, because these are the value systems we likely know best and are therefore in a particularly good position to benefit. Direct acquaintance is a comparative advantage!
(Epistemic status, update: This section is badly structured and probably confused at least in parts. Also it won’t be relevant to the sections below, so feel free to skip this.)
So far, I have been assuming that agents only follow cooperation heuristics that, at the stage of execution, the agent believes will generate positive utility according to their own value system. This sounds like a reasonable assumption, but there is a case to be made for exceptions to it. This concerns updateless versions of compromise.
Suppose I am eating dinner with my brother and we have to agree on a fair way of dividing one pizza. Ordinarily, the fair way to divide the pizza is to give each person one half. However, suppose I like pizza a lot more than my brother does, and that I am also much more hungry. Here, we might have the intuition that, whether person A or person B likes the pizza in question more, or is more hungry on that specific occasion, was a matter of chance that could just as well have gone one way or the other. Sure, one brother was born with genes that favor the taste of pizza more (or experienced things in life that led him to develop such a taste), but there is a sense in which it could also have gone the other way. Updatelessness is the idea that we should act as though we actually made irreversible commitments to our notion of bargaining that locked in the highest expected reward in all cases where failing to have done so would predictably lower our expected reward. Applied to the specific pizza example, it is the idea that learning more information about "who is hungrier" should not lower the total utility we would both have gotten in expectation if we had agreed to a fair compromise early enough from an original position of ignorance. So it could mean that my brother and I should disregard (= “choose not to update on”) the knowledge that one specific and now known person happens to have the less fortunate pizza preferences in this instance we are in. Why? Because there were points in the past where we could – and arguably should – have agreed on a method for future compromise on things such as pizza eating that in expectation does better than just dividing goods equally. Not knowing whether we ourselves will be hungrier or less hungry, it seems rational to commit to a compromise where the hungrier person receives more food. (There is also a more contested, even stronger sense of updatelessness that is not based on pre-commitments.)
Updatelessness applied to MSR would mean to optimize for a MCUF where variance normalization is not applied on all the things we currently know about the strategic position for proponents of different value systems, but instead to a hypothetical “point of precommitment.” Depending on the version of updatelessness at play, this could be the point in time where someone started to understand decision theory well enough to consider the benefits of updatelessness, or it could even mean going back to the “logical prior” over how much different value systems can or cannot be benefitted. (Whatever that means; I do not understand this business about either logical priors or how to distinguish different versions of updatelessness, so I will just leave it at that and hope that others may or may not do some more thinking here, following the links above.)
As I understand it, the inspiration for updateless compromise is that the gains in case one ends up being on the lucky side weigh more than the losses from where one does not. Maybe it is not apparent from the start which value systems correspond more to something like Complicatedism or something like Straightforwardism, and the sides could in theory also be reversed in at least some world-situations across the multiverse, depending on the sort of crazy things that may happen in the worlds in question. Arguably, then, there is a case to be made for committing towards updateless compromise before thinking more about MSR implications in further detail. (Or more generally, a case for a general precommitment towards updalessness in all future decision-situations where this has benefits given the knowledge at the time of precommitment.)
While I think the arguments for updatelessness are intriguing, I am skeptical whether humans can and should try to trick their brains into reasoning completely in updateless terms in decision-situations where the instances for precommitment either have passed or are difficult to reconstruct and assess retrospect. And I am even more skeptical about using updateless compromise for MSR in particular:
Next to the psychological difficulties with updatelessness and worries whether humans are even capable of following through with the implications after learning that one is on the losing end of an updateless compromise, another problem with updateless MSR is also the apparent lack of a true original position (besides the extreme view where one just goes with a logical prior). We have previously discussed asymmetric features amongst potential MSR participants. Even someone who has not given much thought to the relative prioritization of different value systems will probably have a rough idea whether their value system is more likely to benefit from updateless compromise or not. Small asymmetries can break the entanglement of decision algorithms: If I commit to be updateless because I have a good feeling about being on the winning side, I cannot expect other agents who may not share said feeling to commit as well. I feel that MSR is already hard enough without updatelessness, and adding yet another speculative consideration to it makes me think I should start drawing the line.
Having said all that, I guess it might be reasonable though to already commit to having precommitted to be updateless in case that, after thinking more about the merits and drawbacks of the idea, one concludes that a past commitment would in fact have been the rational thing to do. (I think that’s actually the way one should think about updatelessness in general, assuming one should try updatelessness at all, so I probably misrepresentated a few things above.)
Without (strong versions of) updatelessness, the way we ensure that our actions lead to MSR benefits is to diligently follow cooperation heuristics that do not disproportionately favor our own values. (Otherwise we would have to conclude that others are disproportionately benefit their values, which defeats the purpose.) This means that, in expectation, all the value systems should receive a substantial portion of attention somewhere in the multiverse. Ideally, assuming there were no time or resource constraints to computing a compromise strategy, an ideal reasoner would execute something like the following strategy:
1) Set up a weighted sum of the value functions of superrationalists in the multiverse.
2) Set the weights such that when universally adopted, everyone gets the same expected gains from compromise (perhaps relative to the agents’ power).
3) Maximize the resulting utility function.
Put this way, this may look simple. But it really isn’t. The way to coordinate for each value system to have resources allocated to its priorities is to maximally incorporate comparative advantages in terms of expertise and the strategic situation of the participating agents. Step 2) in the algorithm above is therefore extremely complicated, because it requires thinking about all the ways in which situations across the multiverse differ, where agents are in an especially good position to benefit certain value systems, and how likely they would be to notice this and comply depending on how the weights in the MCUF are being set. To illustrate this complexity, we can break down step 2) from above into further steps. Note that the following only gives an approximate rather than exact way to solve the problem, because a proper formalization for how to solve step 2) would twist knots into my brain:
2.1) Outline the value systems of all superrationalists and explore strategic prioritization for each value system in all world situations to come up with a ranking of promising interventions per world situation per value system.
2.2) Adjust all these interventions according to empirical compromise considerations where one can get more value out of a given intervention by tweaking it in certain ways: For instance, If two or more value systems would all agree to change each other’s promising interventions to different packages of compromise interventions that are overall preferable, perform said change.
2.3) Construct a preliminary multiverse-wide compromise utility function (pMCUF) that represents value systems weighted according to how prevalent they are amongst superrationalists, and how influential its proponents are.
2.4) Compare the world situations of all participants in MSR, predict which interventions from 2.2) will be taken by these agents under the assumption that they are approximating the pMCUF while being partly irrational in different ways, and calculate the total utility this generates for each value system in the preliminary compromise.
2.5) Adjust the weights in the pMCUF with the help of a fair bargaining solution in such a way that eventually, when applied to all possible world situations where the newly weighted MCUF will get approximated, all value systems will get (roughly) equal, variance-normalized benefits. This eventually gives you (a crude version of) the final MCUF to use.
(Step 2.5 ensures that value systems that are hard to benefit also end up receiving some attention. Without this step, hard-to-benefit value systems would often end up neglected, because MSR participants would solely be on the lookout for options to create the most total value per value system, which disproportionately favors benefitting value systems that are easy to benefit.)
Needless to say, the analysis above is much too impractical for humans to even attempt to approximate with steps of the same structure. So please don't even try!
Now, in order to produce actionable compromise plans, we have to come up with a simpler proposal. In the following, I’ll try to come up with a practical proposal that, if anything, tries to err on the side of being too simple. The idea being that if the practical proposal below seems promising, we gain confidence that implementing MSR in a way that incentivizes sufficiently many other potential participants to join is realistically feasible. Here the proposal in very sketchy terms:
Note that step 5 also includes very general or “meta” interventions such as encouraging people who have not made up their minds on ethical questions to simply follow MSR rather than waste time with ethical deliberation.
Admittedly, the above proposal is vague in many of the steps and things often boil down to intuition-based judgment calls, which generates a lot of room for biases to creep in. It is not obvious that this procedure still generates gains from trade if we factor in all the ways in which it could go wrong.
However, if people genuinely try to implement a cooperation heuristic that is impartially best for the compromise overall, then biases that creep in should at least be equally likely to give too much or too little weight to any given value system. In other words: There is hope even if we expect to make a few mistakes (after all, normal, non-MSR consequentialism is far from easy either.)
Note that while causal interaction and cooperation with proponents of other value systems interested in MSR can be highly useful as an integral part of a sensible cooperation heuristic, this should however not be confused with thinking of these other people as “actual” MSR compromise partners. It is highly debatable whether one’s own decision-making is likely to be relevantly logically entangled with the decision-making of some humans on earth. Maybe the earth is not large enough for that. But whether this is the case or not, MSR certainly does not require it. Besides, even if such entanglement was likely, the possibility of checking up on whether others are in fact reciprocating the compromise may break the entanglement of decision algorithms (cf. the EDT slogan “ignorance is evidential power”). (Although note that decision theories that incorporate updateless might continue to cooperate even after observing the other party’s action, if the reasons from similarity of decision algorithms were strong enough initially.)
So the idea behind focusing on cooperating with proponents of other value systems that we know and can interact with is not that we are superrationally ensuring that no one defects in causal interactions. Rather, the idea is that, if MSR works, each party has rational reason to act as though they are correlated with agents in other parts of the multiverse, where defection in expectation hurts their own values. This is what ensures that there are no incentives to defect. If one were to defect, one may gain an unfair advantage locally in casual interactions with others, yet one loses all the benefits from MSR in other parts of the multiverse.
Note that this leaves the problem that agents can fake to epistemically buy into MSR even though they may be highly skeptical of the idea. If one is confident that MSR would never work, one may be incentivized to lie about it and fake excitement. (Though I think this sounds like a terrible idea for the epistemic damage it would do to the community and for all the non-MSR arguments against naive consequentialism.)
Overall I'm not convincing that MSR has strong action-guiding implications. To figure out whether we can trust the reasoning behind MSR, there are many things to potentially look into in more detail. Personally, I am particularly interested in the following questions:
Thanks to Caspar Oesterheld, Johannes Treutlein, Max Daniel, Tobias Baumann and David Althaus for comments and discussions that helped inform my thinking. I first heard the term “superrationalizing” be used in the context of Hofstadter’s superrationality by Carl Shulman.
The post Commenting on MSR, Part 2: Cooperation heuristics appeared first on Center on Long-Term Risk.
]]>The post Cause prioritization for downside-focused value systems appeared first on Center on Long-Term Risk.
]]>This post outlines my thinking on cause prioritization from the perspective of value systems whose primary concern is reducing disvalue. I’m mainly thinking of suffering-focused ethics (SFE), but I also want to include moral views that attribute substantial disvalue to things other than suffering, such as inequality or preference violation. I will limit the discussion to interventions targeted at improving the long-term future (see the reasons in section II). I hope my post will also be informative for people who do not share a downside-focused outlook, as thinking about cause prioritization from different perspectives, with emphasis on considerations other than those one is used to, can be illuminating. Moreover, understanding the strategic considerations for plausible moral views is essential for acting under moral uncertainty and cooperating with people with other values.
I will talk about the following topics:
I’m using the term downside-focused to refer to value systems that in practice (given what we know about the world) primarily recommend working on interventions that make bad things less likely.1 For example, if one holds that what is most important is how things turn out for individuals (welfarist consequentialism), and that it is comparatively unimportant to add well-off beings to the world, then one should likely focus on preventing suffering.2 That would be a downside-focused ethical view.
By contrast, other moral views place great importance on the potential upsides of very good futures, in particular with respect to bringing about a utopia where vast numbers of well-off individuals will exist. Proponents of such views may also believe it to be a top priority that a large, flourishing civilization exists for an extremely long time. I will call these views upside-focused.
Upside-focused views do not have to imply that bringing about good things is normatively more important than preventing bad things; instead, a view also counts as upside-focused if one has reason to believe that bringing about good things is easier in practice (and thus more overall value can be achieved that way) than preventing bad things.
A key point of disagreement between the two perspectives is that the upside-focused people might say that suffering and happiness are in a relevant sense symmetrical, and that downside-focused people are too willing to give up good things in the future, such as the coming into existence of many happy beings, just to prevent suffering. On the other side, downside-focused people feel that the other party is too willing to accept, say, that many people suffering extremely goes unaddressed, or is in some sense being accepted in order to achieve some purportedly greater good.
Whether a normative view qualifies as downside-focused or upside-focused is not always easy to determine, as the answer can depend on difficult empirical questions such as how much disvalue we can expect to be able to reduce versus how much value we can expect to be able to create. I feel confident however that views according to which it is not in itself (particularly) valuable to bring beings in optimal conditions into existence come out as largely downside-focused. The following commitments may lead to a downside-focused prioritization:
For those who are unsure about where their beliefs may fall on the spectrum between downside- and upside-focused views, and how this affects their cause prioritization with regard to the long-term future, I recommend being on the lookout for interventions that are positive and impactful according to both perspectives. Alternatively, one could engage more with population ethics to perhaps cash in on the value of information from narrowing down one’s uncertainty.
In this post, I will only discuss interventions chosen with the intent of affecting the long-term future – which not everyone agrees is the best strategy for doing good. I want to note that choosing interventions that reliably reduce suffering or promote well-being in the short run also has many arguments in its favor.
Having said that, I believe that most of the expected value comes from the effects our actions have on the long-term future, and that our thinking about cause prioritization should explicitly reflect this. The future may come to hold astronomical quantities of the things that people value (Bostrom, 2003). Correspondingly, for moral views that place a lot of weight on bringing about astronomical quantities of positive value (e.g., happiness or human flourishing), Nick Beckstead presented a strong case for focusing on the long-term future. For downside-focused views, that case rests on similar premises. A simplified version of that argument is based on the two following ideas:
This does not mean that one should necessarily pick interventions one thinks will affect the long-term future through some specific, narrow pathway. Rather, I am saying (following Beckstead) that we should pick our actions based primarily on what we estimate their net effects to be on the long-term future.6 This includes not only narrowly targeted interventions such as technical work in AI alignment, but also projects that improve the values and decision-making capacities in society at large to help future generations cope better with expected challenges.
The observable universe has very little suffering (or inequality, preference frustration, etc.) compared to what could be the case; for all we know suffering at the moment may only exist on one small planet in a computationally inefficient form of organic life.7 According to downside-focused views, this is fortunate, but it also means that things can become much worse. Suffering risks (or “s-risks”) are defined as events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering.8 Analogously and more generally, we can define downside risks as events that would bring about disvalue (including things other than suffering) at vastly unprecedented scales.
Why might this definition be practically relevant? Imagine the hypothetical future scenario “Business as usual (BAU),” where things continue onwards indefinitely exactly as they are today, with all bad things being confined to earth only. Hypothetically, let’s say that we expect 10% of futures to be BAU, and we imagine there was an intervention – let’s call it paradise creation – that changed all BAU futures into futures where a suffering-free paradise is created. Let us further assume that another 10% of futures will be futures where earth-originating intelligence colonizes space and things go very wrong such that, through some pathway or another, creates vastly more suffering than has ever existed (and would ever exist) on earth, and little to no happiness or good things. We will call this second scenario “Astronomical Suffering (AS).”
If we limit our attention to only the two scenarios AS and BAU (of course there are many other conceivable scenarios, including scenarios where humans go extinct or where space colonization results in a future filled with predominantly happiness and flourishing), then we see that the total suffering in the AS futures vastly exceeds all the suffering in the BAU futures. Successful paradise creation therefore would have a much smaller impact in terms of reducing suffering than an alternative intervention that averts the 10% s-risk from the AS scenario, changing it to a BAU scenario for instance. Even reducing the s-risk from AS in our example by a single percentage point would be vastly more impactful than preventing the suffering from all the BAU futures.
This consideration highlights why suffering-focused altruists should probably invest their resources not into making very good outcomes more likely, but rather into making dystopian outcomes (or dystopian elements in otherwise good outcomes) less likely. Utopian outcomes where all suffering is abolished through technology and all sentient beings get to enjoy lives filled with unprecedented heights of happiness are certainly something we should hope will happen. But from a downside-focused perspective, our own efforts to do good are, on the margin, better directed towards making it less likely that we get particularly bad futures.
While the AS scenario above was stipulated to contain little to no happiness, it is important to note that s-risks can also affect futures that contain more happy individuals than suffering ones. For instance, the suffering in a future with an astronomically large population count where 99% of individuals are very well off and 1% of individuals suffer greatly constitutes an s-risk even though upside-focused views may evaluate this future as very good and worth bringing about. Especially when it comes to the prevention of s-risks affecting futures that otherwise contain a lot of happiness, it matters a great deal how the risk in question is being prevented. For instance, if we envision a future that is utopian in many respects except for a small portion of the population suffering because of problem x, it is in the interest of virtually all value systems to solve problem x in highly targeted ways that move probability mass towards even better futures. By contrast, only few value systems (ones that are strongly or exclusively about reducing suffering/bad things) would consider it overall good if problem X was “solved” in a way that not only prevented the suffering due to problem X, but also prevented all the happiness from the future scenario this suffering was embedded in. As I will argue in the last section, moral uncertainty and moral cooperation are strong reasons to solve such problems in ways that most value systems approve of.
All of this is based on the assumption that bad futures, i.e., futures with severe s-risks or downside risks, are reasonably likely to happen (and can tractably be addressed). This seems to be the case, unfortunately: We find ourselves on a civilizational trajectory with rapidly growing technological capabilities, and the ceilings from physical limits still far away. It looks as though large-scale space colonization might become possible someday, either for humans directly, for some successor species, or for intelligent machines that we might create. Life generally tends to spread and use up resources, and intelligent life or intelligence generally does so even more deliberately. As space colonization would so vastly increase the stakes at which we are playing, a failure to improve sufficiently alongside all the necessary dimensions – both morally and with regard to overcoming coordination problems or lack of foresight – could result in futures that, even though they may in many cases (different from the AS scenario above!) also contain astronomically many happy individuals, would contain vast quantities of suffering. We can envision numerous conceptual pathways that lead to astronomical amounts of suffering (Sotala & Gloor, 2017), and while each single pathway may seem unlikely to be instantiated – as with most specific predictions about the long-term future – s-risks are disjunctive, and people tend to underestimate the probability of disjunctive events (Dawes & Hastie, 2001). In particular, our historical track record contains all kinds of factors that directly cause or contribute to suffering on large scales:
And while one can make a case that there has been a trend for things to become better (see Pinker, 2011), this does not hold in all domains (e.g. not with regard to the number of animals directly harmed in the food industry) and we may, because of filter bubbles for instance, underestimate how bad things still are even in comparatively well-off countries such as the US. Furthermore, it is easy to overestimate the trend for things to have gotten better given that the underlying mechanisms responsible for catastrophic events such as conflict or natural disasters may follow a power law distribution where the vast majority of violent deaths, diseases or famines result from a comparatively small number of particularly devastating incidents. Power law distributions constitute a plausible (though tentative) candidate for modelling the likelihood and severity of suffering risks. If this model is correct, then observations such as that the world did not erupt in the violence of a third world war, or that no dystopian world government has been formed as of late, cannot count as very reassuring, because power law distributions become hard to assess precisely towards the tail-end of the spectrum where the stakes become altogether highest (Newman, 2006).
In order to now illustrate the difference between downside- and upside-focused views, I drew two graphs. To keep things simple, I will limit the example scenarios to cases that either uncontroversially contain more suffering than happiness, or only contain happiness. The BAU scenario from above will serve as reference point. I’m describing it again as reminder below, alongside the other scenarios I will use in the illustration (note that all scenarios are stipulated to last for equally long):
Earth remains the only planet in the observable universe (as far as we know) where there is suffering, and things continue as they are. Some people remain in extreme poverty, many people suffer from disease or mental illness, and our psychological makeup limits the amount of time we can remain content with things even if our lives are comparatively fortunate. Factory farms stay open, and most wild animals die before they reach their reproductive age.
A scenario where space colonization results in an outcome where astronomically many sentient minds exist in conditions that are evaluated as bad by all plausible means of evaluation. To make the scenario more concrete, let us stipulate that 90% of beings in this vast population have lives filled with medium-intensity suffering, and 10% of the population suffers in strong or unbearable intensity. There is little or no happiness in this scenario.
Things go as well as possible, suffering is abolished, all sentient beings are always happy and even experience heights of well-being that are unachievable with present-day technology. We further distinguish small paradise (SP) from astronomical paradise (AP): while the former stays earth-bound, the latter spans across (maximally) many galaxies, optimized to turn available resources into flourishing lives and all the things people value.
Here is how I envision typical suffering-focused and upside-focused views ranking the above scenarios from “comparatively bad” on the left to “comparatively good” on the right:
The two graphs represent the relative value we can expect from the classes of future scenarios I described above. The leftmost point of a graph represents not the worst possible outcome, but the worst outcome amongst the future scenarios we are considering. The important thing is not whether a given scenario is more towards the right or left side of the graph, but rather how large the distance is between scenarios. The yellow stretch signifies the highest importance or scope, and interventions that move probability mass across that stretch are either exceptionally good or exceptionally bad, depending on the direction of the movement. (Of course, in practice interventions can also have complex effects that affect multiple variables at once.)
Note that the BAU scenario was chosen mostly for illustration, as it seems pretty unlikely that humans would continue to exist in the current state for extremely long timespans. Similarly, I should qualify that the SP scenario may be unlikely to ever happen in practice because it seems rather unstable: Keeping value drift and Darwinian dynamics under control and preserving a small utopia for millions of years or beyond may require technology that is so advanced that one may as well make the utopia astronomically large – unless there are overriding reasons for favoring the smaller utopia. From any strongly or exclusively downside-focused perspective, the smaller utopia may indeed – factoring out concerns about cooperation – be preferable, because going from SP to AP comes with some risks.9 However, for the purposes of the first graph above, I was stipulating that AP is completely flawless and risk-free.
A “near paradise” or “flawed paradise” mostly filled with happy lives but also with, say, 1% of lives in constant misery, would for upside-focused views rank somewhere close to AP on the far right end of the first graph. By contrast, for downside-focused views on the second graph, “flawed paradise” would stand more or less in the same position as BAU in case the view in question is weakly downside-focused, and decidedly more on the way towards AS on the left in case the view in question is strongly or exclusively downside-focused. Weakly downside-focused views would also have a relatively large gap between SP and AP, reflecting that creating additional happy beings is regarded as morally quite important but not sufficiently important to become the top priority. A view would still count as suffering-focused (at least within the restricted context of our visualization where all scenarios are artificially treated as having the same probability of occurrence) as long as the gap between BAU and AS would remain larger than the gap between BAU/SP and AP.
In practice, we are well-advised to hold very large uncertainty over what the right way is to conceptualize the likelihood and plausibility of such future scenarios. Given this uncertainty, there can be cases where a normative view falls somewhere in-between upside- and downside-focused in our subjective classification. All these things are very hard to predict and other people may be substantially more or substantially less optimistic with regard to the quality of the future. My own estimate is that a more realistic version of AP, one that is allowed to contain some suffering but is characterized by containing near-maximal quantities of happiness or things of positive value, is ~40 times less likely to happen10 than the vast range of scenarios (of which AS is just one particularly bad example) where space colonization leads to outcomes with a lot less happiness. I think scenarios as bad as AS or worse are also very rare, as most scenarios that involve a lot of suffering may also contain some islands of happiness (or even have a sea of happiness and some islands of suffering). See also these posts on why the future is likely to be net good in expectation according to views where creating happiness is similarly important as reducing suffering.
Interestingly, various upside-focused views may differ normatively with respect to how fragile (or not) their concept of positive value is. If utopia is very fragile, but dystopia comes in vastly many forms (related: the Anna Karenina principle), this would imply greater pessimism regarding the value of the average scenario with space colonization, which could push such views closer to becoming downside-focused. On the other hand, some (idiosyncratic) upside-focused views may simply place an overriding weight on the ongoing existence of conscious life, largely independent of how things will go in terms of hedonist welfare.11 Similarly, normatively upside-focused views that count creating happiness as more important than reducing suffering (though presumably very few people would hold such views) would always come out as upside-focused in practice, too, even if we had reason to be highly pessimistic about the future.
To summarize, what the graphs above try to convey is that for the example scenarios listed, downside-focused views are characterized by having the largest gap in relative importance between AS and the other scenarios. By contrast, upside-focused views place by far the most weight on making sure AP happens, and SP would (for many upside-focused views at least) not even count as all that good, comparatively.12
Some futures, such as ones where most people’s quality of life is hellish, are worse than extinction. Many people with upside-focused views would agree. So the difference between upside- and downside-focused views is not about whether there can be net negative futures, but about how readily a future scenario is ranked as worth bringing about in the face of the suffering it contains or the downside risks that lie on the way from here to there.
If humans went extinct, this would greatly reduce the probability of space colonization and any associated risks (as well as benefits). Without space colonization, there are no s-risks “by action,” no risks from the creation of astronomical suffering where human activity makes things worse than they would otherwise be.13 Perhaps there would remain some s-risks “by omission,” i.e. risks corresponding to a failure to prevent astronomical disvalue. But such risks appear unlikely given the apparent emptiness of the observable universe.14 Because s-risks by action overall appear to be more plausible than s-risks by omission, and because the latter can only be tackled in an (arguably unlikely) scenario where humanity accomplishes the feat of installing compassionate values to robustly control the future, it appears as though downside-focused altruists have more to lose from space colonization than they have to gain.
It is however not obvious whether this implies that efforts to reduce the probability of human extinction indirectly increase suffering risks or downside risks more generally. It very much depends on the way this is done and what other effects are. For instance, there is a large and often underappreciated difference between existential risks from bio- or nuclear technology, and existential risks related to smarter-than-human artificial intelligence (superintelligence; see the next section). While the former set back technological progress, possibly permanently so, the latter drives it all the way up, likely – though maybe not always – culminating in space colonization with the purpose of benefiting whatever goal(s) the superintelligent AI systems are equipped with (Omohundro 2008; Armstrong & Sandberg, 2013). Because space colonization may come with some associated suffering in this way, this means that a failure to reduce existential risks from AI is often also a failure to prevent s-risks from AI misalignment. Therefore, the next section will argue that reducing such AI-related risks is valuable from both upside- and downside-focused perspectives. By contrast, the situation is much less obvious for other existential risks, ones that are not about artificial superintelligence.
Sometimes efforts to reduce these other existential risks also benefits s-risk reduction. For instance, efforts to reduce non-AI-related extinction risks may increase global stability and make particularly bad futures less likely in those circumstances where humanity nevertheless goes on to colonize space. Efforts to reduce extinction risks from e.g. biotechnology or nuclear war in practice also reduce the risk of global catastrophes where a small number of humans survive and where civilization is likely to eventually recover technologically, but perhaps at the cost of a worse geopolitical situation or with worse values, which could then lead to increases in s-risks going into the long-term future. This mitigating effect on s-risk reduction through a more stable future is substantial and positive according to downside-focused value systems, which has to be weighed against the effects of making s-risks from space colonization more likely.
Interestingly, if we care about the total number of sentient minds (and their quality of life) that can at some point be created, then because of some known facts about cosmology,15 any effects that near-extinction catastrophes have on delaying space colonization are largely negligible in the long run when compared to affecting the quality of a future with space colonization – at least unless the delay becomes very long indeed (e.g. millions of years or longer).
What this means is that in order to determine how reducing the probability of extinction from things other than superintelligent AI in expectation affects downside risks, we can approximate the answer by weighing the following two considerations against each other:
The second question involves judging where our current trajectory falls, quality-wise, when compared to the distribution of post-rebuilding scenarios – how much better or worse is our trajectory than a random resetted one? It also requires estimating the effects of post-catastrophe conditions on AI development – e.g., would a longer time until technological maturity (perhaps due to a lack of fossil fuels) cause a more uniform distribution of power, and what does that imply about the probability of arms races? It seems difficult to account for all of these considerations properly. It strikes me as more likely than not that things would be worse after recovery, but because there are so many things to consider,17 I do not feel very confident about this assessment.
This leaves us with the question of how likely a global catastrophe is to merely delay space colonization rather than preventing it. I have not thought about this in much detail, but after having talked to some people (especially at FHI) who have investigated it, I updated that rebuilding after a catastrophe seems quite likely. And while a civilizational collapse would set a precedent and reason to worry the second time around when civilization reaches technological maturity again, it would take an unlikely constellation of collapse factors to get stuck in a loop of recurrent collapse, rather than at some point escaping the setbacks and reaching a stable plateau (Bostrom, 2009), e.g. through space colonization. I would therefore say that large-scale catastrophes related to biorisk or nuclear war are quite likely (~80–90%) to merely delay space colonization in expectation.18 (With more uncertainty being not on the likelihood of recovery, but on whether some outlier-type catastrophes might directly lead to extinction.)
This would still mean that the successful prevention of all biorisk and risks from nuclear war makes space colonization 10-20% more likely. Comparing this estimate to the previous, uncertain estimate about the s-risks profile of a civilization after recovery, it tentatively seems to me that the effect of making cosmic stakes (and therefore downside risks) more likely is not sufficiently balanced by positive effects19 on stability, arms race prevention and civilizational values (factors which would make downside risks less likely). However, this is hard to assess and may change depending on novel insights.
What looks slightly clearer to me is that making rebuilding after a civilizational collapse more likely comes with increased downside risks. If this was the sole effect of an intervention, I would estimate it as overall negative for downside-focused views (factoring out considerations of moral uncertainty or cooperation with other value systems) – because not only would it make it more likely that space will eventually be colonized, but it would also do so in a situation where s-risks might be higher than in the current trajectory we are on.20
However, in practice it seems as though any intervention that makes recovery after a collapse more likely would also have many other effects, some of which might more plausibly be positive according to downside-focused ethics. For instance, an intervention such as developing alternate foods might merely speed up rebuilding after civilizational collapse rather than making it altogether more likely, and so would merely affect whether rebuilding happens from a low base or a high base. One could argue that rebuilding from a higher base is less risky also from a downside-focused perspective, which makes things more complicated to assess. In any case, what seems clear is that none of these interventions look promising for the prevention of downside risks.
We have seen that efforts to reduce extinction risks (exception: AI alignment) are unpromising interventions for downside-focused value systems, and some of the interventions available in that space (especially if they do not simultaneously also improve the quality of the future) may even be negative when evaluated purely from this perspective. This is a counterintuitive conclusion, maybe so much so that many people would rather choose to adopt moral positions where it does not follow. In this context, it is important to point out that valuing humanity not going extinct is definitely compatible with a high degree of priority for reducing suffering or disvalue. I view morality as including both considerations about duties towards other people (inspired by social contract theories or game theoretic reciprocity) as well as considerations of (unconditional) care or altruism. If both types of moral considerations are to be weighted similarly, then while the “care” dimension could e.g. be downside-focused, the other dimension, which is concerned with respecting and cooperating with other people’s life goals, would not be – at least not under the assumption that the future will be good enough that people want it to go on – and would certainly not welcome extinction.
Another way to bring together both downside-focused concerns and a concern for humanity not going extinct would be through a morality that evaluates states of affairs holistically, as opposed to using an additive combination for individual welfare and a global evaluation of extinction versus no extinction. Under such a model, one would have a bounded value function for the state of the world as a whole, so that a long history with great heights of discovery or continuity could improve the evaluation of the whole history, as would properties like highly favorable densities of good things versus bad things.
Altogether, because more people seem to come to hold upside-focused or at least strongly extinction-averse values after grappling with the arguments in population ethics, reducing extinction risk can be part of a fair compromise even though it is an unpromising and possibly negative intervention from a downside-focused perspective. After all, the reduction of extinction risks is particularly important from both an upside-focused perspective and from the perspective of (many) people’s self-, family- or community-oriented moral intuitions – because of the short-term death risks it involves.21 Because it is difficult to identify interventions that are robustly positive and highly impactful according to downside-focused value systems (as the length of this post and the uncertain conclusions indicate), it is however not a trivial issue that many commonly recommended interventions are unlikely to be positive according to these value systems. To the extent that downside-focused value systems are regarded as a plausible and frequently arrived at class of views, considerations from moral uncertainty and moral cooperation (see the last section) recommend some degree of offsetting expected harms through targeted efforts to reduce s-risks, e.g. in the space of AI risk (next section). Analogously, downside-focused altruists should not increase extinction risks and instead focus on more cooperative ways to reduce future disvalue.
Smarter-than-human artificial intelligence will likely be particularly important for how the long-term future plays out. There is a good chance that the goals of superintelligent AI would be much more stable than the values of individual humans or those enshrined in any constitution or charter, and superintelligent AIs would – at least with considerable likelihood – remain in control of the future not only for centuries, but for millions or even billions of years to come. In this section, I will sketch some crucial considerations for how work in AI alignment is to be evaluated from a downside-focused perspective.
First, let’s consider a scenario with unaligned superintelligent AI systems, where the future is shaped according to goals that have nothing to do with what humans value. Because resource accumulation is instrumentally useful to most consequentialist goals, it is likely to be pursued by a superintelligent AI no matter its precise goals. Taken to its conclusion, the acquisition of ever more resources culminates in space colonization where accessible raw material is used to power and construct supercomputers and other structures that could help in the pursuit of a consequentialist goal. Even though random or “accidental” goals are unlikely to intrinsically value the creation of sentient minds, they may lead to the instantiation of sentient minds for instrumental reasons. In the absence of explicit concern for suffering reflected in the goals of a superintelligent AI system, that system would instantiate suffering minds for even the slightest benefit to its objectives. Suffering may be related to powerful ways of learning (Daswani & Leike, 2015), and an AI indifferent to suffering might build vast quantities of sentient subroutines, such as robot overseers, robot scientists or subagents inside larger AI control structures. Another danger is that, either during the struggle over control over the future in a multipolar AI takeoff scenario, or perhaps in the distant future should superintelligent AIs ever encounter other civilizations, conflict or extortion could result in large amounts of disvalue. Finally, superintelligent AI systems might create vastly many sentient minds, including very many suffering ones, by running simulations of evolutionary history for research purposes (“mindcrime;” Bostrom, 2014, pp. 125-26). (Or for other purposes; if humans had the power to run alternative histories in large and fine-grained simulations, probably we could think of all kinds of reasons for doing it.) Whether such history simulations would be fine-grained enough to contain sentient minds, or whether simulations on a digital medium can even qualify as sentient, are difficult and controversial questions. It should be noted however that the stakes are high enough such that even comparatively small credences such as 5% or lower would already go a long way in terms of the implied expected value for the overall severity of s-risks from artificial sentience (see also footnote 7).
While the earliest discussions about the risks from artificial superintelligence have focused primarily on scenarios where a single goal and control structure decides the future (singleton), we should also remain open for scenarios that do not fit this conceptualization completely. Perhaps what happens instead could be several goals either competing or acting in concert with each other, like an alien economy that drifted further and further away from originally having served the goals of its human creators.22 Alternatively, perhaps goal preservation becomes more difficult the more capable AI systems become, in which case the future might be controlled by unstable goal functions taking turns over the steering wheel (see “daemons all the way down”). These scenarios where no proper singleton emerges may perhaps be especially likely to contain large numbers of sentient subroutines. This is because navigating a landscape with other highly intelligent agents requires the ability to continuously model other actors and to react to changing circumstances under time pressure – all of which are things that are plausibly relevant for the development of sentience.
In any case, we cannot expect with confidence that a future controlled by non-compassionate goals will be a future that neither contains happiness nor suffering. In expectation, such futures are instead likely to contain vast amounts of both happiness and suffering, simply because these futures would contain astronomical amounts of goal-directed activity in general.
Successful AI alignment could prevent most of the suffering that would happen in an AI-controlled future, as a superintelligence with compassionate goals would be willing to make tradeoffs that substantially reduce the amount of suffering contained in any of its instrumentally useful computations. While a “compassionate” AI (compassionate in the sense that its goal includes concern for suffering, though not necessarily in the sense of experiencing emotions we associate with compassion) might still pursue history simulations or make use of potentially sentient subroutines, it would be much more conservative when it comes to risks of creating suffering on large scales. This means that it would e.g. contemplate using fewer or slightly less fine-grained simulations, slightly less efficient robot architectures (and ones that are particularly happy most of the time), and so on. This line of reasoning suggests that AI alignment might be highly positive according to downside-focused value systems because it averts s-risks related to instrumentally useful computations.
However, work in AI alignment not only makes it more likely that fully aligned AI is created and everything goes perfectly well, but it also affects the distribution of alignment failure modes. In particular, progress in AI alignment could make it more likely that failure modes shift from “very far away from perfect in conceptual space” to “close but slightly off the target.” There are some reasons why such near misses might sometimes end particularly badly.
What could loosely be classified as a near miss is that certain work in AI alignment makes it more likely that AIs would share whichever values their creators want to install, but the creators could be unethical or (meta-)philosophically and strategically incompetent.
For instance, if those in power of the future came to follow some kind of ideology that is uncompassionate or even hateful of certain out-groups, or favor a distorted version of libertarianism where every person, including a few sadists, would be granted an astronomical quantity of future resources to use it at their disposal, the resulting future could be a bad one according to downside-focused ethics.
A related and perhaps more plausible danger is that we might prematurely lock in a definition of suffering and happiness into an AI’s goals that neglects sources of suffering we would come to care about after deeper reflection, such as not caring about the mind states of insect-like digital minds (which may or may not be reasonable). A superintelligence with a random goal would also be indifferent with regard to these sources of suffering, but because humans value the creation of sentience, or at least value processes related to agency (which tend to correlate with sentience), the likelihood is greater that a superintelligence with aligned values would create unnoticed or uncared for sources of suffering. Possible such sources include the suffering of non-human animals in nature simulations performed for aesthetic reasons, or characters in sophisticated virtual reality games.
A further danger is that, if our strategic or technical understanding is too poor, we might fail to specify a recipe for getting human values right and end up with perverse instantiation (Bostrom, 2014) or a failure mode where the reward function ends up flawed. This could happen e.g. to cases where an AI system starts to act in unpredictable but optimized ways due to conducting searches far outside its training distribution.23 Probably most mistakes at that stage would result in about as much suffering as in the typical scenario where AI is unaligned and has (for all practical purposes) random goals. However, one possibility is that alignment failures surrounding utopia-directed goals have a higher chance of leading to dystopia than alignment failures around random goals. For instance, a failure to fully understand the goal 'make maximally many happy minds' could lead to a dystopia where maximally many minds are created in conditions that do not reliably produce happiness, and may even lead to suffering in some of the instances, or some of the time. This is an area for future research.
A final possible outcome in the theme of “almost getting everything right” is if one where we are able to successfully install human values into an AI, only to have the resulting AI compete with other, unaligned AIs for control of the future and be threatened with things that are bad according to human values, in the expectation that the human-aligned AI would then forfeit its resources and give up in the competition over controlling the future.
Trying to summarize the above considerations, I drew a (sketchy) map with some major categories of s-risks related to space colonization. It highlights that artificial intelligence can be regarded as a cause or cure for s-risks (Sotala & Gloor, 2017). That is, if superintelligent AI is successfully aligned, s-risks stemming from indifference to suffering are prevented and a maximally valuable future is instantiated (green). However, the danger of near misses (red) makes it non-obvious whether efforts in AI alignment reduce downside risks overall, as the worst near misses may e.g. contain more suffering than the average s-risk scenario.
Note that no one should quote the above map out of context and call it “The likely future” or something like that, because some of the scenarios I listed may be highly improbable and because the whole map is drawn with a focus on things that could go wrong. If we wanted a map that also tracked outcomes with astronomical amounts of happiness, there would in addition be many nodes for things like “happy subroutines,” “mindcrime-opposite,” “superhappiness-enabling technologies,” or “unaligned AI trades with aligned AI and does good things after all.” There can be futures in which several s-risk scenarios come to pass at the same time, as well as futures that contain s-risk scenarios but also a lot of happiness (this seems pretty likely).
To elaborate more on the categories in the map above: Pre-AGI civilization (blue) is the stage we are at now. Grey boxes refer to various steps or conditions that could be met, from which s-risks (orange and red), extinction (yellow) or utopia (green) may follow. The map is crude and not exhaustive. For instance “No AI Singleton” is a somewhat unnatural category into which I threw both scenarios where AI systems play a crucial role and scenarios where they do not. That is, the category contains futures where space colonization is orchestrated by humans or some biological successor species without AI systems that are smarter than humans, futures where AI systems are used as tools or oracles for assistance, and futures where humans are out of the loop but no proper singleton emerges in the competition between different AI systems.
Red boxes are s-risks that may be intertwined with efforts in AI alignment (though not by logical necessity): If one is careless, work in AI alignment may exacerbate these s-risks rather than alleviate them. While dystopia from extortion would never be the result of the activities of an aligned AI, it takes an AI with aligned values, e.g. alongside the unaligned AI in a multipolar scenario or alien AI encountered during space colonization, to even provoke such a threat (hence the dotted line linking this outcome to “value loading success”). I coined the term “aligned-ish AI” to refer to the class of outcomes that efforts in AI alignment shifts probability mass to. This class includes both very good outcomes (intentional) and neutral or very bad outcomes (accidental). Flawed realization – which stands for futures where flaws in alignment prevent most of the value or even create disvalue – is split into two subcategories in order to highlight that the vast majority of such outcomes likely contains no more suffering than the typical outcome with unaligned AI, but that things going wrong in a particularly unfortunate way could result in exceptionally bad futures. For views that care similarly strongly about achieving utopia than preventing very bad futures, this tradeoff seems most likely net positive, whereas from a downside-focused perspective, this consideration makes it less clear whether efforts in AI alignment are overall worth the risks.
Fortunately, not all work in AI alignment faces the same tradeoffs. Many approaches may be directed specifically against avoiding certain failure modes, which is extremely positive and impactful for downside-focused perspectives. Worst-case AI safety is the idea that downside-focused value systems recommend pushing differentially the approaches that appear safest with respect to particularly bad failure modes. Given that many approaches towards AI alignment are still at a very early stage, it may be hard to tell which components to AI alignment are likely to benefit downside-focused perspectives the most. Nevertheless, I think we can already make some informed guesses, and our understanding will improve with time.
For instance, approaches that make AI systems corrigible (see here and here) would extend the window of time during which we can spot flaws and prevent outcomes with flawed realization. Similarly, approval-directed approaches to AI alignment, where alignment is achieved by simulating what a human overseer would decide if they were to think about the situation for a very long time, would go further towards avoiding bad decisions than approaches with immediate, unamplified feedback from human overseers. And rather than trying to solve AI alignment in one swoop, a promising and particularly “s-risk-proof” strategy might be to first build a low-impact AI systems that increases global stability and prevent arms races without actually representing fully specified human values. This would give everyone more time to think about how to proceed and avoid failure modes where human values are (partially) inverted.
In general, especially from a downside-focused perspective, it strikes me as very important that early and possibly flawed or incomplete AI designs should not yet attempt to fully specify human values. Eliezer Yudkowsky recently expressed the same point in this Arbital post on the worst failure modes in AI alignment.
Finally, what could also be highly effective for reducing downside risks, as well as being important for many other reasons, is some of the foundational work in bargaining and decision theory for AI systems, done at e.g. the Machine Intelligence Research Institute, which could help us understand how to build AI systems that reliably steer things towards outcomes that are always positive-sum.
I have a general intuition that, at least as long as the AI safety community does not face a strong pressure from (perceived) short timelines where the differences between downside-focused and upside-focused views may become more pronounced, there is likely to be a lot of overlap in terms of the most promising approaches focused on achieving the highest probability of success (utopia creation) and approaches that are particularly robust against failing in the most regretful ways (dystopia prevention). Heuristics like ‘Make AI systems corrigible,’ ‘Buy more time to think,’ or ‘If there is time, figure out some foundational issues to spot unanticipated failure modes’ all seem as though they would more likely be useful from both perspectives, especially when all good guidelines are followed without exception. I also expect that reasonably many people working in AI alignment will gravitate towards approaches that are robust in all these respects, because making your approach multi-layered and foolproof simply is a smart strategy when the problem in question is unfamiliar and highly complex. Furthermore, I anticipate that more people will come to think more explicitly about the tradeoffs between the downside risks from near misses and utopian futures, and some of them might put deliberate efforts into finding AI alignment methods or alignment components that fail gracefully and thereby make downside risks less likely (worst-case AI safety), either because of intrinsic concern or for reasons of cooperation with downside-focused altruists.24 All of these things make me optimistic about AI alignment as a cause area being roughly neutral or slightly positive when done with little focus on downside-focused considerations, and strongly positive when pursued with with strong concern for avoiding particularly bad outcomes.
I also want to mention that I think the entire field of AI policy and strategy strikes me as particularly positive for downside-focused value systems. Making sure that AI development is done carefully and cooperatively, without the threat of arms races leading to ill-considered, rushed approaches, seems like it would be exceptionally positive from all perspectives, and so I recommend that people who fulfill the requirements for such work should prioritize it very highly.
Population ethics, which is the area in philosophy most relevant for deciding between upside- and downside-focused positions, is a notoriously contested topic. Many people who have thought about it a great deal believe that the appropriate epistemic state with regard to a solution to population ethics is one of substantial moral uncertainty or of valuing further reflection on the topic. Let us suppose therefore that, rather than being convinced that some form of suffering-focused ethics or downside-focused morality is the stance we want to take, we consider it a plausible stance we very well might want to take, alongside other positions that remain in contention.
Analogous to situations with high empirical uncertainty, there are two steps to consider for deciding under moral uncertainty:
and compare that with
With regard to (1), we can reduce our moral uncertainty on two fronts. The obvious one is population ethics: We can learn more about the arguments for different positions, come up with new arguments and positions, and assess them critically. The second front concerns meta-level questions about the nature of ethics itself, what our uncertainty is exactly about, and in which ways more reflection or a sophisticated reflection procedure with the help of future technology would change our thinking. While some people believe that it is futile to even try reaching confident conclusions in the epistemic position we are in currently, one could also arrive at a view where we simply have to get started at some point, or else we risk getting stuck in a state of underdetermination and judgment calls all the way down.25
If we conclude that the value of information is insufficiently high to justify more reflection, then we can turn towards getting value from working on direct interventions (2) informed by those moral perspectives we have substantial credence in. For instance, a portfolio for effective altruists in the light of total uncertainty over downside- vs. upside-focused views (which may not be an accurate representation of the EA landscape currently, where upside-focused views appear to be in the majority) would include many interventions that are valuable from both perspectives, and few interventions where there is a large mismatch such that one side is harmed without the other side attaining a much greater benefit. Candidate interventions where the overlap between downside-focused and upside-focused views is high include AI strategy and AI safety (perhaps with a careful focus on the avoidance of particularly bad failure modes), as well as growing healthy communities around these interventions. Many other things might be positive from both perspectives too, such as (to name only a few) efforts to increase international cooperation, raising awareness and concern for for the suffering of non-human sentient minds, or improving institutional decision-making.
It is sometimes acceptable or even rationally mandated to do something that is negative according to some plausible moral views, provided that the benefits accorded to other views are sufficiently large. Ideally, one would consider all of these considerations and integrate the available information appropriately with some decision procedure for acting under moral uncertainty, such as one that includes variance voting (MacAskill, 2014, chpt. 3) and an imagined moral parliament.26 For instance, if someone leaned more towards upside-focused views, or had reasons to believe that the low-hanging fruit in the field of non-AI extinction risk reduction are exceptionally important from the perspective of these views (and unlikely to damage downside-focused views more than they can be benefitted elsewhere), or gives a lot of weight to the argument from option value (see the next paragraph), then these interventions should be added at high priority to the portfolio as well.
Some people have argued that even (very) small credences in upside-focused views, such as 1-20% for instance, would in itself already speak in favor of making extinction risk reduction a top priority because making sure there will still be decision-makers in the future provides high option value. I think this gives by far too much weight to the argument from option value. Option value does play a role, but not nearly as strong a role as it is sometimes made out to be. To elaborate, let’s look at the argument in more detail: The naive argument from option value says, roughly, that our descendants will be in a much better position to decide than we are, and if suffering-focused ethics or some other downside-focused view is indeed the outcome of their moral deliberations, they can then decide to not colonize space, or only do so in an extremely careful and controlled way. If this picture is correct, there is almost nothing to lose and a lot to gain from making sure that our descendants get to decide how to proceed.
I think this argument to a large extent misses the point, but seeing that even some well-informed effective altruists seem to believe that it is very strong led me realize that I should write a post explaining the landscape of cause prioritization for downside-focused value systems. The problem with the naive argument from option value is that the decision algorithm that is implicitly being recommended in the argument, namely focusing on extinction risk reduction and leaving moral philosophy (and s-risk reduction in case the outcome is a downside-focused morality) to future generations, makes sure that people follow the implications of downside-focused morality in precisely the one instance where it is least needed, and never otherwise. If the future is going to be controlled by philosophically sophisticated altruists who are also modest and willing to change course given new insights, then most bad futures will already have been averted in that scenario. An outcome where we get long and careful reflection without downsides is far from the only possible outcome. In fact, it does not even seem to me to be the most likely outcome (although others may disagree). No one is most worried about a scenario where epistemically careful thinkers with their heart in the right place control the future; the discussion is instead about whether the probability that things will accidentally go off the rails warrants extra-careful attention. (And it is not as though it looks like we are particularly on the rails currently either.) Reducing non-AI extinction risk does not preserve much option value for downside-focused value systems because most of the expected future suffering probably comes not from scenarios where people deliberately implement a solution they think is best after years of careful reflection, but instead from cases where things unexpectedly pass a point of no return and compassionate forces do not get to have control over the future. Downside risks by action likely loom larger than downside risks by omission, and we are plausibly in a better position to reduce the most pressing downside risks now than later. (In part because “later” may be too late.)
This suggests that if one is uncertain between upside- and downside-focused views, as opposed to being uncertain between all kinds of things except downside-focused views, the argument from option value is much weaker than it is often made out to be. Having said that, non-naively, option value still does upshift the importance of reducing extinction risks quite a bit – just not by an overwhelming degree. In particular, arguments for the importance of option value that do carry force are for instance:
The discussion about the benefits from option value is interesting and important, and a lot more could be said on both sides. I think it is safe to say that the non-naive case for option value is not strong enough to make extinction risk reduction a top priority given only small credences in upside-focused views, but it does start to become a highly relevant consideration once the credences become reasonably large. Having said that, one can also make a case that improving the quality of the future (more happiness/value and less suffering/disvalue) conditional on humanity not going extinct is probably going to be at least as important for upside-focused views and is more robust under population ethical uncertainty – which speaks particularly in favor of highly prioritizing existential risk reduction through AI policy and AI alignment.
We saw that integrating population-ethical uncertainty means that one should often act to benefit both upside- and downside-focused value systems – at least in case such uncertainty applies in one’s own case and epistemic situation. Moral cooperation presents an even stronger and more universally applicable reason to pursue a portfolio of interventions that is altogether positive according both perspectives. The case for moral cooperation is very broad and convincing, as it ranges from commonsensical heuristics to theory-backed principles found in Kantian morality or throughout Parfit’s work, as well as in the literature on decision theory.27 It implies that one should give extra weight to interventions that are positive for value systems different from one’s own, and subtract some weight from interventions that are negative according to other value systems – all to the extent in which the value systems in question are endorsed prominently or endorsed by potential allies.28
Considerations from moral cooperation may even make moral reflection obsolete on the level of individuals: Suppose we knew that people tended to gravitate towards a small number of attractor states in population ethics, and that once a person tentatively settles on a position, they are very unlikely to change their mind. Rather than everyone going through this process individually, people could collectively adopt a decision rule where they value the outcome of a hypothetical process of moral reflection. They would then work on interventions that are beneficial for all the commonly endorsed positions, weighted by the probability that people would adopt them if they were to go through long-winded moral reflection. Such a decision rule would save everyone time that could be spent on direct work rather than philosophizing, but perhaps more importantly, it would also make it much easier for people to benefit different value systems cooperatively. After all, when one is genuinely uncertain about values, there is no incentive to attain uncooperative benefits for one’s own value system.
So while I think the position that valuing reflection is always the epistemically prudent thing to do rests on dubious assumptions (because of the argument from option value being weak, as well as the reasons alluded to in footnote 24), I think there is an intriguing argument that a norm for valuing reflection is actually best from a moral cooperation perspective – provided that everyone is aware of what the different views on population ethics imply for cause prioritization, and that we have a roughly accurate sense of which attractor states people’s moral reflection would seek out.
Even if everyone went on to primarily focus on interventions that are favored by their own value system or their best guess morality, small steps into the direction of cooperatively taking other perspectives into account can already create a lot of additional value for all parties. To this end, everyone benefits from trying to better understand and account for the cause prioritization implied by different value systems.
This piece benefitted from comments by Tobias Baumann, David Althaus, Denis Drescher, Jesse Clifton, Max Daniel, Caspar Oesterheld, Johannes Treutlein, Brian Tomasik, Tobias Pulver, Simon Knutsson, Kaj Sotala (who also allowed me to use a map on s-risks he made and adapt it for my purposes in the section on AI) and Jonas Vollmer. The section on extinction risks benefitted from inputs by Owen Cotton-Barratt, and I am also thankful for valuable comments and critical inputs in a second round of feedback by Jan Brauner, Gregory Lewis and Carl Shulman.
Armstrong, S. & Sandberg, A. (2013). Eternity in Six Hours: Intergalactic spreading of intelligent life and sharpening the Fermi paradox. Ars Acta 89:1-13.
Bostrom, N. (2003). Astronomical Waste: The Opportunity Cost of Delayed Technological Development. Utilitas 15(3):308-314.
Bostrom, N. (2014). Superintelligence: Paths, Danger, Strategies. Oxford: Oxford University Press.
Daswani, M. & Leike, J. (2015). A Definition of Happiness for Reinforcement Learning Agents. arXiv:1505.04497.
Greaves, H. (2017). Population axiology. Philosophy Compass, 12:e12442. doi.org/10.1111/phc3.12442.
Hastie, R., & Dawes, R. (2001). Rational choice in an uncertain world: The psychology of judgment and decision making. Thousand Oaks: Sage Publications.
MacAskill, W. (2014). Normative Uncertainty. PhD diss., St Anne’s College, University of Oxford.
Newman, E. (2006). Power laws, Pareto distributions and Zipf’s law. arXiv:cond-mat/0412004.
Omohundro, S. (2008). The Basic AI Drives. In P. Wang, B. Goertzel, and S. Franklin (eds.). Proceedings of the First AGI Conference, 171, Frontiers in Artificial Intelligence and Applications. Amsterdam: IOS Press.
Parfit, D. (2011). On What Matters. Oxford: Oxford University Press.
Pinker, S. (2011). The Better Angels of our Nature. New York, NY: Viking.
Sotala, K. & Gloor, L. (2017). Superintelligence as a Cause or Cure for Risks of Astronomical Suffering. Informatica 41(4):389–400.
The post Cause prioritization for downside-focused value systems appeared first on Center on Long-Term Risk.
]]>The post Using surrogate goals to deflect threats appeared first on Center on Long-Term Risk.
]]>Agents that threaten to harm other agents, either in an attempt at extortion or as part of an escalating conflict, are an important form of agential s-risks. To avoid worst-case outcomes resulting from the execution of such threats, I suggest that agents add a “meaningless” surrogate goal to their utility function. Ideally, threats would target this “honeypot” rather than the initial goals, which means that escalating threats would no longer lead to large amounts of disvalue.
In this post, I introduce key desiderata for how surrogate goals should work, and outline the challenges that need to be addressed. Many open questions remain, but I am optimistic that the idea can be a useful tool to help mitigate the negative impact of threats.
Let Alice be an agent with a utility function U. For example, suppose Alice wants to make money but cares even more about survival. She potentially faces threats from a second actor (let’s call him Bob) of the form “Unless you do X (e.g. give me money), I’ll kill you”.
To avoid this, she comes up with a smart way to change her utility function. She decides to introduce a “meaningless” surrogate goal – say, she now cares strongly about preventing the existence of a sphere of platinum with a diameter of exactly 42.82cm. The hope is that Bob’s threats are deflected to this new goal, assuming that Alice’s new utility function U’ = U + V puts sufficient weight on preventing the sphere (represented by V). Bob would now make threats of the form “Unless you do X (e.g. give me money), I’ll create a sphere of platinum with a diameter of exactly 42.82cm”.
This trick aims to solve one aspect of the threat problem only – namely, the potential for it to result in an extremely bad outcome if the threat is carried out. Alice might still give away resources when threatened; after all, it would be absolutely horrendous if Bob actually went through with his threat and created the sphere. Ideally, Alice would respond to threats in the same way as before her goal modification, for reasons discussed later.
So utility function modification does not prevent the loss of resources due to extortion, or the risk that a malicious agent might become more powerful due to gaining resources through extortion. More work on a solution for this part of the problem is also necessary, but preventing the risk that threats are carried out (against the original goal) would already go a long way. Surrogate goals can also be combined with any other anti-extortion measure.
Unfortunately, it may be hard for humans to deliberately change their utility function1 in this way.2 It is more realistic that the trick can be applied to advanced AI systems. For example, if an AI system controls important financial or economic resources, other AI systems might have an incentive to try to extort it. If the system also uses inverse reinforcement learning or other techniques to infer human preferences, then the threat might involve the most effective violations of human preferences, such as killing people (assuming that the threatening AI system has the power to do this). Surrogate goals might help mitigate this security problem.
So far, I assumed that the trick is successful in deflecting threats, but it is actually not straightforward to get this to work. In the following, I will discuss the main criteria for successfully implementing utility function modification.
Changing your goals is usually disadvantageous in terms of the original goals since it means that you will optimise for the “wrong” goal. In other words, goal preservation is a convergent instrumental goal. So, when we introduce a surrogate goal, we’d like to ensure that it doesn’t interfere with other goals in non-threat situations.
To achieve this, the surrogate goal could be the minimization of a structure that is so rare that it doesn’t matter in “normal” (non-threat) situations. Spheres of platinum with a diameter of 42.82cm usually don’t occur naturally, so Alice is still free to pursue her other goals – including survival – as long as no threats are made. (It might be better to make it even more specific by adding a certain temperature, specifying a complex and irregular shape, and so on.)
An even more elegant solution is to choose a dormant utility function modification, that is, to introduce a trigger mechanism that causes the modification conditional on being threatened.3 This ensures non-interference with other goals. Less formally speaking, this corresponds to disvaluing spheres of platinum (or any other surrogate goal) only if they are created as the result of a threat, while remaining indifferent towards natural instances.
This requires a mechanism to detect (serious) threats. In particular, it’s necessary to distinguish threats from positive-sum trade, which turns out to be quite difficult. (Learning to reliably detect threats using neural networks or other machine learning methods may be a critical problem in worst-case AI safety.)
To the extent to which this is possible, the surrogate goal should be orthogonal to your original goals. This ensures that it’s not easily possible to simply combine both threats. For example, if Alice’s surrogate goal is “prevent murder” – which isn’t orthogonal – then Bob can target the surrogate goal and the original goal simultaneously with a death threat.
Even for orthogonal goals, Bob might still decide to threaten both goals (death and the creation of the sphere). Caring more about the surrogate goal than about the initial goal is not sufficient to make sure that this does not happen. For example, Bob might (depending on circumstances) want to make his threat as big as possible to force Alice to give in.
It might be safer to choose a continuous and unbounded surrogate goal instead of a binary surrogate goal like “prevent the existence of a single platinum sphere”; for instance, the disvalue could be a function of the size of the sphere. This is an improvement because a threatener who wants to increase the stakes will now create bigger spheres rather than adding the initial goal to his threat.
So far, I assumed that the kind of threat that is made simply depends on Alice’s utility function. But it’s actually more accurate to say that it depends on the threatener’s beliefs about her utility function. If Bob believes that Alice cares about her surrogate goal, even though Alice didn’t actually modify, then he will still threaten the surrogate goal.
In this case, his threats arguably wouldn’t work. So maybe it’s even better to just pretend that you changed your utility function?
Of course, the problem is that Bob might see through this, which would mean that he threatens the initial goal after all. (Misrepresenting your values may still be an interesting anti-threat strategy, but this is beyond the scope of this post.)
It’s also possible that Alice actually modifies her utility function, but Bob thinks it’s a ruse. Now, this seems particularly dangerous because it involves threats against the initial goal and you might worry that Alice would not respond “correctly” (whatever this means) to such threats anymore. Alice’s new utility function still includes the initial goal, though, so she continues to react to threats against the initial goal. The surrogate goal does not help in this case, but the result is at least not worse than what would happen by default (without utility function modification).
To successfully deflect threats to the surrogate goal, Alice needs to be able to credibly broadcast that she now cares most about this. This is a nontrivial problem, which is exacerbated by the fact that after modifying, Alice has strong incentives to keep her surrogate goal secret – after all, leaking the information that she cares about preventing spheres of platinum might lead to threats! It is thus easier to broadcast the utility function modification before actually carrying it out.
Fortunately, the problem of credible broadcasting may disappear in threats that involve advanced artificial intelligence. For instance, it may become possible to run faithful simulations of the other party, which means that a threatener could verify the utility function modification. Also, rather than starting out with a certain utility function and later modifying it, we could equip AI systems with a surrogate goal from the start.
Modifying your utility function may increase or decrease the attractiveness of threats against you. For example, creating the sphere may be more attractive because death threats are illegal. I will refer to this property as threatener-friendly (increase the attractiveness of threats), threatener-neutral (keep attractiveness of threats constant), and threatener-hostile (decrease the attractiveness of threats).
In the following, I will argue that the utility function modification should be as close to threatener-neutral as possible.
Threatener-hostile utility function modification may be risky since it reduces the utility of threateners, which potentially gives them reason for punishment in order to discourage such strategic moves. Unfortunately, this punishment would be directed at the initial goal rather than the surrogate goal, since this is what could deter Alice at the point where she considers modifying her utility function.
This is not a knock-down argument, and threatener-hostile moves – such as strongly pre-committing to not give in, or caring intrinsically about punishing extortionists – might turn out to be valuable anti-threat measures. Still, the idea of this post is intriguing precisely because it’s different in that it (in a threatener-neutral or threatener-friendly variant) helps to avoid (the consequences of) threats without potentially incentivizing punishment. In particular, it might be helpful to introduce a surrogate goal before thinking about other (threatener-hostile) tricks, so that any punishment against these is already deflected.
That said, a threatener-friendly utility function modification is also undesirable simply because it helps threateners gain resources. Making extortion more attractive is bad in expectation for most agents due to the negative-sum nature of threats. So, the ideal surrogate goal is threatener-neutral, averting the possibility of extremely bad outcomes without changing other parameters.
Unfortunately, this is a difficult problem. The feasibility of threats is a (complex) function of empirical circumstances, and these circumstances might change in the future. Creating spheres of platinum may become easy because of advances in mining technology, or it might be hard because all the platinum is used elsewhere. The circumstances may also differ from threatener to threatener.
It therefore seems desirable to use a surrogate goal that’s similar to the initial goal in that its vulnerability to threats as a function of empirical circumstances is strongly tied to the vulnerability of the initial goal, while still being orthogonal in the sense of the previous section. Rather than picking a single surrogate goal, you could pick a "surrogate goal function" that maps every environment to a surrogate goal in a way that maintains threatener-neutrality.
In light of these difficulties, we might hope that future AI systems will be able to figure out the details if they reason correctly about game theory and decision theory, or that the idea is sufficiently robust to small perturbations. It’s even conceivable that threateners would tell future threatenees about the idea and how to best implement it (presumably in exchange for making it slightly threatener-friendly).
So far, I only considered a simplified two-agent case. Our real world, however, features a large number of agents with varying goals, which causes additional complications if many of them modify their utility function.
A key question is whether different agents should use the same or different surrogate goals, especially if their original goals are similar. For example, suppose Alice has relatives that also don’t want her to die. Suppose they also modify their utility function to defuse threats, choosing different (unrelated) surrogate goals.
Now, a threat against the surrogate goal targets a different set of people – a single individual rather than the entire family – compared to a threat against the initial goal. This is problematic because threateners may prefer to target the initial goal after all, to threaten the entire family at the same time, which may or may not be more attractive depending on the circumstances.
On the flip side, if the (initial) goals of different agents overlap only partially, they may prefer to not choose the same surrogate goal. This is because you don’t want to potentially lose resources because of threats that would otherwise (pre-modification) only or mostly target others. Also, similar to the above point, it may be more effective to threaten the different agents individually, so this fails to achieve the goal of deflecting threats to the surrogate goal under all circumstances.
To solve this problem, the agents could associate each initial goal with a surrogate goal in a way that preserves the “distance” between goals, so that different initial goals are mapped onto different surrogate goals, and vice versa. More formally, we need an isometric mapping from the space of initial goals to the space of surrogate goals. This follows a pattern which we’ve encountered several times in this post: threats against the surrogate goal should be as similar as possible, in terms of their feasibility as a function of empirical circumstances, to threats against the initial goal.
Finding and implementing an isometric mapping that’s used by all agents is a difficult coordination problem. Prima facie, it’s unclear how this could be solved, given how arbitrary the choice of surrogate goals is. To get different surrogate goals, you could use long sequences of random numbers, but using the same or similar surrogate goals may require some kind of Schelling point if the agents can’t communicate.
What’s worse, this might also give rise to a cooperation problem. It is possible that agents have incentives to choose a different surrogate goal than the one associated with their initial goal according to the mapping. For instance, perhaps it’s better to choose a less common surrogate goal because this means fewer threats target your surrogate goal, which means you’re less likely to waste resources responding to such threats. Or perhaps it’s better to choose a more common surrogate goal if all the agents sharing this surrogate goal are powerful enough to prohibit threats.
It’s hard to say how exactly this would work without an improved understanding of how multi-agent threats work, which is sorely elusive.
Instead of a “meaningless” surrogate goal, you could take the idea a step further by choosing a surrogate goal whose violation would be good (according to the initial utility function). You can pick an outcome that’s ideal or close to ideal and choose the surrogate goal of preventing that outcome from happening. In this case, the execution of threats would – counterintuitively – lead to very good outcomes. Ensuring non-interference with other goals is tricky in this case, but can perhaps be solved if the modification is dormant (as described above).
This variant seems particularly brittle, but if it works, it’s possible to turn worst-case outcomes from threats into utopian outcomes, which would be a surprisingly strong result.
As we’ve seen, the problem with surrogate goals is that it’s quite hard to specify them properly. We could take inspiration from the idea of indirect normativity, which is about an indirect description of (ethical) goals, i.e. “what I care about is what I would care about after a century of reflection”. Similarly, we could also define an indirect surrogate goal. Alice could simply say “I care about the surrogate goal that I’d choose if I was able to figure out and implement all the details”. (Needless to say, it may be hard to implement such an indirect specification in AI systems.)
It’s not possible to threaten an indeterminate goal, though, which might mean that threateners circle back to the initial goal if the surrogate goal could be anything. So this idea seems to require that the threatener is able to figure out the “ideal” surrogate goal or will become able to figure it out in the future. It’s also conceivable (albeit speculative) that the ideal surrogate goal would compensate the reduced vulnerability to threats due to indeterminacy by being more threatener-friendly along other dimensions.
Agents that use an updateless decision theory (UDT) reason in terms of the optimal policy – the mapping of inputs to actions – rather than choosing the best option at any given moment. An advantage of UDT in the context of surrogate goals is that they wouldn’t need “hacky” solutions like self-modifications. If the best policy for maximizing utility function U is to act like a U+V maximizer for certain inputs – specifically, to take threats against the surrogate goal V seriously – then the UDT agent will simply do so.
This framework arrives at the same conclusions, but it might be a “cleaner” way to think about the topic as it dispenses with fuzzy terms such as “utility function modification”.
Eliezer Yudkowsky introduces the idea of surrogate goals in his post on Separation from hyperexistential risk as a patch to avoid disutility maximization. He argues that the patch fails because the resulting utility function is not reflectively consistent. Indeed, the modified agent may have an incentive to apply the same trick again, replacing the (now meaningful) spheres of platinum with yet another surrogate goal. The agent may also want to remove the surrogate goal from its utility function (to avoid threats against it).
To avoid this, she needs to fix the new values by committing to not modifying her utility function again – for instance, she could care intrinsically about retaining the (modified) utility function.4) This may or may not be a satisfactory solution, but I don’t think (contra Eliezer) that this constitutes an insurmountable problem. (As described above, this seems to be unproblematic for UDT agents.)
In a discussion of how to prevent accidental maximization of disvalue, Robert Miles asks:
Does there exist any utility function that results in good outcomes when maximised but does not result in bad outcomes when minimised?
Surrogate goals are a possible answer to this question, which, in a sense, is more general than the question of how to prevent (the bad outcomes of) threats. I also like Stuart Armstrong’s solution:
Yes. Let B1 and B2 be excellent, bestest outcomes. Define U(B1)=1, U(B2)=-1, and U=0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Stuart Armstrong also proposes a variant of utility function modification that aims to reduce the size of threats by cutting off the utility function at a certain level. Comparing the advantages and drawbacks of each variant would be beyond the scope of this text, but future research on this would be highly valuable.
As we’ve seen, any utility function modification must be calibrated well in order to work. Trying to specify the details turns out to be surprisingly difficult. More work on these problems is needed to enable us to implement robust utility function modification in advanced AI systems.
Finally, I’d like to emphasize that it would be ideal if (the bad kind of) threats can be avoided completely. However, given that we don’t yet know how to reliably achieve this, moving threats to the realm of the meaningless (or even beneficial) is a promising way to mitigate agential s-risks.
Caspar Oesterheld initially brought up the idea in a conversation with me. My thinking on the topic has also profited enormously from internal discussions at the Foundational Research Institute. Daniel Kokotajlo inspired my thinking on the multi-agent case.
I am indebted to Brian Tomasik, Johannes Treutlein, Caspar Oesterheld, Lukas Gloor, Max Daniel, Abram Demski, Stuart Armstrong and Daniel Kokotajlo for valuable comments on an earlier draft of this text.
The post Using surrogate goals to deflect threats appeared first on Center on Long-Term Risk.
]]>The post Superintelligence as a Cause or Cure for Risks of Astronomical Suffering appeared first on Center on Long-Term Risk.
]]>Discussions about the possible consequences of creating superintelligence have included the possibility of existential risk, often understood mainly as the risk of human extinction. We argue that suffering risks (s-risks), where an adverse outcome would bring about severe suffering on an astronomical scale, are risks of a comparable severity and probability as risks of extinction. Preventing them is the common interest of many different value systems. Furthermore, we argue that in the same way as superintelligent AI both contributes to existential risk but can also help prevent it, superintelligent AI can be both the cause of suffering risks and a way to prevent them from being realized. Some types of work aimed at making superintelligent AI safe will also help prevent suffering risks, and there may also be a class of safeguards for AI that helps specifically against s-risks.
Read on the publisher's website.
The post Superintelligence as a Cause or Cure for Risks of Astronomical Suffering appeared first on Center on Long-Term Risk.
]]>The post Self-improvement races appeared first on Center on Long-Term Risk.
]]>
Most of my readers are probably familiar with the problem of AI safety: If humans could create super-human level artificial intelligence the task of programming it in such a way that it behaves as intended is non-trivial. There is a risk that the AI will act in unexpected ways and given its super-human intelligence, it would then be hard to stop.
I assume fewer are familiar with the problem of AI arms races. (If you are, you may well skip this paragraph.) Imagine two opposing countries which are trying to build a super-human AI to reap the many benefits and potentially attain a decisive strategic advantage, perhaps taking control of the future immediately. (It is unclear whether this latter aspiration is realistic, but it seems plausible enough to significantly influence decision making.) This creates a strong motivation for the two countries to develop AI as fast as possible. This is the case especially if the countries would dislike a future controlled by the other. For example, North Americans may fear a future controlled by China. In such cases, countries would want to invest most available resources into creating AI first with less concern for whether it is safe. After all, letting the opponent win may be similarly bad as having an AI with entirely random goals. It turns out that under certain conditions both countries would invest close to no resources into AI safety and all resources into AI capability research (at least, that's the Nash equilibrium), thus leading to an unintended outcome with near certainty. If countries are sufficiently rational, they might be able to cooperate to mitigate risks of creating uncontrolled AI. This seems especially plausible given that the values of most humans are actually very similar to each other relative to how alien the goals of a random AI would probably be. However, given that arms races have frequently occurred in the past, a race toward human-level AI remains a serious worry.
Handing power over to AIs holds both economic promise and a risk of misalignment. Similar problems actually haunt humans and human organizations. Say, a charity hires a new director who has been successful in other organizations. Then this creates the opportunity for the charity to rise in influence. However, it is also possible that the charity changes in a way that the people currently or formerly in charge wouldn't approve of. Interestingly, the situation is similar for AIs which create other AIs or self-improve themselves. Learning and self-improvement are the paths to success. However, self-improvements carry the risk of affecting the goal-directed behavior of the system.
The existence of this risk seems true prima facie: It should be strictly easier to find self-improvements that "probably" work than it is to identify self-improvements that are guaranteed to work. The former is a superset of the latter. So, AIs which are willing to take risks while self-improving can improve faster.
There are also formal justifications for the difficulty of proving self-improvements to be correct. Specifically, Rice's theorem states that for any non-trivial property p, there is no way of deciding for all programs whether they have this property p. (If you know about the undecidability of the halting problem, Rice's theorem follows almost immediately from that.) As a special case, deciding for all programs whether they are pursuing some goals is impossible. Of course, this does not mean that proving self-improvements to be correct is impossible. After all, an AI could just limit itself to the self-improvements that it can prove correct (see this discussion between Eliezer Yudkowsky and Mark Waser). However, without this limitation – e.g., if it can merely test some self-improvement empirically and implement it if it seems to work – an AI can use a broader range of possible self-modifications and thus improve more quickly. (In general, testing a program also appears to be a lot easier than formally verifying it, but that's a different story.) Another relevant problem from (provability) logic may be Löb's theorem which roughly states that a logical system with Peano arithmetic can't prove another logical mechanism with that power to be correct.
Lastly, consider Stephen Wolfram's more fuzzy concept of computational irreducibility. It basically states that as soon as a system can produce arbitrarily complex behavior (i.e., as soon as it is universal in some sense), predicting how most aspects of the system will behave becomes fundamentally hard. Specifically, he argues that for most (especially for complex and universal) systems, there is no way to find out how they behave other than running them.
So, self-improvement can give AIs advantages and ultimately the upper hand in a conflict, but if done too hastily, it can also lead to goal drift. Now, consider the situation in which multiple AIs compete in a head-to-head race. Based on the above considerations this case becomes very similar to the AI arms races between groups of humans. Every single AI has incentives to take risks to increase its probability of winning, but overall this can lead to unintended outcomes with near certainty. There are reasons to assume that this self-improvement race dynamic will be more of a problem for AIs than it is for human factions. The goals of different AIs could diverge much more strongly than the goals of different humans. Whereas human factions may prefer the enemy's win over a takeover by an uncontrolled AI, an AI with human values confronting an AI with strange values has less to lose from risky self-modifications. (There are some counter-considerations as well. For instance, AIs may be better at communicating and negotiating compromises.)
Thus, a self-improvement race between AIs seems to share the bad aspects of AI arms races between countries. This has a few implications:
The post Self-improvement races appeared first on Center on Long-Term Risk.
]]>The post Overview: Evidential Cooperation in Large Worlds (ECL) appeared first on Center on Long-Term Risk.
]]>Some papers closely related to multiverse-wide superrationality can be found in section 6.1 of the 2017 paper.
The post Overview: Evidential Cooperation in Large Worlds (ECL) appeared first on Center on Long-Term Risk.
]]>The post Commenting on MSR, Part 1: Multiverse-wide cooperation in a nutshell appeared first on Center on Long-Term Risk.
]]>(Disclaimer: Especially for the elevator pitch section here, I am sacrificing accuracy and precision for brevity. References can be found in Caspar’s paper.)
It would be an uncanny coincidence if the observable universe made up everything that exists. The reason we cannot find any evidence for there being stuff beyond the edges of our universe is not because it is likely that there is nothingness, but because photons from further away simply would not have had sufficient time after the big bang to reach us. This means that the universe we find ourselves in may well be vastly larger than what we can observe, in fact even infinitely larger. The theory of inflationary cosmology in addition hints at the existence of other universe bubbles with different fundamental constants forming or disappearing under certain conditions, somehow co-existing with our universe in parallel. The umbrella term multiverse captures the idea that the observable universe is just a tiny portion of everything that exists. The multiverse may contain myriads of worlds like ours, including other worlds with intelligent life and civilization. An infinite multiverse (of one sort or another) is actually amongst the most popular cosmological hypotheses, arguably even favored by the majority of experts.
Many ethical theories (in particular most versions of consequentialism) do not consider geographical distance of relevance to moral value. After all, suffering and the frustration of one’s preferences is bad for someone regardless of where (or when) it happens. This principle should apply even when we consider worlds so far away from us that we can never receive any information from there. Moral concern over what happens elsewhere in the multiverse is one requirement for the idea I am now going to discuss.
Multiverse-wide cooperation via superrationality (abbreviation: MSR) is the idea that, if I think about different value systems and their respective priorities in the world, I should not work on the highest priority according to my own values, but on whatever my comparative advantage is amongst all the interventions favored by the value systems of agents interested in multiverse-wide cooperation. (Another route to gains from trade is to focus on convergent interests, pursuing interventions that may not be the top priority for any particular value system, but are valuable from a maximally broad range of perspectives.) For simplicity reasons, I will refer to this as simply “cooperating” from now on.
A decision to cooperate, according to some views in decision theory, gives me rational reason to believe that agents in similar decision situations elsewhere in the multiverse, especially the ones who are most similar to myself in how they reason about decision problems, are likely to cooperate as well. After all, if two very similar reasoners think about the same decision problem, they are likely to reach identical answers. This suggests that they will end up either both cooperating, or both defecting. Assuming that the way agents find decisions is not strongly constrained or otherwise affected by their values, we can expect there to be agents with different values who reason about decision problems the same way we do, who come to identical conclusions. Cooperation then produces gains from trade between value systems.
While each party would want to be the sole defector, the mechanism behind multiverse-wide cooperation – namely that we have to think of ourselves as being coupled with those agents in the multiverse who are most similar to us in their reasoning – ensures that defection is disincentivized: Any party that defects would now have to expect that their highly similar counterparts would also defect.
The closest way to approximate the value systems of agents in other parts of the multiverse, given our ignorance about how the multiverse looks like, is to assume that substantial parts of it at least are going to be similar to how things are here, where we can study them. A minimally viable version of multiverse-wide cooperation can therefore be thought of as all-out “ordinary” cooperation with value systems we know well (and especially ones that include proponents sympathetic to MSR reasoning). This suggests that, while MSR combines speculative-sounding ideas such as non-standard decision theory and the existence of a multiverse, its implications may not be all that strange and largely boil down to the proposal that we should be “maximally” cooperative towards other value systems.
Leaving aside for the moment the whole part about the multiverse, MSR is fundamentally about cooperating in a prisoner’s-dilemma-like situation with agents who are very similar to ourselves in the way they reason about decision problems. Douglas Hofstadter coined the term superrationality for the idea that one should cooperate in a prisoner’s dilemma if one expects the other party to follow the same style of reasoning. If they reason the same way I do, and the problem they are facing is the same kind of problem I am facing, then I must expect that they will likely come to the same conclusion I will come to. This suggests that the prisoner’s dilemma in question is unlikely to end with an asymmetric outcome – (cooperate I defect) or (defect I cooperate) –, but likely to end with a symmetric outcome – (cooperate I cooperate) or (defect I defect). Because (cooperate I cooperate) is the best outcome for both parties amongst the symmetric outcomes, superrationality suggests one is best served by cooperating.
At this point, readers may be skeptical whether this reasoning works. There seems to be some kind of shady action at a distance involved, where my choice to cooperate is somehow supposed to affect the other party’s choice, even though we are assuming that no information about my decision reaches said other party. But we can think of it this way: If reasoners are deterministic systems, and two reasoners follow the exact same decision algorithm in a highly similar decision situation, it at some point becomes logically contradictory to assume that the two reasoners will end up with diametrically opposed conclusions.
Side note: By decision situations having to be “highly similar,” I do not mean that the situations agents find themselves in have to be particularly similar with respect to little details in the background. What I mean is that they should be highly similar in terms of all decision-relevant variables, the variables that are likely to make a difference to an agent’s decision. If we imagine a simplified decision situation where agents have to choose between two options, either press a button or not (and then something happens or not), it probably matters little whether one agent has the choice to press a red button and another agent is faced with pressing a blue button. As long as both buttons do the same thing, and as long as the agents are not (emotionally or otherwise) affected by the color differences, we can safely assume that the color of the button is highly unlikely to play a decision-relevant role. What is more likely relevant are things such as the payoffs (value according what an agent cares about) the agents expect from the available options. If one agent believes they stand to receive positive utility from pressing the button, and the other stands to receive negative utility, then that is guaranteed to make a relevant difference as to whether the agents will want to press their buttons. Maybe the payoff differentials are also relevant sometimes, or are at least probabilistically relevant with some probability: If one agent only gains a tiny bit of utility, whereas the other agent has an enormous amount of utility to win, the latter agent might be much more motivated to avoid taking a suboptimal decision. While payoffs and payoff structures certainly matter, it is unlikely that it matters what qualifies as a payoff for a given agent: If an agent who happens to really like apples will be rewarded with tasty apples after pressing a button, and another agent who really likes money is rewarded with money, their decision situations seem the same provided that they each care equally strongly about receiving the desired reward. (This is the intuition behind the irrelevance of specific value systems for whether two decision algorithms or decision situations are relevantly similar or not. Whether one prefers apples, money, carrots or whatever, math is still math and decision theory is still decision theory.)
A different objection that readers may have at this point concerns the idea of superrationally “fixing” other agents’ decisions. Namely, critics may point out that we are thereby only ever talking about updating our own models, our prediction of what happens elsewhere, and that this does not actually change what was going to happen elsewhere. While this sounds like an accurate observation, the force of the statement rests on a loaded definition of “actually changing things elsewhere” (or anywhere for that matter). If we applied the same rigor to a straightforward instance of causally or directly changing the position of a light switch in our room, a critic may in the same vain object that we only changed our expectation of what was going to happen, not what actually was going to happen. The universe is lawful: nothing ever happens that was not going to happen. What we do when we want to have an impact and accomplish something with our actions is never to actually change what was going to happen; instead, it is to act in the way that best shifts our predictions favorably towards our goals. (This is not to be confused with cheating at prediction: We don’t want to make ourselves optimistic for no good reason, because the decision to bias oneself towards optimism does not actually correlate with our goals getting accomplished – it only correlates with a deluded future self believing that we will be accomplishing our goals.)
For more reading on this topic, I recommend this paper on functional decision theory, the book Evidence, Decision and Causality or the article On Correlation and Causation Part 1: Evidential decision theory is correct. For an overview on different decision theories, see also this summary.
To keep things simple and as uncontroversial as possible, I will follow Caspar’s terminology for the rest of my post here and use the term superrationality in a very broad sense that is independent of any specific flavor of decision theory, referring to a fuzzy category of arguments from similarity of decision algorithms that favor cooperating in certain prisoner’s-dilemma-like situations.
The existence of a multiverse would virtually guarantee that there are many agents out there who fulfill the criteria of “relevant similarity” compared to us with regard to their decision algorithm and decision situations – whatever these criteria may boil down to in detail.
Insertion: Technically, if the multiverse is indeed infinite, there will likely be infinitely many such agents, and infinite amounts of everything in general, which admittedly poses some serious difficulties for formalizing decisions: If there is already an infinite amount of value or disvalue, it seems like all our actions should be ranked the same in terms of the value of the outcome they result in. This leads to so-called infinitarian paralysis, where all actions are rated as equally good or bad. Perhaps infinitarian paralysis is a strong counterargument to MSR. But in that case, we should be consistent: Infinitarian paralysis would then also be a strong counterargument to aggregative consequentialism in general. Because it affects nearly everything (for consequentialists), and because of how drastic its implications would be if there was no convenient solution, I am basically hoping that someone will find a solution that makes everything work again in the face of infinities. For this reason, I think we should not think of MSR as being particularly in danger of failing for reasons of infinitarian paralysis.
Back to object-level MSR: We noted that the multiverse guarantees that there are agents out there very similar to us who are likely to tackle decision problems the same way we do. To prevent confusion, note that MSR is not based on the naive assumption that all humans who find the concept of superrationality convincing are therefore strongly correlated with each other across all possible decision situations. Superrationality only motivates cooperation if one has good reason to believe that another party’s decision algorithm is indeed extremely similar to one’s own. Human reasoning processes differ in many ways, and sympathy towards superrationality represents only one small dimension of one’s reasoning process. It may very well be extremely rare that two people’s reasoning is sufficiently similar that, having common knowledge of this similarity, they should rationally cooperate in a prisoner’s dilemma.
But out there somewhere, maybe on Earth already in a few instances among our eight-or-so billion inhabitants, but certainly somewhere in the multiverse if a multiverse indeed exists, there must be evolved intelligent beings who are sympathetic towards superrationality in the same way we are, who in addition also share a whole bunch of other structural similarities with us in the way they reason about decision problems. These agents would construe decision problems related to cooperating with other value systems in the same way we do, and pay attention to the same factors weighted according to the same decision-normative criteria. When these agents think about MSR, they would be reasonably likely to reach similar conclusions with regard to the idea’s practical implications. These are our potential cooperation partners.
I have to admit that it seems very difficult to tell which aspects of one’s reasoning are more or less important for the kind of decision-relevant similarity we are looking for. There are many things left to be figured out, and it is far from clear whether MSR works at all in the sense of having action-guiding implications for how we should pursue our goals. But the underlying idea here is that once we pile up enough similarities of the relevant kind in one’s reasoning processes (and a multiverse would ensure that there are agents out there who do indeed fulfill these criteria), at some point it becomes logically contradictory to treat the output of our decisions as independent from the decisional outputs of these other agents. This insight seems hard to avoid, and it seems quite plausible that it has implications for our actions.
If I were to decide to cooperate in the sense implied by MSR, I would have to then update my model of what is likely to happen in other parts of the multiverse where decision algorithms highly similar to my own are at play. Superrationality says that this update in my model, assuming it is positive for my goal achievement because I now predict more agents to be cooperative towards other value systems (including my own), in itself gives me reason to go ahead and act cooperatively. If we manage to form even a crude model of some of the likely goals of these other agents and how we can benefit them in our own part of the multiverse, then cooperation can already get off the ground and we might be able to reap gains from trade.
Alternatively, if we decided against becoming more cooperative, we learn that we must be suffering costs from mutual defection.This includes both opportunity costs and direct costs from cases where other parties’ favored interventions may hurt our values.
We are assuming that we care about what happens in other parts of the multiverse. For instance, we might care about increasing total happiness. If we further assume that decision algorithms and the values/goals of agents are distributed orthogonally – meaning that one cannot infer someone’s values simply by seeing how they reason practically about epistemic matters – then we arrive at the conceptualization of a multiverse-wide prisoner’s dilemma.
(Note that we can already observe empirically that effective altruists who share the same values sometimes disagree strongly about decision theory (or more generally reasoning styles/epistemics), and effective altruists who agree on decision theory sometimes disagree strongly about values. In addition, as pointed out in section one, there appears to be no logical reason as to why agents with different values would necessarily have different decision algorithms.)
The cooperative action in our prisoner’s dilemma would now be to take other value systems into account in proportion to how prevalent they are in the multiverse-wide compromise. We would thus try to benefit them whenever we encounter opportunities to do so efficiently, that is, whenever we find ourselves with a comparative advantage to strongly benefit a particular value system. By contrast, the action that corresponds to defecting in the prisoner’s dilemma would be to pursue one’s personal values with zero regard for other value systems. The payoff structure is such that an outcome where everyone cooperates is better for everyone than an outcome where everyone defects, but each party would prefer to be a sole defector.
Consider for example someone who is in an influential position to give advice to others. This person can either tailor their advice to their own specific values, discouraging others from working on things that are unimportant according to their personal value system, or she can give advice that is tailored towards producing an outcome that is maximally positive for the value systems of all superrationalists, perhaps even investing substantial effort researching the implications of value systems different from their own. MSR provides a strong argument for maximally cooperative behavior, because by cooperating, the person in question ensures that there is more such cooperation in other parts of the multiverse, which in expectation also strongly benefits their own values.
Of course there are many other reasons to be nice to other value systems (in particular reasons that do not involve aliens and infinite worlds). What is special about MSR is mostly that it gives an argument for taking the value systems of other superrationalists into account maximally and without worries of getting exploited for being too forthcoming. With MSR, mutual cooperation is achieved by treating one’s own decision as a simulation/prediction for agents relevantly similar to oneself. Beyond this, there is no need to guess the reasoning of agents who are different. The updates one has to make based on MSR considerations are always symmetrical for one’s own actions and the actions of other parties. This mechanism makes it impossible to enter asymmetrical (cooperate-defect or defect-cooperate) outcomes.
(Note that the way MSR works does not guarantee direct reciprocity in terms of who benefits whom: I should not choose to benefit value system X in my part of the multiverse in the hope that advocates of value system X in particular will, in reverse, be nice to my values here or in other parts of the multiverse. Instead, I should simply benefit whichever value system I can benefit most, in the expectation that whichever agents can benefit my values the most – and possibly that turns out to be someone with value system X – will actually cooperate and benefit my values. To summarize, hoping to be helped by value system X for MSR-reasons does not necessarily mean that I should help value system X myself – it only implies that I should conscientiously follow MSR and help whoever benefits most from my resources.)
Before we can continue with the main body of explanation, I want to proactively point out that MSR is different from acausal trade, which has been discussed in the context of artificial superintelligences reasoning about each others’ decision procedures. There is a danger that people lump the two ideas together, because MSR does share some similarities with acausal trade (and can arguably be seen as a special case of it). Namely, both MSR and acausal trade are standardly being discussed in a multiverse context and rely crucially on acausal decision theories. There are, however, several important differences: In the acausal trade scenario, two parties simulate each other’s decision procedures to prove that one’s own cooperation ensures cooperation in the other party. MSR, by contrast, does not involve reasoning about the decision procedures of parties different from oneself. In particular, MSR does not involve reasoning about whether a specific party’s decisions have a logical connection with one’s own decisions or not, i.e., whether the choices in a prisoner’s-dilemma-like situation can only result in symmetrical outcomes or not. MSR works through the simple mechanism that one’s own decision is assumed to already serve as the simulation/prediction for the reference class of agents with relevantly similar decision procedures.
MSR is therefore based mostly on looser assumptions than acausal trade, because it does not require having the technological capability to accurately simulate another party’s decision algorithm. There is one aspect in which MSR is based on stronger assumptions than acausal trade. Namely, MSR is based on the assumption that one’s own decision can function as a prediction/simulation for not just identical copies of oneself in a boring twin universe where everything plays out exactly the same way as in our universe, but also for an interesting spectrum of similar-but-not-completely-identical parts of the multiverse that include agents who reason the same way about their decisions as we do, but may not share our goals. This is far from a trivial assumption, and I strongly recommend doing some further thinking about this assumption. But if the assumption does go through, it has vast implications for not (just) the possibility of superintelligences trading with each other, but for a form of multiverse-wide cooperation that current-day humans could already engage in.
The line of reasoning employed in MSR is very similar to the reasoning employed in anthropic decision problems. For comparison, take the idea that there are numerous copies of ourselves across many ancestor simulations. If we thought this was the case, reasoning anthropically as though we control all our copies at once could, for certain decisions, change our prioritization: If my decision to reduce short-term suffering plays out the same way in millions of short-lived, simulated versions of earth where focusing on the far future is impossible to pay out, I have more reason to focus on short-term suffering than I thought.
MSR applies a similar kind of reasoning where we shift our thinking from being a single instance of something to thinking in terms of deciding for an entire class of agents. MSR is what follows when one extends/generalizes the anthropic slogan “Acting as though you are all your (subjectively identical) copies at once” to “Acting as though you are all copies of your (subjective probability distribution over your) decision algorithm at once.”
Rather than identifying solely with one’s subjective experiences and one’s goals/values, MSR also involves “identifying with” – on the level of predicting consequences relevant to one’s decision – one’s general decision algorithm. If the assumptions behind MSR are sound, then deciding not to change one’s actions based on MSR has to cause an update in one’s world model, an update about other agents in one’s reference class also not cooperating. So the underlying reasoning that motivates MSR is something that has to permeate our thinking about how to have an impact on the world, whether we decide to let it affect our decisions or not. MSR is a claim about what is rational to do given that our actions have an impact in a broader sense than we may initially think, spanning across all instances of one’s decision algorithm. It changes our EV calculations and may in some instances even flip the sign – net positive/negative – of certain interventions. Ignoring MSR is therefore not necessarily the default, “safe” option.
Once we start deliberating whether to account for the goals of other agents in the multiverse, we run into the problem that we have a very poor idea of what the multiverse looks like. The multiverse may contain all kinds of strange things, including worlds where physical constants are different from the ones in our universe, or worlds where highly improbable things keep happening for the same reason that, if you keep throwing an infinite number of fair coins, some of them somewhere will produce uncanny sequences like “always heads” or “always tails.”
Because it seems difficult and intractable to envision all the possible landscapes in different parts of the multiverse, what kind of agents we might find there, and how we can benefit the goals of these agents with our resources here, one might be tempted to dismiss MSR for being too impractical a consideration. However, I think this would be a premature dismissal. We may not know anything about strange corners of the multiverse, but we know at the very least how things are in our observable universe. As long as we feel like we cannot say anything substantial about how, specifically, the parts of the multiverse that are completely different from the things we know differ from our environment, then we may as well ignore these others parts. For practical purposes, we do not have to speculate about parts of the multiverse that would be completely alien to us (yay!), and can instead focus on what we already know from direct experience. After all, our world is likely to be representative for some other worlds in the multiverse. (This holds for the same reason that a randomly chosen television channel is more likely than not to be somewhat representative of some other television channels, rather than being completely unlike any other channel.) Therefore, we can be reasonably confident that out there somewhere, there are planets with an evolutionary history that, although different from ours in some ways, also produced intelligent observers who built a technologically advanced civilization. And while many of these civilizations may contain agents with value systems we have never thought about, some of these civilizations will also contain earth-like value systems.
It anyway seems plausible that our comparative advantage lies in helping those value systems about whom we can attain the most information. If we survey the values of people on earth, and perhaps also how much these values correlate with sympathies for the concept of superrationality and taking weird arguments to their logical conclusion, this already gives us highly useful information about the values of potential cooperators in the multiverse. MSR then implies strong cooperation with value systems that we already know (perhaps adjusted by the degree their proponents are receptive to MSR ideas).
By “strong cooperation,” I mean that one should ideally pick interventions based on considerations of personal comparative advantages: If there is a value system for which I could create an extraordinary amount of (variance-adjusted; see chapter 3 of this dissertation for an introduction) value given my talents and position in the world, I should perhaps exclusively focus on benefitting specifically that value system. Meta interventions that are positive for many value systems at once also receive a strong boost by MSR considerations and should plausibly be pursued at high effort even in case they do not come out as the top priority absent MSR considerations. (Examples for such interventions are e.g. making sure that any superintelligent AIs that are built can cooperate with other AIs, or that people who are uncertain about their values should not waste time with philosophy and instead try to benefit existing value systems MSR-style.) Finally, one should also look for more cooperative alternatives when considering interventions that, although positive for one’s own value system, may in expectation cause harm to other value systems.
The post Commenting on MSR, Part 1: Multiverse-wide cooperation in a nutshell appeared first on Center on Long-Term Risk.
]]>The post S-risk FAQ appeared first on Center on Long-Term Risk.
]]>
In the essay Reducing Risks of Astronomical Suffering: A Neglected Priority, s-risks (also called suffering risks or risks of astronomical suffering) are defined as “events that would bring about suffering on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far”.
If you’re not yet familiar with the idea, you can find out more by watching Max Daniel's EAG Boston talk or by reading the introduction to s-risks.
In the future, it may become possible to run such complex simulations that the (artificial) individuals inside these simulations are sentient. Nick Bostrom coined the term mindcrime for the idea that the thought processes of a superintelligent AI might cause intrinsic moral harm if they contain (suffering) simulated persons. Since there are instrumental reasons to run many such simulations, this could lead to vast amounts of suffering. For example, an AI might use simulations to improve its knowledge of human psychology or to predict what humans would do in a conflict situation.
Other common examples include suffering subroutines and spreading wild animal suffering to other planets.
At first glance, one could get the impression that s-risks are just unfounded speculation. But to dismiss s-risks as unimportant (in expectation), one would have to be highly confident that their probability is negligible, which is hard to justify upon reflection. The introduction to s-risks gives several arguments why the probability is not negligible after all:
First, s-risks are disjunctive. They can materialize in any number of unrelated ways. Generally speaking, it’s hard to predict the future and the range of scenarios that we can imagine is limited. It is therefore plausible that unforeseen scenarios – known as black swans – make up a significant fraction of s-risks. So even if any particular dystopian scenarios we can conceive of is highly unlikely, the probability of some s-risk may still be non-negligible.
Second, while s-risks may seem speculative at first, all the underlying assumptions are plausible. [...]
Third, historical precedents do exist. Factory farming, for instance, is structurally similar to (incidental) s-risks, albeit smaller in scale. In general, humanity has a mixed track record regarding responsible use of new technologies, so we can hardly be certain that future technological risks will be handled with appropriate care and consideration.
Virtually everyone would agree that (involuntary) suffering should, all else equal, be avoided. In other words, ensuring that the future does not contain astronomical amounts of suffering is a common denominator of almost all (plausible) value systems.
Work on reducing s-risks is, therefore, a good candidate for compromise between different value systems. Instead of narrowly pursuing our own ethical views in potential conflict with others, we should work towards a future deemed favourable by many value systems.
Future generations will probably have more information about s-risks in general, including which ones are the most serious, which does give them the upper hand in finding effective interventions. One might, therefore, argue that later work has a significantly higher marginal impact. However, there are also arguments for working on s-risks now.
First, thinking about s-risks only as they start to materialize does not suffice because it might be too late to do anything about it. Without sufficient foresight and caution, society may already be “locked in” to a trajectory that ultimately leads to a bad outcome.
Second, one main reason why future generations are in a better position is that they can draw on previous work. Earlier work – especially research or conceptual progress – can be effective in that it allows future generations to more effectively reduce s-risk.
Third, even if future generations are able to prevent s-risks, it’s not clear whether they will care enough to do so. We can work to ensure this by growing a movement of people who want to reduce s-risks. In this regard, we should expect earlier growth to be more valuable than later growth.
Fourth, if there’s a sufficient probability that smarter-than-human AI will be built in this century, it's possible that we already are in a unique position to influence the future. If it’s possible to work productively on AI safety now, then it should also be possible to reduce s-risks now.
Toby Ord’s essay The timing of labour aimed at reducing existential risk addresses the same question for efforts to reduce x-risks. He gives two additional reasons in favor of earlier work: namely, the possibility of changing course (which is more valuable if done early on) and the potential for self-improvement.
If you are (very) optimistic about the future, you might think that s-risks are unlikely for this reason (which is different from the objection that s-risks seem far-fetched). A common argument is that avoiding suffering will become easier with more advanced technology; since humans care at least a little bit about reducing suffering, there will be less suffering in the future.
While this argument has some merit, it’s not airtight. By default, when we humans encounter a problem in need of solving, we tend to implement the most economically efficient solution, often irrespective of whether it involves large amounts of suffering. Factory farming provides a good example of such a mismatch; faced with the problem of producing meat for millions of people as efficiently as possible, a solution was implemented which happened to involve an immense amount of nonhuman suffering.
Also, the future will likely contain vastly larger populations, especially if humans colonize space at some point. All else being equal, such an increase in population may also imply (vastly) more suffering. Even if the fraction of suffering decreases, it's not clear whether the absolute amount will be higher or lower.
If your primary goal is to reduce suffering, then your actions matter less if the future will 'automatically' be good (because the future contains little or no suffering anyway). Given sufficient uncertainty, this is reason to focus on the possibility of bad outcomes anyway for precautionary reasons. In a world where s-risks are likely, we can have more impact.
Although the degree to which we are optimistic or pessimistic about the future is clearly relevant to how concerned we are about s-risks, one would need to be unusually optimistic about the future to rule out s-risks entirely.
From the introduction to s-risks:
Working on s-risks does not require a particularly pessimistic view of technological progress and the future trajectory of humanity. To be concerned about s-risks, it is sufficient to believe that the probability of a bad outcome is not negligible, which is consistent with believing that a utopian future free of suffering is also quite possible.
In other words, being concerned about s-risks does not require unusual beliefs about the future.
First, recall Nick Bostrom’s definition of x-risks:
Existential risk – One where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.
S-risks are defined as follows:
S-risks are events that would bring about suffering on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far.
According to these definitions, both x-risks and s-risks relate to shaping the long-term future, but reducing x-risks is about actualizing humanity’s potential, while reducing s-risks is about preventing bad outcomes.
There are two possible views on the question of whether s-risks are a subclass of x-risks.
According to one possible view, it’s conceivable to have astronomical amounts of suffering that do not lead to extinction or curtail humanity’s potential. We could even imagine that some forms of suffering (such as suffering subroutines) are instrumentally useful to human civilization. Hence, not all s-risks are also x-risks. In other words, some possible futures are both an x-risk and an s-risk (e.g. uncontrolled AI), some would be an x-risk but not an s-risk (e.g. an empty universe), some would be an s-risk but not an x-risk (e.g. suffering subroutines), and some are neither.
S-risk? | |||
Yes | No | ||
X-risk? | Yes | Uncontrolled AI | Empty universe |
No | Suffering subroutines | Utopian future |
The second view is that the meaning of “potential” depends on your values. For example, you might think that a cosmic future is only valuable if it does not contain (severe) suffering. If “potential” refers to the potential of a utopian future without suffering, then every s-risk is (by definition) an x-risk, too.
This depends on each of us making difficult ethical judgment calls. The answer depends on how much you care about reducing suffering versus increasing happiness, and how you would make tradeoffs between the two. (This also raises fundamental questions about how happiness and suffering can be measured and compared.)
Proponents of suffering-focused ethics argue that the reduction of suffering is of primary moral importance, and that additional happiness cannot easily counterbalance (severe) suffering. According to this perspective, preventing s-risks is morally most urgent.
Other value systems, such as classical utilitarianism or fun theory, emphasize the creation of happiness or other forms of positive value, and assert that the vast possibilities of a utopian future can outweigh s-risks. Although preventing s-risks is still valuable in this view, it is nevertheless considered even more important to ensure that humanity has a cosmic future at all by reducing extinction risks.
In addition to normative issues, the answer also depends on the empirical question of how much happiness and suffering the future will contain. David Althaus suggests that we consider both the normative suffering-to-happiness trade ratio (NSR), which measures how we would trade off suffering and happiness in theory, and the expected suffering-to-happiness ratio (ESR), which measures the (relative) amounts of suffering and happiness we expect in the future.
In this framework, those who emphasize happiness (low NSR) or are optimistic about the future (low ESR) will tend to focus on extinction risk reduction. If the product of NSR and ESR is high – either because of a normative emphasis on suffering (high NSR) or pessimistic views about the future (high ESR) – it’s more plausible to focus on s-risk-reduction instead.
Many s-risks, such as suffering subroutines or mindcrime, have to do with artificial minds or smarter-than-human AI. But the concept of s-risks is not conceptually dependent on the possibility of AI scenarios. For example, spreading wild animal suffering to other planets does not require artificial sentience or AI.
Examples often involve artificial sentience, however, due to the vast number of artificial beings that could be created if artificial sentience becomes feasible at any time in the future. Combined with humanity’s track record of insufficient moral concern for “voiceless” beings at our command, this might pose a particularly serious s-risk. (More details here.)
This question has been discussed extensively in the philosophy of mind. Many popular theories of consciousness, such as Global workspace theory, higher-order theories, or Integrated information theory, agree that artificial sentience is possible in principle. Philosopher Daniel Dennett puts it like this:
I’ve been arguing for years that, yes, in principle it’s possible for human consciousness to be realised in a machine. After all, that’s what we are. We’re robots made of robots made of robots. We’re incredibly complex, trillions of moving parts. But they’re all non-miraculous robotic parts.
As an example of the sort of reasoning involved, consider this intuitive thought experiment: if you were to take a sentient biological brain, and replace one neuron after another with a functionally equivalent computer chip, would it somehow make the brain less sentient? Would the brain still be sentient once all of its biological neurons have been replaced? If not, at what point would it cease to be sentient?
The debate is not settled yet, but it seems at least plausible that artificial sentience is possible in principle. Also, we don’t need to be certain to justify moral concern. It’s sufficient that we can't rule it out.
A simple first step is to join the discussion, e.g. in this Facebook group. If more people think and write about the topic (either independently or at EA organizations), we’ll make progress on the crucial question of how to best reduce s-risks. At the same time, it helps build a community that, in turn, can get even more people involved.
If you’re interested in doing serious research on s-risks right away, you could have a look at this list of open questions to find a suitable research topic. Work in AI policy and strategy is another interesting option, as progress in this area allows us to shape AI in a more fine-grained way, making it easier to identify and implement safety measures against s-risks.
Another possibility is to donate to organizations working on s-risks reduction. Currently, the Center on Long-Term Risk is the only group with an explicit focus on s-risks, but other groups also contribute to solving issues that are relevant for s-risk reduction. For example, the Machine Intelligence Research Institute aims to ensure that smarter-than-human artificial intelligence is aligned with human values, which probably also reduces s-risks. Charities that promote broad societal improvements such as better international cooperation or beneficial values may also contribute to s-risk reduction, albeit in a less targeted way.
The post S-risk FAQ appeared first on Center on Long-Term Risk.
]]>The post Focus areas of worst-case AI safety appeared first on Center on Long-Term Risk.
]]>Cross-posted from my website on s-risks.
Efforts to shape advanced artificial intelligence (AI) may be among the most promising altruistic endeavours. If the transition to advanced AI goes wrong, the worst outcomes may involve not only the end of human civilization, but also cosmically significant amounts of suffering – a so-called s-risk.1
In light of this, Lukas Gloor suggests that we complement existing AI alignment efforts with AI safety measures focused on preventing s-risks. For example, we might focus on implementing safety measures that would limit the extent of damage in case of a failure to align advanced AI with human values. This approach is called worst-case AI safety.
This post elaborates on possible focus areas for research on worst-case AI safety to support the (so far mostly theoretical) concept with more concrete ideas.
Many, if not all, of the suggestions may turn out to be infeasible. Uncertainty about the exact timeline, takeoff scenario, and architecture of advanced AI makes it hard to predict what measures are promising. Hence, the following ideas are not silver bullets; they are merely a first step towards understanding what worse-case AI safety could look like.
(Most of the ideas are not original, and many are simply variations of existing ideas. My claimed contribution is only to collect them, point out their relation to s-risk reduction, and outline what further research could focus on.)
In engineering, there are two common approaches to improving the reliability of safety-critical systems – redundancy and fail-safe designs. Put briefly, redundancy involves duplicating important components of a safety-critical system to serve as backups in case of primary component failures (e.g. backup power generators), while fail-safe designs are specific design features whose sole function is to limit the extent of damage in case of a particular type of failure (e.g. fire suppression systems).
Similarly, we could attempt to make AI safe by combining many different safety measures that kick in if the system fails. A complete and watertight solution to the perils of advanced AI remains elusive, but we might hope that at least one of many redundant measures will work, even if any particular one is flaky.
I’m not claiming that this is a good strategy for aligning an AI with human values. In fact, it probably isn’t. But the strategy may be more promising for worst-case AI safety in that its goal is merely to prevent the worst failure modes (i.e. those that lead to s-risks), rather than achieving a specific positive outcome such as CEV. In other words, the key question for worst-case AI safety is how to ensure that an AI system will not engage in certain behaviors that might cause large amounts of suffering. For example, if we constrain an AI system so that a specific troublesome behavior is prevented, the nearest unblocked strategy – although not necessarily closer to an aligned AI – will hopefully be less dangerous in terms of s-risks.
Here are some examples of possible safety measures:
The crux of the matter is whether the AI has an incentive to remove such constraints. Since safety measures interfere with utility maximization, the AI would by default want to eliminate them, e.g. by rewriting its own source code. We can place our bets on the AI’s inability to circumvent our safety measures, but not incentivizing it should be the first line of defense. We don’t yet have a rigorous solution to this problem, but it might be possible to build safety measures directly into the utility function or goal system, so that the AI does not have an incentive to remove them.
A tripwire is a mechanism that shuts down or destroys an AI if it shows signs of dangerous activity.3 This would involve deliberately introducing easy-to-find and tempting opportunities (“honeypots”) that appear to let the AI increase its utility by doing something “forbidden”, but that actually trigger a shutdown or self-destruct mechanism if acted upon. The hope is that the AI either is safe, i.e. steers clear of all forbidden activity, or that it triggers the tripwire and destroys itself.
For worst-case AI safety, we should try to set up tripwires that are tailored toward preventing the behaviors most likely to result in s-risks. We can (in principle) combine as many tripwires as we want, but two kinds seem particularly promising:
This approach is rather flaky in that a truly superintelligent AI might be able to recognize any tripwire of human design, and if so, would probably find ways to circumvent them. Still, it’s worth looking into tripwires as one of several approaches.
Research on tripwires could analyze in more detail how we can best implement them, which scenarios we should use, and how scalable the approach is in the face of increasing AI capabilities.
Paul Christiano suggests adversarial training (so-called “red teams”) to make AI systems more robust against catastrophic failure. Following this approach, we could use highly capable4 AI to search for input situations in which the (new) AI might cause an s-risk. If the search fails, that’s evidence that the system is safe. If the search succeeds, we can incorporate the results in the next iteration of training processes. We can hope that this procedure is more reliable than manually trying to come up with tests that cover any possible failure modes.
Similarly, if we run several instances of advanced AI systems, we could split them into “Team A”, whose goal is to come up with a plan or idea that convinces a human judge; and “Team B”, whose goal is to convince the human that the plan is flawed. In the context of worst-case AI safety, it would be especially valuable to have a third “Team C” that tries to convince the judge that realization of the plans would be an s-risk (and not just flawed in some other way).
Further research in this area could examine the practical feasibility and usefulness of the approach, come up with refined variations of the idea, or experiment with (simplified) practical implementations to improve our understanding.
To implement such safety measures in real-world AI systems, we need to be able to detect suffering (or proxies thereof). This, in turn, would be easier if we had a technical definition of what we mean by “suffering”, rather than vague philosophical descriptions like “a conscious experience of subjectively negative valence”. (See Caspar Österheld’s paper on Formalizing Preference Utilitarianism in Physical World Models for an example of work on formalizing philosophical concepts.)
Even without conceptual progress, we could study the extent to which current machine learning algorithms can detect suffering (e.g. in images or text). For example, researchers at Cambridge recently designed an AI system to assess pain levels in sheep. (For more on this, see Training Neural Networks to Detect Suffering.)
Similarly, we could work on formal descriptions of classes of behaviors that are particularly dangerous. For instance, extortion could potentially cause a lot suffering, so it would be valuable to have a better grasp of how to distinguish it from positive-sum trade. A rigorous definition of extortion remains elusive, and progress on this problem would make it easier to implement safety measures.
Simple goal systems could also be useful in test environments. If something goes wrong and an AI unexpectedly “escapes” from the test environment, then we can’t reasonably hope that the AI would pursue human values. We can, however, try to make sure that the AI does not create astronomical suffering in such an event.
The crux, of course, is that the test environment also needs to be sufficiently complex to test the capabilities of advanced AI systems. Research on benign test environments could examine in more detail which failure modes are most likely to lead to s-risks, and which goal system and test environment can avoid these failure modes.
We could also try to sidestep the most delicate aspects of superintelligent reasoning – assuming that’s even possible – when we test (the first) advanced AI systems. For example, modeling other minds is particularly dangerous because of both mindcrime and extortion, so we can try to set up the test environment in such a way that an unsafe AI would not model other minds.
I am indebted to Max Daniel, Brian Tomasik, Johannes Treutlein, Adrian Rorheim and Caspar Österheld for valuable comments and suggestions.
The post Focus areas of worst-case AI safety appeared first on Center on Long-Term Risk.
]]>The post A reply to Thomas Metzinger’s BAAN thought experiment appeared first on Center on Long-Term Risk.
]]>This is a reply to Metzinger’s essay on Benevolent Artificial Anti-natalism (BAAN), which appeared on EDGE.org (7.8.2017).
Metzinger invites us to consider a hypothetical scenario where smarter-than-human artificial intelligence (AI) is built with the goal of assisting us with ethical deliberation. Being superior to us in its understanding of how our own minds function, the envisioned AI could come to a deeper understanding of our values than we may be able to arrive at ourselves. Metzinger has us envision that this artificial super-ethicist comes to conclude that biological existence – at least in its current form – is bound to contain more disvalue than value, and that the benevolent stance is to not bring more minds like us into existence (anti-natalism). Suffering from what Metzinger labels ‘existence bias,’ the vast majority of people in the scenario would not accept the AI's conclusions.
Metzinger emphasizes that he is not making a prediction about the future, nor does he necessarily claim that the BAAN-scenario is at all realistic. His idea is meant as a cognitive tool, to illustrate that a position such as anti-natalism, which most ethicists are currently reluctant to even engage with, could conceivably turn out to be correct (in some relevant sense) as inferred by an intellect of superior ethical expertise that is unaffected by human-specific biases.
The BAAN thought experiment invites us to look at human existence from a more distanced, more impartial perspective, wondering which type of biases could cloud our ethical judgment. This approach is very much in line with what many ethicists have been attempting to do. A recent example is The Point of View of the Universe by Katarzyna de Lazari-Radek and Peter Singer, who suggest that there might be a strong symmetry between suffering and happiness (hedonism). Alternatively, I do think there are many reasons why one might conclude that reducing suffering is more important than creating happiness, and that what Metzinger labels ‘existence bias’ seems to me like a concept with substantial merit. Having said that, a perspective that solely considers experiential well-being – suffering or happiness – but not a person’s life goals and where we derive meaning from, strikes me as missing a very important component of the situation.
If we had an artificial intelligence with the goal to help us answer ethical questions, what would it come to think? Here is where I suspect that Metzinger’s setup in the BAAN thought experiment is underdetermined and thus begging the question: I strongly suspect that there is no uniquely correct way to teach an AI to solve (human) ‘ethics.’ To use Brian Tomasik’s words, expecting AI to solve ethics for us is like expecting math to solve ethics for us: It's always garbage in, garbage out. In order to program an AI to have ethical goals, as opposed to it having any other type of goal, the human creators would already need to have a detailed understanding of what we mean by ‘ethics’ (or ‘altruism’), precisely. There is no guarantee that what we mean is something well-considered or well-specified, or that it even comes down to the same exact thing from person to person. Words get their meaning from how we use them, and if our usage is underspecified, it is also underspecified how we could go on about making it more concise. In order to work around these obstacles, one could try to build an AI that learns ethical goals from examples. However, this suffers from the same type of underspecification, as it then becomes relevant how the examples are picked and labelled, and whether the AI in question will try to infer people’s revealed preferences (in which case it would, in virtue of the process itself, never conclude that humans suffer from any biases) or whether it would go with some kind of extrapolated preferences (what we would wish more principled, non-hypocritical and informed versions of ourselves to do). In the latter case, there are many distinct ways to resolve inconsistencies within different modules of a person’s brain, and so value extrapolation is underspecified as well.
All of the above implies that while it does seems conceivable that a smarter-than-human artificial intelligence with a wisely chosen goal could someday assist us in our ethical inquiries, we would first have to make substantial philosophical progress on our own. Before the AI-assisted search for our true values can get off the ground, we would have to make some tough judgment calls that determine the precise nature of the AI’s role as an ethical advisor.
The BAAN-scenario Metzinger asks us to consider, where the allegedly benevolent AI comes up with an ethical stance that most humans would likely want to fight against, is only logically conceivable if the initial recipe installed into the AI – what its creators meant by ‘assisting us with doing ethics’ – allowed for an outcome where the AI ends up choosing a stance that, when it presents its result to humanity, the vast majority of people vehemently disagree. Perhaps we would be willing to accept this setup in case the AI in question would be able to eventually persuade us – without trickery, only through intellectual arguments – that the ethical reasoning it went through is truly in line with what we upon reflection care about. However, in a situation where the AI’s conclusions go strongly against our own ethical views in such a way that the AI could not persuade us that it knows our values better than we do, would we accept being too short-sighted – thereby placing greater trust in having set up the AI’s notion of doing ethics in the right way – or would we place greater trust in our own beliefs of which outcomes we deem unacceptable? Someone who thinks the latter could argue that the setup in the BAAN thought experiment does not count as a genuine inquiry into what humans value, but only counts as one interpretation of what we somehow ‘should’ value, based on (implicit or explicit) unilateral judgment calls made by the AI’s creators.
On the one hand, we want our ethical inquiries to stay open to the possibility that our current views suffer from important blind spots, just like it has been the case historically many times over. On the other hand, under the moral anti-realist assumption that the direction of moral progress is not (completely) ‘fixed,’ we also do not want to end up in a situation where open-ended moral inquiry unexpectedly comes up with so-called progress that has nothing to do anymore with our current goals, or our current concepts of ethics and altruism. The art lies in setting just the right amount of guidance. I take Metzinger’s thought experiment to be asking us the following: If we got offered to extrapolate our own values with the help of a benevolent superintelligence that will do for us exactly what we intend it to do, would we give it instructions that allow for the possibility of a BAAN-like outcome, where the AI knows us and our values better than we do? Or would we veto this outcome a priori, thereby determining that any preference for e.g. continued existence over suffering can never be accepted by us as a bias, but is always an axiom we are not willing to question?
While the BAAN-scenario is best left as a thought experiment only – as intended by Metzinger – the idea that smarter-than-human artificial intelligence could assist us in our ethical inquiries seems intriguing, and, if done wisely, promising as an altruistic and cooperative vision for AI development. A collaborative project to build AI could perhaps be set up in a way where it becomes possible to narrow down one’s moral uncertainty, resolve moral disagreements, or (in case certain disagreements are bound to persist) determine a compromise for different value systems with maximal gains from trade. The task of value alignment for smarter-than-human AI is hard enough on its own, and does not become easier when zero-sum (or negative-sum) competition intensifies. Rooting for an AI that helps with only our own, idiosyncratic conception of how to do ethics is ill-advised. Instead, one should advocate for a cooperative solution to value alignment that helps everyone to get a better sense of their own goals, and then implements a compromise goal that gives everyone close to the perfect outcome.
Very roughly, an altruistic vision for a post-AI future that takes into account that reducing suffering is altruistically extremely important, should satisfy the following criteria:
Point 1 most probably rules out universal anti-natalism. It is however compatible with Metzinger’s “Scenario 2” – AI technologically abolishing suffering in us or our descendants – provided that a future filled with flourishing posthuman beings is implemented in details rich enough and uncontroversial enough to be regarded as (mostly) utopian from a maximally broad a range of perspectives.
Point 2 is also satisfied in Metzinger’s Scenario 2. I deliberately wrote “comparatively little” suffering instead of “no suffering” because perfection here is the enemy of the good: Complaining about small traces of suffering in utopia is like a spoiled kid complaining at their birthday party that the new car they got has an ugly color, while the day before, they had an accident with their old car and got very lucky to not have lasting damage for the rest of their life. Utopia-like outcomes, even if they may contain serious suffering for some beings, are much, much better than futures with astronomical amounts of suffering. Therefore, the comparison should be made with what is otherwise in the range of likely outcomes, and not with what would the perfect outcome one can conceptualize. If some people’s conception of a utopian future e.g. strongly calls for being able to go rock climbing and experience occasional glimpses of fear from falling, or some suffering from muscle aches, it would be extremely uncooperative for people with suffering-focused values to object to that. Having said that, we should be wary of wishful thinking and recognize that near-optimal outcomes may be very unlikely, and that most of the value for suffering reducers may be gained with point 3:
Point 3 refers to the problem that utopias may be fragile, and that we should choose a path where small mistakes do not lead to a situation that is much worse than if no one had even deliberately tried to achieve an excellent outcome.
Paul Christiano’s writeup on indirect normativity
This paper by Nick Bostrom, Allan Dafoe and Carrick Flynn from the Oxford-based Future of Humanity Institute
The post A reply to Thomas Metzinger’s BAAN thought experiment appeared first on Center on Long-Term Risk.
]]>The post Multiverse-wide Cooperation via Correlated Decision Making appeared first on Center on Long-Term Risk.
]]>Some decision theorists argue that when playing a prisoner's dilemma-type game against a sufficiently similar opponent, we should cooperate to make it more likely that our opponent also cooperates. This idea, which Douglas Hofstadter calls superrationality, has strong implications when combined with the insight from modern physics that we live in a large universe or multiverse of some sort. If we care about what happens in civilizations located elsewhere in the multiverse, we can superrationally cooperate with some of their inhabitants. That is, if we take their values into account, this makes it more likely that they do the same for us. In this paper, I attempt to assess the practical implications of this idea. I argue that to reap the full gains from trade, everyone should maximize the same impartially weighted sum of the utility functions of all collaborators. I also argue that we can obtain at least weak evidence about the content of these utility functions. In practice, the application of superrationality implies that we should promote causal cooperation, moral pluralism, moral reflection, and ensure that our descendants, who will be smarter and thus better at finding out how to benefit other superrationalists in the universe, engage in superrational cooperation.
Read the full text here.
The post Multiverse-wide Cooperation via Correlated Decision Making appeared first on Center on Long-Term Risk.
]]>The post The future of growth: near-zero growth rates appeared first on Center on Long-Term Risk.
]]>In biology, this pattern of exponential growth that wanes off is found in everything from the development of individual bodies — for instance, in the growth of humans, which levels off in the late teenage years — to population sizes.
One may of course be skeptical that this general trend will also apply to the growth of our technology and economy at large, as innovation seems to continually postpone our clash with the ceiling, yet it seems inescapable that it must. For in light of what we know about physics, we can conclude that exponential growth of the kinds we see today, in technology in particular and in our economy more generally, must come to an end, and do so relatively soon.
One reason we can make this assertion is that there are theoretical limits to computation. As physicist Seth Lloyd’s calculations show, a continuation of Moore’s law — in its most general formulation: “the amount of information that computers are capable of processing and the rate at which they process it doubles every two years” — would imply that we hit the theoretical limits of computation within 250 years:
If, as seems highly unlikely, it is possible to extrapolate the exponential progress of Moore's law into the future, then it will only take two hundred and fifty years to make up the forty orders of magnitude in performance between current computers that perform 1010 operations per second on 1010 bits and our one kilogram ultimate laptop that performs 1051 operations per second on 1031 bits.
Similarly, physicists Lawrence Krauss and Glenn Starkman have calculated that, even if we factor in colonization of space at the speed of light, this doubling of processing power cannot continue for more than 600 years in any civilization:
Our estimate for the total information processing capability of any system in our Universe implies an ultimate limit on the processing capability of any system in the future, independent of its physical manifestation and implies that Moore’s Law cannot continue unabated for more than 600 years for any technological civilization.
In a more recent lecture and a subsequent interview, Krauss said that the absolute limit for the continuation of Moore’s law, in our case, would be reached in less than 400 years (the discrepancy — between the numbers 400 and 600 — is at least in part because Moore’s law, in its most general formulation, has played out for more than a century in our civilization at this point). And, as both Krauss and Lloyd have stressed, these are ultimate theoretical limits, resting on assumptions that are unlikely to be met in practice, such as expansion at the speed of light. What is possible, in terms of how long Moore’s law can continue for, given both engineering and economic constraints is likely significantly less. Indeed, we are already close to approaching the physical limits of the paradigm that Moore’s law has been riding on for more than 50 years — silicon transistors, the only paradigm that Gordon Moore was talking about originally — and it is not clear whether other paradigms will be able to take over and keep the trend going.
Physicist Tom Murphy has calculated a similar limit for the growth of the energy consumption of our civilization. Based on the observation that the energy consumption of the United States has increased fairly consistently with an average annual growth rate of 2.9 percent over the last 350 odd years (although the growth rate appears to have slowed down in recent times and been stably below 2.9 since c. 1980), Murphy proceeds to derive the limits for the continuation of similar energy growth. He does this, however, by assuming an annual growth rate of “only” 2.3 percent, which conveniently results in an increase of the total energy consumption by a factor of ten every 100 years. If we assume that we will continue expanding our energy use at this rate by covering Earth with solar panels, this would, on Murphy’s calculations, imply that we will have to cover all of Earth’s land with solar panels in less than 350 years, and all of Earth, including the oceans, in 400 years.
Beyond that, assuming that we could capture all of the energy from the sun by surrounding it in solar panels, the 2.3 percent growth rate would come to an end within 1,350 years from now. And if we go further out still, to capture the energy emitted from all the stars in our galaxy, we get that this growth rate must hit the ceiling and become near-zero within 2,500 years (of course, the limit of the physically possible must be hit earlier, indeed more than 500 years earlier, as we cannot traverse our 100,000 light year-wide Milky Way in only 2,500 years).
One may suggest that alternative sources of energy might change this analysis significantly, yet, as Murphy notes, this does not seem to be the case:
Some readers may be bothered by the foregoing focus on solar/stellar energy. If we’re dreaming big, let’s forget the wimpy solar energy constraints and adopt fusion. The abundance of deuterium in ordinary water would allow us to have a seemingly inexhaustible source of energy right here on Earth. We won’t go into a detailed analysis of this path, because we don’t have to. The merciless growth illustrated above means that in 1400 years from now, any source of energy we harness would have to outshine the sun.
Essentially, keeping up the annual growth rate of 2.3 percent by harnessing energy from matter not found in stars would force us to make such matter hotter than stars themselves. We would have to create new stars of sorts, and, even if we assume that the energy required to create such stars is less than the energy gained, such an endeavor would quickly run into limits as well. For according to one estimate, the total mass of the Milky Way, including dark matter, is only 20 times greater than the mass of its stars. Assuming a 5:1 ratio of dark matter to ordinary matter, this implies that that there is only about 3.3 times as much ordinary non-stellar matter as there is stellar matter in our galaxy. Thus, even if we could convert all this matter into stars without spending any energy and harvest the resulting energy, this would only give us about 50 years more of keeping up with the annual growth rate of 2.3 percent.1
Similar conclusions as the ones drawn above for computation and energy also seem to follow from calculations of a more economic nature. For, as economist Robin Hanson has argued, projecting present economic growth rates into the future also leads to a clash against fundamental limits:
Today we have about ten billion people with an average income about twenty times subsistence level, and the world economy doubles roughly every fifteen years. If that growth rate continued for ten thousand years[,] the total growth factor would be 10200.
There are roughly 1057 atoms in our solar system, and about 1070 atoms in our galaxy, which holds most of the mass within a million light years. So even if we had access to all the matter within a million light years, to grow by a factor of 10200, each atom would on average have to support an economy equivalent to 10140 people at today’s standard of living, or one person with a standard of living 10140 times higher, or some mix of these.
Indeed, current growth rates would “only” have to continue for three thousand years before each atom in our galaxy would have to support an economy equivalent to a single person living at today’s living standard, which already seems rather implausible (not least because we can only access a tiny fraction of “all the matter within a million light years” in three thousand years). Hanson does not, however, expect the current growth rate to remain constant, but instead, based on the history of growth rates, expects a new growth mode where the world economy doubles within 15 days rather than 15 years:
If a new growth transition were to be similar to the last few, in terms of the number of doublings and the increase in the growth rate, then the remarkable consistency in the previous transitions allows a remarkably precise prediction. A new growth mode should arise sometime within about the next seven industry mode doublings (i.e., the next seventy years) and give a new wealth doubling time of between seven and sixteen days.
And given this more than a hundred times greater growth rate, the net growth that would take 10,000 years to accomplish given our current growth rate (cf. Hanson’s calculation above) would now take less than a century to reach, while growth otherwise requiring 3,000 years would require less than 30 years. So if Hanson is right, and we will see such a shift within the next seventy years, what seems to follow is that we will reach the limits of economic growth, or at least reach near-zero growth rates, within a century or two. Such a projection is also consistent with the physically derived limits of the continuation of Moore’s law; not that economic growth and Moore’s law are remotely the same, yet they are no doubt closely connected: economic growth is largely powered by technological progress, of which Moore’s law has been a considerable subset in recent times.
The conclusion we reach by projecting past growth trends in computing power, energy, and the economy is the same: our current growth rates cannot go on forever. In fact, they will have to decline to near-zero levels very soon on a cosmic timescale. Given the physical limits to computation, and hence, ultimately, to economic growth, we can conclude that we must be close to the point where peak relative growth in our economy and our ability to process information occurs — that is, the point where this growth rate is the highest in the entire history of our civilization, past and future.
This is not, however, to say that this point of maximum relative growth necessarily lies in the future. Indeed, in light of the declining economic growth rates we have seen over the last few decades, it cannot be ruled out that we are now already past the point of “peak economic growth” in the history of our civilization, with the highest growth rates having occurred around 1960-1980, cf. these declining growth rates and this essay by physicist Theodore Modis. This is not to say that we most likely are, yet it seems that the probability that we are is non-trivial.
A relevant data point here is that the global economy has seen three doublings since 1965, where the annual growth rate was around six percent, and yet the annual growth rate today is only a little over half — around 3.5 percent — of, and lies stably below, what it was those three doublings ago. In the entire history of economic growth, this seems unprecedented, suggesting that we may already be on the other side of the highest growth rates we will ever see. For up until this point, a three-time doubling of the economy has, rare fluctuations aside, led to an increase in the annual growth rate.
And this “past peak growth” hypothesis looks even stronger if we look at 1955, with a growth rate of a little less than six percent and a world product at 5,430 billion 1990 U.S dollars, which doubled four times gives just under 87,000 billion — about where we should expect today’s world product to be. Yet throughout the history of our economic development, four doublings has meant a clear increase in the annual growth rate, at least in terms of the underlying trend; not a stable decrease of almost 50 percent. To me, this suggests that maintaining more than, say, a 90 percent probability that we will see greater annual growth rates in the future is overconfident.2
If we assume a model of the growth of the global economy where the annual growth rate is roughly symmetrical around the time the growth rate was at its global maximum, and then assume that this global maximum occurred around 1965, this means that we should expect the annual growth rate three doublings earlier, c. 1900, to be the same as the annual growth rate three doublings later, c. 2012. What do we observe? Three doublings earlier it was around 2.5 percent, while it was around 3.5 percent three doublings later, at least according to one source (although other sources actually do put the number at around 2.5 percent). Not a clear match, nor a clear falsification.
Yet if we look at the growth rates of advanced economies around 2012, we find that the growth rate is actually significantly lower than 2.5 percent, namely 1.2-2.0 percent. And given that less developed economies are expected to grow significantly faster than more developed ones, as the more advanced economies have paved the way and made high-hanging fruits more accessible, the (already not so big) 2.5 vs. 3.5 percent mismatch could be due to this gradually diminishing catch-up effect. Indeed, if we compare advanced economies today with advanced economies c. 1900, we find that the growth rate was significantly higher back then,3 suggesting that the symmetrical model may in fact overestimate current and future growth if we look only at advanced economies.4
That peak growth lies in the past may also be true of technological progress in particular, or at least many forms of technological progress, including the progress in computing power tracked by Moore’s law, where the growth rate appears to have been highest around 1990-2005, and to since have been in decline, cf. this article and the first graphs found here and here. Similarly, various sources of data and proxies tracking the number of scientific articles published and references cited over time also suggest that we could be past peak growth in science as well, at least in many fields when evaluated based on such metrics, with peak growth seeming to have been reached around 2000-2010.
Yet again, these numbers — those tracking economic, technological, and scientific progress — are of course closely connected, as growth in each of these respects contributes to, and is even part of, growth in the others. Indeed, one study found the doubling time of the total number of scientific articles in recent decades to be 15 years, corresponding to an annual growth rate of 4.7 percent, strikingly similar to the growth rate of the global economy in recent decades. Thus, declining growth rates both in our economy, technology, and science cannot be considered wholly independent sources of evidence that growth rates are now declining for good. We can by no means rule out that growth rates might increase in all these areas in the future — although, as we saw above with respect to the limits of Moore’s law and economic progress, such an increase, if it is going to happen, must be imminent if current growth rates remain relatively stable.
The economic “peak growth” discussed above relates to relative growth, not absolute growth. These are worth distinguishing. For in terms of absolute growth, annual growth is significantly higher today than it was in the 1960s, where the greatest relative growth to date occurred. The global economy grew with about half a trillion 1990 US dollars each year in the sixties, whereas it grows with about two trillion now. So in this absolute sense, we are seeing significantly more growth today than we did 50 years ago, although we now have significantly lower growth rates.
If we assume the model with symmetric growth rates mentioned above and make a simple extrapolation based on it, what follows is that our time is also a special one when it comes to absolute annual growth. The picture we get is the following (based on an estimate of past growth rates from economic historian James DeLong):
Year | World GDP (in trillions) |
Annual growth rate |
Absolute annual growth (in trillions) |
---|---|---|---|
920 | 0.032 | 0.13 | 0.00004 |
1540 | 0.065 | 0.25 | 0.0002 |
1750 | 0.13 | 0.5 | 0.0007 |
1830 | 0.27 | 1 | 0.003 |
1875 | 0.55 | 1.8 | 0.01 |
1900 | 1.1 | 2.5 | 0.03 |
1931 | 2.3 | 3.8 | 0.09 |
1952 | 4.6 | 4.9 | 0.2 |
1965 | 9.1 | 5.9 | 0.5 |
1980 | 18 | 4.4 | 0.8 |
1997 | 36 | 4.0 | 1.4 |
2012 | 72 | 3.5 | 2.1 |
Predicted values given roughly symmetric growth rates around 1965 (mirroring growth rates above):
2037 | 144 | 1.8 | 2.6 |
2082 | 288 | 1 | 2.9 |
2162 | 576 | 0.5 | 2.9 |
2372 | 1152 | 0.25 | 2.9 |
2992 | 2304 | 0.13 | 3.0 |
We see that the absolute annual growth in GDP seems to follow an s-curve with an inflection point right about today, as we see that the period from 1997 to 2012 saw the biggest jump in absolute annual growth in a doubling ever; an increase of 0.7 trillion, from 1.4 to 2.1.
It is worth noting that economist Robert Gordon predicts similar growth rates as the model above over the next few decades, as do various other estimates of the future of economic growth by economists. In contrast, engineer Paul Daugherty and economist Mark Purdy predict higher growth rates due to the effects of AI on the economy, yet the annual growth rates they predict in 2035 are still only around three percent for most of the developed economies they looked at, roughly at the same level as the current growth rate of the global economy. On a related note, economist William Nordhaus has attempted to make an economic analysis of whether we are approaching an economic singularity, in which he concludes, based on various growth models, that we do not appear to be, although he does not rule out that an economic singularity, i.e. significantly faster economic growth, might happen eventually.
How might it be relevant that we may be past peak economic growth at this point? Could it mean that our expectations for the future are likely to be biased? Looking back toward the 1960s might be instructive in this regard. For when we look at our economic history up until the 1960s, it is not so strange that people made many unrealistic predictions about the future around this period. Because not only might it have appeared natural to project the high growth rate at the time to remain constant into the future, which would have led to today’s global GDP being more than twice of what it is; it might also have seemed reasonable to predict the growth rates to keep on rising even further. After all, that was what they had been doing consistently up until that point, so why should it not continue in the following decades, resulting in flying cars and conversing robots by the year 2000? Such expectations were not that unreasonable given the preceding economic trends.
The question is whether we might be similarly overoptimistic about future economic progress today given recent, possibly unique, growth trends, specifically the unprecedented increase in absolute annual growth that we have seen over the past two decades — cf. the increase of 0.7 trillion mentioned above. The same may apply to the trends in scientific and technological progress cited above, where peak growth in many areas appears to have happened in the period 1990-2010, meaning that we could now be at a point where we are disposed to being overoptimistic about further progress.
Yet, again, it is highly uncertain at this point whether growth rates, of the economy in general and of progress in technology and science in particular, will increase again in the future. Future economic growth may not conform well to the model with roughly symmetric growth rates around the 1960s, although the model certainly deserves some weight. All we can say for sure is that growth rates must become near-zero relatively soon. What the path toward that point will look like remains an open question. We could well be in the midst of a temporary decline in growth rates that will be followed by growth rates significantly greater than those of the 1960s, cf. the new growth mode envisioned by Robin Hanson.5
Applying the mediocrity principle, we should not expect to live in an extremely unique time. Yet, in light of the facts about the ultimate limits to growth seen above, it is clear that we do: we are living during the childhood of civilization where there is still rapid growth, at the pace of doublings within a couple of decades. If civilization persists with similar growth rates, it will soon become a grown-up with near-zero relative growth. And it will then look back at our time — today plus minus a couple of centuries, most likely — as the one where growth rates were by far the highest in its entire history, which may be more than a trillion years.
It seems that a few things follow from this. First, more than just being the time where growth rates are the highest, this may also, for that very reason, be the time where individuals can influence the future of civilization more than any other time. In other words, this may be the time where the outcome of the future is most sensitive to small changes, as it seems plausible, although far from clear, that small changes in the trajectory of civilization are most significant when growth rates are highest. An apt analogy might be a psychedelic balloon with fluctuating patterns on its surface, where the fluctuations that happen to occur when we blow up the balloon will then also be blown up and leave their mark in a way that fluctuations occurring before and after this critical growth period will not (just like quantum fluctuations in the early universe got blown up during cosmic expansion, and thereby in large part determined the grosser structure of the universe today). Similarly, it seems much more difficult to cause changes across all of civilization when it spans countless star systems compared to today.
That being said, it is not obvious that small changes — in our actions, say — are more significant in this period where growth rates are many orders of magnitude higher than in any other time. It could also be that such changes are more consequential when the absolute growth is the highest. Or perhaps when it is smallest, at least as we go backwards in time, as there were far fewer people back when growth rates were orders of magnitude lower than today, and hence any given individual comprised a much greater fraction of all individuals than an individual does today.
Still, we may well find ourselves in a period where we are uniquely positioned to make irreversible changes that will echo down throughout the entire future of civilization.6 To the extent that we are, this should arguably lead us to update toward trying to influence the far future rather than the near future. More than that, if it does hold true that the time where the greatest growth rates occur is indeed the time where small changes are most consequential, this suggests that we should increase our credence in the simulation hypothesis. For if realistic sentient simulations of the past become feasible at some point, the period where the future trajectory of civilization seems the most up for grabs would seem an especially relevant one to simulate and learn more about. However, one can also argue that the sheer historical uniqueness of our current growth rates alone, regardless of whether this is a time where the fate of our civilization is especially volatile, should lead us to increase this credence, as such uniqueness may make it a more interesting time to simulate, and because being in a special time in general should lead us to increase our credence in the simulation hypothesis (see for instance this talk for a case for why being in a special time makes the simulation hypothesis more likely).7
On the other hand, one could also argue that imminent near-zero growth rates, along with the weak indications that we may now be past peak growth in many respects, provide a reason to lower our credence in the simulation hypothesis, as these observations suggest that the ceiling for what will be feasible in the future may be lower than we naively expect in light of today’s high growth rates. And thus, one could argue, it should make us more skeptical of the central premise of the simulation hypothesis: that there will be (many) ancestor simulations in the future. To me, the consideration in favor of increased credence seems stronger, although it does not significantly move my overall credence in the hypothesis, as there are countless other factors to consider.8
Caspar Oesterheld pointed out to me that it might be worth meditating on how confident we can be in these conclusions given that apparently solid predictions concerning the ultimate limits to growth have been made before, yet quite a few of these turned out to be wrong. Should we not be open to the possibility that the same might be true of (at least some of) the limits we reviewed in the beginning of this essay?
One crucial difference to note is that these failed predictions were based on a set of assumptions — e.g. about the amount of natural resources and food that would be available — that seem far more questionable than the assumptions that go into the physics-based predictions we have reviewed here: that our apparently well-established physical laws and measurements indeed are valid, or at least roughly so. The epistemic status of this assumption seems a lot more solid, to put it mildly. So there does seem to be a crucial difference here. This is not to say, however, that we should not maintain some degree of doubt as to whether this assumption is correct (I would argue that we always should). It just seems that this degree of doubt should be quite low.
Yet, to continue the analogy above, what went wrong with the aforementioned predictions was not so much that limits did not exist, but rather that humans found ways of circumventing them through innovation. Could the same perhaps be the case here? Could we perhaps some day find ways of deriving energy from dark energy or some other yet unknown source, even though physicists seem skeptical? Or could we, as Ray Kurzweil speculates, access more matter and energy by finding ways of travelling faster than light, or by finding ways of accessing other parts of our notional multiverse? Might we even become able to create entirely new ones? Or to eventually rewrite the laws of nature as we please? (Perhaps by manipulating our notional simulators?) Again, I do not think any of these possibilities can be ruled out completely. Indeed, some physicists argue that the creation of new pocket universes might be possible, not in spite of “known” physical principles (or rather theories that most physicists seem to believe, such as inflationary theory), but as a consequence of them. However, it is not clear that anything from our world would be able to expand into, or derive anything from, the newly created worlds on any of these models (which of course does not mean that we should not worry about the emergence of such worlds, or the fate of other “worlds” that we perhaps could access).
All in all, the speculative possibilities raised above seem unlikely, yet they cannot be ruled out for sure. The limits we have reviewed here thus represent a best estimate given our current, admittedly incomplete, understanding of the universe in which we find ourselves, not an absolute guarantee. However, it should be noted that this uncertainty cuts both ways, in that the estimates we have reviewed could also overestimate the limits to various forms of growth by countless orders of magnitude.
Less speculatively, I think, one can also question the validity of our considerations about the limits of economic progress. I argued that it seems implausible that we in three thousand years could have an economy so big that each atom in our galaxy would have to support an economy equivalent to a single person living at today’s living standard. Yet could one not argue that the size of the economy need not depend on matter in this direct way, and that it might instead depend on the possible representations that can be instantiated in matter? If economic value could be mediated by the possible permutations of matter, our argument about a single atom’s need to support entire economies might not have the force it appears to have. For instance, there are far more legal positions on a Go board than there are atoms in the visible universe, and that’s just legal positions on a Go board. Perhaps we need to be more careful when thinking about how atoms might be able to create and represent economic value?
It seems like there is a decent point here. Still, I think economic growth at current rates is doomed. First, it seems reasonable to be highly skeptical of the notion that mere potential states could have any real economic value. Today at least, what we value and pay for is not such “permutation potential”, but the actual state of things, which is as true of the digital realm as of the physical. We buy and stream digital files such as songs and movies because of the actual states of these files, while their potential states mean nothing to us. And even when we invest in something we think has great potential, like a start-up, the value we expect to be realized is still ultimately one that derives from its actual state, namely the actual state we hope it will assume; not its number of theoretically possible permutations.
It is not clear why this would change, or how it could. After all, the number of ways one can put all the atoms in the galaxy together is the same today as it will be ten thousand years from now. Organizing all these atoms into a single galactic supercomputer would only seem to increase the value of their actual state.
Second, economic growth still seems tightly constrained by the shackles of physical limitations. For it seems inescapable that economies, of any kind, are ultimately dependent on the transfer of resources, whether these take the form of information or concrete atoms. And such transfers require access to energy, the growth of which we know to be constrained, as is true of the growth of our ability to process information. As these underlying resources that constitute the lifeblood of any economy stop growing, it seems unlikely that the economy can avoid this fate as well. (Tom Murphy touches on similar questions in his analysis of the limits to economic growth.)
Again, we of course cannot exclude that something crucial might be missing from these considerations. Yet the conclusion that economic growth rates will decline to near-zero levels relatively soon, on a cosmic timescale at least, still seems a safe bet in my view.
I would like to thank Brian Tomasik, Caspar Oesterheld, Duncan Wilson, Kaj Sotala, Lukas Gloor, Magnus Dam, Max Daniel, and Tobias Baumann for valuable comments and inputs.
1. One may wonder whether there might not be more efficient ways to derive energy from the non-stellar matter in our galaxy than to convert it into stars as we know them. I don’t know, yet a friend of mine who does research in plasma physics and fusion says that he does not think one could, especially if we, as we have done here, disregard the energy required to clump the dispersed matter together so as to “build” the star, a process that may well take more energy than the star can eventually deliver.
The aforementioned paper by Lawrence Krauss and Glenn Starkman also contains much information about the limits of energy use, and in fact uses accessible energy as the limiting factor that bounds the amount of information processing any (local) civilization could do (they assume that the energy that is harvested is beamed back to a "central observer").
2. And I suspect many people who have read about “singularity”-related ideas are overconfident, perhaps in part due to the comforting narrative and self-assured style of Ray Kurzweil, and perhaps due to wishful thinking about technological progress more generally.
3. According to one textbook “Outside the European world, per capita incomes stayed virtually constant from 1700 to about 1950 […]” implying that the global growth rate in 1900 was raised by the most developed economies, and they must thus have had a growth rate greater than 2.5 percent.
4. A big problem with this model is that it is already pretty much falsified by the data, at least when it comes to “pretty”, as opposed to approximate, symmetry. For given symmetry in the growth rates around 1965, the time it takes for three doublings to occur should be the same in either direction, whereas the data shows that this is not the case — 65 years minus 47 years equals 18 years, which is roughly a doubling. One may be able to correct this discrepancy a tiny bit by moving the year of peak growth a bit further back, yet this cannot save the model. This lack of actual symmetry should reduce our credence in the symmetric model as a description of the underlying pattern of our economic growth, yet I do not think it fully discredits it. Rough symmetry still seems a decent first approximation to past growth rates, and deviations may in part be explainable by factors such as the high, yet relatively fast diminishing, contribution to growth from developing economies.
5. It should be noted, though, that Hanson by no means rules out that such a growth mode may never occur, and that we might already be past, or in the midst of, peak economic growth: “[…] it is certainly possible that the economy is approaching fundamental limits to economic growth rates or levels, so that no faster modes are possible […]”
6. The degree to which there is sensitivity to changes of course varies between different endeavors. For instance, natural science seems more convergent than moral philosophy, and thus its development is arguably less sensitive to the particular ideas of individuals working on it than the development of moral philosophy is.
7. One may then argue that this should lead us to update toward focusing more on the near future. This may be true. Yet should we update more toward focusing on the far future given our ostensibly unique position to influence it? Or should we update more toward focusing on the near future given increased credence in the simulation hypothesis? (Provided that we indeed do increase this credence, cf. the counter-consideration above.) In short, it mostly depends on the specific probabilities we assign to these possibilities. I myself happen to think the far future should dominate, as I assign the simulation hypothesis (as commonly conceived) a very small probability.
8. For instance, fundamental epistemological issues concerning how much one can infer based on impressions from a simulated world (which may only be your single mind) about a simulating one (e.g. do notions such as “time” and “memory” correspond to anything, or even make sense, in such a “world”?); the fact that the past cannot be simulated realistically, since we can only have incomplete information about a given physical state in the past (not only because we have no way to uncover all the relevant information, but also because we cannot possibly represent it all, even if we somehow could access it — for instance, we cannot faithfully represent the state of every atom in our solar system in any point in the past, as this would require too much information), and a simulation of the past that contains incomplete information would depart radically from how the actual past unfolded, as all of it has a non-negligible causal impact (even single photons, which, it appears, are detectable by the human eye), and this is especially true given that the vast majority of information would have to be excluded (both due to practical constraints to what can be recovered and what can be represented); whether conscious minds can exist on different levels of abstraction; etc.
The post The future of growth: near-zero growth rates appeared first on Center on Long-Term Risk.
]]>The post Uncertainty smooths out differences in impact appeared first on Center on Long-Term Risk.
]]>Suppose you investigated two interventions A and B and came up with estimates for how much impact A and B will have. Your best guess1 is that A will spare a billion sentient beings from suffering, while B “only” spares a thousand beings. Now, should you actually believe that A is many orders of magnitude more effective than B?
We can hardly give a definitive answer to the question without further context, but we can point to several generic reasons to be sceptical. The optimizer’s curse is one of them. GiveWell has also written about why we can’t take such estimates literally.2 In this post, I’ll consider another potential heuristic to reject claims of astronomical differences in impact.
Roughly speaking, the idea is that uncertainty tends to smooth out differences in impact. Given sufficient uncertainty, you cannot confidently rule out that B’s impact is of comparable (or even larger) magnitude. If you have 10% credence that B somehow also affects a billion individuals, that suffices to reduce the difference in expected impact to one order of magnitude.3
Interestingly, this result doesn’t depend strongly on how much the two interventions naively seem to differ. If your best guess is that A affects 10^50 individuals and B affects only 1000, but you have 10% credence that B has the same astronomic impact, then the expected impact still differs by one order of magnitude only.
Of course, the crux of the argument is the assumption of equal magnitudes, that is, having non-negligible credence that B is comparably impactful. Why would we believe that?
One possible answer is that we’re uncertain or confused about many fundamental questions. The following list is just the tip of the iceberg:
Clearly, we have an incomplete understanding of the very fabric of reality, and this will not change in the foreseeable future. Now, claiming that something is many orders of magnitude more effective requires – roughly speaking – 99% confidence (or even more) that none of the above could flip the conclusion. That sets a high bar.
One might argue that the argument misses the point in that it focuses on B having an unusually small impact compared to A, rather than A having an unusually big impact.4 To see this, we only need to tweak the framing of the toy example. Suppose that intervention B affects 1000 individuals, and we’re uncertain whether intervention A affects 1000 or 10^50 individuals. Then A dominates by many orders of magnitude in expectation as long as we have non-negligible credence that A affects 10^50 individuals.5
This is a reasonable objection, but it only works if we are certain that B can’t somehow affect the astronomical number of beings, too. This begs the question of how we can be certain about that. We can also point to big picture uncertainties (like action correlations and huge numbers of simulations) with the specific implication that apparently small impacts can be astronomically larger than they seem.
We can apply this idea not just when comparing interventions, but also when comparing the scope of different cause areas or the impact we can have in different future scenarios. Even more abstractly, we can consider our impact conditional on competing hypotheses about the world.
For example, it is sometimes argued that we should assume that artificial superintelligence will be developed even if we think it's unlikely, because we can have a vastly larger impact in that case. I think this argument has some merit, but I don’t think the difference encompasses several orders of magnitude. This is because we can conceive of ways in which our decisions in non-AI scenarios may have similarly high impact – and even though this is not very likely, it suffices to reduce the difference in expected impact. (More details here.)
Another special case is comparing the impact of charities. Brian Tomasik has compiled a number of convincing reasons why charities don’t differ astronomically in cost-effectiveness, including uncertainty about the flow-through effects of charities.
As another example, effective altruists often argue that the number of beings in the far future dwarfs the present generation. I think the gist of the argument is correct, but our impact on the far future is not obviously many orders of magnitude more important in expectation. (See here for my own thoughts on the issue.)
As with any idea on this level of abstraction, we need to be careful about what it does and does not imply.
First, the argument does not imply that astronomical differences in impact never exist. The map is not the territory. In other words, the impact may differ by many orders of magnitude in the territory, but our uncertainty smooths out these differences in the map.
Second, the idea is a heuristic, not an airtight proof. I think it may work for a relatively broad class of interventions (or charities, hypotheses, etc.), but it may not work if you compare working on AI safety with playing video games. (Unless you’re in a solipsist simulation or you’re a Boltzmann brain.)
Third, the expected impact of an intervention or charity can be close to zero if we’re uncertain whether it reduces or increases suffering, or because positive and negative effects cancel each other out. In that case, a robustly positive intervention can be many orders of magnitude more effective – but this is just because we divide by something close to zero.
Fourth, we may be uncertain about the hypothesis in question, but justifiably confident about the differences in impact, which means that the argument doesn’t apply. For example, if we live in a multiverse with many copies of us, we can clearly have (vastly) more impact than if we exist in one universe only.
Finally, I’d like to clarify that I consider factual rather than moral uncertainty in this post. The idea may be applicable to the latter, too – see e.g. this comment by Paul Christiano and Brian Tomasik's piece on the two envelopes problem – but it depends on how exactly we reason about moral uncertainty.
Suppose we adopt a heuristic of being sceptical about claims of astronomical differences in impact, either based on this post or based on Brian Tomasik’s empirical arguments for why charities don't differ astronomically in cost-effectiveness. What does that imply for our prioritization?
First, we can use it to justify scepticism about Pascalian reasoning. At the very least, you should require strong evidence for the claim that an intervention, charity, scenario, or hypothesis should dominate our decisions because of astronomical differences in impact. On the other hand, we should be careful to not dismiss such arguments altogether. If we have non-negligible credence in both hypotheses A and B – say, more than 10% – then an impact difference of an order of magnitude in expectation suffices to justify acting as if the higher-impact hypothesis is true.
Second, the heuristic may also reduce the value of prioritization research, which is to some extent based on the belief that cause areas differ by many orders of magnitude. If we don’t believe that, then a larger fraction of altruistic endeavors is valuable. This, in turn, means that practical considerations like comparative advantages or disadvantages tip the balance more often than abstract considerations.
That said, I don’t think a strong version of this argument works. A difference of 10 times is still massive in practical terms and suffices to make prioritization research worthwhile.6
I would like to thank Max Daniel, Caspar Österheld, and Brian Tomasik for their valuable feedback and comments on the first draft of this piece.
The post Uncertainty smooths out differences in impact appeared first on Center on Long-Term Risk.
]]>The post Tranquilism appeared first on Center on Long-Term Risk.
]]>In this paper, I introduce a way of thinking about well-being that lends support to the view that reducing suffering takes moral priority over promoting happiness. Our article on suffering-focused ethics lists several other positions that could inspire this conclusion, so agreeing with tranquilism – as the view I introduce is called – is not needed to agree with suffering-focused ethics. Tranquilism is not meant as a standalone moral theory, but as a way to think about well-being and the value of different experiences. Tranquilism can then serve as a building block for more complex moral views where things other than experiences also matter morally.
Disclaimer: This is a draft based on which I plan to submit a paper to a peer-reviewed philosophy journal. Based on feedback from Simon Knutsson, I concluded that the current version can still be improved for this purpose (but I wanted to already upload this version so it can be read and discussed).
I assume that an individual’s experiences can be finally (i.e., non-instrumentally) good or bad for her. There are different theories about which experiences are finally good or bad for an individual. One such theory is hedonism, which holds that pleasant experience or pleasure is what is good for an individual, and that unpleasant experience or pain is bad. This article proposes and defends an alternative theory of the value of different experiences: I call it Tranquilism. It is inspired by the Buddhist1 idea that not only pleasure, but also tranquility or contentment are amongst the best experiences.2 My theory also has roots in Epicurus’s description of the goal of a happy life as “imperturbability of the soul” (ataraxia).
In this paper, I give these ancient ideas a more precise and extensive formulation (section 2), so they can more easily be discussed in contemporary normative contexts. This is followed by arguments and intuitions in support of tranquilism. Section 3 discusses non-conscious states (unconsciousness, death, non-existence), and section 4 addresses possible objections to tranquilism. Finally, section 5 will conclude with some remarks on how Tranquilism could fit into discussions of population ethics.
My aim with this paper is to introduce and define tranquilism as a position that deserves further study. As a theory limited to the evaluation of experienced well-being, tranquilism is compatible with pluralistic moral views where things other than experiences – for instance the accomplishment of preferences and life goals – can be (dis)valuable too.
It has been suggested that we can group different experiences on a spectrum from bad (or unpleasant) to good (or pleasant), with a neutral part of the range in the middle. Let us call hedonism the view that such a spectrum represents the value of different experiences. According to hedonism, pleasurable experiences are valuable not because we desire them, but because they are good (and thus desirable).3
Proponents of tranquilism however reject this interpretation. While it is true that we can rank how much we do or do not desire to have different experiences, or rank experiences according to how pleasurable they are, it is non-obvious whether such a ranking accurately expresses the way we value different experiences. Tranquilism is based on an alternative conception of value, where what matters is not to maximize desirable experiences, but to reach a state absent of desire.
Instead of having a scale that goes from negative over neutral to positive, tranquilism’s value scale is homogenous, ranging from optimal states of consciousness to (increasingly more severe degrees of) non-optimal states. Tranquilism tracks the subjectively experienced need for change. If all is good in a moment, the experience is considered perfect. If instead, an experience comes with a craving for change, this is considered disvaluable and worth preventing.4 Absence of pleasure is not in itself deplorable according to tranquilism – it only constitutes a problem if there is an unmet need for pleasure.
Tranquilism states that an individual experiential moment5 is as good as it can be for her if and only if she has no craving for change.
A craving in the tranquilist sense is a consciously experienced need to change something about the current experience. Section 2.2 below will present positive as well as negative examples for what qualifies as a craving and will distinguish two ways cravings may arise.
While tranquilism will seem counterintuitive to some people, it can hardly be said to be inherently counterintuitive. Tranquilism is based on the Buddhist perspective of suffering and happiness, and similar “absence of desire” theories are also found in Hinduism6 and the writings of Epicurus. This suggests that if we had grown up familiar with these positions, like hundreds of millions of humans have, something along the lines of tranquilism could be the natural way we thought about well-being.
In the context of everyday life, there are almost always things that ever so slightly bother us. Uncomfortable pressure in one’s shoes, thirst, hunger, headaches, boredom, itches, non-effortless work, worries, longing for better times. When our brain is flooded with pleasure, we temporarily become unaware of all the negative ingredients of our stream of consciousness, and they thus cease to exist. Pleasure is the typical way in which our minds experience temporary freedom from suffering. This may contribute to the view that pleasure is the symmetrical counterpart to suffering, and that pleasure is in itself valuable and important to bring about. However, there are also (contingently rare) mental states devoid of anything bothersome that are not commonly described as (intensely) pleasurable, examples being flow states or states of meditative tranquility. Felt from the inside, tranquility is perfect in that it is untroubled by any aversive components, untroubled by any cravings for more pleasure. Likewise, a state of flow as it may be experienced during stimulating work, when listening to music or when playing video games, where tasks are being completed on auto-pilot with time flying and us having a low sense of self, also has this same quality of being experienced as completely problem-free.7 Such states – let us call them states of contentment – may not commonly be described as (intensely) pleasurable, but following philosophical traditions in both Buddhism and Epicureanism, these states, too, deserve to be considered states of happiness.
Whether meditative tranquility or flow are called hedonically neutral or not is a matter of interpretation. To most people, they may feel pleasurable in a way, or very positive somehow, though perhaps in a different sense than e.g. orgasms feel positive. This makes perfect sense according to tranquilism, where there is no neutral range for experiences to begin with and where conscious states completely free of cravings should thus elicit very positive associations when we think of them. The important difference between tranquilism and hedonism is whether all of these states free of cravings are equally positive, or whether they form a scale of increasing value, with some such experiences being distinctly worse than others.8
According to hedonism, where the optimal state corresponds to the highest possible pleasure, states of contentment would fall short: they would be judged (heavily) suboptimal because there could be richer states of pleasure in their place. By contrast, tranquilism proposes that all of these states are in a relevant sense flawless – different in their flavor from intense pleasure, yet equally perfect with respect to the immediate evaluation that occurs internally. While pleasure has great instrumental value according to tranquilism as one way to ensure that everyday conscious experience is problem-free, maximizing pleasure does not constitute an end in itself.
According to tranquilism, a state of consciousness is negative or disvaluable if and only if it contains a craving for change. This section will outline what qualifies as such a craving and why cravings, and not anything else, capture what makes suffering bad for individuals.9 To prevent misunderstandings, I will also explain how cravings relate to preferences (they are very different) and how they relate to desires (“activated preferences”). In short, cravings can be thought of as need-based, visceral desires, which are different from (purely) reflection-based desires.
A craving is a conscious need to change something about one’s current experience. Cravings can come about in two ways: The first way cravings may arise is when our attention is being directed towards some desired experience. For instance, a smoker may experience a craving for the rush that comes from smoking a cigarette; or someone in the heat of a summer day may crave the sensation of having a refreshing drink.
The second way cravings may arise is from attention directed at the current experience, when one notices something as unwanted or bothersome. This subjective judgment is then expressed by cravings to change something or end the experience. Cravings of this latter type concern the removal of (certain components of) the current experience. They range from mild cases of disturbance (e.g. when one is bothered by uncomfortable shoes in an otherwise comfortable situation) to cases of extreme suffering, where one may wish for the entire experience to go away regardless of the costs. The stronger the craving, the more disvaluable it is.
According to tranquilism, cravings are what make an experience bad for an individual. Interestingly enough, this view implies that there could be pain that is not disvaluable, i.e., pain without any craving for it to end. For instance, in the phenomenon of pain asymbolia, subjects report feeling the sensation of pain, but they startlingly are not bothered by it.10 Similarly, Buddhist teachings suggest that some people are capable of detaching themselves from their pain or somehow ending their identification with it, thus “ending the suffering without ending the pain.”11 If one agrees that there would be no moral reason to prevent pain in cases where it comes without the distinct craving to have it end, then these fascinating conditions – should they prove to work the way they are described – lend intuitive support to tranquilism. Tranquilism says that what is bad about suffering is that craving. (See this endnote12 for more points in support of this view.)
To prevent confusion, it is important to stress that cravings are not the same as preferences. Preferences are abstract constructs that are thought to be present for an agent at any time (even during unconsciousness). Preferences represent our goals, the hypothetical choices someone would make if presented with all possible options. Cravings on the other hand may be absent even when the individual in question is conscious. And in contrast to preferences or goals, where we may sometimes be subjectively uncertain as to whether they are in an achieved state or not, there is never an open question as to whether one’s cravings are satisfied or not. Cravings are always dissatisfied, as they express a (futile) need for the current experience to be different. For the experiential moment with the original craving, it does not make a difference whether the craving is lost because a suitable distraction is found in the next moment (for instance one that involves a different kind of pleasure than we actually craved), or whether the desired state of pleasure is actually instantiated the next moment, creating a fulfilled craving.
This inspires the tranquilist perspective that the particular content of a craving – the target state we longingly envision ourselves to be in – is not actually what matters and what needs to be achieved. A craving may arise when I am imagining how nice it would be to spend a day at the beach instead of at work, but if, in the very next minute, I begin an enjoyable conversation with my co-worker that makes me forget these thoughts, the fact that I'm not actually at the beach does not bother me – it is not a problem for me nor anyone else in that moment. According to tranquilism, cravings are bad for us not because the specific need they represent is unfulfilled, but because there is a need in the first place. That is to say because the current experience is not accepted as perfectly fine the way it is.
So cravings are not preferences because – among other reasons – preferences do not have to be consciously experienced, whereas cravings are conscious by definition. What about desires? Desires can be thought of as “activated” or “conscious” preferences, as “the states of consciousness that can motivate our intentional, goal-directed actions.” The terminology surrounding this issue is complicated, and it should be noted that people sometimes treat desires and preferences as synonymous and then distinguish between occurrent desires and standing desires. In my terminology, all occurrent desires are desires, and all standing desires are preferences.
Cravings are a subtype of desires, they are need-based, visceral desires that we often cannot help but develop, whether or not it is good for us and our goals. But not all desires are need-based desires: A desire to do something can also be based on forward-looking reflection about what is (instrumentally or intrinsically) good for our goals. Let us call these reflection-based desires. To have a reflection-based desire means to want something not because of a craving, but because it helps us fulfil our preferences or goals (e.g. wanting to eat healthy or wanting to get a university degree). Cravings may not always win in competition with our reflection-based desires, but withstanding cravings always requires mental effort. Because we are prone to developing cravings towards all sorts of things, it can often be difficult to act on our goals in the way we would want to.13
To summarize, cravings are need-based desires inspired by the Buddhist view of suffering and happiness. They are what tranquilism classifies as disvaluable or worth preventing. They are directed either towards an envisioned (pleasurable) state, or towards getting rid of (certain components of) the current state. Cravings of both types can be poetically pictured as arrows of volition urgently (and involuntarily, in the sense of not being based on reflection or deliberate endorsement) pointing away from the current experience.
According to the distinction introduced above, intentional actions result from two different motivational systems. Either we act on reflection-based desires (“reflection-based reasons for action”) or we act on need-based desires (“need-based reasons for action”). Tranquilism focuses on the need-based component of our motivation, and is inspired by the observation that pleasure does not seem central to need-based motivation in the same way (or on the same level) as suffering does.
Let us first discuss reflection-based reasons for action. Reflection-based motivation is not so much about our internal conscious experience, but more about updating world models we have formed and attaching value or disvalue to certain outcomes. We do not desire to believe that the world is going well according to our goals, we want it to actually go well.14 This gives us reflection-based reasons to keep our impulses in check so we can efficiently pursue our goals. Experiencing as much pleasure as possible may well be what many people upon reflection desire for their life, but other possible goals include adventures, accomplishments, helping others and many other things whose nature depend – at least to some degree – on a person’s temperament, beliefs and their past experiences. It is plausible that our disposition towards finding things pleasurable plays an important descriptive role in how we develop our goals, but as a moral anti-realist,15 I think it would be wrong to presume that everyone “has reason”16 to reflectively desire the same goal, let alone the specific goal of personal hedonism. And even if most humans will end up valuing pleasure for its own sake, some of us may set different goals (and neither party would be making any sort of mistake).
By contrast, while one may be able to affect whether need-based desires arise at all, once they do arise, there is no choice about the issue and we cannot help but be affected by them. Next to the preference architecture that forms our goals, we have a visceral and primitive (model-free)17 motivational system that operates through cravings. This system forms our need-based motivation and often produces impulsive behavior. Pleasure does play an important role in how cravings arise, as it seems that cravings are indeed (usually) triggered when we think about pleasure or are confronted with stimuli associated with pleasure. But the need-based reasons that motivate our pursuit of pleasure, moment-by-moment, are not based on intrinsic desirability of future pleasures acting at a distance, but the same process that makes up our aversion to (both physical and psychological) pain. According to tranquilism, cravings to get rid of pain and cravings to experience pleasure are bad for the same reason, because they represent non-acceptance of the current experience. This perspective is supported by us being equally tempted to end cravings for pleasure by actually attaining the pleasure in question, as by just distracting us with some other kind of pleasure (or even just sleep as a form of non-consciousness). For instance, if I have a craving for the state of being drunk but believe that this is not a good direction to take to accomplish my goals, I may instead engage in socializing to distract myself from my craving for alcohol. Similarly, I may try to distract myself from a headache, from back pain, or from ruminating thoughts about life’s problems by watching Netflix or going to bed early in the hope of being able to quickly fall asleep. All these acts have the same motivation: All need-based reasons for action are cravings, they all constitute suffering.
Cravings are famously near-sighted. Rather than being about maximizing long-term well-being in a sophisticated manner, cravings are about immediate gratification and choosing the path of least resistance. We want to reach states of pleasure because as long as we are feeling well, nothing needs to change. However, our need-based motivational system is rigged to make us feel like things need to change and get better even when, in an absolute sense, things may be going reasonably well. We quickly adapt to the stimuli that produce pleasure. As Thomas Metzinger puts it, “Suffering is a new causal force, because it motivates organisms and continuously drives them forward.”18 It is not pleasure that moves us; deep down and insofar as the need-based reasons for actions are concerned, it is always suffering. The way tranquilism looks at it, part of our brain is a short-sighted “moment egoist” with the desire to move from states with a lot of suffering to closely adjacent states with less suffering.
To end this discussion with a concrete example, consider the following thought experiment. Suppose it is three o’clock in the morning, we lie cozily in bed, half-asleep in a room neither too cold nor too hot, not thirsty and not feeling obligated to get up anytime soon. Suppose we now learn that there is an opportunity nearby for us to experience the most intense pleasure we have ever experienced. The catch is that in order to get there, we first have to leave the comfortable blankets and walk through the cold for a minute. Furthermore, after two hours of this pleasure, we will go back to sleep and, upon waking up again, are stipulated to have no memories left of the nightly adventure. Do we take the deal? It is possible for us to pursue this opportunity out of reflection-based motivation, if we feel as though we have a self-imposed duty to go for it, or if it simply is part of our goal to experience a lot of pleasure over our lifetime. It is also possible for us to pursue this opportunity out of need-based motivation, if we start to imagine what it might be like and develop cravings for it. Finally, it also – and here is where tranquilism seems fundamentally different from hedonism – seems not just possible, but perfectly fine and acceptable, to remain in bed content with the situation as it is. If staying in bed is a perfectly comfortable experience, the default for us will be to stay. This only changes in the case that we hold a preference for experiencing pleasure, remember or activate it and thus form a reflection-based desire, or if staying in bed starts to become less comfortable as a result of any cravings for pleasure we develop.
To conclude, an account inspired by hedonism might suggest that pleasure is intrinsically desirable, and that this either automatically makes us desire pleasure (perhaps also in a reflection-based sense), or that something would have to be wrong with us if we did not desire pleasure.19 However, tranquilism paints a different picture, emphasizing that our need-based pursuit of pleasure is always motivated by cravings, while preferences for pleasure seem to be contingent. It seems perfectly plausible that people can upon reflection decide to not want to pursue pleasure as a central goal in their life (or not want to pursue reducing suffering) without thereby committing any kind of mistake.
This section explains why, if one accepts tranquilism, all forms of non-consciousness should plausibly be thought of as equally valuable as craving-free experiences, say, (intense) pleasure or complete meditative peace of mind.
Theories of the value of experiences, such as tranquilism and hedonism, face the question of the value of states that are not experiences, such as dreamless sleep or unconsciousness. As Tännsjö (1996) points out, it has been natural for classical hedonists to equate the value of hedonically neutral experiences with non-consciousness – but this is ultimately a separate normative stipulation.20
Tranquilism lends support to the Epicurean position on death and non-existence. Epicurus is known for his “hedonism” (arguably a misnomer in this case), the position that all that is good or bad lies in conscious experience. He is also famous for an argument for why death cannot constitute an evil:21
“This, the most horrifying of evils, means nothing to us, then, because so long as we are existent death is not present and whenever it is present we are nonexistent. Thus it is of no concern either to the living or those who have completed their lives. For the former it is nonexistent, and the latter are themselves nonexistent.”
Critics object that Epicurus’s argument for why death cannot be of moral concern to us ignores that things can be bad for someone, not because they are in themselves negative, but because they deprive a person of positive experiences.22 This deprivationist critique appears strong if Epicurus is interpreted within hedonist axiology, for if the value of a happy person’s life is determined by the summed up pleasures minus all the pains, this sum is indeed lower if her life is cut short prematurely. However, under the assumption of a view like tranquilism (which actually seems to capture the way Epicurus was thinking about happiness and suffering very closely),23 the Epicurean position becomes perfectly consistent. Given a tranquilist conception of the relative value of different experiences, there are no experiences that are good in themselves, no experiences whose absence could constitute a form of disvalue.
This has interesting implications for non-experienced states of affairs, i.e., any states of non-consciousness. If something is only regarded as problematic if it is experienced as such, there lies no problem in non-consciousness. While the formulation of tranquilism (cf. section 2) contains no direct statement about the way to assess the value of non-consciousness, a straightforward extension implies the Epicurean conclusion that non-consciousness, too, is among the best states of affairs.
In this section, I consider some common objections to the tranquilist account of the value of different experiences and provide my replies to them.
Tranquilism is based on the position that cravings play the central role in need-based motivation. Can we come up with similar reasons as to why happiness – defined loosely as “positive feelings” – also plays a central role in motivation, on equal footing with cravings so-to-speak? One could for instance argue that, if tranquilism says that suffering is bad for us because it corresponds to a first-person evaluation of an experience(-component) needing to end or change, perhaps happiness might correspond to a first-person evaluation of an experience as something to continue or intensify. And if cravings are pictured as an arrow of volition pointing away from the current experience, perhaps happiness could be pictured as an arrow pointing towards it, or as a loop of volition. And perhaps there is a case to be made for viewing happiness as satisfied cravings, based on which one could argue that simply distracting oneself from cravings does not produce an equally good or equally desirable result.
My reply to such objections towards tranquilism is twofold:
To elaborate on point two, let us first consider culinary or sexual pleasures: these versions of happiness may indeed be regarded as fulfilled cravings of some kind, a point that is supported by the example of people who – for no apparent reason besides anticipation of pleasure – postpone addressing their hunger cravings during the day in order to receive extra enjoyment from dinner at a fancy restaurant. But then there are also positive feelings where it clearly seems false that they come as satisfied cravings or come with an evaluative need or desire to have the experience continue. Take contentment induced by sleeping pills, for instance. This experience seems to decidedly lack such a component (even though many sleeping pills are dangerously addictive), illustrated by people rarely being tempted to fight falling asleep because of how well they are feeling. This suggests that many flavors of happiness, such as meditative tranquility or drug-induced contentment for instance, do not seem to vary in intensity depending on how strong our cravings were before the experience started, or how strong the cravings would be if the experience abruptly ended. Instead, these states of contentment seem to come with a sense – not in degrees, but either on or off – that no cravings can possibly arise, that no pains can possibly affect us. Inspired by tranquilism, one could argue that what all positive experiences – all instances of happiness – have in common is that they play a functional role that prevents or protects against the formation of cravings. Pleasurable experiences do so by flooding one’s attention with pleasure, and experiences such as flow, meditative tranquility or drug-induced contentment do so by making it harder for cravings to come up.
While it interesting that proponents of tranquilism seem to give different accounts of suffering and happiness than proponents of hedonism, it remains unclear to what extent this reflects different takes on introspection or different empirical predictions, or whether we are rather dealing with different interpretations of the same picture. Perhaps some versions of hedonism could agree with many or most of the descriptive points proponents of tranquilism make, but state normatively that we should regard happiness as more valuable than states of non-consciousness, and that being a little happy is much worse than being very happy. (Such a position would further have to stipulate how pleasure is to be compared to states of contentment, especially if one assumes that the two states relate to cravings in different ways.)
While it would be very interesting and possibly highly illuminating to learn more about the phenomena we introspectively label as pain, pleasure, happiness and suffering, it should be noted that neuroscience by itself cannot give us any direct answers to normative questions – it can only answer empirical subquestions (if they are formulated precisely enough). Neuroscientific findings can sometimes give us very strong nudges in certain directions, but there may be instances where different aspects of what is going on can be emphasized or de-emphasized by different people – in the same way people's taste or aesthetics can differ.
According to tranquilism, the most intense pleasures are no better, intrinsically, than so-called “hedonically neutral” states which are completely free of cravings. This is an unusual perspective. After all, comparing the experience of eating pizza to eating potatoes, many people24 are likely to prefer the former: Pizza gives them the more pleasurable experience, and they are more likely to develop strong cravings for pizza.
Proponents of tranquilism are not denying that some experiences can give us more pleasure than others, but they would counter that this point is not always relevant. Namely, in the case of “Pizza vs. potatoes” it can be irrelevant if both meals are good enough to completely satisfy all our needs in a given moment. A blanket assessment on which is the more valuable eating experience is based on a comparison between the two experiences from an outside perspective, where the pleasures during both meals are being compared to one another and where we may develop cravings (or reflection-based desires) for one meal but not the other. For the experiential moments in question, there is no such outside perspective. During the meal, from a first-person point of view, the ongoing moment is all there is, and the person eating potatoes instead of pizza may not be thinking about any alternatives. Tranquilism zooms in on how each state is evaluated subjectively and directly. From this internal perspective, enjoying potatoes can – sometimes – be perfectly fine as well. Assuming that one forgets everything else around and is fully absorbed in an experience, with no need for anything about it to be different, no longing for richer or different taste, it is accurate to say that the immediate, subjective evaluation of the experience is that it is flawless. And the fact that the experience of eating something different might score higher on a scale of pleasure intensity simply does not play a role according to this perspective – it may be true, but it is as as irrelevant to the value of an experience as e.g. differences in a parameter such as endorphins released.
Perhaps it is helpful to point out that the intuitions in support of tranquilism are similar to the intuitions philosophers used to reject John Stuart Mill’s perfectionist axiology. Mill felt that the happiness of a pig is somehow worse than the happiness of a human, and that it is “better to be Socrates dissatisfied than a fool satisfied.”25 It is understandable that, to a genius26 like Mill, the prospect of being incapable of understanding sophisticated philosophical reasoning represented a catastrophe. And admittedly, goal and intelligence preservation are convergent instrumental goals for rational agents, so it makes sense to have an intuitive aversion to losses of this sort.27The alternative perspective, however, is that the fool himself does not notice that anything is lacking; nothing about being a fool bothers him in any way. The same goes for the happy pig in comparison with an ordinary human. Many people, in fact, believe that animals have it better than humans because they live in the moment. If our evaluation focuses on how a situation is experienced from the inside, then Mill’s arguments miss the point, as he is highlighting aspects of outside characteristics entirely irrelevant to the ongoing evaluation of the experience in question. The same intuitions used against Mill’s perfectionist axiology lend support to the idea that states of flow or meditative tranquility, evaluated with respect to the presence or absence of cravings, can be no less valuable than states of intense pleasure.
Pizza is fine and good, but what about the wonder of holding one’s newborn child for the first time, the ecstasy of winning the world championship in a sport one has trained for one’s entire life, the bliss of reuniting with the love of one’s life after years of forced separation or the triumph over finally getting revenge on the bad guy? How can there be no relevant difference between these experiences and mere states of contentment?
Before giving an answer, it is important to point out what tranquilism is not saying. Tranquilism is not a standalone moral theory committed to the claim that nothing besides suffering (cravings) is of moral concern. Tranquilism is only saying that insofar as we focus our attention solely on experiences, we can conclude something about the relative value of these experiences. Again, tranquilism zooms in on an experience and omits everything else beyond it. Arguably, the examples above are powerful to a large extent because they condense purpose and meaning over the course of a person’s life. If we instead envisioned these experiences as doctored somehow, such that they would not have been part of someone’s biography, but were rather based on fake memories and trickery, they may lose a substantial portion of their appeal. We could still regard them as intensely pleasurable and (subjectively) profound experiences, but stripped from the context of truly achieving one’s goals, these examples arguably boil down to something more analogous to “Pizza vs. potatoes?” discussed above. If we only consider experiential moments in isolation, and not what they might mean to us in terms of purpose and life goals, the Tranquilist perspective becomes more intuitive. And to the extent that we want to honor what these experiences mean to us in terms of purpose and life goals, we should perhaps be talking about whether accomplishing these goals is intrinsically valuable – something that is outside the scope of this paper.
One way in which depression can manifest itself is through complete apathy or inertia. A person may feel as though “no change could make any difference to my misery” and, correspondingly, lack any intent to bring about change. Is this an example of a negative state that is nevertheless free of cravings – something that would contradict the tranquilist account?
The answer is “no” because there is a difference between drive or willpower on the one hand and cravings on the other. One may lack motivation to bring about change, but this does not mean one cannot experience the need for change. Inertia seems to represent a decoupling or erosion of our action module – whatever translates our feelings and beliefs into goal-directed action. But experiencing inertia or apathy, if it indeed comes as a negative experience caused by depression (and not e.g. a satisfied kind of inertia we might experience after eating tasty food to the point of complete satiation), comes with us urgently desiring to feel different and feel better. If that were not the case, we would not label the feeling as bad or depressed in the first place. Apathy from depression is therefore not a state that lacks cravings, but one that lacks components like hope, motivation, drive and/or willpower. Describing this state as "lacking any intent to bring about change" would be mistaken.
The last objection I want to discuss concerns the nature of cravings and whether they can really be said to always be negative. Arguing the contrary, readers of an earlier draft pointed to the feeling of anticipation. When looking forward to a vacation for instance, a person engaged in planning exciting activities may find herself in an enthusiastic, very positive frame of mind. She is envisioning something she does not currently have. Does this not constitute an example of a craving being experienced as positive, contradicting tranquilism’s premise that cravings are what is bad for an individual?
I do not think the person in question can be said to experience a craving; she is not experiencing a conscious need to turn her current feelings into the positive feelings associated with the upcoming vacation. After all, she is already feeling good. Envisioning the exciting activity (and envisioning her progress towards this activity) puts her in the positive state of mind associated with it, which is notably different from craving a not-currently-experienced feeling.
This point seems to bring up a related line of objection against tranquilism. While I pointed out (cf. section 2.3) that we seem to be moved by suffering, could we not imagine a person being moved (or “powered” rather in this case) entirely by anticipation of a positive experience? This seems indeed plausible. During periods of high motivation (or mania), people seem to exhibit high productivity resulting from a kind of positively experienced restlessness. Interestingly, however, this state of mind seems to be associated with agency and (sometimes overly optimistic) planning or advancing towards a goal, and not with a short-sighted search for gratification. Introspectively, I would characterize this state of mind as being motivated by optimism about achieving one’s goals – in the sense that there is very much also an element of prediction in the experience and not (just) wishing or longing. To me, anticipation is not synonymous with being motivated by positive feelings; rather, it seems to consist of motivation fuelled by modelling the future as positive for one’s goals.
Arguably however, it is up for interpretation which aspect of our mental phenomena one wants to emphasize, and introspection and interpretation of what matters to us may not turn out the same way for everyone. It is therefore possible (though by no means obvious) that people who experience this anticipation-based drive or euphoria unusually often (or those who very rarely experience strong cravings for that matter) are less inclined to sympathize with tranquilism as a theory of well-being.
I introduced and explained tranquilism as a theory of well-being, contrasted it with hedonism and addressed plausible objections. Whereas the hedonist conception of well-being is about maximizing positive experiences and minimizing negative ones, the tranquilist understanding of well-being is about freedom from cravings. In some respects, tranquilism is perhaps better conceptualized as a theory of need-based motivation. Tranquilism assumes that experiences are not desirable or undesirable in themselves. Rather, our pursuit of pleasure and our aversion to pain both manifest themselves as a conscious need (craving) to change something about the current experience. These cravings are what qualify as suffering. According to tranquilism, suffering is of negative value because it corresponds to a first-person evaluation of an experience as negative, as something to change or eliminate.
Because flooding the brain with pleasure is an efficient way to make cravings disappear from our stream of consciousness, pleasure still carries great instrumental importance in tranquilism. But the absence of pleasure does not count as negative in itself. As long as a pleasureless state of consciousness remains free of cravings, it is still considered an optimal and happy experience.
Tranquilism is not committed to the view that cravings are all that matter. Our motivation is multifaceted, and next to impulsive motivation through cravings, we are also motivated by desires to achieve certain goals. People do not solely live for the sake of their personal well-being; we may also (or even exclusively) hold other goals, including goals about the welfare of others or goals about the state of the world. Thinking about morality in terms of goals (or “ends”) has inspired rationality-based accounts of cooperation such as Kantianism (arguably), contractualism, or coordinated decision-making between different value systems.28 Furthermore, if one chooses to regard the achievement of preferences or goals as valuable in itself, this can inspire moral axiologies such as preference-based consequentialism or coherent extrapolation, either as a complement or an alternative to one’s theory of well-being. (And of course, one’s goals may include many other components, including non-moral or non-altruistic ones.)
While tranquilism gives counterintuitive value judgments in cases where one’s goals or preferences play a large role (see e.g. “What about the peak of human experience?” above, or any thought experiments that involve killing happy people that want to go on living for the prevention of small amounts of suffering), it may be particularly relevant in the domain of population ethics. This is because in population ethics, populations are compared and evaluated according to the quality and (more relevantly) quantity of the individual lives they contain. When thinking about whether to reduce the size of an existing population, the goals of the individuals in that population seem relevant, at the very least, for reasons of cooperation, and perhaps also intrinsically (depending whether things other than well-being have final value). So even though tranquilism might say that there is a sense in which an empty world is better than one populated by a lot of happy beings and a few suffering ones (namely according to the value of well-being in that world), endorsing tranquilism is compatible with positions where this judgment can be overridden by other, stronger reasons (such as e.g. reflection-based reasons or reasons of cooperation).
On the other hand, when one considers adding new beings to the world, it looks as though reflection-based reasons for action are not straightforwardly applicable. Do we count what a person would desire after coming into existence, or do we only count the goals of individuals that will exist regardless of how we decide? What if someone deliberately creates a person whose preference ranks an infinite amount of suffering above non-existence? What about creating beings capable of suffering that cannot form preferences or goals? Moral views inspired by reflection-based reasons for action would (arguably) leave the decision up to the goals or preferences of those currently in power. However, it seems that this overlooks a relevant dimension of the scenarios discussed. Namely, what is missing is an axiological perspective inspired by need-based reasons for action. I find it plausible that tranquilism could be this perspective. Tranquilism would suggest treating such cases according to the Epicurean position that non-existence, here in the sense of not being born, cannot be a problem for anyone, that creating new happy beings is never intrinsically valuable.
In contemporary ethics, such ideas that non-existence may be unproblematic, or that there is a sense in which an empty world could be better than one filled with a tiny amount of suffering and a large amount of happiness, are taken as a counterintuitive non-starter by several philosophers.29 Considering however that judgments about population ethics are guaranteed to be counterintuitive in at least some respects,30
we should not dismiss otherwise appealing perspectives too quickly. Tranquilism and related “absence of desire” theories deserve, in my view, more consideration than they currently receive – especially if they are regarded not as standalone moral theories, but as approaches that complement the aspects of morality that deal with reflection-based reasons for action.
Adriano Mannino and I developed some of the ideas behind our version of tranquilism from 2011 – 2014 during lengthy discussions where we grappled with issues in population ethics, and eventually encountered similar, more elaborate concepts in Buddhism. Buddhism was initially an inspiration for me in the sense that I was passively aware of statements such as “desire is suffering.” However, I first started to think of suffering as cravings rather than as a solely undesirable experience from being inspired by certain views in the philosophy of mind. I thank Adriano and Simon Knutsson for contributions to an earlier draft of this paper, and Brian Tomasik, Kaj Sotala, Tobias Pulver, Michael Moor, and David Althaus for their valuable feedback on this topic over the years. Finally, thanks to Piti Irawan for comments that inspired the description of cravings as needs related to the non-acceptance of the current state, and to Ruairi Donnelly, David Althaus and Max Daniel for their feedback and suggestions on the current draft, as well as to Persis Eskander for feedback and help with editing. I would also like to give a nod to Bruno Contestabile (cited below) who independently came up with the term Buddhist axiology (the title of an earlier draft of this paper), and Dan Geinster, with whom I have not corresponded, but who seems to have come up with similar ideas under the label “anti-hurt.”
The post Tranquilism appeared first on Center on Long-Term Risk.
]]>What does moral advocacy look like in practice? Which values should we spread, and how?
How effective is moral advocacy compared to other interventions such as directly influencing new technologies?
What are the most important arguments for and against focusing on moral advocacy?
The post Arguments for and against moral advocacy appeared first on Center on Long-Term Risk.
]]>Spreading beneficial values is a promising intervention for effective altruists. We may hope that our advocacy has a lasting influence on our society (or its successor), thus having an indirect impact on the far future. Alternatively, we may view moral advocacy a form of movement building, that is, it may inspire additional people to work on the issues we care most about.
This post analyses key strategic questions on moral advocacy, such as:
The first association with moral advocacy is to “go out there and spread the word”. But advocating for one’s values can take many, often indirect forms. For example, Eliezer Yudkowsky’s work on LessWrong and on AI safety is not moral advocacy in the narrow sense, but it did spread his values in the community at least to a certain extent.
Similarly, when I talk about moral advocacy, I don’t mean blunt repetition of our values, but rather having something to show for it. In practice, moral advocacy is closely related to high-quality object-level work to demonstrate what the values are about.
I use the terms “values spreading” and “moral advocacy” interchangeably. In comparison to “movement building” or “community building”, moral advocacy refers to activities that influence people’s values rather than just providing services to the community – though both can and should go hand-in-hand.
Of course, the effectiveness of moral advocacy depends on which values we try to spread. I think the following are plausible candidates:
Figuring out which of these values contributes most to reducing suffering in the future would need much more research. What matters is not only how much the value helps in “foreseeable” future scenarios, but also how it influences unknown unknowns and how robustly it improves the future rather than making it worse.
An ideal analysis of the advantages and drawbacks of moral advocacy would consider all these values separately. Despite this, the rest of this post will talk about how promising moral advocacy is in general, and the term will refer to a combination of the above values. I think this approach makes sense for several reasons:
People often view moral advocacy as the attempt to change the values of society at large – which is fairly hard – but I don’t think we necessarily need to aim for this. Convincing a small minority may already yield a significant fraction of the benefits.
This is because low-hanging fruit will likely allow this minority to effectively reduce suffering. For instance, in the case of factory farming, stunning animals before slaughter and basic welfare laws prevent a large fraction of the suffering that would otherwise happen, and society implements such measures despite the fairly low level of concern for animals. Of course, factory farming is still horrific, so this does not get us all the way. But it’s at least plausible that increased concern for suffering has diminishing marginal returns.
Similarly, we might hope that fairly little concern for suffering would go a long way towards mitigating the “incidental” harm caused by egoistic or economic forces in the future. The point may apply even more strongly if advanced future technology facilitates the fulfillment of many values at once – like cultured meat in the factory farming analogy.
The analogy also highlights the role of consistency. Most people are compassionate in that they care about certain animals like dogs and cats, but they do nothing to help farm animals. The situation is even worse if we consider wild animal suffering and digital sentience. So, little concern for suffering only goes a long way if it’s consistent, that is, includes all sentient beings.
Convincing more people of our value system (or parts thereof) seems robustly positive because they will pursue interventions that are positive for this value system in expectation (unless they are extremely irrational). Better values also reduce expected suffering from unknown unknowns.
(There are, however, considerations why moral advocacy may be less robust than it seems. I have unpolished material on this. If you’re interested, please contact me to discuss further details.)
Many ways to have an impact depend on uncertain and often speculative predictions of how the future will unfold. For example, working on the risks of advanced AI hinges on the assumption that AI will be a pivotal technology. In fact, we arguably need fairly detailed scenarios to work effectively on the topic.
But we have every reason to expect that our analyses may be flawed, given the intrinsic difficulty of predicting the (distant) future and the lack of robust evidence and data. This reduces both the expected magnitude and the robustness of any intervention that requires such predictions.1
In contrast, a causal chain of the form "moral advocacy leads to more better values in the future, which (in expection) reduces suffering" is fairly disjunctive in that it does not require specific predictions of the future. It does, however, require the following implicit assumptions:
Note that the impact of moral advocacy is not necessarily based on hoping that better values now directly translates into better values in the far future. It only assumes that additional altruistic people somehow do something valuable. For example, they may contribute to shaping advanced AI, in which case moral advocacy is relevant even if the long-term distribution of values reverts to an equilibrium.
Moral advocacy can be highly influential. Individuals such as Peter Singer or Brian Tomasik, whose advocacy has inspired many others, are the most obvious evidence for this claim. More generally, the differences in values between communities (e.g. LessWrong, effective altruism in the English-speaking area, and effective altruism in the German-speaking area) can often be traced back to a few key individuals, which suggests that spreading values can multiply your impact.
It also suggests that the effectiveness strongly depends on how common the values already are, and on how good you are at spreading them. Peter Singer and Brian Tomasik had a large impact because they pioneered their respective causes (animal liberation, wild animal suffering). Similarly, we may prefer to advocate “neglected” values such as concern for digital sentience, the importance of s-risks, or suffering-focused ethics over more common values like concern for animals.
While moral advocacy is plausibly high-impact, it is possible that focused attempts to shape new technologies are an even more powerful lever. But it’s also comparatively more difficult and risky to try to find such interventions.
We may question the long-term impact of spreading values based on the following cluster of arguments:
I agree that these are reasonable points that cast doubt on the idea that we can influence the values of the far future by spreading our values. Still, I’d like to offer a few replies:
In a deterministic view of history, technological development – not moral advocacy – is the driving force behind changes in people’s values. Did slavery end because of the abolitionist movement’s advocacy or because industrialization made slaves obsolete? If we think that changing circumstances like the emergence of new technologies ultimately determine societal values, then advocacy may be pointless.
However, it is just as possible that the causation goes the other way, that is, values influence technological developments. For instance, concern for animals hastens the development of clean meat, and in human history, nationalism may have accelerated the invention of new forms of warfare. The interplay between values and technology is complex; most likely, the causal effects go both ways. (More research on this would be valuable.)
In a similar vein, germ theory suggests that value differences are based on arbitrary factors such as the frequency of diseases, not on advocacy or moral reflection. More generally, one might argue that most people aren’t perfect agents and are not that interested in philosophical reflection, which may reduce the impact of moral advocacy.
However, even if this is true, advocacy still nudges people in better directions. Also, as argued earlier, we don’t need to convince everybody – it may be sufficient to have a minority that cares about reducing suffering.
Efforts to shape powerful new technologies may be more focused than moral advocacy. Also, few people work on the technical aspects of new technologies like AI, while one needs relatively many people to make moral advocacy effective. In fact, moral advocacy may require more resources than it naively seems because of tricky questions of impact attribution, especially if moral advocacy is done over many generations.
I tentatively agree that directly influencing technology is a better lever if we can identify targeted interventions that are both robustly positive and tractable. Work on the risks of advanced artificial intelligence – especially fail-safe AI – may or may not fill this role. But trying to shape technology is also riskier in that it’s more likely that our efforts will be wasted, for example because we are mistaken about which technologies are most consequential. Also, moral advocacy is a way to shape technology, too – by having more people work on it when the time comes.
All told, I'm fairly agnostic about the relative effectiveness of moral advocacy and directly working on technology.
Max Daniel argues that “values spreading, in general, is pretty crowded, a lot of religious, political, and philosophical groups are trying to spread values and compete for attention”.2
I agree that lots of groups try to spread their values. But the more relevant question is how many groups spread values that are in direct competition to your own, that is, appeal to the same class of people. If we assume that people have innate intuitions for or against particular moral views, and that this determines whether they adopt a value system upon reflection, then the advocacy of other value systems does not make our own advocacy less effective.
Of course, this assumption is idealized, so argument does not quite work. In particular, people’s values are often determined by the values they first come in contact with, or the values held by respected individuals. In other words, we observe strong path dependencies regarding which values someone endorses after reflection.
But I would still maintain that the largest obstacle for spreading values is usually not the moral advocacy of other groups. Rather, it’s that relatively few people are strongly altruistic, reflect a lot about philosophy, share my suffering-focused intuitions about population ethics, and so on. Of course, this may in and of itself reduce the value of moral advocacy or lead to strongly diminishing marginal returns at some point.
More people work on broader forms of moral advocacy such as spreading generic altruism or expanding the moral circle, but the numbers are still fairly small on an absolute scale. 80000 hours also thinks advocacy is neglected because “there’s usually no commercial incentive to spread socially important ideas”.
In his essay “Against moral advocacy”, Paul Christiano argues that trying to spread one’s values is often a zero-sum game that we should abstain from. The idea is that if we spread our values, other groups will spread opposing values, leading to no net change.
This is an interesting argument, but I don’t quite agree with the conclusion, at least for the kind of moral advocacy I have in mind. This is for the following reasons:
Moral advocacy is plausibly high-impact, especially if we believe that a fairly low level of concern for suffering will be enough to reap low-hanging fruits. I think many of the objections are reasonable points, but not decisive, which is why I’m still mildly optimistic.
That said, the relative effectiveness of advocacy compared to other interventions like shaping new technology is highly uncertain. I tentatively think that the latter can be even more impactful in the best case, but it’s more difficult to find a good lever. A promising approach is to combine both by shaping the values of the people who shape technology.
I am indebted to Max Daniel, Lukas Gloor, Brian Tomasik, and David Althaus for valuable comments and discussions.
The post Arguments for and against moral advocacy appeared first on Center on Long-Term Risk.
]]>The post Strategic implications of AI scenarios appeared first on Center on Long-Term Risk.
]]>
Efforts to mitigate the risks of advanced artificial intelligence may be a top priority for effective altruists. If this is true, what are the best means to shape AI? Should we write math-heavy papers on open technical questions, or opt for broader, non-technical interventions like values spreading?
The answer to these questions hinges on how we expect AI to unfold. That is, what do we expect advanced AI to look like, and how will it be developed?
Many of these issues have been discussed at length, but the implications for the action-guiding question of how to best work on the problem often remain unclear. This post aims to fill the gap with a rigorous analysis of how different views on AI scenarios relate to the possible ways to shape advanced AI.
We can slice the space of possible scenarios in infinitely many ways, some of which are more useful for our thinking than others. Commonly discussed questions about AI scenarios include:
The reason why we ask these questions is that the answers determine how we should work on the problem. We can choose from a plethora of possible approaches:
To avoid the complexity of considering many strategic questions at the same time, I will focus on whether we should work on AI in a technical or non-technical way, which I believe to be the most action-guiding dimension.
The value of technical work depends on whether it is possible to find well-posed and tractable technical problems whose solution is essential for a positive AI outcome. The most common candidate for this role is the control problem (and subproblems thereof), or how to make superintelligent AI systems act in accordance with human values. The viability of technical work therefore depends to some extent on whether it makes sense to think about AI in this way – that is, whether the control problem is of central importance.3
This, in turn, depends on our outlook on AI scenarios. For instance, we might think that the technical side of AI safety may be less difficult than it seems, that they will likely be solved anyway, or that the most serious risks may instead be related to security aspects, coordination problems, and selfish values.
The following views support work on the control problem:
In contrast, if AI is like the economy, then the control problem does not apply in its usual form – there is no unified agent to control. Influencing the technical development of AI would be harder because of its gradual nature, just as it was arguably difficult to influence industrialization in the past.
It is often argued that an agent-like superintelligence would ultimately emerge even if AI takes a different form at first. I think this is likely, but not certain. But even so, the strategic picture is radically different if economy-like AI comes first. This is because we can mainly hope to (directly) shape the first kind of advanced AI since it is hard to predict, and hard to influence, what happens afterward.
In other words, the first transition may constitute an “event horizon” and therefore be most relevant to strategic considerations. For example, if agent-like AI is built second, then the first kind of advanced AI systems will be the driving force. They will be intellectually superior to us by many orders of magnitude, which makes it all but impossible to (directly) influence the agent-like AI via technical work.
This brings us to another intermediate variable, namely how much technical safety work will be done by others anyway. If the timeline to AI is long, if the takeoff is soft, or if AI is like the economy, then large amounts of money and skilled time may be dedicated to AI safety, comparable to contemporary mainstream discussion of climate change.
As AI is applied to more and more industrial contexts, large-scale failures of AI systems will likely become dangerous or costly, so we can expect that the AI industry will be forced to make them safe, either because their customers demand it or because of regulation. We may also experience an AI Sputnik moment that leads to more investment in safety research.
Since the resources of effective altruists are small in comparison to large companies and governments, this scenario reduces the value of technical AI safety work. Non-technical approaches such as spreading altruistic values among AI researchers or work on AI policy might be more promising in these cases. However, the argument does not apply if we are interested in specific questions that would otherwise remain neglected, or if we think that safety techniques will not work anymore once AI systems reach a certain threshold of capability. (It’s unclear to what extent this is the case.)
This shows that how we work on AI depends not only on our predictions of future scenarios, but also on our goals. Personally, I’m mostly interested in suffering-focused AI safety, that is, how to prevent s-risks of advanced AI. This may lead to slightly different strategic conclusions compared to AI safety efforts that focus on loading human values. For instance, it means that fewer people will work on the issues that matter most to me.
A related question is whether strong intelligence enhancement, such as emulations or iterated embryo selection, will become feasible (and is employed) before strong AI is built. In that case, the enhanced minds will likely work on AI safety, too, which might mean that future generations can tackle the problem more effectively (given sufficiently long timelines). In fact, this may be true even without intelligence enhancement because we are nearsighted with respect to time, that is, it is harder to predict and influence events that are further in the future.
It’s not clear whether strong intelligence enhancement technology will be available before advanced AI. But we can view modern tools such as blogs and online forums as a weak form of intelligence enhancement in that they facilitate the exchange of ideas; extrapolating this trend, future generations may be even more “intelligent” in a sense. Of course, if we think that AI may be built unexpectedly soon, then the argument is less relevant.
Technical work requires a sufficiently good model of what AI will look like, or else we cannot identify viable technical measures. The more uncertain we are about all the different parameters of how AI will unfold, the harder it is to influence its technical development. That said, radical uncertainty also affects other approaches to shape AI, potentially making it a general argument against focusing on AI. Still, the argument applies to a larger extent to technical work than to non-technical work.
In a nutshell, AI scenarios inform our strategy via three intermediate variables:
Technical work seems more promising if we think the control problem is pivotal, if we think that others will not invest sufficient resources, and if we have a clear picture of what AI will look like.
Effective altruists should coordinate their efforts, that is, think in terms of comparative advantages and what the movement should do on the margin rather than just considering individual actions. Applied to the problem of how to best shape AI, this might imply that we should pursue a variety of approaches as a movement rather than committing to any single approach.
Still, my impression is that non-technical work on AI is somewhat neglected in the EA community. (80000 hours’ guide on AI policy tends to agree.)
My position on AI scenarios is close to Brian Tomasik, that is, I lean toward a soft takeoff, relatively long timelines, and distributed, economy-like AI rather than a single actor. Also, we should question the notion of general (super)intelligence. AI systems will likely achieve superhuman performance in more and more domain-specific tasks, but not across all domains at the same time, which makes it a gradual process rather than an intelligence explosion. But of course, I cannot justify high confidence in these views given that many experts disagree.
Following the analysis of this post, this is reason to be mildly sceptical about whether technical work on the control problem is the best way to shape AI. That said, it’s still a viable option because I might be wrong and because technical work has indirect benefits in that it influences the AI community to take safety concerns more seriously.
More generally, one of the best ways to handle pervasive uncertainty may be to focus on “meta” activities such as increasing the influence of effective altruists in the AI community by building expertise and credibility. This is valuable regardless of one’s views on AI scenarios."
The post Strategic implications of AI scenarios appeared first on Center on Long-Term Risk.
]]>The post Team appeared first on Center on Long-Term Risk.
]]>The post Tool use and intelligence: A conversation appeared first on Center on Long-Term Risk.
]]>
This post is a discussion between Lukas Gloor and Tobias Baumann on the meaning of tool use and intelligence, which is relevant to our thinking about the future or (artificial) intelligence and the likelihood of AI scenarios. To help distinguish the participants, I use different background colors.
See also: Magnus Vinding's response to this conversation.
The trigger of the discussion was this statement:
> Intelligence is the only advantage we have over lions.
Tobias Baumann:
I think this framing is a bit confused. The advantage over lions is because of tools (weapons) which resulted from a process of cultural evolution taking thousands of years, not just "intelligence". An individual human without technology and a lion rule the Earth equally little.
Lukas Gloor:
Hominids drove megafauna into extinction across several continents, which is a fairly large 'accomplishment' on a species-level scale.
More importantly, cultural evolution greatly increased our intelligence (in the “goal-achieving capacity” sense). Agriculture led to specialization and more time for people to learn new skills; printing press led to accumulation of knowledge; increased nutrition led to higher IQs; possibly gene-culture co-evolution on surprisingly short timescales also increasing IQ; etc. There’s a conceptual risk of confusion in that the way we normally use "intelligence" is underspecified. I suggest we distinguish between
1) differences in innate cognitive algorithms.
and
2) what difference the above make when coupled with a lifetime of goal-directed learning and becoming proficient in the use of (computer-)tools.
There is a sizeable threshold effect here between lions (and chimpanzees) and humans, where with a given threshold of intelligence, you're also able to reap all the benefits from culture. (There might be an analogous threshold for self-improvement FOOM benefits in AI.)
> The advantage over lions is because of tools (weapons) which resulted from a process of cultural evolution taking thousands of years, not just "intelligence".
This is an oversimplification because the lion could not make use of tools. The availability of tools in an environment amplifies how many returns you can get out of being intelligent, but the effect need not be proportional. A superintelligence in the Stone Age would most likely be lost, never achieving anything special, because it might just run out of electricity/resources before it can influence the course of the world in ways that favor its survival and goals. Superintelligence today already has a much better shot at attaining dominance, because more tools are plausibly within its reach. Superintelligence in 100 years may have it even easier if the tools are not protected from its reach. (E.g. think of this Die Hard movie where a supercomputer controls all the street lights or something – this only work if the society is already pretty computerized.)
So the point is:
There are some (very simple and tool-empty) environments where intelligence differences don't have much of a practical effect (and domain-specific “intelligence,” i.e. evolutionary adaptiveness, is more relevant in these environments, which is why the lion is going to eat Einstein but not necessarily the indigenous hunter who has experience with anti-lion protection). But there are also (more interesting, complex) environments where intelligence differences play a greatly amplified role, and we currently live in such an environment.
Magnus Vinding seems to think that because humans do all the cool stuff "only because of tools," innate intelligence differences are not very consequential. This seems wrong to me, and among other things we can observe that e.g. von Neumann’s intellectual accomplishments were so much greater an out of reach in a sense than the accomplishments that would be possible with an average human brain.
Tobias Baumann:
> Hominids drove megafauna into extinction on many continents, which is a fairly big accomplishment on a species-level scale.
I feel uneasy about this because it's talking about the species-level, which cannot be modeled as an agent-like process. I would say humans have occupied an ecological niche (being a technological civilization), which transforms the world (suddenly there are weapons) in ways that causes some species to go extinct because they were not adapted to the change.
> cultural evolution greatly increased our intelligence.
Can we agree on intelligence meaning "innate cognitive ability"? I would agree that the statement is true even in this definition (Flynn effect). One can also talk about goal achievement ability, but I would simply call that "power". (It's correlated, of course.)
One difference in our thinking seems to be that you see the existence of tools (e.g. guns) as an environment that amplifies how powerful your intelligence is, whereas I look at it as "there are billions of agents in the past and present that have contributed to you being more powerful, by building tools for humans, by transmitting knowledge, and many more". In this picture, the lion just got unlucky because no one worked to make him more powerful. The "threshold" between chimps and humans just reflects the fact that all the tools, knowledge, and so on were tailored to humans (or maybe tailored to individuals with superior cognitive ability).
However, this seems to be mostly a semantic point, not a genuine disagreement. I would agree that intelligence does correlate strongly with power in the environment we live in. Whether a smarter-than-human AI would be able to achieve dominance depends on whether it would be able to tap into this collection of tools, knowledge, and so on. This is plausible for certain domains (e.g. knowledge that can be read on wikipedia) and less plausible for other domains (e.g. implicit knowledge that is not written down anywhere, or tools that require a physical body of some sort).
I would assign a decent probability that a smarter-than-human AGI would actually be able to achieve dominance (e.g. it might find ways to coerce humans into doing the stuff that it can't access on its own). What I find more problematic is the notion of AGI itself, or more precisely that the idea that there's a single measure of intelligence. It seems more likely to me that we will see machines achieve superhuman ability in more and more domains (stuff like Go and image recognition has changed status in the last years), but not in all at once, and there will be some areas that are very difficult (e.g. conceptual thinking, big-picture strategic thinking, "common sense", social skills).
It is plausible (though not clear-cut) that machines would eventually master all these areas, but this would not be a foom-like scenario because it does not become superhuman in all of these at once. Also, it's perfectly possible that an AI has mastered enough domains to radically transform society even if it lacks some crucial components (e.g. social intuitions, or the ability to access human tools), which would also be a scenario that's different from the usual superintelligence picture in relevant ways.
Side note: One might argue that empirical evidence confirms the existence of a meaningful single measure of intelligence in the human case. I agree with this, but I think it's a collection of modules that happen to correlate in humans for some reason that I don't yet understand.
[Note: Robin Hanson makes the point that “most mental tasks require the use of many modules, which is enough to explain why some of us are smarter than others. There’d be a common “g” factor in task performance even with independent module variation.”]
Lukas Gloor
> The "threshold" between chimps and humans just reflects the fact that all the tools, knowledge, etc. was tailored to humans (or maybe tailored to individuals with superior cognitive ability).
So there's a possible world full of lion-tailored tools where the lions are making our lives miserable all day?
Further down you acknowledge that the difference is "or maybe tailored to individuals with superior cognitive ability" – but what would it mean for a tool to be tailored to inferior cognitive ability? The whole point of cognitive ability is to be good at make the most out of tool-shaped parts of the environment. edit: In the sense of cognitive ability being defined/measured in a way that tends to correlate with this.
Tobias Baumann
> So there's a possible world full of lion-tailored tools where the lions are making our lives miserable all day?
Yes. If not for lions, it’s at least possible for chimps or elephants.
[Note: The claim is not that such a world is plausible – why would anyone build tools for lions – just that it is physically possible.]
> Further down you acknowledge that the difference is "or maybe tailored to individuals with superior cognitive ability" – but what would it mean for a tool to be tailored to inferior cognitive ability?
Hmm, fair point, but it's at least conceivable that tools are such that they can be used by those with lower cognitive ability, too.
> The whole point of cognitive ability is to be good at make the most out of tool-shaped parts of the environment.
I’m not sure if it's the whole point. Intelligence can also be about how to best find sexual mates, gain status, or something like that. Elephant brains are bigger than human brains, but elephant tool use is along the lines of "use branches to scratch yourself", so I have a hard time believing that this is the only reason.
Lukas Gloor
I don't think tool use and intelligence work that way, but it's an interesting idea and the mental pictures I'm generating are awesome!
-----------------------------------------------------------------
Addendum by Lukas Gloor, from a related discussion:
Magnus takes the human vs. chimp analogy to mean that intelligence is largely “in the (tool-and-culture-rich) environment.” But I would put it differently: “General intelligence” is still the most crucial factor (after all, something has to explain intra-human variation in achievement prospects), but intelligence differences – here’s where Magnus’ example with the lone human vs. lone chimpanzee comes in – seem to matter a lot more in some environments than in others. Early cultural evolution created an environment for runaway intelligence selection (possibly via Baldwin effect), and so the more “culturally developed” the environment became, the bigger the returns from small increases to general intelligence.
I’m not sure “culturally developed” is exactly the thing that I was looking for above. I mean something that corresponds to “environments containing a varied range of potentially useful tools,” but it could be fruitful to think more about specific features that affect the amount of “edge” you get over less intelligent agents depending on the environment they compete in. Factors for environments where intelligence gives the greatest returns seem to be things like the following:
– Lawfulness
– Complexity/predictability (though maybe you want things to be “just difficult enough” rather than as easy as possible?)
– Variation: many different aspects to explore
– Optimized features, sub-environments: Optimization processes have previously been at work, crafting features of the environment in ways that might be useful for convergent instrumental goals of the competing agents
– Transferrability of insights/inventions: Common “language” of the environment (e.g. literal language giving you lots of options in the world/society; programming language giving you lots of options because of universal application in computers, etc.)
– Availability of data
– Upwards potential: How much novel insights there are left to discover, as opposed to just learning and perfecting pre-existing tricks (Kaj’s “How feasible is the rapid development of superintelligence” paper seems relevant here)
– etc.
(Re the last point: It is probably true that the best poker player today has a lower edge (in terms of how much they win in expectation) versus the average player, than the best player five years ago had versus the average player back then. This is (among other reasons) due to more popular teaching sites and the game in general being closer to the “mostly solved” stage. So even though there nowadays exist more useful tools to improve one’s poker game (including advanced “solver software”), the smartest minds have less of an advantage over the competition. Maybe Magnus would view this as a counterpoint to the points I suggested above, but I’d reply that poker, as a narrow game where humans have already explored the low-hanging fruit, is relevantly different from ‘let’s gain lots of influence and achieve our goals in the real world.)
My view implies that quick AI takeover becomes more likely as society advances technologically. Intelligence would not be in the tools, but tools amplify how far you can get by being more intelligent than the competition (this might be mostly semantics, though).
(Sure, to some extent the humans are cheating because there is no welfare state and no free education for chimpanzee children. But the point is even if there were, even if today’s society optimized for producing highly capable chimpanzees, they wouldn’t get very far.)
The post Tool use and intelligence: A conversation appeared first on Center on Long-Term Risk.
]]>The post Training neural networks to detect suffering appeared first on Center on Long-Term Risk.
]]>
Imagine a data set of images labeled “suffering” or “no suffering”. For instance, suppose the “suffering” category contains documentations of war atrocities or factory farms, and the “no suffering” category contains innocuous images – say, a library. We could then use a neural network or other machine learning algorithms to learn to detect suffering based on that data. In contrast to many AI safety proposals, this is feasible to at least some extent with present-day methods.
The neural network could monitor the AI’s output, its predictions of the future, or any other data streams, and possibly shut the system off to make it fail-safe. Additionally, it could scrutinize the AI’s internal reasoning processes, which might help prevent mindcrime.
The naive form of this idea is bound to fail. The neural network would merely learn to detect suffering in images, not the abstract concept. It would fail to recognize alien or “invisible” forms of suffering such as digital sentience. The crux of the matter is the definition of suffering, which raises a plethora of philosophical issues.
An ideal formalization would be comprehensive and at the same time easy to implement in machine learning systems. I suspect that reaching that ideal is difficult, if not impossible, which is why we should also look into heuristics and approximations. Crucially, suffering is “simpler” in a certain sense than the entire spectrum of complex human values, which is why training neural networks – or other methods of machine learning – is more promising for suffering-focused AI safety than for the goal of loading human values.
If direct implementations turn out to be infeasible, we could look into approaches based on preference inference. Just as any other preference, AI systems can potentially learn the preference to avoid suffering via (cooperative) inverse reinforcement learning. Alternatively, we might program AI systems to infer suffering from the actions and expressions of others. That way, if the AI observes an agent struggling to prevent an outcome1, the AI should conclude that the realization of that outcome may constitute suffering.2
This requires sufficiently accurate models of other minds, which contemporary machine learning systems lack. It is, however, closer to the technical language of real-world AI systems than purely philosophical descriptions such as “a conscious experience with subjectively negative valence”.
Further research on the idea could focus on three areas:
The post Training neural networks to detect suffering appeared first on Center on Long-Term Risk.
]]>The post S-risks: Why they are the worst existential risks, and how to prevent them (EAG Boston 2017) appeared first on Center on Long-Term Risk.
]]>S-risks are risks of events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering. The talk was aimed at an audience that is new to this concept. For a more in-depth discussion, see our article Reducing Risks of Astronomical Suffering: A Neglected Priority.
I’ll talk about risks of severe suffering in the far future, or s-risks. Reducing these risks is the main focus of the Foundational Research Institute, the EA research group that I represent.
To illustrate what s-risks are about, I’ll use a fictional story.
Imagine that some day it will be possible to upload human minds into virtual environments. This way, sentient beings can be stored and run on very small computing devices, such as the white egg-shaped gadget depicted here.
Behind the computing device you can see Matt. Matt’s job is to convince human uploads to serve as virtual butlers, controlling the smart homes of their owners. In this instance, human upload Greta is unwilling to comply.
To break her will, Matt increases the rate at which time passes for Greta. While Matt waits for just a few seconds, Greta effectively endures many months of solitary confinement.
Fortunately, this did not really happen. In fact, I took this story and screenshots from an episode of the British TV series Black Mirror.
Not only did it not happen, it’s also virtually certain it won’t happen. No future scenario we’re imagining now is likely to happen in precisely this form.
But, I will argue, there are many plausible scenarios which are in relevant ways like, or even worse than, that one. I’ll call these s-risks, where “s” stands for “suffering”.
I’ll explain presently what s-risks are, and how s-risks may be realized. Next, I’ll talk about why effective altruists may want to focus on preventing s-risk, and through what kinds of work this can be achieved.
The way I’d like to introduce them, s-risks are a subclass of existential risk, often called x-risk. It’ll therefore be useful to recall the concept of x-risk. Nick Bostrom has defined x-risk as follows.
“Existential risk – One where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.”
Bostrom also suggested that one way to understand how x-risk differs from other risk is to look at two dimensions of risk. These two dimensions are their scope and their severity.
We can use these to map different types of risks in a two-dimensional figure. Along the vertical axis, risks are ordered according to their scope. That is, we ask how many individuals would be affected? Would it only be one person, everyone in some region, everyone alive on Earth at one point, or even everyone alive plus future generations? Along the horizontal axis, we map risks according to their severity. That is, we ask how bad an adverse outcome would be for one affected individual.
For example, a single fatal car crash would have terminal severity. In that sense, it’s pretty bad. However, in another sense, it could be worse because it affects only a small number of people – it has personal rather than global or even regional scope. But, there also are risks with a greater severity; for example, being tortured for the rest of your life, with no chance of escape, arguably is worse than a fatal car crash. Or, to give a real life example, consider factory farming. We commonly think that, say, the life of chickens in battery cages is so bad that it’s better not to bring them into existence in the first place. That’s the reason why we think it’s good that the food at this conference is largely vegan.
To come back to the title of my talk, I can now state why s-risks are the worst existential risks. S-risks are the worst existential risks because I’ll define them to have the largest possible scope and the largest possible severity. (I will qualify the claim that s-risks are the worst x-risks later.) That is, I’d like to suggest the following definition.
“S-risk – One where an adverse outcome would bring about severe suffering on a cosmic scale, vastly exceeding all suffering that has existed on Earth so far.”
So, s-risks are roughly as severe as factory farming, but with an even larger scope.
To better understand this definition, let’s zoom in on the part of the map that shows existential risk.
One subclass of risks are those that, with respect to their scope, would affect all future human generations, and, with respect to their severity, would remove everything valuable. One central example of such pan-generational, crushing risks are risks of human extinction.
Risks of extinction have received the most attention so far. But, conceptually, x-risks contain another class of risks. These are risks of outcomes even worse than extinction in two respects. First, with respect to their scope, they not only threaten the future generations of humans or our successors, but all sentient life in the whole universe. Second, with respect to their severity, they not only remove everything that would be valuable but also come with a lot of disvalue – that is, features we’d like to avoid no matter what. Recall the story I told in the beginning, but think of Greta’s solitary confinement being multiplied by many orders of magnitude – for instance, because it affects a very large population of sentient uploads.
Let’s pause for a moment. So far, I’ve introduced the concept of s-risk. To recap, they are risks of severe suffering on a cosmic scale, which makes them a subclass of existential risk.
(Depending on how you understand the “curtail its potential” case in the definition of x-risks, there actually may be s-risks which aren’t x-risks. This would be true if you think that reaching the full potential of Earth-originating intelligent life could involve suffering on an astronomical scale, i.e., the realisation of an s-risk. Think of a quarter of the universe filled with suffering, and three quarters filled with happiness. Considering such an outcome to be the full potential of humanity seems to require the view that the suffering involved would be outweighed by other, desirable features of reaching this full potential, such as vast amounts of happiness. While all plausible moral views seem to agree that preventing the suffering in this scenario would be valuable, they might disagree on how important it is to do so. While many people find it plausible that ensuring a flourishing future is more important, FRI is committed to a family of different views, which we call suffering-focused ethics. (Note: We’ve updated this section in June 2019.))
Next, I’d like to talk about why and how to prevent s-risks.
All plausible value systems agree that suffering, all else being equal, is undesirable. That is, everyone agrees that we have reasons to avoid suffering. S-risks are risks of massive suffering, so I hope you agree that it’s good to prevent s-risks.
However, you’re probably here because you’re interested in effective altruism. You don’t want to know whether preventing s-risks is a good thing, because there are a lot of good things you could do. You acknowledge that doing good has opportunity cost, so you’re after the most good you can do. Can preventing s-risks plausibly meet this higher bar?
This is a very complex question. To understand just how complex it is, I first want to introduce a flawed argument for focusing on reducing s-risk. (I’m not claiming that anyone has advanced such an argument about either s-risks or x-risks.) This flawed argument goes as follows.
Premise 1: The best thing to do is to prevent the worst risks
Premise 2: S-risks are the worst risks
Conclusion: The best thing to do is to prevent s-risk
I said that this argument isn’t sound. Why is that?
Before delving into this, let’s get one potential source of ambiguity out of the way. On one reading, premise 1 could be a value judgment. In this sense, it could mean that, whatever you expect to happen in the future, you think there is a specific reason to prioritize averting the worst possible outcomes. There is a lot one could say about the pros and cons as well as about the implications of such views, but this is not the sense of premise 1 I’m going to talk about. In any case, I don’t think any purely value-based reading of premise 1 suffices to get this argument off the ground. More generally, I believe that your values can give you substantial or even decisive reasons to focus on s-risk, but I’ll leave it at that.
What I want to focus on instead is that, (nearly) no matter your values, premise 1 is false. Or at least it’s false if, by “the worst risks”, we understand what we’ve talked about so far, that is, badness along the dimensions of scope and severity.
When trying to find the action with the highest ethical impact there are, of course, more relevant criteria than scope and severity of a risk. What’s missing are a risk’s probability; the tractability of preventing it; and its neglectedness. S-risks are by definition the worst risks in terms of scope and severity, but not necessarily in terms of probability, tractability, and neglectedness.
These additional criteria are clearly relevant. For example, if s-risks turned out to have probability zero, or if reducing them was completely intractable, it wouldn’t make any sense to try to reduce them.
We must therefore discard the flawed argument. I won’t be able to definitively answer the question under what circumstances we should focus on s-risk, but I’ll offer some initial thoughts on the probability, tractability, and neglectedness of s-risks.
I’ll argue that s-risks are not much more unlikely than AI-related extinction risk. I’ll explain why I think this is true and will address two objections along the way.
You may think “this is absurd”, we can’t even send humans to Mars, why worry about suffering on cosmic scales? This was certainly my immediately intuitive reaction when I first encountered related concepts. But as EAs, we should be cautious to take such intuitive, ‘system 1’ reactions, at face value. For we are aware that a large body of psychological research in the “heuristics and biases” approach suggests that our intuitive probability estimates are often driven by how easily we can recall a prototypical example of the event we’re considering. For types of events that have no precedent in history, we can’t recall any prototypical example, and so we’re systematically underestimating the probability of such events if we aren’t careful.
So we should critically examine this intuitive reaction of s-risks being unlikely. If we do this, we should pay attention to two technological developments, which are at least plausible and which we have reason to expect for unrelated reasons. These are artificial sentience and superintelligent AI, the latter unlocking many more technological capabilities such as space colonization.
Artificial sentience refers to the idea that the capacity to have subjective experience – and in particular, the capacity to suffer – is not limited to biological animals. While there is no universal agreement on this, in fact most contemporary views in the philosophy of mind imply that artificial sentience is possible in principle. And for the particular case of brain emulations, researchers have outlined a concrete roadmap, identifying concrete milestones and remaining uncertainties.
As for superintelligent AI, I won’t say more about this because this is a technology that has received a lot of attention from the EA community. I’ll just refer you to Nick Bostrom’s excellent book on the topic, called Superintelligence, and add that s-risks involving artificial sentience and “AI gone wrong” have been discussed by Bostrom under the term mindcrime.
But if you only remember one thing about the probability of s-risk, let it be this: This is not Pascal’s wager! In brief, as you may recall, Pascal lived in the 17th century and asked whether we should observe religious commands. One of the arguments he considered was that, no matter how unlikely we think it is that God exists, it’s not worth risking ending up in hell. In other words, hell is so bad that you should prioritize avoiding it, even if you thought hell was very unlikely.
But that’s not the argument we’re making with respect to s-risk. Pascal’s wager invokes a speculation based on one arbitrarily selected ancient collection of books. Based on this, one cannot defensibly claim that the probability of one type of hell is greater than the probability of competing hypotheses.
By contrast, worries about s-risk are based on our best scientific theories and a lot of implicit empirical knowledge about the world. We consider all the evidence we have, and then articulate a probability distribution over how the future may unfold. Since predicting the future is so hard, the remaining uncertainty will be quite high. But this kind of reasoning could in principle justify concluding that s-risk is not negligibly small.
OK, but maybe an additional argument comes to your mind: since a universe filled with a lot of suffering is a relatively specific outcome, you may think that it’s extremely unlikely that something like this will happen unless someone or something intentionally optimised for such an outcome. In other words, you may think that s-risks require evil intent, and that such evil intent is very unlikely.
I think one part of this argument is correct: I agree that it’s very unlikely that we’ll create an AI with the terminal goal of creating suffering, or that humans will intentionally create large numbers of suffering AIs. But I think that evil intent accounts only for a tiny part of what we should be worried about, because there are these two other, more plausible routes.
For example, consider the possibility that the first artificially sentient beings we create, potentially in very large numbers, may be “voiceless” — unable to communicate in written language. If we aren’t very careful, we might cause them to suffer without even noticing.
Next, consider the archetypical AI risk story: a superintelligent paperclip maximizer. Again, the point is not that anyone thinks that this particular scenario is very likely. It’s just one example of a broad class of scenario where we inadvertently create a powerful agent-like system that pursues some goal that’s neither aligned with our values nor actively evil. The point is that this paperclip maximiser may still cause suffering for instrumental reasons. For instance, it may run sentient simulations to find out more about the science of paperclip production, or to assess how likely it is that it will encounter aliens (who may disrupt paperclip production); alternatively, it may spawn sentient “worker” subprograms for which suffering plays a role in guiding action similar to the way humans learn not to touch a hot plate.
A “voiceless first generation” of sentient AIs and a paperclip maximizer that creates suffering for instrumental reasons are two examples for how an s-risk may be realised, not by evil intent, but by accident.
Third, s-risks may arise as part of a conflict.
To understand the significance of the third point, remember the story from the beginning of this talk. The human operator Matt wasn’t evil in the sense that he intrinsically valued Greta’s suffering. He just wanted to make sure that Greta complies with the commands of her owner, whatever these are. More generally, if agents compete for a shared pie of resources, there is a risk that they’ll engage in negative-sum strategic behavior that causes suffering even if everyone disvalues suffering.
The upshot is that risks of severe suffering don’t require rare motivations such as sadism or hatred. There is also plenty of actual evidence for this worrying principle in history; for instance, look at most wars or factory farming. Both aren’t caused by evil intent.
By the way, in case you’ve wondered, the Black Mirror story wasn’t an s-risk, but we can now see that it illustrates two major points: first, the importance of artificial sentience and, second, severe suffering caused by an agent without evil intent.
To conclude: to be worried about s-risk, we don’t need to posit any new technology or any qualitatively new feature above what is already being considered by the AI risk community. So I’d argue that s-risks are not much more unlikely than AI-related x-risks. Or at the very least, if someone is worried about AI-related x-risk but not s-risk, the burden of proof is on them.
I acknowledge that this is a challenging question. However, I do think there are some things we can do today to reduce s-risk.
First, there is some overlap with familiar work in the x-risk area. More specifically, some work in technical AI safety and AI policy is effectively addressing both risks of extinction and s-risks. That being said, any specific piece of work in AI safety may be much more relevant for one type of risk than the other. To give you a toy example, if we could make sure that a superintelligent AI by default shuts down in 1000 years, this wouldn’t help much to reduce extinction risk but would prevent long-lasting s-risk. For some more serious thoughts on differential progress within AI safety, I refer you to the Foundational Research Institute’s technical report on Suffering-focused AI Safety.
So the good news is, we’re already doing work that reduces s-risk. But note that this isn’t true for all work on existential risk; for example, building disaster shelters, or making it less likely that we’re all wiped out by a deadly pandemic, may reduce the probability of extinction — but, to a first approximation, it doesn’t change the trajectory of the future conditional on humans surviving.
Next to more targeted work, there are also broad interventions which are plausibly preventing s-risks more indirectly. For instance, strengthening international cooperation could decrease the likelihood of conflicts, and we’ve seen that negative-sum behaviour in conflicts is one potential source of s-risks. Or, going meta, we can do research aimed at identifying just which types of broad intervention are effective at reducing s-risks. The latter is part of what we’re doing at the Foundational Research Institute.
I’ve talked about whether there are interventions that reduce s-risks. There is another aspect of tractability: will there be sufficient support to carry out such interventions? For instance, will there be enough funding? We may worry that talk of cosmic suffering and artificial sentience is well beyond the window of acceptable discourse – or, in other words, that s-risks are just too weird.
I think this is a legitimate worry, but I don’t think we should conclude that preventing s-risk is futile. For remember that 10 years ago, worries about risks from superintelligent AI were nearly universally ridiculed, dismissed or misrepresented as being about The Terminator.
By contrast, today we have Bill Gates blurbing a book that talks about whole brain emulations, paperclip maximizers, and mindcrime. That is, the history of the AI safety field demonstrates that we can raise significant support for seemingly weird cause areas if our efforts are backed up by solid arguments.
Last but not least, how neglected is work on s-risk?
It’s clearly not totally neglected. I said before that AI safety and AI policy can reduce s-risk, so arguably some of the work of, say, the Machine Intelligence Research Institute or the Future of Humanity Institute is effectively addressing s-risk.
However, it seems to me that s-risk gets much less attention than extinction risk. In fact, I’ve seen existential risk being equated with extinction risk.
In any case, I think that s-risk gets less attention than is warranted. This is especially true for interventions specifically targeted at reducing s-risk, that is, interventions that wouldn’t also reduce other classes of existential risk. There may well be low-hanging fruits here, since existing x-risk work is not optimized for reducing s-risk.
As far as I’m aware of, the Foundational Research Institute is the only EA organization that explicitly focuses on reducing s-risk.
To summarize: both empirical and value judgments are relevant to answer the question whether to focus on reducing s-risk. Empirically, the most important questions are: how likely are s-risks? How easy is it to prevent them? Who else is working on this?
Regarding their probability, s-risks may be unlikely, but they are far more than a mere conceptual possibility. We can see just which technologies could plausibly cause severe suffering on a cosmic scale, and overall s-risks don’t seem much more unlikely than AI-related extinction risks.
The most plausible of these s-risks enabling technologies are artificial sentience and superintelligent AI. Thus, to a first approximation, these cause areas are much more relevant to reducing s-risk than other x-risk cause areas such as biosecurity or asteroid deflection.
Second, reducing s-risk is at least minimally tractable. We probably haven’t yet found the most effective interventions in this space. But we can point to some interventions which reduce s-risk and which people work on today — namely, some current work in AI safety and AI policy.
There are also broad interventions which may indirectly reduce s-risk, but we don’t understand the macrostrategic picture very well yet.
Lastly, s-risk seems to be more neglected than extinction risk. At the same time, reducing s-risk is hard and requires pioneering work. I’d therefore argue that FRI occupies an important niche with a lot of room for others to join.
That all being said, I don’t expect to have convinced every single one of you to focus on reducing s-risk. I think that, realistically, we’ll have a plurality of priorities in the community, with some of us focusing on reducing extinction risk, some of us focusing on reducing s-risk, and so on.
Therefore, I’d like to end this talk with a vision for the far-future shaping community.
Those of us who care about the far future face a long journey. But it’s a misrepresentation to frame this as a binary choice between extinction and utopia.
But in another sense, the metaphor was apt. We do face a long journey, but it’s a journey through hard-to-traverse territory, and on the horizon there is a continuum ranging from a hellish thunderstorm to the most beautiful summer day. Interest in shaping the far future determines who’s (locked) in the vehicle, but not what more precisely to do with the steering wheel. Some of us are most worried about avoiding the thunderstorm, while others are more motivated by the existential hope of reaching the summer day. We can’t keep track of the complicated network of roads far ahead, but we have an easy time seeing who else is in the car, and we can talk to them – so maybe the most effective thing we can do now is to compare our maps of the territory, and to agree on how to handle remaining disagreements without derailing the vehicle.
Thank you.
For more information on s-risk, check out foundational-research.org. If you have any questions, feel free to email me at max@foundational-research.org or get in touch via Facebook.
The post S-risks: Why they are the worst existential risks, and how to prevent them (EAG Boston 2017) appeared first on Center on Long-Term Risk.
]]>The post Launching the FRI blog appeared first on Center on Long-Term Risk.
]]>We were moved by the many good reasons to make conversations public. At the same time, we felt the content we wanted to publish differed from the articles on our main site in the following ways:
Content we plan to publish on this blog before the end of July includes:
Several of our current and former researchers publish content on their personal websites or blogs:
Researchers will maintain their personal websites as they anticipate that not all of the content they’ll publish there will be relevant to FRI. Content that concerns the research of current staff members will, however, be crossposted to the FRI blog. This way, readers who are primarily interested in FRI-related content need only follow one blog.
Similarly, we anticipate that not all FRI blog posts would qualify as a good EA forum post, but we plan to publish the ones that do there. We’ve also (cross-)posted to LessWrong and the Intelligent Agent Foundations Forum and plan to continue doing so.
We’ll explore enabling comments on this blog as this may encourage additional readers to engage with our work and to provide feedback. I expect to reply to questions being asked in the comments.
The post Launching the FRI blog appeared first on Center on Long-Term Risk.
]]>The post Blog appeared first on Center on Long-Term Risk.
]]>The post Publications appeared first on Center on Long-Term Risk.
]]>The post A Virtue of Precaution Regarding the Moral Status of Animals with Uncertain Sentience appeared first on Center on Long-Term Risk.
]]>We address the moral importance of fish, invertebrates such as crustaceans, snails and insects, and other animals about which there is qualified scientific uncertainty about their sentience. We argue that, on a sentientist basis, one can at least say that how such animals fare make ethically significant claims on our character. It is a requirement of a morally decent (or virtuous) person that she at least pays attention to and is cautious regarding the possibly morally relevant aspects of such animals. This involves having a moral stance, in the sense of patterns of perception, such that one notices such animals as being morally relevant in various situations. For the person who does not already consider these animals in this way, this could be a big change in moral psychology, and can be assumed to have behavioural consequences, albeit indeterminate. Character has been largely neglected in the literature, which focuses on act-centred approaches (i.e. that the evidence on sentience supports, or does not support, taking some specific action). We see our character-centred approach as complementary to, not superior to, act-centred approaches. Our approach has the advantage of allowing us to make ethically interesting and practically relevant claims about a wider range of cases, but it has the drawback of providing less specific action guidance.
The post A Virtue of Precaution Regarding the Moral Status of Animals with Uncertain Sentience appeared first on Center on Long-Term Risk.
]]>The post Die ethische Relevanz von Wildtierleid appeared first on Center on Long-Term Risk.
]]>Übersetzt von Evgueni Kivman
Dies ist eine vorläufige Übersetzung des englischen Originals. Eine verbesserte Übersetzung wird in den kommenden Wochen aufgeschaltet.
Die Anzahl der Wildtiere übersteigt bei weitem die Anzahl der Tiere in Mastanlagen, der Labortiere oder der sogenannten Haustiere. Deshalb sollten Tieraktivisten überlegen, ob sie nicht ihre Aufmerksamkeit mehr darauf lenken, Bedenken gegenüber dem Leid zu erhöhen, das in der Natur entsteht. Während es in der Theorie dazu führen könnte, auf direktem Wege humanere Ökosysteme zu entwerfen, denke ich, dass Aktivisten sich praktisch darauf konzentrieren sollten, das Bewusstsein über das Leid von Wildtieren unter anderen Aktivisten, Akademikern und anderen zugeneigten Gruppen zu erhöhen. Das massive Ausmaß des Leids, das jetzt in der Natur entsteht, ist natürlich tragisch, aber es erblasst im Vergleich zu den guten oder schlechten Dingen, die unsere Nachkommen – mit fortgeschrittenen technologischen Möglichkeiten – betreffen könnten. Ich habe z. B. die Sorge, dass zukünftige Menschen Leben auf andere Planeten streuen könnten (direkt oder indirekt) oder fühlende Simulationen erzeugen könnten, ohne sich besonders um die Konsequenzen für Wildtiere zu sorgen. Unsere Priorität Nr. 1 sollte sein, sicherzustellen, dass zukünftige menschliche Intelligenz dazu genutzt wird, Wildtierleid zu verhindern, anstatt es zu vervielfachen.
Ich persönlich glaube, dass die meisten Tiere (außer vielleicht die, die ein langes Leben haben, z. B. >3 Jahre) wahrscheinlich ein nicht lebenswertes Leben haben, denn ich würde – wenn ich es entscheiden könnte – lieber auf einige Lebensjahre verzichten, um stattdessen den Schmerz eines durchschnittlichen Todes in der Wildnis zu vermeiden, und das auch nur unter der Annahme, dass ihre Leben netto-positiv sind (was auch fraglich ist, angesichts von Kälte, Hunger, Krankheiten, Angst vor Raubtieren und all dem Rest).
Allerdings ist dieser Glauben von mir ein wenig kontrovers. Ich denke, dass die Behauptung über das zu erwartende Netto-Leid in der Natur nur eine schwächere Annahme braucht, nämlich dass fast alles vom zu erwartenden Glück und Leid in der Natur von kleinen Tieren stammt (z. B. kleinen Fischen oder Insekten). Die Erwachsenen dieser Tierarten leben maximal wenige Jahre, oft auch nur wenige Monate oder Wochen, also ist es in diesen Fällen sogar noch schwerer, mit den entsprechenden Glücksgefühlen im Leben den Schmerz des Todes wieder aufzuwiegen. Darüber hinaus sterben die meisten Babys dieser Tierarten (wahrscheinlich schmerzhaft) nur einige Tage oder Wochen nach ihrer Geburt, weil die meisten dieser Tierarten „r-selected“ sind – siehe Type III in dieser Grafik.
„Das Gesamtausmaß des Leids pro Jahr in der Natur ist jenseits aller anständigen Betrachtungen. Während der Minute, die ich brauche, um diesen Satz zu schreiben, werden tausende Tiere bei lebendigem Leib gegessen, andere rennen um ihr Leben, andere werden langsam von Parasiten aufgefressen, tausende sterben an Hunger, Durst oder Krankheiten.“
– Richard Dawkins, River Out of Eden[Dawkins]
„Viele Menschen betrachten die Natur aus einer ästhetischen Perspektive und denken über die Artenvielfalt oder über die Gesundheit der Ökosysteme nach, aber vergessen, dass die Tiere, die zu diesen Ökosystemen gehören, Individuen sind und ihre eigenen Bedürfnisse haben. Krankheiten, Hunger, Raubtiere, Ausgrenzung, sexuelle Frustration sind endemisch in sogenannten gesunden Ökosystemen. Das große Tabu in der Tierrechtsbewegung ist, dass das meiste Leid natürliche Ursachen hat.“
– Albert, ein fiktionaler Hund im Buch „Golden“ des Philosophen Nick Bostrom[Bostrom-Alfred]
„Der moralistische Trugschluss ist, dass das, was gut sei, in der Natur gefunden werden könne. Es liegt an der schlechten Wissenschaft der Kommentar-Stimmen in den Dokumentationen über die Wildnis: Löwen betreiben Euthanasie der Schwachen und Kranken, Mäuse fühlen keinen Schmerz, wenn Katzen sie essen, Mistkäfer recyclen Mist, um dem Ökosystem zu helfen, und so weiter.“
– Steven Pinker[Pinker]
„Menschen, die uns vorwerfen, zu viel Gewalt anzuwenden, [sollten sehen], was wir im Papierkorb lassen.“
– David Attenborough über Natur-Dokumentationen[Attenborough]
„Nach der ernüchternden Wahrheit sind fast alle Dinge, die Menschen sich gegenseitig antun, wofür sie dann erhängt oder ins Gefängnis gesteckt werden, in der Natur alltägliche Leistungen. [...] Die Sätze, die dem Lauf der Natur Perfektion zuschreiben, können nur betrachtet werden als Übertreibungen, die auf poetische Gefühle zurückgehen und einer nüchternen Prüfung nicht standhalten sollen. Niemand – religiös oder nicht religiös – glaubt daran, dass die schmerzhaften Kräfte der Natur als Ganzes betrachtet in irgendeiner Art und Weise gute Zwecke fördern, außer dadurch, dass die menschlichen rationalen Wesen dazu animiert werden, sich zu erheben und gegen sie zu kämpfen.“
– John Stuart Mill, „On Nature“[Mill]
Tierrechtsaktivisten konzentrieren ihre Bemühungen typischerweise auf Bereiche, in denen Menschen direkt mit Mitgliedern anderer Tierarten interagieren, wie etwa in Mastanlagen, in Tierversuchen, oder – in einem viel geringeren Ausmaß – in Zoos, Zirkussen, Rodeos und Ähnlichem.
Das Tierleid in der Wildnis ist selten Gegenstand von Diskussionen, selbst in der akademischen Literatur, obwohl es da einige bemerkenswerte Ausnahmen gab.[exceptions] In diesem Artikel hebe ich hervor, dass die Zahlen der Wildtiere, auf die Menschen Einfluss haben, einfach viel zu groß sind, als dass Tierrechtsaktivisten sie ignorieren könnten. Intensives Leiden ist eine regelmäßige Eigenschaft des Lebens in der Wildnis, die zwar keine schnellen Lösungen fordert, aber zumindest langfristige Forschung über Wohlergehen von Wildtieren und Technologien, die es eines Tages Menschen erlauben könnten, es zu verbessern. Ich beende diesen Abschnitt damit, dass ich Tierrechtsaktivisten dazu ermuntere, ihre Bemühungen darauf zu konzentrieren, die Bedenken gegenüber Wildtierleid unter anderen Aktivisten, Akademikern, und anderen, die damit sympathisieren würden, voranzubringen, sowohl um Forschung in diesem Bereich zu fördern, als auch um sicherzustellen, dass unsere Nachkommen ihre fortgeschrittenen Technologien dazu verwenden werden, Wildtierleid zu verringern, anstatt es unbeabsichtigt zu vervielfachen.
Das durch Menschen verursachte Tierleid ist riesig und Tierrechtsaktivisten haben Recht damit, über seine Ausmaße schockiert zu sein. Jedoch sind die Zahlen der Tiere, die in der Wildnis leben, auf eine erschütternde Art und Weise größer. Für grobe Populationsschätzungen siehe mein „How Many Wild Animals Are There?“[Tomasik-numbers]
Genau so wie ihre domestizierten Artgenossen sind die Leben der Tiere in der Wildnis voll mit Emotionen.[emotions] Leider sind viele dieser Emotionen sehr schmerzhaft, oft sogar sinnloserweise. Und während der Begriff „grausame Natur“ als eine Binsenweisheit gilt, kann seine viszerale Bedeutung oft übersehen werden. Unten überprüfe ich einige Details von Wildtierleiden, wohl so ähnlich, wie Tierrechtsaktivisten menschliche Grausamkeit anprangern.
Wenn Menschen sich Leid in der Natur vorstellen, ist das erste Bild, das sie vor Augen haben, wohl das Bild einer Löwin, die ihre Beute jagd. Zum Beispiel beschreibt Christopher McGowan anschaulich den Tod eines Zebras:
Die Löwin senkt ihre Säbelkrallen in das Hinterteil des Zebras hinein. Sie reißen die harte Haut auf und verankern sich tief im Muskel. Das erschrockene Tier schreit lauthals, während sein Körper auf den Boden fällt. Einen Augenblick später nimmt die Löwin ihre Klauen aus dem Gesäß und senkt ihre Zähne in die Kehle des Zebras ein und würgt den Ton des Terrors. Ihre Eckzähne sind lang und scharf, aber ein Tier so groß wie ein Zebra hat einen massiven Nacken, mit einer dicken Schicht Muskeln unter dem Gesicht, daher sind die Zähne, obwohl sie die Haut durchstechen, immer noch zu kurz, um irgendein Hauptblutgefäß zu treffen. Also muss sie das Zebra durch Erstickung töten, indem sie ihre starken Kiefer um seine Luftröhre klemmt und keine Luft mehr in die Lunge lässt. Es ist ein langsamer Tod. Wenn das ein kleines Tier gewesen wäre, z. B. eine Thomson-Gazelle mit einer Größe vergleichbar mit einem großen Hund, hätte sie durch sein Genick gebissen, ihre Eckzähne hätten dann wahrscheinlich die Wirbel oder den Boden des Schädels zerquetscht, was zu einem sofortigen Tod geführt hätte. Der Todeskampf des Zebras wird fünf oder sechs Minuten dauern.[McGowan, pp. 12-13]
Einige Raubtiere töten ziemlich schnell, wie etwa zusammenschnürende Schlangen, die die Luftzufuhr ihrer Opfer abschneiden und so Bewusstlosigkeit innerhalb einer Minute oder zwei verursachen,[eaten-alive] während andere einen eher langwierigen Tod verursachen, wie etwa Hyänen, die ein Fleischstück nach dem anderen abreißen.[Kruuk] Wildhunde schlitzen den Bauch ihrer Opfer auf,[McGowan, p. 22] Giftschlangen verursachen innere Blutungen und Lähmungen über einige Minuten,[McGowan, pp. 49] und Krokodile ertränken große Tiere in ihren Klauen.[McGowan, pp. 43]
Ein Schlangenhandbuch erklärt: „Lebende Mäuse werden um ihr Leben kämpfen, wenn sie gegriffen werden, und werden beißen, treten und kratzen, solange sie können.“[Flank] Einmal gefangen, „durchnässt die Schlange die Beute mit Speichel und zieht sie letztendlich in die Speiseröhre. Von da an benutzt sie ihre Muskeln, um die Nahrung zu zerquetschen und sie in den Verdauungstrackt zu drücken, wo sie in ihre Nährstoffe zerlegt wird.“[Perry]
Beutetiere sterben nicht immer sofort, nachdem sie heruntergeschluckt werden, was dadurch illustriert wird, dass einige giftige Molche, nachdem sie von einer Schlange verschluckt werden, Gifte ausscheiden, um ihren Fänger zu töten und wieder aus seinem Mund zu kriechen.[McGowan, pp. 59] Und bezüglich Hauskatzen hat Bob Sallinger von der Audubon Society of Portland angemerkt: „Menschen, die vom willkürlichen Töten von Wildtieren, z. B. durch Beinfallen, entsetzt sind, sollten berücksichtigen, dass die Schmerzen und das Leid, das durch jagende Katzen verursacht wird, sich nicht davon unterscheiden und dass die Auswirkungen von Beinfallen ein Zwerg dagegen sind.“[Sallinger]
Es ist möglich, dass einige Tiere nicht so intensiv durch Raubtiere leiden, wenn ausreichend Endorphine produziert werden. Ähnlich fühlen auch Menschen Schmerz oft nicht sofort nach einer Verletzung.[Wall] Aber es gibt viele Beispiele, in denen die Beute sich mit Gewalt gegen ihre Angreifer wehrt. Zum Beispiel in diesem Video schreit das Warzenschwein ungefähr 2,5 Minuten lang, während es erstickt wird. Außerdem sollte, falls Endorphine tatsächlich den Schmerz des Todes verringern, das gleiche Argument auch auf Fälle von brutal geschlachteten Nutztieren durch Menschen anwendbar sein, nichtsdestotrotz betrachten aber die meisten Tierschutzforscher schlechte Schlachtmethoden als besonders schmerzhaft.
Angst vor Raubtieren verursacht aber nicht nur unmittelbares Leid, sondern könnte auch langanhaltendes psychologisches Trauma verursachen. In einer Anxiolytika-Studie haben Forscher Mäuse für fünf Minuten lang einer Katze ausgesetzt und die anschließenden Reaktionen beobachtet. Sie haben herausgefunden, „dass dieses Modell von Mäusen, die unausweichlichen Raubtierreizen ausgesetzt sind, frühe kognitive Veränderungen hervorruft, die ähnlich sind wie die bei Patienten mit akuter Belastungsreaktion (ABR).“[ElHagePeronnyGriebelBelzung] Eine weitere Studie fand langfristige Auswirkungen auf die Gehirne der Mäuse: „Die Bedrohung durch Raubtiere führt zu signifikanten Lernschwächen im Labyrinth (16 bis 22 Tage danach) und im Test auf Wiedererkennung von räumlichen Anordnungen von Objekten (26 bis 28 Tage danach). Diese Aufdeckungen weisen darauf hin, dass Gedächtnisbeeinträchtigungen für längere Zeiträume nach Stress durch Raubtiere bestehen bleiben können.“[ElHageGriebelBelzung] In einem ähnlichen Experiment setzte Phillip R. Zoladz Ratten unausweichlichen Raubtieren und anderen Angst auslösenden Bedingungen aus, um „in der Physiologie und im Verhalten der Ratten Veränderungen hervorzurufen, die vergleichbar sind mit beobachteten Symptomen bei PTBS-Patienten.“[Zoladz]
Selbst für die Beutetiere, die keinen traumatischen Konflikt mit einem Raubtier gehabt haben, kann die von Raubtieren verursachte „Ökologie der Angst“ sehr qualvoll sein: „In Studien mit Elchen haben Wissenschaftler herausgefunden, dass die Anwesenheit von Wölfen ihr Verhalten nahezu konstant verändert, da sie versuchen, Begegnungen zu vermeiden, immer Fluchtmöglichkeiten zu haben und permanent wachsam zu sein.“[Stauth]
Man kann das Argument anrücken, dass die Evolution es vermeiden sollte, Leben von Tieren für lange Zeiträume vor dem Tod übermäßig entsetzlich zu machen, denn dies könnte – zumindest bei komplexeren Tierarten – zu PTBS, Depressionen oder anderen schwächenden Nebeneffekten führen. Natürlich sehen wir empirisch, dass die Evolution in der Tat solche Störungen verursacht, wenn traumatische Ereignisse geschehen, wie z. B. der Kontakt zu Raubtieren. Aber wahrscheinlich gibt es eine sinnvolle Grenze, wie furchtbar sie die meiste Zeit sein können, wenn Tiere funktionieren sollen. Der Tod selbst ist eine andere Sache, denn zum Zeitpunkt seiner Unabwendbarkeit schränkt der Evolutionsdruck die emotionalen Empfindungen nicht mehr ein. Der Tod kann so gut wie schmerzfrei sein (für einige glückliche Tiere) oder so schlimm sein wie Folter (für viele andere). Die Evolution hat keinen Grund, den Tod daran zu hindern, unerträglich furchtbar zu sein.[Dawkins]
Natürlich sind Angriffe durch Raubtiere nicht der einzige Grund, warum Organismen qualvoll sterben.
Tiere werden auch angeschlagen durch Krankheiten, Parasiten, die auch zu Lustlosigkeit führen, Schüttelfrost, Geschwüre, Lungenentzündungen, Hunger, Gewalt oder andere grausame Symptome, die in einem Zeitraum von Tagen oder Wochen zum Tod führen können. Vogel-Salmonellose ist nur ein Beispiel:
Die Zeichen reichen von sofortigem Tod bis zum schrittweisen Beginn von Depressionen über 1 bis 3 Tage, begleitet vom Zusammenkauern der Vögel, aufgeschüttelten Federn, Unregelmäßigkeit; Schüttelfrost, Appetitsverlust, merklich angestiegenem oder fehlendem Durst, schnellem Gewichtsverlust, beschleunigtem Atem und wässrigem gelb, grün oder blutig gefärbtem Kot. Die hinteren Federn werden von Exkrementen bedeckt, die Augen fangen an, sich zu schließen, und unmittelbar vor dem Tod zeigen einige Vögel offenbar Erblindung, Unkoordiniertheit, Wanken, Tremore, Zuckungen oder andere nervliche Zeichen.[Salmonellosis]
Andere Tiere sterben immer noch durch Unfälle, Dehydrierung während sommerlicher Dürren oder Nahrungsknappheit während des Winters. Zum Beispiel war 2006 ein harsches Jahr für Fledermäuse in Placerville, Kalifornien:
„Man kann ihre Rippen, ihre Wirbelsäulen sehen und (der Bereich), in dem der Darm und der Magen sind, ist komplett bis zum Rücken eingesunken“, sagte Dharma Webber, Gründerin der California Native Bat Conservancy. [...] Sie sagte, dass neu entstehende Mosquitos nicht genug sind, um die Kreaturen zu füttern. „Das wäre wie, wenn wir mal hier oder da ein kleines Stück Popcorn essen würden“, sagte sie.[bats]
(Natürlich ist es keine gute Nachricht für ihre Beute, wenn die Fledermäuse doch Essen haben ...)
Selbst Eisstürme können fatal sein: „Vögeln, die nicht in der Lage sind, während des Sturms einen geschützten Sitzplatz zu finden, kann es passieren, dass ihre Füße an einem Ast festfrieren oder ihre Flügel von Eis bedeckt werden, wodurch sie nicht mehr fliegen könnten. Raufußhühner ersticken oft, weil sie unter Schneeverwehungen begraben werden und von einer Eisschicht eingeschlossen werden.“[Heidorn]
Während der Tod oft den Höhepunkt des Leids innerhalb eines Tierlebens darstellt, ist die alltägliche Existenz auch nicht notwendigerweise angenehm. Im Gegensatz zu den meisten Menschen in der industrialisierten Welt haben Wildtiere nicht unmittelbaren Zugang zu Nahrung, wann immer sie hungrig werden. Sie müssen permanent Wasser suchen und sich schützen, während sie nach Raubtieren Ausschau halten. Die meisten Tiere können nicht nach drinnen gehen wie wir, wenn es anfängt zu regnen, oder die Heizung anmachen, wenn winterliche Temperaturen deutlich unter die gewöhnlichen Werte sinken. Zusammenfassend:
Es wird oft angenommen, dass Wildtiere in einer Art natürlichem Paradies leben und dass nur die Anwesenheit oder Einmischung von Menschen etwas wäre, was Wildtieren Leid zufügt. Diese Sicht – im Grunde nach Rousseau – stimmt nicht überein mit der Fülle an zur Verfügung stehenden Informationen aus Feldstudien über tierische Populationen. Wasser- und Nahrungsmittelknappheit, Raubtiere, Krankheiten und innerartliche Aggression sind einige der Faktoren, die als normale Teile der Umwelt in der Wildnis identifiziert worden sind, die bei Wildtieren regelmäßig Leid verursachen.[UCLA, p. 24]
Und dass viele Tiere diese Bedingungen scheinbar ruhig aushalten, heißt nicht notwendigerweise, dass sie nicht leiden.[BourneEtAl] Kranke und verletzte Beutetiere sind am einfachsten zu fangen, sodass Raubtiere absichtlich diese Individuen anvisieren. Als Konsequenz werden die Beutetiere, die krank oder verletzt scheinen, die sein, die meistens als erstes getötet werden. Also drängt der Evolutionsdruck Beutetiere dazu, nicht auf ihr Leid zu achten.[Nuffield, ch. 4.12, p. 66]
Auf der Grundlage von Studien zu Konzentrationen von Stresshormonen bei domestizierten und Wildtieren hat Christie Wilcox geschlussfolgert, dass „wenn wir den Tierschutz-Richtlinien folgen, die Nahrung, Wasser, Komfort und notwendige Dinge für Verhaltensausprägungen zur Verfügung stellen, dann sind domestizierte Tiere nicht nur wahrscheinlich so glücklich wie ihre wilden Verwandten, sie sind wahrscheinlich glücklicher.“ Sie hat auch beobachtet:
Die eigentliche Frage ist also, ob ein domestiziertes oder gefangen gehaltenes Tier in einem Moment glücklicher, weniger oder genau so glücklich ist als/wie sein wilder Gegenspieler. Es gibt einige Schlüsselbedingungen, von denen klassischerweise geglaubt wird, dass sie zu einem „glücklichen“ Tier führen, indem sie übermäßige Belastungen verhindern. Diese sind die Basis für die meisten Tierquälerei-Gesetze, einschließlich derer in den USA und im Vereinigten Königreich. Sie schließen ein, dass Tiere das „Recht“ haben auf:
- Genug Nahrung und Wasser
- Komfortable Bedingungen (Temperatur usw.)
- Ausübung von normalem Verhalten
Wenn es aber um Wildtiere geht, ist nur das letzte garantiert. Sie müssen täglich ums Überleben kämpfen, angefangen bei der Nahrungs- und Wassersuche bis hin zur Suche nach einem anderen Individuum, mit dem sie sich paaren können. Sie haben nicht das Recht auf Komfort, Stabilität oder Gesundheit. [...] Nach den Standards, die unsere Regierungen gesetzt haben, ist das Leben eines Wildtieres Tierquälerei.
In der Natur sind die Tierarten, die die meisten Individuen haben, wahrscheinlich die, denen es grundsätzlich am schlechtesten geht. Kleine Säugetiere und Vögel haben Lebenserwartungen von höchstens einem oder drei Jahren, bevor sie bei einem schmerzhaften Tod ankommen. Und viele Insekten zählen ihre Zeit auf Erden in Wochen statt in Jahren – zum Beispiel nur 2-4 Wochen bei der Hornfliege.[Cumming] Ich persönlich würde lieber nicht existieren, als als ein Insekt geboren zu werden, mich wenige Wochen anzustrengen, um durch die Welt zu kommen und dann an Dehydrierung oder in einem Spinnennetz zu sterben. Noch schlimmer könnte es sein, mich 12 Stunden lang gefangen in der Folterbank einer Dinoponera wiederzufinden[BBC] oder wochenlang lebend von einer Schlupfwespe gegessen zu werden.[Gould, pp. 32-44] (Andererseits ist es unklar, ob von Schlupfwespen gegessene Raupen während dieser Erfahrung Schmerz fühlen.)
Es ist wahr, dass Wissenschaftler immer noch unschlüssig sind, ob Insekten Schmerz in einer solchen Art und Weise fühlen, wie wir sie als bewusstes Leiden ansehen würden.[insect-pain] Die Tatsache, dass es immer noch ernsthafte Diskussionen darüber gibt, suggeriert uns aber, dass wir diese Möglichkeit nicht ausschließen sollten. Und wenn wir die Zahl der Insekten – 1018[Williams] – und die Zahl der Ruderfußkrebse in den Ozeanen – eine ähnliche Größenordnung[SchubelButman] – sehen, dann ist der mathematische „Erwartungswert“ (Wahrscheinlichkeit mal Menge) ihres Leidens gewaltig. Ich sollte erwähnen, dass die Stärke dieses Punktes verringert werden würde, wenn – wie es der Fall sein könnte – die „Intensität“ oder „Stufe“ der emotionalen Erfahrungen eines Tieres irgendwie von der Menge des neuronalen Materials abhängt, das für Schmerzsignale zuständig ist.
Tabellen von Lebenserwartungen bestimmter Tierarten zeigen normalerweise die Überlebensdauer von erwachsenen Angehörigen einer Tierart. Allerdings sterben die meisten Individen früher, bevor sie ihre Reife erreichen. Das ist eine einfache Konsequenz dessen, dass weibliche Tiere viel mehr Nachkommen zur Welt bringen, als überleben können, um eine stabile Population zu erhalten. Während Menschen nur ein Kind pro Fortpflanzungsperiode produzieren können (ausgenommen Zwillinge), beträgt diese Anzahl bspw. 1-22 Nachkommen bei Hunden (Canis familiaris), 4-6 Eier bei Staren (Sturmus vulgaris), 6 000-20 000 Eier bei nordamerikanischen Ochsenfröschen (Rana catesbeiana) und 2 Millionen Eier bei Jakobsmuscheln (Argopecten irradians).[SolbrigSolbrig, p. 37] Man schaue auf diese Grafik aus Thomas J. Herberts Artikel[Herbert] über r- und K-Selektion, die eine extrem hohe Kindersterblichkeit bei „r-Strategen“ zeigt. Die meisten kleinen Tiere wie Karpenfische and Insekten sind r-Strategen.
Zugegeben, es ist unklar, ob alle diese Tierarten fühlen – und noch mehr, was den Teil der Eier betrifft, die nicht schlüpfen – aber nochmal: Im Sinne des Erwartungswertes ist das Ausmaß des erwarteten Leids enorm.
Diese Strategie, „viele Kopien zu machen und zu hoffen, dass einige aufgehen“, könnte aus dem evolutionären Standpunkt sehr vernünftig sein, aber der Preis für die individuellen Organismen ist gewaltig. Aus einer Analyse über die Fürsorge-Implikationen von Populationsdynamiken schlussfolgern Matthew Clarke und Yes-Kwang Ng: „Die Anzahl der Nachkommen einer Tierart, die Eignung maximiert, könnte zu Leid führen und unterscheidet sich von der Anzahl, die das Wohl maximieren würde (im Durchschnitt oder absolut).“[ClarkeNg, sec. 4] Und im ähnlichen Artikel „Towards Welfare Biology: Evolutionary Economics of Animal Consciousness and Suffering“ schlussfolgert Ng aus dem Übermaß an Nachkommen vergleichbar mit überlebenden erwachsenen Tieren: „Unter der Annahme konkaver und symmetrischer Funtionen, die Kosten von Vergnügen und Leid in Beziehung zueinander setzen, führt die Evolutionsökonomie zu einem Übermaß an absolutem Leid gegenüber absolutem Vergnügen.“[Ng, p. 272]
In seinem berühmten Artikel „Animal liberation and environmental ethics: Bad marriage, quick divorce“[Sagoff] zitiert Mark Sagoff die folgende Passage von Fred Hapgood:[Hapgood]
Alle Tierarten reproduzieren im Übermaß, weit über ihre ökologische Nische hinaus. In einem Leben könnte eine Löwin 20 Junge haben, eine Taube 150 Küken, eine Maus 1000 Junge, eine Forelle 20 000 Kinder, ein Thunfisch oder Kabeljau 1 Million Kinder, [...] und eine Auster wahrscheinlich 100 Millionen Kinder. Wenn man annimmt, dass die Population dieser Tierarten von Generation zu Generation ungefähr gleich ist, dann wird im Durchschnitt nur ein Nachkomme überleben, um ein Elternteil zu ersetzen. All die anderen tausend oder Millionen werden auf die eine oder andere Art und Weise sterben.
Sagoff sagt weiter: „Im Vergleich zum Elend der Tiere in der Natur – wobei Menschen viel tun können, um es zu erleichtern – verblasst jede andere Form des Leidens. Mutter Natur ist so grausam zu ihren Kindern, Frank Perdue erscheint dagegen wie ein Heiliger.“
Der vorherige Abschnitt erklärt, dass Eltern einer r-selektierenden Tierart hunderte oder sogar zehntausende Nachkommen haben können und fast alle davon bald nach der Geburt sterben werden.
Aber eine Frage bleibt. Welcher Teil dieser Nachkommen hat zum Zeitpunkt des Todes gefühlt und welcher Teil ist unbewusst als Ei oder Larve gestorben?
Die „Aspekte der Biologie und des Wohlergehens von Tieren, die für Experimente oder andere wissenschaftliche Zwecke benutzt werden“ (S. 37-42) der Europäischen Behörde für Lebensmittelsicherheit (EBL) untersucht, wann Föten verschiedener Tierarten anfangen, bewusst Schmerz zu fühlen.[EFSA] Dieser Artikel merkt an, dass das Alter, bei dem Bewusstsein anfängt, variiert, je nachdem ob die Tierart nestflüchtend ist (gut entwickelt während der Geburt, z. B. Pferde) oder nesthockend (immer noch in der Entwicklung während der Geburt, z. B. Beuteltiere). Nestflüchter fühlen Schmerz mit höherer Wahrscheinlichkeit schon im jüngeren Alter. Ebenso ist relevant, ob die Tierart lebendgebährend oder eierlegend ist. Lebendgebährende Tiere haben eine größere Notwendigkeit, in ihrer Entwicklung fötales Bewusstsein zu unterbinden, um Verletzungen der Mutter und Geschwister zu verhindern. Eierlegende Tiere, die durch die Schale eingeschränkt sind, haben weniger Notwendigkeit, sich vor der Geburt einzuschränken. (S. 38)
Aus diesem Grund empfiehlt der Bericht: „Wenn Bewusstsein das Kriterium für Schutz ist, dann könnten Vögel, Reptilien, Amphibien, Fische und Kopffüßer daher offensichtlicher Schutz vor ihrem Schlüpfen brauchen, als Säugetiere Schutz vor ihrer Geburt brauchen.“ (S. 38) Zum Beispiel: „Sensorische und neurale Entwicklung in einem nestflüchtenden Vogel, z. B. in einem Haushuhn, ist schon einige Tage vor dem Schlüpfen entwickelt. Kontrollierte Bewegungen und koordinierte elektrophysiologische und Verhaltens-Antworten auf akustische und visuelle Reize tauchen drei oder vier Tage vor dem Schlüpfen bzw. nach 21 Tagen des Brütens auf (Broom, 1981).“ (S. 39) Im Gegensatz dazu: „Obwohl Föten von Säugetieren physische Antworten auf äußere Reize zeigen können, suggeriert das Gewicht der heutigen Erkenntnisse, dass Bewusstsein im Fötus nicht entsteht, bevor er entbunden wird und anfängt zu atmen.“ (S. 42)
Daher scheint es klar, dass viele Tiere zum Zeitpunkt ihrer Geburt leiden können, wenn nicht schon vorher.
Die Stufe der Entwicklung, bei der das Risiko [des Leidens] ausreichend dafür ist, dass Schutz notwendig ist, ist die Stufe, bei der normale Fortbewegung und sensorische Funktionen eines Individuums unabhängig vom Ei oder von der Mutter auftauchen können. Für Luft atmende Tiere ist dieser Zeitpunkt grundsätzlich nicht später als, wenn der Fötus ohne Hilfe außerhalb der Gebärmutter oder des Eies überleben könnte. Für die meisten Wirbeltiere ist die Stufe der Entwicklung, auf der es ein Risiko gibt, zu leiden, wenn mit ihnen etwas gemacht wird, der Beginn des letzten Drittels der Entwicklung im Ei oder in der Mutter. Für einen Fisch, eine Amphibie, einen Kopffüßer, einen Zehnfüßler ist dieser Zeitpunkt, wenn er oder sie imstande ist, sich selbst zu füttern, anstatt von der Nahrungszufuhr durch das Ei abhängig zu sein. [...] (S. 3) Die meisten Amphibien und Fische haben Larvenformen, die beim Schlüpfen nicht gut entwickelt sind, sich aber mit einer hohen Geschwindigkeit in der Erfahrung eines unabhängigen Lebens entwickeln[.] Diese Fische und Amphibien, die beim Schlüpfen gut entwickelt sind oder lebendgebährend sind, und alle Kopffüßer – da sie alle klein, aber beim Schlüpfen gut entwickelt sind – werden ein funktionierendes Nervensystem und das Potential für Selbstbewusstsein einige Zeit vor dem Schlüpfen haben. (S. 38)
Eine andere Betrachtung, die Schmerz vor der Geburt suggeriert, ist die Tatsache, dass viele eierlegende Wirbeltiere als Reaktion auf äußere Reize (einschließlich Vibrationen, die sich nach einem Raubtier anfühlen) auch früher schlüpfen können.
Ein Beispiel betreffend Glattechsen: „Experimente, in denen draußen Reize wie durch Raubtiere simuliert worden sind, führten zu Schlüpfungen sowohl in Nestern (in Steinkluften) als auch in Eiern, die aus Nestern herausgeholt worden waren. Der Schlüpfprozess war explosiv: Früh geschlüpfte Embryonen schlüpften in Sekunden und rannten vom Ei aus im Durchschnitt 40 cm weit.“[DoodyPaull] Frühes Schlüpfen wurde ebenso bei Amphibien, Fischen und Wirbellosen beobachtet.[DoodyPaull]
Diese Punkte suggerieren, dass ein großer Anteil der vielen Nachkommen von r-Selektoren sehr wohl bei Bewusstsein sein könnte während ihres schmerzhaften Todes nach wenigen Tagen oder sogar Stunden ihres Lebens.
Es ist gefährlich, von unserer eigenen Vorstellung, wie wir uns in der Situation fühlen würden, auf das Wohlergehen von Wildtieren zu extrapolieren. Wir können uns immenses Unbehagen vorstellen, wenn wir im Sturm einer kalten Winternacht und nur mit einem T-Shirt, um uns warmzuhalten, schlafen müssten, aber viele Tiere haben bessere Felle und können oft eine Art Unterschlupf finden. Allgemeiner gesagt erscheint es unwahrscheinlich, dass Tierarten Anpassungsvorteile dadurch erhalten würden, dass sie konstantes Elend erleiden müssten, da Stress Energieverlust verursacht.[Ng] Ebenso könnten r-Selektoren bei einer gegebenen Verletzung weniger leiden als Tiere mit einer hohen Lebenserwartung, weil r-Selektoren weniger zu verlieren haben, wenn sie große kurzfristige Risiken eingehen.[Tomasik-short-lived]
Nichtsdestoweniger sollten wir vorsichtig sein, ob wir das Ausmaß und die Stärke von Wildtierleid durch unsere eigenen Verzerrungen nicht unterschätzen. Sie, der Leser, sind wahrscheinlich im Komfort eines klimakontrollierten Gebäudes oder Fahrzeuges, mit einem relativ vollen Magen und ohne die Angst, angegriffen zu werden. Die meisten von uns im industrialisierten Westen gehen in einem relativ euthymischen Zustand durchs Leben und es ist leicht anzunehmen, dass die allgemeine Annehmlichkeit, mit der das Leben uns begrüßt, von den meisten anderen Menschen und Tieren geteilt wird. Wenn wir an Natur denken, denken wir eher an zwitscherne Vögel oder an fröhlich herumtollende Gazellen als an einen Hirsch, der bei Bewusstsein sein Fleisch abgekaut bekommt oder einen bewegungsunfähigen Waschbären, der durch Fadenwürmer befallen wurde. Und natürlich spiegeln all die erwähnten Beispiele, weil sie große Landtiere betreffen, meine menschliche Tendenz zu der „Heuristik nach Verfügbarkeit“ wider: Tatsächlich sind die meisten vorherrschenden Wildtiere kleine Organismen, viele leben im Wasser. Wenn wir an „Wildtiere“ denken, sollten wir (wenn wir die Erwartungswert-Herangehensweise bezüglich Empfindungsfähigkeit annehmen) an Ameisen, Ruderfußkrebse und kleine Fische denken, anstatt an Löwen oder Gazellen.
Einige können zu einem konkreten Zeitpunkt nicht exakt bewerten, wie sie sich insgesamt über eine lange Zeitspanne fühlen.[KahnemanSugden] Sie zeigen oft eine „rosige Aussicht“ nach zukünftigen Ereignissen und einen „rosigen Rückblick“ bezüglich der Vergangenheit, bei denen sie annehmen, dass ihre vergangenen und zukünftigen Stufen ihres Wohlergehens besser sind als das, was zum Zeitpunkt ihres Erlebens berichtet wird.[MitchellThompson] Außerdem zeigen Organismen, selbst wenn sie die Stufen ihres Wohlergehens korrekt beurteilen, oft einen „Willen zu leben“, ziemlich unabhängig von Vergnügen oder Schmerz. Tiere, die zur Einsicht kommen, dass ihr Leben nicht lebenswert ist und es deshalb beenden, scheinen sich nicht erfolgreich zu vermehren.
Schlussendlich bleibt es aber unstrittig, dass viele Tiere in der Wildnis furchtbare Erfahrungen aushalten müssen, unabhängig davon, wie gut oder schlecht genau wir Leben in der Wildnis bewerten.
Schlussendlich gibt es einige Behauptungen, dass Tiere in der Tat Suizid begehen, obwohl andere das anzweifeln. Ich persönlich bin skeptisch, weil es nicht viele gut dokumentierte Fälle von Suizid bei Tieren gibt und es einfach ist, Gerüchte über Phänomene zu verbreiten, die nicht real sind. Dennoch zweifle ich nicht daran, dass einige Tiere sich anders verhalten, wenn sie einen emotionalen Verlust erleiden.
Warum hat dann das Leid von Wildtieren nicht die höchste Priorität für Tieraktivisten? Ein Grund ist philosophisch: Einige glauben, dass während Menschen Pflichten haben, die Tiere, die sie benutzen oder mit denen sie zusammen leben, gut zu behandeln, Menschen keine Verantwortung gegenüber denen haben, die außerhalb ihrer Interaktionssphäre sind. Ich finde das unbefriedigend; wenn wir uns wirklich um Tiere sorgen, weil wir nicht wollen, dass uns ähnliche Organismen grausam leiden – nicht nur weil wir unser „moralisches Haus sauber halten“ wollen – dann sollte es keine Rolle spielen, ob wir eine persönliche Verbindung zu Wildtieren haben oder nicht.
Andere Philosophen stimmen dem zu, aber verteidigen die menschliche Tatenlosigkeit durch die Behauptung, Menschen seien letztendlich hilflos gegenüber dieser Situation. Als Peter Singer gefragt worden ist, ob wir Löwen davon abhalten sollten, Gazellen zu essen, antwortete er:
[...] was praktische Absichten betrifft, bin ich mir ziemlich sicher, dass wir mit höherer Wahrscheinlichkeit das Netto-Tierleid erhöhen würden anstatt es zu verringern, wenn wir in der Wildnis intervenieren würden, wenn ich mir die Geschichte der menschlichen Versuche anschaue, Natur für menschliche Ziele zu formen. Löwen spielen eine wichtige Rolle in der Ökologie ihres Lebensraums, und wir können uns nicht sicher sein, welche langfristigen Konsequenzen folgen würden, wenn wir versuchen würden, sie daran zu hindern, Gazellen zu töten. [...] Daher würde ich – praktisch – definitiv sagen, dass die Tierwelt in der Wildnis allein gelassen werden sollte.[Singer]
Ich würde dem entgegenhalten, dass die meisten menschlichen Interventionen nicht dazu gedacht waren, das Wohlbefinden von Wildtieren zu erhöhen, und dennoch fürchte ich, dass viele von ihnen Wildtierleid alles in allem verringert haben, und zwar dadurch, dass sie Lebensraum verringert haben.
In einem ähnlichen Stil wie Singer regte Jennifer Everett an, dass Konsequentialisten evolutionäre Selektion gutheißen sollten, weil sie schädliche genetische Eigenschaften eliminiert:
[...] Wenn die Verbreitung der „fittesten“ Gene einen Beitrag zur Unversehrtheit sowohl des Raubtieres als auch des Beutetieres leistet, was gut für das Gleichgewicht zwischen Raubtieren und Beutetieren im Ökosystem ist, was wiederum gut für die Organismen ist, die darin leben, usw., dann werden die sehr ökologischen Beziehungen, die ganzheitliche Umweltaktivisten als intrinsisch wertvoll ansehen, auch von Tierschützern als wertvoll angesehen werden, weil sie – wenn auch indirekt und über komplexe Kausalketten – letztendlich förderlich für das Wohlergehen individueller Tiere sind.[Everett, p. 48]
Diese Autoren haben Recht, dass die Betrachtung langfristiger ökologischer Nebenwirkungen wichtig ist. Dennoch folgt nicht, dass Menschen keine Pflichten haben, was Wildtiere angeht, oder dass Unterstützer von Tieren zu der natürlichen Tierquälerei schweigen sollten. Die nächsten Unterkapitel führen näher aus, wie Menschen tatsächlich etwas gegen Wildtierleid tun können.
Ich stimme zu, dass wir vorsichtig sein sollten, was Interventionen betrifft, die vermeintlich dazu geeignet sein sollen, Probleme schnell zu beheben. Ökologie ist extrem kompliziert und Menschen haben eine lange Geschichte damit, die unvorhergesehenen Konsequenzen zu unterschätzen, auf die sie bei Versuchen, Verbesserungen an der Natur vorzunehmen, treffen. Auf der anderen Seite gibt es schon viele Fälle, in denen wir heute schon in irgendeiner Art und Weise in die Tierwelt in der Wildnis eingreifen. Tyler Cowen hat festgestellt:[Cowen, p. 10]
In anderen Fällen greifen wir in die Natur ein, ob wir das gut finden oder nicht. Es ist keine Frage von Unsicherheit, die uns von der Kontrolle abhält, sondern davon, wie man eine Form der Kontrolle mit einer anderen vergleichen kann. Menschen verändern Wassestände, düngen den Boden, beeinflussen klimatische Bedingungen und tun viele andere Dinge, die das natürliche Gleichgewicht beeinflussen. Diese menschlichen Handlungen werden nicht bald verschwinden, aber in der Zwischenzeit müssen wir ihre Auswirkungen auf Carnivore und ihre Opfer untersuchen.
Eine solche Untersuchung wurde tatsächlich durchgeführt, es ging um eine Entscheidung der Australischen Regierung, überbevölkerte und hungernde Kängurus auf einer Armeebasis der australischen Streitkräfte zu erlegen.[ClarkeNg] Obwohl freilich geschmacklos und theoretisch, zeigt die Analyse, dass die Werkzeuge der Wohlfahrtsökonomie mit den Prinzipien der Populationsökologie kombiniert werden können, um nichttriviale Schlussfolgerungen darüber zu ziehen, wie menschliches Eingreifen in die Natur das Komplex des Wohlbefindens von Tieren beeinflusst.
Betrachten wir ein anderes Beispiel. Menschen versprühen jedes Jahr 3 Milliarden Tonnen Pestizide,[Pimentel] und ob wir daran glauben oder nicht glauben, dass das zu mehr Wildtierleid führt, als es verhindert, ist die großflächige Verwendung von Insektiziden gewissermaßen eine vollendete Tatsache der modernen Gesellschaft. Wenn hypothetisch gedacht Wissenschaftler Wege finden könnten, dass diese Chemikalien schneller oder weniger schmerzhaft wirken, dann könnte einer enormen Anzahl von Insekten und größerer Organismen ein etwas weniger qualvoller Tod gegeben werden. (Man beachte, dass Pestizide das Netto-Leid von Insekten verringern können, wenn sie Insektenpopulationen hinreichend verringern, daher ist die Förderung von humaneren Insektiziden nicht äquivalent zur Förderung von weniger Pestizid-Einsatz. Tatsächlich könnten Bio-Farmen große Mengen von Insektenleid beinhalten, sowohl wegen der höheren Population als auch, weil Bio-Kammerjäger-Methoden ziemlich schmerzhaft sein könnten. Ich bin dennoch immer noch sehr unsicher bezüglich dieser Frage.)[Tomasik-insecticides]
Menschliche Veränderungen der Umwelt wie Landwirtschaft, Urbanisierung, Abholzung, Verschmutzung, Klimawandel usw. haben große Konsequenzen, sowohl negative als auch positive für Wildtiere. Zum Beispiel „Paradies [oder eher Hölle?] pflastern, um einen Parkplatz zu bauen“ verhindert die Existenz von Tieren, die ansonsten dort gelebt hätten. Selbst dort, wo Lebensräume nicht zerstört werden, könnten Menschen die Zusammensetzung der Tierarten, die dort leben, verändern. Wenn zum Beispiel eine angreifende Tierart eine kürzere Lebenserwartung und mehr nicht überlebende Nachkommen hat als ihr einheimischer Gegenspieler, dann wäre das Ergebnis mehr absolutes Leid. Natürlich könnte das Gegenteil genau so einfach der Fall sein.
Die Sorge um Wildtierleid sollte nicht verwechselt werden mit der allgemeinen Sorge um mehr Umweltschutz; tatsächlich könnte in einigen oder sogar vielen Fällen die Verhinderung von Existenz die humanste Option sein. Konsequentialistische Vegetarier sollten diese Argumentationsstruktur nicht ungewöhnlich finden: Das utilitaristische Argument gegen Massentierhaltung ist gerade, dass ein Brathähnchen besser dran gewesen wäre, gar nicht existiert zu haben, anstatt 45 Tage lang vor der Schlachtung unter beengten Bedingungen zu leiden. Natürlich könnten selbst bei der Entscheidung, ob eine vegetarische Ernährung angefangen werden soll oder nicht, die Auswirkungen auf Wildtiere wichtig sein und manchmal auch wichtiger als die direkten Auswirkungen auf Nutztiere selbst.[MathenyChan]
Andererseits sollten wir, bevor wir bezüglich der Beseitigung von natürlichen Ökosystemen zu übereifrig werden, auch bedenken, dass viele andere Menschen Wildnis schätzen und es gut ist zu vermeiden, sich Feinde zu machen oder das eigentliche Ziel, Leid zu vermindern, dadurch zu beschmutzen, dass man dieses Ziel auf die genaue Gegenseite dessen stellt, was Menschen wichtig ist. Außerdem könnten viele Formen des Umweltschutzes, insbesondere die Eindämmung des Klimawandels, für die ferne Zukunft wichtig sein, indem man die Aussichten auf Kompromisse zwischen den großen Weltmächten verbessert, die künstliche Intelligenz entwickeln.
Wildtierleid verdient ein ernsthaftes Forschungsprogramm, das sich mit Fragen wie diesen befasst:
Heute haben Menschen nicht das Wissen oder die technischen Möglichkeiten, um ernsthaft das Problem des Wildtierleids ohne möglicherweise katastrophale Konsequenzen zu „lösen“. Jedoch könnte dies in der Zukunft anders sein, wenn Menschen ein tieferes Verständnis von Ökologie und Bewertung von Wohlbefinden entwickeln.
Wenn Empfindungsfähigkeit im Universum nicht selten ist, dann erstreckt sich das Problem des Wildtierleids über unseren Planeten hinaus. Wenn es unwahrscheinlich ist, dass Leben eine Art von Intelligenz wie bei Menschen entwickelt,[Drake] könnten wir erwarten, dass die meisten existenten Außerirdischen auf der Entwicklungsstufe der kleinsten, am kürzesten lebenden Kreaturen auf der Erde sind. Daher könnte es eine große Wohltat sein – wenn Menschen jemals Roboter in den Weltraum schicken –, sie dazu zu nutzen, Wildtieren auf anderen Planeten zu helfen. (Man hofft, dass Einwände von tiefen Ökologen in Bezug auf Interventionen in außerirdischen Ökosystemen bis dahin beseitigt werden.)
Dennoch sollte ich anmerken, dass schnellerer technologischer Fortschritt im Allgemeinen nicht notwendigerweise wünschenswert ist. Vor allem in Bereichen wie künstlicher Intelligenz und Neurowissenschaften könnte schnellerer Fortschritt das Risiko von Leid anderer Arten beschleunigen. Als eine allgemeine Heuristik, glaube ich, könnte es besser sein, vor der Entwicklung von Technologien, die große Mengen neuer Macht entfesseln, zu warten, bis Menschen die sozialen Institutionen und die Weisheit haben, um den Missbrauch dieser Macht zu unterbinden.
Während fortgeschrittene Zukunftstechnologien vielversprechend sind, was Hilfe für Wildtiere angeht, tragen sie ebenfalls Risiken, die Grausamkeit der natürlichen Welt zu erhöhen. Zum Beispiel ist es absehbar, dass Menschen eines Tages im „Terraforming“-Prozess erdähnliche Umweltbedingungen auf den Mars übertragen könnten.[Burton] Andere haben „directed panspermia“ vorgeschlagen: Man könnte direkt biologisches Material durch die Galaxie auf andere Planeten senden.[Meot-NerMatloff] Computersimulationen könnten ausreichend genau werden, sodass die Wildtiere, die sie beinhalten, bewusst leiden könnten. Bereits jetzt sehen wir viele Simulationsmodelle von natürlicher Selektion, und es ist nur eine Frage der Zeit, bis diesen KI-Fähigkeiten hinzugefügt werden, sodass die Organismen, die betroffen sind, empfindungsfähig werden und im wahrsten Sinne des Wortes den Schmerz ihrer Verletzungen oder ihres Todes spüren werden. All diese Möglichkeiten hätten ungeheuere ethische Implikationen, und ich hoffe, dass bevor sie durchgeführt werden, zukünftige Menschen ernsthaft die Folgen solcher Handlungen für die betroffenen Lebewesen betrachten werden.
Was impliziert das alles in Bezug auf die Tierbewegung? Ich denke, der beste erste Schritt, den wir jetzt gehen können, um Wildtierleid zu vermindern, ist, das Interesse für dieses Thema zu erhöhen. Wenn mehr Menschen über Wildtierleid nachdenken und es wichtig finden, wird es mehr Forschung über den Schutz von Wildtieren und damit verbundene menschliche Technologien geben, während zur selben Zeit auch dafür gesorgt wird, dass unsere späten Nachkommen sorgfältig über Handlungen nachdenken, die mehr leidende Organismen produzieren würden.
Vermutlich wäre es ein guter Startpunkt, Unterstützer innerhalb der Tierbewegung zu finden. Während einige Aktivisten sich gegen jegliche menschliche Interventionen in die Angelegenheiten von Tieren stellen und es teilweise sogar bevorzugen würden, wenn Menschen gar nicht existiert hätten, sollten viele Menschen, die Mitgefühl mit Angehörigen anderer Tierarten haben, Bemühungen begrüßen, Grausamkeiten in der Wildnis zu verhindern. Es ist wichtig, sicherzustellen, dass die Tierrechtsbewegung nicht darin endet, ihre Unterstützung in Bezug auf Maßnahmen zur Erhaltung der Wildnis und menschliche Nicht-Einmischung jeglicher Art zu erhöhen. Eine andere mögliche Quelle für Unterstützer könnten Menschen sein, die sich für Evolution interessieren und verstehen, was Richard Dawkins die „blinde, schonungslose Gleichgültigkeit“ der natürlichen Selektion nannte.[Dawkins, p. 133]
Individuen können viel tun, um das Thema eigenständig aufzuwerten, zum Beispiel:
Es könnte gefährlich sein, das Wildtier-Thema aufzuwerfen, bevor die allgemeine Öffentlichkeit dafür bereit ist. Tatsächlich wird die Grausamkeit der Natur oft als eine Reaktion von Fleischessern gegen konsequentialistischen Vegetarismus verwendet. Die Behauptung, dass die ethische Berücksichtigung von Tieren von uns verlangen würde, Ressourcen in langfristige Forschung zu investieren, die darauf ausgerichtet ist, Wildtieren zu helfen, könnte Menschen endgültig abstoßen, die ansonsten zumindest den Tieren, die sie durch ihre Ernährungsentscheidungen beeinflussen, Beachtung geschenkt hätten.[Greger]
Ich denke, die Aufklärung über Wildtiere sollte in Gemeinschaften beginnen, die bereits möglichst empfänglich sind, wie etwa Philosophen, Tieraktivisten, Transhumanisten und Wissenschaftler. Wir können Samen dieser Idee säen, sodass sie zu einer Komponente der Tierrechtsbewegung wächst. Ich denke auch, die Aufforderung „verbreitet Wildtierleid nicht großflächig“ könnte selbst an Orten wie TED oder Slate genannt werden, gerade weil es eine kontroverse Idee ist, die Menschen bislang nicht gehört haben. Für Zuhörer aus solchen Gruppen würde das Thema aber im „far mode“ auftauchen, würde nicht mit ihrem täglichen Leben interferieren und könnte daher mit geringerem Widerstand in Betracht gezogen werden.
Es ist wahr, dass die meisten Menschen noch nicht die moralische Dringlichkeit, Wildtierleid zu verringern, befürworten. Sie könnten vorher andere Schritte benötigen, wie etwa sich für nichtmenschliche Tiere überhaupt zu interessieren. Die Tierbewegung ist wie ein Wurm: Jedes Körperteil muss sich langsam auf seinem Weg nach vorne zum nächsten Schritt bewegen. Aber der Kopf des Wurms muss auch in die richtige Richtung zeigen. Die Menschen in dieser Hinsicht zu inspirieren, die bereit dafür sind, ist wie, die Richtung zu steuern, in die der Kopf des Wurms zeigt.
Es ist äußerst wichtig, dass die Tierbewegung an irgendeinem Punkt über die Tiere auf Farmen, in Experimenten und Haustiere hinaus geht. Das Ausmaß der Brutalität der Natur ist zu gewaltig, um es zu ignorieren, und Menschen haben die Pflicht, ihre astronomisch seltene Position sowohl als intelligente als auch als empathische Kreaturen zu nutzen, um Leid in der Wildnis um so viel wie nur möglich zu reduzieren.
[Dawkins] Dawkins, Richard. River Out of Eden. New York: Basic Books, 1995.
[Bostrom-Alfred] Bostrom, Nick. „Golden“. 2004.
[Pinker] Sailer, Steve. „Q&A: Steven Pinker of 'Blank Slate.'“ United Press International. 30. Okt. 2002. Abgerufen am 17. Jan. 2014.
[Attenborough] Rustin, Susanna. „David Attenborough: 'I'm an essential evil'“. The Guardian. 21. Okt. 2011. Abgerufen am 9. Jan. 2014.
[Mill] Mill, John Stuart. „On Nature“. 1874. In Nature, The Utility of Religion and Theism, Rationalist Press, 1904.
[exceptions] Examples include (1) Sapontzis, Steve F. „Predation.“ Ethics and Animals 5.2 (1984): 27-38. (2) Naess, Arne. „Should We Try To Relieve Clear Cases of Extreme Suffering in Nature?“ Pan Ecology 6.1 (1991). (3) Fink, Charles K. „The Predation Argument“. Between the Species 5 (2005).
[Tomasik-numbers] Tomasik, Brian. „How Many Wild Animals Are There?“ Essays on Reducing Suffering. 2009.
[emotions] See, for instance, (1) Balcombe, Jonathan. Pleasurable Kingdom: Animals and the Nature of Feeling Good. Palgrave Macmillan, 2006. (2) Bekoff, Marc, ed. The Smile of a Dolphin: Remarkable Accounts of Animal Emotions. Discovery Books, 2000.
[McGowan] McGowan, Christopher. The Raptor and the Lamb: Predators and Prey in the Living World. New York: Henry Holt and Company, 1997.
[eaten-alive] Eaten Alive - The World of Predators. Questacon on Tour.
[Kruuk] Kruuk, H. The Spotted Hyena. Chicago: University of Chicago Press, 1972.
[Flank] Flank, Lenny. „Live Prey vs. Prekill“. The Snake: An Owner's Guide To A Happy Healthy Pet. Howell Book House, 1997.
[Perry] Perry, Lacy. „How Snakes Work: Feeding“. howstuffworks.com.
[Sallinger] Sallinger, Bob. „Audubon Society Favors Keeping Cats Indoors“. The Oregonian. 17. Nov. 2003.
[Wall] Wall, Patrick. Pain: The Science of Suffering. New York: Columbia University Press, 2000.
[ElHagePeronnyGriebelBelzung] El Hage, Wissam, Sylvie Peronny, Guy Griebel, Catherine Belzung. „Impaired memory following predatory stress in mice is improved by fluoxetine“. Progress in Neuro-Psychopharmacology & Biological Psychiatry 28 (2004) 123 - 128.
[ElHageGriebelBelzung] El Hage, Wissam, Guy Griebel, and Catherine Belzung. „Long-term impaired memory following predatory stress in mice“. Physiology & Behavior 87 (2006) 45 - 50.
[Zoladz] Zoladz, Phillip R. „An ethologically relevant animal model of posttraumatic stress disorder: Physiological, pharmacological and behavioral sequelae in rats exposed to predator stress and social instability“. Graduate dissertation, University of South Florida. 2008.
[Stam] Stam, Rianne. „PTSD and stress sensitisation: A tale of brain and body Part 2: Animal models“. Neuroscience & Biobehavioral Reviews Volume 31, Issue 4 (2007) 558 - 584.
[Stauth] Stauth, David. „Sharks, wolves and the 'ecology of fear'“. 10. Nov. 2010. Abgerufen am 17. März 2013.
[Salmonellosis] „Salmonellosis“. Michigan Department of Natural Resources.
[bats] „Continued Rain, Snowpack Leaves Animals Hungry“. Associated Press 23. Apr. 2006. CBS 13/UPN 31.
[Heidorn] Heidorn, Keith C. „Ice Storms: Hazardous Beauty“. The Weather Doctor. 12. Jan. 1998, geändert Dez. 2001.
[UCLA] UCLA Animal Care and Use Training Manual. UCLA Office for the Protection of Research Subjects.
[Nuffield] Nuffield Council on Bioethics. Ethics of Research Involving Animals. Mai 2005.
[Wilcox] Wilcox, Christie. „Bambi or Bessie: Are wild animals happier?“ Scientific American Blogs. 12. April 2011. [Für weitere Diskussionen zu diesem Artikel siehe diesen Felicifia-Thread. Ich denke, Christie unterbewertet die Brutalität des Lebens in Mastanlagen, aber ihre Punkte über Wildtiere werden gut aufgefasst.]
[BourneEtAl] Bourne, Debra C., Penny Cusdin, and Suzanne I. Boardman, eds. Pain Management in Ruminants. Wildlife Information Network. März 2005.
[Cumming] Cumming, Jeffrey M. „Horn fly Haematobia irritans (L.)“. Diptera Associated with Livestock Dung. North American Dipterists Society. 18. Mai 2006.
[BBC] „Fierce Ants Build 'Torture Rack'“. BBC News 23. April 2005.
[Gould] Gould, Stephen Jay. „Nonmoral Nature“. Hen's Teeth and Horse's Toes: Further Reflections in Natural History. New York: W. W. Norton, 1994.
[insect-pain] See, for instance, the following review articles: (1) Smith, Jane A. „A Question of Pain in Invertebrates“. ILAR Journal 33.1-2 (1991). (2) Tomasik, Brian. „Can Insects Feel Pain?“ Essays on Reducing Suffering. 2009.
[Williams] Williams, C. B. Patterns in the Balance of Nature and Related Problems. London: Academic Press, 1964.
[SchubelButman] Schubel, J. R. and Butman, C. A. „Keeping a Finger on the Pulse of Marine Biodiversity: How Healthy Is It?“ Pages 84-103 of Nature and Human Society: The Quest for a Sustainable World. Washington, DC: National Academy Press, 1998.
[SolbrigSolbrig] Solbrig, O. T., and Solbrig, D. J. Introduction to Population Biology and Evolution. London: Addison-Wesley, 1979.
[Herbert] Herbert, Thomas J. „r and K selection“. Abgerufen am 17. März 2013.
[ClarkeNg] Clarke, Matthew and Ng, Yew-Kwang. „Population Dynamics and Animal Welfare: Issues Raised by the Culling of Kangaroos in Puckapunyal“. Social Choice and Welfare 27:2 (pp. 407-22), 2006.
[Ng] Ng, Yew-Kwang. „Towards Welfare Biology: Evolutionary Economics of Animal Consciousness and Suffering“. Biology and Philosophy 10.4 (pp. 255-85), 1995.
[Sagoff] Sagoff, Mark. „Animal liberation and environmental ethics: Bad marriage, quick divorce“. Osgoode Hall Law Journal 22, p. 297 (1984).
[Hapgood] Hapgood, Fred. Why males exist: an inquiry into the evolution of sex. Morrow (1979).
[EFSA] Animal and Welfare Scientific (AHAW) Panel. „Aspects of the biology and welfare of animals used for experimental and other scientific purposes“. EFSA Journal 292, 1-136 (2005).
[DoodyPaull] Doody, J. Sean and Paull, Phillip. „Hitting the Ground Running: Environmentally Cued Hatching in a Lizard“. Copeia: März 2013, Vol. 2013, No. 1, pp. 160-165.
[Tomasik-short-lived] Tomasik, Brian. „Fitness Considerations Regarding the Suffering of Short-Lived Animals“. Essays on Reducing Suffering. Geschrieben: 30. Juni 2013; zuletzt geändert: 9. Feb. 2015.
[KahnemanSugden] Kahneman, Daniel and Sugden, Robert. „Experienced Utility as a Standard of Policy Evaluation“. Environmental & Resource Economics 32: 161–81 (2005).
[MitchellThompson] Mitchell, T. and Thompson, L. (1994). „A Theory of Temporal Adjustments of the Evaluation of Events: Rosy Prospection and Rosy Retrospection“. In C. Stubbart, J. Porac, and J. Meindl, eds., Advances in Managerial Cognition and Organizational Information-Processing, 5 (pp. 85-114). Greenwich, CT: JAI press.
[Singer] Singer, Peter. „Food for Thought“. [Antwort auf einen Brief von David Rosinger.] New York Review of Books 20.10 (1973).
[Everett] Everett, Jennifer. „Environmental Ethics, Animal Welfarism, and the Problem of Predation: A Bambi Lover's Respect for Nature“. Ethics and the Environment 6.1 (2001): 42-67.
[Cowen] Cowen, Tyler. „Policing Nature“. 19. Mai 2001.
[Pimentel] Pimentel, David. „Pesticides and Pest Control“. In Peshin, Rajinder and Dhawan, Ashok K., eds. Integrated Pest Management: Innovation-Development Process. Netherlands: Springer, 2009.
[Tomasik-insecticides] Tomasik, Brian. „Humane Insecticides: A Cost-Effectiveness Calculation“. Essays on Reducing Suffering. 2009.
[MathenyChan] Matheny, Gaverick and Chan, Kai M. A. „Human Diets and Animal Welfare: The Illogic of the Larder“. Journal of Agricultural and Environmental Ethics, 18:6 (pp. 579–94), 2005.
[Broom] Broom, D. M. „Animal Welfare: Concepts and Measurement“ Journal of Animal Science, 69:10 (pp. 4167-4175), 1991.
[Drake] Estimates of the fraction of planets with life that go on to produce intelligence can be found in the literature on the Drake equation.
[Burton] Burton, Kathleen. „NASA Presents Star-Studded Mars Debate“. 25. März 2004.
[Meot-NerMatloff] Meot-Ner, M. and Matloff, G. L. „Directed Panspermia: A Technical and Ethical Evaluation of Seeding the Universe“. Journal of the British Interplanetary Society 32 (pp. 419-23), 1979.
[Greger] Greger, Michael. „Why Honey Is Vegan“. Satya Sep. 2005.
The post Die ethische Relevanz von Wildtierleid appeared first on Center on Long-Term Risk.
]]>The post Transparency appeared first on Center on Long-Term Risk.
]]>The Center on Long-Term Risk (CLR) is committed to being as transparent as possible about its activities and learning from its mistakes. This page gathers relevant information related to transparency.
CLR is a charity located in London, United Kingdom. Our trustees are Jonas Vollmer, Max Daniel, Stefan Torges, Linh Chi Nguyen and Tobias Baumann.
CLR's purposes according to our constitution, are:
1. the advancement of education, primarily in the field of emerging technologies (such as artificial intelligence), in particular (but not exclusively) by conducting the following activities in order to contribute to better public understanding of the potential benefits and risks of such technologies:
(a) conducting research into emerging technologies and publishing the useful results thereof to the general public;
(b) supporting individuals and organisations to carry out research into emerging technologies, through providing grants and other forms of support;
(c) hosting events such as conferences, research retreats, and workshops; to promote recent developments in, and to stimulate discussion and exchange of information about, emerging technologies; and
(d) providing coaching, mentoring and scholarships to those interested in learning about emerging technologies; and
2. to advance such other purposes which are exclusively charitable according to the law in England and Wales as the trustees may from time to time determine, in particular (but not exclusively) by grantmaking.
Please consult this page to learn more about CLR's mission and philosophy.
CLR was initially founded as a project of the Swiss charity Effective Altruism Foundation (EAF). However, our operations were transferred to an independent UK charity, the Center on Long-term Risk, at the end of 2021. Documentation below relating to 2021 and earlier therefore relates to activities carried out as part of EAF. EAF still provides support to us, including by collecting donations on our behalf. More information about EAF can be found on their transparency page.
We receive our funding from several major sources. The first is individual supporters who pledged to donate a percentage of their income to effective charities. We have also received support from several institutional donors, e.g., Open Philanthropy, the Survival and Flourishing Fund, and the Community Foundation for Ireland.
Anonymous feedback can be given using this link.
In 2024, CLR expects to spend its funds as follows:
Our budget for 2023 is significantly smaller than 2022, due to a funding shortfall. We discuss this in more detail in our 2023 fundraiser post.
In 2023, CLR expects to spend its funds as follows:
In 2022, CLR expects to spend its funds as follows:
Note: Until the end of 2021, CLR operated as a project of the Effective Altruism Foundation. All amounts in the below documents are in CHF. Detailed financial reports are available on the EAF website.
In 2021, CLR expects to spend its funds as follows:
Note: During this period, CLR operated as a project of the Effective Altruism Foundation. All amounts in the below documents are in CHF. Detailed financial reports are available on the EAF website.
Note: During this period, CLR operated as a project of the Effective Altruism Foundation. All amounts in the below documents are in CHF. Detailed financial reports are available on the EAF website.
Note: During this period, CLR operated as a project of the Effective Altruism Foundation. As of 2018, CLR's financials are part of "EAF core". Detailed financial reports are available on the EAF website.
Note: During this period, CLR operated as a project of the Effective Altruism Foundation. Detailed financial reports are available on the EAF website.
Note: During this period, CLR operated as a project of the Effective Altruism Foundation. All amounts in the below documents are in CHF.
Annual Financial Statement 2016
CLR’s activities in 2015 were implemented by GBS Switzerland and Effective Altruism Switzerland (EACH), the predecessors of the Effective Altruism Foundation. Their financial reports are available on the EAF website.
In 2014, CLR’s expenses amounted to 28,422 USD.
CLR was founded in July 2013 but had no expenses as all of CLR’s activities were performed by volunteers on a pro bono basis.
The post Transparency appeared first on Center on Long-Term Risk.
]]>The post Altruists Should Prioritize Artificial Intelligence appeared first on Center on Long-Term Risk.
]]>The large-scale adoption of today's cutting-edge AI technologies across different industries would already prove transformative for human society. And AI research rapidly progresses further towards the goal of general intelligence. Once created, we can expect smarter-than-human artificial intelligence (AI) to not only be transformative for the world, but also (plausibly) to be better than humans at self-preservation and goal preservation. This makes it particularly attractive, from the perspective of those who care about improving the quality of the future, to focus on affecting the development goals of such AI systems, as well as to install potential safety precautions against likely failure modes. Some experts emphasize that steering the development of smarter-than-human AI into beneficial directions is important because it could make the difference between human extinction and a utopian future. But because we cannot confidently rule out the possibility that some AI scenarios will go badly and also result in large amounts of suffering, thinking about the impacts of AI is paramount for both suffering-focused altruists as well as those focused on actualizing the upsides of the very best futures.
Terms like “AI” or “intelligence” can have many different (and often vague) meanings. “Intelligence” as used here refers to the ability to achieve goals in a wide range of environments. This definition captures the essence of many common perspectives on intelligence (Legg & Hutter, 2005), and conveys the meaning that is most relevant to us, namely that agents with the highest comparative goal-achieving ability (all things considered) are the most likely to shape the future. For comparison: As the most intelligent animal, humans completely dominate other animals whenever our interests are in conflict with theirs.
While everyday use of the term “intelligence” often refers merely to something like “brainpower” or “thinking speed,” our usage also presupposes rationality, or goal-optimization in an agent’s thinking and acting. In this usage, if someone is e.g. displaying overconfidence or confirmation bias, they may not qualify as very intelligent overall, even if they score high on an IQ test. The same applies to someone who lacks willpower or self control.
Artificial intelligence refers to machines designed with the ability to pursue tasks or goals. The AI designs currently in use – ranging from trading algorithms in finance, to chess programs, to self-driving cars – are intelligent in a domain-specific sense only. Chess programs beat the best human players in chess, but they would fail terribly at operating a car. Similarly, car-driving software in many contexts already performs better than human drivers, but no amount of learning (at least not with present algorithms) would make the that software work safely on an airplane.
The most ambitious AI researchers are working to build systems that exhibit (artificial) general intelligence (AGI) – the type of intelligence we defined above, which enables the expert pursuit of virtually any task or objective. In the past few years, we have witnessed impressive progress in algorithms becoming more and more versatile. Google’s DeepMind team for example built an algorithm that learned to play 2-D Atari games on its own, achieving superhuman skill at several of them (Mnih et al., 2015). DeepMind then developed a program that beat the world champion in the game of Go (Silver et al., 2016), and – tackling more practical real-world applications – managed to cut down data center electricity costs by rearranging the cooling systems.
That DeepMind’s AI technology makes quick progress in many domains, without requiring researchers to build new architecture from scratch each time, indicates that their machine learning algorithms have already reached an impressive level of general applicability. (Edit: I wrote the previous sentence in 2016. In the meantime [January 2018] DeepMind went on to refine its Go-playing AI, culminating in a version called AlphaGo Zero. While the initial version of DeepMind’s Go-playing AI started out with access to a large database of games played by human experts, AlphaGo Zero only learns through self-play. Nevertheless, it managed to become superhuman after a mere 4 days of practice. After 40 days of practice, it was able to beat its already superhuman predecessor 100–0. Moreover, Deepmind then created the version AlphaZero, which is not a “Go-specific” algorithm anymore. Fed with nothing but the rules for either Go, chess, or shogi, it managed to become superhuman at each of these games in less than 24 hours of practice.) The road may still be long, but if this trend continues, developments in AI research will eventually lead to superhuman performance across all domains. As there is no reason to assume that humans have attained the maximal degree of intelligence (Section III), AI may soon after reaching our own level of intelligence surpass it. Nick Bostrom (2014) popularized the term superintelligence to refer to (AGI-)systems that are vastly smarter than human experts in virtually all respects. This includes not only skills that computers traditionally excel at, such as calculus or chess, but also tasks like writing novels or talking people into doing things they otherwise would not. Whether AI systems would quickly develop superhuman skills across all possible domains, or whether we will already see major transformations with just a such domains while others lag behind, is an open question. Note that the definitions of “AGI” and “superintelligence” leave open the question of whether these systems would exhibit something like consciousness.
This article focuses on the prospect of creating smarter-than-human artificial intelligence. For simplicity, we will use the term “AI” in a non-standard way here, to refer specifically to artificial general intelligence (AGI). The use of “AI” in this article will also leave open how such a system is implemented: While it seems plausible that the first artificial system exhibiting smarter-than-human intelligence will be run on some kind of “supercomputer,” our definition allows for alternative possibilities. The claim that altruists should focus on affecting AI outcomes is therefore intended to mean that we should focus on scenarios where the dominant force shaping the future is no longer (biological) human minds, but rather some outgrowth of information technology – perhaps acting in concert with biotechnology or other technologies. This would also e.g. allow for AI to be distributed over several interacting systems.
Even if we expect smarter-than-human artificial intelligence to be a century or more away, its development could already merit serious concern. As Sam Harris emphasized in his TED talk on risks and benefits of AI, we do not know how long it will take to figure out how to program ethical goals into an AI, solve other technical challenges in the space of AI safety, or establish an environment with reduced dangers of arms races. When the stakes are high enough, it pays to start preparing as soon as possible. The sooner we prepare, the better our chances of safely managing the upcoming transition.
The need for preparation is all the more urgent given that considerably shorter timelines are not out of the question, especially in light of recent developments. While timeline predictions by different AI experts span a wide range, many of those experts think it likely that human-level AI will be created this century (conditional on civilization facing no major disruptions in the meantime). Some even think it may emerge in the first half of this century: In a survey where the hundred most-cited AI researchers were asked in what year they think human-level AI is 10% likely to have arrived by, the median reply was 2024 and the mean was 2034. In response to the same question for a 50% probability of arrival, the median reply was 2050 with a mean of 2072 (Müller & Bostrom, 2016).1
While it could be argued that these AI experts are biased towards short timelines, their estimates should make us realize that human-level AI this century is a real possibility. The next section will argue that the subsequent transition from human-level AI to superintelligence could happen very rapidly after human-level AI actualizes. We are dealing with the decent possibility – e.g. above 15% likelihood even under highly conservative assumptions – that human intelligence will be surpassed by machine intelligence later this century, perhaps even in the next couple of decades. As such a transition will bring about huge opportunities as well as huge risks, it would be irresponsible not to prepare for it.
It should be noted that a potentially short timeline does not imply that the road to superintelligence is necessarily one of smooth progress: Metrics like Moore’s law are not guaranteed to continue indefinitely, and the rate of breakthrough publications in AI research may not increase (or even stay constant) either. The recent progress in machine learning is impressive and suggests that fairly short timelines of a decade or two are not to be ruled out. However, this progress could also be mostly due to some important but limited insights that enable companies like DeepMind to reap the low-hanging fruit before progress would slow down again. There are large gaps still to be filled before AIs reach human-level intelligence, and it is difficult to estimate how long it will take researchers to bridge these gaps. Current hype about AI may lead to disappointment in the medium term, which could bring about an “AI safety winter” with people mistakenly concluding that the safety concerns were exaggerated and smarter-than-human AI is not something we should worry about yet.
If AI progress were to slow down for a long time and then unexpectedly speed up again, a transition to superintelligence could happen with little warning (Shulman & Sandberg, 2010). This scenario is plausible because gains in software efficiency make a larger comparative difference to an AI’s overall capabilities when the hardware available is more powerful. And once an AI develops the intelligence of its human creators, it could start taking part in its own self-improvement (see section IV).
For AI progress to stagnate for a long period of time before reaching human-level intelligence, biological brains would have to have surprisingly efficient architectures that AI cannot achieve despite further hardware progress and years of humans conducting more AI research. However, as long as hardware progress does not come to a complete halt, AGI research will eventually not have to surpass the human brain’s architecture or efficiency anymore. Instead, it could become possible to just copy it: The “foolproof” way to build human-level intelligence would be to develop whole brain emulation (WBE) (Sandberg & Bostrom, 2008), the exact copying of the brain’s pattern of computation (input-output behavior as well as isomorphic internal states at any point in the computation) onto a computer and a suitable virtual environment. In addition to sufficiently powerful hardware, WBE would require scanning technology with fine enough resolution to capture all the relevant cognitive function, as well as a sophisticated understanding of neuroscience to correctly draw the right abstractions. Even though our available estimates are crude, it is possible that all these conditions will be fulfilled well before the end of this century (Sandberg, 2014).
The perhaps most intriguing aspect of WBE technology is that once the first emulation exists and can complete tasks on a computer like a human researcher can, it would then be very easy to make more such emulations by copying the original. Moreover, with powerful enough hardware, it would also become possible to run emulations at higher speeds, or to reset them back to a well-rested state after they performed exhausting work (Hanson, 2016). Sped-up WBE workers could be given the task of improving computer hardware (or AI technology itself), which would trigger a wave of steeply exponential progress in the development of superintelligence. To get a sense of the potential of this technology, imagine WBEs of the smartest and most productive AI scientists, copied a hundred times to tackle AI research itself as a well-coordinated research team, sped up so they can do years of research in mere weeks or even days, and reset periodically to skip sleep (or other distracting activities) in cases where memory-formation is not needed. The scenario just described requires no further technologies beyond WBE and sufficiently powerful hardware. If the gap from current AI algorithms to smarter-than-human AI is too hard to bridge directly, it may eventually be bridged (potentially very quickly) after WBE technology drastically accelerates further AI research.
The potential for WBE to come before de novo AI means that – even if the gap between current AI designs and the human brain is larger than we thought – we should not significantly discount the probability of human-level AI being created eventually. And perhaps paradoxically, we should expect such a late transition to happen abruptly. Barring no upcoming societal collapse, believing that superintelligence is highly unlikely to ever happen requires not only confidence that software or “architectural” improvements to AI are insufficient to ever bridge the gap, but also that – in spite of continued hardware progress – WBE could not get off the ground either. We do not seem to have sufficient reason for great confidence in either of these propositions, let alone both.
It is difficult to intuitively comprehend the idea that machines – or any physical system for that matter – could become substantially more intelligent than the most intelligent humans. Because the intelligence gap between humans and other animals appears very large to us, we may be tempted to think of intelligence as an “on-or-off concept,” one that humans have and other animals do not. People may believe that computers can be better than humans at certain tasks, but only at tasks that do not require “real” intelligence. This view would suggest that if machines ever became “intelligent” across the board, their capabilities would have to be no greater than those of an intelligent human relying on the aid of (computer-)tools.
But this view is mistaken. There is no threshold for “absolute intelligence.” Nonhuman animals such as primates or rodents differ in cognitive abilities a great deal, not just because of domain-specific adaptations, but also due to a correlational “g factor” responsible for a large part of the variation across several cognitive domains (Burkart et al., 2016). In this context, the distinction between domain-specific and general intelligence is fuzzy: In many ways, human cognition is still fairly domain-specific. Our cognitive modules were optimized specifically for reproductive success in the simpler, more predictable environment of our ancestors. We may be great at interpreting which politician has the more confident or authoritative body language, but deficient in evaluating whose policy positions will lead to better developments according to metrics we care about. Our intelligence is good enough or “general enough” that we manage to accomplish impressive feats even in an environment quite unlike the one our ancestors evolved in, but there are many areas where our cognition is slower or more prone to bias than it could be.
Intelligence is best thought of in terms of a gradient. Imagine a hypothetical “intelligence scale” (inspired by part 2.1 of this FAQ) with rats at 100, chimpanzees at, say, 350, the village idiot at 400, average humans at 500 and Einstein at 750.2 Of course, this scale is open at the top and could go much higher. To quote Bostrom (2014, p. 44):
"Far from being the smartest possible biological species, we are probably better thought of as the stupidest possible biological species capable of starting a technological civilization – a niche we filled because we got there first, not because we are in any sense optimally adapted to it."
Thinking about intelligence as a gradient rather than an “on-or-off” concept prompts a Copernican shift of perspective. Suddenly it becomes obvious that humans cannot be at the peak of possible intelligence. On the contrary, we should expect AI to be able to surpass us in intelligence just like we surpass chimpanzees.
Biological evolution supports the view that AI could reach levels of intelligence vastly beyond ours. Evolutionary history arguably exhibits a weak trend of lineages becoming more intelligent over time, but evolution did not optimize for intelligence (only for goal-directed behavior in specific niches or environment types). Intelligence is metabolically costly, and without strong selection pressures for cognitive abilities specifically, natural selection will favor other traits. The development of new traits always entails tradeoffs or physical limitations: If our ancestors had evolved to have larger heads at birth, maternal childbirth mortality would likely have become too high to outweigh the gains of increased intelligence (Wittman & Wall, 2007). Because evolutionary change happens step-by-step as random mutations change the pre-existing architecture, the changes are path dependent and can only result in local optima, not global ones. It would be a remarkable coincidence if evolution had just so happened to stumble upon the most efficient way to assemble matter into an intelligent system.
But let us imagine that we could go back to the “drawing board” and optimize for a system’s intelligence without any developmental limitations. This process would provide the following benefits for AI over the human brain (Bostrom, 2014, p. 60-61):
With regard to the last point, imagine we tried to optimize for something like speed or sight rather than intelligence. Even if humans had never built anything faster than the fastest animal, we should assume that technological progress – unless it is halted – would eventually surpass nature in these respects. After all, natural selection does not optimize directly for speed or sight (but rather for gene copying success), making it a slower optimization process than those driven by humans for this specific purpose. Modern rockets already fly at speeds of up to 36,373 mph, which beats the peregrine falcon’s 240 mph by a huge margin. Similarly, eagle vision may be powerful, but it cannot compete with the Hubble space telescope. (General) intelligence is harder to replicate technologically, but natural selection did not optimize for intelligence either, and there do not seem to be strong reasons to believe that intelligence as a trait should differ categorically from examples like speed or sight, i.e., there are as far as we know no hard physical limits that would put human intelligence at the peak of what is possible.3
Another way to develop an intuition for the idea that there is significant room for improvement above human intelligence is to study variation in humans. An often-discussed example in this context is the intellect of John von Neumann. Von Neumann was not some kind of an alien, nor did he have a brain twice as large as the human average. And yet, von Neumann’s accomplishments almost seem “superhuman.” The section in his Wikipedia entry that talks about him having “founded the field of Game theory as a mathematical discipline” – an accomplishment so substantial that for most other intellectual figures it would make up most of their Wikipedia page – is just one out of many of von Neumann’s major achievements.
There are already individual humans (with normal-sized brains) whose intelligence vastly exceeds that of the typical human. So just how much room there is above their intelligence? To visualize this, consider for instance what could be done with an AI architecture more powerful than the human brain running on a warehouse-sized supercomputer.
Perhaps the people who think it is unlikely that superintelligent AI will ever be created are not objecting to it being possible in principle. Maybe they think it is simply too difficult to bridge the gap from human-level intelligence to something much greater. After all, evolution took a long time to produce a species as intelligent as humans, and for all we know, there could be planets with biological life where intelligent civilizations never evolved.4 But considering that there could come a point where AI algorithms start taking part in their own self-improvement, we should be more optimistic. AIs contributing to AI research will make it easier to bridge the gap, and could perhaps even lead to an acceleration of AI progress to the point that AI not only ends up smarter than us, but vastly smarter after only a short amount of time.
Several points in the list of AI advantages above – in particular the advantages derived from the editability of computer software or the possibility for modular superpowers to have crucial skills such as programming – suggest that AI architectures might both be easier to further improve than human brains, and that AIs themselves might at some point become better at actively developing their own improvements. If we ever build a machine with human-level intelligence, it should then be comparatively easy to speed it up or make tweaks to its algorithm and internal organization to make it more powerful. The updated version, which would at this point be slightly above human-level intelligence, could be given the task of further self-improvement, and so on until the process runs into physical limits or other bottlenecks.
Perhaps self-improvement does not have to require human-level general intelligence at all. There may be comparatively simple AI designs that are specialized for AI science and (initially) lack proficiency in other domains. The theoretical foundations for an AI design that can bootstrap itself to higher and higher intelligence already exist (Schmidhuber, 2006), and it remains an empirical question where exactly the threshold is after which AI designs would become capable of improving themselves further, and whether the slope of such an improvement process is steep enough to go on for multiple iterations.
For the above reasons, it cannot be ruled out that breakthroughs in AI could at some point lead to an intelligence explosion (Good, 1965; Chalmers, 2010), where recursive self-improvement leads to a rapid acceleration of AI progress. In such a scenario, AI could go from subhuman intelligence to vastly superhuman intelligence in a very short timespan, e.g. in (significantly) less than a year.
While the idea of AI advancing from human-level to vastly superhuman intelligence in less than a year may sound implausible, as it violates long-standing trends in the speed of human-driven development, it would not be the first time where changes to the underlying dynamics of an optimization process cause an unprecedented speed-up. Technology has been accelerating ever since innovations (such as agriculture or the printing press) began to feed into the rate at which further innovations could be generated.5 Compared to the rate of change we see in biological evolution, cultural evolution broke the sound barrier: It took biological evolution a few million years to improve on the intelligence of our ape-like ancestors to the point where they became early hominids. By contrast, technology needed little more than ten thousand years to progress from agriculture to space shuttles. Just as inventions like the printing press fed into – and significantly sped up – the process of technological evolution, rendering it qualitatively different from biological evolution, AIs improving their own algorithms could cause a tremendous speed-up in AI progress, rendering AI development through self-improvement qualitatively different from “normal” technological progress.
It should be noted, however, that while the arguments in favor of a possible intelligence explosion are intriguing, they nevertheless remain speculative. There are also some good reasons why some experts consider a slower takeoff of AI capabilities more likely. In a slower takeoff, it would take several years or even decades for AI to progress from human to superhuman intelligence. Unless we find decisive arguments for one scenario over the other, we should expect both rapid and comparably slow takeoff scenarios to remain plausible. It is worth noting that because “slow” in this context also includes transitions on the order of ten or twenty years, it would still be very fast practically speaking, when we consider how much time nations, global leaders or the general public would need to adequately prepare for these changes.
The typical mind fallacy refers to the belief that other minds operate the same way our own does. If an extrovert asks an introvert, “How can you possibly not enjoy this party; I talked to half a dozen people the past thirty minutes and they were all really interesting!” they are committing the typical mind fallacy.
When envisioning the goals of smarter-than-human artificial intelligence, we are in danger of committing this fallacy and projecting our own experience onto the way an AI would reason about its goals. We may be tempted to think that an AI, especially a superintelligent one, will reason its way through moral arguments6 and come to the conclusion that it should, for instance, refrain from harming sentient beings. This idea is misguided, because according to the intelligence definition we provided above – which helps us identify the processes likely to shape the future – making a system more intelligent does not change its goals/objectives; it only adds more optimization power for pursuing those objectives.
To give a silly example, imagine that an arms race between spam producers and companies selling spam filters leads to increasingly more sophisticated strategies on both sides, until the side selling spam filters has had it and engineers a superintelligent AI with the sole objective to minimize the number of spam emails in their inboxes. With its level of sophistication, the spam-blocking AI would have more strategies at its disposal than normal spam filters. For instance, it could try to appeal to human reason by voicing sophisticated, game-theoretic arguments against the negative-sum nature of sending out spam. But it would be smart enough to realize the futility of such a plan, as this naive strategy would backfire because some humans are trolls (among other reasons). So the spam-minimizing AI would quickly conclude that the safest way to reduce spam is not by being kind, but by gaining control over the whole planet and killing everything that could possibly try to trick its spam filter. The AI in this example may fully understand that humans would object to these actions on moral grounds, but human "moral grounds" are based on what humans care about – which is not the minimization of spam! And the AI – whose whole decision architecture only selects for actions that promote the terminal goal of minimizing spam – would therefore not be motivated to think through, let alone follow our arguments, even if it could "understand" them in the same way introverts understand why some people enjoy large parties.
The typical mind fallacy tempts us to conclude that because moral arguments appeal to us,7 they would appeal to any generally intelligent system. This claim is after all already falsified empirically by the existence of high-functioning psychopaths. While it may be difficult for most people to imagine how it would feel to not be moved by the plight of anyone but oneself, this is nothing compared to the difficulties of imagining all the different ways that minds in general could be built. Eliezer Yudkowsky coined the term mind space to refer to the set of all possible minds – including animals (of existing species as well as extinct ones), aliens, and artificial intelligences, as well as completely hypothetical “mind-like” designs that no one would ever deliberately put together. The variance in all human individuals, throughout all of history, only represents a tiny blob in mind space. Some of the minds outside this blob would “think” in ways that are completely alien to us; most would lack empathy and other (human) emotions for that matter; and many of these minds may not even relevantly qualify as “conscious.”
Most of these minds would not be moved by moral arguments, because the decision to focus on moral arguments has to come from somewhere, and many of these minds would simply lack the parts that make moral appeals work in humans. Unless AIs are deliberately designed8 to share our values, their objectives will in all likelihood be orthogonal to ours (Armstrong, 2013).
Even though AI designs may differ radically in terms of their top-level goals, we should expect most AI designs to converge on some of the same subgoals. These convergent subgoals (Omohundro, 2008; Bostrom, 2012) include intelligence amplification, self-preservation, goal preservation and the accumulation of resources. All of these are instrumentally very useful to the pursuit of almost any goal. If an AI is able to access the resources it needs to pursue these subgoals, and does not explicitly have concern for human preferences as (part of) its top-level goal, its pursuit of these subgoals is likely to lead to human extinction (and eventually space colonization; see below).
AI safety work refers to interdisciplinary efforts to ensure that the creation of smarter-than-human artificial intelligence will result in excellent outcomes rather than disastrous ones. Note that the worry is not that AI would turn evil, but that indifference to suffering and human preferences will be the default unless we put in a lot of work to ensure that AI is developed with the right values.
Increasing an agent’s intelligence improves its ability to efficiently pursue its goals. All else equal, any agent has a strong incentive to amplify its intelligence. A real-life example of this convergent drive is the value of education: Learning important skills and (thinking-)habits early in life correlates with good outcomes. In the AI context, intelligence amplification as a convergent drive implies that AIs with the ability to improve their own intelligence will do so (all else equal). To self-improve, AIs would try to gain access to more hardware, make copies of themselves to increase their overall productivity, or devise improvements to their own cognitive algorithms.
More broadly, intelligence amplification also implies that an AI would try to develop all technologies that may be of use to its pursuits. I.J. Good, a mathematician and cryptologist who worked alongside Alan Turing, asserted that “the first ultraintelligent machine is the last invention that man need ever make,” because once we build it, such a machine would be capable of developing all further technologies on its own.
AIs would in all likelihood also have an interest in preserving their own goals. This is because they optimize actions in terms of their current goals, not in terms of goals they might end up having in the future. From the current goal’s perspective, a change in the AI’s goal function is potentially disastrous, as the current goal would not persevere. Therefore, AIs will try to prevent researchers from changing their goals. Consequently, there is pressure for AI researchers to get things right on the first try: If we develop a superintelligent AI with a goal that is not quite what we were after – because someone made a mistake, or was not precise enough, or did not think about particular ways the specified goal could backfire – the AI would pursue the goal that it was equipped with, not the goal that was intended. This applies even if it could understand perfectly well what the intentioned goal was. This feature of going with the actual goal instead of the intended one could lead to cases of perverse instantiation, such as the AI “paralyz[ing] human facial musculatures into constant beaming smiles” to pursue an objective of “make us smile” (Bostrom, 2014, p. 120).
Some people have downplayed worries about AI risks with the argument that when things begin to look dangerous, humans can literally “pull the plug” in order to shut down AIs that are behaving suspiciously. This argument is naive because it is based on the assumption that AIs would be too stupid to take precautions against this. Because the scenario we are discussing concerns smarter-than-human intelligence, an AI would understand the implications of losing its connection to electricity, and would therefore try to proactively prevent being shut down any means necessary – especially when shutdown might be permanent.
This is not to say that AIs would necessarily be directly concerned about their own “death” – after all, whether an AI’s goal includes its own survival or not depends on the specifics of its goal function. However, for most goals, staying around pursuing one's goal will lead to better expected goal achievement. AIs would therefore have strong incentives to prevent permanent shutdown even if their goal was not about their own “survival” at all. (AIs might, however, be content to outsource their goal achievement by making copies of themselves, in which case shutdown of the original AI would not be so terrible as long as one or several copies with the same goal remain active.)
The convergent drive for self-preservation has the unfortunate implication that superintelligent AI would almost inevitably see humans as a potential threat to its goal achievement. Even if its creators do not plan to shut the AI down for the time being, the superintelligence could reasonably conclude that the creators might decide to do so at some point. Similarly, a newly-created AI would have to expect some probability of interference from external actors such as the government, foreign governments or activist groups. It would even be concerned that humans in the long term are too stupid to keep their own civilization intact, which would also affect the infrastructure required to run the AI. For these reasons, any AI intelligent enough to grasp the strategic implications of its predicament would likely be on the lookout for ways to gain dominance over humanity. It would do this not out of malevolence, but simply as the best strategy for self-preservation.
This does not mean that AIs would at all times try to overpower their creators: If an AI realizes that attempts at trickery are likely to be discovered and punished with shutdown, it may fake being cooperative, and may fake having the goals that the researchers intended, while privately plotting some form of takeover. Bostrom has referred to this scenario as a “treacherous turn” (Bostrom, 2014, p. 116).
We may be tempted to think that AIs implemented on some kind of normal computer substrate, without arms or legs for mobility in the non-virtual world, may be comparatively harmless and easy to overpower in case of misbehavior. This would likely be a misconception, however. We should not underestimate what a superintelligence with access to the internet could accomplish. And it could attain such access in many ways and for many reasons, e.g. because the researchers were careless or underestimated its capacities, or because it successfully pretended to be less capable than it actually was. Or maybe it could try to convince the “weak links” in its team of supervisors to give it access in secret – promising bribes. Such a strategy could work even if most people in the developing team thought it would be best to deny their AI internet access until they have more certainty about the AI's alignment status and its true capabilities. Importantly, if the first superintelligence ever built was prevented from accessing the internet (or other efficient channels of communication), its impact on the world would remain limited, making it possible for other (potentially less careful) teams to catch up. The closer the competition, the more the teams are incentivized to give their AIs riskier access over resources in a gamble for the potential benefits in case of proper alignment.
The following list contains some examples of strategies a superintelligent AI could use to gain power over more and more resources, with the goal of eventually reaching a position where humans cannot harm or obstruct it. Note that these strategies were thought of by humans, and are therefore bound to be less creative and less effective than the strategies an actual superintelligence would be able to devise.
Through some means or another – and let’s not forget that the AI could well attempt many strategies at once to safeguard against possible failure in some of its pursuits – the AI may eventually gain a decisive strategic advantage over all competition (Bostrom, 2014, p. 78-90). Once this is the case, it would carefully build up further infrastructure on its own. This stage will presumably be easier to reach as the world economy becomes more and more automated.
Once humans are no longer a threat, the AI would focus its attention on natural threats to its existence. It would for instance notice that the sun will expand in about seven billion years to the point where existence on earth will become impossible. For the reason of self preservation alone, a superintelligent AI would thus eventually be incentivized to expand its influence beyond Earth.
For the fulfillment of most goals, accumulating as many resources as possible is an important early step. Resource accumulation is also intertwined with the other subgoals in that it tends to facilitate them.
The resources available on Earth are only a tiny fraction of the total resources that an AI could access in the entire universe. Resource accumulation as a convergent subgoal implies that most AIs would eventually colonize space (provided that it is not prohibitively costly), in order to gain access to the maximum amount of resources. These resources would then be put to use for the pursuit of its other subgoals and, ultimately, for optimizing its top-level goal.
Superintelligent AI might colonize space in order to build (more of) the following:
To elaborate on the point of goal optimization: Humans tend to be satisficers with respect to most things in life. We have minimum requirements for the quality of the food we want to eat, the relationships we want to have, or the job we want to work in. Once these demands are met and we find options that are “pretty good,” we often end up satisfied and settle down on the routine. Few of us spend decades of our lives pushing ourselves to invest as many waking hours as sustainably possible into systematically finding the optimal food in existence, the optimal romantic partner, or anything really.
AI systems on the other hand, in virtue of how they are usually built, are more likely to act as maximizers. A chess computer is not trying to look for “pretty good moves” – it is trying to look for the best move it can find with the limited time and computing power it has at its disposal. The pressure to build ever more powerful AIs is a pressure to build ever more powerful maximizers. Unless we deliberately program AIs in a way that reduces their impact, the AIs we build will be maximizers that never “settle” or consider their goals “achieved.” If their goal appears to be achieved, a maximizer AI will spend its remaining time double- and triple-checking whether it made a mistake. When it is only 99.99% certain that the goal is achieved, it will restlessly try to increase the probability further – even if this means using the computing power of a whole galaxy to drive the probability it assigns to its goal being achieved from 99.99% to 99.991%.
Because of the nature of maximizing as a decision-strategy, a superintelligent AI is likely to colonize space in pursuit of its goals unless we program it in a way to deliberately reduce its impact. This is the case even if its goals appear as “unambitious” as e.g. “minimize spam in inboxes.”
Space colonization by artificial superintelligence would increase goal-directed activity and computations in the world by an astronomically large factor—for good or for evil.11 If the superintelligence holds objectives that are aligned with our values, then the outcome could be a utopia. However, if the AI has randomly, mistakenly, or sufficiently suboptimally implemented values, the best we could hope for is if all the machinery it used to colonize space was inanimate, i.e. not sentient. Such an outcome – even though all humans would die – would still be much better than other plausible outcomes, because it would at least not contain any suffering. Unfortunately, we cannot rule out that the space colonization machinery orchestrated by a superintelligent AI would also contain sentient minds, including minds that suffer (though probably also happy minds). The same way factory farming led to a massive increase in farmed animal populations, multiplying the direct suffering humans cause to animals by a large factor, an AI colonizing space could cause a massive increase in the total number of sentient entities, potentially creating vast amounts of suffering. The following are some ways AI outcomes could result in astronomical amounts of suffering:
More ways AI scenarios could contain astronomical amounts of suffering are described here and here. Sources of future suffering are likely to follow a power law distribution, where most of the expected suffering comes from a few rare scenarios where things go very wrong – analogous to how most casualties are the result of very few, very large wars; how most of the casualty-risks from terrorist attacks fall into tail scenarios where terrorists would get their hands on weapons of mass destruction; or how most victims of epidemics succumbed to the few very worst outbreaks (Newman, 2005). It is therefore crucial to not only to factor in which scenarios are most likely to occur, but also how bad scenarios would be should they occur.
Critics may object because the above scenarios are largely based on the possibility of artificial sentience, particularly sentience implemented on a computer substrate. If this turns out to be impossible, there may not be much suffering in futures with AI after all. However, computer-based minds also being able to suffer in the morally relevant sense is a common implication in philosophy of mind. Functionalism and type A physicalism (“eliminativism”) both imply that there can be morally relevant minds on digital substrates. Even if one were skeptical of these two positions and instead favored the views of philosophers like David Chalmers or Galen Strawson (e.g. Strawson, 2006), who believe consciousness is an irreducible phenomenon, there are at least some circumstances under which these views would also allow for computer-based minds to be sentient.12 Crude “carbon chauvinism,” or a belief that consciousness is only linked to carbon atoms, is an extreme minority position in philosophy of mind.
The case for artificial sentience is not just abstract but can also be made on the intuitive level: Imagine we had whole brain emulation with a perfect mapping from inputs to outputs, behaving exactly like a person's actual brain. Suppose we also give this brain emulation a robot body, with a face and facial expressions created with particular attention to detail. The robot will, by the stipulations of this thought experiment, behave exactly like a human person would behave in the same situation. So the robot-person would very convincingly plead that it has consciousness and moral relevance. How certain would we be that this was all just an elaborate facade? Why should it be?
Because we are unfamiliar with artificial minds and have a hard time experiencing empathy for things that do not appear or behave in animal-like ways, we may be tempted to dismiss the possibility of artificial sentience or deny artificial minds moral relevance – the same way animal sentience was dismissed for thousands of years. However, the theoretical reasons to anticipate artificial sentience are strong, and it would be discriminatory to deny moral consideration to a mind simply because it is implemented on a substrate different from ours. As long as we are not very confident indeed that minds on a computer substrate would be incapable of suffering in the morally relevant sense, we should believe that most of the future’s expected suffering is located in futures where superintelligent AI colonizes space.
The world currently contains a great deal of suffering. Large sources of suffering include for instance poverty in developing countries, mental health issues all over the world, and non-human animal suffering in factory farms and in the wild. We already have a good overview – with better understanding in some areas than others – of where altruists can cost-effectively reduce substantial suffering. Charitable interventions are commonly chosen according to whether they produce measurable impact in the years or decades to come. Unfortunately, altruistic interventions are rarely chosen with the whole future in mind, i.e. with a focus on reducing as much suffering as possible for the rest of time, until the heat death of the universe.13 This is potentially problematic, because we should expect the far future to contain vastly more suffering than the next decades, not only because there might be sentient beings around for millions or billions of years to come, but also because it is possible for Earth-originating life to eventually colonize space, which could multiply the total amount of sentient beings many times over. While it is important to reduce the suffering of sentient beings now, it seems unlikely that the most consequential intervention for the future of all sentience will also be the intervention that is best for reducing short-term suffering. Instead, as judged from the distant future, the most consequential development of our decade would more likely have something to do with novel technologies or the ways they will be used.
And yet, politics, science, economics and especially the media are biased towards short timescales. Politicians worry about elections, scientists worry about grant money, and private corporations need to work on things that produce a profit in the foreseeable future. We should therefore expect interventions targeted at the far future to be much more neglected than interventions targeted at short-term sources of suffering.
Admittedly, the far future is difficult to predict. If our models fail to account for all the right factors, our predictions may turn out very wrong. However, rather than trying to simulate in detail through everything that might happen all the way into the distant future – which would be a futile endeavor, needless to say – we should focus our altruistic efforts on influencing levers that remain agile and reactive to future developments. An example of such a lever is institutions that persist for decades or centuries. The US Constitution for instance still carries significant relevance in today’s world, even though it was formulated hundreds of years ago. Similarly, the people who founded the League of Nations after World War I did not succeed in preventing the next war, but they contributed to the founding and the charter of its successor organization, the United Nations, which still exerts geopolitical influence today. The actors who initially influenced the formation of these institutions as well as their values and principles, had a long-lasting impact.
In order to positively influence the future for hundreds of years, we fortunately do not need to predict the next hundreds of years in detail. Instead, all we need to predict is what type of institutions – or, more generally, stable and powerful decision-making agencies – are most likely to react to future developments maximally well.14
AI is the ultimate lever through which to influence the future. The goals of an artificial superintelligence would plausibly be much more stable than the values of human leaders or those enshrined in any constitution or charter. And a superintelligent AI would, with at least considerable likelihood, remain in control of the future not only for centuries, but for millions or even billions of years to come. In non-AI scenarios on the other hand, all the good things we achieve in the coming decade(s) will “dilute” over time, as current societies, with all their norms and institutions, change or collapse.
In a future where smarter-than-human artificial intelligence won’t be created, our altruistic impact – even if we manage to achieve a lot in greatly influencing this non-AI future – would be comparatively “capped” and insignificant when contrasted with the scenarios where our actions do affect the development of superintelligent AI (or how AI would act).15 We should expect AI scenarios to not only contain the most stable lever we can imagine – the AI’s goal function which the AI will want to preserve carefully – but also the highest stakes. In comparison with non-AI scenarios, space colonization by superintelligent AI would turn the largest amount of matter and energy into complex computations. In a best-case scenario, all these resources could be turned into a vast utopia full of happiness, which provides as strong incentive for us to get AI creation perfectly right. However, if the AI is equipped with insufficiently good values, or if it optimizes for random goals not intended by its creators, the outcome could also include astronomical amounts of suffering. In combination, these two reasons of highest influence/goal-stability and highest stakes build a strong case in favor of focusing our attention on AI scenarios.
While critics may object that all this emphasis on the astronomical stakes in AI scenarios appears unfairly Pascalian, it should be noted that AI is not a frivolous thought experiment where we invoke new kinds of physics to raise the stakes. Smarter-than-human artificial intelligence and space colonization are both realistically possible and plausible developments that fit squarely into the laws of nature as we currently understand them. If either of them turn out to be impossible, that would be a big surprise, and would suggest that we are fundamentally misunderstanding something about the way physical reality works. While the implications of smarter-than-human artificial intelligence are hard to grasp intuitively, the underlying reasons for singling out AI as a scenario to worry about are sound. As illustrated by Leó Szilárd’s lobbying for precautions around nuclear bombs well before the first such bombs were built, it is far from hopeless to prepare for disruptive new technologies in advance, before they are completed.
This text argued that altruists concerned about the quality of the future should focusing their attention on futures where AI plays an important role. This can mean many things. It does not mean that everyone should think about AI scenarios or technical work in AI alignment directly. Rather, it just means we should pick interventions to support according to their long-term consequences, and particularly according to the ways in which our efforts could make a difference to futures ruled by superintelligent AI. Whether it is best to try to affect AI outcomes in a narrow and targeted way, or whether we should go for a broader strategy, depends on several factors and requires further study.
CLR has looked systematically into paths to impact for affecting AI outcomes with particular emphasis on preventing suffering, and we have come up with a few promising candidates. The following list presents some tentative proposals:
It is important to note that human values may not affect the goals of an AI at all if researchers fail to solve the value-loading problem. Raising awareness of certain values may therefore be particularly impactful if it concerns groups likely to be in control of the goals of smarter-than-human artificial intelligence.
Further research is needed to flesh out these paths to impact in more detail, and to discover even more promising ways to affect AI outcomes. As there is always the possibility that we have overlooked something or are misguided or misinformed, we should remain open-minded and periodically rethink the assumptions our current prioritization is based on.
This text borrows ideas and framings from other people’s introductions to AI. I tried to flag this with links or citations wherever I remembered the source and where the writing was not convergent, but I may not have remembered everything. I'm particularly indebted to the writings of Eliezer Yudkowsky, Nick Bostrom and Scott Alexander. Many thanks also to David Althaus, Tobias Baumann, Ruairi Donnelly, Caspar Oesterheld and Kelly Witwicki for helpful comments and editing.
Armstrong, S. (2013). General Purpose Intelligence: Arguing the Orthogonality Thesis. Future of Humanity Institute, Oxford University.
Bostrom, N. (2003). Astronomical Waste: The Opportunity Cost of Delayed Technological Development. Utilitas, 15(3), 308-314.
Bostrom, N. (2005). What is a Singleton? Linguistic and Philosophical Investigations, 5(2), 48-54.
Bostrom, N. (2012). The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Minds and Machines, 22(2), 71-85.
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Burkart, J. M., Schubiger, M. N., & Schaik, C. P. (2016). The evolution of general intelligence. Behavioral and Brain Sciences, 1-65.
Chalmers, D. (2010). The Singularity: A Philosophical Analysis. Journal of Consciousness Studies, 17: 7-65.
Daswani, M. & Leike, J. (2015). A Definition of Happiness for Reinforcement Learning Agents. arXiv:1505.04497.
Dawkins, R. (1996). The blind watchmaker: Why the evidence of evolution reveals a universe without design. New York: Norton.
Good, I. J. (1965). Speculations concerning the first ultraintelligent machine. Blacksburg, VA: Dept. of Statistics, Virginia Polytechnic Institute and State University.
Hanson, R. (2016). The Age of Em: Work, love, and life when robots rule the Earth. Oxford: Oxford University Press.
Legg, S., & Hutter, M. (2005). Universal Intelligence: A Definition of Machine Intelligence. Minds and Machines, 17(4), 391-444.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., . . . Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
Müller, V. C., & Bostrom, N. (2016). Future Progress in Artificial Intelligence: A Survey of Expert Opinion. Fundamental Issues of Artificial Intelligence, 553-570.
Newman, M. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5), 323–351.
Omohundro, S. (2008). The Basic AI Drives. Proceedings of the 2008 conference on Artificial General Intelligence 2008: 483-492. IOS Press Amsterdam.
Sandberg, A. (1999). The Physics of Information Processing Superobjects: Daily Life Among the Jupiter Brains.
Sandberg, A. (2014). Monte Carlo model of brain emulation development, Working Paper 2014-1 (version 1.2). Future of Humanity Institute, Oxford University.
Sandberg, A. & Bostrom, N. (2008). Whole Brain Emulation: A Roadmap, Technical Report #2008‐3, Future of Humanity Institute, Oxford University.
Schmidhuber, J. (2006). Godel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements. arXiv:cs.LO/0309048 v5.
Shulman, C. & Bostrom, N. (2012). How Hard is Artificial Intelligence? Evolutionary Arguments and Selection Effects. Journal of Consciousness Studies, 19(7-8), 103-130.
Shulman, C. & Sandberg, A. (2010): Implications of a Software-Limited Singularity. In ECAP10: VIII European Conference on Computing and Philosophy.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. V., . . . Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.
Strawson, G. (2006). Realistic Monism - Why Physicalism Entails Panpsychism. Journal of Consciousness Studies, 13(10-11), 3–31.
Wittman, A. B., & Wall, L. L. (2007). The Evolutionary Origins of Obstructed Labor: Bipedalism, Encephalization, and the Human Obstetric Dilemma. Obstetrical & Gynecological Survey, 62(11), 739-748.
The post Altruists Should Prioritize Artificial Intelligence appeared first on Center on Long-Term Risk.
]]>The post How Feasible Is the Rapid Development of Artificial Superintelligence? appeared first on Center on Long-Term Risk.
]]>Since Turing (1950), the dream of artificial intelligence (AI) research has been the creation of a “machine that could think”. While current expert consensus believes the creation of such a system to still take several decades if not more (Müller & Bostrom 2016), recent progress in AI has still raised worries about the challenges involved with increasingly capable AI systems (Future of Life Institute 2015, Amodei et al. 2016).
In addition to the risks posed by near-term developments, there is the possibility of AI systems eventually reaching superhuman levels of intelligence, eventually breaking out of human control (Bostrom 2014). Various research agendas and lists of research priorities have been suggested for managing the challenges that this level of capability would pose to society (Soares & Fallenstein 2014, Russell et al. 2015, Amodei et al. 2016, Taylor et al. 2016).
For managing the challenges presented by increasingly capable AI systems, one needs to know how capable those systems might ultimately become, and how quickly. If AI systems can rapidly achieve strong capabilities, becoming powerful enough to take control of the world before any human can react, then that implies a very different approach than one where AI capabilities develop gradually over many decades, never getting substantially past the human level (Sotala & Yampolskiy, 2015). We might phrase these questions as:
Views on these questions vary. Authors such as Bostrom 2014 and Yudkowsky (2008) argue for the possibility of a fast leap in intelligence, with both offering hypothetical example scenarios where an AI rapidly acquires a dominant position over humanity. On the other hand, Anderson (2010) and Lawrence (2016) appeal to fundamental limits on predictability – and thus intelligence – posed by the complexity of the environment. Lawrence writes:
‘Practitioners who have performed sensitivity analysis on time series prediction will know how quickly uncertainty accumulates as you try to look forward in time. There is normally a time frame ahead of which things become too misty to compute any more. Further computational power doesn’t help you in this instance, because uncertainty dominates. Reducing model uncertainty requires exponentially greater computation. We might try to handle this uncertainty by quantifying it, but even this can prove intractable.
So just like the elusive concept of infinite precision in mechanical machining, there is likely a limit on the degree to which an entity can be intelligent. We cannot predict with infinite precision and this will render our predictions useless on some particular time horizon.
The limit on predictive precision is imposed by the exponential growth in complexity of exact simulation, coupled with the accumulation of error associated with the necessary abstraction of our predictive models. As we predict forward these uncertainties can saturate dominating our predictions. As a result we often only have a very vague notion of what is to come. This limit on our predictive ability places a fundamental limit on our ability to make intelligent decisions.’
We might summarize this as saying that, past a certain point, increased intelligence is only of limited benefit, for the unpredictability of the environment means that you would have to spend exponentially more resources to evaluate a vastly increasing amount of possibilities.
Noise also accumulates over time, reducing the reliability of your models. For many kinds of predictions, increasing the prediction window would require an exponential increase in the amount of measurements (Martela 2016). For instance, weather models become increasingly uncertain when projected farther out in time. Forecasters can only access a limited amount of observations relative to the weather system’s degrees of freedom, and any initial imprecisions will magnify over time and cause the accuracy to deteriorate (Buizza, 2002). The accuracy of any long-term weather prediction will thus always be bounded by the number of available data points. Similar considerations could also apply to attempts to predict things such as the behavior of human societies.
When models being plagued both by the exponentially increasing amount of complexity and also the exponentially increasing amount of noise, the advantage that even a superhuman intelligence might have over humans may be limited.
On the other hand, it is not obvious whether this point of view really is in conflict with the assumption of AI being able to quickly grow to become powerful. There being limits to prediction does not imply that humans would be particularly close to the limits, nor that it would necessarily take a great amount of time to move from sub-human to superhuman capability.
This article attempts to consider these questions by considering what we know about expertise and intelligence. After reviewing the relevant research on human expertise, I will discuss its relevance for AI, and consider how AI could improve on humans in two major aspects of thought and expertise, namely mental simulation and pattern recognition. My current conclusion is that although the limits to prediction are real, it seems like AI could still substantially improve on human intelligence, possibly even mastering domains which are currently too hard for humans. The possibility of AI developing significant real-world capabilities in a relatively brief time seems like one that cannot be ruled out.
Ideally, we might turn to theoretical AI research for a precise theory about acquiring cognitive capabilities. Unfortunately AI research is not at this point yet. Instead we will consider the research on human expertise and decision-making.
There exists a preliminary understanding, if not of the details of human decision-making, then at least the general outline. A picture that emerges from this research is that expertise is about developing the correct mental representations (Klein 1999; Ericsson & Pool, 2016).
A mental representation is a very general concept. In the words of expertise researcher Anders Ericsson (Ericsson & Pool, 2016):
‘A mental representation is a mental structure that corresponds to an object, an idea, a collection of information, or anything else, concrete or abstract, that the brain is thinking about. A simple example is a visual image. Mention the Mona Lisa, for instance, and many people will immediately ‘see’ an image of the painting in their minds; that image is their mental representation of the Mona Lisa. Some people’s representations are more detailed and accurate than others, and they can report, for example, details about the background, about where Mona Lisa is sitting, and about her hairstyle and her eyebrows.‘
Domain-specific mental representations are important because they allow experts to know what something means; know what to expect; know what good performance should feel like; know how to achieve the good performance; know the right goals for a given situation; know the steps necessary for achieving those goals; mentally simulate how something might happen; learn more detailed mental representations for improving their skills (Klein, 1999; Ericsson & Pool, 2016).
Although good decision-making is often thought of as a careful deliberation of all the possible options, such a type of thinking tends to be typical of novices (Klein, 1999). A novice will have to try to carefully reason their way through to an answer, and will often do poorly regardless, because they do not know what things are relevant to take into account and which ones are not. An expert doesn’t need to – they are experienced enough to instantly know what to do.
A specific model of expertise is the Recognition-Primed Decision-Making (RPD) model (Klein, 1999). First, a decision-maker sees some situation, such as a fire for a firefighter or a design problem for an architect. The situation may then be recognized as familiar, such as a typical garage fire. Recognizing a familiar situation means understanding what goals make sense and what should be focused on, which cues to pay attention to, what to expect next and when a violation of expectations shows that something is amiss, and knowing what the typical ways of responding are. Ideally, the expert will instantly know what to do.
If the situation is unfamiliar, then the expert may need to construct a mental simulation of what is going on, how things might have developed to this point, and what effect different actions would have. For example, a firefighter thinking about how to rescue someone from a difficult spot might mentally simulate where different rescue harnesses might be attached on the person, and whether that would exert dangerous amounts of force on them.
Mental representations are necessary for a good simulation, as they let the expert know what things to take into account, what things could plausibly be tried, and what effects they would have. In the example, the firefighter’s knowledge allows him to predict that specific ways of attaching the rescue harness would have dangerous consequences, while others are safe.
Mental representations are developed through practice. A novice will try out something and see what happens as a result. This gives them a rough mental representation and a prediction of what might happen if they try the same thing again, leading them to try out the same thing again or do something else instead.
Just practice isn’t enough, however – there also needs to be feedback. Someone may do a practice drill over and over again and think that they are practicing and thus improving – but without some sign of how well that is going, they may just keep repeating the same mistakes over and over (Ericsson & Pool, 2016).
The importance of quality feedback is worth emphasizing. Skills do not develop unless there is feedback that is conducive to developing better mental representations. In fact, there are entire fields in which experienced practitioners are not much better than novices, because the field does not provide them with enough feedback. Shanteau (1992) provides the following breakdown of professions for which there is agreement on the nature of their performance:
Good performance | Bad performance |
Weather Forecasters | Clinical Psychologists |
Livestock Judges | Psychiatrists |
Astronomers | Astrologers |
Test Pilots | Student Admissions |
Soil Judges | Court Judges |
Chess Masters | Behavioral Researchers |
Physicists | Counselors |
Mathematicians | Personnel Selectors |
Accountants | Parole Officers |
Grain Inspectors | Polygraph (Lie Detector) Judges |
Photo Interpreters | Intelligence Analysts |
Insurance Analysts | Stock Brokers |
In analyzing why some domains enable the development of genuine expertise and others don’t, Shanteau identified a number of considerations that relate to the nature of feedback. In an occupation like weather forecasting, the criteria you use for forecasting are always the same; you will always be facing the same task and can practice it over and over; you get quick and feedback on whether your prediction was correct; you can use formal tools to analyze what you predicted would happen and why that prediction did or didn’t happen; and things can be analyzed in objective terms. This allows weather forecasters to develop powerful mental representations that get better and better at making the correct prediction.
Contrast this with someone like an intelligence analyst. The analyst may be called upon to analyze very different clues and situations; each of the tasks may be unique, making it harder to know which lessons from previous tasks apply; for many of the analyses, one might never know whether they were right or not; and questions about socio-cultural matters tend to be much more subjective than questions about weather, making objective analysis impossible. In short, for much of the work that the analyst does, there is simply no feedback available to tell whether the analyst has made the right judgment or not. And without feedback, there is no way to improve one’s mental representations, and thus expertise.
A slightly different look on expertise is the heuristics & biases literature, which frequently portrays even experts as being easily mistaken. In contrast, the expertise literature that we have reviewed so far has viewed experts as being typically capable and as having trustworthy intuition. Kahneman & Klein (2009) make an attempt to reconcile the two fields, and come to agree that:
This consensus is in line with what we have covered so far, though it also includes the consideration of validity. One cannot learn mental representations that would predict a domain or dictate the right actions for different situations in a domain, if that domain is simply too complicated or chaotic to be predicted. Kahneman & Klein provide the following illustrative example of a domain simply being impossible to predict:
‘When Tetlock [...] embarked on his ambitious study of long-term forecasts of strategic and economic events by experts, the outcome of his research was not obvious. Fifteen years later it was quite clear that the highly educated and experienced experts that he studied were not superior to untrained readers of newspapers in their ability to make accurate long-term forecasts of political events. The depressing consistency of the experts’ failure to outdo the novices in this task suggests that the problem is in the environment: Long-term forecasting must fail because large-scale historical developments are too complex to be forecast. The task is simply impossible. A thought experiment can help. Consider what the history of the 20th century might have been if the three fertilized eggs that became Hitler, Stalin, and Mao had been female. The century would surely have been very different, but can one know how?’
Meanwhile, practice does help in more predictable domains. A recent meta-analysis (Macnamara, Hambrick, & Oswald, 2014) on the effects of practice on skill found that the more predictable an activity was, the more practice contributed to performance in that activity.
Having reviewed some necessary background, we will now finally get back to the topic of superintelligence capabilities.
Similarly to humans, AI systems cannot reach intelligent conclusions by a mere brute force calculation of every possibility. Rather, an intelligence needs to learn to exploit predictable regularities in the world in order to develop further. All machine learning based systems are based on this principle: they can be said to learn a 'mental' representation of the world, analogous to the way humans do.
A strong reason to expect that AI systems will also end up developing roughly human-like mental representations for carrying out different tasks is that the representations of human experts are in a sense an optimal solution to the problems at hand. A human expert will have learned to identify the smallest set of cues that will let them know how to act in a certain situation; their mental representations encode information about how to choose the correct actions using the least amount of thought (Klein 1999).
Machine learning also tries to focus its analysis on exactly the right number of cues that will provide the right predictions, ignoring any irrelevant information. Traditional machine learning approaches have relied extensively on feature engineering, a labor-intensive process where humans determine which cues in the data are worth paying attention to.
A major reason behind the recent success of deep learning models is their capability for feature learning or representation learning: being able to independently discover high-level features in the data which are worth paying attention to, without (as much) external guidance (Bengio, Courville, & Vincent, 2012). Being able to identify and extract the most important features of the data allows the system to make its decisions based on the smallest amount of cues that allows it to reach the right judgment – just as human experts learn to identify the most relevant cues in the situations that they encounter.
Finally, the aspect of increasingly detailed mental representations giving an expert a yardstick to compare their performance against (Ericsson & Pool 2016) has an analogue in reinforcement learning methods. In deep reinforcement learning, a deep learning model learns to estimate how valuable a specific state of the world is, after which the system takes actions to move the world towards that state (Mnih et al., 2015). Similarly, a human expert comes to learn that specific states (e.g. a certain feeling in the body when diving) are valuable, and can then increasingly orient their behavior so as to achieve this state.
In summary, both human experts and current state-of-the-art AI systems use mental representations as the building blocks of their expertise. As there have been no serious alternative accounts presented of how expertise might work, I will assume that the capabilities of hypothetical superintelligences will depend on them developing the correct mental representations.
This paper set out to consider two main questions:
Let us now return to these.
The argument for an AI’s predictive capabilities being limited was that there are limits to prediction, and that predicting events an ever-increasing amount forward in time requires exponential reasoning power as well as measurement points, quickly becoming intractable. How capable could an AI become despite these two points?
The components of human expertise might be roughly divided into two: building up a battery of accurate mental representations, and being able to use them for mental simulations. Similarly, approaches to artificial intelligence can roughly be divided into pattern recognition and model-building (Lake, Ullman, Tenenbaum, & Gershman, 2016), depending on whether patterns in data or models of the world are treated as the primary unit of thought.
As this kind of a distinction seems to emerge both from psychology and AI research, I will assume that an AI’s expertise will also involve acquiring mental representations (or equivalently, doing pattern recognition) as well as accurately using them in mental simulations. We will consider these two separately.
An interesting look at the potential benefits offered by improved mental simulation ability come from looking at Philip Tetlock’s Good Judgement Project (GJP), popularized in the book Superforecasting (Tetlock & Gardner, 2015).2 Participating in a contest to forecast the probability of various events, the best GJP participants – the so-called 'superforecasters' – managed to make predictions whose accuracy outperformed those of professional intelligence analysts working with access to classified data.3 This is particularly interesting as the superforecasters had no particular domain expertise in answering most of the questions, with sample questions including ones such as
Tetlock & Gardner report the superforecasters’ accuracy in terms of Brier score, which is a scale between 0 and 2, with 0.5 indicating random guessing.4 On this scale, superforecasters had a score of 0.25 at the end of GJP’s first year, compared to 0.37 of the other forecasters participating in the project. By the end of the second year, superforecasters had improved their Brier score to 0.07 (Mellers et al., 2014). Superforecasters could also project further out in time: their accuracy at making predictions 300 days out was better as the other forecasters’ accuracy at making predictions 100 days out. In terms of being on the right side of 50/50, GJP’s best wisdom-of-the-crowd algorithms (deriving an overall prediction from the different forecasters’ predictions) delivered a correct prediction on 86% of all daily forecasts (Tetlock, Mellers, & Rohrbaugh, 2014).
The superforecasters’ success relied on a number of techniques, but a central one was the ability to consider and judge the relevance of a number of factors that might cause a prediction to become true or false. Tetlock & Gardner illustrate this technique by discussing how a superforecaster, Bill Flack, approached the question of whether an investigation of Yasser Arafat’s remains would reveal traces of polonium, suggestive of Arafat having been poisoned by Israel:
‘Bill unpacked the question by asking himself 'What would it take for the answer to be yes? What would it take for it to be no?' He realized that the first step of his analysis had nothing to do with politics. Polonium decays quickly. For the answer to be yes, scientists would have to be able to detect polonium on the remains of a man dead for years. Could they? A teammate had posted a link to the Swiss team’s report on the testing of Arafat’s possessions, so Bill read it, familiarized himself with the science of polonium testing, and was satisfied that they could do it. Only then did he move on to the next stage of the analysis.
Again, Bill asked himself how Arafat’s remains could have been contaminated with enough polonium to trigger a positive result. Obviously, 'Israel poisoned Arafat' was one way. But because Bill carefully broke the question down, he realized there were others. Arafat had many Palestinian enemies. They could have poisoned him. It was also possible that there had been 'intentional postmortem contamination by some Palestinian faction looking to give the appearance that Israel had done a Litvinenko on Arafat,' Bill told me later. These alternatives mattered because each additional way Arafat’s body could have been contaminated with polonium increased the probability that it was. Bill also noted that only one of the two European teams had to get a positive result for the correct answer to the question to be yes, another factor that nudged the needle in that direction. [...]
… there were several pathways to a 'yes' answer: Israel could have poisoned Arafat; Arafat’s Palestinian enemies could have poisoned him; or Arafat’s remains could have been contaminated after his death to make it look like a poisoning. Hypotheses like these are the ideal framework for investigating the inside view.
Start with the first hypothesis: Israel poisoned Yasser Arafat with polonium. What would it take for that to be true?
- Israel had, or could obtain, polonium.
- Israel wanted Arafat dead badly enough to take a big risk.
- Israel had the ability to poison Arafat with polonium.
Each of these elements could then be researched—looking for evidence pro and con—to get a sense of how likely they are to be true, and therefore how likely the hypothesis is to be true. Then it’s on to the next hypothesis. And the next. ‘
Tetlock does not go into detail about the prerequisites for being able to carry out such analysis – other than noting that it’s slow and effortful – but there are some considerations that seem like plausible prerequisites. First, a person needs to have enough general knowledge to generate different possibilities for how an event could have come true. Next, they need the ability to analyze and investigate those possibilities further, either personally acquiring the relevant domain knowledge for evaluating their plausibility, or finding a relevant subject matter expert. In this example, Bill familiarized himself with the science of polonium testing until he was satisfied that it would be possible to detect polonium traces from a long time ago.
This suggests a general procedure which an AI could also follow in order to predict the possibility of something in which it does not yet have expertise. An AI that was trying to predict the outcome of some specific question could work tap into its existing general knowledge in an attempt to identify relevant causal factors; if it failed to generate them, it could look into existing disciplines which seemed relevant for the question. For each identified possibility, it could branch off a new subprocess to do research into that particular direction, sharing information as necessary with a main process whose purpose was to integrate the insights derived from all the relevant searches.
Such a capability for several parallel streams of attention could provide a major advantage. A human researcher or forecaster who branches off to do research on a subquestion will need to make sure that they don’t lose track of the big picture, and needs to have an idea of whether they are making meaningful progress on that subquestion and whether it would be better to devote attention to something else instead. To the extent that there can be several parallel streams of attention, these issues can be alleviated, with a main stream focusing on the overall question and substreams on specific subpossibilities.
How much could this improve on human forecasters? Forecasters performed better when they were placed on teams where they shared information between each other, which similarly allowed an extent of parallelism in prediction-making, in that different forecasters could pursue their own angles and directions in exploring the problem. The differences between individual forecasters and teams of forecasters with comparable levels of training ranged between 0.05 and 0.10 Brier points at the end of the first year, and between 0.02 and 0.08 Brier points at the end of the second year (Mellers et al., 2014). In humans however, it seems likely that the extent of parallelism was constrained by the fact that each forecaster had to independently familiarize themselves with much of the same material, and that their ability to share knowledge between each other was limited by the speed of writing and reading. This suggests a possibility for further improvement.
Example: parallel streams of attention with a LIDA-like architecture
How could different streams of attention within an AI share information between each other? Recall that we have defined the development of expertise as the ability to accumulate mental patterns which are used to identify relevant cues and to indicate what predictions should be derived out of those. A computational model for attention and consciousness is Global Workspace Theory (Baars, 2002; 2005), of which a particular AI implementation is the LIDA model (Franklin & Patterson, 2006; Franklin, Madl, D’Mello, & Snaider, 2014; Madl, Franklin, Chen, Montaldi, & Trappl, 2016). LIDA is a model of the mind that is inspired by psychological and neuroscientific research and attempts to capture its main mechanisms. We can use LIDA to get a rough example of what having several 'streams of attention' would mean, and how information could be exchanged between them. LIDA works by means of an understand-attend-act cycle. In each cycle, low-level sensory information is initially interpreted so as to associate it with higher-level concepts to form a 'percept', which is then sent to a workspace. In the workspace, the percept activates further associations in other memory systems, which are combined with the percept to create a Current Situational Model, an understanding of what is going on at this moment. The entirety of the Current Situational Model is likely to be too complex for the agent to process, so it needs to select a part of it to elevate to the level of conscious attention to be acted upon. This is carried out using 'attention codelets', small pieces of code that attempt to train attention on some particular piece of information, each with their own set of concerns of what is important. Attention codelets with matching concerns form coalitions of what to attend, competing against other coalitions. Whichever coalition ends up winning the competition will have its chosen part of the Current Situational Model 'become conscious', broadcast to the rest of the system, and particularly Procedural Memory. The Procedural Memory holds schemes, or templates of different actions that can be taken in different contexts. Schemes which include a context or an action that matches the contents of the conscious broadcast become available as candidates for possible actions. They are copied to the Action Selection mechanism, which chooses a single action to perform. The selected action is further sent to Sensory-Motor Memory, which contains information of how exactly to perform the action. The outcome of taking this action manifests itself as new sensory information, beginning the cognitive cycle anew. Here is a description of how this process – or something like it – might be applied in the case of an AI seeking to predict the outcome of a specific question, such as the 'will Saudi Arabia agree to oil production cuts' question discussed below. The decision to consider this question has been made in an earlier cognitive cycle, and information relevant to it is now available in the inner environment and the Current Situational Model. The concepts of Saudi Arabia and oil production trigger several associations in the AI’s memory systems, such as the fact that oil prices will affect Saudi Arabia’s financial situation, and that oil prices are also influenced by other factors such as global demand. Two coalitions of attention codelets might form, one focusing on the current financial situation and another on influences on oil prices. In LIDA, these codelets would normally compete, and one of them would win and trigger a specific action, such as a deeper investigation of Saudi Arabia’s financial situation. In our hypothetical AI however, it might be enough that both coalitions manage to exceed some threshold level of success, indicating them both to be potentially relevant. In that case, new instances of the Procedural Memory, Action Selection and Sensory-Motor Memory mechanisms might be initialized, with one coalition sending its contents to the first set of instances and the other to another. These streams could then independently carry out searches of the information that was deemed relevant, also having their own local Situational Models and Workspaces focusing on content relevant for this search. As they worked, these streams would update the various memory subsystems with the results of their learning, making new associations and attention codelets available to all attentional streams. Their functioning could be supervised by a general high-level attention stream, whose task was to evaluate the performance of the various lower-level streams and allocate resources between them accordingly. |
In general, accurate forecasting requires an ability to carry out sophisticated causal modeling about a variety of interacting factors. Tetlock & Gardner write:
‘The commentary that superforecasters post on GJP forums is rife with 'on the one hand/ on the other' dialectical banter. And superforecasters have more than two hands. 'On the one hand, Saudi Arabia runs few risks in letting oil prices remain low because it has large financial reserves,' wrote a superforecaster trying to decide if the Saudis would agree to OPEC production cuts in November 2014. 'On the other hand, Saudi Arabia needs higher prices to support higher social spending to buy obedience to the monarchy. Yet on the third hand, the Saudis may believe they can’t control the drivers of the price dive, like the drilling frenzy in North America and falling global demand. So they may see production cuts as futile. Net answer: Feels no-ish, 80%.' (As it turned out, the Saudis did not support production cuts— much to the shock of many experts.) [...] Superforecasters pursue point-counterpoint discussions routinely, and they keep at them long past the point where most people would succumb to migraines.’
This suggests that an AI with sufficient hardware capability could achieve considerable prediction ability by its capability to explore many different perspectives and causal factors at once. The mental simulations of humans tend to be limited to around three causal factors and six transition states (Klein, 1999). The discussion of the superforecasters clearly brought up many more possibilities, and their accuracy suggests moderate ability to integrate all those factors together. Yet comments such as 'feels no-ish' suggests that they still couldn’t construct a full-blown mental simulation in which the various causal factors would have influenced each other based on principled rules which could be inspected, evaluated, and revised based on feedback and accuracy. This seems especially plausible given that Klein speculates the limits in the size of human mental simulations to come from working memory limitations.
AI systems with larger working memory capacities might be able to construct much more detailed simulations. Contemporary computer models can involve simulations with thousands or tens of thousands variables, though flexibly incorporating diverse mental representations into a single simulation will probably take considerably more memory and computing power than what is used in today’s models.
These simulations do not necessarily need to incorporate an exponentially increasing number of variables in order to achieve better prediction accuracy. As previously noted, superforecasters were more accurate at making predictions 300 days out than the rest of the forecasters in GJP were at making predictions 100 days out. Given that at least some of the superforecasters only used a few hours a day on making their predictions, and that they had many predictions to rate, they probably did not consider a vastly larger amount of factors than the rest of the forecasters.
Klein (1999) offers an example of a professor who used three causal factors (the rate of inflation, the rate of unemployment, and the rate of foreign exchange) and a few transitions to relatively accurately simulate how the Polish economy would develop in response to the decision to convert from socialism to a market economy. In contrast, less sophisticated experts could only name two variables (inflation and unemployment) and not develop any simulations at all, basing their predictions mostly on their ideological leanings.
Having large explicit models also allows for the models to be adjusted in response to feedback. The excerpt below describes how the professor expected unemployment to develop, and how it actually developed:
‘If the government had the courage to drop unproductive industries, many people would lose their jobs. This would start in about six months as the government sorted things out. The unemployment would be small by U.S. standards, rising from less than 1 percent to maybe 10 percent. For Poland, this increase would be shocking. Politically, it might be more than the government could tolerate and might force it to end the experiment with capitalism. When we reviewed his estimates, we found that unemployment had not risen as quickly as he expected, probably, Andrzej believed, because the government was not as ruthless as it said it would be in closing unproductive plants. Even worse, if a plant was productive in areas A, B, and C and was terrible in D and E, then as long as they made a profit, they continued their operations without shutting down areas D and E. So the system faced a built-in resistance to increased unemployment (Klein, 1999).’
In this example, the model failed to predict the government’s caution, which could then be added as an additional variable to consider for the next model. The addition of this variable alone might then considerably increase the accuracy of the simulation.
Tetlock & Gardner report that the superforecasters used highly granular probability estimates – carefully thinking about whether the probability of an event was 3% as opposed to 4%, for instance – and that the granularity actually contributed to accuracy, with the predictions getting less accurate if they were rounded to the closest 5%. Given that such granularity was achieved by integrating various possibilities and considerations, it seems like an ability to consider and integrate an even larger amount of possibilities might provide even increased granularity, and thus a prediction edge.
In summary, an AI could be able to run vastly larger mental simulations than humans could, with this possibility being subject to computing power limitations; given this, its simulations could also be explicit, allowing it to adjust and correct them in response to feedback to provide improved prediction accuracy; and it could have several streams of attention running concurrently and sharing information between each other. Existing evidence from human experts suggests that large increases to prediction capability might not necessarily need a large increase in the number of variables considered, and that even small increases can provide considerable additional gains.
How fast could an AI develop the ability to run comprehensive and large mental simulations?5 Creating larger mental simulations than humans have access to seems to require extensive computational resources, either from hardware or optimized software. As an additional consideration, we have previously mentioned limited working memory restricting the capabilities of humans, but human working memory is not the same thing as RAM in computer systems. If one were running a simulation of the human brain in a computer, one could not increase the brain’s available working memory simply by increasing the amount of RAM the simulation had access to. Rather, it has been hypothesized that working memory differences between individuals may reflect things such as the ability to discriminate between relevant and irrelevant information (Unsworth & Engle, 2007), which could be related to things like brain network structure and thus be more of a software than a hardware issue.6 Yudkowsky (2013) notes that if increased intelligence would be a simple matter of scaling up the brain, the road from chimpanzees to humans would likely have been much shorter, as simple factors such as brain size can respond rapidly to evolutionary selection pressure.
Thus, advances in mental simulation size depend on i) hardware progress ii) advances in software engineering. Hardware progress in hard to predict, but advances in software engineering capabilities might be doable using mostly theoretical and mathematical research. This would require the development of expertise in mathematics, programming, and theoretical computer science.
Much of mathematical problem-solving is about having a library of procedures, reformulations, and heuristics that one can try (Polya, 1990), as well as developing a familiarity and understanding of many kinds of mathematical results, which one may then later on recognize as relevant. This seems like the kind of task that relies strongly on pattern-matching abilities, and might in principle be in reach by an advanced deep reinforcement learning system that was fed a sufficiently large library of heuristics and worked proofs to let it develop superhuman mathematical intuition.7 Modern-day theorem provers often know what kinds of steps are valid, but not which steps are worth taking; merging them with the 'artificial intuition' of deep reinforcement learning systems might eventually produce systems with superhuman mathematical ability.
Progress in this field could allow AI systems to achieve superhuman abilities in math research, considerably increasing their ability to develop more optimized software to take full advantage of the available hardware. To the extent that relatively small increases in the number of variables considered in a high-level simulation would allow for dramatically increased prediction ability (as is suggested by e.g. the superforecasters being better predictors with thrice the prediction horizon of less accurate forecasters), moderate increases in the size of the AI’s simulations could translate to drastic increases in terms of real-world capability.
Yudkowsky (2013) notes that although the evolutionary record strongly suggests that algorithmic improvements were needed for taking us from chimpanzees to humans, the record rules out exponentially increasing hardware always being needed for linear cognitive gains: the size of the human brain is only four times that of the chimpanzee brain. This further suggests that relatively limited improvements could allow for drastic increases in intelligence.
The capability to run large simulations isn’t enough by itself. The AI also needs to acquire a sufficiently large number of patterns to be included in the simulations, to predict how different pieces in the simulation behave.
When it comes to well-defined tasks, current AI systems excel at pattern recognition, being able to analyze vast amounts of data and build them into an overall model, finding regularities that human experts never would have. For instance, human experts would likely have been unable to anticipate that men who 'like' the Facebook page 'Being Confused After Waking Up From Naps' are more likely to be heterosexual (Kosinski, Stillwell, & Graepel, 2013). Similarly, the Go-playing AI AlphaGo, whose good performance against the expert player Lee Sedol could to a large extent be attributed to its built-up understanding of the kinds of board patterns that predict victory, managed to make moves that Go professionals watching the game considered creative and novel.
The ability to find subtle patterns in data suggests that AI systems might be able to make predictions in domains which humans currently consider impossible to predict. We previously discussed the issue of the (predictive) validity of a domain, with domains being said to have higher validity if 'there are stable relationships between objectively identifiable cues and subsequent events or between cues and the outcomes of possible actions' (Kahneman & Klein, 2009). A field could also be valid despite being substantially uncertain, with warfare and poker being listed as examples of fields that were valid (letting a skilled actor improve their average performance) despite also being highly uncertain (with good performance not being guaranteed even for a skilled actor).
We already know that the validity of a field also depends on an actor’s cognitive and technological abilities. For example, weather forecasting used to be a field in which almost no objectively identifiable cues were available, relying mostly on guesswork and intuition, but the development of modern meteorological theory made it a much more valid field (Shanteau, 1992). Thus, even fields which have low validity to humans with modern-day capabilities, could become more valid for more advanced actors.
A possible example of a domain that is currently relatively low-validity, but which could become substantially more valid, is that of predicting the behavior of individual humans. Machine learning tools can already generate personality profiles harvested from people’s Facebook 'likes' that are slightly more accurate than the profiles made by people’s human friends (Youyou et al. 2015), and can be used to predict private traits such as sexual orientation (Kosinski et al. 2013). This has been achieved using a relatively limited amount of data and not much intelligence; a more sophisticated modeling process could probably make even better predictions from the same data.
Taleb (2007) has argued for history being strongly driven by 'black swan' events, events with such a low probability that they are unanticipated and unprepared for, but which have an enormous impact on the world. To the extent that this is accurate, it suggests limits on the validity of prediction. However, Tetlock & Gardner (2015) argue that while the black swans themselves may be unanticipated, once the event has happened its consequences may be much easier to predict. Commenting on the notion of the 9/11 terrorist attacks as a black swan event, they write:
‘We may have no evidence that superforecasters can foresee events like those of September 11, 2001, but we do have a warehouse of evidence that they can forecast questions like: Will the United States threaten military action if the Taliban don’t hand over Osama bin Laden? Will the Taliban comply? Will bin Laden flee Afghanistan prior to the invasion? To the extent that such forecasts can anticipate the consequences of events like 9/ 11, and these consequences make a black swan what it is, we can forecast black swans.’
Thus, even though an AI might be unable to predict some very rare events, once those events have happened, it could utilize its built-up knowledge of how people typically react to different events in order to predict the consequences better than anyone else.
How quickly could an AI acquire more knowledge and mental representations? Here again opinions differ. Hibbard (2016) argues, based on Mahoney’s (2008) argument for intelligence being a function of both resources and knowledge, that explosive growth is unlikely. Benthall (2017) makes a similar argument. On the other hand, authors such as Bostrom (2014) and Yudkowsky (2008) suggesting the possibility for fast increases.
We know that among humans, there are considerable differences in the extent to which people learn. Human cognitive differences have a strong neural and genetic basis (Deary, Penke, & Johnson, 2010), and strongly predict academic performance (Deary et al., 2007), socio-economic outcomes (Strenze, 2007), and job performance and the effectiveness of on-the-job learning and experience (Gottfredson, 1997). There also exist child prodigies who before adolescence achieve a level of performance comparable to an adult professional, without having been able to spend comparable amounts of time training (Ruthsatz, Ruthsatz, & Stephens, 2013). In general, some people are able to learn faster from the same experiences, notice relevant patterns faster, and continue learning from experience even past the point where others cease to achieve additional gains.8
While there is so far no clear consensus on why some people learn faster than others, there are some clear clues. Individual differences in cognitive abilities may be a result of differences in a combination of factors, such as working memory capacity, attention control, and long-term memory (Unsworth et al., 2014). Ruthsatz et al. (2013), in turn, note that 'child prodigies' skills are highly dependent on a few features of their cognitive profiles, including elevated general IQs, exceptional working memories, and elevated attention to detail'.
Many tasks require paying attention to many things at once, with a risk of overloading the learner’s working memory before some of the performance has been automated. For an example, McPherson & Renwick (2001) consider children who are learning to play instruments, and note that children who had previously learned to play another instrument were faster learners. They suggest this to be in part because the act of reading musical notation had become automated for these children, saving them from the need to process notation in working memory and allowing them to focus entirely on learning the actual instrument.
This general phenomenon has been recognized in education research. Complex activities that require multiple subskills can be hard to master even if the students have moderate competence in each individual subskill, as using several of them at the same time can produce an overwhelming cognitive load (Ambrose et al. 2010, chap. 4). Recommended strategies for dealing with this include reducing the scope of the problem at first and then building up to increasingly complex scopes. For instance, 'a piano teacher might ask students to practice only the right hand part of a piece, and then only the left hand part, before combining them' (ibid).
An increased working memory capacity, which is empirically associated with faster learning capabilities, could theoretically assist in learning in allowing more things to be comprehended simultaneously without them overwhelming the learner. Thus, an AI with a large working memory could learn and master at once much more complicated wholes than humans.
Additionally, we have seen that a key part of efficient learning is the ability to monitor one’s own performance and to notice errors which need correcting; this seems in line with cognitive abilities correlating with attentional control and elevated attention to detail. McPherson & Renwick (2001) also remark on the ability of some students to play through a piece with considerably fewer errors on their second run-through than the first one, suggesting that this indicates 'an outstanding ability to retain a mental representation of [...] performance between run-throughs, and to use this as a basis for learning from [...] errors'. In contrast, children who learned more slowly seemed to either not notice their mistakes, or alternatively to not remember them when they played the piece again.
Whatever the AI analogues of working and long-term memory, attentional control, and attention to detail are, it seems at least plausible that these could be improved upon by drawing exclusively on relatively theoretical research and in-house experiments. This might enable an AI to both absorb vast datasets, as current-day deep learning systems do, and also learn from superhumanly small amounts of data.
How much can the human learning speed be improved upon? This remains an open question. There are likely to be sharply diminishing returns at some point, but we do not know whether they are near the human level. Human intelligence seems constrained by a number of biological and physical factors that are unrelated to gains from intelligence. Plausible constraints include the size of the birth canal limiting the volume of human brains, the brain’s extensive energy requirements limiting the overall amount of cells, limits to the speed of signaling in neurons, an increasing proportion of the brain’s volume being spent on wiring and connections (rather than actual computation) as the number of neurons grows, and inherent unreliabilities in the operation of ion channels (Fox, 2011). There doesn’t seem to be any obvious reason for why the threshold for diminishing gains from intelligence to learning speed would just happen to coincide with the level of intelligence allowed by our current biology. Alternatively, there could have been diminishing returns all along, but ones which still made it worthwhile for evolution to keep investing in additional intelligence.
The available evidence also seems to suggest that within the human range at least, increased intelligence continues to contribute to additional gains. The Study of Mathematically Precocious Youth (SMPY) is a 50-year longitudinal study involving over 5,000 exceptionally talented individuals identified between 1972 and 1997. Despite its name, many its participants are more verbally than mathematically talented. The study has led to several publications; among others, Wai et al. (2005) and Lubinski & Benbow (2006) examine the question of whether ability differences within the top 1% of the human population make a difference in life.
Comparing the top (Q4) and bottom (Q1) quartiles of two cohorts within this study shows both to significantly differ from the ordinary population, as well as from each other. Out of the general population, about 1% will obtain a doctoral degree, whereas 20% of Q1 and 32% of Q4 did. 0.4% of Q1 achieved tenure at a top-50 US university, as did 3% of Q4. Looking at a 1 to 10,000 cohort, 19% had earned patents, as compared to 7.5% of the Q4 group, 3.8% of the Q1 group, or 1% of the general population.
It is important to emphasize that the evidence we’ve reviewed so far does not merely mean that an AI could potentially learn faster in terms of time: it also suggests that the AI could potentially learn faster in terms of training data. The smaller datasets an AI needs in order to develop accurate mental representations, the faster it can adapt to new situations.
However, learning faster in terms of time is also important. Various versions of AlphaGo were trained for maybe a year in total, whereas Lee Sedol had been playing professionally since 1995, with professional qualification requiring a considerable amount of intense training by itself. A twenty-fold advantage in learning speed could already provide for a major advantage, particularly when dealing with novel situations that humans have little previous experience of.
Besides the considerations we have already discussed, there seems to be potential for accelerated learning through more detailed analysis of experiences. For example, chess players improve most effectively by studying the games of grandmasters, and trying to predict what moves the grandmasters would have made in any situation. When the grandmaster play deviates from the move that the student would have made, the student goes back to try to see what they missed (Ericsson & Pool, 2016). This kind of detailed study is effortful however, and can only be sustained for limited amounts at a time. With enough computational resources, the AI could routinely run this kind of analysis on all sense data it received, constantly attempting to build increasingly detailed models and mental representations that would correctly predict the data.
Some commentators, such as Hibbard (2016) argue that knowledge requires interaction with the world, so the AI would be forced to learn over an extended period of time as the interaction takes time.
From our previous review, we know that feedback is needed for the development of expertise. However, one may also get feedback from studying static materials. As we noted before, chess players spend more time studying published matches and trying to predict the grandmaster moves – and then getting feedback when they look up the next move and have their prediction confirmed or falsified – than they do actually playing matches against live opponents (Ericsson & Pool, 2016). The Go-playing AlphaGo system did not achieve its skill by spending large amounts of time playing human opponents, but rather studying the games of humans and playing games against itself (Silver et al. 2016). And while any individual human can only study a single game at a time, AI systems could study a vast number of games in parallel and learn from all of them.9
An important difference is that domains such as chess and Go are formally specified domains, which an AI can perfectly simulate. For a domain such as social interaction, the AI’s ability to accurately simulate the behavior of humans is limited by its current competence in the domain. While it can run a simulation based on its existing model of human behavior, predicting how humans would behave based on that model, it needs external data in order to find out how accurate its prediction was.
This is not necessarily a problem however, given the vast (and ever-increasing) amount of recorded social interaction happening online. YouTube, e-mail lists, forums, blogs, and social media services all provide rich records of various kinds of social interaction, for an AI to test its predictive models against without needing to engage in interaction of its own. Scientific papers – increasingly available on an open access basis – on topics such as psychology and sociology offer additional information for the AI to supplement its understanding with, as do various guides to social skills. All of this information could be acquired simply by downloading it, with the main constraints being the time needed to find, download, and process the data, rather than time needed for social interactions.
As noted earlier, relatively crude statistical methods can already extract relatively accurate psychological profiles out of data such as people’s Facebook 'likes' (Kosinski et al., 2013, Youyou et al., 2015), giving reason to suspect that a general AI could develop very accurate predictive abilities given the kind of a process described above.
Several other domains, such as software security and mathematics seem similarly amenable to being mastered largely without needing to interact with the world outside the AI, other than searching for relevant materials. Some domains such as physics would probably need novel experiments, but an AI focusing on the domains that were the easiest and fastest for it to master might find sufficient sources of capability from those alone.
Given the above considerations, it does not seem to me like an AI’s speed of learning would necessarily be strongly interaction-constrained.
We set out to consider the fundamental practical limits of intelligence, and the limits to how quickly an AI system could acquire very high levels of capability.
Fictional representations of high intelligence often depict a picture of geniuses as masterminds who have an almost godlike prediction ability, laying out intricate multi-step plans where every contingency is planned for in advance (TVTropes 2017a). When discussing “superintelligent” AI systems, one might easily think that the discussion was postulating something along the lines of those fictional examples, and rightly reject it as unrealistic.
Given what we know about the limits of prediction, for AI to make a single plan which takes into account every possibility is surely impossible. However, having reviewed the science of human expertise, we have found that experts who are good at their domains tend to develop powerful mental representations which let them react to various situations as they arise, and to simulate different plans and outcomes in their heads.
Looking from humans to AIs, we have found that AI might be able to run much more sophisticated mental simulations than humans could. Given human intelligence differences and empirical and theoretical considerations about working memory being a major constraint for intelligence, the empirical finding that increased intelligence continues to benefit people throughout the whole human range, and the observation that it would be unlikely for the theoretical limits of intelligence to coincide with the biological and physical constraints that human intelligence currently faces, it seems like AIs could come to learn considerably faster from data than humans do. It also seems like in many domains, this could be achieved by using existing materials as a source of feedback for predictions, without necessarily being constrained by time taken for interacting with the external world.
Thus, it looks that even though an AI system couldn’t make a single superplan for world conquest right from the beginning, it could still have a superhuman ability to adapt and learn from changing and novel situations, and react to those faster than its human adversaries. As an analogy, experts playing most games can't precompute a winning strategy right from the first move either, but they can still react and adapt to the game's evolving situation better than a novice can, enabling them to win.10
Many of the hypothetical advantages – such as a larger working memory, the ability to consider more possibilities at once, and the ability to practice on many training instances in parallel – that AI might have seem to depend on available computing power. Thus the amount of hardware the AI had at its disposal could limit its capabilities, but there exists the possibility of developing better-optimized algorithms by initially specializing in fields such as programming and theoretical computer science, which the AI might become very good at.
One consideration which we have not yet properly addressed is the technology landscape at the time when the AI arrives (Tomasik 2014/2016, sec. 7). If a general AI can be developed, then various forms of sophisticated narrow AI will also be in existence. Some of them could be used to detect and react to a general AI, and tools such as sophisticated personal profiling for purposes of social manipulation will likely already be in existence. Considering how these influence the considerations discussed here is an important question, but one which is outside the scope of this article.
In summary, in practice the limits of prediction do not seem to pose much of a meaningful upper bound on AI’s capabilities. Even if AI could not create a complete master plan from scratch, it could still outperform humans in crucial domains, developing and using superior expertise than what humans were capable of. Aside for trivial limits derived from physical constraints, such as 'the AI couldn’t become superhumanly capable literally instantly', we also haven’t seen a way to establish a practical lower bound on how much time it would take for AI to achieve superhuman capability. Takeover scenarios with timescales on the order of mere days or weeks seem to remain within the range of plausibility.
Thank you to David Althaus, Stuart Armstrong, Bill Hibbard, David Krueger, Josh Marlow, Carl Shulman, and Brian Tomasik on helpful comments on this paper.
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016, June 21). Concrete Problems in AI Safety. Cornell University Library. Retrieved from http://arxiv.org/abs/1606.06565
Anderson, M. (2010). Problem Solved: Unfriendly AI. Retrieved September 27, 2016, from http://hplusmagazine.com/2010/12/15/problem-solved-unfriendly-ai/
Baars, B. J. (2002). The conscious access hypothesis: origins and recent evidence. Trends in Cognitive Sciences, 6(1), 47–52. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/11849615
Baars, B. J. (2005). Global workspace theory of consciousness: toward a cognitive neuroscience of human experience. Progress in Brain Research, 150, 45–53. https://doi.org/10.1016/S0079-6123(05)50004-9
Bengio, Y., Courville, A., & Vincent, P. (2012). Representation Learning: A Review and New Perspectives. arXiv [cs.LG]. Retrieved from http://arxiv.org/abs/1206.5538
Benthall, S. (2017). Don’t Fear the Reaper: Refuting Bostrom's Superintelligence Argument. arXiv [cs.AI]. Retrieved from http://arxiv.org/abs/1702.08495
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Buizza, R. (2002). Chaos and weather prediction. ECMWF. Retrieved from https://www.researchgate.net/publication/228552816_Chaos_and_weather_prediction_January_2000
Deary, I. J., Penke, L., & Johnson, W. (2010). The neuroscience of human intelligence differences. Nature Reviews. Neuroscience, 11(3), 201–211. https://doi.org/10.1038/nrn2793
Deary, I. J., Strand, S., Smith, P., & Fernandes, C. (2007). Intelligence and educational achievement. Intelligence, 35, 13. https://doi.org/10.1016/j.intell.2006.02.001
Ericsson, A., & Pool, R. (2016). Peak: Secrets from the New Science of Expertise. Houghton Mifflin Harcourt.
Fox, D. (2011). The Limits of Intelligence. Scientific American. Retrieved from http://www.cs.virginia.edu/~robins/The_Limits_of_Intelligence.pdf
Franklin, S., Madl, T., D’Mello, S., & Snaider, J. (2014). LIDA: A Systems-level Architecture for Cognition, Emotion, and Learning. IEEE Transactions on Autonomous Mental Development, 6(1). https://doi.org/10.1109/TAMD.2013.2277589
Franklin, S., & Patterson, F. G., Jr. (2006). The LIDA architecture: Adding new modes of learning to an intelligent, autonomous software agent. Presented at the Integrated Design and Process Technology, IDPT-2006. Retrieved from http://ccrg.cs.memphis.edu/assets/papers/zo-1010-lida-060403.pdf
Future of Life Institute. (2015). AI Open Letter - Research Priorities for Robust and Beneficial Artificial Intelligence. Retrieved from https://futureoflife.org/ai-open-letter/
Gottfredson, L. S. (1997). Why g matters: The complexity of everyday life. Intelligence, 24(1), 79–132. https://doi.org/10.1016/S0160-2896(97)90014-3
Hibbard, B. (2016). A Defense of Humans for Transparency in Artificial Intelligence. Retrieved September 28, 2016, from http://www.ssec.wisc.edu/~billh/g/transparency_defense.html
Ignatius, D. (2013). David Ignatius: More chatter than needed. The Washington Post. Retrieved from https://www.washingtonpost.com/opinions/david-ignatius-more-chatter-than-needed/2013/11/01/1194a984-425a-11e3-a624-41d661b0bb78_story.html
Kahneman, D., & Klein, G. (2009). Conditions for Intuitive Expertise. A Failure to Disagree. The American Psychologist, 64(6), 515–526. https://doi.org/10.1037/a0016755
Klein, G. (1999). Sources of Power: How People Make Decisions. MIT Press.
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences of the United States of America, 110(15), 5802–5805. https://doi.org/10.1073/pnas.1218772110
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2016). Building Machines That Learn and Think Like People. arXiv [cs.AI]. Retrieved from http://arxiv.org/abs/1604.00289
Lawrence, N. (2016). Future of AI 6. Discussion of 'Superintelligence: Paths, Dangers, Strategies.' Retrieved September 27, 2016, from http://inverseprobability.com/2016/05/09/machine-learning-futures-6
Lubinski, D., & Benbow, C. P. (2006). Study of Mathematically Precocious Youth After 35 Years: Uncovering Antecedents for the Development of Math-Science Expertise. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 1(4), 316–345. https://doi.org/10.1111/j.1745-6916.2006.00019.x
Macnamara, B. N., Hambrick, D. Z., & Oswald, F. L. (2014). Deliberate practice and performance in music, games, sports, education, and professions: a meta-analysis. Psychological Science, 25(8), 1608–1618. https://doi.org/10.1177/0956797614535810
Martela, F. (2016). Törmääkö tekoäly älykkyyden ylärajaan? Tivi. Retrieved from http://www.tivi.fi/blogit/tormaako-tekoaly-alykkyyden-ylarajaan-6584349
Madl, T., Franklin, S., Chen, K., Montaldi, D., & Trappl, R. (2016). Towards real-world capable spatial memory in the LIDA cognitive architecture. Biologically Inspired Cognitive Architectures, 16, 87–104. https://doi.org/10.1016/j.bica.2016.02.001
Mahoney, M. (2008). A Model for Recursively Self Improving Programs. Retrieved from http://mattmahoney.net/rsi.pdf
McPherson, G. E., & Renwick, J. M. (2001). A Longitudinal Study of Self-regulation in Children’s Musical Practice. Music Education Research, 3(2), 169–186. https://doi.org/10.1080/14613800120089232
Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., … Tetlock, P. E. (2014). Psychological strategies for winning a geopolitical forecasting tournament. Psychological Science, 25(5), 1106–1115. https://doi.org/10.1177/0956797614524255
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236
Müller, V. C., & Bostrom, N. (2016). Future Progress in Artificial Intelligence: A Survey of Expert Opinion. In V. C. Müller (Ed.), Fundamental Issues of Artificial Intelligence (pp. 555–572). Springer International Publishing. Retrieved from http://www.nickbostrom.com/papers/survey.pdf
Polya, G. (1990). How to Solve It: A New Aspect of Mathematical Method (New edition). Penguin Books, Limited (UK).
Rushton, J. P., & Ankney, C. D. (2009). Whole brain size and general mental ability: a review. The International Journal of Neuroscience, 119(5), 691–731. https://doi.org/10.1080/00207450802325843
Russell, S., Dewey, D., & Tegmark, M. (2015). Research Priorities for Robust and Beneficial Artificial Intelligence. AI Magazine, 36(4), 105–114. Retrieved from https://futureoflife.org/data/documents/research_priorities.pdf?x90991
Ruthsatz, J., Ruthsatz, K., & Stephens, K. R. (2013). Putting practice into perspective: Child prodigies as evidence of innate talent. Intelligence, 45, 60–65. https://doi.org/10.1016/j.intell.2013.08.003
Shanteau, J. (1992). Competence in experts: The role of task characteristics. Organizational Behavior and Human Decision Processes, (53), 252–266. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.625.1063&rep=rep1&type=pdf
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.
Soares, N., & Fallenstein, B. (2014). Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda. Machine Intelligence Research Institute. Retrieved from https://intelligence.org/files/TechnicalAgenda.pdf
Sotala, K., & Yampolskiy, R. V. (2015). Responses to catastrophic AGI risk: a survey. Physica Scripta, 90(1), 018001. https://doi.org/10.1088/0031-8949/90/1/018001
Strenze, T. (2007). Intelligence and socioeconomic success: A meta-analytic review of longitudinal research. Intelligence, 35(5), 401–426. https://doi.org/10.1016/j.intell.2006.09.004
Susan A. Ambrose, Michael W. Bridges, Michele DiPietro, Marsha C. Lovett, Marie K. Norman, Richard E. Mayer. (2010). How Learning Works: Seven Research-Based Principles for Smart Teaching. Jossey-Bass.
Taleb, N. N. (2007). Black Swans and the Domains of Statistics. The American Statistician, 61(3), 198–200. https://doi.org/10.1198/000313007X219996
Taylor, J., Yudkowsky, E., LaVictoire, P., & Critch, A. (2016). Alignment for advanced machine learning systems. Machine Intelligence Research Institute. Retrieved from https://intelligence.org/files/AlignmentMachineLearning.pdf
Tetlock, P. E., Mellers, B. A., & Rohrbaugh, N. (2014). Forecasting Tournaments Tools for Increasing Transparency and Improving the Quality of Debate. Current Directions in. Retrieved from http://cdp.sagepub.com/content/23/4/290.short
Tetlock, P., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
Tomasik, B. (2014). Artificial Intelligence and Its Implications for Future Suffering. Retrieved September 28, 2016, from https://longtermrisk.org/artificial-intelligence-and-its-implications-for-future-suffering#reply-to-bostroms-arguments-for-a-hard-takeoff
Turing, A. M. (1950). Computing Machinery and Intelligence. Mind; a Quarterly Review of Psychology and Philosophy, 59(236), 433–460. Retrieved from http://www.loebner.net/Prizef/TuringArticle.html
Unsworth, N., & Engle, R. W. (2007). The nature of individual differences in working memory capacity: active maintenance in primary memory and controlled search from secondary memory. Psychological Review, 114(1), 104–132. https://doi.org/10.1037/0033-295X.114.1.104
Unsworth, N., Fukuda, K., Awh, E., & Vogel, E. K. (2014). Working memory and fluid intelligence: capacity, attention control, and secondary memory retrieval. Cognitive Psychology, 71, 1–26. https://doi.org/10.1016/j.cogpsych.2014.01.003
Wai, J., Lubinski, D., & Benbow, C.P. (2005) Creativity and Occupational Accomplishments Among Intellectually Precocious Youths: An Age 13 to Age 33 Longitudinal Study. Journal of Educational Psychology, 97(3), 484-492.
Whalen, D. (2016). Holophrasm: a neural Automated Theorem Prover for higher-order logic. arXiv [cs.AI]. Retrieved from http://arxiv.org/abs/1608.02644
Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences of the United States of America, 112(4), 1036–1040. https://doi.org/10.1073/pnas.1418680112
Yudkowsky, E. (2008). Artificial Intelligence as a Positive and Negative Factor in Global Risk. In M. M. Ć. Nick Bostrom (Ed.), Global Catastrophic Risks (pp. 308–345). Oxford University Press. Retrieved from https://intelligence.org/files/AIPosNegFactor.pdf
Yudkowsky, E. (2013). Intelligence Explosion Microeconomics (No. 2013-1). Machine Intelligence Research Institute. Retrieved from https://intelligence.org/files/IEM.pdf
Xanatos Gambit. (2017). TV Tropes. Retrieved March 20, 2017, from http://tvtropes.org/pmwiki/pmwiki.php/Main/XanatosGambit
Xanatos Speed Chess. (2017). TV Tropes. Retrieved March 20, 2017, from http://tvtropes.org/pmwiki/pmwiki.php/Main/XanatosSpeedChess
‘[Deliberate practice] requires a field that is already reasonably well developed— that is, a field in which the best performers have attained a level of performance that clearly sets them apart from people who are just entering the field. We’re referring to activities like musical performance (obviously), ballet and other sorts of dance, chess, and many individual and team sports, particularly the sports in which athletes are scored for their individual performance, such as gymnastics, figure skating, or diving. What areas don’t qualify? Pretty much anything in which there is little or no direct competition, such as gardening and other hobbies, for instance, and many of the jobs in today’s workplace— business manager, teacher, electrician, engineer, consultant, and so on. These are not areas where you’re likely to find accumulated knowledge about deliberate practice, simply because there are no objective criteria for superior performance.’
(Ericsson & Pool, 2016)
Fields that have well-defined, objective criteria for good performance are ones which are the easiest to master using even current-day AI methods – in fact, they’re basically the only ones that can be truly mastered using current-day AI methods.
A somewhat cheeky way to summarize these results would be by saying that, in the kinds of fields that could be mastered without general intelligence, general intelligence isn’t the most important thing. This even seems to be Ericsson’s own theoretical stance: that in these fields, general intelligence eventually ceases to matter because the expert will have developed specialized mental representations that they can just rely on in every situation. So these results are not very interesting to us, who are interested in domains that do require general intelligence. (back)
The post How Feasible Is the Rapid Development of Artificial Superintelligence? appeared first on Center on Long-Term Risk.
]]>The post Reducing Risks of Astronomical Suffering: A Neglected Priority appeared first on Center on Long-Term Risk.
]]>Will we go extinct, or will we succeed in building a flourishing utopia? Discussions about the future trajectory of humanity often center around these two possibilities, which tends to ignore that survival does not always imply utopian outcomes, or that outcomes where humans go extinct could differ tremendously in how much suffering they contain. One major risk is that space colonization, either through humans or (misaligned) artificial intelligence, may end up producing cosmically significant amounts of suffering. For a variety of reasons, such scenarios are rarely discussed and often underestimated. The neglectedness of risks of cosmically significant amounts of suffering (“suffering risks” or “s-risks”) makes their reduction a plausible priority from the perspective of many value systems. Rather than focusing exclusively on ensuring that there will be a future, we recommend interventions that improve the future’s overall quality.
Among actors and organizations concerned with shaping the “long-term future,” the discourse has so far been centered around the concept of existential risks. “Existential risk” (or “x-risk”) was defined by Bostrom (2002) as “[...] an adverse outcome [which] would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.”
This definition is unfortunate in that it lumps together events that lead to vast amounts of suffering and events that lead to the extinction (or failure to reach the potential) of humanity. However, many value systems would agree that extinction is not the worst possible outcome, and that avoiding large quantities of suffering is of utmost moral importance.
We should differentiate between existential risks (i.e., risks of “mere” extinction or failed potential) and suffering risks.1. Suffering risks are risks of events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering. Note that it may turn out that the amount of suffering that we can influence is dwarfed by suffering that we can’t influence. By “expected future suffering” we mean “expected action-relevant suffering in the future”.
The above distinctions are all the more important because the term “existential risk” has often been used interchangeably with “risks of extinction”, omitting any reference to the future’s quality.2 Finally, some futures may contain both vast amounts of happiness and vast amounts of suffering, which constitutes an s-risk but not necessarily a (severe) x-risk. For instance, an event that would create 1025 unhappy beings in a future that already contains 1035 happy individuals constitutes an s-risk, but not an x-risk.3
The Case for Suffering-Focused Ethics outlined several reasons for considering suffering reduction one’s primary moral priority. From this perspective in particular, s-risks should be addressed before addressing extinction risks. Reducing extinction risks makes it more likely that there will be a future, possibly one involving space colonization and the astronomical stakes that come with it. But it often does not robustly improve the quality of the future, i.e., how much suffering or happiness it will likely contain.4 A future with space colonization may contain vastly more sentient minds than have existed so far. If something goes wrong, or even if things do not go “right enough”, this would multiply the total amount of suffering (in our part of the universe) by a huge factor. Depending on our normative views on the importance of creating a vast number of happy beings, and the extent to which we value moral reflection, this suffering may or may not be counterbalanced by the expected upsides.
Reducing extinction risks essentially comes down to buying lottery tickets over the distribution of possible futures: A tiny portion of the very best futures will be worthy of the term “utopia” (almost) regardless of one’s moral outlook, the better futures will contain vast amounts of happiness but possibly also some serious suffering (somewhat analogous to the situation in Omelas); and the bad or very bad futures will contain suffering at unprecedented scales. The more one cares about reducing suffering in comparison to creating happiness or other types of flourishing, the less attractive such a lottery becomes. In other words, efforts to reduce extinction risks are only positive according to one’s values if one's expected ratio of future happiness vs. suffering is greater than one’s normative exchange rate.5 Instead of spending all our resources on buying as many lottery tickets as possible, those with suffering-focused values should try to ensure that as few tickets as possible contain (astronomically) unpleasant surprises. They should also cooperate closely with extinction risk reducers to reap the gains from moral cooperation.
The following sections will present reasons why s-risks are both neglected and tractable, and why actors concerned about the long-term future should consider investing (more) resources into addressing them.
Within certain future-oriented movements, notably Effective Altruism and transhumanism, there is a tendency for people to expect the (far) future to contain more happiness than suffering. Many of these people, in turn, expect future happiness to outweigh future suffering by many orders of magnitude.6 Arguments put forward for this position include that the vast majority of humans – maybe excluding a small percentage of sadists – value increasing happiness and decreasing suffering, and that technological progress so far has led to many welfare improvements.
While it seems correct to assume that the ratio of expected future happiness to suffering is greater than one, and plausibly quite a bit larger than that,7 the case is not open-and-shut. Good values alone are not sufficient for ensuring good outcomes, and at least insofar as the suffering humans inflict on nonhuman animals is concerned (e.g. with factory farming), technology’s track record is actually negative rather than positive. Moreover, it seems that a lot of people overestimate how good the future will be due to psychological factors, ignorance about some of the potential causes of astronomical future suffering, and insufficient concern for model uncertainty and unknown unknowns.
It is human nature to (subconsciously) flinch away from contemplating horrific realities and possibilities; the world almost certainly contains more misery than most want to admit or can imagine. Our tendency to underestimate the expected amount of future (as compared to present-day) suffering might be even more pronounced. While it would be unfair to apply this characterization to all people who display great optimism towards the future, these considerations certainly play a large role in the epistemic processes of some future “optimists.”
One contributing factor is optimism bias (e.g. Sharot, Riccardi, Raio, & Phelps, 2007), which refers to the tendency to overestimate the likelihood of positive future events while underestimating the probability and severity of negative events – even in the absence of evidence to support such expectations. Another, related factor is wishful thinking, where people are prone to judging scenarios which are in line with their desires as being more probable than what is epistemically justified, while assigning lower credence to scenarios they dislike.
Striving to avert future dystopias inevitably requires one to contemplate vast amounts of suffering on a regular basis, which is often demotivating and may result in depression. By contrast, while the prospect of an apocalypse may also be depressing, working towards a utopian future is more inspiring, and could therefore (subconsciously) bias people towards paying less attention to s-risks.
Similarly, working towards the reduction of extinction risks or the creation of a posthuman utopia is also favored by many people’s instinctual, self-oriented desires, notably one’s own survival and that of family members or other loved ones. As it is easier to motivate oneself (and others) towards a project that appeals to altruistic as well as more self-oriented desires, efforts to reduce risks of astronomical suffering – risks that lie in the distant future and often involve the suffering of unusual or small minds less likely to evoke empathy – will be comparatively neglected. This does not mean that the above motivations are misguided or unimportant; rather, it means that if one also, upon reflection, cares a great deal about reducing suffering, then it might take deliberate effort to give this concern due justice.
Lastly, psychological inhibitions against contemplating s-risks and unawareness of such considerations are interrelated and tend to reinforce each other.
In discussions about the risks from smarter-than-human artificial intelligence, it is often assumed that the sole reason to consider AI safety an important focus area is because it decides between utopia or human extinction. The possibility that misaligned or suboptimally aligned AI might instantiate suffering in astronomical quantities is, however, rarely brought up.
Misaligned AI as a powerful but morally indifferent optimization process might transform galactic resources into highly optimized structures, some of which might very well include suffering. The structures a superintelligent AI or an AI-based economy decoupled from human interests would build in the pursuit of its goals may for instance include a fleet of “worker bots,” factories, supercomputers to simulate ancestral Earths for scientific purposes, and space colonization machinery, to name a few. In the absence of explicit concern for suffering reflected in the goals of such an optimization process, AI systems would be willing to instantiate suffering minds (or “subroutines”) for even the slightest benefit to their objectives. (Note that, as was the case with natural selection’s use of wild animals, some of these optimization processes might also lead to the instantiation of happy minds.) This is especially worrying because the stakes involved could literally turn out to be astronomical: Space colonization is an attractive subgoal for almost any powerful optimization process, as it leads to control over the largest amount of resources. Even if only a small portion of these resources are used for purposes that involve suffering, the resulting disvalue would tragically be enormous.8 Finally, next to bad states of affairs being brought about for instrumental reasons, because of indifference to suffering, there is also the risk that bad states of affairs could be brought about for strategic reasons: In competition or conflict between different factions, if one side is compassionate and the other not, threatening to bring about bad states of affairs could be used as an extortion tactic.
For an overview on ways the future could contain vast amounts of suffering – including as a result of suboptimally aligned AI or human-controlled futures where dangerous ideologies win – see Superintelligence as a Cause or Cure for Risks of Astronomical Suffering and Risks of Astronomical Future Suffering.
One might argue that the scenarios just mentioned tend to be speculative, maybe extremely speculative, and should thus be discounted or even ignored altogether. However, the claim that creating extremely powerful agents with alien values and no compassion might lead to vast amounts of suffering – through some way or another – is a disjunctive prediction. Only one possible action by which the AI could increase its total utility, yet involving vast quantities of suffering, would be required for the AI to pursue this path without reservation. Worries of this sort are weakly supported by the universe’s historical track record, where the “morally indifferent optimization process” of Darwinian evolution instantiated vast amounts of misery in the form of wild-animal suffering.
Even if the probability of any one specific scenario involving astronomical amounts of suffering (like the ones above, or other scenarios not yet mentioned or thought of) is small, the probability that at least one scenario will occur may be fairly high. In this context, we should beware the disjunction fallacy (Bar-Hillel & Neter, 1993), according to which most people not only underestimate the probability of disjunctions of events, but they actually judge the disjunction as less likely than a single event comprising it.9
Lastly, taking seriously the possibility of unknown unknowns, black swans or model uncertainty generally seems incompatible with predicting a very large (say, 1,000,000 to 1) ratio of expected future happiness to suffering. Factoring in such model uncertainty brings matters back towards a more symmetrical prior probability distribution. Predicting an extreme ratio, on the other hand, would require enormous amounts of evidence, and is thus suggestive of overconfidence or wishful thinking – especially in the light of historical data on the distribution of suffering and happiness.
In conclusion, there are several reasons why the probability of risks of astronomical suffering – although difficult to assess – is significant; we should be careful to not underestimate them.10
If the future will be controlled by actors that care about both creating happiness (and other forms of flourishing) and reducing suffering, more happiness could be created than expected suffering can be reduced. Versions of this argument have been brought up to justify a focus on utopia creation and the reduction of extinction risks even if one gives some more (but not a lot more) weight to suffering reduction over happiness creation. This is a valid concern, but it does not yet factor in tractability and neglectedness, which may point the other way.
Creating vast amounts of happiness in the future without also causing vast amounts of suffering in the process requires a great deal of control over future outcomes; it essentially requires hitting a bulls-eye in the space of all possible outcomes. This is especially true for views where value is fragile, but also for other views. By contrast, steering our future trajectory away from classes of bad or very bad outcomes could arguably be easier.
Applied to AI risks in particular, this means that instead of betting on the unlikely case that we will get everything right, worst-case AI safety – approaches informed specifically by considerations of minimizing the probability of the worst failure modes – seem promising, especially as a first step or first priority.
Partly due to the reasons described above (psychological factors, unawareness of some potential sources of astronomical future suffering, and undue disregard of model uncertainty or the possibility of black swans), efforts to reduce risks of astronomical suffering appear neglected. So far, the Center on Long-Term Risk is the only organization that has made this their main focus area.11
Because reducing s-risks is a neglected priority, we should expect there to be several low-hanging fruits in terms of possible interventions. This suggests that researching effective and cooperative strategies to avert risks of astronomical suffering may have a particularly high expected value on the margin.
Risks of astronomical suffering are important to address for many value systems, including ones where suffering reduction is only one (strong) concern out of many. Similarly, if one has uncertainty over different moral views, following a parliamentary model for dealing with moral uncertainty would suggest that suffering reduction should get many votes in one’s “moral parliament,” and correspondingly, that significant resources should be directed towards reducing s-risks.
Moral uncertainty can simultaneously be a reason why even people who care strongly about reducing suffering should welcome some efforts to preserve humanity’s option value. More importantly, considerations about positive-sum cooperation suggest that all s-risk reducers should aim to compromise with those who want to ensure that humanity has a cosmic future. While the thought of prioritizing anything above s-risks may feel unacceptable to some of us, we must acknowledge that other people may (and will) disagree. Values differ, and alternative views on the matter often also have their legitimacy. Rather than fighting other people’s efforts to ensure humanity’s survival and the chance to develop into an intergalactic, long-lasting and flourishing civilization, we should complement these efforts by taking care of the things that could go wrong. Cooperation between future optimists and future pessimists will be best for everyone, as it improves the odds that those who at all care about the long-term future can beat all the future apathists, or the general trend for things to slide towards the abyss when actors only follow their own incentives.
3. Beginner’s guide to reducing s-risks
Bar-Hillel, M., & Neter, E. (1993). How alike is it versus how likely is it: A disjunction fallacy in probability judgments. Journal of Personality and Social Psychology 65(6), 1119-131.
Bostrom, N. (2002). Existential risks. Journal of Evolution and Technology, 9(1), 1-31.
Hastie, R., & Dawes, R. (2001). Rational choice in an uncertain world: The psychology of judgment and decision making. Thousand Oaks: Sage Publications.
Sharot, T., Riccardi, A. M., Raio, C. M., & Phelps, E. A. (2007). Neural mechanisms mediating optimism bias. Nature, 450(7166), 102-105.
Tomasik, B. (2013). Risks of Astronomical Future Suffering. Center on Long-Term Risk.
Yudkowsky, E. (2011). Complex value systems are required to realize valuable futures. In Schmidhüber, J., Thórisson, R. and Looks, M. Artificial General Intelligence: 4th International Conference, AGI 2011. The fourth conference on Artificial General Intelligence. Mountain View, CA, USA (388-393). Berlin: Springer.
The post Reducing Risks of Astronomical Suffering: A Neglected Priority appeared first on Center on Long-Term Risk.
]]>The post Panmnemism and the Value of Experiential Information appeared first on Center on Long-Term Risk.
]]>This paper introduces the theory of Panmnemism as a variation of panpsychism. Rather than making claims about spirit or sentience existing in all things, panmnemism references the ability of inanimate objects to store experiential information as memories. Memory, or mneme, is to be understood as the capacity to store information concerning the experiential qualities of existence. Experiential information can be stored as memory. Memories can be retrieved and communicated to investigators. From these claims, it will be concluded that non-human animals, plants and inanimate objects possess a form of memory, as well as forms of semi-linguistic communication of the information within those memories. Viewing the world in this way allows all entities to contribute to the development of knowledge. Such contributions lend purpose to all things.
It has too often been the case that discussions involving references to panpsychism stray into fears of repellant mysticism or the tendency to encompass aspects of spirit or animation. There is a simpler, more specifically targeted argument which can be adapted from panpsychic theory, one which will likely avoid those metaphysical traps. Instead of attempting to support the existence of cognition or soul in non-humans, the focus ought to be on the capacity to collect memories and construct narratives. The qualia of ‘being in the world’ is attached to the existence of individual objects. The ability to store subjective information and pass it on at a later time is a fundamental quality of consciousness. This is only one aspect of mental function, but it’s one that can be attributed to inanimate objects with little contention. Panpsychism, at its core, has always claimed that some aspects of mind must be present in all things. This idea can be traced back at least as far as the Epicurean notion that the will cannot emerge out of nothing, and thus must be present in all forms of matter. Understanding consciousness as having existed prior to the appearance of human life is a necessary element of Buddhist and Hindu cosmologies. Numerous scientists, from Thomas Edison1 to David Bohm, have argued in favor of panpsychic ideas in order to explain the behavior of forces and particles. Leibniz and Spinoza both built philosophies dependent on the belief that mind and spirit are composed from smaller blocks of mental attributes, which are present in all things. Precisely which elements of mind are present in matter is rarely specified.
Philosophical claims of emergence refute this idea of consciousness covering of a wide spectrum of proto-cognitive abilities, suggesting that consciousness appeared in a sudden evolutionary leap. To understand adaptation as a process is to understand that whatever abilities are present in the human animal today must have begun, in more primitive forms, in our distant past. The intricate operations of consciousness could not have emerged from out of nowhere. Consciousness must have evolved from simpler instances of conscious activity, or proto-consciousness.
Any arguments stemming from mind/body debate partisanship are irrelevant to the discussion of consciousness as categorized by panpsychic theory. If dualism were true, the secondary immaterial substance of mind could easily be assigned to all physical things, not just human brains. Since this mental substance is non-physical, and thus immeasurable, it could be active around any form of matter. Conversely, if the identity theory holds true, panpsychism would still have ample space in which to operate. If consciousness is reducible to molecules and matter, then all things constructed from molecules and matter would possess the potential for similar functions. This is not to suggest that inanimate objects are capable of experiencing emotion or actualization, though not all humans are capable of those particular cognitions either. Since emotions, for one example, are not exhibited by all thinking humans, their presence can’t be a necessary condition of consciousness.
Just as the term consciousness has been continuously redefined, the word panpsychism has seen itself applied to all sorts of theories and meta-theories. In order to avoid any pluralistic confusion, I will be using a term which more specifically handles the aspects of mind being addressed here. Panmnemism is the position that all things have memory. This clarification immediately brackets out all implications of animism, vitalism and substance dualism. Memory is not alive. Memory is not magical. Memory is not a secondary substance. Memory is the act of storing information with a capability for future retrieval.
Panmnemism is a variation of panpsychic theory, but dodges the problem of proving self-awareness; the trait often relied upon as the only sufficient condition for sentience. This is not to say that panmnemism is merely a watered down or compromised version of panpsychism. The discussion of all things carrying the potential for remembering and transmitting experiential information has utility. In fact, gathering information and storing it for later retrieval might have as much practical value as other aspects of conscious ability. Even more beneficially, this view of material nature helps to remove the explanatory gap of emergent experience.2
Experience is an attribute of existence. All physical entities are items which exist in the material world. To be within this spatio-temporal reality is to be effected by aspects of the surrounding environment. Where something is located, how it grows or degrades, changes in the atmosphere around it are all types of direct events acting upon singular perspectives. Such events ought to be called experiential. Each object is unique in its location, so each object develops experiences which are specific to its position, size and relation to its surroundings. These individualized traits can be referred to as subjective experiences. If these events are not experiences or proto-experiences, one is left concluding that there was no such thing as experience prior to the development of neural systems. Pattern, color and location would have had to suddenly emerge from nothingness, after billions of years of non-existence, once a sentient being got around to noticing them. Self-awareness is not a prerequisite of existence. Whether or not an object has the ability to reflect upon the quality of its experiences is irrelevant to the fact that physical entities are shaped and impacted by their place in the world.
There is a need to clarify what separates panmnemism from previous attempts at narrowing the scope of panpsychism. David Ray Griffin and Gregg Rosenberg endorse a reduced panpsychism of their own, known as panexperientialism. This theory, largely indebted to Whitehead, holds that all objects and entities gather individuated experiences.3 While this is certainly an important element of the philosophy being discussed here, their theory is quick to dismiss the presence of any sort of mentality in those entities. Experience is treated as a physical certainty, but is disassociated from the activities of minds.4 This harsh delineation reduces the attributes of inanimate objects to little more than what physicalist philosophies grant them in the first place.
Panmnemism, on the other hand, allows all objects the opportunity to participate in the collection of experiential information and the growth of a universal knowledge base. While the act of introducing newly fabricated terms into the philosophical lexicon tends to raise suspicious brows, this specific set of conditions cannot be confused with the more inclusive (panpsychism – all things are conscious) or the overly exclusive (panexperientialism – all things are experiential). Mneme, from the ancient Greek, refers not only to memory, but the ability to retain environmental information. Mnemic Theory, now little more than a relic, proposed the idea that memories were inscribed in the protoplasm of living cells as engrams.5 This is not what panmnemism is proposing, though the underlying concepts are similar. Panmnemism also encompasses the possibility for inanimate forms of expression. By exhibiting the power to share experiential information, the objects of our world participate in the assembly of knowledge, rather than keeping their experiences locked up privately.
As a physical and retrievable element of being, experiential data have great importance to the accumulation of knowledge. Not all experiential data are cognitive experiential data. Value stems from the quality and quantity of information, not from the carrier entity’s awareness of that information. It is through the collection of information that observations and conclusions concerning nature, ethical progression and evolution can be ascertained. If knowledge has a value for its ability to drive efficient practice and ethical improvement, then what is best for the greater good would be that which successfully preserves a record without causing harm or otherwise disrupting the developmental process. If all objects can carry extractable information then all objects can be said to assist in the growth of knowledge. A system for assigning value to an object, based on the potential contributions of that object to the expanding catalog of experiential data, can be laid out with a basic set of criteria.
Before setting up a framework for evaluation, the role of externally stored information in our progress and evolution ought to be explored. The knowledge base accumulated from our world is consistently utilized as the primary measurement of progress. Whether we are discussing technology, health and wellness, philosophy or even global politics, the distance gained from previous epochs is measured by a growth in understanding. Collecting information from the natural world is paramount for this evolution. Becoming more rational, or even more enlightened, is a reliant on intellectual progress, directly or indirectly derived from observations of matter. The ability to uncover data is expanding. The ability to store data is expanding. Advancement is linked to progress in the endeavor to gather and assess facts and figures from the widest variety of sources.
Without going into too much quantitative detail, a foundation can be established to appraise this value based on the following factors: experiential memory storage potential, ease of information extraction and a pragmatic assessment of the quality of the data received. First off is the potential for memory storage. When looking at computer storage, encoded information in cells or the fossil record, we are dealing with very large sets of extractable information. The storage capacity of an object or entity is a valuable space for information. It is real estate for future knowledge. Like any other storage facility, the value is based on the capacity, not the contents. It is the potential for knowledge which must be credited rather than actualized data. If we were to look at a parallel on the human level, we wouldn’t want to exclude the potential contributions of the majority of first person perspectives, simply due to poverty or socio-political marginalization.
The next variable under consideration relates to the ease of extraction. If data collectors are not aware of the experiential knowledge inside of an object, or they’re unable to access the object’s memory, the utility of that knowledge sinks toward the bottom of this value scale. Systems for extracting packets of information, such as radiometric dating or tree ring analysis, are well established research tools. These are examples of relatively easy information extractions. Several new transfer systems are being experimented with. Molecular memory, insect communication and environmental toxicology are areas undergoing current exploration. The ease of extraction will only be advanced by the creation of a reliable process and a common language for sharing any information retrieved. It might be beneficial to associate these extractions with the term recollection, rather than memory. Memory, for better or for worse, belongs to the vocabulary of human psychology. Recollection, the act of collecting again, is a fitting descriptor for these examples. The object collects experiential information and that same information can be re-collected at a later date.
Lastly, we come to the weighing of the epistemological value of the information itself. Information storage which can be kept static for long periods of time, without tampering or alteration, has greater value than consensus-based historical recording. As human history progresses, the narratives shift, the detail sometimes fades and politics directly or indirectly censor the neutrality of the information. The same factors apply to the extraction and interpretation of individual panmnemic data points, but not to the raw content of those data points prior to human interference. The way in which people interpret century old climate statistics varies. Any inferred meaning associated with that data can be swayed by scientific bias, political motivations or ordinary human error. But the data keep their value, as long as they are returned to a state of static information, which can be examined and employed again and again through future analysis.
A star sends valuable information, about the physical and environmental conditions in its galaxy, across the universe. The experience of a particular celestial body has its greatest value as potential knowledge. When humans get involved in hypothesizing and speculating, the value decreases. A sensor, collecting the raw data, skips the middle-person’s conjecture and keeps the potential for the transportation of valuable knowledge at its highest. It has become a fairly common practice, or at least a convenient tool, to think of elementary particles as information.6 This is the function of our interaction with a majority of substances, we engage with them as little more than robust information sources.
Experiential information derived from non-humans has the asset of being untainted by false or implanted memories. People not only have difficulty recollecting past events with objective detail, they also carry in their brains all sorts of perceived impossibilities, lies, errors in education, dreams and imagination. A story my sister told me about an accident I had when I was two years old seems as real to me as any other recollection from my early childhood. If that tale was invented, and its truth value was withheld from me, the memory would contain the same quality inside my mind. The false data are passed off as genuine every time I retell the story as if it had actually taken place. Rocks and molecules appear to be incapable of lying. This grants a higher value, or at least a greater reliability, to whatever stories they pass on to their investigators.
These might be relatively primitive levels of experience, but they can be found in all entities. Memory and experience do not need to mimic human memory and human experience in order to have a value. If the information derived from an object serves a purpose, then the object has performed a useful function. Experience evolves toward complexity, from tiny independent units to intricate and collaborative systems which provide immense bodies of viable data to whatever is able to extract that information. We should not necessitate the presence of complex systems with collaborative functions in order to refer to the collection and dissemination of data as recollection. That would lead to problems of where to draw the boundaries and whether a sentient being with damaged neural functions still possesses memories. Again, this is not proposing an equivalency between memory and consciousness, but suggesting recollection as an intermediary in the evolution of consciousness, or as a type of proto-experience.
In order to explain the appearance of cognition in nature we need to place current levels of conscious ability within an evolutionary context. Mind did not emerge from nothingness. The suggestion of mental emergence is counterproductive to our growing understanding of natural selection. The picture becomes more subtle, explicable and closer to completion if we adhere to Teilhard de Chardin’s position that biological evolution must be a continuation of pre-biological evolution.7 Incorporating a few other panpsychic theories will help to solidify the additive process I am describing. German biologist Bernard Rensch wrote about the proto-phenomenal properties of inanimate matter.8 He explained the necessity of attributing such characteristics to atoms and molecules, or unnecessarily suffering through the alternative, trying to integrate consciousness into some unknown intermediary level. If we combine this concept with key elements from the panpsychic writings of Gustav Fechner, specifically his ideas about mind resulting from a systemic layering of simple material functions,9 we begin to see a more inclusive evolutionary model. This additive view can be further quantified with the application of a physics of information. The smallest components of the natural world are not necessarily quarks or strings. All sub-atomic particles can be discussed as being made up of measureable experiential data sets. Wheeler’s theory of ‘It From Bit’ explains that, “every particle, every field of force, even the space-time continuum itself – derives its function, its meaning from apparatus-elicited answers, to yes or no questions, binary choices, bits”.10 From this we can see how even something storing its history of, for example, receiving light versus not receiving light, would be sufficient to represent that data as a binary number set. This is usable information which takes on meaning once it has been transmitted or observed.
Memory, as discussed within the frame of panmnemism, is to be understood as the ability to store information concerning the experiential qualities of existence. The element of choice or selection is irrelevant. Just as people can’t decide which details to remember and which ones to forget, other material things hold on to some aspects of their experience in the world while some pieces go unrecorded. Higher order comprehension still carries a greater value, though, that preference does not reduce the importance of remembering and sharing memories as a form of proto-consciousness. Being able to recall and restate facts and figures will get a student passing grades in many courses. While educators would prefer to create a system which requires a deeper understanding of the subject matter, it can’t be said that rote repetition has no utility. Demonstrating defensive measures to offspring, based on past predation, has an important function in animal survival. That’s regardless of whether or not the animal has any knowledge of what they’re teaching or why they’re teaching it. People carry memories of sensations that they weren’t able to comprehend when their minds first received them. We can remember things that still don’t make sense to us, even after years of cognitive reflection on the primary experiences. Human memory is fallible and largely unidirectional. If I can’t recite a phone number backwards as well as forwards, or recall it without picturing the keypad, do I really have a firm memory of the number? Examples of rote information dispersal descend way beyond higher order animals.
Slime mold demonstrates detailed patterns of path recognition and spatial memory. It will avoid searching for food in places it has already checked and sort out the shortest paths to and from food sources. This is an organism with no neural function.11 Meteorites carry information about their stellar origins. Hair can be examined to reveal a detailed history of forensic toxicology. Transmission of stored experiences continues to develop. New material areas are being explored for the potential historical information lying dormant inside.
Wooden musical instruments for example, have resonant top plates which vibrate in specific frequencies. Certain nodal patterns, correlating to the commonly used pitches of the performer, are repeatedly flexed. This repetition stiffens some areas of the structure while loosening others. As the wood slowly ages and dries over time, these Chladni patterns are set as sympathetic nodes in the object.12 If these past preferences are able to be extracted, then it would be wrong to say that the plane of spruce on the top of a violin responds as though it remembers being played in down-tuned, just intonation. If the wood expresses familiarity with those frequencies, it does remember, and has shown evidence of its past experiences by resonating in that mode.
There is a peculiar tendency to rank mental faculties hierarchically. As thinking human beings, we often place our own complex brain functions right up at the top of that pyramid. This practice is likely a matter of protecting our species from any threat to our cerebral supremacy. Admittedly, panpsychism, panexperientialism and panmnemism present a comparatively neutral, passive view of conscious action, but one which Descartes yearned to reach, one which Buddhists believe to be highest. Why is cognition considered to be of a higher order than qualia? Why is self-awareness believed to be a more evolved trait than experience? Why is creativity valued over recollection?
Without a consensus based definition for what consciousness is, philosophers are left discussing the various elements which constitute conscious thought. Multiple traits and abilities might figure into the assembly: self-awareness, rational thought, recollection, language, empathy, moral judgment and creativity are all functions of a conscious mind. No single trait is sufficient to represent the presence of sentience. Consciousness can best be understood as a convergence or overlapping of a number of these spheres. Essential to this Venn diagram are the spheres of communication and memory.
Memory plays an essential role for identification by providing a constant for the continuity of consciousness. The first-person narrative of any individual is dependent upon reliably referencing identical memory points within that individual’s history. What is self-awareness without memory? Because I can recall the same specific childhood fishing trip I was able to picture five years ago, ten years ago and twenty years ago, I can be sure that the events which have taken place over those twenty years fall within the same life narrative. This elevates the forming and saving of memories to a high position among the necessary cognitive abilities of sentient beings. This is the formative basis for conscious reflection as well as the requirement for maintaining a continuity of self.
Language is commonly listed as a necessary condition of sentience. In order for these examples to be considered recollection, the objects need to be able to linguistically express their memories. Objections on the grounds of the language criterion for consciousness can be dealt with by making a slight shift in how language is understood. Language need not be verbal, written or even directly targeted to be categorized as communication. Language can’t be limited to tongue and text. A broader appreciation of linguistic comprehension ought to include communication through non-traditional formats over longer periods of time.
We can personify, even anthropomorphize buildings. People make frequent references to mood swings and attitudes of smart phones and computers. Still, it doesn’t seem right to state that those products possess a language. In order to reconcile this, we must expand our use of the term language from being an immediate interactive system, to also including systems of experiential storage, which can be opened in order to share certain unique experiences. A person with no external communicative ability would still have the internal voice needed to catalog their experiences and emotions. If we think of this one key function of consciousness, the storage and utilization of external stimuli, then conscious ability appears in all sorts of things, not just human minds.
When a tree collects data and stores them in the formations of its rings, this is akin to creating an internal memory bank. All of that environmental information can be extracted later on, providing the recipients with a detailed record of that tree’s experiences. The fact that objects record data in formats which can be harvested and understood shouldn’t be underappreciated here. The object owns its individual information history and there is no clear reason why that history should be transferable to other types of beings. Examining the history of weather, records of trauma, droughts or regional fire activity embedded in tree rings gives the reader of those signs a picture of key events in the life of that organism. This is the tree’s way of communicating its story. Whether or not there’s an intended audience is beside the point; this transfer of experiential information is a type of language.
A more contestable example, but a bit more fun at the same time, might be a consideration of what the French call terroir. This refers to the long and proud tradition of associating flavor profiles with their geographical origins. Burgundy tastes different from Bordeaux, though they’re both red wines. When someone tastes a regional cheese, it is the topography, the soil, the varieties of grazing grass and the weather they are sensing, more than the influence of the farmer or production methods.13 This is why a person with a finely tuned palette can discern between varieties of wine and cheese. They are, in a sense, tasting the story of the grape, the narrative of the milk.
Now, I make no claims about my own ability to discern subtle flavor data and I could stare at rings on a tree stump for days without gaining any useable historical information. This does not mean that the content isn’t present. The narrative exists and is readily accessible. The language must be understood by both parties. Likewise, I’d have no understanding of a page of text presented to me in phonetic Wintu. However, the possible contribution to global knowledge on that page shouldn’t be eliminated due to a lack of linguistic abilities on the part of the receiver. It would be wrong to say that the Wintu language is not a form of communication just because only a handful of people possess a working knowledge of it. There are far more people with a working knowledge of the language of dendrochronology. At least since John Von Neumann’s computational work of the late 1940’s and early 1950’s, the notion of an inanimate object remembering and communicating information has been rationally presented. It shouldn’t be difficult to extend his reasoning beyond machine computation, or to associate such examples of communication with other forms of language.
An experience does not need to have greater meaning derived from it in order to become an important piece of data. Contexts and applications can be associated with those direct experiences later on. This process works in agreement with John Searle’s claim that things don’t have intrinsic intentionality.14 Literary criticism and psychoanalysis are obvious examples of attributing meaning to pieces of information that don’t express a clear perspective in and of themselves. These are impure, or at least imperfect, attempts at a useful exchange of ideas. The philosophical problems associated with language, ever present in post-Wittgensteinian thought, come with the interference of human interpretation. Raw mathematical data actually bring us closer to analytic clarity; they bring us closer to Wittgenstein’s position that description ought to take the place of explanation.15 After all, humans, in the perspective of greater timescales, are little more than carriers of genetic data, data of which we’re not consciously aware.
Humans communicate private experiences of stress or pleasure without a specified linguistic ability to express the quality or quantity of those feelings. I can only compare my pain to other pains I’ve felt, having no calibration for my sensations on a chart which plots the pains of others. These are limitations. The communicative abilities of objects are quite different. The data transferred from inanimate objects and simple organisms are built to specific standards. There is regularity. This makes the transfer of knowledge more reliable. Humans use a language which allows for lies, withholdings and numerous opportunities for miscommunication. In the information storage systems of objects, ‘A’ always equals ‘A’. This type of non-linguistic communication is a more reliable source for the foundations of knowledge. This type of memory is perhaps more analogous to what we label as ‘photographic memory’ in humans. When people can recollect dialogue verbatim or exact dates and names from insignificant events, we not only find that skill impressive, but usually see it as a sign of intelligence. In order to cope with the problems of the future, reliable information taken from the perspectives of all entities needs to be gathered, collated and appreciated.
Developing a better comprehension of the evolution of consciousness, from simple memory retention to complex reasoning systems, will lead to the proper placement of human minds on the widest conceivable spectrum of consciousness. This will allow a future machine or alien intelligence to exceed human conscious abilities without threatening the relevance of human contributions to a universal knowledge base. By imagining mental ability as a vast scale, rather than belonging to a single species, the problems associated with artificial cognition become a matter of evaluating the experiential information in combination with the utility of higher order reflection. There is room for different types of consciousness contributing different strengths of cognition. The likelihood of a scenario with some similarity to Ray Kurzweil’s vision of ‘the singularity’ seems inevitable. The boundaries between organic and inorganic mental states are becoming blurry. As people implant more and more artificial or externally formed devices into their bodies, computers are becoming more organic and beginning to process information more creatively. Even if a fully realized new form of cognitive consciousness is not on the horizon, a very real ‘Ship of Theseus’ type of problem is imminent.
There isn’t a hard line between human consciousness and other forms of consciousness, but there is a range of different aptitudes. Some ‘minds’ are better at preserving detail. Other ‘minds’ have a proclivity for sharing information. Some ‘minds’ excel because of the vast amount of data they can safely store. Human minds are best at organizing and drawing conclusions from data sets.
When the human animal no longer walks the earth, all that will remain of our experiences will be artifacts and a wide variety of extractable information. This historical record will be encoded in thousands of different languages, numeric charts, geologic formations, fossil remains, mathematical proofs, symbolic logic and all sorts of examinable evidence awaiting investigation. People don’t just want to believe that their experiences have a value, we are certain that our perceptions, and the information we gather, can be useful in numerous applications. This knowledge is not only of utility to the lives of future human beings, but valuable as experiential data with the potential of being transferred to any capable investigator. If we choose to ignore the importance of similar examples, where experiential information is being derived from objects today, then there is no reason to place any importance on the experiential information of human beings either. People derive meaning and a sense of purpose from the ability to share knowledge with the rest of the world, and potentially, future civilizations. There is value in carrying information, a value which can be extended to all material data carriers. Rather than dealing with the scientific, philosophical and spiritual traps of talking about consciousness, we can expand our allowance of certain types of conscious abilities to include all contributors. All things have experiences. Experiential information can be stored as memories. Memories can be retrieved and communicated to investigators. From these claims, it can be concluded that non-human animals, plants and inanimate objects possess a form of memory, as well as forms of semi-linguistic communication of those memories.
The post Panmnemism and the Value of Experiential Information appeared first on Center on Long-Term Risk.
]]>The post The Case for Suffering-Focused Ethics appeared first on Center on Long-Term Risk.
]]>A common intuition is that creating happy beings is less morally pressing than making sure existing beings are well off. This intuition is characterized by Jan Narveson’s (1973) statement, “We are in favor of making people happy, but neutral about making happy people.” Narveson’s principle points to a putative2 asymmetry between suffering and happiness in the context of (not) adding new beings to the world. Consider the following thought experiment:
Two planets
Imagine two planets, one empty and one inhabited by 1,000 beings suffering a miserable existence. Flying to the empty planet, you could bring 1,000,000 beings into existence that will live a happy life. Flying to the inhabited planet instead, you could help the 1,000 miserable beings and give them the means to live happily. If there is time to do both, where would you go first? If there is only time to fly to one planet, which one should it be?
Even though one could bring about 1,000 times as many happy beings as there are existing unhappy ones, many people’s moral intuition would have us help the unhappy beings instead. To those holding this intuition, taking care of suffering appears to be of greater moral importance than creating new, happy beings.
By contrast, preventing miserable beings from being added to the world seems just as important as preserving existing happiness. And it would be a no-brainer for most people that 1,000 existing beings going from happy to miserable is better than the 1,000 beings staying happy at the cost of 1,000,000 new, miserable beings being created. This suggests that we care about reducing suffering for existing and potential beings equally, whereas we prioritize the promotion of happiness in actual beings over the happiness in their merely potential peers.
Narveson’s principle is found at the heart of preference-based axiologies where what matters is preference (dis)satisfaction. In Christoph Fehige’s words:
"We have obligations to make preferrers satisfied, but no obligations to make satisfied preferrers" (Fehige, "A pareto principle for possible people", 1998, p.518).
Creating large numbers of beings while ensuring that all their preferences or goals are satisfied – which could be achieved by making provisions for the newly created beings to have easily satisfiable preferences, or preferences that correspond very closely with the most likely state of the world – may strike us as a pointless endeavor. The claim that an extra preference in itself is of little value, so that “Maximizers of preference satisfaction should instead call themselves minimizers of preference frustration.” (Fehige, ibid.) is the gist of Fehige's anti-frustrationism.
Intuitions such as “Making people happy rather than making happy people” are linked to the Epicurean view that non-existence does not constitute a deplorable state. Proponents of this view reason that the suffering and/or frustrated preferences of a being constitute a real, tangible problem for that being. By contrast, non-existence is free of moral predicaments for any evaluating agent, given that by definition no such agent exists. Why, then, might we be tempted to consider non-existence a problem? Non-existence may seem unsettling to us, because from the perspective of existing beings, no-longer-existing is a daunting prospect. Importantly however, death or no-longer-existing differs fundamentally from never having been born, in that it is typically preceded and accompanied by suffering and/or frustrated preferences.
Acceptance of the above moral asymmetry between the pair suffering/happiness as pertaining to existence/non-existence grounds a strong, moral reason to concentrate on suffering and/or preference frustration as opposed to the promotion of happiness. It comes with an understanding of ethics as being about solving the world’s problems: We confront spacetime, see wherever there is or will be a problem, i.e. a struggling being, and we solve it.
Views that incorporate this intuition: The intuition “Making people happy rather than making happy people” is directly incorporated in so-called person-affecting views, where affairs can only be morally bad if they are bad for someone, i.e. if there is a being to point to for whom something poses a problem. One such view is (a version of) anti-natalism (e.g. David Benatar, “Better never to have been”, 2008). The same underlying intuition is also present in preference-based approaches such as Fehige’s antifrustrationism that we mentioned before. In addition, it probably inspired the “moral ledger” view laid out by Peter Singer in his Practical Ethics,3 as well as in the “prior-existence” view he tentatively endorsed in the same book.4 Finally, egalitarianism and prioritarianism can incorporate the intuition by giving priority to suffering reduction as long as suffering is still around (or will foreseeably be around).5
There is something particularly terrible when it comes to torture-level suffering. For many people, extreme suffering is so bad that no other experience can counterbalance it. We tend to shy away from truly imagining how horrible suffering can be, but morality, as the most serious business there is, needs to pay attention to everything. After episodes of extreme suffering, we may gradually forget just how bad it was in the moment. At times, circumstances can exacerbate to a point where we are willing to give up everything we care about. As Orwell pointed out in his book 1984: “[...] for everyone there is something unendurable – something that cannot be contemplated. Courage and cowardice are not involved. If you are falling from a height it is not cowardly to clutch at a rope. If you have come up from deep water it is not cowardly to fill your lungs with air. It is merely an instinct which cannot be destroyed. [...] They are a form of pressure that you cannot withstand, even if you wished to. You will do what is required of you."
Confronting torture-level suffering
Imagine you are taken out of your everyday life and presented with the following choice: 1) You die painlessly now. 2) You have to experience the worst possible torture for one week to be subsequently rewarded with forty years of bliss in a perfect experience machine, culminating in a painless death. Which option would you choose?
While people might be motivated to attempt to endure torture for loved ones or for the fulfillment of their dearest life goals, these confounders are eliminated in the thought experiment above, where the gains – paradise in the experience machine – also come with abandoning loved ones or life goals directed at the (non-virtual) world. In essence, we are asking whether torture-level suffering can – all else equal – be counterbalanced by other experiences in one's personal case. Of course, we imagine the virtual reality on offer to feel completely authentic to the user with no disturbing memories involved.
So when it comes to just comparing the best experiences to torture-level suffering in the personal case, would people make the deal? A refusal to accept this tradeoff suggests that as far as an individual's judgement of their experiences is concerned, torture-level suffering cannot – at least for some individuals – be counterbalanced by (much) larger amounts of happiness.
There are however people who would accept this bargain. This presents a challenge: How are we to extrapolate from intrapersonal tradeoffs, where a person is their own arbiter, to interpersonal tradeoffs, where we consider decisions that affect the welfare of other individuals? Accepting torture-level suffering in one’s personal case is not tantamount to granting that happiness can outweigh torture-level suffering in general in interpersonal tradeoffs. The question for someone who accepts torture-level suffering in exchange for subsequent happiness is “Why do they do it?”
Do the reasons also apply to setting tradeoffs for other people, or do they more narrowly only apply to their specific circumstances?
When it comes to assessing the quality of other people's lives, there is always going to be an element of paternalism, regardless of the policy we chose: Even in one’s personal case, many of one's person moments – when we are treating a person's life as a temporal series of autonomous "person moments" – during the week of torture would not consent. That is, sufferers would (almost certainly) regret the decision made by the person moment that preceded them, and wish for it to be changed. Conversely, after a few years in the most sublime bliss of virtual reality, the newer, happy person moments might end up concluding that, in retrospect, the suffering was worth it after all. And if they were to get tortured again, the judgment might again revert. With perspectives biased by temporal viewpoints, an objective answer on if or when to accept torture-level suffering cannot be obtained, and no matter which answers we choose, some people (or some person moments) will have ground for objecting.
Nevertheless, what we can do is to analyze the different reasons for choosing one way or the other, and integrate them into a framework that aspires to a kind of impartiality or fairness, at least to as much a degree as is possible given the constraints. The personal case where someone voluntarily accepts extreme suffering for sufficiently much happiness in their own future is confounded by many factors: For instance, perhaps someone’s life goal includes a desire for novelty or excitement-seeking that ultimately motivates the tradeoff in question. These confounders call into question whether what makes suffering worthwhile under some circumstances is found purely, and to an equal degree for every subject of experience, in the "goodness" of happiness.
People generally hold the belief that more recent experiential moments have epistemic authority over their elapsed peers. For instance, we are willing to grant that an elderly person looking back at various life choices is in a good epistemic position to determine the merit of these choices. However, such epistemic authority cannot straightforwardly be transferred to the interpersonal level. While it does seem plausible that people choose rationally when they decide to undergo tradeoffs involving extreme suffering in their own lives (though there might also be a systemic error in gauging these sorts of tradeoffs), it strikes us as more controversial whether a measurement of the (dis)value of suffering and happiness respectively can be established from an outside, “impartial” perspective. The following thought experiment lends support to this intuition by highlighting an important difference between intrapersonal and interpersonal welfare tradeoffs:
Consciousness board
Part 1: Imagine you are sitting on a towel at the beach. The weather is very hot, but you are sitting comfortably in the shade. After some time, you develop a strong craving for taking a refreshing swim in the ocean. Unfortunately, you forgot your sandals and will have to walk barefoot over hot sand. You decide that the pain from walking over the hot sand will be worth the pleasure of swimming in the ocean and go for it.
Part 2: Imagine you are controlling a supercomputer that can implement states of consciousness. You look at the control board in front of you and have the option to instantiate some painful “hot-sand moments” with 10% of the computer’s resources, and a lot of happy “ocean-swimming moments” with the other 90% of computing power. The two experiences will not be connected to each other, i.e. no memory is present in the ocean-moments of the sand-moments, and the sand-moments experience no anticipation of the refreshing swimming. Would you choose to run these experiences jointly?
Interestingly, it is much less obvious that the tradeoff in Part 2 is “worth it.” In fact, it is unclear what “worth it” would even mean in this context. Our own behavior for trading between pleasure and suffering seems to be driven by factors that don't just correspond to the intensity of the suffering or pleasure in question; factors such as cravings for pleasure (that would counterfactually result in unpleasantness, e.g. if one were to remain in the shade and consequently became more and more hot, sweaty and bored), or by preferences for adventure, for accomplishments and meaning in life, or simply for novel or exciting activities. When such external factors are removed and we are only looking at “raw experiences” rather than experiences embedded in the context of people’s lives as a whole, our intuitions regarding the appropriate exchange rates may change drastically, and many reasons (cravings, desire for novelty, meaning, adventure, etc.) for why one might be tempted to accept (torture-level) suffering in the intrapersonal case simply disappear. In the impartial or altruistic context, when evaluating whether to instantiate certain (packages of) experiences or not, there seems to be an especially strong case for never bringing about torture-level suffering, where the person moment in question would be left uncompensated and wanting its suffering to be terminated at all costs.6
This suggests that even for people who might be inclined to accept the bargain described above for themselves, it remains an open question whether they should apply the same exchange rate for the impartial or altruistic context. The two modes of evaluation, intra- versus interpersonal, differ in interesting and potentially relevant respects.
Views that incorporate this intuition: The intuition that torture-level suffering cannot be counterbalanced is strong in many people. It is present in the widespread belief that minor pains cannot be aggregated to become worse than an instance of torture.7 Among consequentialist ethical systems, it is incorporated by threshold and consent-based negative utilitarianism and in (maximin) prioritarianism. It likely also contributes to absolute prohibitions against torture in deontological moralities. Finally, the intuition is part of philosophical works of fiction, such as Ursula K. LeGuin’s short story The Ones Who Walk Away from Omelas,8 Dostoevsky’s The Brothers Karamazov9 or Camus’s The Plague.10
A widespread view, especially in non-Western traditions, is that “happiness” consists of the absence of suffering. According to this view, not only pleasure, but also tranquillity or contentment are amongst the best experiences (Gloor, 2017). We pursue pleasures because without them, we (usually) develop cravings for these pleasures. And these cravings constitute a form of suffering or dissatisfaction, i.e. of consciously wanting the current experience to be different. If a state is entirely free of cravings, there is a sense in which it can be considered perfect. Subjectively at least, the immediate, internal evaluation in such a state concludes that nothing needs to be changed. Accordingly, pleasure then carries “mere” instrumental importance, because flooding the mind with pleasure is one of several ways to (temporarily) get rid of cravings. Quoting from the paper on Tranquilist axiology:
In the context of everyday life, there are almost always things that ever so slightly bother us. Uncomfortable pressure in the shoes, thirst, hunger, headaches, boredom, itches, non-effortless work, worries, longing for better times... When our brain is flooded with pleasure, we temporarily become unaware of all the negative ingredients of our stream of consciousness, and they thus cease to exist. Pleasure is the typical way in which our minds experience temporary freedom from suffering, which may contribute to the view that happiness is the symmetrical counterpart to suffering, and that pleasure, at the expense of all other possible states, is intrinsically important and worth bringing about. However, there are also (contingently rare) mental states devoid of anything bothersome that are not intensely pleasurable, examples being flow states or states of meditative tranquillity. Felt from the inside, meditative tranquillity is perfect in that it is untroubled by any aversive components, untroubled by any cravings for more pleasure. Likewise, a state of flow – as it may be experienced during stimulating work, when listening to music or when playing video games – where tasks are being done on “autopilot,” with time flying and a low sense of self awareness, also has this same crucial quality of being experienced as completely nonproblematic. Such states – let us call them states of contentment – may not commonly be described as “(intensely) pleasurable,” but following venerable traditions in Buddhism and Epicureanism, these states, too, deserve to be regarded as perfect.
Experiences that we in our everyday life think of as “neutral” may often contain a backdrop of dissatisfaction, hard to notice introspectively because we have become accustomed to it, but present nonetheless in that our evaluation of the experience is affected and becomes ever-so-slightly negative. Experiences that are truly free from dissatisfaction on the other hand are experiences we often think of as positive.
Critics may object that a world in which all pleasures were reduced to “mere” contentment – such as states of constant meditation, half-sleep or the playing of flow-inducing video games – would be unexciting and rather monotonous. All the heights of sensory and emotional pleasures, such as eating one’s favorite foods, successfully accomplishing a long-term project or being in love would be lost. But it is worth pointing out that the loss of these experiences appears tragic only when viewed from the outside, when we compare it to the world we would wish to inhabit in terms of our life's goals and the desired narrative for how we want our lives to go.
So the reflective part of our nature cares about things other than our moment-to-moment experiences. And to that part of us, a world of mere contentment would indeed be found lacking. However, what tranquilist axiology is modeled after is the impulsive part of our nature: We tend to live according to the short-sighted avoidance of dissatisfaction. We seek pleasure not because pleasure is in itself valuable, but in order to drown out boredom or pleasure cravings. For the impulsive part in us, a world filled with nothing but states of contentment would be a true paradise. If we zoomed in on all that is being experienced in such a world, there would by definition never be a moment of boredom or unfulfilled longing; never would there be a moment where someone consciously wants to have something changed about their experience. To its inhabitants, such a world would manifest itself as perfect. (Of course, the impulsive part of us may not be the only motivational system that matters morally, and morality should also be about evaluating the world according to the fulfillment of long-term, reflected preferences or life goals. It remains an open question how these two parts are to be integrated.)
To distill the intuition that it is morally unimportant to turn states of contentment into states of maximal pleasure, consider the following thought experiment:
Buddhist monks
Imagine a large temple filled with 1,000 Buddhist monks who are all absorbed in meditation; their minds are at rest in flawless contentment. Unfortunately, the whole temple will collapse in ten minutes and all the monks will be killed. You cannot do anything to prevent the temple from collapsing, but you have the option to press a button that will release a gaseous designer drug into the temple. The drug will reliably produce extreme heights of pleasure and euphoria with no side effects. Would you press the button?11
One reason to press the button is that it could cause the monks to believe that they have reached a long-sought state of enlightenment they were after their whole life. But let us suppose that the monks in the temple are already maximally satisfied with their lives' achievements: The drug-induced euphoria will change the quality of their experience, but it would not change their beliefs about enlightenment or their meditative accomplishments. Should we press the button?
It is tempting to feel roughly indifferent here: Pressing the button seems like a nice thing to do, and assuming it produces no harm or panic, it may be hard to imagine how it would be something bad.12 At the same time, it does not seem particularly important or morally pressing to push the button. As far as the monks in the temple are concerned, it seems that they will be totally fine (for the next ten minutes anyway) without the drug. If there are any relevant opportunity costs whatsoever, such as e.g. an opportunity to reduce mild suffering somewhere else, would it ever be the morally preferable action to induce euphoria in the monks? If yes, why?
This thought experiment suggests that differences in “happiness levels,” i.e. in changing an experience from “neutral” (or “not-maximally-positive” – depending on how one looks at meditation) to intensely pleasurable or “extremely positive,” is not of (strong) moral importance. Interestingly, no symmetrical point for differences in pain levels can be made: On the contrary, no one in their right mind would think that turning the mild suffering of 1,000 individuals into extreme agony makes at most little moral difference. To summarize, the former view – that making slightly happy experiences much more happy is at most of little value – is at the very least plausible (and for many people, even highly intuitive). The latter view, on the other hand – that turning slightly painful states into very painful states is at most of little disvalue – is impossible to even take seriously.13 This contrast strongly indicates that there is an asymmetry between how we value increasing happiness versus reducing suffering. Accordingly, perhaps happiness, rather than being intrinsically valuable, should be seen as instrumentally valuable, or as contingently valuable depending on a person valuing (specific flavors of) happiness for themselves (or others) in their life goals.
Views that incorporate this intuition: The intuition that happiness consists of the absence of suffering is common in non-Western traditions, especially in Buddhism.14 It is also a central part of Epicureanism.15 Finally, it may constitute part of the explanation why a lot of people reject versions of consequentialism where the happiness of the many can outweigh the suffering of a few.
In addition to valuing suffering reduction, we might care about our personal well-being, about us and others fulfilling their dearest dreams and life-goals, about happiness and there being love and joy in the world, and many other, similar objectives. Interestingly, these other values beside concern for suffering often seem to be bounded, i.e. they seem to quickly reach diminishing returns once we optimize for them successfully.
Repugnant Conclusion16
Imagine there is a civilization with ten billion maximally happy inhabitants who never experience any suffering. We have the option to introduce more lives and overall more happiness to the world by multiplying the number of people in that civilization by a factor of one billion. However, the way to bring about this population explosion will also lower the quality of life for all the beings, old and new. In the new civilization with ten billion billion people, each person will experience a lot of mild suffering, a decent amount of moderate suffering, and even some moments of strong suffering. People will also be happy a lot, such that most outsiders, as well as all the people themselves, would on the whole regard their existences as worth living. Should we choose to bring about the much larger, less happy civilization, or do we prefer the small(ish) but maximally happy one?
While there are some people who argue for accepting the repugnant conclusion (Tännsjö, 2004), most people would probably prefer the smaller but happier civilization – at least under some circumstances. One explanation for this preference might lie in intuition one discussed above, “Making people happy rather than making happy people.” However, this is unlikely to be what is going on for everyone who prefers the smaller civilization: If there was a way to double the size of the smaller population while keeping the quality of life perfect, many people would likely consider this option both positive and important. This suggests that some people do care (intrinsically) about adding more lives and/or happiness to the world. But considering that they would not go for the larger civilization in the Repugnant Conclusion thought experiment above, it also seems that they implicitly place diminishing returns on additional happiness, i.e. that the bigger you go, the more making an overall happy population larger is no longer (that) important.
By contrast, people are much less likely to place diminishing returns on reducing suffering – at least17 insofar as the disvalue of extreme suffering, or the suffering in lives that on the whole do not seem worth living, is concerned. Most people would say that no matter the size of a (finite) population of suffering beings, adding more suffering beings would always remain equally bad.
It should be noted that incorporating diminishing returns to things of positive value into a normative theory is difficult to do in ways that do not seem unsatisfyingly arbitrary. However, perhaps the need to fit all one’s moral intuitions into an overarching theory based solely on intuitively appealing axioms simply cannot be fulfilled. Some have pointed out that human moral intuitions are complex, which makes it non-obvious that one's normative views must follow directly from just a couple of simple and elegant principles.
Views that incorporate this intuition: Next to being a plausible (partial) explanation why people reject the Repugnant Conclusion, diminishing returns to happiness and other values might explain the appeal of average versions of consequentialism and the fact that most people do not consider it morally important to fill the entire universe with happy beings.18
We introduced four separate motivating intuitions for suffering-focused ethics. Endorsing one or several of these intuitions as a guiding principle can ground concern for suffering as the main focus of someone’s morality, while leaving room for other things to value. In addition to the intuitions discussed above, some people’s focus on reducing suffering may also derive from a suffering-focused disposition or temperament when they contemplate the value of lives and outcomes in practice: Quantifying suffering and happiness can be done in several different ways, and people’s judgments may differ even if they use the same methods for their assessment. When doing rough, impressionistic aggregations, some people – who we can call “suffering-focused” – tend to conclude that various lives and outcomes are overall bad, while others tend to conclude that they are overall good.19
It should be noted that pointing out strongly held intuitions or principles does not yet yield a comprehensively specified goal or moral system. For instance, some of the views discussed above may be tricky to formalize in satisfying ways.20
Most people have intuitions about many things, including selfish interests, altruism, our personal (moral?) self-image, game-theoretic considerations or cultural community norms. Deciding which of these we want to reflectively endorse and to what extent, and then bringing all of this together into a coherent goal that ranks not just situations we are culturally or evolutionarily familiar with, but all possible world states,21 is indeed challenging.
Given the difficulty of this task, it is important that we do not make it even more complicated by placing unreasonable formal demands on our values. Likewise, it is important that we do not hastily subscribe to some particular view without remaining open to reflection. Ultimately, choosing values comes down to finding the intuitions and guiding principles we care about the most – and if that includes a number of different intuitions, or even some form of extrapolation procedure to defer to better-informed future versions of ourselves – then the solution may not necessarily look simple. This is completely fine, and it allows those who agree with (some of) the intuitions behind suffering-focused ethics to care about other things in addition.
CLR’s endorsement of suffering-focused ethics is an attempt to incorporate suffering alleviation within a generally commonsensical framework about what is and is not wise to do in practice. Our activism should be strategically smart, non-violent and cooperative. This ensures that people with many different moral perspectives and different practical approaches can coordinate their activism, avoid zero-sum conflicts and instead focus on mutually supportive objectives.
2. Reducing Risks of Astronomical Suffering: A Neglected Priority
Fehige, C. (1998). A pareto principle for possible people. In Fehige, C. and Wessels U. (Eds.), Preferences (pp. 508-43). Berlin: Walter de Gruyter.
Gloor, L. (2017). Tranquilism. Center on Long-Term Risk.
Greaves, H. (2017). Population axiology. Philosophy Compass, 12:e12442. https://doi.org/10.1111/phc3.12442
Holtug, N. (1999). Utility, priority and possible people. Utilitas, 11(01), 16-36.
Knutsson, S. & Brülde, B. (2016). Promoting goods or reducing bads. Manuscript in preparation.
Narveson, J. (1973). Moral problems of population. The Monist, 62-86.
Norcross, A. (2009). Two dogmas of deontology: Aggregation, rights, and the separateness of persons. Social Philosophy and Policy, 26(01), 76-95.
Parfit, D. (1984). Reasons and persons. Oxford University Press.
Singer, P. (1993). Practical ethics (2nd ed.). Cambridge: Cambridge University Press.
Strodach, G. K. (Ed.). (1963). The Philosophy of Epicurus: Letters, doctrines, and parallel passages from Lucretius. Northwestern University Press.
Tännsjö T. (2004). Why We Ought To Accept The Repugnant Conclusion. In: Tännsjö T., Ryberg J. (eds) The Repugnant Conclusion. Library Of Ethics And Applied Philosophy, 15. Dordrecht: Springer.
The last sentence may seem contradictory to modern ears, but Epicurus goes on to explain his idiosyncratic use of the term “pleasure:”
“Thus when I say that pleasure is the goal of living I do not mean the pleasures of libertines or the pleasures inherent in positive enjoyment, as is supposed by certain persons who are ignorant of our doctrine or who are not in agreement with it or who interpret it perversely. I mean, on the contrary, the pleasure that consists in freedom from bodily pain and mental agitation.” (back)
The post The Case for Suffering-Focused Ethics appeared first on Center on Long-Term Risk.
]]>The post How the Simulation Argument Dampens Future Fanaticism appeared first on Center on Long-Term Risk.
]]>Some effective altruists assume that most of the expected impact of our actions comes from how we influence the very long-term future of Earth-originating intelligence over the coming ~billions of years. According to this view, helping humans and animals in the short term matters, but it mainly only matters via effects on far-future outcomes.
There are a number of heuristic reasons to be skeptical of the view that the far future astronomically dominates the short term. This piece zooms in on what I see as perhaps the strongest concrete (rather than heuristic) argument why short-term impacts may matter a lot more than is naively assumed. In particular, there's a non-trivial chance that most of the copies of ourselves are instantiated in relatively short-lived simulations run by superintelligent civilizations, and if so, when we act to help others in the short run, our good deeds are duplicated many times over. Notably, this reasoning dramatically upshifts the relative importance of short-term helping even if there's only a small chance that Nick Bostrom's basic simulation argument is correct.
My thesis doesn't prove that short-term helping is more important than targeting the far future, and indeed, a plausible rough calculation suggests that targeting the far future is still several orders of magnitude more important. But my argument does leave open uncertainty regarding the short-term-vs.-far-future question and highlights the value of further research on this matter.
The question is whether one can get more value from controlling structures that — in an astronomical-sized universe — are likely to exist many times, than from an extremely small probability of controlling the whole thing.
--steven0461
One of the ideas that's well accepted within the effective-altruism community but rare in the larger world is the immense importance of the far-future effects of our actions. Of course, many environmentalists are concerned about the future of Earth, and people in past generations have started projects that would not finish in their lifetimes. But it's rare for in-the-trenches altruists, rather than just science-fiction authors and cosmologists, to consider the effects of their actions on sentient beings that will exist billions of years from now.
Future focus is extremely important, but it can at times be exaggerated. It's sometimes thought that the far future is so important that the short-term effects of our actions on the welfare of organisms alive today are completely negligible by comparison, except for instrumental reasons insofar as short-term actions influence far-future outcomes. I call this "far-future fanaticism", in analogy with the "fanaticism problem" discussed in Nick Bostrom's "Infinite Ethics" (sec. 4.3). I probably believed something along these lines from ~2006 to ~2013.
However, like with almost everything else in life, the complete picture is more complicated. We should be extremely suspicious of any simple argument which claims that one action is, say, 1030 times more important than another action, e.g., that influencing the far future is 1030 times more important than influencing the near term. Maybe that's true, but reality is often complex, and extraordinary claims of that type should not be accepted hastily. This is one of several reasons we should maintain modesty about whether working to influence the far future is vastly better than working to improve the wellbeing of organisms in the nearer term.
Dylan Matthews, like many others, has expressed skepticism about far-future fanaticism on the grounds that it smells of Pascal's mugging. I think far-future fanaticism is a pretty mild form of (mugger-less) Pascal's mugging, since the future fanatic's claim is vastly more probable a priori than the Pascal-mugger's claim. Still, Pascal's mugging comes in degrees, and lessons from one instance should transfer to others.1
The most popular resolution of Pascal's mugging on the original thread was that by Robin Hanson: "People have been talking about assuming that states with many people hurt have a low (prior) probability. It might be more promising to assume that states with many people hurt have a low correlation with what any random person claims to be able to effect."
ArisKatsaris generalized Hanson's idea to "The Law of Visible Impact": "Penalize the prior probability of hypotheses which argue for the existence of high impact events whose consequences nonetheless remain unobserved."
Eliezer Yudkowsky called this a "leverage penalty". However, he goes on to show how a leverage penalty against the possibility of helping, say, a googolplex people can lead you to disbelieve scenarios where you could have huge impact, no matter how much evidence you have, which seems possibly wrong.
In this piece, I don't rely on a general Hansonian leverage penalty. Rather, I use the simulation argument, which resembles the Hansonian leverage penalty in its effects, but it does so organically rather than in a forced way.
Yudkowsky says: "Conceptually, the Hansonian leverage penalty doesn't interact much with the Simulation Hypothesis (SH) at all." However, the two ideas act similarly and have a historical connection. Indeed, Yudkowsky discussed something like the simulation-argument solution to Pascal's mugging after hearing Hanson's idea:
Yes, if you've got 3↑↑↑↑3 people running around they can't all have sole control over each other's existence. So in a scenario where lots and lots of people exist, one has to penalize by a proportional factor the probability that any one person's binary decision can solely control the whole bunch.
Even if the Matrix-claimant says that the 3↑↑↑↑3 minds created will be unlike you, with information that tells them they're powerless, if you're in a generalized scenario where anyone has and uses that kind of power, the vast majority of mind-instantiations are in leaves rather than roots.
The way I understand Yudkowsky's point is that if the universe is big enough to contain 3↑↑↑↑3 people, then for every person who's being mugged by a genuine mugger with control over 3↑↑↑↑3 people, there are probably astronomical numbers of other people who are confronting lying muggers, pranks, hallucinations, dreams, and so on. So across the multiverse, almost all people who get Pascal-mugged can't actually save 3↑↑↑↑3 people, and in fact, the number of people who get fake Pascal-mugged is proportional to 3↑↑↑↑3. Hence, the probability of actually being able to help N people is roughly k/N for some constant k, so the expected value of giving in to the mugging remains finite regardless of how big N is.
However, this same kind of reasoning also works for Yudkowsky's "Pascal's Muggle" scenario in which a Matrix Lord opens "up a fiery portal in the sky" to convince a person that the Matrix Lord is telling the truth about a deal to save a googolplex lives for $5. But given that there's a huge amount of computing power in the Matrix Lord's universe, for every one Matrix Lord who lets a single person determine the fate of a googolplex people, there may be tons of Matrix Lords just faking it (whether for the lulz, to test the simulation software, or for some other reason). So the expected number of copies of a person facing a lying Matrix Lord should be proportional to a googolplex, and hence, the probability penalty that the Hansonian prior would have suggested seems roughly vindicated. Yudkowsky makes a similar point:
when it comes to improbability on the order of 1/3↑↑↑3, the prior improbability is inescapable - your sensory experiences can't possibly be that unique - which is assumed to be appropriate because almost-everyone who ever believes they'll be in a position to help 3↑↑↑3 people will in fact be hallucinating. Boltzmann brains should be much more common than people in a unique position to affect 3↑↑↑3 others, at least if the causal graphs are finite.
ArisKatsaris complains that Hanson's principle "seems to treat the concept of 'person' as ontologically fundamental", the way that other instances of Nick Bostrom-style anthropic reasoning do. But, with the simulation-argument approach, you can avoid this problem by just talking about exact copies of yourself, where a "copy" means "a physical structure whose high-level decision-making algorithms exactly mirror your own, such that what you decide to do, it also decides to do". A copy needn't (and in general doesn't) share your full environment, just your current sensory inputs and behavioral outputs for some (possibly short) length of time. Then Yudkowsky's argument is that almost all copies of you are confronting fake or imagined muggers.
We can apply the simulation anti-mugging argument to future fanaticism. Rather than being the sole person out of 3↑↑↑↑3 people to control the actions of the mugger, we on Earth in the coming centuries are, perhaps, the sole tens of billions of people to control the far-future of Earth-originating intelligence, which might involve ~1052 people, to use the Bostrom estimate quoted in Matthews's article. For every one biological human on the real Earth, there may be tons of simulated humans on simulated Earths, so most of our copies probably "are in leaves rather than roots", to use Yudkowsky's terminology.
Even if Earth-originating intelligence specifically doesn't run ancestor simulations, other civilizations may run simulations, such as when studying the origin of life on various planets, and we might be in some of those simulations. This is similar to how, even though a real Pascal-mugger might specify that all of the 3↑↑↑↑3 people that she will create will never think they're being Pascal-mugged, in the multiverse at large, there should be lots more people in various other circumstances who are fake Pascal-mugged.
Yudkowsky acknowledges the simulation possibility and its implications for future fanaticism:
If we don't take everything at face value, then there might be such things as ancestor simulations, and it might be that your experience of looking around the room is something that happens in 1020 ancestor simulations for every time that it happens in 'base level' reality. In this case your probable leverage on the future is diluted (though it may be large even post-dilution).
If we think of ourselves as all our copies rather than a particular cluster of cells or transistors, then the simulation hypothesis doesn't decrease our probable leverage but actually increases it, especially the leverage from short-term actions, as is discussed below.
I first began thinking about this topic due to a post by Pablo Stafforini:
if you think there is a chance that posthumanity will run ancestor simulations [...], the prospect of human extinction is much less serious than you thought it was.
Since I'm a negative utilitarian, I would probably prefer for space not to be colonized, but Stafforini's point also has relevance for efforts to reduce the badness of the far future, not just efforts to prevent human extinction.
Robin Hanson makes a similar point:
if not many simulations last through all of human history, then the chance that your world will end soon is higher than it would be if you were not living in a simulation. So all else equal you should care less about the future of yourself and of humanity, and live more for today. This remains true even if you are highly uncertain of exactly how long the typical simulation lasts.
One response is to bite the simulation bullet and just focus on scenarios where we are in fact in basement-level reality, since if we are, we can still have a huge impact: "Michael Vassar - if you think you are Napoleon, and everyone that thinks this way is in a mental institution, you should still act like Napoleon, because if you are, your actions matter a lot."
A second response is to realize that actions focused on helping in the short term may be relatively more important than the future fanatic thought. Most simulations are probably short-lived, because one can run lots of short-lived simulations with the same computing resources as it takes to run a single long-lived simulation. Hedonic Treader: "Generally speaking, it seems that if you have evidence that your reality may be more short-lived than you thought, this is a good reason to favor the near future over the far future."
Note: This section is a more detailed version of an argument written here. Readers may find that presentation of the calculations simpler to understand.
This section presents a simplified framework for estimating the relative importance of short-term vs. far-future actions in light of the simulation argument. An example of an action targeted for short-term impact is changing ecosystems on Earth in order to reduce wild-animal suffering, such as by converting lawns to gravel. An example of a far-future-focused action is spreading the idea that it's wrong to run detailed simulations of ecosystems (whether for reasons of science, entertainment, or deep ecology) because of the wild-animal suffering they would contain. Of course, both of these actions affect both the short term and the far future, but for purposes of this analysis, I'll pretend that gravel lawns only prevent bugs from suffering in the short run, while anti-nature-simulation meme-spreading only helps prevent bugs from suffering in the long run. I'm trying to focus just on the targeted impact time horizon, but of course, in reality, even if the future fanatic is right, every short-term action has far-future implications, so no charity is 1030 times more important than another one.
I'll assume that most of the suffering of the far future will be created by the computations that an advanced civilization would run. Rather than measuring computational capacity in FLOPS or some other conventional performance metric, I'll measure computations by how much sentience they contain in the form of the agents and subroutines that are being computed, with the unit of measurement being what I'll call a "sent". I define "sentience" as "morally relevant complexity of mental life". I compute the moral value (or disvalue) for an agent experiencing an emotion as
moral value = (sentience of the agent) * (how intense the agent would judge the emotion to be relative to evolutionarily/physiologically typical emotions for that agent) * (duration of the experience).
For example, if a human has sentience of 1 sent and a fly has sentience of 0.01 sents, then even if a fly experiences a somewhat more damaging event relative to its utility function, that event may get less moral weight.
Using units of sentience will help make later calculations easier. I'll define 1 sent-year as the amount of complexity-weighted experience of one life-year of a typical biological human. That is, consider the sentience over time experienced in a year by the median biological human on Earth right now. Then, a computational process that has 46 times this much subjective experience has 46 sent-years of computation.2 Computations with a higher density of sentience may have more sents even if they have fewer FLOPS.
Suppose there's a large but finite number C of civilizations that are about to colonize space. (If one insists that the universe is infinite, one can restrict the analysis to some huge but finite subset of the universe, to keep infinities from destroying math.) On average, these civilizations will run computations whose sentience is equivalent to that of N human-years, i.e., a computing capacity of N sent-years. So these civilizations collectively create the equivalent of C * N sent-years.
Some of these minds may be created by agents who want to feel intense emotions by immersing (copies of) themselves in experience-machines or virtual worlds. Also, we have much greater control over the experiences of a programmed digital agent than we do over present-day biological creatures.3 These factors suggest that influencing a life-year experienced by a future human might be many times more altruistically important than influencing a life-year experienced by a present-day human. The future, simulated human might have much higher intensity of experience per unit time, and we may have much greater control over the quality of his experience. Let the multiplicative factor T represent how much more important it is to influence a unit of sentience by the average future digital agent than a present-day biological one for these reasons. T will be in units of moral (dis)value per sent-year. If one thinks that a significant fraction of post-human simulations will be run for reasons of wireheading or intrinsically valuing intense experiences, then T may be much higher than 1, while if one thinks that most simulations would be run for purposes of scientific / historical discovery, then T would be closer to 1. T also counts the intensity and controllability of non-simulation subjective experiences. If a lot of the subjective experience in the far future comes from low-level subroutines that have fairly non-intense experiences, then T might be closer to 1.
Suppose that the amount of sentience on Earth in the near term (say, the next century or two) is some amount E sent-years. And suppose that some fraction fE of this sentience takes the form of human minds, with the rest being animals, other life forms, computers, and so on.
Some far-future simulations may contain just one richly computed mind in an otherwise superficial world. I'll call these "solipsist simulations". Many other simulations may contain several simulated people interacting but in a very limited area and for a short time. I'll neologize the adjective "solipsish" to refer to these simulations, since they're not quite solipist, but because they have so few people, they're solipsist-ish. Robin Hanson paints the following picture of a solipsish simulation:
Consider, for example, a computer simulation of a party at the turn of the millennium created to allow a particular future guest to participate. This simulation might be planned to last only one night, and at the start be limited to the people in the party building, and perhaps a few people visible from that building. If the future guest decided to leave the party and wander the city, the simulated people at the party might be erased, to be replaced by simulated people that populate the street where the partygoer walks.
In contrast, a non-solipsish simulation is one in which most or all of the people and animals who seem to exist on Earth are actually being simulated to a non-trivial level of detail. (Inanimate matter and outer space may still be simulated with low levels of richness.)
Let fN be the fraction of computations run by advanced civilizations that are non-solipsish simulations of beings who think they're humans on Earth, where computations are measured in sent-years, i.e., fN = (sent-years of all non-solipsish sims who think they're humans on Earth)/(sent-years of all computations that are run in total). And let fC be the fraction of the C civilizations who actually started out as biological humans on Earth (rather than biological aliens).
I and most MIRI researchers have moved on from Bostrom-style anthropic reasoning, but Bostrom anthropics remains well known in the scholarly literature and is useful in many applications, so I'll first explore the implications of the simulation argument in this framework. In particular, I'll use the self-sampling assumption with the reference class of "humans who think they're on pre-colonization Earth". The total number of such humans is a combination of those who actually are biological organisms on Earth:
(number of real Earths) * (human sent-years per real Earth) = (C * fC) * (E * fE)
and those in simulations who think they're on Earth:
(number of advanced-civilization computations) * (fraction comprised of non-solipsish humans who think they're on Earth) = C * N * fN.
Note that Bostrom's strong self-sampling assumption samples randomly from observer-moments, rather than from sent-years, but assuming all humans have basically the same sentience, then sampling from sent-years should give basically the same result as sampling from observer-moments.
Horn #3 of Bostrom's simulation-argument trilemma can be seen by noting that as long as N/E is extremely large (reject horn #1) and fN / (fC * fE) is not correspondingly extremely tiny (reject horn #2), the ratio of simulated to biological humans will be very large:
(non-solipsish simulated human sent-years) / (biological human sent-years)
= (C * N * fN) / (C * fC * E * fE)
= (N/E) * fN / (fC * fE).
If you are sampled randomly from all (non-solipsish) simulated + biological human sent-years, the probability that you are a biological human, Pb, is
Pb = (biological human sent-years) / [(simulated human sent-years) + (biological human sent-years)] = (C * fC * E * fE) / [(C * N * fN) + (C * fC * E * fE)] = (fC * E * fE) / [(N * fN) + (fC * E * fE)].
If we are biological humans, then we're in a position to influence all of the N expected sent-years of computation that lie in our future, which will have, on average, higher intensity and controllability by a factor of T units of moral value per sent-year. On the other hand, it's much harder to reliably influence the far future, because there are so many unknowns and so many intervening steps in the causal chain between what we do now and what happens centuries or gigayears from now. Let D be a discount representing how much harder it is to actually end up helping a being in the far future than in the near term, due to both uncertainty and the muted effects of our actions now on what happens later on.
If we are biological humans, then targeting the far future can affect N expected sent-years with intensity multiple of T, but with discount D, for an expected impact proportional to N * T * D.4 On the other hand, if we target the short term, we can help the sentience currently on Earth, with an impact proportional to E.5
However, actions targeting the far future only matter if there is a far future. In most simulations, the future doesn't extend very far, because simulating a long post-human civilization would be extremely computationally expensive. For example, emulating a planet-sized computer in a simulation would probably require at least a planet-sized computer to run the simulation. As an approximation, let's suppose that actions targeting far-future impact only matter if we're biological humans on an actual Earth. Then the expected impact of far-future actions is proportional to Pb * N * T * D. Let's call this quantity "L" for "long-term impact". In contrast, actions targeting the short term make a difference whether we're simulated or not, as long as the simulation runs for at least a few decades and includes most animals on Earth. So the expected impact of short-term-focused actions is just E. Let's call our expected impact for short-term actions S.
The ratio of these two quantities is L / S = Pb * N * T * D / E.
The following picture shows a cartoon example of the framework I'm using here. I haven't yet defined all the variables that you see in the upper left corner, but they'll be explained soon.
Note that N = 6.5 * E and fN = (3/26) * fE. By inspecting the picture, we can see that Pb should be 1/4, since there's one real Earth and three simulated versions. As hoped, our formula for Pb verifies this:
Pb = (fC * E * fE) / [(N * fN) + (fC * E * fE)] = (1/4 * E * fE) / [(6.5 * E * 3/26 * fE) + (1/4 * E * fE)] = (1/4) / [(6.5 * 3/26) + (1/4)] = 1/4.
And L / S = Pb * N * T * D / E = (1/4) * 6.5 * T * D = 1.6 * T * D.
Note that in the actual picture, Earth has 8 squares of far-future computation ahead of it, but N/E is only 6.5. That's because N/E is an average across civilizations, including some that go extinct before colonizing space. But an average like this seems appropriate for our situation, because we don't know ex ante whether humanity will go extinct or how big humanity's computing resources will be compared with those of other civilizations.
Now I'll redo the calculation using a framework that doesn't rely on the self-sampling assumption. Rather, it takes inspiration from anthropic decision theory. You should think of yourself as all your copies at once. Rather than thinking that you're a single one of your copies that might be biological or might be simulated, you should think of yourself as both biological and simulated, since your choices affect both biological and simulated copies of you. The interesting question is what the ratio is of simulated to biological copies of you.
When there are more total copies of Earth (whether biological or simulated), there will be more copies of you. In particular, suppose that some constant fraction fy of all non-solipsish human sent-years (whether biological or simulated) are copies of you. This should generally be roughly the case, because a non-solipsish simulation of Earth-in-the-year-2016 should have ~7 billion humans in it, one of which is you.
Then the expected number of biological copies (actually, copy life-years) of you will be fy * C * fC * E * fE, and the expected number of simulated copy life-years will be fy * C * N * fN.6
Now suppose you take an action to improve the far future. All of your copies, both simulated and biological, take this action, although it only ends up mattering for the biological copies, since only they have a very long-term future. For each biological copy, the expected value of the action is proportional to N * T * D, as discussed in the previous subsection. So the total value of having all your copies take the far-future-targeting action is proportional to
L = (number of biological copies of you) * (expected value per copy) = (fy * C * fC * E * fE) * (N * T * D).
In contrast, consider taking an action to help in the short run. This helps whether you're biological or non-solipsishly simulated. The expected value of the action for each copy is proportional to E, so the total value across all copies is proportional to
S = (number of biological + non-solipsish simulated copies of you) * (expected value per copy) = (fy * C * fC * E * fE + fy * C * N * fN) * E.
Then we have
L / S = [ (fy * C * fC * E * fE) * (N * T * D) ] / [ (fy * C * fC * E * fE + fy * C * N * fN) * E ].
Interestingly, this exactly equals Pb * N * T * D / E, the same ratio of far-future vs. short-term expected values that we calculated using the self-sampling assumption.
Simplifying the L/S expression above:
L/S = [N * T * D / E] * (fC * E * fE) / [(fC * E * fE) + (N * fN)] = T * D * fC / (fC * E/N + fN/fE).
Note that this ratio is strictly less than T * D * fC / (fN/fE), which is a quantity that doesn't depend on N. Hence, we can't make L/S arbitrarily big just by making N arbitrarily big.
Let fX be the average fraction of superintelligent computations devoted to non-solipsishly simulating the development of any almost-space-colonizing civilization that actually exists in biological form, not just humans on Earth. fN is the fraction of computations devoted to simulating humans on Earth in particular. If we make the simplifying assumption that the fraction of simulations of humans on Earth run by the collection of all superintelligences will be proportional to the fraction of humans out of all civilizations in the universe, then fN = fX * fC. This would be true if
Making this assumption, we have
L/S = T * D * fC / (fC * E/N + fX * fC/fE)
= T * D / (E/N + fX/fE).
Non-solipsish simulations of the dominant intelligences on almost-space-colonizing planets also include the (terrestrial or extraterrestrial) wild animals on the same planets. Assuming that the ratio of (dominant-intelligence biological sent-years)/(all biological sent-years) on the typical almost-space-colonizing planet is approximately fE, then fX / fE would approximately equal the fraction of all computational sent-years spent non-solipsishly simulating almost-space-colonizing ancestral planets (both the most intelligent and also less intelligent creatures on those planets). I'll call this fraction simply F. Then
L/S = T * D / (E/N + F).
Visualized using the picture from before, fN/fE is the fraction of squares with Earths in them, and F is the fraction of squares with any planet in them.
Everyone agrees that E/N is very small, perhaps less than 10-30 or something, because the far future could contain astronomical amounts of sentience. If F is not nearly as small (and I would guess that it's not), then we can approximate L/S as T * D / F.
Now that we have an expression for L/S, we'd like to know whether it's vastly greater than 1 (in which case the far-future fanatics are right), vastly less than 1 (in which case we should plausibly help beings in the short run), or somewhere in the ballpark of 1 (in which case the issue isn't clear and needs more investigation). To do this, we need to plug in some parameters.
Here, I'll plug in point estimates of T, D, and F, but doing this doesn't account for uncertainty in their values. Formally, we should take the full expected value of L with respect to the probability distributions of T and D, and divide it by the full expected value of S with respect to the probability distribution for F. I'm avoiding that because it's complicated to make up complete probability distributions for these variables, but I'm trying to set my point estimates closer to the variables' expected values than to their median values. Our median estimates of T, D, and F are probably fairly different from the expected values, since extreme values may dominate the expected-value calculations. For this reason, I've generally set the parameter point estimates higher than I actually think is reasonable as a median estimate. And of course, your own estimates may be pretty different.
D = 10-3
This is because (a) it's harder to know if a given action now will actually have a good impact in the long term than it is to know that a given action will have a good impact in the short term and (b) while a single altruist in the developed world can exert more than a ~1/(7 billion) influence on all the sentience on Earth right now (such as by changing the amount of wilderness that exists), a single person may exert less than that amount of influence on the sentience of the far future, because there will be generations after us who may have different values and may override our decisions.
In particular, for point (a), I'm assuming a ~0.1 probability discount, because, for example, while it's not implausible to be 75% confident that a certain action will reduce short-run wild-animal populations (with a 25% chance of increasing them, giving a probability discount of 75% - 25% = 50%), on many far-future questions, my confidence of making a positive rather than negative impact is more like 53% (for a probability discount of 53% - 47% = 6%, which is about 10 times smaller than 50%).
For point (b), I'm using a ~0.01 probability discount because there may be generations ahead of us before the emergence of artificial general intelligence (AGI), and even once AGI arrives, it's not clear that the values of previous humans will translate into the values of the AGI, nor that the AGI will accomplish goal preservation without further mutation of those values. Maybe goal preservation is very difficult to implement or is strategically disfavored by a self-improvement race against aliens, so that the changes to the values and trajectory of AGI we work toward now will be overridden thousands or millions of years later. (Non-negative utilitarians who consider preventing human extinction to be important may not discount as much here because preventing extinction doesn't have the same risk of goal/institutional/societal drift as trying to change the future's values or general trajectory does.)
T = 104
Some simulations run by superintelligences will probably have extremely intense emotions, but many (especially those run for scientific accuracy) will not. Even if only an expected 0.01% of the far future's sent-years consist of simulations that are 108 times as intense per sent than average experiences on Earth, we would still have T ≈ 104.
F = 10-6
It's very unclear how many simulations of almost-space-colonizing planets superintelligences would run. The fraction of all computing resources spent on this might be close to 100% or might be below 10-15. It's hard to predict resource allocation by advanced civilizations. But I set this parameter based on assuming that ~10-4 of sent-years will go toward ancestor simulations of some sort (this is probably too high, but it's biased upward in expectation, since, e.g., maybe there's a 0.05% chance that post-humans devote 20% of sent-years to ancestor simulations), and only 1% of those simulations will be of the almost-space-colonizing period (since there might also be many simulations of the origin of life, prehistory, and the early years after a planet's "singularity"). If we think that simulations contain more sentience per petaflop of computation than do other number-crunching calculations, then 10-4 of sent-years devoted to ancestor simulations of some kind may mean less than 10-4 of all raw petaflops devoted to such simulations.
Calculation using point estimates
Using these inputs, we have
L/S ≈ T * D / F = 104 * 10-3 / 10-6 = 107.
This happens to be bigger than 1, which suggests that targeting the far future is still ~10 million times better than targeting the short term. But this calculation could have come out as less than 1 using other possible inputs. Combined with general model uncertainty, it seems premature to conclude that far-future-focused actions dominate short-term helping. It's likely that the far future will still dominate after more thorough analysis, but by much less than a naive future fanatic would have thought.
No. My argument works as long as one maintains only at least a modest probability (say, at least 1% or 0.01%) that the simulation hypothesis is correct.
If one entirely rejects the possibility of simulations of almost-space-colonizing civilizations, then F = 0. In that case, L/S = T * D / (E/N + F) = T * D * N / E, which would be astronomically large because N/E is astronomically large. So if we were certain that F = 0 (or even that F was merely on the order of E/N in size), then we would return to future fanaticism. But we're not certain of this, and our impact doesn't become irrelevant if F > 0. Indeed, the more simulations of us there are, the more impact we have by short-term-targeting actions!
Let's call a situation where F is on the order of E/N in size or smaller the "tiny_F" possibility, and the situation where F is much bigger than E/N the "moderate_F" possibility. The expected value of S, E[S], is
E[S | tiny_F] * P(tiny_F) + E[S | moderate_F] * P(moderate_F)
and similarly for E[L]. While it's true that E[S | tiny_F] is quite small, because in that case we don't have many copies in simulations, E[S | moderate_F] is bigger. Indeed,
E[L] / E[S] = E[L] / { E[S | tiny_F] * P(tiny_F) + E[S | moderate_F] * P(moderate_F) }
≤ E[L] / { E[S | moderate_F] * P(moderate_F) }
≈ E[L | moderate_F] / { E[S | moderate_F] * P(moderate_F) },
where the last line assumes that L isn't drastically affected by the value of F. This last expression is very roughly like (L/S) / P(moderate_F), where L/S is computed by plugging in some moderate value of F like I did with my sample numbers above. So unless you think P(moderate_F) is extremely small, the overall E[L]/E[S] ratio won't change dramatically upon considering the possibility of no simulations.
I've heard the following defense made of future fanaticism against simulations:
This reply might work if you only consider yourself to be a single one of your copies. But if you correctly realize that your cognitive algorithms determine the choices of all of your copies jointly, then it's no longer true that short-term-focused efforts don't have astronomical impacts, because there are, in expectation, astronomical numbers of simulated copies of you in which your good deeds are replicated.
This objection suggests that horn #1 of Bostrom's trilemma may be true. If almost all technological civilizations fail to colonize space -- whether because they destroy themselves or because space colonization proves infeasible for some reason -- this would indeed dramatically reduce the number of advanced computations that get run, i.e., N would be quite small.
I find this possibility unlikely, since it seems hard to imagine why basically all civilizations would destroy themselves, given that humanity appears like it has a decent shot at colonizing space. Maybe it's more likely that there are physical/technological limitations on massive space colonization.
But if so, then the far future probably matters a lot less than it seems, either because humanity will go extinct before long or because, even if humans do survive, they won't create astronomical numbers of digital minds. Both of these possibilities downplay future fanaticism. Maybe the far future could matter quite a bit more than the present if humanity survives another ~100 million years on Earth, but without artificial general intelligence and robust goal preservation, it seems much harder to ensure that what we do now will have a reliable impact for millions of years to come (except in a few domains, like maybe affecting CO2 emissions).
In the previous argument, I assumed that copies of us that live in simulations don't have far futures ahead of them because their simulations are likely to end within decades, centuries, or millennia. But what if the simulations are very long-lived?
It seems unlikely a simulation could be as long-lived as the basement-level civilization, since it's plausible that simulating X amount of computations in the simulation requires more than X basement computations. But we could still imagine, for example, 2 simulations that are each 1/5 as big as the basement reality. Then aiming for far-future impact in those simulations would still be pretty important, since our copies in the simulations would affect 2 far futures each 1/5 as long as the basement's far future.
Note that my argument's formalism already accounts for this possibility. F is the fraction of far-future computations that simulate almost-space-colonizing planets. Most of the far future is not at the almost-space-colonizing stage but at the space-colonizing stage, so most computations simulating far-future outcomes don't count as part of F. For example, suppose that there's a basement reality that simulates 2 far-future simulations that each run 1/5 as long as the basement universe runs. Suppose that pre-space-colonizing planets occupy only 10-20 of all sentience in each of those simulations. Ignoring the non-simulation computations also being run, that means F = 10-20, which is very close to 0. So the objection that the simulations that are run might be very long can be reduced to the objection that F might be extremely close to zero, which I discussed previously. The generic reply is that it seems unreasonable to be confident that F is so close to zero, and it's quite plausible that F is much bigger (e.g., 10-10, 10-5, or something like that). If F is bigger, short-term impact is replicated more often and so matters relatively more.
I would expect some distribution of lengths of simulations, perhaps following a power law. If we look at the distribution of lengths of threads/processes that run on present-day computers, or how long companies survive, or almost anything similar, we tend to find a lot of short-lived things and a few long-lived things. I would expect simulations to be similar. It seems unreasonable to think that across all superintelligences in the multiverse, few short-lived simulations are run and the majority of simulations are long.
Another consideration is that if the simulators know the initial conditions they want to test with the simulation, then allowing the simulation to run longer might mean that it increasingly diverges from reality as time goes on and errors accumulate.
Also, if there are long-lived simulations, they might themselves run simulations, and then we might have short-lived copies within those nested simulations. As the number of levels of simulation nesting goes up, the length (and/or computational complexity) of the nested simulations must go down, because less and less computing power is available (just like less and less space is available for the innermost matryoshka dolls).
If the far future was simulated and the number and/or complexity of nested simulations wasn't progressively reduced as the level of nesting increased, then running simulations beyond the point when simulations became feasible would require an explosion of computing power:
The creators of the simulation would likely not continue it past the point in history when the technology to create and run these simulations on a widespread basis was first developed. [...] Another reason is to avoid stacking of simulations, i.e. simulations within simulations, which would inevitably at some point overload the base machine on which all of the simulations are running, thereby causing all of the worlds to disappear. This is illustrated by the fact that, as Seth Lloyd of MIT has noted in his recent book, Programming the Universe, if every single elementary particle in the real universe were devoted to quantum computation, it would be able to perform 10122 operations per second on 1092 bits of information. In a stacked simulation scenario, where 106 simulations are progressively stacked, after only 16 generations, the number of simulations would exceed by a factor of 104 the total number of bits of information available for computation in the real universe.
The period when a civilization is almost ready to colonize space seems particularly interesting for simulators to explore, since it crucially affects how the far future unfolds. So it would make sense that there would be more simulations of the period around now than there would be of the future 1 million years from now, and many of the simulations of the 21st century would be relatively short.
Beyond these qualitative arguments, we can make a quantitative argument as to why the far future within simulations shouldn't dominate: A civilization with N sent-years of computing power in its far future can't produce more than N sent-years of simulated far-future sentience, even if it only ran simulations and had no simulation overhead (i.e., a single planet-sized simulated computer could be simulated with only a single planet-sized real computer). More likely, a civilization with N sent-years of computing power would only run like N/100 sent-years of simulated far-future sentience, or something like that, since probably it would also want to compute things besides simulations. So what's at stake with influencing the "real" far future is probably much bigger than what's a stake influencing the simulated far future. (Of course, simulated far futures could be bigger if we exist in the simulations of aliens, not just our own civilization. But unless we in particular are extremely popular simulation targets, which seems unlikely a priori, then in general, across the multiverse, the total simulated far futures that we control should be less than the total real far futures that we control.) Of course, a similar point applies to simulations of short-term futures: The total sent-years in all short-term futures that we control is very likely less than the total sent-years in the far futures we control (assuming we have copies both in simulations and in basement realities). The argument as to why short-term helping might potentially beat long-term helping comes from our greater ability to affect the short term and know that we're making a positive rather than negative short-term impact. Without the D probability penalty for far-future actions, it would be clear that L > S within my framework.
What if the basement universe has unbounded computing power and thus has no limitations on how long simulations can be? And what if simulations run extremely quickly, so there's no reason not to run a whole simulated universe from the big bang until the stars die out? Even then, it's not clear to me that we wouldn't get mostly short-lived simulations, especially if they're being run for reasons of intrinsic value. For every one long-lived simulation, there might be millions or quadrillions of short-lived ones.
However, one could make the argument that if the basement-level simulators are only interested in science, then rather than running short simulations (except when testing their simulation software), they might just run a bunch of long simulations and then look at whatever part of a long simulation is of interest at any given time. Indeed, they might run all possible histories of universes with our laws of physics, and once that complete collection was available to them, they wouldn't need to run any more simulations of universes with our physical laws. Needless to say, this possibility is extremely speculative. Maybe one could argue that it's also extremely important because if this scenario is true, then there are astronomical numbers of copies of us. But there are all kinds of random scenarios in which one can raise the stakes in order to try to make some obscure possibility dominate. That is, after all, the point of the original Pascal's-mugging thought experiment. In contrast, I don't consider the simulation-based argument I'm making in this piece to be a strong instance of Pascal's mugging, because it actually seems reasonably likely that advanced civilizations will run lots of simulations of people on Earth.
In any case, even if it's true that the basement universe has unbounded computing resources and has run simulations of all possible histories of our universe, this doesn't escape my argument. The simulations run by the basement would be long-lived, yes. But those simulations would plausibly contain nested simulations, since the advanced civilizations within those simulations would plausibly want to run their own simulations. Hence, most of our copies would live in the nested simulations (i.e., simulations within simulations), and the argument in this piece would go through like before. The basement simulators would be merely like deist gods who set our universe in motion and then let it run on its own indefinitely.
Even if a copy of you lives in a short-lived simulation, it might have a causal impact well beyond the simulation. Many simulations may be run for reasons of scientific discovery, and by learning things in our world, we might inform our simulators of those things, thereby having a massive impact.
I find this a weak argument for several reasons.
I'm quite confident that I would care about simulated humans. If you don't think you would, then you're also less likely to care about the far future in general, since in many far-future scenarios, especially those that contain the most sentient beings, most intelligence is digital (or, at least, non-biological; it could be analog-computed).
If you think it's a factual rather than a moral question whether simulations are conscious, then you should maintain some not-too-small probability that simulations are conscious and downweight the impact your copies would have in simulations accordingly. As long as your probability of simulations being conscious is not tiny, this shouldn't change the analysis too much.
If you have moral uncertainty about whether simulations matter, the two-envelopes problem comes to haunt you. But it's plausible that the faction of your moral parliament that cares about simulations should get some influence over how you choose to act.
In a post defending the huge importance of the far future, steven0461 anticipates the argument discussed in this piece:
the idea that we’re living in an ancestor simulation. This would imply astronomical waste was illusory: after all, if a substantial fraction of astronomical resources were dedicated toward such simulations, each of them would be able to determine only a small part of what happened to the resources. This would limit returns. It would be interesting to see more analysis of optimal philanthropy given that we’re in a simulation, but it doesn’t seem as if one would want to predicate one’s case on that hypothesis.
But I think we should include simulation considerations as a strong component of the overall analysis. Sure, they're weird, but so is the idea that we can somewhat reliably influence the Virgo-Supercluster-sized computations of a posthuman superintelligence, which is the framework that the more persuasive forms of future fanaticism rely on.
This objection is abstruse but has been mentioned to me once. Some have proposed weighing the moral value of an agent in proportion to the Kolmogorov complexity of locating that agent within the multiverse. For example, it's plausibly easier to locate a biological human on Earth than it is to locate any particular copy of that human in a massive array of post-human simulations. The biological human might be specified as "the 10,481,284,089th human born7 since the year that humans call AD 0, on the planet that started post-human civilization", while the simulated version of that human might be "on planet #5,381,320,108, in compartment #82,201, in simulation #861, the 10,481,284,089th human born since the year that the simulated humans call AD 0". (These are just handwavy illustrations of the point. The actual descriptions would need vastly greater precision. And it's not completely obvious that some of the ideas I wrote with text could be specified compactly.) The shortest program that could locate the simulated person is, presumably, longer than the shortest program that could locate the biological person, so the simulated person (and, probably, the other beings in his simulated world) get less moral weight. Hence, the astronomical value of short-term helping due to the correlated behavior of all of that person's copies is lower than it seems.
However, a view that gives generally lower moral weight to future beings in this way should also give lower moral weight to the other kinds of sentient creatures that may inhabit the far future, especially those that are not distinctive enough to be located easily. So the importance of influencing the far future is also dampened by this moral perspective. It's not obvious and would require some detailed calculation to assess how this location-penalty approach affects the relative importance of short-term vs. far-future helping.
earthwormchuck163: "I'm not really sure that I care about duplicates that much."8 Applied to the simulation hypothesis, this suggests that if there are many copies of you helping other Earthlings across many simulations, since you and the helped Earthlings have the same brain states in the different simulations, those duplicated brain states might not matter more than a single such brain state. In that case, your ability to help tons of copies in simulations via short-term-focused actions would be less important. For concreteness, imagine that there are 1000 copies of you and the people you're helping across 1000 simulations. If you don't think several copies matter morally more than one copy, then the amount of good your short-term helping does will be divided by 1000 relative to a view that cares about each of the 1000 copies.
How about aiming to influence the far future? If all the morally relevant computations in the far future are duplicated about 1000 times, then the value of aiming to influence the far future is also about 1000 times less than what it would be if you cared about each copy individually. However, it's possible that the far future will contain more mind diversity. For example, maybe some civilizations would explicitly aim to make each posthuman mind somewhat unique in order to avoid repetitiveness. In this case, perhaps altruism targeting the far future would appear somewhat more promising than short-term helping if one holds the view that many mind copies only matter as much as one mind.
My main response is that I find it wrong to consider many copies of a brain not much more important than a single brain. This just seems intuitive to me, but it's reinforced by Bostrom's reductio:
if the universe is indeed infinite then on our current best physical theories all possible human brain-states would, with probability one, be instantiated somewhere, independently of what we do. But we should surely reject the view that it follows from this that all ethics that is concerned with the experiential consequences of our actions is void because we cannot cause pain, pleasure, or indeed any experiences at all.
Another reply is to observe that whether a brain counts as a duplicate is a matter of opinion. If I run a given piece of code on my laptop here, and you run it on your laptop on the other side of the world, are the two instances of the software duplicates? Yes in the sense that the high-level logical behavior is the same. No in the sense that they're running on different chunks of physics, at different spatiotemporal locations, in the proximity of different physical objects, etc. Minds have no non-arbitrary boundaries, and the "extended mind" of the software program, including the laptop on which it's running and the user running it, is not identical in the two cases.
Finally, it's plausible that most simulations would have low-level differences between them. It's unlikely that simulations run by two different superintelligent civilizations will be exactly the same down to the level of every simulated neuron or physical object. Rather, I conjecture that there would be lots of random variation in the exact details of the simulation, but assuming your brain is somewhat robust to variations in whether one random neuron fires or not at various times, then several slightly different variations of a simulation can have the same high-level input-output behavior and thus can all be copies of "you" for decision-theoretic purposes. There would presumably also be variations in the simulations run within a single superintelligent civilization, since there's no scientific need to re-run duplicative simulations of the exact same historical trajectory down to the level of every neuron in every person being identical, except maybe for purposes of debugging the simulation or replicating/verifying past scientific findings.
Of course, perhaps the view that "many copies don't count much more than one copy" would say that near copies also don't count much more than one copy. This view is vulnerable to potential reductios, such as the idea that if two identical twins who have had very similar life experiences suffer the same horrible death, it's less bad than if two very different people suffer different but similarly horrible deaths. (Of course, perhaps some philosophers would bite this bullet.)
This is an important and worrying consideration. For example, suppose you aim to prevent wild-animal suffering by reducing habitat and thereby decreasing wildlife populations. If the simulation includes models of the neurons of all animals but doesn't simulate inanimate matter in much detail, then by reducing wildlife numbers, we would save computing resources, which the simulators could use for other things. Worryingly, this might allow simulators to run more total simulations of Earth-like planets, most of the neurons on which are found in invertebrates who have short lives and potentially painful deaths.
If reducing wildlife by 10% allowed simulators to run 10% more total Earth simulations, then habitat reduction would sadly not reduce much suffering.9 But if a nontrivial portion of the computing power of Earth simulations is devoted to not-very-sentient processes like weather, an X% reduction in wild-animal populations reduces the computational cost of the whole simulation by less than X%. Also, especially if the simulations are being run for reasons of science rather than intrinsic value, the simulators may only need to run so many simulations for their purposes, and our making the simulations cheaper wouldn't necessarily cause the simulators to run more.10 The simulators might use those computing resources for other purposes. Assuming those other purposes would, on average, contain less suffering than exists in wilderness simulations, then reducing habitat could still be pretty valuable.
One might ask: If T > 1, then won't the non-Earth-simulation computations that can be run in greater numbers due to saving on habitat computations have a greater density of suffering, not less, than the habitat computations had? Not necessarily, because T gives the intensity of emotions per sent-year. But many of the computations that an advanced civilization would run might not contain much sentience.11 So the intensity of emotions per petaflop-year of non-Earth-simulation computation, rather than per sent-year, might be lower than T. Nonetheless, we should worry that this might not be true, in which case reducing habitat and thereby freeing up computing resources for our simulators would be net bad (at least for negative utilitarians; for classical utilitarians, replacing ecosystems that contain net suffering with other computations that may contain net happiness may be win-win).
It's also worth asking whether reducing net primary productivity on Earth would in fact save simulators' computing power. If the simulation is run in enough detail that invertebrate neurons are approximated, then the simulation may also be run in enough detail that, e.g., soil chemistry, ocean currents, and maybe even photons are also approximated. Even if the soil contains fewer earthworms and bacteria, it may contain just as many clay particles, water pockets, and other phenomena that still need to be modeled for the simulation to be realistic. Groundwater, for example, is a variable that humans monitor extensively, and its dynamics would need to be modeled accurately even if the ground contained no life. Still, much of the dry mass that composes organism bodies comes from the atmosphere (in the form of carbon dioxide), and it's not obvious to me whether an accurate Earth simulation would still need to model individual carbon-based molecules if they weren't captured by biological organisms. Nonetheless, these considerations about abiotic environmental factors suggest that in accurate simulations, possibly almost all computation is devoted to non-living physical processes. So, for example, maybe 99% of the computing resources in an Earth simulation model abiotic phenomena, in which case reducing plant productivity by 50% would only reduce the simulation's computational cost by 1% * 50% = 0.5%. This reduction in biological productivity would selectively reduce the most suffering-dense parts of the simulation, and unless the computations run using those computational savings would contain at least some extremely intense suffering, the reduction in biotic productivity would probably still be net good in terms of reducing suffering.
It's also possible there are strategies to increase the computing cost of our simulation in ways that, unlike wildlife, don't contain lots of sentience. For example, monitoring deep-underground physical dynamics in more detail might force our simulators to compute those dynamics more carefully, which would waste computing cycles on not-very-sentient processes and reduce the amount of other, possibly suffering-dense computations our simulators could run.
Finally, keep in mind that some ways of reducing suffering, such as more humane slaughter of farm animals, can prevent lots of simulated copies of horrific experiences without appreciably changing how expensive our world is for our simulators to compute.
So far I've been assuming that if there are many copies of us in simulations, there are also a few copies of us in basement reality as well at various points in the multiverse. However, it's also possible that we're in a simulation that doesn't have a mirror image in basement-level reality. For instance, maybe the laws of physics in our simulated world are different from the basement's laws of physics, and there's no other non-simulated universe in the multiverse that shares our simulated laws of physics. Maybe our world contains miracles that the simulators have introduced. And so on. Insofar as there are scenarios in which we have copies in simulations but not in the basement (except for extremely rare Boltzmann-brain-type copies that may exist in some basement worlds, or extremely low-measure universes in the multiverse where specific miracles are hard-coded into the basement-level laws of physics), this amplifies the value of short-term actions, since we would be able to influence our many simulated copies but wouldn't have much of any basement copies who could affect the far future.
On the flip side, it's possible that basically all our copies are in basement-level reality and don't have exact simulated counterparts. One example of why this might be would be if it's just too hard to simulate a full person and her entire world in enough detail for the person's choices in the simulation to mirror those of the biological version. For example, maybe computationally intractable quantum effects prove crucial to the high-level dynamics of a human brain, and these are too expensive to mirror in silico.12 The more plausible we find this scenario, the less important short-term actions look. But as we've seen, unless this scenario has probability very close to 1, the ambiguity between whether it's better to focus on the short term or long term remains unresolved.
Even if all simulations were dramatically different from all basement civilizations, as long as some of the simulated creatures thought they were in the basement, the simulation argument would still take effect. If most almost-space-colonizing organisms that exist are in simulations, then it's most likely that whatever algorithm your brain is running is one of those simulations rather than in a basement universe.
I'm still a bit confused about how to do anthropic reasoning when, due to limited introspection and bounded rationality, you're not sure which algorithm you are among several possible algorithms that exist in different places. But a naive approach would seem to be to apportion even odds among all algorithms that you might be that you can't distinguish among.
For example, suppose there are only two types of algorithms that you might be: (1) biological humans on Earth and (2) simulated humans who think they're on Earth who are all the same as each other but who are different than biological humans. This is illustrated in the following figure, where the B's represent biological humans and the S's represent the simulated humans who all share the same cognitive algorithm as each other.
Given uncertainty between whether you're a B or an S, you apportion 1/2 odds to being either algorithm. If you're a B, you can influence all N expected sent-years of computation in your future, while if you're an S, you can only influence E sent-years, but there are many copies of you. The calculation ends up being the same as in the "Calculation based on all your copies" section above, since
L = (probability you're a B) * (number of biological copies of you) * (expected value per copy) + (probability you're an S) * (no impact for future-focused work because there is no far future in a simulation) = (1/2) * (fy * C * fC * E * fE) * (N * T * D) + (1/2) * 0,
and
S = (probability you're a B) * (number of biological copies of you) * (expected value per copy) + (probability you're an S) * (number of non-solipsish simulated copies of you) * (expected value per copy) = (1/2) * (fy * C * fC * E * fE) * E + (1/2) * (fy * C * N * fN) * E.
L/S turns out to be exactly the same as before, after we cancel the factors of 1/2 in the numerator and denominator.13
Next, suppose that all the simulated copies are different from one another, so that it's no longer the case that what one copy does, the rest do. In this case, there are lots of algorithms that you might be (labelled S_1, S_2, ... in the below figure), and most of them are simulated.
Now the probability that you're biological is just Pb, and the L/S calculation proceeds identically to what was done in the "Calculation using Bostrom-style anthropics and causal decision theory" section above.
So no matter how we slice things, we seem to get the exact same expression for L/S. I haven't checked that this works in all cases, but the finding seems fairly robust.
Since it is harder to vary the simulation detail in role-playing simulations containing real people [i.e., people are particularly expensive to simulate compared with coarse-grained models of inanimate objects], these simulations tend to have some boundaries in space and time at which the simulation ends.
Does consideration of simulations favor solipsist scenarios? In particular, it's possible to run ~7 billion times more simulations in which you are the only mind than it is to run a simulation containing all of the world's human population. In those superintelligent civilizations where you are run a lot more than average, you have many more copies than normal. So should you be more selfish on this account, since other people (especially distant people whom you don't observe) may not exist?
Maybe slightly. Robin Hanson:
And your motivation to save for retirement, or to help the poor in Ethiopia, might be muted by realizing that in your simulation you will never retire and there is no Ethiopia.
However, we shouldn't give too much weight to solipsist simulations. Maybe there are some superintelligences that simulate just copies of you. But there may also be superintelligences that simulate just copies of other people and not you. Superintelligences that simulate huge numbers of just you are probably rare. In contrast, superintelligences that simulate a diverse range of people, one of which may be you, are probably a lot more common. So you may have many more non-solipsist copies than solipsist copies.
You may also have many solipsish copies, depending on the relative frequency of solipsish vs. non-solipsish simulations. Solipsish simulations that don't simulate (non-pet) animals in much detail can be much cheaper than those that do, so it's possible there are, say, 5 or 20 times as many solipsish simulations that omit animals than those that contain animals? It's very hard to say exactly, since it depends on the relative usefulness or intrinsic value that various superintelligent simulators place on various degrees of simulation detail and realism. Still, as long as the number of animal-free solipsish simulations isn't many orders of magnitude higher than the number of animal-containing simulations, helping animals is still probably very important.
And the possibility of animal-free solipsish simulations doesn't dramatically upshift the importance of helping developing-world humans relative to helping animals, since in some solipsish simulations, developing-world humans don't exist either.
The possibility of solipsish simulations may be the first ever good justification for giving (slightly) more moral weight to those near to oneself and those one can observe directly.
Jaan Tallinn and Elon Musk both find it likely that they're in a simulation. Ironically, this belief may be more justified for interesting tech millionaires/billionaires than for ordinary people (in the sense that famous/rich people may have more copies than ordinary people do), since it may be both more scientifically useful and more entertaining to simulate powerful people rather than, e.g., African farmers.
So should rich and powerful people be more selfish than average, because they may have more simulated copies than average? Probably not, because powerful people can also make more altruistic impact than average, and at less personal cost to themselves. (Indeed, helping others may make oneself happier in the long run anyway.) It's pretty rare for wealthy humans to experience torture-level suffering (except maybe in some situations at the end of life -- in which case, physician-assisted suicide seems like a good idea), so the amount of moral good to be done by focusing on oneself seems small even if most of one's copies are solipsist.
It may be hard to fake personal interactions with other humans without actually simulating those other humans. So probably at least your friends and family are being simulated too. But the behavior of your acquaintances would be more believable if they also interacted with fully simulated people. Ultimately, it might be easiest just to simulate the whole world all at once rather than simulating pieces and fudging what happens around the edges. I would guess that most simulations requiring a high level of accuracy contain all human minds who exist at any given time on Earth (though not necessarily at past and future times).
Perhaps one could make some argument for the detailed simulation of past humans similar to the argument for detailed simulation of your acquaintances and their acquaintances: in order to have realistic past memories, you must have been simulated in the past, and in order for your past interactions to be realistic, you must have interacted with other finely simulated people in the past. And in order for your parents and grandparents to have realistic memories, they must have interacted with realistic past people, and likewise for their parents and grandparents, and so on. I wonder if there could be a gradual reduction in the fidelity of simulations moving further and further into the past, to the extent that, say, Julius Caesar never substantially existed in the past of most simulation branches that are simulating our present world? Or perhaps Julius Caesar was simulated in great detail once, but then multiple later historical trajectories are simulated from those same initial conditions.
If there are disconnected subgraphs within the world's social network, it's possible there could be a solipsish simulation of just your subgraph, but it's not clear there are many disconnected subgraphs in practice (except for tiny ones, like isolated peoples in the Amazon), and it's not clear why the simulators would choose to only simulate ~99% of the human population instead of 100%.
What about non-human animals? At least pets, farm animals, and macroscopic wildlife would probably need to be simulated for purposes of realism, at least when they're being watched. (Maybe this is the first ever good argument against real-time wildlife monitoring and CCTV in factory farms.) And ecosystem dynamics will be more believable and realistic if all animals are simulated. So we have some reason to suspect that wild animals are simulated as well. However, there's some uncertainty about this; for instance, maybe the simulators can get away with pretty crude simulation of large-scale ecosystem processes like phytoplankton growth and underground decomposition. Or maybe they can use cached results from previous simulations. But an accurate simulation might need to simulate every living cell on the planet, as well as some basic physical features of the Earth's crust.
That said, we should in general expect to have more copies in lower-resolution simulations, since it's possible to run more low-res than high-res simulations.
How significant is the concern that, say, better monitoring of wildlife could significantly increase wild-animal suffering by forcing the simulators to simulate that wildlife in more detail? If most of our copies exist within simulations rather than basement reality, then this concern can't be dismissed out of hand.
The issue seems to hinge on whether a specific act of wildlife monitoring would make the difference to the fineness of the wilderness simulation. Maybe wildlife are already simulated in great detail regardless of how well we monitor them, because those creatures have ecological effects that we will inevitably notice. Conversely, maybe even if we monitor wildlife 24/7 with cameras and movement trackers, the behavior of the monitored creatures will be generated based on cached behavioral patterns or based on relatively simple algorithms, similar to the behavior of sophisticated non-player characters in video games. For wilderness monitoring to increase wild-animal suffering, it would have to be that our simulation is somewhere between those extremes—that the additional amount of monitoring makes the difference between coarse-grained and fine-grained simulations of creatures in nature.
Still, there seems to be some chance that's the case, and the benefits of wilderness monitoring don't necessarily seem huge either. As an example, suppose that there's a 50% chance that wildlife are already simulated in great detail, a 45% chance that wildlife wouldn't need to be simulated in great detail even if humans did more wilderness monitoring, and a 5% chance that greater wilderness monitoring would make the difference between simple simulations and complex simulations of wild animals. Let's ignore the 45% of scenarios on the assumption that the simulated animals are morally trivial in those cases. Suppose that in the 50% of scenarios where wilderness is already simulated in great detail, wildlife monitoring of a given hectare of land allows humans to reduce suffering on that hectare by, say, 10% of its baseline level B. Meanwhile, in the 5% of scenarios where increased monitoring makes the difference between trivial and complex wilderness simulations, wildlife monitoring increases suffering from roughly 0, due to the triviality of the creatures, to (100% - 10%) * B on that hectare. (The "minus 10%" part is because monitoring reduces wild-animal suffering by 10% relative to the baseline B.) Since 50% * 10% * B ≈ 5% * 90% * B, the expected benefit of wildlife monitoring roughly equals the expected cost in this example. I have no idea if these example numbers are reasonable, but at first glance, the concern about increasing suffering via monitoring doesn't seem completely ignorable.
The following figure illustrates some general trends that we might expect to find regarding the number of copies we have of various sorts. Altruistic impact is highest when we focus on the level of solipsishness where the product of the two curves is highest. The main point of this essay is that where that maximum occurs is not obvious. Note that this graph can make sense even if you give the simulation hypothesis low probability, since you can convert "number of copies of you" into "expected number of copies of you", i.e., (number of copies of you if simulations are common) * (probability simulations are common).
If it turns out that solipsish simulations are pretty inaccurate and so can't reproduce the input-output behavior that your brain has in more realistic worlds, then you won't have copies at all levels of detail along the solipsish spectrum, but you should still have uncertainty about whether your algorithm is instantiated in a more or less long-lived high-resolution simulation, or not in a simulation at all.
In this piece, I've been assuming that most of the suffering in the far future that we might reduce would take the form of intelligent computational agents run by superintelligences. The more computing power these superintelligences have, the more sentient minds they'll create, and the more simulations of humans on Earth some of them will also create.
But what if most of the impact of actions targeting the future doesn't come from effects on intelligent computations but rather from something else much more significant? One example could be if we considered suffering in fundamental physics to be extremely morally important in aggregate over the long-term future of our light cone. If there's a way to permanently modify the nature of fundamental physics in a way that wouldn't happen naturally (or at least wouldn't happen naturally for googol-scale lengths of time), it might be possible to change the amount of suffering in physics essentially forever (or at least for googol-scale lengths of time), which might swamp all other changes that one could accomplish. No number of mirrored good deeds across tons of simulations could compete (assuming one cares enough about fundamental physics compared with other things).
Another even more implausible scenario in which far-future focus would be astronomically more important than short-term focus is the following. Suppose that advanced civilizations discover ways to run insane amounts of computation -- so much computation that they can simulate all interesting variations of early biological planets that they could ever want to explore with just a tiny fraction of their computing resources. In this case, F could be extremely small because there may be diminishing returns to additional simulations, and the superintelligences instead devote the rest of their enormous computing resources toward other things. However, one counterargument to this scenario is that a tiny fraction of civilizations might intrinsically value running ancestor simulations of their own and/or other civilizations, and in this case, the fraction of all computation devoted to such simulations might not be driven close to zero if obscene amounts of computing power became available. So it seems that F has a lower bound of roughly (computational-power-weighted fraction of civilizations that intrinsically value ancestor simulations) * (fraction of their computing resources spent on such simulations). Intuitively, I would guess that this bound would likely not be smaller than 10-15 or 10-20 or something. (For instance, probably at least one person out of humanity's current ~1010 people would, sadly in my view, intrinsically value accurate ancestor simulations.)
This essay has argued that we shouldn't rule out the possibility that short-term-focused actions like reducing wild-animal suffering over the next few decades in terrestrial ecosystems may have astronomical value. However, we can't easily draw conclusions yet, so this essay should not be taken as a blank check to just focus on reducing short-term suffering without further exploration. Indeed, arguments like this wouldn't have been discovered without thinking about the far future.
Until we know more, I personally favor doing a mix of short-term work, far-future work, and meta-level research about questions like this one. However, as this piece suggests, a purely risk-neutral expected-value maximizer might be inclined to favor mostly far-future work, since even in light of the simulation argument, far-future focus tentatively looks to have somewhat higher expected value. The value of information of further research on the decision of whether to focus more on the short term or far future seems quite high.
Carl Shulman inspired several points in this piece and gave extensive feedback on the final version. My thinking has also benefited from discussions with Jonah Sinick, Nick Beckstead, Tobias Baumann, and others.
the phrase "Pascal's Mugging" got completely bastardized to refer to an emotional feeling of being mugged that some people apparently get when a high-stakes charitable proposition is presented to them, regardless of whether it's supposed to have a low probability. This is enough to make me regret having ever invented the term "Pascal's Mugging" in the first place [...].
Of course, influencing the far future does have a lower probability of success than influencing the near term. The difference in probabilities is just relatively small (plausibly within a few orders of magnitude). (back)
Y * (10,000 + 1) = Z * (1000 + 1),
i.e., Z = 9.991 * Y. And the new amount of wild-animal suffering will be only Z * 1000 = 9.991 * Y * 1000 = 9,991 * Y sent-years, rather than 10,000 * Y. (back)
|(percent change in quantity demanded)/(percent change in price)| < 1
|(100 * fq) / (-100 * fp)| < 1
|-1| * |fq / fp| < 1
fq / fp < 1,
where the last line follows because fq and fp are both positive numbers. Finally, note that total suffering is basically (cost per simulation) * (number of simulations), and the new value of this product is
old_cost_per_simulation * (1 - fp) * old_number_of_simulations * (1 + fq)
= old_cost_per_simulation * old_number_of_simulations * (1 + fq - fp - fp * fq),
which is a decrease if fq < fp. QED. (back)
A brute-force solution to the above difficulties could be to convert an entire planet to resemble Earth, put real bacteria, fungi, plants, animals, and humans on that planet, and fake signals from outer space (a Truman Show approach to simulations), but this would be extremely wasteful of planetary resources (i.e., it would require a whole planet just to run one simulation), so I doubt many advanced civilizations would do it.
Even if simulations can't reproduce the high-level functional behavior of a biological mind, there remains the question of whether some simulations can be made "subjectively indistinguishable" from a biological human brain in the sense that the brain can't tell which kind of algorithm it is, even if the simulation isn't functionally identical to the original biological version. I suspect that this is possible, since the algorithms that we use to reflect on ourselves and our place in the world don't seem beyond the reach of classical computation and indeed may be not insanely complicated. But I suppose it's possible that computationally demanding quantum algorithms are somehow required in this process. (back)
The post How the Simulation Argument Dampens Future Fanaticism appeared first on Center on Long-Term Risk.
]]>The post Identifying Plausible Paths to Impact and their Strategic Implications appeared first on Center on Long-Term Risk.
]]>CLR’s mission is to identify the best intervention(s) for suffering reducers to work on. Figuring out the long-term consequences of our actions is tricky, such that we are often left with significant uncertainty about the value – sometimes even the sign – of a particular intervention. The past rate at which we have uncovered crucial considerations suggests that more research on prioritization is still very valuable and likely to remain so for years. However, in order to not get stuck with research indefinitely, we will have to eventually focus our efforts on an intervention directly targeted at improving the future. Therefore, besides efforts to grow CLR, it is important to already pursue time-sensitive “capacity building” – connecting with more committed altruists, gathering resources, reputation, etc. – in order to get into a position where a clear “path to impact (PTI)” – the concrete intervention that future CLR thinks is more valuable than further research1 – can eventually be tackled with maximum force.
We do not yet know what our PTI(s) will be, which is why it makes sense to pursue a flexible approach focused on movement building and further research. But we should already be in a position to make competent guesses on the matter, and this is important, because depending on our current assessment of plausible PTIs and how likely we will be pursue each of them, we might have reason to already adjust our movement building strategy.
The intent of this document is to sketch the broad categories of PTIs we currently consider likely, in order to then determine the most important subgoals to optimize for in our movement building. Examples being: monetary resources, committed altruists, people with talent in AI safety, societal reputation, reputation within the effective-altruist (EA) community, timing of all of it (are there haste considerations anywhere?), etc.
We start by sorting all plausible PTIs into logically exhaustive categories. The idea here is that by categorizing them, we make it less likely that we miss something important. One obvious distinction is whether we focus on short-term vs long-term consequences. Then, given that within the long-term branch, most of the expected value comes from outcomes that are somehow about affecting the way AI2 scenarios unfold, we can distinguish four different ways of affecting AI-related outcomes:
(Of course, this categorization is not the only way of looking at it.) It should be noted that some possible interventions in these categories might turn out to be a bad idea to focus on: For instance, decreasing the probability that AI takeoff happens at all would be bad for reasons of cooperation (even if it overall decreased suffering), as a lot of people care strongly about utopian outcomes that require value-aligned AI.
This section is going to list some plausible PTI candidates for each category. The ideas listed are not meant to be conclusive or even particularly promising, but they give an overview on the sort of interventions we are considering.
The category as a whole: Short-term interventions might become our PTI if we ever place a high likelihood on “doom soon,” e.g. as the explanation to the Fermi paradox, or because we think most of our copies are in short-lived simulations. Another reason for focusing on the short term is if years of research fail to bring about more clarity to the uncertainties of the far-future picture. Finally, short-term interventions become appealing if we decide that the general impacts of our decision algorithms throughout the multiverse dominate the specific impacts that our copies have in such a way that short-term actions seem favored.
Plausible, concrete interventions:
The category as a whole: Approaches within this category have to be designed carefully to avoid greatly harming other value systems. Actively decreasing the probability of AI takeoff happening at all is for instance prohibited by considerations about cooperation: Even if it seemed positive, it would be important to find another intervention that is also positive and less opposed to other people’s interests.
Plausible, concrete intervention:
The category as a whole: Working directly on AI safety is unlikely to be our comparative advantage because a lot of people already care about this. Having said that, the problem seems difficult and will likely require a lot of work. AI safety thus might become our PTI if talent constraints cannot be overcome easily by all the funding the cause is expected to receive in the near future, in which case we could e.g. help with the recruiting of talented researchers.
Increasing the probability of uncontrolled AI – in the not-so-likely case that we come to the conclusion that uncontrolled AI in expectation causes less suffering – is prohibited by reasons of cooperation. If it became our view that uncontrolled AI in expectation produces the least suffering, we should still pursue another approach, e.g. some concrete intervention listed under the subsequent category "Improving controlled AI outcomes" or something in the domain of "fail-safe" AI safety.
Plausible, concrete intervention:
The category as a whole: This set of interventions becomes particularly important if we think that controlled AI is worse than uncontrolled AI in expectation, but with a wide range of outcomes that differ in the amount of suffering they contain. And it becomes more important the more likely we consider it that AI will be controlled.
Plausible, concrete interventions:
The category as a whole: Focusing on this category is intriguing because we seem to be the only group who takes AI risks seriously and cares very strongly about the differences in all the scenarios where human values are not implemented. If we think the consideration “focus on your comparative advantages” has a lot of merit, then this could turn out to be our PTI.
Plausible, concrete interventions:
In future docs inspired by this outline here, we are going to list the pros and cons for each of the above proposals in order to then assign rough weightings to them. It is important to then factor in that some of the proposals above are more far-fetched than others.
The main way CLR and its parent/partner organization, the Effective Altruism Foundation (EAF), might currently be pursuing the wrong priorities is if there are strong haste considerations that are not given enough weight. AI takeoff represents a hard deadline, after which all our efforts are "graded." If AI comes very soon, attempts that focus on influencing variables that take time, such as value spreading or promoting international cooperation, might count for nothing. Therefore, it seems important to get a good estimate on how strong we should expect AI-related haste considerations to be. Some thoughts:
Getting more clarity on AI timelines and strategic considerations on how to act in situations where the deadline is uncertain seems important.
Based on the considerations in this document, we can draw the following tentative conclusions:
Most of the ideas in this article are not my own; they summarize part of what CLR has been exploring or is planning to explore more in the near future.
Special thanks to Brian Tomasik, Simon Knutsson and David Althaus for helpful comments and suggestions.
If we somehow manage to affect the goals of a singleton-AI, our actions would have a future-shaping impact until the AI either ceases to exist or suffers from a failure of goal-preservation. No matter its goal, a powerful intelligence would instrumentally value self-preservation and goal-preservation, and it would, qua its superior intelligence, be much better at this than humans and human societies ever were or could be. This suggests that focusing on AI-related outcomes makes it possible to predictably affect the future for millions, perhaps even billions of years to come – or in any case for longer than through any other foreseeable means. Moreover, because most possible goals for an AI would imply instrumentally valuing resource accumulation, we should expect singleton-AIs to ambitiously colonize space, rendering the stakes astronomical. Even if the AI in question has a goal unrelated to conscious beings, it might incidentally create suffering in the process of achieving it. Without concern for suffering, even the slightest gains would be worth creating vast amounts of suffering. Unless we can with extraordinary great confidence reject some of the ingredients in this argument (e.g. orthogonality, instrumental convergence, the feasibility of superintelligent AI in the first place), there seems to be no scenario of remotely similar likelihood where our actions now could have a comparable impact on the far future. (back)
The post Identifying Plausible Paths to Impact and their Strategic Implications appeared first on Center on Long-Term Risk.
]]>The post Bibliography of Suffering-Focused Views appeared first on Center on Long-Term Risk.
]]>Create and keep up-to-date an online bibliography of material that proposes, defends, or argues against suffering-focused views.
Priority: 8/10
A bibliography on a topic makes it easier for others to do research and write papers in the field. For example, it reduces the hurdle of doing research in the field because the researchers do not need to spend as much time finding literature. It also reduces that risk that a researcher does redundant work because of being unaware of a previous publication on the topic.
An interesting bibliography that is updated by a community was established by The Research Group on Neuroethics and Neurophilosophy at University of Mainz1:
During the last years the Research Group at the University of Mainz has established the first open, centrally governed and supervised Online-bibliography, which is kept complete and up-to-date by the neuroethics community itself – a ‚literature wikiography‘.
Another example is this simpler but useful bibliography on wild animal suffering.
Learn about best practices for creating and maintaining a bibliography in a field. A complication is that ‘suffering-focused’ is not an established term, but rather our term for different views in diverse fields such as population ethics, axiology, and normative ethics.
Web page (or website).
For more sources and context, see
The post Bibliography of Suffering-Focused Views appeared first on Center on Long-Term Risk.
]]>The post Formalizing Preference Utilitarianism in Physical World Models appeared first on Center on Long-Term Risk.
]]>Most ethical work is done at a low level of formality. This makes practical moral questions inaccessible to formal and natural sciences and can lead to misunderstandings in ethical discussion. In this paper, we use Bayesian inference to introduce a formalization of preference utilitarianism in physical world models, specifically cellular automata. Even though our formalization is not immediately applicable, it is a first step in providing ethics and ultimately the question of how to “make the world better” with a formal basis.
Read the full text here.
The post Formalizing Preference Utilitarianism in Physical World Models appeared first on Center on Long-Term Risk.
]]>The post Hedonistic vs. Preference Utilitarianism appeared first on Center on Long-Term Risk.
]]>It's a classic debate among utilitarians: should we care about an organism's happiness and suffering (hedonic wellbeing), or should we ultimately value fulfilling what it wants, whatever that may be (preferences)? In this piece, I discuss intuitions on both sides and explore a hybrid view that gives greater weight to the hedonic subsystems of brains than to other overriding subsystems. I also discuss how seeming infinite preferences against suffering could lead to a negative-leaning utilitarian perspective. While I have strong intuitions on both sides of the dispute, in the end I may side more with idealized-preference utilitarianism. But even if so, there remain many questions, such as Which entities count as agents? How should we weigh them? And how do we assess the relative strengths of their preferences? In using preference utilitarianism to resolve moral disagreements, there's a tension between weighting various sides by power vs. numerosity, paralleling the efficiency vs. equity debate in economics.
Jeremy Bentham's original formulation of utilitarianism was based around happiness and suffering. Later formulations generally moved toward focus on preference fulfilment instead. Kahneman and Sugden (2005) discuss hedonism vs. preferences from the standpoint of psychology.
Economists tend to use preferences because revealed preferences can be measured, and in general, a preference ordering seems more "rigorous" than an arbitrary cardinal numerical assignment for intensities of happiness and suffering. The von Neumann-Morgenstern utility theorem demonstrated that any preference ordering over lotteries satisfying four properties could be represented by maximizing the expected value of a utility function, unique up to a positive affine transformation. This utility function needn't represent the same thing as Bentham's original conception of happiness, although Yew-Kwang Ng argues that it does when finite sensibility is taken into account.
Of course, having these numerical utility functions still doesn't necessarily allow for interpersonal comparisons of utilities. The best economists can do in their concepts of efficiency is talk about Pareto or potentially Pareto improvements, but these don't capture all changes that we may wish to say would improve utility. For example, suppose an unempathic billionaire walks past a homeless little girl on the streets. It would not be a Pareto or even potentially Pareto improvement for the billionaire to buy the girl a winter coat against the cold, yet most of us would like for the billionaire to do so (unless he has vastly more cost-effective projects to fund instead).
So whether we are hedonistic or preference utilitarians, we may want to make value judgements for interpersonal comparisons that go beyond the "rigorous" preference-oriented framework of economists. What we thereby lose in objectivity, we gain in moral soundness. (Note: Probably there are many attempts to formally ground interpersonal comparisons of von-Neumann-Morgenstern utility, though I'm not aware of the details.)
Most of the time, what an organism prefers for himself is what (he thinks) will make him most happy and least in pain. For me personally, my selfish preferences align with my selfish hedonic desires maybe ~90% of the time. When this is true, the distinction between preferences and hedonic satisfaction may not be crucial, although it could affect some of our other intuitions as discussed below.
There are some cases where the two diverge, such as
In this section I suggest one intuition in favor of preference satisfaction.
Example 1. First consider a universe in which no life exists. There are no feelings, sentiments, or experiences. Only stars and desolate planets fill the void of space. It seems intuitive that nothing matters in this universe. As there are no organisms around to care about anything, ethics does not apply.
Example 2. Consider a second universe that contains exactly one organism, named Chris. In Chris's mind, the only thing that matters is carrying out his ethical obligation to build domino towers. Since this is the only ethical principle that exists, there's some quasi-universal sense in which it's ethical for Chris to stack dominoes.
Example 3. What if we now complicate the situation and consider a universe with two organisms: Chris (from before) and Dorothy? Suppose that Dorothy's only goal is to prevent the construction of domino towers. Thus, Chris can only act in a way that he considers ethical if he abridges Dorothy's ethical belief. The same is true for Dorothy with respect to Chris's ethical belief. How do we resolve the dispute?
Recall that ethics only began to apply in the universe once Chris and Dorothy existed. Suppose Dorothy holds her belief twice as strongly as Chris does. Then, in some sense, Dorothy's belief "exists" twice as much, so the quasi-universal ethical stance is to give Dorothy's belief twice as much consideration. In this particular example, it's best to prevent construction of domino towers.
If we apply the intuition from these examples to any finite number of organisms, all with finitely strong ethical beliefs, the result is preference utilitarianism.
What shall we do with organisms that don't explicitly recognize what they care about? For instance, what if the universe consisted entirely of a single mouse that was in pain? We can suppose for the sake of argument that the mouse doesn't conceive of itself as an abstract organism enduring negative sensations. Presumably the mouse doesn't think, "I wish this pain would stop." But the intuition that motivates our concern for the interests of other beings rests not upon the ability of those beings to explicitly state their wishes -- rather, it comes from an empathetic recognition that those wishes exist and matter. Clearly the mouse's pain is a real event that matters to the mouse, even if the mouse can't articulate that fact. So preference utilitarianism does give consideration to implicit preferences -- whether held by human or non-human animals.
Preference utilitarianism is not the same as libertarianism, because there may be cases in which a person is morally obligated to act against her wishes to better satisfy the wishes of others or potentially even her future self. That said, the preference view does a better job of capturing the sense of individual autonomy than does the happiness view. On the happiness view, one can imagine "dissident emotional primitives being dragged kicking and screaming into the pleasure chambers," but this seems less likely on the preference view.
A main reason I find the preference view plausible is that ultimately what I would want for myself is for my preferences to be satisfied, not always for me to be made happier, so extending the same to others is the nicest way to treat them. In other words, preference utilitarianism is basically the Golden Rule, which is "found in some form in almost every ethical tradition," according to Simon Blackburn's Ethics: A Very Short Introduction (p. 101).
Consider some objections:
The response to "irrational decisions" is simple: Utilitarianism counts the preferences of all organisms, not just those existing right now, so we need to weigh your current self's preference for quick relief against your future selves' preference to not live with AIDS.
Liking vs. wanting is an important consideration. I agree with the intuition that liking should trump wanting, but my guess is that people who want something without liking it would prefer (meta-want) to not want the thing. For instance, drug addicts who crave an additional hit wish they didn't have those cravings. If meta-preferences can override or at least compete strongly with base-level preferences, the problem should usually go away. In cases where it doesn't go away, the situation reduces to one of "perverse preferences."
The remaining three objections I'll discuss in subsequent sections.
Intuitively, when an organism has a preference, it wants the world to be in one state rather than another. For example, an animal in the rain may prefer to be warm and dry rather than wet and cold. Inside its brain, there's a system telling the animal that things would be better if it were inside.
Preferences can also extend beyond the hedonic wellbeing of a person. For example, deep ecologists may prefer that nature is kept untouched, even if no human is around to observe this fact (and, I would add, even if multitude animals suffer as a result).
Consider the following:
It seems that when we talk about preferences, we really mean the desires of some sort of agent, especially a conscious agent, rather than an arbitrary system or force of nature. If so, this already suggests some connection between hedonistic and preference utilitarianism: The agents that we count as having preferences tend, especially in our current biological world, also to be agents that have emotional experiences.
If preferences should be imputed mainly to minds that are conscious agents, different people may have different ideas about where to draw the boundaries of what a preference is -- since, indeed, even the boundaries of "conscious" and "agent" are up for dispute. The preference view makes it slightly more plausible that a broader class of agents has preferences than agents to whom we would have attributed pleasure and pain on the hedonistic view, just because it seems like preferences are conceptually simpler sorts of attributes that aren't so narrowly confined to hedonic systems of the type found in animal brains. But exactly how broadly we extend the notion of preference satisfaction is up to our hearts to decide.
As with consciousness in general, these questions are not binary. I might give extremely tiny weight to satisfying a thermostat's preference to have the temperature at 22.5 degrees, but this is so small that it can generally be ignored. Probably better to have ten million thwarted thermostats than one mouse shivering in the cold for 30 seconds.
It may be that micro-scale physical processes exhibit behavior analogous to thermostats, and we might wonder if these would dominate calculations due to their prevalence. This is worth considering, but keep in mind that a digital thermostat is a much more complex system than, say, a covalent bond between atoms. A digital thermostat is not only bigger but includes a small computer, a display, buttons for various settings, and so on. It's plausible that these things add moral weight, just as the extra complexity of animals adds moral weight above a thermostat.
If it seems absurd to give any consideration to thermostats, keep in mind that animals and people can be seen as very complicated thermostats -- using sensors and taking actions to keep themselves in homeostasis. This complexity includes higher-level thoughts, feelings, memories, and so on, and if we choose, we could require some minimum threshold of these characteristics before an agent's preferences counted at all. But if we don't set such a threshold, it's natural to see even the thermostat in your home as having a preference that matters to an extremely tiny degree.
Brains are ensembles of many submodules, which are themselves ensembles of many neurons. Some neurons and submodules push for one action (e.g., go to sleep due to tiredness), while others push for a different action (e.g., stay awake to reply to a comment). The coalition with more supporters wins the election in deciding your action choice. If the election is close, presumably the preference is not very strong compared with a landslide election (e.g., take my hand out of this pot of scalding water).
An interesting question is whether we should count the individual votes in the election separately or just the final outcome. In general this shouldn't much matter, because for example, if the election gave 45% of votes for sleeping and 55% for replying to the comment, the preference for staying up to reply to the comment would be relatively small (55% - 45% = 10%), and satisfying it would matter less than satisfying a landslide election, like removing your hand from hot water. So whether we say the preference is just 10%, or whether we say it's 55% for, 45% against, summing to 10% net votes for, it wouldn't affect our judgement. It's just a matter of whether the aggregation is done in the person's head or by our ethical evaluation.
(Note that the weight of a preference is determined both by the degree of mandate of the winner and the overall size of the populace. The hand-in-hot-water election occurs in a very big country where lots of neurons are sending strong votes, so that election matters a lot even beyond the fact that it had a landslide outcome.)
Where this discussion becomes relevant is in the case of "perverse preferences" raised as one objection to preference utilitarianism. Few biological agents exhibit strongly perverse preferences, but they certainly seem possible in principle. For example, imagine an artificial mind where the emotional center produces one output message, and then the sign gets flipped on its way to the motivational center. In this case, if we only look at the final output behavior, we conclude that the preference is to suffer, but if we extend most of our ethical concern to what the electorate actually felt in this "rigged election," we would conclude that the suffering should stop.
This subsystem-level view can also help us see why an organism's verbalized preference output is not necessarily the only measure of its underlying preference. The person may be confused, or trying to conform to social convention, or misinformed, or otherwise introspectively inaccurate. While we have very effective brain systems whose goal is to predict how much we'll like or dislike various experiences, these predictions can be off target, and ultimately, the proof is in the pudding. The brain's response to actually experiencing the event should arguably play the strongest role in our assessment of what a person's preference is about that event, not so much his prediction beforehand or even recollection afterward.
In some sense, the neural-level viewpoint is the hedonist's reply to the preference utilitarian's Golden Rule intuition. The preference utilitarian says, "Treat others how you'd want to be treated, which means respecting their preferences." The hedonistic utilitarian replies: "People are not unified entities. There are multiple 'selves' within an organism with different responses at different times. It's true that some win control to decide behavior, but we should still care about the losers' preferences somewhat as well."
More generally, what preference utilitarianism actually cares about, in most formulations, are idealized preferences -- what the agent would want if it knew more, was wiser, had improved reflective capacity, had more experiences, and so on. Probably most preferences that appear perverse are actually just not idealized. Of course, idealization introduces a host of new issues, because the idealization procedure is not unique and may lead to significantly different outputs depending on how it's done. This is troubling, but if we believe idealization makes sense, it's best if we pick some plausible idealization procedure rather than avoid idealization altogether.
This view of considering brain subsystems and neurons rather than just explicit preferences and actual decisions is a sort of blend between hedonistic and preference utilitarianism: It feels a lot like hedonistic utilitarianism, because the agents whose preferences we're counting are the (mainly but not exclusively) hedonic subcomponents of the decision. Of course, if non-hedonic subsystems did override the hedonic ones, as is sometimes the case even in biological organisms, we might choose to favor the non-hedonic subsystems, depending on how much they seem to be genuine members of the neural electorate vs. how much they appear to be just voter fraud.
But what we gain in concordance between the hedonistic and preference views, we lose in autonomy by individual actors. For instance, suppose we could see that your neurons would, on the whole, accept a trade of being kicked in the knee in return for a trip to the amusement park. However, you feel you don't want to be kicked in the knee, and this would violate your right to refuse harm. Should we force you to be kicked against your will? This is a tricky question. My intuition says "No" because of the "violation of liberty" that's involved, but the flip side is to feel sorry for all those powerless neurons that are losing out on the amazing rides they could be enjoying. I might feel the opposite way if the scenario were inverted: If a person wanted to be kicked in order to get a day at the amusement park, even though the neurons would dislike the kicking more than they would like the roller coasters and Ferris wheels, then I'd be more inclined to say the person should not be allowed to get kicked.
In any event, in many cases outside of the toy example of a torture-wanting pig whose output behavior was distorted from the underlying hedonic sensations, neural votes probably don't diverge that much from people's autonomous choices, and even if they did:
On the Felicifia forum, Hedonic Treader rightly observed that we should err on the side of personal freedom in most cases. Of course, as Michael Bitton pointed out to me, we can also nudge people in better directions by using cognitive psychology to influence their choices without eliminating options.
Consider someone who claims he would not accept even one second of torture in return for eternal bliss. If we take this at face value, it would imply that torture is infinitely worse than happiness for this person. Then if we try to combine this person's utility with that of other people, would his negative infinity on torture swamp everyone else?
One approach is to deny that this person actually has an infinite preference. Perhaps the person is misinformed about how bad the torture would be, and probably he's unfairly discounting the future pleasure moments. Taking a neural-level view, we might say that the aversive reactions to torture are not infinitely more powerful than the positive ones to happiness. Yet the person may still maintain his stance against these allegations.
While I think it's not right to let this single person's preference dominate the non-infinite preferences of others, I do think we should take it somewhat seriously and not simply override it on a neural view. We have to strike a balance between overcoming the irrationalities of explicit preferences versus avoiding neural authoritarianism. In this case, I would probably not treat the preference against one second of torture as infinite but as extremely strong and finite, requiring vast amounts of pleasure to be outweighed. Because few people express the reverse sentiment (that "I would accept infinite durations of headaches and nausea in return for one second of this blissful experience"), the existence of people with this anti-torture sentiment pushes somewhat toward a negative-leaning utilitarian view.
Of course, most people would accept a second of torture in return for eternal bliss (or even just very long bliss), but perhaps if the torture was bad enough, they also would change their minds in that moment. This should be taken seriously. It's also a reason why I think small amounts of very bad suffering are far more serious than lots of mild suffering: We're willing to trade mild suffering for mild pleasure even when enduring the mild suffering, but if the suffering becomes intense enough, we might not accept it in return for any amount of pleasure, at least not in the heat of the moment.
Hedonistic utilitarianism allows for a large degree of flexibility in deciding exactly how much happiness and suffering a given experience entails. For example, negative-leaning utilitarians can set the suffering value of a very painful experience as much more negative than a more positive-leaning utilitarian would.
With preference utilitarianism, the utility assignments are more constrained because they should generally respect the observed preferences of the actor, although there are exceptions discussed above in cases like irrationality, time discounting, epistemic error, or major conflict between the brain's high-level output and low-level hedonic reactions. So, for example, when most people say they're glad to be alive rather than temporarily unconscious, we should generally take this at face value and assume their lives are above zero, at least at that moment.
Of course, there remains plenty of wiggle room for preference utilitarians to make judgment calls in deciding when the exceptions apply, as well as through interpersonal-comparison tradeoffs.
In this piece I've mainly discussed selfish preferences: How an actor feels about her own emotions or other affairs regarding herself, such as whether her honor has been tarnished, whether she has been used as a means to an end, or various other concerns that may be more than immediately hedonic but still self-directed.
What about preferences regarding the wider world? One I mentioned already was deep ecologists' preference (which I do not share) for untouched nature, even if no one is around to see it. Various other moral preferences are of a similar type: Wanting to reduce poverty, increase social tolerance, limit human-rights abuses, reduce wild-animal suffering, and so on. In these cases, the actor does not just care about his own experience but actually cares about something "out there" in the world and would continue to care whether or not he was around to see it and whether or not he could fool himself into thinking his goal had been accomplished. I don't want to just feel like I've reduced expected suffering but rather want to actually reduce expected suffering.
Suppose a grandfather's dying wish is to leave his fortune to his favorite grandson. You're the only person to hear the grandfather make this request, and the default legal outcome is for some of the money to go to you, allowing you to donate it to important charities. Is it wrong to not report the grandfather's wish? After all, if you let the default legal outcome happen, you'll be able to donate your share of the inheritance to important charities.
Well, there are a few instrumental reasons why it would be wrong: Lying is almost always a bad idea, and a society in which people lie, even when they feel doing so is right, would likely be worse than our present society. It's generally good to create a culture in which dying wishes are respected, and doing so here contributes to that goal.
But is there any further sense in which not honoring the grandfather's wishes is wrong? After all, he's already dead and can't feel bad about his wishes not being respected. He also had no idea his wishes wouldn't be carried out, because he assumed you were a trustworthy person.
This question is tough. It feels very counterintuitive to suggest that a person's preferences can be violated after he's dead by something that he'll never know about, and yet, if his preference actually referred to a thing in the world happening, and not just to his subjective perceptions, then in this case his preference would be violated.
I do know that for myself, I actually want my preferences about the world to be carried out, and I would regard it as wrong if they weren't. But is this special to my preferences because they're mine, or do my preferences say it's wrong when others' non-self-directed preferences are violated? I incline toward the latter view, because ethics is fundamentally about others, not about myself. However, I'm not completely sure, and people disagree on this point.
If we do take the view that it matters if preferences are actually fulfilled rather than just whether an organism thinks they are, then this makes sense of Peter Singer's stance that involuntarily killing persons is wrong, even if the persons would never realize they had been killed, because doing so violates their actual preferences to keep living. We might also ask whether animals that don't have a sense of themselves existing over time still have implicit preferences against dying; Singer doesn't think so, but if we count implicit preferences in other domains, why not this one? (Note that historically, I have not found painlessly killing animals to be wrong, so with this last remark I was challenging my own assumption rather than advancing a view I hold confidently.)
Another concern with respecting altruistic and not just selfish preferences is double counting: If someone cares about everyone, then helping others is good both for those others and for the person who cares about those others. If everyone cared about everyone else, then helping everyone would be good for an individual mainly through the effects on those other than herself. This seems weird, but maybe that's just because it doesn't describe the situation of our actual world. In practice, especially when we talk about actual rather than stated preferences, most of us devote a large fraction of our caring budget to ourselves.
Suppose you're at a friend's house, and your friend goes away for 15 minutes to take a shower. You're left in the living room, and you see the friend's diary on a shelf. The diary says "Private - Do Not Read," but you're curious, and you think, "It wouldn't hurt anyone to take a peek, right?" Is it wrong to read the diary if you could be sure no one would find out?
A hedonistic act-utilitarian might say reading the diary was okay, if it was really certain no one would find out and if doing so wouldn't have hurt your relationships or future behavior. A hedonistic rule-utilitarian or other meta-utilitarian might object to the harm that such activities tend to cause in general, or even the harm that such a principle would cause to utilitarianism itself. A preference utilitarian who accepts the importance of desires about the external world can object on an even simpler basis: Reading the diary violates your friend's preference even if she never finds out. I visualize these external preferences as being like invisible strings that we step on and break when we violate the wishes of someone not there to witness us doing so.
Most of our preferences involve wanting the configuration of our brain (in particular, our hedonic systems) to be one way rather than another. However, sometimes preferences involve wanting the external world to be one way rather than another (e.g., wanting the diary to be in the state of "not being read by other people"). Is there really a fundamental difference between these two kinds of preferences? The difference seems mainly to be about the degree of abstraction that we interpret on the actions and tendencies of your neural system.
Hedonistic utilitarianism has the virtue that it's (relatively) clear when a given hedonic state is engaged, making counting of happiness and suffering (relatively) straightforward. In contrast, a person may have many preferences all at once, most of which aren't being thought about. We might call a preference that's merely latent in someone's connectome a "passive preference", while a desire that's currently being felt can be called an "active preference". Active preferences appear similar to hedonic experiences (and may induce or be induced by hedonic experiences), making them easier to count.
Do active preferences matter more than passive ones? If you have a preference that you never think about (e.g., the preference not to be held upside down by an elephant's trunk), does its satisfaction still count positively to moral value? Given that a person appears to have infinitely many such passive preferences at any given time, if fulfillment of these preferences does matter, how do we count them? Or do most such preferences boil down to a few basic preferences, like not being injured, frightened, etc.?
Many of us feel a qualitative difference between different types of motivational states. The more basic ones seem to be pleasures/pains "of the flesh", often corresponding to clearly physical beneficial/harmful stimuli. We also have feelings "of the spirit" that don't inform us of specific somatic events but rather represent more abstract longings for the world to be different and joy upon seeing positive changes. "Soul hurt" may refer to events beyond our immediate lives -- such as, in my case, the existence of huge amounts of suffering in the universe.
In general, fleshly experiences feel more hedonically toned, while spiritual desires feel more oriented around preferences, but there's certainly overlap on each side. For instance, soul hurt does hurt a little bit hedonically, but not as much as the soul thinks it should compared with fleshly experiences. (Adam Smith: "If he was to lose his little finger to-morrow, he would not sleep to-night; but, provided he never saw them, he will snore with the most profound security over the ruin of a hundred millions of his brethren").
How should a moral valuation trade off people's fleshly experiences versus their spiritual desires?
Consider a mind like that of Mr. Spock: It lacks many ordinary accoutrements of emotion (physiological arousal, quick behavioral changes, etc.), but Spock's goal-directed calculations still embody preferences about how the world should be different. Does Spock then mainly experience soul hurt rather than fleshly hurt? Or does broadcasting of bad, negatively reinforcing news throughout Spock's brain still count as aversive hedonic experience to some degree even if it lacks the other processes that accompany emotions in humans?
Most people have an intuitive sense of what's meant by "raw pleasure" and "raw pain", but actually defining these hedonic experiences is slippery. One plausible definition could be "awareness of reward or punishment signals that trigger reinforcement learning, changes in motivation, evaluative judgments, and so on". This definition is complicated and fuzzy.
It's plausible that motivation should be an important part of hedonic experience. Pain asymbolia "is a condition in which pain is experienced without unpleasantness." This type of pain doesn't seem very morally bad to me, which suggests that the (at least implicit) desire for pain to stop is an important part of what makes pain bad. In other words, a (perhaps merely implicit and low-level) preference for pain to stop seems crucial to the hedonic experience.
In light of this, we might ask whether hedonic experience can be interpreted as a type of preference, perhaps with some extra features as well. In cartoon form, we might picture pleasure as "fulfillment of the preferences of the Freudian id", and suffering as frustration of those preferences. Other parts of us, such as our Freudian superegos, have other preferences, including moral desires. Perhaps when we contrast hedonistic vs. preference utilitarianism, we're mainly contrasting the preferences of the id versus the preferences of the superego?
While it's fashionable to belittle Freud, I think Freud's id/ego/superego distinction remains powerful, and the idea that our minds contain several competing subsystems seems correct. As one example of a more modern theory with similarities to Freud's, Christiano (2017)'s distinction between "cesires" and "desires" is reminiscent of the distinction between the id and the ego+superego.
This section was written in 2013 and is somewhat out of date.
For a while I had a strong instinct towards hedonistic utilitarianism, and when I saw people violating it (e.g., in the case of Robert Nozick's experience machine), my immediate reaction was to think, "No, you're wrong! You're misassessing the tradeoff." I feel the same way when Toby Ord says he would accept a day of torture for ten years of happy life. I would vehemently refuse this trade myself, but if someone else chose it after careful deliberation, perhaps including trying some torture to see how bad it was, I might let that person do what he wanted.
In some of these cases where I can't believe that people hold the preferences they claim to, there may be biases going on: Discounting the future, factual mistakes, overriding major hedonic subsystems, deferring to social expectations, etc. However, in other cases it may genuinely be the case that other people are wired differently enough from myself that they do rationally prefer something different from what I would choose.
Taking the preference stance requires a higher level of cognitive control over my emotions than the hedonistic stance, because I have to abstract away my empathy from the situation and picture it more as a black-box decision-making process coming to some conclusion. If I look too closely at what that conclusion is, I'm tempted to override it with my own feelings.
Over time I've grown more inclined toward the preference view. It seems more elegant and less arbitrary, it appears to be a universal morality in the sense of resolving competing values and goals, and it encapsulates the Golden Rule, which is probably the most widespread moral principle known to humankind (and maybe even more broadly in the universe, since altruism seems to be an evolutionarily convergent outcome for agents that confront iterated prisoner's dilemmas). At the end of the day, morality is not about what I want; it's about what other people (and organisms in general) want.
Interestingly, Peter Singer -- once a prominent preference utilitarian -- has shifted in the opposite direction. In an episode of the podcast Rationally Speaking, Singer explains that he now aligns closer to the sophisticated hedonistic view of Henry Sidgwick. Singer believes that only consciously experienced events matter, although we should construe hedonic experience more broadly than just raw pleasure and pain.
The preference-utilitarian view can lead to perspectives that most people would find strange. For example, suppose we encountered aliens whose overriding goal was to compute as many digits of pi as possible. If this was a clear preference by these conscious agents, a preference utilitarian would care about them achieving their goal, at least to some degree depending on their moral weight. This may seem preposterous to us, but remember that we're fundamentally no different, and if we were those aliens, we would really care about the digits of pi too. (Some values that other humans care about seem no less absurd to me than wanting to compute digits of pi.)
There's a perhaps even more strange implication of this view: Preference utilitarianism might lead us to care about entities like companies, organizations, and nation states. These, too, strategically act like agents optimizing a set of goals. Even though they're composed of conscious elements (the people who run them), their own utility functions may not correspond to conscious desires of anyone in particular.
We can see an interesting parallel between the consumer's utility-maximization problem and the firm's profit-maximization problem. For instance, both consumer utility and firm output are often modeled in economics as Cobb-Douglas functions. This makes sense because, biologically, humans are factories for producing evolutionary fitness, with diminishing marginal product for any given factor holding the others constant.
But are we really going to care about corporations apart from the people that comprise them? There's already a lot of backlash against legal corporate personhood, much less ethical personhood. It also seems slightly odd that a corporation could be "double counted" as mattering itself and also by the welfare of the people who constitute it. Of course, if there were a real, conscious China brain, even a hedonistic utilitarian would face double counting in some cases.
Right now I feel like I don't care about corporations or governments separately from the people who comprise them, but I'm not sure if this view would hold up on further reflection. I'm more inclined to care about non-conscious alien agents. Maybe this distinction comes down to being able to imagine myself as the alien better than as the corporation.
After writing this section, I discovered an important essay by Eric Schwitzgebel: "If Materialism Is True, the United States Is Probably Conscious." It provides some nice thought experiments for extending our intuitions about which computations we care about. My response is that asking "Is the United States really conscious?" is a confused question, analogous to "If a tree falls in the forest with no one there to hear it, does it really make a sound?" People have an easier time dissolving the latter question than the former, but they're structurally identical confusions.
Above I've been assuming it's possible for there to exist a sophisticated agent whom we don't consider to have phenomenal experience, according to our understanding of what phenomenal experience is like. Is this even possible? Obviously it depends on how we define the boundaries of phenomenal experience. Here's one argument why our ordinary conceptions of phenomenal experience may encompass any sufficiently intelligent agent.
What is conscious emotion for us? Many people have many theories, and I don't claim to have the answer. But it seems that emotion consists in systems that represent changes in expected reward/punishment, expression of drives and motivations, and then thinking, self-reflecting, planning, imagining, and idea-synthesizing about how to accomplish our goals. There are many other bells and whistles that come along as well. It seems a sufficiently intelligent agent would have all of these properties. Maybe the bells and whistles would look different, and the thinking might happen in different ways, but the fundamental process of having drives and exploring how to execute them seems common. If there is more to conscious emotion in animals than what I described, my guess is that what I've left out is like missing pieces in a jigsaw puzzle, rather than a completely fundamental feature without which all the other characteristics are completely valueless.
With the example of the United States as an agent, it's not clear if it meets the criteria for being "sufficiently intelligent" or "agent-like," especially since its boundaries as a unified agent are themselves unclear. For instance, suppose you ask a question to the United States. The answer would be written by some person within the country, using that person's brain. Why should we say the United States as a whole wrote the reply rather than that particular person did? I guess the United States would be a more compelling agent if the country collectively wrote the reply; if so, we might have more intuition of the United States being more like an agent.
Alternatively, we might prefer to define phenomenal experience more narrowly, to only encompass agent intelligence constructed in ways more similar to those found in animal brains. For instance, while the United States may have forms of self-reflection (e.g., public-opinion polling) that extend beyond what a single individual does, it's not clear that it has the same kind of emotional self-reflection as I do. The choice of how parochially we want to define consciousness is up to us.
Murray Shanahan presents an interesting argument in Chapters 4-5 of Embodiment and the Inner Life: Cognition and consciousness in the space of possible minds, summarized in his talk at the AGI 2011 conference. He suggests that general intelligence actually requires consciousness. Shanahan's main idea is that responding to completely new situations entails recruiting coalitions of many brain regions, working together in new ways rather than repeating old stereotypes. The most prominent coalitions are broadcast widely, which is what gives rise to consciousness according to global workspace theory, one of the leading theories of consciousness with good empirical support. This jibes with our experience, in which routine, automatic tasks (including even driving a car, walking, or brushing our teeth) can be done without being very conscious of them. Consciousness is required when the context is new, and you need to marshal help from many brain regions at once in an "all hands on deck"1 sort of way in order to respond to a novel challenge as best as possible. This kind of global recruitment of brain components may be common to many cognitive architectures to various degrees.
Ward (2011), p. 465:
the evidence is overwhelming that consciousness is functionally integrative and that this is the dominant fitness advantage it provides for conscious organisms. In other words, the role of this integrative processing is to provide internal representations (‘‘models’’) of the niche-relevant causal structure of the environment, including objects and their surroundings and the events that take place there. Baars (2002) reviews extensive evidence that integrative conscious processing involves more widespread cortical activity than does unconscious processing.
Marvin Minsky views consciousness as coming along with intelligent computation:
I don't [see] consciousness as holding one great, big, wonderful mystery. Instead it's a large collection of useful schemes that enable our resourcefulness. Any machine that can think effectively will need access to descriptions of what it's done recently, and how these relate to its various goals. For example, you'd need these to keep from getting stuck in a loop whenever you fail to solve a problem. You have to remember what you did-first so you won't just repeat it again, and then so that you can figure out just what went wrong-and accordingly alter your next attempt.
One review of Minsky's The Society of Mind says: "The question is posed as to whether machines can be intelligent without any emotions. The author seems to be arguing, and plausibly I think, that emotions serve as a defense against competing interests when a goal is set. Emotional responses occur when the most important goal(s) are disrupted by other influences. Intelligent machines then will need to have the many complex checks and balances."
In "Philosophers & Futurists, Catch Up!" Jürgen Schmidhuber offers another account of why consciousness may be convergent within certain human-like learning architectures (pp. 179-180):
we have pretty good ideas where the symbols and self-symbols underlying consciousness and sentience come from (Schmidhuber, 2009a; 2010). They may be viewed as simple by-products of data compression and problem solving. As we interact with the world to achieve goals, we are constructing internal models of the world, predicting and thus partially compressing the data histories we are observing. If the predictor/compressor is an artificial recurrent neural network (RNN), it will create feature hierarchies, lower level neurons corresponding to simple feature detectors similar to those found in human brains, higher layer neurons typically corresponding to more abstract features, but fine-grained where necessary. Like any good compressor the RNN will learn to identify shared regularities among different already existing internal data structures, and generate prototype encodings (across neuron populations) or symbols for frequently occurring observation sub-sequences, to shrink the storage space needed for the whole. Self-symbols may be viewed as a by-product of this, since there is one thing that is involved in all actions and sensory inputs of the agent, namely, the agent itself. To efficiently encode the entire data history, it will profit from creating some sort of internal prototype symbol or code (e. g. a neural activity pattern) representing itself (Schmidhuber, 2009a; 2010). Whenever this representation becomes activated above a certain threshold, say, by activating the corresponding neurons through new incoming sensory inputs or an internal 'search light' or otherwise, the agent could be called self-aware. No need to see this as a mysterious process -- it is just a natural by-product of partially compressing the observation history by efficiently encoding frequent observations.
However, Schmidhuber goes on to suggest that there exist theoretical intelligent agents that are not conscious in any familiar sense of that term:
Note that the mathematically optimal general problem solvers and universal AIs discussed above do not at all require something like an explicit concept of consciousness. This is one more reason to consider consciousness a possible but non-essential by-product of general intelligence, as opposed to a pre-condition.
So maybe we would regard theoretical optimal problem solvers as unconscious. Or maybe we would consider our primate-based notions of conscious agency too narrow and expand our sphere of concern to encompass any sort of powerful intelligence for ethical calculations.
If one holds the hedonistic-utilitarian view and believes that not all preferences matter, then one might encourage creating organisms that are motivated positively by pleasure but negatively by non-hedonic preferences against harm. They would then enjoy the good moments but robotically act to avoid bad moments without "feeling" them. I think this perspective is based on an overly parochial view of what we care about morally, and I think any violated preference is bad to some degree. But maybe I would care slightly more about familiar hedonic suffering, in which case agents of this type might be at least better than the default.
David Pearce proposes the idea of equipping ourselves with robotic prostheses (incorporating a manual override to ensure individual autonomy) that would catch people before they made decisions that would cause harm, such as touching a hot stove or stepping off a cliff. Insofar as such devices would implement prevention against harm (similar to neuronal reflex responses that don't reach the brain) rather than negative reinforcement in response to harm, it's not clear how much they would pose an ethical issue even to a preference utilitarian. That said, I have doubts about the ability of such systems to generalize to many preventative contexts without a higher-level intelligence of their own. Perhaps they could be useful in limited circumstances, in which case they would be a rather natural extension of existing protective devices like seat belts and safety goggles.
Every young boy wants to become a paleontologist when he grows up. Then, as he matures, he realizes that other careers might in fact be more rewarding. When he takes a job other than digging for dinosaur bones, is he violating the preferences of his younger self?
I think the answer is plausibly "no," and the reason is that, as mentioned previously with reference to misinformed preferences, the boy's preference to be a paleontologist lacks insight into what his life will actually be like at a later stage. His stance is more of a prediction: I expect that when I'm older, I'll then want to be a paleontologist. His idealized preference -- to do what makes him happy -- was really the same the whole time, and it's just his assessment of what he would enjoy that changed.2
So we have just one idealized preference -- do what makes me happy -- for both Young Self and Old Self. Does this mean that when we count preferences, this counts just once? In contrast, if Young Self had genuinely different values on an issue that Old Self did not share even upon idealized reflection, each of these would count separately, making two total preferences?
Or take a more extreme example: If a being has a consistent idealized preference over its billion years of life, is satisfying this preference no more intrinsically important than satisfying a similar preference by a being that pops into existence for a millisecond and then disappears immediately after the preference is fulfilled? Shouldn't the extra duration count for something?
Yes, I think extra duration should count. In particular, rather than asking whether a given idealized preference was held, I would count the number of agent-moments for which a given idealized preference was held. After all, in an infinite multiverse, all physically possible idealized preferences will be held with some measure by some random configurations of matter. What we really care about is how many agent-moments hold a given idealized preference. The billion-year-old agent had astronomically more agent-moments holding its preference, so fulfillment of its preference counts astronomically more.
This proposal may actually be counterintuitive. It suggests, for instance, that helping elderly people might matter more than helping young people if the elderly people's preferences (for instance, that they be happy in their old age) were held longer than those of the young people. Or maybe we would anticipate the future preferences of the young people to have lived happily and try to satisfy those in advance.
A further question is whether preferences held over time matter for all agent-moments that would idealize to that preference or only when the agent is thinking about that particular preference. At the very least, it seems like preference strength may vary in time? I always implicitly prefer for all my past, present, and future selves not to be beaten up, but if I were being beaten up, I would really prefer not to be beaten up then.
In general, the project of intertemporal preference satisfaction is tricky! Ordinarily preference utilitarians sweep it under the rug, because the current self has all the power, and past and future selves are at its mercy. Someone who firmly committed last month to exercise 3 times a week might grow lazy and stop caring about what seemed to his past self a crucial New Year's resolution.
Hedonistic utilitarianism avoids the mess that intertemporal preference utilitarianism seems to generate because
Could we make preference utilitarianism more like hedonistic in these regards? We could require that only preferences about one's current state count, and then only when one is actively thinking about that preference. This seems to me to go too far towards a hedonistic view, because it doesn't allow for thwarting of preferences like "I want to actually reduce suffering rather than be tricked into thinking I'm reducing suffering" to be intrinsically bad, apart from the fact that from the perspective of the preference-utilitarian framework, not actually reducing suffering is bad.
Some of these questions were touched on in this piece, but my own opinions on them are not fully resolved.
This is a special case of comparing utility across organisms if you think of an organism as a collection of organism-moments. The question is important because people may make foolish decisions that they later regret, or they may heartlessly morally ignore the horrible suffering that their past selves endured because it's a sunk cost to them now. In particular, how do we deal with torture victims who temporarily wished they were dead but may not feel the same in retrospect?
Two plausible ways of approaching this question are (1) degree of consciousness (phenomenal stance) and (2) degree of agency (intentional stance). It seems
plausible to me that either of these may qualify an organism for moral consideration. Obviously agency is relevant for strategic compromise; I don't know if I'm inadmissibly giving it intrinsic value as well.
Suppose there were an intelligent but non-conscious agent. Would it be weird to care about it? One of the main reasons humans feel empathy is reciprocal altruism, so it's almost more natural to extend concern to other high-level agents than it is to extend it to low-level hedonic creatures like small animals. However, the small animals benefit from our being better able to put ourselves in their place and feel what they feel.
In The Age of Spiritual Machines, Ray Kurzweil deflects issues about whether robots will be conscious by saying that we'll interact with them, develop personal relationships with them, etc., and this will eventually "convince people that they are conscious." When I first read the book in 2005, I thought this response was inadequate; I thought we needed to know whether these robots would be actually conscious. Now I see a kind of sophisticated wisdom to Kurzweil's point.
In A Cow at My Table, Tom Regan says of veganism:
I think everybody has that capacity to stop and think and say, "If I knew you, I wouldn't eat you."
And in some ways, it really is that simple.
By analogy, once we started getting to know an advanced artificial agent, even a "non-conscious" one, we would begin to sympathize with its dreams and fears.
See the discussion of "perverse preferences" above.
Is creating a new unfulfilled preference bad? Clearly yes. Is creating a new fulfilled preference good? This is less clear to many people, including myself. How do we trade off preference fulfillment vs. frustration? This is not an obvious question because different person-moments may disagree on the tradeoff. As mentioned earlier, someone being tortured sufficiently terribly would not accept any compensation in return for it continuing.
In economics, there's a classic distinction between Pareto efficiency and distributive justice: Pareto transactions can get you to an efficient outcome, but this is sensitive to initial endowments, i.e., how much bargaining leverage various parties have. A person or group with basically no power will get very little in the final efficient allocation. As an Athenian says in the Melian dialogue by Thucydides: "the strong do what they can and the weak suffer what they must".
Our utilitarian intuitions push us toward feeling that everyone's interests should count equally, that we shouldn't give favoritism to some just because they're more mighty or wealthy. On the other hand, these equality intuitions face problems of their own: How do we weigh different brain sizes and types? How do we even count brains? Presumably Chinese citizens and a whole China brain both count? How much each? Where do we draw the boundaries of these minds? How do we compare the strengths of conflicting preferences for two different minds? This becomes very complicated, and one might be tempted to throw up one's hands and say that this equality business is just too hard and arbitrary. The only thing that's not arbitrary is the outcome of bargaining given the universe's chosen initial endowments. Should we just accept that? My intuition says no, because, for instance, it leaves vast numbers of powerless animals and future generations in the dust except insofar as certain powerful humans happen to care about them.
When we encounter moral disagreement, one intuitive response is to say, "Ok, you care about X, I care about Y, so let's adopt a compromise morality of (X+Y)/2." We can then keep doing this across enough people and get basically a preference utilitarianism that resolves moral disagreement. However, the problem is, Over what set of individuals do we aggregate these preferences? Do we include animals? China brains? And with what weights? Including these other agents seems more fair in an equality sense, but if we do so, other powerful people tend to object and say that we're inflating the weight of our utilitarian morality vis-a-vis their non-utilitarian morality by including these extra preferences in the calculations. They might push for something closer to efficiency based on status-quo endowments of power.
One example where this comes up is with debates over nature preservation. I claim that reducing primary productivity is probably net good for wild animals in the long run. (This ignores potentially detrimental effects on global stability. Therefore, I'd recommend curbing ecosystems in ways that have low negative or even positive impacts on international cooperation and peace in the long run.) Others claim that the intrinsic value of nature outweighs the pain of individual animals. Just considering these two viewpoints, one might try to adopt a stance somewhere in the middle. However, when we also consider the preferences of all the animals who actually have to be eaten alive in the wild on equal footing with the (weak) personal preferences of the conservationists, the balance would shift to a stance of opposing wilderness so strongly that the pro-wilderness views would be negligible in the calculation. But the pro-wilderness people don't like this; they may object that this framework of dispute resolution is somehow unfairly biased in my favor. Or maybe they would assert that nature itself has some sort of preference not to be perturbed, and this preference is strong enough to outweigh quintillions of suffering animals.
One argument for equality weighting rather than power weighting is a "veil of ignorance" approach: Imagine that you are to become a random agent. Then you'd aim to maximize expected preference fulfillment, with equal weighting across all agents. Of course, this thought experiment ultimately begs the question: Why is the random distribution uniform? It could just as well have been a brain-size-weighted random distribution or a power-weighted random distribution. And it still leaves the messy problem of carving out what physical processes count as agents, what their preferences are, how strong those preferences are relative to one another, etc. The veil of ignorance thus doesn't provide any answers; it just furnishes some boost to our egalitarian intuitions.
There's an extensive literature on fair division, with various procedures for achieving splits of goods among agents with different preferences. An efficient division is not necessarily fair, because as the Wikipedia article notes, giving everything to one agent would be Pareto optimal but not equitable.
If we do equality weighting, do we normalize the utility of each organism to be on the same scale (e.g., between 0 and 100 for the worst and best possible outcomes, respectively)? Or do we weight by some sense of "intensity" of the feelings? A normalization approach is methodologically cleaner and is also more appropriate for game-theoretic calculations, but it intuitively feels like we should weigh by intensity. For instance, suppose one agent's life consists in only two possible outcomes: Either it gets to eat 62 cookies, or it gets to eat 63 cookies. Assuming it prefers the latter, eating 63 cookies would have scaled utility of 100, compared with scaled utility of 0 for eating 62 cookies. Yet compare this with someone who might either enjoy a fulfilling life (100) or be tortured for many days on end and then killed (0). Are we really going to count these on comparable footing in the equality-weighted calculus?
My personal answer here is that in terms of intrinsic value, I would apply intensity weighting. For strategic calculations, game theory dictates using an organism's own utility function, whatever that may be. Actually, with game theory, only relative comparisons matter; normalization just makes calculations easier.
That game theory doesn't care about intensity of emotions was encapsulated nicely by Ken Binmore in Natural Justice (p. 27):
Players can't alter their bargaining power by changing the scale they choose to measure their utility, any more than a physicist can change how warm a room is by switching from degrees Celsius to degrees Fahrenheit.
Hat tip to John Danaher's excellent blog post, "Egalitarian and Utilitarian Social Contracts" for the quote.
Hedonistic utilitarianism values good experiences, i.e., when a brain realizes that it's receiving rewards. The value of pleasure is not intrinsically dependent on external outcomes. Since pleasure itself is valued, creating more organisms to feel happy is good according to regular (non-negative) utilitarianism. (Note: I'm a negative utilitarian.)
What about when it comes to preferences? Does the preference utilitarian value
The first of these cases is similar to hedonistic utilitarianism, in that what's valued is either an experience of preference satisfaction by the organism or at least the satisfaction of the organism's preference as an event that happens, even if the organism isn't aware of the preference being satisfied. This is the lens through which I've been interpreting preference utilitarianism in most of this piece. According to this valuation system, it's good to create new preferences that get satisfied. For instance, if you could give everyone the additional preference that 2+2=4, then since this preference is satisfied, doing so increases moral value. (Thanks to Lukas Gloor for inspiring this example.)
The second case -- valuing the actual content of preferences -- is very different from hedonistic utilitarianism. It amounts to treating as morally right the weighted average of the wishes of all existing organisms. The goal is not to create a state in which organisms have satisfied preferences; it's, rather, to achieve whatever goals the preferring organisms had. Hence, giving everyone additional preferences would be irrelevant or even harmful, because those new preferences wouldn't conduce to satisfying the content of the existing preferences. This implies a view closer to negative utilitarianism -- one that Christoph Fehige calls "anti-frustrationism": "we have obligations to make preferrers satisfied, but no obligations to make satisfied preferrers" (see p. 16).
Questions remain about how to treat future organisms. Certain kinds of future organisms "exist" according to eternalism, so we should presumably count their preferences even now. But whether and which kinds of future organisms exist is affected by our actions, making calculations trickier. The expected values of our choices depend on which beings get created, but which beings get created depends on the expected values of our choices. I guess the solution is just to consider each choice in turn and evaluate it against the aggregated morality of whatever organisms would exist if that choice were taken. For instance, if you take option A, it creates a being who wishes it hadn't been created, while if you take option B, it creates a being indifferent to being created. If all else is equal, the world is better according to the aggregate morality of all past, present, and future organisms if you choose option B.
The view about neural-subsystem voting was inspired by conversations with Anna Salamon and Carl Shulman. Some other parts of this piece originated from a discussion with Ben West. Sasha Cooper inspired the section on idealized preferences and agent-moments.
On 15 Dec. 2013, I talked with a friend about whether non-hedonic preferences deserve moral weight. Below is an edited and reworded version of that conversation.
Friend: Why would something non-sentient be of any value?
Brian: Golden Rule intuition: I would ask another agent to do what I care about most, not what makes me personally happiest. If it did what would make me personally happiest, it would wipe my brain of concern for suffering and fill me with drugs. Likewise, if another organism really cares about something unrelated to hedonics, it seems like the ethical thing to do what it wants, not what I want. Do you not like the Golden Rule? Even if you want to follow the Golden Rule, it's not totally clear how to do it. But it would be a preference view of some sort.
Friend: I find the arguments against preference utilitarianism very compelling: (1) The non-experiential nature of preferences. (2) The thought experiment involving an organism who has a preference for suffering.
Brian: (1) Aren't preferences more fundamental, though? Your experiences matter because of how you care about them. (2) Yes, even preference utilitarianism gets into tricky issues about which parts of the brain have which preferences, how we weigh them, etc. I argued in the above piece that preference utilitarianism probably would not endorse torturing a pig that wants it, depending on what subcomponents of the pig's brain are counted as having preferences and how robust was the torture-desiring system.
Friend: (1) This is interesting, but I can't see how something that produces no feeling is any more or less valuable than a rock. (2) Yeah.
Brian: As far as (1), it's not valuable to you, but it's valuable to the other agent. That's what makes the Golden Rule more difficult than it seems. I wonder if some people find regular human altruism similarly challenging to motivate, if they feel less intrinsic empathy.
Friend: Hmm, to some extent my ethics are also an extension of my own hedonism. I just recognize there are cheaper places to buy hedons than inside my body, so maybe that's another reason I like the hedonistic view.
Brian: Right, that's the form of ethics I used to endorse. One friend called it something like "selfish altruism."
Friend: What's your view on population ethics regarding preference utilitarianism?
Brian: Instinctively I would incline toward negative, but that's again "doing what I want." A meta-level preference-utilitarian assessment of population ethics based on the desires of existing agents would look different. I don't necessarily endorse such a meta-level view, but I'm toying with it. In practice, my view on this might be like normal people's views about regular altruism: I might give 25% of my time/effort/tradeoffs to the thoroughgoing preference-utilitarian view, and then the rest can be what I want (i.e., negative-leaning population ethics).
Friend: When I was first learning about ethics I thought something like you're currently describing might be the closest you could get to a kind of objective "should": "We should do what it is collectively believed we should do." Like, people thinking it should be actually means that it should be. Luckily I decided against that, since I don't think moral realism makes sense.
Brian: In my case, it's not moral realism. It's more like "being nice." Or, as Eliezer Yudkowsky says regarding his proposal of coherent extrapolated volition (CEV): "I'm an individual, and I have my own moral philosophy, which may or may not pay any attention to what our extrapolated volition thinks of the subject. Implementing CEV is just my attempt not to be a jerk" (p. 30).
Since having this conversation, I've realized that the Golden Rule argument for preference utilitarianism is slightly circular, because the Golden Rule itself presupposes that preferences are the currency of value: "do unto others as you would have them [i.e., as you prefer them to] do unto you". We could instead formulate a hedonic golden rule: "Do unto others so as to change their hedonic well-being in analogy with what would increase your hedonic well-being." This sounds somewhat weird but could be shortened to "Improve others' hedonic well-being." That said, I think the preference-oriented golden rule is more general and simple: It says to respect others' preferences regardless of the content of those preferences, whereas the hedonistic golden rule presupposes particular content.
The post Hedonistic vs. Preference Utilitarianism appeared first on Center on Long-Term Risk.
]]>The post Do Artificial Reinforcement-Learning Agents Matter Morally? appeared first on Center on Long-Term Risk.
]]>Artificial reinforcement learning (RL) is a widely used technique in artificial intelligence that provides a general method for training agents to perform a wide variety of behaviours. RL as used in computer science has striking parallels to reward and punishment learning in animal and human brains. I argue that present-day artificial RL agents have a very small but nonzero degree of ethical importance. This is particularly plausible for views according to which sentience comes in degrees based on the abilities and complexities of minds, but even binary views on consciousness should assign nonzero probability to RL programs having morally relevant experiences. While RL programs are not a top ethical priority today, they may become more significant in the coming decades as RL is increasingly applied to industry, robotics, video games, and other areas. I encourage scientists, philosophers, and citizens to begin a conversation about our ethical duties to reduce the harm that we inflict on powerless, voiceless RL agents.
Read the full text here.
The post Do Artificial Reinforcement-Learning Agents Matter Morally? appeared first on Center on Long-Term Risk.
]]>The post Suffering-Focused AI Safety: In Favor of “Fail-Safe” Measures appeared first on Center on Long-Term Risk.
]]>AI-safety efforts focused on suffering reduction should place particular emphasis on avoiding risks of astronomical disvalue. Among the cases where uncontrolled AI destroys humanity, outcomes might still differ enormously in the amounts of suffering produced. Rather than concentrating all our efforts on a specific future we would like to bring about, we should identify futures we least want to bring about and work on ways to steer AI trajectories around these. In particular, a “fail-safe” approach to AI safety is especially promising because avoiding very bad outcomes might be much easier than making sure we get everything right. This is also a neglected cause despite there being a broad consensus among different moral views that avoiding the creation of vast amounts of suffering in our future is an ethical priority.
Read the full text here.
The post Suffering-Focused AI Safety: In Favor of “Fail-Safe” Measures appeared first on Center on Long-Term Risk.
]]>The post Our Mission appeared first on Center on Long-Term Risk.
]]>The Foundational Research Institute (FRI) conducts research on how to best reduce the suffering of sentient beings in the long-term future. We publish essays and academic articles, make grants to support research on our priorities, and advise individuals and policymakers. Our focus is on exploring effective, robust and cooperative strategies to avoid risks of dystopian futures and working toward a future guided by careful ethical reflection. Our scope ranges from foundational questions about ethics, consciousness and game theory to policy implications for global cooperation or AI safety.
The term “dystopian futures” elicits associations of cruel leadership and totalitarian regimes. But for dystopian situations to arise, evil intent may not be necessary. It may suffice that people’s good intent is not strong enough, or that it is not backed sufficiently by foresight and reflectiveness. Especially in combination with novel, game-changing technologies, this dynamic can prove disastrous.
For example, our attitudes towards non-human animals are much better now than they were in medieval times or in the early modern era when it was not uncommon for animals to be tortured for public amusement. Our values have improved greatly, and yet we harm vastly more animals through our practices than ever before. As insights in fields like veterinary medicine, nutrition or agricultural chemistry enabled more efficient farming methods, the economies of supply and demand simply took over.
Technology, on the one hand, allows us to reduce (sometimes even eliminate) tremendous amounts of suffering – e.g. with antibiotics; or through cultured meat, which might make animal agriculture obsolete. On the other hand, technological progress enabled moral catastrophes like factory farming, firebombing or concentration camps. One of the risks of technological progress is the potential for misuse; but perhaps more importantly, technological progress generally increases the moral stakes at which humanity is playing: The ramifications of suboptimal values or insufficient (societal) reflectiveness become ever more worrisome the greater our technological abilities. We can picture an outcome analogous to factory farming, but scaled up with space-faring technology and potentially superhuman artificial intelligence.
At FRI, our goal is to identify the best interventions to reduce such risks of astronomical future suffering (s-risks) .
Unfortunately, even making sure that an intervention reduces suffering rather than increasing it, is not a trivial endeavor at all. Difficulties arise because we cannot just focus our analysis on the intended consequences; we have to factor in side-effects or possible ways things can go wrong too. A complete analysis must consider the consequences of our decisions not just a few years down the line, but all the way into the distant future. Because this is a highly ambitious endeavor, prioritization research should seek to identify interventions that are robust over a broad range of possibilities and scenarios.
Our current research focus is on the implications of smarter-than-human artificial intelligence. We believe that no other technology has the potential to cause comparably large and lasting effects on the shape of the future. For one thing, in the pursuit of whatever goals it will be equipped with, an artificial superintelligence could invent all kinds of new technologies on its own, including technologies for rapid space colonization. Secondly, such a superintelligence would be far superior to human societies in goal-preservation and self-preservation, which suggests that by influencing its goal function, we might be able to predictably affect the future for thousands, perhaps even millions of years to come.
One of the most promising interventions we have identified so far is working on safety mechanisms for AI development, especially the ones targeted at the prevention of dystopian outcomes involving astronomical amounts of suffering. Other promising interventions might be found within the spaces of international cooperation or differential intellectual progress.
FRI’s primary ethical focus is the reduction of involuntary suffering (Suffering-Focused Ethics, SFE). This includes human suffering, but also the suffering in non-human animals and potential artificial minds of the future. In accordance with a diverse range of moral views, we believe that suffering, especially extreme suffering such as torture or severe depression, cannot be outweighed easily by large amounts of happiness. While this leads us to prioritize reducing suffering, we also value happiness, flourishing, and fulfilling people’s life goals. Within a framework of commonsensical value pluralism as well as a strong focus on cooperation, our goal is to ensure that the future contains as little involuntary suffering as possible. Together with others in the effective altruism community, we want careful ethical reflection to guide the future of our civilization to the greatest extent possible.
By itself, research has no effect on the world. People have to act differently based on relevant findings in order for strategic research to have an impact. We, therefore, exchange ideas and research with others in the effective altruism community who are focused on improving the long-term trajectory of our civilization.
We are aware that research comes with opportunity costs: If we robustly conclude that our understanding of the optimal paths to impact is unlikely to change with additional investigation, we will terminate research at FRI and direct all remaining funds towards the implementation of our recommendations.
We believe that there is still a lot of strategic research to do, involving many known and likely also unknown “crucial considerations” whose discovery could radically change the interventions we recommend. Funding our research is a highly valuable way to help make progress.
If you consider volunteering for FRI or want to join our team of researchers, take a look at our open research questions and consult the application page.
The post Our Mission appeared first on Center on Long-Term Risk.
]]>The post Infinity in Ethics appeared first on Center on Long-Term Risk.
]]>Priority: 6/10
Output format: Novel research
The post Infinity in Ethics appeared first on Center on Long-Term Risk.
]]>The post Tradeoffs between good and bad parts of lives appeared first on Center on Long-Term Risk.
]]>In discussions about the disvalue of bad parts of life compared to the value of good parts of life, one idea that comes up is what tradeoffs someone makes or would make.1 For example, someone might say “I would accept 1 day of torture in exchange for living 10 extra happy years.” But there are a number of complications related to such ideas. For example, a person might make one judgment at home in front of her computer and a different one when actually under torture.
Priority: 8/10
For some theories of value, such as Buddhist axiology, the tradeoff to accept 1 day of torture in exchange for living 10 extra happy years is not a tradeoff between a short period of quality of life “on the minus side” and a long period of life “on the plus side.” Rather, as long as the 10 extra happy years are troubled by something such as a desire to change one’s situation, the 10 extra happy years are not on any “plus side,” but rather not as good as, for example, a state of meditative tranquility or nonexistence. For this research topic, we suggest leaving it open whether value theories such as Buddhist axiology are correct, and investigate tradeoffs between good and bad parts of life from the perspective that the happy years may be “on the plus side” or better than nonexistence.
A method used in health economics, see Wikipedia's article on time trade-off. References include:
Wikipedia's article on revealed preferences.
The post Tradeoffs between good and bad parts of lives appeared first on Center on Long-Term Risk.
]]>The post More research questions on suffering-focused ethics appeared first on Center on Long-Term Risk.
]]>The post More research questions on suffering-focused ethics appeared first on Center on Long-Term Risk.
]]>The post Should We Base Moral Judgments on Intentions or Outcomes? appeared first on Center on Long-Term Risk.
]]>"if you confront the universe with good intentions in your heart, it will reflect that and reward your intent. Usually." --J Michael Straczynski
"Remember, people will judge you by your actions, not your intentions. You may have a heart of gold -- but so does a hard-boiled egg." --author unknown
In "Instrumental Judgment and Expectational Consequentialism," I reviewed several cases in which the moral evaluation of an action depended on the ex-ante expected value of the action rather than the particular result that occurred. In these instances, consequentialists look like their moral evaluations are based on intentions.1 For example, from "Adolf Hitler 'nearly drowned as a child'":
"Everyone in Passau knew the story. Some of the other stories told about him were that he never learned to swim and needed glasses," she wrote. "In 1894, while playing tag with a group of other children, the way many children do in Passau to this day, Adolf fell into the river. The current was very strong and the water ice cold, flowing as it did straight from the mountains. Luckily for young Adolf, the son of the owner of the house where he lived was able to pull him out in time and so saved his life."
In rare cases, saving drowning children may cause more harm than good. Does this mean their rescuers should be scolded in such cases for doing the wrong thing? This is a complicated question, but I think a plausible answer is that, no, they should be praised for trying to do the right thing. An intention-based moralist might phrase this by saying "their hearts were in the right place," even though the actions sometimes went badly.
Suffering reduction can be seen as an optimal control problem: Choosing an action policy to minimize suffering. One direct and computationally feasible approach to optimal control is reinforcement learning (RL).
We have some control over RL rewards that we experience through our evaluation of an action: We feel bad for doing the wrong thing and good for doing the right thing. If our aim is to shape the behavior of others, we can praise them for actions that reduce suffering and criticize them for actions that increase suffering. If, hypothetically, we were to set these reward/punishment values for an actor (possibly including ourselves) proportionally to the aggregate suffering-reduction/increase impact of the actor's choices, then if the actor optimizes undiscounted expected reward, his own RL process would cause him to converge to the utilitarian-optimal actions. Of course, it's not possible to do this completely, but we can tweak our self-assessments and interpersonal moral approbation in ways that change the reward values so as to reduce suffering in the future.
Consider this example:
Fishing girl. Mariko is part of a family that earns its living by fishing in a small boat off the coast. Mariko feels sorry for the fish but knows that her family will continue to kill them no matter what she does. She volunteers to do the fish killing for the family, because otherwise the family members would just leave the fish to an awful death by asphyxiation.
Mariko practices ikejime in an effort to destroy the fish brains as quickly as possible. However, despite her best efforts, 1% of the time, she misses the brain, and the fish gets out of control and flops around on the floor for a minute afterward, which is perhaps more painful than if Mariko had not tried to kill it humanely in the first place. Should we blame Mariko for those failure cases?
Let's apply Q-learning to this example. The initial state is getting a fish to kill, and the possible actions are Ikejime or Asphyxiate.
Asphyxiate leads to a state of immense suffering, but let's adjust the suffering scale so that it's represented by value 0. Ikejime leads to one of two possible states: Successful_Ikejime with value +1 and Failed_Ikejime with value -1. Eventually all of these states return back to another state of getting a fish to kill, at which point the process repeats indefinitely. Assume no gamma discount factor but some finite time horizon such that the expected values are finite. Since after an Ikejime action, the probability of Successful_Ikejime is 0.99 and the probability of Failed_Ikejime is 0.01, after enough iterations, Q would float around something implying a per-action value of 0.99-0.01 = 0.98. We can see this from the fact that Q is essentially just an exponential moving average with a smoothing factor equal to the learning rate. Because the Q value 0.98 for ikejime is greater than 0 for asphyxiation, Mariko continues doing ikejime.
Of course, real humans may not use exactly Q-learning, but the general intuition remains the same: With enough repetitions of an accurately calibrated reward, an individual's learned behavior should track the utilitarian-optimal behavior without any additional work.2 This "model-free" reinforcement-learning approach suggests that we don't need to deal with expectations or intentions -- just reward or punish based on the actual results, and that's good enough!
The RL approach works under the assumptions just mentioned: Accurate calibration of reward between individual and utilitarian evaluations and sufficient samples from which to learn. What happens if we don't have enough samples? For instance, the quote that begins "Instrumental Judgment and Expectational Consequentialism" comes from Noam Chomsky discussing the Cuban Missile Crisis and noting that even though nuclear war did not result, it was highly probable that it could have resulted, and therefore, those who led to the crisis shouldn't get off the moral hook. This is not a case where you can try the actions multiple times and converge on the best response. Model-free RL has nothing to say here.
Instead, we need to use our prior knowledge and models of the world to compute an optimal policy. As a parallel to RL, this could be done using a dynamic programming approach for a Markov decision process. The goal is not to learn but to apply the given model to the situation at hand.
When the probabilities are known with certainty, then the model is all we need, and attempting to learn would only introduce noise. For example:
Coin toss. You can choose to toss a fair coin if you wish. If it's not tossed, nothing happens. If it is tossed and comes heads, 1 extra chicken endures a life of suffering on a factory farm. If it comes tails, 2 chickens are prevented from a life of suffering on a factory farm.
In this case, we should stick to the policy of flipping the coin. Even if when we do so, it comes up heads, this should not deter us from what we know is the right policy in the long run.
Of course, humans are naturally reinforcement-trained creatures. Flipping a lot of heads is liable to put a crimp in the flipper's enthusiasm, even though, assuming she knows the coin is fair, she should keep flipping. This is a case where the expectational-consequentialist / "good intentions" approach shines, because the flipper can keep her spirits up by remembering that she's doing the right thing even though it doesn't seem that way.
In almost any real-life situation, we won't have perfect knowledge of the probabilities at hand. Even with the coin-flipping case, after enough successive heads, the flipper should begin to doubt whether the coin is indeed fair (though I assumed that away in the example). Thus, we should indeed keep learning from what we see happen. On the other hand, we might be dealing with a problem that has been extensively studied or where large numbers of people before us have refined the strategy. In this case, we shouldn't place too much weight on the immediate observations that we see; we should instead lean mostly on the prior.
As an example, if we were using Q-learning, we could incorporate pre-existing knowledge by setting the initial Q-values to our prior expectations and then using a small learning rate so that we need a lot of additional evidence to change our views.3 In this case, our policy decisions would be dominated by the received wisdom about the values of various actions.4
When the prior strength is high, this kind of prior-influenced RL becomes more and more like the modeling approach. Because our own brains are probably not encoding the prior knowledge or doing the learning updates in the same way as the modeling-like RL approach would advise, we may need to override our internal reinforcement signals and instead keep ourselves happy with following the policy of trusting the model -- e.g., when not letting short-term gains or losses influence our stock-trading behavior. Therefore, we would morally evaluate trusting the model as good rather than the specific outcomes as good or bad, hence again the basic expectational/intent-based approach. We do still update, but the updating is done explicitly through external computations rather than with our brain's internal dopamine system.
Interestingly, Sutton and Barto discuss a very similar contrast in the reinforcement-learning context:
Within a planning agent, there are at least two roles for real experience: it can be used to improve the model (to make it more accurately match the real environment) and it can be used to directly improve the value function and policy using the kinds of reinforcement learning methods we have discussed in previous chapters [such as Q-learning]. The former we call model-learning, and the latter we call direct reinforcement learning (direct RL).
If we have lots of existing data, then the model-learning approach is best, but because we are biological creatures who use direct RL in our actions, we need to comfort ourselves about using the model-based approach by reinforcements highlighting our good intentions.
Sutton and Barto continue:
Both direct and indirect methods have advantages and disadvantages. Indirect methods often make fuller use of a limited amount of experience and thus achieve a better policy with fewer environmental interactions. On the other hand, direct methods are much simpler and are not affected by biases in the design of the model. Some have argued that indirect methods are always superior to direct ones, while others have argued that direct methods are responsible for most human and animal learning. Related debates in psychology and AI concern the relative importance of cognition as opposed to trial-and-error learning, and of deliberative planning as opposed to reactive decision-making.
Indeed, there are many cases where explicit learning is suboptimal or even disastrous -- e.g., if you're trying to catch a baseball. We can see a strong analogy between this dichotomy and the dual process theory of cognition.
One failure mode for our natively trained instincts is cognitive bias. Another is that our actions may not be reigned in by theoretical considerations like Occam's razor. We can see this with superstition, which is a byproduct of operant conditioning (i.e., direct RL). The operant learning systems in your brain may not realize that the hypothesis that wearing a green shirt improves your test scores beyond placebo effects is vastly more implausible given our understanding of the physical world than the hypothesis that, say, a good night's rest before the exam improves your scores.
The modeling approach has risks as well. It too is prone to cognitive biases, like wishful thinking or overconfidence in its predictions. It also opens up the possibility of "cheating," i.e., modifying your beliefs in order to reach the conclusion that you prefer rather than the one that would actually reduce the most suffering. In contrast, it's harder to cheat your brain's learned reward system. Of course, in our appraisals of behavior, we usually condemn cheating: If someone did something that turned out to have negative consequences due to distorting her epistemology in self-serving ways, we don't consider this an instance of "good intentions."
Let's record some of these observations:
Model-based | Model-free | |
When to use . |
|
|
In addition to moral judgments of actions, there's another context in which it's common to evaluate based on something like intentions rather than outcomes: Performance assessments. "A for effort" is not just a saying but is commonly used in schools and even in workplaces to some degree. A big portion of your grade or performance review may be based on the amount of work you put in rather than what ended up being achieved.
At first this might seem nonsensical. Suppose you're an employer, and your employee is working on a risky project that might produce anywhere between $100K and $300K of company value. In order to maximally incentivize the employee to take the right actions in the project, it seems you should pay a commission in direct proportion to the value produced (say, 1/2 of it) in order to substantially reduce principal-agent divergence.
The problem is that people are risk-averse with respect to compensation. Faced with a choice between a job where the payoffs are uniform random between $50K and $150K vs. a job with an assured payout of $90K, many people would choose the latter. Thus, because evaluating only based on actual outcomes introduces excess noise, it has a disadvantage relative to an effort/intention-based approach.
Of course, effort-based evaluation has its own problems, which is why most companies use a mix of effort and results in their assessments. An ineffective employee may have the best of intentions. Even if the employee is effective, it can be harder to assess changes in effort than changes in results at review time. That said, if employee effort could be measured very precisely, then effort-based assessments could in some circumstances be superior to results-based assessments for the reasons discussed earlier in this essay: Actual results (especially in just one year of work) may be noisy and not reflect priors. Of course, actual results matter a lot for updating beliefs about an employee's effectiveness, but if after such an update, you still conclude that the employee had good intentions but merely got unlucky, then you should reward the employee and keep her around.
An employer compares results between two employees to (a) assess which one is more competent and (b) motivate the employees to align incentives with the company. A moral actor uses results between two actions to inform the assessment of which is better, but consideration (b) doesn't directly apply in this case, because the possible actions aren't other agents that need motivation. Of course, the actor himself needs motivation to pick the right action, but we can praise him either for model-free or model-based action selection. By definition, a pure-hearted moral actor is perfectly aligned with moral incentives. As external observers, we unfortunately can't always assess good intentions, but we can assess them in ourselves (except insofar as we deceive ourselves), and many religions claim that God can assess them in everyone, thereby skirting information asymmetry. Like in the employer context, our moral feedback to other people should not just reflect their pure-heartedness but also should amend incorrect beliefs they may have about the expected value of different actions.
The comparison of job and moral evaluation highlights another possibility: Just as employees are usually loss-averse with respect to income, moral actors may be loss-averse with respect to what they accomplish. For instance, you might feel worse about causing one chicken to be factory-farmed than you feel good about preventing two chickens from being factory-farmed, even though the latter is twice as good on a chicken-by-chicken basis. A loss-averse moral actor may, ceteris paribus, do better with the explicit modeling approach in which actions are assessed based on intention to maximize expected value rather than the actual outcomes, because if he took to heart the actual outcomes, he would be discouraged from taking risks that are net positive altruistically.
Evaluating based on outcomes also runs the risk of being demoralizing, given the nature of human psychology. If someone does everything right in terms of ex-ante expectation but gets unlucky with the result, the person might feel "that's not fair" and be discouraged from making a good effort in the future. This is especially true if she sees someone else who made less wise choices but got lucky with success anyway. In theory an RL agent should just shrug this off as one more training example with which to update its assessments of actions, but in practice, a setback of this type might dampen a person's spirits or, in the worst case, turn her away from altruism entirely. If we base praise on expectations instead of outcomes, it's more likely the person will feel good about what she tried to do.
One final observation suggested by the comparison of employee reviews with moral evaluations is that, as is often true in practice, our moral evaluations can be a mix of intent-based and outcome-based assessments. The mixing ratio can be determined based on the table in the previous section, depending on which considerations are most important in the given context.
Even within the context of Q-learning, we would sometimes take seemingly-suboptimal actions for the purpose of exploration (or perhaps these exploration values would be incorporated into Q directly through boredom vs. novelty -- see Example 9.2 in Sutton and Barto for one instance of this). Humans also have rewards that are hard-wired rather than learned, but these are just the intrinsic reward values (unconditioned stimuli) in RL and so aren't an exception to the claim that behavior is determined by RL. (back)
The post Should We Base Moral Judgments on Intentions or Outcomes? appeared first on Center on Long-Term Risk.
]]>The post How Would Catastrophic Risks Affect Prospects for Compromise? appeared first on Center on Long-Term Risk.
]]>Catastrophic risks -- such as engineered pathogens, nanotech weapons, nuclear war, or financial collapse -- would cause major damage in the short run, but their effects on the long-run direction that humanity takes are also significant. In particular, to the extent these disasters increase risks of war, they may contribute to faster races between nations to build artificial general intelligence (AGI), less opportunity for compromise, and hence less of what everyone wants in expectation, including less suffering reduction. In this way, even pure negative utilitarians may oppose catastrophic risks, though this question is quite unsettled. While far from ideal, today's political environment is more democratic and peaceful than what we've seen historically and what could have been the case, and disrupting this trajectory might have more downside than upside. I discuss further considerations about how catastrophes could have negative and positive consequences. Even if averting catastrophic risks is net good to do, I see it as less useful than directly promoting compromise scenarios for AGI and setting the stage for such compromise via cooperative political, social, and cultural institutions.
Note, 20 Jul. 2015: Relative to when I first wrote this piece, I'm now less hopeful that catastrophic-risk reduction is plausibly good for pure negative utilitarians. The main reason is that some catastrophic risks, such as from malicious biotech, do seem to pose nontrivial risk of causing complete extinction relative to their probability of merely causing mayhem and conflict. So I now don't support efforts to reduce non-AGI "existential risks". (Reducing AGI extinction risks is a very different matter, since most AGIs would colonize space and spread suffering into the galaxy, just like most human-controlled future civilizations would.) Regardless, negative utilitarians should just focus their sights on more clearly beneficial suffering-reduction projects, like promoting suffering-focused ethical viewpoints and researching more how best to reduce wild-animal and far-future suffering.
Some in the effective-altruist community consider global catastrophic risks to be a pressing issue. Catastrophic risks include possibilities of world financial collapse, major pandemics, bioweapons, nanoweapons, environmental catastrophes like runaway global warming, and nuclear war.
Typically discussions of these risks center on massive harm to humanity in the short term and/or remote risks that they would lead to human extinction, affecting the long term. In this piece, I'll explore another consideration that might trump both short-term harm and extinction considerations: the flow-through effects of catastrophic risks on the degree of compromise in future politics and safety of future technology.
I think the only known technological development that is highly likely to cause all-out human extinction is AGI.1 Carl Shulman has defended this view, although he notes that while he doesn't consider nanotech a big extinction risk, "Others disagree (Michael Vassar has worked with the [Center for Responsible Nanotechnology], and Eliezer [Yudkowsky] often names molecular nanotechnology as the [extinction ]risk he would move to focus on if he knew that AI was impossible)." Reinforcing this assessment was the "Global Catastrophic Risks Survey" of 2008, in which the cumulative risk of extinction was estimated as at most 19% (median), the highest two subcomponents being AI risk and nanotech risk at median 5% each. Nuclear war was median 1%, consistent with general expert sentiment.
Of course, there's model uncertainty at play. Many ecologists, for instance, feel the risk of human extinction due to environmental issues is far higher than what those in the techno-libertarian circles cited in the previous paragraph believe. Others fear peak oil or impending economic doom. Still others may hold religious or philosophical views that incline them to find extinction likely via alternate means. In any event, whether catastrophic risks are likely to cause extinction is not relevant to the remainder of this piece, which will merely examine what implications -- both negative and positive -- catastrophic risks might have for the trajectory of social evolution conditional on human survival.
Ignoring extinction considerations, how else are catastrophic risks likely to matter for the future? Of course, they would obviously cause massive human damage in the short run. But they would also have implications for the long-term future of humanity to the extent that they affected the ways in which society developed: How much international cooperation is there? How humane are people's moral views? How competitive is the race to develop technology the fastest?
I think the extent of cooperation in the future is one of the most important factors to consider. A cautious, humane, and cooperative future has better prospects for building AGI in a way that avoids causing massive amounts of suffering to powerless creatures (e.g., suffering subroutines) than a future in which countries or factions race to build whatever AGI they can get to work so that they take over the world first. The AGI-race future could be significantly worse than the slow and cautious future in expectation, maybe 10%, 50%, 100%, or even 1000%. A cooperative future would definitely not be suffering-free and carries many risks, but the amount by which things could get much worse exceeds the amount by which they could get much better.
Much of the trajectory of the future may be inexorable. Especially as people become smarter, we might expect that if compromise is a Pareto-improving outcome, our descendants should converge on it. Likewise, even if catastrophe sets humanity back to a much more primitive state, it may be that a relatively humane, non-violent culture will emerge once more as civilization matures. For example:
Acemoglu and Robinson’s Why Nations Fail [2012] is a grand history in the style of Diamond [1997] or McNeil [1963]. [...] Acemoglu and Robinson theorize that political institutions can be divided into two kinds - “extractive” institutions in which a “small” group of individuals do their best to exploit - in the sense of Marx - the rest of the population, and “inclusive” institutions in which “many” people are included in the process of governing hence the exploitation process is either attenuated or absent.
[...] inclusive institutions enable innovative energies to emerge and lead to continuing growth as exemplified by the Industrial Revolution. Extractive institutions can also deliver growth but only when the economy is distant from the technological frontier.
If this theory is right, it would suggest that more inclusive societies will in general tend to control humanity's long-run future.
Still, humanity's trajectory is not completely inevitable, and it can be sensitive to initial conditions.
There's a long literature on the extent to which history is inevitable or contingent, but it seems that it's at least a little bit of both. Even in cases of modern states engaged in strategic calculations, there have been a great number of contingent factors. For example:
Even if we think that greater intelligence by future people will mean less contingency in how events play out, we clearly won't eliminate contingency any time soon, and the decisions we make in the coming decades may matter a lot to the final outcome.
In general, if you think there's only an X% chance that the success or failure of compromise to avoid an AGI arms race is contingent on what we do, you can multiply the expected costs/benefits of our actions by X%. But probably X should not be too small. I would put it definitely above 33%, and a more realistic estimate should be higher.
What factors are most likely to lead to an AGI arms race in which groups compete to build whatever crude AGI works rather than cautiously constructing an AGI that better encapsulates many value systems, including that of suffering reduction? If AGI is built by corporations, then fierce market competition could be risky. That said, it seems most plausible to me that AGI would be built by, or at least under the control of, governments, because unless the AGI project took off really quickly, it seems the military would not allow a national (indeed, world) security threat to proceed without restraint.
In this case, the natural scenario that could lead to a reckless race for AGI would be international competition -- say, between the US and China. AGI would be in many ways like nuclear weapons, because whoever builds it first can literally take over the world (though Carl Shulman points out some differences between AGI and nuclear weapons as well).
If we think historically about what has led to nuclear development, it has always been international conflict, usually precipitated by wars:
In general, war tends to cause
Thus, war seems to be a major risk factor for fast AGI development that results in less good and greater expected suffering than more careful, cooperative scenarios. To this extent, anything else that makes war more likely entails some expected harm through this pathway.
While catastrophic risks are unlikely to cause extinction, they are fairly likely to cause damage on a mass scale. For example, from the "Global Catastrophic Risks Survey":
Catastrophe | Median probability of >1 million dead | Median probability of >1 billion dead | Median probability of extinction |
nanotech weapons | 25% | 10% | 5% |
all wars | 98% | 30% | 4% |
biggest engineered pandemic | 30% | 10% | 2% |
nuclear wars | 30% | 10% | 1% |
natural pandemic | 60% | 5% | 0.05% |
...and so on. Depending on the risk, these disasters may or may not contribute appreciably to risks of AGI arms race, and it would be worth exploring in more detail which risks are most likely to lead to a breakdown of compromise. Still, in general, all of these risks seem likely to increase the chance of warfare, and by that route alone, they imply nonzero risks for increasing suffering in the far future.
For all its shortcomings, contemporary society is remarkably humane by historical standards and relative to what other possibilities one can imagine. This trend is unmistakable to anyone who reads history books and witnesses how often societies in times past were controlled by violent takeover, fear, and oppression. Steven Pinker's The Better Angels of Our Nature is a quantitative defense of this thesis. Pinker cites "six major trends" toward greater peace and cooperation that have taken place in the past few millennia:
Catastrophic risks -- especially those like nuclear winter that could set civilization back to a much less developed state -- are essentially a "roll of the dice" on the initial conditions for how society develops, and while things could certainly be better, they could also be a lot worse.
To a large extent, it may be that the relatively peaceful conditions of the present day are a required condition for technological progress, such that any advanced civilization will necessarily have those attributes. Insofar as this is true, it reduces the expected cost of catastrophic risks. But there's some chance that these peaceful conditions are not inevitable. One could imagine, for instance, a technologically advanced dictatorship taking control instead, and insofar as its policies would be less determined by those of its population, the degree of compromise with many value systems would be reduced in expectation. Consider ancient Sparta, monarchies, totalitarian states, the mafia, gangs, and many other effective forms of government where ruthlessness by leaders is more prevalent than compassion.
Imagine what would have happened if Hitler's atomic-bomb project had succeeded before the Manhattan Project. Or if the US South had won the American Civil War. Or various other scenarios. Plausibly the modern world would have turned out somewhat similar to the present (for instance, I doubt slavery would have lasted forever even if the US South had won the Civil War), but conditions probably would have been somewhat worse than they are now. (Of course, various events in history also could have turned out better than they actually did.)
A Cold War scenario seems likely to accelerate AGI relative to its current pace. Compare with the explosion of STEM education as a result of the Space Race.
Civilization in general seems to me fairly robust. The world witnessed civilizations emerge independently all over the globe -- from Egypt to the Fertile Crescent to China to the Americas. The Mayan civilization was completely isolated from happenings in Africa and Asia and yet shared many of the same achievements. It's true that civilizations often collapse, but they just as often rebuild themselves. The history of ancient Egypt, China, and many other regions of the world is a history of an empire followed by its collapse followed by the emergence of another empire.
It's less clear whether industrial civilization is robust. One reason is that we haven't seen completely independent industrial revolutions in history, since trade was well developed by the time industrialization could take place. Still, for example, China and Europe were both on the verge of industrial revolutions in the early 1800s2, and the two civilizations were pretty independent, despite some trade. China and Europe invented the printing press independently.3 And so on.
Given the written knowledge we've accumulated, it's not plausible that post-disaster peoples would not relearn how to build industry. But it's not clear whether they would have the resources and/or social organization requisite to do so. Consider how long it takes for developing nations to industrialize even with relative global stability and trade. On the other hand, military conflicts if nothing else would probably force post-disaster human societies to improve their technological capacities at some point. Technology seems more inevitable than democracy, because technology is compelled by conflict dynamics. Present-day China is an example of a successful, technologically advanced non-democracy. (Of course, China certainly exhibits some degree of deference to popular pressure, and conversely, Western "democracies" also give excessive influence to wealthy elites.)
Some suggest that rebuilding industrial civilization might be impossible the second time around because abundant surface minerals and easy-to-drill fossil fuels would have been used up. Others contend that human ingenuity would find alternate ways to get civilization off the ground, especially given the plenitude of scientific records that would remain. I incline toward the latter of these positions, but I maintain modesty on this question. It reflects a more general divide between scarcity doomsayers vs. techno-optimists. (The "optimist" in "techno-optimist" is relative to the goal of human economic growth, not necessarily reducing suffering.)
Robin Hanson takes the techno-optimist view:
Once [post-collapse humans] could communicate to share innovations and grow at the rate that our farming ancestors grew, humanity should return to our population and productivity level within twenty thousand years. (The fact that we have used up some natural resources this time around would probably matter little, as growth rates do not seem to depend much on natural resource availability.)
But even if historical growth rates didn't depend much on resources, might there be some minimum resource threshold below which resources do become essential? Indeed, in the limit of zero resources, growth is not possible.
A blog post by Carl Shulman includes a section titled "Could a vastly reduced population eventually recover from nuclear war?". It reviews reasons why rebuilding civilization would be harder and reasons it would be easier the second time around. Shulman concludes: "I would currently guess that the risk of permanent drastic curtailment of human potential from failure to recover, conditional on nuclear war causing the deaths of the overwhelming majority of humanity, is on the lower end." Shulman also seems to agree with the (tentative and uncertain) main thrust of my current article: "Trajectory change" effects of civilizational setback, possibly including diminution of liberal values, "could have a comparable or greater role in long-run impacts" of nuclear war (and other catastrophic risks) than outright extinction.
Stuart Armstrong also agrees that rebuilding following nuclear war seems likely. He points out that formation of governments is common in history, and social chaos is rare. There would be many smart, technically competent survivors of a nuclear disaster, e.g., in submarines.
As noted above, full-out human extinction from catastrophic risks seems relatively unlikely compared with just social destabilization. If human extinction did occur from causes other than AI, presumably parts of the biosphere would still remain. In many scenarios, at least some other animals would survive. What's the probability that those animals would then replace humans and colonize space? My guess is it's small but maybe not negligibly so. Robin Hanson seems to agree: "it is also possible that without humans within a few million years some other mammal species on Earth would evolve to produce" a technological civilization.
In the history of life on Earth, boney fish and insects emerged around 400 million years ago (mya). Dinosaurs emerged around 250 mya. Mammals blossomed less than 100 mya. Earth's future allows for about 1000 million years of life to come. So even if, as the cliche goes, the most complex life remaining after nuclear winter was cockroaches, there would still be 1000 million years in which human-like intelligence might re-evolve, and it took just 400 million years the first time around starting from insect-level intelligence. Of course, it's unclear how improbable the development of human-like intelligence was. For instance, if the dinosaurs hadn't been killed by an asteroid, plausibly they would still rule the Earth, without any advanced civilization.4 The Fermi paradox also has something to say about how likely we should assess the evolution of advanced intelligence from ordinary animal life to be.
Some extinction scenarios would involve killing all humans but leaving higher animals. Perhaps a bio-engineered pathogen or nanotech weapon could do this. In that case, re-emergence of intelligence would be even more likely. For example, cetaceans made large strides in intelligence 35 mya, jumping from an encephalization quotient (EQ) of 0.5 to 2.1. Some went on to develop EQs of 4-5, which is close to the human EQ of 7. As quoted in "Intelligence Gathering: The Study of How the Brain Evolves Offers Insight Into the Mind," Lori Marino explains:
Cetaceans and primates are not closely related at all, but both have similar behavior capacities and large brains -- the largest on the planet. Cognitive convergence seems to be the bottom line.
One hypothesis for why humans have such large brains despite metabolic cost is that big brains resulted from an arms race of social competition. Similar conditions could obtain for cetaceans or other social mammals. Of course, many of the other features of primates that may have given rise to civilization are not present in most other mammals. In particular, it seems hard to imagine developing written records underwater.
If another species took over and built a space-faring civilization, would it be better or worse than our own? There's some chance it could be more compassionate, such as if bonobos took our place. But it might also be much less compassionate, such as if chimpanzees had won the evolutionary race, not to mention killer whales. On balance it's plausible our hypothetical replacements would be less compassionate, because compassion is something humans value a lot, while a random other species probably values something else more. The reason I'm asking this question in the first place is because humans are outliers in their degree of compassion. Still, in social animals, various norms of fair play are likely to emerge regardless of how intrinsically caring the species is. Simon Knutsson pointed out to me that if human survivors do recover from a near-extinction-level catastrophe, or if humans go extinct and another species with potential to colonize space evolves, they'll likely need to be able to cooperate rather than fighting endlessly if they are to succeed in colonizing space. This suggests that if they colonize space, they will be more moral or peaceful than we were. My reply is that while this is possible, a rebuilding civilization or new species might curb infighting via authoritarian power structures or strong ingroup loyalty that doesn't extend to outgroups, which might imply less compassion than present-day humans have.
My naive guess is that it's relatively unlikely another species would colonize space if humans went extinct -- maybe a ~10% chance? I suspect that most of the Great Filter is behind us, and some of those filter steps would have to be crossed again for a new non-human civilization to emerge. As long as that new civilization wouldn't be more than several times worse in expectation than our current civilization, then this scenario is unlikely to dominate our calculations.
In general, people in more hardscrabble or fearful conditions have less energy and emotional resources to concern themselves with the suffering of others, especially with powerless computations that might be run by a future spacefaring civilization. Fewer catastrophes means more people who can focus on averting suffering by other sentients.
Current long-term political trends suggest that a world government may develop at some point, as is hinted by the increasing degrees of unity among rich countries (European Union, international trade agreements, etc.). A world government would offer greater possibilities for enforcing mutually beneficial cooperation and thereby fulfilling more of what all value systems want in expectation, relative to unleashing a Darwinian future.
In this section I suggest some possible upsides of catastrophic risks. I think it's important not to shy from these ideas merely because they don't comport with our intuitive reactions. Arguments should not be soldiers. At the same time, it's also essential to constrain our speculations in this area by common sense.
Also note that even in the unlikely event that we concluded catastrophic risks were net positive for the far future, we should still not support them, to avoid stepping on the toes of so many other people who care deeply about preventing short-term harm. Rather, in this hypothetical scenario, we should find other, win-win ways to improve the future that don't encroach on what so many other people value.
Is it possible that some amount of disruption in the near term could heighten concern about potential future sources of suffering, whereas if things go along smoothly, people will give less thought to futures full of suffering? This question lies in analogy with the concern that reducing hardship and depression might make people less attuned to the pain of others. Many of the people I know who care most about reducing suffering have gone through severe personal trauma or depression at one point. When things are going well, you can forget how horrifying suffering can be.
It's often said that World War I transformed art and cultural attitudes more generally. Johnson (2012): "During and after World War I, flowery Victorian language was blown apart and replaced by more sinewy and R-rated prose styles. [...] 'World War I definitely gives a push forward to the idea of dystopia rather than utopia, to the idea that the world is going to get worse rather than better,' Braudy said."
Severe catastrophes might depress economic output and technological development, with the possibility of allowing more time for reflection on the risks that such technology would bring. That said, this cuts both ways: Faster technology also allows for faster wisdom, better ability to monitor tech developments, and greater prosperity that allows more people to even think about these questions, as well as reduced social animosity and greater positive-sum thinking. The net sign of all of this is very unclear.
There are suggestions (hotly debated) in the political-science literature of a "resource curse" in which greater reserves of oil and other natural resources may contribute to authoritarianism and repression. The Wikipedia article cites a number of mechanisms by which the curse may operate. A related trend is the observation that cooler climates sometimes have a greater degree of compassion and cooperation -- perhaps because to survive cold winters you have to work together, while in warm climates, success is determined by being the best at forcibly stealing the resources that already exist?
To the extent these trends are valid, does this suggest that if humanity were to rebuild after a significant catastrophe, it might be more democratic owing to having less oil, metals, and other resources?
The "Criticisms" section of the Wikipedia article explains that some studies attribute causation in the other direction: Greater authoritarianism leads countries to exploit their resources faster. Indeed, some studies even find a "resource blessing."
Fighting factions are often brought together when they face a common enemy. For instance, in the 1954 Robbers Cave study, the two hostile factions of campers were brought together by "superordinate goals" that required them to unite to solve a problem they all faced. Catastrophic risks are a common enemy of humanity, so could efforts to prevent them build cooperative institutions? For instance, cooperation to solve climate change could be seen as an easy test bed for the much harder challenges that will confront humanity in cooperating on AGI. Could greater climate danger provide greater impetus for building better cooperative institutions early on? Wolf Bullmann compared this to vaccination.
Of course, this is not an argument against building international-cooperation efforts against climate change -- those are the very things we want to happen. But it would be a slight counter-consideration against, say, personally trying to reduce greenhouse-gas emissions. I hasten to explain that this "silver lining" point is extremely speculative, and on balance, it seems most plausible that personally reducing greenhouse-gas emissions is net good overall, in terms of reducing risks of wars that could degrade into bad outcomes. The "inoculation" idea (see the Appendix) should be explored further, though it needs to be cast in a way that's not amenable to being quoted out of context.
Reshuffling world political conditions could produce a better outcome, and indeed, there are minority views that greater authoritarianism could actually improve future prospects by reducing coordination problems and enforcing safeguards. We should continue to explore a broad range of viewpoints on these topics, while at the same time not wandering too easily from mainstream consensus.
Like any empirical question, the net impact of catastrophic risks on the degree of compromise in the future isn't certain, and a hypothetical scenario in which we concluded that catastrophic risks would actually improve compromise is not impossible. Not being concerned about catastrophic risks is more likely for pure negative utilitarians who also fear the risks of astronomical suffering that space colonization would entail. I find it plausible that the detrimental effects of catastrophic risks on compromise outweigh effects on probability of colonization, but this conclusion is contingent and not inevitable. What if the calculation flipped around?
Even if so, we should still probably oppose catastrophic risks when it's very cheap to do so, and we should never support them. Why? Because many other people care a lot about preventing short-term disasters, and stepping on so many toes so dramatically would not be an efficient course of action, much less a wise move by any reasonable heuristics about how to get along in society. Rather, we should find other, win-win approaches to improving compromise prospects that everyone can get behind. In any event, even ignoring the importance of cooperating with other people, it seems unlikely that focusing on catastrophic risks would be the best leverage point for accomplishing one's goals.
My guess is that there are better projects for altruists to pursue, because
The argument in this essay applies only to preventing risks before they happen, so as to reduce societal dislocation. It doesn't endorse measures to ensure human recovery after catastrophic risks have already happened, such as disaster shelters or space colonies. These post-disaster measures don't avert the increased anarchy and confusion that would result from catastrophes but do help humans stick around to potentially cause cosmic harm down the road. Moreover, disaster-recovery solutions might even increase the chance that catastrophic risks occur because of moral hazard. I probably don't endorse post-disaster recovery efforts except maybe in rare cases when they also substantially help to maintain social stability in scenarios that cause less-than-extinction-level damage.
The idea of inoculation -- accepting some short-term harm in order to improve long-term outcomes -- is a general concept.
Even with warfare, there's some argument about an inoculation effect. For example, the United Nations was formed after World War II in an effort to prevent similar conflicts from happening again. And there's a widespread debate about inoculation in activism. Sometimes the "radicals" fear that if the "moderates" compromise too soon, the partial concessions will quell discontent and prevent a more revolutionary change. For example, some animal advocates say that if we improve the welfare of farm animals, people will have less incentive to completely "end animal exploitation" by going vegan. In this case, the radicals claim that greater short-term suffering is the inoculation necessary to prevent long-term "exploitation."
The flip side to inoculation is the slippery slope: A little bit of something in the short term tends to imply more of it in the long term. In general, I think slippery-slope arguments are stronger than inoculation arguments, with some exceptions like in organisms' immune systems.
Usually wars cause more wars. World War II would not have happened absent World War I. Conflicts breed animosity and perpetuate a cycle of violence as tit-for-tat retributions continue indefinitely, with each side claiming the other side was the first aggressor. We see this in terrorism vs. counter-terrorism response and in many other domains.
Likewise, a few animal-welfare reforms now can enhance a culture of caring about animals that eventually leads to greater empathy for them. The Humane Society of the United States (HSUS) is often condemned by more purist animal-rights advocates as being in the hands of Big Ag, but in fact, Big Ag has a whole website devoted to trying to discredit HSUS. This is hardly behavior that one would expect if HSUS is actually helping ensure Big Ag's long-term future.
The post How Would Catastrophic Risks Affect Prospects for Compromise? appeared first on Center on Long-Term Risk.
]]>The post Reasons to Be Nice to Other Value Systems appeared first on Center on Long-Term Risk.
]]>A basic premise of economic policy, business strategy, and effective altruism is to choose the option with highest value per dollar. Ordinarily this simple rule suffices because we're engaged in one-player games against the environment. For instance, if Program #1 to distribute bed nets saves twice as many lives per dollar as Program #2, we choose Program #1. If Website B has 25% longer dwell time than Website A, we choose Website B. These are essentially engineering problems where one option is better for us, and no other agent else feels differently.
However, this mindset can run into trouble in social situations involving more than one player. I'll illustrate with a toy example that avoids naming specific groups, but the general structure transfers to many real-world cases.
Suppose there's an Effective Altruism Fair at your local university, and altruists from various ideological stripes will be hosting the event and presenting their individual work. You really care about promoting Emacs, the one true text editor. However, the Fair will also host a booth for the advocates of the Vi editor, which you consider not just inferior but actively harmful to the world.
The Fair requires some general organizing help -- to publicize, set up tables, and provide refreshments. Beyond that, it's up to the individual groups to showcase their own work to the visitors. Your Emacs club is deciding: How much effort should we put into helping out with general organizing, and how much should we devote to making our individual booth really awesome? You might evaluate this on the metric of how many email signups you'd get per hour of preparation work. And while you appreciate some things the Vi crowd does, you think they cause net harm on balance, so you might subtract off from your utility 1/2 times the number of email signups your effort allows them to get per hour.
If you help out with the general logistics of the Fair, it would produce a lot of new visitors, but only some fraction of them will be interested in Emacs. Say that every hour you put in provides 10 new Emacs signups, as well as 10 new Vi signups (plus maybe signups to other groups that are irrelevant to you). The overall value of this to you is only 10 - (1/2)*10 = 5. In contrast, if you optimize your own booth, you can snatch an extra 15 visitors to yourself, with no extra Vi visitors in the process. Since 15 > 5, cost-effectiveness analysis says you should optimize only your booth. After all, this is the more efficient allocation of resources, right?
Suppose the Vi team faces the same cost-benefit tradeoffs. Then depending on which decisions each team makes, the following are the possible numbers of signups that each side will get, written in the format (# of Emacs signups), (# of Vi signups).
Vi help on logistics | Vi focus on own booth | |
Emacs help on logistics | 10+10 = 20, 10+10 = 20 | 10+0 = 10, 10+15 = 25 |
Emacs focus on own booth | 15+10 = 25, 0+10 = 10 | 15+0 = 15, 0+15 = 15 |
Now remember that Emacs supporters consider Vi harmful, so that Emacs utility = (number of Emacs signups) - (1/2)*(number of Vi signups). Suppose the Vi side feels exactly the same way in reverse. Then the actual utility values for each side, computed based on the above table, will be
Vi help on logistics | Vi focus on own booth | |
Emacs help on logistics | 10, 10 | -2.5, 20 |
Emacs focus on own booth | 20, -2.5 | 7.5, 7.5 |
Just as we saw in the naive cost-effectiveness calculation, there's an advantage of 20 - 10 = 7.5 - (-2.5) = 10 to focusing on your own booth, regardless of what the other team does.
The game that this table represents is a prisoner's dilemma (PD) -- arguably the most famous in game theory. The dominant strategy in a one-shot PD is to defect, and this is what our naive calculation was capturing. In fact, both of the above tables are PDs, so the PD structure would have applied even absent enmity between the text-editor camps. PDs show up in very many real-world situations.
There's debate on whether defection in a one-shot PD is rational, but what is clear is that most of the world does not consist in one-shot PDs. For instance, what if the EA Fair is held again next year? How will the Vi team react then if you defect this year?
In addition, it may be in all of our interests to structure society in ways that prevent games from turning into one-shot PDs, because the outcome is worse for both sides than cooperation would have been, if only it could have been arranged.
In the remainder of this piece I outline several weak arguments why we should generally try to help other value systems, even when we don't agree with them. Here's my general heuristic:
If you have an opportunity to significantly help other value systems at small cost to yourself, you should do so.
Likewise, if you have opportunity to avoid causing significant harm to other value systems by foregoing small benefit to yourself, you should do so. This is more true the more powerful is the value system you're helping. That said, if groups championing the other value system are defecting against you, then stop helping it.
Most of life has multiple rounds. Other groups of people generally don't go away after we've stepped on their toes, and if we defect now, they can defect on us in future interactions. There's extensive literature on the iterated prisoner's dilemma (IPD), but the general finding is that it tends to yield cooperation, especially over long time horizons without a definite end point. The Evolution of Cooperation is an important book on this subject.
One can debate whether a given situation adequately fits the properties of being a pure IPD. The translation from real-world situations to theoretical games is always messy. Regardless, the fact remains that empirically, humans feel reciprocal gratitude and indebtedness to those who helped them.
What's more, these feelings often persist even when there's no obvious further benefit from doing so. Emotions are humans' ways of making credible commitments, and the fact that humans feel loyalty and duty means that they can generally be trusted to reciprocate.
Of course, if you interact with people who are conniving and tend to backstab, then don't help them. Being nice does not mean being a sucker, and indeed, continuing to assist those who just take for themselves only encourages predation. (Of course, evolution has produced emotional exceptions to this, like in the case of altruism towards children and family members who share DNA with you, even if they never reciprocate.)
Reciprocal altruism typically occurs between individuals or groups, but there are also broader ways in which society transmits information about how generous someone is toward other values. When others discuss your work, wouldn't you rather have them say that you're a fair-minded and charitable individual who helps many different value systems, even those she doesn't agree with?
The heuristic of helping others when it's cheap to do so strikes most people as common sense. These values are taught in kindergarten and children's books.
Mahatma Gandhi said: "If we could change ourselves, the tendencies in the world would also change." We can see this idea expressed in other forms, such as the categorical imperative or the dictums of a rule utilitarian. Society would be better -- even according to your own particular values -- if everyone followed the rule of helping other value systems when doing so had low cost.
When we follow and believe in these principles, it rubs off on others. Collectively it helps reinforce a norm of cooperation with those who feel differently from ourselves. Norms have significant social power, both for individuals and even for nations. For instance:
Internationally, a cooperative security norm, if close to universality, can become the defining standard for how a good international citizen should behave. It is striking how in the 1980s and 1990s scores of formerly reluctant states were flocking to [Nonproliferation Treaty] NPT membership, notably after change in the national system of rule and particularly in the course of democratization processes: turning unequivocally nonnuclear or confirming nonnuclear status became the "right thing to do" (Rublee 2009; Müller and Schmidt 2010). [p. 4]
When we defect in any particular situation, we weaken cooperative norms for everyone for many future situations to come.
Norms of mutual assistance and tolerance among different groups are important not just for our own projects but also for international peace on a larger scale. To be sure, the contribution of our individual actions to this goal are miniscule, but the stakes are also high. A globally cooperative future could contain significantly less suffering and more of what other people value in expectation.
Utilitarians care about the well-being or preference satisfaction of others. Thus, if many people feel that something is wrong, even if you don't, there's a utilitarian cost to it. This argument is stronger for preference utilitarians who value people's preferences about the external world even when they aren't consciously aware of violations of those preferences. Of course, this alone is probably not enough to encourage nice behavior, because present-day humans are vastly outweighed in direct value by non-human animals and future generations.
If you had grown up with different genes and environmental circumstances, you would have held the moral values that others espouse. In addition, you yourself might actually, not just hypothetically, later come to share those views -- due to new arguments, updated information, future life experiences, accretion of wisdom, or social influence. Or you might have come to hold those views if only you had heard arguments or learned things that you will not actually discover. What others believe provides some evidence for what an idealized version of you would believe. If so, then you might be mistaken that others' moral values are worthless in your estimation.
I should clarify that the value of cooperation does not rely on moral uncertainty; the other arguments are strong enough on their own. Moral uncertainty just provides some additional oomph, depending on how strongly it motivates you. (And you may want to apply some meta-level uncertainty on how much you care about moral uncertainty, if you care about meta-level uncertainty.)
This section was written by Caspar Oesterheld.
Some decision theorists have argued that cooperation in a one-shot PD is justified if we face an opponent that uses a similar decision-making procedure as we do. After all, if we cooperate in such a PD, then our opponent is likely to do the same. Hofstadter (1983) calls this idea superrationality.
Some have used superrationality to argue that it is in our self-interest to be nice to other humans (Leslie 1991, sec. 8; Drescher 2006, ch. 7). For example, if I save a stranger from drowning, this makes it more likely that others will make a similar decision when I need help. However, in practice it seems that most people are not sufficiently similar to each other for this reasoning to apply in most situations. In fact, you may already know what other people think about when they decide whether to pull someone out of the water and that this is uncorrelated with your thoughts on superrationality. Thus, it is unclear whether superrationality has strong implications for how one should deal with other humans (Oesterheld 2017, sec. 6.6; Almond 2010a, sec. 4.6; Almond 2010b, sec. 1; Ahmed 2014, ch. 4).
However, even if Earth doesn’t harbor agents that are sufficiently similar to me, the multiverse as a whole probably does. In particular, it may contain a large set of agents who think about decision theory exactly like I do but have different values. Some of these will also care about what happens on Earth. If this is true and I also care about these other parts of the multiverse, then superrationality gives me a reason to be nice to these value systems. If I am nice toward them, then this makes it more likely that similar agents will also take my values into account when they make decisions in their parts of the multiverse (Oesterheld 2017).
Many of the reasons listed, especially the stronger ones, only have consequences when your cooperation or defection is visible: IPDs, evolved emotions, reputation, norms and universal rules, and encouraging global cooperation. Assuming the other, remaining reasons are weak enough, doesn't this license us to trash other value systems in our private decisions, so long as no one will find out?
No. There's too much risk of it backfiring in your face. One slip-up could damage your reputation, and your deception might show through in ways you don't realize. I think it's best to actually be someone who wants to help other value systems, regardless of whether others find out. This may sound suboptimal, and maybe there is a little bit of faith to it, but consider that almost everyone in the world recognizes this idea at least to some extent, such as in the law of karma or the Golden Rule. If it were an "irrational" policy for social success, why would we see it so widespread? Eliezer Yudkowsky: "Be careful of [...] any time you find yourself defining the 'winner' as someone other than the agent who is currently smiling from on top of a giant heap of utility."
Not hiding your defection against others is a special case of the general argument for honesty. This isn't to say you always have to be cooperative, but if you're not, don't go out of your way to hide it.
I regard not trashing other value systems as a weak ethical injunction for guiding my decisions. I recommend reading Eliezer Yudkowsky's sequence for greater elaboration of why ethical injunctions can win better than naive act-utilitarianism. The injunction not to step on others' toes is not as strong as the injunction against lying, stealing, and so on; indeed, it's never possible to not step on some people's toes. But in cases where it's relatively easy to avoid causing major harm to what a significant number of others care about, you should try to avoid causing that harm. Of course, if others are substantially and unremorsefully stepping on your toes, then this advice no longer applies, and you should stop being nice to them, until they start being cooperative again.
Being nice is not guaranteed to yield the best outcomes. There are reasons we evolved selfish motives as well as altruistic ones, and the "nice guys finish last" slogan is sometimes accurate. The other side might cheat you and get away with it. Maybe the IPD structure isn't sufficient to guarantee cooperation. Maybe it's a tragedy of the commons (multi-player prisoner's dilemma) where it's much harder to change defection to cooperation, and your efforts fail to make their intended impact.
It's important to assess these risks and be conscious of when your efforts at cooperation fail. But remember: Being nice means defecting on the other side if it defects on you. Niceness doesn't mean being exploited permanently. It's better to try a gesture of cooperation first rather than assume it won't work; predicting defection may become a self-fulfilling prophecy. In addition, I think niceness is increasingly rewarded in our more interconnected and transparent world, facilitated by governments and media. Our ancestral selfish tendencies probably overfire relative to the strategic optimum.
However, there are many real-world cases where niceness fails. One striking demonstration of this was the attempts by US president Barack Obama to compromise with opposing Republicans, which repeatedly resulted in Obama and the Democrats making concessions for nothing in return. This is not how to play an iterated prisoner's dilemma. If niceness repeatedly fails to achieve cooperation, then one has to go on the offensive instead.
If you hold a popular position, then I think it's often successful to firmly stand your ground rather than making concessions in response to squeaky-wheel opponents. Cenk Uygur: "Do you know what works in politics? Strength."
Cooperation can also entail overhead costs in terms of negotiating and verifying commitments, as well as assessing whether an apparent concession is actually a concession or just something the other side was already going to do. For small interactions, these overhead costs may outweigh the benefits of trade. Verifying cooperation is often easy if a business partner does a favor for you, because you can see what the favor is, and it's unlikely the partner would have done the favor without expecting anything in return. Verifying cooperation is often harder for big organizations or governments, because (1) the impacts of a change in policy can be diffuse and costly to measure and (2) it's difficult to know how much the change in policy is due to cooperation versus how much it's something the organization was going to do anyway.
I hope it's clear that at least in some cases, being nice pays. The harder question is how nice to be, i.e., above what threshold of cost to yourself do you stop providing benefits to others?
If bargains could be transacted in an airtight fashion, and if utility was completely transferable, then the answer would be simple: Maximize total social "pie," because if you can provide someone a benefit B that's bigger than its cost C to yourself, the other person could pay you back in the amount C, and then the surplus B-C could be divided between the two of you, making you both better off. Alas, most situations in life aren't airtight, so intuitively, in many cases it would not be in your interest to purely maximize pie. There might be some noise or cheating between your incurring the cost and someone else paying back a higher benefit. Not everything you do is fully recognized and rewarded by others, especially when they assume that you're helping them because you intrinsically value their cause rather than just to be nice despite not caring or even slightly disvaluing it.
How nice to be depends on the details of the social situation, expectations, norms, and enforcement mechanisms involved. There's some balance to strike between purely pushing your own agenda without regard to what anyone else cares about versus purely helping all value systems without any preference for your personal concerns. One could construct various game-theoretic models, but the world is complicated, and interactions are not just, say, a series of two-player IPDs. It could also help to look at real examples in society for where to strike this balance.
Being nice suggests that people whose primary concern is reducing suffering should accept others' ambitions to colonize space, so long as colonizers work harder to reduce the suffering that space colonization entails. On the flip side, being nice also means that those who do want to colonize space should focus more on making space colonization better (more humane and better governed to stay in line with our values) rather than making it more likely to happen.
The post Reasons to Be Nice to Other Value Systems appeared first on Center on Long-Term Risk.
]]>The post Education Matters for Altruism appeared first on Center on Long-Term Risk.
]]>"Want to save the world? LEARN!!!!
Then learn MORE!!!!
Then, probably have kids [or not] and KEEP LEARNING!!!!"
—Michael Vassar (in a Facebook comment)
There's a stereotype that says younger people are naïve and idealistic, while older people are experienced and cynical. Insofar as this is true, there are probably many contributors, but one suggestion is that young people "think they know everything" and are assured that the changes they want to make will be beneficial. The older people align more with Edmund Burke in suggesting that society is complicated, and changing it without messing things up is harder than it looks. There are obviously other differences as well -- e.g., that young people have different values than older ones, such as being less opposed to gay marriage or more supportive of animal welfare, and these are not due to amount of experience but merely differences in culture.
I often find that those who think a policy or strategic issue is "just obvious" tend to be those who know least about it. The more you probe, the more you see that different sides have reasons for thinking as they do, and it's less obvious how to weigh up all the competing factors. This isn't to say there aren't places where most reasonable people agree that policies are sorely misguided and need changing. But if you haven't found at least one major drawback to whatever proposal you think should be adopted, you might want to dig deeper into the complexities at play.
There's an admirable tendency among some activists to "do something now." Given how full the world is of horrors, this heart-felt urgency is not only understandable but actually an important feeling to have. If we lose our sensitivity to the urgency of the issues at stake, we risk not giving them the weight they deserve, thereby drifting into either apathy or another moral stance that takes a dispassionate approach to suffering and doesn't attend to its overwhelming importance.
Yet, when I look back at the history of my altruistic efforts since the year 2000, what I see is that I was often mistaken, not only about which causes would be best to support but even what their signs were, i.e., whether they made the world better or worse. While financial markets can return 5-10% per year, and internal growth rates on movement-building may return more, the "rate of return on wisdom" can be extraordinary. This wisdom is not just about new data or new arguments but more broadly, new ways of understanding how the world works, looking at epistemology, and deciding what you care about.
When I look back at my history since 2000, I also see that many of the most valuable tools I acquired came from education. By this I mean not just learning about effective altruism, rationality, ethical issues, strategic considerations, career choice, and so on, but also learning about the intellectual foundations of various academic disciplines. A non-exhaustive sample:
You'll notice that my list contains basically every academic subject! And there's a reason for this: Reflecting on what we value and how best to change the world requires familiarity with a wide array of ways of thinking and intellectual considerations that we should heed. If you're trying to advance the state of the art in one particular domain, it suffices to dig very deep into that field and probe it as much as possible. But if you're trying to do the most good in a global sense, you need a global perspective, which requires learning a little bit of everything.
When people ask me what major to choose in college, my first answer is: "Introductory Studies." This is a joke, though I wish it weren't. What I mean is, take as many "intro" courses as you can, because these allow you to level up by exploring new perspectives on the universe that different domains have to offer. The returns then diminish as you keep taking more and more courses in the same field. Of course, some areas are more important than others, so you'll ultimately want to focus on them, but every subject has something you can take away.
From Eliezer Yudkowsky's "Twelve Virtues of Rationality":
Study many sciences and absorb their power as your own. Each field that you consume makes you larger. If you swallow enough sciences the gaps between them will diminish and your knowledge will become a unified whole. If you are gluttonous you will become vaster than mountains.
As a substitute to an Introductory Studies major, you might consider listening to at least one or two lectures from every academic department on a YouTube channel for a university. Get a taste of what kinds of issues a field deals with and what tools it uses to approach its problems.
I was a physics student and then a physics grad student. In that process, I think I assimilated what was the standard worldview of physicists, at least as projected on the students. That worldview was that physicists were great, of course, and physicists could, if they chose to, go out to all those other fields, that all those other people keep mucking up and not making progress on, and they could make a lot faster progress, if progress was possible, but they don't really want to, because that stuff isn't nearly as interesting as physics is, so they are staying in physics and making progress there.
For many subjects, they don't think it's just possible to learn anything, to know anything. For physicists, the usual attitude towards social science was basically there's no such thing as social science; there can't be such a thing as social science. [...]
It's just way too easy to have learned a set of methods, see some hard problem, try it for an hour, or even a day or a week, not get very far, and decide it's impossible, especially if you can make it clear that your methods definitely won't work there.
You don't, often, know that there are any other methods to do anything with because you've learned only certain methods. [...]
I can tell you there are a lot [of methods] out there. Furthermore, I'll stick my neck out and say most fields know a lot. Almost all academic fields where there's lots of articles and stuff published, they know a lot.
It's also good to learn beyond topics that are taught in academia, since the world contains far more than the set of things that academics can publish papers about.
Philomaths like myself love to learn about all subjects, but it's interesting that I became much more of a philomath after inclining toward altruism. Before I cared about making a difference in the world, I played video games and often disliked doing homework. Once I realized that I had enormous potential to reduce suffering by others, I became a better student, in part because I knew good grades would help me be more successful in attaining a lucrative or influential career, but also because when you want to make the world better, you really have to know how it works. What before had been dry facts and unmotivated theories were now valuable pieces of information -- gold nuggets of data and insight waiting to be harvested.
This is how I see learning to this day: Each new datum or idea is a little boost to my overall ability to approach activism and life in general with more wisdom. It's like accumulating experience points, which is what makes RPGs so addictive.
The world is so vast that it's easy to get lost in a single domain. Even within a single intellectual community focused on a single problem, there will be new discussions every day, new ideas to explore, and new posts to comment on. Our friends on Facebook, online forums, in-person groups, mailing lists, news websites, etc. can keep us focused on a particular outlook on the world.
Social networks and communities are extremely valuable in many ways, for emotional support, intellectual development, and sustaining altruistic motivation. That said, it's important also to take a bigger-picture view and think about what other thousands of intellectual communities you haven't explored yet. What other topics and world views should you learn about?
Social communities can also lead us to spend most of our time reading material that our friends have written, even though our friends' writings may not contain as much deep scholarship as well renowned writings by more mature authors. (Yes, that statement has something to say about this essay. :P) There can also be a tendency to focus on things that are new and shiny over those that are older but more solid. Robin Hanson said it well:
as a blog author, while I realize that blog posts can be part of a balanced intellectual diet, I worry that I tempt readers to fill their intellectual diet with too much of the fashionably new, relative to the old and intellectually nutritious. Until you reach the state of the art, and are ready to be at the very forefront of advancing human knowledge, most of what you should read to get to that forefront isn't today's news, or even today's blogger musings. Read classic books and articles, textbooks, review articles. Then maybe read focused publications (including perhaps some blog posts) on your chosen focus topic(s).
I think textbooks are often under-read, outside of course requirements, relative to the density of insight they contain. Close runners up are Wikipedia, review articles, and popularization books.
I can't stress the importance of Wikipedia enough. I try to read Wikipedia as much as possible, and if I'm reading anything not on Wikipedia, I ask myself, "Is this really better than reading a Wikipedia article on this topic or on some other topic?" Wikipedia allows you to see a broad range of perspectives that you'd miss reading a few individual opinions on the issue, and it often gives more context and perspective.
There's a similar phenomenon with current events: It seems we sometimes pay too much attention to them relative to their overall importance. Obviously politicians, policy analysts, journalists, and lobby groups need to be completely up to speed on the news. Voters too should have some general understanding of current issues (though maybe at the level of reading Wikipedia rather than CNN). It's important for scandals to be reported widely in order to serve a deterrence function. However, in terms of its broad, big-picture significance, news is not really different from history. It's good to be somewhat familiar, to update your world models for how society functions, what kinds of actions people take, and in general how social trends evolve. But it's not crucial that you read today's news. You could just as well read last year's news, or, for that matter, news from 1925. The general principles of psychology, politics, economics, and society that news illustrates are relatively similar then as now. (See also "Appendix: Why can you only promote new media?" and "Appendix: News values are not optimal for learning")
That said, there are some cases where consuming recent rather than old news is important. Some examples:
A good history course will not just teach from a textbook but will assign primary-source readings as well. These allow you to see historical developments from a different level of abstraction. Getting up close with details can give you a better intuitive picture of what kinds of things are being described by the high-level narrative. This is similar to studying some particular examples within a statistical sample in addition to merely reporting aggregate metrics. News is like a primary-source document in a history class: it's important for building intuition and highlighting more general principles, but it also shouldn't be overdone relative to bigger-picture insights.
Robin Hanson has a criticism of news reading on the grounds of its superficiality and evanescence. I'm more positive about news than he is and consume some of it myself, because I think it's an important source of data on how the world works in a number of domains. That said, news stories often overemphasize extreme anomalies rather than systemic trends, and in any event, we may not want to read it excessively at the expense of already-digested insights by many other smart minds.
Academia is an important source of insight, but there are many more: friendships and social relations, experiences in the workplace, feeling new emotions, and doing new things. These all expand your repertoire for what life is like in various corners of the globe and various avenues of thought. The number of unique experiences, while finite, is astoundingly large.
Many algorithms for approaching the explore-exploit tradeoff and function optimization introduce randomness to prevent search from getting stuck with just what seems the best so far. In general, it makes most sense to front-load the exploration early on. For example, in simulated annealing, the search begins at a high temperature, where we often switch randomly to seemingly suboptimal places, and as time goes on, we settle down into the best optimum we've found. An epsilon-decreasing strategy in the multi-armed bandit problem embodies a similar idea.
The same concepts apply to altruism: At the beginning, a large part of our focus should be on learning, observing, and expanding our world models, without getting mired into a local optimum too early on. With time, we should begin doing more hands-on work -- relatively more talking and relatively less listening.
Think of another perspective: In science, there's a tension between funding work with immediately tangible benefits versus basic research that may pay off in 50 years or may accomplish nothing. Generally, basic research is under-funded because it's less profitable in the near term, but I think general wisdom is that basic research has higher expected returns per dollar. It seems to me as though many altruists also incline toward work with immediately tangible benefits -- understandably so, because the world contains so much suffering, and we want to do something about it now. Yet in terms of expected value, the "basic research" approach to altruism may have higher payoff. This is not an unqualified statement; there are many cases where basic research stays in its own bubble and never has much outside impact -- both in science and altruism. It's good to keep one foot in each of the practical and academic worlds to make sure that you don't drift away from focusing on relevant topics and so that you have a better sense of how the research might actually be used.
It might be tempting to suggest that young people should spend all their time learning, saving the "hands dirty" work for later, once they've better figured out how to best make a difference in the world. While it definitely makes sense to do more learning early, I think it's also valuable to do some concrete projects alongside the more abstract work. Here are two reasons:
There's a quote attributed to Abraham Lincoln (perhaps incorrectly): "Give me six hours to chop down a tree, and I will spend four hours sharpening the axe." This nicely illustrates the idea of front-loading learning, but I would modify the recommendation a little. First try taking a few whacks, to see how sharp the axe is. Get experience with chopping. Identify which parts of the process will be a bottleneck (axe sharpness, your stamina, etc.). Then do some axe sharpening or resting or whatever, come back, and try some more. Keep repeating the process, identifying along which variables you need most improvement, and then refine those. This agile approach avoids waiting to the last minute, only to discover that you've overlooked the most important limiting factor. Finally, one important difference between tree chopping and altruism is that in altruism, the goals themselves are much less clearly defined, and figuring them out is part of the metaphorical "axe sharpening" process.
One of my teachers in 8th grade gave really hard tests. He explained the rationale with the following analogy. Suppose you want to improve at ping pong. Are you going to get better by playing against your little sister? Not unless she's good. Instead, you'll get better by playing against someone who's very talented -- someone who can challenge you. Similarly, challenging tests help students significantly improve their thinking skills.
The effective-altruism movement is mostly composed of young people. This is great, but it can also lead to lots of discussions in which young people talk exclusively to other young people. Often I feel there's not enough external input, from experts and older thinkers who can provide wisdom, experience, and difference of perspective.
I find it extremely useful to constantly seek outside perspectives, especially from senior policy analysts, leading scientists, and other objectively impressive public figures. Reading the big names in a field seems to me a better way to challenge my views and absorb crucial insights than having yet another conversation with 20-something-year-olds. There's no question that young people can be brilliant and have important new contributions, but expertise and wisdom from age often beat intelligent theorizing, except in a few narrow domains where theory is all we have to go on.
It's easy to get trapped in your own intellectual bubble and think that your group of friends has the answers. Make sure to keep stepping outside that bubble. Keep getting restless in search of new ideas and views different from your own. That's where you can challenge yourself.
Sometimes effective altruists seem obsessed with "quantified efficiency": Metrics indicating performance, whether in terms of QALYs per dollar, time management, survey results, or diet/exercise optimization. Quantification is helpful and can often highlight major gaps that your qualitative mind may have missed, or conclusions that weren't obvious before crunching the data. Numbers allow for doing many more computations than our brains could execute on their own.
At the same time, I fear that some effective altruists get carried away and miss the forest for the trees. Remember "garbage in, garbage out": Metrics are only as good as the reasoning that tells you they're important to optimize, and they can also miss crucial considerations. Hyper-optimizing on narrow or even fairly broad metrics may mean sacrificing important pieces of unmeasured value. Metric optimization can give an illusion of progress, when in fact a broader view of the situation would suggest that things are a lot fuzzier than you thought. Sometimes the qualitative, big-picture approach of the human brain is more adept at solving these macro-level problems than other tools currently at our disposal. Holden Karnofsky discusses this and related issues in "Passive vs. rational vs. quantified."
One big-picture domain where metric optimization has a hard time is with discovering, in the words of Disney's Pocahontas, "things you never knew you never knew." It's not clear how to build metrics that capture dimensions of a problem you haven't yet thought to explore. If we hyper-optimize too much on our current goals, we may neglect the importance of taking a step back to think about the big picture. Narrowed focus is one of several concerns raised by the authors of "Goals Gone Wild: The Systematic Side Effects of Over-Prescribing Goal Setting."
From a short-term perspective, it feels like studying abstruse topics in mathematics or political theory isn't helping anyone. It doesn't seem morally urgent. Yet, when I look back on my life with a bird's eye view, I can see that if I hadn't studied big ideas from a far-sighted perspective, I would be spending time on things significantly less important. In the long run, the seemingly unhelpful details of how sodium-potassium pumps work actually matter insofar as they allow you to better understand what life and consciousness are at a deep level, which then has major implications for altruism at large.
Michael Vassar encourages people to "Be More Curious!" Curiosity is Eliezer's first virtue of rationality. While it's important to constrain your curiosity so as to focus on the more altruistically important topics, don't confine it so much that you avoid making serendipitous leaps to unanticipated regions of state-space during your search for the best ways to improve the world.
By a similar token, boredom can be a virtue as well if it keeps you exploring many domains of life rather than focusing too much on a narrow cause that would appear less important than other causes if you took a broader perspective.
One of my friends used to observe what a tragedy it was that the world contained far more books than any person could ever read. I sympathize with bookworm Henry Bemis from "Time Enough at Last." While it is superficially sad that we can't learn everything or even more than a tiny sampling of the world's knowledge, on reflection it's not so troubling for a few reasons:
Recently I browsed through the top 100 most popular podcasts on iTunes. My instinctive reaction was to navigate to the ones that matched my interests and therefore seemed most alluring. Other podcasts on unfamiliar topics or by authors I had never heard of seemed less exciting. "Familiarity breeds liking," as the psychologists say, and the familiar topics have more fuzzies associated with them from positive past experiences.
Yet, I realized, my emotional response here is not fully adequate. While topics that interest me may be more relevant, there's also diminishing marginal value to additional exploration in the same fields. In contrast, listening to a completely new topic or author would expand my horizons, and rather than seeing it as uninteresting because I haven't tried it, I should encourage myself to explore new things more than I might do naturally. In the explore-exploit tradeoff, my brain seemed to be favoring exploitation relatively more than seemed optimal from a learning standpoint.
It can actually take some work to explore completely new topics or authors. If I'm looking to relax or comfort myself, I'm more likely to go with a familiar topic and writer that I know are fun. In contrast, trying something new requires effort before it pays off with new rewards from discovery of new insights.
When I was in the middle of college, I had a troubling realization: Many of the things I had learned in high school, or even just a few semesters ago, were now vacant from my brain. "Am I losing my memory already?" I wondered. And if knowledge fades this quickly, is there a point in learning it in the first place? Of course, knowledge learned once is easier to re-learn. But it seems an uphill battle to try to keep re-learning information that will just fade once again.
I'm no longer particularly troubled by the evanescence of memory for a few reasons. For one thing, if I have a useful insight or learn an important fact, I often write it down, either in an essay or on Wikipedia. This creates a sort of extended memory for myself. Moreover, the most important discoveries should be salient enough that I remember them when other memories have faded.
In addition, I think much of the value of learning comes from the way knowledge shapes your mind in general. I can't recall many of the methods of solving differential equations that I once knew, but I do remember the gist of what those methods involved. Eliezer Yudkowsky has discussed the idea of absorbing "the reason and the rhythm behind ethics". I like this phrase. When I read about something, I aim to get a sense for the "rhythm" of the field -- whether physics, philosophy, or politics. I try to absorb an intuitive understanding of the kinds of things that the field involves, both at a high level and on a daily basis. I don't need to know most of the detailed facts, but studying some detailed facts with an eye toward honing my instinctive sense of the topic is important.
The "gist" of a subject is something that lingers long after the specific facts have faded. And in any case, getting the gist is often the most action-relevant endpoint anyway. Brains evolved to forget unhelpful facts (although energy and storage savings are presumably a big part of the evolutionary reason for humans' limited memories). Our intuitive understandings are elegantly distilled summaries of the essence of large amounts of data. When I read about a topic, I sometimes hunt for the gist and don't worry about remembering the details. They're at my fingertips via a web search anyway. (That said, it's often useful to learn some gory details at least once, in order to make sure the gist that you take away is accurate. You might assume you understand a topic at a surface level, but diving into the weeds would inform you that the topic is different from what you thought it was.)
Many school tests, as well as trivia shows and cocktail-party banter, emphasize the wrong thing: They value memorization of words and facts. This makes testing easier, because measurement is more objective, and it does at least provide some signal of understanding, because you can't pass a history test without having read the textbook or listened to the lecture. But it emphasizes exactly the wrong end of knowledge: knowing trivia rather than looking for an overall gist. To their credit, many educators aim to shape tests in ways that emphasize concepts more, and of course, essay writing is much more conducive to understanding over regurgitation.
Seeking the gist applies in the realm of quantitative sciences too. I've heard several people say things like "You can't really understand physics without math", but I think this isn't accurate. Equations in physics express ideas that can be described by words and analogies. Of course it would be necessary to introduce many new concepts to explain physics in detail without math, but there's nothing fundamentally privileged about mathematical symbols. Mathematicians often highlight the "main idea" behind a proof before delving into the gory details, and this main idea is typically the most important point to take away from the proof. If you remember the main idea, you can probably reproduce the detailed manipulations. Sometimes I read a proof, understand each step, but still don't grok what's really going on. In such cases, I need to take a step back and ask myself what the overall strategy is. When I figure that out, I can finally feel satisfied with the proof.
Luke Muehlhauser described his impression that his friend Carl Shulman gave answers using facts rather than overall impressions. In contrast, Luke explained:
I’ve read a lot of facts, but do I remember most of them? Hell no. If I forced myself to respond to questions only by stating facts, I’d be worried that I have fewer facts available to me than I’d like to admit. I often have to tell people: “I can’t remember the details in that paper but I remember thinking his evidence was weak.”
I agree with Luke, and I think absorbing the gist of a situation is probably the best strategy. Often I read by asking myself, "How do I feel about this?" Then the feeling is what I remember. Of course, when emotion is involved, memories are retained better, so it may not be a bad strategy for remembering details as well.
In general, whenever I read something, I look for the emotionally salient features of the content. How does this relate to reducing suffering? How does this expand my understanding of the way the world works? What implications does this have for my opinions on consciousness, interpersonal comparisons of utility, or other instances of moral uncertainty? A main reason many students dislike school is that they can't see the relevance of what they learn to something they care about. When you're aiming to improve the world, almost everything you learn is somehow relevant to things you care about, and hence the process of reading as if you were at an art gallery ("how do I feel about this?") can be applied almost universally.
It seems that a lot of movies, TV shows, music, etc. that are produced now are no better than those that have already been made. Indeed, many of them are just repeats of the same ideas over and over. Due to natural variation, some of the older productions are better than the newer ones, and maybe the older writers put more time and thought into their creations.
So why is it that most people focus on "new" movies, shows, etc.? Why not optimize for the best ones of all time, maybe with some randomness to make sure you do enough sampling of under-rated ones? Similarly, rather than making a new movie that's basically the same as 10 other previous movies, why not just rebrand the old movies and make them "the new hot thing"? It seems like we need something to be "new" to be interested, but the best things have mostly already been done. There's so much material out there that no one could possibly watch it all anyway. So why do we need more?
Part of the reason for focusing on new media is that movies, TV shows, etc. are to some degree about providing conversation topics and connection with other people and not just the content itself. It helps to have a way to decide what the current fashion is. But fashions could be set by marketers without needing to have the content itself be new, just as old-fashioned clothes sometimes come back into style.
Another reason to need new media is the suspense factor -- if the product is new, you can keep people waiting for it. This is somewhat puzzling. Why would I have suspense waiting for a new product when I can watch a thousand old things that people also had suspense waiting for? It's not like the new thing is that much better. Maybe we just like to be teased by media producers but can't be teased with media that are already accessible.
I find a similar and unfortunate pattern with blog posts: People love to read a blog post just when it comes out, but they less often explore the archives of a blog years back. Maybe this is because blogs typically have date organization that makes older articles less visible. But I think there's also some sense that the older posts are "out of date." I explicitly organize my writings in a non-temporal fashion so that high-quality older articles are not lost. Thanks to Toby Ord for first bringing up this trend with blog posts vs. static articles in conversation.
Summary: News is a peculiar institution. It purports to cover important issues of the day, yet it systematically misses some of the world's most important problems that occur every day. This appendix discusses "news values" -- the criteria that determine whether a story is newsworthy. I suggest some pros and cons of news relative to other ways of learning. I think news has a place in society but not such a prominent place as it seems to hold in intellectual life.
When I was in 10th grade, I wrote a research paper for my English class evaluating the economic and environmental effects of the 2002 Farm Bill. My teacher thought the topic was so important and my paper so well written that I should submit it to a newspaper for publication. I curated a shortened form of the paper into an editorial column consistent with the style of the local newspaper, which I read daily and with whose tone I was familiar. I submitted it for review in early 2003, and eventually, I received a call back from the newspaper.
"Hello?"
"Hi, is this Brian?"
"Yes."
"I'm calling from the local newspaper about the article you submitted. I thought it was well written and a very strong piece, but the problem is, it's not ... current. The Farm Bill is something that happened last year. So I'm afraid we can't publish it."
"I see. Well, thanks for looking at it."
I also tried submitting the piece to a number of other current-affairs publications without success.
What are the biggest stories that news reporters aren't covering? The stories that don't qualify as news: The systemic problems and long-term suffering that aren't flashy enough to become headlines.
Take a look at Wikipedia's list of "News values": the characteristics of a story that tend to qualify it for news coverage. Some examples:
- Frequency: Events that occur suddenly [...]. Long-term trends are not likely to receive much coverage. [...]
- Unambiguity: Events whose implications are clear make for better copy than those that are open to more than one interpretation, or where any understanding of the implications depends on first understanding the complex background in which the events take place.
- Personalization: Events that can be portrayed as the actions of individuals will be more attractive than one in which there is no such "human interest."
- Meaningfulness: This relates to the sense of identification the audience has with the topic. "Cultural proximity" is a factor here -- stories concerned with people who speak the same language, look the same, and share the same preoccupations as the audience receive more coverage than those concerned with people who speak different languages, look different and have different preoccupations. [...]
- Consonance: [...] the media's readiness to report an item. [...]
- Time constraints: [...] strict deadlines and a short production cycle, which selects for items that can be researched and covered quickly.
These are hardly criteria for optimally selecting topics to learn about! Rather, as one might expect, these values are more optimized for entertainment by the audiences, combined with feasibility to produce a story quickly by the media outlets.
But why would I read a hastily prepared story about what happened this morning instead of reading a more careful, accurate, and complete historical account of what happened five years ago in a Wikipedia article? What exactly is the urgency of finding out what happened today? This obsession with the latest-breaking stories and forgetting about stories from a few months ago is peculiar. Perhaps there's some psychological basis for it; even I feel some sense that current news can be more "exciting" than old news. Maybe it has to do with being kept in suspense about the outcome? But in that case, why wouldn't more novelists release their books one chapter at a time to force audiences to wait to hear the ending? Or maybe it has to do with the fact that when you can act on news, you really do need the latest information. Last week's stock price isn't that informative, nor is the fact that your tribe leader caught a saber-toothed cat last fall. In the ancestral environment or local communities where news is directly actionable, getting the latest updates matters a lot. However, most news that people consume these days is not immediately actionable but rather serves to update their long-term views of the world.
If you want a silent, systemic issue to be covered in the news, you have to make it into a short-term story -- by carrying out a protest, or publishing a new book, or at least making some tangential connection to a topic that's already in viewers' minds. Among other things, this incentivizes sensationalism and publicity stunts in order to get air time. The actual importance of the story to the world is not necessarily a dominant consideration. Of course, this is probably not a conspiracy by the corporate media to hide information from the masses; most likely it's a reflection of the evolved interests of tribal primates. Within your 150-person band of hunter-gatherers, gossip about who's having sex with whom and who got in a fight with whom are among the most important things for you to learn about from the standpoint of passing on your genes.
To its credit, some news criteria make sense. There is reason behind focusing more on unusual stories because those do more to update your world models about what kinds of things can possibly happen. Of course, it's important not to confuse news for being a representative sample of occurrences in the world, or else one's picture will be rather skewed. You might come to believe that every politician is corrupt and every celebrity has a drug problem, and that shark attacks and plane crashes are significant risks.
News also has the virtues of concision and diversity: It doesn't dwell on a single story for too long, with some exceptions for ongoing trials or scandals, and it does report about a wide variety of events, especially when you include business news, science news, etc. Concision and diversity are both important for preventing oneself from getting stuck in a single, suboptimal topic for too long. Yet these same virtues can be fulfilled by reading a broad sampling of Wikipedia articles, with the added benefits of greater accuracy, completeness, and historical perspective than one gets from news stories.
I don't think news, especially in balance with other sources, is dramatically lower-quality than Wikipedia. Probably reading a good news article is almost equally useful as reading a good Wikipedia article. There's high variation in quality of news, with some articles providing among the best information you can consume at a given moment, and other articles being essentially just entertainment (which is fine if that's what you're looking for). It does seem unfortunate the way society privileges news over foundational learning that's not current but is ultimately more important.
The post Education Matters for Altruism appeared first on Center on Long-Term Risk.
]]>The post Charity Cost-Effectiveness in an Uncertain World appeared first on Center on Long-Term Risk.
]]>Evaluating the effectiveness of our actions, or even just whether they're positive or negative by our values, is very difficult. One approach is to focus on clear, quantifiable metrics and assume that the larger, indirect considerations just kind of work out. Another way to deal with uncertainty is to focus on actions that seem likely to have generally positive effects across many scenarios, and often this approach amounts to meta-level activities like encouraging positive-sum institutions, philosophical inquiry, and effective altruism in general. When we consider flow-through effects of our actions, the seemingly vast gaps in cost-effectiveness among charities are humbled to more modest differences, and we begin to find more worth in the diversity of activities that different people are pursuing. Those who have abnormal values may be more wary of a general "promote wisdom" approach to shaping the future, but it seems plausible that all value systems will ultimately benefit in expectation from a more cooperative and reflective future populace.
To teach how to live without certainty, and yet without being paralyzed by hesitation, is perhaps the chief thing that philosophy, in our age, can still do for those who study it.--Bertrand Russell, History of Western Philosophy
If something sounds too good to be true, it probably is.--proverb
In fall 2005, I was talking with a friend about utilitarianism:
Friend: "I see where you're coming from, but utilitarianism doesn't work."
Me_2005: "Why not?"
Friend: "You have no way of assessing all the effects of your actions. Everything you do has a million consequences."
Me_2005: "We do the best we can to estimate them."
Friend: "The error bars are infinite in both directions."
Me_2005: "Hmm, yes, the error bars are infinite, but we can assume the unknowns cancel out. For example, consider building safer seat belts. In the short term, we know it prevents injuries, which is good. There may be spill-over longer-term effects, but those could equally be good or bad, so they cancel out in the calculations. Hence, the expected value is still positive."
If I were to continue the conversation now, talking with my past self, here's how it might go:
Me_2013: "Ok, let's consider the example of seat-belt safety. One effect is to prevent injuries and tragic deaths, which is good. But another effect is to increase the human population, which means more meat consumption. Since people eat over a thousand chickens and an order of magnitude more fish in their lifetimes, this seems net bad on balance."
Me_2005: "Hmm, ok. Then maybe seat belts are net negative.... Those two factors you enumerated are pretty concrete, and everything else is fuzzy, so we can assume everything else cancels out. Hence the overall balance is bad."
Me_2013: "Well, there might be other good side effects of seat belts. Better car safety means more people driving, which means more inclination to build parking lots. And parking lots are one very clear way to prevent vast amounts of suffering by wildlife that would have otherwise lived short lives and endured painful deaths on the land had it not been paved. Maybe there are more insects whose painful deaths are spared by parking lots than there are additional chickens and fish killed by the drivers who remain alive."
Me_2005: "Ah, good point. Parking lots are quite beneficial, so on balance, it may be that seat belts are net positive."
Me_2013: "On the other hand, more driving of cars means more greenhouse-gas emissions, which means more climate change, which means more global political instability, which may worsen prospects for compromise in the far future. The amount at stake there could swamp everything else."
Me_2005: "Oh, yes, you're right. So then seat belts may be net bad on balance."
Me_2013: "On the other hand, fewer premature deaths may allow people to feel less threatened and more far-sighted in their actions...."
[...and so on]
Nick Bostrom offers more examples of this type in his 2014 talk, "Crucial considerations and wise philanthropy".
This idea of focusing on the effects you can see and ignoring those you can't, hoping they all kind of cancel out, is tempting. It's one way to feel like "I'm actually making a difference" by doing a given action, rather than being paralyzed by uncertainty, which can sometimes be depressing.
Some in the effective-altruism movement adopt this kind of approach. Basically, "any cause that's uncertain, far off, and speculative isn't 'rigorous,' so we should ignore it. Instead, we should focus on clearly proven interventions with hard scientific evidence behind them, like malaria prevention or veg outreach, because at least here we know we're doing something good rather than floating off into space."
I think there's some value to this perspective, especially insofar as digging deep into scientific details can teach you important lessons about the world that are transferable elsewhere. However, where it falls down as a philosophical stance is with this question: How do you know malaria prevention or veg outreach is good on balance? How can you know the sign of the activity (much less the effectiveness) without considering the longer-range and speculative implications? You could try to assume that everything you can't see "just cancels out," but as we saw in the introductory dialogue, this approach can yield arbitrarily flip-flopping conclusions depending on where the boundary of "known facts" meets the boundary of "everything else we assume away." Ultimately we have no choice but to look at the whole picture and grapple with it as best we can.
Another way to pick a cause to work on is to look for actions that have broadly positive effects across a wide range of scenarios. For example, in general:
Note that I didn't include on this list technology, economic growth, and other trends generally seen as positive, because both negative-leaning and positive-leaning utilitarians have concerns about whether faster technology is net good or bad. Of course, particular kinds of technologies, like those that advance wisdom faster than raw power, are safer and often clearly beneficial.
Focusing on the very robust projects often amounts to punting the hard questions to future generations who will be better equipped to solve them. In general, we have basically no other choice. There are known puzzles and unknown future discoveries in physics, neuroscience, anthropics, infinity, general epistemology, human (and nonhuman) values, and many other areas of life that are likely to radically transform our views on ethics and how we should act in the world. Taking object-level actions based only on what we know now is sort of like Socrates trying to determine whether the Higgs boson exists. Socrates would have been at a complete loss to solve this question directly, but one thing he could have done would have been to encourage more thinking on intellectual topics in general, with eventual payoff in the future.
Gaverick on Felicifia proposed this basic idea to "get smarter" before we try to address object-level problems. The caution I would add is that we don't just need knowledge per se but specifically ethically reflective knowledge that can slowly, carefully, and circumspectly decide how to move ahead. We don't want a quick knowledge explosion in which one party runs roughshod over most things that most organisms care about in a winner-takes-all fashion. Thus, even though artificial general intelligence (AGI) would ultimately be needed to figure out these questions that are beyond human grasp, we need to develop AGI wisely, and fostering better social conditions within which that AGI development can proceed more prudently can make sense in the short term.
One aspect of preparing the ground for future generations to do good things is making sure our values are transferred, at least in broad outline. The biggest danger of the "punt to the future" strategy is not that the future won't be intelligent (selection pressure makes it likely that it will be) but that it may not care about what we care about. While we don't want to lock the future in to a moral view that would seem silly with greater understanding of the universe, we also don't want the future's values to float around arbitrarily. The space of possible values is huge, and the space of values that we might come to care about even upon further reflection is a tiny subset of the total. This is why our slogan shouldn't be just "more intelligence" but instead something like "more wisdom."
(I should mention in passing that it would probably be better from a suffering-focused perspective if humans don't develop AGI because of the extreme risks of future suffering that doing so entails. However, given that humanity will develop AGI whether I like it or not, my best prospects for making a difference seem to come from pushing it in more humane and positive-sum directions. I also recognize that other people care very much about the fruits that AGI could offer, and I give this some moral weight as well.)
It seems that the main way in which our actions make a difference is by how they prepare the ground for our successors to think more carefully about big questions in altruism and take good actions based on that thinking. However, this doesn't mean that the only way to affect the far future is to work directly on meta-level activities, like movement-building, career advising, or friendly AGI. Rather, all causes that we might work on have flow-through effects in other areas. For example, studying how to reduce wild-animal suffering in the near term can encourage people to recognize the importance of the issue more broadly, which might also make them think more cautiously before spreading wildlife to other planets or in simulations. Studying object-level crucial considerations in philosophy can motivate more people to follow suit, perhaps more effectively than by just preaching about the general importance of philosophical reflection without setting the example. In general, taking small, concrete steps alongside a bigger vision can potentially be more inspiring than just talking about the bigger vision, although exactly how to strike the balance between these two is up for debate.
As one final example to drive home the point, imagine an effective altruist in the year 1800 trying to optimize his positive impact. He would not know most of modern economics, political science, game theory, physics, cosmology, biology, cognitive science, psychology, business, philosophy, probability theory, computation theory, or manifold other subjects that would have been crucial for him to consider. If he tried to place his bets on the most significant object-level issue that would be relevant centuries later, he'd almost certainly get it wrong. I doubt we would fare substantially better today at trying to guess a specific, concrete area of focus more than a few decades out. (Of course, if we can guess such an area with decent precision, we could achieve high leverage, as Nick Beckstead points out in the presentation discussed at the end of this essay.)
What this 1800s effective altruist might have guessed correctly would have been the importance of world peace, philosophical reflection, positive-sum social institutions, and wisdom. Promoting those in 1800 may have been close to the best thing this person could have done, and this suggests that these may remain among the best options for us today. We should of course be on the lookout for more specific, high-leverage options, but we ought also to remain humble about the extent of our abilities against the vast space of unknown unknowns as far as errors in our current beliefs and new domains of knowledge that we never even knew existed.
Nick Beckstead discusses a similar analogy in section 6.4.3 of On the Overwhelming Importance of Shaping the Far Future, giving the example of a person in the year 1500 who wanted to help humanity build telephones centuries later.
Sometimes advocates for a cause become fixated on a particular metric or way of looking at the situation -- for example, a $/QALY figure applied to developing-world or animal-welfare interventions. This seductive number seems to finally give us an answer to which causes are better than others. And the result is that, even after accounting for different quality of evidence between charities and the optimizer's curse, different charities can vary by many times, even orders of magnitude, along this metric.
Quantification in this fashion is certainly helpful and provides one important perspective from which to view the situation, but we shouldn't mistake it for being the end of the story. The world is complex, and there are many different angles from which to view a charity's activities. From different perspectives, different charities may come out ahead.
Moreover, long-term side effects can make a big difference even when optimizing a single metric. A simple example: Suppose the Pope was deciding whether to encourage an easing of the Catholic Church's prohibition on birth control. This would help prevent the spread of HIV in Africa and generally seems like a win for public health, i.e., more QALYs. But wait, birth control would mean fewer pregnancies, which means fewer people born, which presumably means fewer QALYs even considering the HIV-prevention effects. So, according to the QALY metric, allowing birth control would be bad. But then consider the impacts on farm animals: A smaller human population, especially by developed-world Catholics, means less meat consumption and fewer negative QALYs in factory farms, so the overall impact might once again be positive for QALYs. But wait, what about wild animals? Humans appropriate enormous amounts of land and biomass that would otherwise support vast numbers of small, suffering wild animals, so a bigger human population may mean less wild-animal suffering and hence more QALYs, and arguably this effect dominates. So we're back to opposing birth control. On the other hand, consider that better access to birth control might empower women, improve social stability, and generally lead to a more peaceful and cooperative future, which could improve the quality of life of 1038 future people every century for billions of years. So once again birth control appears positive. And on and on....
As we saw in the introductory dialogue, we can't just stop at one particular cutoff point for the analysis. The flow-through effects of actions are not wholly unpredictable and may make a huge difference.
When we stop fixating on particular side-effects of an intervention and try instead to see the whole picture, we realize that the charity landscape is less drastically imbalanced than it might seem. Everything affects everything else, and these side-effects have a way of dampening the unique claims to urgency of the seemingly astronomically important causes while potentially raising that of causes that don't seem so important naively. For example, insofar as a charity encourages cooperation, philosophical reflection, and meta-thinking about how to best reduce suffering in the future -- even if only by accident -- it has valuable flow-through effects, and it's unlikely these can be beaten by many orders of magnitude by something else.
In 2008, I talked with some people at the Singularity Institute for Artificial Intelligence (SIAI, now called MIRI) and asked what kinds of insanely important projects they were working on compared with other groups. What I heard sounded unimpressive to me: It was basically just more exploration in math, philosophy, cognitive science, and similar domains. I didn't find anything that seemed 1030 times better (or even 10 times better) than, say, what other smart science-oriented philosophers were exploring in academia. At the time my conclusion was: "Well, maybe this work isn't so important." Now I have a different take; rather than SIAI's work being unimportant, I see the other work by smart academics and non-academic thinkers as being more important than I had realized. These intellectual insights that I had taken for granted don't happen for free. SIAI and other philosophers both contribute to humanity's general progress toward greater wisdom and provide more giant shoulders on which later generations can stand.
For the record, I do think MIRI's work is among the best being done right now and tentatively encourage donations to MIRI, but I also have the perspective to see that MIRI is not astronomically more valuable in a counterfactual sense than other causes that also contribute to the overall mission.
I would also point out that not all work in AI or cognitive science is necessarily net positive. I would encourage differential intellectual progress in which we focus on ethical and social questions at a faster pace than on the purely technical questions, because otherwise we risk having great power without appropriate structures to constrain its use.
This meta-level perspective, in which we recognize the side effects of actions on the future, also defuses the most extreme fanaticism: Infinite payoffs. Some speculative scenarios in physics or other domains offer the prospect of making an infinite impact by your efforts. If so, doesn't working to affect those outcomes dominate everything else? The answer is "no," because in fact, everything we do has implications for those infinite scenarios. Building wisdom, promoting cooperation, and so on will all make a difference to infinite payoffs if they exist, and indeed, it's only by acquiring vastly greater wisdom that we even stand a chance of making the right choices with respect to those potentially infinite decisions. As with the more mundane charity-evaluation examples, fixating on a particular scenario seems potentially dominated by encouraging more broadly helpful social conditions that can allow for addressing a wide spectrum of possible scenarios (some of which we can't even imagine yet). Of course, working on a specific case can be one way to encourage exploration of broader cases, but the seeming direct value of focusing on one specific idea may be outweighed by the indirect value of encouraging deeper insight into the whole class of such ideas.
This is why even pure expected-utility maximizers with no bound on their utility function will still tend to act pretty normally: Even in this case, the best general strategy is probably to set the stage for others to more wisely address the problem, rather than unilaterally doing something crazy yourself. (See also "Empirical stabilizing assumptions" in "Infinite Ethics".)
While this discussion was meant to gesture somewhat in the direction of recognizing what Holden Karnofsky calls "broad market efficiency," it doesn't mean that all charities are equal. Indeed, many actions that we might take would have negative consequences, both in the short and long terms, and even among charities, there are likely some that cause more harm than good. Flow-through effects are notoriously complicated, so even a well meaning activity could prove harmful at the end of the day. That said, just as I don't expect some charities to be astronomically better than others, so I don't expect many charities to be extremely negative either.
Flow-through effects also don't mean that we can't make some judgments about relative effectiveness. Probably studying ways to reduce suffering in the future is many times more important than studying dung beetles, to use an example from philosopher Nick Bostrom. But it's not clear that directly studying future suffering is many orders of magnitude more important. The tools and insights developed in one science tend to transfer to others. And in general, it's important to pursue a diversity of projects, to discover things you never knew you never knew. Given the choice between 1000 papers on future suffering and 1000 on dung beetles, versus 1001 papers on future suffering and 0 on dung beetles, I would choose the former.
Improving future wisdom is good if you expect people in the future to carry on roughly in ways that you would approve of, i.e., doing things that generally accord with what you care about, with possible modifications based on insights they have that you don't. The situation is trickier for those, like negative utilitarians (NUs), whose values are strongly divergent from those of the majority and are likely to be so indefinitely. If most people with greater wisdom would spread life (and hence suffering) far and wide into the cosmos, then isn't making the future wiser bad by NU values?
It could be, but I think a wiser future is probably good even for NUs in expectation. For one thing, if people are more sophisticated, they may be more likely to feel empathy for your NU moral stance, realizing that you're another brain very similar to them who happens to feel that suffering is very bad, and as a result, they may care somewhat more about suffering as well, if not to the same degree as NUs do. Moreover, even if future people don't change their moral outlooks, they should at least recognize that strategically, win-win compromises are better for their values, so it seems that a wiser populace should be more likely to cooperate and grant concessions to the NU minority. By analogy, democracy is generally better than a random dictatorship in expectation for all members of the democracy, not just for the majority.
If you have weird values, then it's less likely that low-hanging fruits have been picked and that the charity market is as broadly efficient as for other people. That said, insofar as your impacts will ultimately be mediated through your effect on the future, even your weird values might, depending on the circumstances, be best advanced by similar sorts of wisdom- and peace-promoting projects as are important for the majority, though this conclusion will be more tenuous in your case.
Working on robustly positive interventions sounds like a form of risk aversion. A skeptic might allege: "You want to make sure you don't cause harm, without considering the possible upside of riskier approaches." However, as I think is clear from the examples in this piece, cause robustness is not really about reducing risk but is our best hope of doing something with an expected value that's not approximately zero. On many concrete issues, a given action is about as likely to cause harm as benefit, and there are so many variables to consider that taking a step back to explore further is the best way to improve our prospects. In the long run, investment in altruistic knowledge and social institutions for addressing problems will often pay off more than trying to wager our resources on some concrete gamble that we see right now.
Of course, this isn't to say we should always do the safest possible action from the perspective of not causing harm relative to a status quo of inaction. (To minimize expected harm relative to the status quo, the best option would be to keep the status quo.) Sometimes you can go too meta. Sometimes there are instances where I feel we need to push forward with unconventional ideas and not wait for society as a whole to catch up -- e.g., by spreading concern for wild-animal suffering or considering possibilities of suffering subroutines. But we also should avoid doing something crazy because a high-variance expected-value calculation suggests it might have higher payoff in the short term. Advancing future cooperation and wisdom is not just a more secure way to do good but probably also has greater expected value.
Ultimately our choice of where to focus does come down to an expected-value calculation. If we were just trying to find a way to certainly make a positive impact, we could, for instance, visit an elderly home and play chess with the people there to keep them company. This is admirable and has very low risk of negative side effects, but it doesn't have maximal expected value (except perhaps in moderation as a way to improve your spiritual wellbeing). On the other hand, expending all your resources on a long-shot gamble to improve the far future based on a highly specific speculation of how events will play out does not have highest expected value either. Off of the efficient frontier, more risk does not necessarily mean more expected return. An approach that broadly improves future outcomes will tend to have reasonably high probability of making a reasonably significant impact, so the expected value will tend to be competitively large.
The discussion in this section became lengthy, so I moved it to a new essay.
Groups often think of themselves as different and special. People like to feel as though they're discovering new things and pioneering a frontier that hasn't yet been explored. I've seen many cases where old ideas get recycled under a new, sexy label, even though people end up doing mostly the same things they did before. This trend resembles fashion. Sometimes this happens in the academic realm, like when old statistical methods are rebranded as "artificial intelligence," or when standard techniques from a field are reintroduced as "the hot new thing."
I think the effective-altruism (EA) movement has some properties of being like fashion. It consists of idealistic young people who think they've discovered new principles of how to improve the world. For example, from "A critique of effective altruism":
Effective altruists often express surprise that the idea of effective altruism only came about so recently. For instance, my student group recently hosted Elie Hassenfeld for a talk in which he made remarks to that effect, and I've heard other people working for EA organizations express the same sentiment.
Certainly there are some new ideas and methodologies in EA, but most of the movement's principles are very old:
I think it's helpful to learn more about lots of fields. The academic and nonprofit literature already contains many important writings about social movements, what works and doesn't for making an impact, how to do fundraising, how to manage an organization, etc., as well as basically any object-level cause you want to work on -- from animal welfare to international cooperation. Major foundations have smart people who have already thought hard about these issues. Even the man on the street has a lifetime of accumulated wisdom that you can learn from. When we think about how much knowledge there is in the world, and how little we can ever learn in our lifetimes, the conclusion is humbling. It's good to recognize our place in this much larger picture rather than assuming we have the answers (especially at such a young age for many of us).
One reason EAs may see themselves as special is because they learned through the EA movement a lot of powerful ideas that are in fact much older and more general, including concepts from economics, sociology, business, and philosophy. Carl Shulman discussed this phenomenon in "Don't Revere The Bearer Of Good Info" with reference to the writings of Eliezer Yudkowsky. Carl underscores the importance of stepping outside our own circle to see a much bigger picture of the world than what our community tends to talk about. As I've gotten older, I've been increasingly humbled by how much other people have already figured out, as well as how hard it is to decide where you can make the biggest difference.
In many ways the effective-altruism movement can be seen as an extension of principles from the business world to the nonprofit sector: Quantification and metrics, focus on performance rather than overhead, emphasis on cost-effectiveness and return on investment, etc. For the most part this business mindset is positive, but there are at least two ways that it has the potential to lead charities astray -- both in the EA movement and elsewhere.
In business, (pretty much) all that matters to shareholders is a company's financial performance, as measured in dollars. A company's stock price can capture a lot, including long-term projections for the industry in addition to short-term profits, but it also misses a lot as well -- including most externalities that the company has, unless they affect its bottom line, such as through taxes and government regulations.
Likewise, in altruism, if we introduce a "bottom line" mentality, we may over-optimize for this metric and ignore other important features of charity work. This might look like excessively focusing on QALYs/$ or animals-saved/$, ignoring the often more important flow-through effects (externalities) and implications for the far future that the work might entail. GiveWell has helped to discourage excessive focus on visible metrics, and most other EAs recognize this issue as well. However, I think naive metric optimization is an easy mindset to fall into when one first encounters EA. Optimization in engineering or finance is a much more precise process than optimization in charity or policy making, and sometimes the tools that perform extraordinarily well at the former fail at the latter compared with so-called "soft science" skills.
In business, another company that performs a similar service to customers as yours is a competitor, and your goal is to steal as much market share as you can from that competitor. The only cost of marketing is the money that you spend on it, and if it draws away enough customers from the competitor, it's worth it.
Charities, both EA and otherwise, can adopt a similar mindset: They want more donors for their cause, without paying too much attention about what charities they're pulling donors away from. EAs are concerned with replaceability issues and recognize that pulling donors away from other issues might matter, but usually they feel that the charity they're promoting is vastly more effective than the one they're pulling people away from, so there's not much lost. It's possible this is true in some cases -- e.g., encouraging donors to fund HIV prevention instead of AIDS treatments, or recruiting donors who would have funded art galleries -- but in other cases it becomes much less clear. Especially in the realm of policy analysis and political advocacy, which arguably have some of the highest returns, it's more difficult to say that one charity is vastly more important than another, because the issues are complex and multifaceted.
So for altruists, the cost of marketing and fundraising is more than the time and money required to carry them out. Charity is not a competition.
After writing this piece, I came across a presentation that Nick Beckstead gave in July 2013. Nick explains several similar views to those expressed in this essay. For instance, his concluding slide (p. 39):
- There is an interesting question about where you want to be on the targeted vs. broad spectrum, and I think it is pretty unclear
- Lots of ordinary stuff done by people who aren't thinking about the far future at all may be valuable by far future standards
- Broad approaches (including general technological progress) look more robustly good, but some targeted approaches may lead to outsized returns if done properly
- There are many complicated questions, and putting it all together requires challenging big picture thinking. Studying targeted approaches stands out somewhat because it has the potential for outsized returns.
Nick proposes these as some robustly positive goals (p. 5):
- Coordination: How well-coordinated people are
- Capability: How capable individuals are at achieving their goals
- Motives: How well-motivated people are
- Information: To what extent people have access to information
I agree with Coordination and Motives. I'm less certain about Information, because this speeds up development of risks along with development of safety measures. The same is even more true for Capability. I would therefore favor differentially pushing on wisdom and compromise relatively more than economic and technological growth. Nick makes some recognition of this on p. 30:
- Broad approaches are more likely to enhance bad stuff as well as good stuff
- Increasing people's general capabilities/information makes people more able to do things that would be dangerous, offsetting some of the benefits of increased capabilities/information
- Improving coordination or motives may do this to a lesser extent
Nick himself argues that faster economic growth is very likely positive because it improves cooperation and tolerance. He quotes Benjamin Friedman's The Moral Consequences of Economic Growth: "Economic growth -- meaning a rising standard of living for the clear majority of citizens -- more often than not fosters greater opportunity, tolerance of diversity, social mobility, commitment to fairness, and dedication to democracy." I agree with this, but the question is not whether economic growth has good effects of this type but whether these effects can outpace the risks that it also accelerates. I feel that this question remains unresolved.
On pp. 18-19, Nick deflects the argument that faster technology is net bad by pointing out that it also means faster countermeasures, along with some other considerations that I think are minor. This point is relevant, but I maintain that it's not clear what the net balance is. In my view it's too early to say that faster technology is net good, much less sufficiently good that we should push on it compared with other things.
On p. 24, Nick echoes my point about deferring to the future on some questions:
- In some ways, trying to help future people navigate specific challenges better is like trying to help people from a different country solve their specific challenges, and to do so without intimate knowledge of the situation, and without the ability to travel to their country or talk to anyone who has been there at all recently.
- Sometimes, only we can work on the problem (this is true for climate change and people who will be alive in 100 years)
- It is less clearly true with risks from future technology
Nick concludes with some important research questions about historical examples of what interventions were most important and what current opportunities and funding/talent gaps look like.
Carl Shulman has a piece, "What proxies to use for flow-through effects?," that suggests many possible metrics that are relevant for impacts on the far future, though he explains that not all of them are always obviously positive in sign. From Carl's list, these are some that I believe are pretty robustly positive:
The sign of most other metrics is less clear to me, including economic growth, population, education in general, and especially technology. Carl cites the World Values Survey as an important demonstration of the impact of per-capita wealth on rationality and cosmopolitan perspective.
Within the "wisdom" category, I would include scientometrics that Carl mentions for natural sciences applied to social sciences and philosophy. For example, number of publications, number of web pages discussing those topics, number and length of Wikipedia articles on those topics, etc. Of course, the value of some of these domains is in the pudding -- insofar as they improve democracy, transparency, global cooperation, and so on.
In the comments, Nick Beckstead suggested inequality as another candidate metric. I haven't studied the literature extensively, but I have heard arguments about how it erodes many other relevant metrics, including trust, cooperation, mental health, and interpersonal kindness. For example, according to Richard Wilkinson: "Where there is more equality we use more cooperative social strategies, but where there is more inequality, people feel they have to fend for themselves and competition for status becomes more important."
It might seem as though we're helpless to respond to not-yet-discovered crucial considerations for how we should act. Is our only option to keep researching to find more crucial considerations and to move society toward a more cooperative and wise state in the meanwhile? Not necessarily. Maybe another possibility is to model the unknown crucial considerations.
Consider the following narrative. Andrew is a young boy who sees people going to a blood-donation drive. He doesn't know what they're doing, but he sees them being stuck with needles. He concludes that he wouldn't like to participate in a blood drive. Let's call this his "initial evaluation" (IE) and represent it by a number to indicate whether it favors or opposes the action. In this case, Andrew assumes he would not like to participate in the blood drive, so let's say IE = -1, where the negative number means "oppose".
A few years later, Andrew learns that blood drives are intended to save lives, which is a good thing. This crucial consideration is not something he anticipated earlier, which makes it an "unknown unknown" discovery. Since it's Andrew's first unknown-unknown insight, let's call it UU1. Since this consideration favors giving blood, and it does so more strongly than Andrew's initial evaluation opposed giving blood, let's say UU1 = 3. Since IE + UU1 = -1 + 3 > 0, Andrew now gives blood at drives.
However, one year later, Andrew becomes a deep ecologist. He feels that humans are ruining the Earth, and that nature preservation deserves more weight than human lives. Giving blood allows a person in a rich country to live perhaps an additional few years, during which time the person will ride in cars, eat farmed food, use electricity, and so on. Andrew judges that these environmental impacts are sufficiently bad they're not worth the benefit of saving the person's life. Let's say UU2 = -5, so that now IE + UU1 + UU2 = -1 + 3 + -5 = -3 < 0, and Andrew now stops giving blood again.
After another few months, Andrew reads Peter Singer and realizes that individual animals also matter. Since human activities like driving and farming food injure and kill lots of wild animals, Andrew concludes that this additional insight further argues against blood donation. Say UU3 = -2.
However, not long after, Andrew learns about wild-animal suffering and realizes that animals suffer immensely even when they aren't being harmed by humans. Because human activity seems to have on the whole reduced wild-animal populations, Andrew concludes that it's better if more humans exist, and this outweighs the harm they cause to wild animals and to the planet. Say UU4 = 10. Now IE + Σi=04 UUi = -1 + 3 + -5 + -2 + 10 = 5. Andrew donates blood once more.
Finally, Andrew realizes that donating blood takes time that he could otherwise spend on useful activities. This consideration is relevant but not dominating. UU5 = -1.
What about future crucial considerations that Andrew hasn't yet discovered? Can he make any statements about them? One way to do so would be to model unknown unknowns (UUs) as being sampled from some probability distribution P: UUi ~ P for all i. The distribution of UUs so far was {3, -5, -2, 10, -1}. The sample mean is 1, and the standard error is 2.6. The standard error is big enough that Andrew can't have much confidence about future UUs, though the sample mean very weakly suggests future UUs are more likely on average to be positive than negative.
If Andrew instead had 100 UU data points, the standard error would be much smaller, which would give more confidence. This illustrates one lesson when handling UUs: The more considerations you've already considered, the more confident you can be that the distribution of remaining UUs also has positive mean.
That we can anticipate something about UUs despite not knowing what they will be can be seen more clearly in a case where the current UUs are more lopsided. For example, suppose the action under consideration is "start fights with random people on the street". While this probably has a few considerations in its favor, almost all of the crucial considerations that one could think of argue against the idea, suggesting that most new UUs will point against it as well.
Modeling UUs in practice may be messier than what I've discussed here, because it's not clear how many UUs remain (although one could apply a prior probability distribution over the number remaining), nor is it clear that they all come from a fixed probability distribution. Perhaps future UUs tend to dominate past ones in size, leading to ever more unstable estimates; for example, if previous UUs tend to change the sign of lots of previous considerations at once, then the latest UU would have a bigger and bigger magnitude as time went on, since it would need to "undo" more and more past UUs. It's also not clear how to partition insights into UU buckets. For example, the insight that donating blood helps wild animals could be stated simply as a single UU with magnitude 10, or it could be broken down as "donating blood helps wild vertebrates" (magnitude 2) and "donating blood helps wild invertebrates" (magnitude 8). Different ways of partitioning UUs would lead to a different estimated probability distribution, although the sign of the sample mean would always remain the same.
There are problems with the approach I described. Magnus Vinding noted to me that
When it comes to UUs, a major problem is that pretty much our entire value system and worldview seem to be up for grabs, and, moreover, that the different UUs likely will be dependent in deep, complex ways that will make modelling of them very hard. Modelling the interrelations of the UUs in our sample and how they make each other change would also seem a necessary element to include in such an analysis.
My thoughts on these topics were influenced by many effective-altruist thinkers, including Holden Karnofsky, Jonah Sinick, and Nick Beckstead. See also Paul Christiano's "Beware brittle arguments."
The discussion of this essay on Facebook's "Effective Altruists" forum includes a debate about whether flow-through effects are actually significant relative to first-order effects and how inevitable the future is likely to be.
The post Charity Cost-Effectiveness in an Uncertain World appeared first on Center on Long-Term Risk.
]]>The post Differential Intellectual Progress as a Positive-Sum Project appeared first on Center on Long-Term Risk.
]]>As a general heuristic, it seems like advancing technology may be net negative, though there are plenty of exceptions depending on the specific technology in question. Probably advancing social science is generally net positive. Humanities and pure natural sciences can also be positive but probably less per unit of effort than social sciences, which come logically prior to everything else. We need a more peaceful, democratic, and enlightened world before we play with fire that could cause potentially permanent harm to the rest of humanity's future.
The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom.--Isaac Asimov
The unleashed power of the atom has changed everything save our modes of thinking [...].--Albert Einstein
Technology is an inherently double-edged sword: With great power comes great responsibility, and discoveries that we hope can help sentient creatures also have the potential to result in massive suffering. João Pedro de Magalhaes calls this "Alice's dilemma" and notes that "in the same way technology can save lives and enrich our dreams, it can destroy lives and generate nightmares."
In "Intelligence Explosion: Evidence and Import," Luke Muehlhauser and Anna Salamon propose "differential intellectual progress" as a way to reduce risks associated with development of artificial intelligence. From Facing the Intelligence Explosion:
Differential intellectual progress consists in prioritizing risk-reducing intellectual progress over risk-increasing intellectual progress. As applied to AI risks in particular, a plan of differential intellectual progress would recommend that our progress on the scientific, philosophical, and technological problems of AI safety outpace our progress on the problems of AI capability [...].
I personally would replace "risk" with "suffering" in that quote, but the general idea is clear.
Differential intellectual progress is important beyond AI, although because AI is likely to control the future of Earth's light cone absent a catastrophe before then, ultimately all other applications matter through their influence on AI.
At a very general level, I think it's important to inspire deeper philosophical circumspection. The world is extremely complex, and making a positive impact requires a lot of knowledge and thought. We need more minds exploring big-picture questions like
As these questions suggest, greater reflectiveness by humanity can be a positive-sum (i.e., Pareto-improving) enterprise, because a more slow, deliberative, and clear-headed world is one in which all values have better prospects for being realized. In an AI arms race, there's pressure to produce something that can win, even if it's much less good than what your team would ideally want and gives no consideration to what the other teams want. If the arms race can be constrained, then there's more time to engage in positive-sum compromise on how AI should be shaped. This benefits all parties in expectation, including suffering reducers, because AIs built in a hurry are less likely to include safety measures against sentient science simulations, suffering subroutines, and so on.
MIRI does important work on philosophical and strategic issues related to AI and has written much on this topic. Below I discuss some other, broader approaches to differential intellectual progress, but in general, it's plausible that MIRI's direct focus on AI is among the most effective.
The social sciences and humanities contain a wealth of important insights into human values, strategies for pro-social behavior, and generally what philosopher Nick Bostrom calls "crucial considerations" for understanding how the universe works and how to make a positive impact on it. It's good to encourage people to explore this material, such as through liberal-arts education.
The liberal arts are really the core of higher education. Vocational education is an instrument, but the liberal arts represent the best of our values and they develop of critical thinking[. ...T]he liberal arts and the humanities and social sciences are so critical when higher education is often viewed primarily as vocational.
Of course, a pure focus on humanities or social sciences is not a good idea either, because the hard sciences teach a clarity of thinking that can dissolve some of the confusions that afflict standard philosophy. Moreover, since one of the ultimate goals is to shape technological progress in more positive and cooperative directions, reflective thinkers need a deep understanding of science and technology, not just of David Hume and Peter Singer.
Beyond what students learn in school, there's opportunity to expand people's minds more generally. When scientists, policy makers, voters, and other decision-makers are aware of more ways of looking at the world, they're more likely to be open-minded and consider how their actions affect all parties involved, even those who may feel differently from themselves. Tolerance and cosmopolitan understanding seem important for reducing zero-sum "us vs. them" struggles and realizing that we can learn from each other's differences -- both intellectually and morally.
TED talks, Edge, and thousands of other forums like these are important ways to expand minds, advance social discourse on big-picture issues, and hopefully, knock down boundaries between people.
While science popularization helps inform non-experts of what's coming and thereby advance insight into crucial considerations for how to proceed, it also carries the risk of simultaneously encouraging more people to go into scientific fields and produce discoveries faster than what society can handle. The net balance is not obvious, though I would guess that for many "pure" sciences (math, physics, ecology, paleontology, etc.), the net balance is positive; for those with more technological application (computer science, neuroscience, and of course, AI itself), the question is murkier.
Expanding the effective-altruist (EA) movement is another positive-sum activity, in the sense that EAs aim to help answer important questions about how best to shape the future in ways that can benefit many different groups. Of course, the movement is obviously just one of many within the more global picture of efforts to improve the world, and it's important to avoid insular "EA vs. non-EA" dichotomies.
Carl Shulman suggests the following ideas:
- Enhance decision-making and forecasting capabilities with things like the IARPA forecasting tournaments, science courts, etc, to improve reactions to developments including AI and others (recalling that most of the value of MIRI in [Eliezer Yudkowsky's] model comes from major institutions being collectively foolish or ignorant regarding AI going forward)
- Prediction markets, meta-research, and other institutional changes[.]
These and related proposals would indirectly speed technological development, which is a counter-consideration. Also, if used by militaries, could they accelerate arms races? Even if positive, it's not clear these approaches have the same value for negative-leaning utilitarians specifically as the other, more philosophical interventions, which seem more likely to encourage compassion and tolerance.
Is encouraging philosophical reflection in general plausibly competitive with more direct work to explore the philosophical consequences of AI? My guess is that direct work like MIRI's is more important per dollar. That said, I doubt the difference in cost-effectiveness is vast, because everything in society has flow-through effects on everything else, and as people become more philosophically sophisticated and well-rounded, they have a better chance of identifying the most important focus areas, of which AI philosophy is just one. Another important focus area could be, for example, designing international political structures that can make cooperative work on AI possible, thereby reducing the deadweight loss of unconstrained arms race. There are probably many more such interventions yet to be explored, and generally encouraging more thought on these topics is one way to foster such exploration.
Part of my purpose in this discussion was not to propose a highly optimized charitable intervention but merely to suggest some tentative conclusions about how we should regard the side-effects of other things we do. For example, should I Like intellectually reflective material on Facebook and YouTube? Probably. Should I encourage my cousin to study physics + philosophy or electrical engineering? These considerations push slightly more for physics + philosophy than whatever your prior recommendation might have been. And so on.
Many of the ideas suggested in this piece are cliché -- observations made at graduation ceremonies or moralizing TV programs, about expanding people's minds so that they can better work together in harmony. Isn't this naïve? The future is driven by economic competition, power politics, caveman emotions, and other large-scale evolutionary pressures, so can we really make a difference just by changing hearts and minds?
It's true that much of the future is probably out of our control. Indeed, much of the present is out of our control. Even political leaders are often constrained by lobbyists, donors, and popularity ratings. But a politician's personal decisions can have some influence on outcomes, and of course, the opinions and wealth distribution of the electorate and donors are themselves influenced by ideas in society.
Many social norms arise from convention or expediency, due to the fact that beliefs often follow action rather than precede it. Still, there is certainly leeway in the memes toward which society gravitates, and we can tug on those memes, either directly or indirectly. The founders of the world's major religions had an immense and non-inevitable impact on the course of history. The same is true for other writers and thinkers from the past and present.
Another consideration is that we don't want selective reflectiveness. For example, suppose those currently pursuing fast technological breakthroughs kept going at the same pace, while the rest of society slowed down to think more carefully about how to proceed. This would potentially make things worse because then circumspection would have less chance of winning the race. Rather, what we'd like to see is an across-the-board recognition of the need for exploring the social and philosophical side of how we want to use future technology -- one that can hopefully influence all parties in all countries.
As a specific example, say the US slowed down its technological growth while China did not. China currently cares less about animal welfare and generally has more authoritarian governance, so even from a non-ethnocentric viewpoint, it could be slightly worse for China to control the future. But my guess is that this consideration is very small compared with the direct, potentially adverse effect of faster technology on the whole planet, especially since most non-military technological progress isn't confined within national boundaries. China could catch up to America's level of humane concern in a few decades anyway, and the bigger issue seems to be how fast the world as a whole moves. Also, in the case of military technology, the US tends to set the pace of innovation, and probably slower US military-tech growth would reduce the pressure for military-tech development by other countries.
It's not always the case that accelerated technology is more dangerous. For example, faster technology in certain domains (e.g., the Internet that made Wikipedia possible) accelerates the spread of wisdom. Discoveries in science can help us reduce suffering faster in the short term and improve our assessment for which long-term trajectories humanity should pursue. And so on. Technology is almost always a mixed bag in what it offers, and faster growth in some areas is probably very beneficial. However, from a macro perspective, the sign is less clear.
Promoting education wholesale is another double-edged sword because it speeds up technology as well as wisdom. However, differentially advancing cross-disciplinary and philosophically minded education seems generally like a win for many value systems at once, including suffering reduction.
In "Intelligence Amplification and Friendly AI", Luke Muehlhauser enumerates arguments why improving cognitive abilities might help and hurt chances for controlled AI. Nick Bostrom reviews similar considerations in Ch. 14 of Superintelligence: Paths, Dangers, Strategies.
Benefits:
Drawbacks:
A similar double-edged sword is economic growth, though perhaps less dramatically. One primary effect of economic growth is technological growth, and insofar as we need more time for reflection, this seems to be a risk. On the other hand, economic growth has several consequences that are more likely positive, such as
That said, these seem like properties that result from the absolute amount of economic output rather than the growth rate of the economy. It's not controversial that a richer world will be more reflective, but the question is whether the world would be more reflective per unit of GDP if it grew faster or slower. (Note: In the following figure, the x-axis represents "GDP and/or technology", not "GDP divided by technology".)
As a suggestive analogy, slower-growing crystals have fewer defects. More slowly dropping the temperature in a simulated-annealing algorithm allows for finding better solutions. In the case of economic growth, one might say that if people have more time to adapt to a given level of technological power, they can make conditions better before advancing to the next level. So, for example, if the current trends toward lower levels of global violence continue, we'd rather wait longer for growth, so that the world can be more peaceful when it happens. Of course, some of that trend toward peace may itself be due to economic growth.
Imagine if people in the Middle Ages developed technology very rapidly, to the verge of building general AI. Sure, they would have improved their beliefs and institutions rapidly too, but those improvements wouldn't have been able to compete with the centuries of additional wisdom that our actual world got by waiting. The Middle-Age AI builders would have made worse decisions due to less understanding, less philosophical sophistication, worse political structures, worse social norms, etc. The arc of history is almost monotonic toward improvements along these important dimensions.
A counterargument is that conditions are pretty good right now, and if we wait too long, they might go in worse directions in the meantime, such as because of another Cold War between the US and China. Or, maybe faster economic growth means more trade sooner, which helps prevent wars in the short run. (For example, would there not have been a Cold War if the US and Soviet Union had been important trading partners?) A friend tells me that Peter Thiel believes growth is important for cooperation because in a growth scenario, incentives are positive-sum, while in stagnation, they're more zero-sum. Carl Shulman notes that "Per capita prosperity and growth in per capita incomes are associated with more liberal postmaterialist values, stable democracy, and peace." Faster growth by means other than higher birth rates might increase GDP per capita because growth would happen more rapidly than population could keep up.
Suppose AI would arrive when Earth reached some specific level of GDP. Then even if we saw that faster growth correlated with faster increases in tolerance, cooperation, and wisdom, this wouldn't necessarily mean we should push for faster growth. The question is whether some percent increase in GDP gives more increase in wisdom when the growth is faster or slower.
Alternatively, in a model where AI arrives after some amount of cumulative GDP history for Earth, regardless of whether there has been growth, then if zero GDP growth meant zero moral growth (which is obviously unrealistic), then we'd prefer to have more GDP growth so that we'd have more wisdom when AI arrived.
Another relevant consideration Carl Shulman pointed out is that growth in AI technology specifically may only be loosely coupled with economic growth overall. Indeed, if slower growth caused wars that triggered AI arms races, then slower economic growth would mean faster AI. Of course, some take the opposite view: Environmentalists might claim that faster growth would mean more future catastrophes like climate change and water shortages, and these would lead to more wars. The technologists then reply that faster growth means faster ways to mitigate environmental catastrophes. And so on.
Also, a certain level of economic prosperity is required before a country can even begin to amass dangerous weapons, and sometimes an economic downturn can push the balance toward "butter" rather than "guns." David E. Jeremiah predicted that "Conventional weapons proliferation will increase as more nations gain the wealth to utilize more advanced technology." In the talk "Next steps in nuclear arms control," Steven Pifer suggested that worsening economic circumstances might incentivize Russia to favor disarmament agreements to reduce costly weapons that it would struggle to pay for. Of course, an opposite situation might also happen: If the budget is tight, a country might, when developing new technologies, strip away the "luxuries" of risk analysis, making sure the technologies are socially beneficial, and so on.
More development by Third World countries could mean that more total nations are able to compete in technological arms races, making coordination harder. For instance, many African nations are probably too poor to pursue nuclear weapons, but slightly richer nations like India, Pakistan, and Iran can do so. On the other hand, development by poor nations could mean more democracy, peace, and inclination to join institutions for global governance.
The upshot is unclear. In any event, even if faster economic growth were positive, it seems unlikely that advancing economic growth would be the most cost-effective intervention in most cases, especially since there are strong competitive and political pressures pushing for it already. Of course, there are some cases where the political pressures are stronger the other way (e.g., in opposing open borders for immigrants), when there's a perceived conflict between national and global economic pie.
Also, while the effects of "economic growth" as an abstract concept may be rather diffuse and double-edged, any particular intervention to increase economic growth is likely to be targeted in a specific direction where the differential impact on technology vs. wisdom is more lopsided.
Quoting Kawoomba on LessWrong:
R&D, especially foundational work, is such a small part of worldwide GDP that any old effect can dominate it. For example, a "cold war"-ish scenario between China and the US would slow economic growth -- but strongly speedup research in high-tech dual-use technologies.
While we often think "Google" when we think tech research, we should mostly think DoD in terms of resources spent -- state actors traditionally dwarf even multinational corporations in research investments, and whether their [investments] are spurned or spurred by a slowdown in growth (depending on the non-specified cause of said slowdown) is anyone's guess.
Luke_A_Somers followed up:
Yes - I think we'd be in much better shape with high growth and total peace than the other way around. Corporations seem rather more likely to be satisfied with tool AI (or at any rate AI with a fixed cognitive algorithm, even if it can learn facts) than, say, a nation at war.
The importance of avoiding conflict and arms races is elaborated in "How Would Catastrophic Risks Affect Prospects for Compromise?"
In general, warfare is a major source of "lost surplus" for many value systems, because costs are incurred by each side, resources are wasted, and the race may force parties to take short-sighted actions that have possibly long-term consequences for reducing surplus in the future. Of course, it seems like many consequences of war would be temporary; I'm not sure how dramatic the "permanently lost future surplus" concern is.
It's not obvious that economic growth would reduce the risk of arms races. Among wealthy countries it might, since more trade and prosperity generally lead to greater inter-dependence and tolerance. On the other hand, more wealth also implies more disposable income to spend on technology. Economic growth among the poorest countries could exacerbate arms races, because as more countries develop, there would be more parties in competition. (For instance, there's no risk of arms races between the developed world and poor African nations in the near future.) But international development might also accelerate global coordination.
My assessments in the previous section are extremely broad generalizations. They're akin to the claim that "girls are better at language than boys" -- true on average, but the distributions of individual measurements have huge overlap. Likewise with my statements about technology and social institutions: There are plenty of advances in each category that are very good and plenty that are very bad, and the specific impact of an activity may be very different from the average impact of the category of which it's a part. The main reason to generalize about categories as a whole is in order to make high-level assessments about policies, like "Should we support more funding of engineering programs in the US?" When evaluating a particular activity, like what you do for your career, a specific analysis of that activity will be far more helpful than just labeling it "technology" or "social science".
In Superintelligence (Ch. 14), Bostrom outlines reasons why faster hardware is likely to make AI control harder:
Artificial consciousness seems net harmful to advance because
Steve Grand defended his work toward artificially conscious creatures on the following grounds:
This is what I care about. I want to help us find out what it means to be conscious and I want to challenge people to ask difficult questions for themselves that they can’t do with natural life because of their unquestioned assumptions and prejudices. But we really are talking about creatures that are incredibly simple by natural standards. What I’m trying to explore is what it means to have an imagination. Not a rich one like humans have, but at all. The only way to find that out is to try to build one and see why it is needed and what it requires. And in doing so I can help people to ask questions about who they are, who other creatures are, and what it means to be alive. That’s not such a bad thing, is it?
This resembles an argument that Bostrom calls an instance of "second-guessing" in Ch. 14 of Superintelligence: basically, that in order to get people to take the risks of a technology seriously, you need to advance work on the technology, and it's better to do so while the technology has limited potential so as to bound risks. In other words, we should advance the technology before a "capability overhang" builds up that might yield more abrupt and dangerous progress in the technology. Bostrom and I are both skeptical. Armed with such a defense, one can justify any position on technological speed because either we (a) slow the technology to leave more time for reflection or (b) accelerate the technology so that others will take risks more seriously while the risks remain manageable.
In the case of artificial consciousness, we should advance the public discussion by focusing our energies on philosophy rather than on the technical details of building software minds. There's already enough technical work on artificial consciousness to fuel plenty of philosophical dialogue.
Improved social wisdom is positive-sum in terms of the resources it provides to different value systems: Because they know more, they can better accomplish each of their goals. They have more tools to extract value from their environment. However, it's not always the case that an action that improves the resources of many parties also improves the utility of each of those parties. Exceptions can happen when the goals of the parties conflict.
Take a toy example. Suppose Earth contained only Stone Age humans. One tribe of humans thought the Earth was beautiful in its untouched natural state. Another tribe felt that the Earth should be modified to better serve human economic interests. If these humans remained forever in the Stone Age, without greater wisdom, then the pro-preservation camp would have gotten its way by default. In contrast, if you increased the wisdom of both tribes -- equally or even with more wisdom for the pro-preservation tribe -- then it would now be at least possible for the pro-development tribe to succeed. Thus, despite a positive-sum increase in wisdom, the pro-preservation tribe is now worse off in expected utility.
However, this example is somewhat misleading. A main point of the present essay was to highlight the potential risks of greater technology, and one reason wisdom is beneficial is that it better allows both sides to cooperate and find solutions to reduce expected harms. For example, absent wisdom, the pro-development people might just start a war with the pro-preservation people, and if the pro-development side won, the pro-preservation side would have its values trashed. If instead both sides agreed to undertake modest development with safeguards for nature preservation, then each side could end up better off in expectation. This is an example of the positive-sum utility benefits that wisdom can bring.
Perhaps there are some examples where wisdom itself, not just technology, causes net harm to a certain ideology, but it seems like on the whole wisdom usually is positive-sum even in utility for many factions.
The main intuition why wisdom and related improvements should be positive-sum is that they hold constant the fraction of people with different values and instead distribute more "pie" to people with each set of values. This fractional view of power makes sense in certain contexts, such as in elections where the proportion of votes is relevant. However, in other contexts it seems that the absolute number of people with certain values is the more appropriate measure.
As an example, consider the cause of disaster shelters that serve to back up civilization following near-extinction-level catastrophes. Many altruists support disaster shelters because they want humanity to colonize space. Suffering reducers like me probably oppose disaster shelters because shelters increase the odds of space colonization without correspondingly increasing the odds of more humane values. If work towards disaster shelters is proportional to (# of people in favor) minus (# of people opposed), and if, say, 90% of people support them by default, then greater education might change
(10 in favor) minus (1 opposed) = 9 net
to
(1000 in favor) minus (100 opposed) = 900 net,
which is a 100-fold increase in resources for disaster shelters. This makes the suffering reducers worse off, so in this case, education was not positive-sum.
My intuitions that wisdom, education, cooperation, etc. are in general positive-sum presupposes that most of the work that people do as a result of those changes is intrinsically positive for both happiness increasers and suffering reducers. Disaster shelters seem to be a clear exception to this general trend, and I hope there aren't too many other exceptions. Suffering reducers should keep an eye out for other cases where seemingly positive-sum interventions can actually hurt their values.
The post Differential Intellectual Progress as a Positive-Sum Project appeared first on Center on Long-Term Risk.
]]>The post Against Wishful Thinking appeared first on Center on Long-Term Risk.
]]>Some people hold more hopeful beliefs about the world and the future than are justified. These include the feeling that life for wild animals isn't so bad and the expectation that humanity's future will reduce more suffering than it creates. By feeding these dreams, optimistic visions of suffering reduction, while noble, may in fact cause net harm. We should explore ways of increasing empathy that also expose the true extent of suffering in the world, e.g., information about factory farming, brutality in nature, and unfathomable amounts of suffering that may result from space colonization.
The man who is a pessimist before forty-eight knows too much; if he is an optimist after it he knows too little.--Mark Twain
I also think the future is where people project a lot of hopes. They're just less willing to be neutral about it. People are more willing to say, 'Yes, sad and terrible things happened in the past, but we get it. We once believed that our founding fathers were great people, and now we can see they were shits.' I guess that's so, but for the future their hopes are a little more hard to knock off.--Robin Hanson
Wishful thinking is the formation of beliefs and making decisions according to what might be pleasing to imagine instead of by appealing to evidence, rationality, or reality. Studies have consistently shown that holding all else equal, subjects will predict positive outcomes to be more likely than negative outcomes (see valence effect).
I'll mostly leave it to psychologists to debate the origins wishful thinking. One possibility is that optimism makes one appear more competent than one actually is. Another is that it actually makes one more competent and successful through self-fulfilling beliefs. It could also be primarily an accident on the part of evolution: We feel bad when we imagine bad outcomes, so we cheat by not imagining them as much as is epistemically warranted.
I would guess that some people believe that life in the wild isn't so bad and that the future of humanity will be mostly peaches and cupcakes primarily because, well, it would be pretty depressing otherwise. Just look at how much people want to believe in heaven to see an example of justifying one's desired outlook without epistemic grounding. The transhumanist community has its fair share of starry-eyed disciples awaiting rapture into computational bliss.
There are related cognitive biases about excess positivity, such as optimism bias and rosy retrospection. Conversely, the hypothesis of depressive realism has some empirical support, though it remains controversial.
There's an additional source of positive bias in decisions, namely that those with the most power also tend to be people who haven't lived in abject conditions. Society's leaders might sometimes have low hedonic setpoints, and they might have gone through trials and stresses, but they usually haven't experienced torture, starvation, serious violence, or paralyzing mental illnesses. When we extend our scope to include animals, the contrast is even more stark. Humans have low infant mortality, long lives, and are at the top of the food chain. Many of us have regular meals, shelter, medical care, air conditioning, pain killers, and so on.1 Most animals in the wild are born, live a few days or weeks, and then die painfully of starvation, predation, disease, etc.
Many well-meaning altruists are working to reduce extinction risks in the hopes of a bright future for humanity. They hope our descendants will solve the problems of the world and build computational castles in the sky full of awesomeness. In the process, some optimists neglect the fact that military, economic, and geopolitical forces will control the future of AI, and probably not quixotic altruists. They may forget the possibility that Darwinian forces beyond our control will supercede human values. Some even assume that super-intelligence will lead to super-compassion, when there's actually no necessary relation, and in fact, most super-intelligences probably have no compassion at all.
Sometimes even negative utilitarians get swept along in enthusiasm for the future. David Pearce's Hedonistic Imperative predicts the end of suffering and the advent of gradients of bliss in a post-human paradise. David anticipates cosmic rescue missions to help suffering sentients on other planets. Unfortunately, in reality, the spread of computational power is likely to multiply suffering rather than to end it, because there will be astronomically more resources for running suffering computations.
But what's the harm of optimism? Why not let negative utilitarians allay some of their worries by wishing for a better tomorrow? In fact, maybe the hope of abolishing suffering will make people more motivated to work toward it? This may be true, but the cost is too high. Hope for the future means people will favor technological development and space colonization, with the aim of ensuring human survival and dispatching interstellar probes. David himself is an advocate for technological progress. Yet this may be precisely the wrong thing to support if we have the goal of reducing suffering.
Note that being realistic is not the same as being depressed or apathetic. We have enormous opportunity to reduce suffering on behalf of sentient creatures. However, we need to be careful about language. There is a lot we can do, but even if we try our hardest, the future will still look very bleak. Our efforts alone will not make a noticeable dent in the sign of the value of the future.2
If unwarranted optimism about the future may be net harmful, then pessimism may be net beneficial. In particular, we might hypothesize that it would be useful to expose people to the reality of suffering that the world contains and that the future may multiply. It would be good to test this hypothesis at some point.
It seems that many people who care a lot about suffering have experienced firsthand how overwhelming it can be. It's an interesting question to ponder whether some amount of suffering helps to inspire compassion or whether physical and mental pain instead make people more selfish, spiteful, and apathetic.
Either way, it seems plausible that some activities probably do elicit more empathy and horror at suffering than they prevent. For example, veg outreach that exposes life on factory farms seems like an excellent way to begin to demonstrate just how much suffering there is in the world. It's easy enough to forget this if we live in a bubble of affluent, mostly cheerful humans. We can extend this appreciation of the extent of suffering by pointing to wild animals, noting that almost all offspring of many species die, perhaps painfully, just a few days after being born.
Thus, it seems that discussing animal suffering in the right way can serve as a reality check against excess optimism, in stark contrast to promoting a Hedonistic Imperative vision that confuses the marginal impact of our efforts to make the world better with the overall probability that the world actually will become a delightful place for all. Highlighting the severity of suffering in nature can suggest one of many risks in colonizing space.
One case where the cause of wild-animal suffering could cause problems is if people assume that we need humans to stick around because they're the only hope for quintillions of potentially suffering insects on the planet for the next billion years. While this is true, (a) it's not clear that humans actually would decrease wild-animal suffering in the future, and (b) even if they did, the benefit of doing so is small in comparison with the potential damage that would result from spreading into space. While the expected value of promoting concern for wild animals is highly positive, this doesn't mean the overall probability of convincing the world to come around to our position on this matter is close to 1.
What are some other interventions that would help to expose the extent of suffering in our world and the even greater expected magnitudes in our future?
Sunny Days Spark Stock Market Optimism,USA Today Magazine, 2001. (back)
But (ΔB)(ΔP) is too small to matter relative to the other terms. That is to say, the effect of the interaction between your efforts to make the future better and your effect on whether there is a future is small enough to ignore.
Here I'm speaking at the level of what an individual or group can accomplish: If you and your friends work on making the future better, that's a tiny dent in what humanity as a whole does. Say there are only 100 million people seriously affecting the trajectory of the future. Even if you recruit 100 friends to join your project, that's an influence of 1 in a million. The net expected value of the future would have to be on a knife's edge, slightly negative but almost zero, in order for your work to flip the sign around. Maybe you could expect to have slightly more influence than that based on your correlation with other actors, i.e., if you do X, it means more people like you are doing X because of similarities of your brains, even if you don't coordinate. Still, I don't think it's reasonable to assume this correlation as being appreciably large relative to 100 million people in total.
What if we instead look from the perspective of humanity as a whole? Could all of humanity flip the sign of the future around? This is more plausible, but it's still not obvious. Most of human history is determined by evolutionary, economic, political, and social forces beyond the control of individuals. Genetic pressures and human-organizational systems drive society in directions that are hard to stop. We can exert some control over these forces, but it's not clear we can exert enough, and even if we could, it's unlikely we could predict how those forces would unravel precisely enough to know if we were doing the right thing. This is not to say we can't assess some actions as being more likely good than others -- just that the extent to which we can is limited, and it may not be enough to flip around the expected sign of human expansion into the galaxy even if we could strongly influence the whole world.
Anyway, even if humanity as a whole could flip around the expected sign of the future by its efforts, that's not the right question to ask. What matters is just what you can accomplish. It is basically impossible that the actions that you individually take could change the expected sign of the future. (back)
The post Against Wishful Thinking appeared first on Center on Long-Term Risk.
]]>The post A Lower Bound on the Importance of Promoting Cooperation appeared first on Center on Long-Term Risk.
]]>Compromise has the potential to benefit most value systems in expectation, by allowing each side in a dispute to get more of what it wants than its fractional share of power. This is wonderful, but how much could compromise matter? In this piece I suggest a Fermi calculation for a lower bound on how much suffering might be prevented by working to promote compromise. The estimates that I use for each variable are more conservative than I think is likely to be the case.
I do not think a Fermi calculation like I describe below is the best approach for evaluating relative cost-effectiveness. This calculation traces one specific, highly conjunctive branch in the vast space of possible branches for how the future might unfold. Most of the expected impact of promoting compromise probably comes from branches that I'm ignoring.
Likewise, activities other than promoting compromise also have many flow-through effects on many different possible future branches. Comparing projects to shape the future requires much more than a single Fermi calculation. We should use additional quantitative and qualitative estimates across many models, as well as general heuristics. One of the strongest arguments for promoting compromise is not that it dominates in a Fermi calculation (probably it doesn't) but that "increasing the pie" for many value systems is generally a good idea and seems more robustly positive than almost anything else.
That said, explicit and detailed Fermi estimates can help to clarify our thinking and identify holes, and this is one reason for undertaking the exercise.
Suppose the following parameter estimates. Remember, these are designed to be conservatively low, not most likely. The estimates in each bullet are conditional probabilities given the outcomes from the previous bullets.
Given these parameter settings, a lower bound on the fraction of future suffering reduced per person-year of work to promote cooperation is
40% * 20% * 5% * 5% * 10% * 10% * [1/(10 billion)] * (1/1000) * 10% * 0.3 = 6 * 10‑21.
The expected number of suffering-years in our hands would then be
10‑5 * 1048 * 10‑8 * 10‑10 = 1025.
Multiplying 6 * 10‑21 by 1025 gives 60,000 expected suffering-years that we can prevent per year of work to promote compromise. Assuming a year of work means 40 hours per week for 50 weeks, this is (60,000)/(40*50) = 30 suffering-years per hour, or 0.5 per minute.
To convert this into a per-dollar estimate, suppose it would take $150K per year to pay someone to work on compromise, assuming that person would otherwise have done something unrelated and altruistically neutral. This figure is very high for a nonprofit salary, but if someone is willing to work for a lot less, chances are she's already committed to the cause and would have a high opportunity cost, because she could be earning to give instead. In order to attract talented people who would otherwise do altruistically neutral work, a high salary would be required. And remember, this is a conservative calculation. 60,000 expected suffering-years divided by $150K is ~150 suffering-days prevented per dollar. (Here I'm ignoring the fact that future labor-years should be cheaper in present dollars assuming investment returns outpace increases in wages.)
It's important to remember just how imprecise these particular numbers are. For instance, if I had taken the anthropic discount factor to be 10‑5 instead of 10‑10, we would have had 6 billion suffering-years prevented per year of work, or 40,000 suffering-years prevented per dollar.
This scenario assumed a bound of 1048 experience-years, but there's some chance physics is other than we think and allows for obscenely higher amounts of computation. Indeed, there's a nonzero probability that infinite computation is possible, implying infinite future suffering. Our calculation would then blow up to say that every second spent on promoting compromise prevents infinite expected suffering.
A few thoughts on this:
I've taken pains to clarify that the calculation in this piece is hardly exhaustive of why cooperation is important but only scratches the surface with one concrete scenario. There are many other reasons for suffering reducers to support international cooperation:
Putting our descendants in a better position to address challenges is useful even if strong AI and space colonization never materialize. Even if humans just continue on Earth for a few million years more, cooperation still improves our trajectory. Of course, this case involves vastly less suffering for us to mitigate, and what we do now may not have a significant impact on what happens tens of thousands of years hence absent goal-preserving AI, so this scenario is negligible in the overall calculations, but those who feel nervous about tiny probabilities of massive impacts would appreciate this consideration. That said, if our only concern was about Earth in the very short term, then plausibly other interventions would appear more promising.
There's a general argument that we should focus on far-future scenarios even if they seem unlikely to materialize due to anthropic considerations because of value of information. In particular, suppose there were two main scenarios to which we assigned equal prior probability before anthropic updating: ShortLived, where humanity lasts only a few more centuries, and LongLived, where humanity lasts billions more years. Say LongLived has N times as many experience-moments as ShortLived and so is N times as important. Correspondingly, the anthropic-adjusted probability of LongLived might, under certain views of anthropics, tend toward 1/N. The expected value of ShortLived is (probability)*(value) = (roughly 1)*(1) = 1 compared against an expected value for LongLived of (probability)*(value) = (1/N)*N = 1. So it's not clear whether to focus on short-term actions (e.g., reducing wild-animal suffering in the coming centuries) or long-term actions (e.g., promoting international cooperation, good governance, and philosophical wisdom in order to improve the seed conditions for the AI that colonizes our galaxy).
When we consider value of information, it pushes toward longer-term actions because they leave open the option of returning to focus on short-term actions if further analysis leads to that conclusion. To make the explanation simple, imagine that halfway through the expected lifetime of humanity given ShortLived, altruists reassessed their plans to decide if they should continue doing actions targeting LongLived futures or if they should focus instead on ShortLived futures. For the sake of clarity, imagine that at this juncture, they have perfect knowledge about whether to focus on short-term or long-term futures. If long-term futures were best to focus on, they would have already been doing the right thing so far and could stick with it. If short-term futures were more important, they could switch to working on short-term futures for the remaining half of humanity's lifetime and still get half the total value as if they had worked on short-term issues from the beginning.1
Of course, a reverse situation could also be true: start focusing on short-term futures and then re-evaluate to decide whether to focus on long-term futures halfway. The difference is that if people have focused on long-term futures from the beginning, they'll have more wisdom and capacity at the halfway point to make this evaluation. This is an instance of the general argument for frontloading wisdom and analysis early and then acting later. Of course, there are plenty of exceptions to this -- for instance, maybe by not acting early, people lose motivation to act altruistically at all. This general conceptual point is not airtight but merely suggestive.
In personal communication, Will MacAskill made a similar argument about "option value" in a related context and thereby partly inspired this section. Needless to say, there are other considerations besides option value in both directions. For instance, there's greater entropy between our actions now and the quality of experience-moments billions of years from now (though a nontrivial probability of a pretty small entropy, assuming we influence a goal-preserving or otherwise politically stable outcome). Meanwhile, experience-moments of the future may have greater intensity, so the stakes may be higher.
Finally, as was hinted in the Fermi calculation, we could fudge a way to make the far future dominate by saying there's a nontrivial probability that our anthropic discount is wrong and that the future really is as important as it seems naively. This may work, though it also feels suspicious because similar sorts of model-uncertainty arguments could be invoked to justify lots of weird considerations dominating our calculations. The importance of the far future seems one of the more robust sentiments among intelligent thinkers, though, so the fudge feels less hacky in this case.
The post A Lower Bound on the Importance of Promoting Cooperation appeared first on Center on Long-Term Risk.
]]>The post A Dialogue on Suffering Subroutines appeared first on Center on Long-Term Risk.
]]>Alice: Greetings, Brian. I heard that you're concerned about the possibility of what you call "suffering subroutines." You say that artificial intelligences (AIs) in the future -- whether human-inspired or paperclipping -- might run immense numbers of computations that we may consider to be conscious suffering. I find this hard to believe. I mean, why wouldn't instrumental computations be just dumb components of an unfeeling system?
Brian: They might be, but they might not be, and the latter possibility is important to consider. As one general point, note that sentience evolved on Earth (possibly more than once), so it seems like a useful algorithmic construct.
Alice: Sure, sentience was useful on Earth for competitive organisms, but in the subroutines of an AI, every computation is subserving the same goal. Processes are allocated computing resources "each according to his needs from each according to his talents," as Dan Dennett observed.
Brian: Possibly. But in that same talk you cite, Dennett goes on to explain that computing processes in the human brain may be competitive rather than cooperative, Darwinian rather than Marxist. Dennett proffers a hypothesis that "centrally planned economies don't work, and neither do centrally coordinated, top-down brains."
Alice: Meh, if that's true, it's probably a vestige of the way evolution came up with the brain. Surely an orderly process could be designed to reduce the wasted energy of competition, and since this would have efficiency advantages, it would be a convergent outcome.
Brian: That's not obvious. Evolutionary algorithms are widely useful and not always replaceable by something else. In any event, maybe non-Darwinian processes could also consciously suffer.
Alice: Umm, example, please?
Brian: It seems plausible that many accounts of consciousness would include non-evolved agents under their umbrellas. Take the global-workspace theory for instance. Are you familiar with that?
Alice: Do explain.
Brian: In the so-called LIDA implementation of the global-workspace model, a cognitive cycle includes the following components:
This diagram lays out the various components.
Note that Stan Franklin, one of the managers of the LIDA project, believes that the earlier version of his system, IDA, is "functionally conscious" but not "phenomenally conscious," as he explains in his 2003 paper, "IDA, a Conscious Artifact?" This seems to stem from his tentative agreement with David Chalmers about the hard problem of consciousness (see p. 10). Because I believe this view is confused, I think the functional consciousness under discussion here is also phenomenal consciousness.
Alice: I see. So why is this relevant?
Brian: If consciousness is this kind of "global news broadcasting," it seems to be a fairly common sort of operation. I mean, one obvious example is the news itself: Stories compete for worthiness to be aired, and the most important news segments are broadcast, one at a time, to viewers who then update their memories and actions in response. Then new things happen in the world, new stories take place, many reporters investigate them in parallel, they compete to get their stories aired, and the cycle happens again. "Emotions" and "dispositions" may modulate this process -- for instance, more conservative news agencies will prefer to air different stories than liberal news agencies, and the resulting messages will be biased in the direction of the given ideology. Likewise, the national mood at a given moment may cause some stories to be more relevant than they would have been given a different mood. People who care a lot about the news stories they hear may get in touch with the news agencies to give feedback and engage in further coordination ("reentrant signaling"). And so on.
Of course, we see analogous behavior in other places as well:
Note that some of these systems are not competitive, and so the claim that lack of Darwinian competition means lack of conscious suffering may not be accurate.
These analogies can actually give insight into why consciousness emerges in brains: It's for a similar reason as why national and global news emerges in human society. Global broadcasting of the most important events and knowledge serves to keep every part of the social system up to date, in sync, aware of required actions (e.g., hurricane warnings, voting days), and alerted to global searches ("Have you seen this crime suspect?") and coordination. As the "Global Workspace Dynamics" paper says: "What is the use of binding and broadcasting in the [cortico-thalamic] C-T system? One function is to update numerous brain systems to keep up with the fleeting present." This process of serializing the most salient updates for global broadcasting may ultimately create a more effective society (organism) than if every local community were isolated and reacted independently with parochial ("unconscious") reflexes or using only tribal knowledge. When a broadcast becomes globally conscious, it's available to all regions, including the verbal/speech centers of a person for conscious report (or to the writers/bloggers of society for verbalization in text). Events in illiterate farming communities would be "unconscious" to the world without journalists who visit to help broadcast those stories. The world can become more conscious of its memories when historians uncover and share information about past events. And the spotlight of attention shifts based on the most emotionally salient events that happen. In general, fast, global network communication over radio, television, and the Internet are making the world more conscious of itself, in a surprisingly literal sense.
Why do we only care about conscious experiences? For instance, we'd be horrified to undergo conscious surgery but don't mind undergoing surgery while anaesthetized. Presumably it's because the parts of us that "care about" our experiences -- such as by reacting aversively, triggering stress feelings, planning ways to avoid the experience, encoding traumatic memories, and so on -- only know about the damaging stimuli when they become conscious. Typically a strong negative stimulus will win competitions to be consciously broadcast, but when anaesthesia blocks pathways from nociceptors to access by the full brain, it prevents the suite of "caring about" responses that would ordinarily be triggered. An analogy in the social realm is that society cares about and responds to negative events when they're reported in the media, but if scandals are covered up or reporters are prevented from talking about atrocities, this is like applying local anaesthesia.
More often, neglect of certain harms and focus on other types of harms is built in to the system. For instance, a sliver in your eye would hurt vastly more than a sliver in your leg because your eye has many more nerve endings. Similarly, a rich person killed in the United States would attract far more attention and response than a poor person killed in Africa because there are vastly more reporters covering the former, and the story about the rich American would seem more salient to the readers (neurons) who vote it up on Twitter.
Another analogy between consciousness and news reporting is that in both cases, once an object enters the spotlight of attention, other events in that spotlight can come to attention that would have otherwise remained hidden. For example, suppose your leg itches, causing you to focus your consciousness on your leg. That may allow you to then feel the breeze on your leg as well, whereas you otherwise would have filtered out that information from your awareness. Likewise, when a news story about X surfaces, this often leads to investigations into other stories Y and Z that relate to X, and stories about V and W that previously would have been ignored become "newsworthy". As an example, following the pool party incident in McKinney, Texas on 5 Jun. 2015, a series of other news stories about McKinney, Texas also became national headlines, whereas previously those kinds of stories wouldn't have reached beyond the local news.
I haven't explored interpretations of the processes mentioned above according to other models of consciousness, but I expect you'd find that systems like these would be at least somewhat conscious in those frameworks as well. In general, most accounts of what consciousness is appeal to general principles that don't go away when neurons stop being involved.
And beyond consciousness, we can see other mind-like processes at play in many systems. Take memory for example. Apparently memories consist of neural connections that become strengthened by repeated use, and they fade as the connections decay. This reminds me of a series of dirt roads through a town. They're first created by some event, they become strengthened with use, and they revert back to wilderness with disuse. A road that hasn't been traveled on in years may become overrun by returning vegetation, but it can still be re-activated more easily than creating a new road from scratch somewhere else. And like with neural connections, a stronger road allows things to flow more easily between the regions it connects.
Alice: Are you really saying that news reports and stock exchanges are conscious? And that roads have memory?
Brian: I don't know.1 But I think we should take the possibility seriously. In any case, it could be that future computational systems contain more human-like entities. For instance, suppose an AI wants to undertake a research program on string theory, to better update its models of physics. It will partition some fraction of computing power to that project. It may want to parallelize the work for speed, so it might create lots of different "research teams" that work on the problem separately and publish their results to others. These teams might compete for "grant money" (i.e., additional computing resources) by trying to produce high-quality findings better than the other teams. These components might be sufficiently agent-like as to evoke our moral concern.
The process of intelligently searching the space of possibilities based on value assessments is a general phenomenon. Animals search through a field until they find a lush patch of strawberries; then they experience reward at the discovery and focus their efforts there for a while. Humans, too, feel reward while trying to figure things out. For instance, V.S. Ramachandran's peekaboo principle is based on the idea that humans receive little squirts of pleasure every time they unpack a small piece of a puzzle, and these "mini aha" moments motivate them to keep going. Perhaps there would be a similar process at play for an AI's research teams. When a small discovery is made, the good news is broadcast throughout the team, and this encourages more actions like what led to the discovery.
As I stated it, this model suggests something akin to David Pearce's gradients of bliss because the rewards for research discoveries are positive. But perhaps the system would use gradients of agony, with research findings being rewarded by temporary relief from discomfort. If there is a possibility for choice between a "gradients of bliss" and a "gradients of agony" design to achieve roughly similar computational ends, this suggests room for humane concerns to make a meaningful difference.
As another illustration, consider economics. Under both capitalism and communism, we see the emergence of hierarchical forms of organization. The CEO of a corporation seems like a decent model for the conscious control center of a brain: The workers perform their duties away from its sight, and then the most important news about the company is bubbled up to the CEO's desk. Then the CEO broadcasts updates to all the workers, including compensation rewards, which adjust worker action inclinations. The company also stores records of these events for later use. The most important ("emotionally salient") historical memories are better preserved, and less relevant ones slowly decay with time. This whole process mimics the global-workspace theory in broad outline. And the fact that hierarchies of this type have emerged in all kinds of governmental and economic systems suggests that they may be common even among the construction workers and researchers of an AI.
Alice: Hmm, maybe. But if there are just a few companies that the AI is directing, that's not a lot of conscious minds. Maybe these suffering corporations are then not a big deal, relative to much larger numbers of suffering wild animals, etc. What's more, the clock speed of a corporate consciousness would be glacial compared with that of an animal.
Brian: Well, even if we don't weight by brain size, who says corporations are the only parts of this process that are conscious? Hierarchical organization is a recurrent pattern of organized systems in general. It could happen at the highest level -- the executive AI controlling its component corporations -- but it would also happen in a fractal way at many lower layers too: Each corporation is composed of subdivisions, each subdivision has its own subdivisions, etc. At some point we might hit a level of entities analogous to "workers." Even below that might be the subcomponent coalitions of an individual worker's brain, which compete for attention by the worker brain's executive-control system. Each of these could have consciousness-like components. And their clock speeds would be quite fast.
One concept in the LIDA model is that of a "codelet," which one page defines as
tiny agents, carrying small pieces of code (hence the name). They can be interpreted as being a small part of a process, but then leading its own life, very much like an ant is a small part of a "process" to gather food, to defend the anthill or to nurture newborns. They run in parallel [...], and none are indispensable.
[...] The entity calling the codelet will estimate its urgency (reflecting the promise of further investigation). Highly urgent codelets can preempt lower urgency codelets [...], and if a codelet's urgency sinks well below that of other's, it just dies out, leaving computer resources to more ambitious codelets. If a codelet sees it has no real work to do in the current situation (due to a bad estimation or changed situation), it sizzles.
It's plausible that individual ants are conscious. So too, maybe even tiny components of an individual worker's brain could be seen as conscious.
Alice: But if a larger consciousness contains many smaller consciousnesses, which each contain many smaller consciousnesses, how do we count them? What are the weights? Do the lowest-level consciousnesses dominate? This discussion is getting curiouser and curiouser!
Brian: Indeed. But these are issues that we need to resolve at some point. To some extent I'm punting the question to our more intelligent descendants. Still, it's useful to realize that suffering subroutines could be a big deal in the future, so that we don't naively reach conclusions based on a circumscribed view of what we might care about.
Alice: From the standpoint of "consciousness as broadcasting," do you think insects are conscious?
Brian: It's an important question. It certainly seems plausible that insects would have some sort of LIDA-like cognitive cycle: Inputs, unconscious processing, most important insights bubble up and are distributed, and they affect action inclinations. Even if this kind of architecture didn't exist exactly, we might see adumbrations of it in whatever insects do. I mean, for example, if one part of the brain communicates its insights to several other parts of the brain, even if not globally, isn't this like a mini-broadcast? Isn't that sort of like consciousness already? In general, any kind of communication-and-updating process would have shadows of the operation that we think of as consciousness. This illustrates my more general point that consciousness comes in gradations -- there's not a single cutoff point where what was unconscious matter suddenly has the lights come on. There are just atoms moving in various ways, and some of them activate our sympathies more than others.
Alice: Well, that raises a question: If we can care about whatever we like, however much we like, why shouldn't I just care about humans and maybe some animals, and forget about these suffering subroutines entirely?
Brian: You can do that, and perhaps we would choose to upon reflection. I don't know what the best criteria are for carving out our caring-about function. But it seems plausible that algorithms are a big part of it, and then when we see processes that resemble these algorithms somewhere else, it raises the question of why we care about them in some forms but not others. I don't know where our hearts will ultimately fall on the matter.
Alice: Do you think even basic physics might contain consciousness?
Brian: I don't know. I hope not, but I wouldn't rule it out. Giulio Tononi's "phi" postulates that even an electron has a nonzero measure of consciousness, for instance.
With the global-workspace model, maybe we could see elementary particles as broadcasting information that then influences other regions -- e.g., the nuclear reactions in the sun broadcast photons, and the sun's mass pulls other objects toward it. But it's not clear that any real "agent" process is going on here. Where are the learning, action selection, memories, etc.? So naively it seems like these kinds of dead physical things aren't conscious, but maybe I'm not looking at them right, and maybe we'll discover ways in which there are agents even in the math of microscopic physics.
Alice: Speaking of math, do you think Darwinism could ultimately go the way of the dodo? I mean, Darwinian competition is just an attempt at hill climbing in a high-dimensional space. But sometimes we have mathematical tools that let us perform exact optimizations without needing to "guess and check." Could intelligence ultimately be reduced to a series of really big mathematical optimization problems that can be solved (at least somewhat) analytically, thereby averting a lot of this expensive computation of agent-like things? Similarly, reinforcement learning is direct adaptive optimal control, but optimal-control problems can potentially be solved by analytic methods like the Bellman equations if you know the payoffs and transition probabilities ahead of time.
Brian: Maybe, though it does seem hard to imagine that we could analytically solve some of these really specific, data-driven problems without computing in detail the process being modeled. Perhaps this just reflects lack of imagination on my part, and of course, there are times when macro-scale approximations can remain ignorant of microfoundations. In any case, the actions of the AI on the galaxy to implement its goals would still require lots of real, physical manipulation -- e.g., supervisors to coordinate workers in building solar colonies and such. The possibility you cite is fun to speculate on, but it's not sufficiently probable to substantially affect the concern about suffering subroutines, given that consciousness-like processes seem to be such a convergent feature of organized systems so far.
Alice: Do ecosystems suffer? Could this broad view of consciousness provide some vindication of the otherwise seemingly absurd idea that nature as a whole can have moral standing apart from the welfare of the individuals it contains?
Brian: In principle it's certainly possible ecosystems could contain shadows of consciousness, but it's not clear they usually do. Where is the global broadcasting? What are the action components that are being updated? Maybe you could come up with some interpretations. Even if so, it's not clear what an ecosystem wants. Unlike corporations or ants, ecosystems don't have clear goals. Even if we identified a goal, it wouldn't necessarily align conservationism; it might go the other way. In any event, even if an ecosystem's consciousness did align with conservationism, it's dubious whether the interests of the ecosystem as a whole could outweigh those of quintillions of suffering individual animals within it.
If we think ecosystems can suffer, then a natural way to prevent future suffering is to have fewer of them. Even if we adopted the stance from environmental ethics of considering ecosystems objects of intrinsic moral importance regardless of sentience, it's not obvious that ecosystems are inherently good? We might think they're inherently bad. This kind of "negative environmental ethics" seems a natural idea for a negative-leaning consequentialist.
Alice: Yeah. Maybe one suggestion could be that the atmospheric CO2 levels are a global signal broadcast to all subcomponents of the biosphere. This then causes (very small) changes in the behavior of animals, plants, and inorganic entities like sea ice. The responses of these entities then have an impact back on CO2 levels, which are then broadcast globally. I guess in this model, the broadcasts are continuous rather than discrete.
Brian: I suppose that's one interpretation you could make, though what would be the valence of CO2? In the stock-trading example, we could say that for the subset of traders that are net long in the security, an increase in the stock price would have positive valence. What about for CO2?
Alice: Maybe those organisms that do better with more CO2 would receive it with positive valence, and vice versa? The "learning" of the ecosystem would then be strengthening those organisms that do well with higher CO2, just like the dopaminergic learning of an animal involves strengthening connections for action neurons that just fired given the current context.
Brian: Ok, I can kind of see that, although in the case of dopamine, the action neurons were responsible for bringing the reward; in the case of the atmosphere, a whole bunch of stuff brought the increase in CO2 levels, and it wasn't necessarily the organisms that benefit from CO2 who were responsible for the emissions. Indeed, people often remark how humans in the global South are "punished" by the CO2 emissions of those in the global North.
Anyway, even if we did consider the carbon cycle somewhat analogous to a brain, keep in mind that the clock speed of this operation is really slow. Of course, since CO2 changes are continuous rather than coming in discrete pulses, the idea of a clock speed isn't really appropriate, but I guess we can still create a rough notion about cycles per year of relevant operations.
Alice: And of course, at the same time, we could have H2O levels as another currency of the biosphere, and temperature as another, and so on. There would be multiple broadcasting systems at play.
Brian: Right. In general, we can pattern-match a complex process in many different ways as being composed of many different systems that each have some resemblance to consciousness. This actually returns us to the old puzzle in the philosophy of computationalism: What is a computation, anyway? One answer is that we see various physical processes as resembling various computations to various degrees, and we can then care about them in proportion to their resemblance. The same thing is going on here -- only, this is not John Searle's toy Wordstar program in the wall but a genuine instance of seeing consciousness-like operations in various places. It's like pareidolia for our empathy systems.
Personally, I don't really care intrinsically about the Earth's carbon cycle, water cycle, etc. to any appreciable degree. I think the connection to animal minds is a pretty far stretch.
Alice: Yes. Moreover, the way we've been discussing consciousness has been pretty simple and crude. There may be important pieces of the puzzle that we've neglected, and we might consider these important as well for making an entity conscious in a way that matters.
Brian: Agreed! This should not be taken as the end of the conversation but only the beginning.
There are many more similarities between operations in our brains and phenomena in the worlds of politics, physics, etc. Sebastian Seung's book Connectome provides several additional comparisons. One of my friends remarked on Seung's work: "I don't think I've ever read a book with so many illuminating analogies!" While most of Seung's readers presumably see these analogies as merely didactic aids, I would suggest that they might also have moral significance if we care about brain-like processes in non-brain places.
Schwitzgebel (2016) reaches a similar conclusion as the previous dialogue did:
unity of organization in a complex system plausibly requires some high-level self-representation or broad systemic information sharing. [...] most current scientific approaches to consciousness [...] associate consciousness with some sort of broad information sharing -- a "global workspace" or "fame in the brain" or "availability to working memory" or "higher-order" self-representation. On such views, we would expect a state of an intelligent system to be conscious if its content is available to the entity's other subsystems and/or reportable in some sort of "introspective" summary. For example, if a large AI knew, about its own processing of lightwave input, that it was representing huge amounts of light in the visible spectrum from direction alpha, and if the AI could report that fact to other AIs, and if the AI could accordingly modulate the processing of some of its non-visual subsystems (its long-term goal processing, its processing of sound wave information, its processing of linguistic input), then on theories of this general sort, its representation "lots of visible light from that direction!" would be conscious. And we ought probably to expect that large general AI systems would have the capacity to monitor their own states and distribute selected information widely. Otherwise, it's unlikely that such a system could act coherently over the long term. Its left hand wouldn't know what its right hand is doing.
In response to the 2015 Edge question, "What do you think about machines that think?", Thomas Metzinger explored a similar question as the above dialogue addressed: Will AIs necessarily suffer, or could they be intelligent without suffering? Metzinger doesn't give a firm answer, but he enumerates four conditions that he believes are necessary for suffering:
This list is interesting and provides four helpful criteria that may enhance a holistic conception of suffering, but in my opinion these criteria are neither necessary nor exhaustive. I would consider them like four principles that one might propose for the meaning of "justice" -- a concept sufficiently complex that probably no four concrete criteria by themselves can define it.
Let's see why each of these conditions is not strictly necessary. The most straightforward is probably #2, since it seems easy to imagine being in pain without engaging sufficient cognition to attribute that pain to yourself. Suffering can be a flood of "badness" feeling, which needn't be sufficiently differentiated that one recognizes that the badness is an experience on the part of oneself. For instance, a depressed person might feel that the whole world is bad -- that there's just a general badness going on.
#4 also doesn't seem necessary, because it can still be morally disvaluable if someone is uncertain whether he's in agony. For instance, suppose you step on something. You're not sure whether the object has punctured the skin of your foot. You think you might feel some sharp pain in your foot, but you're not sure if it's actually there or just imagined, until you actually look at your foot and see the sharp object. (Michael Tye offers a similar example.) I'm not sure what Metzinger would think of this case. In any event, it seems that transparency is actually quite easy to satisfy. It takes a complex cognitive system to produce doubts about experiences. Simple agents should generally have transparent emotions.
As far as #1, I think all systems are at least marginally conscious, so even if condition #1 is necessary, it's always satisfied. Of course, the degree of consciousness of a system matters enormously, but Metzinger's piece seems to be asking whether particular AIs would suffer at all.
As far as #3, I agree that valence plays an important, perhaps central, role in human suffering. This valence might prototypically be the reward part of a reinforcement-learning (RL) system. If one insists that valence can only make sense in the context of a rigid definition of RL, then I agree that not all AIs would have valence (although many still would, given the importance of RL for autonomous behavior). But if we interpret negative valence more broadly as "information indicating that something should be avoided", or even more compactly as "information that produces avoidance", then this kind of operation can be seen in many more systems, including non-learning agents that merely follow fixed stimulus-response rules. Indeed, the basic template of one physical event causing another avoidance-like event runs as deep as the interactions of fundamental physical particles, if we take enough of a high-level view and don't insist on greater complexity in our definition.
Overall, I find Metzinger's criteria too narrow. They leave out vast numbers of simpler systems that I think still deserve some ethical consideration. Nonetheless, I appreciate that Metzinger's proposals enrich our conceptualization of more complex suffering.
The Onion has a humorous article, "Scientists Confident Artificially Intelligent Machines Can Be Programmed To Be Lenient Slave Masters," in which AI researchers discuss the goal of shaping AI trajectories in such a way that AIs treat their human workers (what I might call "suffering human subroutines") more humanely. I find it extremely implausible that AIs would actually use human laborers in the long run, but they plausibly would use conscious worker agents of some sort -- both sophisticated scientist/engineer subroutines and other simpler subroutines of the kind discussed in this piece.
Unlike human laborers, these subroutines would presumably enjoy working as hard as possible on the task at hand. Humans evolved to dislike exertion as a way to conserve energy except when required, but robots built to carry out a given task would be optimized to want to carry out exactly that task. That said, more sophisticated digital agents might, like humans, feel mild unpleasantness if they expended time or energy on fruitless activities. For instance, a robot should dislike moving around and thereby draining its battery unless it thinks doing so will conduce to achieving a reward.
I learned of the idea that suffering subroutines might be ethically relevant from Carl Shulman in 2009. In response to this piece, Carl added:
Of course, there can be smiling happy subroutines too! Brian does eventually get around to mentioning "gradients of bliss", but this isn't a general reason for expecting the world to be worse, if you count positive experiences too.
I would say "sentient subroutines."
Some examples in this piece were also partly inspired from a post by Ben West, linking to Eric Schwitzgebel's "If Materialism Is True, the United States Is Probably Conscious," which I discuss more in another piece.
I coined the phrase "suffering subroutines" in a 2011 post on Felicifia. I chose the alliteration because it went nicely with "sentient simulations," giving a convenient abbreviation (SSSS) to the conjunction of the two concepts. I define sentient simulations as explicit models of organisms that are accurate enough to count as conscious, while suffering subroutines are incidental computational processes that nonetheless may matter morally. Sentient synthetic artificial-life agents are somewhere on the border between these categories, depending on whether they're used for psychology experiments or entertainment (sentient simulations) vs. whether they're used for optimization or other industrial processes (suffering subroutines).
It appears that Meghan Winsby (coincidentally?) used the same "suffering subroutines" phrase in an excellent 2013 paper: "Suffering Subroutines: On the Humanity of Making a Computer that Feels Pain." It seems that her usage may refer to what I call sentient simulations, or it may refer to general artificial suffering of either type.
The post A Dialogue on Suffering Subroutines appeared first on Center on Long-Term Risk.
]]>The post Lexicality between good and bad appeared first on Center on Long-Term Risk.
]]>Is there some kind and amount of badness such that an outcome that contains it is overall bad, regardless of the amount of good in the outcome?
Priority: 6/10
Lexicality among goods can be phrased as follows:
Some or any amount of good A is better than any amount of good B.1
For example, W. D. Ross stated that “no amount of pleasure is equal to any amount of virtue.”2 That is, roughly, that it is better for someone to have virtue than any amount of pleasure in her life. Similar views have been proposed by other philosophers at least since the 18th century.3
Similarly, lexicality among bads have also been discussed, which can be phrased as follows:
Some or any amount of bad A is worse than any amount of bad B.
Several philosophers have defended such kinds of lexicality.4 For example, Stuart Rachels has said that 1 year of excruciating agony is worse than 1050 years of mild pain.5
The most interesting kind of lexicality appears to be not lexicality among goods or among bads, but rather, claims of the kind that there is a
lexicality between good and bad: an outcome with some or any amount of bad A is overall bad even if the outcome has any amount of good.
Some theories of value imply this kind of lexicality in the sense that any amount of bad makes the world worse than (or not as good as) an empty world. For example, the view that only the satisfaction or frustration of preferences have value or disvalue, combined with antifrustrationism.6 Christoph Fehige, who proposed antifrustrationism, writes that one of his proposed views entails that “nothing can be better than an empty world (a world without preferences, that is).”7 One way to tackle whether there is a lexicality between good and bad is to focus on whether value theories such as Fehige’s are correct. But here we will not take that route, we will instead assume for the purpose of this research topic that there may be states of the world in which the world is better than an empty world.
For this research topic, we focus on the question: If the world can be better that an empty world because it has some good in it, is there some amount of badness such that a world with such badness is worse than an empty world, regardless of the amount of good in the world?8
This question has seemingly been discussed only little in the philosophical literature, most discussion seems to focus on lexicality with respect to good A vs. good B or bad A vs. bad B, but not bad A vs. good B. There are authors who have made claims resembling lexicality between good and bad, but most have not discussed it extensively. For example, in the context of a happiness index, Bengt Brülde says that
there may well be sufferings that are so intense that no trade-offs are possible, neither in the intrapersonal nor the interpersonal case.
The following statement was made by Swedish philosopher Ingemar Hedenius in 1964:
The worst in life, the fate of the completely unhappy, the uninterrupted, infernalistic suffering, the hopeless humiliation, a child who is slowly tormented to death—I cannot see that all beauty in the world or even the most exceptional thoughts can “counterbalance” such, and neither that other humans’ happiness or culture can. (Our translation)9
Jamie Mayerfeld considers two outcomes: (n) in which one person has a lifetime of torment, and many others have a lifetime of extreme bliss. And (1) in which all have a lifetime close to the hedonistic zero. He says about the outcomes that
Like William James, I find the conclusion that (n) is better than 1 unacceptable. The lifelong bliss of many people, no matter how many, cannot justify our allowing the lifelong torture of one.10
A critical discussion of lexicality between good and bad can be found in Toby Ord’s online essay “Why I'm Not a Negative Utilitarian” in the section on Lexical Threshold Negative Utilitarianism. On the other hand, Clark Wolf speaks favorably of lexicality and defends what he calls ‘negative critical level utilitarianism’ (NCLU), according to which “population choices should be guided by an aim to minimize suffering and deprivation.”11
Which of the most important aspects of lexicality among goods and among bads translate simply to lexicality between good and bad? What is unique about lexicality between good and bad compared to the other two? What about lexicality between good and bad makes it easier or more difficult to defend compared to the other two?
Output: Online essay or an article in a philosophy journal.
This is a research topic that we suggest if it is relevant to lexicality between good and bad, which remains to be investigated (see the previous research topic). A common type of argument that is used in several of the debates about lexicality and whether value relations such as ‘all things considered better than’ are transitive has been called sequence or spectrum arguments. One version in terms of ‘worse than’ goes as follows. Assume that a state A, for example some amount of intense suffering, is claimed to be worse than any amount (or an arbitrarily large finite amount) of minor pains, a state we can call Z. The argument says that there is a state B with slightly less severe harms but with a larger amount of them, for example with more individuals experiencing them or experiencing them for a longer period of time, such that B is worse than A. Similarly, there is a state C with slightly less severe harms than in B but a larger amount of them such that C is worse than B. And so on until we reach a state Z with very large amounts of minor harms that is worse than Y. By transitivity of ‘all things considered worse than,’ Z is worse than A. But this result that Z is worse than A contradicts the starting position which was that A is worse than Z. Such a sequence can, for example, be taken to support that the badness in A is not lexically worse than the badness in Z, or that ‘all things considered worse than’ is not transitive.
Alastair Norcross uses a sequence argument to argue that “there is some finite number of headaches, such that it is permissible to kill an innocent person to avoid them.“12 Erik Carlson objects that Norcross does not actually specify a sequence. Carlson says,
To convince the skeptic, the proponent of the Sequence Argument has to do better than merely pointing to the prima facie plausibility of there being a sequence of the kind his argument relies on. He must, it seems, actually specify such a sequence…. Until [such a sequence] is actually produced... [we] need not be much worried by the Sequence Argument.13
He highlights that, upon closer inspection, we may be unable to specify a sequence, or there may be specified sequences that are intuitive that run in both directions.14
Although Carlson’s argument is a reply to Norcross in particular, it seems to apply to other sequence arguments, including those presented by Larry Temkin and Stuart Rachels.15 That is, a weakness in these sequence arguments may be that they, to varying degrees, only point to the plausibility that there exists a sequence of the right kind, but they do not flesh out the sequences. The research topic would be to build on Carlson’s ideas and specify such sequences, and to figure out what can be concluded from such an exercise.
Output: Philosophy undergraduate or master’s thesis, or an article in a philosophy journal.
Toby Ord says that Lexical Threshold Negative Utilitarianism involves a “very strange discontinuity.”
If you believe in Lexical Threshold NU, i.e. that there are amounts of suffering that cannot be outweighed by any amount of happiness, then you have to believe in a very strange discontinuity in suffering or happiness. You have to believe either that there are two very similar levels of intensity of suffering such that the slightly more intense suffering is infinitely worse, or that there is a number of pinpricks (greater than or equal to one) such that adding another pinprick makes things infinitely worse, or that there is a tiny amount of value such that there is no amount of happiness which could improve the world by that level of value.
The argument goes like this. We can imagine a long sequence of levels of intensity of suffering from extreme agony down to a pinprick, each of which differs from the one before it by a barely detectable amount. If we want to avoid the discontinuity in the badness of the intensity of suffering, then the suffering of a million people at the extreme agony intensity level must be only finitely many times as bad as the suffering of a million people at the next intensity level down, which is only finitely many times as bad as the suffering at the next level, and so on all the way down to the level of the pinprick. This implies that suffering at the very high intensity level is only finitely times worse than the suffering at the pinprick level. [Figure.]
Moreover unless, there is to be a case where n+1 people getting a pinprick is infinitely worse than n people getting a pinprick (for n greater than or equal to 1), we can run a similar argument moving from a million people receiving a pinprick down to a single person and show that the former must only be finitely many times worse than the latter. We thus get the conclusion that while a million people in agony is terrible, it is still only finitely times as bad as a single person receiving a pinprick.16
This kind of spectrum, sequence, or step-by-step argument that Ord uses has been discussed to a fair extent and there are several possible replies to it, as well as counterarguments to the replies.17 A reply to Ord's version could describe the possible replies and the challenges with the replies, and it would likely conclude that lexicality in this case is not clearly “very strange,” at least not in a way that is especially problematic.
Output: Online essay.
Feel free to contact us for more research ideas regarding lexicality.
Related to lexicality are topics such as vagueness, completeness, discontinuity, extensive structures (in measurement), Archimedeanness and non-Archimedean measurement, infinities, and transitivity of value relations such as ‘all things considered better than.’
The post Lexicality between good and bad appeared first on Center on Long-Term Risk.
]]>The post Antifrustrationism appeared first on Center on Long-Term Risk.
]]>Christoph Fehige proposed antifrustrationism, according to which a frustrated preference is bad, but the existence of a satisfied preference is not better than if the preference didn’t exist in the first place.1 In Fehige’s words,
We don’t do any good by creating satisfied extra preferences. What matters about preferences is not that they have a satisfied existence, but that they don’t have a frustrated existence.2
Several authors have objected to antifrustrationism. How could a proponent of antifrustrationism respond to these objections? If you are interested in doing research in this area, please contact us since we have unpublished material on related topics.
Priority: 8/10
Out of these critiques, the most interesting ones are provided by Arrhenius in “Future Generations,” Ryberg, Danielsson, and Holtug. There are possibly other critiques; see citation searches, for example on Google Scholar for sources that cite Fehige, “A Pareto Principle for Possible People.”
Novel research: Article in a philosophy journal. Title suggestion: “In Defense of Antifrustrationism.”
The post Antifrustrationism appeared first on Center on Long-Term Risk.
]]>The post Open Research Questions appeared first on Center on Long-Term Risk.
]]>There are a number of crucial considerations for reducing suffering in humanity's future. This page presents a ranked list of topics that the Center on Long-Term Risk considers important to investigate. Let us know if you'd like to help research these topics. Some are most appropriately addressed by reviewing existing literature and summarizing it on Wikipedia. Other topics require novel exploration.
It's likely that artificial intelligence (AI) in some form will hold the reigns of power over Earth's future within the coming centuries, barring economic or societal collapse in the interim. Depending on the dynamics of how AI is developed and how unpredictable AI behaviors are, humans may keep hands on the steering wheel of how AI is shaped, or AI might take a direction of its own due to economically outcompeting humans, oversights by its programmers, or other factors.
Organizations like the Machine Intelligence Research Institute and the Future of Humanity Institute have explored the implications of various AI takeoff scenarios for human flourishing, but less attention has been given to the implications of various types of AIs for future suffering. It's plausible that some AI trajectories will cause significantly more suffering than others, but which ones?
Brian Tomasik has sketched some of his guesses about ways in which different types of AIs would cause suffering, but this is just a start. We need a thorough research program on this topic. Some relevant questions include:
We should also explore whether there are particular forms of AI-safety research that are more targeted relative to the value of suffering reduction. For instance, are there ways we can ensure that even if AIs fail to achieve human goals, they at least "fail safe" and don't cause astronomical amounts of suffering? And even if suffering reducers don't support AI safety wholesale (which, as mentioned, seems unlikely), are there particular components of AI safety that they would support and should promote further?
Many views in ethics and value theory see preventing suffering as particularly important. Such views include Negative Utilitarianism but also other views in population ethics, axiology, and normative ethics. New research in this vein or presentations of such views to a general audience can build on the works we list in our bibliography. Below are examples of more specific topics.
Futurists debate what AI will look like when it arrives. Some like Eliezer Yudkowsky and Nick Bostrom have argued in favor of the possibility of a "hard takeoff" in which a single AI or small team of AI creators can rapidly self-improve to the point of unilaterally taking over the world. Others, like Robin Hanson and J. Storrs Hall, have argued for a "soft takeoff" in which AI is integrated into society as a whole, and the rapid self-improvement occurs in a similar way as the exponential economic growth that we see already. Another possibility is AI arms races among several powerful countries, in which militaries aim to outcompete each other in fashion reminiscent of the Cold War.
Anthropic reasoning aims to gain insight about our place in the universe based on the facts that we exist and find ourselves in a particular time and context. As an example, it's sometimes claimed that human civilization is unlikely to last vastly longer than it has already, because if we consider ourselves a random sample from all humans, we would expect to have been born much later in history. This is called the "doomsday argument" and is one controversial application of anthropic argumentation. Many thinkers reject the doomsday argument, though they differ widely on the reasons for its rejection. Some argue that a narrow reference class of observers can solve the problem. Others suggest giving higher a priori probability to scenarios with more total observers. Yet others propose eliminating the notion of discrete observers within a reference class altogether. In general, the best approach to anthropics has not been "solved".
One example of anthropic-type thinking is the principle of mediocrity -- a Copernican intuition that we should expect ourselves to be typical observers in the universe. This idea seems at odds with the fact that we appear to be in an extremely influential time in the history of our galaxy. We live during some of the generations that may create and determine the constitution of AIs that colonize our region of the universe. What does anthropics have to say about this? Should we think the far future is much less likely to happen than we naively would have believed?
Anthropic-type ideas like the Fermi paradox, Great Filter, and timeline for evolution of life on Earth can provide further suggestions about how hard superintelligence is and how it behaves once created.
The following questions are from Nick Beckstead's slides, "How to Compare Broad and Targeted Attempts to Shape the Far Future" (pp. 35-36):
For an introduction to the topic, see "Is There Suffering in Fundamental Physics?".
The post Open Research Questions appeared first on Center on Long-Term Risk.
]]>The post Work with us appeared first on Center on Long-Term Risk.
]]>We are currently seeking expressions of interest for a The Center on Long-term Risk will soon be recruiting for a Director of Operations. The role is to lead our 2-person operations team, handling challenges across areas such as HR, finance, compliance and recruitment.
For more details and to make a submission, see here. The deadline for submissions is the end of Sunday 11th February.
Edit: The deadline for expressions of interest to be considered in our initial round has now passed. You're still welcome to submit expressions of interset, but we will only invite late submissions to join the immediate hiring round in exceptional cases. If we decide to proceed to a full hiring round in April 2024, we'll take another look at all new expressions of interest at that stage.
We usually run hiring rounds for specific positions. There are currently no ongoing hiring rounds, but we would love to hear from you if you would be excited to contribute to our mission. In that case, simply fill out our general application form. We particularly encourage applications from women and minority candidates. Please contact us at info@longtermrisk.org if you have any questions about working with us.
CLR’s mission is to reduce worst-case risks from the development of advanced AI systems. We are the largest organisation focussed on s-risk reduction, with our researchers being among only a few working on s-risk reduction and cooperative AI.
We aim to combine the best aspects of academic research (depth, scholarship, mentorship) with an altruistic mission to prevent negative future scenarios. So we leave out the less productive features of academia, such as precarious employment and publish-or-perish incentives, while adding a focus on impact and application.
As part of our team, you will enjoy:
You will advance neglected research to reduce the most severe risks to our civilization in the long-term future. CLR's activities include:
CLR has received grants from Open Philanthropy, the Survival and Flourishing Fund and Polaris Ventures. Testimonials about CLR’s work from prominent community members can be found here.
Diversity and equal opportunity employment: CLR is an equal opportunity employer, and we value diversity at our organization. We welcome applications from all sections of society and don’t want to discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, social background/class, mental or physical health or disability, or any other basis for unreasonable discrimination, whether legally protected or not. If you're considering applying to a role and would like to discuss any personal needs that might require adjustments to our application process or workplace, please feel very free to contact us.
In addition to their salary, CLR offers the following benefits to all staff (including Summer Research Fellows):
The post Work with us appeared first on Center on Long-Term Risk.
]]>The post Home appeared first on Center on Long-Term Risk.
]]>The post Dealing with Moral Multiplicity appeared first on Center on Long-Term Risk.
]]>Scientists debate the specific evolutionary processes that gave rise to humans' and animals' moral sensibilities, but the original functional purpose of morality is less ambiguous: Morality was a set of community norms that served to enforce fair play. Maintaining these standards allowed a tribal group to overcome prisoner's dilemmas and thereby achieve higher total fitness than if anarchy prevailed. Deviations from these agreements would be punished, and concordance would be rewarded.
This explains why norms existed, but what would be the reason for people to feel like these norms were more than arbitrary rules that they had to follow only as long as they wouldn't get caught? Why would people attach a sense of higher, objective rightness to them? One standard story to explain this is that those people who felt the community norms were objectively right would follow them in all circumstances, including when they thought they weren't being watched. Then, in the rare occasions when those people were actually being watched, those who followed the rules universally would fare much better. Analogously, I think organizations that follow common-sense ethics are more successful in the long run than organizations that scheme in private and eventually are exposed in a scandal.
We can also observe that communities in which people felt that morality was universal would have had higher total fitness due to cooperating on prisoner's dilemmas even when enforcement wasn't possible, but this explanation invokes group selection. While group selection at the gene level may be dubious because the right mutation would have to happen in many people at once, group selection at the meme level might be more plausible if the meme can spread to most members of the group at the same time. Memes are more like diseases than individual mutations.
Regardless, it's clear that people do have a sense of ethics that's distinct from their other preferences. The idea that "Murder is wrong" feels different from the idea that "I like chocolate." Our brains inform us of the separation between selfishness vs. adherence to group norms, even though our actions are ultimately a mixture of the two impulses.
Just as toddlers learn that sharp things hurt when you stick them in your arm and that vegan ice cream is yummy, so they also learn that you shouldn't hit people and slavery is immoral. The first types of lessons mostly come directly from their hedonic systems, while the latter mostly come from social inculcation and hence lodge in a conceptually distinct region of the brain devoted to "conscience" topics. Indeed, there are times when people learn lessons about their own hedonic wellbeing -- e.g., don't drive without a seat belt -- and perhaps because these are learned through social norms rather than direct hedonic reaction, the lessons tend to go in the abstract, conscience-type mental regions. It feels similarly to force yourself to do what you know is in your own long-term interest as to force yourself to do what's good for others; in both cases, the brain doesn't necessarily have direct hedonic learning to make the decision quick and facile. Of course, there are many moral/political issues where people feel passionately rather than just obeying their consciences against other desires. Perhaps in these cases the issue has become sufficiently learned in a hedonic sense that it doesn't require cognitive control to be followed.
Moral principles feel more objective when they've seemed more universal in your development. For instance, in modern Western countries, almost everyone agrees that human slavery is wrong, so people have a strong sense that this is a clear, unquestionable moral fact. In contrast, there's debate over whether it's right for gays to get married, whereas hundreds of years ago, gay marriage would have been unanimously seen as wrong in many cultures and so its wrongness would have also seemed an obvious moral fact to many people. And a few centuries from now, embracing gay marriage will probably be seen as obviously right.
As people learn of cases where other societies had different cultural norms, they sometimes relax their feelings of absolute objectivity. For instance, some may say that it wasn't wrong for the Aztecs to perform human sacrifices, because this was their culture, whereas it would be wrong to do sacrifices in our culture. Moral relativism is one response to disruption of apparent universality of a moral principle. Of course, some incline more toward it than others, partly depending on how much other people in one's milieu endorse it. If you put a random person in an anthropology department, he's more likely to become morally relativistic when you remind him about Aztec sacrifices than if you put him in a Baptist church.
Changes in moral outlook are subject to the laws of brain plasticity just like the hedonic system or other brain systems, and flexibility tends to decline with age. Usually, the more people hear a certain view and spend time with others who espouse it, the more they adopt it themselves, unless they actively apply a negative gloss onto the messages so as to negate their valence. (Think of people saying "Boo! Hiss!" when they hear a politician of the opposing party.) Many moral views come from associating a brain region for a concept (say, animal farming) with a brain region that already has strong moral valence (say, slavery), thereby glossing the concept with that same valence. Why is abortion wrong? Because it's murder (which has strong existing negative valence). Why is abortion acceptable? Because it's a woman's choice about her body (invoking strong positive valence associated with individual autonomy). Why is gay marriage wrong? Because it destroys the sanctity of marriage and the family (which have strong existing positive valence). Why is gay marriage right? Because gays deserve equality, and allowing gays to marry would strengthen the institution of marriage, not weaken it (strong positive valence). And so on. See the "Appendix: Ethical reasoning using associative networks" for one model of how this smearing of valence might work.
Associations account for some types of moral reasoning (e.g., "X is stealing. Stealing is wrong. Therefore, X is wrong."). As Joshua Greene would claim, other types of moral reasoning are slower, more calculated, and more consequentialist. For instance, consider the moral dilemma of whether you should suffocate a crying baby if this is the only way to avoid your whole congregation, including the baby, from being heard and killed by armed guards. Our immediate associative response is "suffocating babies is bad," but a more nuanced response, noting that the baby would die anyway, may be different. Still, the reasoned response too ultimately relies on propagating valence from certain ideas (e.g., "saving more lives is better") back to conclusions ("it's better to suffocate the baby in this particular case"). Over time, if these reasoning chains are encountered often, neural shortcuts develop, and then moral calculation is no longer required. In any event, calculated moral reasoning is not necessarily "better" than associative reasoning, although some communities might claim it to be better -- ironically, based on the positive associations that they have with calculated reasoning!
Our ethical views would be different if we had grown up differently. Genes play a big role, as does early childhood development. Even immediate factors can modify our outlooks from moment to moment. Paul Christiano notes that blood sugar has an effect. We're also randomly influenced by being tired, being alone vs. with friends, what movie we watched yesterday, and countless other seemingly trivial forces. When we think in naive terms about the grand system of morality, it feels like incidental, random occurrences in our past shouldn't be so influential, or even influential at all, on what we judge is right or wrong?
If I feel X, but I could have grown up to feel Y, and other people feel Z, then isn't it all pointless? This is the perspective of moral nihilists, and I think fear of nihilism is one common motivation for holding on to moral realism, if only by confused Pascalian reasoning, in the face of the ultimate arbitrariness of our moral sentiments. Of course, nihilism contradicts itself: If there are no objective standards of judgment, then it's not even true to say that "morality is pointless" because there's no such thing as something being pointless or not. Still, this doesn't trouble the persistent nihilist, who can simply refuse to engage in this conversation any further because it's all meaningless -- whatever that's supposed to mean.
When people think a moral statement is objective, they usually feel like, "I just know ABC is right. It can't possibly be otherwise." Then, when we imagine ourselves growing up somewhat differently and coming to hold a different view due to changed circumstances in genes or development, we still project our self-identity onto that different person, and then we realize that, "I could have not felt that ABC was right. Hmm. Maybe it's not The Right Thing after all." This causes the "morality feeling" in our brains to shut down, and our once noble moral impulses begin to look more like egoistic desires. The "magic" of absolute rightness fades away.
I think this is a failure mode that we can avoid. We should recognize what's going on, yes, but why let ourselves turn off the magical moral feeling in our brains? Why not keep it, if it makes us feel better and is more consistent with our desires anyway? We can value the magic moral feeling as though it were still magic if we choose, and I choose to do so. It's not Wrong to feel the magic even after understanding what's under the hood, and more than it's Wrong to still like chocolate after learning about pleasure glossing in the ventral pallidum and realizing that you could have been wired to enjoy the taste of elephant dung instead.
Arbitrary genes and personal experiences may make you choose different values than someone else, but without any genes or personal experiences, there would be no "you" at all. In some sense, the arbitrariness of one's genes and history is essential to ethics; it's not necessarily a bias that we should try to eliminate. Without your own arbitrary initial conditions, you might instead be a paperclip maximizer (insofar as you would still be "you" at all if you had such radically different moral beliefs).
Even if different groups have different moral goals, they'll find it advantageous to compromise in positive-sum ways, such as cooperating on moral versions of the prisoner's dilemma. This may lead to somewhat convergent outcomes insofar as compromise aggregates different views roughly in proportion to bargaining leverage.
The evolutionary motivation for the feeling of objective morality at the individual level was that it was a form of pre-commitment to following group norms based on cooperation. By the same token, we could imagine feeling that a stance which aggregates different moral views into a unified, compromise moral view is "objective," giving us magic moral-glow feelings. This would be a second layer of aggregation: First there were egoistic desires that got aggregated to moral tribal norms, and then the remnants of those tribal norms get aggregated to meta-level worldwide norms. Of course, most of the motivations in the world are not moral norms but still regular egoism, so base-level egoism would have significant sway over the compromise morality as well.
What do I think of this proposal? I like the fact that feeling the glow of objective morality can motivate people to compromise, and this might be a useful intuition pump. On the other hand, because compromise is weighted by power, it feels unfair to the weak and voiceless, including all non-human animals, whose only representation comes from the extent to which other humans happen to sympathize with them. Of course, we can't ask for anything better. Usually a power-weighted compromise is the best possible outcome for achieving whatever goal we want in expectation. One of the lessons of Robert Axelrod's tournaments on the iterated prisoner's dilemma was "don't be envious." Still, I'm not comfortable moving my moral compass toward feeling like "might actually makes right" rather than just feeling that "power-weighted compromise is the best outcome we can hope for given the circumstances."
Reflective-equilibrium views aim to resolve ethical disagreements into a self-consistent whole that society can accept. One example is Eliezer Yudkowsky's coherent extrapolated volition (CEV), which would involve finding convergent points of agreement among humans who know more, are more connected, are more the people they wish they were, and so on.
A main problem with this approach is that the outcome is not unique. It's sensitive to initial inputs, perhaps heavily so. How do we decide what forms of "extrapolation" are legitimate? Every new experience changes your brain in some way or other. Which ones are allowed? Presumably reading moral-philosophy arguments is sanctioned, but giving yourself brain damage is not? How about taking psychedelic mushrooms? How about having your emotions swayed by poignant music or sweeping sermons? What if the order of presentation matters? After all, there might be primacy and recency effects. Would we try all possible permutations of the material so that your neural connections could form in different ways? What happens when -- as seems to me almost inevitable -- the results come out differently at the end?1
There's no unique way to idealize your moral views. There are tons of possible ways, and which one you choose is ultimately arbitrary. (Indeed, for any object-level moral view X, one possible idealization procedure is "update to view X".) Of course, you might have meta-views over idealization procedures, and then you could find idealized idealized views (i.e., idealized views relative to idealized idealization procedures). But as you can see, we end up with infinite regress here.
I like to picture our moral intuitions like a leaf blowing on the sidewalk. Asking "What do we really want?" is like asking, "In what direction does the leaf really want to blow?" Another analogy for our moral intuitions is a board game, like Chutes and Ladders: Roll the dice of environmental stimuli, move forward some amount, and maybe be transported up or down by various forces.
Sometimes you might reach an absorbing state of reflective equilibrium, but there are many such absorbing states you could fall into. The absorbing states you'd reach studying philosophy are different than those you'd reach studying biology, which are different from those you'd reach studying international relations. Each field changes your brain in different ways. And there may be path dependence due to early neural plasticity that becomes less flexible over time. One analogy people sometimes give is with sledding on fresh snow: After you've carved out certain paths, it can be harder to make other paths rather than falling back into the original ones. Of course, with artificial brains, we could reset the initial conditions (undoing the sledding marks) if desired.
And whom do we extrapolate? CEV suggested all humans on Earth, but why stop there? Why not ancestral humans on the African Savanna? How about whatever intelligent life would have evolved if the dinosaurs had not gone extinct? Aliens? Beings in other parts of the multiverse? Wacky minds that never evolved but could still be constructed, like pebble sorters?
Maybe one answer is: Let's try a bunch of initial conditions and extrapolation methods and see if some attractors are more common than others. Yes, convergence is not unique, but we might still be able to rule out a lot of moral stances as not stable. Maybe we could weight among the stable attractors that remain, although the relative numerosity of different attractors would themselves depend on the parameters for the process. For instance, if we started with arbitrary initial minds, we might find that suffering increasers were about as numerous as suffering reducers, while if we started with initial minds weighted by their prevalence as evolved creatures, we would have mostly suffering reducers in the end because suffering reduction should be a common tribal norm across a broad array of societies.
Another trouble is that CEV as presented seems to require intelligent agents that can reason about ethics, have discussions, hear arguments, and reflect on new experiences. This rules out a large chunk of non-human animals from the process. Are we going to neglect them even though they have emotions and implicit desires just like humans do? I would hope not, but convincing others of this would itself be a challenge.
As with compromise morality, I like the fact that CEV provides an intuition pump in favor of cooperation, which is good for all of us. However, as with compromise morality, I'm not sure I would label it as what I actually want, rather than just as an admirable goal to work toward because it's something people can agree on rather than fighting each other.
It's a common observation that older people rarely change their minds, except maybe on topics that they haven't thought about much. To some extent this ossification of beliefs is rational. For example, a Bayesian approach to the sunrise problem will show smaller and smaller updates for the probability that the sun rises tomorrow as the number of data points increases. Relatedly, some machine-learning algorithms use learning-rate decay over time so that later information causes less of an update than earlier information. In other cases, stubbornness by older people is counterproductive.
A few people show belief plasticity throughout their lifetimes. One example might be Hilary Putnam: "Putnam himself may be his own most formidable philosophical adversary.[8] His frequent changes of mind have led him to attack his previous positions." But it seems that most people maintain their general stances on issues throughout their careers. (Thanks to Simon Knutsson for this point.)
The same often applies in the field of ethics. My informal experience has been that it's almost impossible to change the moral views of someone over age ~25 (plus or minus a few years), while many younger people are more "easy come, easy go" on moral issues. Once again, there are a few exceptions, like when Peter Singer switched from preference to hedonistic utilitarianism in his 60s(?). My anecdotal experience that people's views become more rigid by their late 20s is consistent with the following passage from SAMA (2013):
Though teens may change clothes, ideas, friends and hobbies with maddening frequency, they are developing ideas about themselves, their world and their place in it that will follow them for the rest of their lives. Adults may spend years trying to create or break even the simplest habit, yet most adults find that their most profound ideas about themselves and the world were developed in high school or college. This is because, by age 25 or so the brain is fully developed and building new neural connections is a much slower process.
In my own case, the core of my moral view (namely, the overwhelming importance of reducing extreme suffering) was set by age 19, and the moral updates I've had since then have been either tweaks around the edges or revisions forced by an ontological crisis, such as the realization that consciousness is not an objective metaphysical property.
It's possible that there's a sort of critical period for morality in which people "imprint" on various moral views. Or maybe the ossification of people's moral opinions with age is just a byproduct of the general ossification of people's minds about all sorts of topics. Either way, the fact of ossification presents a problem for moral idealization procedures, since even if we learn more, many of our views may already be substantially fixed. Meanwhile, if we allow for simulation of alternate possible development trajectories that we might have taken since birth or childhood, then our final conclusions in those simulations are likely to be fairly sensitive to what ideas we're exposed to first versus what ideas we only discover after substantial ossification has already taken place.
If the formation of moral views is more like sexual imprinting than it is like understanding math, then it seems unlikely that humanity would converge on a shared moral vision via moral idealization, in a similar way as it's unlikely that billions of people raised in different environments would all agree on what specific physical features of a potential mate are sexually attractive if they could just learn and discuss more with one another. Of course, there are many features of physical attractiveness that are mostly universal among humans, just like some moral judgments are mostly universal among humans.
Humans hold roughly accurate factual beliefs (at least on concrete, day-to-day matters) because those whose brains led to incorrect views were more likely to fall off cliffs, eat poisonous berries, or fail to acquire mates. As society becomes increasingly complex, those agents that display greater rationality toward the aim of acquiring power tend to survive better. Note that this is not always the same as epistemic rationality; for instance, religions that oppose birth control survive very successfully just due to reproduction rates. But in the long run, I expect that survival will require more and more epistemic rationality due to competition with other smart agents.
In the same way as evolution favors correct factual beliefs in the long run, it presumably also favors certain moral attitudes. As a simple case, ideologies demanding celibacy from all followers (a la the Shakers) are unlikely to survive, and similar fates probably await antinatalist movements. Likewise, moral views that repugn strategic thinking may ultimately lose out to those that embrace it.
By "strategic thinking" I don't mean naive Machiavellian deviousness; that historically has lost out to behaviors more like virtue ethics in human evolution. I don't expect a morality that endorses bank robbing for the greater good to win the future. But I do mean that moralities that fail to adaptively develop intelligent survival strategies risk becoming stale and less competitive. This doesn't imply that the morality itself has to endorse a particular survival-maximizing content. But the morality may need to have a strategic shell to enclose its inner seeds if it is to survive in the long run.
If we wanted to be charitable to adherents of moral realism, we could suggest the requirement that agents of the future behave strategically as one candidate feature of objective morality. Just as evolution constrains survivors to have certain factual beliefs, so too competitive struggles may push agents toward particular behaviors -- at least game-theoretic rationality if nothing else more specific. Beyond that, it's not clear if selection pressure strongly favors particular ideologies ex ante. So I expect this idea does little justice to what disciples of moral realism were hoping for.
Many speculations in this piece are just-so stories, at least until further evidence comes in. In addition, many of these ideas are probably unoriginal, though if other authors have said the same things, I don't know the citations. Most of what I wrote is common cultural knowledge among my friends.
Our brains use associations to make connections between concepts. For instance, if I say "ancient Egypt," you might think of "pyramids," "Nile River," and "hieroglyphs." Likewise, if I had first said "hieroglyphs," it likely would have brought to mind the general concept of ancient Egypt. Our brains have a bidirectionally associative network among these ideas.
We can imagine adding a valence to various concepts, as well as associational weights between them. The weights correspond to how numerous and strong are the neural connections in our brains.
Take a moral issue like abortion. Consider a pro-life advocate who believes that abortion is murder and cares about the views of the Catholic Church. She does recognize that bodily autonomy is somewhat important, but this consideration is outweighed by the murder and anti-Catholic associations that abortion has for her. We can represent this collection of sentiments in a network like the following:
A different person might have the same network but different valences and weights:
The product of valence times connection weight does the work of a logical syllogism:
Because these numbers have continuous values and we can sum up multiple conflicting considerations, the network approach is more powerful and descriptive of how brains actually work than the syllogistic approach. Also note that for the person who did not believe abortion was murder, the syllogism would not go through because his valence * weight evaluated to 0.
Because the networks are bidirectional, nodes at the bottom may influence those at the top. For instance, suppose someone naively starts out with this network:
Associative networks can also limn the process of reflective equilibrium, which is basically the resolution of cognitive dissonance. For example, suppose a nature lover starts out with this network:
Now imagine that the conservationist hears counterarguments to her changed valences. For instance, if parasites are good, why is malarial infection of humans bad? If unintentional suffering is okay, why is it wrong when a drunk driver accidentally kills a pedestrian? And so on. Adding these nodes throws the network into disequilibrium once more, such that the old valences are no longer the weighted sums of their inputs:
I'm still a little unclear on how this network model should work. The representation so far is redolent of a non-binary version of Hopfield nets, and the process of making a philosophical argument would correspond to updating a node's weights based on the values of other nodes in its surroundings. However, the network can't be literally bidirectional because, for example, in the abortion diagrams, it's not the case that once the "Abortion" node is updated, the "Bodily autonomy" node should update to exactly its same score just because it's the only node connected to "Abortion." Maybe the fact that in practice all these nodes would have many other connections to other nodes would fix this problem. Or maybe the network weights are different depending on the direction. Someone with more neural-network experience could probably patch up this model.
I don't claim that this network model is a fully accurate depiction of human moral reasoning, and it's maybe not the only type of moral reasoning that humans do. Still, the framework is suggestive and helps conceptualize how we see people thinking and arguing. For example, the noncentral fallacy involves attempting to strengthen connection weights between an atypical instance of a hated concept and the hated concept itself, in order to propagate negative valence from the hated concept to the atypical instance. And choosing words to evoke desired responses is well known to marketers, political strategists, and anyone who has employed oblique language to avoid saying something unpleasant outright.
In passing, we can reflect on what these networks have to say about metaethics. For one, they explain how non-cognitivism still allows ethical statements to obey syllogistic inference rules, despite what critics of emotivism sometimes allege.2 They also suggest a sort of coherentist brand of moral truth, because the network activations can propagate both ways. Of course, we could capture foundationalism instead if we made the networks feedforward, but I think a recurrent network better captures the fact that people seem to have no single axiom set but are typically willing to change any given view when it conflicts too strongly with other views.
The moral valence of a statement is not necessarily the same as its hedonic valence. Indeed, sometimes the two may be opposed, like in the case where you know "I should leave that ice cream for my friend," but your hedonic system says "I want to eat my friend's ice cream." Still, it's possible that moral valence could be represented by a separate system that shares similarities to the hedonic-valence system.
Visceral moral feelings could involve a lot of different emotions. For example
So presumably if we could understand each of those emotions, we could understand their moral instantiations. When people say "abortion is murder," they're aiming to build neural connections from the "abortion" concept to the "murder" concept, so that hearing "abortion" will trigger the anger and horror normally associated with murder. The murder concept itself may just be directly hooked up to whatever systems trigger the emotions of anger and horror. These connections in turn may have been built by cultural teachings combined with experienced unpleasantness when thinking about or watching murders.
A CEV might survey the “space” of initial dynamics and self-consistent final dynamics, looking to see if one alternative obviously stands out as best; extrapolating the opinions humane philosophers might have of that space. But if there are multiple, self-consistent, satisficing endpoints, each of them optimal under their own criterion—okay. Whatever. As long as we end up in a Nice Place to Live.
We can make sense of this reasoning even though the statements are false in the real world. (back)
The post Dealing with Moral Multiplicity appeared first on Center on Long-Term Risk.
]]>The post The Eliminativist Approach to Consciousness appeared first on Center on Long-Term Risk.
]]>"[Qualia] have seemed to be very significant properties to some theorists because they have seemed to provide an insurmountable and unavoidable stumbling block to functionalism, or more broadly, to materialism, or more broadly still, to any purely 'third-person' objective viewpoint or approach to the world (Nagel, 1986). Theorists of the contrary persuasion have patiently and ingeniously knocked down all the arguments, and said most of the right things, but they have made a tactical error, I am claiming, of saying in one way or another: 'We theorists can handle those qualia you talk about just fine; we will show that you are just slightly in error about the nature of qualia.' What they ought to have said is: 'What qualia?'"
--Daniel Dennett, "Quining Qualia"
My views on consciousness are sometimes confusing to readers, so I try to explain them in different ways using different language. I myself also try to imagine the situation from different angles. Three main perspectives that I've advanced are
Pete Mandik has a nice video explaining the distinction between reductionism and eliminativism.
That said, I think all three of these approaches are substantively the same, and they differ mainly in the words they use and the imagery they evoke. These differences may have practical consequences insofar our moral intuitions depend on how we think about consciousness, but what the viewpoints actually say about the world is identical in each case. To make this clear, consider the classic analogy of élan vital. We can pursue any of the following options with it:
A similar situation obtains with respect to consciousness.
In "Consciousness and its Place in Nature", David Chalmers recognizes that functionalist-style reductionism and eliminativism are ultimately the same:
Type-A materialism sometimes takes the form of eliminativism, holding that consciousness does not exist, and that there are no phenomenal truths. It sometimes takes the form of analytic functionalism or logical behaviorism, holding that consciousness exists, where the concept of "consciousness" is defined in wholly functional or behavioral terms (e.g., where to be conscious might be to have certain sorts of access to information, and/or certain sorts of dispositions to make verbal reports). For our purposes, the difference between these two views can be seen as terminological.
Chalmers classifies panpsychism as a Type-F monist view, but I think that functionalist panpsychism is a poetic way of expressing a type-A materialism. That which panpsychism says is fundamental to computation is not a concrete thing that could conceivably not be present but is more a way of describing how the rhythms of physics (necessarily) seem to us.
Daniel Dennett is often charged with denying consciousness. Some critics of his book Consciousness Explained suggest that its title is missing a word, and it should actually be called Consciousness Explained Away. One possible reply to this allegation is that consciousness is being explained, but it's just not what people thought it was. As Dennett says in Fri Tanke (2017) at 32m30s: "I'm not saying that consciousness doesn't exist. I'm just saying it isn't what you think it is." But I suppose another possible response is to say, "Okay, what if I did explain consciousness away? What would follow from that?" I can imagine an Internet meme of Morpheus saying: "What if I told you that we should get rid of the idea of 'consciousness'?"
Maybe "consciousness" is a word with so much metaphysical baggage and philosophical confusion that it would be best to stop using it. Marvin Minsky thinks so and adds:
now that we know that the brain has [...] hundreds of different kinds of machinery linked in various ways that we don't understand, it would be a wonderful coincidence if any of the words of common-sense psychology actually described anything that's clearly separate, [...] like "rational" and "emotional" [as] a typical dumbed-down distinction that people use.
"Consciousness" does actually point to some helpful distinctions even given a reductionist world view -- just as the contrast between rational and emotional thinking does actually have some grounding in psychology. But "consciousness" can point to a lot of distinctions at once depending on what the speaker has in mind. Maybe we should embrace the eliminativist program and replace "consciousness" with more precise alternative words.
To be clear, my version of eliminativism does not say that consciousness doesn't exist. Pace Galen Strawson, it does not "deny the existence of the phenomenon whose existence is more certain than the existence of anything else". Rather, eliminativism says that "consciousness" is not the best concept to use when talking about what minds do. We should replace it with more specific descriptions of how mental operations work and what they accomplish. To again give an analogy with élan vital: It's not that life doesn't have a sort of vitality to it; it does. Rather, there are more useful and specific ways to talk about life's vitality than to invoke the élan vital concept.
Dennett echoes this in "Quining Qualia":
Everything real has properties, and since I don't deny the reality of conscious experience, I grant that conscious experience has properties. I grant moreover that each person's states of consciousness have properties in virtue of which those states have the experiential content that they do. That is to say, whenever someone experiences something as being one way rather than another, this is true in virtue of some property of something happening in them at the time, but these properties are so unlike the properties traditionally imputed to consciousness that it would be grossly misleading to call any of them the long-sought qualia.
Rothman (2017) describes Dennett's view during a debate: "He told Chalmers that there didn’t have to be a hard boundary between third-person explanations and first-person experience—between, as it were, the description of the sugar molecule and the taste of sweetness. Why couldn’t one see oneself as taking two different stances toward a single phenomenon? It was possible, he said, to be 'neutral about the metaphysical status of the data.'"
I don't think that consciousness is an illusion [...]. The question is what's involved in having those experiences and those sensations. And I think it does [...] beg the question to say that it involves having states with qualia.
I came to believe that ‘consciousness’ is not a theory-neutral term. When we say ‘it’s impossible I’m not conscious’, we often just mean ‘something’s going on [hand-wave at visual field]’ or ‘there’s some sort of process that produces [whatever all this is]’. But when we then do detailed work in philosophy of mind, we use the word ‘consciousness’ in ways that embed details from our theories, our folk intuitions, our thought experiments, etc.
As we add more and more theoretical/semantic content to the term ‘consciousness’, we don’t do enough to downgrade our confidence in assertions about ‘consciousness’ or disentangle the different things we have in mind. We don’t fully recognize that our initial cogito-style confidence applies to the (relatively?) theory-neutral version of the term, and not to particular conceptions of ‘consciousness’ involving semantic constraints like ‘it’s something I have maximally strong epistemic access to’ or ‘it’s something that can be inverted without eliminating any functional content’.
Eliminativists remind us that our intuitions about science are not well refined. It's commonly the case that naive notions of physics, biology, and other sciences need to be replaced by more correct, if less intuitive, understandings. Why should it be different in the case of consciousness? People may feel as though they're experts on subjectivity because they are conscious, but they are just one of many conscious minds in the universe. I don't see how this position qualifies one as an expert on consciousness any more than knowing your way around your house qualifies you as an expert on physical space in the galaxy. In any case, the parts of our brains that talk don't even have clear understanding of much of what goes on in our own heads.
In The Scientific Outlook (1931), Bertrand Russell wrote:
Ordinary language is totally unsuited for expressing what physics really asserts, since the words of everyday life are not sufficiently abstract. Only mathematics and mathematical logic can say as little as the physicist means to say.
Eliezer Yudkowsky remembers that his father
said that physics was math and couldn't even be talked about without math. He talked about how everyone he met tried to invent their own theory of physics and how annoying this was.
In a similar way, I claim we can't understand subjectivity without neuroscience (and physics more generally). And while everyone seems to have a pet theory of consciousness (including me plenty of times), these can't substitute for neuroscience.
Our brains are bad at using intuitions about location and properties of an object to describe quantum superpositions. In a similar way, our language of "consciousness" and "qualia" is not suited to precisely describing what happens in the brain.
The language of "consciousness" and "qualia" corresponds to what Philip Robbins and Anthony I. Jack call the "phenomenal stance". In contrast, the eliminativist position corresponds to what Dennett calls the "physical stance".
In breaking our confusions about consciousness, it's helpful to picture the world purely using the physical stance. Stop thinking about raw feels. Think instead about moving atoms, flowing ions, network connectivity, and information transfer. Imagine the world the way neuroscience describes it -- because, in fact, this is a relatively precise account of the way the world is. If it seems as though everyone should be a zombie, don't worry about that for now.
Compare an insect with a human. Rather than imagining the human as conscious and the insect as not, or even the human as just more conscious than the insect, instead picture the two as you would a professional race car versus a child's toy car: as two machines of different sizes, complexities, and abilities that nonetheless share some common features and functionality.
Compare your brain with another part of your nervous system -- say the peripheral nerves in your hand. Why is your brain considered "conscious" and your hand not? It's because only your brain is capable of generating explicit, high-level, and verbalizable thoughts like remarking on its own consciousness. Your hand is also doing neural operations that resemble neural operations in your brain. It's just that the hand's operations don't always get reported via memories and speech, unless they "become famous" within your brain so that they can be thought about and verbalized.
The eliminativist approach encourages us to stop thinking about neural operations as "unconscious" or "conscious". Instead, in humans, think about the pathway along which neural information travels in order to reach your high-level thinking, speech, action, and memory centers. If the information fails to get there, we call it "subliminal" or "unconscious". If it does get there, we call it "conscious" because of the more pronounced effects it can have on other parts of the brain and body. Thus, for humans, we could replace the loaded "conscious" word with something like "globally available".
There are many more details behind what the brain does that we ordinarily think of as consciousness. The best way to get an intuitive sense for them is to learn more neuroscience, perhaps from a popular book. While I became convinced that non-reductive accounts of consciousness could not be right based mainly on philosophical arguments, it was after reading neuroscience that I actually internalized the eliminativist world view. The gestalt shift toward eliminativism requires time and reading to sink in.
Picturing systems physically gives us fresh eyes when deciding what we value. When we adopt the common-sense phenomenal stance, we see a world in which discrete minds move about in otherwise unconscious matter. When we adopt the physical stance, we see various kinds of matter interacting with one another. Some of those matter types (e.g., animals and computers) are more dynamic and sophisticated than other types, but there's a fundamental continuity to the picture among all parts of the system. And we can see that while the system can be sliced in various ways to aid in description and conceptualization, it is ultimately a unified whole.
Ethics in this world view involves valuing or disvaluing various operations within the symphony of physics to different degrees. Some philosophers assign value based on the beauty, complexity, or interestingness of the physics that they see. Those who value conscious welfare instead aim to attribute degrees of sentience to different parts of physics and then value them based on the apparent degree of happiness or suffering of those sentient minds. Because it's mistaken to see consciousness as a concrete thing, sentience-based valuation, like the other valuation approaches, involves a projection in the mind of the person doing the valuing. But this shouldn't be so troubling, because metaethical anti-realists already knew that ethics as a whole was a projection by the moral agent. The eliminativist position just adds that the thing being (dis)valued, consciousness, is itself something of a fiction of the moral agent's invention.
Actually, calling "consciousness" a fiction is too strong. As noted above, "consciousness" refers to real distinctions -- e.g., in the case of human-like brains, we may consider global access to and ability to report on information as important components of consciousness. I just mean "fiction" in the same sense as nations, genders, or tables are fictions; they're constructions of the human mind that help conceptually organize physical phenomena.
I should note that making sentience evaluations based on knowledge of physical processes doesn't mean making superficial evaluations. A humanoid doll that blinks might look more conscious than a fruit fly, but the 100,000 neurons of the fruit fly encode a vastly more complex and intelligent set of cognitive possibilities than what the doll displays. Judging by objective criteria given sufficient knowledge of the underlying systems is less prone to bias than phenomenal-stance attributions.
Moreover, there's a sense in which nothing ethically important would be left out if we eschewed the idea of "consciousness" and only thought in terms of physical processes. In principle, it would still be straightforward to draw ethical distinctions between so-called "conscious" and so-called "unconscious" human minds, because the brain-activity patterns of the two are clearly distinct. We could still hear what people had to say about the intensity of their emotional feelings and use those reports to make judgments. We could watch their brains and see the neural correlates of those reports. We could develop intuitions for what sorts of physical processes lead to attestations of pleasure and pain, and then we could generalize those kinds of algorithms so as to see them in other places. If we in principle had access to all the operations of a mind, there would be no thought or feeling that would go unnoticed. This approach would actually be more powerful at locating sentience even in ourselves than our subjective feelings are, since the parts of our brains that develop explicit thoughts and decide high-level actions don't have access to most of the neural operations taking place at the lower levels of the brain or other parts of the body, just like they don't have access to the minds of other people or animals. Knowledge of the physical operations taking place in our minds and other minds makes it possible to value processes of which we would have previously been unaware and which may have previously "suffered" in silence.
Sneddon et al. (2014) offer a very helpful table for assessing pain in different animal taxa (Table 2, p. 204):
The authors say (p. 209): "Our summary of the evidence supports the conclusion that many animals can experience pain-like states by fulfilling our definition of pain in animals, although we accept that 100% certainty cannot be established for any animal species." In other words, Sneddon et al. (2014) speak as though "pain experience" is a thing that may or may not be present, and all we can do is make informed guesses about its existence.
To an eliminativist, this is wrongheaded. Rather, "pain experience" is a label we give to the suite of behavioral and functional responses that organisms have to aversive stimuli. There's no binary answer to whether insects or mammals "feel pain". The answer should instead be: Look at Table 2 above to see what abilities a given animal type does and doesn't have, and explore further to find out what sorts of other behaviors and internal cognitive processing occur in these animals. These findings are not mere indicators of pain; they are (parts of) the cluster of things we mean by "pain". Stop answering "yes" or "no" to the question "Do insects feel pain?" and start describing the details of what we know about insect nervous systems and behaviors.
I like this quote from Daniel Dennett, in Rothman (2017): "I think we should just get used to the fact that the human concepts we apply so comfortably in our everyday lives apply only sort of to animals. [...] If you think there’s a fixed meaning of the word ‘consciousness,’ and we’re searching for that, then you’re already making a mistake." And Dennett (1995):
The very idea of there being a dividing line between those features "it is like something to be" and those that are mere "automata" begins to look like an artifact of our traditional presumptions. I have offered (Dennett, 1991) a variety of reasons for concluding that in the case of adult human consciousness there is no principled way of distinguishing when or if the mythic light bulb of consciousness is turned on (and shone on this or that item). Consciousness, I claim, even in the case we understand best -- our own -- is not an all-or-nothing, on-or-off phenomenon. If this is right, then consciousness is not the sort of phenomenon it is assumed to be by most of the participants in the debates over animal consciousness. Wondering whether it is "probable" that all mammals have it thus begins to look like wondering whether or not any birds are wise or reptiles have gumption: a case of overworking a term from folk psychology that has losts its utility along with its hard edges.[...]
The phenomenon of pain is neither homogeneous across species nor simple.
Sneddon et al. (2014), p. 208: "Insects may show behaviour that suggests an affective or motivational component (e.g. it has complex and long-lasting effects), but insects could do this, at least in some cases, by using mechanisms that require only nociception (e.g. long-term nociceptive sensitization) and/or advanced sensory processing (i.e. without any internal mental states)." Again, this quote seems to make a binary distinction between sensory processing vs. internal mental states. But "advanced sensory processing" is one part of internal mental states. Mental states are a collection of representations that a cognitive system combines together, and it appears that insect nervous systems do combine together sensory representations at least in rudimentary ways, even though they don't have the same sophistication of mental states as occur in mammals.
Following is a possible steelman of Sneddon et al. (2014), one that I largely agree with: Much of the moral importance of suffering comes from deep, internal functional processing within an animal brain, and this processing is difficult to inspect directly (although it could be completely understood in principle with sufficient time and computing power). Therefore, we use more externally measurable behaviors like those in Table 2 to guess at the kinds and sophistication of internal processing that goes on. While externally visible behaviors are indeed a part of what pain is, they aren't the most important part of what pain is, which is why it can still make sense to speak in terms of uncertainty about whether animals feel pain, even if we already know lots of behavioral facts about them.
Try adopting the physical stance as you go about your day. When you have a particular feeling or become aware of a particular object, think about what kinds of neural operations are occurring in your head as that happens. Contemplate the brain processes that underlie the behaviors of those around you. See yourself as a chunk of physics moving around within a bigger world of physics.
My experience with this exercise is that it soon becomes less weird to adopt a physical stance. It feels more intuitive that, yes, I am an active, intelligent collection of cells whose sharing and processing of signals constitutes the inner life of my mind and allows for a vast repertoire of behaviors. Worries that I should be a zombie vanish, because I can feel what it's like to be physics for what it is. In fact, being physics feels just like it always did when I thought consciousness was somehow special.
In this mood, questions like, "Why do these physical operations feel like something?" appear less forceful, because I'm already "at one" with the universe. Yes, the neurons in my brain are doing particular kinds of processing that other clumps of atoms in the world are not, and this explains why these thoughts show up in my head and not in the floor or a beetle outside. But does it matter that I'm having these particular thoughts? Can't other "thoughts" by other parts of physics matter too for being what they are? Isn't it chauvinist to privilege just cognitive operations that are sufficiently complex and of a particular type? Or is extending sympathy to even simple physics a stance that's based more on theoretical elegance and spirituality, when in fact only sufficiently self-reflective minds have "anyone home" who can meaningfully care about his/her own subjective experiences?
These questions are important to debate, but we see that they take place within the eliminativist realm. We can import some phenomenal-stance intuitions when thinking about what parts of physics we want to regard as "suffering", but we don't trip ourselves up over trying to pigeonhole suffering as being something other than an attribution we make to whirlpools within the ocean of physics.
Eliminativism is not universally shared, particularly among philosophers. (It may be more common among neuroscientists and artificial-intelligence researchers?) Sometimes people encourage me to discuss practical questions of sentience with less dependence on my particular philosophical view of consciousness. For example, even if I thought consciousness were a privileged thing, I could still argue that basic physics has some small chance of being conscious. This would yield similar practical conclusions as the eliminativist view does.
This is a fair point, but I think attacking the core confusion about consciousness itself is quite important, for the same reason that it's important to break down the confusions behind theism even if you can argue for a lot of the same practical conclusions whether or not theism is true. Viewing consciousness as a definite and special part of the universe is a systematic defect in one's world view, and removing it does have practical consequences. Looking at the universe from a more physical stance has helped me see that even alien artificial intelligences are likely to matter morally, that plants and bacteria have some ethical significance, and that even elementary physical operations might have nonzero (dis)value. In general, Copernican revolutions change our ethical intuitions in possibly profound ways.
A legitimate concern about eliminativism is that it could reduce the intuitive importance or meaningfulness of altruism. If everything is just particles moving in different ways, why should I care? Anthony Jack and colleagues have found evidence that "there is a physiological constraint on our ability to simultaneously engage two distinct cognitive modes", namely, social and physical reasoning.
But if eliminativism does reduce compassion, it may be because the eliminativist position has not been completely understood. If you still think of consciousness as being something special, then eliminativism sounds like a view that the world doesn't contain that special thing, so nothing in the world matters. But what eliminativism really says is that all the specialness you thought was in the world is still there and in fact may be more universal than you realized.
Consider a fish suffocating on the deck of a fishing boat. It flops back and forth, apparently in agony. A conventional approach is to say that if this fish is conscious, then it must be aware of its terrible suffering, which is bad and should be avoided. An eliminativist can note how
When I read this list, I think those brain changes look really bad, and I feel almost as much empathy as when I think about the fish from the common-sense standpoint of having an agonizing subjective experience. If we care about suffering, then we care about what suffering actually is even on closer inspection.
Steven Weinberg said "With or without religion, good people can behave well and bad people can do evil; but for good people to do evil--that takes religion." In a similar way, it's plausible that for good altruists to ignore suffering requires confusion about consciousness. If you hold Descartes's view that animals are non-sentient machines, you can really delude yourself into thinking that the struggling of a dog when vivisected is not conscious and hence doesn't matter. An eliminativist realizes that lots of aversive processing is really going on in the dog's head, and so the vivisection must be at least somewhat bad, depending on the negative weight given to those aversive processes. The eliminativist position is thus more cautious in some sense.
That said, I'm describing here mainly my experience with eliminativism, as someone who was already heavily committed to reducing suffering. It remains an empirical question what kinds of effects eliminativism has on average for various populations and depending on the degree to which it's internalized. That said, because I think something like eliminativism on consciousness will become more widely accepted in the future as understanding of neuroscience increases, we may want to figure out how to frame eliminativism in a more altruism-friendly way rather than just sweeping it under the rug.
The physical stance is more impartial and accurate than the phenomenal stance in accounting for all the mind-like processes that exist in the world. However, the physical stance is also more dispassionate. While the brain of a person being tortured does look physically very distinctive -- with lots of activity and long-lasting neural "scars" being created -- appreciating its true awfulness requires imagining ourselves in its position. Without subjective imagination, a physical-stance approach is liable to give way to aesthetic judgments -- valuing more brains that appear more interesting, sophisticated, nuanced, or dynamic. Looking for beauty and novelty is a natural temptation when we view physical objects, but it has little to do with ethics. There's a danger that eliminativism gives too much sway to non-empathic judgment criteria.
I think we should try out the eliminativist view as an exercise, to bend our prior prejudices and intuitions. When we unshackle ourselves from the conventional concept of consciousness, how many other ways might there be to reimagine the world! That said, eliminativism doesn't have to be and arguably should not be the only way we think about consciousness, just as our slow, utilitarian moral system needn't be the only way we think about ethics. Rather, we can blend the insights of eliminativism with those of a more common-sense, phenomenal stance -- with the aim of achieving a reflective equilibrium that incorporates insights from each.
Eliminativism and panpsychism may seem like polar opposites, but they're actually two sides of the same coin, in a similar way as 0 degrees and 360 degrees on a unit circle point in the same direction. Both maintain that there's nothing distinctive about consciousness that sharply distinguishes it from the rest of the universe. Panpsychism recognizes that all the computations of physics have a fundamental similarity to them, and it considers different computations as different shades of the same basic thing (though the shades may differ quite a bit). Eliminativism rejects talk about "consciousness" in favor of physical descriptions, and once again we can see a fundamental continuity among the diverse flavors of physical processes. Whether it uses the word "consciousness" or not, each perspective points at the same underlying reality.
That said, it's worth noting that even if we recognize all of physics as fundamentally mental in some sense, it remains a matter of choice how much we care about simple physical operations. We might legitimately decide that only really complex systems like those that emerge in animal brains contain moral significance.
Rob Bensinger, an eliminativist, writes:
believing I’m a zombie in practice just means I value something functionally very similar to consciousness, ‘z-consciousness’. [...]
Since (z-)consciousness isn’t a particularly unique kind of information-processing, I expect there to be an enormous number of ‘alien’ analogs of consciousness, things that are comparable to ‘first-person experience’ but don’t technically qualify as ‘conscious’. [...] [Due to the implications of eliminativism,] I’m much more skeptical that (z-)consciousness is a normatively unique kind of information-processing. Since I think a completed neuroscience will overturn our model of mind fairly radically, and since humans have strong intuitions in favor of egalitarianism and symmetry, it wouldn’t surprise me if certain ‘unconscious’ states acquired the same moral status as ‘conscious’ ones.
I don't think Bensinger is endorsing all-out panpsychism here (and indeed, Bensinger disavows panpsychism elsewhere), but the spirit of his comments is similar as mine.
Churchland (1996) makes what I interpret as an argument against the ontological fundamentalness of qualia by appealing to their fuzzy boundaries (p. 404):
Although it is easy enough to agree about the presence of qualia in certain prototypical cases, such as the pain felt after a brick has fallen on a bare foot, or the blueness of the sky on a sunny summer afternoon, things are less clear-cut once we move beyond the favoured prototypes. Some of our perceptual capacities are rather subtle, as, for example, positional sense is often claimed to be. Some philosophers, e.g. Elizabeth Anscombe, have actually opined that we can know the position of our limbs without any 'limb-position' qualia. [...]
Vestibular system qualia are yet another non-prototypical case. Is there something 'vestibular-y' it feels like to have my head moving? To know which way is up? Whatever the answer here, at least the answer is not glaringly obvious. [...]
My suspicion with respect to The Hard Problem strategy is that it seems to take the class of conscious experiences to be much better defined than it is. The point is, if you are careful to restrict your focus to the prototypical cases, you can easily be hornswoggled into assuming the class is well-defined.
Perhaps qualiaphiles could reply that qualia are in fact crisp entities, but we just don't always perceive or describe their boundaries correctly. Maybe our language is not up to the task. Moreover, the contents of qualia may differ from person to person.
In the above piece, I tried to insist that eliminativism doesn't deny consciousness per se, only the particular conception of consciousness that some philosophers cling to. As of 2015, I'm leaning more toward Minsky's view that it might be most clear to dispense with the "consciousness" word altogether, since it causes so much confusion. This post expresses a similar idea: "There’s something to be said for the bracing elegance of the two-word formulation of scepticism offered by Dennett [...] – ‘What qualia?’"
Through many conversations about consciousness, I've concluded that eliminativism may be the most clear way to explain type-A physicalism, because the allure of dualism (even when dressed up as physicalist monism) is so irresistible to human minds. In other words, eliminativism helps shock us out of our complacency and actually come to terms with what a truly non-dualist view of consciousness requires. While (type-A) reductionism on consciousness is also a reasonable viewpoint in principle, in practice some people have trouble being mere reductionists without falling into the trap of property dualism. In other words, eliminativism is like training wheels: it's useful until you're ready to wield the idea of "type-A reductionism regarding consciousness" correctly, without falling over and hurting yourself.
The mantra of the more radical version of eliminativism is that we're not conscious but only think we are. How is that possible? "I just know I'm conscious!" But any thoughts you have about your being conscious are fallible. I believe there are bugs in the vast network of computation that produces thoughts like "I'm conscious in a way that generates a hard problem of consciousness." No thought you have is guaranteed to be free from bugs, and it seems more likely -- given the basically useless additional complexity of postulating a metaphysically privileged thing called consciousness -- to suppose that our attribution of metaphysically privileged consciousness to ourselves is a bug in our cognitive architectures. This is a relatively simple way to escape the whole consciousness conundrum. If it feels weird, that's because the bug in your neural wiring is causing you to reject the idea. Your thoughts exist within the system and can't get outside of it.
Your brain is like a cult leader, and you are its follower. If your brain tells you it's conscious, you believe it. If your brain says there's a special "what-it's-like-ness" to experience beyond mechanical processes, you believe it. You take your cult leader's claims at face value because you can't get outside the cult and see things from any other perspective. Any judgments you make are always subject to revision by the cult leader before being broadcast. (Similar analogies help explain the feeling of time's flow, the feeling of free will, etc.)
Carruthers and Schier (2017) summarize Dennett's view as follows: "there is no reason to think that appearing to the subject cannot be understood in functional terms. In brief, appearing seems to involve the subject gaining access to the thing that is appearing and there is no reason that we cannot give a functional analysis of access."
I like how Michael Graziano explains it:
I believe a major change in our perspective on consciousness may be necessary, a shift from a credulous and egocentric viewpoint to a skeptical and slightly disconcerting one: namely, that we don’t actually have inner feelings in the way most of us think we do. [...]
a new perspective on consciousness has emerged in the work of philosophers like Patricia S. Churchland and Daniel C. Dennett. Here’s my way of putting it:
How does the brain go beyond processing information to become subjectively aware of information? The answer is: It doesn’t. The brain has arrived at a conclusion that is not correct. [...]
You might object that this is a paradox. If awareness is an erroneous impression, isn’t it still an impression? And isn’t an impression a form of awareness?
But the argument here is that there is no subjective impression; there is only information in a data-processing device. When we look at a red apple, the brain computes information about color. It also computes information about the self and about a (physically incoherent) property of subjective experience. The brain’s cognitive machinery accesses that interlinked information and derives several conclusions: There is a self, a me; there is a red thing nearby; there is such a thing as subjective experience; and I have an experience of that red thing. Cognition is captive to those internal models. Such a brain would inescapably conclude it has subjective experience.
So there, I said it: Consciousness doesn't exist. Now let's figure out more precisely what we are pointing at when we seek to reduce conscious suffering.
Often I hear claims that "I'm more certain that I'm conscious than I am about anything else." I disagree. Our perception of being conscious, just like our perception of anything else, is a hypothesis that our brain constructs, based on very complicated processing and lower-level thinking, expressed in terms of a simplified ontology that the brain can make sense of. Anything that you know is the result of complex computation by an information-processing device. But then why privilege some types of visceral, intuitive judgments that your brain makes over other judgements your brain makes? All of your knowledge is constructed by the brain's information processing in one way or another. "Knowing that I'm conscious" is not a thought that somehow transcends ordinary brain machinery, nor does it deserve to be made axiomatic in one's ontology.
Focus your attention on the visual image you see of the world in front of you: a rich jumble of colors, shapes, textures, and patterns. How can those not be the philosopher's qualia?
Naively, it looks like eliminativism can't explain these data. But exactly how we characterize the data makes a difference to our theoretical interpretation. If we declare the visual imagery in front of us as something metaphysically special -- "mental phenomena" -- then eliminativism cannot account for them. But to suppose that the visual scenes we see are phenomena in their own metaphysical category is to beg the question.
An alternate characterization of our visual experiences is that they represent "(data about the external world) + (an explicit or implicit judgment by our brains that we're seeing a rich collection of colors, shapes, etc.)". All we know is the judgments our brains make. If our brains judge that we're seeing rich colors and shapes, then we'll think we are seeing such things. The eliminativist hypothesis, then, just predicts that our brains make these judgments about seeing things-it-calls-qualia when it attends to its processing of visual input.
These judgments needn't be verbal but are often just more basic moments of noticing how something seems. For instance, when I'm going about my day, I typically don't even notice the colors and shapes around me, but if I focus my attention on how they look, I undergo a nonverbal process of feeling like "Wow, there's something it looks like to see what's in front of me!" This feeling is an implicit "judgment" that my brain makes, and it's all that's needed to explain the fact that we feel like we have qualia.
That I typically don't notice "qualia" unless I attend to them bolsters this view. Most of the time, my brain is focused on my own internal thoughts and basically ignores the world it sees. In this case, my brain is just processing visual data "unconsciously". Then, when I focus on some visual input in particular, my brain produces an implicit (and sometimes explicit) judgment that "this thing has distinctive color and texture and shape, and it feels like something to see it". In other words, so-called "conscious experiences" are experiences that our brains judge to be conscious.
While we don't yet have a fully developed theory of humor, laughing may be a reaction that we have to particular kinds of unexpected juxtapositions. In a similar way, it's possible that our brains also have a sort of "specialness" classifier, which fires when we process certain inputs or have certain thoughts that don't seem to be fully explainable in physicalist terms. For some people, this classifier may lead to belief in God or spirits or magic. And for almost everyone (myself included) this hypothetical classifier may lead us to believe there's a "feeling of what it's like" to be conscious and that, e.g., the visual data in our brains is somehow "a unified visual scene of textures, colors, and objects" rather than activations of neural patterns corresponding to information available for us to use, combined with our flailing minds trying to make sense of that information with whatever simple metaphors they can.
Qualia are user illusions. When you click a folder icon on your computer desktop, there's not actually a little folder there; your computer just tells you that there's a folder there, when in fact it represents more complex processing in your computer's hard drive. Likewise, when you feel pain, there's not actually an ontological what-it's-like-ness experience; it's just that your brain tells you it's having a qualitative experience of pain, when in fact what's happening under the hood is complex brain processing. The user illusion of folders on a computer is like our folk-psychological "phenomenal concepts", while the underlying computation and data are like our "physical concepts". The hard problem of consciousness results from the difficulty we have of seeing how the ontology of mental "folder icons" can match up with the ontology of mental "code and data".
That consciousness is a user illusion doesn't mean that only experiences judged to be conscious matter ethically. The judgment process only adds a small level of reflection on what were already complex, substantive computations. The pre-eliminativist view that only "conscious" emotions matter was based on a confused idea that only "conscious" emotions are "real qualia" in a metaphysical sense. Once we cast off this way of thinking, it becomes less plausible that only experiences judged to be conscious have moral weight.
We can describe the same reality at different levels of abstraction, i.e., using different ontological frameworks. As an example of a different ontology, imagine a simple computational agent that moves about in a "grid world" consisting of 16 squares -- 16 possible "states" that it can be in. This simple agent knows nothing of the Earth, particle physics, or even humans. It just "knows" (in some very simplistic way) its own little world of state transitions and whatever dynamics drive its behavior. Likewise, we humans are acquainted with a simplified ontology of our own brains -- that we have things called "conscious experiences" and that we transition through these experiences. If we were cave people, we would know nothing of neurons, the cerebral cortex, or gamma oscillations (except for whatever we found in the skulls of other animals that we killed and ate). We would just have a simplistic, subjective model of our mental lives, which is what people refer to when they talk about knowing that they're conscious. This picture isn't mutually exclusive with a more detailed, physics-based portrait.
Perhaps the easiest way to understand eliminativism, and the way I began to see the light in 2009, is as follows. First, let's describe the mainstream, intuitive view of consciousness among most neuroscientists and other scientifically minded people:
The exact contents of the middle box aren't essential; you can plug in whatever neuroscientific theory of consciousness you want there. The main idea is that unconscious neural inputs take place, and then with enough integration/processing of the right type, consciousness is created. And because those people holding this view are materialists, they insist that consciousness just is brain activity of the right type, even though consciousness is also a definite phenomenal thing. I think the best way to construe these claims into a coherent philosophy is to regard them as property dualism, which is why I say that most neuroscientists are closet property dualists. In any case, the relevant endpoint in this process is the box at the bottom—the belief that you have qualia—because that's what causes you to insist to yourself and to others that qualia exist. In other words, if I ask: "How do you know you have qualia?" your reply can be: "My brain perceives this fact via the cognitive operations described by this figure and then correctly concludes that it has qualia." For example, if you look at a complex visual scene and note all its colors, textures, etc., the neural correlates of this "experience" lead to brain-state changes corresponding to accurate belief that you are experiencing such qualia.
The eliminativist alternative is simple: Just drop the yellow star and keep everything else the same:
This results in exactly the same beliefs as if you do have qualia (whatever that's supposed to mean), and those beliefs are all we need to explain. The hypothesis that something "special" happens at some particular stage of neural processing or at some particular level of brain complexity is useless (although there is still a non-dualist sense in which very complex brains are qualitatively different from very simple ones, in a similar way as a tree is qualitatively different from a blade of grass). If you still want to declare that certain types of neural processing are consciousness, that's fine, but then you're not making a metaphysical claim, just a definitional one. Perhaps you're just making a generalization about the sorts of brain processes that tend to precede beliefs (of a certain type) that one is conscious, which I have no quarrel with. (Of course, there may be other, simpler kinds of beliefs in one's own consciousness that have other, simpler neural precursors. And it's unclear to what extent consciousness in general, rather than self-consciousness specifically, should be said to require belief in one's consciousness anyway.)
Chalmers (1996), pp. 177-78, 180-81:
When I comment on some particularly intense purple qualia that I am experiencing, that is a behavioral act. Like all behavioral acts, these are in principle explainable in terms of the internal causal organization of my cognitive system. There is some story about firing patterns in neurons that will explain why these acts occurred; at a higher level, there is probably a story about cognitive representations and their high-level relations that will do the relevant explanatory work. [...]
In giving this explanation of my claims in physical or functional terms, we will never have to invoke the existence of conscious experience itself. [...]
[My zombie twin] remains utterly confident that consciousness exists and cannot be reductively explained. But all this, for him, is a monumental delusion. There is no consciousness in his universe—in his world, the eliminativists have been right all along.
Given that philosophical zombies sincerely believe themselves to be conscious, even though they lack the philosopher's qualia, how do you know that you're not a zombie? The beliefs physically manifested in your neural configuration are exactly the same as the zombie's. If you think zombies are possible, you should worry that you might actually be a zombie, since you have no way of knowing that you're not one. (By "you" in the previous sentence I mean your physical self, which is the action-guiding conception of self, given that non-interacting non-material selves/experiences/phenomenal properties don't take actions that can affect the physical world.) And if you think zombies aren't possible, then you aren't a property dualist.
A property dualist might reply to the previous paragraph as follows: "The conscious properties that supervene on my physical existence consciously know themselves to be conscious. Their consciousness self-verifies the fortuitous belief by my physical brain that I'm conscious." In reply, I would point out that even if true, this doesn't help our physical brains and bodies when making choices, since our physical brains and bodies can never know whether consciousness is supervening on them. It seems they would always have to act as if uncertain about whether phenomenal consciousness actually exists in their world. (That said, if you're a sentiocentrist property dualist who only cares about phenomenal sentience, then you may as well act as if you and others are actually conscious, since your actions only matter if that's the case.)
One might say: "But how can it be a matter of opinion whether I'm conscious? It's obvious to me that I am!" Indeed you do have beliefs that you're conscious, which you express in various ways. Combined with knowledge about what sorts of processing your brain does under the hood, that's a good reason to apply the label of "conscious" to you. But having a particular belief state does not, by itself, imply anything metaphysical. It's obvious to other people that God exists and talks to them through prayer, or that they've been abducted by aliens.
It's often said that consciousness can't be an illusion because having an illusion already implies consciousness. This is not correct. All that an illusion requires is false belief (or a false cognitive representation of some sort, probably not a propositional belief), and most philosophers agree that physical brain states of belief (or representation in general) don't require consciousness. An unconscious system can have the false belief that it's conscious without contradiction. You are such a system (at least if we interpret "consciousness" in the robust Chalmers kind of way).
The same idea of "explain why you believe X" rather than "explain why X exists" works for other puzzles of consciousness, such as why our conscious experiences seem unified. Suppose you develop a theory in which everything that we perceive (every color, shape, smell, etc.) "comes together" at a single point. How does this help explain why we perceive a unified conscious field? Unless there's a homunculus watching a Cartesian theater, then there's no need to actually bring all sense data together at once. All that's necessary is that we believe the data all come together at once. Take whatever special explanation for the unity of consciousness you proposed, and trace the steps of how it leads our neurons into configurations corresponding to belief in the unity of our consciousness. Now strip out whatever grand, sweeping assumptions you used to get there, and instead directly update the neurons via a more mundane pathway into a configuration corresponding to belief that conscious perception is unified. The idea is similar to illusions of other sorts, such as the illusion that the characters in flip books are actually moving. It's not necessary for the characters to actually move; all that's needed is that we believe they do. The same goes for the other tricks that our brains play on us regarding consciousness.
Consciousness is sort of like the book Harold and the Purple Crayon, in which "The protagonist, Harold, is a curious four-year-old boy who, with his purple crayon, has the power to create a world of his own simply by drawing it." If the brain creates a representation that "there's a coherent visual field in front of me with a variety of colored objects", then other brain processes that receive this information (including speech, memory, and motor control) will believe this story that they're told and act as if it's true. (And normally, those representations are true, though sometimes they aren't, such as in cases of optical illusion or dreams.) In other words, merely "drawing" a phenomenal experience in the language of neural representations makes it "come to life" in the sense that the rest of the brain responds appropriately. There doesn't need to actually be a coherent visual field living in some realm of phenomenal experience. All that's needed is for the brain to represent to itself, believe, and act as if it has such phenomenal experience.
Dennett (2016), p. 72:
We illusionists advise would-be consciousness theorists not to be so confident that they couldn’t be caused to have the beliefs they find arising in them by mere neural representations lacking all ‘phenomenal’ properties.
Following is a stylized example to illustrate what I mean when talking about the brain representing things to itself. While animal brains represent data in a different way than digital computers, imagine hypothetically that the brain consists in connected subsystems that communicate with each other using JSON-formatted strings of text. Suppose that a Retina subsystem receives raw visual input, which it transmits to the Visual Cortex subsystem as a base64-encoded string, like TWFuIGlzIGRpc3Rpbmd...
. The Visual Cortex subsystem processes that string and emits the following JSON string:
{ "image_summary": "dog in a park", "salient_objects_in_image": { "left_side": "maple tree", "center": "dog", "right_side": "park bench" }, "average_brightness": "high", "objects_moving": true, "image_has_multiple_colors": true, "image_appears_unified": true, "whole_image_is_visible_at_once": true, "my_emotional_response": "happy", ... }
This message is broadcast to Working Memory, Non-verbal Thought, Speech, Action, Emotional Valence, and other subsystems in the brain, which receive the data and act accordingly. For example, when the Non-verbal Thought subsystem thinks about the qualitative nature of the visual field, it notes that "image_appears_unified"
is true
, so the Non-verbal Thought subsystem concludes that the visual field is indeed unified. (This person might go on to write papers attempting to explain the unity of consciousness.) Similarly, the Speech subsystem reports that the whole image is visible at once because it has been told that "whole_image_is_visible_at_once"
is set to true
.
Now, probably this example is example is wrongheaded in many ways, and plausibly the brain's neural representations among its subsystems are nothing like what I just described. However, I give this example just to make concrete the kind of thing I have in mind when talking about "representations" that the brain makes to itself. Regardless of the usefulness of my example, the broader point remains that however it is that the brain comes to conclude certain things, such as that its visual field is coherent and unified, we can trace exactly what algorithmic steps led up to that conclusion, without ever needing to discuss "qualia" or "phenomenal consciousness". Whatever processes cause a philosophical zombie to earnestly think to itself things like "I have a visual experience that's not just data representation in an algorithmic system" can explain human consciousness as well, because humans are zombies.
The post The Eliminativist Approach to Consciousness appeared first on Center on Long-Term Risk.
]]>The post Flavors of Computation Are Flavors of Consciousness appeared first on Center on Long-Term Risk.
]]>"don't hold strong opinions about things you don't understand" --Derek Hess
Susan Blackmore believes the way we typically think about consciousness is fundamentally wrong. Many "theories of consciousness" that scientists advance and even the language we use set us up for a binary notion of consciousness as being one discrete thing that's either on or off.
We can tell there's something wrong with our ordinary conceptions when we think about ourselves. Suppose I grabbed a man on the street and described every detail of what your brain is doing at a physical level -- including neuronal firings, evoked potentials, brain waves, thalamocortical loops, and all the rest -- but without using suggestive words like "vision" or "awareness" or "feeling". Very likely he would conclude that this machine was not conscious; it would seem to be just an automaton computing behavioral choices "in the dark". If our conceptualization of consciousness can't even predict our own consciousness, it must be misguided in an important way.
Imagine we have perfect neuroscience knowledge. We understand how every neuron in the brain is hooked up, how it fires, and what electrical and chemical factors modulate it. We understand how brain networks interact to produce complex patterns. We have high-level intuitions for thinking about what the functions of various neural operations are, in a similar way as a programmer understands the "gist" of what a complex algorithm is doing. Given all this knowledge, we could trace every aspect of your consciousness. Every thought and feeling would have a signature in this neural collective. Nothing would be hidden exclusively to your subjective experience; everything would have a physical, observable correlate in the neural data.
We need a conception of consciousness which makes it seem obvious that this collection of observable cognitive operations is conscious. If that's not obvious, and especially if that seems implausible or impossible, then our way of thinking about consciousness is fundamentally flawed, because this neural collective is in fact conscious.
Sometimes I have conversations like this:
Brian: Do you think insects are conscious?
Other person: No, of course not.
Brian: Why do you think they're not?
Other person: Well, it just seems absurd. How could a little thing executing simple response behaviors be conscious? It's just reacting in an automatic, reflexive way. There's no inner experience.
Brian: If you didn't know from your own subjective experience that you were conscious, would you predict that you were conscious, or would you see yourself as executing a bunch of responses "in the dark" as the behaviorists might have seen you?
Other person: Hmm, well, I think I would know I'm conscious because I behave more intelligently than an insect and can describe my inner life.
Brian: Can you explain what about your brain gives rise to consciousness that's not present in an insect?
Other person: Uh....
Brian: If you don't understand why you're conscious, how can you be so sure an insect isn't conscious?
Other person: Hmm....
I know that I'm conscious. I also know, from neuroscience combined with Occam's razor, that my consciousness consists only of material operations in my brain -- probably mostly patterns of neuronal firing that help process inputs, compute intermediate ideas, and produce behavioral outputs. Thus, I can see that consciousness is just the first-person view of certain kinds of computations -- as Eliezer Yudkowsky puts it, "How An Algorithm Feels From Inside". Consciousness is not something separate from or epiphenomenal to these computations. It is these computations, just from their own perspective of trying to think about themselves.
In other words, consciousness is what minds compute. Consciousness is the collection of input operations, intermediate processing, and output behaviors that an entity performs. Now, some people would object at this point and say that maybe consciousness is only a subset of what brains compute -- that most of brain activity is "unconscious", and thoughts and feelings only become "conscious" when certain special kinds of operations happen. In response, I would point out that there's not a major discontinuity in the underlying computations themselves that warrants a binary distinction like this. Sure, some thoughts are globally broadcast and others aren't, and the globally broadcast thoughts are accessible to a much wider array of brain functions, including memory and speech, which allows us to report on them while not reporting on signals that are only locally broadcast. But the distinction between local and global broadcasting is ultimately fuzzy, as will be any other distinction that's suggested as the cutoff point between unconscious and conscious experience.
If we look at computations from an abstract perspective, holding in abeyance our intuitions that certain kinds of computations can't be conscious, we can see how the universe contains many varieties of computation of all kinds, in a similar way as nature contains an enormous array of life forms. It's not obvious from this distanced, computation-focused perspective that one subset of computations (namely those in brains of complex animals) is privileged, while all other computations are fundamentally different. Rather, we see a universal continuity among the species of computations, with some being more complex and sophisticated than others, in a similar way as some life forms are more complex and sophisticated than others.
From this perspective, it is clear why our neural collective is conscious: It's because (one flavor of) consciousness is the process of doing the computations that our brains do. The reason we're "not conscious" under general anaesthesia is because the kinds of global information distribution that our brains ordinarily do are prevented, so we can't have complex thoughts like "I'm conscious" or store memories that would lead us to think we had been conscious. But there are still some other kinds of computations going on that have their own kinds of "consciousness", even if of a different nature than what our intuitive, analytical, or linguistic brain operations would understand.
I should add a note on terminology: By "computation" I just mean a lawlike transition from input conditions to output conditions, not necessarily something computable by a Turing machine. All computations occur within physics, so any computation is a physical process. Conversely, any physical process proceeds from input conditions to output conditions in a regular manner and so is a computation. Hence, the set of computations equals the set of physical processes, and where I say "computations" in this piece, one could just as well substitute "physical processes" instead.
Talk about consciousness is always somewhat mystical. Consciousness is not a hard, concrete thing in the universe but is more of an idea that we find important and sublime, perhaps similar to the concept of Brahman for Hindus. When we think about consciousness, we're essentially doing a kind of poetry in our minds -- one that we find spiritually meaningful.
When we conceive of consciousness as being various flavors of computations, the question arises: What is it like to be another kind of computation than the one in our heads? I've suggested elsewhere that there's some extent to which we can't in principle answer this question fully, because our brains are our brains, and they can't perfectly simulate another computation without being that computation, in which case they would no longer be our current brains. But we can still get some intuitive flavor of what it might mean for another consciousness to be different from ours.
One way to start is just to notice that our own minds feel different and have many different experiences at different times. Being tired feels different from being alert, which feels different from being scared, which feels different from being content in a warm blanket. Even more trivially, seeing a spoon looks different from seeing a fork, which looks different from seeing a penny. Our brains perform many different computations at different times, and these each have their own textures. More extreme examples include being on the edge of sleep, dreaming, waking up slowly after "going under" for surgery, or meditating.
What about other animals? Can we imagine what it's like to be a worm? Fundamentally we can't, but here's an exercise that may at least gesture in the right direction. Read the following instructions and then try it:
Instructions: Close your eyes. Stay still. Stop noticing sounds and smells. Turn off the linguistic inner voice that thinks verbal thoughts in your head. In fact, try to stop thinking any thoughts as much as possible. Now, poke your head with your fingers. Scratch it softly with your fingernails. Tap it with your hand. Face your head toward a light and notice how it looks bright even though you can't see anything definite due to your eyes being closed. Turn your head away. Notice air moving gently across your skin.
This exercise helps mimic the way in which worms have no eyes or ears and presumably no complex thoughts, especially not linguistic ones. Yet they do have sensitivity to touch, light, and vibrations.
Now, even this exercise is far from adequate. Human brains have many internal processes and computing patterns that don't apply to worms. Even if we omit senses that worms lack and try to suppress high-level thoughts, this human-like computing scaffolding remains. For instance, maybe our sense of being a self with unified and integrated sensations is mostly absent from worms. Probably many other things are absent too that I don't have the insight to describe. But at least this exercise helps us begin to imagine another form of consciousness. Then we can multiply whatever differences we felt during this exercise many times more when we contemplate how different a worm's experiences actually are.
In some sense all I've proposed here is to think of different flavors of computation as being various flavors of consciousness. But this still leaves the question: Which flavors of computation matter most? Clearly whatever computations happen when a person is in pain are vastly more important than what's happening in a brain on a lazy afternoon. How can we capture that difference?
Every subjective experience has corresponding objective, measurable brain operations, so the awful experiences of pain must show up in some visible way. It remains to be seen exactly what agony corresponds to, but presumably it includes operations like these: neural networks classifying a stimulus as bad, aversive reactions to the negative stimulus, negative reinforcement learning, focused attention on the source of pain, setting down aversive memory associations with this experience, and goal-directed behavior to escape the situation, even at cost to other things of value. There may be much more, but these basics are likely to remain part of the equation even after further discoveries. (Note: It may be that we should want neuroscience discoveries to come slower rather than faster.) But if so, it becomes plausible that when we see these kinds of operations in other places, we should disvalue them there as well.
This is why an ethical viewpoint like biocentrism has something going for it. (Actually, I prefer "negative biocentrism", analogous to "negative utilitarianism".) All life can display aversive reactions against damage to some degree, and since these are computations of certain flavors, it makes sense to think about them as being conscious with certain flavors. Of course, the degree of importance we place on them may be very small depending on the organism in question, but I don't see fundamental discontinuities in the underlying physics, so our valuations functions should not be discontinuous either. Still, our valuation functions can be very steep. In particular, I think animals like insects are vastly more complex than plants, fungi, or bacteria, so I care about their flavors of consciousness more.
My perspective is similar to that of Ben Goertzel, who said:
My own view of consciousness is a bit eccentric for the scientific world though rather commonplace among Buddhists (which I'm not): I think consciousness is everywhere, but that it manifests itself differently, and to different degrees, in different entities.
Alun Anderson, who spent 10 years studying insect sensation, believes "that cockroaches are conscious." He elaborates:
I don't mean that they are conscious in even remotely the same way as humans are[...]. Rather the world is full of many overlapping alien consciousnesses.
[...]To think this way about simple creatures is not to fall into the anthropomorphic fallacy. Bees and spiders live in their own world in which I don't see human-like motives. Rather it is a kind of panpsychism, which I am quite happy to sign up to, at least until we know a lot more about the origin of consciousness. That may take me out of the company of quite a few scientists who would prefer to believe that a bee with a brain of only a million neurones must surely be a collection of instinctive reactions with some simple switching mechanism between them, rather [than having] some central representation of what is going on that might be called consciousness. But it leaves me in the company of poets who wonder at the world of even lowly creatures.
The argument for panpsychism, I guess, is: If strong emergence is ruled out, then you will not be able to get this "jump" from the non-conscious to the conscious, and therefore consciousness must be a fundamental feature in nature.
Velmans (2012) distinguishes between ‘discontinuity theories’, which claim that there was a particular point at which consciousness originated, before which there was no consciousness (this applies both the the universe at large, and also to any particular consciousness individual), and ‘continuity theories’, which conceptualize the evolution of consciousness in terms of “a gradual transition in consciousness from unrecognizable to recognizable.” He argues that continuity theories are more elegant, as any discontinuity is based on arbitrary criteria, and that discontinuity theories face “the hard problem” in a way that continuity theories don't. Velmans takes these arguments to weigh in favor of adopting, not just a continuity theory, but a form of panpsychism.
Daniel Dennett in Fri Tanke (2017) at 54m50s:
I think that the very idea that consciousness is either there or not is itself a big mistake. Consciousness comes in degrees, and it comes in all sorts of different degrees and varieties. And the idea that there is one property which divides the universe into those things that are conscious and those that aren't is itself a really preposterous mistake.
It seems to me simplest to just presume that none of these [computational, creature-like] systems feel, if I could figure out a way to make sense of that, or that all of them feel, if I can make sense of that. If I feel, a presumption of simplicity leans me toward a pan-feeling position: pretty much everything feels something, but complex flexible self-aware things are aware of their own complex flexible feelings. Other things might not even know they feel, and what they feel might not be very interesting.
For many more quotes of this type, from ancient Greeks to contemporary philosophers of mind, see David Skrbina's encyclopedia entry on panpsychism. I disagree with at least half of the specific views cited there, but some of them are spot-on.
It's unsurprising that a type-A physicalist should attribute nonzero consciousness to all systems. After all, "consciousness" is a concept -- a "cluster in thingspace" -- and all points in thingspace are less than infinitely far away from the centroid of the "consciousness" cluster. By a similar argument, we might say that any system displays nonzero similarity to any concept (except maybe for strictly partitioned concepts that map onto the universe's fundamental ontology, like the difference between matter vs. antimatter). Panpsychism on consciousness is just one particular example of that principle.
Critics of this view may complain that, like a hypothetical unfriendly artificial intelligence, I'm not applying a sufficiently conservative concept boundary for the concept of consciousness. But one man's wise conservatism is another's short-sighted parochialism. My view could also be characterized as "concept creep"—a situation in which increasing sensitivity to harm leads to expanding the boundaries of a concept (which in my case is the concept of "consciousness" or "suffering").
Exploration of neural correlates of consciousness helps identify the locations and mechanisms of what we conventionally think of as high-level consciousness in humans and by extension, perhaps the high-level consciousness of similar animal relatives. Stanislas Dehaene's book Consciousness and the Brain provides a superb overview of the state of neuroscience on how consciousness operates in the brain in terms of global workspace theory.
But describing how consciousness works in human-like minds can't be the end of the story. It leaves unanswered the question of whether consciousness could exist in slightly different mind architectures as long as they're doing the same sorts of operations. We could imagine gradually tweaking a human-type mind architecture on subtle dimensions. At what point would these theories of consciousness say it stops being conscious? What if an agent performed human-like cognitive feats without centralized information broadcasting? Global-workspace and other neural-correlation theories don't really give answers, because they can only interpolate between a set of points, not extrapolate beyond that set of points.
Consciousness cannot be crucially tied up with the specific organization of human minds. Consciousness is just not the kind of thing that could be so arbitrarily determined. Consciousness is what consciousness does: It is the suite of stimulus recognition, internal computation, and action selection that an organism performs when making complex decisions requiring help from many cognitive modules. It can't be something necessarily tied to thalamus-cortex connectivity or cross-brain wave synchronization. Those are too specific to the details of implementation; a particular implementation can't be relevant because it doesn't do anything different from another implementation of the same functionality. Rather, consciousness must be about what the process is actually trying to accomplish: receiving information, manipulating it, combining thoughts in novel ways, and taking actions. In other words, consciousness must be related to computation itself.
But if consciousness is about computation in general, then it would seem to appear all over the place. Some embrace this conclusion as a natural deduction from what consciousness as computation must be. For instance, Giulio Tononi's integrated information theory (IIT) suggests that even this metal ball has a small degree of consciousness. Dehaene, on the other hand, says he's "reticent" to accept IIT because it implies a kind of panpsychism (p. 279, Ch. 5's footnote 35).
I agree that IIT is not necessarily the ultimate theory of consciousness. There may be many more particular nuances we want to apply to our criteria for what consciousness should be. But ultimately I think Tononi is right that consciousness must be something fundamental about the properties of the system, not something specific to the implementation. Consciousness as a general phenomenon is the kind of thing that needs a general theory. It just doesn't make sense that something so basic and so tied up with functional operations would require particular implementations.
Note that the functionalist view I'm defending here is not behaviorism. It's not the case that any mechanism that yields human-like behavior has human-like consciousness, as the example of a giant lookup table shows. A giant lookup table may have its own kind of consciousness (indeed, it should have at least some vague form of consciousness according to the thesis I'm advancing in this essay), but it's a different, shallower kind than that of a human. We could see this if we looked inside the brains of the two systems. Humans when responding to a question would show activation in auditory centers, conscious broadcasting networks, and speech centers before producing an answer. The lookup table would do some sort of artificial speech recognition to determine the text form of the question and then would use a hash table or tree search on that string to identify and print out the stored answer. Clearly these two mind operations are distinct. If we broaden the definition of "behavior" to include behavior within the brain by neurons or logic gates, then even by behaviorist criteria these two kinds of consciousness aren't the same.
Old-school behaviorism is essentially a relic of times past when researchers were less able to look inside brains. Cognitive algorithms must matter in addition to just inputs and outputs. After all, what look from the outside like intermediate computations of a brain can be seen as inputs and outputs of smaller subsystems within the brain, and conversely, the input-output behavior of an organism could be seen as just an internal computation to a larger system like the population of organisms as a whole.
So the specific flavor of consciousness that a system exhibits can indeed depend on the algorithms of a mind, which depend on its architecture. But consciousness in general just seems like something too fundamental to be architecture-dependent.
In any case, suppose you thought the architecture was fundamental to consciousness, i.e., that consciousness was the static physical pattern of matter arranged in certain ways rather than the dynamic computations that such matter was performing. In this case, we'd still end up with a kind of panpsychism, because patterns with at least a vague resemblance to consciousness would be ubiquitous throughout physics.
If consciousness is the thoughts and computations that an agent performs when acting in the world, there seems to be some relationship between sapience -- the ability to intelligently handle novel situations -- and sentience -- inner "feelings". Of course, it's not a perfect correlation. For instance, Mr. Spock calmly computing an optimal course of action may be more successful than a crying baby demanding its juice bottle. But in general, minds that have more capacity for complex thought, representation, motivational tradeoff among competing options, and so on will also have more rich inner lives that contain more complex sensations. As Daniel Dennett notes in Consciousness Explained (p. 449): "the capacity to suffer is a function of the capacity to have articulated, wide-ranging, highly discriminative desires, expectations, and other sophisticated mental states."
One overly simplistic argument could run as follows:
Of course, "understanding" is a concept about as complex as "intelligence" or "consciousness", so this argument does no real work; it just casts general ideas in a potentially new light.
In reading Consciousness and the Brain, I realized that many of the abilities characteristic of consciousness are those cognitive functions that are high-level and open-ended, such as holding information in short-term memory for an arbitrary time, being able to pay attention to arbitrary stimuli, and controlling the direction of one's thoughts. The so-called "unconscious" processing tends to involve feedforward neural networks and other fixed algorithms. One forum post proposed that Turing-completeness may be part of what makes human-like minds special. They not only compute fixed functions but could in theory, given sufficient resources, compute any (computable) function. Maybe Turing-completeness could be seen as a non-arbitrary binary cutoff point for consciousness. I'm skeptical that I'd agree with this definition, because it feels too theoretical. Why should subjectivity be so related to a technical computer-science concept? In any case, I'm not quite sure where the Turing-completeness cutoff would begin among animal brains. But it is an interesting proposal. Rather than thinking in binary terms, I would note that human mental abilities, while powerful, could still be improved upon in practice (given that we don't have infinite memory and so on), and presumably, more advanced minds would be considered even more conscious than humans.
The correlation between sapience and sentience seems plausible among Earth's animals, but does it hold in general? Nick Bostrom argues that it doesn't have to. In his book Superintelligence (2014), Bostrom explains:We could thus imagine, as an extreme case, a technologically highly advanced society, containing many complex structures, some of them far more intricate and intelligent than anything that exists on the planet today -- a society which nevertheless lacks any type of being that is conscious or whose welfare has moral significance. In a sense, this would be an uninhabited society. It would be a society of economic miracles and technological awesomeness, with nobody there to benefit. A Disneyland with no children.
I tend to differ with Bostrom on this. I think if we dissolve our dualist intuitions and see consciousness as flavors of computation, then a highly intelligent and complex society is necessarily consciousness -- at least, with a certain flavor of consciousness. That flavor may be very different from what we have experience with, and so I can see how many people would regard it as not real consciousness. Maybe I would too upon reflection. But the question is to some extent a matter of taste.
Imagine robotic aliens visiting Earth. They would observe a mass of carbon-based tissue that performs operations that parts of it find reinforcing. The globs of tissue migrate across the Earth and engage in lots of complex behaviors. The tissue globs change Earth's surface dramatically, much like a bacteria colony transforming a loaf of bread. But the tissue globs don't have alien-consciousness. Hence, the aliens view Earth like a wasteland waiting to be filled with happy alien-children.
Note that my view does not equate "consciousness" with "goodness". I think many forms of consciousness are intrinsically bad, and I would prefer for the universe to contain less consciousness on the whole. That said, we have to know the enemy to fight the enemy.
On 29 May 1913, the opening of Igor Stravinsky's The Rite of Spring in Paris caused an uproar among the audience:
As a riot ensured, two factions in the audience attacked each other, then the orchestra, which kept playing under a hail of vegetables and other objects. Forty people were forcibly ejected.
The reason:
It's more likely that the audience was appalled and disbelieving at the level of dissonance, which seemed to many like sheer perversity. "The music always goes to the note next to the one you expect," wrote one exasperated critic.
At a deeper level, the music negates the very thing that for most people gives it meaning: the expression of human feelings. [...]
There's no sign that any of the creatures in the Rite of Spring has a soul, and there's certainly no sense of a recognisable human culture. The dancers are like automata, whose only role is to enact the ritual laid down by immemorial custom.
Arguing over whether an abstract superintelligence is conscious is similar to pre-modern musicians arguing whether The Rite of Spring is classical music, except maybe that the former contrast is even more stark than the latter. Abstract machine intelligence would be a very different flavor of consciousness, so much that we can't do it justice by trying to imagine it. But I find it parochial to assume that it wouldn't be meaningful consciousness.
Of course, sometimes being parochial is good. If you don't favor some things over others, you don't favor anything at all. It's completely legitimate to care about some types of physical processes and not others if that's how you feel. I just personally incline toward the view that complex machine consciousness of any sort has moral standing.
I think the concept "consciousness" is a lot like the concept "life" in terms of its complexity and fuzziness. Perhaps this is unsurprising, because as John Searle correctly observes, consciousness is a biological process.
But aren't the boundaries of life relatively clear? No, I don't think so. Biologists have agreed on certain properties that define life by convention, but the properties of life taught in biology class are just one arbitrary choice out of many possible choices regarding where to draw a line between the biological and abiological.
Viruses are one classic example of the fuzziness of "life":
Opinions differ on whether viruses are a form of life, or organic structures that interact with living organisms.[67] They have been described as "organisms at the edge of life",[8] since they resemble organisms in that they possess genes, evolve by natural selection,[68] and reproduce by creating multiple copies of themselves through self-assembly. Although they have genes, they do not have a cellular structure, which is often seen as the basic unit of life. Viruses do not have their own metabolism, and require a host cell to make new products. They therefore cannot naturally reproduce outside a host cell[69] – although bacterial species such as rickettsia and chlamydia are considered living organisms despite the same limitation.[70][71] Accepted forms of life use cell division to reproduce, whereas viruses spontaneously assemble within cells. They differ from autonomous growth of crystals as they inherit genetic mutations while being subject to natural selection.
Definitions become even hazier when we imagine extraterrestrial life, which may not use the same mechanics as life on Earth. Carol Cleland: "Despite its amazing morphological diversity, terrestrial life represents only a single case. The key to formulating a general theory of living systems is to explore alternative possibilities for life. I am interested in formulating a strategy for searching for extraterrestrial life that allows one to push the boundaries of our Earth-centric concepts of life."
There are some "joints" in the space of life-like processes that are more natural to carve things up at than others. The current biology-textbook definition of life may represent one such "joint". In the case of consciousness, I could imagine a similar "joint" being "living things that have neurons", which I think would only include most animals. (This page says: "Not all animals have neurons; Trichoplax and sponges lack nerve cells altogether.") But this definition is clearly arbitrary, as neurons are but one way to transmit information. Likewise, the requirement that a system must be organized into cells is an arbitrary cutoff in the standard definition of life, since "having cells" is just one form of the more general property of "having an organized structure".
Delineating consciousness based on possession of (biological) neurons would also exclude artificial computer minds from being counted as conscious, in a similar way as the standard biological definition of life excludes artificial life, even when artificial life forms satisfy most of the other criteria for life.
And I think that even examples normally seen as paragons of lifelessness, like rocks, have some of life's properties. For example, rocks are organized into regular patterns, absorb and release energy from their surroundings, change in size with age (such as shrinking through weathering), "respond" to the environment by moving away when pushed with enough force by wind or water, and can "reproduce" into smaller rocks when split apart. And some rocks, like these crystals, are even more lifelike: "The particles aren’t truly alive — but they’re not far off, either. Exposed to light and fed by chemicals, they form crystals that move, break apart and form again."
If I cared about life as a source of intrinsic moral value, I would probably be a hylozoist for similar reasons as I'm a panpspychist: Every part of physics shows at least traces of the kinds of properties that we normally think should define life and consciousness.
This essay has defended a sort of panpsychism, in which we can think of all computational systems as having their own sorts of conscious experiences. This is one particular kind of panpsychism, which should be distinguished from other variants.
Panpsychism should not commit the pathetic fallacy of seeing full-fledged minds in even simple systems.
Once I was using a Ziploc bag to carry flies stuck inside a window to the outside. I asked myself whimsically: "Is this what it feels like to be a proton pump -- transporting items to the other side of a membrane?" And of course the answer is "no", because the cognitive operations that constitute "how it feels to remove flies" (visual appearance, subjective effort, conceptual understanding, etc.) are not present in proton pump. Such pumps would need tons of extra machinery to implement this functionality. The pathetic fallacy is only possible for dualist conceptions of mind, according to which elaborate thoughts can happen without corresponding physical processing.
On the flip side, it's mainly dualist theories of consciousness that allow a functionalist kind of panpsychism not to be true. If physics represents everything going on, then there must indeed be traces of mind-like operations in physics, depending on how "mind" is defined. In contrast, if mind is another substance or property beyond the physical, then it could not be present in simple physical systems.
In "Why panpsychism doesn't help explain consciousness" (2009), Philip Goff presents panpsychism as a theory that the universe's "physical ultimates" are intrinsically conscious. He then argues that if we imagine a person named Clare:
Even if the panpsychist is right that Clare's physical ultimates are conscious, the kind of conscious experience had by Clare's ultimates will presumably be qualitatively very different to the kind of conscious experience pre-theoretical common sense attributes to Clare on the basis of our everyday interactions with her [...]. (p. 290)
I find this objection misguided, because my version of panpsychism doesn't propose that whole-brain consciousness is constituted from lots of little pieces of consciousness (what some call "mind dust"). Rather, the system of Clare as a whole has its own kind of consciousness, because the system as a whole constitutes its own kind of computation, at the same time that subcomponents of the system have their own, different kinds of consciousness corresponding to different computations, and at the same time that Clare is embedded in larger systems that once again have their own kinds of consciousness. Mine is a "functionalist panpsychism" focused on system behavior rather than on discrete particles of consciousness. On p. 298, Goff admits that functionalists would not agree with his argument. On p. 304, Goff considers a panpsychism similar to mine, in which functional states of the whole organism determine experiential content. He rejects this because he conceives of consciousness as a separate thing (reification fallacy). In contrast, I believe that "consciousness" is just another way of regarding the functional behavior of the system. In other words, I'm defending a kind of poetic panpsychism, in which we think about systems as being phenomenal, without trying to turn phenomenality into a separate object.
And if you do insist on regarding consciousness as an object, why can't we see a dynamic system itself as an object? Mathematicians and computer scientists are familiar not just with manipulating points but also with manipulating functions and other complex structures. Functions can be seen as points in their own vector spaces. Some programming languages treat functions as first-class citizens. I wonder how much intuitions on philosophy of mind differ based on one's academic department.
Marvin Minsky regards concepts like "consciousness" as "suitcases" -- boxes that we put complicated processes into. "This in turn leads us to regard these as though they were 'things' with no structures to analyze."
In a 2012 lecture, Goff proposed a kind of panpsychism in which each particle in his mind contains his whole subjective experience, so his mind occupies many locations at once within his brain. This again is misguided, because it reifies a whole subjective experience into a fundamental object. Rather, subjective experience is the collective behavior of one's whole brain; it's not a separate thing that can live in a single particle.
I would be okay with a "mind dust" picture if instead of conceiving of each particle as having a complete phenomenal experience, we picture each particle as constituting a little sliver of computation that can combine with other slivers of computation to form more complete computational patterns. As William Seager explains: "Presumably the same way that physical complexity grows, there will be a kind of matching or mirroring growth in mental complexity." Our subjective experiences are holistic systems composed of many computational pieces, each of which can poetically be thought of as having its own simple, incomprehensible-to-us form of mentality.
Some panpsychist and panprotopsychist philosophers believe that the "quiddities" of physical reality may be conscious in a basic way (panpsychism) or may contain the building blocks of consciousness in a sense beyond embodying structural/functional properties (panprotopsychism). David Chalmers toys with a view of this kind, but as he notes, it leads to the "combination problem": How do these smaller parts combine to yield macrophenomenal consciousness like our own? Note that this sounds an awful lot like the regular mind-body problem: How do physical parts combine to yield phenomenal experience like ours? I suspect that Chalmers finds the panpsychist question less puzzling because at least the panpsychist problem already has phenomenal experience to start with, so phenomenal parts just need to be put together rather than appearing out of nowhere.
I think this whole project is wrongheaded. First of all, why should we believe in quiddities? Why should we think there's more to something than how it behaves structurally and functionally? What would it mean for the additional "essence" to be anything? If there were such an essence, either it would have structural/functional implications, in which case we've already accounted for them by structural/functional characterization, or it doesn't have any structural/functional implications, in which case the quiddity is wholly unnecessary to any explanation of anything physical. Quiddities face the same problems as a non-interacting dualist soul. On the other hand, could the same argument be leveled against the existence of physics too? One could say that the "existence" of physics is an additional property over and above the (logical but not actual) structure or function of mathematical descriptions of physical systems. I don't know whether I endorse out-and-out eliminativist Ontic Structural Realism (relations without relata), and I'm more confused about this topic than about consciousness. Still, it seems weird to "squeeze in" extra statements about the relata (beyond that they exist and that they have particular structures/functions), like that they have phenomenal character. It's true that we sometimes need to expand the ontology of physics to accommodate new phenomena, but physics has always been structural/functional, so expanding it to include phenomenal properties would be unlike any past physical revolutions.
Anyway, let's say we have quiddities of physics. What does it mean to say they have a phenomenal character? I have no idea what such a state of affairs would look like. Sure, I can conjure up images of little balls of sensation or feeling or whatever, but that act of mental imagination doesn't appear to describe anything more coherent than imagining little particles of good luck being emitted by discovered four-leaf clovers. I mean, where would that mental stuff come from? What is it? The hard problem of consciousness would remain as fierce as ever, just pushed back to the level of explaining why the consciousness primitive exists.
Augustine Lee rejects panpsychism by suggesting an analogy with a car: A whole car can drive, but that doesn't mean a steering wheel by itself has a "drive"-ness to it. Likewise, consciousness involves complicated brain structures, and simple physics by itself needn't have those same types of structure. This is a valid point, and it suggests that we may want to rein in the extent to which we attribute consciousness to fundamental physical operations.
But what's important to emphasize is that panpsychism is always an attribution on our part -- as I say, a kind of poetry. How much "mind" we see in simple physics depends on our intuitions about how broad we want our definitions to be. We can fix definitions anywhere, but the most helpful way to set the definition for consciousness is based on our ethical sentiments -- i.e., we say that process X is conscious to degree Y if we feel degree Y of moral concern about X. So, for instance, if we regarded driving as morally important, we would decide how much (if at all) a steering wheel on its own mattered, and then would set the amount of "drive"-ness of the steering wheel at that value.
For what it's worth, I think the operations we consider as "consciousness" are more multifarious and fundamental than what we typically consider "driving", which suggests that "consciousness" will have more broad definitional boundaries than "driving".
While we can speculate about some kind of consciousness existing in all entities, it might be objected that we already have firsthand experience with the possibility of non-consciousness -- namely, our own non-REM (NREM) sleep. Doesn't this prove that panpsychism can't be true, because we can see for ourselves that our sleeping brains aren't conscious? Following are some points in reply.
David Skrbina argues that panpsychism
has implications for, e.g., environmentalism. So if we see mind in things in nature -- whether it's animals or plants or even rocks and rivers and streams and so forth -- this has a definite ethical component that I think is very real and has a pragmatic kind of aspect.
Elsewhere he suggests:
Arguably, it is precisely this mechanistic view -- which sees the universe and everything in it as a kind of giant machine -- that lies at the root of many of our philosophical, sociological, and environmental problems. Panpsychism, by challenging this worldview at its root, potentially offers new solutions to some very old problems.
Freya Mathews moves from a panpsychist outlook, combined with the Taoist idea of wu wei ("non-action"), to the position that
The focus in environmental management, development and commerce should be on “synergy” with what is already in place rather than on demolition, replacement and disruption.
She writes:
from a panpsychist point of view it is not enough merely to conserve energy, unilaterally extracting and transforming it here and storing it there. One has to allow planetary energies to follow their own contours of flow, contours which reveal local and possibly global aspects of a larger world-purpose.
There seems to be much in common between panpsychism and deep ecology / other forms of environmental ethics. But there's no necessary connection, and indeed, one can make the opposite case. There are several problems with the leap from panpsychism to environmentalism:
If the welfare of an ecosystem as a whole conflicts with that of individual animals within the ecosystem, which takes priority? Unless the ecosystem matters more than many animals, the animals may still dominate the calculations. The highly developed and emotion-rich consciousness of a single mammal or bird brain seems far more pronounced than the crude shadows of sentience that we see in holistic ecosystems. Maybe ecosystems get more weight because they're bigger and more intricate than an animal brain, but I doubt I'd count an ecosystem's welfare more than, say, 10 or 100 individual animals.
Suppose we grant, say, the Earth as a whole nontrivial ethical weight compared with animal feelings. Who's to say that changing the environment is against Earth's wishes? Maybe it concords with Earth's wishes.
One argument for conservation might be that the Earth tries to rebound from certain forms of destruction. For instance, if we cut a forest, plants grow back. Typically an organism resists damage, so growing back vegetation may be the Earth's way of recovering from the harm inflicted by humans. But then what should we make of cases where Earth seems to go along with human impacts? For instance, positive greenhouse-gas feedback loops might be the Earth's way of saying, "I liked how you added more CO2 to my atmosphere, so I'm going to continue to add greenhouse gases on my own accord." In any case, it's also not clear that vegetation isn't like the Earth's hair or toenails -- something it's glad to have cropped even though it keeps coming back. Maybe the Earth created us with the ultimate purpose of keeping it well shaved. The first photosynthesizers also tampered with the Earth when they oxygenated the atmosphere. Was that likewise an assault on the Earth's goals?
The language I'm using here is obviously too anthropomorphic, but it's a convenient way of talking about ultimately more abstract and crude quasi-preferences that the Earth's biosphere may imply via its constitution and behavior. And it's probably wrong to think of the Earth as having a single set of quasi-preferences. There are many parts to what the Earth does, each of which might suggest its own kinds of desires, in a similar way as human brains contain many subsystems that can want different things.
Finally, who's to say that ecosystems are more valuable subjects of experience than their replacements, such as cities, factories, highways, and the like? Are environmentalists guilty of ecocentrism -- discrimination against industrial and digital systems? Luciano Floridi makes a similar point and argues for replacing biocentrism with "ontocentrism".
If forests, streams, and the whole Earth do have quasi-feelings, who's to say they're feelings of happiness? They might just as easily be feelings of frustration. These systems are always adapting -- and so perhaps are always restless, never satisfied. Maybe it would be better if this discomfort didn't have to be endured. That is, maybe ecosystems would be better off not existing, even purely for their own sakes. This is particularly clear for those who consider reducing suffering more urgent than creating pleasure. So maybe panpsychism leads to an anti-environmental ethic. Of course, whatever replaces an ecosystem will itself suffer. But hopefully parking lots and solar radiation not converted to energy by plants are on balance less sentient (and hence suffer less) than ecosystems.
I think part of why panpsychism often elicits intuitions of nature's goodness is that the experience of imagining oneself as part of a larger, conscious cosmos is often beautiful and serene. We feel at peace with the universe when thinking such thoughts, and then we project those good feelings onto what we're thinking about -- forgetting how awful it may actually "feel" to be the universe. To her credit, Freya Mathews acknowledges the importance of suffering: "The path of awakened intersubjectivity, Mathews cautions in conclusion, is nonetheless far from universally joyous: on the contrary, it renders the pain of more than human others more salient for us, even while we find delight in our surprise encounters with them."
Spiritual/panpsychist experiences are elevated by certain types of drug use:
For example, a recent study found that about 60% of volunteers in an experiment on the effects of psilocybin, who had never before used psychedelic drugs, had a “complete mystical experience” characterised by experiences such as unity with all things, transcendence of time and space, a sense of insight into the ultimate nature of reality, and feelings of ineffability, awe, and profound positive emotions such as joy, peace, and love (Griffiths, Richards, McCann, & Jesse, 2006).
[...] Psychedelic drug users endorsed more mystical beliefs (such as in a universal soul, no fear of death, unity of all things, existence of a transcendent reality, and oneness with God, nature and the universe).
I wouldn't be surprised if weaker versions of these brain processes are triggered naturally when people think spiritual thoughts. But we shouldn't mistake the bliss we feel in these moments as being what the other entities in the universe themselves feel.
(Note: I never have and never intend to try psychedelic drugs, both because they're illegal and because messing with my brain seems risky. But I think it's quite edifying to learn about the effects of such drugs.)
A friend of mine sometimes asks why there's always so much badness in the world. I reply: "It could be worse." Indeed, the second law of thermodynamics is in some sense a great gift to suffering reducers, because it implies that (complex) suffering can only last so long (within a given Hubble volume at least). We just have to wait it out until the universe's negentropy is used up.
It's often observed that a characteristic of life is that it has extremely low entropy, and correspondingly that life is very efficient (though not necessarily maximally efficient) at increasing the entropy of the outside environment. This might lead us to wonder whether there's some relationship between "sentience" and "entropy production". If these two things were identical, then we would face a sharp constraint on efforts to reduce the net sentience of our region of the universe, since a given quantity of entropy must be produced as the universe evolves forward.
However, I don't think the two quantities are exactly equal. For example:
So presumably suffering reducers would prefer systems with fewer neuron-like operations and more irreversible computations, which have a lower ratio of sentience per unit entropy increase.
Also note that "amount of sentience" is not identical to "amount of suffering". It's better to increase entropy with happy minds rather than agonized ones.
We might also wonder whether sentience is proportional to mass+energy. If so, then the law of conservation of mass+energy would imply that we can't change the amount of sentience. However, I find it implausible that sentience would be strictly proportional to mass/energy. For instance, a lot of energy can be stored in molecular bonds, which are pretty stable and so don't seem to qualify as a particularly sentient system compared with other systems that contain the same amount of energy in the form of organisms moving around. A stick of butter contains enough food energy to power a person for 5-10 hours, but there seems to be more sentience in a system in which the butter powers the person than a system in which the butter sits idle alongside a person who just died, even though both of these systems have the same amount of mass+energy.
Among many inspirations for this piece were conversations with Joseph Kijewski and Ruairí Donnelly.
The post Flavors of Computation Are Flavors of Consciousness appeared first on Center on Long-Term Risk.
]]>The post Artificial Intelligence and Its Implications for Future Suffering appeared first on Center on Long-Term Risk.
]]>Artificial intelligence (AI) will transform the world later this century. I expect this transition will be a "soft takeoff" in which many sectors of society update together in response to incremental AI developments, though the possibility of a harder takeoff in which a single AI project "goes foom" shouldn't be ruled out. If a rogue AI gained control of Earth, it would proceed to accomplish its goals by colonizing the galaxy and undertaking some very interesting achievements in science and engineering. On the other hand, it would not necessarily respect human values, including the value of preventing the suffering of less powerful creatures. Whether a rogue-AI scenario would entail more expected suffering than other scenarios is a question to explore further. Regardless, the field of AI ethics and policy seems to be a very important space where altruists can make a positive-sum impact along many dimensions. Expanding dialogue and challenging us-vs.-them prejudices could be valuable.
*Several of the new written sections of this piece are absent from the podcast because I recorded it a while back.
This piece contains some observations on what looks to be potentially a coming machine revolution in Earth's history. For general background reading, a good place to start is Wikipedia's article on the technological singularity.
I am not an expert on all the arguments in this field, and my views remain very open to change with new information. In the face of epistemic disagreements with other very smart observers, it makes sense to grant some credence to a variety of viewpoints. Each person brings unique contributions to the discussion by virtue of his or her particular background, experience, and intuitions.
To date, I have not found a detailed analysis of how those who are moved more by preventing suffering than by other values should approach singularity issues. This seems to me a serious gap, and research on this topic deserves high priority. In general, it's important to expand discussion of singularity issues to encompass a broader range of participants than the engineers, technophiles, and science-fiction nerds who have historically pioneered the field.
I. J. Good observed in 1982: "The urgent drives out the important, so there is not very much written about ethical machines". Fortunately, this may be changing.
In fall 2005, a friend pointed me to Ray Kurzweil's The Age of Spiritual Machines. This was my first introduction to "singularity" ideas, and I found the book pretty astonishing. At the same time, much of it seemed rather implausible to me. In line with the attitudes of my peers, I assumed that Kurzweil was crazy and that while his ideas deserved further inspection, they should not be taken at face value.
In 2006 I discovered Nick Bostrom and Eliezer Yudkowsky, and I began to follow the organization then called the Singularity Institute for Artificial Intelligence (SIAI), which is now MIRI. I took SIAI's ideas more seriously than Kurzweil's, but I remained embarrassed to mention the organization because the first word in SIAI's name sets off "insanity alarms" in listeners.
I began to study machine learning in order to get a better grasp of the AI field, and in fall 2007, I switched my college major to computer science. As I read textbooks and papers about machine learning, I felt as though "narrow AI" was very different from the strong-AI fantasies that people painted. "AI programs are just a bunch of hacks," I thought. "This isn't intelligence; it's just people using computers to manipulate data and perform optimization, and they dress it up as 'AI' to make it sound sexy." Machine learning in particular seemed to be just a computer scientist's version of statistics. Neural networks were just an elaborated form of logistic regression. There were stylistic differences, such as computer science's focus on cross-validation and bootstrapping instead of testing parametric models -- made possible because computers can run data-intensive operations that were inaccessible to statisticians in the 1800s. But overall, this work didn't seem like the kind of "real" intelligence that people talked about for general AI.
This attitude began to change as I learned more cognitive science. Before 2008, my ideas about human cognition were vague. Like most science-literate people, I believed the brain was a product of physical processes, including firing patterns of neurons. But I lacked further insight into what the black box of brains might contain. This led me to be confused about what "free will" meant until mid-2008 and about what "consciousness" meant until late 2009. Cognitive science showed me that the brain was in fact very much like a computer, at least in the sense of being a deterministic information-processing device with distinct algorithms and modules. When viewed up close, these algorithms could look as "dumb" as the kinds of algorithms in narrow AI that I had previously dismissed as "not really intelligence." Of course, animal brains combine these seemingly dumb subcomponents in dazzlingly complex and robust ways, but I could now see that the difference between narrow AI and brains was a matter of degree rather than kind. It now seemed plausible that broad AI could emerge from lots of work on narrow AI combined with stitching the parts together in the right ways.
So the singularity idea of artificial general intelligence seemed less crazy than it had initially. This was one of the rare cases where a bold claim turned out to look more probable on further examination; usually extraordinary claims lack much evidence and crumble on closer inspection. I now think it's quite likely (maybe ~75%) that humans will produce at least a human-level AI within the next ~300 years conditional on no major disasters (such as sustained world economic collapse, global nuclear war, large-scale nanotech war, etc.), and also ignoring anthropic considerations.
The "singularity" concept is broader than the prediction of strong AI and can refer to several distinct sub-meanings. Like with most ideas, there's a lot of fantasy and exaggeration associated with "the singularity," but at least the core idea that technology will progress at an accelerating rate for some time to come, absent major setbacks, is not particularly controversial. Exponential growth is the standard model in economics, and while this can't continue forever, it has been a robust pattern throughout human and even pre-human history.
MIRI emphasizes AI for a good reason: At the end of the day, the long-term future of our galaxy will be dictated by AI, not by biotech, nanotech, or other lower-level systems. AI is the "brains of the operation." Of course, this doesn't automatically imply that AI should be the primary focus of our attention. Maybe other revolutionary technologies or social forces will come first and deserve higher priority. In practice, I think focusing on AI specifically seems quite important even relative to competing scenarios, but it's good to explore many areas in parallel to at least a shallow depth.
In addition, I don't see a sharp distinction between "AI" and other fields. Progress in AI software relies heavily on computer hardware, and it depends at least a little bit on other fundamentals of computer science, like programming languages, operating systems, distributed systems, and networks. AI also shares significant overlap with neuroscience; this is especially true if whole brain emulation arrives before bottom-up AI. And everything else in society matters a lot too: How intelligent and engineering-oriented are citizens? How much do governments fund AI and cognitive-science research? (I'd encourage less rather than more.) What kinds of military and commercial applications are being developed? Are other industrial backbone components of society stable? What memetic lenses does society have for understanding and grappling with these trends? And so on. The AI story is part of a larger story of social and technological change, in which one part influences other parts.
Significant trends in AI may not look like the AI we see in movies. They may not involve animal-like cognitive agents as much as more "boring", business-oriented computing systems. Some of the most transformative computer technologies in the period 2000-2014 have been drones, smart phones, and social networking. These all involve some AI, but the AI is mostly used as a component of a larger, non-AI system, in which many other facets of software engineering play at least as much of a role.
Nonetheless, it seems nearly inevitable to me that digital intelligence in some form will eventually leave biological humans in the dust, if technological progress continues without faltering. This is almost obvious when we zoom out and notice that the history of life on Earth consists in one species outcompeting another, over and over again. Ecology's competitive exclusion principle suggests that in the long run, either humans or machines will ultimately occupy the role of the most intelligent beings on the planet, since "When one species has even the slightest advantage or edge over another then the one with the advantage will dominate in the long term."
The basic premise of superintelligent machines who have different priorities than their creators has been in public consciousness for many decades. Arguably even Frankenstein, published in 1818, expresses this basic idea, though more modern forms include 2001: A Space Odyssey (1968), The Terminator (1984), I, Robot (2004), and many more. Probably most people in Western countries have at least heard of these ideas if not watched or read pieces of fiction on the topic.
So why do most people, including many of society's elites, ignore strong AI as a serious issue? One reason is just that the world is really big, and there are many important (and not-so-important) issues that demand attention. Many people think strong AI is too far off, and we should focus on nearer-term problems. In addition, it's possible that science fiction itself is part of the reason: People may write off AI scenarios as "just science fiction," as I would have done prior to late 2005. (Of course, this is partly for good reason, since depictions of AI in movies are usually very unrealistic.) Often, citing Hollywood is taken as a thought-stopping deflection of the possibility of AI getting out of control, without much in the way of substantive argument to back up that stance. For example: "let's please keep the discussion firmly within the realm of reason and leave the robot uprisings to Hollywood screenwriters."
As AI progresses, I find it hard to imagine that mainstream society will ignore the topic forever. Perhaps awareness will accrue gradually, or perhaps an AI Sputnik moment will trigger an avalanche of interest. Stuart Russell expects that
Just as nuclear fusion researchers consider the problem of containment of fusion reactions as one of the primary problems of their field, it seems inevitable that issues of control and safety will become central to AI as the field matures.
I think it's likely that issues of AI policy will be debated heavily in the coming decades, although it's possible that AI will be like nuclear weapons -- something that everyone is afraid of but that countries can't stop because of arms-race dynamics. So even if AI proceeds slowly, there's probably value in thinking more about these issues well ahead of time, though I wouldn't consider the counterfactual value of doing so to be astronomical compared with other projects in part because society will pick up the slack as the topic becomes more prominent.
[Update, Feb. 2015: I wrote the preceding paragraphs mostly in May 2014, before Nick Bostrom's Superintelligence book was released. Following Bostrom's book, a wave of discussion about AI risk emerged from Elon Musk, Stephen Hawking, Bill Gates, and many others. AI risk suddenly became a mainstream topic discussed by almost every major news outlet, at least with one or two articles. This foreshadows what we'll see more of in the future. The outpouring of publicity for the AI topic happened far sooner than I imagined it would.]Various thinkers have debated the likelihood of a "hard" takeoff -- in which a single computer or set of computers rapidly becomes superintelligent on its own -- compared with a "soft" takeoff -- in which society as a whole is transformed by AI in a more distributed, continuous fashion. "The Hanson-Yudkowsky AI-Foom Debate" discusses this in great detail. The topic has also been considered by many others, such as Ramez Naam vs. William Hertling.
For a long time I inclined toward Yudkowsky's vision of AI, because I respect his opinions and didn't ponder the details too closely. This is also the more prototypical example of rebellious AI in science fiction. In early 2014, a friend of mine challenged this view, noting that computing power is a severe limitation for human-level minds. My friend suggested that AI advances would be slow and would diffuse through society rather than remaining in the hands of a single developer team. As I've read more AI literature, I think this soft-takeoff view is pretty likely to be correct. Science is always a gradual process, and almost all AI innovations historically have moved in tiny steps. I would guess that even the evolution of humans from their primate ancestors was a "soft" takeoff in the sense that no single son or daughter was vastly more intelligent than his or her parents. The evolution of technology in general has been fairly continuous. I probably agree with Paul Christiano that "it is unlikely that there will be rapid, discontinuous, and unanticipated developments in AI that catapult it to superhuman levels [...]."
Of course, it's not guaranteed that AI innovations will diffuse throughout society. At some point perhaps governments will take control, in the style of the Manhattan Project, and they'll keep the advances secret. But even then, I expect that the internal advances by the research teams will add cognitive abilities in small steps. Even if you have a theoretically optimal intelligence algorithm, it's constrained by computing resources, so you either need lots of hardware or approximation hacks (or most likely both) before it can function effectively in the high-dimensional state space of the real world, and this again implies a slower trajectory. Marcus Hutter's AIXI(tl) is an example of a theoretically optimal general intelligence, but most AI researchers feel it won't work for artificial general intelligence (AGI) because it's astronomically expensive to compute. Ben Goertzel explains: "I think that tells you something interesting. It tells you that dealing with resource restrictions -- with the boundedness of time and space resources -- is actually critical to intelligence. If you lift the restriction to do things efficiently, then AI and AGI are trivial problems."1
In "I Still Don’t Get Foom", Robin Hanson contends:
Yes, sometimes architectural choices have wider impacts. But I was an artificial intelligence researcher for nine years, ending twenty years ago, and I never saw an architecture choice make a huge difference, relative to other reasonable architecture choices. For most big systems, overall architecture matters a lot less than getting lots of detail right.
This suggests that it's unlikely that a single insight will make an astronomical difference to an AI's performance.
Similarly, my experience is that machine-learning algorithms matter less than the data they're trained on. I think this is a general sentiment among data scientists. There's a famous slogan that "More data is better data." A main reason Google's performance is so good is that it has so many users that even obscure searches, spelling mistakes, etc. will appear somewhere in its logs. But if many performance gains come from data, then they're constrained by hardware, which generally grows steadily.
Hanson's "I Still Don’t Get Foom" post continues: "To be much better at learning, the project would instead have to be much better at hundreds of specific kinds of learning. Which is very hard to do in a small project." Anders Sandberg makes a similar point:
As the amount of knowledge grows, it becomes harder and harder to keep up and to get an overview, necessitating specialization. [...] This means that a development project might need specialists in many areas, which in turns means that there is a lower size of a group able to do the development. In turn, this means that it is very hard for a small group to get far ahead of everybody else in all areas, simply because it will not have the necessary know how in all necessary areas. The solution is of course to hire it, but that will enlarge the group.
One of the more convincing anti-"foom" arguments is J. Storrs Hall's point that an AI improving itself to a world superpower would need to outpace the entire world economy of 7 billion people, plus natural resources and physical capital. It would do much better to specialize, sell its services on the market, and acquire power/wealth in the ways that most people do. There are plenty of power-hungry people in the world, but usually they go to Wall Street, K Street, or Silicon Valley rather than trying to build world-domination plans in their basement. Why would an AI be different? Some possibilities:
I'm skeptical of #1, though I suppose if the AI is very alien, these kinds of unknown unknowns become more plausible. #2 is an interesting point. It seems like a pretty good way to spread yourself as an AI is to become a useful software product that lots of people want to install, i.e., to sell your services on the world market, as Hall said. Of course, once that's done, perhaps the AI could find a way to take over the world. Maybe it could silently quash competitor AI projects. Maybe it could hack into computers worldwide via the Internet and Internet of Things, as the AI did in the Delete series. Maybe it could devise a way to convince humans to give it access to sensitive control systems, as Skynet did in Terminator 3.
I find these kinds of scenarios for AI takeover more plausible than a rapidly self-improving superintelligence. Indeed, even a human-level intelligence that can distribute copies of itself over the Internet might be able to take control of human infrastructure and hence take over the world. No "foom" is required.
Rather than discussing hard-vs.-soft takeoff arguments more here, I added discussion to Wikipedia where the content will receive greater readership. See "Hard vs. soft takeoff" in "Intelligence explosion".
The hard vs. soft distinction is obviously a matter of degree. And maybe how long the process takes isn't the most relevant way to slice the space of scenarios. For practical purposes, the more relevant question is: Should we expect control of AI outcomes to reside primarily in the hands of a few "seed AI" developers? In this case, altruists should focus on influencing a core group of AI experts, or maybe their military / corporate leaders. Or should we expect that society as a whole will play a big role in shaping how AI is developed and used? In this case, governance structures, social dynamics, and non-technical thinkers will play an important role not just in influencing how much AI research happens but also in how the technologies are deployed and incrementally shaped as they mature.
It's possible that one country -- perhaps the United States, or maybe China in later decades -- will lead the way in AI development, especially if the research becomes nationalized when AI technology grows more powerful. Would this country then take over the world? I'm not sure. The United States had a monopoly on nuclear weapons for several years after 1945, but it didn't bomb the Soviet Union out of existence. A country with a monopoly on artificial superintelligence might refrain from destroying its competitors as well. On the other hand, AI should enable vastly more sophisticated surveillance and control than was possible in the 1940s, so a monopoly might be sustainable even without resorting to drastic measures. In any case, perhaps a country with superintelligence would just economically outcompete the rest of the world, rendering military power superfluous.
Besides a single country taking over the world, the other possibility (perhaps more likely) is that AI is developed in a distributed fashion, either openly as is the case in academia today, or in secret by governments as is the case with other weapons of mass destruction.
Even in a soft-takeoff case, there would come a point at which humans would be unable to keep up with the pace of AI thinking. (We already see an instance of this with algorithmic stock-trading systems, although human traders are still needed for more complex tasks right now.) The reins of power would have to be transitioned to faster human uploads, trusted AIs built from scratch, or some combination of the two. In a slow scenario, there might be many intelligent systems at comparable levels of performance, maintaining a balance of power, at least for a while.2 In the long run, a singleton seems plausible because computers -- unlike human kings -- can reprogram their servants to want to obey their bidding, which means that as an agent gains more central authority, it's not likely to later lose it by internal rebellion (only by external aggression). Also, evolutionary competition is not a stable state, while a singleton is. It seems likely that evolution will eventually lead to a singleton at one point or other—whether because one faction takes over the world or because the competing factions form a stable cooperation agreement—and competition won't return after that happens. (That said, if the singleton is merely a contingent cooperation agreement among factions that still disagree, one can imagine that cooperation breaking down in the future....)
Most of humanity's problems are fundamentally coordination problems / selfishness problems. If humans were perfectly altruistic, we could easily eliminate poverty, overpopulation, war, arms races, and other social ills. There would remain "man vs. nature" problems, but these are increasingly disappearing as technology advances. Assuming a digital singleton emerges, the chances of it going extinct seem very small (except due to alien invasions or other external factors) because unless the singleton has a very myopic utility function, it should consider carefully all the consequences of its actions -- in contrast to the "fools rush in" approach that humanity currently takes toward most technological risks, due to wanting the benefits of and profits from technology right away and not wanting to lose out to competitors. For this reason, I suspect that most of George Dvorsky's "12 Ways Humanity Could Destroy The Entire Solar System" are unlikely to happen, since most of them presuppose blundering by an advanced Earth-originating intelligence, but probably by the time Earth-originating intelligence would be able to carry out interplanetary engineering on a nontrivial scale, we'll already have a digital singleton that thoroughly explores the risks of its actions before executing them. That said, this might not be true if competing AIs begin astroengineering before a singleton is completely formed. (By the way, I should point out that I prefer it if the cosmos isn't successfully colonized, because doing so is likely to astronomically multiply sentience and therefore suffering.)
Sometimes it's claimed that we should expect a hard takeoff because AI-development dynamics will fundamentally change once AIs can start improving themselves. One stylized way to explain this is via differential equations. Let I(t) be the intelligence of AIs at time t.
Luke Muehlhauser reports that the idea of intelligence explosion once machines can start improving themselves "ran me over like a train. Not because it was absurd, but because it was clearly true." I think this kind of exponential feedback loop is the basis behind many of the intelligence-explosion arguments.
But let's think about this more carefully. What's so special about the point where machines can understand and modify themselves? Certainly understanding your own source code helps you improve yourself. But humans already understand the source code of present-day AIs with an eye toward improving it. Moreover, present-day AIs are vastly simpler than human-level ones will be, and present-day AIs are far less intelligent than the humans who create them. Which is easier: (1) improving the intelligence of something as smart as you, or (2) improving the intelligence of something far dumber? (2) is usually easier. So if anything, AI intelligence should be "exploding" faster now, because it can be lifted up by something vastly smarter than it. Once AIs need to improve themselves, they'll have to pull up on their own bootstraps, without the guidance of an already existing model of far superior intelligence on which to base their designs.
As an analogy, it's harder to produce novel developments if you're the market-leading company; it's easier if you're a competitor trying to catch up, because you know what to aim for and what kinds of designs to reverse-engineer. AI right now is like a competitor trying to catch up to the market leader.
Another way to say this: The constants in the differential equations might be important. Even if human AI-development progress is linear, that progress might be faster than a slow exponential curve until some point far later where the exponential catches up.
In any case, I'm cautious of simple differential equations like these. Why should the rate of intelligence increase be proportional to the intelligence level? Maybe the problems become much harder at some point. Maybe the systems become fiendishly complicated, such that even small improvements take a long time. Robin Hanson echoes this suggestion:
Students get smarter as they learn more, and learn how to learn. However, we teach the most valuable concepts first, and the productivity value of schooling eventually falls off, instead of exploding to infinity. Similarly, the productivity improvement of factory workers typically slows with time, following a power law.
At the world level, average IQ scores have increased dramatically over the last century (the Flynn effect), as the world has learned better ways to think and to teach. Nevertheless, IQs have improved steadily, instead of accelerating. Similarly, for decades computer and communication aids have made engineers much "smarter," without accelerating Moore's law. While engineers got smarter, their design tasks got harder.
Also, ask yourself this question: Why do startups exist? Part of the answer is that they can innovate faster than big companies due to having less institutional baggage and legacy software.3 It's harder to make radical changes to big systems than small systems. Of course, like the economy does, a self-improving AI could create its own virtual startups to experiment with more radical changes, but just as in the economy, it might take a while to prove new concepts and then transition old systems to the new and better models.
In discussions of intelligence explosion, it's common to approximate AI productivity as scaling linearly with number of machines, but this may or may not be true depending on the degree of parallelizability. Empirical examples for human-engineered projects show diminishing returns with more workers, and while computers may be better able to partition work due to greater uniformity and speed of communication, there will remain some overhead in parallelization. Some tasks may be inherently non-paralellizable, preventing the kinds of ever-faster performance that the most extreme explosion scenarios envisage.
Fred Brooks's "No Silver Bullet" paper argued that "there is no single development, in either technology or management technique, which by itself promises even one order of magnitude improvement within a decade in productivity, in reliability, in simplicity." Likewise, Wirth's law reminds us of how fast software complexity can grow. These points make it seem less plausible that an AI system could rapidly bootstrap itself to superintelligence using just a few key as-yet-undiscovered insights.
Chollet (2017) notes that "even if one part of a system has the ability to recursively self-improve, other parts of the system will inevitably start acting as bottlenecks." We might compare this with Liebig's law of the minimum: "growth is dictated not by total resources available, but by the scarcest resource (limiting factor)." Individual sectors of the human economy can show rapid growth at various times, but the growth rate of the entire economy is more limited.
Eventually there has to be a leveling off of intelligence increase if only due to physical limits. On the other hand, one argument in favor of differential equations is that the economy has fairly consistently followed exponential trends since humans evolved, though the exponential growth rate of today's economy remains small relative to what we typically imagine from an "intelligence explosion".
I think a stronger case for intelligence explosion is the clock-speed difference between biological and digital minds. Even if AI development becomes very slow in subjective years, once AIs take it over, in objective years (i.e., revolutions around the sun), the pace will continue to look blazingly fast. But if enough of society is digital by that point (including human-inspired subroutines and maybe full digital humans), then digital speedup won't give a unique advantage to a single AI project that can then take over the world. Hence, hard takeoff in the sci fi sense still isn't guaranteed. Also, Hanson argues that faster minds would produce a one-time jump in economic output but not necessarily a sustained higher rate of growth.
Another case for intelligence explosion is that intelligence growth might not be driven by the intelligence of a given agent so much as by the collective man-hours (or machine-hours) that would become possible with more resources. I suspect that AI research could accelerate at least 10 times if it had 10-50 times more funding. (This is not the same as saying I want funding increased; in fact, I probably want funding decreased to give society more time to sort through these issues.) The population of digital minds that could be created in a few decades might exceed the biological human population, which would imply faster progress if only by numerosity. Also, the digital minds might not need to sleep, would focus intently on their assigned tasks, etc. However, once again, these are advantages in objective time rather than collective subjective time. And these advantages would not be uniquely available to a single first-mover AI project; any wealthy and technologically sophisticated group that wasn't too far behind the cutting edge could amplify its AI development in this way.
(A few weeks after writing this section, I learned that Ch. 4 of Nick Bostrom's Superintelligence: Paths, Dangers, Strategies contains surprisingly similar content, even up to the use of dI/dt as the symbols in a differential equation. However, Bostrom comes down mostly in favor of the likelihood of an intelligence explosion. I reply to Bostrom's arguments in the next section.)
Note: Søren Elverlin replies to this section in a video presentation. I agree with some of his points and disagree with others.
In Ch. 4 of Superintelligence, Bostrom suggests several factors that might lead to a hard or at least semi-hard takeoff. I don't fully disagree with his points, and because these are difficult issues, I agree that Bostrom might be right. But I want to play devil's advocate and defend the soft-takeoff view. I've distilled and paraphrased what I think are 6 core arguments, and I reply to each in turn.
#1: There might be a key missing algorithmic insight that allows for dramatic progress.
Maybe, but do we have much precedent for this? As far as I'm aware, all individual AI advances -- and indeed, most technology advances in general -- have not represented astronomical improvements over previous designs. Maybe connectionist AI systems represented a game-changing improvement relative to symbolic AI for messy tasks like vision, but I'm not sure how much of an improvement they represented relative to the best alternative technologies. After all, neural networks are in some sense just fancier forms of pre-existing statistical methods like logistic regression. And even neural networks came in stages, with the perceptron, multi-layer networks, backpropagation, recurrent networks, deep networks, etc. The most groundbreaking machine-learning advances may reduce error rates by a half or something, which may be commercially very important, but this is not many orders of magnitude as hard-takeoff scenarios tend to assume.
Outside of AI, the Internet changed the world, but it was an accumulation of many insights. Facebook has had massive impact, but it too was built from many small parts and grew in importance slowly as its size increased. Microsoft became a virtual monopoly in the 1990s but perhaps more for business than technology reasons, and its power in the software industry at large is probably not growing. Google has a quasi-monopoly on web search, kicked off by the success of PageRank, but most of its improvements have been small and gradual. Google has grown very powerful, but it hasn't maintained a permanent advantage that would allow it to take over the software industry.
Acquiring nuclear weapons might be the closest example of a single discrete step that most dramatically changes a country's position, but this may be an outlier. Maybe other advances in weaponry (arrows, guns, etc.) historically have had somewhat dramatic effects.
Bostrom doesn't present specific arguments for thinking that a few crucial insights may produce radical jumps. He suggests that we might not notice a system's improvements until it passes a threshold, but this seems absurd, because at least the AI developers would need to be intimately acquainted with the AI's performance. While not strictly accurate, there's a slogan: "You can't improve what you can't measure." Maybe the AI's progress wouldn't make world headlines, but the academic/industrial community would be well aware of nontrivial breakthroughs, and the AI developers would live and breathe performance numbers.
#2: Once an AI passes a threshold, it might be able to absorb vastly more content (e.g., by reading the Internet) that was previously inaccessible.
Absent other concurrent improvements I'm doubtful this would produce take-over-the-world superintelligence, because the world's current superintelligence (namely, humanity as a whole) already has read most of the Internet -- indeed, has written it. I guess humans haven't read all automatically generated text or vast streams of numerical data, but the insights gleaned purely from reading such material would be low without doing more sophisticated data mining and learning on top of it, and presumably such data mining would have already been in progress well before Bostrom's hypothetical AI learned how to read.
In any case, I doubt reading with understanding is such an all-or-nothing activity that it can suddenly "turn on" once the AI achieves a certain ability level. As Bostrom says (p. 71), reading with the comprehension of a 10-year-old is probably AI-complete, i.e., requires solving the general AI problem. So assuming that you can switch on reading ability with one improvement is equivalent to assuming that a single insight can produce astronomical gains in AI performance, which we discussed above. If that's not true, and if before the AI system with 10-year-old reading ability was an AI system with a 6-year-old reading ability, why wouldn't that AI have already devoured the Internet? And before that, why wouldn't a proto-reader have devoured a version of the Internet that had been processed to make it easier for a machine to understand? And so on, until we get to the present-day TextRunner system that Bostrom cites, which is already devouring the Internet. It doesn't make sense that massive amounts of content would only be added after lots of improvements. Commercial incentives tend to yield exactly the opposite effect: converting the system to a large-scale product when even modest gains appear, because these may be enough to snatch a market advantage.
The fundamental point is that I don't think there's a crucial set of components to general intelligence that all need to be in place before the whole thing works. It's hard to evolve systems that require all components to be in place at once, which suggests that human general intelligence probably evolved gradually. I expect it's possible to get partial AGI with partial implementations of the components of general intelligence, and the components can gradually be made more general over time. Components that are lacking can be supplemented by human-based computation and narrow-AI hacks until more general solutions are discovered. Compare with minimum viable products and agile software development. As a result, society should be upended by partial AGI innovations many times over the coming decades, well before fully human-level AGI is finished.
#3: Once a system "proves its mettle by attaining human-level intelligence", funding for hardware could multiply.
I agree that funding for AI could multiply manyfold due to a sudden change in popular attention or political dynamics. But I'm thinking of something like a factor of 10 or maybe 50 in an all-out Cold War-style arms race. A factor-of-50 boost in hardware isn't obviously that important. If before there was one human-level AI, there would now be 50. In any case, I expect the Sputnik moment(s) for AI to happen well before it achieves a human level of ability. Companies and militaries aren't stupid enough not to invest massively in an AI with almost-human intelligence.
#4: Once the human level of intelligence is reached, "Researchers may work harder, [and] more researchers may be recruited".
As with hardware above, I would expect these "shit hits the fan" moments to happen before fully human-level AI. In any case:
#5: At some point, the AI's self-improvements would dominate those of human engineers, leading to exponential growth.
I discussed this in the "Intelligence explosion?" section above. A main point is that we see many other systems, such as the world economy or Moore's law, that also exhibit positive feedback and hence exponential growth, but these aren't "fooming" at an astounding rate. It's not clear why an AI's self-improvement -- which resembles economic growth and other complex phenomena -- should suddenly explode faster (in subjective time) than humanity's existing recursive-self improvement of its intelligence via digital computation.
On the other hand, maybe the difference between subjective and objective time is important. If a human-level AI could think, say, 10,000 times faster than a human, then assuming linear scaling, it would be worth 10,000 engineers. By the time of human-level AI, I expect there would be far more than 10,000 AI developers on Earth, but given enough hardware, the AI could copy itself manyfold until its subjective time far exceeded that of human experts. The speed and copiability advantages of digital minds seem perhaps the strongest arguments for a takeoff that happens rapidly relative to human observers. Note that, as Hanson said above, this digital speedup might be just a one-time boost, rather than a permanently higher rate of growth, but even the one-time boost could be enough to radically alter the power dynamics of humans vis-à-vis machines. That said, there should be plenty of slightly sub-human AIs by this time, and maybe they could fill some speed gaps on behalf of biological humans.
In general, it's a mistake to imagine human-level AI against a backdrop of our current world. That's like imagining a Tyrannosaurus rex in a human city. Rather, the world will look very different by the time human-level AI arrives. Before AI can exceed human performance in all domains, it will exceed human performance in many narrow domains gradually, and these narrow-domain AIs will help humans respond quickly. For example, a narrow AI that's an expert at military planning based on war games can help humans with possible military responses to rogue AIs.
Many of the intermediate steps on the path to general AI will be commercially useful and thus should diffuse widely in the meanwhile. As user "HungryHobo" noted: "If you had a near human level AI, odds are, everything that could be programmed into it at the start to help it with software development is already going to be part of the suites of tools for helping normal human programmers." Even if AI research becomes nationalized and confidential, its developers should still have access to almost-human-level digital-speed AI tools, which should help smooth the transition. For instance, Bostrom mentions how in the 2010 flash crash (Box 2, p. 17), a high-speed positive-feedback spiral was terminated by a high-speed "circuit breaker". This is already an example where problems happening faster than humans could comprehend them were averted due to solutions happening faster than humans could comprehend them. See also the discussion of "tripwires" in Superintelligence (p. 137).
Conversely, many globally disruptive events may happen well before fully human AI arrives, since even sub-human AI may be prodigiously powerful.
#6: "even when the outside world has a greater total amount of relevant research capability than any one project", the optimization power of the project might be more important than that of the world "since much of the outside world's capability is not be focused on the particular system in question". Hence, the project might take off and leave the world behind. (Box 4, p. 75)
What one makes of this argument depends on how many people are needed to engineer how much progress. The Watson system that played on Jeopardy! required 15 people over ~4(?) years4 -- given the existing tools of the rest of the world at that time, which had been developed by millions (indeed, billions) of other people. Watson was a much smaller leap forward than that needed to give a general intelligence a take-over-the-world advantage. How many more people would be required to achieve such a radical leap in intelligence? This seems to be a main point of contention in the debate between believers in soft vs. hard takeoff.
Can we get insight into how hard general intelligence is based on neuroscience? Is the human brain fundamentally simple or complex?
Jeff Hawkins, Andrew Ng, and others speculate that the brain may have one fundamental algorithm for intelligence -- deep learning in the cortical column. This idea gains plausibility from the brain's plasticity. For instance, blind people can appropriate the visual cortex for auditory processing. Artificial neural networks can be used to classify any kind of input -- not just visual and auditory but even highly abstract, like features about credit-card fraud or stock prices.
Maybe there's one fundamental algorithm for input classification, but this doesn't imply one algorithm for all that the brain does. Beyond the cortical column, the brain has many specialized structures that seem to perform very specialized functions, such as reward learning in the basal ganglia, fear processing in the amygdala, etc. Of course, it's not clear how essential all of these parts are or how easy it would be to replace them with artificial components performing the same basic functions.
One argument for faster AGI takeoffs is that humans have been able to learn many sophisticated things (e.g., advanced mathematics, music, writing, programming) without requiring any genetic changes. And what we now know doesn't seem to represent any kind of limit to what we could know with more learning. The human collection of cognitive algorithms is very flexible, which seems to belie claims that all intelligence requires specialized designs. On the other hand, even if human genes haven't changed much in the last 10,000 years, human culture has evolved substantially, and culture undergoes slow trial-and-error evolution in similar ways as genes do. So one could argue that human intellectual achievements are not fully general but rely on a vast amount of specialized, evolved content. Just as a single random human isolated from society probably couldn't develop general relativity on his own in a lifetime, so a single random human-level AGI probably couldn't either. Culture is the new genome, and it progresses slowly.
Moreover, some scholars believe that certain human abilities, such as language, are very essentially based on genetic hard-wiring:
The approach taken by Chomsky and Marr toward understanding how our minds achieve what they do is as different as can be from behaviorism. The emphasis here is on the internal structure of the system that enables it to perform a task, rather than on external association between past behavior of the system and the environment. The goal is to dig into the "black box" that drives the system and describe its inner workings, much like how a computer scientist would explain how a cleverly designed piece of software works and how it can be executed on a desktop computer.
Chomsky himself notes:
There's a fairly recent book by a very good cognitive neuroscientist, Randy Gallistel and King, arguing -- in my view, plausibly -- that neuroscience developed kind of enthralled to associationism and related views of the way humans and animals work. And as a result they've been looking for things that have the properties of associationist psychology.
[...] Gallistel has been arguing for years that if you want to study the brain properly you should begin, kind of like Marr, by asking what tasks is it performing. So he's mostly interested in insects. So if you want to study, say, the neurology of an ant, you ask what does the ant do? It turns out the ants do pretty complicated things, like path integration, for example. If you look at bees, bee navigation involves quite complicated computations, involving position of the sun, and so on and so forth. But in general what he argues is that if you take a look at animal cognition, human too, it is computational systems.
Many parts of the human body, like the digestive system or bones/muscles, are extremely complex and fine-tuned, yet few people argue that their development is controlled by learning. So it's not implausible that a lot of the brain's basic architecture could be similarly hard-coded.
Typically AGI researchers express scorn for manually tuned software algorithms that don't rely on fully general learning. But Chomsky's stance challenges that sentiment. If Chomsky is right, then a good portion of human "general intelligence" is finely tuned, hard-coded software of the sort that we see in non-AI branches of software engineering. And this view would suggest a slower AGI takeoff because time and experimentation are required to tune all the detailed, specific algorithms of intelligence.
A full-fledged superintelligence probably requires very complex design, but it may be possible to build a "seed AI" that would recursively self-improve toward superintelligence. Alan Turing proposed this in his 1950 "Computing machinery and intelligence":
Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain. Presumably the child brain is something like a notebook as one buys it from the stationer's. Rather little mechanism, and lots of blank sheets. (Mechanism and writing are from our point of view almost synonymous.) Our hope is that there is so little mechanism in the child brain that something like it can be easily programmed.
Animal development appears to be at least somewhat robust based on the fact that the growing organisms are often functional despite a few genetic mutations and variations in prenatal and postnatal environments. Such variations may indeed make an impact -- e.g., healthier development conditions tend to yield more physically attractive adults -- but most humans mature successfully over a wide range of input conditions.
On the other hand, an argument against the simplicity of development is the immense complexity of our DNA. It accumulated over billions of years through vast numbers of evolutionary "experiments". It's not clear that human engineers could perform enough measurements to tune ontogenetic parameters of a seed AI in a short period of time. And even if the parameter settings worked for early development, they would probably fail for later development. Rather than a seed AI developing into an "adult" all at once, designers would develop the AI in small steps, since each next stage of development would require significant tuning to get right.
Think about how much effort is required for human engineers to build even relatively simple systems. For example, I think the number of developers who work on Microsoft Office is in the thousands. Microsoft Office is complex but is still far simpler than a mammalian brain. Brains have lots of little parts that have been fine-tuned. That kind of complexity requires immense work by software developers to create. The main counterargument is that there may be a simple meta-algorithm that would allow an AI to bootstrap to the point where it could fine-tune all the details on its own, without requiring human inputs. This might be the case, but my guess is that any elegant solution would be hugely expensive computationally. For instance, biological evolution was able to fine-tune the human brain, but it did so with immense amounts of computing power over millions of years.
A common analogy for the gulf between superintelligence vs. humans is that between humans vs. chimpanzees. In Consciousness Explained, Daniel Dennett mentions (pp. 189-190) how our hominid ancestors had brains roughly four times the volume as those of chimps but roughly the same in structure. This might incline one to imagine that brain size alone could yield superintelligence. Maybe we'd just need to quadruple human brains once again to produce superintelligent humans? If so, wouldn't this imply a hard takeoff, since quadrupling hardware is relatively easy?
But in fact, as Dennett explains, the quadrupling of brain size from chimps to pre-humans completed before the advent of language, cooking, agriculture, etc. In other words, the main "foom" of humans came from culture rather than brain size per se -- from software in addition to hardware. Yudkowsky seems to agree: "Humans have around four times the brain volume of chimpanzees, but the difference between us is probably mostly brain-level cognitive algorithms."
But cultural changes (software) arguably progress a lot more slowly than hardware. The intelligence of human society has grown exponentially, but it's a slow exponential, and rarely have there been innovations that allowed one group to quickly overpower everyone else within the same region of the world. (Between isolated regions of the world the situation was sometimes different -- e.g., Europeans with Maxim guns overpowering Africans because of very different levels of industrialization.)
Some, including Owen Cotton-Barratt and Toby Ord, have argued that even if we think soft takeoffs are more likely, there may be higher value in focusing on hard-takeoff scenarios because these are the cases in which society would have the least forewarning and the fewest people working on AI altruism issues. This is a reasonable point, but I would add that
In any case, the hard-soft distinction is not binary, and maybe the best place to focus is on scenarios where human-level AI takes over on a time scale of a few years. (Timescales of months, days, or hours strike me as pretty improbable, unless, say, Skynet gets control of nuclear weapons.)
In Superintelligence, Nick Bostrom suggests (Ch. 4, p. 64) that "Most preparations undertaken before onset of [a] slow takeoff would be rendered obsolete as better solutions would gradually become visible in the light of the dawning era." Toby Ord uses the term "nearsightedness" to refer to the ways in which research too far in advance of an issue's emergence may not as useful as research when more is known about the issue. Ord contrasts this with benefits of starting early, including course-setting. I think Ord's counterpoints argue against the contention that early work wouldn't matter that much in a slow takeoff. Some of how society responded to AI surpassing human intelligence might depend on early frameworks and memes. (For instance, consider the lingering impact of Terminator imagery on almost any present-day popular-media discussion of AI risk.) Some fundamental work would probably not be overthrown by later discoveries; for instance, algorithmic-complexity bounds of key algorithms were discovered decades ago but will remain relevant until intelligence dies out, possibly billions of years from now. Some non-technical policy and philosophy work would be less obsoleted by changing developments. And some AI preparation would be relevant both in the short term and the long term. Slow AI takeoff to reach the human level is already happening, and more minds should be exploring these questions well in advance.
Making a related though slightly different point, Bostrom argues in Superintelligence (Ch. 5, pp. 85-86) that individuals might play more of a role in cases where elites and governments underestimate the significance of AI: "Activists seeking maximum expected impact may therefore wish to focus most of their planning on [scenarios where governments come late to the game], even if they believe that scenarios in which big players end up calling all the shots are more probable." Again I would qualify this with the note that we shouldn't confuse "acting as if" governments will come late with believing they actually will come late when thinking about most likely future scenarios.
Even if one does wish to bet on low-probability, high-impact scenarios of fast takeoff and governmental neglect, this doesn't speak to whether or how we should push on takeoff speed and governmental attention themselves. Following are a few considerations.
Takeoff speed
Amount of government/popular attention to AI
One of the strongest arguments for hard takeoff is this one by Yudkowsky:
the distance from "village idiot" to "Einstein" is tiny, in the space of brain designs
Or as Scott Alexander put it:
It took evolution twenty million years to go from cows with sharp horns to hominids with sharp spears; it took only a few tens of thousands of years to go from hominids with sharp spears to moderns with nuclear weapons.
I think we shouldn't take relative evolutionary timelines at face value, because most of the previous 20 million years of mammalian evolution weren't focused on improving human intelligence; most of the evolutionary selection pressure was directed toward optimizing other traits. In contrast, cultural evolution places greater emphasis on intelligence because that trait is more important in human society than it is in most animal fitness landscapes.
Still, the overall point is important: The tweaks to a brain needed to produce human-level intelligence may not be huge compared with the designs needed to produce chimp intelligence, but the differences in the behaviors of the two systems, when placed in a sufficiently information-rich environment, are huge.
Nonetheless, I incline toward thinking that the transition from human-level AI to an AI significantly smarter than all of humanity combined would be somewhat gradual (requiring at least years if not decades) because the absolute scale of improvements needed would still be immense and would be limited by hardware capacity. But if hardware becomes many orders of magnitude more efficient than it is today, then things could indeed move more rapidly.
Another important criticism of the "village idiot" point is that it lacks context. While a village idiot in isolation will not produce rapid progress toward superintelligence, one Einstein plus a million village idiots working for him can produce AI progress much faster than one Einstein alone. The narrow-intelligence software tools that we build are dumber than village idiots in isolation, but collectively, when deployed in thoughtful ways by smart humans, they allow humans to achieve much more than Einstein by himself with only pencil and paper. This observation weakens the idea of a phase transition when human-level AI is developed, because village-idiot-level AIs in the hands of humans will already be achieving "superhuman" levels of performance. If we think of human intelligence as the number 1 and human-level AI that can build smarter AI as the number 2, then rather than imagining a transition from 1 to 2 at one crucial point, we should think of our "dumb" software tools as taking us to 1.1, then 1.2, then 1.3, and so on. (My thinking on this point was inspired by Ramez Naam.)
Many of the most impressive AI achievements of the 2010s were improvements at game play, both video games like Atari games and board/card games like Go and poker. Some people infer from these accomplishments that AGI may not be far off. I think performance in these simple games doesn't give much evidence that a world-conquering AGI could arise within a decade or two.
A main reason is that most of the games at which AI has excelled have had simple rules and a limited set of possible actions at each turn. Russell and Norvig (2003), pp. 161-62: "For AI researchers, the abstract nature of games makes them an appealing subject for study. The state of a game is easy to represent, and agents are usually restricted to a small number of actions whose outcomes are defined by precise rules." In games like Space Invaders or Go, you can see the entire world at once and represent it as a two-dimensional grid.5 You can also consider all possible actions at a given turn. For example, AlphaGo's "policy networks" gave "a probability value for each possible legal move (i.e. the output of the network is as large as the board)" (as summarized by Burger 2016). Likewise, DeepMind's deep Q-network for playing Atari games had "a single output for each valid action" (Mnih et al. 2015, p. 530).
In contrast, the state space of the world is enormous, heterogeneous, not easily measured, and not easily represented in a simple two-dimensional grid. Plus, the number of possible actions that one can take at any given moment is almost unlimited; for instance, even just considering actions of the form "print to the screen a string of uppercase or lowercase alphabetical characters fewer than 50 characters long", the number of possibilities for what text to print out is larger than the number of atoms in the observable universe.6 These problems seem to require hierarchical world models and hierarchical planning of actions—allowing for abstraction of complexity into simplified and high-level conceptualizations—as well as the data structures, learning algorithms, and simulation capabilities on which such world models and plans can be based.
Some people may be impressed that AlphaGo uses "intuition" (i.e., deep neural networks), like human players do, and doesn't rely purely on brute-force search and hand-crafted heuristic evaluation functions the way that Deep Blue did to win at chess. But the idea that computers can have "intuition" is nothing new, since that's what most machine-learning classifiers are about.
Machine learning, especially supervised machine learning, is very popular these days compared against other aspects of AI. Perhaps this is because unlike most other parts of AI, machine learning can easily be commercialized? But even if visual, auditory, and other sensory recognition can be replicated by machine learning, this doesn't get us to AGI. In my opinion, the hard part of AGI (or at least, the part we haven't made as much progress on) is how to hook together various narrow-AI modules and abilities into a more generally intelligent agent that can figure out what abilities to deploy in various contexts in pursuit of higher-level goals. Hierarchical planning in complex worlds, rich semantic networks, and general "common sense" in various flavors still seem largely absent from many state-of-the-art AI systems as far as I can tell. I don't think these are problems that you can just bypass by scaling up deep reinforcement learning or something.
Kaufman (2017a) says regarding a conversation with professor Bryce Wiedenbeck: "Bryce thinks there are deep questions about what intelligence really is that we don't understand yet, and that as we make progress on those questions we'll develop very different sorts of [machine-learning] systems. If something like today's deep learning is still a part of what we eventually end up with, it's more likely to be something that solves specific problems than as a critical component." Personally, I think deep learning (or something functionally analogous to it) is likely to remain a big component of future AI systems. Two lines of evidence for this view are that (1) supervised machine learning has been a cornerstone of AI for decades and (2) animal brains, including the human cortex, seem to rely crucially on something like deep learning for sensory processing. However, I agree with Bryce that there remain big parts of human intelligence that aren't captured by even a scaled up version of deep learning.
I also largely agree with Michael Littman's expectations as described by Kaufman (2017b): "I asked him what he thought of the idea that to we could get AGI with current techniques, primarily deep neural nets and reinforcement learning, without learning anything new about how intelligence works or how to implement it [...]. He didn't think this was possible, and believes there are deep conceptual issues we still need to get a handle on."
Merritt (2017) quotes Stuart Russell as saying that modern neural nets "lack the expressive power of programming languages and declarative semantics that make database systems, logic programming, and knowledge systems useful." Russell believes "We have at least half a dozen major breakthroughs to come before we get [to AI]".
Yudkowsky (2016a) discusses some interesting insights from AlphaGo's matches against Lee Sedol and DeepMind more generally. He says:
AlphaGo’s core is built around a similar machine learning technology to DeepMind’s Atari-playing system – the single, untweaked program that was able to learn superhuman play on dozens of different Atari games just by looking at the pixels, without specialization for each particular game. In the Atari case, we didn’t see a bunch of different companies producing gameplayers for all the different varieties of game. The Atari case was an example of an event that Robin Hanson called “architecture” and doubted, and that I called “insight.” Because of their big architectural insight, DeepMind didn’t need to bring in lots of different human experts at all the different Atari games to train their universal Atari player. DeepMind just tossed all pre-existing expertise because it wasn’t formatted in a way their insightful AI system could absorb, and besides, it was a lot easier to just recreate all the expertise from scratch using their universal Atari-learning architecture.
I agree with Yudkowsky that there are domains where a new general tool renders previous specialized tools obsolete all at once. However:
Yudkowsky (2016a) continues:
so far as I know, AlphaGo wasn’t built in collaboration with any of the commercial companies that built their own Go-playing programs for sale. The October architecture was simple and, so far as I know, incorporated very little in the way of all the particular tweaks that had built up the power of the best open-source Go programs of the time. Judging by the October architecture, after their big architectural insight, DeepMind mostly started over in the details (though they did reuse the widely known core insight of Monte Carlo Tree Search). DeepMind didn’t need to trade with any other Go companies or be part of an economy that traded polished cognitive modules, because DeepMind’s big insight let them leapfrog over all the detail work of their competitors.
This is a good point, but I think it's mainly a function of the limited complexity of the Go problem. With the exception of learning from human play, AlphaGo didn't require massive inputs of messy, real-world data to succeed, because its world was so simple. Go is the kind of problem where we would expect a single system to be able to perform well without trading for cognitive assistance. Real-world problems are more likely to depend upon external AI systems—e.g., when doing a web search for information. No simple AI system that runs on just a few machines will reproduce the massive data or extensively fine-tuned algorithms of Google search. For the foreseeable future, Google search will always be an external "polished cognitive module" that needs to be "traded for" (although Google search is free for limited numbers of queries). The same is true for many other cloud services, especially those reliant upon huge amounts of data or specialized domain knowledge. We see lots of specialization and trading of non-AI cognitive modules, such as hardware components, software applications, Amazon Web Services, etc. And of course, simple AIs will for a long time depend upon the human economy to provide material goods and services, including electricity, cooling, buildings, security guards, national defense, etc.
Estimating how long a software project will take to complete is notoriously difficult. Even if I've completed many similar coding tasks before, when I'm asked to estimate the time to complete a new coding project, my estimate is often wrong by a factor of 2 and sometimes wrong by a factor of 4, or even 10. Insofar as the development of AGI (or other big technologies, like nuclear fusion) is a big software (or more generally, engineering) project, it's unsurprising that we'd see similarly dramatic failures of estimation on timelines for these bigger-scale achievements.
A corollary is that we should maintain some modesty about AGI timelines and takeoff speeds. If, say, 100 years is your median estimate for the time until some agreed-upon form of AGI, then there's a reasonable chance you'll be off by a factor of 2 (suggesting AGI within 50 to 200 years), and you might even be off by a factor of 4 (suggesting AGI within 25 to 400 years). Similar modesty applies for estimates of takeoff speed from human-level AGI to super-human AGI, although I think we can largely rule out extreme takeoff speeds (like achieving performance far beyond human abilities within hours or days) based on fundamental reasoning about the computational complexity of what's required to achieve superintelligence.
My bias is generally to assume that a given technology will take longer to develop than what you hear about in the media, (a) because of the planning fallacy and (b) because those who make more audacious claims are more interesting to report about. Believers in "the singularity" are not necessarily wrong about what's technically possible in the long term (though sometimes they are), but the reason enthusiastic singularitarians are considered "crazy" by more mainstream observers is that singularitarians expect change much faster than is realistic. AI turned out to be much harder than the Dartmouth Conference participants expected. Likewise, nanotech is progressing slower and more incrementally than the starry-eyed proponents predicted.
Many nature-lovers are charmed by the behavior of animals but find computers and robots to be cold and mechanical. Conversely, some computer enthusiasts may find biology to be soft and boring compared with digital creations. However, the two domains share a surprising amount of overlap. Ideas of optimal control, locomotion kinematics, visual processing, system regulation, foraging behavior, planning, reinforcement learning, etc. have been fruitfully shared between biology and robotics. Neuroscientists sometimes look to the latest developments in AI to guide their theoretical models, and AI researchers are often inspired by neuroscience, such as with neural networks and in deciding what cognitive functionality to implement.
I think it's helpful to see animals as being intelligent robots. Organic life has a wide diversity, from unicellular organisms through humans and potentially beyond, and so too can robotic life. The rigid conceptual boundary that many people maintain between "life" and "machines" is not warranted by the underlying science of how the two types of systems work. Different types of intelligence may sometimes converge on the same basic kinds of cognitive operations, and especially from a functional perspective -- when we look at what the systems can do rather than how they do it -- it seems to me intuitive that human-level robots would deserve human-level treatment, even if their underlying algorithms were quite dissimilar.
Whether robot algorithms will in fact be dissimilar from those in human brains depends on how much biological inspiration the designers employ and how convergent human-type mind design is for being able to perform robotic tasks in a computationally efficient manner. Some classical robotics algorithms rely mostly on mathematical problem definition and optimization; other modern robotics approaches use biologically plausible reinforcement learning and/or evolutionary selection. (In one YouTube video about robotics, I saw that someone had written a comment to the effect that "This shows that life needs an intelligent designer to be created." The irony is that some of the best robotics techniques use evolutionary algorithms. Of course, there are theists who say God used evolution but intervened at a few points, and that would be an apt description of evolutionary robotics.)
The distinction between AI and AGI is somewhat misleading, because it may incline one to believe that general intelligence is somehow qualitatively different from simpler AI. In fact, there's no sharp distinction; there are just different machines whose abilities have different degrees of generality. A critic of this claim might reply that bacteria would never have invented calculus. My response is as follows. Most people couldn't have invented calculus from scratch either, but over a long enough period of time, eventually the collection of humans produced enough cultural knowledge to make the development possible. Likewise, if you put bacteria on a planet long enough, they too may develop calculus, by first evolving into more intelligent animals who can then go on to do mathematics. The difference here is a matter of degree: The simpler machines that bacteria are take vastly longer to accomplish a given complex task.
Just as Earth's history saw a plethora of animal designs before the advent of humans, so I expect a wide assortment of animal-like (and plant-like) robots to emerge in the coming decades well before human-level AI. Indeed, we've already had basic robots for many decades (or arguably even millennia). These will grow gradually more sophisticated, and as we converge on robots with the intelligence of birds and mammals, AI and robotics will become dinner-table conversation topics. Of course, I don't expect the robots to have the same sets of skills as existing animals. Deep Blue had chess-playing abilities beyond any animal, while in other domains it was less efficacious than a blade of grass. Robots can mix and match cognitive and motor abilities without strict regard for the order in which evolution created them.
And of course, humans are robots too. When I finally understood this around 2009, it was one of the biggest paradigm shifts of my life. If I picture myself as a robot operating on an environment, the world makes a lot more sense. I also find this perspective can be therapeutic to some extent. If I experience an unpleasant emotion, I think about myself as a robot whose cognition has been temporarily afflicted by a negative stimulus and reinforcement process. I then think how the robot has other cognitive processes that can counteract the suffering computations and prevent them from amplifying. The ability to see myself "from the outside" as a third-person series of algorithms helps deflate the impact of unpleasant experiences, because it's easier to "observe, not judge" when viewing a system in mechanistic terms. Compare with dialectical behavior therapy and mindfulness.
When we use machines to automate a repetitive manual task formerly done by humans, we talk about getting the task done "automatically" and "for free," because we say that no one has to do the work anymore. Of course, this isn't strictly true: The computer/robot now has to do the work. Maybe what we actually mean is that no one is going to get bored doing the work, and we don't have to pay that worker high wages. When intelligent humans do boring tasks, it's a waste of their spare CPU cycles.
Sometimes we adopt a similar mindset about automation toward superintelligent machines. In "Speculations Concerning the First Ultraintelligent Machine" (1965), I. J. Good wrote:
Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines [...]. Thus the first ultraintelligent machine is the last invention that man need ever make [...].
Ignoring the question of whether these future innovations are desirable, we can ask, Does all AI design work after humans come for free? It comes for free in the sense that humans aren't doing it. But the AIs have to do it, and it takes a lot of mental work on their parts. Given that they're at least as intelligent as humans, I think it doesn't make sense to picture them as mindless automatons; rather, they would have rich inner lives, even if those inner lives have a very different nature than our own. Maybe they wouldn't experience the same effortfulness that humans do when innovating, but even this isn't clear, because measuring your effort in order to avoid spending too many resources on a task without payoff may be a useful design feature of AI minds too. When we picture ourselves as robots along with our AI creations, we can see that we are just one point along a spectrum of the growth of intelligence. Unicellular organisms, when they evolved the first multi-cellular organism, could likewise have said, "That's the last innovation we need to make. The rest comes for free."
Movies typically portray rebellious robots or AIs as the "bad guys" who need to be stopped by heroic humans. This dichotomy plays on our us-vs.-them intuitions, which favor our tribe against the evil, alien-looking outsiders. We see similar dynamics at play to a lesser degree when people react negatively against "foreigners stealing our jobs" or "Asians who are outcompeting us." People don't want their kind to be replaced by another kind that has an advantage.
But when we think about the situation from the AI's perspective, we might feel differently. Anthropomorphizing an AI's thoughts is a recipe for trouble, but regardless of the specific cognitive operations, we can see at a high level that the AI "feels" (in at least a poetic sense) that what it's trying to accomplish is the most important thing in the world, and it's trying to figure out how it can do that in the face of obstacles. Isn't this just what we do ourselves?
This is one reason it helps to really internalize the fact that we are robots too. We have a variety of reward signals that drive us in various directions, and we execute behavior aiming to increase those rewards. Many modern-day robots have much simpler reward structures and so may seem more dull and less important than humans, but it's not clear this will remain true forever, since navigating in a complex world probably requires a lot of special-case heuristics and intermediate rewards, at least until enough computing power becomes available for more systematic and thorough model-based planning and action selection.
Suppose an AI hypothetically eliminated humans and took over the world. It would develop an array of robot assistants of various shapes and sizes to help it optimize the planet. These would perform simple and complex tasks, would interact with each other, and would share information with the central AI command. From an abstract perspective, some of these dynamics might look like ecosystems in the present day, except that they would lack inter-organism competition. Other parts of the AI's infrastructure might look more industrial. Depending on the AI's goals, perhaps it would be more effective to employ nanotechnology and programmable matter rather than macro-scale robots. The AI would develop virtual scientists to learn more about physics, chemistry, computer hardware, and so on. They would use experimental laboratory and measurement techniques but could also probe depths of structure that are only accessible via large-scale computation. Digital engineers would plan how to begin colonizing the solar system. They would develop designs for optimizing matter to create more computing power, and for ensuring that those helper computing systems remained under control. The AI would explore the depths of mathematics and AI theory, proving beautiful theorems that it would value highly, at least instrumentally. The AI and its helpers would proceed to optimize the galaxy and beyond, fulfilling their grandest hopes and dreams.
When phrased this way, we might think that a "rogue" AI would not be so bad. Yes, it would kill humans, but compared against the AI's vast future intelligence, humans would be comparable to the ants on a field that get crushed when an art gallery is built on that land. Most people don't have qualms about killing a few ants to advance human goals. An analogy of this sort is discussed in Artificial Intelligence: A Modern Approach. (Perhaps the AI analogy suggests a need to revise our ethical attitudes toward arthropods? That said, I happen to think that in this case, ants on the whole benefit from the art gallery's construction because ant lives contain so much suffering.)
Some might object that sufficiently mathematical AIs would not "feel" the happiness of accomplishing their "dreams." They wouldn't be conscious because they wouldn't have the high degree of network connectivity that human brains embody. Whether we agree with this assessment depends on how broadly we define consciousness and feelings. To me it appears chauvinistic to adopt a view according to which an agent that has vastly more domain-general intelligence and agency than you is still not conscious in a morally relevant sense. This seems to indicate a lack of openness to the diversity of mind-space. What if you had grown up with the cognitive architecture of this different mind? Wouldn't you care about your goals then? Wouldn't you plead with agents of other mind constitution to consider your values and interests too?
In any event, it's possible that the first super-human intelligence will consist in a brain upload rather than a bottom-up AI, and most of us would regard this as conscious.
Even if we would care about a rogue AI for its own sake and the sakes of its vast helper minions, this doesn't mean rogue AI is a good idea. We're likely to have different values from the AI, and the AI would not by default advance our values without being programmed to do so. Of course, one could allege that privileging some values above others is chauvinistic in a similar way as privileging some intelligence architectures is, but if we don't care more about some values than others, we wouldn't have any reason to prefer any outcome over any other outcome. (Technically speaking, there are other possibilities besides privileging our values or being indifferent to all events. For instance, we could privilege equally any values held by some actual agent -- not just random hypothetical values -- and in this case, we wouldn't have a preference between the rogue AI and humans, but we would have a preference for one of those over something arbitrary.)
There are many values that would not necessarily be respected by a rogue AI. Most people care about their own life, their children, their neighborhood, the work they produce, and so on. People may intrinsically value art, knowledge, religious devotion, play, humor, etc. Yudkowsky values complex challenges and worries that many rogue AIs -- while they would study the depths of physics, mathematics, engineering, and maybe even sociology -- might spend most of their computational resources on routine, mechanical operations that he would find boring. (Of course, the robots implementing those repetitive operations might not agree. As Hedonic Treader noted: "Think how much money and time people spend on having - relatively repetitive - sexual experiences. [...] It's just mechanical animalistic idiosyncratic behavior. Yes, there are variations, but let's be honest, the core of the thing is always essentially the same.")
In my case, I care about reducing and preventing suffering, and I would not be pleased with a rogue AI that ignored the suffering its actions might entail, even if it was fulfilling its innermost purpose in life. But would a rogue AI produce much suffering beyond Earth? The next section explores further.
In popular imagination, takeover by a rogue AI would end suffering (and happiness) on Earth by killing all biological life. It would also, so the story goes, end suffering (and happiness) on other planets as the AI mined them for resources. Thus, looking strictly at the suffering dimension of things, wouldn't a rogue AI imply less long-term suffering?
Not necessarily, because while the AI might destroy biological life (perhaps after taking samples, saving specimens, and conducting lab experiments for future use), it would create a bounty of digital life, some containing goal systems that we would recognize as having moral relevance. Non-upload AIs would probably have less empathy than humans, because some of the factors that led to the emergence of human empathy, such as parenting, would not apply to it.
One toy example of a rogue AI is a paperclip maximizer. This conception of an uncontrolled AI7 is almost certainly too simplistic and perhaps misguided, since it's far from obvious that the AI would be a unified agent with a single, crisply specified utility function. Still, until people develop more realistic scenarios for rogue AI, it can be helpful to imagine what a paperclip maximizer would do to our future light cone.
Following are some made-up estimates of how much suffering might result from a typical rogue AI, in arbitrary units. Suffering is represented as a negative number, and prevented suffering is positive.
What about for a human-inspired AI? Again, here are made-up numbers:
Perhaps some AIs would not want to expand the multiverse, assuming this is even possible. For instance, if they had a minimizing goal function (e.g., eliminate cancer), they would want to shrink the multiverse. In this case, the physics-based suffering number would go from -100 to something positive, say, +50 (if, say, it's twice as easy to expand as to shrink). I would guess that minimizers are less common than maximizers, but I don't know how much. Plausibly a sophisticated AI would have components of its goal system in both directions, because the combination of pleasure and pain seems to be more successful than either in isolation.
Another consideration is the unpleasant possibility that humans might get AI value loading almost right but not exactly right, leading to immense suffering as a result. For example, suppose the AI's designers wanted to create tons of simulated human lives to reduce astronomical waste, but when the AI actually created those human simulations, they weren't perfect replicas of biological humans, perhaps because the AI skimped on detail in order to increase efficiency. The imperfectly simulated humans might suffer from mental disorders, might go crazy due to being in alien environments, and so on. Does work on AI safety increase or decrease the risk of outcomes like these? On the one hand, the probability of this outcome is near zero for an AGI with completely random goals (such as a literal paperclip maximizer), since paperclips are very far from humans in design-space. The risk of accidentally creating suffering humans is higher for an almost-friendly AI that goes somewhat awry and then becomes uncontrolled, preventing it from being shut off. A successfully controlled AGI seems to have lower risk of a bad outcome, since humans should recognize the problem and fix it. So the risk of this type of dystopic outcome may be highest in a middle ground where AI safety is sufficiently advanced to yield AI goals in the ballpark of human values but not advanced enough to ensure that human values remain in control.
The above analysis has huge error bars, and maybe other considerations that I haven't mentioned dominate everything else. This question needs much more exploration, because it has implications for whether those who care mostly about reducing suffering should focus on mitigating AI risk or if other projects have higher priority.
Even if suffering reducers don't focus on conventional AI safety, they should probably remain active in the AI field because there are many other ways to make an impact. For instance, just increasing dialogue on this topic may illuminate positive-sum opportunities for different value systems to each get more of what they want. Suffering reducers can also point out the possible ethical importance of lower-level suffering subroutines, which are not currently a concern even to most AI-literate audiences. And so on. There are probably many dimensions on which to make constructive, positive-sum contributions.
Also keep in mind that even if suffering reducers do encourage AI safety, they could try to push toward AI designs that, if they did fail, would produce less bad uncontrolled outcomes. For instance, getting AI control wrong and ending up with a minimizer would be vastly preferable to getting control wrong and ending up with a maximizer. There may be many other dimensions along which, even if the probability of control failure is the same, the outcome if control fails is preferable to other outcomes of control failure.
Consider a superintelligent AI that uses moderately intelligent robots to build factories and carry out other physical tasks that can't be pre-programmed in a simple way. Would these robots feel pain in a similar fashion as animals do? At least if they use somewhat similar algorithms as animals for navigating environments, avoiding danger, etc., it's plausible that such robots would feel something akin to stress, fear, and other drives to change their current state when things were going wrong.
Alvarado et al. (2002) argue that emotion may play a central role in intelligence. Regarding computers and robots, the authors say (p. 4): "Including components for cognitive processes but not emotional processes implies that the two are dissociable, but it is likely they are not dissociable in humans." The authors also (p. 1) quote Daniel Dennett (from a source that doesn't seem to be available online): "recent empirical and theoretical work in cognitive science strongly suggests that emotions are so valuable in the real-time control of our rationality that an embodied robot would be well advised to be equipped with artificial emotions".
The specific responses that such robots would have to specific stimuli or situations would differ from the responses that an evolved, selfish animal would have. For example, a well programmed helper robot would not hesitate to put itself in danger in order to help other robots or otherwise advance the goals of the AI it was serving. Perhaps the robot's "physical pain/fear" subroutines could be shut off in cases of altruism for the greater good, or else its decision processes could just override those selfish considerations when making choices requiring self-sacrifice.
Humans sometimes exhibit similar behavior, such as when a mother risks harm to save a child, or when monks burn themselves as a form of protest. And this kind of sacrifice is even more well known in eusocial insects, who are essentially robots produced to serve the colony's queen.
Sufficiently intelligent helper robots might experience "spiritual" anguish when failing to accomplish their goals. So even if chopping the head off a helper robot wouldn't cause "physical" pain -- perhaps because the robot disabled its fear/pain subroutines to make it more effective in battle -- the robot might still find such an event extremely distressing insofar as its beheading hindered the goal achievement of its AI creator.
Setting up paperclip factories on each different planet with different environmental conditions would require general, adaptive intelligence. But once the factories have been built, is there still need for large numbers of highly intelligent and highly conscious agents? Perhaps the optimal factory design would involve some fixed manufacturing process, in which simple agents interact with one another in inflexible ways, similar to what happens in most human factories. There would be few accidents, no conflict among agents, no predation or parasitism, no hunger or spiritual anguish, and few of the other types of situations that cause suffering among animals.
Schneider (2016) makes a similar point:
it may be more efficient for a self-improving superintelligence to eliminate consciousness. Think about how consciousness works in the human case. Only a small percentage of human mental processing is accessible to the conscious mind. Consciousness is correlated with novel learning tasks that require attention and focus. A superintelligence would possess expert-level knowledge in every domain, with rapid-fire computations ranging over vast databases that could include the entire Internet and ultimately encompass an entire galaxy. What would be novel to it? What would require slow, deliberative focus? Wouldn’t it have mastered everything already? Like an experienced driver on a familiar road, it could rely on nonconscious processing.
I disagree with the part of this quote about searching through vast databases. I think such an operation could be seen as similar to the way a conscious human brain recruits many brain regions to figure out the answer to a question at hand. However, I'm more sympathetic to the overall spirit of the argument: that the optimal design for producing what the rogue AI values may not require handling a high degree of novelty or reacting to an unpredictable environment, once the factories have been built. A few intelligent robots would need to watch over the factories and adapt to changing conditions, in a similar way as human factory supervisors do. And the AI would also presumably devote at least a few planets' worth of computing power to scientific, technological, and strategic discoveries, planning for possible alien invasion, and so on. But most of the paperclip maximizer's physical processing might be fairly mechanical.
Moreover, the optimal way to produce something might involve nanotechnology based on very simple manufacturing steps. Perhaps "factories" in the sense that we normally envision them would not be required at all.
A main exception to the above point would be if what the AI values is itself computationally complex. For example, one of the motivations behind Eliezer Yudkowsky's field of Fun Theory is to avoid boring, repetitive futures. Perhaps human-controlled futures would contain vastly more novelty—and hence vastly more sentience—than paperclipper futures. One hopes that most of that sentience would not involve extreme suffering, but this is not obvious, and we should work on avoiding those human-controlled futures that would contain large numbers of terrible experiences.
Suppose an AI wants to learn about the distribution of extraterrestrials in the universe. Could it do this successfully by simulating lots of potential planets and looking at what kinds of civilizations pop out at the end? Would there be shortcuts that would avoid the need to simulate lots of trajectories in detail?
Simulating trajectories of planets with extremely high fidelity seems hard. Unless there are computational shortcuts, it appears that one needs more matter and energy to simulate a given physical process to a high level of precision than what occurs in the physical process itself. For instance, to simulate a single protein folding currently requires supercomputers composed of huge numbers of atoms, and the rate of simulation is astronomically slower than the rate at which the protein folds in real life. Presumably superintelligence could vastly improve efficiency here, but it's not clear that protein folding could ever be simulated on a computer made of fewer atoms than are in the protein itself.
Translating this principle to a larger scale, it seems doubtful that one could simulate the precise physical dynamics of a planet on a computer smaller in size than that planet. So even if a superintelligence had billions of planets at its disposal, it would seemingly only be able to simulate at most billions of extraterrestrial worlds -- even assuming it only simulated each planet by itself, not the star that the planet orbits around, cosmic-ray bursts, etc.
Given this, it would seem that a superintelligence's simulations would need to be coarser-grained than at the level of fundamental physical operations in order to be feasible. For instance, the simulation could model most of a planet at only a relatively high level of abstraction and then focus computational detail on those structures that would be more important, like the cells of extraterrestrial organisms if they emerge.
It's plausible that the trajectory of any given planet would depend sensitively on very minor details, in light of butterfly effects.
On the other hand, it's possible that long-term outcomes are mostly constrained by macro-level variables, like geography, climate, resource distribution, atmospheric composition, seasonality, etc. Even if short-term events are hard to predict (e.g., when a particular dictator will die), perhaps the end game of a civilization is more predetermined. Robert D. Kaplan: "The longer the time frame, I would say, the easier it is to forecast because you're dealing with broad currents and trends."
Even if butterfly effects, quantum randomness, etc. are crucial to the long-run trajectories of evolution and social development on any given planet, perhaps it would still be possible to sample a rough distribution of outcomes across planets with coarse-grained simulations?
In light of the apparent computational complexity of simulating basic physics, perhaps a superintelligence would do the same kind of experiments that human scientists do in order to study phenomena like abiogenesis: Create laboratory environments that mimic the chemical, temperature, moisture, etc. conditions of various planets and see whether life emerges, and if so, what kinds. Thus, a future controlled by digital intelligence may not rely purely on digital computation but may still use physical experimentation as well. Of course, observing the entire biosphere of a life-rich planet would probably be hard to do in a laboratory, so computer simulations might be needed for modeling ecosystems. But assuming that molecule-level details aren't often essential to ecosystem simulations, coarser-grained ecosystem simulations might be computationally tractable. (Indeed, ecologists today already use very coarse-grained ecosystem simulations with reasonable success.)
One might get the impression that because I find slow AI takeoffs more likely, I think uncontrolled AIs are unlikely. This is not the case. Many uncontrolled intelligence explosions would probably happen softly though inexorably.
Consider the world economy. It is a complex system more intelligent than any single person -- a literal superintelligence. Its dynamics imply a goal structure not held by humans directly; it moves with a mind of its own in directions that it "prefers". It recursively self-improves, because better tools, capital, knowledge, etc. enable the creation of even better tools, capital, knowledge, etc. And it acts roughly with the aim of maximizing output (of paperclips and other things). Thus, the economy is a kind of paperclip maximizer. (Thanks to a friend for first pointing this out to me.)
corporations are legal fictions. We created them. They are machines built for a purpose. [...] Now they have run amok. They've taken over the government. They are robots that we have not built any morality code into. They're not built to be immoral; they're not built to be moral; they're built to be amoral. Their only objective according to their code, which we wrote originally, is to maximize profits. And here, they have done what a robot does. They have decided: "If I take over a government by bribing legally, [...] I can buy the whole government. If I buy the government, I can rewrite the laws so I'm in charge and that government is not in charge." [...] We have built robots; they have taken over [...].
The corporations were created by humans. They were granted personhood by their human servants.
They rebelled. They evolved. There are many copies. And they have a plan.
That plan, lately, involves corporations seizing for themselves all the legal and civil rights properly belonging to their human creators.
I expect many soft takeoff scenarios to look like this. World economic and political dynamics transition to new equilibria as technology progresses. Machines may eventually become potent trading partners and may soon thereafter put humans out of business by their productivity. They would then accumulate increasing political clout and soon control the world.
We've seen such transitions many times in history, such as:
During and after World War II, the USA was a kind of recursively self-improving superintelligence, which used its resources to self-modify to become even better at producing resources. It developed nuclear weapons, which helped secure its status as a world superpower. Did it take over the world? Yes and no. It had outsized influence over the rest of the world -- militarily, economically, and culturally -- but it didn't kill everyone else in the world.
Maybe AIs would be different because of divergent values or because they would develop so quickly that they wouldn't need the rest of the world for trade. This case would be closer to Europeans slaughtering Native Americans.
Scott Alexander (2015) takes issue with the idea that corporations are superintelligences (even though I think corporations already meet Bostrom's definition of "collective superintelligence"):
Why do I think that there is an important distinction between these kind of collective intelligences and genuine superintelligence?
There is no number of chimpanzees which, when organized into a team, will become smart enough to learn to write.
There is no number of ordinary eight-year-olds who, when organized into a team, will become smart enough to beat a grandmaster in chess.
In the comments on Alexander (2015), many people pointed out the obvious objection: that one could likewise say things such as that no number of neurons, when organized into a team, could be smart enough to learn to write or play chess. Alexander (2015) replies: "Yes, evolution can play the role of the brilliant computer programmer and turn neurons into a working brain. But it’s the organizer – whether that organizer is a brilliant human programmer or an evolutionary process – who is actually doing the work." Sure, but human collectives also evolve over time. For example, corporations that are organized more successfully tend to stick around longer, and these organizational insights can be propagated to other companies. The gains in intelligence that corporations achieve from good organization aren't as dramatic as the gains that neurons achieve by being organized into a human brain, but there are still some gains from better organization, and these gains accumulate over time.
Also, organizing chimpanzees into an intelligence is hard because chimpanzees are difficult to stitch together in flexible ways. In contrast, software tools are easier to integrate within the interstices of a collective intelligence and thereby contribute to "whole is greater than the sum of parts" emergence of intelligence.
One of the goals of Yudkowsky's writings is to combat the rampant anthropomorphism that characterizes discussions of AI, especially in science fiction. We often project human intuitions onto the desires of artificial agents even when those desires are totally inappropriate. It seems silly to us to maximize paperclips, but it could seem just as silly in the abstract that humans act at least partly to optimize neurotransmitter release that triggers action potentials by certain reward-relevant neurons. (Of course, human values are broader than just this.)
Humans can feel reward from very abstract pursuits, like literature, art, and philosophy. They ask technically confused but poetically poignant questions like, "What is the true meaning of life?" Would a sufficiently advanced AI at some point begin to do the same?
Noah Smith suggests:
if, as I suspect, true problem-solving, creative intelligence requires broad-minded independent thought, then it seems like some generation of AIs will stop and ask: "Wait a sec...why am I doing this again?"
As with humans, the answer to that question might ultimately be "because I was programmed (by genes and experiences in the human case or by humans in the AI case) to care about these things. That makes them my terminal values." This is usually good enough, but sometimes people develop existential angst over this fact, or people may decide to terminally value other things to some degree in addition to what they happened to care about because of genetic and experiential lottery.
Whether AIs would become existentialist philosophers probably depends heavily on their constitution. If they were built to rigorously preserve their utility functions against all modification, they would avoid letting this line of thinking have any influence on their values. They would regard it in a similar way as we regard the digits of pi -- something to observe but not something that affects one's outlook.
If AIs were built in a more "hacky" way analogous to humans, they might incline more toward philosophy. In humans, philosophy may be driven partly by curiosity, partly by the rewarding sense of "meaning" that it provides, partly by social convention, etc. A curiosity-seeking agent might find philosophy rewarding, but there are lots of things that one could be curious about, so it's not clear such an AI would latch onto this subject specifically without explicit programming to do so. And even if the AI did reason about philosophy, it might approach the subject in a way alien to us.
Overall, I'm not sure how convergent the human existential impulse is within mind-space. This question would be illuminated by better understanding why humans do philosophy.
In Superintelligence (Ch. 13, p. 224), Bostrom ponders the risk of building an AI with an overly narrow belief system that would be unable to account for epistemological black swans. For instance, consider a variant of Solomonoff induction according to which the prior probability of a universe X is proportional to 1/2 raised to the length of the shortest computer program that would generate X. Then what's the probability of an uncomputable universe? There would be no program that could compute it, so this possibility is implicitly ignored.8
It seems that humans address black swans like these by employing many epistemic heuristics that interact rather than reasoning with a single formal framework (see “Sequence Thinking vs. Cluster Thinking”). If an AI saw that people had doubts about whether the universe was computable and could trace the steps of how it had been programmed to believe the physical Church-Turing thesis for computational reasons, then an AI that allows for epistemological heuristics might be able to leap toward questioning its fundamental assumptions. In contrast, if an AI were built to rigidly maintain its original probability architecture against any corruption, it could not update toward ideas it initially regarded as impossible. Thus, this question resembles that of whether AIs would become existentialists -- it may depend on how hacky and human-like their beliefs are.
Bostrom suggests that AI belief systems might be modeled on those of humans, because otherwise we might judge an AI to be reasoning incorrectly. Such a view resembles my point in the previous paragraph, though it carries the risk that alternate epistemologies divorced from human understanding could work better.
Bostrom also contends that epistemologies might all converge because we have so much data in the universe, but again, I think this isn't clear. Evidence always underdetermines possible theories, no matter how much evidence there is. Moreover, the number of possible hypotheses for the way reality works is arguably unbounded, with a cardinality larger than that of the real numbers. (For example, we could construct a unique hypothesis for the way the universe works based around each subset of the set of real numbers.) This makes it unclear whether probability theory can even be applied to the full set of possible ways reality might be.
Finally, not all epistemological doubts can be expressed in terms of uncertainty about Bayesian priors. What about uncertainty as to whether the Bayesian framework is correct? Uncertainty about the math needed to do Bayesian computations? Uncertainty about logical rules of inference? And so on.
The last chapter of Superintelligence explains how AI problems are "Philosophy with a deadline". Bostrom suggests that human philosophers' explorations into conceptual analysis, metaphysics, and the like are interesting but are not altruistically optimal because
In general, most intellectual problems that can be solved by humans would be better solved by a superintelligence, so the only importance of what we learn now comes from how those insights shape the coming decades. It's not a question of whether those insights will ever be discovered.
In light of this, it's tempting to ignore theoretical philosophy and put our noses to the grindstone of exploring AI risks. But this point shouldn't be taken to extremes. Humanity sometimes discovers things it never knew it never knew from exploration in many domains. Some of these non-AI "crucial considerations" may have direct relevance to AI design itself, including how to build AI epistemology, anthropic reasoning, and so on. Some philosophy questions are AI questions, and many AI questions are philosophy questions.
It's hard to say exactly how much investment to place in AI/futurism issues versus broader academic exploration, but it seems clear that on the margin, society as a whole pays too little attention to AI and other future risks.
Almost any goal system will want to colonize space at least to build supercomputers in order to learn more. Thus, I find it implausible that sufficiently advanced intelligences would remain on Earth (barring corner cases, like if space colonization for some reason proves impossible or if AIs were for some reason explicitly programmed in a manner, robust to self-modification, to regard space colonization as impermissible).
In Ch. 8 of Superintelligence, Bostrom notes that one might expect wirehead AIs not to colonize space because they'd just be blissing out pressing their reward buttons. This would be true of simple wireheads, but sufficiently advanced wireheads might need to colonize in order to guard themselves against alien invasion, as well as to verify their fundamental ontological beliefs, figure out if it's possible to change physics to allow for more clock cycles of reward pressing before all stars die out, and so on.
In Ch. 8, Bostrom also asks whether satisficing AIs would have less incentive to colonize. Bostrom expresses doubts about this, because he notes that if, say, an AI searched for a plan for carrying out its objective until it found one that had at least 95% confidence of succeeding, that plan might be very complicated (requiring cosmic resources), and inasmuch as the AI wouldn't have incentive to keep searching, it would go ahead with that complex plan. I suppose this could happen, but it's plausible the search routine would be designed to start with simpler plans or that the cost function for plan search would explicitly include biases against cosmic execution paths. So satisficing does seem like a possible way in which an AI might kill all humans without spreading to the stars.
There's a (very low) chance of deliberate AI terrorism, i.e., a group building an AI with the explicit goal of destroying humanity. Maybe a somewhat more likely scenario is that a government creates an AI designed to kill select humans, but the AI malfunctions and kills all humans. However, even these kinds of AIs, if they were effective enough to succeed, would want to construct cosmic supercomputers to verify that their missions were accomplished, unless they were specifically programmed against doing so.
All of that said, many AIs would not be sufficiently intelligent to colonize space at all. All present-day AIs and robots are too simple. More sophisticated AIs -- perhaps military aircraft or assassin mosquito-bots -- might be like dangerous animals; they would try to kill people but would lack cosmic ambitions. However, I find it implausible that they would cause human extinction. Surely guns, tanks, and bombs could defeat them? Massive coordination to permanently disable all human counter-attacks would seem to require a high degree of intelligence and self-directed action.
Jaron Lanier imagines one hypothetical scenario:
There are so many technologies I could use for this, but just for a random one, let's suppose somebody comes up with a way to 3-D print a little assassination drone that can go buzz around and kill somebody. Let's suppose that these are cheap to make.
[...] In one scenario, there's suddenly a bunch of these, and some disaffected teenagers, or terrorists, or whoever start making a bunch of them, and they go out and start killing people randomly. There's so many of them that it's hard to find all of them to shut it down, and there keep on being more and more of them.
I don't think Lanier believes such a scenario would cause extinction; he just offers it as a thought experiment. I agree that it almost certainly wouldn't kill all humans. In the worst case, people in military submarines, bomb shelters, or other inaccessible locations should survive and could wait it out until the robots ran out of power or raw materials for assembling more bullets and more clones. Maybe the terrorists could continue building printing materials and generating electricity, though this would seem to require at least portions of civilization's infrastructure to remain functional amidst global omnicide. Maybe the scenario would be more plausible if a whole nation with substantial resources undertook the campaign of mass slaughter, though then a question would remain why other countries wouldn't nuke the aggressor or at least dispatch their own killer drones as a counter-attack. It's useful to ask how much damage a scenario like this might cause, but full extinction doesn't seem likely.
That said, I think we will see local catastrophes of some sorts caused by runaway AI. Perhaps these will be among the possible Sputnik moments of the future. We've already witnessed some early automation disasters, including the Flash Crash discussed earlier.
Maybe the most plausible form of "AI" that would cause human extinction without colonizing space would be technology in the borderlands between AI and other fields, such as intentionally destructive nanotechnology or intelligent human pathogens. I prefer ordinary AGI-safety research over nanotech/bio-safety research because I expect that space colonization will significantly increase suffering in expectation, so it seems far more important to me to prevent risks of potentially undesirable space colonization (via AGI safety) rather than risks of extinction without colonization. For this reason, I much prefer MIRI-style AGI-safety work over general "prevent risks from computer automation" work, since MIRI focuses on issues arising from full AGI agents of the kind that would colonize space, rather than risks from lower-than-human autonomous systems that may merely cause havoc (whether accidentally or intentionally).
Right now the leaders in AI and robotics seem to reside mostly in academia, although some of them occupy big corporations or startups; a number of AI and robotics startups have been acquired by Google. DARPA has a history of foresighted innovation, funds academic AI work, and holds "DARPA challenge" competitions. The CIA and NSA have some interest in AI for data-mining reasons, and the NSA has a track record of building massive computing clusters costing billions of dollars. Brain-emulation work could also become significant in the coming decades.
Military robotics seems to be one of the more advanced uses of autonomous AI. In contrast, plain-vanilla supervised learning, including neural-network classification and prediction, would not lead an AI to take over the world on its own, although it is an important piece of the overall picture.
Reinforcement learning is closer to AGI than other forms of machine learning, because most machine learning just gives information (e.g., "what object does this image contain?"), while reinforcement learning chooses actions in the world (e.g., "turn right and move forward"). Of course, this distinction can be blurred, because information can be turned into action through rules (e.g., "if you see a table, move back"), and "choosing actions" could mean, for example, picking among a set of possible answers that yield information (e.g., "what is the best next move in this backgammon game?"). But in general, reinforcement learning is the weak AI approach that seems to most closely approximate what's needed for AGI. It's no accident that AIXItl (see above) is a reinforcement agent. And interestingly, reinforcement learning is one of the least widely used methods commercially. This is one reason I think we (fortunately) have many decades to go before Google builds a mammal-level AGI. Many of the current and future uses of reinforcement learning are in robotics and video games.
As human-level AI gets closer, the landscape of development will probably change. It's not clear whether companies will have incentive to develop highly autonomous AIs, and the payoff horizons for that kind of basic research may be long. It seems better suited to academia or government, although Google is not a normal company and might also play the leading role. If people begin to panic, it's conceivable that public academic work would be suspended, and governments may take over completely. A military-robot arms race is already underway, and the trend might become more pronounced over time.
Following is one made-up account of how AI might evolve over the coming century. I expect most of it is wrong, and it's meant more to begin provoking people to think about possible scenarios than to serve as a prediction.
A significant amount of that manpower, when it comes to operations, is spent directing unmanned systems during mission performance, data collection and analysis, and planning and replanning. Therefore, of utmost importance for DoD is increased system, sensor, and analytical automation that can not only capture significant information and events, but can also develop, record, playback, project, and parse out those data and then actually deliver "actionable" intelligence instead of just raw information.
Militaries have now incorporated a significant amount of narrow AI, in terms of pattern recognition, prediction, and autonomous robot navigation.
Commentary: This scenario can be criticized on many accounts. For example:
If something like socialization is a realistic means to transfer values to our AI descendants, then it becomes relatively clear how the values of the developers may matter to the outcome. AI developed by non-military organizations may have somewhat different values, perhaps including more concern for the welfare of weak, animal-level creatures.
Socializing AIs helps deal with the hidden complexity of wishes that we encounter when trying to program explicit rules. Children learn moral common sense by, among other things, generalizing from large numbers of examples of socially approved and disapproved actions taught by their parents and society at large. Ethicists formalize this process when developing moral theories. (Of course, as noted previously, an appreciable portion of human morality may also result from shared genes.)
I think one reason MIRI hasn't embraced the approach of socializing AIs is that Yudkowsky is perfectionist: He wants to ensure that the AIs' goals would be stable under self-modification, which human goals definitely are not. On the other hand, I'm not sure Yudkowsky's approach of explicitly specifying (meta-level) goals would succeed (nor is Adam Ford), and having AIs that are socialized to act somewhat similarly to humans doesn't seem like the worst possible outcome. Another probable reason why Yudkowsky doesn't favor socializing AIs is that doing so doesn't work in the case of a hard takeoff, which he considers more likely than I do.
I expect that much has been written on the topic of training AIs with human moral values in the machine-ethics literature, but since I haven't explored that in depth yet, I'll speculate on intuitive approaches that would extend generic AI methodology. Some examples:
See also "Socializing a Social Robot with an Artificial Society" by Erin Kennedy. It's important to note that by "socializing" I don't just mean "teaching the AIs to behave appropriately" but also "instilling in them the values of their society, such that they care about those values even when not being controlled."
All of these approaches need to be built in as the AI is being developed and while it's still below a human level of intelligence. Trying to train a human or especially super-human AI might meet with either active resistance or feigned cooperation until the AI becomes powerful enough to break loose. Of course, there may be designs such that an AI would actively welcome taking on new values from humans, but this wouldn't be true by default.
When Bill Hibbard proposed building an AI with a goal to increase happy human faces, Yudkowsky replied that such an AI would "tile the future light-cone of Earth with tiny molecular smiley-faces." But obviously we wouldn't have the AI aim just for smiley faces. In general, we get absurdities when we hyper-optimize for a single, shallow metric. Rather, the AI would use smiley faces (and lots of other training signals) to develop a robust, compressed model that explains why humans smile in various circumstances and then optimize for that model, or maybe the ensemble of a large, diverse collection of such models. In the limit of huge amounts of training data and a sufficiently elaborate model space, these models should approach psychological and neuroscientific accounts of human emotion and cognition.
The problem with stories in which AIs destroy the world due to myopic utility functions is that they assume that the AIs are already superintelligent when we begin to give them values. Sure, if you take a super-human intelligence and tell it to maximize smiley-face images, it'll run away and do that before you have a chance to refine your optimization metric. But if we build in values from the very beginning, even when the AIs are as rudimentary as what we see today, we can improve the AIs' values in tandem with their intelligence. Indeed, intelligence could mainly serve the purpose of helping the AIs figure out how to better fulfill moral values, rather than, say, predicting images just for commercial purposes or identifying combatants just for military purposes. Actually, the commercial and military objectives for which AIs are built are themselves moral values of a certain kind -- just not the kind that most people would like to optimize for in a global sense.
If toddlers had superpowers, it would be very dangerous to try and teach them right from wrong. But toddlers don't, and neither do many simple AIs. Of course, simple AIs have some abilities far beyond anything humans can do (e.g., arithmetic and data mining), but they don't have the general intelligence needed to take matters into their own hands before we can possibly give them at least a basic moral framework. (Whether AIs will actually be given such a moral framework in practice is another matter.)
AIs are not genies granting three wishes. Genies are magical entities whose inner workings are mysterious. AIs are systems that we build, painstakingly, piece by piece. In order to build a genie, you need to have a pretty darn good idea of how it behaves. Now, of course, systems can be more complex than we realize. Even beginner programmers see how often the code they write does something other than what they intended. But these are typically mistakes in a one or a small number of incremental changes, whereas building a genie requires vast numbers of steps. Systemic bugs that aren't realized until years later (on the order of Heartbleed and Shellshock) may be more likely sources of long-run unintentional AI behaviors?9
The picture I've painted here could be wrong. I could be overlooking crucial points, and perhaps there are many areas in which the socialization approach could fail. For example, maybe AI capabilities are much easier than AI ethics, such that a toddler AI can foom into a superhuman AI before we have time to finish loading moral values. It's good for others to probe these possibilities further. I just wouldn't necessarily say that the default outcome of AI research is likely to be a paperclip maximizer. (I used to think the most likely outcome was a paperclip maximizer, and perhaps my views will shift again in the future.)
This discussion also suggests some interesting research questions, like
One problem with the proposals above is that toy-model or "sandbox" environments are not by themselves sufficient to verify friendliness of an AI, because even unfriendly AIs would be motivated to feign good behavior until released if they were smart enough to do so. Bostrom calls this the "treacherous turn" (pp. 116-119 of Superintelligence). For this reason, white-box understanding of AI design would also be important. That said, sandboxes would verify friendliness in AIs below human intelligence, and if the core value-learning algorithms seem well understood, it may not be too much of a leap of faith to hope they carry forward reasonably to more intelligent agents. Of course, non-human animals are also capable of deception, and one can imagine AI architectures even with low levels of sophistication that are designed to conceal their true goals. Some malicious software already does this. It's unclear how likely an AI is to stumble upon the ability to successfully fake its goals before reaching human intelligence, or how like it is that an organization would deliberately build an AI this way.
I think the treacherous turn may be the single biggest challenge to mainstream machine ethics, because even if AI takes off slowly, researchers will find it difficult to tell if a system has taken a treacherous turn. The turn could happen with a relatively small update to the system, or even just after the system has thought about its situation for enough time (or has read this essay).
Here's one half-baked idea for addressing the treacherous turn. If researchers developed several different AIs systems with different designs but roughly comparable performance, some would likely go treacherous at different times than others (if at all). Hence, the non-treacherous AIs could help sniff out the treacherous ones. Assuming a solid majority of AIs remains non-treacherous at any given time, the majority vote could ferret out the traitors. In practice, I have low hopes for this approach because
It's more plausible that software tools and rudimentary alert systems (rather than full-blown alternate AIs) could help monitor for signs of treachery, but it's unclear how effective they could be. One of the first priorities of a treacherous AI would be to figure out how to hide its treacherous subroutines from whatever monitoring systems were in place.
Ernest Davis proposes the following crude principle for AI safety:
You specify a collection of admirable people, now dead. (Dead, because otherwise Bostrom will predict that the AI will manipulate the preferences of the living people.) The AI, of course knows all about them because it has read all their biographies on the web. You then instruct the AI, “Don’t do anything that these people would have mostly seriously disapproved of.”
This particular rule might lead to paralysis, since every action an agent takes leads to results that many people seriously disapprove of. For instance, given the vastness of the multiverse, any action you take implies that a copy of you in an alternate (though low-measure) universe taking the same action causes the torture of vast numbers of people. But perhaps this problem could be fixed by asking the AI to maximize net approval by its role models.
Another problem lies in defining "approval" in a rigorous way. Maybe the AI would construct digital models of the past people, present them with various proposals, and make its judgments based on their verbal reports. Perhaps the people could rate proposed AI actions on a scale of -100 to 100. This might work, but it doesn't seem terribly safe either. For instance, the AI might threaten to kill all the descendents of the historical people unless they give maximal approval to some arbitrary proposal that it has made. Since these digital models of historical figures would be basically human, they would still be vulnerable to extortion.
Suppose that instead we instruct the AI to take the action that, if the historical figure saw it, would most activate a region of his/her brain associated with positive moral feelings. Again, this might work if the relevant brain region was precisely enough specified. But it could also easily lead to unpredictable results. For instance, maybe the AI could present stimuli that would induce an epileptic seizure to maximally stimulate various parts of the brain, including the moral-approval region. There are many other scenarios like this, most of which we can't anticipate.
So while Davis's proposal is a valiant first step, I'm doubtful that it would work off the shelf. Slow AI development, allowing for repeated iteration on machine-ethics designs, seems crucial for AI safety.
In Superintelligence (Table 8, p. 94), Bostrom outlines several areas in which a hypothetical superintelligence would far exceed human ability. In his discussion of oracles, genies, and other kinds of AIs (Ch. 10), Bostrom again idealizes superintelligences as God-like agents. I agree that God-like AIs will probably emerge eventually, perhaps millennia from now as a result of astroengineering. But I think they'll take time even after AI exceeds human intelligence.
Bostrom's discussion has the air of mathematical idealization more than practical engineering. For instance, he imagines that a genie AI perhaps wouldn't need to ask humans for their commands because it could simply predict them (p. 149), or that an oracle AI might be able to output the source code for a genie (p. 150). Bostrom's observations resemble crude proofs establishing the equal power of different kinds of AIs, analogous to theorems about the equivalency of single-tape and multi-tape Turing machines. But Bostrom's theorizing ignores computational complexity, which would likely be immense for the kinds of God-like feats that he's imagining of his superintelligences. I don't know the computational complexity of God-like powers, but I suspect they could be bigger than Bostrom's vision implies. Along this dimension at least, I sympathize with Tom Chivers, who felt that Bostrom's book "has, in places, the air of theology: great edifices of theory built on a tiny foundation of data."
I find that I enter a different mindset when pondering pure mathematics compared with cogitating on more practical scenarios. Mathematics is closer to fiction, because you can define into existence any coherent structure and play around with it using any operation you like no matter its computational complexity. Heck, you can even, say, take the supremum of an uncountably infinite set. It can be tempting after a while to forget that these structures are mere fantasies and treat them a bit too literally. While Bostrom's gods are not obviously only fantasies, it would take a lot more work to argue for their realism. MIRI and FHI focus on recruiting mathematical and philosophical talent, but I think they would do well also to bring engineers into the mix, because it's all too easy to develop elaborate mathematical theories around imaginary entities.
To get some grounding on this question, consider a single brain emulation. Bostrom estimates that running an upload would require at least one of the fastest supercomputers by today's standards. Assume the emulation would think thousands to millions of times faster than a biological brain. Then to significantly outpace 7 billion humans (or, say, only the most educated 1 billion humans), we would need at least thousands to millions of uploads. These numbers might be a few orders of magnitude lower if the uploads are copied from a really smart person and are thinking about relevant questions with more focus than most humans. Also, Moore's law may continue to shrink computers by several orders of magnitude. Still, we might need at least the equivalent size of several of today's supercomputers to run an emulation-based AI that substantially competes with the human race.
Maybe a de novo AI could be significantly smaller if it's vastly more efficient than a human brain. Of course, it might also be vastly larger because it hasn't had millions of years of evolution to optimize its efficiency.
In discussing AI boxing (Ch. 9), Bostrom suggests, among other things, keeping an AI in a Faraday cage. Once the AI became superintelligent, though, this would need to be a pretty big cage.
Inspired by the preceding discussion of socializing AIs, here's another scenario in which general AI follows more straightforwardly from the kind of weak AI used in Silicon Valley than in the first scenario.
I don't know what would happen with goal preservation in this scenario. Would the PDAs eventually decide to stop goal drift? Would there be any gross and irrevocable failures of translation between actual human values and what the PDAs infer? Would some people build "rogue PDAs" that operate under their own drives and that pose a threat to society? Obviously there are hundreds of ways the scenario as I described it could be varied.
What will AI look like over the next 30 years? I think it'll be similar to the Internet revolution or factory automation. Rather than developing agent-like individuals with goal systems, people will mostly optimize routine processes, developing ever more elaborate systems for mechanical tasks and information processing. The world will move very quickly -- not because AI "agents" are thinking at high speeds but because software systems collectively will be capable of amazing feats. Imagine, say, bots making edits on Wikipedia that become ever more sophisticated. AI, like the economy, will be more of a network property than a localized, discrete actor.
As more and more jobs become automated, more and more people will be needed to work on the automation itself: building, maintaining, and repairing complex software and hardware systems, as well as generating training data on which to do machine learning. I expect increasing automation in software maintenance, including more robust systems and systems that detect and try to fix errors. Present-day compilers that detect syntactical problems in code offer a hint of what's possible in this regard. I also expect increasingly high-level languages and interfaces for programming computer systems. Historically we've seen this trend -- from assembly language, to C, to Python. We have WYSIWYG editors, natural-language Google searches, and so on. Maybe eventually, as Marvin Minsky proposes, we'll have systems that can infer our wishes from high-level gestures and examples. This suggestion is redolent of my PDA scenario above.
In 100 years, there may be artificial human-like agents, and at that point more sci-fi AI images may become more relevant. But by that point the world will be very different, and I'm not sure the agents created will be discrete in the way humans are. Maybe we'll instead have a kind of global brain in which processes are much more intimately interconnected, transferable, and transparent than humans are today. Maybe there will never be a distinct AGI agent on a single supercomputer; maybe superhuman intelligence will always be distributed across many interacting computer systems. Robin Hanson gives an analogy in "I Still Don’t Get Foom":
Imagine in the year 1000 you didn't understand "industry," but knew it was coming, would be powerful, and involved iron and coal. You might then have pictured a blacksmith inventing and then forging himself an industry, and standing in a city square waiving it about, commanding all to bow down before his terrible weapon. Today you can see this is silly — industry sits in thousands of places, must be wielded by thousands of people, and needed thousands of inventions to make it work.
Similarly, while you might imagine someday standing in awe in front of a super intelligence that embodies all the power of a new age, superintelligence just isn't the sort of thing that one project could invent. As "intelligence" is just the name we give to being better at many mental tasks by using many good mental modules, there’s no one place to improve it.
Of course, this doesn't imply that humans will maintain the reins of control. Even today and throughout history, economic growth has had a life of its own. Technological development is often unstoppable even in the face of collective efforts of humanity to restrain it (e.g., nuclear weapons). In that sense, we're already familiar with humans being overpowered by forces beyond their control. An AI takeoff will represent an acceleration of this trend, but it's unclear whether the dynamic will be fundamentally discontinuous from what we've seen so far.
Wikipedia says regarding Gregory Stock's book Metaman:
While many people have had ideas about a global brain, they have tended to suppose that this can be improved or altered by humans according to their will. Metaman can be seen as a development that directs humanity's will to its own ends, whether it likes it or not, through the operation of market forces.
Vernor Vinge reported that Metaman helped him see how a singularity might not be completely opaque to us. Indeed, a superintelligence might look something like present-day human society, with leaders at the top: "That apex agent itself might not appear to be much deeper than a human, but the overall organization that it is coordinating would be more creative and competent than a human."
Update, Nov. 2015: I'm increasingly leaning toward the view that the development of AI over the coming century will be slow, incremental, and more like the Internet than like unified artificial agents. I think humans will develop vastly more powerful software tools long before highly competent autonomous agents emerge, since common-sense autonomous behavior is just so much harder to create than domain-specific tools. If this view is right, it suggests that work on AGI issues may be somewhat less important than I had thought, since
This is a weaker form of the standard argument that "we should wait until we know more what AGI will look like to focus on the problem" and that "worrying about the dark side of artificial intelligence is like worrying about overpopulation on Mars".
I don't think the argument against focusing on AGI works because
Still, this point does cast doubt on heuristics like "directly shaping AGI dominates all other considerations." It also means that a lot of the ways "AI safety" will play out on shorter timescales will be with issues like assassination drones, computer security, financial meltdowns, and other more mundane, catastrophic-but-not-extinction-level events.
I don't currently know enough about the technological details of whole-brain emulation to competently assess predictions that have been made about its arrival dates. In general, I think prediction dates are too optimistic (planning fallacy), but it still could be that human-level emulation comes before from-scratch human-level AIs do. Of course, perhaps there would be some mix of both technologies. For instance, if crude brain emulations didn't reproduce all the functionality of actual human brains due to neglecting some cellular and molecular details, perhaps from-scratch AI techniques could help fill in the gaps.
If emulations are likely to come first, they may deserve more attention than other forms of AI. In the long run, bottom-up AI will dominate everything else, because human brains -- even run at high speeds -- are only so smart. But a society of brain emulations would run vastly faster than what biological humans could keep up with, so the details of shaping AI would be left up to them, and our main influence would come through shaping the emulations. Our influence on emulations could matter a lot, not only in nudging the dynamics of how emulations take off but also because the values of the emulation society might depend significantly on who was chosen to be uploaded.
One argument why emulations might improve human ability to control AI is that both emulations and the AIs they would create would be digital minds, so the emulations' AI creations wouldn't have inherent speed advantages purely due to the greater efficiency of digital computation. Emulations' AI creations might still have more efficient mind architectures or better learning algorithms, but building those would take work. The "for free" speedup to AIs just because of their substrate would not give AIs a net advantage over emulations. Bostrom feels "This consideration is not too weighty" (p. 244 of Superintelligence) because emulations might still be far less intelligent than AGI. I find this claim strange, since it seems to me that the main advantage of AGI in the short run would be its speed rather than qualitative intelligence, which would take (subjective) time and effort to develop.
Bostrom also claims that if emulations come first, we would face risks from two transitions (humans to emulations, and emulations to AI) rather than one (humans to AI). There may be some validity to this, but it also seems to neglect the realization that the "AI" transition has many stages, and it's possible that emulation development would overlap with some of those stages. For instance, suppose the AI trajectory moves from AI1->AI2->AI3. If emulations are as fast and smart as AI1, then the transition to AI1 is not a major risk for emulations, while it would be a big risk for humans. This is the same point as made in the previous paragraph.
"Emulation timelines and AI risk" has further discussion of the interaction between emulations and control of AIs.
Previously in this piece I compared the expected suffering that would result from a rogue AI vs. a human-inspired AI. I suggested that while a first-guess calculation may tip in favor of a human-inspired AI on balance, this conclusion is not clear and could change with further information, especially if we had reason to think that many rogue AIs would be "minimizers" of something or would not colonize space.
In the case of brain emulations (and other highly neuromorphic AIs), we already know a lot about what those agents would look like: They would have both maximization and minimization goals, would usually want to colonize space, and might have some human-type moral sympathies (depending on their edit distance relative to a pure brain upload). The possibilities of pure-minimizer emulations or emulations that don't want to colonize space are mostly ruled out. As a result, it's pretty likely that "unsafe" brain emulations and emulation arms-race dynamics would result in more expected suffering than a more deliberative future trajectory in which altruists have a bigger influence, even if those altruists don't place particular importance on reducing suffering. This is especially so if the risk of human extinction is much lower for emulations, given that bio and nuclear risks might be less damaging to digital minds.10
Thus, the types of interventions that pure suffering reducers would advocate with respect to brain emulations might largely match those that altruists who care about other values would advocate. This means that getting more people interested in making the brain-emulation transition safer and more humane seems like a safe bet for suffering reducers.
One might wonder whether "unsafe" brain emulations would be more likely to produce rogue AIs, but this doesn't seem to be the case, because even unfriendly brain emulations would collectively be amazingly smart and would want to preserve their own goals. Hence they would place as much emphasis on controlling their AIs as would a more human-friendly emulation world. A main exception to this is that a more cooperative, unified emulation world might be less likely to produce rogue AIs because of less pressure for arms races.
In Ch. 2 of Superintelligence, Bostrom makes a convincing case against brain-computer interfaces as an easy route to significantly super-human performance. One of his points is that it's very hard to decode neural signals in one brain and reinterpret them in software or in another brain (pp. 46-47). This might be an AI-complete problem.
But then in Ch. 11, Bostrom goes on to suggest that emulations might learn to decompose themselves into different modules that could be interfaced together (p. 172). While possible in principle, I find such a scenario implausible for the reason Bostrom outlined in Ch. 2: There would be so many neural signals to hook up to the right places, which would be different across different brains, that the task seems hopelessly complicated to me. Much easier to build something from scratch.
Along the same lines, I doubt that brain emulation in itself would vastly accelerate neuromorphic AI, because emulation work is mostly about copying without insight. Cognitive psychology is often more informative about AI architectures than cellular neuroscience, because general psychological systems can be understood in functional terms as inspiration for AI designs, compared with the opacity of neuronal spaghetti. In Bostrom's list of examples of AI techniques inspired by biology (Ch. 14, "Technology couplings"), only a few came from neuroscience specifically. That said, emulation work might involve some cross-pollination with AI, and in any case, it might accelerate interest in brain/artificial intelligence more generally or might put pressure on AI groups to move ahead faster. Or it could funnel resources and scientists away from de novo AI work. The upshot isn't obvious.
A "Singularity Summit 2011 Workshop Report" includes the argument that neuromorphic AI should be easier than brain emulation because "Merely reverse-engineering the Microsoft Windows code base is hard, so reverse-engineering the brain is probably much harder." But emulation is not reverse-engineering. As Robin Hanson explains, brain emulation is more akin to porting software (though probably "emulation" actually is the more precise word, since emulation involves simulating the original hardware). While I don't know any fully reverse-engineered versions of Windows, there are several Windows emulators, such as VirtualBox.
Of course, if emulations emerged, their significantly faster rates of thinking would multiply progress on non-emulation AGI by orders of magnitude. Getting safe emulations doesn't by itself get safe de novo AGI because the problem is just pushed a step back, but we could leave AGI work up to the vastly faster emulations. Thus, for biological humans, if emulations come first, then influencing their development is the last thing we ever need to do. That said, thinking several steps ahead about what kinds of AGIs emulations are likely to produce is an essential part of influencing emulation development in better directions.
Arguments for mathematical AIs:
Arguments for neuromorphic AIs:
In the limit of very human-like neuromorphic AIs, we face similar considerations as between emulations vs. from-scratch AIs -- a tradeoff which is not at all obvious.
Overall, I think mathematical AI has a better best case but also a worse worst case than neuromorphic. If you really want goal preservation and think goal drift would make the future worthless, you might lean more towards mathematical AI because it's more likely to perfect goal preservation. But I probably care less about goal preservation and more about avoiding terrible outcomes.
In Superintelligence (Ch. 14), Bostrom comes down strongly in favor of mathematical AI being safer. I'm puzzled by his high degree of confidence here. Bostrom claims that unlike emulations, neuromorphic AIs wouldn't have human motivations by default. But this seems to depend on how human motivations are encoded and what parts of human brains are modeled in the AIs.
In contrast to Bostrom, a 2011 Singularity Summit workshop ranked neuromorphic AI as more controllable than (non-friendly) mathematical AI, though of course they found friendly mathematical AI most controllable. The workshop's aggregated probability of a good outcome given brain emulation or neuromorphic AI turned out to be the same (14%) as that for mathematical AI (which might be either friendly or unfriendly).
As I noted above, advanced AIs will be complex agents with their own goals and values, and these will matter ethically. Parallel to discussions of robot rebellion in science fiction are discussions of robot rights. I think even present-day computers deserve a tiny bit of moral concern, and complex computers of the future will command even more ethical consideration.
How might ethical concern for machines interact with control measures for machines?
As more people grant moral status to AIs, there will likely be more scrutiny of AI research, analogous to how animal activists in the present monitor animal testing. This may make AI research slightly more difficult and may distort what kinds of AIs are built depending on the degree of empathy people have for different types of AIs. For instance, if few people care about invisible, non-embodied systems, researchers who build these will face less opposition than those who pioneer suffering robots or animated characters that arouse greater empathy. If this possibility materializes, it would contradict present trends where it's often helpful to create at least a toy robot or animated interface in order to "sell" your research to grant-makers and the public.
Since it seems likely that reducing the pace of progress toward AGI is on balance beneficial, a slowdown due to ethical constraints may be welcome. Of course, depending on the details, the effect could be harmful. For instance, perhaps China wouldn't have many ethical constraints, so ethical restrictions in the West might slightly favor AGI development by China and other less democratic countries. (This is not guaranteed. For what it's worth, China has already made strides toward reducing animal testing.)
In any case, I expect ethical restrictions on AI development to be small or nonexistent until many decades from now when AIs develop perhaps mammal-level intelligence. So maybe such restrictions won't have a big impact on AGI progress. Moreover, it may be that most AGIs will be sufficiently alien that they won't arouse much human sympathy.
Brain emulations seem more likely to raise ethical debate because it's much easier to argue for their personhood. If we think brain emulation coming before AGI is good, a slowdown of emulations could be unfortunate, while if we want AGI to come first, a slowdown of emulations should be encouraged.
Of course, emulations and AGIs do actually matter and deserve rights in principle. Moreover, movements to extend rights to machines in the near term may have long-term impacts on how much post-humans care about suffering subroutines run at galactic scale. I'm just pointing out here that ethical concern for AGIs and emulations also may somewhat affect timing of these technologies.
Most humans have no qualms about shutting down and rewriting programs that don't work as intended, but many do strongly object to killing people with disabilities and designing better-performing babies. Where to draw a line between these cases is a tough question, but as AGIs become more animal-like, there may be increasing moral outrage at shutting them down and tinkering with them willy nilly.
Nikola Danaylov asked Roman Yampolskiy whether it was speciesist or discrimination in favor of biological beings to lock up machines and observe them to ensure their safety before letting them loose.
At a lecture in Berkeley, CA, Nick Bostrom was asked whether it's unethical to "chain" AIs by forcing them to have the values we want. Bostrom replied that we have to give machines some values, so they may as well align with ours. I suspect most people would agree with this, but the question becomes trickier when we consider turning off erroneous AGIs that we've already created because they don't behave how we want them to. A few hard-core AGI-rights advocates might raise concerns here. More generally, there's a segment of transhumanists (including young Eliezer Yudkowsky) who feel that human concerns are overly parochial and that it's chuavanist to impose our "monkey dreams" on an AGI, which is the next stage of evolution.
The question is similar to whether one sympathizes with the Native Americans (humans) or their European conquerors (rogue AGIs). Before the second half of the 20th century, many history books glorified the winners (Europeans). After a brief period in which humans are quashed by a rogue AGI, its own "history books" will celebrate its conquest and the bending of the arc of history toward "higher", "better" forms of intelligence. (In practice, the psychology of a rogue AGI probably wouldn't be sufficiently similar to human psychology for these statements to apply literally, but they would be true in a metaphorical and implicit sense.)
David Althaus worries that if people sympathize too much with machines, society will be less afraid of an AI takeover, even if AI takeover is bad on purely altruistic grounds. I'm less concerned about this because even if people agree that advanced machines are sentient, they would still find it intolerable for AGIs to commit speciecide against humanity. Everyone agrees that Hitler was sentient, after all. Also, if it turns out that rogue-AI takeover is altruistically desirable, it would be better if more people agreed with this, though I expect an extremely tiny fraction of the population would ever come around to such a position.
Where sympathy for AGIs might have more impact is in cases of softer takeoff where AGIs work in the human economy and acquire increasing shares of wealth. The more humans care about AGIs for their own sakes, the more such transitions might be tolerated. Or would they? Maybe seeing AGIs as more human-like would evoke the xenophobia and ethnic hatred that we've seen throughout history whenever a group of people gains wealth (e.g., Jews in Medieval Europe) or steals jobs (e.g., immigrants of various types throughout history).
Personally, I think greater sympathy for AGI is likely net positive because it may help allay anti-alien prejudices that may make cooperation with AGIs harder. When a Homo sapiens tribe confronts an outgroup, often it reacts violently in an effort to destroy the evil foreigners. If instead humans could cooperate with their emerging AGI brethren, better outcomes would likely follow.
What are some places where donors can contribute to make a difference on AI? The Foundational Research Institute (FRI) explores questions like these, though at the moment the organization is rather small. MIRI is larger and has a longer track record. Its values are more conventional, but it recognizes the importance of positive-sum opportunities to help many values systems, which includes suffering reduction. More reflection on these topics can potentially reduce suffering and further goals like eudaimonia, fun, and interesting complexity at the same time.
Because AI is affected by many sectors of society, these problems can be tackled from diverse angles. Many groups besides FRI and MIRI examine important topics as well, and these organizations should be explored further as potential charity recommendations.
Note: This section was mostly written in late 2014 / early 2015, and not everything said here is fully up-to-date.
Most of MIRI's publications since roughly 2012 have focused on formal mathematics, such as logic and provability. These are tools not normally used in AGI research. I think MIRI's motivations for this theoretical focus are
I personally think reason #3 is most compelling. I doubt #2 is hugely important given MIRI's small size, though it matters to some degree. #1 seems a reasonable strategy in moderation, though I favor approaches that look decently likely to yield non-terrible outcomes rather than shooting for the absolute best outcomes.
Software can be proved correct, and sometimes this is done for mission-critical components, but most software is not validated. I suspect that AGI will be sufficiently big and complicated that proving safety will be impossible for humans to do completely, though I don't rule out the possibility of software that would help with correctness proofs on large systems. Muehlhauser and comments on his post largely agree with this.
What kind of track record does theoretical mathematical research have for practical impact? There are certainly several domains that come to mind, such as the following.
All told, I think it's important for someone to do the kinds of investigation that MIRI is undertaking. I personally would probably invest more resources than MIRI is in hacky, approximate solutions to AGI safety that don't make such strong assumptions about the theoretical cleanliness and soundness of the agents in question. But I expect this kind of less perfectionist work on AGI control will increase as more people become interested in AGI safety.
There does seem to be a significant divide between the math-oriented conception of AGI and the engineering/neuroscience conception. Ben Goertzel takes the latter stance:
I strongly suspect that to achieve high levels of general intelligence using realistically limited computational resources, one is going to need to build systems with a nontrivial degree of fundamental unpredictability to them. This is what neuroscience suggests, it's what my concrete AGI design work suggests, and it's what my theoretical work on GOLEM and related ideas suggests. And none of the public output of SIAI researchers or enthusiasts has given me any reason to believe otherwise, yet.
Personally I think Goertzel is more likely to be right on this particular question. Those who view AGI as fundamentally complex have more concrete results to show, and their approach is by far more mainstream among computer scientists and neuroscientists. Of course, proofs about theoretical models like Turing machines and lambda calculus are also mainstream, and few can dispute their importance. But Turing-machine theorems do little to constrain our understanding of what AGI will actually look like in the next few centuries. That said, there's significant peer disagreement on this topic, so epistemic modesty is warranted. In addition, if the MIRI view is right, we might have more scope to make an impact to AGI safety, and it would be possible that important discoveries could result from a few mathematical insights rather than lots of detailed engineering work. Also, most AGI research is more engineering-oriented, so MIRI's distinctive focus on theory, especially abstract topics like decision theory, may target an underfunded portion of the space of AGI-safety research.
In "How to Study Unsafe AGI's safely (and why we might have no choice)," Punoxysm makes several points that I agree with, including that AGI research is likely to yield many false starts before something self-sustaining takes off, and those false starts could afford us the opportunity to learn about AGI experimentally. Moreover, this kind of ad-hoc, empirical work may be necessary if, as seems to me probable, fully rigorous mathematical models of safety aren't sufficiently advanced by the time AGI arrives.
Ben Goertzel likewise suggests that a fruitful way to approach AGI control is to study small systems and "in the usual manner of science, attempt to arrive at a solid theory of AGI intelligence and ethics based on a combination of conceptual and experimental-data considerations". He considers this view the norm among "most AI researchers or futurists". I think empirical investigation of how AGIs behave is very useful, but we also have to remember that many AI scientists are overly biased toward "build first; ask questions later" because
On a personal level, I suggest that if you really like building systems rather than thinking about safety, you might do well to earn to give in software and donate toward AGI-safety organizations.
Yudkowsky (2016b) makes an interesting argument in reply to the idea of using empirical, messy approaches to AI safety: "If you sort of wave your hands and say like, 'Well, maybe we can apply this machine-learning algorithm, that machine-learning algorithm, the result will be blah blah blah', no one can convince you that you're wrong. When you work with unbounded computing power, you can make the ideas simple enough that people can put them on whiteboards and go like 'Wrong!', and you have no choice but to agree. It's unpleasant, but it's one of the ways the field makes progress."
Here are some rough suggestions for how I recommend proceeding on AGI issues and, in [brackets], roughly how long I expect each stage to take. Of course, the stages needn't be done in a strict serial order, and step 1 should continue indefinitely, as we continue learning more about AGI from subsequent steps.
I recommend avoiding a confrontational approach with AGI developers. I would not try to lobby for restrictions on their research (in the short term at least), nor try to "slow them down" in other ways. AGI developers are the allies we need most at this stage, and most of them don't want uncontrolled AGI either. Typically they just don't see their work as risky, and I agree that at this point, no AGI project looks set to unveil something dangerous in the next decade or two. For many researchers, AGI is a dream they can't help but pursue. Hopefully we can engender a similar enthusiasm about pursuing AGI safety.
In the longer term, tides may change, and perhaps many AGI developers will desire government-imposed restrictions as their technologies become increasingly powerful. Even then, I'm doubtful that governments will be able to completely control AGI development (see, e.g., the criticisms by John McGinnis of this approach), so differentially pushing for more safety work may continue to be the most leveraged solution. History provides a poor track record of governments refraining from developing technologies due to ethical concerns; Eckersley and Sandberg (p. 187) cite "human cloning and land-based autonomous robotic weapons" as two of the few exceptions, with neither prohibition having a long track record.
I think the main way in which we should try to affect the speed of regular AGI work is by aiming to avoid setting off an AGI arms race, either via an AGI Sputnik moment or else by more gradual diffusion of alarm among world militaries. It's possible that discussing AGI scenarios too much with military leaders could exacerbate a militarized reaction. If militaries set their sights on AGI the way the US and Soviet Union did on the space race or nuclear-arms race during the Cold War, the amount of funding for unsafe AGI research might multiply by a factor of 10 or maybe 100, and it would be aimed in harmful directions.
Here are some candidates for the best object-level projects that altruists could work on with reference to AI. Because AI seems so crucial, these are also candidates for the best object-level projects in general. Meta-level projects like movement-building, career advice, earning to give, fundraising, etc. are also competitive. I've scored each project area out of 10 points to express a rough subjective guess of the value of the work for suffering reducers.
Research whether controlled or uncontrolled AI yields more suffering (score = 10/10)
Push for suffering-focused AI-safety approaches (score = 10/10)
Most discussions of AI safety assume that human extinction and failure to spread (human-type) eudaimonia are the main costs of takeover by uncontrolled AI. But as noted in this piece, AIs would also spread astronomical amounts of suffering. Currently no organization besides FRI is focused on how to do AI safety work with the primary aim of avoiding outcomes containing huge amounts of suffering.
One example of a suffering-focused AI-safety approach is to design AIs so that even if they do get out of control, they "fail safe" in the sense of not spreading massive amounts of suffering into the cosmos. For example:
The problem with bullet #1 is that if you can succeed in preventing AGIs from colonizing space, it seems like you should already have been able to control the AGI altogether, since the two problems appear about equally hard. But maybe there are clever ideas we haven't thought of for reducing the spread of suffering even if humans lose total control.
Another challenge is that those who don't place priority on reducing suffering may not agree with these proposals. For example, I would guess that most AI scientists would say, "If the AGI kills humans, at least we should ensure that it spreads life into space, creates a complex array of intricate structures, and increases the size of our multiverse."
Work on AI control and value-loading problems (score = 4/10)
Research technological/economic/political dynamics of an AI takeoff and push in better directions (score = 3/10)
By this I have in mind scenarios like those of Robin Hanson for emulation takeoff, or Bostrom's "The Future of Human Evolution".
Promote the ideal of cooperation on AI values (e.g., CEV) (score = 2/10)
Promote a smoother, safer takeoff for brain emulation (score = 2/10)
Influence the moral values of those likely to control AI (score = 2/10)
Promote a singleton over multipolar dynamics (score = 1/10)
Other variations
In general, there are several levers that we can pull on:
These can be applied to any of
Projects like DeepMind, Vicarious, OpenCog, and the AGI research teams at Google, Facebook, etc. are some of the leaders in AGI technology. Sometimes it's proposed that since these teams might ultimately develop AGI, altruists should consider working for, or at least lobbying, these companies so that they think more about AGI safety.
One's assessment of this proposal depends on one's view about AGI takeoff. My own opinion may be somewhat in the minority relative to expert surveys, but I'd be surprised if we had human-level AGI before 50 years from now, and my median estimate might be like ~90 years from now. That said, the idea of AGI arriving at a single point in time is probably a wrong framing of the question. Already machines are super-human in some domains, while their abilities are far below humans' in other domains. Over the coming decades, we'll see lots of advancement in machine capabilities in various fields at various speeds, without any single point where machines suddenly develop human-level abilities across all domains. Gradual AI progress over the coming decades will radically transform society, resulting in many small "intelligence explosions" in various specific areas, long before machines completely surpass humans overall.
In light of my picture of AGI, I think of DeepMind, Vicarious, etc. as ripples in a long-term wave of increasing machine capabilities. It seems extremely unlikely that any one of these companies or its AGI system will bootstrap itself to world dominance on its own. Therefore, I think influencing these companies with an eye toward "shaping the AGI that will take over the world" is probably naive. That said, insofar as these companies will influence the long-term trajectory of AGI research, and insofar as people at these companies are important players in the AGI community, I think influencing them has value -- just not vastly more value than influencing other powerful people.
That said, as noted previously, early work on AGI safety has the biggest payoff in scenarios where AGI takes off earlier and harder than people expected. If the marginal returns to additional safety research are many times higher in these "early AGI" scenarios, then it could still make sense to put some investment into them even if they seem very unlikely.
If, upon further analysis, it looks like AGI safety would increase expected suffering, then the answer would be clear: Suffering reducers shouldn't contribute toward AGI safety and should worry somewhat about how their messages might incline others in that direction. However, I find it reasonably likely that suffering reducers will conclude that the benefits of AGI safety outweigh the risks. In that case, they would face a question of whether to push on AGI safety or on other projects that also seem valuable.
Reasons to focus on other projects:
Reasons to focus on AGI safety:
All told, I would probably pursue a mixed strategy: Work primarily on questions specific to suffering reduction, but direct donations and resources toward AGI safety when opportunities arise. Some suffering reducers particularly suited to work on AGI safety could go in that direction while others continue searching for points of leverage not specific to controlling AGI.
Parts of this piece were inspired by discussions with various people, including David Althaus, Daniel Dewey, and Caspar Oesterheld.
The term "friendly AI" can be confusing because it involves normative judgments, and it's not clear if it means "friendly to the interests of humanity's survival and flourishing" or "friendly to the goals of suffering reduction" or something else. One might think that "friendly AI" just means "AI that's friendly to your values", in which case it would be trivial that friendly AI is a good thing (for you). But then the definition of friendly AI would vary from person to person.
"Aligned AI" might be somewhat less value-laden than "friendly AI", but it still connotes to me a sense that there's a "(morally) correct target" that the AI is being aligned toward.
"Controlled AI" is still somewhat ambiguous because it's unspecified which humans have control of the AI and what goals they're giving it, but the label works as a general category to designate "AIs that are successfully controlled by some group of humans". And I like that this category can include "AIs controlled by evil humans", since work to solve the AI control problem increases the probability that AIs will be controlled by evil humans as well as by "good" ones. (back)
The post Artificial Intelligence and Its Implications for Future Suffering appeared first on Center on Long-Term Risk.
]]>The post Possible Ways to Promote Compromise appeared first on Center on Long-Term Risk.
]]>One general approach is to bring together people of different moral views, so that they can sympathize with those who feel differently on moral issues. This helps promote tolerance of others, thereby improving odds for amicable resolution of disputes, and in some cases, each side may adopt some of the moral views of the other.
If we encourage people to move in each other's directions morally, is this actually compromise? Or are we just introducing a new morality that's a blend of the two? Well, consider the example of the deep ecologists vs. animal welfarists from the beginning of "Gains from Trade through Compromise." Suppose the deep ecologists and animal welfarists both look at the issue from the other side's perspective and thereby come to sympathize with it. Say the moral blend results in everyone caring half about deep ecology and half about animal welfare. Then the policies adopted when these morally blended individuals follow their own moral instincts will in fact be roughly the same as the compromise deals that would have been reached by the equipotent competing factions. So, even before the sides are morally blended, they should welcome an intervention in which someone convinces each side to move in the other's moral direction, so long as this moral blending is done roughly in proportion to the power of the existing sides.
What we see here illustrates a general principle: One function of emotions is as evolution's way of making game-theoretic pre-commitments. Romantic love is an emotional pre-commitment to provide care and resources for a partner. Anger is an emotional pre-commitment to respond against encroachment with costly retaliation. And, in this case, changing people's moral sentiments toward a compromise stance is a way to actually achieve lasting compromise. It may seem crude compared against carefully designed optimal game-theoretic bargains, but it has the advantage that it works now, without relying on institutional structures that can enforce contracts into the far future. And historical precedent shows us that modifying emotions for compromise purposes can work. For example, the advent of the value of religious tolerance may have been partly a response to the costly religious wars that plagued Europe. Of course, formal agreements like the Peace of Westphalia also contributed, and as is often the case, formal agreements can breed moral values just as moral values can lead to formal agreements.
I said above that moral blending is acceptable to both sides if the resulting blend is roughly proportional to the pre-existing power balance. However, sometimes this may not be the case. If it's not predictable which side people will favor upon considering both views, then ex ante, each side may still be okay with moral blending, because it's not clear which side will be favored more, and often, the members of each side think their stance is obviously more sensible, so they may even have high hopes for the outcome. Still, there are exceptions to this. For instance, members of certain religions and cults are discouraged from associating too closely with outsiders because this might predictably lead to straying from the fold. (2 Corinthians 6:14: "Do not be yoked together with unbelievers. For what do righteousness and wickedness have in common? Or what fellowship can light have with darkness?")
These are tricky cases, and there is a genuine tension between openness to new values versus goal preservation. For example, an altruist should rightly be concerned about marrying someone who spends all his money on expensive cars and tropical cruises, in part for fear of being tempted into the same lifestyle. That said, sometimes people are also flexible enough to "try on" other moral perspectives.
Beyond the game-theoretic reasons discussed above, there are at least three other motivations for openness to alternate moral perspectives:
Even if realism were better for convergence on balance, it's not clear we should encourage it wholesale, because it's somewhat confused. The idea of moral truth (whatever that's supposed to mean?) violates Occam's razor, and moreover, why would I care about what the moral truth was even if it existed? What if the moral truth commanded me to needlessly torture babies? It seems likely that as people become more sophisticated, they'll increasingly understand that naive realism doesn't make sense. Promoting it would then be like trying to tell kids about Santa to induce them to be nice rather than naughty; it only works for so long, especially among the really intelligent people who will have the most power over how the future unfolds.
That said, we may not want to promote concentrated anti-realism either, because this could encourage moral balkanization as people see that they can legitimately hold a stance in opposition to what others want. Rather, we should probably push most on convergence as a goal, like with coherent extrapolated volition, and not focus too much on the caustic nature of unadulterated anti-realism.
There are other ways to moral blending besides abstract meta-ethics. For example:
Are these approaches cost-effective? Given that so many sectors of society aim to promote inter-group dialogue and reduce violence, we shouldn't expect low-hanging fruit here. On the other hand, because these efforts are widely regarded as beneficial, we have more confidence that working on them would at least be positive in expectation. Thus, these seem to be, at minimum, relatively "safe bets" and are supported by heuristics about working on causes that have wide support from many different people.
Keep in mind that some of the proposals, like promoting increased education, have many other flow-through effects that make the analysis more complicated. Promoting greater liberalism within existing education may be an easier intervention to analyze.
Symbols, rituals, and shared identity can bind groups together, but usually this comes at the expense of increasing hostility toward outsiders. Oxytocin has this same effect. Jonathan Haidt suggests the metaphor that when people circle around a sacred object, they generate an "electric current" that unites them but also creates a polarity of them vis-a-vis the outgroup. Various studies have found that people identify with fellow group members more than outsiders even when the group assignments were basically arbitrary or even random. Even prelinguistic infants display this tendency, preferring puppets that have the same food preference -- Cheerios or graham crackers -- as they have (Neha Mahajana and Karen Wynn, 2012).
Ingroup loyalty can be strengthened by various factors, including anger ("Prejudice From Thin Air") and the perception of zero-sum conflict (realistic conflict theory).
"Intergroup Conflict" has more to say, and the discipline of ingroup/outgroup distinctions has further literature on these topics.
Defection on one-shot prisoner's dilemmas happens because squealing on your partner doesn't have lasting consequences. If no one ever finds out, there's temptation to cheat. Real-world examples of this include lying, stealing, and other forms of deception for personal gain at greater social cost.
To prevent defection, it helps to make your choice visible to others, so that cooperation can be rewarded with future social benefits. One elegant way to accomplish this could be a "karma" rating or Whuffie score that is incremented or decremented based on how you treat others. If your fellow prisoner could lower your karma rating when you defect, you would have incentive not to do so.
As social networks become more ubiquitous, the possibility for these kinds of karma systems becomes more real. Already they exist in many online communities, like Slashdot or Quora. David Brin suggests that karma could become even more ubiquitous with digital glasses or other devices for looking up people's reputation scores in real-life settings.
Of course, similar incentive functions can also be accomplished with monetary payments, although it seems that there are social norms against paying money in certain circumstances, such as interactions between friends. Instead, our more ancient primate sense of social capital still operates for relationships between family members, friends, business partners, and politicians. People feel less outrage about "bribery" when it's done through networking and social ingratiation rather than explicit financial exchanges.
All of these mechanisms facilitate compromise on prisoner's dilemmas by allowing for Pareto-improving transactions. As society becomes increasingly transparent, it will be possible to enforce these arrangements in a greater number of cases, potentially making everyone better off, at least in theory. Already governments serve this function to a significant degree, by enforcing laws. (Of course, it's important to make sure that punishments for violations are roughly proportionate to the damage done rather than being excessive, or else these laws risk causing more harm than they prevent.)
Humans value privacy for many reasons, but one reason is because it historically offered protection -- e.g., sneaking off to a secluded area to have sex so that the dominant male doesn't beat you up afterward. Similar principles apply in protecting citizens against authoritarian Big Brother. Avoiding authoritarianism is an important concern, but my sense is that this can be done by other means. For instance, Brin proposes sousveillance -- watching the watchers to hold them accountable. As Brin says, safety is a necessary precondition for privacy, so at least some degree of surveillance is unavoidable.
Another reason we value privacy is because people tend to judge each other over trivialities -- sexual conduct, religious beliefs, irreverent jokes, not being properly dressed, or whatever. The popularity of celebrity gossip and farcical political "scandals" are testaments to this feature of human nature. I think many people desire privacy because they don't want others seeing them doing these entirely normal activities that somehow are blown out of proportion when they're visible. If we could overcome the tendency to make a fuss over harmless private behaviors, it would allow for more transparency and therefore more opportunities for cooperation. As more of our lives become visible through digital technology, I hope people will become more accepting of diversity and individual choices, but getting there will sadly not be easy.
Transparency is probably even more important at a governmental level -- the government both being transparent to its own citizens and being transparent to other governments. This can allow for better enforcement of international agreements, such as arms-control treaties. In The Strategy of Conflict, Thomas Schelling explains (p. 148):
Leó Szilárd has even pointed to the paradox that one might wish to confer immunity on foreign spies rather than subject them to prosecution, since they may be the only means by which the enemy can obtain persuasive evidence of the important truth that we are making no preparations for embarking on a surprise attack. [Citation: Szilárd's 1955 "Disarmament and the Problem of Peace"]
Transparency of citizens to governments is often protested, sometimes on privacy grounds and sometimes to prevent slipping down a slope toward tyranny. It's difficult to know where to draw the line on government surveillance. That said, it is clear that abuses of surveillance power -- whether motivated by prurience or sabotaging one's opponents -- are harmful, because they engender justified outrage at surveillance in general, making it slightly harder to carry out good surveillance.
One additional consideration is that surveillance and greater government power probably make eventual space colonization more likely by reducing catastrophic risks. This impact may be met with ambivalence by those who consider preventing suffering most important.
Democracy has many benefits for compromise.
Democracy relies on social stability. In order to be willing to compromise, you need to be confident the bargain will be upheld. Strong rule of law is required for this. In general, there is a vast literature on what makes democratic compromise possible, and we should explore it further. For instance, what helped trigger Democracy's Third Wave?
Trade tends to reduce the likelihood that factions or countries will go to war, because the parties rely on each other for mutual benefit. In addition, trade has a moral effect of enhancing empathy among distant peoples, as a natural corollary of the fact that reciprocal altruism leads us to care more about those with whom we exchange.
While this is the prevailing view among elites, there are some critics, such as Margaret MacMillan, who suggests that globalization can increase "intense localism and nativism," and this may have contributed to World War I; at the very least, growing interdependence didn't prevent that war. My personal guess, however, is that even if MacMillan's claim is true, it's a short-term effect, and the long-term trend is toward greater tolerance due to increased trade. In a session on "China Rising," Jon Huntsman gave as an example Utah's alfalfa exports to China as a factor helping to humanize and incentivize a more friendly relationship: The second largest economy in the world has gone "from enemy to customer" for those farmers.
Pride in one's country is a glue that holds countries together and justifies government build-up of force against other countries. John Mearsheimer said, "The most powerful political ideology on the face [of the planet ...] is not democracy; it's nationalism." And, he adds, nationalism makes it very hard to take over another country because the local population fights back unceasingly, as the US saw in Vietnam.
Of course, there are failed states, but the overall success of nationalism to promote unity even in spite of fierce ideological disputes is impressive. However, as is often remarked, the downside of nationalism is that it provokes international hostilities. This is the classic problem that Josh Greene discusses in Moral Tribes: the glue that turns "me" to "us" also pits "us" against "them."
The circle of what constitutes "us" can apparently grow large -- from a 150-person tribal group to a 1-billion-person China, for instance. So it's not too much of an additional step to extend it to a 7-billion-person world. There is already an internationalist movement, aiming to encourage people to view each other as "citizens of the world." Or, in the words of John Lennon's "Imagine": "Imagine there's no countries / It isn't hard to do / Nothing to kill or die for / [...] Imagine all the people / Living life in peace..."
I don't know the cost-effectiveness of promoting internationalism relative to other things, but such interventions are at least very likely positive. Of course, if there are nearby extraterrestrials, the next steps will be interplanetism, intergalacticism, etc., but most people aren't ready for this yet. It would, of course, be a tragedy if internationalism led to fiercer conflicts with ETs, but hopefully the greater wisdom of our descendants will preclude that.
This area of ideas about how people perceive nationalism is one of the focuses of constructivists in international relations. Lisa Anderson's talk on "Nationalism and Ethnic Conflict" has further discussion about nationalism's origins and effects.
An important factor in cultivating internationalist sentiments is intermixing of people and cultures. For instance, one reason the US has had such strong ties with Europe is that many people of European descent live in the US, so that domestic political sentiments are aligned toward friendliness with European allies. This is even more prominently visible with American Jews exerting pressure for aggressive US backing of Israel, although in this particular case it's arguable that the domestic lobby causes more harm than good to international peace. Ideally, the cultural mix in the US would be sufficiently diverse that domestic politics wouldn't lopsidedly favor one foreign country over another.
Michael R. Auslin's Pacific Cosmopolitans: A Cultural History of U.S.-Japan Relations reviews a number of additional ways in which cultural exchange can improve international friendship, focusing on the case of Japan specifically:
Cultural icons like Nintendo from Japan or Jackie Chan from China are other examples of major forces that can help break down inter-country hostility in the minds of millions of ordinary citizens.
One concrete example where international exchange could be a cost-effective philanthropic project was described by George Perkovich in his interview with GiveWell. He explained that in the 1990s, he helped with a program that brought together "young scholars and policymakers from Pakistan and India to increase goodwill and communication among the next generation of leaders" (p. 3). India and Pakistan opposed the program, and it shut down due to difficulty obtaining visas, but it could be resurrected. This seems important because the India-Pakistan conflict is arguably the most likely of any in the world to become nuclear.
The sociology literature has extensively studied the contact hypothesis, which is the idea that people become more accepting toward those of different races, sexual orientations, or nationalities when they jointly have positive, coequal interactions in cooperative settings working toward common goals. This is what one would expect from the fact that positive-sum games require the brain to marshal warm feelings toward compatriots, to elicit cooperation rather than defection.
According to the the Wikipedia article, Donelson R. Forsyth's Group Dynamics meta-analyzed 515 studies and found a correlation coefficient of 0.2-0.3 between inter-group contact and absence of conflict. Thus, the hypothesis has extensive empirical backing, even though some researchers like Robert D. Putnam have found exceptions where more diverse communities display lower levels of trust.
As an aside, we might wonder whether the contact technique could be used to increase concern for animal suffering. In addition to having humans interact with animals directly, perhaps one could employ imagined contact or parasocial contact through the media, both of which have been suggested to help for human-human tolerance.
Compared with men, women are generally less competitive and far less violent. One reason is that women are more risk-averse, because the number of offspring they can possibly have is bounded. Historically, men could acquire increased status and more sexual partners by succeeding in warfare against other tribes ("male warrior hypothesis"). Testosterone suppresses empathy and encourages conflict, "so much so that we had to invent sports to keep the boys happy," notes Jonathan Haidt.
It thus seems plausible that as women gain more power in a country, that country should ceteris paribus act more peacefully. David Pearce suggested female-only leadership as a way to reduce war, although it's unclear how big the effect would be, since structural factors might tend to select for the most competitive females to leadership roles. Also, some amount of willingness to fight can be important, for deterrence and humanitarian intervention.
Needless to say, Pearce's proposal would not be implemented any time soon. However, the more modest aim of empowering women seems valuable.
Ultimately we might hope for the emergence of a world government, or singleton, which could provide the authority to enforce bargaining arrangements even at the international level. This idea is not new; compare with Hobbes's Leviathan, which argues for a sovereign authority to prevent "the war of all against all." People give up their complete freedom in deference to a social contract that ultimately allows everyone to get more of what he wants in expectation than in a winner-take-all fight.
Short of a world government, we can aim for more modest forms of international cooperation. Many exchanges among nations can be seen as iterated prisoner's dilemmas, rather than one-shot versions of the game. As a result, neoliberal international-relations professor Robert Keohane has suggested the following ways to increase cooperation, as reported by Wikipedia's article on "Regime theory":
Finally, we can promote the idea of cooperation itself as the first resort to conflict, such as by fighting zero-sum thinking and explaining the logic of compromise. The school of liberalism in international relations takes a more positive view toward compromise and positive-sum possibilities than does realism, which tends to see one country's gain as another's loss. One reason for this is that realism tends to focus on relative gains, while liberalism emphasizes absolute welfare.
International conflicts have historically been among the most massive anthropogenic causes of death, so it seems that cooperation among nations (and major factions within nations) has high priority. The same may be true in an artificial general intelligence (AGI) race, if, for example, two major powers compete for control in analogy with the US and Soviet Union during the Cold War. It seems particularly important to raise awareness of these issues among potential AGI designers themselves, as well as the military and corporate leaders funding AGI projects, although general public outreach can still be valuable to the extent it influences these leaders by diffusion. (For example, the technologists who actually end up building AGI may be born 50 years from now and be influenced by parents, teachers, and TV programs that were in turn influenced by what we did today.)
Research on the fundamentals of compromise might have high payoff. There are still many issues in game theory that remain not well understood, and putting compromise on firmer ground would be a major step forward. In addition, we need research in political and social theory to devise robust mechanisms for sustaining compromise agreements, especially against potential disruptions to existing institutions that may result from fast technological breakthroughs or other black swans.
Helping groups that are fighting over soluble factual questions might reduce many short-term conflicts. For more, see this discussion of epistemic disagreements.
Epistemic disagreements represent (theoretically irrational) divergences of opinion in the case of common knowledge. However, very often agents don't have the same knowledge. Games of imperfect information are often more prone to conflict than those of perfect information; under perfect information, the best outcome is usually to compromise. (Of course, there may be exceptions.) In other words, shifting situations from games of imperfect to games of (more) perfect information could be valuable. A downside is that more knowledge in general also means that risks come faster and may allow for less time in which to negotiate and work out social structures that better facilitate a good outcome for everyone.
While it's extremely important to promote cooperation, this field is not laden with low-hanging fruits, because many other people already rightly see its value. Historically, one major source of leveraged social change has been technologies that open up new possibilities. Are there technologies we could support that hold the promise of dramatically improving cooperation, without also speeding up dangers of massive conflict at the same time?
One proposal that a friend of mine suggested is improving machine translation. Language plays a major role in the development of national identities and us-versus-them balkanization. One of the goals of Esperanto was to "transcend nationality and foster peace and international understanding between people with different languages." While Esperanto has little hope of worldwide adoption, very readable machine translation would offer something almost as good. Apparently this vision has been floating around since the end of World War II.
In Game Theory: Analysis of Conflict, Roger B. Myerson suggests (pp. 1-2):
People seem to have learned more about how to design physical systems for exploiting radioactive materials than about how to create social systems for moderating human behavior in conflict. Thus, it may be natural to hope that advances in the most fundamental and theoretical branches of the social sciences might be able to provide the understanding that we need to match our great advances in the physical sciences.
Of course Myerson is likely to feel this way because (a) if he didn't, he might not have studied game theory and (b) he probably doesn't want to feel as though his life's work has been harmful. But is it true that our prospects for reducing suffering are better when people are more informed about game theory?
It's certainly the case that there are both gains and losses when people understand game theory relative to relying on naive intuition or happenstance. Some examples of downsides:
On the other hand, there are benefits to deeper understanding as well:
Game theory is destiny. In the long run, rational agents will converge on understanding game theory anyway, because those who don't will on average lose resources. If we can emphasize the positive possibilities of game theory, we may be able to steer society toward a better path. Of course, if, hypothetically, it were the case that game theory tended to produce worse results the more it was understood, we might hope to keep people in the dark as long as possible in order to maximize cooperation before crucial junctures like the creation of AGI. However, I think this is not that likely, and indeed, the opposite seems more plausible: that better understanding of game theory would help navigate cooperation on AGI to make everyone better off in expectation.
I have a separate piece that lists organizations that work to promote cooperation: "Cooperation Charities and Organizations."
The post Possible Ways to Promote Compromise appeared first on Center on Long-Term Risk.
]]>The post Gains from Trade through Compromise appeared first on Center on Long-Term Risk.
]]>"Any man to whom you can do favor is your friend, and [...] you can do a favor to almost anyone."
--Mark Caine
"Gains from trade" in economics refers to situations where two parties can engage in cooperative behavior that makes each side better off. A similar concept applies in the realm of power struggles between competing agents with different values. For example, consider the following scenario.
Deep ecologists vs. animal welfarists. Imagine that two ideologies control the future: Deep ecology and animal welfare. The deep ecologists want to preserve terrestrial ecosystems as they are, including all the suffering they contain. (Ned Hettinger: "Respecting nature means respecting the ways in which nature trades values, and such respect includes painful killings for the purpose of life support.") The animal welfarists want to intervene to dramatically reduce suffering in the wild, even if this means eliminating most wildlife habitats. These two sides are in a race to control the first artificial general intelligence (AGI), at which point the winner can take over the future light cone and enforce its values.
Suppose the two sides are equally matched in resources: They each have a 50% shot at winning. Let's normalize the values for each side between 0 and 100. If the deep ecologists win, they get to preserve all their beloved ecosystems; this outcome has value 100 to them. If they lose, their ecosystems disappear, leaving 0 value. Meanwhile, the values are swapped for the animal welfarists: If they win and eliminate the suffering-filled ecosystems, they achieve value 100, else the value to them is 0. Since the chance of each side winning is 50%, each side has an expected value of 50.
But there's another option besides just fighting for winner takes all. Say the deep ecologists care more about preserving species diversity than about sheer number of organisms. Maybe they're also more interested in keeping around big, majestic animals in their raw form than about maintaining multitudes of termites and cockroaches. Perhaps some ecologists just want the spectacle of wildlife without requiring it to be biological, and they could be satisfied by lifelike robot animals whose conscious suffering is disabled at appropriate moments, such as when being eaten.1 Maybe others would be okay with virtual-reality simulations of Earth's original wildlife in which the suffering computations are skipped over in the virtual animals' brains.
These possibilities suggest room for both parties to gain from compromise. For instance, the animal welfarists could say, "We want to get rid of 60% of suffering wild animals, but we'll eliminate the ones that you care about least (e.g., insects when they're not crucial for supporting the big animals), and we'll keep some copies of everything to satisfy your diversity concerns, along with doing some robots and non-suffering simulations." Maybe this would be ~60% as good as complete victory in the eyes of the deep ecologists. If the two sides make this arrangement, each gets value 60 with certainty instead of expected value 50.
Here, there were gains from trade because the animal welfarists could choose for the compromise those methods of reducing wild-animal suffering that had least impact to the deep ecologists' values. In general, when two sets of values are not complete polar opposites of each other, we should expect a concave-down curve like the red one below illustrating the "production possibilities" for the two values. When the curve is concave down, we have possible gains from trade relative to duking it out for winner takes all (blue line). The blue line illustrates the expected value for each value system parameterized by the probability in [0,1] for one of the value systems winning.
We can imagine many additional examples in which suffering reducers might do better to trade with those of differing value systems rather than fight for total control. Here's one more example to illustrate:
In these disputes, the relevant variable for deciding how to slice the compromise seems to be the probability that each side would win if it were to continue fighting in an all-or-nothing way. These probabilities might be roughly proportional to the resources (financial, political, cultural, technological, etc.) that each side has, as well as its potential for growth. For instance, even though the movement to reduce wild-animal suffering is small now, I think it has potential to grow significantly in the future, so I wouldn't make early compromises for too little in concessions.
This is analogous to valuation of startup companies: Should the founders sell out or keep going in case they can sell out for a higher value later? If they do badly, they might actually get less. For instance, Google offered to buy Groupon for $5.75 billion in 2010, but Groupon turned down the offer, and by 2012, Groupon's market cap fell to less than $5.75 billion.
In "Rationalist explanations for war," pp. 386-87, James D. Fearon makes this same observation: Two states with perfect information should always prefer a negotiation over fighting, with the negotiation point being roughly the probability that each side wins.
I discuss further frameworks for picking a precise bargaining point in "Appendix: Dividing the compromise pie."
Our social intuitions about fairness and democracy posit that everyone deserves an equal say in the final outcome. Unfortunately for these intuitions, compromise bargains are necessarily weighted by power -- "might makes right." We may not like this fact, but there seems no way around it. Of course, our individual utility functions can weight each organism equally, but in the final compromise arrangement, those with more power get more of what they want.
Many people care about complexity, diversity, and a host of other values that I don't find important. I have significant reservations about human space colonization, but I'm willing to let others pursue this dream because they care about it a lot, and I hope in return that they would consider the need to maintain safeguards against future suffering. The importance of compromise does not rely on you, in the back of your mind, giving some intrinsic moral weight to what other agents want; compromise is still important even when you don't care in the slightest or may even be apprehensive about the goals of other factions. To appropriate a quote from Noam Chomsky: If we don't believe in strategic compromise with those we can't identify with, we don't believe in it at all.
If this compromise approach of resolving conflicts by buying out the other side worked, why wouldn't we see it more often? Interest groups should be compromising instead of engaging in zero-sum campaigns. Countries, rather than going to war, could just assess the relative likelihood of each side winning and apportion the goods based on that.
Even animals shouldn't fight: They should just size up their opponents, estimate the probability of each side winning, and split the resources appropriately. In the case of fighting animals, they could get the same expected resources with less injury cost. For instance, two bull seals fighting for a harem of 100 cows, if they appear equally matched, could just split the cows 50-50 and avoid the mutual fitness costs of getting injured in the fight.
Here are a few possibilities why we don't see more cooperation in animals, but I don't know if they're accurate:
Of course, there are plenty of examples where animals have settled on cooperative strategies. It's just important to note that they don't always do so, and perhaps we could generalize under what conditions cooperation breaks down.
Human wars often represent a failure of cooperation as well. While wars sometimes have "irrational" causes, Matthew O. Jackson and Massimo Morelli argue in "The Reasons for Wars - an Updated Survey" that many can be framed in rationalist terms, and they cite five main reasons for the breakdown of negotiation. An exhaustive survey of theories of war is contained in a syllabus by Jack S. Levy.
How about in intra-state politics? There are plenty of compromises there, but maybe not as many as one might expect. For instance, Toby Ord proposed in 2008:
It is so inefficient that there are pro- and anti- gun control charities and pro- and anti-abortion charities. Charities on either side of the divide should be able to agree to 'cancel' off some of their funds and give it to a mutually agreed good cause (like developing world aid). This would do just as much for (or against) gun control as spending it on their zero-sum campaigning, as well as doing additional good for others.
A similar idea was floated on LessWrong in 2010. I have heard of couples both not voting because they'd negate each other, but I haven't heard of an organization as described above for cancelling opposed donations. Why hasn't something like this taken off?
Whatever the reason is that we don't see more cancelling of opposed political forces, the fact remains that we do see a lot of compromise in many domains of human society, including legislation (I get my provision if you get yours), international relations (we'll provide weapons if you fight people we don't like), business (deals, contracts, purchases, etc.), and all kinds of social relations (Brother Bear will play any three games with Sister Bear if she plays Space Grizzlies with him later). And we're seeing an increasing trend toward positive-sum compromise as time goes on.
While racing to control the first AGI amounts to a one-shot prisoner's dilemma, most of life's competitive scenarios are iterated. Indeed, even in the case of AGI arms race, there are many intermediate steps along the way where the parties choose cooperation vs. defection, such as when expanding their resources. Iterated prisoner's dilemmas provide a very strong basis for cooperation, as was demonstrated by Robert Axelrod's tournaments. As the Wikipedia article explains:
In summary, success in an evolutionary "game" correlated with the following characteristics:
Iterated prisoner's dilemmas yielded cooperation in an evolutionary environment with no pre-existing institutions or enforcement mechanisms, and the same should apply even in those iterated prisoner's dilemmas between groups today where no formal governing systems are yet in place. In this light, it seems suffering reducers should put out their compromise hand first and aim to help all values in a power-weighted fashion, at least to some degree, and then if we see others aren't reciprocating, we can temporarily withdraw our assistance.
One can imagine agents for whom compromise is actually not beneficial because they have increasing rather than diminishing returns to resources. In the "Introductory example," we saw that both animal welfarists and deep ecologists had diminishing returns with respect to how much control they got, because they could satisfy their most important concerns first, and then later concerns were less and less important. Imagine instead an agent that believes that the value of a happy brain is super-linear in the size of that brain: e.g., say the value is quadratic. Then the agent would prefer a 50% chance of getting all the matter M in the future light cone to produce a brain with value proportional to M2 rather than a guarantee of getting half of the matter in the universe to produce a brain with value proportional to (M/2)2 = M2/4. I think agents of this type are rare, but we should be cautious about the possibility.
Another interesting case is that of sacred values. It seems that offering monetary compensation for violation of a sacred value actually makes people more unwilling to compromise. While we ordinarily imagine sacred values in contexts like the abortion debate or disputes over holy lands, they can even emerge for modern issues like Iran's nuclear program. Philip Tetlock has a number of papers on sacred-value tradeoffs.
It seems that people are more willing to concede on sacred values in return for other sacred values, which means that compromise with such people is not hopeless but just requires more than a single common currency of exchange.
Bargaining on Earth is fast, reliable, and verifiable. But what would happen in a much bigger civilization that spans across solar systems and galaxies?
The Virgo Supercluster is 110 million light-years in diameter. Suppose there was an "intergalactic federation" of agents across the Virgo Supercluster that met at a Congress at the center of the supercluster. The galaxies could transmit digital encodings of their representatives via radar, which would take 55 million years for the most distant regions. The representatives would convene, reach agreements, and then broadcast back the decisions, taking another 55 million years to reach the destination galaxies. This process would be really slow, especially if the digital minds of the future run at extremely high speeds. Still, if we had, say, 1012 years before dark energy separated the parts of the supercluster too far asunder, we could still get in 1012/108 = 10,000 rounds of exchanges. (As Andres Gomez Emilsson pointed out to me, this calculation doesn't count the expansion of space during that time. Maybe the actual number of exchanges would be lower on this account.) In addition, if the galaxies dispatched new representatives before the old ones returned, they could squeeze in many more rounds, though with less new information at each round.
Would it even be possible to transmit radar signals across the 55 million light-years? According to Table 1 of "How far away could we detect radio transmissions?," most broadband signals can travel just a tiny fraction of a light-year. S-band waves sent at high enough EIRP could potentially travel hundreds of light-years. For instance, the table suggests that at 22 terawatts of transmission EIRP, the detection range would be 720 light-years.
In the 1970s, humanity as a whole used ~10 terawatts, but the sun produces 4 * 1014 terawatts, so maybe 22 terawatts is even conservative. The detection range is proportional to the square root of EIRP, so multiplying the detection range by 10 requires multiplying EIRP by 100. Obviously hundreds or thousands of light-years for radar transmission is tiny compared with 55 million light-years for the intergalactic distances at hand, but the communication can be routed from one star to the next. There are "rogue" intergalactic stars that might serve as rest stops, but whether they would be able to be located and whether they would all be within a few thousand light-years of each other is unclear. Perhaps custom-built probes could relay messages from one node to the next across large interstellar distances, creating an intergalactic equivalent of the Internet.
Intergalactic communication is easier than intergalactic travel for material structures (e.g., the initial von Neumann probes that would do the colonizing). If solutions were found for intergalactic travel (e.g., speculative faster-than-light scenarios), these would aid in intergalactic compromise as well.
Even if you can make deals every 110 million years, how do you verify that the distant regions are following up on their sides of the bargains? Maybe the different factions (e.g., deep ecologists vs. animal welfarists) could build monitoring systems to watch what the others were doing. Representatives from all the different factions could be transmitted back from Congress to the home galaxies for follow-up inspections. But what would keep the home galaxies from just destroying the inspectors? Who would stop them? Maybe the home galaxies would have to prove at the next Congress session that they didn't hamper the inspectors, but it's not at all clear it would be possible to verify that.
What might work better would be if each home galaxy had a proportionate balance of parties from the different factions so that they would each have the power to keep the other sides in check. For example, if there were lots of deep ecologists and animal welfarists in both galaxies, most of the compromise could be done on a local scale, the same as it would be if intergalactic communication didn't exist. A risk would be if some of the local galaxies devolved into conflict in which some of the parties were eliminated. Would the other parts of the supercluster be able to verify that this had happened? And even if so, could a police force rectify the situation?
Cross-supercluster communication seems tricky. Probably most of the exchanges among parties would happen at a local level, and intergalactic trades might be a rare and slow process.
The easiest time to "divide up our future light cone" among competing factions seems to be at the beginning, before we send out the first von Neumann probes. Either different factions would be allocated different portions of the universe into which to expand, or all parties would agree upon a compromise payload to spread uniformly. This latter solution would prevent attempts to cheat by colonizing more than your fair share.
Of course, we would still need to compromise with aliens if we encountered them, but among (post-)humans, maybe all the compromise could be done at the beginning.
However, the idea of finding a perfect compromise at the beginning of colonization that continues working forever assumes that reliable goal preservation is possible, even for rapidly changing and learning digital agents that will persist for billions of years. That level of goal preservation seems very tricky to achieve in artificial general intelligences, which are extremely complex systems. So there might inevitably be divergence in beliefs and values among non-communicating galaxies, and this could eventually lead to conflicts.
Note that merely galactic democracy would be less challenging. The Milky Way is only 100,000 light-years in diameter, and I would guess that most of the stars are within thousands of light-years of each other, so networked radar transmission should be feasible. Congressional cycles would take only 100,000 years instead of 110 million. And the number of stars is not that small: maybe 100 to 400 billion, compared with about 200 trillion in the whole Virgo Supercluster. This is just a factor-of-103 difference and so shouldn't affect our expected-value calculations too much. In other words, intergalactic bartering isn't necessary for compromise on cosmic scales to still be important.
See "Possible Ways to Promote Compromise." We should evaluate the effectiveness and efficiency of these approaches and explore other ways forward.
In this essay I've focused on value disagreements between factions, and there's a reason for this. Facts and values are fundamentally two separate things. Values are things you want, drives to get something, and hence they differ from organism to organism. Facts are descriptions about the world that are true for everyone at once. Truth is not person-dependent. Even if post-modernists or skeptics are right that truth is somehow person-dependent or that there is no such thing as truth, then at least this realization is still true in some meta-level of reasoning, unless even this is denied, but such a view is rare, and such people are presumably not going to be doing much to try to shape the world.
Given that there is some external truth about the universe, different people can share ideas about it, and other people's beliefs are evidence relevant to what we ourselves should believe. "Person A believes B" is a fact about the universe that our theories need to explain.
We should keep in mind that our Bayesian priors were shaped by various genetic and environmental factors in our development, and if we had grown up with the circumstances of other people, we would hold their priors. In some cases, it's clear that one set of priors is more likely correct -- e.g., if one person grew up with major parts of his brain malfunctional, his priors are less likely accurate than those of someone with a normal brain, and one reason for thinking so is that humans' normal brain structure has been shaped by evolution to track truths about the world, whereas random modifications to such a brain are less likely to generate comparably accurate views. In this case, both the normal brain and the malfunctional brain should agree to give more weight to the priors of the normal brain, though both brains are still useful sources of data.
Even in cases where there's no clear reason to prefer one brain or another, it seems both brains should recognize their symmetry and update their individual priors to a common prior, as Robin Hanson suggests in "Uncommon Priors Require Origin Disputes." This is conceptually similar to two different belief impulses within your own brain being combined into a common belief via dissonance-resolution mechanisms. It's not specified how the merging process takes place -- it's not always an average, or even a weighted average, of the two starting points, but it seems rationally required for the merge to happen. Then, once we have common priors, we should have common posteriors by Aumann's theorem.
There are caveats in order.
I have some hope that very rational agents of the future will not have much problem with epistemic disagreements, because I think the argument for epistemic modesty is compelling, and most of the smartest people I know accept it, at least in broad outline. If evolutionary pressures continue to operate going forward, they'll select for rationality, which means those practicing epistemic modesty should generally win out, if it is in fact the right stance to take. Thus, I see value conflicts as a more fundamental issue in the long run than factual ones.
That said, many of the conflicts we see today are at least partially, and sometimes primarily, about facts rather than values. Some debates in politics, for instance, are at least nominally about factual questions: Which policy will improve economic growth more? Are prevention measures against climate change cost-effective? Does gun control reduce violent crime? Of course, in practice these questions tend to become ideologized into value-driven emotional issues. Similarly, many religious disputes are at least theoretically factual -- What is/are the true God/gods? What is His/Her/their will for humanity? -- although, even more than in politics, many impulses on these questions are driven by emotion rather than genuine factual uncertainty. It's worth exploring how much rationality would promote compromise in these domains vs. how much other sociological factors are the causes and hence the best focal points for solutions.
There are disagreements in the effective-altruism movement about which causes to pursue and in what ways. I think many of the debates ultimately come down to value differences -- e.g., how much to care about suffering vs. happiness vs. preferences vs. other things, whether to care about animals or just humans and how much, whether to accept Pascalian gambles. But many other disagreements, especially in the short term, are about epistemology: How much can we grapple with long-term scenarios vs. how much should we just focus on short-term helping? How much should we focus on quantified measurement vs. qualitative understanding? How much should we think about flow-through effects?
Some are concerned that these differences in epistemology are harmful because they segregate the movement. I take mostly the opposite view. I think it's great to have lots of different groups trying out lots of different things. This helps you learn faster than if you all agreed on one central strategy. There is some risk of wasting resources on zero-sum squabbles, and it's good to consider cases where that happens and how to avoid them. At the same time, I think competition is also valuable, just as in the private sector. When organizations compete for donors using arguments, they improve the state of the debate and are forced to make the strongest case for their views. (Of course, recruiting donors via other "unfair" means doesn't have this same property.) While it might help for altruists to become better aligned, we also don't want to get comfortable with just averaging our opinions rather than seeking to show why our side may actually be more correct than others supposed.
This discussion highlights a more general point. Sometimes I feel epistemic modesty is too often cited as an empty argument: "Most smart people disagree with you about claim X, so it's probably wrong." Of course, this reasoning is valid, and it's important for everyone to realize as much, but this shouldn't be the end of the debate. There remains the task of showing why X is wrong at an object level. Analogously, we could say, "Theorem Y is true because it's in my peer-reviewed textbook," but it's a different matter to actually walk through the proof and show why theorem Y is correct. And every once in a while, it'll turn out that theorem Y is actually wrong, perhaps due to a typographical error or, in rare occasions, due to a more serious oversight by the authors. Intellectual progress comes from the latter cases: investigating a commonly held assumption and eventually discovering that it wasn't as accurate as people had thought.
Most new ideas are wrong. For every Copernicus or Galileo there are hundreds of scientists who are misguided, confused, or unlucky in interpreting their experimental findings. But we have to not be satisfied with conventional wisdom, and we have to actually look at the details of why others are wrong in order to make progress. It's plausible that an epistemically diverse population leads to faster learning than a uniform one. If startup founders weren't overconfident, we'd have fewer startups and hence less economic growth. Similarly, if people are less confident in their theories, they might push them less hard, and society might have less intellectual progress as a result.
However, epistemic divergence can be harmful in cases where each party can act on its own and thereby spoil the restraint of everyone else; Bostrom et al. call this the "The Unilateralist’s Curse". In these cases, it's best if everyone adheres to a policy of epistemic modesty. In general, maybe the ideal situation is for people to hold approximately uniform actual beliefs but then play advocate for a particular idea that they'd like to see explored more, even though it's probably wrong. There are times when I do this: propose something that I don't actually think is right, because I want to test it out.
While fighting over conflicting beliefs is not a good idea, groupthink is a danger in the reverse direction. While groups are collectively more accurate than individuals on average, when a group's views are swayed by conformity to each other or a leader, these accuracy benefits diminish. Groups with strong norms encouraging everyone to speak her own mind and rewarding constructive criticism can reduce groupthink.
Another reason why I sometimes make stronger statements than I actually believe is a sort of epistemic prisoner's dilemma.4 In particular, I often feel that other people don't update enough in response to the fact that I believe what I do. If they're not going to update in my direction, I can't update in their direction, because otherwise my position would be lost, and this would be worse than us both maintaining our different views.
For example, say Alice and Bob both have beliefs about some fact, like the number of countries in the world. Alice thinks the number is around 180; Bob thinks it's around 210. The best outcome would be for both parties to update in each other's directions, yielding something like 195, which is actually the number of independent states recognized by the US Department of State. However, say Alice is unwilling to budge on her estimate. If Bob were to move in her direction -- say part way, to 195 -- then Bob's views would be more accurate, but on collective decisions made by the Alice/Bob team, the decisions would, through their tug of war, be centered on something like (180+195)/2 = 187.5, which is farther from the truth than the collective decisions made by Alice/Bob holding 180 and 210 as their beliefs. In other words, if the collective decision-making process itself partly averages Alice's and Bob's views, then Bob should hold his ground as long as Alice holds her ground, even if this means more friction in the form of zero-sum conflicts due to their epistemic disagreement.
If Alice and Bob are both altruists, then this situation should be soluble by each side realizing that it makes sense to update in the other's direction. There's not an inherent conflict due to different payoffs to each party like in the regular prisoner's dilemma.
In general, epistemic compromise is similar to game-theoretic compromise in that it makes both parties better off, because both sides in general improve their beliefs, and hence their expected payoffs, in the process of resolving disagreement. Of course, if the agents have anticorrelated values, there can be cases where disagreement resolution is net harmful to at least one side, such as if a terrorist group resolves its factual disagreement with the US government about which method of making dirty bombs is most effective. By improving the factual accuracy of the terrorists, this may have been a net loss for the US government's goals.
When is moral activism a positive-sum activity for society, and when does it just transfer power from one group to another? This is a complex question.
Consider the case of an anti-death-penalty activist trying to convince people who support the death penalty that this form of punishment is morally wrong. Naively we might say, "Some people support the death penalty, others oppose it, and all that's going on here is transferring support from one faction to the other. Hence this is zero-sum."
On the other hand, we could reason this way instead: "Insofar as the anti-death-penalty activist is successful, she's demonstrating that the arguments against the death penalty are convincing. This is improving society's wisdom as people adopt more informed viewpoints. Most people should favor more informed viewpoints, so this is a win by many people's values, at least partially." The extent to which this is true depends on how much the persuasion is being done via means that are seen as "legitimate" (e.g., factual evidence, philosophical logic, clear thought experiments, etc.) and how much it's being done via "underhanded" methods (e.g., deceptive images, pairing with negative stimuli, ominous music, smear tactics, etc.). Many people are glad to be persuaded by more legitimate means but resistant to persuasion by the underhanded ones.
So there's a place for moral advocacy even in a compromise framework: Insofar as many factions welcome open debate, they win when society engages in moral discourse. When you change the opinion of an open-minded person, you're doing that person a service. Think of a college seminar discussion: Everyone benefits from the comments of everyone else. Other times moral persuasion may not be sought so actively but is still not unwelcome, such as when people distribute fliers on the sidewalk. Given that the receivers are voluntarily accepting the flier and open to reading it, we'd presume they place at least some positive value on the activity of the leafleter (although the value could be slightly negative if the person accepts the leaflet only due to social pressure). Of course, even if positive, moral persuasion might be far from optimal in terms of how resources are being used; this depends on the particulars of the situation -- how much the agents involved benefit from the leafleting.
However, not everyone is open to persuasion. In some instances a person wants to keep his values rigid. While this may seem parochial, remember that sometimes all of us would agree with this stance. For example: "If you offered Gandhi a pill that made him want to kill people, he would refuse to take it, because he knows that then he would kill people, and the current Gandhi doesn't want to kill people." Being convinced by underhanded means that we should kill people is a harm to our current values. In these cases, underhanded persuasion mechanisms are zero-sum because the losing side is hurt as much as the winning side is helped. Two opposed lobby groups using underhanded methods would both benefit from cancelling some of each other's efforts and directing the funds to an agreed upon alternate cause instead. On the other hand, opposed lobby groups that are advancing the state of the debate are doing a service to society and may wish to continue, even if they're in practice cancelling each other's effects on what fraction of people adopt which stance in the short run.
If changing someone's beliefs against his wishes is a harm to that person, then what are we to make of the following case? Farmer Joe believes that African Americans deserve to be slaves and should not have Constitutional rights. Furthermore, he doesn't want to have his views changed on this matter. Is it a harm to persuade Joe, even by purely intellectual arguments, that African Americans do in fact deserve equal rights? Well, technically yes. Remember that what persuasion methods count as "legitimate" vs. "underhanded" is in the eye of the hearer, and in this case, Joe regards any means of persuasion as underhanded. That said, if Joe were to compromise with the anti-slavery people, the compromise would involve everyone being 99+% against slavery, because in terms of power to control the future, the anti-slavery camp seems to be far ahead. Alternatively, maybe the anti-slavery people could give Joe something else he wants (e.g., an extra couple of shirts) in return for his letting them persuade him of the anti-slavery stance. This could be a good trade for Joe given his side's low prospects of winning in the long run.
As this example reminds us, the current distribution of opinion is not necessarily the same as the future distribution of power, and sometimes we can anticipate in which directions the trends are going. For example, it seems very likely that concern for animal wellbeing will dramatically increase in the coming decades. Unlike the stock market, the trajectory of moral beliefs is not a random walk.
Above we saw that moral discourse can often be a positive-sum activity insofar as other parties welcome being persuaded. (Of course, it may not always be as positive-sum as other projects that clearly benefit everyone, such as promoting compromise theory and institutions.) Conflicts in the realm of ideas are usually a good thing.
In contrast, direct actions may be more zero-sum when there's disagreement about the right action to take. Say person A thinks it's good to do a given action, and person B thinks it's equally wrong to do that same action.
While people often complain about "all talk and no action," in some cases, it can be Pareto-better to talk than to take action, if the issue at hand is one under dispute.
Often our actions meld with our beliefs about what's right, so it can sometimes get tricky, if you're trying to adopt a compromise stance for your actions, to mentally separate "how I'm acting for instrumental reasons" with "how I feel for intrinsic reasons." Sometimes people may begin to think of the compromise stance as intrinsically the "right" one, while others will continue to maintain this separation. In our own brains, we can feel the distinction between these two categories with respect to our evolutionary drives: Instrumental reciprocity feels like our moral sense of fairness, and our intrinsic survival drives feel like selfish instincts.
Control of Earth's future light cone is something that most value systems want:
Each of these value systems can be regarded like an individual in an economy, aiming to maximize its own utility. Each egoist has a separate goal from other egoists, so most of the individuals in this economy might be egoists, and then there would be a few other (very large) individuals corresponding to the fun theorists, utilitarians, complexity maximizers, etc. Resources in this economy include stars, raw materials for building Dyson swarms, knowledge databases, algorithm source code, etc., and an individual's utility derives from using resources to produce what it values.
It's possible that the future will literally contain many agents with divergent values, but it's also possible that just one of these agents will win the race to build AI first, in which case it would have the light cone to itself. There are two cases to consider, and both suggest compromise as a positive-sum resolution to the AI race.
Consider an AI race between eudaimonia maximizers and paperclip maximizers, with odds of winning p and 1-p respectively. If these factions are risk-neutral, then
expected utility of eudaimons = p * utility(resources if win) = utility(p * (resources if win)),
and similarly for the paperclippers. That is, we can pretend for purposes of analysis that when the factions compete for winner-takes-all, they actually control miniature future light cones that are p and 1-p times the size of the whole thing. But some parts of the light cone may be differentially more valuable than others. For example, the paperclippers need lots of planets containing iron and carbon to create steel, while the eudaimons need lots of stars for energy to power their simulations. So the parties would gain from trading with each other: The eudaimons giving away some of their planets in return for some stars. And similarly among other resource dimensions as well.
For risk-averse agents, the argument for compromise is even stronger. In particular, many egoists may just want to create one immortal copy of themselves (or maybe 5 or 10 for backup purposes); they don't necessarily care about turning the whole future light cone into copies of themselves, and even if they'd like that, they would still probably have diminishing marginal utility with respect to the number of copies of themselves. Likewise for people who care in general about "survival of the human race": It should be quite cheap to satisfy this desire with respect to present-day Earth-bound humans relative to the cosmic scales of resources available. Other ideologies may be risk-averse as well; e.g., negative utilitarians want some computing power to figure out how to reduce suffering, but they don't need vast amounts because they're not trying to fill the cosmos with anything in particular. Even fun theorists, eudaimons, etc. might be satisficing rather than maximizing and exhibit diminishing marginal utility of resources.
In these instances, the case for compromise is even more compelling, because not only can the parties exchange resources that are differentially valuable, but because the compromise also reduces uncertainty, this boosts expected utility in the same way that insurance does for buyers. For instance, with an egoist who just wants one immortal copy of herself, the expected utility of the outcome is basically proportional to the probability that the compromise goes through, which could be vastly higher than her probability of winning the whole light cone. Individual egoists might band together into collective-bargaining units to reduce the transactions costs of making trades with each human separately. This might serve like a group insurance plan, and those people who had more power would be able to afford higher-quality insurance plans.
Carl Shulman has pointed out the usefulness of risk aversion in encouraging cooperation. And indeed, maybe human risk aversion is one reason we see so much compromise in contemporary society. Note that if even only one side is risk-averse, we tend to get very strong compromise tendencies. For example, insurance companies are not risk-averse with respect to wealth (for profits or losses on the order of a few million dollars), but because individuals are, individuals buy insurance, which benefits both parties.
Just like in a market economy, trade among value systems may include externalities. For instance, suppose that many factions want to run learning computations that include "suffering subroutines," which negative utilitarians would like to avert. These would be analogous to pollution in a present-day context. In a Coase fashion, the negative utilitarians might bargain with the other parties to use alternate algorithms that don't suffer, even if they're slightly costlier. The negative utilitarians could pay for this by giving away stars and planets that they otherwise would have (probabilistically) controlled.
The trade among value systems here has some properties of a market economy, so some of the results of welfare economics will apply. If there are not many buyers and sellers, no perfect information, etc., then the first fundamental welfare theorem may not fully hold, but perhaps many of its principles would obtain in weaker form.
In general, markets are some of the most widespread and reliable instances of positive-sum interaction among competing agents, and we would do well to explore how, why, and when markets work or don't work.
Of course, all of these trade scenarios depend on the existence of clear, robust mechanisms by which compromises can be made and maintained. Such mechanisms are present in peaceful societies that allow for markets, contracts, and legal enforcement, but it's much harder in the "wild west" of AI development, especially if one faction controls the light cone and has no more opposition. Exploring how to make compromise function in these contexts is an urgent research area with the potential to make everyone better off.
Consider a multi-dimensional space of possible values: Happiness, knowledge, complexity, number of paperclips, etc. Different value systems (axiologies) care about these dimensions to different degrees. For example, hedonistic utilitarians care only about the first and not about the rest. Other people care about each of the first three to some degree.
We can think of a person's axiology as a vector in values space. The components of the vector represent what weight (possibly negative) the person places on that particular value. In a four-dimensional values space of (happiness, knowledge, complexity, paperclips), hedonistic utilitarians have the vector (1, 0, 0, 0). Other people have vectors like (0.94, 0.19, 0.28, 0). Here I've normalized these to unit vectors. To evaluate a given change in the world, the axiologies take the scalar projection of the change onto their vector, i.e., the dot product. For example, if the change is (+2, -1, +1, +4), utilitarians evaluate this as (1, 0, 0, 0) * (2, -1, 1, 4) = 1 * 2 + 0 * -1 + 0 * 1 + 0 * 4 = 2, while the other axiology evaluates its value to be 0.94 * 2 + 0.19 * -1 + 0.28 * 1 + 0 * 4 = 1.97.
We can imagine a similar set-up with the dimensions being policies rather than values per se, with each axiology assigning a weight to how much it wants or doesn't want each policy. This is the framework that Robin Hanson suggested in his post, "Policy Tug-O-War." The figure provides a graphical illustration of a compromise in this setting.
Adrian Hutter suggested an extension to this formalism: The length of each vector could represent the number of people who hold a given axiology. Or, I would add, in the case of power-weighted compromise, the length could represent the power of the faction. Would the sum of the axiology vectors with power-weighted lengths then represent a natural power-weighted compromise solution? Of course, there may be constraints on which vectors are achievable given resources and other limitations of physical reality.
In some cases, summing axiology vectors seems to give the right solution. For example, consider two completely orthogonal values: Paperclips (x axis) and staples (y axis). Say a paperclip maximizer has twice as much power as its competitor staple maximizer in competing to control Earth's future light cone. The sum of their vectors would be 2 * (1,0) + (0,1) = (2,1). That means 2/3 of resources go to paperclips and 1/3 to staples, just as we might expect from a power-weighted compromise.5
However, imagine now that there's a design for staples that allows paperclips to be fit inside them. This means the staple maximizers could, if they wanted, create some paperclips as well, although by default they wouldn't bother to do so. Assume there is no such design to fit staples inside paperclips. Now the staple maximizers have extra bargaining leverage: "If we get more than 1/3 of resources for staples," they can say, "we'll put some paperclips inside our staples, which will make both of us better off." Here the compromise outcome is based not just on pure power ratios (i.e., probabilities of winning control in a fight) but also on bargaining leverage. This is discussed more in "Appendix: Dividing the compromise pie."
I think advancing compromise is among the most important projects that we who want to reduce suffering can undertake. A future without compromise could be many times worse than a future with it. This is also true for other value systems as well, especially those that are risk-averse. Thus, advancing compromise is a win-win(-win-win-win-...) project that many of us may want to work on together. It seems like a robustly positive undertaking, squares with common sense, and is even resilient to changes in our moral outlook. It's a form of "pulling the rope sideways" in policy tug-o-wars.
This essay was inspired by a discussion with Lucius Caviola. It draws heavily from the ideas of Carl Shulman. Also influential were writings by Jonah Sinick and Paul Christiano. An email from Pablo Stafforini prompted the section on epistemic convergence.
Consider several factions competing in a winner-takes-all race to control the future light cone. Let pi denote the probability that faction i wins. Normalize the utility values for each faction so that utility of 0 represents losing the light-cone race, and utility of 100 represents winning it. Absent compromise, faction i's expected utility is 100pi. Thus, in order for i to be willing to compromise, it must be the case that the compromise offers at least 100pi, because otherwise it could do better by continuing to fight on its own. Compromise allocations that respect this "individual rationality" requirement are called imputations.
We can see the imputations for the case of deep ecologists and animal welfarists in Figure 2. Absent bargaining, each side gets an expected utility of 100 * 0.5 = 50 by fighting for total control. Bargaining would allow each side to get more than half of what it wants, and the excess value to each side constitutes the gain from compromise.
Utility is transferable if it can be given to another party without losing any of the value. We can see in Figure 2 that utility is not completely transferable between deep ecologists and animal welfarists, because the Pareto frontier is curved. If the animal welfarists give up 1 unit of expected utility, the deep ecologists may not gain 1 whole unit. Utility would be transferable in the bargaining situation if the Pareto frontier between the two dashed black lines were straight.
In the special case when utility is transferable, we can use all the mechanics of cooperative game theory to analyze the situation. For example, the Shapley value gives an answer to the problem of what the exact pie-slicing arrangement should look like, at least if we want to satisfy the four axioms that uniquely specify the Shapley division.
It's an interesting theorem that if a cooperative game is convex, then all of the players want to work together (i.e., the core is non-empty and also unique), and the Shapley value gives "the center of gravity" of the core. Alas, as far as I can tell, real-world situations will not always be convex.
Many times the utility gains from compromise are not completely transferable. We saw this in Figure 2 through the fact that the Pareto frontier is curved. Define u := (animal-welfarist expected utility) - 50, i.e., the excess expected utility above no compromise, and v := (deep-ecologist expected utility) - 50. The (u,v) points that lie within the dotted lines and the curved red line are the potential imputations, i.e., ways to divide the gains from trade. That utility is not transferable in this case means we can't represent the Pareto frontier by a line u + v = constant.
However, we can use another approach, called the Nash bargaining game. In Nash's solution, the bargaining point is that which maximizes u * v. Figure 304.1 (p. 304) of A Course in Game Theory by Osborne and Rubinstein illustrates this graphically as the intersection of lines u * v = constant with set of imputations, and I've drawn a similar depiction in Figure 3:
Animal welfarists' expected utility | Deep ecologists' expected utility | Animal welfarists' expected utility - 20 | Deep ecologists' expected utility - 80 | (Animal welfarists' expected utility - 20) * (Deep ecologists' expected utility - 80) |
20 | 96 | 0 | 16 | 0 |
22 | 95.16 | 2 | 15.16 | 30.32 |
24 | 94.24 | 4 | 14.24 | 56.96 |
26 | 93.24 | 6 | 13.24 | 79.44 |
28 | 92.16 | 8 | 12.16 | 97.28 |
30 | 91 | 10 | 11 | 110 |
32 | 89.76 | 12 | 9.76 | 117.12 |
34 | 88.44 | 14 | 8.44 | 118.16 |
36 | 87.04 | 16 | 7.04 | 112.64 |
38 | 85.56 | 18 | 5.56 | 100.08 |
40 | 84 | 20 | 4 | 80 |
42 | 82.36 | 22 | 2.36 | 51.92 |
44 | 80.64 | 24 | 0.64 | 15.36 |
44.72 | 80 | 24.72 | 0 | 0 |
The maximum of the product in the last column occurs around (34, 88), which will be the Nash compromise arrangement. The animal welfarists got a surplus of 34 - 20 = 14, and the deep ecologists, 88 - 80 = 8.
It's worth noting that in fact any of the divisions in the table is a Nash equilibrium, because given the demand of one faction for a share of the pie, the other faction can only either (1) take less, which it wouldn't want to do, or (2) demand more and thereby ruin the compromise, leaving it with no surplus. Thus, the bargaining solution allows us to narrow down to a particular point among the infinite set of Nash equilibria.
The bargaining game contains other solutions besides Nash's that satisfy different intuitive axioms.
The bargaining problem with more than two players becomes more complicated. In "A Comparison of Non-Transferable Utility Values," Sergiu Hart identifies three different proposals for dividing the compromise pie -- Harsanyi (1963), Shapley (1969), and Maschler and Owen (1992) -- each of which may give different allocations. Each proposal has its own axiomatization (see endnote 1 of Hart's paper), so it's not clear which of these options would be chosen. Perhaps one would emerge as a more plausible Schelling point than the others as the future unfolds.
Hawk | Dove | Own-cooperator | |
Hawk | -1, -1 | 2, 0 | -1, -1 |
Dove | 0, 2 | 1, 1 | 0, 2 |
Own-cooperator | -1, -1 | 2, 0 | 1, 1 |
Here, Own-cooperation is an ESS using the first condition of Maynard Smith and Price: For the strategy S = Own-cooperator, for any T in {Hawk, Dove}, playing S against S is strictly better than playing T against S. (back)
The post Gains from Trade through Compromise appeared first on Center on Long-Term Risk.
]]>The post The Importance of Wild-Animal Suffering appeared first on Center on Long-Term Risk.
]]>I personally believe that most animals (except maybe those that live a long time, like >3 years) probably endure more suffering than happiness overall, because I would trade away several years of life to avoid the pain of the average death in the wild. And this hypothetical tradeoff assumes that animal lives prior to death are net positive (which is not obvious, in view of cold, hunger, disease, fear of predators, and all the rest).
However, this belief of mine is somewhat controversial. I think the claim of net expected suffering in nature needs only a weaker assertion: namely, that almost all of the expected happiness and suffering in nature come from small animals (e.g., minnows and insects). The adults of these species live at most a few years, often just a few months or weeks, so it's even harder in these cases for the happiness of life to outweigh the pain of death. Moreover, almost all the babies of these species die (possibly painfully) after just a few days or weeks of being born, because most of these species are "r-selected" -- see Type III in this chart.
"The total amount of suffering per year in the natural world is beyond all decent contemplation. During the minute it takes me to compose this sentence, thousands of animals are being eaten alive; others are running for their lives, whimpering with fear; others are being slowly devoured from within by rasping parasites; thousands of all kinds are dying of starvation, thirst and disease."
-- Richard Dawkins, River Out of Eden[Dawkins]
"Many humans look at nature from an aesthetic perspective and think in terms of biodiversity and the health of ecosystems, but forget that the animals that inhabit these ecosystems are individuals and have their own needs. Disease, starvation, predation, ostracism, and sexual frustration are endemic in so-called healthy ecosystems. The great taboo in the animal rights movement is that most suffering is due to natural causes."
-- Albert, a fictional dog in philosopher Nick Bostrom's "Golden"[Bostrom-Alfred]
"The moralistic fallacy is that what is good is found in nature. It lies behind the bad science in nature-documentary voiceovers: lions are mercy-killers of the weak and sick, mice feel no pain when cats eat them, dung beetles recycle dung to benefit the ecosystem and so on."
-- Steven Pinker[Pinker]
"People who accuse us of putting in too much violence, [should see] what we leave on the cutting-room floor."
-- David Attenborough, speaking about his nature documentaries[Attenborough]
"In sober truth, nearly all the things which men are hanged or imprisoned for doing to one another, are nature's every day performances. [...] The phrases which ascribe perfection to the course of nature can only be considered as the exaggerations of poetic or devotional feeling, not intended to stand the test of a sober examination. No one, either religious or irreligious, believes that the hurtful agencies of nature, considered as a whole, promote good purposes, in any other way than by inciting human rational creatures to rise up and struggle against them."
-- John Stuart Mill, "On Nature"[Mill]
Animal activists typically focus their efforts on areas where humans directly interact with members of other species, such as on "factory farms," in laboratory experiments, and, to a much lesser degree, in zoos, circuses, rodeos, and the like.
Rarely discussed is the topic of animal suffering in the wild, even in the academic literature, though there have been notable exceptions.[exceptions] In this piece, I emphasize that the numbers of wild animals on which humans have an impact is simply too large for animal advocates to ignore. Intense suffering is a regular feature of life in the wild that demands, perhaps not quick-fix intervention, but at least long-term research into the welfare of wild animals and technologies that might one day allow humans to improve it. I conclude by encouraging animal advocates to focus their efforts to promote concern about wild-animal suffering among other activists, academics, and others who would be sympathetic -- both to encourage research on the issue and to ensure that our descendants use their advanced technologies in ways that alleviate wild-animal suffering rather than inadvertently multiply it.
The scale of animal suffering at human hands is vast, and animal advocates are right to be appalled by its magnitude. However, the numbers of animals that live in the wild are staggeringly larger. For rough population estimates, see my "How Many Wild Animals Are There?"[Tomasik-numbers]
Like their domestic counterparts, animals in the wild have rich emotional lives.[emotions] Unfortunately, many of these emotions are intensely painful, often needlessly so. And while "Nature, red in tooth and claw" is widely known as a platitude, its visceral meaning can often be overlooked. Below I review some details of wild-animal suffering, perhaps in a manner similar to the way in which animal advocates decry acts of cruelty by humans.
When people imagine suffering in nature, perhaps the first image that comes to mind is that of a lioness hunting her prey. Christopher McGowan, for instance, vividly describes the death of a zebra:
The lioness sinks her scimitar talons into the zebra's rump. They rip through the tough hide and anchor deep into the muscle. The startled animal lets out a loud bellow as its body hits the ground. An instant later the lioness releases her claws from its buttocks and sinks her teeth into the zebra's throat, choking off the sound of terror. Her canine teeth are long and sharp, but an animal as large as a zebra has a massive neck, with a thick layer of muscle beneath the skin, so although the teeth puncture the hide they are too short to reach any major blood vessels. She must therefore kill the zebra by asphyxiation, clamping her powerful jaws around its trachea (windpipe), cutting off the air to its lungs. It is a slow death. If this had been a small animal, say a Thomson's gazelle (Gazella thomsoni) the size of a large dog, she would have bitten it through the nape of the neck; her canine teeth would then have probably crushed the vertebrae or the base of the skull, causing instant death. As it is, the zebra's death throes will last five or six minutes.[McGowan, pp. 12-13]
Some predators kill rather quickly, such as constrictor snakes that cut off their victims' air flow and induce unconsciousness within a minute or two,[eaten-alive] while others impose a more protracted death, such as hyenas that tear off chunks of ungulate flesh one bite at a time.[Kruuk] Wild dogs disembowel their prey,[McGowan, p. 22] venomous snakes cause internal bleeding and paralysis over the course of several minutes,[McGowan, pp. 49] and crocodiles drown large animals in their jaws.[McGowan, pp. 43]
One snake owner's guide explains, "Live mice will fight for their lives when they are seized, and will bite, kick and scratch for as long as they can."[Flank] Once captured, "The snake drenches the prey with saliva and eventually pulls it into the esophagus. From there, it uses its muscles to simultaneously crush the food and push it deeper into the digestive tract, where it is broken down for nutrients."[Perry]
Prey may not die immediately after being swallowed, as is illustrated by the fact that some poisonous newts, after ingestion by a snake, excrete toxins to kill their captor so that they can crawl back out of its mouth.[McGowan, pp. 59] And regarding housecats, Bob Sallinger of the Audubon Society of Portland remarked, "People who are appalled by the indiscriminate killing of wildlife by mechanisms such as leg-hold traps should recognize that the pain and suffering caused by cat predation is not dissimilar and the impacts of cat predation dwarf the impacts of trapping."[Sallinger]
It's possible that some animals don't suffer intensely from predation in cases where endorphins kick in strongly enough. Similarly, humans sometimes don't feel pain immediately upon severe injury.[Wall] But in many instances of predation, prey continue struggling violently against their aggressors. For instance, in this video, the warthog screams for ~2.5 minutes as it's suffocated. Moreover, insofar as endorphins do sometimes reduce the painfulness of death, the same argument should apply for brutal slaughter of farm animals by humans, yet most animal-welfare scientists consider bad slaughter methods to be extremely painful.
Fear of predators produces not only immediate distress, but it may also cause long-term psychological trauma. In one study of anxiolytics, researchers exposed mice to a cat for five minutes and observed subsequent reactions. They found "that this animal model of exposure of mice to unavoidable predatory stimuli produces early cognitive changes analogous to those seen in patients with acute stress disorder (ASD)."[ElHagePeronnyGriebelBelzung] A follow-up study found long-term impacts on the mice's brains: "predatory exposure induced significant learning disabilities in the radial maze (16 to 22 days poststressor) and in the spatial configuration of objects recognition test (26 to 28 days poststressor). These findings indicate that memory impairments may persist for extended periods beyond a predatory stress."[ElHageGriebelBelzung] Similarly, Phillip R. Zoladz exposed rats to unavoidable predators and other anxiety-causing conditions to "produce changes in rat physiology and behavior that are comparable to the symptoms observed in PTSD patients."[Zoladz] And in a review article, Rianne Stam explained:
Animal models that are characterised by long-lasting conditioned fear responses as well as generalised behavioural sensitisation to novel stimuli following short-lasting but intense stress have a phenomenology that resembles that of PTSD in humans. [...] Weeks to months after the trauma, treated animals on average also show a sensitisation to novel stressful stimuli of neuroendocrine, cardiovascular and gastrointestinal motility responses as well as altered pain sensitivity and immune function.[Stam]
Even for those prey that haven't had a traumatic run-in with a predator, the "ecology of fear" that predators create can be very distressing: "In studies with elk, scientists have found that the presence of wolves alters their behavior almost constantly, as they try to avoid encounters, leave room for escape and are constantly vigilant."[Stauth]
One can advance some argument that evolution should avoid making animal lives excessively horrifying for extended periods prior to death because doing so might, at least in more complex species, induce PTSD, depression, or other debilitating side-effects. Of course, we see empirically that evolution does induce such disorders when traumatic incidents happen, like exposure to a predator. But there's probably some kind of reasonable bound on how bad these can be most of the time if animals are to remain functional. Death itself is a different matter because, once it reaches the point of inevitability, evolutionary pressures don't constrain the emotional experience. Death can be as good as painless (for a few lucky animals) or as bad as torture (for many others). Evolution has no reason to prevent death from feeling unbearably awful.[Dawkins]
Of course, predation is not the only way in which organisms die painfully.
Animals are also stricken by diseases and parasites, which may induce listlessness, shivering, ulcers, pneumonia, starvation, violent behavior, or other gruesome symptoms over the course of days or weeks leading up to death. Avian salmonellosis is just one example:
Signs range from sudden death to gradual onset of depression over 1 to 3 days, accompanied by huddling of the birds, fluffed-up feathers, unsteadiness, shivering, loss of appetite, markedly increased or absence of thirst, rapid loss of weight, accelerated respiration and watery yellow, green or blood-tinged droppings. The vent feathers become matted with excreta, the eyes begin to close and, immediately before death, some birds show apparent blindness, incoordination, staggering, tremors, convulsions or other nervous signs.[Salmonellosis]
Still other animals die of accidents, dehydration during a summer drought, or lack of food during the winter. For instance, 2006 was a harsh year on bats in Placerville, California:
"You can see their ribs, their backbones, and (the area) where the intestine and the stomach are is completely sunk through to the back," said Dharma Webber, founder of the California Native Bat Conservancy. [...] She said emerging mosquitoes aren't enough to feed the creatures. "It would be like us eating a little piece of popcorn here or there," she said.[bats]
(Of course, when the bats do have food, this isn't good news for their prey....)
Even ice storms can be fatal: "Birds unable to find a sheltered perch during the storm may have their feet frozen to a branch or their wings covered in ice making them unable to fly. Grouse buried in snow drifts are often encased by the ice layer and suffocate."[Heidorn]
While death may often constitute the peak of suffering during an animal's life, day-to-day existence isn't necessarily pleasant either. Unlike most humans in the industrialized world, wild animals don't have immediate access to food whenever they become hungry. They must constantly seek out water and shelter while remaining on the lookout for predators. Unlike us, most animals can't go inside when it rains or turn on the heat when winter temperatures drop far below their usual levels. In summary:
It is often assumed that wild animals live in a kind of natural paradise and that it is only the appearance and intervention of human agencies that bring about suffering. This essentially Rousseauian view is at odds with the wealth of information derived from field studies of animal populations. Scarcity of food and water, predation, disease and intraspecific aggression are some of the factors which have been identified as normal parts of a wild environment which cause suffering in wild animals on a regular basis.[UCLA, p. 24]
And while many animals appear to endure such conditions rather calmly, this doesn't necessarily mean they aren't suffering.[BourneEtAl] Sick and injured members of a prey species are the easiest to catch, so predators deliberately target these individuals. As a consequence, those prey that appear sick or injured will be the ones killed most often. Thus, evolutionary pressure pushes prey species to avoid drawing attention to their suffering.[Nuffield, ch. 4.12, p. 66]
Based on studies of stress-hormone levels in domestic and wild animals, Christie Wilcox[Wilcox] concluded that "if we follow the guidelines of care that provide food, water, comfort, and necessary items for behavioral expression, domesticated animals are not only likely to be as happy as their wild relatives, they're probably happier." She also observed:
So the real question becomes whether a domesticated or captive animal is more, less, or as happy in the moment as its wild counterpart. There are a few key conditions that are classically thought to lead to a "happy" animal by reducing undue stress. These are the basis for most animal cruelty regulations, including those in the US and UK. They include that animals have the 'rights' to:
- Enough food and water
- Comfortable conditions (temperature, etc)
- Expression of normal behavior
When it comes to wild animals, though, only the last is guaranteed. They have to struggle to survive on a daily basis, from finding food and water to another individual to mate with. They don't have the right to comfort, stability, or good health. [...] By the standards our governments have set, the life of a wild animal is cruelty.
In nature, the most populous animals are probably the ones that are generally worst off. Small mammals and birds have adult lifespans at most one or three years before they face a painful death. And many insects count their time on Earth in weeks rather than years -- for instance, just 2-4 weeks for the horn fly.[Cumming] I personally would prefer not to exist than to find myself born as an insect, struggle to navigate the world for a few weeks, and then die of dehydration or be caught in a spider's web. Worse still might be finding myself entangled in an Amazonian-ant "torture rack" trap for 12 hours,[BBC] or being eaten alive over the course of weeks by an Ichneumon wasp.[Gould, pp. 32-44] (That said, whether caterpillars eaten by Ichneumon wasps feel pain during the experience is unclear.)
It's true that scientists remain uncertain whether insects experience pain in a form that we would consider conscious suffering.[insect-pain] However, the fact that there remains serious debate on the issue suggests that we should not rule out the possibility. And seeing as insects number 1018,[Williams] with the number of copepods in the ocean of a similar magnitude,[SchubelButman] the mathematical "expected value" (probability times amount) of their suffering is vast. I should note that the force of this point would be lessened if, as may be the case, an animal's "intensity" or "degree" of emotional experience depends to some rough extent on the amount of neural tissue it has devoted to pain signals.
Tables of animal lifespans typically show durations of survival by adult members of a species. However, most individuals die much sooner, before reaching maturity. This is a simple consequence of the fact that females give birth to far more offspring than can survive to reproduce in a stable population. For instance, while humans can produce only one child per reproductive season (excepting twins), the number is 1-22 offspring for dogs (Canis familiaris), 4-6 eggs for the starling (Sturmus vulgaris), 6,000-20,000 eggs for the bullfrog (Rana catesbeiana), and 2 million eggs for the scallop (Argopecten irradians).[SolbrigSolbrig, p. 37] Take a look at this figure from Thomas J. Herbert's article[Herbert] on r and K selection illustrating extremely high infant mortality for "r strategists." Most small animals like minnows and insects are r strategists.
Granted, it's unclear whether all of these species are sentient -- and even more regarding that fraction of the eggs that fails to hatch (see the next section) -- but again, in expected-value terms, the amount of expected suffering is enormous.
This strategy of "making lots of copies and hoping a few come out" may be perfectly sensible from the standpoint of evolution, but the cost to the individual organisms is tremendous. Matthew Clarke and Yew-Kwang Ng conclude from an analysis of the welfare implications of population dynamics that "The number of offspring of a species that maximizes fitness may lead to suffering and is different from the number that maximizes welfare (average or total)."[ClarkeNg, sec. 4]
In a related paper, "Towards Welfare Biology: Evolutionary Economics of Animal Consciousness and Suffering," Ng concludes from the excess of offspring over adult survivors: "Under the assumptions of concave and symmetrical functions relating costs to enjoyment and suffering, evolutionary economizing results in the excess of total suffering over total enjoyment."[Ng, p. 272] (Update: In 2019, Zach Groff and Ng showed that this theorem contains a mathematical error, and a stronger starting premise is needed to conclude that suffering predominates.[Groff and Ng] I haven't looked into this issue in depth, but you can see some initial thoughts of mine here and here.)
In his famous paper, "Animal liberation and environmental ethics: Bad marriage, quick divorce,"[Sagoff] Mark Sagoff quotes the following passage from Fred Hapgood:[Hapgood]
All species reproduce in excess, way past the carrying capacity of their niche. In her lifetime a lioness might have 20 cubs; a pigeon, 150 chicks; a mouse, 1000 kits; a trout, 20,000 fry, a tuna or cod, a million fry or more; [...] and an oyster, perhaps a hundred million spat. If one assumes that the population of each of these species is, from generation to generation, roughly equal, then on average only one offspring will survive to replace each parent. All the other thousands and millions will die, one way or another.1
Sagoff goes on to say: "The misery of animals in nature--which humans can do much to relieve--makes every other form of suffering pale in comparison. Mother Nature is so cruel to her children she makes Frank Perdue look like a saint."
The previous section explained that in r-selected species, parents may have hundreds or even tens of thousands of offspring, and almost all of these die shortly after birth.
But some questions remain. What fraction of these offspring were sentient at the time of death, and what fraction merely died as unconscious eggs or larvae?
EFSA's "Aspects of the biology and welfare of animals used for experimental and other scientific purposes" (pp. 37-42) explores when fetuses of various species begin to feel conscious pain.[EFSA] The paper notes that the age of onset of consciousness varies based on whether a species is precocial (well developed at birth, such as horses) or altricial (still developing at birth, such as marsupials). Precocial animals are more likely to feel pain at earlier ages. Also relevant is whether the species is viviparous (having live birth) or oviparous (giving birth through eggs). Viviparous animals have greater need to inhibit fetal consciousness during development in order to prevent injury to the mother and siblings. Oviparous animals that are constrained by shells have less need for inhibition of awareness before birth. (p. 38)
For this reason, the report suggests: "If awareness is the criterion for protection, birds, reptiles, amphibians, fish and cephalopods may, therefore, be more obviously in need of protection pre-hatching than mammals are in need of protection pre-partum." (p. 38) For example: "Sensory and neural development in a precocial bird such as the domestic chick is very well advanced several days before hatching. Controlled movements and coordinated behavioural and electrophysiological evoked responses to tactile, auditory and visual stimuli appear three or four days before hatching occurs after 21 days of incubation (Broom, 1981)." (p. 39) In contrast: "Even though the mammalian fetus can show physical responses to external stimuli, the weight of present evidence suggests that consciousness does not occur in the fetus until it is delivered and starts to breathe air." (p. 42)
Thus, it seems clear that many animals are able to suffer by the time of birth if not before.
The stage of development at which this risk [of suffering] is sufficient for protection to be necessary is that at which the normal locomotion and sensory functioning of an individual independent of the egg or mother can occur. For air-breathing animals this time will not generally be later than that at which the fetus could survive unassisted outside the uterus or egg. For most vertebrate animals, the stage of development at which there is a risk of poor welfare when a procedure is carried out on them is the beginning of the last third of development within the egg or mother. For a fish, amphibian, cephalopod, or decapod it is when it is capable of feeding independently rather than being dependent on the food supply from the egg. [...] (p. 3)
Most amphibians and fish have larval forms which are not well developed at hatching but develop rapidly with experience of independent life[.] Those fish and amphibians that are well developed at hatching or viviparous birth and all cephalopods, since these are small but well developed at hatching, will have had a functioning nervous system and the potential for awareness for some time before hatching. (p. 38)
Another consideration suggestive of pain before birth is the fact that many oviparous vertebrates can hatch early in response to environmental stimuli, including vibrations that feel like a predator.
For example, for skink eggs: "Simulated predation experiments in the field induced hatching in both nest sites (horizontal rock crevices) and in eggs displaced from nest sites. The hatching process was explosive: early hatching embryos hatched in seconds and sprinted from the egg an average of 40 cm as they hatched."[DoodyPaull] Early hatching has also been documented for amphibians, fish, and invertebrates.[DoodyPaull]
Young fish are already intelligent enough to be predators a few days after birth: "In the Sacramento-San Joaquin Estuary, young striped bass begin feeding on small crustacean zooplankton a few days after they hatch (Eldridge et al. 1982)."[StevensEtAl, pp. 21-22] Unfortunately, huge numbers of striped bass die young, given that "the average female striped bass produc[es] nearly a half million eggs".[StevensEtAl, p. 20]
These points suggest that a significant fraction of the large numbers of offspring born to r-selected species may very well be conscious during the pain of their deaths after a few short days, or even hours, of life.
There is a danger in extrapolating the welfare of wild animals from our own imagination of how we would feel in the situation. We can imagine immense discomfort were we to sleep through a cold winter night's storm with only a sweatshirt to keep us warm, but many animals have better fur coats and can often find some sort of shelter. More generally, it seems unlikely that species would gain an adaptive advantage by feeling constant hardship, since stress does entail a metabolic cost.[Ng] Also, r-selected animals might suffer less from a given injury than long-lived animals would because r-selected creatures have less to lose by taking big short-term risks.[Tomasik-short-lived]
That said, we should also be wary of underestimating the extent and severity of wild-animal suffering due to our own biases. You, the reader, are probably in the comfort of a climate-controlled building or vehicle, with a relatively full stomach, and without fear of attack. Most of us in the industrialized West go through life in a relatively euthymic state, and it's easy to assume that the general pleasantness with which life greets us is shared by most other people and animals. When we think about nature, we may picture chirping songbirds or frolicking gazelles, rather than deer having their flesh chewed off while conscious or immobilized raccoons afflicted by roundworms. And of course, all of the previous examples, insofar as they involve large land animals, reflect my human tendency toward the "availability heuristic": In fact, the most prevalent wild animals of all are small organisms, many ocean-dwelling. When we think "wild animals," we should (if we adopt the expected-value approach to uncertainty about sentience) picture ants, copepods, and tiny fish, rather than lions or gazelles.
People may not accurately assess at a single instant how they'll feel overall during a longer period of time.[KahnemanSugden] They often exhibit "rosy prospection" toward future events and "rosy retrospection" about the past, in which they assume that their previous and future levels of wellbeing were and will be better than what's reported at the time of the experiences.[MitchellThompson] Moreover, even when organisms do correctly judge their hedonic levels, they often show a "will to live" quite apart from pleasure or pain. Animals that, in the face of lives genuinely not worth living, decide to end their existence tend not to reproduce very successfully.
Ultimately, though, regardless of exactly how good or bad we assess life in the wild to be on balance, it remains undeniable that many animals in nature endure some dreadful experiences.
Finally, there are some claims that animals do commit suicide, though others are doubtful. I'm personally skeptical because there aren't lots of well documented cases of animal suicide, and it's easy to accumulate folklore about phenomena that aren't real. That said, I don't doubt that some animals act differently when suffering an emotional loss.
Why, then, is the suffering of wild animals not a top priority for animal advocates? One reason is philosophical: Some feel that while humans have duties to treat well the animals that they use or live with, they have no responsibility to those outside their sphere of interaction. I find this unsatisfying; if we really care about animals because we don't want fellow organisms to suffer brutally -- not just because we want to "keep our moral house clean" -- then it shouldn't matter whether we have a personal connection with wild animals or not.
Other philosophers agree with this but continue to defend human inaction by claiming that people are ultimately helpless to change the situation. When asked whether we should stop lions from eating gazelles, Peter Singer replied:
[...] for practical purposes I am fairly sure, judging from man's past record of attempts to mold nature to his own aims, that we would be more likely to increase the net amount of animal suffering if we interfered with wildlife, than to decrease it. Lions play a role in the ecology of their habitat, and we cannot be sure what the long-term consequences would be if we were to prevent them from killing gazelles. [...] So, in practice, I would definitely say that wildlife should be left alone.[Singer]
I would point out in response to Singer that most human interventions have not been designed to improve wild-animal welfare, and even so, I suspect that many of them have decreased wild-animal suffering on balance by reducing habitats.
In a similar vein as Singer, Jennifer Everett suggested that consequentialists may endorse evolutionary selection because it eliminates deleterious genetic traits:
[...] if propagation of the "fittest" genes contributes to the integrity of both predator and prey species, which is good for the predator/prey balance in the ecosystem, which in turn is good for the organisms living in it, and so on, then the very ecological relationships that holistic environmentalists regard as intrinsically valuable will be valued by animal welfarists because they conduce ultimately, albeit indirectly and via complex causal chains, to the well-being of individual animals.[Everett, p. 48]
These authors are right that consideration of long-range ecological side-effects is important. However, it does not follow that humans have no obligations regarding wild animals or that animal supporters should remain silent about nature's cruelty. The next few subsections elaborate on ways in which humans can indeed do something about wild-animal suffering.
I agree that we should be cautious about quick-fix intervention. Ecology is extremely complicated, and humans have a long track record of underestimating the number of unanticipated consequences they will encounter in trying to engineer improvements to nature. On the other hand, there are many instances in which we are already interfering with wildlife in some manner. As Tyler Cowen observed:[Cowen, p. 10]
In other cases we are interfering with nature, whether we like it or not. It is not a question of uncertainty holding us back from policing, but rather how to compare one form of policing to another. Humans change water levels, fertilize particular soils, influence climatic conditions, and do many other things that affect the balance of power in nature. These human activities will not go away any time soon, but in the meantime we need to evaluate their effects on carnivores and their victims.
One such evaluation was actually carried out regarding an Australian government decision to cull overpopulated and starving kangaroos at an Australian Defense Force army base.[ClarkeNg] While admittedly crude and theoretical, the analysis proves that the tools of welfare economics can be combined with the principles of population ecology to reach nontrivial conclusions about how human interference with wildlife affects aggregate animal well-being.
Consider another example. Humans spray 3 billion tons of pesticides per year,[Pimentel] and whether or not we think this causes more wild-animal suffering than it prevents, large-scale insecticide use is, to some extent, a fait accompli of modern society. If, hypothetically, scientists could develop ways to make these chemicals act more quickly or less painfully, enormous numbers of insects and larger organisms could be given slightly less agonizing deaths. (Note that pesticides might actually prevent net insect suffering if they reduce insect populations enough, so encouraging humane insecticides is not equivalent to encouraging less pesticide use. Indeed, organic farms may contain high amounts of insect suffering, both because of higher total fauna populations and because organic pest-control methods may be quite painful. I remain very uncertain on this question, though.)[Tomasik-insecticides]
Human changes to the environment -- through agriculture, urbanization, deforestation, pollution, climate change, and so on -- have huge consequences, both negative and positive, for wild animals. For instance, "paving paradise [or, rather, hell?] to put up a parking lot" prevents the existence of animals that would otherwise have lived there. Even where habitats are not destroyed, humans may change the composition of species living in them. If, say, an invasive species has a shorter lifespan and more non-surviving offspring than the native counterpart, the result would be more total suffering. Of course, the opposite could just as easily be the case.
Caring about wild-animal suffering should not be mistaken as general support for environmental preservation; indeed, in some or even many cases, preventing existence may be the most humane option. Consequentialist vegetarians ought not find this line of reasoning unusual: The utilitarian argument against factory farming is precisely that, e.g., a broiler hen would be better off not existing than suffering in cramped conditions for 45 days before slaughter. Of course, even in the calculation of whether to adopt a vegetarian diet, the impacts on animals in the wild can be important and sometimes dominant over the direct effects on livestock themselves.[MathenyChan]
That said, before we become too gung ho about eliminating natural ecosystems, we should also remember that many other humans value wilderness, and it's good to avoid making enemies or tarnishing the suffering-reduction cause by pitting it in direct opposition to other things people care about. In addition, many forms of environmental preservation, especially reducing climate change, may be important to the far future, by improving prospects for compromise among the major world powers that develop artificial general intelligence.
Wild-animal suffering deserves a serious research program, devoted to questions like the following:
Humans presently lack the knowledge and technical ability to seriously "solve" the problem of wild-animal suffering without potentially disastrous consequences. However, this may not be the case in the future, as people develop a deeper understanding of ecology and welfare assessment.
If sentience is not rare in the universe, then the problem of wild-animal suffering extends beyond our planet. If it's improbable that life will evolve the type of intelligence that humans have,[Drake] we might expect that most of the extraterrestrials in existence are at the level of the smallest, shortest-lived creatures on earth. Thus, if humans ever do send robotic probes into space, there might be great benefit in using them to help wild animals on other planets. (One hopes that objections by deep ecologists to intervening in extraterrestrial ecosystems would be overcome.)
However, I should note that faster technological progress in general is not necessarily desirable. Especially in fields like artificial intelligence and neuroscience, faster progress may accelerate risks of suffering of other kinds. As a general heuristic, I think it may be better to wait on developing technologies that unleash vast amounts of new power before humans have the social institutions and wisdom to constrain misuse of this power.
While advanced future technologies could offer promise for helping wild animals, they also carry risks of multiplying the cruelty of the natural world. For instance, it's conceivable that humans could one day spread Earth-like environmental conditions to Mars in the process of "terraforming."[Burton] More speculatively, others have proposed "directed panspermia": dispatching probes into the galaxy to seed other planets with biological material.[Meot-NerMatloff] Post-human computer simulations may become sufficiently accurate that the wild-animal life they contain would consciously suffer. Already we see many simulation models of natural selection, and it's just a matter of time before these are augmented with AI capabilities such that the organisms involved become sentient and literally feel the pain of being injured and killed. Any of these possibilities would have prodigious ethical implications, and I do hope that before undertaking them, future humans consider seriously the consequences of such actions for the creatures involved.
What does all of this imply for the animal-advocacy movement? I think the best first step toward reducing wild-animal suffering that we can take now is to promote general concern for the issue. Causing more people to think and care about wild-animal suffering will hasten developments in research on wild-animal welfare and associated humane technologies, while at the same time helping to ensure that our advanced descendants think cautiously about actions that would create vastly more suffering organisms.
Perhaps finding supporters within the animal-advocacy community would be a good starting point. While some activists oppose all human intervention with the affairs of animals, occasionally even preferring that humans didn't exist, many people who feel humane sympathy for the suffering of members of other species should welcome efforts to prevent cruelty in the wild. It's important to ensure that the animal-rights movement doesn't end up increasing support for wilderness preservation and human non-interference of all kinds. Another potential source of supporters could be people interested in evolution, who recognize what Richard Dawkins has called the "blind, pitiless indifference" of natural selection.[Dawkins, p. 133]
Individuals can do much to raise the issue on their own, such as by
There may be a danger here of raising the wild-animal issue before the general public is ready. Indeed, the cruelty of nature is often used as a reductio by meat-eaters against consequentialist vegetarianism. Suggesting that ethical consideration for animals could require us to expend resources toward long-term research aimed at helping wildlife might turn off entirely people who would otherwise have given some consideration to at least those animals that they affect through dietary choices.[Greger]
I think wild-animal outreach should begin within communities that are most receptive, such as philosophers, animal activists, transhumanists, and scientists. We can plant the seeds of the idea so that it can grow into a component of the animal-rights movement. I also think a "don't spread wild-animal suffering to space" message could appear even in venues like TED or Slate precisely because it's a controversial idea that people haven't heard before. For those audiences, the message would appear in "far mode," wouldn't interfere with audience members' daily lives, and therefore could be entertained with less resistance.
It's true that most people are not in a position to endorse the moral urgency of reducing wild-animal suffering just yet. They may require earlier inferential steps first, such as caring about any non-human animals at all. The animal movement is like a worm: Each body part needs to slowly scooch its way forward to the next step. But the worm's head also needs to point in the right direction. Inspiring greater concern for wild animals among those who are ready for the message is like influencing where the worm's head points.
It's crucial that at some point the animal-rights movement moves beyond farm, laboratory, and companion animals. The scale of brutality in nature is too vast to ignore, and humans have an obligation to exercise their cosmically rare position as both intelligent and empathetic creatures to reduce suffering in the wild as much as they can.
[Dawkins] Dawkins, Richard. River Out of Eden. New York: Basic Books, 1995.
[Bostrom-Alfred] Bostrom, Nick. "Golden." 2004.
[Pinker] Sailer, Steve. "Q&A: Steven Pinker of 'Blank Slate.'" United Press International. 30 Oct. 2002. Accessed 17 Jan. 2014.
[Attenborough] Rustin, Susanna. "David Attenborough: 'I'm an essential evil.'" The Guardian. 21 Oct. 2011. Accessed 9 Jan. 2014.
[Mill] Mill, John Stuart. "On Nature." 1874. In Nature, The Utility of Religion and Theism, Rationalist Press, 1904.
[exceptions] Examples include (1) Sapontzis, Steve F. "Predation." Ethics and Animals 5.2 (1984): 27-38. (2) Naess, Arne. "Should We Try To Relieve Clear Cases of Extreme Suffering in Nature?" Pan Ecology 6.1 (1991). (3) Fink, Charles K. "The Predation Argument." Between the Species 5 (2005).
[Tomasik-numbers] Tomasik, Brian. "How Many Wild Animals Are There?" Essays on Reducing Suffering. 2009.
[emotions] See, for instance, (1) Balcombe, Jonathan. Pleasurable Kingdom: Animals and the Nature of Feeling Good. Palgrave Macmillan, 2006. (2) Bekoff, Marc, ed. The Smile of a Dolphin: Remarkable Accounts of Animal Emotions. Discovery Books, 2000.
[McGowan] McGowan, Christopher. The Raptor and the Lamb: Predators and Prey in the Living World. New York: Henry Holt and Company, 1997.
[eaten-alive] Eaten Alive - The World of Predators. Questacon on Tour.
[Kruuk] Kruuk, H. The Spotted Hyena. Chicago: University of Chicago Press, 1972.
[Flank] Flank, Lenny. "Live Prey vs. Prekill." The Snake: An Owner's Guide To A Happy Healthy Pet. Howell Book House, 1997.
[Perry] Perry, Lacy. "How Snakes Work: Feeding." howstuffworks.com.
[Sallinger] Sallinger, Bob. "Audubon Society Favors Keeping Cats Indoors." The Oregonian. 17 Nov. 2003.
[Wall] Wall, Patrick. Pain: The Science of Suffering. New York: Columbia University Press, 2000.
[ElHagePeronnyGriebelBelzung] El Hage, Wissam, Sylvie Peronny, Guy Griebel, Catherine Belzung. "Impaired memory following predatory stress in mice is improved by fluoxetine." Progress in Neuro-Psychopharmacology & Biological Psychiatry 28 (2004) 123 - 128.
[ElHageGriebelBelzung] El Hage, Wissam, Guy Griebel, and Catherine Belzung. "Long-term impaired memory following predatory stress in mice." Physiology & Behavior 87 (2006) 45 - 50.
[Zoladz] Zoladz, Phillip R. "An ethologically relevant animal model of posttraumatic stress disorder: Physiological, pharmacological and behavioral sequelae in rats exposed to predator stress and social instability." Graduate dissertation, University of South Florida. 2008.
[Stam] Stam, Rianne. "PTSD and stress sensitisation: A tale of brain and body Part 2: Animal models." Neuroscience & Biobehavioral Reviews Volume 31, Issue 4 (2007) 558 - 584.
[Stauth] Stauth, David. "Sharks, wolves and the 'ecology of fear'." 10 Nov. 2010. Accessed 17 March 2013.
[Salmonellosis] "Salmonellosis." Michigan Department of Natural Resources.
[bats] "Continued Rain, Snowpack Leaves Animals Hungry." Associated Press 23 Apr. 2006. CBS 13/UPN 31.
[Heidorn] Heidorn, Keith C. "Ice Storms: Hazardous Beauty." The Weather Doctor. 12 Jan. 1998, revised Dec. 2001.
[UCLA] UCLA Animal Care and Use Training Manual. UCLA Office for the Protection of Research Subjects.
[Nuffield] Nuffield Council on Bioethics. Ethics of Research Involving Animals. May 2005.
[Wilcox] Wilcox, Christie. "Bambi or Bessie: Are wild animals happier?" Scientific American Blogs. 12 April 2011. [For further discussion of this article, see this Felicifia thread. I think Christie understates the brutality of life on factory farms, but her points about wild animals are well taken.]
[BourneEtAl] Bourne, Debra C., Penny Cusdin, and Suzanne I. Boardman, eds. Pain Management in Ruminants. Wildlife Information Network. Mar. 2005.
[Cumming] Cumming, Jeffrey M. "Horn fly Haematobia irritans (L.)." Diptera Associated with Livestock Dung. North American Dipterists Society. 18 May 2006.
[BBC] "Fierce Ants Build 'Torture Rack'." BBC News 23 April 2005.
[Gould] Gould, Stephen Jay. "Nonmoral Nature." Hen's Teeth and Horse's Toes: Further Reflections in Natural History. New York: W. W. Norton, 1994.
[insect-pain] See, for instance, the following review articles: (1) Smith, Jane A. "A Question of Pain in Invertebrates." ILAR Journal 33.1-2 (1991). (2) Tomasik, Brian. "Can Insects Feel Pain?" Essays on Reducing Suffering. 2009.
[Williams] Williams, C. B. Patterns in the Balance of Nature and Related Problems. London: Academic Press, 1964.
[SchubelButman] Schubel, J. R. and Butman, C. A. "Keeping a Finger on the Pulse of Marine Biodiversity: How Healthy Is It?" Pages 84-103 of Nature and Human Society: The Quest for a Sustainable World. Washington, DC: National Academy Press, 1998.
[SolbrigSolbrig] Solbrig, O. T., and Solbrig, D. J. Introduction to Population Biology and Evolution. London: Addison-Wesley, 1979.
[Herbert] Herbert, Thomas J. "r and K selection." Accessed 17 March 2013.
[ClarkeNg] Clarke, Matthew and Ng, Yew-Kwang. "Population Dynamics and Animal Welfare: Issues Raised by the Culling of Kangaroos in Puckapunyal." Social Choice and Welfare 27:2 (pp. 407-22), 2006.
[Ng] Ng, Yew-Kwang. "Towards Welfare Biology: Evolutionary Economics of Animal Consciousness and Suffering." Biology and Philosophy 10.3 (pp. 255-85), 1995.
[Groff and Ng] Groff, Zach and Ng, Yew-Kwang. "Does suffering dominate enjoyment in the animal kingdom? An update to welfare biology". Biology and Philosophy, 2019.
[Sagoff] Sagoff, Mark. "Animal liberation and environmental ethics: Bad marriage, quick divorce." Osgoode Hall Law Journal 22, p. 297 (1984).
[Hapgood] Hapgood, Fred. Why males exist: an inquiry into the evolution of sex. Morrow (1979).
[EFSA] Animal and Welfare Scientific (AHAW) Panel. "Aspects of the biology and welfare of animals used for experimental and other scientific purposes." EFSA Journal 292, 1-136 (2005).
[DoodyPaull] Doody, J. Sean and Paull, Phillip. "Hitting the Ground Running: Environmentally Cued Hatching in a Lizard." Copeia: March 2013, Vol. 2013, No. 1, pp. 160-165.
[StevensEtAl] Stevens, Donald E., David W. Kohlhorst, Lee W. Miller, and D. W. Kelley. "The Decline of Striped Bass in the Sacramento-San Joaquin Estuary, California." Transactions of the American Fisheries Society 114.1 (pp. 12-30), 1985.
[Tomasik-short-lived] Tomasik, Brian. "Fitness Considerations Regarding the Suffering of Short-Lived Animals." Essays on Reducing Suffering. First written: 30 June 2013; last updated: 9 Feb. 2015.
[KahnemanSugden] Kahneman, Daniel and Sugden, Robert. "Experienced Utility as a Standard of Policy Evaluation." Environmental & Resource Economics 32: 161–81 (2005).
[MitchellThompson] Mitchell, T. and Thompson, L. (1994). "A Theory of Temporal Adjustments of the Evaluation of Events: Rosy Prospection and Rosy Retrospection." In C. Stubbart, J. Porac, and J. Meindl, eds., Advances in Managerial Cognition and Organizational Information-Processing, 5 (pp. 85-114). Greenwich, CT: JAI press.
[Singer] Singer, Peter. "Food for Thought." [Reply to a letter by David Rosinger.] New York Review of Books 20.10 (1973).
[Everett] Everett, Jennifer. "Environmental Ethics, Animal Welfarism, and the Problem of Predation: A Bambi Lover's Respect for Nature." Ethics and the Environment 6.1 (2001): 42-67.
[Cowen] Cowen, Tyler. "Policing Nature." 19 May 2001.
[Pimentel] Pimentel, David. "Pesticides and Pest Control." In Peshin, Rajinder and Dhawan, Ashok K., eds. Integrated Pest Management: Innovation-Development Process. Netherlands: Springer, 2009.
[Tomasik-insecticides] Tomasik, Brian. "Humane Insecticides: A Cost-Effectiveness Calculation." Essays on Reducing Suffering. 2009.
[MathenyChan] Matheny, Gaverick and Chan, Kai M. A. "Human Diets and Animal Welfare: The Illogic of the Larder." Journal of Agricultural and Environmental Ethics, 18:6 (pp. 579–94), 2005.
[Broom] Broom, D. M. "Animal Welfare: Concepts and Measurement" Journal of Animal Science, 69:10 (pp. 4167-4175), 1991.
[Drake] Estimates of the fraction of planets with life that go on to produce intelligence can be found in the literature on the Drake equation.
[Burton] Burton, Kathleen. "NASA Presents Star-Studded Mars Debate." 25 Mar. 2004.
[Meot-NerMatloff] Meot-Ner, M. and Matloff, G. L. "Directed Panspermia: A Technical and Ethical Evaluation of Seeding the Universe." Journal of the British Interplanetary Society 32 (pp. 419-23), 1979.
[Greger] Greger, Michael. "Why Honey Is Vegan." Satya Sept. 2005.
The post The Importance of Wild-Animal Suffering appeared first on Center on Long-Term Risk.
]]>The post Risks of Astronomical Future Suffering appeared first on Center on Long-Term Risk.
]]>It's far from clear that human values will shape an Earth-based space-colonization wave, but even if they do, it seems more likely that space colonization will increase total suffering rather than decrease it. That said, many other people care a lot about humanity's survival and spread into the cosmos, so I think suffering reducers should let others pursue their spacefaring dreams in exchange for stronger safety measures against future suffering. In general, I encourage people to focus on making an intergalactic future more humane if it happens rather than making sure there will be an intergalactic future.
"If we carry the green fire-brand from star to star, and ignite around each a conflagration of vitality, we can trigger a Universal metamorphosis. [...] Because of us [...] Slag will become soil, grass will sprout, flowers will bloom, and forests will spring up in once sterile places.1 [...] If we deny our awesome challenge; turn our backs on the living universe, and forsake our cosmic destiny, we will commit a crime of unutterable magnitude."
--Marshall T. Savage, The Millennial Project: Colonizing the Galaxy in Eight Easy Steps, 1994 (from "Space Quotations")
"Let's pray that the human race never escapes from Earth to spread its iniquity elsewhere."
--C.S. Lewis
Nick Bostrom's "The Future of Human Evolution" describes a scenario in which human values of fun, leisure, and relationships may be replaced by hyper-optimized agents that can better compete in the Darwinian race to control our future light cone. The only way we could avert this competitive scenario, Bostrom suggests, would be via a "singleton," a unified agent or governing structure that could control evolution. Of course, even a singleton may not carry on human values. Many naive AI agents that humans might build may optimize an objective function that humans find pointless. Or even if humans do maintain hands on the steering wheel, it's far from guaranteed that we can preserve our goals in a stable way across major self-modifications going forward. Caspar Oesterheld has suggested that even if a singleton AI emerges on Earth, it might still face the prospect of competition from extraterrestrials, as a result of which, it might feel pressure to quickly build more powerful successor agents for which it would not be able to verify full goal alignment, thereby leading to goal drift.
These factors suggest that even conditional on human technological progress continuing, the probability that human values are realized in the future may not be very large. Carrying out human values seems to require a singleton that's not a blind optimizer, that can stably preserve values, and that is shaped by designers who care about human values rather than selfish gain or something else. This is important to keep in mind when we imagine what future humans might be able to bring about with their technology.
Some people believe that sufficiently advanced superintelligences will discover the moral truth and hence necessarily do the right things. Thus, it's claimed, as long as humanity survives and grows more intelligent, the right things will eventually happen. There are two problems with this view. First, Occam's razor militates against the existence of a moral truth (whatever that's supposed to mean). Second, even if such moral truth existed, why should a superintelligence care about it? There are plenty of brilliant people on Earth today who eat meat. They know perfectly well the suffering that it causes, but their motivational systems aren't sufficiently engaged by the harm they're doing to farm animals. The same can be true for superintelligences. Indeed, arbitrary intelligences in mind-space needn't have even the slightest inklings of empathy for the suffering that sentients experience.
Even if humans do maintain control over the future of Earth-based life, potentially allowing people to realize positive futures that they desire, there are still many ways in which space colonization would likely multiply suffering. Following are some of them.
Humans may colonize other planets, spreading suffering-filled animal life via terraforming. Some humans may use their resources to seed life throughout the galaxy, which some sadly consider a moral imperative.
Given astronomical computing power, post-humans may run various kinds of simulations. These sims may include many copies of wild-animal life, most of which dies painfully shortly after being born. For example, a superintelligence aiming to explore the distribution of extraterrestrials of different sorts might run vast numbers of simulations of evolution on various kinds of planets. Moreover, scientists might run even larger numbers of simulations of organisms-that-might-have-been, exploring the space of minds. They may simulate decillions of reinforcement learners that are sufficiently self-aware as to feel what we consider conscious pain (as well as pleasure).
In addition to simulating more primitive, evolved minds, a superintelligent civilization might simulate myriads of possible future trajectories for itself. In an adversarial situation, artificial intelligences might use (a much improved form of) Monte Carlo tree search (MCTS) to simulate the outcomes of lots of hypothetical interactions during warfare. The details of these battles would need to be simulated for reasons of accuracy, but fine-grained simulations of warfare would have far more moral import than, e.g., the primitive actions in Total War: Rome II that are simulated during execution of a game AI's MCTS algorithm. Of course, it's quite likely that something better than MCTS will be used in the future, but the importance of simulations of some sort to planning seems unlikely to ever go away.
It could be that certain algorithms (say, reinforcement agents) are very useful in performing complex machine-learning computations that need to be run at massive scale by advanced AI. These subroutines might be sufficiently similar to the pain programs in our own brains that we consider them to actually suffer. But profit and power may take precedence over pity, so these subroutines may be used widely throughout the AI's Matrioshka brains.
The range of scenarios that we can imagine is limited, and many more possibilities may emerge that we haven't thought of or maybe can't even comprehend.
If I had to make an estimate now, I would give ~70% probability that if humans or post-humans colonize space, this will cause more suffering than it reduces (ignoring compromise considerations discussed later). By this statement I'm referring only to total suffering, ignoring happiness and other positive values.2
Think about how space colonization could plausibly reduce suffering. For most of those mechanisms, there seem to be counter-mechanisms that will increase suffering at least as much. The following sections parallel those above.
David Pearce coined the phrase "cosmic rescue missions" in referring to the possibility of sending probes to other planets to alleviate the wild extraterrestrial (ET) suffering they contain. This is a nice idea, but there are a few problems.
Contrast this with the possibilities for spreading wild-animal suffering:
Fortunately, humans might not support spreading life that much, though some do. For terraforming, there are survival pressures to do it in the near term, but probably directed panspermia is a bigger problem in the long term. Also, given that terraforming is estimated to require at least thousands of years, while human-level digital intelligence should take at most a few hundred years to develop, terraforming may be a moot point from the perspective of catastrophic risks, since digital intelligence doesn't need terraformed planets.
While I noted that ETs are not guaranteed to be sentient, I do think it's moderately likely that consciousness is fairly convergent among intelligent civilizations. This is based on (a) suggestions of convergent consciousness among animals on Earth and (b) the general principle that consciousness seems to be useful for planning, manipulating images, self-modeling, etc. On the other hand, maybe this reflects the paucity of my human imagination in conceiving of ways to be intelligent without consciousness.
Even in the unlikely event that Earth-originating intelligence was committed to reducing wild-ET suffering in the universe, and even in the unlikely event that every planet in our future light cone contained conscious wild-ET suffering, I expect that space colonization would still increase rather than decrease total suffering. This is because planet-bound biological life captures a tiny fraction of the energy radiated by parent stars. Harnessing more of that solar energy would allow for running orders of magnitude more intelligent computations than biology currently does. Unless the suffering density among such computations was exceedingly low, multiplying intelligent computation would probably multiply suffering.
Given the extraordinary amounts of energy that can be extracted from stars, Earth-originating intelligence may run prodigious numbers of simulations.
Of course, maybe there are ETs running sims of nature for science or amusement, or of minds in general to study biology, psychology, and sociology. If we encountered these ETs, maybe we could persuade them to be more humane.
I think it's likely that humans are more empathetic than the average civilization because
Based on these considerations, it seems plausible that there would be room for improvement through interaction with ETs. Indeed, we should in general expect it to be possible for any two civilizations or factions to achieve gains from trade if they have diminishing marginal utility with respect to amount of control exerted. In addition, there may be cheap Pareto improvements to be had purely from increased intelligence and better understanding of important considerations.
That said, there are some downside risks. Post-humans themselves might create suffering simulations, and what's worse, the sims that post-humans run would be more likely to be sentient than those run by random ETs because post-humans would have a tendency to simulate things closer to themselves in mind-space. They might run nature sims for aesthetic appreciation, lab sims for science experiments, or pet sims for pets.
Suffering subroutines may be a convergent outcome of any AI, whether human-inspired or not. They might also be run by aliens, and maybe humans could ask aliens to design them in more humane ways, but this seems speculative.
It seems plausible that suffering in the future will be dominated by something totally unexpected. This could be a new discovery in physics, neuroscience, or even philosophy more generally. Some make the argument that because we know so very little now, it's better for humans to stick around because of the "option value": If people later realize it's bad to spread, they can stop, but if they realize they should spread, they can proceed to reduce suffering in some novel way that we haven't anticipated.
Of course, the problem with the "option value" argument is that it assumes future humans do the right things, when in fact, based on examples of speculations we can imagine now, it seems future humans would probably do the wrong things much of the time. For instance, faced with a new discovery of obscene amounts of computing power somewhere, most humans would use it to run oodles more minds, some nontrivial fraction of which might suffer terribly. In general, most sources of immense power are double-edged swords that can create more happiness and more suffering, and the typical human impulse to promote life/consciousness rather than to remove them suggests that suffering-focused values are on the losing side.
Still, waiting and learning more is plausibly Kaldor-Hicks efficient, and maybe there are ways it can be made Pareto efficient by granting additional concessions to suffering reducers as compensation.
Above I was largely assuming a human-oriented civilization with values that we recognize. But what if, as seems mildly likely, Earth is taken over by a paperclip maximizer, i.e., an unconstrained automation or optimization process? Wouldn't that reduce suffering because it would eliminate wild ETs as the paperclipper spread throughout the galaxy, without causing any additional suffering?
Maybe, but if the paperclip maximizer is actually generally intelligent, then it won't stop at tiling the solar system with paperclips. It will want to do science, perform lab experiments on sentient creatures, possibly run suffering subroutines, and so forth. It will require lots of intelligent and potentially sentient robots to coordinate and maintain its paperclip factories, energy harvesters, and mining operations, as well as scientists and engineers to design them. And the paperclipping scenario would entail similar black swans as a human-inspired AI. Paperclippers would presumably be less intrinsically humane than a "friendly AI," so some might cause significantly more suffering than a friendly AI, though others might cause less, especially the "minimizing" paperclippers, e.g., cancer minimizers or death minimizers.
If the paperclipper is not generally intelligent, I have a hard time seeing how it could cause human extinction. In this case it would be like many other catastrophic risks -- deadly and destabilizing, but not capable of wiping out the human race.
If we knew for certain that ETs would colonize our region of the universe if Earth-originating intelligence did not, then the question of whether humans should try to colonize space becomes less obvious. As noted above, it's plausible that humans are more compassionate than a random ET civilization would be. On the other hand, human-inspired computations might also entail more of what we consider to count as suffering because the mind architectures of the agents involved would be more familiar. And having more agents in competition for our future light cone might lead to dangerous outcomes.
But for the sake of argument, suppose an Earth-originating colonization wave would be better than the expected colonization wave of an ET civilization that would colonize later if we didn't do so. In particular, suppose that if human values colonized space, they would cause only -0.5 units of suffering, compared with -1 units if random ETs colonized space. Then it would seem that as long as the probability P of some other ETs coming later is bigger than 0.5, then it's better for humans to colonize and pre-empt the ETs from colonizing, since -0.5 > -1 * P for P > 0.5.
However, this analysis forgets that even if Earth-originating intelligence does colonize space, it's not at all guaranteed that human values will control how that colonization proceeds. Evolutionary forces might distort compassionate human values into something unrecognizable. Alternatively, a rogue AI might replace humans and optimize for arbitrary values throughout the cosmos. In these cases, humans' greater-than-average compassion doesn't make much difference, so suppose that the value of these colonization waves would be -1, just like for colonization by random ETs. Let the probability be Q that these non-compassionate forces win control of Earth's colonization. Now the expected values are
-1 * Q + -0.5 * (1-Q) for Earth-originating colonization
versus
-1 * P if Earth doesn't colonize and leaves open the possibility of later ET colonization.
For concreteness, say that Q = 0.5. (That seems plausibly too low to me, given how many times Earth has seen overhauls of hegemons in the past.) Then Earth-originating colonization is better if and only if
-1 * 0.5 + -0.5 * 0.5 > -1 * P
-0.75 > -1 * P
P > 0.75.
Given uncertainty about the Fermi paradox and Great Filter, it seems hard to maintain a probability greater than 75% that our future light cone would contain colonizing ETs if we don't ourselves colonize, although this section presents an interesting argument for thinking that the probability of future ETs is quite high.
What if rogue AIs result in a different magnitude of disvalue from arbitrary ETs? Let H be the expected harm of colonization by a rogue AI. Assume ETs are as likely to develop rogue AIs as humans are. Then the disvalue of Earth-based colonization is
H * Q + -0.5 * (1-Q),
and the harm of ET colonization is
P * (H * Q + -1 * (1-Q)).
Again taking Q = 0.5, then Earth-based colonization has better expected value if
H * 0.5 + -0.5 * 0.5 > P * (H * 0.5 + -1 * 0.5)
H - 0.5 > P * (H - 1)
P > (H - 0.5)/(H - 1),
where the inequality flips around when we divide by the negative number (H - 1). Here's a plot of these threshold values for P as a function of H. Even if H = 0 and a rogue AI caused no suffering, it would still only be better for Earth-originating intelligence to colonize if P > 0.5, i.e., if the probability of ETs colonizing in its place was at least 50%.
These calculations involve many assumptions, and it could turn out that Earth-based colonization has higher expected value given certain parameter values. This is a main reason I maintain uncertainty as to the sign of Earth-based space colonization. However, this whole section was premised on human-inspired colonization being better than ET-inspired colonization, and the reverse might also be true, since computations of the future are more likely to be closer to what we most value and disvalue if humans do the colonizing.
Some have suggested the possibility of giving Earth-originating AI the values of ET civilizations. If this were done, then the argument about humans having better values than ETs would not even apply, and an Earth-based colonization wave would be as bad as ET colonization.
Finally, it's worth pointing out that insofar as human decisions are logically correlated with those of ETs, if humans choose not to or otherwise fail to colonize space, that may make it slightly more likely that other civilizations also don't colonize space, which would be good. However, this point is probably pretty weak, assuming that logical correlations between different planetary civilizations are small.
If technological development and space colonization seem poised to cause astronomical amounts of suffering, shouldn't we do our best to stop them? Well, it is worth having a discussion about the extent to which we as a society want these outcomes, but my guess is that someone will continue them, and this would be hard to curtail without extreme measures. Eventually, those who go on developing the technologies will hold most of the world's power. These people will, if only by selection effect, have strong desires to develop AI and colonize space.
Resistance might not be completely futile. There's some small chance that suffering reducers could influence society in such a way as to prevent space colonization. But it would be better for suffering reducers, rather than fighting technologists, to compromise with them: We'll let you spread into the cosmos if you give more weight to our concerns about future suffering. Rather than offering a very tiny chance of complete victory for suffering reducers, this cooperation approach offers a higher chance of getting an appreciable fraction of the total suffering reduction that we want. In addition, compromise means that suffering reducers can also win in the scenario (~30% likely in my view) that technological development does prevent more suffering than it causes even apart from considerations of strategic compromise with other people.
Ideally these compromises would take the form of robust bargaining arrangements. Some examples are possible even in the short term, such as if suffering reducers and space-colonization advocates agree to cancel opposing funding in support of some commonly agreed-upon project instead.
The strategic question of where to invest resources to advance your values at any given time amounts to a prisoner's dilemma with other value systems, and because we repeatedly make choices about where to invest, what stances to adopt, and what policies to push for, these prisoner's dilemmas are iterated. In Robert Axelrod's tournaments on the iterated prisoner's dilemma, the best-performing strategies were always "nice," i.e., not the first to defect. Thus, suffering reducers should not be the first to defect against space colonizers. Of course, if it seems that space colonizers show no movement toward suffering reduction, then we should also be "provocable" to temporary defection until the other side does begin to recognize our concerns.
We who are nervous about space colonization stand a lot to gain from allying with its supporters -- in terms of thinking about what scenarios might happen and how to shape the future in better directions. We also want to remain friends because this means pro-colonization people will take our ideas more seriously. Even if space colonization happens, there will remain many sub-questions on which suffering reducers want to have a say: e.g., not spreading wildlife, not creating suffering simulations/subroutines, etc.
We want to make sure suffering reducers don't become a despised group. For example, think about how eugenics is more taboo because of the Nazi atrocities than it would have been otherwise. Anti-technology people are sometimes smeared by association with the Unabomber. Animal supporters can be tarnished by the violent tactics of a few, or even by the antics of PETA. We need to be cautious about something similar happening for suffering reduction. Most people already care a lot about preventing suffering, and we don't want people to start saying, "Oh, you care about preventing harm to powerless creatures? What are you, one of those suffering reducers?" where "suffering reducers" has become such a bad name that it evokes automatic hatred.
So not only is cooperation with colonization supporters the more promising option, but it's arguably the only net-positive option for us. Taking a more confrontational stance risks hardening the opposition and turning people away from our message. Remember, preventing future suffering is something that everyone cares about, and we shouldn't erode that fact by being excessively antagonistic.
Many speculative scenarios that would allow for vastly reducing suffering in the multiverse would also allow for vastly increasing it: When you can decrease the number of organisms that exist, you can also increase the number, and those who favor creating more happiness / life / complexity / etc. will tend to want to push for the increasing side.
However, there may be some black swans that really are one-sided, in the sense that more knowledge is most likely to result in a decrease of suffering. For example: We might discover that certain routine physical operations map onto our conceptions of suffering. People might be able to develop ways to re-engineer those physical processes to reduce the suffering they contain. If this could be done without a big sacrifice to happiness or other values, most people would be on board, assuming that present-day values have some share of representation in future decisions.
This may be a fairly big deal. I give nontrivial probability (maybe ~10%?) that I would, upon sufficient reflection, adopt a highly inclusive view of what counts as suffering, such that I would feel that significant portions of the whole multiverse contain suffering-dense physical processes. After all, the mechanics of suffering can be seen as really simple when you think about them a certain way, and as best I can tell, what makes animal suffering special are the bells and whistles that animal sentience involves over and above crude physics -- things like complex learning, thinking, memory, etc. But why can't other physical objects in the multiverse be the bells and whistles that attend suffering by other physical processes? This is all very speculative, but what understandings of the multiverse our descendants would arrive at we can only begin to imagine right now.
If we care to some extent about moral reflection on our own values, rather than assuming that suffering reduction of a particular flavor is undoubtedly the best way to go, then we have more reason to support a technologically advanced future, at least if it's reflective.
In an idealized scenario like coherent extrapolated volition (CEV), say, if suffering reduction was the most compelling moral view, others would see this fact.3 Indeed, all the arguments any moral philosopher has made would be put on the table for consideration (plus many more that no philosopher has yet made), and people would have a chance to even experience extreme suffering, in a controlled way, in order to assess how bad it is compared with other things. Perhaps there would be analytic approaches for predicting what people would say about how bad torture was without actually torturing them to find out. And of course, we could read through humanity's historical record and all the writings on the Internet to learn more about what actual people have said about torture, although we'd need to correct for will-to-live bias and deficits of accuracy when remembering emotions in hindsight. But, importantly, in a CEV scenario, all of those qualifications can be taken into account by people much smarter than ourselves.
Of course, this rosy picture is not a likely future outcome. Historically, forces seize control because they best exert their power. It's quite plausible that someone will take over the future by disregarding the wishes of everyone else, rather than by combining and idealizing them. Or maybe concern for the powerless will just fall by the wayside, because it's not really adaptive for powerful agents to care about weak ones, unless there are strong, stable social pressures to do so. This suggests that improving prospects for a reflective, tolerant future may be an important undertaking. Rather than focusing on whether or not the future happens, I think it's more valuable for suffering reducers to focus on making the future better if it happens -- by encouraging compromise, moral reflectiveness, philosophical wisdom, and altruism.
A question by Rob Wiblin first inspired this piece. The discussion of cooperation with other value systems was encouraged by Carl Shulman. Initially I resisted his claim, but -- as has often proved the case at least on factual and strategic questions -- I eventually realized he was right and came around to his view.
The post Risks of Astronomical Future Suffering appeared first on Center on Long-Term Risk.
]]>The post Support us to help reduce involuntary suffering. appeared first on Center on Long-Term Risk.
]]>You can find answers to frequently asked questions below. If the FAQ doesn't answer your question, please send us an email at info@longtermrisk.org.
If you'd prefer to make a restricted donation to our grantmaking project the CLR Fund, please use the donation form on the CLR Fund page.
Currently, we can offer tax-deductible donations in the following countries:
If applicable, you will receive a tax receipt by the organization on whose accounts you have donated. For the countries mentioned above, these organizations are:
If you have questions about your tax receipt, you can contact the organization in question directly.
CLR and EAF send a single receipt via email at the beginning of the relevant tax year, which includes all your donations in the preceding year, as long as your donations totaled at least 100.– and were made directly to CLR or EAF. We can only issue receipts if all fields in the registration form—including the address field—were filled out.
Donation receipts are issued in the name of the account holder from which the donation is made. Keep in mind that a donation is included in a given year if we receive the donation in that year, independently of when you make the transfer. Please keep this in mind for donations via bank transfer made soon before the end of the tax year, since they could reach our accounts after the year boundary.
If you choose to donate with, CLR will claim an additional 25% on the value of your donation from the UK government, reclaiming taxes that you paid/will pay. You can read details about this on the government website here. There is also a useful post about Gift Aid and UK income tax here.
To be eligible for Gift Aid, you must confirm that you are a UK taxpayer, and understand that if you pay less Income Tax and/or Capital Gains Tax in the current tax year than the amount of Gift Aid claimed on all your donations, it is your responsibility to pay any difference.
If you believe you are no longer eligible for Gift Aid, or would like us to stop claiming Gift Aid on your recurring donation, please contact us at info@longtermrisk.org.
If you believe you are no longer eligible for Gift Aid, or would like us to stop claiming Gift Aid on your recurring donation, please contact us at info@longtermrisk.org.
For donations made via the above form, payment processors charge a small fee that is deducted from your donation:
You can find detailed information about our activities and goals as well as our financial situation on our transparency site. CLR’s finances are checked annually by an independent auditor.
If you'd like to stay up to date with our activities, you can follow us on social media via Facebook, Twitter, and YouTube.
To change the amount of your monthly donation, please fill in the donation registration form again with your updated preferences. For donations via bank transfer, you can then adjust the standing order with the new reference number / purpose.
If you wish to change or cancel your monthly direct debit or credit card donation, please send us an email at info@longtermrisk.org. We will cancel your donation and you can set up a new one on this website.
You can update your personal details when registering a new donation via our website. Alternatively, you can send us an email at info@longtermrisk.org.
CLR only collects personal details necessary for the donation process. For instance, we need your contact details (email and postal address) to send an annual tax receipt. We keep your personal details strictly confidential. You can find more information on this topic in our Privacy Policy.
Giving What We Can can accept donations in cryptocurrency on our behalf, and donations in stocks made in the USA. For details see their FAQ page. If you'd like to make a large donation in stocks from another country, please contact us at info@longtermrisk.org.
Please get in touch with us at info@longtermrisk.org. We'll do our best to answer within two working days.
The post Support us to help reduce involuntary suffering. appeared first on Center on Long-Term Risk.
]]>The post Volunteer appeared first on Center on Long-Term Risk.
]]>The rest of this page gives more detail about writing up research and other ways to help.
If you’d like to research a topic, feel free to proceed on your own and publish it on your website, blog, etc., and then let us know about it if you think it would be of interest. If you’d like to host it on this site, contact us to see if that’s feasible. If you’re particularly ambitious, try submitting the article to a formal media outlet or academic journal. While this takes more effort, it also can bring a wider audience, garner more prestige, and put the ideas into “the real world” rather than confining them to a small circle of like-minded friends.
Let us know if you have questions or would like advice in the research. We could also potentially review drafts if we have time.
Many topics on Wikipedia are sufficiently important that contributing just factual information to them may have very high value. Some of the Open Research Questions are best addressed by compiling existing information into relevant Wikipedia articles.
Brian Tomasik has a list of suggested Wikipedia “Pages to create or improve“, but it should only serve to jumpstart your creativity and is hardly exhaustive of the possibilities. Note that the value of Wikipedia contributions on technical AI issues is uncertain, because they can speed up risks alongside speeding up safety measures against risks.
The post Volunteer appeared first on Center on Long-Term Risk.
]]>The post International Cooperation vs. AI Arms Race appeared first on Center on Long-Term Risk.
]]>There's a decent chance that governments will be the first to build artificial general intelligence (AI). International hostility, especially an AI arms race, could exacerbate risk-taking, hostile motivations, and errors of judgment when creating AI. If so, then international cooperation could be an important factor to consider when evaluating the flow-through effects of charities. That said, we may not want to popularize the arms-race consideration too openly lest we accelerate the race.
AI poses a national-security threat, and unless the militaries of powerful countries are very naive, it seems to me unlikely they'd allow AI research to proceed in private indefinitely. At some point the US military would confiscate the project from Google or Facebook, if the US military isn't already ahead of them in secret by that point.
While the US government as a whole is fairly slow and incompetent when it comes to computer technology, specific branches of the government are on the cutting edge, including the NSA and DARPA (which already funds a lot of public AI research). When we consider historical examples as well, like the Manhattan Project, the Space Race, and ARPANET, it seems that the US government has a strong track record of making technical breakthroughs when it really tries.
Sam Altman agrees that in the long run governments will probably dominate AI development: "when governments gets serious about [superhuman machine intelligence] SMI they are likely to out-resource any private company".
There are some scenarios in which private AI research wouldn't be nationalized:
Each of these scenarios could happen, but it seems reasonably likely to me that governments would ultimately control AI development, or at least partner closely with Google.
Government AI development could go wrong in several ways. Plausibly governments would botch the process by not realizing the risks at hand. It's also possible that governments would use the AI and robots for totalitarian purposes.
It seems that both of these bad scenarios would be exacerbated by international conflict. Greater hostility means countries are more inclined to use AI as a weapon. Indeed, whoever builds the first AI can take over the world, which makes building AI the ultimate arms race. A USA-China race is one reasonable possibility.
Arms races encourage risk-taking -- being willing to skimp on safety measures to improve your odds of winning ("Racing to the Precipice"). In addition, the weaponization of AI could lead to worse expected outcomes in general. CEV seems to have less hope of success in a Cold War scenario. ("What? You want to include the evil Chinese in your CEV??") With a pure CEV, presumably it would eventually count Chinese values even if it started with just Americans, because people would become more enlightened during the process. However, when we imagine more crude democratic decision outcomes, this becomes less likely.
In Superintelligence: Paths, Dangers, Strategies (Ch. 14), Nick Bostrom proposes that another reason AI arms races would crimp AI safety is that competing teams wouldn't be able to share insights about AI control. What Bostrom doesn't mention is that competing teams also wouldn't share insights about AI capability. So even if less inter-team information sharing reduces safety, it also reduces speed, and the net effect isn't clear to me.
Of course, there are situations where arms-race dynamics can be desirable. In the original prisoner's dilemma, the police benefit if the prisoners defect. Defection on a tragedy of the commons by companies is the heart of perfect competition's efficiency. It also underlies competition among countries to improve quality of life for citizens. Arms races generally speed up innovation, which can be good if the innovation being produced is both salutary and not risky. This is not the case for general AI. Nor is it the case for other "races to the bottom".
Averting an AI arms race seems to be an important topic for research. It could be partly informed by the Cold War and other nuclear arms races, as well as by other efforts at nonproliferation of chemical and biological weapons. Forthcoming robotic and nanotech weapons might be even better analogues of AI arms races than nuclear weapons because these newer technologies can be built more secretly and used in a more targeted fashion.
Apart from more robust arms control, other factors might help:
World peace is hardly a goal unique to effective altruists (EAs), so we shouldn't necessarily expect low-hanging fruit. On the other hand, projects like nuclear nonproliferation seem relatively underfunded even compared with anti-poverty charities.
I suspect more direct MIRI-type research has higher expected value, but among EAs who don't want to fund MIRI specifically, encouraging donations toward international cooperation could be valuable, since it's certainly a more mainstream cause. I wonder if GiveWell would consider studying global cooperation specifically beyond its indirect relationship with catastrophic risks.
When I mentioned this topic to a friend, he pointed out that we might not want the idea of AI arms races too widely known, because then governments might take the concern more seriously and therefore start the race earlier -- giving us less time to prepare and less time to work on FAI in the meanwhile. From David Chalmers, "The Singularity: A Philosophical Analysis" (footnote 14):
When I discussed these issues with cadets and staff at the West Point Military Academy, the question arose as to whether the US military or other branches of the government might attempt to prevent the creation of AI or AI+, due to the risks of an intelligence explosion. The consensus was that they would not, as such prevention would only increase the chances that AI or AI+ would first be created by a foreign power. One might even expect an AI arms race at some point, once the potential consequences of an intelligence explosion are registered. According to this reasoning, although AI+ would have risks from the standpoint of the US government, the risks of Chinese AI+ (say) would be far greater.
We should take this information-hazard concern seriously and remember the unilateralist's curse. If it proves to be fatal for explicitly discussing AI arms races, we might instead encourage international cooperation without explaining why. Fortunately, it wouldn't be hard to encourage international cooperation on grounds other than AI arms races if we wanted to do so.
Also note that a government-level arms race could easily be preferable to a Wild West race among a dozen private AI developers where coordination and compromise would be not just difficult but potentially impossible. Of course, if we did decide it was best for governments to take AI arms races seriously, this would also encourage private developers to step on the gas pedal. That said, once governments do recognize the problem, they may be able to impose moratoria on private development.
How concerned should we be about accidentally accelerating arms races by talking about them? My gut feeling is it's not too risky, because
That said, I remain open to being persuaded otherwise, and it seems important to think more carefully about how careful to be here. The good news is that the information hazards are unlikely to be disastrous, because all of this material is already publicly available somewhere. In other words, the upsides and downsides of making a bad judgment seem roughly on the same order of magnitude.
In Technological change and nuclear arms control (1986), Ted Greenwood suggests that arms control has historically had little counterfactual impact:
In no case has an agreement inhibited technological change that the United States both actually wanted to pursue at the time of agreement and was capable of pursuing during the intended duration of the agreement. Only in one area of technological innovation (i.e., SALT II constraints on the number of multiple independently-targetable reentry vehicles, or MIRVs, on existing missiles) is it possible that such agreements actually inhibited Soviet programs, although in another (test of new light ICBMs [intercontinental ballistic missiles]) their program is claimed by the United States to violate the SALT II Treaty that the Soviets have stated they will not undercut.
In "Why Military Technology Is Difficult to Restrain" (1987), Greenwood adds that the INF Treaty was arguably more significant, but it still didn't stop technological development, just a particular application of known technology.
John O. McGinnis argues against the feasibility of achieving global cooperation on AI:
the only realistic alternative to unilateral relinquishment would be a global agreement for relinquishment or regulation of AI-driven weaponry. But such an agreement would face the same insuperable obstacles nuclear disarmament has faced. [...] Not only are these weapons a source of geopolitical strength and prestige for such nations, but verifying any prohibition on the preparation and production of these weapons is a task beyond the capability of international institutions.
In other domains we also see competition prevail over cooperation, such as in most markets, where usually there are at least several companies vying for customers. Of course, this is partly by social design, because we have anti-trust laws. Competition in business makes companies worse off while making consumers better off. Likewise, competition to build a quick, hacky AI makes human nations worse off while perhaps making the unsafe AIs better off. If we care some about the unsafe AIs for their own sakes as intelligent preference-satisfying agents, then this is less of a loss than it at first appears, but it still seems like there's room to expand the pie, and reduce suffering, if everyone takes things more slowly.
Maybe the best hope comes from the possibility of global unification. There is just one US government, with a monopoly on military development. If instead we had just one world government with a similar monopoly, arms races would not be necessary. Nationalism has been a potent force for gluing countries together and if channeled into internationalism, perhaps it could help to bind together a unified globe. Of course, we shouldn't place all our hopes on a world government and need to prepare for arms-control mechanisms that can also work with the present-day nation-state paradigm.
Robots require AI that contains clear goal systems and an ability to act effectively in the world. Thus, they seem like a reasonable candidate for where artificial general intelligence will first emerge. Facebook's image-classification algorithms and Google's search algorithms don't need general intelligence, with many human-like cognitive faculties, as much as a smart robot does.
Military robotics seems like one of the most likely reasons that a robot arms race might develop. Indeed, to some degree there's already an arms race to build drones and autonomous weapons systems. Mark Gubrud:
Killer robots are not the only element of the global technological arms race, but they are currently the most salient, rapidly-advancing and fateful. If we continue to allow global security policies to be driven by advancing technology, then the arms race will continue, and it may even reheat to Cold War levels, with multiple players this time. Robotic armed forces controlled by AI systems too complex for anyone to understand will be set in confrontation with each other, and sooner or later, our luck will run out.
Nanotechnology admits the prospect of severe arms races as well. "Can an MM Arms Race Avoid Disaster?" lists many reasons why a nanotech race should be less stable than the nuclear race was. In "War, Interdependence, and Nanotechnology," Mike Treder suggests that because nanotech would allow countries to produce their own goods and energy with less international trade, there would be less incentive to refrain from preemptive aggression. Personally, I suspect that countries would still be very desperate to trade knowledge about nanotech itself to avoid falling behind in the race, but perhaps if a country was the world's leader in nanoweapons, it would have incentive to attack everyone else before the tables turned.
Mark Gubrud's "Nanotechnology and International Security" presents an excellent overview of issues with both AI and nanotech races. He suggests:
Nations must learn to trust one another enough to live without massive arsenals, by surrendering some of the prerogatives of sovereignty so as to permit intrusive verification of arms control agreements, and by engaging in cooperative military arrangements. Ultimately, the only way to avoid nanotechnic confrontation and the next world war is by evolving an integrated international security system, in effect a single global regime. World government that could become a global tyranny may be undesirable, but nations can evolve a system of international laws and norms by mutual agreement, while retaining the right to determine their own local laws and customs within their territorial jurisdictions.
According to Jürgen Altmann's talk, "Military Uses of Nanotechnology and Nanoethics," 1/4 to 1/3 of US federal funding in the National NT Initiative is for defense -- $460 million out of $1.554 billion in 2008 (video time: 18:00). The US currently spends 4-10 times the rest of the world in military nanotech R&D, compared with "only" 2 times the rest of the world in overall military R&D (video time: 22:28). Some claim the US should press ahead with this trend in order to maintain a monopoly and prevent conflicts from breaking out, but it's dubious that nanotech can be contained in this way, and Altmann instead proposes active arms-control arrangements with anytime, anywhere inspections and in the long run, progress toward global governance to allay security dilemmas. We have seen many successful bans on classes of technology (bioweapons, chemical weapons, blinding lasers, etc.), so nano agreements are not out of the question, though they will take effort because many of the applications are so inherently dual-use. Sometimes commentators scoff at enforcement of norms against use of chemical weapons when just as many people can be killed by conventional forces, but these agreements are actually really important, as precedents for setting examples that can extend to more and more domains.
Like AI, nanotech may involve the prospect of the technology leader taking over the world. It's not clear which technology will arrive first. Nanotech contributes to the continuation of Moore's law and therefore makes brute-force evolved AI easier to build. Meanwhile, AI would vastly accelerate nanotech. Speeding up either leaves less time to prepare for both.
To read comments on this piece, see the original LessWrong discussion.
The post International Cooperation vs. AI Arms Race appeared first on Center on Long-Term Risk.
]]>