love is value alignment: Nonlinear Function
Created: July 12, 2020
Modified: November 28, 2023

love is value alignment

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

What does it mean to love someone? Of course this question has as many answers as there are people, and probably more. But here's one view: love is when you care about someone and want them to be happy.

We might map this into more technical language: love is when you adopt someone else's utility as a term in your own utility function.

This is somewhat of a bloodless definition of love. It totally ignores the experiential aspect, the sense of warmth, the melting of the heart, that humans in love can experience. But, acknowledging that all models are wrong, this is still perhaps an interesting model of love to consider and critique.

Of course, we never know someone else's utility function. Part of trying to maximize it means trying to learn what it is---what does someone like, and how can we make them happy? Formally this is the domain of inverse reinforcement learning and (if the subject is actively trying to teach us) of cooperative inverse reinforcement learning.

The love in question can be 'big love', like a partner or family member, or it can be casual love: fleeting encounters and attractions can still be non-zero-sum. The difference is in how much priority we give to the other person's utility, or how durable our devotion to it is.

This definition gets at a certain sort of 'helpfulness' maybe, more than 'love' in its deepest sense. By this definition, a personal assistant running errands for his boss might be said to 'love' his boss (within the bounds of his work persona at least). The helping agent need not even 'know' that they are in love. The Netflix recommendation algorithm tries to learn a user's preferences and help recommend movies that they'd like, but it has no broader understanding of the world or of its own existence and behavior as a rational agent.


A more advanced definition of love might be a relaxing of the boundaries of the self. To love someone deeply can feel "selfless", like losing yourself in them, defining the pair of you as a new fundamental unit of being.

This definition applies to self-aware agents, agents that actively maintain a world model that specifically contains a representation of the agent itself as an entity in the world. It is a change, not in reward functions (such an agent may not know its reward function, if it even has one --- it only knows the model of reward associated with its self-representation), but in the world model's units of analysis, the way in which it purports to carve the world at the joints.

I'm unsure if there are any actually-existing cases of artificial agents where this form of love can be cleanly operationalized. It seems like a very interesting direction for study as agents and our mechanistic interpretability techniques both become more advanced.

This definition might imply the utility-theoretic definition, but it also works for agents that do not exactly have coherent utility functions. A parent, who is composed of conflicting impulses, might love their child, who they also see as a bundle of conflicting impulses. Acting effectively might involve understanding those parts and mediating between them to find a harmonious path forwards. Loving the child just expands the scope of the work to include both of their parts.

This accords with the view of love in some spiritual traditions. Buddhist practice leads to the realization that the self is a construct --- that we are not really separate from the world, that the boundaries that we draw around our 'selves' are not real. This realization can be accompanied by a strong sense of love for the world and all beings, understanding that they are not different, not separate from us. Conversely, cultivating that sense of love can prime the mind towards the realization of no-self.


supporting growth, including finding new goals and purpose

similar capacity - enough to really understand the person in the deepest sense, their full being. (does this rule out dogs loving humans?)


How does self-love fit into this framework? From a purely decision-theoretic standpoint, self-love seems like a contradiction in terms. Of course you care about your own utility function! If there were some 'utility function' that you didn't care about, then self-evidently that function couldn't be your actual utility function. Similarly, the 'boundaries of the self' would seem to self-evidently include the self --- what could it mean to expand them?

But we have a lot of evidence that self-love is a psychologically meaningful concept. So something is wrong with this model. What?

  • Thread A: We don't have direct access to anyone's utility function, including our own; we can only attempt to model it. And of course, meanwhile the function itself is changing under our feet. Our identities matter partly because they are models of our own utility functions; and having a good model is a prerequisite to being able to optimize it.
  • Thread B: we tend to treat other people similarly to how we treat ourselves, because projection is unavoidable.
  • Thread C: our actions only optimize any utility insofar as we are coherent and rational, which of course we never perfectly are. Minds are made of parts, which may have contradictory impulses. People often act self-destructively, or are hard on themselves

Synthesis?

Utility-theoretic: our 'revealed utility' is the utility implicitly optimized by our actions, to the extent that there is one. Meanwhile we have a 'true utility' which is the stuff that will really make us happy, which is unknown even to us. Just as loving someone else involves curiosity about their utility, loving ourselves involves curiosity about our own utility---actually paying attention to what feels good and constantly improving our models. And it involves training our action models to align with our models of our utility---a dual-process cognition setup where our 'system 2' is modeling our 'true values' (ultimately supervised by system-1 primal rewards) and serves as a supervision signal for the system-1 policies that guide our everyday thoughts and actions.

Parts-theoretic: most people have parts of ourselves (traits, preferences, habits, physical characteristics, etc) that we have trouble accepting. We wish we had smoother skin, neater habits, better judgement. We might be ashamed of sexuality, of ambition or the lack thereof, etc. Whatever is within "the boundaries of the self", the job is to accept it as it is, to see the state of things as fundamentally okay, even as we want the best for it.


Or: there is a felt experience of compassion, of grace, of wanting the best for ourselves (or someone else), that is not automatic. And feeling it for ourselves helps us feel it for others.


A radical perspective is that all love is self-love. We never experience other people directly, only our mental representations of them. So when we think we are loving another person, ultimately the love is directed towards parts of our own mind.

This may seem like a technicality, but it has real implications, because our mind is highly connected. We see our lover in terms of concepts, traits that they embody, ranging from high-level concepts like ambition all the way down to basic perceptual organization --- textures, smells, sounds we associate with them. Especially at the higher levels (but to some extent even the perceptual levels), these concepts will be reused in our own self-representation. We will also have many of the same traits in greater or lesser degree. So fully loving another person, accepting everything about them, will mean loving and accepting some parts of ourselves. It is impossible to love fully without being changed.


Ultimately, love is perhaps the only durable source of purpose. Rather than viewing love as a manipulation of utility functions, we might ask: where do utility functions come from? If I want to help you, and you want to help me, then we are both being helped, which is good, but more importantly, we both have purpose.