Past Research

Postdoc on the Interactive Machine Learning project, led by Andrea Thomaz, at Georgia Tech in Atlanta:

During my postdoc at Georgia Tech I explored what happens when an artificial learner interprets the evaluations of a human teacher as referring to the action choice quality of the learner. This interpretation is the foundation of the Policy Shaping algorithm, which consists of two components; one reinforcement learning (RL) component that operates on environmental rewards (for example q-learning operating on score changes in pacman), and another component that learns from a teacher.

The component that is learning from a teacher is the focus of the research, and the objective is to find better interpretations of human teachers. This social component modifies the policy that the RL component produce; shaping the policy based on teacher evaluations. Teacher generated data is represented in state s, action a, evaluation e triples. Given a (s,a,e) triple, the probability of taking action a in state s is increased/decreased if the evaluation e is positive/negative (a triple shapes a policy directly, but only in the state contained in that triple). The evaluations generated by the teacher are not maximised, and are not used to modify a reward function or a set of q-values. It instead shapes the policy coming from the RL component.

To illustrate the interpretation of human evaluations that Policy Shaping is built on top of, consider an agent that gets one negative evaluation for creating a problem, and then multiple positive evaluations for a series of actions that resolve that problem and return the agent to the original state. If the evaluations are interpreted as something that should be maximised, then the implication would be that it would be a good idea to go through this cycle as often as possible. The human behavior is (explicitly or implicitly) interpreted as urging the agent to deliberately create the problem. The interpretation that policy shaping is built on top of instead implies that one should not take the action that created the problem, because that action choice was given a negative evaluation. For the actions fixing the problem, the interpretation implies only that if the learner happen to be in that specific state, then it should take the action that received positive evaluations. But the Policy Shaping interpretation does not imply that this situation should be created. Positive evaluations are simply interpreted as indicating a good action choice, nothing more.

A cognitive argument in favour of the Policy Shaping interpretation is that pressing positive/negative evaluation buttons when a robot does the right/wrong thing is much easier than for example giving the type of feedback that would be suitable for maximisation. In a task where I only cared about the end state, it would for example be very difficult to make the expected discounted sum of my evaluations path independent. This would be difficult even if I knew the exact numerical value of every end state, and perfectly knew the world dynamics. Keeping track of the total sum of the current interaction, and making sure it ends up the same as the sum that the learner would have received for taking some other path would still be tricky. The question here is not what human teachers should do, or what type of human behavior would be convenient for designers of systems. The thing that really matters is what humans actually do.

For a learner that is trying to do what a human teacher would like the learner to do, maximising the number of times the teacher pushes a positive evaluation button minus the times the teacher pushes a negative evaluation button corresponds to an assumption regarding the interpretation of the human behavior. A learner that is simply trying to get good evaluations is not making the same assumptions, because any clever way of getting good evaluations or avoiding bad evaluations would, by definition, constitute success. But, given a learner that is trying to figure out what a human teacher would like the learner to do, then it is possible to compare the accuracy of different interpretations. Consider a teacher that for the entire duration of learning tends to give more positive than negative evaluations (for example giving positive evaluations to every reasonable action, and only giving negative evaluations to very serious mistakes), and that will stop giving evaluations when the learner is done learning. The strategy of deliberately slowing down learning, resulting in a larger sum of rewards, is a failure for the type of learner we are concerned with. The interpretation of the button is bad, in a well defined way, because interpreting the human behavior as urging the learner to slow down learning is simply inaccurate. An agent that is only trying to extract rewards from the teacher is however not making an interpretation mistake when slowing down learning if this is the best way of extracting rewards. A similar point can be made about a teacher that only use the negative evaluation button. In this case, some creative strategy that makes the teacher want to avoid the learner and stop the interaction could be the best way to maximise the sum of rewards.

For a learner trying to figure out what the teacher would like it to do, and using the interpretation of evaluations as referring to action choice, the behavior of the positive teacher is not interpreted as urging the learner to slow down learning, and the behavior of the negative teacher is not interpreted as urging the learner to terminate the interaction. A teacher that sometimes press minus, but never press plus, is simply interpreted as urging the learner to avoid certain actions. This seems like a more reasonable interpretation than: "please make the interaction so unpleasant that I will avoid you". These particular issues, and other specific failure modes that one happens to notice, could of course be dealt with using workarounds or patches. But starting from a more accurate interpretation of the human seems like a cleaner and more robust solution. The issue of how various human behaviors should be interpreted is also interesting as a scientific question in its own right. For a learner trying to figure out what the teacher would like it to do, a reasonable response to the negative teacher under the action evaluation interpretation would simply be to reduce the probability of taking actions that have been evaluated negatively (which is what the Policy Shaping algorithm does). That these types of issues arise even with specifically designated evaluative buttons illustrate that it is not always easy to see what implicit interpretation assumptions are embedded in various learning algorithms. It seems likely that similar foundational interpretation issues will need to be addressed when designing an agent that is learning from a more complex set of inputs, such as the types of social signals that humans naturally produce when observing or interacting with an artificial system. Dealing with related interpretation issues for social signals seems like an under explored research area (see also the future direction section). I see this as strongly intertwined with, but not identical to, the question of how to detect social signals.

PhD at inria, Bordeaux with Pierre-Yves Oudeyer as advisor:

I first worked on a version of Gaussian Mixture Regression that made it easier to learn an unknown number of tasks from unlabelled demonstrations. This work was later extended to include language learning, where language was treated as any other part of the context, allowing the same system to learn linguistic and non linguistic tasks. Alongside this work, I developed a formalism for the type of social learning where a learner is trying to figure out what a teacher would like it to do, focusing on flawed teachers and unknown interaction protocols.

Incremental Local Online Gaussian Mixture Regression (ILO-GMR):

In this setup, a teacher/demonstrator performed a number of unlabelled demonstrations. This means that the learner/imitator does not know the number of tasks, and does not know which demonstrations are instances of the same task. The first experiment investigating this setup used ILO-GMR, a regression technique that makes it unnecessary to explicitly represent the number of tasks. Each demonstration is chopped up into individual data points, and during reproduction, the learner: (i) retrieves a set of local data points, consisting of those points with context that is most similar to the current context, (ii) uses only these local points to build a Gaussian Mixture Model (GMM) with a small number of Gaussians, (iii) does regression on the GMM to get the next action, (iv) updates the context after taking the chosen action, and retrieves new points (back to step one). This research resulted in the paper "Incremental Local Online Gaussian Mixture Regression for Imitation Learning of Multiple Tasks".

Imitation learning and language:

A big part of my work at inria was based on exploring language learning from the point of view of imitation learning, which resulted in a conference paper, a journal article, and a book chapter (see the publications section). The basic idea was to construct a generic imitation learning algorithm that was capable of learning both linguistic and non linguistic tasks. To illustrate the similarity with non linguistic imitation, the algorithm was tested on a mixed set of unlabelled tasks; both linguistic and non linguistic. The learner/imitator observed a teacher/demonstrator and learnt a number of tasks. It was however not told which demonstrations was of what task, how many separate tasks there were, or how many of the demonstrations were of linguistic vs non linguistic tasks. The setup contained an learner that watched two humans interact; one teacher and one interactant. The interactant might perform a communicative act, either a speech utterance or a hand sign, and the teacher performed a demonstration that might be a response to the interactant, or might be a response to some other part of the context. In order to solve this problem, the learner treats the interactant as any other part of the context. The learner learns how to respond to speech and hand signs, as well as how to respond to the position of an object. Learning linguistic skills is cast as a special case of learning other sensory motor skills, and a single system is able to learn linguistic and non linguistic skills through observation (without needing to be told which of the skills are linguistic).

One experiment focused on imitation of internal cognitive operations. The teacher/demonstrator performed a "focus on object" operation as a response to some aspect of the environment. This operation could not be observed directly, and an artificial learner had to infer what the teacher had done. The learner observes one teacher, as well as one interactant. The interactant performs two hand signs which are treated as part of the context. The teacher responds to one of them by focusing on the object that was indicated by the sign (there are three objects, with one hand sign for each), and responds to the other hand sign by performing a movement in the reference frame of the object that the teacher is now focusing on (there are three movements, and three movement request signs). The learner/imitator must infer how many different types of hand signs it has observed, because the inputs are continuous, and also infer what hand sign triggered the focus on object operation, and what hand sign triggered the movement type.

Theoretical issues that arise when learning from teachers that are flawed/limited/mistaken/etc:

Alongside the work on imitation learning and language, I pursued more theoretical work, that eventually resulted in the journal article "A Social Learning Formalism for Learners Trying to Figure Out What a Teacher Wants Them to Do". The formalism presents a theoretical foundation for approaching the problem of how a learner can infer what a teacher wants it to do through strongly ambiguous interaction or observation. The formalism groups the interpretation of a broad range of information sources under the same theoretical framework. A teacher's motion demonstration, eye gaze during a reproduction attempt, facial expressions, evaluative buttons, speech comments, EEG readings, etc are all treated as specific instances of the same general class of information sources. These sources all provide (partially and ambiguously) information about what the teacher wants the learner to do, and all need to be interpreted concurrently. Learning setups are introduced, and algorithm outlines are presented to illustrate some of the practical problems that must be overcome. There is a shift in how interpretation is viewed, going from the situation with a static interpretation of a teacher's behavior, to a situation with a parameterised hypothesis space, that is updated based on observations.

An experiment taking the initial steps towards learning interpretation hypotheses, and exemplifying the type of systems that the formalism is meant to describe, resulted in the conference paper "Simultaneous Acquisition of Task and Feedback Models". Perfect knowledge of the feedback protocol would make learning the task easier, and a perfect understanding of the task would make learning the feedback protocol easier. The solution presented was to concurrently update the flawed models of both the feedback protocol and the task.

Language games at the AI-Lab of the VUB in Brussels:

I worked on language game research under the supervision of Luc Steels as part of my masters program, which resulted in the conference paper: "Combining different interaction strategies reduces uncertainty when bootstrapping a lexicon". The general research question of language games is how a population of initially non linguistic agents can bootstrap a language. My focus was on how to reduce referential uncertainty, and how to choose the type of interaction to initiate. Agents in language games have some initial conventions, such as pointing and an interaction protocol, and they use this to negotiate a set of shared linguistic conventions.

Agents in my experiment were learning words of the type "shape" and "colour", concurrently with learning words of the type "blue", "round", "red", and "square". If one agent uses the description "gavagai" to refer to an object, then there is referential uncertainty, as it is not clear what aspect of the object has been described. Referential uncertainty can be reduced if two agents observe an object and one asks "what colour is that?" and the other answers "gavagai". Knowing words such as "red" and "blue" makes it easier to learn words such as "colour". And knowing words such as "colour" makes it easier to learn words such as "red" and "blue". Agents concurrently negotiated meanings for both types of words. There is no human teacher in language game setups, but the inference problem faced in a language game is similar to the inference problem faced by an agent that learns from a human teacher.