A lot of what drives my research interests and that of our group is the basic notion that HCI systems of the future will sense the human user, the user's state and adapt to it (in whatever way is useful and suitable for a particular application). In particular, the young field of affective computing and its interaction with HCI is an area of research focus and future direction. Along the way, we need to solve quite a number of problems, be it in computer vision, signal processing, affective computing, bio-engineering, speech processing or others. The aspect of multimodal, multi-sensor systems also plays a central role, where the multiplicity leads to more accurate and more robust systems, capable of being used in real-world applications, that outperform single modality, single sensor systems.
I collaborate closely with a number of researchers at local institutions (other departments/schools at the University of Canberra, ANU, ADFA, NICTA, CSIRO, Australian Institute of Sports) as well as nationally (CSIRO Sydney, Black Dog Institute Sydney, University of Western Sydney, Flinders University, University of Adelaide) and internationally (Carnegie Mellon University, University of Pittsburgh, University of Maryland, IBM TJ Watson Research Center, Fraunhofer Institute for Computer Graphics). Postgraduate students often spend 3-6 months of their PhD studies as a visiting researcher at one of these institutions.
Note 1: The following list is in no particular order. The topics are not set in stone and it is certainly possible to combine parts of different topics listed here to form a new project.
Note 2: If you would like to provide your own topic proposal, that is OK as long as it roughly aligns with my research interests.
In the area of affective computing, one future project is on applying facial expression analysis, facial movement analysis, voice analysis and possibly EEG analysis in a context of sensing distinctive differences of people with psychological disorders, in particular depression and melancholia, compared to people without these disorders. The aim is to develop an objective measurement of the severity and characteristics of a particular patient's depressive disorder. This has real-world, immediate applications as psychological disorders are a major global burden in terms of health and economy. Ultimately, we aim to develop a system that at least incorporates face and voice analysis in a low-cost, easy-to-use system on a standard laptop that clinicians and patients can use to monitor the progress of the disorder during and after treatment.
Open questions include the age-old computer vision dream of real-time, non-rigid face tracking of unseen faces (i.e. *anyone* can show up their face in front of the camera(s) and the system can track the face and the movement of the facial features, such as eyebrows, lips, jaw line, eyes, …) instantaneously in any kind of lighting conditions from any angle. Current face tracking technology has moved forward a lot, but we are still quite a way away from fulfilling this dream. We have developed a lot of non-rigid face tracking technology in our group, so this could be an extension of that work. Another totally unsolved research question in this context is that of voice characteristics of patients with depression. The addition of physiological signals, e.g. EEG, but also galvanic skin response, skin temperature, ECG, is absolutely cutting edge. Each of these areas in itself would probably be worth a PhD, but of particular interest to us is to investigate a multimodal system that we hypothesise will be perform better.
In a second application of affective computing, or perhaps more correctly affective sensing, we are interested in developing technology that can sense a user's state from a HCI point of view. That is, we wish to be able to automatically analyse someone's facial expressions, their characteristic movements (face or possibly other body parts or even the entire body!), their speech pattern in different social contexts, someone's age and gender (this has been done in audio and video separately to some limited success but not in a multimodal combination), their emotional/affective state (e.g. their mood or even just their short-term emotions) and so on, so that a HCI system can react to this input in a suitable way. This fits into work we do in the large-scale, multi-university project called From Talking Heads to Thinking Heads, but also is of strong interest generally.
Again, there is a whole host of research problems to solve, some similar to the ones mentioned under 1) but here with a focus on applications to HCI. For example, imagine a kiosk system where the user's age is estimated and the system changes its appearance (for example, choosing a different, age appropriate avatar head on screen), content (again, for example, changing the content in an age appropriate way, so that easier sentences are used for children than for adults) and voice (for example, choosing a younger voice for children). So, here, we also have the generic research questions of generic face tracking, but also questions such as age estimation from a number of modalities (at least audio and video), gender recognition, affect recognition from video and audio. In addition, questions such as how would a HCI system react if it was able to sense all these user characteristics? Obviously, that is context dependent, but we could pick a particular application and develop a theory on how to adapt around that. We estimate that this entire work would easily be enough for two PhD projects. I would envisage that any project in this area will be of a multimodal, multi-sensor nature.
- Another affective computing topic is on stress detection. This has application in both the medical / health care area as well as in safety (e.g. driver stress detection) and education (e.g. stress during learning). Again, we would start from an angle of analysing facial expressions / movements and vocal characteristics, which we record while the subject is performing certain tasks that are designed to enduce different levels of stress. (Note, we primarily talk about 'mental stress' or 'emotional stress' here, not physiological stress.) We wish to answer questions such as: Can we accurately measure the level of stress that a person is under? What is an appropriate stress model? How robust are the measurements to changes in the environment, e.g. changes in illumination, head pose, acoustic background noise, etc.?
- The following is purely a computer vision project, extending our work in the area of face modelling, face detection and face tracking. So far, our work has been in 2D only, partially due to it being computationally more efficient than working with 3D models, although the latter clearly have the potential to deliver more accurate solutions than the 2D ones (in particular, for extreme head poses). Many of the algorithms we have developed in recent years were developed and tested in 2D but their extension to 3D is kind of obvious and would be a very interesting topic to pursue. Using 3D models, or maybe a combination of 2D and 3D models, is expected to be a strong component in any system tackling the already mentioned dream of being able to detect and track anybody's face in a whole range of environmental conditions and head poses. Applications of such technology are manifold, some of which I already have listed above, but there are many others. This project also has the potential for commercialisation, but the science comes first!
This PhD topic is in the area of face recognition. While this area has seen a lot of attention in the last 10-15 years and much research has been done, the overall applicability of face recognition system in large scale environments (i.e. thousands or more of faces to recognise) is still fairly low and performance abysmal. In one aspect of my current work, we are interested in biometrics and biometric security (for example, can a face recognition system detect if an impostor shows it a laptop screen with a prerecorded video of a known face?). Our previous work on face detection and face fitting/tracking means that we can accurately detect a face and that we can remove effects due to different poses and facial expressions. There is also a very interesting idea of adding shadow removal, illumination invariance and contrast enhancement as a preprocessing step. A related problem is that of performing face recognition from just a single image at the time of enrolment. Current face recognition systems work well when they have sufficient training data at hand, in other words multiple images of the same face, but in reality that is often not possible to obtain. What strategies can we develop to deal with that problem?
This project could look into either or both of 2D face recognition and 3D face recognition. As already mentioned, many of our algorithms have been originally developed for 2D but could equally be employed in 3D. 3D face recognition, the automatic fitting (and thus adapting) of a generic 3D face model to a series of 2D or 3D face images, the feature extraction etc. are all still in their infancy and have a lot of potential.
Another possible project would be in the areas of image enhancement and image completion/inpainting, so this is much more pure image processing than the others, although the results and algorithms have applications in computer vision as well. In some recent work, we have been able to reconstruct images from contrast information alone (see Khwaja & Goecke, DICTA 2008) in a biologically inspired way. We have since applied this technique to automatically enhance images of low contrast (over or under exposure) with great success. One of my current students works in this area but will finish his PhD in about 12 months from now, so there is an opportunity to continue and further drive this work. This kind of research has immediate applications for digital camera manufacturers, who have a keen interest in giving their customers the 'best' image quality, irrespective of how bad the original shot taken was. I have also worked on image completion / inpainting, where one tries to fill a part of an image where information is not available (a missing image part or deliberately removing an object from a scene). It would be very interesting to drive this work further, for example by extending it to the more complex conditional random fields, instead of the currently used Markov Random Fields, to further improve the results.
The ability to produce good quality images and to remove objects at will from a scene, so that for example we are left with only a background image, has further applications in areas such as surveillance and tracking, event detection, etc. See next project as well. In other words, this is not just an exercise in doing research for the sake of it (nothing wrong with that, but I am application driven), but has real applications and commercial value. It could also be combined with automatic image segmentation, to automatically detect the object boundaries of objects that we wish to remove.
In this project, a student would work on visual tracking of people which has applications in surveillance (security) as well as crowd movement analysis (for example, in a sports stadium), to name just a few. Again, there is a lot of research going on in this area, but it would be a very interesting project to apply some of the algorithms we developed for the specific case of face tracking to a more generic problem of people/object tracking. Can we use the same kind of statistical modelling techniques? How can we adapt them to this more generic problem? What about tracking in a multi-camera environment? Can we build a system where there is a 'handover' of tracked persons as they leave one camera's field of view and enter another one? Can we detect unusual events (e.g. someone leaving a piece of luggage behind at an airport)?
This work could look at tracking individual people or objects, but could also investigate the movement patterns of large crowds, e.g. in a train/metro/bus station. Can one detect 'people moving against the stream', so as to identify unusual behaviour?
I also want to point out an application that is perhaps not so obvious. I have done some work with the Australian Institute of Sports (supporting elite athletes) which has a keen interest in cutting edge technology, if it means that it gives Australian sportsmen and sportswomen a competitive advantage. One area they have been interested in is developing video analysis software that automatically tracks players in ballsports on the field, captures and measures their paths, and provides quantitative data for analysis by the coaches (e.g. how many km did this player run? What paths did he/she most often go? Which part of the pitch did they occupy? The technology of people tracking and event detection goes well beyond surveillance type applications!
Finally, there is a possible project in automatically analysing videos (e.g. movies, security camera tapes) and providing a video synopsis that shows a condensed view of the entire video, or highlights only the sections where something 'unusual' happens, or allows to select specific parts of a video to be shown based on the appearance of a particular person. Obviously, to be able to do that, one must be able to identify people or events occurring in the video, track people, then select representative segments of the video, and so on. Zisserman's Video Google paper gives a good idea of what this is about. To some extent, this project combines many of the other projects (as you can see, these projects are not necessarily discrete, non-overlapping entities, but rather project ideas that can be changed, put together, divided differently etc.). Here, we would look at combining face detection, face tracking, people tracking, face recognition, possibly facial expression recognition, age and gender recognition etc. in one system, so that the user could search for particular people (e.g. an actor) or particular scenes or all people who have a certain characteristic (e.g. all male people in scenes where they smile).