# A Survey of Action Prediction Models

In this post, we examine some of the studies about future event prediction. Various computer vision techniques can be used in event prediction such as instance pose estimation, feature extraction, and motion trajectories. In the recent papers, there is a trend of using Variational Autoencoders (VAEs) with amortized inference, a neural network which approximates the posterior distribution with shared parameters. You can check Jaan’s Tutorial on Variational Autoencoders to familiarize yourself with VAEs.

Long-term prediction, also called intention prediction, infers the future actions based on observed actions. It aims to predict the action that is going to happen in the future. The importance of the task is taking advantage of future actions or precautions against probable future events. However, making predictions about future actions (events) is a hard task because the future is uncertain, which makes future predictions vulnerable. In the context of information theory, the uncertainty of the future imposes high entropy over the distribution of possible future actions and their timings. A successful future prediction may have different applications in different domains such as predicting future states of disease in medicine, forecasting and recommending in recommendations systems, forecasting in the stock market, trend-projection in social sciences, and predicting pedestrian actions to use in autonomous cars.

It is safe to say that future prediction is studied since the mid-90s. The approaches are mainly probabilistic methods, including Bayesian Approaches from Hidden Markov Models to Variational Autoencoders, that incorporate the uncertainty in the future. In the future action prediction, the main challenge is learning the uncertainty in the future by fitting a “good” model. For the last 10 years, Deep Learning frameworks are used in probabilistic models that are trained with end-to-end learning. In this blog post, we see how previous works approach future event prediction by giving a summary of the motivations of the selected papers. Then, we specifically focus on the long-term asynchronous action prediction with the Variational Autoencoders (VAE) by presenting the state of the art model in this area.

## What is action prediction?

There two types of action prediction tasks, short-term prediction, and long-term prediction [1]. For the sake of completeness, we define both of them, although we focus on the latter. The short-term prediction is mainly about inferring action labels based upon temporally incomplete action videos. Therefore, they have the initial frames of the action (a clue) that should be predicted.

The latter one, the intention prediction or long-term prediction, infers the future actions based on a history of actions. The model has the sequence of history as a clue where the history may be synchronous (each input is separated by a fixed time interval) or asynchronous (time interval between actions are not constant).

Problem definition The input for our task is a sequence of actions $x_{1:n} = (x_{1},…,x_{n})$ in which $x_{n}$ is n-th action. Each action $x_{n} =(a_{n}, \tau_{n})$ is represented by action category $a_{n} \in{1,2,\ldots, K}$ and inter arrival time $\tau \in \mathbb{R}^{+}$. Inter arrival time can be defined as the time difference between starting times of the actions $x_{n-1}$ and $x_{n}$. The goal is to produce a distribution over the action category $a_{n}$ and starting time $\tau_{n}$ given the sequence of actions $x_{1:n-1}$.

## Early Papers

A Self-Correcting Point Process, 1979 [2]: This paper studies a point process that operates very closely to deterministic rate $\rho$ which produces $\rho t$ points in time interval $(0,t]$ for all $t$. Authors mathematically formulate this by making the instantaneous rate of $t$ of the process a suitable function of $n - \rho t$, n being the number of points in $[0, t]$. They then generalize the point process in two types: Markovian and arbitrary. They analyze both types and derive the mean number of points in the given interval with a variance.

Spatial-temporal Event Prediction: a New Model, 1998 [3]: In the paper, authors propose a new model for predicting the probability of occurrence of spatial-temporal random events based on theory on point patterns. The model considers the event characteristics both in space and time prediction. It also uses a clustering based criterion to identify the key features that explain the spatial pattern as a preliminary step.

A Bayesian Approach to Event Prediction, 2003 [4]: Even though it is not a computer vision research, alarm systems are also related to the topics of future event prediction and anomaly detection. In this paper, the authors propose a Bayesian predictive approach to event prediction. Their novel contributions can be summarized as, first, variation in the model parameters is incorporated in the analysis; second, the proposed model allows ‘on‐line prediction’ such that posterior probabilities and predictions are updated at each time point.

A Data-driven Approach for Event Prediction, 2010 [5]: The authors propose a simple method to identify videos with unusual events in a large collection of short video clips. The approach is purely based on computer vision techniques. The model uses a distribution of expected motion which is built by utilizing the videos that are similar to the query. In the end, this approach can be used for future event prediction and anomaly detection.

## Papers from 2010-Present

Parsing Video Events with Goal inference and Intent Prediction, 2011 [6]: In the paper, authors propose an event parsing algorithm to understand events. The parsing is based on Stochastic Context Sensitive Grammar (SCSG) which represents the hierarchical parts of events and the temporal relations between the sub-events. The so-called alphabets of the SCSG are the constructed atomic actions (very similar to a vocabulary set as in NLP tasks) defined by poses of the agents and their interactions with surrounding objects in the scene. One of the contributions of the paper is that the model infers the goal of the agents and predicts their intents by a top-down process.

Activity Forecasting, 2012 [7]: This paper proposes an approach that models the effect of the physical environment on the choice of human actions. Therefore, the model can achieve accurate activity forecasting. It is accomplished mainly by the use of semantic scene understanding and optimal control theory (it is used to find a control law for a dynamical system over a period of time such that an objective function is optimized). The model also integrates several other key elements of activity analysis, such as destination forecasting, sequence smoothing, and transfer learning. As a result, experiments show that the model accurately predicts distributions over future actions of individuals.

Anticipating Visual Representations from Unlabeled Video, 2016 [8]: This paper presents a framework that is based on learning temporal dynamics or structures in an unlabeled video to learn human actions. The key motivation is to predict encoded representations of images in the future with ConvNets and FCL; these encodings are created via transfer learning, e.g., AlexNet. Then, they predict future actions by using recognition algorithms on predicted future encodings. However, it works in a synchronous manner, so that predictions are made for fixed time points in the future.

Recurrent Marked Temporal Point Processes, Embedding Event History to Vector (APP-LSTM), 2016 [9]: This paper focuses on the time interval between two events and argues that these characteristics make data different from i.i.d. time-series data, where time and space are treated as indexes rather than random variables. In this paper, the authors propose the Recurrent Marked Temporal Point Process (RMTPP) to model the event timings and markers simultaneously. The key idea of the approach is viewing the intensity function of a temporal point process as a non-linear function of the history and using a recurrent neural network to learn a representation of influences from the event history automatically. They show that, in the case where the true models have parametric specifications, RMTPP can learn the dynamics of such models without knowing the actual parametric forms. In the case where the true models are unknown, RMTPP can also learn the dynamics. Also, it achieves better predictive performance compared to other parametric alternatives based on particular prior assumptions.

Learning to Generate Long-term Future via Hierarchical Prediction, 2017 [10]: The authors use a hierarchical approach for making long-term predictions of future frames. The model pipeline goes as, estimating high level structure in the input frames, predicting how that structure evolves in the future, constructing the future frames by observing a single frame from the past and the predicted high-level structure. The key advantage is constructing the future frames without observing any of the pixel-level predictions. The model is a combination of LSTM and encoder-decoder convolutional neural networks, which can independently predict the video structure and generate future frames. Although this paper does not focus on the future possible actions, it is a good study that uses different computer vision elements such as pose estimation, to predict future frames.

What will Happen Next? Forecasting Player Moves in Sports Videos, 2017 [11]: In this work, authors develop a generic framework for forecasting future events in team sports (water polo and basketball) videos from visual data. They use two types of the visual inputs, one is in the form of raw input, and the other is the overhead representation of the raw input. After creating a nine-dimensional feature vector for each player, they try to find each player’s probability to receive the ball via a random forest classifier.

When will you do what? - Anticipating Temporal Occurrences of Activities (TD-LSTM), 2018 [12]: In this paper, the authors propose two methods to predict a considerably large amount of future actions and their durations. Both the CNN and RNN are trained to learn future video labels based on previously seen content. They show that the proposed model can generate accurate predictions of future actions given that a long video with a vast amount of different actions and noise.

Time Perception Machine, Temporal Point Processes for the When, Where and What of Activity Prediction, 2018 [13]: In this paper, authors develop novel models for learning the temporal distribution of human activities in streaming data (e.g., videos and person trajectories). They propose an integrated framework of neural networks and temporal point processes for predicting when the next activity will happen. Because point processes are limited to taking event frames as input, they utilize a mechanism to extract features from frames of interest while preserving the rich information in the remaining frames. Furthermore, they extend the model to a joint estimation framework for predicting the timing, spatial location, and category of the activity simultaneously to answer the 3 W’s, i.e., when, where, and what of activity prediction.

## State of the Art Model

A Variational Auto-Encoder Model for Stochastic Point Processes (APP-VAE), 2019 [14]: The paper proposes a probabilistic generative model for action sequences. The main idea is that Action Point Process VAE (APP-VAE), a variational autoencoder, is used to capture the distribution over the times and categories of action sequences. The framework has the advantage of generalizability due to the use of latent representations and non-linear functions to parameterize the distributions. The distributions are over the events that are likely to occur next and interval arrival time until the next event occurs. One of the novelty claims of this paper is that they show how effective it is to use VAE as a base model even with point data, action category, and time until the action. Another novelty is that the proposed model generates an approximated posterior distribution from which the latent variables of the input sequence are sampled. By using the posterior distribution, one can analyze the similarities between multiple action sequences and creates new scenarios by tweaking the latent variable. Compared to previous models in the literature, models on action prediction tasks use regularly spaced data as input, which may have the effect of feeding non-relevant information to models. However, the input used in this model is asynchronous, meaning that time intervals between the events are not necessarily uniform, which is also the reason for predicting the time intervals between actions. The reasoning comes from the fact that events of interests, elements of our event space, are generally infrequent. Using asynchronous input makes the category and time interval predictable over a generated probability distribution. The authors’ main assumption is that working with probabilistic models is better to encapsulate the uncertainty in the future. They prefer to work with a VAE network, which introduces randomness for latent variable z, combined with LSTMs [15] to encode the sequences.

The main trick in the formula below is to learn both

• Prior $p_{\psi}\left(z_{n+1} | x_{1: n}\right)=\mathcal{N}\left(\mu_{\psi_{n+1}}, \sigma_{\psi_{n+1}}^{2}\right)$
• Posterior $q_{\phi}\left(z_{n} | x_{1: n}\right)=\mathcal{N}\left(\mu_{\phi_{n}}, \sigma_{\phi_{n}}^{2}\right)$

instead of using a fixed prior distribution, which is generally chosen as a Normal distribution. I believe the main trick in the paper is the following: they do not fix the prior and hence the temporal dependencies in predictions are preserved. Moreover, it enables using the prior network for predicting latent representation of the future action during test time. And, they decode the generated latent representation in order to get a distribution over the categories and inter arrival time for the next event. Compared to previous works, LSTM and VAE are used in a compatible way without imposing a fixed prior. The parameters of the model can be jointly optimized by maximizing the ELBO, which is defined as: $\mathcal{L}_{\theta, \phi, \psi}\left(x_{1: N}\right)=\sum_{n=1}^{N}\left(\mathbb{E}_{q_{\phi}\left(z_{n} | x_{1: n}\right)}\left[\log p_{\theta}\left(x_{n} | z_{n}\right)\right] - D_{K L}\left(q_{\phi}\left(z_{n} | x_{1: n}\right) \| p_{\psi}\left(z_{n} | x_{1: n-1}\right)\right)\right)$

Written on September 30, 2020