Activity Grammars for Temporal Action Segmentation Dayoung Gong∗ Joonseok Lee∗ Deunsol Jung Suha Kwak Minsu Cho Pohang University of Science and Technology (POSTECH) arXiv:2312.04266v1 [cs.CV] 7 Dec 2023 {dayoung.gong, jameslee, deunsol.jung, suha.kwak, mscho}@postech.ac.kr http://cvlab.postech.ac.kr/research/KARI Abstract Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads. 1 Introduction Human activities in videos do not proceed by accident; they are structured being subject to generative rules imposed by the goal of activities, the properties of individual actions, the physical environment, and so on. Comprehending such a compositional structure of multi-granular semantics in human activity poses a significant challenge in video understanding research. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging due to the reason. The recent methods based on deep neural networks [25, 9, 45, 2, 15, 16, 1] have shown remarkable improvement in learning temporal relations of actions in an implicit manner, but often face out-of-context errors that reveal the lack of capacity to capture the intricate structures of human activity, and the scarcity of annotated data exacerbates the issue in training. In this work, we address the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. Grammar is a natural and powerful way of explicitly representing the hierarchical structure of languages [14] and can also be applied to express the structure of activities. Despite the extensive body of grammar-based research for video understanding [23, 24, 33, 35, 32], none of these approaches have successfully integrated recursive rules. Recursive rules are indispensable for expressing complex and realistic structures found in action phrases and activities. To achieve this, we introduce a novel activity grammar induction algorithm, Key-Action-based Recursive Induction (KARI), that extracts a powerful probabilistic context-free grammar while capturing the characteristics of the activity. Since an activity is composed of multiple actions, each activity exhibits a distinctive temporal structure based on pivotal actions, setting it apart from other activities. The proposed grammar induction ∗ Equal contribution. 37th Conference on Neural Information Processing Systems (NeurIPS 2023). enables recursive rules with flexible temporal orders, which leads to powerful generalization capability. We also propose a novel activity grammar evaluation framework to evaluate the generalization and discrimination power of the proposed grammar induction algorithm. To incorporate the induced activity grammar into temporal action segmentation, we develop an effective parser, dubbed BEP, which searches the optimal rules according to the classification outputs generated by an off-the-shelf action segmentation model. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. The main contribution of this paper can be summarized as follows: • We introduce a novel grammar induction algorithm that extracts a powerful context-free grammar with recursive rules based on key actions and temporal dependencies. • We develop an effective parser that efficiently handles recursive rules of context-free grammar by using Breadth-first search and pruning. • We propose a new grammar evaluation framework to assess the generalization and discrimination capabilities of the induced activity grammars. • We show that the proposed method significantly improves the performance of temporal action segmentation models, as demonstrated through a comprehensive evaluation on two benchmarks, Breakfast and 50 Salads. 2 Related work Grammar for activity analysis. Grammar is an essential tool to represent the compositional structure of language [14] and has been mainly studied in the context of natural language processing (NLP) [21, 22, 20, 37]. Grammar has been extensively studied in various research areas [28, 29, 32, 8, 43, 13, 42, 11, 27, 12]. Similarly, a grammatical framework can be used to express the structure of activities. Several work [23, 24, 33, 35] have defined context-free grammars based on possible temporal transitions between actions for action detection and recognition. Vo and Bovick [41] propose a stochastic grammar to model a hierarchical representation of activity based on AND-rules and OR-rules. Richard et al. [34] propose a context-free grammar defined on action sequences for weakly-supervised temporal action segmentation. Qi et al. [30, 32, 31] utilize a grammar induction algorithm named ADIOS [37] to induce grammar from action corpus. However, none of the proposed grammar for activity analysis includes recursive rules, which are a fundamental factor in expressing repetitions of actions or action phrases. In this paper, we propose a novel action grammar for temporal action segmentation based on key action and temporal dependency between actions considering recursive temporal structure. Temporal action segmentation (TAS). Various methods have been proposed to address the task. Early work utilizes temporal sliding windows [36, 19] to detect action segments, and language-based methods [24, 23] has been proposed to utilize a temporal hierarchy of actions during segmentation. Recently, a deep-learning-based model named the temporal convolutional networks (TCN) has been proposed with an encoder-decoder architecture [25, 9]. Moreover, transformer-based models [45, 2] are recently introduced to leverage global temporal relations between actions based on self-attention and cross-attention mechanisms [40]. Other researches have been proposed to improve the accuracy of temporal action segmentation based on existing models [9, 45]. Huang et al. [15] introduce a network module named Graph-based Temporal Reasoning Module (GTRM) that is applied on top of baseline models to learn temporal relations of action segments. Ishikawa et al. [16] suggest an action segment refinement framework (ASRF) dividing a task into frame-wise action segmentation and boundary regression. They refine frame-level classification results with the predicted action boundaries. Gao et al. [10] propose a global-to-local search scheme to find appropriate receptive field combinations instead of heuristic respective fields. Ahn and Lee [1] recently propose a hierarchical action segmentation refiner (HASR), which refines segmentation results by applying multi-granular context information from videos. A fast approximate inference method named FIFA for temporal action segmentation and alignment instead of dynamic programming is proposed by Souri et al. [38]. Other researches [5, 6] reformulate TAS as a cross-domain problem with different domains of spatio-temporal variations, introducing self-supervised temporal domain adaptation. Xu et al. [44] proposes differentiable temporal logic (DTL), which is a model-agnostic framework to give temporal constraints to neural networks. In this paper, we propose a neuro-symbolic approach where the activity grammar induced by the proposed grammar induction algorithm guides a temporal action segmentation model to refine segmental errors through parsing. 2 S A action sequences TAS model C B KARI ( ' (a) activity grammar induction Seg. Opt. BEP I ! (b) parsing #∗ (#∗ , %∗ ) (c) optimization !: KARI-induced grammar, ": video, #: prediction, (%∗ , '∗ ) : refined action (sequences, length), Seg. Opt.: segmentation optimization Figure 1: Overall pipeline of the proposed method. (a) KARI induces an activity grammar G from action sequences in the training data, (b) BEP parses neural predictions Y from the off-the-shelf temporal action segmentation model given a video F by using the KARI-induced grammar G, and (c) the final output of optimal action sequences and lengths (a∗ , l∗ ) is achieved through segmentation optimization. It is best viewed in color. 3 Our approach Given a video of T frames F = [F1 , F2 , ..., FT ] and a predefined set of action classes A, the goal of temporal action segmentation is to translate the video into a sequence of actions a = [a1 , a2 , ..., aN ] and their associated frame lengths l = [l1 , l2 , ..., lN ] where N is unknown, ai ∈ A for 1 ≤ i ≤ N , PN ai ̸= ai+1 for 1 ≤ i ≤ N − 1, and i=1 li = T .2 The resultant output of a and l indicates that the video consists of N segments and each pair (ai , li ) represents the action and length of ith segment. In this work, we introduce an activity grammar that guides neural predictions for temporal action segmentation through parsing. We propose a novel activity grammar induction algorithm named KARI and an efficient parser called BEP. The overall pipeline of the proposed method consists of three steps, as illustrated in Fig. 1. First of all, KARI induces an activity grammar from action sequences in the training data. Using the KARI-induced grammar, BEP then takes the frame-level class prediction Y ∈ RT ×|A| from the off-the-shelf temporal action segmentation model [45, 9] and produces a grammar-consistent action sequence a∗ . Finally, segmentation optimization is performed to obtain optimal action lengths l∗ based on a∗ and Y . In the following, we introduce the activity grammar as a probabilistic context-free grammar (Section 3.1), present KARI (Section 3.2) and BEP (Section 3.3), and describe a segmentation optimization method for final outputs (Section 3.4). 3.1 Activity grammar We define the activity grammar as a probabilistic context-free grammar (PCFG) [17], designed to derive diverse action sequences pertaining to a specific activity class. The activity grammar, denoted as G = (V, Σ, P, S), follows the conventional PCFG which consists of four components: a finite set of variables V, a finite set of terminals Σ, a finite set of production rules P, and the start symbol S ∈ V. In our context, the set of terminals Σ becomes the set of action classes A, and the production rules P are used to generate action sequences from the start variable S. We use two types of production rules, ‘AND’ and ‘OR’, defined as follows: AND : V → α OR : V → V1 [p1 ] | V2 [p2 ] | · · · | Vn [pn ] where V ∈ V and α ∈ (Σ ∪ V)∗ , where V, V1 , ..., Vn ∈ V. (1) (2) The AND rule replaces a head variable V with a sequence of variables and terminals α, determining the order of the terminals and variables. In contrast, the OR rule converts a head variable V to a sub-variable Vi with the probability pi , providing multiple alternatives for replacement; ‘|’ denotes ‘OR’ operation. These two types of rules allow us to generate action sequences hierarchically. 3.2 Grammar induction: Key-Action-based Recursive Induction (KARI) Grammar induction refers to the process of learning grammars from data [37]. In our context, it takes action sequences of a specific activity in the training set and produces an activity grammar that is 2 In fact, this form of output is equivalent to that of frame-level action classification, which predict an action class for each frame, and the sequence of frame-level actions is easily converted to (a, l) and vice versa. 3 • !! = [pour coffee, pour milk] • !" = [pour coffee, spoon sugar, pour milk] • !# = [take cup, pour coffee, pour milk, spoon sugar, stir coffee] • !$ = [take cup, pour coffee, spoon sugar, stir coffee] (a) Action sequences in the training set ! take cup pour coffee pour milk spoon sugar stir coffee !'# !% # !&# • - → /' /% /& • / ' → `take cup( | $ • / % → `pour coffee( • / & → /)& /*& • /)& → `pour milk ′ /)& | `spoon sugar( /)& | $ • /*& → `stir coffee( /*& | $ ! "! (b) Sub-sequence partitioning based on the key action "" "# • #' = $ , take cup , #% = pour coffee , # & = { pour milk , spoon sugar, pour milk , [pour milk, spoon sugar, stir coffee], spoon sugar, stir coffee } "$# "%# # (c) The set of sub-sequences !! , !" , and !# • + = take cup , +% = pour coffee , +& = {pour milk, spoon ' "$& sugar, stir coffee} " # … (d) The action set $ , $ , and $ • , = [{take cup}], ,% = [{pour coffee}], ,& = [{pour milk, ' spoon sugar}, stir coffee ] (e) The dependency sequence "! , "" , and "# # # … ! "$# : variable : terminal : AND : OR : terminal (key action) (f) KARI-induced activity grammar Figure 2: Example of activity grammar induction of KARI. (a) Example action sequences are provided with ‘pour coffee’ as the key action with N key set to 1. (b) Action sequence, e.g., a3 , is segmented into sub-sequences based on key actions. (c) All action sub-sequences aΩ consist in a set of sub-sequences DΩ . (d) The action set AΩ contains all the actions occurring in DΩ . (e) Temporally independent actions are grouped, where each action group is temporally dependent in the action group sequence dΩ . (f) The resultant KARI-induced activity grammar is shown. For simplicity, we omit the probability, and it is best viewed in color. able to parse action sequences of the activity; the induced grammar should be able to parse unseen sequences of the activity as well as the sequences in the training set for generalization. To obtain an effective activity grammar avoiding under-/over-generalization, we introduce two main concepts for grammar induction: key action and temporal dependency. The key actions for a specific activity are those consistently present in every action sequence from the training dataset. Specifically, the top N key most frequently occurring actions among these are selected as the key actions. The hyperparameter of the number of key actions N key affects the degree of generalization achieved by the induced grammar. The temporal dependency refers to the relevance of temporal orders across actions. Temporally independent actions do not occur in a specific temporal order. This concept of temporal dependency can also be extended to groups of actions, meaning that some groups of actions can be temporally dependent on others. We induce an activity grammar based on the key actions and the temporal dependency. Action sequences are divided into sub-sequences using the key actions as reference points, and the temporal dependencies between actions within the sub-sequences are established; temporally dependent actions are represented using AND rules (Eq. 1), while temporally independent actions are expressed with OR rules (Eq. 2). We give an example of grammar induction in Fig. 2; four action sequences are given in Fig. 2a, where the action class ‘pour coffee’ is chosen as the key action with the number of key actions N key set to 1. Given the action sequences from the training dataset D, we begin grammar induction by identifying a set of key actions K ⊂ A with the pre-defined hyperparameter N key . Using the key actions, each action sequence a ∈ D is divided into three parts: a = [aL , aM , aR ]. The sub-sequences aL , aM , and aR denote the portions of the original action sequence that occurred before, between, and after the key actions, respectively; the sub-sequence aM starts from the first key action and includes up to the last key action in K. An example in Fig. 2b shows that the action sequence a3 is divided into three sub-sequences using the key actions. For notational convenience, we will use the superscript Ω ∈ {L, M, R} to denote one of the three parts. All action sub-sequences aΩ in a specific part Ω are grouped to consist in a corresponding set of sub-sequences DΩ (cf. Fig 2c). The action set AΩ ⊆ A 4 is then defined to contain all the actions occurring in DΩ (cf. Fig 2d). To determine the temporal dependencies among the actions of AΩ , pairwise temporal orders are considered as follows. If one action always occurs before the other in DΩ , then the two actions are temporally dependent and otherwise temporally independent. Based on the concepts, we construct the action group sequence dΩ by collecting the temporally independent actions as an action group and arranging such action groups according to their temporal dependencies (cf. Fig 2e). In the following, we describe how to construct the production rules P of the activity grammar G. Start rule. We first create the rule for the start variable S: S →VLVMVR, (3) where V L , V M , and V R are variables used to derive left, middle, and right parts of the action sequence, respectively. Rule for the variable V Ω . For V Ω , Ω ∈ {L, R}, we construct an AND rule of action groups based on action group sequence dΩ : V Ω → V1Ω V2Ω · · · V|dΩΩ | , (4) where the variable ViΩ represents the ith action group in the action group sequence dΩ i . Since actions in an action group are considered temporally independent, we construct an OR rule for each action group: Ω Ω Ω Ω Ω Ω Ω Ω ViΩ → dΩ i,1 Vi [pi,1 ] | di,2 Vi [pi,2 ] | · · · | di,|dΩ | Vi [pi,|dΩ | ] | ϵ [pi,ϵ ] , (5) Ω Ω Ω Ω where dΩ i,j denotes the jth action from the action group di . The variable Vi yields di,j Vi with the Ω probability pΩ i,j . This rule can be recursively used to proceed to the variable Vi in the next step. This recursive structure allows for repeated selection of actions within the same action group, leading to the generation of diverse action sequences, which is effective for generalization. To avoid an infinite loop of the recursion, the empty string ϵ with the escape probability pΩ i,ϵ is added to Eq. 5. For the details, refer to the transition probability pΩ and the escape probability pΩ i,j i,ϵ in Appendix A.1. Rule for the middle variable V M . Since the temporal order of key actions might vary, we consider all the possible temporal orders between key actions in K. A set of temporal permutations of actions is denoted as Π, where each possible temporal permutation is represented by the OR rule: M M M M M V M → V1M [pM 1 ] | V2 [p2 ] | · · · | V|Π| [pΠ ] | ϵ [pϵ ]. (6) The rule for the permutation variable ViM is defined by the AND rule: ViM → πi,1 V M(i,1) · · · πi,|πi | V M(i,|πi |) V M , (7) where all the key actions are included. Note that πi,j represents the jth action of the permutation πi ∈ Π, and the variable V M(i,j) derives action sub-sequences between actions πi,j and πi,j+1 . The production rule for V M(i,j) adheres to the rules specified in Eq. 4 and 5. The resultant KARI-induced grammar from the example is shown in Fig. 2f, highlighting the compositional structure of actions. 3.3 Parser: Breadth-first Earley Parser (BEP) The goal of the parser is to identify the optimal action sequence a∗ by discovering the most likely grammatical structure based on the output of the action segmentation model [9, 45]. In other words, the parser examines the production rules of the activity grammar to determine whether the given neural prediction Y can be parsed by the grammar G. However, when the grammar includes recursive rules, the existing parser struggles to complete the parsing within a reasonable time due to the significant increase in branches from the parse tree. To address this challenge, we introduce an effective parser dubbed BEP, integrating Breadth-first search (BFS) and a pruning technique into a generalized Earley parser (GEP) [32]. Since the BFS prioritizes production rules closer to the start variable, it helps the parser understand the entire context of the activity before branching to recursive iterations. Simultaneously, pruning effectively reduces the vast search space generated by OR nodes and recursion, enabling the parser to focus on more relevant rules for the activity. For parsing, we employ two heuristic probabilities introduced in [32] to compute the probability of variables and terminals within the parse tree. Specifically, let Yt,x denote the probability of frame t 5 being labeled as x. In this context, we denote the last action in the action sequence a as x, i.e. x = aN , where a = [a1 , a2 , ..., aN ], for simplicity. The transition probability g(x | a1:N −1 , G) determines the probability of parsing action x given the a1:N −1 and the grammar G. The parsing probability p(F1:T → a | G) computes the probability of a being the action sequence for F1:T . The probability at t = 1 is initialized by:  g(x | ϵ, G) Y1,x if a contains only x, p(F1 → a | G) = (8) 0 otherwise, where ϵ indicates an empty string. Since we assume that the last action of a is classified as x, the parsing probability p(F1:t → a | G) can be represented with the probability of the previous frames: p(F1:t → a | G) = Yt,x ( p(F1:t−1 → a | G) + g(x | a1:N −1 , G) p(F1:t−1 → a1:N −1 | G) ). (9) The prefix probability p(F1:T → a... | G) represents the probability of a being the prefix of a∗ . This probability is computed by measuring the probability that a is the action sequence for the frame F1:t with t in the range [1, T ]: p(F1:T → a... | G) = p(F1 → a | G) + g(x | a1:N −1 , G) T X t=2 Yt,x p(F1:t−1 → a1:N −1 | G). (10) The parsing operation is structured following the original Earley parser [7], consisting of three key operations: prediction, scanning, and completion. These operations involve the update and generation of states, where every state comprises the rule being processed, the parent state, the parsed action sequence denoted as a, and the prefix probability denoted as p(a...). The states are enqueued and prioritized by their depth d within the parse tree. • Prediction: for every state Q(m, n, d) of the form (A → α · Bβ, Q(i, j, k), a, p(a...)), add (B → ·Γ, Q(m, n, d), a, p(a...)) to Q(m, n, d + 1) for every production rule in the grammar with B on the left-hand side. • Scanning: for every state in Q(m, n, d) of the form (A → α · wβ, Q(i, j, k), a, p(a...)), append the new terminal w to a and compute the probability p((a + w)...). Create a new set Q(m + 1, n′ , d) where n′ is the current size of Q(m + 1). Add (A → αw · β, Q(i, j, k), a + w, p((a + w)...)) to Q(m + 1, n′ , d). • Completion: for every state in Q(m, n, d) of the form (A → Γ·, Q(i, j, k), a, p(a...)), find states in Q(i, j, k) of the form (B → α · Aβ, Q(i′ , j ′ , k ′ ), a′ , p(a′ ...)) and add (B → αA · β, Q(i′ , j ′ , k ′ ), a, p(a...)) to Q(m, n, d − 1). The symbols α, β, and Γ represent arbitrary strings consisting of terminals and variables, i.e. α, β, Γ ∈ (Σ ∪ V )∗ . The symbols A and B refer to the variables, while w denotes a single terminal. The symbol Q represents the set of states, and the dot (·) denotes the current position of the parser within the production rule. Additionally, we introduce a pruning technique of limiting the queue size to reduce the vast search space in the parse tree, similar to the beam search. Specifically, the parser preserves only the top N queue elements from the queue in order of the parsing probability of each state. The parsing process terminates when the parser identifies that the parsed action sequence a∗ has a higher parsing probability than the prefix probabilities of any other states in the queue. For the further details, refer to Appendix B. 3.4 Segmentation optimization Segmentation optimization aims to determine the optimal alignment between the input classification probability matrix Y and the action sequence a∗ . In other words, the entire frames are allocated within the action sequences a∗ = [a∗1 , a∗2 , ..., a∗N ], obtained from the parser, to determine the optimal ∗ action lengths l∗ = [l1∗ , l2∗ , ..., lN ]. In this work, we utilize dynamic programming-based Viterbi-like algorithm [35] for activity parsing. Similar to [26, 32], the optimizer explores all possible allocations and selects the one with the maximum product of probabilities: l∗ = arg max(p(l | a∗ , Y1:T )), (11) l p(l | a, Y1:t ) = max(p(l1:N −1 | a1:N −1 , Y1:i ) i 1) = 3 , since the average length of sub-strings in hR 1 is 1.5. Formulation of the transition probability. The action sequence a = [a1 , a2 , ..., aN ] represents the distinct action labels for the video segments, where ai ̸= ai+1 as described in Section 3. In order to prevent the repetition of the same action in Eq. 5, we introduce an additional input q when defining the transition probability. Here, q refers to the index of the actions selected by the rule in the previous step, specifically at (nrec − 1)th step where nrec > 1. We simply put q to 0 in the first recursion, i.e. nrec = 1, which does not affect the results. The transition probability pΩ i,j is defined by:    Ω Ω a ∈ hi | a1 = di,j    if nrec = 1 ,  Ω|  |h  i  rec if nrec > 1 and j = q, pΩ , q) = 0 (14) i,j (n    Ω Ω rec  p (1, 0) 1 − p (n )  i,j i,ϵ   P otherwise.  Ω (1, 0) p l̸=q i,l M M The escape probability pM in Eq. 6 is ϵ and the transition probability pi,j for the middle variable V defined in the same way as Eq. 13 and Eq. 14, respectively. 14 A.2 Derivation of the escape probability We introduce the formulation of the escape probability pΩ i,ϵ in Appendix A.1. The escape probability is required to avoid an infinite loop of the rules and guarantee the length of sequences from the recursive rules in Eq. 5. Since the number of recursions directly determines the sequence lengths, we determine the escape probability regarding the length of action sequences. For notational simplicity, we denote the escape probabilities pΩ i,ϵ by p, omitting superscripts and subscripts. The expectation of the number of recursions is calculated by: lim n→∞ n X k=1 (1 − p)(1 + (1 − p)n − np(1 − p)n ) , n→∞ p (15) 1−p . p (16) kp(1 − p)k = lim = Since we derive the escape probability when nrec > 1, the expected number of recursions is equal to N̄ − 1: 1−p = N̄ − 1, (17) p , where N̄ is the average length of action sequences. Finally, we obtain the escape probability by p= 1 , N̄ (18) where this equation is used in Eq. 13 when nrec > 1. The derivation of the escape probability pM ϵ of the middle variable V M in Eq. 6 is also formulated as the same. B Breadth-first Earley Parser (BEP) B.1 Earley parser The Earley parser [7] is a classic algorithm that efficiently parses strings for context-free grammar. It operates by maintaining a set of states of the parsing process. Each state consists of a production rule, a position within that rule, and a position in the input string. The parser builds a parse tree for the input string, which records the structure of parsing. The Earley parser is commonly used for natural language processing tasks, such as syntactic analysis and semantic parsing. The Earley parser consists of three main operations: scanning, prediction, and completion. • Scanning: The parser matches a terminal symbol in the input string with the current position in the production rule. This operation moves the parser forward in the input string. • Prediction: The parser expands a variable in the production rule based on the current position. It adds new states to the set of states for possible future matches. • Completion: When the parser reaches the end of the production rule, it searches for other states predicting the head variable of the current rule. Subsequently, the parser update the positions within the rule of the searched states. By iterating these three operations, the Earley parser builds a parse chart that represents all possible parse trees for the input string. B.2 Implementation details The parsing probability p(F1:t → a | G) (Eq. 9) can suffer from numerical underflow due to its exponential decrease as t increases. To overcome the issue, we compute the probabilities in logarithmic space, following [31]. For simplicity, we denote log(p(F1:t−1 → a | G)) as PN and log(p(F1:t−1 → a1:N −1 | G)) as PN −1 below: PN′ −1 = log(g(x|a1:N −1 , G)) + PN −1 , z = max(PN , PN′ −1 ), log(p(F1:t → a | G)) = log(Yt,x ) + z + log(exp(PN − z) + exp(PN′ −1 − z)). 15 (19) (20) (21) 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 B.4 Parsing example In this section, we provide an example to help understand how BEP works. For the sake of simplicity and ease of calculation, we make the assumption that the frame-wise probability of the entire action class from the temporal action segmentation networks is equal. First of all, we define toy grammar as follows. S !ABC A ! A1 A2 A1 ! x1 [0.7] | x2 [0.3] A2 ! A3 A4 [0.5] | ✏ [0.5] A3 ! x4 A4 [0.5] | ✏ [0.5] A4 ! x3 A3 [0.5] | ✏ [0.5] B ! x5 [1.0] C ! x6 [0.7] | x7 [0.3] In the context of grammar, S indicates the starting variable. A, B, C, and Ai for i = [1, 2, 3, 4] Figurewhile 6: Toy used for4,the example of theterminals. BEP parsing represent the variables, xjgrammar for j = [1, 2, 3, 5, 6, 7] represent Table A1 is the history of parsing with the toy grammar. It shows the currently popped state, visiting order, previous state, and which states are currently in the queue. The numbers column For theprefix, computational efficiency, we set the sampling stride of input matrix Y as 50 in forthe Breakfast pop,100 from, queue indicate m, n, and d ofthe themaximum state. Thelength column represents the prefix probability and forand 50Salads. Additionally, we set of pthe refined action sequence as 20 excluding the and frame-wise probability, which can be considered a parsing probability since all framefor Breakfast 25 for 50Salads. wise probabilities are assumed to be the same. Note that the table includes some history after the parsed sequence satisfied the early stop constraint to illustrate how BEP prioritizes the states. In B.3 Parsing algorithm order 14, even though the probability of state S(1, 1, 3) is higher, BEP visits the state S(3, 0, 2) with a lower depth. Algorithm 1 shows the parsing procedure of BEP. We utilize a priority queue that sorts the elements in ascending order. The currentSet stores multiple states with the same m, n, and d. See B.4 for the BEP stops parsing when the probability of a∗ has the highest probability compared to C examples. Comparison with other grammar induction algorithms states in the queue while ensuring the current state can reach depth 1 with only completions. In this section, we introduce the existing grammar induction algorithms for activity grammar [2, 5, 6] B.4 Parsing in Section C.1. example Additional experimental results and analysis to compare KARI with the other grammar induction algorithms will be given in Section C.2 and C.3. In this section, we provide an example to help understand how BEP works. For simplicity, we assume that frame-wise classalgorithms probabilities from the segmentation model are identical across all action C.1 Grammar induction classes. First of all, we define toy grammar as shown in Figure 6. In the context of grammar, S indicates the starting variable.grammar A, B, C,proposed and Ai for = [1, 2,et3,al. 4] introduces represent the variables, while xj The hierarchical context-free by iKuehne a root rule that allows for j =selection [1, 2, 3, 4,of5,activities. 6, 7] represent for the In thisterminals. grammar, the root rule with the starting variable S is defined as Table 9 is the history of parsing with the toy grammar. It shows the currently popped state, the visiting order, the parsed prefix, the previous state, 4 and which states are currently in the queue. The three consecutive numbers in the column pop, from, and queue indicate m, n, and d of the state. The column p represents the prefix probability excluding the frame-wise probability, which can be considered a parsing probability since all frame-wise probabilities are assumed to be the same. Note that the table includes some history after the parsed sequence satisfied the early stop constraint to illustrate how BEP prioritizes the states. Returning to the subject, the table shows BEP preferentially searches for states with a small depth. In order 14, even though the probability of state Q(1, 1, 3) is higher, BEP visits the state Q(3, 0, 2) with a lower depth. 16 Algorithm 1: Breadth-first Earley Parser (BEP) Input :probability matrix Y , grammar G, queue size N queue Output: Best parsed sequence a∗ 1 function Breadth-first Earley Parser 2 q ← priorityQueue() ; // init priority queue 3 Q(0, 0, 0) ← (Γ → R, Q(0, 0, 0), ϵ, 1.0) ; // set initial state 4 q.push(0, (1.0, 0, 0, ϵ, Q(0, 0, 0))) ; // push initial state to queue 5 a∗ ← ϵ ; // init a∗ 6 while (d, (p(a1:|a|−1 ), m, n, a1:|a|−1 , currentSet)) ← q.pop() do 7 for (r, Q(i, j, k), a, p(a...)) ∈ currentSet do // update a∗ when a has higher probability 8 9 10 if p(a) > p(a∗ ) then a∗ ← a end if // prediction 11 12 13 14 15 16 17 if r is (A → α · Bβ) then for each (B → Γ) in G do r′ ← (B → ·Γ) Q′ ← (r′ , Q(m, n, d), a, p(a...)) Q(m, n, d + 1).add(Q′ ) q.push(d + 1, (p(a...), m, n, a, Q(m, n, d + 1))) end for end if // scanning 18 19 20 21 22 23 24 if r is (A → α · xβ) then r′ ← (A → αx · β) n′ ← |Q(m + 1)| Q′ ← (r′ , Q(i, j, k), a + x, p((a + x)...)) Q(m + 1, n′ , d).add(Q′ ) q.push(d, (p(a + x)..., m + 1, n′ , d, Q(i, j, k))) end if // completion 25 26 27 28 29 30 31 32 33 if r is (B → Γ·) then for each ((A → α · Bβ), Q(i′ , j ′ , k ′ ), a, p(a...)) in Q(i, j, k) do r′ ← (A → αB · β) Q′ ← (r′ , Q(i′ , j ′ , k ′ ), a, p(a...)) Q(m, n, d − 1).add(Q′ ) q.push(d − 1, (p(a...), m, n, a, Q(m, n, d − 1))) end for end if end for // early stop when a∗ has the highest probability and finished parsing 34 35 36 37 38 if p(a∗ ) > p(a′ ) for all a′ in q then if a∗ has parsed then return a∗ end if end if // Queue pruning 39 if |q| > N queue then // sort q in probability descending order q ′ ← sorted(q, key = p(a...), reverse = T rue) q.clear() 42 for i ← 1 to N queue do 43 q.push(q ′ .pop()) 44 end for 45 end if 46 end while 47 return a∗ 48 end function 40 41 17 Table 9: Parsing log for the given toy grammar through BEP. pop order m n d - 1 0 0 0 000 2 0 0 1 001 3 0 0 2 002 4 0 0 0 0 3 3 003 5 19 1 1 0 1 3 3 103 6 1 0 2 102 7 1 1 0 0 3 3 0 0 0 4 4 3 8 1 1 2 203 9 2 0 2 202 10 2 0 1 201 11 2 0 2 202 12 3 0 2 302 13 3 0 1 301 14 3 3 0 0 2 2 302 15 17 4 4 0 1 2 2 402 16 4 0 1 103 - 401 - - - - 412 18 4 1 1 411 - - - - 113 - 1 1 2 rule prefix operation from p queue Γ → ·S - ROOT - 1 000 PRED 000 1 001 A → ·A1 A2 - PRED 001 1 002 - PRED PRED 002 002 0.7 0.3 003 A1 → x 1 · A1 → x 2 · x1 x2 SCAN SCAN 003 003 0.7 0.3 103, 113 A → A1 · A2 x1 COMP 103 0.7 113, 102 A2 → ·A3 A4 A2 → ·e x1 x1 PRED PRED 102 102 0.35 0.35 103, 113 A3 → ·a4A4 A3 → ·e A2 → e· x1 x1 x1 PRED PRED SCAN 103 103 103 0.175 0.175 0.35 203, 113, 104 A → A1 A2 · x1 COMP 203 0.35 202, 113, 104 x1 COMP 202 0.35 201, 113, 104 B → ·x5 x1 PRED 201 0.35 202, 113, 104 x1 x5 SCAN 202 0.35 302, 113, 104 S → AB · C x 1 x5 COMP 302 0.35 301, 113, 104 x1 x5 x1 x5 PRED PRED 301 301 0.245 0.105 302, 113, 104 C → x6 · C → x7 · x1 x5 x6 x1 x5 x7 SCAN SCAN 302 302 0.245 0.105 402, 412, 113, 104 x1 x5 x6 COMP 402 0.245 401, 412, 113, 104 Γ → S· x1 x5 x6 COMP 401 0.245 412, 113, 104 x1 x5 x7 COMP 412 0.105 411, 113, 104 Γ → S· x1 x5 x7 COMP 411 0.105 113, 104 x2 COMP 113 0.3 112, 104 S → ·ABC A1 → ·x1 A1 → ·x2 S → A · BC B → x5 · C → ·x6 C → ·x7 S → ABC· S → ABC· A → A1 · A2 C Comparison with the existing grammar induction algorithms C.1 Existing grammar induction algorithms Kuehne et al. [23] introduce a hierarchical context-free grammar induction algorithm. The root rule with the starting variable S is induced as S → V1 | V2 | ... | VN A , where the variable Vi represents a single activity and N A is the number of activities from the dataset. Then each Vi expands into action sequences from each activity, i.e., the rule is formed as: Vi → Ai,1 | Ai,2 | ... | Ai,|Ai | , where Ai is a set of action sequences from the i-th activity and each Ai,j represents a j-th action sequence in Ai . Richard et al. [35] propose a grammar induction method for a probabilistic right-regular grammar, where every rule has the form of H̃ → c H. The algorithm is motivated by n-gram models [18] and finite grammars [14]. Specifically, the variable H̃ represents an action sequence a1:n , H represents an action sequence a1:n−1 , and a terminal c is an action class of an . The induced grammar can express the intermediate action sequences a1:n and expands its rules based on the sequential order of actions. Recently, Qi et al. [32] adopt the Automatic Distillation of Structure (ADIOS) [37] algorithm to induce a probabilistic context-free grammar. The ADIOS algorithm finds the significant patterns (AND rules) and equivalence action classes (OR rules) from the given action sequences. The algorithm identifies repetitive patterns in action sequences to minimize redundant sequences and find potential candidates for generalized action classes. Following the grammar induction methods of ADIOS, we 18 Table 10: The performance comparison with other grammar induction algorithms on two benchmark datasets. dataset reprod. refinement grammar induction edit F1@10 F1@25 F1@50 acc. 50Salads [39] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Kuehne et al. [23] Richard et al.[35] ADIOS-AND ADIOS-OR KARI (ours) 76.5 62.9 63.1 58.3 61.1 79.9 83.8 73.0 73.1 70.0 72.0 85.4 81.7 70.6 70.5 68.0 70.1 83.8 74.8 63.0 62.9 59.4 62.4 77.4 86.1 78.6 78.7 76.2 78.9 85.3 Breakfast [23] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Kuehne et al. [23] Richard et al.[35] ADIOS-AND ADIOS-OR KARI (ours) 75.6 72.8 77.3 69.2 70.3 77.8 77.3 74.0 77.2 69.8 71.8 78.8 72.0 69.1 72.2 64.9 66.8 73.7 59.4 55.5 59.4 52.2 54.2 60.8 74.3 72.9 74.1 72.4 71.8 74.0 set a decreasing ratio of the motif extraction algorithm η to 1, a significance level for the decrease ratio γ to 0.1, and the context window size 1 for ADIOS-AND-induced grammar. For ADIOS-OR-induced grammar, we set η to 0.9, γ to 0.1, and the context window size to 4. However, none of these approaches have managed to effectively integrate recursive rules, which are crucial for representing intricate and lifelike structures of action phrases and activities. The proposed KARI algorithm introduces a probabilistic context-free grammar that allows for the expression of complex activity structures, which captures a distinctive temporal structure based on key actions. C.2 Performance on temporal action segmentation In Table 10, we compare the performance of each grammar induction algorithm on refining temporal action segmentation models [45]. The overall results show that the KARI-induced grammar demonstrates the best refinement performance compared to the other grammar induction algorithms, showing a significant performance gap in both datasets. The induced grammar of Richard et al.also shows better performance than other grammar induction algorithms except for KARI, indicating that the ability to represent intermediate action sequences by production rules helps improve refinement performance. In conclusion, the generalization capabilities and variability of expressing action sequences are essential to guide the temporal action segmentation network to better refinement results. D Examples of KARI-induced grammars In this section, we provide an activity grammar induced by KARI. Based on example action sequences related to ‘coffee’ activity from the Breakfast dataset, as shown in Fig. 7, a resultant KARI-induced grammar is obtained as follows: S → ‘SIL’ V L V M V R ‘SIL’[1.0] V L → ‘take cup’ [0.4545] | ϵ [0.5455] V M → ‘pour coffee’ [1.0] V R → V1R V2R [1.0] R R R V1R → ‘pour milk’ V1,1 [0.4545] | ‘pour sugar’ V1,2 [0.0909] | ‘spoon sugar’ V1,3 [0.2727] | ϵ [0.1819] R V1,1 → ‘pour R V1,2 → ‘pour R V1,3 → ‘pour R V2 → ‘stir R R sugar’ V1,2 [0.1333] | ‘spoon sugar’ V1,3 [0.2667] | ϵ [0.6] R R milk’ V1,1 [0.24] | ‘spoon sugar’ V1,3 [0.16] | ϵ [0.6] R R milk’ V1,1 [0.3] | ‘pour sugar’ V1,2 [0.1] | ϵ [0.6] coffee’ [0.5455] | ϵ [0.4545] 19 a1 = ‘pour coffee’ a2 = ‘pour coffee’ ‘pour milk’ a3 = ‘take cup’ ‘pour coffee’ a4 = ‘take cup’ ‘pour coffee’ ‘pour milk’ a5 = ‘pour coffee’ ‘spoon sugar’ ‘pour milk’ a6 = ‘pour coffee’ ‘spoon sugar’ ‘stir coffee’ a7 = ‘pour coffee’ ‘pour sugar’ ‘pour milk’ ‘stir coffee’ a8 = ‘pour coffee’ ‘pour milk’ ‘spoon sugar’ ‘stir coffee’ a9 = ‘take cup’ ‘pour coffee’ ‘pour milk’ ‘spoon sugar’ ‘stir coffee’ a10 = ‘take cup’ ‘pour coffee’ ‘spoon sugar’ ‘pour milk’ ‘stir coffee’ a11 = ‘take cup’ ‘pour coffee’ ‘pour milk’ ‘pour sugar’ ‘stir coffee’ Figure 7: Example sequences of ‘coffee’ activity from Breakfast. S ! E L E M E R [1.0] (18) To clarify varyingLtransition and the escape probabilities according to the number of recursions nrec E ! ✏ [0.55] | ‘take cup’ [0.45] (19)Ω in Eq. 14, we modify the expression of the recursive rule in Eq. 5 by introducing a sub-variable Vi,j . M 3 E ! ‘pour coffee’ [1.0] (20) Refer to the official GitHub repository for more results . E R ! E1R E2R [1.0] (21) E R R R E1R ! ✏ [0.19] | E1,1 [0.45] | E1,2 [0.09] | E1,3 [0.28] Qualitative results (22) R R E1,1 ! ‘pour milk’ R1,1 [1.0] (23) We provide additional qualitative results for the Breakfast and 50Salads. Figure 8 shows examples R R E1,2 !refinements ‘pour sugar’ R1,2 [1.0] (24) of successful output by the KARI-induced grammar, demonstrating its ability to cover R comprising combinations R of multiple actions. This is further evident in Figure 8a, various sequences E1,3 ! ‘spoon sugar’ R1,3 [1.0] (25) where ADIOS-OR falls short in covering the ‘add dressing’ action following the ‘serve salad’ action, R R R R1,1 !it✏proficiently. [0.6] | ‘pour sugar’ R1,2 [0.13] | ‘spoon sugar’ R1,3 [0.27] (26) while KARI handles R R R ✏ [0.6] ‘pour milk’ R1,1grammar [0.24] | ‘spoon sugar’ R1,3further [0.16]improvement (27)is We also show R failure of| the KARI-induced in Figure 9, where 1,2 !cases R R grammar sometimes deletes R needed. We acknowledge that the KARI-induced certain actions. This R1,3 ! ✏ [0.6] | ‘pour milk’ R1,1 [0.3] | ‘pour sugar’ R1,2 [0.1] (28) deletion of actions, along with the challenges posed by inaccurate identification of actions by the E2R !show ✏ [0.45] | ‘stir coffee’ [0.55] (29)as segmentation model, areas for improvement in the refinement process. We recognize these 70 71 opportunities for future work to enhance the performance of the grammar induction algorithm and A.2 Breadth-first Earley Parser (BEP) address these limitations. A.2.1 F Implementation details Broader Impact In practical scenarios, the probability p(f0:t ! a) exhibits exponential decrease as t increases, which canresearch eventually result ininnumeric underflow. To overcome this issue, we compute the probabilities The presented this paper holds significant potential for impact across multiple domains. 74 in logarithmic space, following [2]. For convenience, we denote log(p(f ! a)) as P 0:t 1 |a| and The development of efficient and effective grammar induction algorithms for activity grammar, 75 coupled log(p(f0:t ! a )) as P . 1 0:|a| 1 |a| 1 with the Breadth-first Earley parser, has the potential to greatly enhance human activity 0 recognition and understanding systems. This, in turn, can have far-reaching implications in various P|a| (30) 1 = log(g(x|a0:|a| 1 , G)) + P|a| 1 , applications, such as video surveillance, human-computer interaction, robotics, and healthcare 0 z =accuracy max(P|a|and , P|a| (31) monitoring. By improving the efficiency of activity recognition systems, our research 1 ), 0 contributeslog(p(f to advancements in these domains, enabling more robust and intelligentz)). systems.(32) The z) + exp(P 0:t ! a)) = log(Yt,x ) + z + log(exp(P|a| |a| 1 broader implications of this research extend beyond activity grammar induction itself, fostering innovation and enhancing the capabilities intelligent systems fields. 76 For the computational efficiency, we set theofsampling stride as 50in fordiverse Breakfast and 100 for 50salads. 72 73 77 A.2.2 78 Algorithm 1 shows the parsing procedure of BEP. 79 A.2.3 80 81 Parsing algorithm Parsing example In this section, we provide an example to help understand how BEP works. For convenience, we assume that the frame-wise probabilities of all action classes from the segmentation model are equal 3 https://github.com/gongda0e/KARI 3 20 GT ASFormer ADIOS-OR KARI (Ours) action start/end cut cheese cut lettuce place cheese place lettuce cut tomato mix ingredients add oil place tomato add vinegar peel cucumber cut cucumber add salt add pepper pour milk stir_cereals mix dressing serve salad (a) 50Salads GT ASFormer ADIOS-OR KARI (Ours) SIL take bowl pour cereals (b) Breakfast (activity: cereals) GT ASFormer ADIOS-OR KARI (Ours) SIL take cup add_teabag pour water (c) Breakfast (activity: tea) Figure 8: Qualitative results on successful cases 21 place cucumber add dressing P07_webcam01_cereals GT ASFormer ADIOS-OR KARI (Ours) P07_webcam01_cereals action start/end add pepper peel cucumber cut tomato add dressing mix ingredient add salt add oil add vinegar place cucumber into bowl serve salad onto plate mix dressing cut cheese place cheese into bowl place lettuce into bowl cut lettuce place tomato into bowl cut cucumber (a) 50Salads SILtake_bowlpour_cerealspour_milkstir_cerealsSIL GT ASFormer ADIOS-OR KARI (Ours) SIL pour oil butter pan crack egg stirfry egg add saltnpepper take plate put egg2plate (b) Breakfast (activity: scrambled egg) Figure 9: Qualitative results on failure cases SILtake_bowlpour_cerealspour_milkstir_cerealsSIL 22