Activity Grammars for Temporal Action Segmentation
Dayoung Gong∗

Joonseok Lee∗

Deunsol Jung

Suha Kwak

Minsu Cho

Pohang University of Science and Technology (POSTECH)

arXiv:2312.04266v1 [cs.CV] 7 Dec 2023

{dayoung.gong, jameslee, deunsol.jung, suha.kwak, mscho}@postech.ac.kr
http://cvlab.postech.ac.kr/research/KARI

Abstract
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual
properties. The task of temporal action segmentation, which aims at translating an
untrimmed activity video into a sequence of action segments, remains challenging
for this reason. This paper addresses the problem by introducing an effective
activity grammar to guide neural predictions for temporal action segmentation. We
propose a novel grammar induction algorithm that extracts a powerful context-free
grammar from action sequence data. We also develop an efficient generalized
parser that transforms frame-level probability distributions into a reliable sequence
of actions according to the induced grammar with recursive rules. Our approach
can be combined with any neural network for temporal action segmentation to
enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action
segmentation in terms of both performance and interpretability on two standard
benchmarks, Breakfast and 50 Salads.

1

Introduction

Human activities in videos do not proceed by accident; they are structured being subject to generative
rules imposed by the goal of activities, the properties of individual actions, the physical environment,
and so on. Comprehending such a compositional structure of multi-granular semantics in human
activity poses a significant challenge in video understanding research. The task of temporal action
segmentation, which aims at translating an untrimmed activity video into a sequence of action
segments, remains challenging due to the reason. The recent methods based on deep neural networks [25, 9, 45, 2, 15, 16, 1] have shown remarkable improvement in learning temporal relations of
actions in an implicit manner, but often face out-of-context errors that reveal the lack of capacity to
capture the intricate structures of human activity, and the scarcity of annotated data exacerbates the
issue in training. In this work, we address the problem by introducing an effective activity grammar
to guide neural predictions for temporal action segmentation.
Grammar is a natural and powerful way of explicitly representing the hierarchical structure of
languages [14] and can also be applied to express the structure of activities. Despite the extensive body
of grammar-based research for video understanding [23, 24, 33, 35, 32], none of these approaches
have successfully integrated recursive rules. Recursive rules are indispensable for expressing complex
and realistic structures found in action phrases and activities. To achieve this, we introduce a novel
activity grammar induction algorithm, Key-Action-based Recursive Induction (KARI), that extracts a
powerful probabilistic context-free grammar while capturing the characteristics of the activity. Since
an activity is composed of multiple actions, each activity exhibits a distinctive temporal structure
based on pivotal actions, setting it apart from other activities. The proposed grammar induction
∗

Equal contribution.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

enables recursive rules with flexible temporal orders, which leads to powerful generalization capability.
We also propose a novel activity grammar evaluation framework to evaluate the generalization and
discrimination power of the proposed grammar induction algorithm. To incorporate the induced
activity grammar into temporal action segmentation, we develop an effective parser, dubbed BEP,
which searches the optimal rules according to the classification outputs generated by an off-the-shelf
action segmentation model. Our approach can be combined with any neural network for temporal
action segmentation to enhance the sequence prediction and discover its compositional structure.
The main contribution of this paper can be summarized as follows:
• We introduce a novel grammar induction algorithm that extracts a powerful context-free
grammar with recursive rules based on key actions and temporal dependencies.
• We develop an effective parser that efficiently handles recursive rules of context-free grammar by using Breadth-first search and pruning.
• We propose a new grammar evaluation framework to assess the generalization and discrimination capabilities of the induced activity grammars.
• We show that the proposed method significantly improves the performance of temporal
action segmentation models, as demonstrated through a comprehensive evaluation on two
benchmarks, Breakfast and 50 Salads.

2

Related work

Grammar for activity analysis. Grammar is an essential tool to represent the compositional
structure of language [14] and has been mainly studied in the context of natural language processing
(NLP) [21, 22, 20, 37]. Grammar has been extensively studied in various research areas [28, 29, 32,
8, 43, 13, 42, 11, 27, 12]. Similarly, a grammatical framework can be used to express the structure
of activities. Several work [23, 24, 33, 35] have defined context-free grammars based on possible
temporal transitions between actions for action detection and recognition. Vo and Bovick [41]
propose a stochastic grammar to model a hierarchical representation of activity based on AND-rules
and OR-rules. Richard et al. [34] propose a context-free grammar defined on action sequences for
weakly-supervised temporal action segmentation. Qi et al. [30, 32, 31] utilize a grammar induction
algorithm named ADIOS [37] to induce grammar from action corpus. However, none of the proposed
grammar for activity analysis includes recursive rules, which are a fundamental factor in expressing
repetitions of actions or action phrases. In this paper, we propose a novel action grammar for temporal
action segmentation based on key action and temporal dependency between actions considering
recursive temporal structure.
Temporal action segmentation (TAS). Various methods have been proposed to address the task.
Early work utilizes temporal sliding windows [36, 19] to detect action segments, and language-based
methods [24, 23] has been proposed to utilize a temporal hierarchy of actions during segmentation.
Recently, a deep-learning-based model named the temporal convolutional networks (TCN) has been
proposed with an encoder-decoder architecture [25, 9]. Moreover, transformer-based models [45, 2]
are recently introduced to leverage global temporal relations between actions based on self-attention
and cross-attention mechanisms [40]. Other researches have been proposed to improve the accuracy
of temporal action segmentation based on existing models [9, 45]. Huang et al. [15] introduce a
network module named Graph-based Temporal Reasoning Module (GTRM) that is applied on top
of baseline models to learn temporal relations of action segments. Ishikawa et al. [16] suggest an
action segment refinement framework (ASRF) dividing a task into frame-wise action segmentation
and boundary regression. They refine frame-level classification results with the predicted action
boundaries. Gao et al. [10] propose a global-to-local search scheme to find appropriate receptive field
combinations instead of heuristic respective fields. Ahn and Lee [1] recently propose a hierarchical
action segmentation refiner (HASR), which refines segmentation results by applying multi-granular
context information from videos. A fast approximate inference method named FIFA for temporal
action segmentation and alignment instead of dynamic programming is proposed by Souri et al. [38].
Other researches [5, 6] reformulate TAS as a cross-domain problem with different domains of
spatio-temporal variations, introducing self-supervised temporal domain adaptation. Xu et al. [44]
proposes differentiable temporal logic (DTL), which is a model-agnostic framework to give temporal
constraints to neural networks. In this paper, we propose a neuro-symbolic approach where the
activity grammar induced by the proposed grammar induction algorithm guides a temporal action
segmentation model to refine segmental errors through parsing.
2

S
A

action sequences

TAS
model

C

B

KARI

(

'

(a) activity grammar induction

Seg.
Opt.

BEP

I

!

(b) parsing

#∗

(#∗ , %∗ )

(c) optimization

!: KARI-induced grammar, ": video, #: prediction, (%∗ , '∗ ) : refined action (sequences, length), Seg. Opt.: segmentation optimization

Figure 1: Overall pipeline of the proposed method. (a) KARI induces an activity grammar G from
action sequences in the training data, (b) BEP parses neural predictions Y from the off-the-shelf
temporal action segmentation model given a video F by using the KARI-induced grammar G, and (c)
the final output of optimal action sequences and lengths (a∗ , l∗ ) is achieved through segmentation
optimization. It is best viewed in color.

3

Our approach

Given a video of T frames F = [F1 , F2 , ..., FT ] and a predefined set of action classes A, the goal of
temporal action segmentation is to translate the video into a sequence of actions a = [a1 , a2 , ..., aN ]
and their associated frame lengths l = [l1 , l2 , ..., lN ] where N is unknown, ai ∈ A for 1 ≤ i ≤ N ,
PN
ai ̸= ai+1 for 1 ≤ i ≤ N − 1, and i=1 li = T .2 The resultant output of a and l indicates that the
video consists of N segments and each pair (ai , li ) represents the action and length of ith segment.
In this work, we introduce an activity grammar that guides neural predictions for temporal action
segmentation through parsing. We propose a novel activity grammar induction algorithm named
KARI and an efficient parser called BEP. The overall pipeline of the proposed method consists of
three steps, as illustrated in Fig. 1. First of all, KARI induces an activity grammar from action
sequences in the training data. Using the KARI-induced grammar, BEP then takes the frame-level
class prediction Y ∈ RT ×|A| from the off-the-shelf temporal action segmentation model [45, 9] and
produces a grammar-consistent action sequence a∗ . Finally, segmentation optimization is performed
to obtain optimal action lengths l∗ based on a∗ and Y . In the following, we introduce the activity
grammar as a probabilistic context-free grammar (Section 3.1), present KARI (Section 3.2) and
BEP (Section 3.3), and describe a segmentation optimization method for final outputs (Section 3.4).
3.1

Activity grammar

We define the activity grammar as a probabilistic context-free grammar (PCFG) [17], designed
to derive diverse action sequences pertaining to a specific activity class. The activity grammar,
denoted as G = (V, Σ, P, S), follows the conventional PCFG which consists of four components: a
finite set of variables V, a finite set of terminals Σ, a finite set of production rules P, and the start
symbol S ∈ V. In our context, the set of terminals Σ becomes the set of action classes A, and the
production rules P are used to generate action sequences from the start variable S. We use two types
of production rules, ‘AND’ and ‘OR’, defined as follows:
AND : V → α
OR : V → V1 [p1 ] | V2 [p2 ] | · · · | Vn [pn ]

where V ∈ V and α ∈ (Σ ∪ V)∗ ,
where V, V1 , ..., Vn ∈ V.

(1)
(2)

The AND rule replaces a head variable V with a sequence of variables and terminals α, determining
the order of the terminals and variables. In contrast, the OR rule converts a head variable V to a
sub-variable Vi with the probability pi , providing multiple alternatives for replacement; ‘|’ denotes
‘OR’ operation. These two types of rules allow us to generate action sequences hierarchically.
3.2

Grammar induction: Key-Action-based Recursive Induction (KARI)

Grammar induction refers to the process of learning grammars from data [37]. In our context, it takes
action sequences of a specific activity in the training set and produces an activity grammar that is
2

In fact, this form of output is equivalent to that of frame-level action classification, which predict an action
class for each frame, and the sequence of frame-level actions is easily converted to (a, l) and vice versa.

3

• !! = [pour coffee, pour milk]
• !" = [pour coffee, spoon sugar, pour milk]
• !# = [take cup, pour coffee, pour milk, spoon sugar, stir coffee]
• !$ = [take cup, pour coffee, spoon sugar, stir coffee]
(a) Action sequences in the training set !
take cup

pour coffee

pour milk spoon sugar stir coffee

!'#

!%
#

!&#

• - → /' /% /&
• / ' → `take cup( | $
• / % → `pour coffee(
• / & → /)& /*&
• /)& → `pour milk ′ /)& | `spoon sugar( /)& | $
• /*& → `stir coffee( /*& | $

!
"!

(b) Sub-sequence partitioning based on the key action

""

"#

• #' = $ , take cup , #% = pour coffee ,

# & = { pour milk , spoon sugar, pour milk , [pour milk, spoon
sugar, stir coffee], spoon sugar, stir coffee }

"$# "%#

#

(c) The set of sub-sequences !! , !" , and !#

• + = take cup , +% = pour coffee , +& = {pour milk, spoon
'

"$&

sugar, stir coffee}

"

#

…

(d) The action set $ , $ , and $

• , = [{take cup}], ,% = [{pour coffee}], ,& = [{pour milk,
'

spoon sugar}, stir coffee ]

(e) The dependency sequence "! , "" , and "#

#

#

…

!

"$#

: variable

: terminal

: AND

: OR

: terminal (key action)

(f) KARI-induced activity grammar

Figure 2: Example of activity grammar induction of KARI. (a) Example action sequences are
provided with ‘pour coffee’ as the key action with N key set to 1. (b) Action sequence, e.g., a3 , is
segmented into sub-sequences based on key actions. (c) All action sub-sequences aΩ consist in a set
of sub-sequences DΩ . (d) The action set AΩ contains all the actions occurring in DΩ . (e) Temporally
independent actions are grouped, where each action group is temporally dependent in the action
group sequence dΩ . (f) The resultant KARI-induced activity grammar is shown. For simplicity, we
omit the probability, and it is best viewed in color.
able to parse action sequences of the activity; the induced grammar should be able to parse unseen
sequences of the activity as well as the sequences in the training set for generalization. To obtain an
effective activity grammar avoiding under-/over-generalization, we introduce two main concepts for
grammar induction: key action and temporal dependency.
The key actions for a specific activity are those consistently present in every action sequence from
the training dataset. Specifically, the top N key most frequently occurring actions among these are
selected as the key actions. The hyperparameter of the number of key actions N key affects the degree
of generalization achieved by the induced grammar. The temporal dependency refers to the relevance
of temporal orders across actions. Temporally independent actions do not occur in a specific temporal
order. This concept of temporal dependency can also be extended to groups of actions, meaning that
some groups of actions can be temporally dependent on others.
We induce an activity grammar based on the key actions and the temporal dependency. Action
sequences are divided into sub-sequences using the key actions as reference points, and the temporal
dependencies between actions within the sub-sequences are established; temporally dependent actions
are represented using AND rules (Eq. 1), while temporally independent actions are expressed with
OR rules (Eq. 2). We give an example of grammar induction in Fig. 2; four action sequences are
given in Fig. 2a, where the action class ‘pour coffee’ is chosen as the key action with the number of
key actions N key set to 1.
Given the action sequences from the training dataset D, we begin grammar induction by identifying a
set of key actions K ⊂ A with the pre-defined hyperparameter N key . Using the key actions, each
action sequence a ∈ D is divided into three parts: a = [aL , aM , aR ]. The sub-sequences aL , aM ,
and aR denote the portions of the original action sequence that occurred before, between, and after
the key actions, respectively; the sub-sequence aM starts from the first key action and includes up to
the last key action in K. An example in Fig. 2b shows that the action sequence a3 is divided into
three sub-sequences using the key actions. For notational convenience, we will use the superscript
Ω ∈ {L, M, R} to denote one of the three parts. All action sub-sequences aΩ in a specific part Ω are
grouped to consist in a corresponding set of sub-sequences DΩ (cf. Fig 2c). The action set AΩ ⊆ A
4

is then defined to contain all the actions occurring in DΩ (cf. Fig 2d). To determine the temporal
dependencies among the actions of AΩ , pairwise temporal orders are considered as follows. If one
action always occurs before the other in DΩ , then the two actions are temporally dependent and
otherwise temporally independent. Based on the concepts, we construct the action group sequence
dΩ by collecting the temporally independent actions as an action group and arranging such action
groups according to their temporal dependencies (cf. Fig 2e).
In the following, we describe how to construct the production rules P of the activity grammar G.
Start rule. We first create the rule for the start variable S:

S →VLVMVR,

(3)

where V L , V M , and V R are variables used to derive left, middle, and right parts of the action
sequence, respectively.
Rule for the variable V Ω . For V Ω , Ω ∈ {L, R}, we construct an AND rule of action groups based
on action group sequence dΩ :
V Ω → V1Ω V2Ω · · · V|dΩΩ | ,

(4)

where the variable ViΩ represents the ith action group in the action group sequence dΩ
i . Since actions
in an action group are considered temporally independent, we construct an OR rule for each action
group:
Ω
Ω
Ω
Ω
Ω
Ω
Ω
Ω
ViΩ → dΩ
i,1 Vi [pi,1 ] | di,2 Vi [pi,2 ] | · · · | di,|dΩ | Vi [pi,|dΩ | ] | ϵ [pi,ϵ ] ,

(5)

Ω
Ω
Ω
Ω
where dΩ
i,j denotes the jth action from the action group di . The variable Vi yields di,j Vi with the
Ω
probability pΩ
i,j . This rule can be recursively used to proceed to the variable Vi in the next step. This
recursive structure allows for repeated selection of actions within the same action group, leading to
the generation of diverse action sequences, which is effective for generalization. To avoid an infinite
loop of the recursion, the empty string ϵ with the escape probability pΩ
i,ϵ is added to Eq. 5. For the
details, refer to the transition probability pΩ
and
the
escape
probability
pΩ
i,j
i,ϵ in Appendix A.1.

Rule for the middle variable V M . Since the temporal order of key actions might vary, we consider
all the possible temporal orders between key actions in K. A set of temporal permutations of actions
is denoted as Π, where each possible temporal permutation is represented by the OR rule:
M M
M
M
M
V M → V1M [pM
1 ] | V2 [p2 ] | · · · | V|Π| [pΠ ] | ϵ [pϵ ].

(6)

The rule for the permutation variable ViM is defined by the AND rule:
ViM → πi,1 V M(i,1) · · · πi,|πi | V M(i,|πi |) V M ,

(7)

where all the key actions are included. Note that πi,j represents the jth action of the permutation
πi ∈ Π, and the variable V M(i,j) derives action sub-sequences between actions πi,j and πi,j+1 . The
production rule for V M(i,j) adheres to the rules specified in Eq. 4 and 5. The resultant KARI-induced
grammar from the example is shown in Fig. 2f, highlighting the compositional structure of actions.
3.3

Parser: Breadth-first Earley Parser (BEP)

The goal of the parser is to identify the optimal action sequence a∗ by discovering the most likely
grammatical structure based on the output of the action segmentation model [9, 45]. In other words,
the parser examines the production rules of the activity grammar to determine whether the given
neural prediction Y can be parsed by the grammar G. However, when the grammar includes recursive
rules, the existing parser struggles to complete the parsing within a reasonable time due to the
significant increase in branches from the parse tree. To address this challenge, we introduce an
effective parser dubbed BEP, integrating Breadth-first search (BFS) and a pruning technique into a
generalized Earley parser (GEP) [32]. Since the BFS prioritizes production rules closer to the start
variable, it helps the parser understand the entire context of the activity before branching to recursive
iterations. Simultaneously, pruning effectively reduces the vast search space generated by OR nodes
and recursion, enabling the parser to focus on more relevant rules for the activity.
For parsing, we employ two heuristic probabilities introduced in [32] to compute the probability of
variables and terminals within the parse tree. Specifically, let Yt,x denote the probability of frame t
5

being labeled as x. In this context, we denote the last action in the action sequence a as x, i.e. x = aN ,
where a = [a1 , a2 , ..., aN ], for simplicity. The transition probability g(x | a1:N −1 , G) determines the
probability of parsing action x given the a1:N −1 and the grammar G.
The parsing probability p(F1:T → a | G) computes the probability of a being the action sequence
for F1:T . The probability at t = 1 is initialized by:

g(x | ϵ, G) Y1,x if a contains only x,
p(F1 → a | G) =
(8)
0
otherwise,
where ϵ indicates an empty string.

Since we assume that the last action of a is classified as x, the parsing probability p(F1:t → a | G)
can be represented with the probability of the previous frames:
p(F1:t → a | G) = Yt,x ( p(F1:t−1 → a | G) + g(x | a1:N −1 , G) p(F1:t−1 → a1:N −1 | G) ). (9)
The prefix probability p(F1:T → a... | G) represents the probability of a being the prefix of a∗ . This
probability is computed by measuring the probability that a is the action sequence for the frame F1:t
with t in the range [1, T ]:
p(F1:T → a... | G) = p(F1 → a | G) + g(x | a1:N −1 , G)

T
X
t=2

Yt,x p(F1:t−1 → a1:N −1 | G). (10)

The parsing operation is structured following the original Earley parser [7], consisting of three key
operations: prediction, scanning, and completion. These operations involve the update and generation
of states, where every state comprises the rule being processed, the parent state, the parsed action
sequence denoted as a, and the prefix probability denoted as p(a...). The states are enqueued and
prioritized by their depth d within the parse tree.
• Prediction: for every state Q(m, n, d) of the form (A → α · Bβ, Q(i, j, k), a, p(a...)),
add (B → ·Γ, Q(m, n, d), a, p(a...)) to Q(m, n, d + 1) for every production rule in the
grammar with B on the left-hand side.
• Scanning: for every state in Q(m, n, d) of the form (A → α · wβ, Q(i, j, k), a, p(a...)),
append the new terminal w to a and compute the probability p((a + w)...). Create a new set
Q(m + 1, n′ , d) where n′ is the current size of Q(m + 1). Add (A → αw · β, Q(i, j, k), a +
w, p((a + w)...)) to Q(m + 1, n′ , d).
• Completion: for every state in Q(m, n, d) of the form (A → Γ·, Q(i, j, k), a, p(a...)),
find states in Q(i, j, k) of the form (B → α · Aβ, Q(i′ , j ′ , k ′ ), a′ , p(a′ ...)) and add (B →
αA · β, Q(i′ , j ′ , k ′ ), a, p(a...)) to Q(m, n, d − 1).
The symbols α, β, and Γ represent arbitrary strings consisting of terminals and variables, i.e. α, β, Γ ∈
(Σ ∪ V )∗ . The symbols A and B refer to the variables, while w denotes a single terminal. The symbol
Q represents the set of states, and the dot (·) denotes the current position of the parser within the
production rule.
Additionally, we introduce a pruning technique of limiting the queue size to reduce the vast search
space in the parse tree, similar to the beam search. Specifically, the parser preserves only the top
N queue elements from the queue in order of the parsing probability of each state. The parsing
process terminates when the parser identifies that the parsed action sequence a∗ has a higher parsing
probability than the prefix probabilities of any other states in the queue. For the further details, refer
to Appendix B.
3.4

Segmentation optimization

Segmentation optimization aims to determine the optimal alignment between the input classification
probability matrix Y and the action sequence a∗ . In other words, the entire frames are allocated
within the action sequences a∗ = [a∗1 , a∗2 , ..., a∗N ], obtained from the parser, to determine the optimal
∗
action lengths l∗ = [l1∗ , l2∗ , ..., lN
]. In this work, we utilize dynamic programming-based Viterbi-like
algorithm [35] for activity parsing. Similar to [26, 32], the optimizer explores all possible allocations
and selects the one with the maximum product of probabilities:
l∗ = arg max(p(l | a∗ , Y1:T )),
(11)
l

p(l | a, Y1:t ) = max(p(l1:N −1 | a1:N −1 , Y1:i )
i<t

6

t
Y
j=i

Yj,aN ).

(12)

""& : #$% induced "

generation

……

grammar
induction

O

unseen

parser

Synthetic grammar index

X

all
…

seen

Synthetic grammar index

""# : #$% synthetic "

$/1

$/0

KARI-induced grammar index

ADIOS-OR-induced grammar index

Figure 4: Confusion matrix of activity grammars. The
results of KARI-induced grammar are similar to the synthetic
grammar, showing high recall with comparable precision.

X

Figure 3: Grammar evaluation

Table 1: Synthetic G I

Table 2: Synthetic G II

grammar

precision

recall

grammar

0.97
0.99
0.93

0.08
0.25
0.98

ADIOS-AND
ADIOS-OR
KARI (ours)

ADIOS-AND
ADIOS-OR
KARI (ours)

4

Experimental evaluation and analysis

4.1

Datasets and evaluation metrics

precision

recall

0.98
1.00
0.92

0.06
0.33
0.96

Datasets. We conduct experiments on two widely used benchmark datasets for temporal action
segmentation: Breakfast [23] and 50 Salads [39]. The Breakfast dataset, consisting of 1,712 videos,
involves 52 individuals preparing 10 different breakfast activities comprised of 48 actions in 18
different kitchens. Similarly, the 50 Salads dataset comprises 50 egocentric videos of people preparing
salads of a single activity with 17 fine-grained actions from 25 people. We used I3D [4] features
provided by [9].
Evaluation metrics. For evaluation metrics, we report edit score, F1@{10, 25, 50} scores, and
frame-wise accuracy following the previous work [9, 45].
4.2

Implementation details

For KARI, we set the hyperparameters of the number of key actions N key to 4 for Breakfast, and 3
for 50 Salads. We individually induce separate activity grammar for the ten activity classes within
Breakfast and subsequently merge them into a unified grammar. For the comparison with the existing
grammar used in the previous work [32, 31], we induce activity grammars of ADIOS [37] provided
by [31]. Two types of ADIOS-induced grammar are induced: ADIOS-AND-induced grammar,
primarily composed of AND rules with limited generalization capabilities, and ADIOS-OR-induced
grammar, predominantly incorporating OR rules, offering improved generalization. Please refer to
Appendix C.1 for grammar induction details.
For BEP, we configured the queue size N queue to be 20. For efficiency, we adjust the sampling rate
of the input video features to 50 for Breakfast and 100 for 50 Salads. We use two widely used models
for the temporal action segmentation: ASFormer [45] based on Transformer and MS-TCN [9] based
on CNNs. Since we apply the proposed method to the reproduced temporal action segmentation
models, we directly compare and evaluate the performance based on the reproduced results.
4.3

Evaluation framework for activity grammar

We propose a novel evaluation framework to assess the generalization and discrimination capabilities
of the activity grammar. Figure 3 shows the overall process of the grammar evaluation framework.
We first generate a set of synthetic activity grammars G S randomly. Action sequences a ∈ Diall are
generated from each synthetic grammar GSi ∈ G S , and these sequences are randomly divided into two
sets: seen seen and unseen. For each seen set, a grammar induction algorithm is applied, resulting in
the induced grammar GIi consisting in a corresponding set of induced grammars G I . For grammar
evaluation, the induced grammar GIi ∈ G I parses action sequences from the entire unseen sets. The
induced grammar should accurately parse the action sequences generated by the original synthetic
grammar from which it was induced, while also effectively discriminating those generated by other
synthetic grammars.
To simulate real-world video action sequences, we generate the synthetic activity grammars assuming
temporal dependencies across actions. This indicates that certain actions follow a temporal order
7

Table 3: The performance comparison on 50Salads
model

reprod.

refinement

grammar
induction

edit

F1@10

F1@25

F1@50

acc.

ASFormer
[45]

✓
✓
✓
✓

✓
✓
✓

ADIOS-AND
ADIOS-OR
KARI (ours)

75.0
76.5
58.3
61.1
79.9

76.0
83.8
70.0
72.0
85.4

70.6
81.7
68.0
70.1
83.8

57.4
74.8
59.4
62.4
77.4

73.5
86.1
76.2
78.9
85.3

MS-TCN
[9]

✓
✓
✓
✓

✓
✓
✓

ADIOS-AND
ADIOS-OR
KARI (ours)

67.9
62.4
56.8
61.9
66.7

76.3
69.5
66.4
69.1
75.1

74.0
65.3
63.8
66.9
73.2

64.5
55.7
52.9
57.2
60.8

80.7
75.2
72.5
74.2
76.7

Table 4: The performance comparison on Breakfast
model

reprod.

refinement

grammar
induction

edit

F1@10

F1@25

F1@50

acc.

ASFormer
[45]

✓
✓
✓
✓

✓
✓
✓

ADIOS-AND
ADIOS-OR
KARI (ours)

75.0
75.6
69.2
70.3
77.8

76.0
77.3
69.8
71.8
78.8

70.6
72.0
64.9
66.8
73.7

57.4
59.4
52.2
54.2
60.8

73.5
74.3
72.4
71.8
74.0

MS-TCN
[9]

✓
✓
✓
✓

✓
✓
✓

ADIOS-AND
ADIOS-OR
KARI (ours)

61.7
69.7
68.0
69.6
74.9

52.6
70.7
66.7
69.2
74.6

48.1
65.1
61.0
63.3
68.7

37.9
52.6
48.0
50.3
55.1

66.3
69.4
68.4
68.2
68.8

while others do not adhere to such dependencies. To prevent parsing failures arising from uncovered
terminals, we maintain a consistent set of terminals throughout the entire grammar while randomly
assigning key actions to these terminals. The number of variables is randomly determined for
each synthetic grammar. As evaluation metrics, we use precision and recall similar to the previous
work [37, 3]. For the induced grammar GIi , action sequences successfully parsed from the synthetic
grammar GSi are classified as positive samples from the entire unseen sets, otherwise considered
negative samples.
Details. In our experiment, we generate a total of 100 grammars, each consisting of 20 variables and
20 terminals. We have developed two types of synthetic grammars that differ in terms of temporal
hierarchical difficulty. In synthetic grammar I, each terminal is allocated to a single variable, while in
synthetic grammar II, terminals are randomly assigned multiple times to different variables. Three
types of grammars are evaluated: induced by ADIOS-AND, ADIOS-OR, and proposed KARI.
Results. Table 1 and Table 2 show the results of grammar evaluation by using synthetic grammar I and
II, respectively. Our activity grammar demonstrates robust generalization performances, achieving a
recall of approximately 1.0 on unseen action sequences compared to others, maintaining comparable
precision. The ADIOS-OR-induced grammar shows better generalization ability compared to the
ADIOS-AND-induced grammar. We visualize a confusion matrix of the three types of grammar:
synthetic grammar, KARI-induced grammar, and ADIOS-OR-induced grammar, as shown in Fig. 4.
The confusion matrix shows the parsing accuracy of each unseen set over the synthetic grammar II.
Higher accuracy is represented by brighter cells in the matrix. KARI-induced grammars demonstrate
similar patterns in their confusion matrix compared to the synthetic grammars. This similarity
indicates their capacity to generalize to unseen sets from which each grammar is induced, allowing
effective discrimination of action sequences from other synthetic grammars.
4.4

Effects of the grammar-based refinement on temporal action segmentation

Table 3 and Table 4 show the performance of applying the proposed method to temporal action
segmentation models [9, 45] across two benchmark datasets. The first row in Table 3 and Table 4
indicates the performance from the original paper [9, 45], whereas the second row represents the
8

Table 5: Ablation study of KARI on 50 Salads. Using both key action and recursive rules is effective
for refining neural predictions from the temporal action segmentation models.
refinement

key actions

recursive rules

edit

F1@10

F1@25

F1@50

acc.

✓
✓
✓

✓
✓
-

✓
-

79.9
69.2 (10.7↓)
62.6 (17.3↓)

85.4
77.1 (8.3↓)
72.9 (12.5↓)

83.8
74.9 (8.9↓)
70.5 (13.3↓)

77.4
67.4 (10.0↓)
63.0 (14.1↓)

85.3
80.9 (4.4↓)
78.8 (6.5↓)

Table 6: BEP vs. GEP. BEP is effective under a fair Table 7: Ablation on N key . Using
comparison to GEP.
proper number of key actions matters.
parser

N queue

edit

F1@10

F1@25

F1@50

acc.

N key

edit

F1@10

F1@25

F1@50

acc.

GEP

10
20
30

73.3
72.1
72.3

81.1
80.5
79.8

79.1
79.1
78.1

72.9
72.3
71.4

84.0
83.9
84.2

BEP

10
20
30

78.3
79.9
78.9

84.9
85.4
85.5

83.2
83.8
83.8

76.9
77.4
77.3

84.9
85.3
85.1

1
2
3
4
5
6

73.0
77.8
79.9
74.3
71.2
68.4

81.5
85.1
85.4
82.2
78.6
75.2

79.9
83.5
83.8
80.5
76.3
73.3

72.6
77.1
77.4
73.5
68.6
64.2

83.5
85.8
85.3
83.3
80.4
77.9

reproduced performance obtained using official codes. The comparison between the second and the
last row of each compartment in each table reveals significant improvements in both edit scores and F1
scores. This result validates the effectiveness of leveraging activity grammars to refine segment-wise
classification. Remarkably, the KARI-induced grammar shows great performance compared to both
ADIOS-induced grammars, demonstrating the importance of generalizing the grammar to cover
unseen action sequences during inference effectively.
4.5

Analysis

Ablation studies of KARI. Ablation studies of KARI are conducted on the 50 Salads dataset using
ASFormer [45], as shown in Table 5 to demonstrate the effectiveness of each component, including
key actions and temporal dependency. The results show that both key actions and recursive rules
contribute to the significant improvement of grammar-based refinement. In particular, using recursive
rules is essential for the activity grammar to be generalized to the unseen action sequences.
BEP vs. GEP. Table 6 presents the performance comparison of GEP and BEP using KARI-induced
grammar. We limit the queue size of both parsers, as the parser, without limitation, fails to complete
parsing within a reasonable time. The results indicate that our BEP outperforms GEP under the
same condition. This is attributed to GEP prioritizing states based on the highest probability, which
increases the risk of getting trapped in local optima when performing selective pruning within specific
branches. In contrast, BEP, which prioritizes low-depth states, allows for easier escape from cycles
and OR nodes, contributing to improved overall performance.
The number of key actions. Table 7 shows the results by adjusting the number of key actions N key
of KARI on 50 Salads. We set the value of N key ranging from 1 to 6, where the induced grammar with
a smaller value generates the larger activity corpus. We find that setting N key to 3 outperforms the
others, demonstrating the importance of achieving an appropriate level of generalization for effective
refinement. Both excessive and insufficient generalization can negatively impact performance,
highlighting the need to strike a balance in the generalization ability of the activity grammar.
Grammar evaluation on real data. We evaluate the parsing recall on the unseen action sequences
of the Breakfast dataset. The results present the average recall across all splits for each activity.
The number inside brackets indicates the average length of action sequences of each activity in D.
Table 8 compares the generalization capability of the five grammar induction algorithms [23, 35, 37],
including KARI (details in Appendix C.1). The result demonstrates that KARI-induced grammar
shows better generalization ability on real data compared to others. Remarkably, the KARI-induced
grammar shows robust performance with the extended average length of the action sequences, whereas
other algorithms exhibit poor generalization.
4.6

Qualitative results

Figure 5 presents a visual representation of the refined segmentation results on benchmark datasets.
The proposed method successfully parses and identifies the actions ‘pour oil’ (red bar in Fig. 5a)
9

Table 8: Grammar evaluation on real data. We evaluate the proposed KARI-induced-grammar on
Breakfast, demonstrating the superior high recall on unseen action sequences from each activity. The
average length of action sequences of each activity is shown in parentheses.
Grammar induction

scrambled
egg (11.9)

pancake
(11.1)

salad
(9.9)

fried egg
(9.5)

juice
(7.2)

coffee
(6.7)

sandwich
(6.0)

cereal
(5.1)

milk
(5.0)

tea
(5.0)

total
(7.7)

Kuehne et al. [23]
Richard et al. [35]
ADIOS-AND [37]
ADIOS-OR [37]
KARI (ours)

0.25
0.25
0.25
0.39
0.84

0.24
0.24
0.24
0.30
0.71

0.0
0.0
0.0
0.37
0.90

0.32
0.32
0.32
0.53
0.70

0.53
0.53
0.53
0.55
0.77

0.80
0.80
0.80
0.80
1.00

0.63
0.63
0.63
0.73
0.91

0.96
0.96
0.96
0.96
0.96

0.78
0.78
0.78
0.78
0.90

0.91
0.91
0.91
0.92
0.98

0.53
0.54
0.54
0.63
0.87

GT

ASFormer
ADIOS-OR
KARI (Ours)
SIL

pour oil

crack egg

stir egg

pour egg2pan

stirfry egg

add saltnpepper

take plate

put egg2plate

put pancake2plate

peel cucumber

cut cucumber

place cucumber

(a) Breakfast
Rgb-23-2

GT

ASFormer
ADIOS-OR
KARI (Ours)
action start/end
cut cheese

cut lettuce
place cheese

place lettuce

cut tomato

mix ingredients

add oil

place tomato
add vinegar

add salt

add pepper

mix dressing

serve salad

(b) 50 Salads

Figure 5: Qualitative results. KARI-induced grammar efficiently insert missing actions and removes
out-of-context actions in ASFormer [45].
and ‘add saltnpepper’, (blue bar in Fig. 5a), which are omitted in the results obtained by using the
ADIOS-OR induced grammar. The results show that KARI-induced grammar allows a more flexible
temporal structure between actions. Furthermore, our method effectively removes actions such as
‘put pancake2plate’ that do not correspond to the intended activity. Similarly, qualitative results on 50
Salads in Fig. 5b show the effectiveness of the proposed method with complex action sequences. The
overall results show that activity grammar-based refinement for the temporal action segmentation
model is effective for correcting the neural predictions by using the grammar as a guide.

5

Conclusion

We have shown that the proposed approach enhances the sequence prediction and discovers its
compositional structure, significantly improving temporal action segmentation in terms of both
performance and interpretability. However, the improvement is limited by the initial output of the
action segmentation network, which remains further research in the future. We believe that the
grammar induction and parsing methods can be easily applied to other sequence prediction tasks.
10

6

Acknowledgements

This work was supported by the IITP grants (2022-0-00264: Comprehensive video understanding and
generation with knowledge-based deep logic (50%), 2022-0-00290: Visual intelligence for space-time
understanding and generation based on multi-layered visual common sense (20%), 2022-0-00959:
Few-shot learning of causal inference in vision and language (20%), and 2019-0-01906: AI graduate
school program at POSTECH (10%)) funded by the Korea government (MSIT).

References
[1] H. Ahn and D. Lee. Refining action segmentation with hierarchical video representations. In
Proc. IEEE International Conference on Computer Vision (ICCV), pages 16302–16310, 2021.
1, 2
[2] N. Behrmann, S. A. Golestaneh, Z. Kolter, J. Gall, and M. Noroozi. Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In Proc.
European Conference on Computer Vision (ECCV), pages 52–68. Springer, 2022. 1, 2
[3] P. Belcák, D. Hofer, and R. Wattenhofer. A neural model for regular grammar induction. arXiv
preprint arXiv:2209.11628, 2022. 8
[4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics
dataset. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
6299–6308, 2017. 7
[5] M.-H. Chen, B. Li, Y. Bao, and G. AlRegib. Action segmentation with mixed temporal domain
adaptation. In Proc. IEEE Winter Conference on Applications of Computer Vision (WACV),
pages 605–614, 2020. 2
[6] M.-H. Chen, B. Li, Y. Bao, G. AlRegib, and Z. Kira. Action segmentation with joint selfsupervised temporal domain adaptation. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 9454–9463, 2020. 2
[7] J. Earley. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94–
102, 1970. 6, 15
[8] H.-S. Fang, Y. Xu, W. Wang, X. Liu, and S.-C. Zhu. Learning pose grammar to encode human
body configuration for 3d pose estimation. In Proc. AAAI Conference on Artificial Intelligence
(AAAI), volume 32, 2018. 2
[9] Y. A. Farha and J. Gall. Ms-tcn: Multi-stage temporal convolutional network for action
segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 3575–3584, 2019. 1, 2, 3, 5, 7, 8
[10] S.-H. Gao, Q. Han, Z.-Y. Li, P. Peng, L. Wang, and M.-M. Cheng. Global2local: Efficient
structure search for video action segmentation. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 16805–16814, 2021. 2
[11] M. Guo, V. Thost, B. Li, P. Das, J. Chen, and W. Matusik. Data-efficient graph grammar learning
for molecular generation. In Proc. International Conference on Learning Representations
(ICLR), 2021. 2
[12] Y. Hong, Q. Li, R. Gong, D. Ciao, S. Huang, and S.-C. Zhu. Smart: A situation model for algebra
story problems via attributed grammar. In Proc. AAAI Conference on Artificial Intelligence
(AAAI), pages 13009–13017, 2021. 2
[13] Y. Hong, Q. Li, S.-C. Zhu, and S. Huang. Vlgrammar: Grounded grammar induction of vision
and language. In Proc. IEEE International Conference on Computer Vision (ICCV), pages
1665–1674, 2021. 2
[14] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to automata theory, languages, and
computation. Acm Sigact News, 32(1):60–65, 2001. 1, 2, 18
[15] Y. Huang, Y. Sugano, and Y. Sato. Improving action segmentation via graph-based temporal
reasoning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 14024–14034, 2020. 1, 2
11

[16] Y. Ishikawa, S. Kasai, Y. Aoki, and H. Kataoka. Alleviating over-segmentation errors by
detecting action boundaries. In Proc. IEEE Winter Conference on Applications of Computer
Vision (WACV), pages 2322–2331, 2021. 1, 2
[17] F. Jelinek, J. D. Lafferty, and R. L. Mercer. Basic methods of probabilistic context free grammars.
Springer, 1992. 3
[18] D. Jurafsky. Speech & language processing. Pearson Education India, 2000. 18
[19] S. Karaman, L. Seidenari, and A. Del Bimbo. Fast saliency based pooling of fisher encoded
dense trajectories. In ECCV THUMOS Workshop, volume 1, page 5, 2014. 2
[20] Y. Kim. Sequence-to-sequence learning with latent neural grammars. Advances in Neural
Information Processing Systems, 34:26302–26317, 2021. 2
[21] Y. Kim, C. Dyer, and A. M. Rush. Compound probabilistic context-free grammars for grammar
induction. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pages 2369–2385, 2019. 2
[22] Y. Kim, A. M. Rush, L. Yu, A. Kuncoro, C. Dyer, and G. Melis. Unsupervised recurrent neural
network grammars. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 1105–1117, 2019. 2
[23] H. Kuehne, A. Arslan, and T. Serre. The language of actions: Recovering the syntax and
semantics of goal-directed human activities. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 780–787, 2014. 1, 2, 7, 9, 10, 18, 19
[24] H. Kuehne, A. Richard, and J. Gall. Weakly supervised learning of actions from transcripts.
Computer Vision and Image Understanding, 163:78–89, 2017. 1, 2
[25] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager. Temporal convolutional networks for
action segmentation and detection. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 156–165, 2017. 1, 2
[26] J. Li, P. Lei, and S. Todorovic. Weakly supervised energy-based learning for action segmentation.
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October
2019. 6
[27] Q. Li, S. Huang, Y. Hong, Y. Chen, Y. N. Wu, and S.-C. Zhu. Closed loop neural-symbolic
learning via integrating neural perception, grammar parsing, and symbolic reasoning. In Proc.
International Conference on Machine Learning (ICML), pages 5884–5894. PMLR, 2020. 2
[28] A. Piergiovanni, A. Angelova, and M. S. Ryoo. Differentiable grammars for videos. In Proc.
AAAI Conference on Artificial Intelligence (AAAI), pages 11874–11881, 2020. 2
[29] A. Piergiovanni, A. Angelova, A. Toshev, and M. S. Ryoo. Adversarial generative grammars for
human activity prediction. In Proc. European Conference on Computer Vision (ECCV), pages
507–523. Springer, 2020. 2
[30] S. Qi, S. Huang, P. Wei, and S.-C. Zhu. Predicting human activities using stochastic grammar.
In Proc. IEEE International Conference on Computer Vision (ICCV), pages 1164–1172, 2017. 2
[31] S. Qi, B. Jia, S. Huang, P. Wei, and S.-C. Zhu. A generalized earley parser for human activity
parsing and prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence,
43(8):2538–2554, 2020. 2, 7, 15
[32] S. Qi, B. Jia, and S.-C. Zhu. Generalized earley parser: Bridging symbolic grammars and
sequence data for future prediction. In Proc. International Conference on Machine Learning
(ICML), pages 4171–4179. PMLR, 2018. 1, 2, 5, 6, 7, 18
[33] A. Richard, H. Kuehne, and J. Gall. Weakly supervised action learning with rnn based fineto-coarse modeling. In Proc. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 754–763, 2017. 1, 2
[34] A. Richard, H. Kuehne, and J. Gall. Action sets: Weakly supervised action segmentation without
ordering constraints. In Proc. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 5987–5996, 2018. 2
[35] A. Richard, H. Kuehne, A. Iqbal, and J. Gall. Neuralnetwork-viterbi: A framework for
weakly supervised video learning. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 7386–7395, 2018. 1, 2, 6, 9, 10, 18, 19
12

[36] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity
detection of cooking activities. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1194–1201. IEEE, 2012. 2
[37] Z. Solan, D. Horn, E. Ruppin, and S. Edelman. Unsupervised learning of natural languages.
Proceedings of the National Academy of Sciences, 102(33):11629–11634, 2005. 2, 3, 7, 8, 9,
10, 18
[38] Y. Souri, Y. A. Farha, F. Despinoy, G. Francesca, and J. Gall. Fifa: Fast inference approximation
for action segmentation. In Pattern Recognition: 43rd DAGM German Conference, DAGM
GCPR 2021, Bonn, Germany, September 28–October 1, 2021, Proceedings, pages 282–296.
Springer, 2022. 2
[39] S. Stein and S. J. McKenna. Combining embedded accelerometers with computer vision for
recognizing food preparation activities. In Proceedings of the 2013 ACM international joint
conference on Pervasive and ubiquitous computing, pages 729–738, 2013. 7, 19
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin. Attention is all you need. Proc. Neural Information Processing Systems
(NeurIPS), 30, 2017. 2
[41] N. N. Vo and A. F. Bobick. From stochastic grammar to bayes network: Probabilistic parsing
of complex activity. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 2641–2648, 2014. 2
[42] B. Wan, W. Han, Z. Zheng, and T. Tuytelaars. Unsupervised vision-language grammar induction
with shared structure modeling. In Proc. International Conference on Learning Representations
(ICLR), 2021. 2
[43] Y. Xu, W. Wang, T. Liu, X. Liu, J. Xie, and S.-C. Zhu. Monocular 3d pose estimation via
pose grammar and data augmentation. IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI), 2021. 2
[44] Z. Xu, Y. S. Rawat, Y. Wong, M. Kankanhalli, and M. Shah. Don’t pour cereal into coffee: Differentiable temporal logic for temporal action segmentation. In Advances in Neural Information
Processing Systems, 2022. 2
[45] F. Yi, H. Wen, and T. Jiang. Asformer: Transformer for action segmentation. In Proc. British
Machine Vision Conference (BMVC), 2021. 1, 2, 3, 5, 7, 8, 9, 10, 19

13

Appendices
In this supplement, we provide detailed descriptions of the proposed method and additional results,
which are omitted in the main paper due to the lack of space. In Section A, we will describe the
formulation of the probability of activity grammar. Algorithmic details of BEP are included in
Section B. Section C compares KARI with the existing grammar induction algorithms for activity
grammar and Section E presents additional qualitative results. We conclude this Appendix by
discussing the broader impact of our research in Section F.

A

Formulation of the probabilities in KARI

In this section, we describe the formulation of transition probability pΩ
i,j and the escape probability
M
pΩ
and
p
in
Eq.
5
and
6
in
Section
A.1.
In
the
following,
the
derivation
of the expectation of the
ϵ
i,ϵ
escape probability is described in Section A.2.
A.1

Formulation of the escape and transition probability

We first introduce and escape probabilities and the transition probabilities introduced in Eq. 5.
Pre-processing. Let hΩ
i represent a list of action sub-sequences, where the sub-sequence from
Ω
hΩ
i removes actions that does not exist in the action group di from the action sub-sequence in
Ω
D . The empty string ϵ remains when the action sub-sequence does not include actions within
R
the action group dΩ
i . For example in Fig. 2, a list of sub-sequences h1 can be structured as
R
h1 = [[pour milk], [spoon sugar, pour milk], [pour milk, spoon sugar], [spoon sugar]] with the
R
corresponding action group dR
1 = {pour milk, spoon sugar}. Similary, a list of sub-sequences h2
R
R
is structured as h2 = [ϵ, ϵ, [stir coffee], [stir coffee]] with the action group d2 = {stir coffee}. This
pre-processing step of generating hΩ
i enables us to consider the statistical probabilities associated
with actions.
Formulation of the escape probability. The escape probability pΩ
i,ϵ and the transition probability
Ω
rec
pi,j are both defined based on the number of recursion n of the current timestep; thereby these
probabilities are represented as functions of nrec . We first define the escape probability function:

 
a ∈ hΩ

i |a = ϵ

if nrec = 1 ,

|hΩ
|
Ω
rec
i
pi,ϵ (n ) =
(13)


 1Ω
otherwise ,
N̄ hi
Ω
rec
where N̄ hi is the average length of the sub-sequences in hΩ
= 1,
i . In the first recursion, i.e., n
the probability calculation solely considers statistics of the actions. Otherwise, the probability is
calculated based on the expected number of recursions, which will be introduced in Appendix A.2. In
R
Fig. 2, pR
1,ϵ (1) = 0, since none of the action sub-sequence a from h1 is equal to the empty sequence,
2
R
rec
and p1,ϵ (n > 1) = 3 , since the average length of sub-strings in hR
1 is 1.5.
Formulation of the transition probability. The action sequence a = [a1 , a2 , ..., aN ] represents the
distinct action labels for the video segments, where ai ̸= ai+1 as described in Section 3. In order to
prevent the repetition of the same action in Eq. 5, we introduce an additional input q when defining
the transition probability. Here, q refers to the index of the actions selected by the rule in the previous
step, specifically at (nrec − 1)th step where nrec > 1. We simply put q to 0 in the first recursion,
i.e. nrec = 1, which does not affect the results. The transition probability pΩ
i,j is defined by:

 
Ω
Ω
a ∈ hi | a1 = di,j



if nrec = 1 ,

Ω|

|h

i

rec
if nrec > 1 and j = q,
pΩ
, q) = 0
(14)
i,j (n



Ω
Ω
rec

p
(1,
0)
1
−
p
(n
)
 i,j
i,ϵ


P
otherwise.

Ω (1, 0)
p
l̸=q i,l
M
M
The escape probability pM
in Eq. 6 is
ϵ and the transition probability pi,j for the middle variable V
defined in the same way as Eq. 13 and Eq. 14, respectively.

14

A.2

Derivation of the escape probability

We introduce the formulation of the escape probability pΩ
i,ϵ in Appendix A.1. The escape probability
is required to avoid an infinite loop of the rules and guarantee the length of sequences from the
recursive rules in Eq. 5. Since the number of recursions directly determines the sequence lengths, we
determine the escape probability regarding the length of action sequences. For notational simplicity,
we denote the escape probabilities pΩ
i,ϵ by p, omitting superscripts and subscripts. The expectation of
the number of recursions is calculated by:
lim

n→∞

n
X
k=1

(1 − p)(1 + (1 − p)n − np(1 − p)n )
,
n→∞
p

(15)

1−p
.
p

(16)

kp(1 − p)k = lim
=

Since we derive the escape probability when nrec > 1, the expected number of recursions is equal to
N̄ − 1:
1−p
= N̄ − 1,
(17)
p
, where N̄ is the average length of action sequences. Finally, we obtain the escape probability by
p=

1
,
N̄

(18)

where this equation is used in Eq. 13 when nrec > 1. The derivation of the escape probability pM
ϵ of
the middle variable V M in Eq. 6 is also formulated as the same.

B

Breadth-first Earley Parser (BEP)

B.1

Earley parser

The Earley parser [7] is a classic algorithm that efficiently parses strings for context-free grammar. It
operates by maintaining a set of states of the parsing process. Each state consists of a production rule,
a position within that rule, and a position in the input string. The parser builds a parse tree for the
input string, which records the structure of parsing. The Earley parser is commonly used for natural
language processing tasks, such as syntactic analysis and semantic parsing.
The Earley parser consists of three main operations: scanning, prediction, and completion.
• Scanning: The parser matches a terminal symbol in the input string with the current position
in the production rule. This operation moves the parser forward in the input string.
• Prediction: The parser expands a variable in the production rule based on the current
position. It adds new states to the set of states for possible future matches.
• Completion: When the parser reaches the end of the production rule, it searches for other
states predicting the head variable of the current rule. Subsequently, the parser update the
positions within the rule of the searched states.
By iterating these three operations, the Earley parser builds a parse chart that represents all possible
parse trees for the input string.
B.2

Implementation details

The parsing probability p(F1:t → a | G) (Eq. 9) can suffer from numerical underflow due to its
exponential decrease as t increases. To overcome the issue, we compute the probabilities in logarithmic space, following [31]. For simplicity, we denote log(p(F1:t−1 → a | G)) as PN and
log(p(F1:t−1 → a1:N −1 | G)) as PN −1 below:
PN′ −1 = log(g(x|a1:N −1 , G)) + PN −1 ,
z = max(PN , PN′ −1 ),

log(p(F1:t → a | G)) =

log(Yt,x ) + z + log(exp(PN − z) + exp(PN′ −1 − z)).
15

(19)

(20)
(21)

120
121
122
123
124

125
126
127
128
129
130
131
132
133
134

135
136
137
138

139
140
141

B.4

Parsing example

In this section, we provide an example to help understand how BEP works. For the sake of simplicity
and ease of calculation, we make the assumption that the frame-wise probability of the entire action
class from the temporal action segmentation networks is equal. First of all, we define toy grammar as
follows.
S !ABC
A ! A1 A2
A1 ! x1 [0.7] | x2 [0.3]
A2 ! A3 A4 [0.5] | ✏ [0.5]
A3 ! x4 A4 [0.5] | ✏ [0.5]
A4 ! x3 A3 [0.5] | ✏ [0.5]
B ! x5 [1.0]
C ! x6 [0.7] | x7 [0.3]
In the context of grammar, S indicates the starting variable. A, B, C, and Ai for i = [1, 2, 3, 4]
Figurewhile
6: Toy
used
for4,the
example
of theterminals.
BEP parsing
represent the variables,
xjgrammar
for j = [1,
2, 3,
5, 6,
7] represent
Table A1 is the history of parsing with the toy grammar. It shows the currently popped state, visiting
order,
previous state,
and which
states
are currently
in the
queue.
The numbers
column
For
theprefix,
computational
efficiency,
we set
the sampling
stride
of input
matrix
Y as 50 in
forthe
Breakfast
pop,100
from,
queue indicate
m, n, and
d ofthe
themaximum
state. Thelength
column
represents
the prefix
probability
and
forand
50Salads.
Additionally,
we set
of pthe
refined action
sequence
as 20
excluding
the and
frame-wise
probability, which can be considered a parsing probability since all framefor
Breakfast
25 for 50Salads.
wise probabilities are assumed to be the same. Note that the table includes some history after the
parsed sequence satisfied the early stop constraint to illustrate how BEP prioritizes the states. In
B.3 Parsing algorithm
order 14, even though the probability of state S(1, 1, 3) is higher, BEP visits the state S(3, 0, 2) with
a lower depth.
Algorithm
1 shows the parsing procedure of BEP. We utilize a priority queue that sorts the elements
in ascending order. The currentSet stores multiple states with the same m, n, and d. See B.4 for
the
BEP stops
parsing
when
the probability
of a∗ has
the highest probability compared to
C examples.
Comparison
with
other
grammar
induction
algorithms
states in the queue while ensuring the current state can reach depth 1 with only completions.
In this section, we introduce the existing grammar induction algorithms for activity grammar [2, 5, 6]
B.4
Parsing
in Section
C.1. example
Additional experimental results and analysis to compare KARI with the other grammar
induction algorithms will be given in Section C.2 and C.3.
In this section, we provide an example to help understand how BEP works. For simplicity, we
assume
that frame-wise
classalgorithms
probabilities from the segmentation model are identical across all action
C.1 Grammar
induction
classes. First of all, we define toy grammar as shown in Figure 6. In the context of grammar, S
indicates
the starting
variable.grammar
A, B, C,proposed
and Ai for
= [1, 2,et3,al.
4] introduces
represent the
variables,
while
xj
The
hierarchical
context-free
by iKuehne
a root
rule that
allows
for
j =selection
[1, 2, 3, 4,of5,activities.
6, 7] represent
for the
In thisterminals.
grammar, the root rule with the starting variable S is defined as
Table 9 is the history of parsing with the toy grammar. It shows the currently popped state, the
visiting order, the parsed prefix, the previous state,
4 and which states are currently in the queue. The
three consecutive numbers in the column pop, from, and queue indicate m, n, and d of the state.
The column p represents the prefix probability excluding the frame-wise probability, which can be
considered a parsing probability since all frame-wise probabilities are assumed to be the same. Note
that the table includes some history after the parsed sequence satisfied the early stop constraint to
illustrate how BEP prioritizes the states. Returning to the subject, the table shows BEP preferentially
searches for states with a small depth. In order 14, even though the probability of state Q(1, 1, 3) is
higher, BEP visits the state Q(3, 0, 2) with a lower depth.

16

Algorithm 1: Breadth-first Earley Parser (BEP)
Input :probability matrix Y , grammar G, queue size N queue
Output: Best parsed sequence a∗
1 function Breadth-first Earley Parser
2
q ← priorityQueue() ;
// init priority queue
3
Q(0, 0, 0) ← (Γ → R, Q(0, 0, 0), ϵ, 1.0) ;
// set initial state
4
q.push(0, (1.0, 0, 0, ϵ, Q(0, 0, 0))) ;
// push initial state to queue
5
a∗ ← ϵ ;
// init a∗
6
while (d, (p(a1:|a|−1 ), m, n, a1:|a|−1 , currentSet)) ← q.pop() do
7
for (r, Q(i, j, k), a, p(a...)) ∈ currentSet do
// update a∗ when a has higher probability

8
9
10

if p(a) > p(a∗ ) then
a∗ ← a
end if

// prediction
11
12
13
14
15
16
17

if r is (A → α · Bβ) then
for each (B → Γ) in G do
r′ ← (B → ·Γ)
Q′ ← (r′ , Q(m, n, d), a, p(a...))
Q(m, n, d + 1).add(Q′ ) q.push(d + 1, (p(a...), m, n, a, Q(m, n, d + 1)))
end for
end if

// scanning
18
19
20
21
22
23
24

if r is (A → α · xβ) then
r′ ← (A → αx · β)
n′ ← |Q(m + 1)|
Q′ ← (r′ , Q(i, j, k), a + x, p((a + x)...))
Q(m + 1, n′ , d).add(Q′ )
q.push(d, (p(a + x)..., m + 1, n′ , d, Q(i, j, k)))
end if

// completion

25
26
27
28
29
30
31
32
33

if r is (B → Γ·) then
for each ((A → α · Bβ), Q(i′ , j ′ , k ′ ), a, p(a...)) in Q(i, j, k) do
r′ ← (A → αB · β)
Q′ ← (r′ , Q(i′ , j ′ , k ′ ), a, p(a...))
Q(m, n, d − 1).add(Q′ )
q.push(d − 1, (p(a...), m, n, a, Q(m, n, d − 1)))
end for
end if
end for

// early stop when a∗ has the highest probability and finished parsing
34
35
36
37
38

if p(a∗ ) > p(a′ ) for all a′ in q then
if a∗ has parsed then
return a∗
end if
end if

// Queue pruning
39

if |q| > N queue then

// sort q in probability descending order

q ′ ← sorted(q, key = p(a...), reverse = T rue)
q.clear()
42
for i ← 1 to N queue do
43
q.push(q ′ .pop())
44
end for
45
end if
46
end while
47
return a∗
48 end function
40

41

17

Table 9: Parsing log for the given toy grammar through BEP.
pop

order

m

n

d

-

1

0

0

0

000

2

0

0

1

001

3

0

0

2

002

4

0
0

0
0

3
3

003

5
19

1
1

0
1

3
3

103

6

1

0

2

102

7

1
1

0
0

3
3

0
0
0

4
4
3

8

1
1
2

203

9

2

0

2

202

10

2

0

1

201

11

2

0

2

202

12

3

0

2

302

13

3

0

1

301

14

3
3

0
0

2
2

302

15
17

4
4

0
1

2
2

402

16

4

0

1

103

-

401

-

-

-

-

412

18

4

1

1

411

-

-

-

-

113

-

1

1

2

rule

prefix

operation

from

p

queue

Γ → ·S

-

ROOT

-

1

000

PRED

000

1

001

A → ·A1 A2

-

PRED

001

1

002

-

PRED
PRED

002
002

0.7
0.3

003

A1 → x 1 ·
A1 → x 2 ·

x1
x2

SCAN
SCAN

003
003

0.7
0.3

103, 113

A → A1 · A2

x1

COMP

103

0.7

113, 102

A2 → ·A3 A4
A2 → ·e

x1
x1

PRED
PRED

102
102

0.35
0.35

103, 113

A3 → ·a4A4
A3 → ·e
A2 → e·

x1
x1
x1

PRED
PRED
SCAN

103
103
103

0.175
0.175
0.35

203, 113, 104

A → A1 A2 ·

x1

COMP

203

0.35

202, 113, 104

x1

COMP

202

0.35

201, 113, 104

B → ·x5

x1

PRED

201

0.35

202, 113, 104

x1 x5

SCAN

202

0.35

302, 113, 104

S → AB · C

x 1 x5

COMP

302

0.35

301, 113, 104

x1 x5
x1 x5

PRED
PRED

301
301

0.245
0.105

302, 113, 104

C → x6 ·
C → x7 ·

x1 x5 x6
x1 x5 x7

SCAN
SCAN

302
302

0.245
0.105

402, 412, 113, 104

x1 x5 x6

COMP

402

0.245

401, 412, 113, 104

Γ → S·

x1 x5 x6

COMP

401

0.245

412, 113, 104

x1 x5 x7

COMP

412

0.105

411, 113, 104

Γ → S·

x1 x5 x7

COMP

411

0.105

113, 104

x2

COMP

113

0.3

112, 104

S → ·ABC
A1 → ·x1
A1 → ·x2

S → A · BC
B → x5 ·
C → ·x6
C → ·x7

S → ABC·
S → ABC·
A → A1 · A2

C

Comparison with the existing grammar induction algorithms

C.1

Existing grammar induction algorithms

Kuehne et al. [23] introduce a hierarchical context-free grammar induction algorithm. The root rule
with the starting variable S is induced as S → V1 | V2 | ... | VN A , where the variable Vi represents a
single activity and N A is the number of activities from the dataset. Then each Vi expands into action
sequences from each activity, i.e., the rule is formed as: Vi → Ai,1 | Ai,2 | ... | Ai,|Ai | , where Ai is a
set of action sequences from the i-th activity and each Ai,j represents a j-th action sequence in Ai .

Richard et al. [35] propose a grammar induction method for a probabilistic right-regular grammar,
where every rule has the form of H̃ → c H. The algorithm is motivated by n-gram models [18] and
finite grammars [14]. Specifically, the variable H̃ represents an action sequence a1:n , H represents
an action sequence a1:n−1 , and a terminal c is an action class of an . The induced grammar can
express the intermediate action sequences a1:n and expands its rules based on the sequential order of
actions.
Recently, Qi et al. [32] adopt the Automatic Distillation of Structure (ADIOS) [37] algorithm to induce
a probabilistic context-free grammar. The ADIOS algorithm finds the significant patterns (AND
rules) and equivalence action classes (OR rules) from the given action sequences. The algorithm
identifies repetitive patterns in action sequences to minimize redundant sequences and find potential
candidates for generalized action classes. Following the grammar induction methods of ADIOS, we
18

Table 10: The performance comparison with other grammar induction algorithms on two benchmark
datasets.
dataset

reprod.

refinement

grammar induction

edit

F1@10

F1@25

F1@50

acc.

50Salads
[39]

✓
✓
✓
✓
✓
✓

✓
✓
✓
✓
✓

Kuehne et al. [23]
Richard et al.[35]
ADIOS-AND
ADIOS-OR
KARI (ours)

76.5
62.9
63.1
58.3
61.1
79.9

83.8
73.0
73.1
70.0
72.0
85.4

81.7
70.6
70.5
68.0
70.1
83.8

74.8
63.0
62.9
59.4
62.4
77.4

86.1
78.6
78.7
76.2
78.9
85.3

Breakfast
[23]

✓
✓
✓
✓
✓
✓

✓
✓
✓
✓
✓

Kuehne et al. [23]
Richard et al.[35]
ADIOS-AND
ADIOS-OR
KARI (ours)

75.6
72.8
77.3
69.2
70.3
77.8

77.3
74.0
77.2
69.8
71.8
78.8

72.0
69.1
72.2
64.9
66.8
73.7

59.4
55.5
59.4
52.2
54.2
60.8

74.3
72.9
74.1
72.4
71.8
74.0

set a decreasing ratio of the motif extraction algorithm η to 1, a significance level for the decrease ratio
γ to 0.1, and the context window size 1 for ADIOS-AND-induced grammar. For ADIOS-OR-induced
grammar, we set η to 0.9, γ to 0.1, and the context window size to 4.
However, none of these approaches have managed to effectively integrate recursive rules, which are
crucial for representing intricate and lifelike structures of action phrases and activities. The proposed
KARI algorithm introduces a probabilistic context-free grammar that allows for the expression of
complex activity structures, which captures a distinctive temporal structure based on key actions.
C.2

Performance on temporal action segmentation

In Table 10, we compare the performance of each grammar induction algorithm on refining temporal action segmentation models [45]. The overall results show that the KARI-induced grammar
demonstrates the best refinement performance compared to the other grammar induction algorithms,
showing a significant performance gap in both datasets. The induced grammar of Richard et al.also
shows better performance than other grammar induction algorithms except for KARI, indicating
that the ability to represent intermediate action sequences by production rules helps improve refinement performance. In conclusion, the generalization capabilities and variability of expressing
action sequences are essential to guide the temporal action segmentation network to better refinement
results.

D

Examples of KARI-induced grammars

In this section, we provide an activity grammar induced by KARI. Based on example action sequences
related to ‘coffee’ activity from the Breakfast dataset, as shown in Fig. 7, a resultant KARI-induced
grammar is obtained as follows:
S → ‘SIL’ V L V M V R ‘SIL’[1.0]

V L → ‘take cup’ [0.4545] | ϵ [0.5455]

V M → ‘pour coffee’ [1.0]
V R → V1R V2R [1.0]

R
R
R
V1R → ‘pour milk’ V1,1
[0.4545] | ‘pour sugar’ V1,2
[0.0909] | ‘spoon sugar’ V1,3
[0.2727]

| ϵ [0.1819]

R
V1,1
→ ‘pour
R
V1,2 → ‘pour
R
V1,3
→ ‘pour
R
V2 → ‘stir

R
R
sugar’ V1,2
[0.1333] | ‘spoon sugar’ V1,3
[0.2667] | ϵ [0.6]

R
R
milk’ V1,1
[0.24] | ‘spoon sugar’ V1,3
[0.16] | ϵ [0.6]
R
R
milk’ V1,1
[0.3] | ‘pour sugar’ V1,2
[0.1] | ϵ [0.6]

coffee’ [0.5455] | ϵ [0.4545]

19

a1 = ‘pour coffee’
a2 = ‘pour coffee’ ‘pour milk’
a3 = ‘take cup’ ‘pour coffee’
a4 = ‘take cup’ ‘pour coffee’ ‘pour milk’
a5 = ‘pour coffee’ ‘spoon sugar’ ‘pour milk’
a6 = ‘pour coffee’ ‘spoon sugar’ ‘stir coffee’
a7 = ‘pour coffee’ ‘pour sugar’ ‘pour milk’ ‘stir coffee’
a8 = ‘pour coffee’ ‘pour milk’ ‘spoon sugar’ ‘stir coffee’
a9 = ‘take cup’ ‘pour coffee’ ‘pour milk’ ‘spoon sugar’ ‘stir coffee’
a10 = ‘take cup’ ‘pour coffee’ ‘spoon sugar’ ‘pour milk’ ‘stir coffee’
a11 = ‘take cup’ ‘pour coffee’ ‘pour milk’ ‘pour sugar’ ‘stir coffee’

Figure 7: Example sequences of ‘coffee’ activity from Breakfast.
S ! E L E M E R [1.0]
(18)
To clarify varyingLtransition and the escape probabilities according to the number of recursions nrec
E ! ✏ [0.55] | ‘take cup’ [0.45]
(19)Ω
in Eq. 14, we modify the expression of the recursive rule in Eq. 5 by introducing a sub-variable Vi,j
.
M
3
E
!
‘pour
coffee’
[1.0]
(20)
Refer to the official GitHub repository for more results .
E R ! E1R E2R [1.0]
(21)

E

R
R
R
E1R !
✏ [0.19] | E1,1
[0.45] | E1,2
[0.09] | E1,3
[0.28]
Qualitative
results

(22)

R
R
E1,1
! ‘pour milk’ R1,1
[1.0]
(23)
We provide additional
qualitative results for the Breakfast and 50Salads. Figure 8 shows examples
R
R
E1,2 !refinements
‘pour sugar’
R1,2
[1.0]
(24)
of successful output
by the
KARI-induced
grammar, demonstrating its ability to cover
R comprising combinations
R of multiple actions. This is further evident in Figure 8a,
various sequences
E1,3 ! ‘spoon sugar’ R1,3 [1.0]
(25)
where ADIOS-OR
falls short in covering the ‘add dressing’ action following the ‘serve salad’ action,
R
R
R
R1,1 !it✏proficiently.
[0.6] | ‘pour sugar’ R1,2 [0.13] | ‘spoon sugar’ R1,3 [0.27]
(26)
while KARI handles
R
R
R
✏ [0.6]
‘pour
milk’ R1,1grammar
[0.24] | ‘spoon
sugar’
R1,3further
[0.16]improvement
(27)is
We also show R
failure
of| the
KARI-induced
in Figure
9, where
1,2 !cases
R
R grammar sometimes deletes
R
needed. We acknowledge
that
the
KARI-induced
certain
actions.
This
R1,3 ! ✏ [0.6] | ‘pour milk’ R1,1 [0.3] | ‘pour sugar’ R1,2 [0.1]
(28)
deletion of actions,
along with the challenges posed by inaccurate identification of actions by the
E2R !show
✏ [0.45]
| ‘stir
coffee’ [0.55]
(29)as
segmentation model,
areas
for improvement
in the refinement process. We recognize these
70
71

opportunities for future work to enhance the performance of the grammar induction algorithm and
A.2 Breadth-first
Earley Parser (BEP)
address
these limitations.
A.2.1

F

Implementation details

Broader Impact

In practical scenarios, the probability p(f0:t ! a) exhibits exponential decrease as t increases, which
canresearch
eventually
result ininnumeric
underflow.
To overcome
this issue,
we compute
the probabilities
The
presented
this paper
holds significant
potential
for impact
across multiple
domains.
74
in
logarithmic
space,
following
[2].
For
convenience,
we
denote
log(p(f
!
a)) as P
0:t
1
|a| and
The development of efficient and effective grammar induction algorithms for activity
grammar,
75 coupled
log(p(f0:t
!
a
))
as
P
.
1
0:|a|
1
|a|
1
with the Breadth-first Earley parser, has the potential to greatly enhance human activity
0
recognition and understanding
systems. This, in turn, can have far-reaching implications in various
P|a|
(30)
1 = log(g(x|a0:|a| 1 , G)) + P|a| 1 ,
applications, such as video
surveillance, human-computer interaction, robotics, and healthcare
0
z =accuracy
max(P|a|and
, P|a|
(31)
monitoring. By improving the
efficiency
of activity recognition systems, our research
1 ),
0
contributeslog(p(f
to advancements
in these domains, enabling more robust
and intelligentz)).
systems.(32)
The
z) + exp(P
0:t ! a)) = log(Yt,x ) + z + log(exp(P|a|
|a| 1
broader implications of this research extend beyond activity grammar induction
itself, fostering
innovation
and enhancing
the capabilities
intelligent
systems
fields.
76
For the computational
efficiency,
we set theofsampling
stride
as 50in
fordiverse
Breakfast
and 100 for 50salads.
72
73

77

A.2.2

78

Algorithm 1 shows the parsing procedure of BEP.

79

A.2.3

80
81

Parsing algorithm

Parsing example

In this section, we provide an example to help understand how BEP works. For convenience, we
assume that the frame-wise probabilities of all action classes from the segmentation model are equal
3

https://github.com/gongda0e/KARI

3

20

GT

ASFormer
ADIOS-OR
KARI (Ours)
action start/end
cut cheese

cut lettuce

place cheese

place lettuce

cut tomato

mix ingredients

add oil

place tomato
add vinegar

peel cucumber

cut cucumber

add salt

add pepper

pour milk

stir_cereals

mix dressing

serve salad

(a) 50Salads

GT

ASFormer
ADIOS-OR
KARI (Ours)
SIL

take bowl

pour cereals

(b) Breakfast (activity: cereals)

GT

ASFormer
ADIOS-OR
KARI (Ours)
SIL

take cup

add_teabag

pour water

(c) Breakfast (activity: tea)

Figure 8: Qualitative results on successful cases

21

place cucumber
add dressing

P07_webcam01_cereals

GT

ASFormer
ADIOS-OR
KARI (Ours)

P07_webcam01_cereals

action start/end

add pepper

peel cucumber

cut tomato

add dressing

mix ingredient

add salt

add oil

add vinegar

place cucumber into bowl
serve salad onto plate

mix dressing

cut cheese

place cheese into bowl
place lettuce into bowl

cut lettuce

place tomato into bowl
cut cucumber

(a) 50Salads
SILtake_bowlpour_cerealspour_milkstir_cerealsSIL

GT
ASFormer
ADIOS-OR
KARI (Ours)
SIL

pour oil

butter pan

crack egg

stirfry egg

add saltnpepper

take plate

put egg2plate

(b) Breakfast (activity: scrambled egg)

Figure 9: Qualitative results on failure cases
SILtake_bowlpour_cerealspour_milkstir_cerealsSIL

22