CODEX: A Cluster-Based Method for Explainable Reinforcement Learning Timothy K. Mathes1 , Jessica Inman2 , Andrés Colón1 , and Simon Khan3 1 Assured Information Security, Inc. 2 Georgia Tech Research Institute 3 Air Force Research Laboratory arXiv:2312.04216v1 [cs.LG] 7 Dec 2023 Abstract Despite the impressive feats demonstrated by Reinforcement Learning (RL), these algorithms have seen little adoption in highrisk, real-world applications due to current difficulties in explaining RL agent actions and building user trust. We present Counterfactual Demonstrations for Explanation (CODEX), a method that incorporates semantic clustering, which can effectively summarize RL agent behavior in the state-action space. Experimentation on the MiniGrid and StarCraft II gaming environments reveals the semantic clusters retain temporal as well as entity information, which is reflected in the constructed summary of agent behavior. Furthermore, clustering the discrete+continuous game-state latent representations identifies the most crucial episodic events, demonstrating a relationship between the latent and semantic spaces. This work contributes to the growing body of work that strives to unlock the power of RL for widespread use by leveraging and extending techniques from Natural Language Processing. 1 Figure 1: Four Rooms and Door Key MiniGrid environments. The RL agent (red triangle) is tasked with autonomously reaching the goal (green square) by maneuvering walls and locked doors. effectively account for the fact that autonomous decision-making agents can change future observations of data based on actions they take or effectively reason over long-term objectives of the underlying agent mission. For example, AI AlphaStar competes against top-tier StarCraft II players, but gaining an understanding of the AI requires extensive empirical study. Effective XRL approaches that overcome these limitations are necessary to unlock the power of RL for widespread use. One potential approach is to develop text-based XRL techniques using world models (Hafner et al., 2020). World models have proven to be extremely effective as several of the recent top-performing RL algorithms are world modeling based (Kaiser et al., 2020; Schrittwieser et al., 2020, 2021; Hafner et al., 2021). They may be used to show a user: a) what the RL agent expects is happening after it makes a decision; and b) what the RL agent expects would have happened had it made a different decision. The former is termed a factual and the latter a counterfactual. In this paper, we propose a global post-hoc clustering method for XRL called Counterfactual Introduction Reinforcement Learning (RL) is a revolutionary technology capable of superhuman long-term decision-making in complex and fast-paced domains (Tesauro, 1992; Mnih et al., 2015; Silver et al., 2018; Schrittwieser et al., 2020; Vinyals et al., 2019). Effective RL-enabled systems will readily outperform the greatest human minds at most tasks.1 However, a major challenge in the field has been explaining RL agent decisions. This prohibitive limitation is because existing Explainable Reinforcement Learning (XRL) methods do not Presented at the International Joint Conference on Artificial Intelligence (IJCAI) 2023 Workshop on Explainable Artificial Intelligence (XAI). 1 1 https://www.nitrd.gov/pubs/National- A I-RD-Strategy-2019.pdf Demonstrations for Explanation (CODEX) as a step towards understanding factuals and counterfactuals. CODEX automatically produces natural language episode tags that describe the agent states and actions while interacting with MiniGrid (Chevalier-Boisvert et al., 2018) and StarCraft II (Vinyals et al., 2017) environments. Figure 1 shows two MiniGrid environments in which the agent must navigate walls and locked doors to reach the goal. CODEX offers several benefits: 1) the vector representations are densely clustered with very good separation, even when the tags are short (∼5-6 words) with minimal semantic distinctiveness; 2) the centroid conditioned cluster topics are fully extractive, avoiding the issue of hallucinations observed in SOTA summarization models (Kryściński et al., 2019; Falke et al., 2019); 3) the semantic clusters retain temporal as well as entity information, which in turn is reflected in the constructed summary of agent behavior in the state-action space; 4) two user-defined parameters provide more finegrained and detailed summaries, revealing tags that occur rarely and may be important; and 5) clustering discrete+continuous game-state latent representations visually reveals the most crucial episode tags, demonstrating a relationship between the latent and semantic spaces. By summarizing world model based factual and counterfactual examples while combining them with latent cluster visualizations, our method enables an intuitive and broader understanding of an RL agent’s behavior. Our code is publicly available.2 2 (Puiutta and Veith, 2020), based upon XAI taxonomy (Adadi and Berrada, 2018), that incorporates ideas from (Heuillet et al., 2021), based upon XAI taxonomy (Arrieta et al., 2020): scope and extraction type. Explanation scope can be either local or global while explanation extraction types are either intrinsic or post-hoc. Local explanations provide insight into specific predictions while global explanations provide insight into overall model structure, or logic. Several recent works are related to, yet distinct from, CODEX. van der Waa et al. (2018) leverages a policy simulation and a translation of states and actions to a description that is easier to understand for human users to enable an RL agent to explain its behavior with contrastive explanations and in terms of the expected consequences of state transitions and outcomes. This work discerns interpretable state changes by applying classification models to the state representation. In contrast, CODEX leverages decoded visual representations of states to identify semantic properties. Nguyen et al. leverage human annotations of interpretations of agent behavior along with automated rationale generation to create natural language explanations for a sequence of actions taken by an RL agent (Nguyen et al., 2022). This work is closely related to CODEX in that observed agent behaviors are summarized with text but differs in its requirement for a large set of human annotations. CODEX does not require human annotation of observations in order to generate a text summary of an episode. Additionally, Nguyen et al. describe a human’s expectation of why an agent may take a particular action. This type of human bias works well in real-world scenarios where humans have a good understanding of world dynamics. CODEX, in contrast, leverages the RL agent’s world model based understanding of its environment, which enables CODEX to elucidate deficiencies in the agent’s understanding of its world. Related Work Explainability in Reinforcement Learning. While Explainable Artificial Intelligence (XAI) has established and widely accepted techniques such as the SHAP library (Lundberg and Lee, 2017) and its encompassed methods (Ribeiro et al., 2016; Štrumbelj and Kononenko, 2014; Shrikumar et al., 2017; Datta et al., 2016; Bach et al., 2015; Lipovetsky and Conklin, 2001), XRL research has not yet yielded such well-regarded methods. To aid in the development of new XRL approaches, XRL researchers have created useful taxonomies to describe and compare methods. There is a 2-step taxonomy largely established from 2 3 The MiniGrid and StarCraft II Environments MiniGrid refers to a collection of simple and easily configurable grid-world environments designed for RL research (Chevalier-Boisvert et al., 2018). The games feature a triangle-shaped player that must reach the goal in a discrete action space depending on the type of game. StarCraft II is a real-time strategy game that involves fast paced micro-actions as https://github.com/ainfosec/codex 2 well as high-level planning and execution. For this effort, we use Deepmind’s PySC2 API (Vinyals et al., 2017) to train models to play two StarCraft II minigames, which provide simplified environments and objectives for the agents to learn. We implement the world model based DreamerV2 (Hafner et al., 2021) RL agent to enable visualization of counterfactuals. DreamerV2 uses a Recurrent State Space Model (RSSM) as its dynamics model to predict transitions. It accepts an encoded state, zt , represented as a latent feature vector that encodes observable and inferred state information and an action, at , as input to predict what state the world will transition to, zt+1 , and the reward, r̂t+1 , that will be received as the output. To reduce the complexity of the RSSM model, DreamerV2 uses a paired encoder-decoder Variational Auto-Encoding (VAE) architecture to learn how to encode raw highdimensional observed states, x, as low-dimensional feature vectors, z. We term the latent representation z a Dreamer state. The entire DreamerV2 architecture is trained in a self-supervised manner using sequences of observational data samples from the target domain, i.e., episodes, by comparing the predicted transitions, [x̂t+1 , r̂t ], with actual observed transitions, [xt+1 , rt ]. The trained decoder can project an image of a predicted world-state from any plausible latent state, z, which is central to how CODEX visualizes counterfactuals. 4 We can discern whether the player is facing one of up/down or one of left/right using first- and secondorder central moments:   1 2µ′11 Θ = arctan 2 µ′20 − µ′02 where µ′20 = µ20 /µ00 = M20 /M00 − x̄2 µ′02 = µ02 /µ00 = M02 /M00 − ȳ 2 µ′11 = µ11 /µ00 = M11 /M00 − x̄ȳ Once the entities’ coordinates and player’s direction are extracted, we generate tags using templates corresponding to specific states or events. For MiniGrid, we use: The player/goal/key/door is at (x, y). The player is facing left/right/up/down. The player turns left/right. The player moves forward. The player reached the goal. The door is open/closed. The key has been picked up. The player picks up/drops the key. The player opens/closes the door. Because a DreamerV2 world model can sometimes produce invalid images from a state vector, we include two additional tags to annotate visual anomalies: Framework CODEX generates natural language tags for MiniGrid and StarCraft II based primarily on the locations (i.e., coordinates) of the entities in the game. Before we can annotate MiniGrid episodes with natural language, we need to know which of the four directions (left, right, up, or down) the player faces at each timestep. Fortunately, the player’s triangular shape means we can compute its direction using image moments, i.e., the weighted averages of the player’s pixel intensities. For a greyscale image with pixel intensities I(x, y), the raw image moments Mpq are given by: Mpq = XX x The player is facing a non-cardinal direction. The state of the door is unknown. Note that the set of templates includes both statedriven tags, e.g., “The key is at (x, y).” which appear at every timestep, and event-driven tags like “The player picks up the key.” which appear only at the timestep where a specific action is taken. For StarCraft II, we use the following: Marine/Beacon [ID] moves from (x,y) to (x,y). Beacon/Shard appears at (x,y). Shard [ID] is collected. Marine [ID] collects shard [ID]. Marine [ID] moves closer to/farther from group/shard [ID]. xp y q I(x, y) y Likewise, the central moments µpq are given by: µpq = XX x p where each entity has a unique, randomly generated [ID]. Additionally, we use the following grouprelated tags: q (x − x̄) (y − ȳ) I(x, y) y 3 Entity [ID] leaves/joins group [ID]. Entities [IDs] leave/join/form group [ID]. Group [ID] is dissolved. Group [ID] merges with group [ID]. Group [ID] moves from (x, y) to (x, y). not publicly available, making automatic evaluation a challenge. In order to evaluate clustering performance on the language model embeddings, we adopt two metrics given the lack of ground truth labels. Silhouette Coefficient. The calculation for a single sample s consists of the mean distance between a sample and all other points in the same cluster a, and the mean distance between the sample s and all other points in the next nearest cluster b. It is a measure of how well defined the clusters are spatially (Rousseeuw, 1987). The metric is defined as: s = (b − a) ÷ max(a, b). Global Cosine Similarity. We take the mean across all clusters of each cluster’s cosine similarity, which is computed as the mean cosine similarity between all cluster vectors and the cluster’s centroid vector. It is a measure of semantic homogeneity and density:   m n X X Aj · v c  1 1 m n ∥Aj ∥ ∥vc ∥ We now present our summarization pipeline, constructed from 3 key components: 1) the contextualized embeddings language model; 2) the dimensionality reduction and clustering algorithms; and 3) the topic model. We evaluate three pretrained Transformer-based language models: BERT-base (Devlin et al., 2018), BERTweet-base (Nguyen et al., 2020), and paraphrase-MiniLM-L6-v2 (Reimers and Gurevych, 2019). We considered BERTweet because the pre-training data includes short text and MiniLM because the fine-tuning on paraphrase detection may be advantageous on data with minimal semantic differences. We then employ the UMAP algorithm (McInnes et al., 2018) for dimensionality reduction followed by HDBSCAN (Campello et al., 2013) for semantic cluster enumeration. The game-state summary S ′ is constructed by identifying the most important tags in each semantic cluster. This is done by incorporating a Latent Dirichlet Allocation (LDA) (Blei et al., 2003) topic model to predict the exemplar tag for each cluster. LDA’s ngram range parameter is set to the length of the shortest and longest tags in a cluster. The underlying idea is that entire tags may be selected as the exemplar for each cluster, avoiding the issue of hallucinations by being fully extractive. However, there are cases when LDA predicts a substring to be the cluster topic because of how ngram range is set. For instance, “Marine [ID] moves closer to beacon ” which is missing the beacon “[ID].” In these cases, the tag closest to the cluster centroid that contains the LDA topic substring is selected. In addition, cosine similarities between the selected tag vector and the rest of the tag vectors in the cluster are calculated. If the similarity falls below a threshold of 0.6, the corresponding tag is selected as well. We re-use the 0.6 parameter setting from previous semantic similarity work. Presumably, such tags increase the summary’s informativeness. The selected tags are sorted by step number to produce the final summary S ′ . 5 i=1 j=1 i where m is the number of clusters, n is the number of cluster vectors, Aj is a given cluster vector, and vc is the cluster centroid vector. 6 Experiments To analyze the effectiveness of our CODEX method from different perspectives, we propose three research questions (RQs) to guide our experiments: RQ1: Which language model is the most appropriate choice for CODEX considering the tradeoffs between clustering performance and model efficiency in terms of size and inference time? RQ2: Are the constructed summaries concise and informative while still retaining temporal and/or entity information from the state-action space? RQ3: Does clustering an environment’s DreamerV2 game-state latent representations reveal a relationship between the latent and semantic spaces? 6.1 Semantic Clustering Comparison (RQ1) Experimental Setup. We experiment on 100 MiniGrid episodes (MiniGrid-100) for tags embedding, dimensionality reduction and semantic clustering to compare the performance of BERTbase, BERTweet, and MiniLM embeddings on short text. We do not conduct additional training or fine-tuning. We log the average amount of time Evaluation Metrics To the best of our knowledge, datasets with gold summaries for XRL on gaming environments are 4 model BERT-base BERTweet MiniLM n neighbors 10 10 10 min cluster size 10 10 10 clustered (%) 98.8 98.4 98.6 sil score 0.916 0.913 0.902 global cos sim 0.986 0.993 0.964 mean 0.951 0.953 0.933 Table 1: BERT-base, BERTweet, and MiniLM peak performance on the MiniGrid-100 episodes. model BERT-base BERTweet MiniLM it takes for each model to produce episode embeddings using an NVIDIA Quadro P1000 GPU card with 4 GB of memory. The choice of limited hardware is to assess whether CODEX could be used on edge devices with restricted resources. We undertake a sweep of two key parameters: UMAP’s n neighbors = {10, 15, 20, 25, 30} and HDBSCAN’s min cluster size = {5, 10, 15, 20, 25}. The remaining parameters are kept fixed: UMAP’s min dist=0.0, n components=2, metric=cosine, and n epochs=500; HDBSCAN’s min samples=1 and cluster selection method=leaf. These values are based on previous work. We compute the percentage of tags clustered as well, since HDBSCAN can identify datapoints as noise. It would be ideal to have the tags maximally clustered. Spatial Separation and Semantic Homogeneity. To evaluate the semantic clusters, we take the mean of the Silhouette Coefficient (sil score) and Global Cosine Similarity (global cos sim) for each pair of parameter values across 100 episodes (i.e., n neighbors=10, min cluster size=5 for 100 episodes; n neighbors=10, min cluster size=10 for 100 episodes, etc.). Table 1 reports the best performing parameter values for each model. Full results are in Appendix A. All three models perform the best when n neighbors=10 and min cluster size=10 on MiniGrid-100. The average number of tags clustered across all 100 episodes is in the range of 98.4-98.8%. BERTweet has the highest sil score and global cos sim mean at 0.953, although BERTbase and MiniLM exhibit comparable performance at 0.951 and 0.933, respectively. Efficiency. We consider model size and speed in the selection process as well. Table 2 shows model sizes along with the average amount of time (secs.) each model takes to generate episode embeddings when run on a single GPU. At 23M parameters, MiniLM produces episode embeddings in 0.059 seconds on average, across 100 episodes that contain a total of 15,392 natural language tags. This is significantly faster than the much larger BERT-base and BERTweet models. Moreover, we observe that the MiniLM se- # params 109M 135M 23M dim. 768 768 384 # epi. 100 100 100 mean 3.19 0.337 0.059 Table 2: Model sizes, embed dimensions, and mean generation times (secs.) on the MiniGrid-100 episodes. mantic clusters are dense and well-separated (see Figure 2). We choose MiniLM for further experimentation to answer RQ2 and RQ3 given the smaller size, speed, and comparable clustering performance, while taking into account the environmental impact, financial cost, and other pitfalls associated with Large Language Models (see Bender et al., 2021; Wei et al., 2022; and Thompson et al., 2022 for discussions). 6.2 Summary Analysis (RQ2) Since the final summary is crucial to understanding agent behavior, we conduct an exploratory qualitative analysis to study: conciseness – do the summaries possess a sufficient number of tags that are interpretable by humans without redundancy? informativeness – do the summaries include all relevant information from the game-state space? MiniGrid-100. We visually inspect the semantic clusters and summaries from each MiniGrid-100 episode generated by the MiniLM >> UMAP >> HDBSCAN pipeline followed by LDA topic extraction. The summaries are constructed as outlined in §4. Figure 2 is indicative of what we observe for the MiniGrid episodes. Labels are shown above each cluster with “x” marking the centroids. As seen in the legend at the top left, episode 8 has 145 total event+state driven tags. The 9 enumerated clusters are dense and well-separated when min cluster size=10 and min samples=1 with 100% of the tags clustered. The evaluation measurements approach a value of 1.0 with sil score=0.966 and global cos sim=0.979. The inset summary is shown to the right. Each line is formatted as: [tag] [cluster ID]. The asterisks around *[cluster ID]* indicate the tag is selected 5 Figure 2: Semantic clusters and summary for MiniGrid episode 8. Figure 3: Latent representation clusters for MiniGrid episode 8. because it falls below the 0.6 summary threshold as explained in §4. The summary consists of 11 total tags in this case, resulting in a low compression rate of (11 ÷ 145) = 0.076. Interestingly, clusters 0 and 7 are far apart in the UMAP 2-dim. projection even though the clustered tags differ by 1 word, i.e., “The door is closed.” [cluster 0] vs. “The door is open.” [cluster 7]. We hypothesize that MiniLM’s fine-tuning on a paraphrase detection task enables it to generate separable embeddings when the semantic distinctions are minimal. The summary captures the key events during the gamestate episode, such as “The key has been picked up.” [cluster 5], “The door is open.” [cluster 7], and “The player reached the goal.” *[cluster 1]*. StarCraft II. We visually inspect the semantic clusters and summaries from 100 randomly selected StarCraft II episodes out of 5,000 total episodes. These episodes have a significantly larger set of event-driven tags per episode compared to MiniGrid. Step numbers are included in the vector representations, designating when the events begin and end to consider the effect of providing the pipeline with temporal information. Each tag is prefixed with a timestamp denoting the starting and ending steps with “t1 -- t2 ”. For instance, for the tag “5 -- 8 Marine 4299161601 moves farther from shard 4299423745”, “5” is when the event starts (Step 5) and “8” is when it ends (Step 8). Figure 4 in Appendix B illustrates what we typically observe in the StarCraft II episodes. We adjust min cluster size=20 and sum thresh=0.7, since the number of tags is approximately 3x larger. Episode 4 (Figure 4) has 530 event-driven tags that are enumerated as 12 clusters when min cluster size=20. 93.58% are clustered with tags marked as noise denoted in gray. The Sil- houette Coefficient is 0.605, reflecting less cluster separation compared to MiniGrid. One possible reason is that the sequences of digits that make up the entity IDs place the vector representations close in 2-dim. space, although HDBSCAN predicts separate clusters. It is notable that the clusters capture temporal as well as entity information. For example, “13 -- 14 Marine 4298899457 moves closer to shard 4303880193” [cluster 6] clusters separately from “45 -- 48 Marine 4298899457 moves closer to shard 4303880193” [cluster 9] when the only differences are the step numbers. In fact, we find that a cluster may have tags from the first half of the episode, which is reflected by the step numbers, with identical tags in another cluster from the second half. Consequently, the constructed summary includes exemplars from both halves of the episode. Moreover, we observe similar behavior when the differentiator is an entity ID. In terms of summary length, Figure 4 shows that the compression rate is low at (15 ÷ 530) = 0.028. We visualize the clusters and summary from the same episode with min cluster size=10 (Figure 5 in Appendix B). The summary is longer and more fine-grained due to the enumeration of more clusters. [Cluster 0] at the bottom center is particularly interesting: it contains tags related to group behavior – “Entities 4299161601, 4298899457 form group 1”, “Group 1 is dissolved”, “Group 2 is dissolved”, “Group 2 moves from (122, 128) to (110, 128)”, “Group 3 is dissolved”, etc. While marked as noise when min cluster size=20 (Figure 4 at bottom center), the tags are clustered because the number of tags has met the threshold of min cluster size=10 in Figure 5. We consider this significant because rare but important tags can 6 be uncovered by adjusting min cluster size. Thus, users could adjust the parameters min cluster size and sum thresh along a sliding scale from course to fine-grained semantic clustering and summarization of an RL agent’s behavior. can be clustered to reveal important episodic events prompts the scientific community to consider new research questions about the nature of the RL latent space. Ongoing research is investigating how to present factual and counterfactual summaries, show latent cluster visualizations, and allow for intuitive manipulation of the min cluster size and sum thresh parameters by users, which opens the door to understanding what an RL agent expects will happen vs. what would have happened had a different decision been made. 6.3 Game-State Latent Representations (RQ3) We address RQ3 by clustering the DreamerV2 discrete+continuous latent representations from MiniGrid and StarCraft II with the UMAP >> HDBSCAN piece of the pipeline. The 2048dimensional, stepwise latent representations are extracted as described in §3 (i.e., the z Dreamer states). Figure 3 shows latent clusters from MiniGrid-100 episode 8. These results are what we typically observe for MiniGrid episodes. We find that lowering min cluster size=5 produces 3 clusters with 100% of the 24 latent representations clustered. We see that Step 14 is the point in which “The door is open.” In the 2-dim. space, Step 14 operates as a transition point between its cluster and the next. In our careful inspection of the latent clusters from the MiniGrid episodes, we can surmise when the door opens by visually identifying the transition point. Moreover, we can predict when “The key is picked up.” by identifying the last point between the first and second clusters (Step 5). The datapoints and their clusters form an arc from the bottom left to the top right for episode 8, which is a visual interpretation of the episode’s progression through time. In terms of StarCraft II, HDBSCAN did not cluster the latent representations, marking all points as noise for every episode. We hypothesize that MiniGrid’s longer model training and lower-complexity environment are possible reasons why HDBSCAN successfully clusters the latent representations. An interesting future direction is exploring how much training is necessary to achieve latent representation clustering for a variety of environment complexities. 7 8 Conclusion CODEX produces text-based summaries that provide representations of factuals and counterfactuals. These summaries could be leveraged to summarize collections of counterfactuals, perhaps with hierarchical summarization techniques. Other venues for future work include extending CODEX to additional semantically diverse environments, exploring the limits of CODEX with respect to episode length and state complexity, extracting more information from the latent space or increasing CODEX’s efficiency by, e.g., automating state tagging with the latest computer vision techniques. Discussion In CODEX, we construct game-state summaries by identifying centroid-conditioned LDA topic exemplars for each semantic cluster. We choose this design over an abstractive summarization approach so that CODEX is an unsupervised method that is fully extractive in nature. We believe that this design choice contributes to user trust. Moreover, the finding that the MiniGrid latent representations 7 A Full Results on the MiniGrid-100 Episodes n neighbors 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 25 30 30 30 30 30 min cluster 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 clustered (%) 93.8 98.8 91.7 88.6 84.0 92.1 98.8 99.0 94.0 90.6 89.7 98.2 99.4 98.6 92.8 89.5 97.6 99.0 99.2 97.4 87.3 97.1 98.6 98.9 96.3 sil score 0.747 0.916 0.844 0.737 0.638 0.688 0.883 0.912 0.834 0.724 0.654 0.844 0.887 0.900 0.818 0.623 0.802 0.875 0.896 0.871 0.620 0.796 0.881 0.904 0.884 global cos sim 0.992 0.986 0.968 0.939 0.904 0.991 0.982 0.975 0.950 0.914 0.989 0.981 0.972 0.960 0.937 0.989 0.977 0.969 0.955 0.941 0.988 0.976 0.967 0.954 0.939 Table 3: BERT-base performance on the MiniGrid-100 episodes. 8 mean 0.870 0.951 0.906 0.838 0.771 0.839 0.932 0.943 0.892 0.819 0.822 0.913 0.930 0.930 0.877 0.806 0.889 0.922 0.925 0.906 0.804 0.886 0.924 0.929 0.912 n neighbors 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 25 30 30 30 30 30 min cluster 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 clustered (%) 93.8 98.4 92.7 89.1 87.2 93.1 98.3 97.8 91.3 89.5 90.7 98.5 96.9 94.3 89.9 89.3 98.6 97.4 93.9 90.4 88.9 99.1 98.6 97.1 92.9 sil score 0.764 0.913 0.830 0.724 0.620 0.720 0.902 0.882 0.779 0.664 0.705 0.911 0.901 0.828 0.718 0.691 0.909 0.904 0.845 0.766 0.687 0.892 0.900 0.830 0.754 global cos sim 0.996 0.993 0.983 0.968 0.950 0.996 0.993 0.987 0.972 0.953 0.995 0.992 0.987 0.974 0.954 0.995 0.991 0.986 0.975 0.962 0.995 0.991 0.987 0.975 0.959 Table 4: BERTweet performance on the MiniGrid-100 episodes. 9 mean 0.880 0.953 0.906 0.846 0.785 0.858 0.947 0.934 0.876 0.809 0.850 0.951 0.944 0.901 0.836 0.843 0.950 0.945 0.910 0.864 0.841 0.942 0.943 0.902 0.857 n neighbors 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 25 30 30 30 30 30 min cluster 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 clustered (%) 93.1 98.6 93.8 89.4 84.6 91.8 99.1 98.7 94.5 89.2 89.8 99.2 98.4 96.1 92.5 88.2 98.7 98.4 95.7 94.9 87.5 98.5 97.6 95.8 94.4 sil score 0.719 0.902 0.863 0.793 0.696 0.670 0.871 0.905 0.842 0.742 0.640 0.870 0.901 0.881 0.783 0.624 0.844 0.864 0.850 0.837 0.620 0.825 0.860 0.851 0.833 global cos sim 0.983 0.964 0.928 0.887 0.832 0.976 0.957 0.940 0.905 0.845 0.971 0.953 0.938 0.910 0.856 0.970 0.951 0.934 0.905 0.866 0.968 0.948 0.931 0.897 0.866 Table 5: MiniLM performance on the MiniGrid-100 episodes. 10 mean 0.851 0.933 0.895 0.840 0.764 0.823 0.914 0.923 0.874 0.794 0.805 0.911 0.919 0.895 0.820 0.797 0.898 0.899 0.877 0.852 0.794 0.887 0.895 0.874 0.850 B StarCraft II Example Semantic Clusters and Summaries Figure 4: Semantic clusters and summary for StarCraft II episode 4 (min cluster size=20). 11 Figure 5: Semantic clusters and summary for StarCraft II episode 4 (min cluster size=10). 12 Ethical Statement cisco Herrera. 2020. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58:82–115. Explainable Reinforcement Learning, as well as explainability more broadly, raises ethical concerns when applied to real-world problems. The main issue is the explanations could be used by malicious actors to manipulate AI systems, which would be especially pernicious for a field such as robotics. For instance, consider a scenario where a robotic agent is sent to retrieve an apple as quickly as possible. There are two paths to the apple, one that is shorter and uses a set of stairs and one that is longer and uses an elevator. Suppose the agent chooses to retrieve the apple using the longer elevator path. A typical user may find this to be an unexpected choice since the longer path should take more time to traverse. By reviewing the counterfactual example, one can observe what the agent expects to happen if it chooses the shorter stair path. In this case, the agent expects to start a slow descent before falling down the stairs, finding itself unable to right itself and continue. Given the above scenario, a malicious actor with access to the factual and counterfactuals would know that at least one of the counterfactuals causes a negative outcome. Thus, directing the robotic agent to choose the shorter path would result in its failure. This outcome would have far reaching consequences if the robotic agent is employed in a healthcare setting or in an area where dangerous materials are being handled. It is crucial to continue discussions on ways to mitigate this situation. Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7), https://doi.org/ 10.1371/journal.pone.0130140. Emily M. Bender, Timnit Gebru, Angelina McMillanMajor, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In FAccT ‘21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623, Virtual Event Canada. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. BSD-3-Clause license. Ricardo J.G.B. Campello, Davoud Moulavi, and Joerg Sander. 2013. Density-based clustering based on hierarchical density estimates. In Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science vol. 7819, pages 160–172, Springer: Berlin, Heidelberg. BSD-3-Clause license. Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. 2018. Minimalistic gridworld environment for OpenAI gym. Apache 2.0 license. https://github.com /maximecb/gym-minigrid. Anupam Datta, Shayak Sen, and Yair Zick. 2016. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In IEEE Symposium on Security and Privacy (SP): pages 598–617, San Jose, California. Acknowledgements This material is based upon work supported by the United States Air Force under Contract No. FA8750-22-C-1003. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805. Apache 2.0 license. Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE access, 6:52138– 52160. Tobias Falke, Leonardo F.R. Ribiero, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics. Alejandro B. Arrieta, Natalia Diaz-Rodriguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Fran- Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. 2020. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations. References 13 Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. 2021. Mastering atari with discrete world models. ArXiv, abs/2010.02193. MIT license. System Demonstrations, pages 9–14, EMNLP. MIT license. Erika Puiutta and Eric M.S.P. Veith. 2020. Explainable reinforcement learning: A survey. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pages 77–95, Springer, Cham. Alexandre Heuillet, Fabien Couthouis, and Natalia Dı́az-Rodrı́guez. 2021. Explainability in deep reinforcement learning. Knowledge-Based Systems, vols 214,106685. Nils Reimers and Iryna Gurevych. 2019. SentenceBERT: Sentence embeddings using siamese BERTNetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics. Apache 2.0 license. Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. 2020. Model-based reinforcement learning for atari. ArXiv, abs/1903.00374. Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Evaluating the factual consistency of abstractive text summarization. ArXiv, abs/1910.12840. Marco T. Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, San Francisco, California. Association for Computing Machinery. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Association for Computational Linguistics. Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics, 20:53–65. BSD-3-Clause license. Stan Lipovetsky and Michael Conklin. 2001. Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry, 17(4):319–330. Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, and David Silver. 2021. Online and offline reinforcement learning by planning with a learned model. ArXiv, abs/2104.06294. Michael L. Littman, Thomas L. Dean, and Leslie P. Kaelbling. 1995. On the complexity of solving Markov decision problems. ArXiv, abs/1302.4971. Scott Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. ArXiv, abs/1705.07874. Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. 2020. Mastering atari, go, chess and shogi by planning with a learned model. ArXiv, abs/1911.08265. Leland McInnes, John Healy, and James Melville. 2018. UMAP: Uniform manifold approximation and projection for dimension reduction. ArXiv, abs/1802.03426. BSD-3-Clause license. Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. In International Conference on Machine Learning, PMLR 70, pages 3145– 3153, ICML. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martine Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature, 518:529–533. X. Phong Nguyen, Tho H. Tran, Nguyen B. Pham, Dung N. Do, and Takehisa Yairi. 2022. Human language explanation for a decision making agent via automated rationale generation. IEEE Access, 10:110727– 110741. David Silver, Thomas Huber, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144. Dat Q. Nguyen, Thanh Vu, and Anh T. Nguyen. 2020. BERTweet: A pre-trained language model for english tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3):647–665. 14 Ahmad Terra, Rafia Inam, and Elena Fersman. 2022. BEERL: Both ends explanations for reinforcement learning. Applied Sciences, 12(21):10947. https://doi. org/10.3390/app122110947. Gerald Tesauro. 1992. Practical issues in temporal difference learning. Machine Learning, 8(3):257–277. Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F. Manso. 2022. The computational limits of deep learning. ArXiv, abs/2007.05558. Jasper van der Waa, Jurriaan van Diggelen, Karel van den Bosch, and Mark Neerincx. 2018. Contrastive explanations for reinforcement learning in terms of expected consequences. ArXiv, abs/1807.08706. Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander S. Vezhnevets, Michelle Yeo, Alireza Makhzani, et al. 2017. StarCraft II: A new challenge for reinforcement learning. ArXiv, abs/1708.04782. Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Geogiev, et al. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354. MIT license. Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams W. Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652. 15