1 MIMo: A Multi-Modal Infant Model for Studying Cognitive Development arXiv:2312.04318v1 [cs.AI] 7 Dec 2023 Dominik Mattern, Pierre Schumacher, Francisco M. López, Marcel C. Raabe, Markus R. Ernst, Arthur Aubret, Jochen Triesch Abstract—Human intelligence and human consciousness emerge gradually during the process of cognitive development. Understanding this development is an essential aspect of understanding the human mind and may facilitate the construction of artificial minds with similar properties. Importantly, human cognitive development relies on embodied interactions with the physical and social environment, which is perceived via complementary sensory modalities. These interactions allow the developing mind to probe the causal structure of the world. This is in stark contrast to common machine learning approaches, e.g., for large language models, which are merely passively “digesting” large amounts of training data, but are not in control of their sensory inputs. However, computational modeling of the kind of self-determined embodied interactions that lead to human intelligence and consciousness is a formidable challenge. Here we present MIMo, an open-source multi-modal infant model for studying early cognitive development through computer simulations. MIMo’s body is modeled after an 18-month-old child with detailed five-fingered hands. MIMo perceives its surroundings via binocular vision, a vestibular system, proprioception, and touch perception through a full-body virtual skin, while two different actuation models allow control of his body. We describe the design and interfaces of MIMo and provide examples illustrating its use. All code is available at https://github.com/trieschlab/MIMo. Index Terms—cognitive development, developmental AI, infant model, multimodal perception, physics simulation I. I NTRODUCTION A good measure of our understanding of a complex system or process is our ability to rebuild it. In the context of human cognitive development this translates to constructing models of how the developing brain comes to control the developing body in increasingly sophisticated ways. Human cognitive development depends crucially on embodied interactions with the physical and social environment, that is sensed through our different sensory modalities. Faithful models of cognitive development therefore need to reproduce these interactions and how they give rise to sensory representations, motor skills, conceptual structures, and a broad range of cognitive abilities. D. Mattern is with the Department of Computer Science and Mathematics, Goethe-University Frankfurt, Frankfurt am Main, Germany. P. Schumacher is with the Max Planck Institute for Intelligent Systems, Tübingen, Germany and the Hertie Institute for Clinical Brain Research, Tübingen, Germany. F.M. López, M. Raabe, M.R. Ernst, A. Aubret and J. Triesch are with the Frankfurt Institute for Advanced Studies, Frankfurt am Main, Germany. This research was supported by “The Adaptive Mind” and “The Third Wave of Artificial Intelligence” funded by the Excellence Program of the Hessian Ministry of Higher Education, Science, Research and Art. J. Triesch was supported by the Johanna Quandt foundation. P. Schumacher was supported by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). This version of MIMo extends a previous version published at ICDL [1]. (a) Full body view (b) Facial expressions Fig. 1. MIMo, the multimodal infant model. (a) MIMo sitting in a room with toys. (b) Six facial expressions of MIMo. Importantly, these interactions are largely initiated and controlled by the developing child. For example, an infant controls its visual, proprioceptive, and haptic inputs through its eye and body movements. This ability to control and “experiment” with different movements to observe their effects on sensory inputs may be crucial for cognitive development, giving the developing mind a means to probe the causal structure of the world, rather than merely passively observing correlations among sensed variables. Importantly, such an active, self-controlled form of learning is in stark contrast to a large body of work in Artificial Intelligence (AI), including large language or vision models, which learn certain aspects of the statistical structure of large training datasets without having any control over their inputs. Despite the impressive successes of such models, they may ultimately be severely limited in their ability to understand the causal mechanisms underlying their training data, which will limit their ability to generalize in novel situations. This should come as no surprise. Every scientist learns that mere correlation is not sufficient evidence for inferring causation. Therefore, scientists routinely exploit the ability to manipulate a system under study using clever experimental designs where they interfere with some variables and observe the effects on others to distill the causal mechanisms at work. Therefore, recreating the physical interaction with the environment must be a central aspect of rebuilding cognitive development and it may be equally essential for achieving human- 2 like intelligence and consciousness in AIs. The idea that an AI could learn like a developing child can be traced back all the way to Turing [2]. However, serious “Developmental Robotics” or “Developmental AI” efforts have become more common and practical only in the last 20 years, for reviews see [3]–[8]. There are two options for modeling the interaction of a developing mind with its physical and social environment. The first option is using humanoid robots. This is the approach taken by the Developmental Robotics community. The main advantage is the inherent realism: the system operates in the real world governed by the actual laws of physics. However, working with humanoid robots is expensive, time-consuming, and suffers from the brittleness of today’s humanoid hardware. All these factors negatively impact the reproducibility of the research. Furthermore, the sensing abilities of today’s robots are usually not comparable to those of actual humans. This is particularly problematic for the sense of touch. The human body is covered by a flexible skin containing various types of mechanoreceptors, thermoreceptors, and nociceptors (pain receptors). These allow us to sense touch, pressure, vibration, temperature, and pain. Today, reproducing such a human-like skin in humanoid robots is still out of reach. The second option for modeling the interaction of the developing mind with its physical and social environment is to do it completely in silico. Many physics simulators or game engines are available today that can approximate the physics of such interactions [9]–[11], for review see [12]. Among the disadvantages of such an approach are 1) inaccuracies of such simulations due to inevitable approximations and 2) the high computational costs, especially when non-rigid body parts and objects are considered. Nevertheless, the in silico approach avoids all the problems of working with humanoid robot hardware mentioned above. Furthermore, it ensures perfect reproducibility. Lastly, if simulations can be run (much) faster than real-time, this greatly facilitates the simulation of developmental processes unfolding over long periods of time (weeks, months, or even years). To support such in silico research, we here present the open source software platform MIMo, the Multi-Modal Infant Model (Fig. 1). MIMo is intended to support two kinds of research: 1) developing computational models of human cognitive development and 2) building developmental AIs that develop more human-like intelligence and consciousness by similarly exploiting their ability to probe the causal structure of the world through their actions. We have decided to model the body of MIMo after an average 18-month-old child. In total, MIMo has 82 degrees of freedom of the body and 6 degrees of freedom of the eyes. MIMo also features different facial expressions that can be used for studies of social development. To simulate MIMo’s interaction with the physical environment we use the MuJoCo physics engine [13], because of its strength at simulating contact physics with friction. Generally, we have aimed for a balance between realism and computational efficiency in the design of MIMo. To accelerate the simulation of physics and touch sensation, MIMo’s body is composed of simple rigid shape primitives such as a sphere for the head and capsules for most other body parts. Presently, MIMo features four sensory modalities: binocular vision, proprioception, full-body touch sensation, and a vestibular system. We also plan to add audition in the future. The remainder of this article describes the detailed design of MIMo and illustrates how to use it. In particular, we present four scenarios where MIMo learns to 1) reach for an object, 2) stand up, 3) touch different locations on his body, and 4) catch a falling ball with his five-fingered hand. These examples are used to illustrate and benchmark potential uses of MIMo. They are not intended as faithful models of how infants acquire these behaviors. This paper is an extended version of preliminary work previously published at the ICDL 2022 conference [1]. Compared to the preliminary MIMo version we have added 1) fivefingered hands, 2) a new actuation model that more accurately models the force-generating behaviour and compliance of real muscles, 3) a detailed play room (see Fig. 2), in which MIMo can interact, and 4) a new demo environment of learning to catch a falling ball. A fifth contribution is that we have improved computational efficiency and benchmarks have been updated accordingly. II. R ELATED W ORK We focus on two major classes of software platforms for simulating cognitive development during embodied interactions with the environment. The first kind is designed to simulate a particular physical robot used in Developmental Robotics research and intended to complement the work with that physical robot. Examples are the iCub simulator [17] and the simulator for the NICO robot [18] that has been implemented using the V-Rep robotics simulation environment [10]. Fig. 2. Top view of the play room environment. The room has a size of 4×4 m2 and is filled with furniture and toys. Furniture models are taken from the 3D-FUTURE dataset [14] and toys from the Toys4K 3D Object dataset [15]. Pictures taken from the Man-made category of the McGill Calibrated Color Image Database [16] are placed on the walls. Additionally, the room has a door and a window, which provides a view into a garden. 3 Such platforms typically aim to faithfully reproduce the design and behavior of the physical robot, permitting to substitute work with the real robot through simulations. However, there always remains a notorious gap between simulation and real world. Furthermore, such simulation platforms also inherit any shortcomings of the robot design relative to the human body and human sensing capabilities. For example, if the robot possesses only poor touch sensation, its simulated counterpart will suffer from the same limitation. The second kind of platform emulates human body and sensing abilities directly and thus is not restricted by limitations of current robotics technology in general or that of specific robots in particular. An early example is the seminal work by Kuniyoshi and Sangawa [19]. More frequently, simulation models of specific aspects of sensorimotor development have been proposed. These typically encompass only a small subset of degrees of freedom and sensory modalities. An early example is work on the development of grasping by Oztop and colleagues [20]. A more recent example is the OpenEyeSim simulator, which has been designed to support modeling the development of active binocular vision [21] and is built using the OpenSim software for simulating neuromusculoskeletal systems [22]. While widely used in Reinforcement Learning (RL) for locomotion tasks [23], standard humanoids introduced within the MuJoCo [24] or Bullet [11] platforms only incorporate very limited haptic abilities and do not model a child-like physical appearance. This may be too limiting, because the structure and physiology of the body constrains the kinds of interactions that are possible. III. P HYSICAL D ESIGN MIMo’s design is based on the MuJoCo humanoid, consisting of geometric primitives, primarily capsules. His overall body dimensions and proportions were adapted from anthropometric measurements of 16–19 month old infants [25], treating the unit of distance in MuJoCo as one meter. Body dimensions for which no direct measurement is found in [25] where induced from other measurements. For example, the length of the lower arm is derived from the elbow-to-fingertip and hand length measurements. We then made many small adjustments to MIMo’s proportions to give him a more natural look, while staying close to the experimental measurements. For example, the upper arms appeared very thin compared to the torso and had their circumferences increased by 0.4 cm, well within the 1.3 cm standard deviation reported by [25]. All joints are modeled as a series of one-axis hinges and split into flexion/extension, abduction/adduction or internal/external rotation as appropriate for the joint. For the shoulders we merged the commonly used abduction and flexion axes, instead using horizontal flexion, abduction and internal rotation. This allows MIMo the same total range of motion by combining horizontal flexion and abduction. Keeping both axes would have allowed for an unrealistic, very large total range of motion if both flexion and abduction were at their limit, and MuJoCo’s joint limit functions were too simple to prevent this in a satisfactory manner. We provide two different versions of the hands for MIMo, a simple one with only (a) Mitten hand (b) Full hand Fig. 3. Different versions of MIMo’s hands. (a) Simple “mitten” hand with 4 degree of freedom. (b) Five-fingered hand with 26 degrees of freedom. a single finger and a fully articulated five-fingered version, modified from [26], to enable experiments where dexterous manipulation is required (see Fig. 3). In total the mitten hand version has 44 degrees of freedom, while the full hand version has 88. These consist of 30 in the body, 6 in the eyes and 8 and 26 degrees of freedom in the hands for the two versions, respectively. To produce range of motion and strength measurements, we collected data from a large number of sources [27]–[38] and then inferred missing values with the available data. For range of motion we took values from the youngest age reported in the various studies and assumed no change. For muscle strengths we took data from [27] as a baseline. These authors consider children aged 3–9 and we used values from the lower end of this range for all joints reported. Joint strengths missing from [27] were induced from other studies by assuming that the relative strengths of joints stay constant, using the strengths of the knee or elbow from [27] as reference values. Where required we converted forces to torques using the appropriate lever arms from MIMo based on the methodologies used in the respective source. Table I shows the range of motion and strengths for all main joints. We treat extension, adduction and internal rotation as positive and flexion, abduction and external rotation as negative. Citations show the source of the data, while entries without marked sources indicate best guesses based on the other values. For the full hand model we kept the range of motion from [26], with finger strengths for individual fingers derived from [39]. For the mitten hand the strength was derived from [27], with a range of motion that allows him to fully close the hand. MIMo can also change his facial expression. This is implemented by changing the texture of the head (see Fig. 1). In addition to the neutral expression we provide six extra textures, corresponding to the six basic emotions proposed by [40] (enjoyment, sadness, surprise, disgust, anger, and fear). These can be used to convey an internal emotional state, for 4 TABLE I J OINT RANGE OF MOTION AND STRENGTH FOR MIM O . Joint Neck flexion/ext. Neck lateral flex. Neck rotation ROM [°] -70 [27] to 80 [27] -70 [32] to 70 [32] -111 [32] to 111 [32] Voluntary Torque [Nm] -1.17 [31]∗ to 2.10 [31]∗ -1.17 to 1.17 -1.17 to 1.17 Trunk flexion/ext. Trunk lateral flex. Trunk rotation -61 [30] to 34 [30] -41 [30] to 41 [30] -36 [30] to 36 [30] -8.13 [30]∗ to 10.58 [30]∗ -7.25 [30]∗ to 7.25 [30]∗ -3.63 [30]∗ to 3.63 [30]∗ Shoulder horizontal Shoulder flexion/ext. Shoulder rotation -118 [34] to 28 [34] -183 [33] to 84 [33] -99 [27] to 67 [27] -1.8 [36]∗ to 1.8 [36]∗ -2.75 [35]∗ to 4 [35]∗ -1.6 [35]∗ to 2.5 [35]∗ Elbow flexion/ext. -146 [27] to 5 [33] -3.6 [27] to 3.0 [27] Wrist palmar/dorsi Wrist ulnar/radial Wrist rotation -92 [33] to 86 [33] -53 [37] to 48 [37] -90 [33] to 90 [33] -1.24 [41] to 0.7 [28]∗ -0.83 [41] to 0.95 [41] -0.7 to 0.7 Hip flexion/ext. Hip ab-/adduction Hip rotation -133 [27] to 20 [29] -51 [29] to 17 [29] -32 [27] to 41 [27] -8 [28]∗ to 8∗ [28]∗ -6.24 [28]∗ to 6.24 [27] -2.66 [27] to 3.54 [27] Knee flexion/ext. -145 [27] to 4 [27] -6.5 [27] to 10 [27] Ankle plantar/dorsi -63 [27] to 32 [27] -3.78 [27] to 1.89 [27] Ankle e-/inversion -33 [38] to 31 [38] -1.06 [38]∗ to 1.16 [38]∗ Ankle rotation -20 to 30 -1.2 to 1.2 * Reported value scaled to be proportional to knee or elbow reference. example for studies of social learning with multiple agents. IV. ACTUATION M ODELS We have implemented two actuation models incorporating muscle-like properties and providing different trade-offs regarding accuracy and computational efficiency. In both models we pursue a “big-picture” approach, grouping the various muscle groups acting on each joint into 1 actuator per movement axis, with each model using a different internal mechanism for torque generation. A. Spring-Damper Model In the first approach, each actuator is modeled as a combination of a motor with a spring-damper system. This approach requires little run time, while providing reasonable accuracy. The motor acts as the voluntary muscle force while the spring acts to return the joint to its neutral position. The spring and the damper loosely approximate viscoelastic characteristics of real muscles. Note that the spring opposes any motion deviating from the equilibrium position. The joint strengths were set as the maximum output torque of the motor. Damper strengths were adjusted manually to ensure simulation stability. Spring stiffness for most joints was set such that at maximum joint deflection, net torque is reduced by 10 B. Muscle Model The second approach implements a muscle-like model based on [42], with adjusted parameters for MIMo. It is similar to MuJoCo’s muscle actuators, while being more flexible regarding muscle parameters. Compared to the spring-damper model, this approach more accurately models the behavior of real muscles, allowing in particular for adaptive compliance, at the cost of increased run time (see Sec. IV-C, VII-B). Each actuator is modeled as two opposing, independently controllable muscles. The output torque of the whole actuator is the sum of the torques of both muscles. We also retain 5 The torque of a single muscle depends on its current length l, velocity l˙ and muscle activity a through:   τ = FL (l) FV (l,˙ vmax ) a(u, t) + FP (l) dfmax , (1) where FL is the force-length curve, FV the force-velocity curve and FP is a passive elastic force. The factor d accounts for the moment arm. The muscle activity a is dependent on time t and the control input u through 1 da = (u − a), (2) dt k with time constant k = 10 ms. See [42] for more details. We adjusted two of the parameters, fmax and vmax , for MIMo specifically. The parameter fmax scales the normalized, unitless force-curves into the proper range, while vmax scales the damping properties of the FV curve. To determine appropriate fmax values we recreated the experimental setups from previous studies measuring the joint strengths (see Sec. III). A given joint is fixed in position and maximum control input are applied. After a short time the applied torque was measured and fmax adjusted so that the applied torque matched the values from the literature. As vmax measurements do not exist for most joints, we produce an initial set of values for all joints that produce stable behaviour and then scale them to the appropriate range by a common factor derived from one of the few reference values from the literature. To create the initial set, we use an iterative approach. We set up a custom environment in which MIMo is suspended in the air with gravity disabled. After setting a starting vmax value, we take Bernoulli-distributed control inputs of 0 or 1 (i.e. bang-bang-control) every two seconds for 30 seconds. We determine the maximum achieved velocity and use it as the new vmax . This process is repeated for some number of episodes. To ensure convergence, we update vmax using a learning rate α ∈ [0, 1], which is exponentially decayed over multiple iterations. We also leverage symmetry of the body by averaging between left and right versions of joints. We then determine a scaling factor between our vmax value and the one from [43], both for the knee joint, and apply this factor to our values for all joints. We chose the knee joint for this as we have a reference value for it, and because many of the strength measurements were induced using the same method of proportional scaling, also using the knee joint (see Sec. III). C. Muscle Compliance A key property of muscular systems is adaptive compliance. When a limb is perturbed from a stable position, the muscles and tendons stretch elastically, altering the force response. It has been shown that these properties support motion stabilization during certain tasks [44], [45]. The degree of compliance 5 periments where high muscle fidelity is not critical. D. Costs We implement two versions of an action cost function c(u) commonly used in reinforcement learning. Both use a measure for the amount of effort required for the actuation, but the second version cw (u) additionally considers the strength of the actuators. These functions have the same general form for both actuation models. For the spring-damper model we have: P 2 u (3) c(u) = i i , n P 2 i ui Ti cw (u) = P , (4) n i Ti Fig. 4. These charts show MIMo’s shoulder position and actuator torque as a ball is dropped onto his outstretched arm. The vertical line indicates the moment of impact. For the motor in the spring-damper model, the actual motor torque is constant, but net-actuation torque changes due to the springdamper as the joint is displaced, with the spring offering less resistance to the motor and the damper acting to slow the motion. can be adapted by the relative contraction levels of the muscle pair. In our spring-damper model, adaptive compliance is not possible, as spring and damper properties are fixed parameters. The muscle model features two independently controllable muscles for each joint, instead of a single actuator. This allows for the generation of the same movement with different cocontraction levels, naturally altering the level of compliance and the response to perturbations. This effect is present in humans, as seen in [46]. Here, two groups of people were asked to catch falling balls while keeping their hands steady. In the healthy group, pre-stiffening of the wrist and elbow muscles was observed before impact, reducing deflection when the ball hit their hand. In the second group, suffering of cerebellar ataxia, the muscle activity did not increase until after impact. This lead to a larger deviation in hand position. To demonstrate this capability, we created a scenario where MIMo holds his arm outstretched in a stable pose before having a ball dropped on his arm. All joints except the shoulder joint were locked in place. The shoulder joint was held in position by the associated actuator. We measure shoulder position and actuator torque as the arm deflects from the impact and then returns to position. The control input to the actuator was chosen to hold the arm stable before impact and was constant during the whole simulation. We repeated this for both actuation models. Control inputs were chosen manually. We picked three sets of inputs with different activation levels for the muscle model. There was only a single input for the spring-damper motors that kept the arm steady. The result is shown in Fig. 4. In the muscle model, higher co-contraction levels lead to less deflection as the resisting force increases more strongly after the impact. This adaptive compliance is not possible using the spring-damper model. Torque curves and overall response are quite similar between the motor and the medium stiffness muscles, demonstrating that the spring-damper model is reasonably accurate for ex- where n is the number of motors in the scene and ui and Ti are the control input and maximum torque of motor i, respectively. For the muscle model we have P ai (ui )2 , (5) c(u) = i P n ai (ui )2 fmaxi cw (u) = i P , (6) n i fmaxi where n is the number of muscles, ai is the activity of muscle i and fmaxi is as described in Section IV-B. Both actuation models expose any parameters or intermediate results for computing custom action penalties or metabolic costs. V. M ULTIMODAL SENSING As with the actuation models, we pursue a big-picture approach for our models of the sensory modalities. For example, Golgi tendon organs measure the strain between muscles and their tendons and thus measure the mechanical load on the muscle, while muscle spindles measure muscle contraction and velocity. All these receptors over a single joint in essence measure the joint position, velocity and applied torque over that joint. We do not model all these various receptors themselves, instead computing the quantities directly. Four different modalities are implemented: Proprioception, vision, a vestibular system, and a full-body touch sensitive skin. The first three have simple implementations and we list the information they collect below. The touch-sensitive skin is described in more detail. MIMo’s proprioceptive system provides him with joint positions and velocities, torques applied across each joint, and limit sensors that activate as each joint approaches its range of motion limits. In addition there are also quantities from the actuation model, which depend on the specific implementation. For example the muscle model provides the current muscle activation for both muscles in each actuator. Vision is implemented through two color cameras located in his eyes. The range of motion is ±45° horizontally, -47° to 33° vertically [47] and ±8° torsionally [48]. The cameras render two RGB images with a 60º field of view, equivalent to the central vision of humans [49]. The vestibular system, which provides our sense of balance, is implemented as a combination gyroscope and accelerometer located at the center of MIMo’s head. 6 (a) Binocular vision (b) Touch perception Fig. 5. MIMo’s multimodal perception while holding a ball. (a) Anaglyph of left and right eye views. (b) Visualization of touch sensors in the hand, with those reporting a force colored red. The size of the circle corresponds to the amount of force sensed. A. Touch Perception Human touch sensation is produced by a variety of receptors responding towards specific aspects of touch, such as the Slowly Adapting type 1 (SA1), which responds primarily to direct pressure and coarse texture or the Rapidly Adapting (RA) type for slip and fine texture [50]. As with the other modalities, we model these types very simply, ignoring signal travel times and condensing the various types of receptors into a single generic “touch” sensor type that measures normal and frictional forces at its location. Sensor points are spread evenly on each individual body part, with the sensor density varying between body parts based on the two-point discrimination distances by [51]. Thus, the front and the back of the palm have the same sensor density, but not the fingertips. MuJoCo only performs rigid-body simulations in which all contacts are treated as point contacts. Area contacts between flat surfaces produce multiple contact points, for example at the corners of the contact area. As soft-body physics is very computationally expensive, we do not adjust these physics but weakly simulate the deformation of the skin to these point contacts, by distributing the point contact forces over nearby sensor points according to a surface response function. Our response function decreases sensed force linearly with distance from the contact point and then normalizes over all sensor points, such that the total sensed force remains identical to the initial point contact force from MuJoCo. The distance measure is the euclidean distance as geodesic approximations were too computationally expensive. To avoid bleed-through to the opposite side of thin bodies (such as the palm), we only select nearby sensor points with a breadth-first-search on the mesh of sensor points. While also expensive, the results of this search can be cached and reused, leading to a only a small run time penalty compared to using just the euclidean distance. All of these aspects, from the sensor point density through the response function can be easily adjusted or expanded. VI. A PPLICATION P ROGRAMMING I NTERFACE Our code is written in Python and built as an OpenAI gym [52] environment to allow easy integration into existing experimental setups and take advantage of the large amount of documentation and third-party libraries that already exist, such as the stable baselines library [53]. This environment is intended as an abstract base class that will be subclassed and adapted by other environments for specific experiments. These subclasses would handle reward structure, sensor limits or any additional constraints. Underdeveloped or limited sensors can be implemented through their configurations, but for the most part the user would implement any perceptual constraints. The environment is set up to facilitate this in a straightforward way. The configuration of the sensory modalities, such as the density of the touch sensors or the field of view and resolution of the visual system can be adjusted or disabled easily during initialization without modifying the underlying MuJoCo XMLs. Swapping between actuation models only requires a single line change as well. The action and observation spaces are generated automatically based on the configuration of the MuJoCo XMLs and the sensor modules. Disabling touch perception also removes the associated entry from the observation space. All of the sensory modalities are programmed as separate modules and can be readily attached to any MuJoCo-based gym environment. Our simulation is time discrete with two different time step types. The physics time step determines the temporal resolution of the physics simulation and choosing a sufficiently small one is important for simulation stability. The second step type are control steps, during which observations are collected and the control algorithm is queried for new motor commands. Control frequency is often an important factor for RL algorithms. Control step size must be an integer multiple of the physics step size. VII. E XPERIMENTS A. Illustrations of learning While RL is a powerful framework to generate controllers, its chaotic nature and requirements for large amounts of data require a stable and efficient model. In the following, we demonstrate that MIMo is well-suited for RL even while including multimodal inputs. Note that badly designed models can negatively affect RL performance, even if they are physically accurate in some regimes. We use four example tasks: reaching for objects, standing up, self-body knowledge and catching a falling ball. In these tasks, two different state-of-the-art, widely used deep RL algorithms, Proximal Policy Optimization (PPO) [54] and Soft Actor-Critic (SAC) [23] are trained to optimize task-dependent reward functions by controlling MiMo’s actuators. While both algorithms are actor-critics, PPO is a widely-used on-policy approach closely related to trust-region algorithms [55] and SAC is an off-policy algorithm with a maximum-entropy formulation [56]. We use the default parameters of the Stable-Baselines3 library [53], consisting of linear networks with two hidden layers of size 64 for PPO and size 256 for SAC, as well as common improvements [57], [58] that are critical for performance. Input and output layer sizes vary depending on the observation and action spaces of the environments. Performance is compared for PPO and SAC with 10 different seeds (Fig. 6). We do not claim that such extrinsically motivated learning is how human infants learn these skills. We merely use these examples to showcase MIMo learning from multimodal input. 7 200 average reward average reward 100 0 −100 PPO SAC 0.0 0.1 0.2 0.3 million steps 0.4 150 100 PPO SAC 50 0.5 0.0 0.5 1.0 million steps (a) Reaching for objects 1.5 (b) Standing up average reward average reward 200 0 −250 100 0 PPO SAC PPO SAC −100 −500 0.0 0.5 million steps 1.0 (c) Self-body knowledge 0 1 2 million steps 3 4 (d) Catching objects Fig. 6. Comparison of learning curves for PPO (red) and SAC (blue) in the four illustration environments, with 10 different seeds each. Snapshots show typical postures of MIMo at the start (top) and end (bottom) of an episode. Videos are available at https://tinyurl.com/MIMo-playlist a) Reaching for Objects: Reaching is a complex behavior that emerges in the first 6 months of age. Since it requires hand-eye coordination, infants must combine vision, proprioception, and touch to produce the desired motion [59]. A motor command is generated for the arm and hand muscles to produce the reaching movement towards an object in their visual field, with immediate haptic feedback about its success. In our illustration, MIMo learns to reach for a ball. He is standing in front of the target, which changes position randomly in each episode, always within reach of MIMo’s right hand. He can only move his right shoulder, elbow, and hand joints. His head and eyes are set to look directly at the ball, i.e., the initial visual search and object fixation is assumed. The observation space only includes the proprioception sensory modality. MIMo can use the joint angles of his head and eyes to determine the position of the target. We introduce an additional difficulty relative to the original version of the task [1] by also sampling the initial condition of the right arm from 10 random positions. This ensures that MIMo effectively uses his proprioception to reach the ball as fast as possible. The reward function  100 if target reached, r= (7) −∥pfingers − ptarget ∥ otherwise is the negative distance between the positions of the fingers and the target, with a sparse positive reward when contact is detected. Each episode lasts 1000 time steps or until MIMo successfully touches the ball. Results for this task are shown in Fig. 6a. b) Standing up: Infants learn to stand up by themselves at around 10 months and start walking shortly after. This is a gradual process that includes previous stages such as crawling and maintaining balance. One particular stage is marked by the emergence of pulling-to-stand, when infants who are unable to stand without support grasp the edge of a solid surface and pull themselves upwards, thus combining the strengths of their arms and legs [60]. This behavior appears as early as 7 months of age and is a necessary milestone during independent locomotion development. To reproduce the pulling-to-stand behavior, we design an environment where MIMo is placed sitting inside a crib. His feet are fixed to the ground and his hands are fixed to the crib’s rail guard, at a height of 45 cm. He can move the joints in his arms, torso, and legs, with the aim of standing up. The observation space includes the proprioception and vestibular sensory modalities. The latter can be particularly useful by providing information about vertical acceleration. The extrinsic reward is given by r = zhead − 0.01 X u2i (8) i∈joints where zhead is the head’s height measured from an initial height of 20 cm, ui is the control input for joint j, and the sum is taken over all active joints. This reward function favors standing positions while penalizing states that require excessive force. The parameters are set to balance the two components. All episodes last 500 time steps. Results for this task are shown in Fig. 6b. c) Self-body Knowledge: Infants learn not only about the world but also about themselves and their own bodies. In fact, this begins as a tactile exploratory behavior before birth and continues over the first few months of life [61]. Infants develop 8 a self-body knowledge that allows them to map the multimodal sensory inputs to the different parts of their bodies. MIMo can learn this self-body knowledge by using his touch perception. We design an environment where MIMo is sitting with his legs crossed, such that his right arm can reach all of his body parts. In each episode he is given a target body part sampled uniformly at random from the geometric primitives that make up his body. By only moving his right arm, he is trained to activate the touch sensors on the target body part. The observation space includes proprioception and touch, as well as the target as a vector with one-hot encoding. The reward function r=   500 −∥ptouched − ptarget ∥  −1 if target touched, if other part touched, otherwise (9) is positive only if the target body part is touched. Otherwise, it is either the negative distance to the target body part if another touch signal is activated or a fixed negative value if there is no touch signal. Each episode lasts 500 time steps or until MIMo successfully touches the target. Results for this task are shown in Fig. 6c. d) Catching Objects: As they develop, infants learn to rely on sensory feedback to adapt their actions in order to achieve their goals. One example is catching a falling object, where visual and haptic information gives a cue about when and how to grasp. While newborns have an innate palmar grasp reflex, they learn to predict the motion of objects to catch them at around 8 months of age [62]. We illustrate this behavior in an experiment where MIMo learns to grasp a ball that falls onto his hand. The ball’s size, mass, and initial position are randomized in each episode. Using the full hand model, MIMo’s body is fixed in a standing position with his right arm stretched in front of him, such that he can only move his right hand wrist and fingers. The thumb is locked to the side. At each time step, his head and eyes are automatically rotated to fixate on the falling ball. His observation space includes proprioception, touch and the size of the ball. MIMo needs to learn to integrate these pieces of information to successfully grasp the ball. This experiment, unlike the others, uses the muscle actuation model. The reward function is given by r=   −100 if ball falls beyond hand, 100 if ball is held for 1 second,  Nc − cw (a) otherwise (10) where Nc is the number of geometric primitives in contact with the ball and cw (a) is the cost function for the actuators as given by Equation 6. This cost function rewards grasping with the fingers over using the wrist. Each episode lasts 800 time steps, or until either the ball falls beyond the hand’s height or MIMo has held onto the ball continuously for a full second. Results for this task are shown in Fig. 6d. Both SAC and PPO quickly converge to the optimal behavior for the simpler tasks, demonstrating that the parametrization of MIMo allows for stable motor control. The two harder tasks, (b) and (d), require either full-body control or difficult hand coordination across all fingers. In these scenarios, SAC performs worse and we observe larger variance across seeds. We conjecture that PPO benefits from its monotonic improvement guarantees and therefore achieves better learning stability and lower variance across seeds. B. Benchmarking MIMo In this section we measure the simulation speed of MIMo. In particular, we are interested in assessing under what conditions we can achieve faster than real-time simulations. Each benchmark runs for one hour of simulation time. Environments have a maximum episode duration and are reset when a goal is achieved or the time limit is reached. We measure the real-time spent in each run, as well as in each of the different components of the system: MuJoCo and the sensory modalities. Physics steps last 5 ms for all benchmarks. Unless stated otherwise we use the spring-damper actuation model with the mitten hand version of MIMo. Results are reported as real seconds required for each simulation second. The test system is equipped with an AMD FX-8350, 16GB RAM and a GTX 1070. The execution times are measured using Python’s cProfile library. In the first experiment we test performance with multiple configurations for the different sensory modalities, focusing on the vision and touch modules since they are most sensitive to their configuration and consume the bulk of the processing time. The environment consists of MIMo and two objects. MIMo takes random actions continuously. Each episode has a fixed length of 1 minute, with 60 episodes per benchmark. The results can be seen in Fig. 7. The default configuration of MIMo (vision resolution of 256 × 256 pixels) requires 0.69 real seconds for each simulation second, i.e., it is 1.44 times faster than real time. In the next benchmark we test the performance of our demo environments. The number of episodes is no longer fixed and individual episodes may be cut short if MIMo achieves the goal of the environment. The reach, self-body, and catch experiments perform two physics steps for every control step, the default setting for MIMo environments. The stand-up experiment uses one physics step per control step in order to increase the stability of the simulation due to the additional constraints between MIMo’s hands and the crib. We also test the performance of the full hand version of MIMo. The configuration is the same as the first benchmark, but with the version of MIMo replaced. Results for both benchmarks are plotted in Fig. 8. The “Reach”, “Standup”, and “SelfBody” experiment are all based on the default configuration with some modalities disabled and joints locked in place. As a result, they perform significantly faster. Interestingly, the physics simulation for the stand-up environment takes longer than the baseline. This slowdown comes from: 1. A slowdown in MuJoCo’s solver, as MIMo’s joint positions are more constrained since both his hands and feet are fixed in place. 2. The environment configuration: MIMo’s initial position is slightly randomized and the simulation allowed to settle for several physics steps without any actions before each episode begins. 9 0.8 Realtime/Simtime leads to a 23 Performance can be improved for specific experiments in multiple ways. In addition to adjusting the configuration of the modalities or the scene, the performance of the simulation can be improved by reducing the frequency of control steps, reducing the frequency of observation collection and thus the time spent in the sensory modules. Other Vestibular Proprioception Vision Touch Physics Init. 1.0 0.6 0.4 VIII. D ISCUSSION 0.2 0.0 V: 64² V: 64² V: 64² V: 64² V: 128 V: 256 V: 512 T: 0.25 T: 0.5 T: 1.0 T: 2.0 ² T: 1.0 ² T: 1.0 ² T: 1.0 Fig. 7. Results of the performance benchmarks. Each bar represents one run consisting of 60 episodes of 1-minute length each. The labels indicate the pixel resolution for the visual system (V) and a scalar multiplier for the sensor density for the touch system (T) used. The 4 leftmost bars correspond to configurations with increasing touch sensor density and constant visual resolution, the 3 rightmost bars to increasing visual resolution and constant touch sensor density. All configurations with visual resolutions lower than 512x512 pixels run significantly faster than real-time (dashed horizontal line). 1.4 Realtime/Simtime 1.2 1.0 0.8 Other Vestibular Proprioception Vision Touch Muscle Physics Init. 0.6 0.4 0.2 0.0 Reach Standu p SelfBo dy Catch Full Ha n d Baselin e Fig. 8. Performance benchmarks for each of the demo environments and the full hand version of MIMo. The benchmark of the default configuration, using the mitten-hands, is also plotted for reference. As to the full hand version, the performance of both MuJoCo and the touch module is highly dependent on the number of contacts in the simulation. Contacts between two body parts of MIMo are particularly expensive as the force distribution has to be computed for both bodies. This leads to a significant slow-down for both MuJoCo and the touch module due to contacts between MIMo’s fingers. The “Catch” environment is based on the full hand version, with all joints, except the right hand, locked into fixed positions, and touch sensation only in the right arm and hand. This speeds up both touch and physics performance. Muscle functions make up 3 The muscle actuation model adds a flat run-time cost of 0.8 ms per physics step for both versions of MIMo, which In this work we have presented MIMo, the multimodal infant model, an open-source research software platform for 1) building models of cognitive development in infants and toddlers and 2) constructing AIs that can learn in a similar self-directed fashion from interactions with their environment. Compared to previous software platforms [11], [17]–[19], [24], a key strength of MIMo is the combination of 1) state-of-theart physics simulation based on MuJoCo (https://mujoco.org) with 2) a full-body touch-sensitive skin, and 3) a plausible approximation of muscle-driven movement, while maintaining the ability to run simulations faster than real-time on standard hardware. We believe that these ingredients are essential for advancing our understanding of the development of, e.g., early self-models in infants, that pave the way toward full-fledged adult-like forms of intelligence and consciousness. In the past, computational models of cognitive development have often been restricted to isolated cognitive phenomena. Some examples are works on binocular vision [63], [64], visual object and category learning [65]–[67], gaze following [68], [69], learning to grasp objects [20], perservative reaching [70], word learning [71]–[73], and countless others. While such models have produced many important insights, they often work with simplified sensory inputs and it is not clear how to scale them to the rich multimodal sensory input provided by our sense organs. Recent approaches that train models with first-person video and audio recordings of infants [74], [75] try to overcome this limitation. Critically, however, the model still learns from just passively observing these inputs recorded in this way, while the actual infant actively generated this sensory input through their own behavior. Therefore, as we have argued in the Introduction, this first-person-recording approach is likely to miss an essential aspect of cognitive development: the infant’s ability to actively probe the causal structure of the world in a targeted fashion. Capturing such more advanced forms of learning in models necessitates modeling environments that support embodied interaction with the physical environment. We feel that the time is ripe to fully embrace such models and we believe that it should be done without incurring the burden and limited reproducability associated with working with humanoid robots. In the design of MIMo we faced a number of trade-offs between realism and computational efficiency. The “right” trade-off will always depend on the particular phenomenon being investigated. The choices we made already permit faster than real-time simulations on today’s standard hardware. However, this has resulted in various limitations. For example, MIMo’s body is composed of simple rigid shape 10 primitives. Furthermore, MIMo is presently limited to just four sensory modalities (binocular vision, proprioception, touch, and a vestibular system). In the future, we plan to incorporate nociception (pain perception), audition, and possibly olfaction. Future work could also study the effects of a growing body during development. All relevant aspects, such as MIMo’s physical size and strength, can be adjusted even within episodes. A model of the development of, say, a particular sensorimotor skill could be trained while parameters of MIMo’s body slowly change. We hope that MIMo will facilitate research into how cognition develops from embodied interactions with the physical and social environment, whose rich structure is sensed through multiple modalities. At the very least, it should make such efforts easier and more reproducible. Furthermore, we hope that MIMo will enable a more cumulative and collaborative approach to such research, where models of the development of higher-level cognitive functions are built on top of previously published models of the development of precursor skills. After all, human development is often a cumulative process, where new representations, skills, and competences are built on top of already existing ones in an open-ended fashion. R EFERENCES [1] D. Mattern, F. M. López, M. R. Ernst, A. Aubret, and J. Triesch, “Mimo: A multi-modal infant model for studying cognitive development in humans and ais,” in 2022 IEEE International Conference on Development and Learning (ICDL). IEEE, 2022, pp. 23–29. [2] A. M. Turing, “Computing machinery and intelligence,” Mind, vol. 59, no. 236, pp. 433–460, 1950. [3] M. Asada, K. F. MacDorman, H. Ishiguro, and Y. Kuniyoshi, “Cognitive developmental robotics as a new paradigm for the design of humanoid robots,” Robotics and Autonomous Systems, vol. 37, no. 2, pp. 185–193, 2001, Humanoid Robots. [4] M. Lungarella, G. Metta, R. Pfeifer, and G. Sandini, “Developmental robotics: a survey,” Connection Science, vol. 15, no. 4, pp. 151–190, 2003. [5] J. Schmidhuber, “Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts,” Connection Science, vol. 18, no. 2, pp. 173–187, 2006. [6] M. Asada, K. Hosoda, Y. Kuniyoshi, H. Ishiguro, T. Inui, Y. Yoshikawa, M. Ogino et al., “Cognitive developmental robotics: A survey,” IEEE Transactions on Autonomous Mental Development, vol. 1, no. 1, pp. 12–34, 2009. [7] A. Cangelosi and M. Schlesinger, “From babies to robots: The contribution of developmental robotics to developmental psychology,” Child Development Perspectives, vol. 12, no. 3, pp. 183–188, 2018. [8] K. Doya and T. Taniguchi, “Toward evolutionary and developmental intelligence,” Current Opinion in Behavioral Sciences, vol. 29, pp. 91– 96, 2019, artificial Intelligence. [9] C. Gan, J. Schwartz, S. Alter, D. Mrowca, M. Schrimpf, J. Traer, J. D. Freitas et al., “ThreeDWorld: A platform for interactive multi-modal physical simulation,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. [Online]. Available: https://openreview.net/forum?id=db1InWAwW2T [10] E. Rohmer, S. P. N. Singh, and M. Freese, “V-rep: A versatile and scalable robot simulation framework,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2013, pp. 1321–1326. [11] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” 2016. [12] J. Collins, S. Chand, A. Vanderkop, and D. Howard, “A review of physics simulators for robotic applications,” IEEE Access, vol. 9, pp. 51 416– 51 431, 2021. [13] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033. [14] H. Fu, R. Jia, L. Gao, M. Gong, B. Zhao, S. Maybank, and D. Tao, “3d-future: 3d furniture shape with texture,” International Journal of Computer Vision, pp. 1–25, 2021. [15] S. Stojanov, A. Thai, and J. M. Rehg, “Using shape to categorize: Low-shot learning with an explicit shape bias,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 1798–1808. [16] A. Olmos and F. A. A. Kingdom, “Mcgill calibrated colour image database,” http://tabby.vision.mcgill.ca. [17] V. Tikhanoff, A. Cangelosi, P. Fitzpatrick, G. Metta, L. Natale, and F. Nori, “An open-source simulator for cognitive robotics research: The prototype of the iCub humanoid robot simulator,” in Proceedings of the 8th Workshop on Performance Metrics for Intelligent Systems, ser. PerMIS ’08. New York, NY, USA: Association for Computing Machinery, 2008, p. 57–61. [18] M. Kerzel, E. Strahl, S. Magg, N. Navarro-Guerrero, S. Heinrich, and S. Wermter, “Nico — neuro-inspired companion: A developmental humanoid robot platform for multimodal interaction,” in 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 2017, pp. 113–120. [19] Y. Kuniyoshi and S. Sangawa, “Early motor development from partially ordered neural-body dynamics: experiments with a cortico-spinalmusculo-skeletal model,” Biological Cybernetics, vol. 95, no. 6, pp. 589– 605, 2006. [20] E. Oztop, N. S. Bradley, and M. A. Arbib, “Infant grasp learning: a computational model,” Experimental brain research, vol. 158, no. 4, pp. 480–503, 2004. [21] A. Priamikov, M. Fronius, B. Shi, and J. Triesch, “Openeyesim: a biomechanical model for simulation of closed-loop visual perception,” Journal of Vision, vol. 16, no. 15, pp. 25–25, 12 2016. [22] S. L. Delp, F. C. Anderson, A. S. Arnold, P. Loan, A. Habib, C. T. John, E. Guendelman et al., “Opensim: Open-source software to create and analyze dynamic simulations of movement,” IEEE Transactions on Biomedical Engineering, vol. 54, no. 11, pp. 1940–1950, 2007. [23] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870. [24] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 4906–4913. [25] S. Ressler, “Anthrokids-anthropometric data of children,” National Institute of Standards and Technology, 1977. [Online]. Available: https://math.nist.gov/∼SRessler/anthrokids/ [26] V. Kumar, “Manipulators and manipulation in high dimensional spaces,” Ph.D. dissertation, University of Washington, Seattle, 2016. [Online]. Available: https://digital.lib.washington.edu/researchworks/handle/1773/ 38104 [27] M. J. McKay, J. N. Baldwin, P. Ferreira, M. Simic, N. Vanicek, J. Burns, . N. P. Consortium et al., “Normative reference values for strength and flexibility of 1,000 children and adults,” Neurology, vol. 88, no. 1, pp. 36–43, 2017. [28] M. N. Eek, A.-K. Kroksmark, and E. Beckung, “Isometric muscle torque in children 5 to 15 years of age: Normative data,” Archives of Physical Medicine and Rehabilitation, vol. 87, no. 8, pp. 1091–1099, 2006. [29] W. N. Sankar, C. T. Laird, and K. D. Baldwin, “Hip range of motion in children: what is the norm?” Journal of Pediatric Orthopaedics, vol. 32, no. 4, pp. 399–405, 2012. [30] T. Gomez, G. Beach, C. Cooke, W. Hrudey, and P. Goyert, “Normative database for trunk range of motion, strength, velocity, and endurance with the isostation b-200 lumbar dynamometer,” Spine, vol. 16, no. 1, p. 15—21, January 1991. [31] A. Jordan, J. Mehlsen, P. M. Bülow, K. Østergaard, and B. DanneskioldSamsøe, “Maximal isometric strength of the cervical musculature in 100 healthy volunteers,” Spine, vol. 24, no. 13, p. 1343, 1999. [32] A. M. Öhman and E. R. Beckung, “Reference values for range of motion and muscle function of the neck in infants,” Pediatric Physical Therapy, vol. 20, no. 1, pp. 53–58, 2008. [33] H. Watanabe, K. Ogata, T. Amano, and T. Okabe, “The range of joint motions of the extremities in healthy japanese people–the difference according to the age (author’s transl),” Nihon Seikeigeka Gakkai Zasshi, vol. 53, no. 3, pp. 275–261, 1979. [34] I. Günal, N. Köse, O. Erdogan, E. Göktürk, and S. Seber, “Normal range of motion of the joints of the upper extremity in male subjects, with special reference to side,” Journal of Bone & Joint Surgery, vol. 78, no. 9, p. 1401, 1996. 11 [35] R. E. Hughes, M. E. Johnson, S. W. O’Driscoll, and K.-N. An, “Agerelated changes in normal isometric shoulder strength,” The American Journal of Sports Medicine, vol. 27, no. 5, pp. 651–657, 1999. [36] M. Katoh, “Test-retest reliability of isometric shoulder muscle strength measurement with a handheld dynamometer and belt,” Journal of Physical Therapy Science, vol. 27, no. 6, pp. 1719–1722, 2015. [37] S. N. Da Paz, A. Stalder, S. Berger, and K. Ziebarth, “Range of motion of the upper extremity in a healthy pediatric population: introduction to normative data,” European Journal of Pediatric Surgery, vol. 26, no. 05, pp. 454–461, 2016. [38] S.-K. Bok, T. H. Lee, and S. S. Lee, “The effects of changes of ankle strength and range of motion according to aging on balance,” Annals of Rehabilitation Medicine, vol. 37, no. 1, pp. 10–16, 2013. [39] T. Ohtsuki, “Decrease in grip strength induced by simultaneous bilateral exertion with reference to finger strength,” Ergonomics, vol. 24, no. 1, pp. 37–48, 1981. [40] P. Ekman, “An argument for basic emotions,” Cognition and Emotion, vol. 6, no. 3-4, pp. 169–200, 1992. [41] J. M. Vanswearingen, “Measuring wrist muscle strength,” Journal of Orthopaedic & Sports Physical Therapy, vol. 4, no. 4, pp. 217–228, 1983. [42] I. Wochner, P. Schumacher, G. Martius, D. Büchler, S. Schmitt, and D. Haeufle, “Learning with muscles: Benefits for data-efficiency and robustness in anthropomorphic tasks,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/ forum?id=Xo3eOibXCQ8 [43] L. A. Frey-Law, A. Laake, K. G. Avin, J. Heitsman, T. Marler, and K. Abdel-Malek, “Knee and elbow 3d strength surfaces: peak torqueangle-velocity relationships,” Journal of applied biomechanics, vol. 28, no. 6, pp. 726–737, 2012. [44] A. Mo, F. Izzi, D. F. B. Haeufle, and A. Badri-Spröwitz, “Effective viscous damping enables morphological computation in legged locomotion,” Frontiers in Robotics and AI, vol. 7, 2020. [Online]. Available: https://www.frontiersin.org/articles/10.3389/frobt.2020.00110 [45] F. Izzi, A. Mo, S. Schmitt, A. Badri-Spröwitz, and D. F. B. Haeufle, “Muscle prestimulation tunes velocity preflex in simulated perturbed hopping,” Scientific Reports, vol. 13, no. 1, p. 4559, Mar 2023. [Online]. Available: https://doi.org/10.1038/s41598-023-31179-6 [46] C. E. Lang and A. J. Bastian, “Cerebellar subjects show impaired adaptation of anticipatory emg during catching,” Journal of neurophysiology, vol. 82, no. 5, pp. 2108–2119, 1999. [47] W. J. Lee, J. H. Kim, Y. U. Shin, S. Hwang, and H. W. Lim, “Differences in eye movement range based on age and gaze direction,” Eye, vol. 33, no. 7, pp. 1145–1151, 2019. [48] A. L. Rosenbaum and A. P. Santiago, Clinical strabismus management: principles and surgical techniques. W.B. Saunders, 1999. [49] H. Strasburger, I. Rentschler, and M. Jüttner, “Peripheral vision and pattern recognition: A review,” Journal of Vision, vol. 11, no. 5, pp. 13–13, 12 2011. [50] K. O. Johnson and S. S. Hsiao, “Neural mechanisms of tactual form and texture perception,” Annual Review of Neuroscience, vol. 15, no. 1, pp. 227–250, 1992. [51] F. Mancini, A. Bauleo, J. Cole, F. Lui, C. A. Porro, P. Haggard, and G. D. Iannetti, “Whole-body mapping of spatial acuity for pain and touch,” Annals of Neurology, vol. 75, no. 6, pp. 917–924, 2014. [52] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” arXiv preprint arXiv:1606.01540, 2016. [53] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [54] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. [55] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 1889–1897. [Online]. Available: https://proceedings.mlr.press/v37/schulman15.html [56] S. Han and Y. Sung, “A max-min entropy framework for reinforcement learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 25 732–25 745, 2021. [57] L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry, “Implementation matters in deep rl: A case study on ppo and trpo,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=r1etN1rtPB [58] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar et al., “Soft actor-critic algorithms and applications,” CoRR, vol. abs/1812.05905, 2018. [Online]. Available: http://arxiv.org/abs/ 1812.05905 [59] D. Corbetta, R. F. Wiener, S. L. Thurman, and E. G. McMahon, “The embodied origins of infant reaching: Implications for the emergence of eye-hand coordination,” Kinesiology Review, vol. 7, no. 1, pp. 10–17, 2018. [60] O. Atun-Einy, S. E. Berger, and A. Scher, “Pulling to stand: Common trajectories and individual differences in development,” Developmental Psychobiology, vol. 54, no. 2, pp. 187–198, 2012. [61] L. Jacquey, J. Fagard, K. O’Regan, and R. Esseily, “Development of body know-how during the baby’s first year of life,” Enfance, vol. 2, no. 2, pp. 175–192, 2020. [62] P. van Hof, J. van der Kamp, and G. J. Savelsbergh, “The relation between infants’ perception of catchableness and the control of catching.” Developmental Psychology, vol. 44, no. 1, pp. 182–194, 2008. [63] M. Dominguez and R. A. Jacobs, “Developmental constraints aid the acquisition of binocular disparity sensitivities,” Neural Computation, vol. 15, no. 1, pp. 161–182, 2003. [64] S. Eckmann, L. Klimmasch, B. E. Shi, and J. Triesch, “Active efficient coding explains the development of binocular vision and its failure in amblyopia,” Proceedings of the National Academy of Sciences, vol. 117, no. 11, pp. 6156–6162, 2020. [65] D. Mareschal, R. M. French, and P. C. Quinn, “A connectionist account of asymmetric category learning in early infancy.” Developmental psychology, vol. 36, no. 5, p. 635, 2000. [66] F. Schneider, X. Xu, M. R. Ernst, Z. Yu, and J. Triesch, “Contrastive learning through time,” in SVRHM 2021 Workshop@ NeurIPS, 2021. [67] A. Aubret, C. Teulièr, and J. Triesch, “Toddler-inspired embodied vision for learning object representations,” in 2022 IEEE International Conference on Development and Learning (ICDL). IEEE, 2022, pp. 81–87. [68] Y. Nagai, K. Hosoda, A. Morita, and M. Asada, “A constructive model for the development of joint attention,” Connection Science, vol. 15, no. 4, pp. 211–229, 2003. [69] J. Triesch, C. Teuscher, G. O. Deák, and E. Carlson, “Gaze following: why (not) learn it?” Developmental science, vol. 9, no. 2, pp. 125–147, 2006. [70] E. Thelen, G. Schöner, C. Scheier, and L. B. Smith, “The dynamics of embodiment: A field theory of infant perseverative reaching,” Behavioral and brain sciences, vol. 24, no. 1, pp. 1–34, 2001. [71] D. K. Roy and A. P. Pentland, “Learning words from sights and sounds: A computational model,” Cognitive science, vol. 26, no. 1, pp. 113–146, 2002. [72] C. Yu and D. H. Ballard, “A unified model of early word learning: Integrating statistical and social cues,” Neurocomputing, vol. 70, no. 1315, pp. 2149–2165, 2007. [73] F. Xu and J. B. Tenenbaum, “Word learning as bayesian inference.” Psychological review, vol. 114, no. 2, p. 245, 2007. [74] S. Bambach, D. Crandall, L. Smith, and C. Yu, “Toddler-inspired visual object learning,” Advances in neural information processing systems, vol. 31, 2018. [75] E. Orhan, V. Gupta, and B. M. Lake, “Self-supervised learning through the eyes of a child,” Advances in Neural Information Processing Systems, vol. 33, pp. 9960–9971, 2020.