Multi Actor-Critic DDPG for Robot Action Space Decomposition: A Framework to Control Large 3D Deformation of Soft Linear Objects arXiv:2312.04308v1 [cs.RO] 7 Dec 2023 Mélodie Daniel1 , Aly Magassouba2 , Miguel Aranda3 , Laurent Lequièvre2 , Juan Antonio Corrales Ramón4 , Roberto Iglesias Rodriguez4 and Youcef Mezouar2 Abstract— Robotic manipulation of deformable linear objects (DLOs) has great potential for applications in diverse fields such as agriculture or industry. However, a major challenge lies in acquiring accurate deformation models that describe the relationship between robot motion and DLO deformations. Such models are difficult to calculate analytically and vary among DLOs. Consequently, manipulating DLOs poses significant challenges, particularly in achieving large deformations that require highly accurate global models. To address these challenges, this paper presents MultiAC6: a new multi ActorCritic framework for robot action space decomposition to control large 3D deformations of DLOs. In our approach, two deep reinforcement learning (DRL) agents orient and position a robot gripper to deform a DLO into the desired shape. Unlike previous DRL-based studies, MultiAC6 is able to solve the sim-to-real gap, achieving large 3D deformations up to 40 cm in real-world settings. Experimental results also show that MultiAC6 has a 66% higher success rate than a singleagent approach. Further experimental studies demonstrate that MultiAC6 generalizes well, without retraining, to DLOs with different lengths or materials. We released the code at this URL1 . A demonstration video is available at this URL2 . Fig. 1: Overview of MultiAC6: a multi actor-critic framework controlling the gripper pose to achieve large 3D DLO deformations. I. I NTRODUCTION Following the Industry 4.0 paradigm, industrial robots are increasingly being requested to manipulate various objects in real-world settings. In this context, providing robots with the ability to manipulate soft objects has many practical uses. This particularly concerns deformable linear objects (DLOs), which are one-dimensional soft objects such as cables, plants, or beams [1]. Typical applications are related to cable harnessing [2], [3], hose manipulation [4], or plant stem bending for harvesting [5], [6]. Modeling DLOs for robot manipulation remains a challenge. In fact, such objects exhibit nonlinear deformations that are difficult and computationally expensive to accurately model [7]. Therefore, simplified models are generally used 1 Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France. 2 CNRS, Clermont Auvergne INP, Institut Pascal, Université Clermont Auvergne, Clermont-Ferrand, France. 3 Instituto de Investigación en Ingenierı́a de Aragón (I3A), Universidad de Zaragoza, Zaragoza, Spain. 4 Centro Singular de Investigación en Tecnoloxı́as Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Spain. JACR was funded by the Spanish government through a ’Beatriz Galindo’ fellowship (Ref. BG20/00143), by the research project PID2020-119367RB-I00 and by the Galician Government through the programme ’Captación e Retención de Talento’. MA was supported via projects PID2021-124137OB-I00 and TED2021-130224B-I00 funded by MCIN/AEI/10.13039/501100011033, by ERDF A way of making Europe and by the European Union NextGenerationEU/PRTR. Corresponding author: Mélodie Daniel, e-mail: melodie.daniel@u-bordeaux.fr. 1 https://github.com/MelodieDANIEL/MultiAC6 2 https://youtu.be/CWyCozJEiQk [8], but at the expense of accuracy and flexibility. Indeed, a single deformation model cannot fully capture the length or material of various DLOs. Different lines of research have tackled DLOs manipulation. Analytical approaches generally consider 2D deformations [2], [6], [9], [10], [11]. Fewer works have addressed the more challenging case of 3D deformations [3], [7], [12], [13]. In general, these methods are limited by the accuracy of the deformation model used. To avoid modeling DLO deformations, another line of research explored deep reinforcement learning (DRL) approaches. Although promising results could be obtained, these approaches are validated mainly in simulation [14], [15], [16]. Indeed, the sim-to-real gap, peculiar to DRL approaches, is still an obstacle to realworld applications [7]. This sim-to-real gap is mainly caused by the approximations of the simulators, such as unrealistic deformations. Despite this limitation, few DRL approaches have been validated in real-world settings, but only for 2D deformations [17], [18]. In contrast, this paper addresses the 3D manipulation of DLOs in real-world settings with a single-arm robot. To this end, we propose a novel Multi Actor-Critic (MultiAC6) DRL framework based on the deep deterministic policy gradient (DDPG) algorithm. This framework decomposes the 6 degree of freedom (DOF) action space of the robot gripper to multiple agents, as shown in Figure 1. This paper is an extension of our previous work [15]. In the latter case, we proposed a single agent framework that controls the 3 DOF position of the robot gripper to deform a DLO in simulation. Differently, in this paper, we propose a collaborative multiagent framework with action space decomposition. Recent work in natural language processing [19] demonstrated that decomposing action spaces between agents achieves better results than a single-agent framework. Indeed, such an approach reduces the state-action space size significantly and makes exploration in the agent training phase more efficient. The following key points of MultiAC6 are worth highlighting. First, unlike existing DRL-based approaches [15], MultiAC6 controls the gripper pose (6 DOF) instead of the gripper position (3 DOF). Therefore, the robot can achieve more complex deformations. Second, MultiAC6 overcomes the sim-to-real gap and is experimentally validated. Third, MultiAC6 is robust to DLO variations without retraining or online fine-tuning. The different contributions of this article can be summarized as follows: We propose a new DRL collaborative multi actor-critic framework with action space decomposition to address 3D manipulation of DLOs in real-world settings. Our approach consists of two agents controlling the gripper position (3 DOF) and orientation (3 DOF). • We define an optimized reward function based on the maximum error between the current shape of the DLO and its desired shape. This reward performs better than a reward function based on the average error [15]. • We validate the robustness of MultiAC6 to DLO variations through extensive real-world experiments involving large 3D deformations. These experiments are carried out, using the same MultiAC6 model, for DLOs with varying length, material, and stiffness. • II. R ELATED W ORK Non-DRL methods: Several recent works showed 3D shape control of DLOs with dual-arm manipulation [3], [7], [12], [13]. The approach in [13] uses a geometrical model of the object to compute an online Jacobian that guides the control task. In [3], [12], quasi-static adaptive controllers based on computing a Jacobian based on a sensorbased deformation model are proposed. To address large 3D deformations, the authors of [7] combine offline and online learning of a radial basis function network. These studies require an online adaptation for each new DLO used. On the contrary, we achieve generalization to various realworld DLOs without needing online estimations or specific training. Additionally, our setup consists of a single arm, which is more challenging due to the fewer actuated DOFs. DRL methods: Another branch of research explored DRL-based methods to avoid modeling DLO deformation. These methods are mainly validated in simulation. For example, in [14], a method based on the DDPG algorithm is introduced to control elastoplastic DLOs. In our previous work in [15], we addressed 3D deformations with a DDPGbased architecture. However, such techniques do not offer a way to transfer the learned policies to real-world settings [7]. Fig. 2: Left: DLO manipulation in a simulated environment considering different workspaces. These were used to create different deformation datasets to evaluate MultiAC6 (cf. Section VI). Right: Overview of a singular configuration for which the deformation cannot be predicted, which leads to a sim-to-real gap. Sim-to-real gap: Resolving this sim-to-real gap is still an open problem since there are no accurate and standard simulation environments for deformable objects. Most of the existing approaches develop their own environment using a physics-based simulation engine such as Bullet [20] or Mujoco [21]. These engines generally model DLOs with the mass-spring method or the finite element method (FEM). This is the case for [17] and [18] where different DRL agents are trained in customized Mujoco environments. Given the above limitations, these works are among the very few validated in real-world settings. In [17], a sample-efficient reinforcement learning method named PILCO is proposed to close the sim-to-real gap for 2D deformations. In [18], a Soft-Actor-Critic (SAC) algorithm is presented to control 2D deformations of real objects. These contributions are nonetheless limited to 2D deformations for DLOs with no compression strength (i.e., cables, strings, and ropes) [1]. Instead, our contribution MultiAC6 aims to close the simto-real gap for 3D manipulation of large-strain DLOs, such as elastic tubes. The MultiAC6 action space decomposition is inspired by the multi-agent dialog policy framework proposed in [19]. Within this framework, each component of the action is carried out by a different agent. In [19], this dialog framework was shown to be 11% more accurate than a hierarchical DRL framework and 66% more accurate than a single agent framework. III. P ROBLEM S TATEMENT Let us consider the 3D manipulation of DLOs using a single-arm robot. In this configuration, we assume that a robot grasps one extremity of a DLO. The second extremity of this DLO is fixed to the ground. The DLO is long (> 60 cm) and is assumed to be elastic. An elastic deformation implies that the DLOs return to their original shape once the deformation force is no longer applied [1]. The goal is then to control the pose of the robot gripper to shape the DLO with a desired deformation. The DLO shape is tracked in real-time by a set of feature points defined by their 3D positions (x, y, z). We assume that the feature points can be tracked accurately in real-time with a vision-based algorithm. Fig. 3: Overview of MultiAC6 framework for DLO manipulation. MultiAC6 decomposes the robot action space using two agents. Agento orients the DLO tip towards ζ, then Agentp positions the gripper to reach the desired deformation. Therefore, the manipulation task consists in moving the gripper so that the DLO feature points reach target positions representing a desired shape (see Figure 2). To achieve a particular deformation, the robot gripper should follow a specific trajectory. Indeed, knowing the gripper final pose is not sufficient to guarantee the achieved deformation accuracy. There is an ambiguity related to the gripper configuration: a unique gripper pose may correspond to very different DLO deformations. Given this background, this work focuses specifically on achieving large deformations for objects with large strains (plants, tubes, etc.). Our approach can generate a suitable trajectory without needing online fine-tuning based on DLO deformation testing [13], [22]. This can help in avoiding damage to the DLO. We quantify the magnitude of a DLO deformation as the maximum among the distances between every feature point’s initial and target position. Using the results in existing works as a criterion (as done in [7]) we define large deformations as those that exceed 15 cm in real-world settings. Similarly to many DRL frameworks, MultiAC6 is trained in a simulator, for safer interaction and shorter training time [23]. Unfortunately, the transfer of trained policies to realworld settings generally does not work well due to the sim-to-real gap [15]. The sim-to-real gap is caused by the difficulty of synthesizing realistic interactions in simulation (due to under-modeling, wrong/approximated parameters, model discrepancies, etc.). For DLOs, the sim-to-real gap cannot be solved using classical sim-to-real transfer techniques which mainly address perception [23]. Simulated DLOs differ significantly from real DLOs. First, mechanical parameters (Young’s modulus, Poisson coefficient, mass, friction, etc.) are only valid for one instance of a DLO. Second, real DLOs may be elastoplastic [1] and partially maintain deformations. Finally, some simulated deformations do not match the real ones for the same action. This is the case for singular positions of DLOs [24], [25], as illustrated in Figure 2. In such a configuration, DLOs are at equilibrium, but unstable in the sense that any slight gripper motion leads to unpredictable deformations. This occurs, for example, when a vertical force is exerted on a straight DLO. Such a singularity is rarely addressed in the literature. The authors of [12] acknowledged this singularity as a limitation of their method: the singularity occurs for objects with very small curvature. In the case of analytical methods, the Jacobian matrix is singular and the model becomes unstable. These singular deformations cannot be replicated in simulation. This discrepancy between real and simulated deformations violates the Markovian observability property [26] of the DRL methods. Consequently, the policies learned in simulation are no longer valid in real-world settings. Achieving such large DLO deformations in 3D presents multiple challenges. To address these, we propose a new DRL framework, described in the next sections, based on DDPG. IV. BACKGROUND ON DDPG The DDPG algorithm is an off-policy actor-critic method used to deal with continuous action spaces [27]. Considering the continuous state space S and action space A, the DDPG agent aims to learn the optimal policy π ∗ : S −→ A. The learning process involves the acquisition of a Q-function and a policy [28]. For this purpose, an actor, also known as the policy network, takes the current state st as input and generates the optimal action at as output. Simultaneously, a critic, called the Q-function network, assesses the optimality of the action chosen at in the state st by assigning Q-values Qt (st , at ) to the state-action pair (st , at ). The actor and critic network are trained from data stored in a replay buffer. This replay buffer is filled with transitions. A transition T is composed of the action at predicted by the actor, the state st , the next state s′t after applying at , and the reward rt obtained for (st , at ). Actor and critic networks are trained when the replay buffer contains enough data ≥ N to extract a batch of non-sequential transitions (see Table I). These transitions are selected randomly to guarantee that the data are independent and identically distributed. Utilizing batches and allowing agents to learn from previous experiences accelerates the learning process while removing unwanted temporal correlations [29]. In fact, the critic network aims to minimize the error between the predicted Q-values Q(s, a) and the Q-values calculated using the Bellman equation [30] QB (s, a) = r + γ × Q′ (s′ , a′ ), where γ is the discount factor. More specifically, the critic network is optimized by minimizing the mean square error (MSE) between QB (s, a) and Q(s, a). Given a batch size N sampled from the replay buffer, the critic loss ℓc becomes: PN (QB n (sn , an ) − Qn (sn , an ))2 . (1) ℓc = n=1 N The actor network predicts actions that maximize the Qvalues. Therefore, the actor network is optimized by minimizing the negative Q-value. The policy loss ℓp is calculated by averaging Q(s, a) [27]: PN Qn (sn , an ) . (2) ℓp = −Q(s, a) = − n=1 N V. M ETHOD A. Action and State Spaces In the MultiAC6 framework (see Figure 3), the action space of a robot is divided between two agents. Each agent within MultiAC6 is a DDPG agent. In this setup, the goal is to control the gripper pose P. Let us define P = (X, θ) with X = (x, y, z) the position and θ = (θx , θy , θz ) the orientation of the gripper in the world frame. From this notation, let us define a position agent (Agentp ) that actuates the gripper translation velocity Ẋ, and an orientation agent (Agento ) that actuates the gripper angular velocity θ̇. In this framework, the robot deforms a DLO to minimize the error between the current feature points F and the desired feature points Fd . The division of the action space between two agents is also translated into the way the task is performed. First, Agento orients the gripper and then Agentp positions the gripper so that the desired deformation is achieved. In a timestep t and considering the continuous state space S and action space A, the Agentp action apt ∈ Ap is Ẋt ∈ R3 . The Agentp state spt consists of the current position Xt and the translation velocity Ẋt of the gripper, and the current and desired feature points. Hence, spt ∈ S p is (X, Ẋ, F, Fd )t ∈ R6+6m , with m the number of selected feature points. For Agento , its action aot ∈ Ao is defined as θ̇ t ∈ R3 . The state of Agento sot consists of the current gripper orientation θ t , the desired DLO tip orientation ζ = (ζx , ζy , ζz ), and the desired feature points. Hence, sot ∈ S o is (θ, ζ, Fd )t ∈ R6+3m . The Agento state is designed in a similar way as many DRL-based manipulation tasks [31], [32], [33] specifying the desired goals in the state vector. B. MultiAC6 action space decomposition 1) Principle: The proposed action space decomposition provides a straightforward but still efficient way to bridge the sim-to-real gap for DLO manipulation. For this purpose, a specific decoupled training strategy is proposed as follows. In our settings, Agento is trained to achieve a given desired DLO tip orientation ζ. This desired orientation ζ is handcrafted (only for training) and is known to lead to the desired DLO deformation. Therefore, this agent is not trained with the simulator. Indeed, the Agento state (θ, ζ, Fd ) is independent of the DLO deformation represented by the feature points Ft . With the assumption that the DLO tip is locally rigid, the gripper orientation θ t can be obtained by integrating θ̇ t . It is worth noting that ζ is defined to avoid singular configurations of the DLO. As a direct benefit, the sim-to-real gap can be avoided for Agento . In parallel, from the desired orientation of the DLO tip, Agentp is trained to control the translation velocity of the gripper to deform the DLO into the desired shape. Given that Agentp always starts from a DLO oriented with ζ, the sim-to-real gap related to singular DLO configurations can also be avoided. Each of the agents is trained separately to avoid error accumulation. This strategy has been used in [32] for a pick-and-place task where it has been proven to outperform a sequential training strategy. Since MultiAC6 agents are trained separately, issues of non-stationary environments [19] are avoided. Such issues occur when both agents update the environment simultaneously. Agentp and Agento would not be able to correctly map states to actions. Therefore, learning an optimal policy would be more challenging. When both agents are trained, the manipulation task is solved in several steps. First, Agento orients the DLO tip towards ζ. Thereafter, Agentp positions the gripper, so that the feature points Ft reach the target points Fd . 2) Theoretical reasoning: Although DRL approaches usually only control the position of the gripper, it is more intuitive and natural to also actuate the gripper orientation. A 6 DOF-gripper is less restricted and can subsequently achieve more complex deformations than a 3 DOF-gripper. Furthermore, as mentioned previously, singular configurations can be avoided with a proper orientation of the DLO tip. Unfortunately, using more DOFs leads to the well-known curse of dimensionality inherent in DRL approaches: the action space grows exponentially with the number of controlled DOFs. It becomes more difficult to find an optimal policy to achieve the desired DLO deformations. To mitigate this issue, our proposed action space decomposition framework combines the advantages of a 6 DOF control of the gripper with the benefits of a limited action space. Indeed, by decoupling the gripper control over two agents, each of them only explores a limited action space, allowing them to find useful learning signals to achieve their respective task. C. Optimization framework 1) Learning parallelization: MultiAC6 uses the learning parallelization technique introduced for the A3C (asynchronous advantage actor-critic) algorithm [34]. The principle is to run multiple agents simultaneously in parallel on different environments. With this approach, more data can be collected for a given time period. For off-policy algorithms such as DDPG, the replay buffer is filled faster. Furthermore, since agent environments and actions are not TABLE I: DDPG parameters Parameter Nb. layers Hidden size αA αC Replay buffer Batch size N γ Value 3 256 0.0001 0.001 50,000 128 0.99 correlated, transitions containing more diverse state-action pairs can be collected in the replay buffer. Therefore, learning parallelization decreases training time while yielding better results, as shown in [15]. 2) Reward function: The reward function controls the optimization of the agent action selection policy [35]. For Agentp , the reward function r1p t is defined as the maximum error. r1p t is computed as the negative of the maximum Euclidean distance Dt between the current feature points and the desired feature points: r1p t = − max(Dt (Ft , Fd )). (3) For Agento , the reward function rto is defined as the rootmean-square error (RMSE) between the current Euler orientation (roll, pitch, yaw angles) of the gripper and the desired DLO tip orientation: rto = −RMSE(θ t , ζ). (4) VI. E XPERIMENTS A. Simulation setup 1) Environment configuration: As mentioned in previous sections, a simulator is required to train the DDPG agents. For this purpose, PyBullet, the Python version of Bullet [36], was used as the simulator physics engine. The simulated environment consisted of a 7-DOF Franka Emika Panda robot and a DLO of dimension 5x5x103 cm3 . The DLO deformations were modeled using FEM. A unique DLO model was defined with a 3D tetrahedral mesh comprising 70 nodes, 104 tetrahedrons, 241 links and 136 faces. This DLO was characterized by a Young’s modulus of 2.5 MPa, a Poisson coefficient of 0.3, a mass of 0.2 Kg, a damping ratio of 0.01, and a friction coefficient of 0.5. In the simulator, the current feature points Ft and the desired feature points Fd were defined using the positions of some mesh nodes. Four mesh nodes (m = 4) were selected all along the DLO (cf. Figure 2). This number is enough to characterize the DLO shape and works well in practice [15]. 2) Datasets: Three datasets of deformations were created to evaluate MultiAC6. Each of these datasets was collected in workspaces of different dimensions, as illustrated in Figure 2. The workspaces are defined as follows: 3 • A small 15x40x25 cm workspace which is used to collect both the training/seen test dataset. 3 • A medium 20x50x25 cm workspace which is used to collect the unseen test dataset. 3 • A large 20x65x30 cm workspace which is used to collect the large unseen test dataset. Each dataset contained 1000 deformations defined by Fd and ζ. The unseen datasets were excluded from the training phase to assess how well MultiAC6 could handle unseen samples. It TABLE II: Simulation results for MultiAC6∗ for different reward functions. With AE (the standard deviation σ) (cm), and ME (cm). Reward Function δp Maximum error r1p Mean error r2p DTW r3p 5 3 5 3 5 3 Reward Function δp Maximum error r1p Mean error r2p DTW r3p 5 3 5 3 5 3 Reward Function δp Maximum error r1p Mean error r2p DTW r3p 5 3 5 3 5 3 Test Seen 80 episodes 100 episodes SR↑ AE↓ ME↓ SR↑ AE↓ 1.0 3.38(0.96) 1.09 1.0 3.04(1.09) 0.98 2.34(0.57) 0.40 0.99 2.01(0.64) 0.93 4.11(1.32) 0.94 1.0 3.40(0.99) 0.52 3.54(1.63) 0.66 1.0 2.04(0.61) 0.86 4.44(1.49) 1.04 0.94 3.86(1.20) 0.52 3.91(1.77) 1.02 0.76 3.10(1.32) ME↓ 1.01 0.67 0.66 0.20 1.22 0.99 Test Unseen 80 episodes 100 episodes SR↑ AE↓ ME↓ SR↑ AE↓ 0.99 3.61(0.90) 0.78 1.0 3.23(1.03) 0.89 2.55(0.78) 0.23 0.97 2.10(0.73) 0.89 4.32(1.31) 1.51 1.0 3.54(0.95) 0.47 3.83(1.68) 0.69 0.97 2.21(0.67) 0.85 4.51(1.68) 1.11 0.92 4.13(1.60) 0.44 4.22(1.97) 1.16 0.65 3.52(1.85) ME↓ 0.47 0.49 1.14 0.34 0.93 0.97 Test Large Unseen 80 episodes 100 episodes SR↑ AE↓ ME↓ SR↑ AE↓ 0.88 4.62(3.79) 0.79 0.96 3.99(3.87) 0.59 3.91(3.98) 0.79 0.89 2.96(3.98) 0.70 5.36(2.42) 1.36 0.94 4.03(0.22) 0.31 5.05(2.71) 0.34 0.79 3.08(2.43) 0.62 7.11(9.59) 1.25 0.73 6.63(9.18) 0.25 7.00(9.66) 1.13 0.33 6.41(9.31) ME↓ 0.97 0.44 0.54 0.54 1.13 1.05 is worth noting that the large unseen dataset corresponded to the full robot workspace. Deformations are collected within each workspace by moving the gripper to a random pose. B. Training configuration 1) DDPG parameters: The DDPG parameters were obtained empirically (see Table I). The actor and critic networks consisted of three fully connected hidden layers of dimension 256 with a rectified linear unit (ReLU) activation function. The actor output at was passed through a Tanh activation function. For exploration purposes, Ornstein-Uhlenbeck noise was added to the action at , as described in [27]. The network gradients were updated with the ADAM optimizer. The learning rate was set to αA = 0.0001 for the actor and αC = 0.001 for the critic. A batch size of N = 128 transitions was randomly sampled from a 50000-size replay buffer. Finally, the discount factor was set to a constant value (γ = 0.99). 2) Training parameters: Agentp was trained with 32 parallel agents for 100 episodes of 300 steps. In this configuration, a manipulation task was considered successful when the maximum error (as defined in Section V-C.2) was below a threshold δp set at 5 cm. This threshold is generally sufficient for applications such as manipulating plants. From this, we could define the success rate (SR). Similarly, for Agento , 32 agents were trained in parallel for 60 episodes of 100 steps. The training dataset was used to sample the desired mesh nodes Fd . The angular error threshold δo was set to 3° (or 0.0524 rad). Both Agentp and Agento were trained on supercomputers with 64 GB memory and Intel Xeon E5-2698 v4 2.20 GHz processors at the UCA University Mesocentre. The average training time was two and a half days, mainly due to the slowness of the FEM computation. The training TABLE III: AC3, AC6, and MultiAC6 simulation results for the test seen, unseen, and large unseen datasets. With AE ± the standard deviation σ in cm, and ME in cm. Method AC3 [15] AC6 MultiAC6 (ours) δp 5 3 5 3 5 3 SR ↑ 0.64 0.26 1.0 1.0 1.0 0.99 Test Seen AE ↓ 4.83 ± 1.22 4.45 ± 1.57 4.40 ± 0.44 2.61 ± 0.36 3.02 ± 1.07 2.01 ± 0.62 ME ↓ 1.78 1.55 2.08 1.26 0.98 0.62 SR ↑ 0.59 0.33 0.60 0.54 1.0 0.97 time can be reduced by using more powerful computers or optimized simulators such as Isaac Gym [37]. C. Simulation results Several experiments were conducted in simulation to (i) assess the performance of the Agentp reward function, and (ii) evaluate the MultiAC6 framework. All results were obtained for 1000 desired random goals sampled from the seen, unseen, and large unseen datasets. Several evaluation metrics were used, which are SR, average error (AE), and minimum final error (ME). 1) Reward function evaluation: A first evaluation consisted in assessing the performance of our proposed reward function. For this purpose, the maximum error reward function was compared to a mean error and a dynamic time warping (DTW) reward function [38]. The DTW reward computes the similarity between two point sets (DLO feature points). The mean error reward function was calculated as the negative average Euclidean distance Dt between the current feature points Ft and the desired feature points Fd : Pm j=1 Dt (Ftj , Fdj ) p . (5) r2 t = −Dt (Ft , Fd ) = − m The DTW reward function was used for measuring the similarity between the current feature points Ft and the desired feature points Fd : m X r3p t = −DTW(Ft , Fd ) = − Dt (Ftj , Fdj ). (6) j=1 To capture only the effect of the reward functions, only Agentp was evaluated with the initial hand-crafted DLO tip orientation. This framework was denoted MultiAC6∗ . As shown in Table II, the maximum error reward function performed overall well. With 80 episodes, our proposed reward function always performed the best with large differences in success rates compared to the DTW or the mean error reward. With 100 training episodes, 89% of the deformations were successfully performed under the most challenging condition (large unseen with δp = 3 cm) for the maximum error reward. In comparison, the mean error reward function had a success rate of 79% while the DTW reward function only achieved 33%. These results with 80 or 100 training episodes support the superiority of our proposed reward. We believe that the maximum error reward performs better because it does not smooth the error as with the mean error reward. Furthermore, this reward is easier to maximize than the DTW error reward. For the following experiments, a maximum error reward was used with 100 training episodes. Test Unseen AE ↓ 9.04 ± 8.60 8.56 ± 8.90 9.50 ± 7.76 8.55 ± 8.40 3.23 ± 1.02 2.09 ± 0.71 ME ↓ 1.34 1.08 2.04 1.47 1.34 0.52 SR ↑ 0.49 0.30 0.47 0.39 0.96 0.89 Test Large Unseen AE ↓ ME ↓ 12.17 ± 11.37 1.66 11.75 ± 11.68 1.12 12.39 ± 10.12 1.48 11.76 ± 10.63 1.20 3.99 ± 3.86 0.99 2.95 ± 3.98 0.47 2) MultiAC6 evaluation: The MultiAC6 framework was then compared with different approaches. In particular, MultiAC6 was compared with single-agent frameworks for controlling the 6 DOF (AC6) or 3 DOF (AC3 [15]) of the robot gripper. AC6 is a single-agent framework that directly outputs both translation and angular velocities. AC3 and AC6 were trained for 100 episodes of 300 steps under the same conditions and with the same parameters as MultiAC6. As shown in Table III, AC3 performed poorly even with seen deformations. As initially assumed, controlling 3 DOF is not sufficient to achieve large 3D deformations. Differently, AC6 performed well only with seen deformations. The success rate dropped drastically for unseen datasets (down to 39%). This suggests that AC6 may not be able to perform well in real-world conditions. In contrast, our MultiAC6 framework achieved at least 89% deformations even under the most challenging conditions (see Figure 4). These results, which are consistent with [19], confirm the benefit of using the action space decomposition. Indeed, with MultiAC6, agents explore smaller state-action spaces than single-agent frameworks. Furthermore, on average, deformations are achieved with an accuracy between 2 and 3 cm. This accuracy can reach in the best-case scenarios 0.51 cm. These results obtained with datasets involving unseen deformations demonstrate the robustness of the MultiAC6 framework. D. Experimental results For real-world experiments, we used a 7-DOF Franka Emika Panda robot to manipulate a long foam bar as illustrated in Figure 4(a). Feature points Ft on the foam bar were defined by markers. These markers were tracked in real-time with a motion capture (MOCAP) system. For all experiments, the threshold δp was set to 5 cm for Agentp and δo was 3° for Agento . 1) MultiAC6 real-world evaluation: The experimental results are presented in Table IV. These results were obtained using 30 samples of reachable desired deformations. The success rate of AC3 and AC6 was 7/30 and 9/30, respectively. In contrast, MultiAC6 achieved 29/30 (+66% compared to AC6) deformations with an average error of 3.65 cm. As hypothesized from the simulation results, AC6 was not able to overcome the sim-to-real gap. By analyzing the results, we discovered that AC6 was heavily affected by the elastoplasticity of the foam bar (different from the initial elasticity assumption) as well as singular configurations. On the contrary, MultiAC6 was able to avoid singular configurations thanks to the decoupled training framework of Agentp and Agento (see Section V-B). With the additional benefit of Fig. 4: Deformation performed by MultiAC6 with a one-meter long foam bar and δp = 5 cm. The initial configuration is given in (a), then in (b) Agento orients the gripper, and finally in (c)-(d), Agentp positions the gripper to reach the desired deformation. TABLE IV: AC3, AC6, and MultiAC6 results in real world with a one-meter long DLO. Method AC3 [15] AC6 MultiAC6∗ MultiAC6 (ours) Fig. 5: Foam bars with different lengths and materials that have been used in the real experiments. SR ↑ 7/30 9/30 29/30 29/30 ∆ -73% -66% 0% — AE ± σ (cm) ↓ 12.10 ± 8.73 11.46 ± 8.41 3.66 ± 0.84 3.65 ± 0.86 TABLE V: MultiAC6 real-world experiments results with different types of foam bars. With YM Young’s modulus defined in (MPA), Stiffness defined in (N/mm), and Length defined in (m). Type M1 M2 M3 M4 M5 Foam bar parameters YM Stiffness Length 0.8 0.10 4.8 1.0 1.2 0.07 3.6 0.8 0.16 7.5 1.0 0.05 2.8 1.0 0.59 38.6 1.2 Success rate ↑ 15/17 17/17 17/17 17/17 14/17 17/17 16/17 Initial deformation Max/Mean (cm) 34.78/25.53 35.61/28.86 40.54/27.41 37.28/25.18 33.40/25.28 37.74/27.14 38.70/27.30 MultiAC6 can generalize well to different workspaces and materials, as agents mainly learn the dynamics of any DLO, and not the model of the DLO manipulated in simulation. Furthermore, Table IV results showed that MultiAC6 was able to achieve large deformations (26 cm on average). Some configurations even exceeded 40 cm. E. Discussion Fig. 6: Various deformations achieved by MultiAC6 with different foam bars. the action space decomposition, MultiAC6 policies are more efficient and thus transferable to real-world settings. 2) MultiAC6 robustness: To test further the robustness of our approach, MultiAC6 was evaluated on seven foam bars (see Figure 5) with different characteristics. These characteristics involved different lengths, materials, and stiffness, as presented in Table V. We believe that these characteristics are relevant to capture MultiAC6 robustness to significantly different DLOs. The foam bars made of materials M1 to M4 were cubical (section = 5x5 cm2 ). The foam bar made of M5 was cylindrical (diameter = 6.5 cm). The results in Table V were obtained from 17 samples of reachable desired deformations. The same MultiAC6 model as in Section VID.1 was used without additional training or online finetuning. MultiAC6 achieved 95% of all deformations with very different types of foam bars. This result emphasizes the flexibility of our approach, which is particularly suitable for real-world applications (see Figure 6). We hypothesize that The simulation and experimental results clearly emphasize the benefits of actuating the 6 DOF of a gripper. These results suggest that controlling the gripper orientation is necessary, but not sufficient, to close the sim-to-real gap. By exploiting the gripper orientation within the MultiAC6 framework, the robot can achieve, with the same model, complex deformations for various DLOs. To do so, the desired orientation of the DLO tip ζ is required to define the state of Agento . This ζ can be obtained empirically without the dynamic model of the DLO. We acknowledge that this may be impractical for real-world deployment. This limitation could be related to many methods that require online fine-tuning [7], [3], [13]. However, during our experiments, we noticed that MultiAC6 could accommodate coarse values of ζ to achieve the desired DLO deformation. Taking advantage of the robustness of MultiAC6, the same orientation ζ can be transferred to different DLOs without affecting real-world performance. Furthermore, these orientations ζ can be defined in advance without accurate measurements (see Figure 7). Therefore, we believe that our approach may be less restrictive than online fine-tuning. Further limitations of MultiAC6 are related to Fig. 7: MultiAC6 success rate with respect to ζ error. discrete actuation peculiar to DRL. Discrete actuation can induce jerky motion and delays. Fortunately, these can be mitigated with longer time steps and interpolated velocities. VII. C ONCLUSION This article introduced MultiAC6, a new multi Actor-Critic framework to control large 3D deformations of DLOs with a single-arm robot. MultiAC6 decomposes the action space of a robot on different agents: one agent controls the gripper position and another controls the gripper orientation. The learning process is then simplified, since both the action and the state spaces are reduced. MultiAC6 was validated through extensive experiments in simulation and in the real world. The results proved that MultiAC6 can perform large deformations of up to 40 cm in a real setup. Furthermore, MultiAC6 is able to handle several types of DLO without retraining or online fine-tuning. We validated the robustness of MultiAC6 in real experiments using various unknown DLOs with an average success rate of 95%. In the future, we wish to develop new DRL frameworks to test MultiAC6 on other soft objects than DLOs and make new comparisons. ACKNOWLEDGMENT This work is funded by the EU Horizon 2020 research and innovation programme under grant agreement No 101017284 (Project ‘ACROBA’) and by the French government through the France 2030 programme IdEx université de Bordeaux / RRI ROBSYS. R EFERENCES [1] J. Sanchez, J. Corrales, et al., “Robotic manipulation and sensing of deformable objects in domestic and industrial applications: A survey,” IJRR, vol. 37, no. 7, pp. 688–716, 2018. [2] J. Zhu, B. Navarro, et al., “Dual-arm robotic manipulation of flexible cables,” in IEEE/RSJ IROS, pp. 479–484, 2018. [3] R. Lagneau, A. Krupa, and M. Marchal, “Automatic shape control of deformable wires based on model-free visual servoing,” IEEE RA-L, vol. 5, no. 4, pp. 5252–5259, 2020. [4] P. Mitrano, D. McConachie, and D. Berenson, “Learning where to trust unreliable models in an unstructured world for deformable object manipulation,” Science Robotics, vol. 6, no. 54, p. eabd8170, 2021. [5] T. Botterill, S. Paulin, et al., “A robot system for pruning grape vines,” Journal of Field Robotics, vol. 34, no. 6, pp. 1100–1122, 2017. [6] O. Aghajanzadeh, M. Aranda, et al., “Adaptive Deformation Control for Elastic Linear Objects,” Frontiers in Robotics and AI, vol. 9, pp. 1– 13, 2022. [7] M. Yu, K. Lv, et al., “Global model learning for large deformation control of elastic deformable linear objects: An efficient and adaptive approach,” IEEE T-RO, vol. 39, no. 1, pp. 417–436, 2023. [8] J. Zhu, A. Cherubini, et al., “Challenges and outlook in robotic manipulation of deformable objects,” IEEE RAM, vol. 29, no. 3, pp. 67–77, 2022. [9] O. Aghajanzadeh, M. Aranda, G. López-Nicolás, R. Lenain, and Y. Mezouar, “An offline geometric model for controlling the shape of elastic linear objects,” in IEEE/RSJ IROS, pp. 2175–2181, 2022. [10] N. Lv, J. Liu, and Y. Jia, “Dynamic modeling and control of deformable linear objects for single-arm and dual-arm robot manipulations,” IEEE T-RO, vol. 38, no. 4, pp. 2341–2353, 2022. [11] S. Jin, C. Wang, and M. Tomizuka, “Robust deformation model approximation for robotic cable manipulation,” in IEEE/RSJ IROS, pp. 6586–6593, 2019. [12] D. Navarro-Alarcon, H. M. Yip, et al., “Automatic 3-D manipulation of soft objects by robotic arms with an adaptive deformation model,” IEEE T-RO, vol. 32, no. 2, pp. 429–441, 2016. [13] M. Shetab-Bushehri, M. Aranda, et al., “Lattice-based shape tracking and servoing of elastic objects,” IEEE T-RO, pp. 1–18, 2023. [14] R. Laezza and Y. Karayiannidis, “Learning shape control of elastoplastic deformable linear objects,” in IEEE ICRA, pp. 4438–4444, 2021. [15] M. H. Daniel Zakaria, M. Aranda, et al., “Robotic Control of the Deformation of Soft Linear Objects Using Deep Reinforcement Learning,” in IEEE CASE, pp. 1516–1522, 2022. [16] L. Pecyna, S. Dong, and S. Luo, “Visual-tactile multimodality for following deformable linear objects using reinforcement learning,” in IEEE/RSJ IROS, pp. 3987–3994, 2022. [17] H. Han, G. Paul, and T. Matsubara, “Model-based reinforcement learning approach for deformable linear object manipulation,” in IEEE CASE, pp. 750–755, 2017. [18] Y. Wu, W. Yan, et al., “Learning to manipulate deformable objects without demonstrations,” in Robotics: Science and Systems, 2020. [19] H. Wang and K. Wong, “A collaborative multi-agent reinforcement learning framework for dialog action decomposition,” in EMNLP, pp. 7882–7889, 2021. [20] D. Seita, P. Florence, et al., “Learning to rearrange deformable cables, fabrics, and bags with goal-conditioned transporter networks,” in IEEE ICRA, pp. 4568–4575, 2021. [21] S. Chen, Y. Liu, et al., “Diffsrl: Learning dynamical state representation for deformable object manipulation with differentiable simulation,” IEEE RA-L, vol. 7, no. 4, pp. 9533–9540, 2022. [22] C. Chi, B. Burchfiel, et al., “Iterative residual policy for goalconditioned dynamic manipulation of deformable objects,” in Robotics: Science and Systems XVIII, 2022. [23] W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real transfer in deep reinforcement learning for robotics: A survey,” in IEEE SSCI, pp. 737–744, 2020. [24] T. Bretl and Z. McCarthy, “Quasi-static manipulation of a Kirchhoff elastic rod based on a geometric analysis of equilibrium configurations,” IJRR, vol. 33, no. 1, pp. 48–68, 2014. [25] A. Borum, D. Matthews, and T. Bretl, “State estimation and tracking of deforming planar elastic rods,” in IEEE ICRA, pp. 4127–4132, 2014. [26] R. S. Sutton and A. G. Barto, Reinforcement learning - an introduction. Adaptive computation and machine learning, MIT Press, 1998. [27] T. P. Lillicrap, J. J. Hunt, et al., “Continuous control with deep reinforcement learning,” in ICLR, 2016. [28] R. Jangir, G. Alenyà, and C. Torras, “Dynamic cloth manipulation with deep reinforcement learning,” in IEEE ICRA, pp. 4630–4636, 2020. [29] R. Liu and J. Zou, “The effects of memory replay in reinforcement learning,” in IEEE Allerton, pp. 478–485, 2018. [30] O. Nachum, M. Norouzi, et al., “Bridging the gap between value and policy based reinforcement learning,” in NeurIPS, pp. 2775–2785, 2017. [31] Y. Li, C. Pan, et al., “Efficient bimanual handover and rearrangement via symmetry-aware actor-critic learning,” in IEEE ICRA, pp. 3867– 3874, 2023. [32] L. Marzari, A. Pore, et al., “Towards hierarchical task decomposition using deep reinforcement learning for pick and place subtasks,” in IEEE ICAR, pp. 640–645, 2021. [33] L. Chen, Z. Jiang, et al., “Deep reinforcement learning based trajectory planning under uncertain constraints,” Frontiers Neurorobotics, vol. 16, p. 883562, 2022. [34] V. Mnih, A. P. Badia, et al., “Asynchronous methods for deep reinforcement learning,” in ICML, vol. 48, pp. 1928–1937, 2016. [35] M. H. Daniel Zakaria, S. Lengagne, J. A. C. Ramón, and Y. Mezouar, “General framework for the optimization of the human-robot collaboration decision-making process through the ability to change performance metrics,” Frontiers in Robotics and AI, vol. 8, 2021. [36] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning.” http:// pybullet.org, 2016–2021. [37] V. Makoviychuk, L. Wawrzyniak, et al., “Isaac Gym: High Performance GPU Based Physics Simulation For Robot Learning,” in NeurIPS, 2021. [38] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series,” in Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop. Technical Report WS-94-03, pp. 359– 370, AAAI Press, 1994.