Finite ‐ Time Bounds for AMSGrad ‐ Enhanced Neural TD

: Although the combination of adaptive methods and deep reinforcement learning has achieved tremendous success in practical applications, its theoretical convergence properties are not well understood. To address this issue, we propose a neural network-based adaptive TD algorithm, called NTD-AMSGrad, which is a variant of temporal difference learning. Moreover, we rigorously analyze the convergence performance of the proposed algorithm and establish a finite-time bound for NTD-AMSGrad under the Markov observation model. Specifically, when the neural network is wide enough, the proposed algorithm can converge to the optimal action-value function at a rate of, where is the number of iterations.


Introduction
Reinforcement learning (RL) has garnered considerable attention in recent years due to its wide-ranging applications, including medical diagnosis [1], financial quantization [2], chat generative pre-trained transformer [3], smart grid [4], and many more. At its core, RL involves the agent interacting with the environment in a trial-and-error process to learn an optimal policy that maximizes its cumulative long-term reward [5]. A key challenge in developing reinforcement learning algorithms is estimating the long-term reward associated with a given policy. This problem, which is commonly referred to as policy evaluation, is of fundamental importance in reinforcement learning.
Temporal-difference learning (TD), originally proposed by Sutton [6], is a crucial approach for solving the policy evaluation problem by estimating the value of a state or action based on the difference between the predicted and actual reward received at each time step. TD has been demonstrated to be both efficient and effective in a wide range of tasks, including e-sports games and robot control [7,8]. In addition, TD can be easily combined with function approximation, such as neural networks, to learn effective policies in highdimensional state spaces.
Despite its many advantages, TD still faces several challenges, particularly with respect to theoretical analysis in the context of nonlinear function approximators. Nonlinear approximators can introduce issues such as instability and divergence in TD algorithms [9,10,11], which can make it difficult to learn an accurate value function. As a result, developing effective techniques for TD with nonlinear function approximation remains an active area of research.
In this context, recent work has explored the use of neural networks as function approximators in TD [7,12,13]. These approaches have shown significant promise in addressing the challenges associated with nonlinear function approximation and have led to breakthrough results in a variety of applications. Furthermore, researchers in [14,15,16] have presented convergence results for neural TD, albeit under certain additional assumptions and restrictions. In an effort to enhance the efficiency of TD, adaptive methods inspired by stochastic algorithms have been proposed in [17,18] for Deep Q-Networks (DQN). Empirical evidence indicates that these adaptive TD variants outperform their vanilla counterparts in numerous tasks. However, there is still much to be learned about the theoretical properties of these algorithms and the conditions under which they are guaranteed to converge to an optimal policy. As such, further research is needed to fully understand the potential of neural TD learning and to develop stable algorithms that can be applied to a wide range of realworld problems.
To address this research gap, the present study proposes an Adam-type TD algorithm with neural network approximation, termed as NTD-AMSGrad. The proposed algorithm combines AMSGrad algorithm [19] with neural TD and adaptively adjusts the learning rate of different weights and biases in the neural network by utilizing the moving average of historical gradients. Moreover, we provide a rigorous nonasymptotic convergence analysis of the proposed algorithm under Markov observation. Additionally, we elaborate on the key contributions of this paper, which are summarized below: We propose an adaptive TD with neural network approximation called NTD-AMSGrad under Markovian sampling We demonstrate that, given a sufficiently wide neural network, NTD-AMSGrad can converge to the optimal actionvalue function at a rate of , where is the number of iterations.
The remainder of this paper is organized as follows. In Section 2, we introduce the necessary preliminaries. In Section 3, we formulate reinforcement learning problem with neural network function approximation. To solve this problem, we propose NTD-AMSGrad and provide the standard assumptions. The main results of this paper are presented in Section 4. In Section 5, we provide the rigorous proofs of main results in detail. Finally, we conclude this paper in Section 6.
The subsequent sections of this paper are structured as follows. Section 2 presents the necessary preliminaries. In Section 3, we formulate the reinforcement learning problem with neural network function approximation. To address this problem, we propose NTD-AMSGrad and outline the standard assumptions. In Section 4, we provide detailed and rigorous proofs of the main theorems. Finally, Section 5 concludes this paper.

Preliminaries
In this section, we provide the necessary preliminaries for the policy evaluation problem. We denote the set of states as . Similarly, the set of actions is denoted as . The Markovian transition probability matrix is represented by , and the reward function is denoted as . Therefore, a Markov reward process can be defined using a 5-tuple , where represents the discount factor. For any given policy , the associated value function is denoted as the corresponding action-value function can be defined as The properties of Markov reward process leads to the Bellman equation, which is given by (1) where is the Bellman operator, and is the unique fixed point of .  In this paper, we consider the -hidden-layer ReLU network to approximate the action-value function , as depicted in Figure 1. The formulation is as follows (2)  represents a constraint set. Then, solving the policy evaluation problem transforms into minimizing the mean square projected Bellman error (MSPBE), as expressed in the following equation (4) where denotes the projection operator. Subsequently, we outline the neural TD approach. The updates in neural TD are carried out using the gradient, where the gradient term is defined as follows (5) where represents the TD error. Definition 1. [20] A point is said to be the approximate stationary if (6) where and the TD error is (7) Cai et al. [20] have proved the existence of an approximate stationary point that minimizes the MSPBE. In addition, we introduce a vector-value map gradient that remains independent of the data point and is defined as (8) Likewise, based on the linearized function, the following gradient terms is given by (9) (10) Next, in conjunction with the AMSGrad, we introduce the remaining definitions of Algorithm 1. The first order moment is updated by the following rule, (11) where is a hyper-parameter? Furthermore, the second order moment is defined as (12) where denotes a hyper-parameter, and . Then, the parameter is updated as (13) where is a weighted projection operator onto , which is given by (14) Thus, this section devises an adaptive TD with neural network approximation, named NTD-AMSGrad, which is summarized in Algorithm 1. Before delving into the analysis of the finite-time bounds of NTD-AMSGrad, it is imperative to introduce the following set of standard assumptions.

Convergence Analysis
In this section, we will rigorously analyze the convergence performance of NTD-AMSGrad. To accomplish this, we will present several key results that are essential for our analysis. Lemma  where the second inequality uses the assumption , and . Finally, with with the probability at least over the randomness of , and by combining the bounds of terms and , we can derive Theorem 1 as follows (30) where we use the fact that and . To this end, the result of Theorem 1 is dereived.
Proof of Theorem 2. By the proof approach in [16], we can establish the result of Theorem 2.
According to the Theorem 1, the theoretical analysis demonstrates that NTD-AMSGrad can converge to the minimizer of MSPBE at a rate of when the width of the ReLU neural networks is sufficiently large. Moreover, Theorem 2 presents the finite-time bounds for NTD-AMSGrad.

Conclusion
This paper presents a novel neural TD algorithm named NTD-AMSGrad, which adopts the Adam-type optimization method. The finite-time bound of the proposed algorithm is established under Markov observation. Specifically, the theoretical results demonstrate that the proposed algorithm achieves convergence at a rate of , when the neural network is sufficiently wide.